...
首页> 外文期刊>Journal of Information Science >TTC-3600: A new benchmark dataset for Turkish text categorization
【24h】

TTC-3600: A new benchmark dataset for Turkish text categorization

机译:TTC-3600:用于土耳其文本分类的新基准数据集

获取原文
获取原文并翻译 | 示例
           

摘要

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.
机译:由于万维网的迅速发展,可以通过互联网访问的文档数量每天都在爆炸性地增加。特别是考虑新闻门户,有时与技术,体育和政治等类别相关的文档似乎属于错误的类别,或者文档位于称为其他的通用类别中。此时,通常需要将文本分类(TC)作为监督学习任务来解决。尽管以其他语言开展的有关TC的研究数量很多,但由于缺乏所创建数据集的可访问性和可用性,用土耳其语进行的研究数量非常有限。本文创建了一个名为TTC-3600的新数据集,该数据集可广泛用于土耳其新闻和文章的TC研究。 TTC-3600是一个文档齐全的数据集,其文件格式与著名的文本挖掘工具兼容。在TTC-3600上评估了TC领域中五个广泛使用的分类器和两种特征选择方法。实验结果表明,在预处理和特征选择步骤进行的所有比较中,使用随机森林分类器和基于属性排序的特征选择方法相结合可获得最佳精度标准值91.03%。公开的TTC-3600数据集和本研究的实验结果可用于其他研究人员的比较实验中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号