...
首页> 外文期刊>Procedia Computer Science >Performance Comparison and Optimization of Text Document Classification using k-NN and Na?ve Bayes Classification Techniques
【24h】

Performance Comparison and Optimization of Text Document Classification using k-NN and Na?ve Bayes Classification Techniques

机译:基于k-NN和朴素贝叶斯分类技术的文本文档分类性能比较和优化

获取原文
   

获取外文期刊封面封底 >>

       

摘要

In the current era, information is available in several different formats, such as text, image, video, audio and others. Corpus is a collection of documents in a large volume. By using Information Retrieval (IR), it is possible to obtain an unstructured information and automatic summary, classification and clustering. This research is to focus on data classification using two out of the six approaches of data classification, which is k-NN (k-Nearest Neighbors) and Na?ve Bayes. The text documents used is in XML format. The Corpus used in this research is downloaded from TREC Legal Track with a total of more than three thousand text documents and over twenty types of classifications. Out of the twenty types of classifications, six are chosen with the most number of text documents. The data is processed using RapidMiner software and the result shows that the optimum value for k in k-NN occurs at k=13. Using this value for k, the accruacy in average reached 55.17 percent, which is better than using Na?ve Bayes which is 39.01 percent.
机译:在当前时代,信息以几种不同的格式提供,例如文本,图像,视频,音频等。语料库是大量文档的集合。通过使用信息检索(IR),可以获得非结构化信息以及自动汇总,分类和聚类。这项研究的重点是使用六种数据分类方法中的两种方法进行数据分类,即k-NN(k最近邻)和朴素贝叶斯。所使用的文本文档为XML格式。本研究中使用的语料库是从TREC Legal Track下载的,共有三千多个文本文档和二十多种分类。在二十种分类中,有六种选择的文本文档数量最多。使用RapidMiner软件处理数据,结果表明k-NN中k的最佳值出现在k = 13处。使用此k值,平均准确率达到55.17%,比使用朴素贝叶斯(39.01%)更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号