...
首页> 外文期刊>Journal of information science and engineering >Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF
【24h】

Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

机译:改进的TFIDF增强了基于聚类的正面和无标签文本分类方法

获取原文
获取原文并翻译 | 示例
           

摘要

PU learning occurs frequently in Web pages classification and text retrieval applications because users may be interested in information on the same topic. Collecting reliable negative examples is a key step in PU (Positive and Unlabeled) text classification, which solves a key problem in machine learning when no labeled negative examples are available in the training set or negative examples are difficult to collect. Thus, this paper presents a novel clustering-based method for collecting reliable negative examples (C-CRNE). Different from traditional methods, we remove as many probable positive examples from unlabeled set as possible, which results that more reliable negative examples are found out. During the process of building classifier, a novel TFIDF-improved feature weighting approach, which reflects the importance of the term in the positive and negative training examples respectively, is presented to describe documents in the Vector Space Model. We also build a weighted voting classifier by iteratively applying the SVM algorithm and implement OCS (One-class SVM), PEBL (Positive Example Based Learning) and 1-DNFII (Constrained 1-DNF) methods used for comparison. Experimental results on three real-world datasets (Reuters Corpus Volume 1 (RCV1), Reuters-21578 and 20 Newsgroups) show that our proposed C-CRNE extracts more reliable negative examples than the baseline algorithms with very low error rates. And our classifier outperforms other state-of-art classification methods from the perspective of traditional performance metrics.
机译:PU学习在网页分类和文本检索应用程序中经常发生,因为用户可能对同一主题的信息感兴趣。收集可靠的否定样本是PU(正向和未标记)文本分类的关键步骤,当训练集中没有标记的否定样本或难以收集否定样本时,这解决了机器学习中的关键问题。因此,本文提出了一种新的基于聚类的方法来收集可靠的否定示例(C-CRNE)。与传统方法不同,我们从未标记的集合中删除了尽可能多的阳性示例,从而发现了更可靠的阴性示例。在构建分类器的过程中,提出了一种新颖的TFIDF改进的特征加权方法,该方法分别反映了该术语在正负训练示例中的重要性,以描述向量空间模型中的文档。我们还通过迭代应用SVM算法来构建加权投票分类器,并实现OCS(一类SVM),PEBL(基于正例学习)和1-DNFII(约束1-DNF)方法进行比较。在三个真实世界的数据集(路透社语料库第1卷(RCV1),路透社21578和20个新闻组)上的实验结果表明,与基线算法相比,我们提出的C-CRNE提取的反例更可靠,错误率极低。从传统性能指标的角度来看,我们的分类器优于其他最新分类方法。

著录项

  • 来源
    《Journal of information science and engineering》 |2014年第5期|1463-1481|共19页
  • 作者

    Lu Liu; Tao Peng;

  • 作者单位

    College of Computer Science and Technology Jilin University Changchun, 130012 China,Department of Computer Science University of Illinois at Urbana-Champaign Urbana, 61801 USA;

    College of Computer Science and Technology Jilin University Changchun, 130012 China,Department of Computer Science University of Illinois at Urbana-Champaign Urbana, 61801 USA,Key Laboratory of Symbol Computation and Knowledge Engineering Ministry of Education Changchun, 130012 China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    text classification; reliable negative examples; clustering; C-CRNE; WVC;

    机译:文字分类可靠的负面例子;集群C-起重机QVC;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号