Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

Lu Liu; Tao Peng

首页> 外文期刊>Journal of information science and engineering >Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

【24h】

Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

机译：改进的TFIDF增强了基于聚类的正面和无标签文本分类方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

PU learning occurs frequently in Web pages classification and text retrieval applications because users may be interested in information on the same topic. Collecting reliable negative examples is a key step in PU (Positive and Unlabeled) text classification, which solves a key problem in machine learning when no labeled negative examples are available in the training set or negative examples are difficult to collect. Thus, this paper presents a novel clustering-based method for collecting reliable negative examples (C-CRNE). Different from traditional methods, we remove as many probable positive examples from unlabeled set as possible, which results that more reliable negative examples are found out. During the process of building classifier, a novel TFIDF-improved feature weighting approach, which reflects the importance of the term in the positive and negative training examples respectively, is presented to describe documents in the Vector Space Model. We also build a weighted voting classifier by iteratively applying the SVM algorithm and implement OCS (One-class SVM), PEBL (Positive Example Based Learning) and 1-DNFII (Constrained 1-DNF) methods used for comparison. Experimental results on three real-world datasets (Reuters Corpus Volume 1 (RCV1), Reuters-21578 and 20 Newsgroups) show that our proposed C-CRNE extracts more reliable negative examples than the baseline algorithms with very low error rates. And our classifier outperforms other state-of-art classification methods from the perspective of traditional performance metrics.

机译：PU学习在网页分类和文本检索应用程序中经常发生，因为用户可能对同一主题的信息感兴趣。收集可靠的否定样本是PU（正向和未标记）文本分类的关键步骤，当训练集中没有标记的否定样本或难以收集否定样本时，这解决了机器学习中的关键问题。因此，本文提出了一种新的基于聚类的方法来收集可靠的否定示例（C-CRNE）。与传统方法不同，我们从未标记的集合中删除了尽可能多的阳性示例，从而发现了更可靠的阴性示例。在构建分类器的过程中，提出了一种新颖的TFIDF改进的特征加权方法，该方法分别反映了该术语在正负训练示例中的重要性，以描述向量空间模型中的文档。我们还通过迭代应用SVM算法来构建加权投票分类器，并实现OCS（一类SVM），PEBL（基于正例学习）和1-DNFII（约束1-DNF）方法进行比较。在三个真实世界的数据集（路透社语料库第1卷（RCV1），路透社21578和20个新闻组）上的实验结果表明，与基线算法相比，我们提出的C-CRNE提取的反例更可靠，错误率极低。从传统性能指标的角度来看，我们的分类器优于其他最新分类方法。

著录项

来源
《Journal of information science and engineering》 |2014年第5期|1463-1481|共19页
作者
Lu Liu; Tao Peng;
展开▼
作者单位

College of Computer Science and Technology Jilin University Changchun, 130012 China,Department of Computer Science University of Illinois at Urbana-Champaign Urbana, 61801 USA;

College of Computer Science and Technology Jilin University Changchun, 130012 China,Department of Computer Science University of Illinois at Urbana-Champaign Urbana, 61801 USA,Key Laboratory of Symbol Computation and Knowledge Engineering Ministry of Education Changchun, 130012 China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
text classification; reliable negative examples; clustering; C-CRNE; WVC;

机译：文字分类可靠的负面例子;集群C-起重机QVC;

相似文献

外文文献
中文文献
专利

1. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . Pei Zhili, Shi Xiaohu, Maurizio Marchese, 自然科学进展：英文版 . 2007,第012期

机译：基于改进文本频率法和互信息算法的改进文本分类方法
2. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . 自然科学进展（英文版） . 2007,第012期

机译：基于改进文本频率法和互信息算法的改进文本分类方法
3. SVM based adaptive learning method for text classification from positive and unlabeled documents [J] . Tao Peng, Wanli Zuo, Fengling He Knowledge and information systems . 2008,第3期

机译：基于支持向量机的自适应学习方法从正向和未标记文档中进行文本分类
4. Semi-supervised text categorization with only a few positive and unlabeled documents [C] . Lu Fang, Bai Qingyuan 2010 3rd International Conference on Biomedical Engineering and Informatics . 2010

机译：半监督文本分类，仅包含少量正面和未标记文档
5. Methods for Improving Natural Language Processing Techniques with Linguistic Regularities Extracted from Large Unlabeled Text Corpora [D] . Lucas, Michael Ryan. 2019

机译：提高了大型未标记文本语料库语言规律的自然语言处理技术的方法
6. Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization [O] . Jieming Yang, Zhaoyang Qu, Zhiying Liu -1

机译：文本分类中考虑不平衡问题的改进特征选择方法
7. Improving Text Categorization Methods for Event Tracking [O] . Yiming Yang, Tom Ault, Thomas Pierce, 2000

机译：改进用于事件跟踪的文本分类方法

Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF

摘要

著录项

相似文献

相关主题

期刊订阅