...
首页> 外文期刊>Intelligent data analysis >Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms
【24h】

Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

机译:半监督文本分类:使用集成学习算法开发未标记的数据

获取原文
获取原文并翻译 | 示例
           

摘要

Text categorization is one of the fundamental tasks in text mining. Classical supervised methods need lot of labeled data to train a classifier. Since assigning labels to the large amount of data is very costly and time consuming, it is useful to use data sets without labels. So many different semi-supervised learning methods have been studied recently. Among these semi-supervised methods, self-training is one of the important learning algorithms that classifies unlabeled samples with small amount of labeled ones and adds the most confident samples to the training set. In this paper, dynamic weighting beside majority vote approach is applied to classify the unlabeled data to reliable and unreliable classes. Then, the reliable data are added to the training set and the remaining data including unreliable data are classified in iterative process. We tested this method on the extracted features of ten common Reuter-21578 classes. Experimental result indicates that proposed method improves the classification performance and it's effective.
机译:文本分类是文本挖掘中的基本任务之一。经典的监督方法需要大量标记数据来训练分类器。由于将标签分配给大量数据非常昂贵且耗时,因此使用不带标签的数据集很有用。最近研究了许多不同的半监督学习方法。在这些半监督方法中,自训练是重要的学习算法之一,该算法将未标记样本与少量标记样本分类,并将最有信心的样本添加到训练集中。在本文中,除了多数表决方法外,还采用动态加权将未标记数据分类为可靠和不可靠的类别。然后,将可靠数据添加到训练集中,并在迭代过程中对包括不可靠数据的其余数据进行分类。我们在十个常见Reuter-21578类的提取特征上测试了此方法。实验结果表明,该方法提高了分类性能,是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号