...
首页> 外文期刊>Expert Systems with Application >Text classification method based on self-training and LDA topic models
【24h】

Text classification method based on self-training and LDA topic models

机译:基于自训练和LDA主题模型的文本分类方法

获取原文
获取原文并翻译 | 示例
           

摘要

Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data, We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts. (C) 2017 Elsevier Ltd. All rights reserved.
机译:有监督的文本分类方法可以通过合理大小的标记集进行学习,因此非常有效。另一方面,当只有一小部分带标签的文档可用时,半监督方法变得更合适。这些方法是基于比较标记实例和未标记实例之间的分布的,因此重要的是集中于表示及其辨别能力。在本文中,我们以基于主题模型的表示形式,以半监督的方式提出了用于文本分类的ST LDA方法。所提出的方法包括基于自我训练和模型的半监督文本分类算法,该模型确定任何新文档集合的参数设置。自训练用于借助未标记数据中的信息来扩大小的初始标记集。我们通过对放大的标记集执行NBMN和SVM分类算法,研究基于主题的表示如何影响预测准确性,然后将结果与在典型的TF-IDF表示上使用相同的方法。我们还将ST LDA与监督分类方法和其他著名的半监督方法进行了比较。实验是从六个可公开获取的文档集中采样的11个非常小的初始标记集进行的。结果表明,与NBMN结合使用时,我们的ST LDA方法在分类准确度方面比其他类似方法和变体要好得多。以这种方式,当只有一小组标记实例可用时,ST LDA方法被证明是针对不同文本集合的竞争性分类方法。这样,所提出的ST LDA方法可以很好地帮助改善文本分类任务,这在许多高级专家和智能系统中是必不可少的,尤其是在缺少标记文本的情况下。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号