Text classification method based on self-training and LDA topic models

Pavlinek Miha; Podgorelec Vili

首页> 外文期刊>Expert Systems with Application >Text classification method based on self-training and LDA topic models

【24h】

Text classification method based on self-training and LDA topic models

机译：基于自训练和LDA主题模型的文本分类方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data, We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts. (C) 2017 Elsevier Ltd. All rights reserved.

机译：有监督的文本分类方法可以通过合理大小的标记集进行学习，因此非常有效。另一方面，当只有一小部分带标签的文档可用时，半监督方法变得更合适。这些方法是基于比较标记实例和未标记实例之间的分布的，因此重要的是集中于表示及其辨别能力。在本文中，我们以基于主题模型的表示形式，以半监督的方式提出了用于文本分类的ST LDA方法。所提出的方法包括基于自我训练和模型的半监督文本分类算法，该模型确定任何新文档集合的参数设置。自训练用于借助未标记数据中的信息来扩大小的初始标记集。我们通过对放大的标记集执行NBMN和SVM分类算法，研究基于主题的表示如何影响预测准确性，然后将结果与在典型的TF-IDF表示上使用相同的方法。我们还将ST LDA与监督分类方法和其他著名的半监督方法进行了比较。实验是从六个可公开获取的文档集中采样的11个非常小的初始标记集进行的。结果表明，与NBMN结合使用时，我们的ST LDA方法在分类准确度方面比其他类似方法和变体要好得多。以这种方式，当只有一小组标记实例可用时，ST LDA方法被证明是针对不同文本集合的竞争性分类方法。这样，所提出的ST LDA方法可以很好地帮助改善文本分类任务，这在许多高级专家和智能系统中是必不可少的，尤其是在缺少标记文本的情况下。（C）2017 Elsevier Ltd.保留所有权利。

著录项

来源
《Expert Systems with Application》 |2017年第9期|83-93|共11页
作者
Pavlinek Miha; Podgorelec Vili;
展开▼
作者单位

Univ Maribor, Fac Elect Engn & Comp Sci, Inst Informat, Maribor, Slovenia;

Univ Maribor, Fac Elect Engn & Comp Sci, Inst Informat, Maribor, Slovenia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Classification; Topic modeling; LDA; Semi-supervised learning; Self-training;

机译：分类;主题建模;LDA;半监督学习;自我训练;

相似文献

外文文献
中文文献
专利

1. LDA-based Topic Modelling in Text Sentiment Classification: An Empirical Analysis [J] . AYTUG ONAN, SERDAR KORUKOGLU, HASAN BULUT International journal of computational linguistics and applications . 2016,第1期

机译：基于LDA的文本情感分类中的主题建模：一项实证分析
2. An Improved LDA Topic Modeling Method Based on Partition for Medium and Long Texts [J] . Chonghui Guo, Menglin Lu, Wei Wei Annals of data science . 2021,第2期

机译：一种基于中长文本分区的改进的LDA主题建模方法
3. MR-LDA: An Efficient Topic Model for Classification of Short Text in Big Social Data [J] . Xiongwen Pang, Benshuai Wan, Huifang Li, International journal of grid and high performance computing . 2016,第4期

机译：MR-LDA：大社会数据中短文本分类的有效主题模型
4. Short text classification based on LDA topic model [C] . Qiuxing Chen, Lixiu Yao, Jie Yang International Conference on Audio, Language and Image Processing . 2016

机译：基于LDA主题模型的短文本分类
5. Probabilistic Topic Modeling and Classification Probabilistic PCA for Text Corpora. [D] . Cheng, Chi Wa. 2011

机译：文本主题的概率主题建模和分类概率PCA。
6. A Topic-modeling Based Framework for Drug-drug Interaction Classification from Biomedical Text [O] . Dingcheng Li, Sijia Liu, Majid Rastegar-Mojarad, 2016

机译：基于主题模型的生物医学文献中药物相互作用分类框架
7. A Text Mining Research Based on LDA Topic Modelling [O] . Zhou Tong, Haiyi Zhang 2016

机译：基于LDA主题建模的文本挖掘研究
8. Text Classification of installation Support Contract Topic Models for Category Management. [R] . Sevier, W. C. 2018

机译：文本分类安装支持合同主题模型的类别管理。

Text classification method based on self-training and LDA topic models

摘要

著录项

相似文献

相关主题

期刊订阅