A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

机译：基于伪标签的无数据朴素贝叶斯算法用于种子词文本分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Traditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (FL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and cm-ploy the expectation maximization algorithm to train PL-DXB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e.. preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.

机译：传统的监督文本分类器需要大量的手动标记文档，而这些文档通常很昂贵。最近，无数据文本分类吸引了更多的关注，因为它只需要很少的便宜得多的类别词即可。在本文中，我们用种子词开发了一个基于伪标签的无数据朴素贝叶斯（FL-DNB）分类器。我们使用种子词出现为每个文档初始化伪标签，然后cm-ploy期望最大化算法以半监督的方式训练PL-DXB。伪标签使用种子词出现和标签后代的估计进行迭代更新。为了避免产生嘈杂的伪标签，我们还在伪标签更新步骤中考虑了最近的邻近文档的信息，即，保留了文档的局部邻域结构。我们凭经验显示PL-DNB优于传统的带有种子词的无数据文本分类算法。特别是，PL-DNB在不平衡数据集上表现良好。

著录项

来源
《International conference on computational linguistics》|2018年|1908-1917|共10页
会议地点
作者
Ximing Li; Bo Yang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A Chinese text classification system based on Naive Bayes algorithm [J] . Wei Cui MATEC Web of Conferences . 2016,第2016期

机译：基于朴素贝叶斯算法的中文文本分类系统
2. A Chinese text classification system based on Naive Bayes algorithm [J] . Wei Cui MATEC Web of Conferences . 2016,第2016期

机译：基于朴素贝叶斯算法的中文文本分类系统
3. APPLICATION OF NEURAL NETWORK ALGORITHMS AND NAIVE BAYES FOR TEXT CLASSIFICATION [J] . VADYM S. YAREMENKO, WALERY S. ROGOZA, VLADYSLAV I. SPITKOVSKYI Journal of Theoretical and Applied Information Technology . 2021,第1期

机译：神经网络算法应用于文本分类的神经网络算法
4. A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words [C] . Ximing Li, Bo Yang International conference on computational linguistics . 2018

机译：基于伪标签Dataless Naive Bayes算法，文本分类与种子单词
5. Soar CGFs that learn inductively: A hybrid autonomous approach based on a modified naive Bayes learning algorithm. [D] . Chia, Chien Wei. 2003

机译：腾飞的CGF可以归纳学习：一种基于改进的朴素贝叶斯学习算法的混合自主方法。
6. Direct comparison between support vector machine and multinomial naive Bayes algorithms for medical abstract classification [O] . Stan Matwin, Vera Sazonova 2012

机译：支持向量机与多项朴素贝叶斯算法在医学抽象分类中的直接比较
7. Persian Text Classification using naive Bayes algorithms and Support Vector Machine algorithm [O] . Naeim Rezaeian, Galina Novikova 2020

机译：利用天真贝叶斯算法和支持向量机算法的波斯文文本分类

A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

摘要

著录项

相似文献

相关主题

期刊订阅