首页> 外文会议>International conference on computational linguistics >A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words
【24h】

A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

机译:基于伪标签的无数据朴素贝叶斯算法用于种子词文本分类

获取原文

摘要

Traditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (FL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and cm-ploy the expectation maximization algorithm to train PL-DXB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e.. preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.
机译:传统的监督文本分类器需要大量的手动标记文档,而这些文档通常很昂贵。最近,无数据文本分类吸引了更多的关注,因为它只需要很少的便宜得多的类别词即可。在本文中,我们用种子词开发了一个基于伪标签的无数据朴素贝叶斯(FL-DNB)分类器。我们使用种子词出现为每个文档初始化伪标签,然后cm-ploy期望最大化算法以半监督的方式训练PL-DXB。伪标签使用种子词出现和标签后代的估计进行迭代更新。为了避免产生嘈杂的伪标签,我们还在伪标签更新步骤中考虑了最近的邻近文档的信息,即,保留了文档的局部邻域结构。我们凭经验显示PL-DNB优于传统的带有种子词的无数据文本分类算法。特别是,PL-DNB在不平衡数据集上表现良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号