...
首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >An Embedding-Based Topic Model for Document Classification
【24h】

An Embedding-Based Topic Model for Document Classification

机译:基于嵌入的文档分类主题模型

获取原文
获取原文并翻译 | 示例
           

摘要

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.
机译:主题建模是一个无监督的学习任务,可以在文件集中发现隐藏主题。反过来,发现的主题可用于总结,组织和理解集合中的文档。主题建模的大多数现有技术是潜在的Dirichlet分配的导数,它使用文档的Word袋式假设。但是,文字袋式模型完全忽略了单词之间的关系。因此,本文介绍了一个两阶段的主题建模算法,它利用Word Embeddings和Word Co-Feationrence。在第一阶段,我们通过从文档中软群体的随机集群集合嵌入的n-gram来确定主题字分布。在第二阶段,我们通过从主题字分布中对每个文档的主题进行采样来确定文档主题分布。这种方法利用Word Embeddings的分布属性而不是使用单词袋的假设。澳大利亚赔偿组织各种数据集的实验结果表明了文献分类任务中所提出的算法的显着比较效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号