An Embedding-Based Topic Model for Document Classification

Seifollahi Sattar; Piccardi Massimo; Jolfaei Alireza

首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >An Embedding-Based Topic Model for Document Classification

【24h】

An Embedding-Based Topic Model for Document Classification

机译：基于嵌入的文档分类主题模型

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.

机译：主题建模是一个无监督的学习任务，可以在文件集中发现隐藏主题。反过来，发现的主题可用于总结，组织和理解集合中的文档。主题建模的大多数现有技术是潜在的Dirichlet分配的导数，它使用文档的Word袋式假设。但是，文字袋式模型完全忽略了单词之间的关系。因此，本文介绍了一个两阶段的主题建模算法，它利用Word Embeddings和Word Co-Feationrence。在第一阶段，我们通过从文档中软群体的随机集群集合嵌入的n-gram来确定主题字分布。在第二阶段，我们通过从主题字分布中对每个文档的主题进行采样来确定文档主题分布。这种方法利用Word Embeddings的分布属性而不是使用单词袋的假设。澳大利亚赔偿组织各种数据集的实验结果表明了文献分类任务中所提出的算法的显着比较效果。

著录项

来源
《ACM transactions on Asian and low-resource language information processing》 |2021年第3期|52.1-52.13|共13页
作者
Seifollahi Sattar; Piccardi Massimo; Jolfaei Alireza;
展开▼
作者单位

RMIT Univ Sch Comp Technol 124 La Trobe St Melbourne Vic 3000 Australia;

Univ Technol Sydney Sch Elect & Data Engn 15 Broadway Ultimo Sydney NSW 2007 Australia;

Macquarie Univ Dept Comp 16 Macquarie Walk Sydney NSW 2109 Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Topic modelling; word embedding; document classification; clustering;

机译：主题建模;词嵌入;文档分类;聚类;

相似文献

外文文献
中文文献
专利

1. Web document classification using topic modeling based document ranking [J] . Youngseok Lee, Jungwon Cho International Journal of Electrical and Computer Engineering . 2021,第3期

机译：使用基于主题建模的文档排名进行Web文档分类
2. Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering [J] . Sahand Vahidnia, Alireza Abbasi, Hussein A. Abbass Journal of Data and Information Science . 2021,第3期

机译：基于嵌入的基于学术文献的研究主题的检测与提取
3. Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering [J] . Sahand Vahidnia, Alireza Abbasi, Hussein A.Abbass 数据与情报科学学报：英文版 . 2021,第003期

机译：基于嵌入的基于学术文献的研究主题的检测与提取
4. Building topic mixture language models using the document soft classification notion of topic models [C] . 2010 7th International Symposium on Chinese Spoken Language Processing . 2010

机译：使用主题模型的文档软分类概念构建主题混合语言模型
5. Topics in Document Classification [D] . Wongchaisuwat, Papis. 2018

机译：文档分类中的主题
6. Incorporating Statistical Topic Models in the Retrieval of Healthcare Documents [O] . Karla Caballero, Ram Akella 2015

机译：在医疗文档检索中纳入统计主题模型
7. Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering [O] . Sahand Vahidnia, Alireza Abbasi, Hussein A. Abbass 2021

机译：基于嵌入的学术文献使用深层聚类检测与提取研究主题
8. Text Classification of installation Support Contract Topic Models for Category Management. [R] . Sevier, W. C. 2018

机译：文本分类安装支持合同主题模型的类别管理。

An Embedding-Based Topic Model for Document Classification

摘要

著录项

相似文献

相关主题

期刊订阅