Deep mixtures of unigrams for uncovering topics in textual data

Viroli Cinzia; Anderlucci Laura

首页> 外文期刊>Statistics and computing >Deep mixtures of unigrams for uncovering topics in textual data

【24h】

Deep mixtures of unigrams for uncovering topics in textual data

机译：在文本数据中揭示主题的深度混合物

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward's method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.

机译：UNIGRAM的混合物是用于聚类文本数据的最简单和最有效的工具之一，因为它们假设与同一主题相关的文档具有类似的术语分布，由多项式自然地描述。当分类任务特别具有挑战性时，例如当文档术语矩阵是高维且极其稀疏时，更复合表示可以更好地深入了解分组结构。在这项工作中，我们通过允许具有进一步更深层次的潜在的模型来开发了对非常短的文件的无监督分类的UNIGRAM的大小写混合物。该提案衍生于贝叶斯框架。 Unigrams的深层混合物的行为与其他传统和最先进的方法进行了经验，即K-Mease，具有余弦距离的K型距离，根据语义分析进行分区，与欧几里德距离进行欧几里德距离。 METOIDS，GASSIAN的混合在基于语义的转换数据，根据病房的方法具有余弦异化，潜在的Dirichlet分配，通过EM算法估计的UNIGRAM的混合物，光谱聚类和亲和传播聚类。在正确的分类率和调整的rand指数方面评估性能。仿真研究和实际数据分析证明，在聚类这些数据中深入了解这些数据高度提高了分类准确性。

著录项

来源
《Statistics and computing》 |2021年第3期|22.1-22.10|共10页
作者
Viroli Cinzia; Anderlucci Laura;
展开▼
作者单位

Univ Bologna Dept Stat Sci Via Belle Arti 41 I-40126 Bologna Italy;

Univ Bologna Dept Stat Sci Via Belle Arti 41 I-40126 Bologna Italy;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep learning; Mixture models; Clustering; Text data analysis;

机译：深入学习;混合模型;聚类;文本数据分析;

相似文献

外文文献
中文文献
专利

1. Phrase Mining and Machine Learning in Textual Data to Uncover Distinct Protein Patterns in Cardiovascular Disease [J] . David A. Liem, Vincent Kyi, Yu Shi, Journal of Molecular and Cellular Cardiology . 2017,第期

机译：短语挖掘和机器学习在文本数据中揭示心血管疾病中的不同蛋白质模式
2. TOPIC MODELING IN MANAGEMENT RESEARCH: RENDERING NEW THEORY FROM TEXTUAL DATA [J] . Hannigan Timothy R., Haans Richard F. J., Vakili Keyvan, The Academy of Management annals . 2019,第2期

机译：管理研究中的主题建模：从文本数据中推论新理论
3. TOPIC MODELING IN MANAGEMENT RESEARCH: RENDERING NEW THEORY FROM TEXTUAL DATA [J] . Hannigan Timothy R., Haans Richard F. J., Vakili Keyvan, The Academy of Management annals . 2019,第2期

机译：管理研究主题建模：从文本数据中渲染新理论
4. InCaToMi: Integrative Causal Topic Miner Between Textual and Non-textual Time Series Data [C] . Hyun Duk Kim, ChengXiang Zhai, Thomas A. Rietz, ACM international conference on information and knowledge management . 2012

机译：InCaToMi：文本和非文本时间序列数据之间的综合因果主题挖掘器
5. Efficient Deep Learning for Visual and Textual Data [D] . Mehta, Sachin. 2021

机译：高效深入学习视觉和文本数据
6. Uncovering the key dimensions of high-throughput biomolecular data using deep learning [O] . Shixiong Zhang, Xiangtao Li, Qiuzhen Lin, 2020

机译：使用深度学习发现高通量生物分子数据的关键维度
7. Deep mixtures of unigrams for uncovering topics in textual data [O] . Cinzia Viroli, Laura Anderlucci 2021

机译：用于揭示文本数据中的主题的深度混合物

Deep mixtures of unigrams for uncovering topics in textual data

摘要

著录项

相似文献

相关主题

期刊订阅