首页> 外文期刊>Statistics and computing >Deep mixtures of unigrams for uncovering topics in textual data
【24h】

Deep mixtures of unigrams for uncovering topics in textual data

机译:在文本数据中揭示主题的深度混合物

获取原文
获取原文并翻译 | 示例
           

摘要

Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward's method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.
机译:UNIGRAM的混合物是用于聚类文本数据的最简单和最有效的工具之一,因为它们假设与同一主题相关的文档具有类似的术语分布,由多项式自然地描述。当分类任务特别具有挑战性时,例如当文档术语矩阵是高维且极其稀疏时,更复合表示可以更好地深入了解分组结构。在这项工作中,我们通过允许具有进一步更深层次的潜在的模型来开发了对非常短的文件的无监督分类的UNIGRAM的大小写混合物。该提案衍生于贝叶斯框架。 Unigrams的深层混合物的行为与其他传统和最先进的方法进行了经验,即K-Mease,具有余弦距离的K型距离,根据语义分析进行分区,与欧几里德距离进行欧几里德距离。 METOIDS,GASSIAN的混合在基于语义的转换数据,根据病房的方法具有余弦异化,潜在的Dirichlet分配,通过EM算法估计的UNIGRAM的混合物,光谱聚类和亲和传播聚类。在正确的分类率和调整的rand指数方面评估性能。仿真研究和实际数据分析证明,在聚类这些数据中深入了解这些数据高度提高了分类准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号