【24h】

Online Multilingual Topic Models with Multi-Level Hyperpriors

机译:具有多级超优先级的在线多语言主题模型

获取原文

摘要

For topic models, such as LDA, that use a bag-of-words assumption, it becomes especially important to break the corpus into appropriately-sized "documents". Since the models are estimated solely from the term cooccurrences, extensive documents such as books or long journal articles lead to diffuse statistics, and short documents such as forum posts or product reviews can lead to sparsity. This paper describes practical inference procedures for hierarchical models that smooth topic estimates for smaller sections with hyperpriors over larger documents. Importantly for large collections, these online variational Bayes inference methods perform a single pass over a corpus and achieve better perplexity than "flat" topic models on monolingual and multilingual data. Furthermore, on the task of detecting document translation pairs in large multilingual collections, polylingual topic models (PLTM) with multi-level hyperpriors (mlhPLTM) achieve significantly better performance than existing online PLTM models while retaining computational efficiency.
机译:对于使用词袋假设的主题模型(例如LDA),将语料库分解为适当大小的“文档”变得尤为重要。由于仅根据术语“共现”来估算模型,因此大量的文档(例如书籍或长期刊文章)会导致统计数据的分散,而简短的文档(例如论坛帖子或产品评论)可能会导致稀疏性。本文介绍了用于层次模型的实用推理过程,该过程可通过较大的文档优先处理较小部分的主题估计。重要的是,对于大型馆藏,这些在线变式贝叶斯推理方法在语料库上执行一次遍历,比单语和多语数据的“扁平”主题模型具有更好的困惑。此外,在检测大型多语言集合中的文档翻译对的任务上,具有多级超优先级(mlhPLTM)的多语言主题模型(PLTM)在保持计算效率的同时,比现有的在线PLTM模型具有明显更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号