首页> 外文期刊>Information Processing & Management >Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications
【24h】

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

机译:多语言环境下的概率主题建模:其方法和应用概述

获取原文
获取原文并翻译 | 示例
           

摘要

Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (ⅰ) per-topic word distributions and (ⅱ) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.
机译:概率主题模型是无监督的生成模型,其将文档内容建模为两步生成过程,也就是说,文档被视为潜在概念或主题的混合,而主题是词汇单词上的概率分布。最近,已经投入了大量的研究工作来将概率主题建模概念从单语言环境转换为多语言环境。已设计出新颖的主题模型,以处理平行且可比的文本。我们定义了多语言概率主题建模(MuPTM),并提供了有关MuPTM中当前研究,方法,优势和局限性的第一篇完整概述。作为一个代表示例,我们选择无所不在的LDA模型自然扩展到称为双语LDA(BiLDA)的多语言环境。我们从高级建模假设到其数学基础,全面介绍了这种代表性的多语言模型。我们演示了如何通过(ⅰ)每个主题单词分布和(ⅱ)每个文档主题分布的输出集来使用数据表示,这些输出来自多语言概率主题模型,涉及各种涉及语言的现实生活中的跨语言任务,而没有任何依赖于外部语言对的翻译资源:(1)跨语言事件中心新闻聚类;(2)跨语言文档分类;(3)跨语言语义相似性;以及(4)跨语言信息检索。我们还将简要回顾相关文献中介绍的其他几个应用程序,并介绍和说明两个相关的建模概念:主题平滑和主题修剪。总之,本文涵盖了多语言概率主题建模的最新研究。通过介绍一系列潜在的应用,我们通过MuPTM揭示了独立​​于语言和独立于语言对的数据表示的重要性。我们通过系统地概述如何通过潜在的跨语言主题的共享空间跨以不同语言编写的语料库链接和转移方面知识,从而为该领域的未来研究提供了明确的方向,即如何有效地利用所学到的知识。各种跨语言应用程序中任何多语言概率主题模型的主题词分布和每个文档的主题分布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号