首页> 外文期刊>Computer speech and language >Topic modeling of Chinese language beyond a bag-of-words
【24h】

Topic modeling of Chinese language beyond a bag-of-words

机译:一字不漏的中文主题建模

获取原文
获取原文并翻译 | 示例
           

摘要

The topic model is one of best known hierarchical Bayesian models for language modeling and document analysis. It has achieved a great success in text classification, in which a text is represented as a big of its words, disregarding grammar and even word order, that is referred to as the bag-of-words assumption. In this paper, we investigate topic modeling of the Chinese language, which has different morphology from alphabetical western languages like English. The Chinese characters, but not the Chinese words, are the basic structural units in Chinese. In previous empirical studies, it shows that the character-based topic model performs better than the word-based topic model. In this research, we propose the character-word topic model (CWTM) to consider the character-word relation in topic modeling. Two types of experiments are designed to test the performance of the new proposed model: topic extraction and text classification. By empirical studies, we demonstrate the superiority of the new proposed model comparing to both word and character based topic models.
机译:主题模型是用于语言建模和文档分析的最著名的分层贝叶斯模型之一。它在文本分类中取得了巨大的成功,其中文本被视为大部分单词,而忽略了语法甚至单词顺序,这被称为“词袋假设”。在本文中,我们研究了汉语的主题建模,该主题建模与英语等西方字母语言具有不同的形态。中文是汉字的基本结构单位,但不是汉字。在以前的实证研究中,它表明基于字符的主题模型比基于单词的主题模型表现更好。在这项研究中,我们提出了字符-单词主题模型(CWTM),以考虑主题建模中的字符-单词关系。设计了两种类型的实验来测试新提出的模型的性能:主题提取和文本分类。通过实证研究,我们证明了新提出的模型与基于单词和字符的主题模型相比具有优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号