摘要:
With the development of the Internet,the application of text classification and topic extraction is becoming more and more widely, and topic model plays a critical role in topic extraction of the text.LDA(latent Dirichlet allocation),as an extensive and mature topic mod-el,is also a probability generation model,which can solve the problem of synonym and polysemy.But when LDA model is used to model the document collection in the domain of social science literature,because of its ignorance of the topic characteristics of document collection it-self,the topic distribution extracted by the modeling method is to trend the high frequency words,which makes the extracted topic deviated from the document topic in nature and the results inaccurate.In this paper,aiming at the topic modeling of document with LDA model and combined with the characteristics of the document in the field of social literature,we present a new topic modeling method to improve ac-cordingly the process of modeling,so that the topic of the final extraction is more accurate and more consistent with the topic characteristics of the document collection itself.%随着互联网的发展,文本分类和主题提取的应用越来越广泛,而主题模型在文本主题提取中起着很大的作用.LDA (latent Dirichlet allocation)模型是一种应用非常广泛且很成熟的主题模型,也是一个概率生成模型,可以很好地解决多词一义和一词多义的问题.但是当利用LDA模型对社科文献领域类的文档集进行主题建模时,由于该建模方法忽略了文档集自身的主题特点,提取的主题分布是偏向文档中高频词汇,所以造成最后提取的主题偏离文档的本质意义上的主题、结果不够准确.针对LDA模型对文档进行主题建模的过程,结合社科文献领域的文档特点,对主题建模的过程进行相应的改进,提出一种新的主题建模方法,从而使最终提取的主题更加准确,更符合文档集本身的主题特点.