首页> 外文期刊>Information Processing & Management >Fuzzy topic modeling approach for text mining over short text
【24h】

Fuzzy topic modeling approach for text mining over short text

机译:短文本文本挖掘的模糊主题建模方法

获取原文
获取原文并翻译 | 示例
           

摘要

In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx + LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.
机译:在这个时代,社交媒体在我们生活中的作用日益广泛,使短文本的发布更加普及。简短的文本包含具有独特特征的有限上下文,这使得它们难以处理。每天,以标签,关键字,推文,电话消息,信使对话,社交网络帖子等形式产生数十亿篇短文本。对这些短文本的分析在文本挖掘和内容分析领域势在必行。从大规模的短文本文档中提取精确的主题是一项关键且具有挑战性的任务。由于短文本(例如,网络上的文本,Twitter之类的社交媒体和新闻标题)的稀疏性问题,常规方法无法在主题中获得单词共现模式。因此,本文通过提出一种新颖的通过模糊视角对短文本进行模糊主题建模(FTM)的方法来解决稀疏性问题。在这项研究中,本地和全局术语频率是通过词袋(BOW)模型计算的。为了消除高维度对全局术语权重的负面影响,采用了主成分分析;此后,采用模糊c均值算法从文档中检索语义上相关的主题。实验是在三个真实世界的短文本数据集上进行的:代码片段数据集属于小型数据集,而其他两个数据集Twitter和问题则是较大的数据集。实验结果表明,与其他最新的基准主题模型(例如GLTM,CSTM,LTM,LDA,Mix-gram,BTM,SATM和DREx +)相比,该方法可以更精确地发现主题,并且效果更好LDA。 FTM的性能在分类,聚类,主题连贯性和执行时间上也得到了证明。在包含50、75、100、125和200个主题数的摘要数据集上,FTM分类准确度分别为0.95、0.94、0.91、0.89和0.87。 FTM在问题数据集上的分类精度为0.73、0.74、0.70、0.68和0.78,主题数为50、75、100、125和200。摘要和问题数据集上FTM的分类准确性高于最新的基准主题模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号