首页> 外文会议>International Conference on Advances in Big Data, Computing and Data Communication Systems >Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm
【24h】

Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm

机译:使用概率潜在语义分析(PLSA)算法对无监督Web挖掘数据的期望最大化

获取原文

摘要

The explosion of web as an information source recently has brought more interesting challenges to document annotation and classification of web documents as well as to the information and collaboration filtering world. Not only does it introduce performance issues due to the huge number of documents, there is even bigger challenges in that most of the data are unlabeled. The arrival of social media and blogging has increased web data as well. Thus the machine learning community has taken up the interest in using unsupervised learning in classification of such big data. New learning called semi-supervised learning relies on assumptions that unlabeled data can help bring interesting patterns, which the industry can use. The study uses the Probability Latent Semantic Analysis algorithm which is an unsupervised machine learning algorithm to retrieve information based on latent classes of documents and terms. PLSA is the topic modeling tool used by the study to deduce hidden topics across the documents by looking at the terms and documents and inferring using the Expectation Maximization algorithm which words belong to which topic. The model was used to discover if given documents infer one or more topics. The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables. An increase of k, the number of topics improved the topic quality.
机译:Web作为信息源的爆炸式增长最近给文档注释和Web文档分类以及信息和协作过滤世界带来了更多有趣的挑战。由于文档数量巨大,它不仅会带来性能问题,而且,由于大多数数据都没有标签,因此带来了更大的挑战。社交媒体和博客的出现也增加了网络数据。因此,机器学习社区对使用无监督学习进行此类大数据分类产生了兴趣。被称为半监督学习的新学习依赖于这样的假设,即未标记的数据可以帮助带来有趣的模式,行业可以使用这种模式。该研究使用概率潜在语义分析算法,这是一种无监督的机器学习算法,用于根据潜在的文档和术语类别检索信息。 PLSA是研究使用的主题建模工具,通过查看术语和文档并使用Expectation Maximization算法推断哪些单词属于哪个主题,从而推断出文档中的隐藏主题。该模型用于发现给定的文档是否推断出一个或多个主题。该研究得出的结论是,使用预处理消除冗余的潜在变量后,PLSA算法在处理数据上比原始Web数据更有效,并且减少了处理时间。 k增加,主题数提高了主题质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号