Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm

机译：使用概率潜在语义分析（PLSA）算法对无监督Web挖掘数据的期望最大化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The explosion of web as an information source recently has brought more interesting challenges to document annotation and classification of web documents as well as to the information and collaboration filtering world. Not only does it introduce performance issues due to the huge number of documents, there is even bigger challenges in that most of the data are unlabeled. The arrival of social media and blogging has increased web data as well. Thus the machine learning community has taken up the interest in using unsupervised learning in classification of such big data. New learning called semi-supervised learning relies on assumptions that unlabeled data can help bring interesting patterns, which the industry can use. The study uses the Probability Latent Semantic Analysis algorithm which is an unsupervised machine learning algorithm to retrieve information based on latent classes of documents and terms. PLSA is the topic modeling tool used by the study to deduce hidden topics across the documents by looking at the terms and documents and inferring using the Expectation Maximization algorithm which words belong to which topic. The model was used to discover if given documents infer one or more topics. The study concluded that the PLSA algorithm is more efficient on processed data than raw web data and processing time was reduced when preprocessing was used to eliminate redundant latent variables. An increase of k, the number of topics improved the topic quality.

机译：Web作为信息源的爆炸式增长最近给文档注释和Web文档分类以及信息和协作过滤世界带来了更多有趣的挑战。由于文档数量巨大，它不仅会带来性能问题，而且，由于大多数数据都没有标签，因此带来了更大的挑战。社交媒体和博客的出现也增加了网络数据。因此，机器学习社区对使用无监督学习进行此类大数据分类产生了兴趣。被称为半监督学习的新学习依赖于这样的假设，即未标记的数据可以帮助带来有趣的模式，行业可以使用这种模式。该研究使用概率潜在语义分析算法，这是一种无监督的机器学习算法，用于根据潜在的文档和术语类别检索信息。 PLSA是研究使用的主题建模工具，通过查看术语和文档并使用Expectation Maximization算法推断哪些单词属于哪个主题，从而推断出文档中的隐藏主题。该模型用于发现给定的文档是否推断出一个或多个主题。该研究得出的结论是，使用预处理消除冗余的潜在变量后，PLSA算法在处理数据上比原始Web数据更有效，并且减少了处理时间。 k增加，主题数提高了主题质量。

著录项

来源
《International Conference on Advances in Big Data, Computing and Data Communication Systems》|2019年|1-9|共9页
会议地点
作者
Chengeta Kennedy;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Semantics; Probabilistic logic; Crawlers; Mathematical model; Classification algorithms; Analytical models;

机译：语义;概率逻辑;爬虫;数学模型;分类算法;分析模型;

相似文献

外文文献
中文文献
专利

1. A qualitative and quantitative assessment of the impact of three processing algorithms with halving of study count statistics in myocardial perfusion imaging: Filtered backprojection, maximal likelihood expectation maximisation and ordered subset expectation maximisation with resolution recovery [J] . ModiB.N., BrownJ.L.E., KumarG., Journal of nuclear cardiology: official publication of the American Society of Nuclear Cardiology . 2012,第5期

机译：对三种处理算法的影响进行定性和定量评估，将心肌灌注成像中的研究计数统计数据减半：过滤后向投影，最大似然期望最大化和有序子集期望最大化（分辨率恢复）
2. The Effectiveness of a Probabilistic Principal Component Analysis Model and Expectation Maximisation Algorithm in Treating Missing Daily Rainfall Data [J] . Zun Liang Chuan, Sayang Mohd Deni, Soo-Fen Fam, Asia-Pacific journal of atmospheric sciences . 2020,第1期

机译：概率主成分分析模型和期望最大化算法在日漏雨量数据处理中的有效性
3. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis [J] . Gabriel Recchia, Michael N. Jones Behavior Research Methods, Instruments & Computers . 2009,第3期

机译：更多数据胜过更智能的算法：将点向互信息与潜在语义分析进行比较
4. Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm [C] . Chengeta Kennedy International Conference on Advances in Big Data, Computing and Data Communication Systems . 2019

机译：期望使用概率潜在语义分析（PLSA）算法对无监督的Web挖掘数据的最大化
5. Poisson Process Bandits: Sequential Models and Algorithms for Maximising the Detection of Point Process Data [D] . Grant, James Andrew. 2019

机译：Poisson Process Barits：序列模型和算法，用于最大化点处理数据的检测
6. The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis [O] . Juanying Xie, Mingzhao Wang, Shengquan Xu, 2021

机译：基于标准偏差和基因组数据分析的余弦相似性的无监督特征选择算法
7. Improved summarization of Chinese spoken documents by probabilistic latent semantic analysis (PLSA) with further analysis and integrated scoring [O] . Sheng-yi Kong, Lin-shan Lee 2006

机译：通过概率潜在语义分析（pLsa）进一步分析和综合评分，改进中文口语文献的摘要

Expectation maximisation on unsupervised web mined data using probability latent semantic analysis (PLSA) algorithm

摘要

著录项

相似文献

相关主题

期刊订阅