首页> 外文期刊>JISTEM - Journal of Information Systems and Technology Management >Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods
【24h】

Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

机译:巴西葡萄牙语中报纸和科学文本的自动文本聚类:方法的分析和比较

获取原文
           

摘要

This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering),2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms),3.Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.
机译:本文报告了对自动文本聚类应用于巴西葡萄牙语的科学文章和报纸文本的实证研究的结果,目的是找到能够对原始组中的文本输入进行聚类的最有效的计算方法。该研究涵盖四个实验,每个实验都有四个步骤:1.语料库选择(选择一组文本进行聚类),2。词类选择(通过使用特定算法从每个文本中选择名词,动词和形容词),3。过滤算法(从预览阶段的结果中选择一组术语,还为每个术语插入语义权重,为每个文本生成索引),4.聚类算法(将聚类算法Simple K-Means,sIB和EM应用于索引)。经过这些步骤,收集了聚类正确性和聚类时间统计结果。 sIB聚类算法是科学语料库和报纸语料库的最佳选择,条件是sIB聚类算法在运行前要求输入簇数作为输入(对于报纸语料库,在1分钟内正确率为68.9%,对于科学语料库,在1分钟内正确率达到77.8%)。 EM聚类算法还可以在没有用户干预的情况下猜测聚类的数量,但最佳情况是正确性低于53%。考虑到进行的实验,人类文本分类和自动聚类的结果相差甚远。还观察到,聚类正确性结果根据输入文本的数量及其主题而变化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号