Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

Alexandre Ribeiro Afonso; Cláudio Gottschalg Duque

首页> 外文期刊>JISTEM - Journal of Information Systems and Technology Management >Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

【24h】

Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

机译：巴西葡萄牙语中报纸和科学文本的自动文本聚类：方法的分析和比较

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering),2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms),3.Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

机译：本文报告了对自动文本聚类应用于巴西葡萄牙语的科学文章和报纸文本的实证研究的结果，目的是找到能够对原始组中的文本输入进行聚类的最有效的计算方法。该研究涵盖四个实验，每个实验都有四个步骤：1.语料库选择（选择一组文本进行聚类），2。词类选择（通过使用特定算法从每个文本中选择名词，动词和形容词），3。过滤算法（从预览阶段的结果中选择一组术语，还为每个术语插入语义权重，为每个文本生成索引），4.聚类算法（将聚类算法Simple K-Means，sIB和EM应用于索引）。经过这些步骤，收集了聚类正确性和聚类时间统计结果。 sIB聚类算法是科学语料库和报纸语料库的最佳选择，条件是sIB聚类算法在运行前要求输入簇数作为输入（对于报纸语料库，在1分钟内正确率为68.9％，对于科学语料库，在1分钟内正确率达到77.8％）。 EM聚类算法还可以在没有用户干预的情况下猜测聚类的数量，但最佳情况是正确性低于53％。考虑到进行的实验，人类文本分类和自动聚类的结果相差甚远。还观察到，聚类正确性结果根据输入文本的数量及其主题而变化。

著录项

来源
《JISTEM - Journal of Information Systems and Technology Management》 |2014年第2期|共22页
作者
Alexandre Ribeiro Afonso; Cláudio Gottschalg Duque;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类一般工业技术;
关键词

相似文献

外文文献
中文文献
专利

1. Brazilian Portuguese Text Clustering Based on Evolutionary Computing [J] . Alexandre Ribeiro Afonso Latin America Transactions, IEEE (Revista IEEE America Latina) . 2016,第7期

机译：基于进化计算的巴西葡萄牙语文本聚类
2. Survey on-demand: a versatile scientific article automated inquiry method using text mining applied to asset liability management [J] . Igor Ferreira Do Nascimento, Pedro Henrique De Melo Albuquerque, Yaohao Peng International Journal of Business Intelligence and Data Mining . 2021,第3期

机译：按需调查：使用文本挖掘的多功能科学文章自动查询方法应用于资产责任管理
3. Creation of Individual Scientific Concept-Centered Semantic Maps Based on Automated Text-Mining Analysis of PubMed [J] . Ekaterina Ilgisonis, Andrey Lisitsa, Valerya Kudryavtseva, Advances in Bioinformatics . 2018,第1期

机译：基于PubMed文本自动挖掘分析的单个科学概念为中心的语义图的创建
4. Studying the Effects of Text Preprocessing and Ensemble Methods on Sentiment Analysis of Brazilian Portuguese Tweets [C] . Fernando Barbosa Gomes, Juan Manuel Adan-Coello, Fernando Ernesto Kintschner International conference on statistical language and speech processing . 2018

机译：研究文本预处理和合奏方法对巴西葡萄牙语推文情感分析的影响
5. COMPUTER-ASSISTED AND TRADITIONAL METHODS OF TEXT ANALYSIS - A COMPARATIVE STUDY OF EAST AND WEST GERMAN NEWSPAPER LANGUAGE (SOCIOLINGUISTICS, TEXT LINGUISTICS). [D] . KEMPF, RENATE UTA. 1984

机译：文本分析的计算机辅助和传统方法-东西方德语报纸语言（社会语言学，文本语言学）的比较研究。
6. Creation of Individual Scientific Concept-Centered Semantic Maps Based on Automated Text-Mining Analysis of PubMed [O] . Ekaterina Ilgisonis, Andrey Lisitsa, Valerya Kudryavtseva, 2018

机译：基于PubMed文本自动挖掘分析的个性化科学概念为中心的语义图的创建
7. AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS [O] . Alexandre Ribeiro Afonso, Cláudio Gottschalg Duque 2014

机译：巴西葡萄牙语报纸和科学文本的自动文本聚类：方法的分析和比较

Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods

摘要

著录项

相似文献

相关主题

期刊订阅