首页> 外文会议>Latin American Computing Conference >Using CRF+LG for automated classification of named entities in newspaper texts

【24h】

Using CRF+LG for automated classification of named entities in newspaper texts

机译：使用CRF + LG在报纸文本中的命名实体自动分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $ext{CRF}+ext{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.

机译：信息制作一直以加速的速度增长。有大量的信息被处理，这使得与文本挑战的收集和解释有关，需要从最多样化的区域，特别是计算语言学的努力。计算语言学的目标之一是通过在应用计算资源的应用中提取的经验证据来实现语言数据集的收集和解释。这项工作介绍了从报纸中提取的数据集的创建。技术用于提取句子，标记，命名实体识别（ner）以及用于描述数据集的统计方法。其他研究人员可以直接从可用的语料库中受益。这项工作介绍了葡萄牙语中葡萄牙语的1029个注释新闻文章，根据Harem类别。在108页的样本中，与来自同一报纸的金标准文本相比，我们的实验显示了97.0％的相似性。对于提取的数据集的NER任务和自动注释，第二个HAREM和ATRIBUNA100的数据集的比例用于培训混合模型 $ text {crf} + text {lg} $ 。通过培训的模型，提取的1029篇文章被自动注释。通常，指标的价值表明，70/30比例的分类模型的最佳度量成果，特别是对于Precision和Recall分别达到91.11％和95.82％。总体而言，该模型的准确性精度为95.86％。

著录项

来源
《Latin American Computing Conference》|2020年|27-32|共6页
会议地点
作者
Jaimel de Oliveira Lima; Cristiano da Silveira Colombo; Flávio Izo; Juliana Campos Pinheiro Pirovani; Elias de Oliveira;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Measurement; Annotations; Statistical analysis; Standards organizations; Production; Portable document format;

机译：培训;测量;注释;统计分析;标准组织;生产;便携式文件格式;

相似文献

外文文献
中文文献
专利

1. Named entity recognition and classification in biomedical text using classifier ensemble [J] . Saha Sriparna, Ekbal Asif, Sikdar Utpal Kumar International journal of data mining and bioinformatics . 2015,第4期

机译：使用分类器集成在生物医学文本中命名实体识别和分类
2. Automated text clustering of newspaper and scientific texts in brazilian portuguese: analysis and comparison of methods [J] . Alexandre Ribeiro Afonso, Cláudio Gottschalg Duque JISTEM - Journal of Information Systems and Technology Management . 2014,第2期

机译：巴西葡萄牙语中报纸和科学文本的自动文本聚类：方法的分析和比较
3. Context-sensitive gender inference of named entities in text [J] . Sudeshna Das, Jiaul H Paik Information Processing & Management . 2021,第1期

机译：文本中命名实体的上下文敏感的性别推理
4. CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition [C] . Juliana P. G. Pirovani, Elias de Oliveira International Conference on Intelligent Systems Design and Applications . 2018

机译：CRF + LG：葡萄牙语命名实体识别的混合方法
5. Named Entity Resolution for Historical Texts [D] . Holmes, Audrey. 2019

机译：为历史文本命名的实体分辨率
6. De-identifying Spanish medical texts - named entity recognition applied to radiology reports [O] . Irene Pérez-Díez, Raúl Pérez-Moraga, Adolfo López-Cerdán, 2021

机译：去识别西班牙医学文本 - 命名实体识别适用于放射学报告
7. Text Classification and Named Entities for New Event Detection [O] . Giridhar Kumaran, James Allan 2004

机译：用于新事件检测的文本分类和命名实体

Using CRF+LG for automated classification of named entities in newspaper texts

摘要

著录项

相似文献

相关主题

期刊订阅