首页> 外文会议>Latin American Computing Conference >Using CRF+LG for automated classification of named entities in newspaper texts
【24h】

Using CRF+LG for automated classification of named entities in newspaper texts

机译:使用CRF + LG在报纸文本中的命名实体自动分类

获取原文

摘要

Information production has been growing at an accelerated rate. There is a large amount of information to be processed, which makes tasks related to the collection and explotation of text challenging, requiring effort from the most diverse areas, especially computational linguistics. One of the goals of computational linguistics is to enable the collection and explotation of linguistic datasets, through empirical evidence, extracted with the application of computational resources. This work presents the creation of dataset extracted from a newspaper. Techniques were used for extracting sentences, tokenization, named entities recognition (NER), as well as statistical methods for describing the dataset. Other researchers can directly benefit from the available corpus. This work presents a corpus with 1029 annotated news articles in Portuguese for entities according to HAREM categories. In a sample of 108 pages, our experiments show a 97.0% similarity compared to gold standard texts from the same newspaper. For the NER task and automatic annotation of the extracted dataset, proportions of the datasets of Second Harem and aTribuna100 were used to train a hybrid model $ext{CRF}+ext{LG}$. With the trained model, the 1029 articles extracted were automatically annotated. In general, the values of the metrics demonstrate that optimal metrics achievements for the classification model for the 70/30 Proportion, especially for the Person (PER) category, reaching 91.11% and 95.82% for precision and recall, respectively. Overall, the model showed 95.86% accuracy.
机译:信息制作一直以加速的速度增长。有大量的信息被处理,这使得与文本挑战的收集和解释有关,需要从最多样化的区域,特别是计算语言学的努力。计算语言学的目标之一是通过在应用计算资源的应用中提取的经验证据来实现语言数据集的收集和解释。这项工作介绍了从报纸中提取的数据集的创建。技术用于提取句子,标记,命名实体识别(ner)以及用于描述数据集的统计方法。其他研究人员可以直接从可用的语料库中受益。这项工作介绍了葡萄牙语中葡萄牙语的1029个注释新闻文章,根据Harem类别。在108页的样本中,与来自同一报纸的金标准文本相比,我们的实验显示了97.0%的相似性。对于提取的数据集的NER任务和自动注释,第二个HAREM和ATRIBUNA100的数据集的比例用于培训混合模型 $ text {crf} + text {lg} $ 。通过培训的模型,提取的1029篇文章被自动注释。通常,指标的价值表明,70/30比例的分类模型的最佳度量成果,特别是对于Precision和Recall分别达到91.11%和95.82%。总体而言,该模型的准确性精度为95.86%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号