【24h】

COVID-19 Named Entity Recognition for Vietnamese

机译:Covid-19命名越南人的实体认同

获取原文

摘要

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by fine-tuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020).
机译:目前的Covid-19 Pandemic导致创造许多关于NLP研究和下游申请的Corpora,以帮助战斗大流行。但是,大多数这些公司都是专门用于英语。随着大流行是一个全球问题,值得创建英语以外的语言的Covid-19相关数据集。在本文中,我们介绍了越南语的第一个手动注释的Covid-19域特定数据集。特别是,我们的数据集是为命名实体识别(ner)任务的注释,具有可用于其他未来的Epidemics的新定义的实体类型。与现有的越南NER数据集相比,我们的数据集还包含最多的实体数。我们在我们的数据集上使用强基线进行实验,并发现:自动越南语词分割有助于改善NER结果,并通过微调预先训练的语言模型来获得最高的性能,其中越南语的单声道模型Phobert(Nguyen和Nguyen ,2020)产生比多语言型号XLM-R(Conneau等,2020)产生更高的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号