首页> 外文期刊>ACM transactions on Asian language information processing >An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus
【24h】

An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus

机译:一种构造具名的英越双语语料库的方法

获取原文
获取原文并翻译 | 示例
           

摘要

Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor--intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.
机译:在双语语料库中手动构建带注释的命名实体(NE)是一个耗时,费力且昂贵的过程,但这对于自然语言处理(NLP)任务(如跨语言信息检索,跨语言,语言信息的提取,机器翻译等。在本文中,我们提出了一种自动方法,通过提出对齐的NE方法,从双语并行语料库中构建了英语-越南语双语语料库中的带注释的NE。该语料库以双语语料库为基础,在该语料库中分别从其自己的语言中提取了初始NE,在使用各种双语约束对齐NE之前,该方法尝试纠正无法识别的NE或识别错误的NE。生成的语料库不仅可以改善网元识别结果,而且可以在英语网元和越南语网元之间建立对齐,这对于训练网元翻译模型是必需的。实验结果表明,该方法有效地优于基线方法。在英语-越南语NE对齐任务中,F量度从68.58%增加到79.77%。由于提高了NE识别质量,所以所提出的方法也显着增加:F值对于英语方面从84.85%增至88.66%,对于越南方面从75.71%增至85.55%。通过为机器翻译系统提供附加的语义信息,BLEU分数从33.04%增加到45.11%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号