首页> 外文期刊>ACM transactions on Asian language information processing >Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Models and Multiple Knowledge Sources
【24h】

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Models and Multiple Knowledge Sources

机译:使用统计模型和多个知识源的平行语料库中双语实体的对齐

获取原文
获取原文并翻译 | 示例
           

摘要

Named entity (NE) extraction is one of the fundamental tasks in natural language processing (NLP). Although many studies have focused on identifying NEs within monolingual documents, aligning NEs in bilingual documents has not been investigated extensively due to the complexity of the task. In this article we introduce a new approach to aligning bilingual NEs in parallel corpora by incorporating statistical models with multiple knowledge sources. In our approach, we model the process of translating an English NE phrase into a Chinese equivalent using lexical translation/transliteration probabilities for word translation and alignment probabilities for word reordering. The method involves automatically learning phrase alignment and acquiring word translations from a bilingual phrase dictionary and parallel corpora, and automatically discovering transliteration transformations from a training set of name-transliteration pairs. The method also involves language-specific knowledge functions, including handling abbreviations, recognizing Chinese personal names, and expanding acronyms. At runtime, the proposed models are applied to each source NE in a pair of bilingual sentences to generate and evaluate the target NE candidates; the source and target NEs are then aligned based on the computed probabilities. Experimental results demonstrate that the proposed approach, which integrates statistical models with extra knowledge sources, is highly feasible and offers significant improvement in performance compared to our previous work, as well as the traditional approach of IBM Model 4.
机译:命名实体(NE)提取是自然语言处理(NLP)中的基本任务之一。尽管许多研究集中于识别单语文档中的NE,但由于任务的复杂性,尚未对双语文档中NE的对齐方式进行广泛研究。在本文中,我们介绍了一种通过将统计模型与多个知识源相结合来在并行语料库中对齐双语网元的新方法。在我们的方法中,我们使用单词翻译的词法翻译/音译概率和单词重排的对齐概率,对将英语NE短语翻译成中文等效词的过程进行建模。该方法包括自动学习短语对齐并从双语短语词典和并行语料库获取单词翻译,并从一组名称-音译对的训练集中自动发现音译转换。该方法还涉及特定于语言的知识功能,包括处理缩写,识别中文个人名称和扩展首字母缩写词。在运行时,将建议的模型以一对双语语句应用于每个源NE,以生成和评估目标NE候选对象;然后根据计算出的概率对源和目标网元进行对齐。实验结果表明,与我们以前的工作以及传统的IBM Model 4方法相比,该方法将统计模型与额外的知识资源集成在一起,是高度可行的,并且在性能上有显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号