首页> 外文期刊>Information Processing & Management >Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework
【24h】

Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

机译:使用语言建模框架从可比较的语料库中提取翻译以进行跨语言信息检索

获取原文
获取原文并翻译 | 示例
           

摘要

A main challenge in Cross-Language Information Retrieval (CUR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source-target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English-Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.
机译:跨语言信息检索(CUR)的主要挑战是从可用的翻译资源中估计适当的翻译模型,因为翻译质量直接影响检索性能。在不同的翻译资源中,我们专注于从可比较的语料库中获取翻译模型,因为它们为语言资源有限的语言和领域提供了适当的翻译。在本文中,我们采用两步方法从可比较的语料库构建有效的翻译模型,而无需任何其他语言资源即可完成CLIR任务。第一步,通过推导源-目标词对之间的相关性来提取翻译。这些相关性在第二步中用于估计单词翻译概率。我们为第一步提出了一种语言建模方法,其中基于概率分布的建模提供了两个关键优势。首先,与启发式调整的先前工作相比,我们的方法可以更轻松地进行调整。其次,它为整合其他词汇和翻译关系以提高可比语料库的翻译准确性提供了原则基础。作为说明,我们将单词共现的单语关系整合到翻译提取过程中,这有助于在可比较的语料库中为低频单词提取更可靠的翻译。在英语-波斯语可比语料库上的实验结果表明,我们的方法在翻译质量和CLIR性能方面都优于以前的方法。实际上,所提出的方法自然适用于任何可比较的语料库,无论其语言如何。此外,我们演示了方法第二步中估计的词翻译概率对CLIR性能的重大影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号