首页> 外文会议>2012 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies. >Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling
【24h】

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

机译:为什么不享用免费午餐?挖掘用于并行句子的大型语料库以改善翻译建模

获取原文
获取原文并翻译 | 示例

摘要

It is well known that the output quality of statistical machine translation (SMT) systems increases with more training data. To obtain more parallel text for translation modeling, researchers have turned to the web to mine parallel sentences, but most previous approaches have avoided the difficult problem of pairwise similarity on cross-lingual documents and instead rely on heuristics. In contrast, we confront this challenge head on using the MapReduce framework. On a modest cluster, our scalable end-to-end processing pipeline was able to automatically gather 5.8m parallel sentence pairs from English and German Wikipedia. Augmenting existing bitext with these data yielded significant improvements over a state-of-the-art baseline (2.39 BLEU points in the best case).
机译:众所周知,统计机器翻译(SMT)系统的输出质量随着训练数据的增加而提高。为了获得更多并行文本以进行翻译建模,研究人员已转向网络挖掘并行句子,但是大多数以前的方法都避免了跨语言文档的成对相似性这一难题,而是依靠启发式方法。相比之下,我们在使用MapReduce框架时面临这一挑战。在适度的集群上,我们可扩展的端到端处理管道能够自动从英语和德语维基百科收集580万个并行句子对。用这些数据增强现有的bitext可以大大提高现有基准(在最佳情况下为2.39 BLEU点)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号