Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

机译：为什么不享用免费午餐？挖掘用于并行句子的大型语料库以改善翻译建模

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

It is well known that the output quality of statistical machine translation (SMT) systems increases with more training data. To obtain more parallel text for translation modeling, researchers have turned to the web to mine parallel sentences, but most previous approaches have avoided the difficult problem of pairwise similarity on cross-lingual documents and instead rely on heuristics. In contrast, we confront this challenge head on using the MapReduce framework. On a modest cluster, our scalable end-to-end processing pipeline was able to automatically gather 5.8m parallel sentence pairs from English and German Wikipedia. Augmenting existing bitext with these data yielded significant improvements over a state-of-the-art baseline (2.39 BLEU points in the best case).

机译：众所周知，统计机器翻译（SMT）系统的输出质量随着训练数据的增加而提高。为了获得更多并行文本以进行翻译建模，研究人员已转向网络挖掘并行句子，但是大多数以前的方法都避免了跨语言文档的成对相似性这一难题，而是依靠启发式方法。相比之下，我们在使用MapReduce框架时面临这一挑战。在适度的集群上，我们可扩展的端到端处理管道能够自动从英语和德语维基百科收集580万个并行句子对。用这些数据增强现有的bitext可以大大提高现有基准（在最佳情况下为2.39 BLEU点）。

著录项

来源
《2012 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies.》|2012年|p.626-630|共5页
会议地点 Montreal(CA);Montreal(CA)
作者
Ferhan Ture; Jimmy Lin;
展开▼
作者单位

Dept. of Computer Science, University of Maryland;

The iSchool University of Maryland;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序设计、软件工程;程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Parallel sentence generation from comparable corpora for improved SMT [J] . Sadaf Abdul Rauf, Holger Schwenk Machine translation . 2011,第4期

机译：从可比语料库并行生成句子以改善SMT
2. Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network [J] . Shaolin Zhu, Yong Yang, Chun Xu Computational intelligence and neuroscience . 2020,第4期

机译：使用并行分层注意网络从非平行语料库中提取并行句子
3. Improving machine translation performance by exploiting non-parallel corpora [J] . Munteanu DS, Marcu D Computational linguistics . 2005,第4期

机译：利用非并行语料库提高机器翻译性能
4. Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling [C] . Ferhan Ture, Jimmy Lin Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies . 2012

机译：为什么不抢午餐？采矿大公司以改善翻译建模的平行句子
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network [O] . Shaolin Zhu, Yong Yang, Chun Xu 2020

机译：使用并行分层注意网络从非平行语料库中提取并行句子
7. Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs [O] . Wołk, Krzysztof, Marasek, Krzysztof 2015

机译：建立与主题一致的可比公司并真正挖掘它平行句子对

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling

摘要

著录项

相似文献

相关主题

期刊订阅