首页> 外文学位 >Active Learning and Crowdsourcing for Machine Translation in Low Resource Scenarios.
【24h】

Active Learning and Crowdsourcing for Machine Translation in Low Resource Scenarios.

机译:在资源不足的情况下为机器翻译进行主动学习和众包。

获取原文
获取原文并翻译 | 示例

摘要

Corpus based approaches to automatic translation such as Example Based and Statistical Machine Translation systems use large amounts of parallel data created by humans to train mathematical models for automatic language translation. Large scale parallel data generation for new language pairs requires intensive human effort and availability of fluent bilinguals or expert translators. Therefore it becomes immensely difficult and expensive to provide state-of-the-art Machine Translation (MT) systems for rare languages.;In this thesis, we explore active learning to reduce costs and make best use of human resources for building low-resource MT systems. Active learning approaches help us identify sentences, which if translated have the potential to provide maximal improvement to an existing system. We then apply active learning to other relevant tasks in MT such as word alignment, classifying monolingual text by topic, extracting comparable corpora from the web. In all these tasks we reduce annotated data required by the underlying supervised learning models. We also extend the traditional active learning approach of optimizing selection for a single annotation to handle cases of multiple-type annotations and show further reduction of costs in building low-resource MT systems.;Finally, as part of this thesis, we have implemented a new framework - Active Crowd Translation (ACT), a cost sensitive active learning setup for building MT systems for low-resource language pairs. Our framework will provide a suitable platform for involving disparately spread out human translators around the world, in a timely and sparingly fashion for rapid building of translation systems. We first explore the ACT paradigm with expert translators and then generalize to full-scale crowdsourcing with non-expert bilingual speakers. In case of Machine Translation, although crowdsourcing services like Amazon's Mechanical Turk have opened doors to tap human potential, they do not guarantee translation expertise nor extended availability of translators. We address several challenges in eliciting quality translations from an unvetted crowd of bilingual speakers.
机译:基于语料库的自动翻译方法(例如,基于示例的翻译和统计机器翻译系统)使用大量由人类创建的并行数据来训练用于自动语言翻译的数学模型。对于新语言对,大规模并行数据生成需要大量的人力和熟练的双语者或专家翻译。因此,为稀有语言提供最新的机器翻译(MT)系统变得极为困难和昂贵。本论文中,我们探索主动学习以降低成本并充分利用人力资源来构建资源匮乏的资源。 MT系统。主动学习方法可以帮助我们识别句子,如果将其翻译,则有可能对现有系统提供最大的改进。然后,我们将主动学习应用于MT中的其他相关任务,例如单词对齐,按主题对单语文本进行分类,从网络中提取可比的语料库。在所有这些任务中,我们减少了基础监督学习模型所需的带注释的数据。我们还扩展了针对单个注释优化选择的传统主动学习方法,以处理多类型注释的情况,并显示出进一步降低了构建低资源MT系统的成本。最后,作为本文的一部分,我们实现了一个新框架-主动人群翻译(ACT),这是一种对成本敏感的主动学习设置,用于为低资源语言对构建MT系统。我们的框架将提供一个合适的平台,以适时和少量的方式让世界各地分散的人类翻译参与其中,以快速构建翻译系统。我们首先与专家翻译探讨ACT范例,然后与非专家双语者一起推广到全面的众包。就机器翻译而言,尽管诸如亚马逊的Mechanical Turk之类的众包服务为挖掘人类潜力打开了大门,但它们不能保证翻译专业知识或翻译人员的可用性。我们要从众多未经审查的双语者中获取高质量的翻译,以应对一些挑战。

著录项

  • 作者

    Ambati, Vamshi.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 148 p.
  • 总页数 148
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号