首页> 外文期刊>Computer speech and language >MuST-C: A multilingual corpus for end-to-end speech translation
【24h】

MuST-C: A multilingual corpus for end-to-end speech translation

机译:Must-C:结束地点翻译的多语种语料库

获取原文
获取原文并翻译 | 示例
       

摘要

End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: ⅰ) language coverage and diversity (from English into 14 languages from different families),ⅱ) size (at least 237 hours of transcribed recordings per language, 430 on average), ⅲ) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.
机译:由于在其两个父任务中的序列进步到序列学习的进步,最近的端到端口语翻译(SLT)最近获得了流行度:自动语音识别(ASR)和机器翻译(MT)。然而,该领域的研究必须面对公开的Corpora努力培训数据饥饿的神经网络。实际上,虽然传统的级联解决方案可以在各种语言中构建大量的ASR和MT培训数据,但适合端到端培训的可用SLT语料量很少,通常很小,语言覆盖率有限。我们通过呈现Must-C,这是由英语TED会谈构建的大型和自由讲话的多语言语音翻译语料库来填补这一差距。其独特的功能包括:Ⅰ)语言覆盖和多样性(从英语到来自不同家庭的14种语言),Ⅱ)尺寸(每语言的每语言的转录录音至少237小时,平均430个),Ⅲ)各种主题和扬声器,以及iv)数据质量。除了描述语料库创建方法和讨论实证和手动质量评估的结果外,我们将使用强大的系统计算的基线结果,以上由Must-C涵盖的每种语言方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号