首页> 外文期刊>BMC Medical Informatics and Decision Making >ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain
【24h】

ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain

机译:Paramed:生物医学域中的英汉翻译并行语料库

获取原文
       

摘要

Biomedical language translation requires multi-lingual fluency as well as relevant domain knowledge. Such requirements make it challenging to train qualified translators and costly to generate high-quality translations. Machine translation represents an effective alternative, but accurate machine translation requires large amounts of in-domain data. While such datasets are abundant in general domains, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge, a parallel corpus does not exist for this language pair in the biomedical domain. We developed an effective pipeline to acquire and process an English-Chinese parallel corpus from the New England Journal of Medicine (NEJM). This corpus consists of about 100,000 sentence pairs and 3,000,000 tokens on each side. We showed that training on out-of-domain data and fine-tuning with as few as 4000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en $$ ightarrow$$ zh (zh $$ ightarrow$$ en) directions. Translation quality continues to improve at a slower pace on larger in-domain data subsets, with a total increase of 33.0 (24.3) BLEU for en $$ ightarrow$$ zh (zh $$ ightarrow$$ en) directions on the full dataset. The code and data are available at https://github.com/boxiangliu/ParaMed .
机译:生物医学语言翻译需要多语言流畅性以及相关领域知识。这些要求使得培训合格的译者并昂贵地产生高质量的翻译,使其充满挑战。机器翻译代表有效的替代方案,但精确的机器翻译需要大量的域数据。虽然此类数据集在一般域中丰富,但它们在生物医学域中的距离不太容易。中文和英语是两种最广泛的语言,但我们的知识,这种语言对在生物医学域中的并行语料库不存在。我们开发了一种有效的管道,可以从新英格兰医学杂志(NEJM)中获取和处理英汉平行语料库。该语料库包括约10万句对和每侧的3,000,000令牌。我们展示了域外数据的培训和微调,少于4000个Nejm句子对提高了转换质量25.3(13.4)Bleu for EN $$ iGrarow $$ Zh(Zh $$ IGRARROW $$ en)方向。翻译质量在较大的域数据子集中的速度较慢的步伐中继续提高,总增加了33.0(24.3)Bleu for EN $$ IGRARROW $$ ZH(ZH $$ IGRARROW $$ en)方向。代码和数据可在https://github.com/boxiangliu/paramed中获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号