首页> 外文会议>IEEE International Conference on Parallel and Distributed Systems >Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark
【24h】

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

机译:密封:高效培训大规模统计机器翻译模型

获取原文

摘要

Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training data can potentially bring higher translation accuracy for SMT models. However, the traditional single-node SMT model training systems can hardly cope with the fast-growing amount of large scale training corpus in the big data era, which makes the urgent requirement of efficient large scale machine translation model training systems. In this paper, we propose Seal, an efficient, scalable, and end-to-end offline SMT model training toolkit based on Apache Spark which is a widely-used distributed data-parallel platform. Seal parallelizes the training process of the entire three key SMT models that are the word alignment model, the translation model, and the N -Gram language model, respectively. To further improve the performance of the model training in Seal, we also propose a number of system optimization methods. In word alignment model training, by optimizing the block size tuning, the overhead of IO operation and communication is greatly reduced. In translation model training, by well encoding the training corpus, the data size transferred over the network can be reduced significantly, thus improving the overall training efficiency. We also optimize the maximum likelihood estimation (MLE) algorithm to solve the data skew issue on the join operation which is adopted both in the translation model training and the language model training. The experiment results show that Seal outperforms the well-known SMT training system Chaski with about 5× speedup for word alignment model training. For the syntactic translation model and language model training, Seal outperforms the existing cutting-edge tools with about 9~18× speedup and 8~9× speedup on average, respectively. On the whole, Seal outperforms the existing distributed system with 4~6× speedup and the single-node system with 9~60× speedup on average respectively. Besides, Seal achieves near-linear data and node scalability.
机译:统计机器翻译(SMT)是自然语言处理(NLP)中的重要研究分支。类似于许多其他NLP应用程序,大规模培训数据可能会对SMT模型带来更高的转换精度。然而,传统的单节点SMT模型培训系统可能几乎无法应对大数据时代中的大规模培训语料库的快速增长,这使得有效的大型机器翻译模型训练系统的迫切要求。在本文中,我们提出了密封,一种基于Apache Spark的Apache Spark的封印,高效,可扩展和端到端的脱机SMT模型训练工具包,这是一个广泛使用的分布式数据并行平台。密封并将整个三个关键SMT模型的培训过程分别并行分别是单词对齐模型,翻译模型和n铭出语言模型的培训过程。为了进一步提高密封型模型培训的性能,我们还提出了许多系统优化方法。在Word对准模型培训中,通过优化块大小调谐,IO操作和通信的开销大大减少。在翻译模型培训中,通过编码训练语料库,可以显着降低网络上传输的数据大小,从而提高了整体培训效率。我们还优化了解决在翻译模型培训和语言模型培训中采用的加入操作上采用的数据偏差问题的最大似然估计(MLE)算法。实验结果表明,密封率优于众所周知的SMT训练系统Chaski,有约5倍加速,用于字对准模型训练。对于语法翻译模型和语言模型训练,密封率优于现有的尖端工具,平均值约为9〜18倍的快速和8〜9倍加速。总的来说,密封率优于现有的分布式系统,分别为4〜6倍的加速和单节点系统,平均分别平均加速9〜60倍。此外,密封率达到了近线性数据和节点可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号