Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

机译：密封：高效培训大规模统计机器翻译模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Statistical machine translation (SMT) is an important research branch in natural language processing (NLP). Similar to many other NLP applications, large scale training data can potentially bring higher translation accuracy for SMT models. However, the traditional single-node SMT model training systems can hardly cope with the fast-growing amount of large scale training corpus in the big data era, which makes the urgent requirement of efficient large scale machine translation model training systems. In this paper, we propose Seal, an efficient, scalable, and end-to-end offline SMT model training toolkit based on Apache Spark which is a widely-used distributed data-parallel platform. Seal parallelizes the training process of the entire three key SMT models that are the word alignment model, the translation model, and the N -Gram language model, respectively. To further improve the performance of the model training in Seal, we also propose a number of system optimization methods. In word alignment model training, by optimizing the block size tuning, the overhead of IO operation and communication is greatly reduced. In translation model training, by well encoding the training corpus, the data size transferred over the network can be reduced significantly, thus improving the overall training efficiency. We also optimize the maximum likelihood estimation (MLE) algorithm to solve the data skew issue on the join operation which is adopted both in the translation model training and the language model training. The experiment results show that Seal outperforms the well-known SMT training system Chaski with about 5× speedup for word alignment model training. For the syntactic translation model and language model training, Seal outperforms the existing cutting-edge tools with about 9~18× speedup and 8~9× speedup on average, respectively. On the whole, Seal outperforms the existing distributed system with 4~6× speedup and the single-node system with 9~60× speedup on average respectively. Besides, Seal achieves near-linear data and node scalability.

机译：统计机器翻译（SMT）是自然语言处理（NLP）中的重要研究分支。类似于许多其他NLP应用程序，大规模培训数据可能会对SMT模型带来更高的转换精度。然而，传统的单节点SMT模型培训系统可能几乎无法应对大数据时代中的大规模培训语料库的快速增长，这使得有效的大型机器翻译模型训练系统的迫切要求。在本文中，我们提出了密封，一种基于Apache Spark的Apache Spark的封印，高效，可扩展和端到端的脱机SMT模型训练工具包，这是一个广泛使用的分布式数据并行平台。密封并将整个三个关键SMT模型的培训过程分别并行分别是单词对齐模型，翻译模型和n铭出语言模型的培训过程。为了进一步提高密封型模型培训的性能，我们还提出了许多系统优化方法。在Word对准模型培训中，通过优化块大小调谐，IO操作和通信的开销大大减少。在翻译模型培训中，通过编码训练语料库，可以显着降低网络上传输的数据大小，从而提高了整体培训效率。我们还优化了解决在翻译模型培训和语言模型培训中采用的加入操作上采用的数据偏差问题的最大似然估计（MLE）算法。实验结果表明，密封率优于众所周知的SMT训练系统Chaski，有约5倍加速，用于字对准模型训练。对于语法翻译模型和语言模型训练，密封率优于现有的尖端工具，平均值约为9〜18倍的快速和8〜9倍加速。总的来说，密封率优于现有的分布式系统，分别为4〜6倍的加速和单节点系统，平均分别平均加速9〜60倍。此外，密封率达到了近线性数据和节点可扩展性。

著录项

来源
《IEEE International Conference on Parallel and Distributed Systems》|2018年|544p|共8页
会议地点
作者
Rong Gu; Min Chen; Wenjia Yang; Chunfeng Yuan; Yihua Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类总体结构、系统结构;
关键词
Training; Data models; Seals; Sparks; Computational modeling; Maximum likelihood estimation; Load modeling;

机译：培训;数据模型;密封;火花;计算建模;最大似然估计;负载建模;

相似文献

外文文献
中文文献
专利

1. Converting Continuous-Space Language Models into N-gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation [J] . RUI WANG, MASAO UTIYAMA, ISAO GOTO, ACM transactions on Asian language information processing . 2016,第3期

机译：通过高效的双语修剪将连续空间语言模型转换为N-gram语言模型以进行统计机器翻译
2. Improving Statistical Machine Translation by Adapting Translation Models to Translationese [J] . Gennadi Lembersk, Noam Orda, Shuly Wintne Computational linguistics . 2013,第4期

机译：通过将翻译模型适应翻译语言来改善统计机器翻译
3. A system for terminology extraction and translation equivalent detection in real time Efficient use of statistical machine translation phrase tables [J] . Antoni Oliver Machine translation . 2017,第3期

机译：实时提取和翻译等价物的系统有效利用统计机器翻译短语表
4. Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark [C] . Rong Gu, Min Chen, Wenjia Yang, IEEE International Conference on Parallel and Distributed Systems . 2018

机译：印章：在Spark上有效训练大规模统计机器翻译模型
5. Efficient and Scalable Optimization Methods for Training Large-Scale Machine Learning Models [D] . Jahani, Majid. 2021

机译：高效且可扩展的培训大型机器学习模型的优化方法
6. 3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data [O] . Megan C Hollister, Jeffrey D. Blume 2019

机译：3145对机器学习和传统统计方法的评估以发现大规模翻译数据
7. Effective training and efficient decoding for statistical machine translation [O] . Wübker Jörn 2017

机译：统计机器翻译的有效培训和有效解码

Seal: Efficient Training Large Scale Statistical Machine Translation Models on Spark

摘要

著录项

相似文献

相关主题

期刊订阅