首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms
【24h】

Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

机译:为什么数据集属性绑定了并行机器学习培训算法的可扩展性

获取原文
获取原文并翻译 | 示例
           

摘要

As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.
机译:随着训练数据集大小和机器学习的模型大小迅速增加,可以消耗更多的计算资源来加速培训过程。然而,主要使用随机优化算法的并行机器学习训练的可扩展性和性能再现性是有限的。在本文中,我们证明数据集中的样本差异在并行机器学习算法的可扩展性中起着突出的作用。我们建议使用数据集的统计属性来测量样本差异。这些属性包括采样序列中的样本特征,样品稀疏性,样本分集和相似性的方差。我们选择四种不同的并行训练算法作为我们的研究对象:(1)异步并行SGD算法(Hogwild!算法),(2)并行模型平均SGD算法(Minibatch SGD算法),(3)分散优化算法, (4)双坐标优化(DADM算法)。我们的结果表明,训练数据集的统计特性确定了这些并行训练算法的可扩展性上限。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号