A Quick Survey on Large Scale Distributed Deep Learning Systems

机译：大规模分布式深度学习系统快速调查

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.

机译：深度学习已广泛用于各种领域，并效果很好地作为一个重要作用。虽然逐渐渗透到各种领域，但每个应用程序的数据量都是巨大的巨大增加，因此计算复杂性和模型参数。作为一个明显的结果，训练和推理是耗时的。例如，经典Reset50分类模型将在NVIDIA M40 GPU上的14天内培训，具有想象成数据集。因此，分布式加速是分派对训练计算的非常有用的方式，并行地向节点的比例调度并加速整个过程。 Facebook的工作和UC Berkeley的加速可以通过分布的深度学习算法和系统，代表性地培训Reset-50型号。作为其他分布式加速度，它可以从数周到几周内加速大型数据集的大型模型，从而为研究人员和开发人员提供更多探索和搜索的空间。但是，除了加速，还有哪些其他问题将面对分布式深度学习系统？加速度的上限在哪里？什么应用程序将用于加速？加速的价格和成本是多少？本文从算法透视，分布式系统透视和应用程序角度来看，我们将对分布式深度学习系统进行简单快速的调查。我们将展示几个最近的优秀作品，并对分布式方法的限制和前景进行分析。

著录项

来源
《IEEE International Conference on Parallel and Distributed Systems》|2018年|545-1092p|共5页
会议地点
作者
Zhaoning Zhang; Lujia Yin; Yuxing Peng; Dongsheng Li;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP316.4-53;
关键词
Deep learning; Distributed systems; Large scale;

机译：深度学习;分布式系统;大规模;

相似文献

外文文献
中文文献
专利

1. A Survey of Distributed Search Techniques in Large Scale Distributed Systems [J] . Ahmed Reaz, Boutaba Raouf Communications Surveys & Tutorials, IEEE . 2011,第2期

机译：大规模分布式系统中分布式搜索技术的研究
2. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey [J] . Nguyen Giang, Dlugolinsky Stefan, Bobak Martin, Artificial Intelligence Review: An International Science and Engineering Journal . 2019,第1期

机译：机器学习和深度学习框架和库的大型数据挖掘：调查
3. Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale [J] . Thao-Nguyen TRUONG, Ryousei TAKANO IEICE transactions on information and systems . 2021,第8期

机译：混合电气/光学交换机架构，用于培训大规模的分布式深度学习
4. A Quick Survey on Large Scale Distributed Deep Learning Systems [C] . Zhaoning Zhang, Lujia Yin, Yuxing Peng, IEEE International Conference on Parallel and Distributed Systems . 2018

机译：大规模分布式深度学习系统快速调查
5. Collaborative Distributed Deep Learning Systems on the Edges [D] . Zeng, Xiao. 2021

机译：边缘上的协同分布式深度学习系统
6. Extensive deep neural networks for transferring small scale learning to large scale systems [O] . Kyle Mills, Kevin Ryczko, Iryna Luchak, 2011

机译：广泛的深度神经网络，可将小规模学习转移到大规模系统
7. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey [O] . Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, 2019

机译：机器学习和深度学习框架和图书馆进行大规模数据挖掘：调查
8. Scalable Coordination Architectures for Deeply Distributed Systems (SCADDS) [R] . Estrin, D. , Heidemarm, J. , Govindan, R. 2004

机译：用于深度分布式系统的可扩展协调架构（sCaDDs）

A Quick Survey on Large Scale Distributed Deep Learning Systems

摘要

著录项

相似文献

相关主题

期刊订阅