Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung Gingfung; Borowiec Damian; Yang Renyu; Friday Adrian; Harper Richard; Garraghan Peter

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

【24h】

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

机译：Horus：深度学习系统中的干扰感知和基于预测的调度

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7-30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.

机译：为了加快深度学习（DL）模型的培训，配备有硬件加速器（如GPU）的机器的集群被利用以减少执行时间。需要最先进的资源经理来提高GPU利用率并最大限度地提高吞吐量。在同一GPU上的共同定位DL作业的同时已显示有效，这可能会产生干扰导致放缓。在本文中，我们提出了Horus：用于DL系统的干扰感知和基于预测的资源管理器。 Horus主动预测从DL模型的计算图表功能外推外推出的异构DL工作的GPU利用，从而消除了在线分析和隔离保留的GPU。通过微基准和作业共同定位跨异性GPU硬件的组合，我们将GPU利用率作为一般代理度量来确定良好的放置决策，与储存隔离GPU执行在线分析并直接测量每个GPU利用率的当前方法相比独特的提交工作。我们的方法促进了高资源利用率和临时减少;通过现实世界的实验和大规模的追踪仿真，我们展示了Horus优于GPU资源利用率的其他DL资源管理人员，对于Makespan减少23.7-30.7％，工作等待时间减少68.3％。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2022年第1期|88-100|共13页
作者
Yeung Gingfung; Borowiec Damian; Yang Renyu; Friday Adrian; Harper Richard; Garraghan Peter;
展开▼
作者单位

Univ Lancaster Sch Comp & Commun Lancaster LA1 4YW England;

Univ Lancaster Sch Comp & Commun Lancaster LA1 4YW England;

Univ Leeds Sch Comp Leeds LS2 9JT W Yorkshire England;

Univ Lancaster Sch Comp & Commun Lancaster LA1 4YW England;

Univ Lancaster Sch Comp & Commun Lancaster LA1 4YW England;

Univ Lancaster Sch Comp & Commun Lancaster LA1 4YW England;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics processing units; Interference; Kernel; Predictive models; Computational modeling; Production; Load modeling; Distributed systems; deep learning; interference; GPU utilization; cloud computing; workload prediction;

机译：图形处理单元;干扰;内核;预测模型;计算建模;生产;负载建模;分布式系统;深入学习;干扰;GPU利用;云计算;云计算;工作负荷预测;

相似文献

外文文献
中文文献
专利

1. Online scheduling of image satellites based on neural networks and deep reinforcement learning [J] . Haijiao WANG, Zhen YANG, Wugen ZHOU, 中国航空学报（英文版） . 2019,第004期
2. Review of Anomaly Detection Systems in Industrial Control Systems Using Deep Feature Learning Approach [J] . Raogo Kabore, Adlès Kouassi, Rodrigue N’goran, 工程（英文）（1947-3931） . 2021,第001期
3. A Bayesian Reinforcement Learning Algorithm Based on Abstract States for Elevator Group Scheduling Systems [J] . CHENG Yuhu, WANG Xuesong, ZHANG Yiyang 电子学报：英文版 . 2010,第003期
4. DeepCP: Deep Learning Driven Cascade Prediction-Based Autonomous Content Placement in Closed Social Network [J] . Wu Qiong, Wu Muhong, Chen Xu, IEEE Journal on Selected Areas in Communications . 2020,第7期

机译：Deepcp：深度学习驱动的基于级联预测的闭合社交网络中的自主内容放置
5. An Intelligent Traffic Load Prediction-Based Adaptive Channel Assignment Algorithm in SDN-IoT: A Deep Learning Approach [J] . Tang Fengxiao, Fadlullah Zubair Md., Mao Bomin, Internet of Things Journal, IEEE . 2018,第6期

机译：SDN-IoT中基于智能交通负荷预测的自适应信道分配算法：一种深度学习方法
6. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters [J] . Peng Yanghua, Bao Yixin, Chen Yangrui, IEEE Transactions on Parallel and Distributed Systems . 2021,第8期

机译：DL2：深度学习群集的深度学习驱动的调度程序
7. Horus: An Interference-Aware Resource Manager for Deep Learning Systems [C] . Gingfung Yeung, Damian Borowiec, Renyu Yang, International Conference on Algorithms and Architectures for Parallel Processing . 2020

机译：Horus：深度学习系统的干扰感知资源管理器
8. Differentiate Containers Scheduling for Deep Learning Applications [D] . Song, Yun. 2020

机译：区分安排安置深度学习应用程序
9. Deep Reinforcement Learning-Based Task Scheduling in IoT Edge Computing [O] . Shuran Sheng, Peng Chen, Zhimin Chen, 2021

机译：基于深度加强学习的IOT Edge Computing任务调度
10. Joint Channel Selection and Data Scheduling in HF Jamming Environment: An Interference-Aware Reinforcement Learning Approach [O] . Wen Li, Yuhua Xu, Qiuju Guo, 2019

机译：HF干扰环境中的联合通道选择和数据调度：干扰感应加强学习方法
11. Task path planning, scheduling and learning for free-ranging robot systems [R] . Wakefield, G. Steve 1987

机译：自由测距机器人系统的任务路径规划，调度和学习

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

摘要

著录项

相似文献

相关主题

期刊订阅