首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems
【24h】

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

机译:Horus:深度学习系统中的干扰感知和基于预测的调度

获取原文
获取原文并翻译 | 示例
           

摘要

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7-30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.
机译:为了加快深度学习(DL)模型的培训,配备有硬件加速器(如GPU)的机器的集群被利用以减少执行时间。需要最先进的资源经理来提高GPU利用率并最大限度地提高吞吐量。在同一GPU上的共同定位DL作业的同时已显示有效,这可能会产生干扰导致放缓。在本文中,我们提出了Horus:用于DL系统的干扰感知和基于预测的资源管理器。 Horus主动预测从DL模型的计算图表功能外推外推出的异构DL工作的GPU利用,从而消除了在线分析和隔离保留的GPU。通过微基准和作业共同定位跨异性GPU硬件的组合,我们将GPU利用率作为一般代理度量来确定良好的放置决策,与储存隔离GPU执行在线分析并直接测量每个GPU利用率的当前方法相比独特的提交工作。我们的方法促进了高资源利用率和临时减少;通过现实世界的实验和大规模的追踪仿真,我们展示了Horus优于GPU资源利用率的其他DL资源管理人员,对于Makespan减少23.7-30.7%,工作等待时间减少68.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号