...
首页> 外文期刊>Journal of supercomputing >A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve HPC scheduling performance
【24h】

A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve HPC scheduling performance

机译:混合调度平台:运行时预测可靠性感知调度平台,可提高HPC调度性能

获取原文
获取原文并翻译 | 示例
           

摘要

The performance of scheduling algorithms for HPC jobs highly depends on the accuracy of job runtime values. Prior research has established that neither user-provided runtimes nor system-generated runtime predictions are accurate. We propose a new scheduling platform that performs well in spite of runtime uncertainties. The key observation that we use for building our platform is the fact that two important classes of scheduling strategies (backfilling and plan based) differ in terms of sensitivity to runtime accuracy. We first confirm this observation by performing trace-based simulations to characterize the sensitivity of different scheduling strategies to job runtime accuracy. We then apply gradient boosting tree regression as a meta-learning approach to estimate the reliability of the system-generated job runtimes. The estimated prediction reliability of job runtimes is then used to choose a specific class of scheduling algorithm. Our hybrid scheduling platform uses a plan-based scheduling strategy for jobs with high expected runtime accuracy and backfills the remaining jobs on top of the planned jobs. While resource sharing is used to minimize fragmentation of resources, a specific ratio of CPU cores is reserved for backfilling of less predictable jobs to avoid starvation of these jobs. This ratio is adapted dynamically based on the resource requirement ratio of predictable jobs among recently submitted jobs. We perform extensive trace-driven simulations on real-world production traces to show that our hybrid scheduling platform outperforms both pure backfilling and pure plan-based scheduling algorithms.
机译:HPC作业的调度算法的性能高度取决于作业运行时值的准确性。先前的研究已经确定,用户提供的运行时或系统生成的运行时预测都不准确。我们提出了一个新的调度平台,尽管运行时存在不确定性,但该平台仍表现良好。我们用于构建平台的主要观察结果是,调度策略的两个重要类别(回填和基于计划)在对运行时准确性的敏感性方面有所不同。我们首先通过执行基于跟踪的模拟来表征这种观察,以表征不同调度策略对作业运行时准确性的敏感性。然后,我们将梯度增强树回归作为一种元学习方法来估计系统生成的作业运行时的可靠性。然后,将估计的作业运行时预测可靠性用于选择特定类别的调度算法。我们的混合调度平台使用基于计划的调度策略来处理具有较高预期运行时准确性的作业,并在计划的作业之上回填剩余的作业。虽然使用资源共享来最大程度地减少资源碎片,但保留一定比例的CPU内核以回填难以预测的作业,从而避免这些作业饿死。该比率是根据最近提交的工作中可预测工作的资源需求比率动态调整的。我们在现实世界中的生产跟踪上执行了广泛的跟踪驱动模拟,以表明我们的混合调度平台优于纯回填算法和基于纯计划的调度算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号