首页> 外文期刊>International Journal of High Performance Computing Applications >A generic approach to scheduling and checkpointing workflows
【24h】

A generic approach to scheduling and checkpointing workflows

机译:安排和检查点工作流的通用方法

获取原文
获取原文并翻译 | 示例
       

摘要

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows.
机译:这项工作涉及计划和检查点策略,以便在容易出现故障的大型平台上执行科学的工作流程。据我们所知,这项工作是第一个针对任意工作流定位故障停止错误的工具。先前的大多数工作都针对软错误,这些软错误破坏了处理器正在执行的任务,但不会导致该处理器的整个内存丢失,这与故障停止错误相反。我们重新审视了经典的映射试探法,例如异构最早完成时间和MinMin,并通过几种检查点策略对其进行了补充。目的是在对每个任务的检查点(CkptAll)和对没有任务的检查点(CkptNone)之间进行权衡,以实现高效的折衷,在失败很少见的情况下,这是一个过大的杀伤力,即使只有很少的错误发生在执行期间。与以前的工作相反,我们的方法适用于任意工作流程,而不仅限于特殊的依赖关系图类,例如最小的串并联图。广泛的实验表明,在各种工作流程中,CkptAll和CkptNone均获得了显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号