A generic approach to scheduling and checkpointing workflows

Han Li; Le Fevre Valentin; Canon Louis-Claude; Robert Yves; Vivien Frederic

首页> 外文期刊>International Journal of High Performance Computing Applications >A generic approach to scheduling and checkpointing workflows

【24h】

A generic approach to scheduling and checkpointing workflows

机译：安排和检查点工作流的通用方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows.

机译：这项工作涉及计划和检查点策略，以便在容易出现故障的大型平台上执行科学的工作流程。据我们所知，这项工作是第一个针对任意工作流定位故障停止错误的工具。先前的大多数工作都针对软错误，这些软错误破坏了处理器正在执行的任务，但不会导致该处理器的整个内存丢失，这与故障停止错误相反。我们重新审视了经典的映射试探法，例如异构最早完成时间和MinMin，并通过几种检查点策略对其进行了补充。目的是在对每个任务的检查点（CkptAll）和对没有任务的检查点（CkptNone）之间进行权衡，以实现高效的折衷，在失败很少见的情况下，这是一个过大的杀伤力，即使只有很少的错误发生在执行期间。与以前的工作相反，我们的方法适用于任意工作流程，而不仅限于特殊的依赖关系图类，例如最小的串并联图。广泛的实验表明，在各种工作流程中，CkptAll和CkptNone均获得了显着提高。

著录项

来源
《International Journal of High Performance Computing Applications》 |2019年第6期|1255-1274|共20页
作者
Han Li; Le Fevre Valentin; Canon Louis-Claude; Robert Yves; Vivien Frederic;
展开▼
作者单位

East China Normal Univ Shanghai Peoples R China|Univ Claude Bernard Lyon Univ Lyon CNRS ENS Lyon Inria Lyon France;

Univ Claude Bernard Lyon Univ Lyon CNRS ENS Lyon Inria Lyon France;

Univ Bourgogne Franche Comte CNRS FEMTO ST Inst Besancon France;

Univ Claude Bernard Lyon Univ Lyon CNRS ENS Lyon Inria Lyon France|Univ Tennessee Knoxville TN USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Workflow; checkpoint; fail-stop error; resilience;

机译：工作流程;检查站故障停止错误;弹性;

相似文献

外文文献
中文文献
专利

1. A generic approach to scheduling and checkpointing workflows [J] . Han Li, Le Fevre Valentin, Canon Louis-Claude, International Journal of High Performance Computing Applications . 2019,第6期

机译：安排和检查点工作流的通用方法
2. FAMOBACH: A fast and survivable workflow scheduling approach based MOHEFT using backtacking and checkpointing [J] . Bouzidi Mohammed Redha, Daoudi Mourad, Ziani Benameur, Computer Communications . 2021,第Apra期

机译：Famobach：使用反向和检查点的快速和可生存的工作流程调度方法是基于Mohft的
3. Checkpointing Strategies for Scheduling Computational Workflows [J] . Guillaume Aupy, Anne Benoit, Henri Casanova, International Journal of Networking and Computing . 2016,第1期

机译：调度计算工作流的检查点策略
4. On the complexity of scheduling checkpoints for computational workflows [C] . Robert Yves, Vivien Frederic, Zaidouni Dounia 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops . 2012

机译：关于计算工作流调度检查点的复杂性
5. Redesigning an Outpatient Pharmacy Workflow Using Generic Simulation Modelling to Maximize a Renovation Opportunity. [D] . Izumi, Janet C. 2015

机译：使用通用仿真模型重新设计门诊药房工作流程，以最大程度地提高翻新机会。
6. A workflow-driven approach to integrate generic software modules in a Trusted Third Party [O] . Martin Bialke, Peter Penndorf, Tim Wegner, 2015

机译：一种工作流程驱动的方法可将通用软件模块集成到受信任的第三方中
7. An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud [O] . Amrith Rajagopal Setlur, S. Jaya Nirmala, Har Simrat Singh, 2020

机译：使用Replication HeuRistics和ChoutchPointing的有效容错工作流程调度方法

A generic approach to scheduling and checkpointing workflows

摘要

著录项

相似文献

相关主题

期刊订阅