首页> 外文期刊>International Journal of High Performance Computing Applications >THE LAM/MPI CHECKPOINT/RESTART FRAMEWORK: SYSTEM-INITIATED CHECKPOINTING
【24h】

THE LAM/MPI CHECKPOINT/RESTART FRAMEWORK: SYSTEM-INITIATED CHECKPOINTING

机译:LAM / MPI检查点/重新启动框架:系统初始化的检查点

获取原文
获取原文并翻译 | 示例
           

摘要

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI.
机译:随着高性能集群的规模和受欢迎程度不断增长,容错和可靠性问题已成为限制应用程序可伸缩性的因素。为了解决这些问题,我们提出了一种系统的设计和实现,该系统可为基于MPI的并行应用程序提供协调的检查点和回滚恢复。我们的方法通过定义的检查点/重新启动接口将Berkeley Lab BLCR内核级过程检查点系统与MPI的LAM实现集成在一起。检查点对应用程序是透明的,允许将系统用于集群维护和调度原因以及容错能力。实验结果表明,由于将检查点支持功能合并到LAM / MPI中,对通信性能的影响可忽略不计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号