首页> 外文会议>2012 SC Companion: High Performance Computing, Networking, Storage and Analysis. >Poster: Programming Model Extensions for Resilience in Extreme Scale Computing
【24h】

Poster: Programming Model Extensions for Resilience in Extreme Scale Computing

机译:海报:用于极限规模计算的弹性的编程模型扩展

获取原文
获取原文并翻译 | 示例

摘要

System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer's fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.
机译:系统弹性是构建极端规模系统的关键挑战。大量的HPC应用程序具有内在的弹性,但是应用程序程序员缺乏将其容错知识传达给系统的机制。我们提出了一种跨层的弹性方法,其中我们提出了一组编程模型扩展,并开发了一个运行时推理框架,该框架可以推理出错误的上下文和严重性,以达到应用程序程序员对容错的期望。我们使用一组加速的故障注入实验证明了我们的方法与一组真实的科学和工程规范的有效性。我们的实验表明,一种跨层方法可以使程序员明确地表达容错知识,然后在系统抽象层中加以利用,从而可以显着提高长期运行的HPC应用程序的可靠性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号