PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

机译：PlexUS：针对弹性极限高性能计算系统的面向模式的运行时系统架构

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including application scalability, power and energy efficiency. Therefore, resilience solutions for HPC systems must be thoughtfully engineered and deployed. In previous work, we developed the concept of resilience design patterns, which consist of templated solutions based on well-established techniques for detection, mitigation and recovery. In this paper, we use these patterns as the foundation to propose new approaches to designing runtime systems for HPC systems. The instantiation of these patterns within a runtime system enables flexible and adaptable end-to-end resiliency solutions for HPC environments. The paper describes the architecture of the runtime system, named Plexus, and the strategies for dynamically composing and adapting pattern instances under runtime control. This runtime-based approach enables actively balancing the cost-benefit trade-off between performance overhead and protection coverage of the resilience solutions. Based on a prototype implementation of PLEXUS, we demonstrate the resiliency and performance gains achieved by the pattern-based runtime system for a parallel linear solver application.

机译：对于高性能计算（HPC）系统设计人员和用户，满足下一代ExaScale超级计算系统的Myriad挑战需要重新思考其应用和系统软件设计的方法。在这些挑战中，在高故障率存在下为科学应用提供弹性和稳定性，需要新的软件架构和设计方法。由于HPC系统变得越来越复杂，因此他们需要复杂的解决方案，用于对这些大规模系统中出现的各种故障和错误模式的检测和缓解，以及故障恢复的解决方案。这些弹性解决方案经常与其他系统属性进行交互并影响应用程序可伸缩性，功率和能效。因此，必须仔细设计和部署HPC系统的恢复解决方案。在以前的工作中，我们开发了恢复力设计模式的概念，该概念包括基于熟悉的检测，减轻和恢复的技术的模板化解决方案。在本文中，我们将这些模式作为基础，以提出为HPC系统设计运行系统系统的新方法。运行时系统内的这些模式的实例化使HPC环境的灵活和适应的端到端弹性解决方案能够实现灵活和适应性的端到端弹性解决方案。本文介绍了运行时系统的架构，名为Plexus的架构以及在运行时控制下动态构成和调整模式实例的策略。这种基于运行的方法可以积极平衡性能开销与保护覆盖范围之间的成本效益折衷。基于Plexus的原型实施，我们展示了基于模式的运行时间系统实现了平行线性求解器应用程序的弹性和性能增益。

著录项

来源
《Pacific Rim International Symposium on Dependable Computing》|2020年|31-39|共9页
会议地点
作者
Saurabh Hukerikar; Christian Engelmann;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Resilience; Runtime; Libraries; Computer architecture; Systems architecture; Government; Scalability;

机译：弹性;运行时;图书馆;计算机架构;系统架构;政府;可扩展性;

相似文献

外文文献
中文文献
专利

1. Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing [J] . Vetter Jeffrey S., Mittal Sparsh Computing in science & engineering . 2015,第2期

机译：非易失性存储系统在超大规模高性能计算中的机会
2. Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software [J] . K. Wang, A. Kulkarni, M. Lang, IEEE Transactions on Parallel and Distributed Systems . 2016,第4期

机译：探索超大型高性能计算系统软件的设计折衷
3. ARTful: A model for user-defined schedulers targeting multiple high-performance computing runtime systems [J] . Santana Alexandre, Freitas Vinicius, Castro Marcio, Software, practice & experience . 2021,第7期

机译：artful：针对多个高性能计算运行时系统的用户定义调度程序的模型
4. INVESTIGATING OPERATING SYSTEM NOISE IN EXTREME-SCALE HIGH-PERFORMANCE COMPUTING SYSTEMS USING SIMULATION [C] . Christian Engelmann Proceedings of the IASTED Multiconferences . 2013

机译：使用仿真研究极高性能计算机系统中的操作系统噪声
5. High-performance computer system architectures for embedded computing [D] . Lee, Dongwon 2011

机译：用于嵌入式计算的高性能计算机系统架构
6. FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo3 Framework [O] . Alfonso Rodríguez, Juan Valverde, Jorge Portilla, 2018

机译：基于FPGA的高性能嵌入式系统用于网络物理系统中的自适应边缘计算：ARTICo3框架
7. INVESTIGATING OPERATING SYSTEM NOISE IN EXTREME-SCALE HIGH-PERFORMANCE COMPUTING SYSTEMS USING SIMULATION [O] . Christian Engelmann 2015

机译：利用仿真研究超大型高性能计算系统中的操作系统噪声

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

摘要

著录项

相似文献

相关主题

期刊订阅