Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software

K. Wang; A. Kulkarni; M. Lang; D. Arnold; I. Raicu

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software

【24h】

Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software

机译：探索超大型高性能计算系统软件的设计折衷

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Owing to the extreme parallelism and the high component failure rates of tomorrow's exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC system software are still designed around a centralized server paradigm and hence are susceptible to scaling issues and single points of failure. In this article, we explore the design tradeoffs for scalable system software at extreme scales. We propose a general system software taxonomy by deconstructing common HPC system software into their basic components. The taxonomy helps us reason about system software as follows: (1) it gives us a systematic way to architect scalable system software by decomposing them into their basic components; (2) it allows us to categorize system software based on the features of these components, and finally (3) it suggests the configuration space to consider for design evaluation via simulations or real implementations. Further, we evaluate different design choices of a representative system software, i.e. key-value store, through simulations up to millions of nodes. Finally, we show evaluation results of two distributed system software, Slurm++ (a distributed HPC resource manager) and MATRIX (a distributed task execution framework), both developed based on insights from this work. We envision that the results in this article help to lay the foundations of developing next-generation HPC system software for extreme scales.

机译：由于明天的万亿级计算具有极高的并行性和较高的组件故障率，因此高性能计算（HPC）系统软件将需要具有可伸缩性，抗故障性，并能适应持续的系统运行和整个系统的利用率。许多现有的HPC系统软件仍然围绕集中式服务器范例进行设计，因此容易受到扩展问题和单点故障的影响。在本文中，我们探讨了可扩展系统软件在极端规模下的设计折衷。通过将常见的HPC系统软件分解为基本组件，我们提出了一种通用的系统软件分类法。该分类法帮助我们对系统软件进行了如下推理：（1）通过将系统软件分解为基本组件，从而为我们提供了一种系统的方法来设计可扩展的系统软件；（2）它允许我们基于这些组件的功能对系统软件进行分类，最后（3）它建议通过仿真或实际实现来考虑设计评估的配置空间。此外，我们通过模拟多达数百万个节点来评估代表性系统软件（即键值存储）的不同设计选择。最后，我们展示了两种分布式系统软件Slurm ++（分布式HPC资源管理器）和MATRIX（分布式任务执行框架）的评估结果，它们都是基于这项工作的见识而开发的。我们预想，本文中的结果将有助于为开发用于极端规模的下一代HPC系统软件奠定基础。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2016年第4期|1070-1084|共15页
作者
K. Wang; A. Kulkarni; M. Lang; D. Arnold; I. Raicu;
展开▼
作者单位

K. Wang is with the Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616.(email:kwang22@hawk.iit.edu);

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Distributed systems; High-performance computing; Key-value stores; Simulation; Systems and Software; high-performance computing; key-value stores; simulation; systems and software;

机译：分布式系统;高性能计算;键值存储;模拟;系统和软件;高性能计算;键值存储;模拟;系统和软件;

相似文献

外文文献
中文文献
专利

1. Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing [J] . Vetter Jeffrey S., Mittal Sparsh Computing in science & engineering . 2015,第2期

机译：非易失性存储系统在超大规模高性能计算中的机会
2. Big data and extreme-scale computing: Pathways to Convergence-Toward a shaping strategy for a future software and data ecosystem for scientific inquiry [J] . Asch M., Moore T., Badia R., Experimental Mechanics . 2018,第4期

机译：大数据和超大规模计算：融合之路-制定未来科学探索软件和数据生态系统的塑造策略
3. Antisocial Computing: Exploring Design Risks in Social Computing Systems [J] . David W. McDonald, David H. Ackley, Randal Bryant, Interactions . 2014,第6期

机译：反社会计算：探索社会计算系统中的设计风险
4. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems [C] . Saurabh Hukerikar, Christian Engelmann Pacific Rim International Symposium on Dependable Computing . 2020

机译：PlexUS：针对弹性极限高性能计算系统的面向模式的运行时系统架构
5. Rethinking the design and implementation of the I/O software stack for high-performance computing. [D] . Zhang, Xuechen. 2012

机译：重新考虑用于高性能计算的I / O软件堆栈的设计和实现。
6. ImageMiner: a software system for comparative analysis of tissue microarrays using content-based image retrieval high-performance computing and grid technology [O] . David J Foran, Lin Yang, Wenjin Chen, 2011

机译：ImageMiner：使用基于内容的图像检索高性能计算和网格技术对组织微阵列进行比较分析的软件系统
7. INVESTIGATING OPERATING SYSTEM NOISE IN EXTREME-SCALE HIGH-PERFORMANCE COMPUTING SYSTEMS USING SIMULATION [O] . Christian Engelmann 2015

机译：利用仿真研究超大型高性能计算系统中的操作系统噪声
8. Integration of Tools for the Design and Assessment of High-Performance, HighlyReliable Computing Systems (DAHPHRS), Phase 1 [R] . Scheper, C., Baker, R., Frank, G., 1992

机译：集成用于设计和评估高性能，高可靠性计算系统（DaHpHRs）的工具，第1阶段

Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software

摘要

著录项

相似文献

相关主题

期刊订阅