首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software
【24h】

Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software

机译:探索超大型高性能计算系统软件的设计折衷

获取原文
获取原文并翻译 | 示例
       

摘要

Owing to the extreme parallelism and the high component failure rates of tomorrow's exascale, high-performance computing (HPC) system software will need to be scalable, failure-resistant, and adaptive for sustained system operation and full system utilizations. Many of the existing HPC system software are still designed around a centralized server paradigm and hence are susceptible to scaling issues and single points of failure. In this article, we explore the design tradeoffs for scalable system software at extreme scales. We propose a general system software taxonomy by deconstructing common HPC system software into their basic components. The taxonomy helps us reason about system software as follows: (1) it gives us a systematic way to architect scalable system software by decomposing them into their basic components; (2) it allows us to categorize system software based on the features of these components, and finally (3) it suggests the configuration space to consider for design evaluation via simulations or real implementations. Further, we evaluate different design choices of a representative system software, i.e. key-value store, through simulations up to millions of nodes. Finally, we show evaluation results of two distributed system software, Slurm++ (a distributed HPC resource manager) and MATRIX (a distributed task execution framework), both developed based on insights from this work. We envision that the results in this article help to lay the foundations of developing next-generation HPC system software for extreme scales.
机译:由于明天的万亿级计算具有极高的并行性和较高的组件故障率,因此高性能计算(HPC)系统软件将需要具有可伸缩性,抗故障性,并能适应持续的系统运行和整个系统的利用率。许多现有的HPC系统软件仍然围绕集中式服务器范例进行设计,因此容易受到扩展问题和单点故障的影响。在本文中,我们探讨了可扩展系统软件在极端规模下的设计折衷。通过将常见的HPC系统软件分解为基本组件,我们提出了一种通用的系统软件分类法。该分类法帮助我们对系统软件进行了如下推理:(1)通过将系统软件分解为基本组件,从而为我们提供了一种系统的方法来设计可扩展的系统软件; (2)它允许我们基于这些组件的功能对系统软件进行分类,最后(3)它建议通过仿真或实际实现来考虑设计评估的配置空间。此外,我们通过模拟多达数百万个节点来评估代表性系统软件(即键值存储)的不同设计选择。最后,我们展示了两种分布式系统软件Slurm ++(分布式HPC资源管理器)和MATRIX(分布式任务执行框架)的评估结果,它们都是基于这项工作的见识而开发的。我们预想,本文中的结果将有助于为开发用于极端规模的下一代HPC系统软件奠定基础。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号