【24h】

Overhead of using spare nodes

机译:使用备用节点的开销

获取原文
获取原文并翻译 | 示例
       

摘要

With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods . The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.
机译:随着高端超级计算机上不断增长的故障率,容错的主题一直在采集关注。为应对这种情况,正在调查各种容差技术;这些包括用户级,基于算法的容错技术和并行执行环境,使作业能够继续遵循节点故障。即使通过这些技术,一些具有静态负载平衡的程序,例如模板计算,可能在故障恢复后低于表达。即使存在备用节点,它们也不总是以有效的方式代替失败的节点。本文考虑如何分配备用节点的问题,如何将其替换为故障节点,以及通信性能受到此类替换的影响。第三个问题源于通过节点替换的等级映射的修改,这可能会产生额外的消息冲突。在模板计算中,等级映射以直接的方式在笛卡尔网络上完成,而不会产生任何消息冲突。但是,一旦发生替换,可能会破坏最佳节点级映射。因此,必须以最大限度地减少通信性能的退化的方式回答这些问题。在本文中,将在替换后的通信性能方面提出几个备用节点分配和失败的节点替换方法。所提出的替代方法被命名为滑动方法。通过使用开发的仿真程序和通过使用K计算机,蓝色基因/ Q(BG / Q)和TSUMAME 2.5来分析滑动方法。结果表明,当发生故障时,根据节点故障的数量,K和BG / Q上的模板通信性能可以减慢10次。 K上的屏障性能可以切成两半。在BG / Q中,屏障性能可以减慢10倍。此外,还将显示,在TSUMAME 2.5上几乎没有看到这种通信性能下降。这是因为Tsubame 2.5具有与Fattree拓扑连接的Infiniband网络,而K计算机和BG / Q有专用的笛卡尔网络。因此,通信性能下降取决于网络特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号