Overhead of using spare nodes

Atsushi Hori; Kazumi Yoshinaga; Thomas Herault; Aurélien Bouteiller; George Bosilca; Yutaka Ishikawa

首页> 外文期刊>International Journal of High Performance Computing Applications >Overhead of using spare nodes

【24h】

Overhead of using spare nodes

机译：使用备用节点的开销

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods . The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.

机译：随着高端超级计算机上不断增长的故障率，容错的主题一直在采集关注。为应对这种情况，正在调查各种容差技术;这些包括用户级，基于算法的容错技术和并行执行环境，使作业能够继续遵循节点故障。即使通过这些技术，一些具有静态负载平衡的程序，例如模板计算，可能在故障恢复后低于表达。即使存在备用节点，它们也不总是以有效的方式代替失败的节点。本文考虑如何分配备用节点的问题，如何将其替换为故障节点，以及通信性能受到此类替换的影响。第三个问题源于通过节点替换的等级映射的修改，这可能会产生额外的消息冲突。在模板计算中，等级映射以直接的方式在笛卡尔网络上完成，而不会产生任何消息冲突。但是，一旦发生替换，可能会破坏最佳节点级映射。因此，必须以最大限度地减少通信性能的退化的方式回答这些问题。在本文中，将在替换后的通信性能方面提出几个备用节点分配和失败的节点替换方法。所提出的替代方法被命名为滑动方法。通过使用开发的仿真程序和通过使用K计算机，蓝色基因/ Q（BG / Q）和TSUMAME 2.5来分析滑动方法。结果表明，当发生故障时，根据节点故障的数量，K和BG / Q上的模板通信性能可以减慢10次。 K上的屏障性能可以切成两半。在BG / Q中，屏障性能可以减慢10倍。此外，还将显示，在TSUMAME 2.5上几乎没有看到这种通信性能下降。这是因为Tsubame 2.5具有与Fattree拓扑连接的Infiniband网络，而K计算机和BG / Q有专用的笛卡尔网络。因此，通信性能下降取决于网络特征。

著录项

来源
《International Journal of High Performance Computing Applications》 |2020年第2期|208-226|共19页
作者
Atsushi Hori; Kazumi Yoshinaga; Thomas Herault; Aurélien Bouteiller; George Bosilca; Yutaka Ishikawa;
展开▼
作者单位

RIKEN Center for Computational Science;

Meguro-ku;

Innovative Computing Laboratory The University of Tennessee;

Innovative Computing Laboratory The University of Tennessee;

Innovative Computing Laboratory The University of Tennessee;

RIKEN Center for Computational Science;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Fault tolerance; fault mitigation; spare node; communication performance; sliding method;

机译：容错;故障缓解;备用节点;通信性能;滑动方法;

相似文献

外文文献
中文文献
专利

1. Can histologic parameters of melanoma metastases in sentinel lymph nodes reliably select patients who can be safely spared completion lymph node dissection? [J] . Murali R, Thompson JF, Scolyer RA Annals of Surgery . 2010,第6期

机译：前哨淋巴结中黑色素瘤转移的组织学参数能否可靠地选择可以安全地完成淋巴结清扫术的患者？
2. Breast cancer patients with extra-axillary sentinel nodes only may be spared axillary lymph node dissection. [J] . van-der-Ploeg IM, Tanis PJ, Valdes-Olmos RA, Annals of surgical oncology . 2008,第11期

机译：仅腋窝前哨淋巴结的乳腺癌患者可以免除腋窝淋巴结清扫术。
3. Breast Cancer Patients with Extra-Axillary Sentinel Nodes Only may be Spared Axillary Lymph Node Dissection [J] . Iris M. C. van der Ploeg MD, Pieter J. Tanis MD PhD, Renato A. Valdés Olmos MD PhD, Annals of Surgical Oncology . 2008,第11期

机译：仅腋窝前淋巴结肿大的乳腺癌患者可以行腋窝淋巴结清扫术
4. A DISTANCE PROTECTION STRATEGIC SPARE RELAY FOR 132/66kV OVERHEAD LINES [C] . R.T. Harris, A.G. Roberts, A. Marks, International Conference on Power Control and Optimization . 2010

机译：距离保护战略备用继电器为132 / 66kV架空线
5. LEACH-SM: A protocol for extending wireless sensor network lifetime by management of spare nodes. [D] . Abu Bakr, Bilal. 2011

机译：LEACH-SM：一种通过管理备用节点来延长无线传感器网络寿命的协议。
6. Simulation and Test of a Contactless Voltage Measurement Method for Overhead Lines Based on Reconstruction of Integral Node Parameters [O] . Jingang Wang, Xiaojun Yan, Lu Zhong, 2020

机译：基于积分节点参数重构的架空线非接触电压测量方法的仿真与测试
7. OC-0084: Hybrid RapidArc for breast with locoregional lymph node irradiation spares more normal tissue [O] . Bucko E., Jeulink M., Meijnen P., 2016

机译：OC-0084：混合RapidArc用于局部淋巴结照射的乳房可保留更多正常组织
8. Minimizing Overhead for Secure Computation and Fully Homomorphic Encryption: Overhead. [R] . Shelat, A., Hohenberger, S., Myers, S., 2015

机译：最小化安全计算和完全同态加密的开销：开销。

Overhead of using spare nodes

摘要

著录项

相似文献

相关主题

期刊订阅