首页> 外文学位 >Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness.

【24h】

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness.

机译：通过具有灵活的节点间通信和不平衡意识的分层算法来加速MPI集体通信。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This work presents and evaluates algorithms for MPI collective communication operations on high performance systems. Collective communication algorithms are extensively investigated, and a universal algorithm to improve the performance of MPI collective operations on hierarchical clusters is introduced. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication. The universal algorithm shows impressive performance results with a variety of collectives, improving upon the MPICH algorithms as well as the Cray MPT algorithms. Speedups average 15x - 30x for most collectives with improved scalability up to 65536 cores.;Further novel improvements are also proposed for inter-node communication. By utilizing algorithms which take advantage of multiple senders from the same shared memory buffer, an additional speedup of 2.5x can be achieved. The discussion also evaluates special-purpose extensions to improve intra-node communication. These extensions return a shared memory or copy-on-write protected buffer from the collective, which reduces or completely eliminates the second phase of intra-node communication.;The second part of this work improves the performance of MPI collective communication operations in the presence of imbalanced processes arrival times. High performance collective communications are crucial for the performance and scalability of applications, and imbalanced process arrival times are common in these applications. A micro-benchmark is used to investigate the nature of process imbalance with perfectly balanced workloads, and understand the nature of inter- versus intra-node imbalance. These insights are then used to develop imbalance tolerant reduction, broadcast, and alltoall algorithms, which minimize the synchronization delay observed by early arriving processes. These algorithms have been implemented and tested on a Cray XE6 using up to 32k cores with varying buffer sizes and levels of imbalance. Results show speedups over MPICH averaging 18.9x for reduce, 5.3x for broadcast, and 6.9x for alltoall in the presence of high, but not unreasonable, imbalance.

机译：这项工作提出并评估了高性能系统上MPI集体通信操作的算法。对集体通信算法进行了广泛的研究，并提出了一种通用算法来提高MPI集体操作在层次集群上的性能。该算法利用共享内存缓冲区进行有效的节点内通信，同时仍然允许将未经修改的，无层次结构的传统集合用于节点间通信。通用算法通过各种集合显示出令人印象深刻的性能结果，对MPICH算法和Cray MPT算法进行了改进。对于大多数集合体，平均提速为15倍至30倍，可扩展性高达65536个内核。还提出了针对节点间通信的进一步新颖改进。通过利用利用来自同一共享内存缓冲区的多个发送器的算法，可以实现2.5倍的额外加速。讨论还评估了特殊用途的扩展，以改善节点内的通信。这些扩展从集合体返回共享内存或写时复制保护缓冲区，从而减少或完全消除了节点内通信的第二阶段。这项工作的第二部分提高了存在时MPI集合体通信操作的性能。进程到达时间的不平衡。高性能的集体通信对于应用程序的性能和可伸缩性至关重要，在这些应用程序中，进程到达时间的不平衡是很常见的。微观基准用于调查工作负载完全平衡的过程不平衡的性质，并了解节点间与节点内不平衡的性质。然后，这些见解可用于开发不平衡容忍的减少，广播和allallall算法，从而最大程度地减少早期到达过程所观察到的同步延迟。这些算法已在Cray XE6上使用多达32k个内核（具有不同的缓冲区大小和不平衡级别）进行了实现和测试。结果显示，在存在高但不合理的不平衡的情况下，相比MPICH，平均降低速度为18.9倍，广播速度为5.3倍，全部为6.9倍。

著录项

作者
Parsons, Benjamin S.;
展开▼
作者单位

Purdue University.;

展开▼
授予单位 Purdue University.;
学科 Computer engineering.
学位 Ph.D.
年度 2015
页码 160 p.
总页数 160
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Adaptive, transparent CPU scaling algorithms leveraging inter-node MPI communication regions [J] . Min Yeol Lim, Vincent W. Freeh, David K. Lowenthal Parallel Computing . 2011,第10a11期

机译：利用节点间MPI通信区域的自适应透明CPU缩放算法
2. Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications [J] . Richard L. Graham, Brian W. Barrett, Galen M. Shipman, Parallel Processing Letters . 2007,第1期

机译：开放式MPI：MPI点对点通信的高性能，灵活实现
3. Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm [J] . Mohsen Soryani, Morteza Analoui, Ghobad Zarrinchian Journal of supercomputing . 2013,第1期

机译：使用无竞争过程映射算法改善多核集群中的节点间通信
4. Accelerating MPI Collective Communications through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility [C] . Parsons Benjamin S., Pai Vijay S. IEEE International Parallel Distributed Processing Symposium . 2014

机译：在不牺牲节点间通信灵活性的情况下，通过分层算法加速MPI集体通信
5. Graph partitioning algorithms for minimizing inter-node communication on a distributed system. [D] . Gadde, Srimanth. 2013

机译：图分区算法，用于最小化分布式系统上的节点间通信。
6. Developing Subdomain Allocation Algorithms Based on Spatial and Communicational Constraints to Accelerate Dust Storm Simulation [O] . Zhipeng Gui, Manzhu Yu, Chaowei Yang, -1

机译：开发基于空间和通信约束的子域分配算法以加速沙尘暴模拟
7. Static/Dynamic Validation of MPI Collective Communications in Multi-threaded Context [O] . Saillard, Emmanuelle, Carribault Cea, Patrick, Barthou, Denis 2015

机译：多线程上下文中MPI集合通信的静态/动态验证

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness.

摘要

著录项

相似文献

相关主题

期刊订阅