首页> 外文学位 >Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness.
【24h】

Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness.

机译:通过具有灵活的节点间通信和不平衡意识的分层算法来加速MPI集体通信。

获取原文
获取原文并翻译 | 示例

摘要

This work presents and evaluates algorithms for MPI collective communication operations on high performance systems. Collective communication algorithms are extensively investigated, and a universal algorithm to improve the performance of MPI collective operations on hierarchical clusters is introduced. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication. The universal algorithm shows impressive performance results with a variety of collectives, improving upon the MPICH algorithms as well as the Cray MPT algorithms. Speedups average 15x - 30x for most collectives with improved scalability up to 65536 cores.;Further novel improvements are also proposed for inter-node communication. By utilizing algorithms which take advantage of multiple senders from the same shared memory buffer, an additional speedup of 2.5x can be achieved. The discussion also evaluates special-purpose extensions to improve intra-node communication. These extensions return a shared memory or copy-on-write protected buffer from the collective, which reduces or completely eliminates the second phase of intra-node communication.;The second part of this work improves the performance of MPI collective communication operations in the presence of imbalanced processes arrival times. High performance collective communications are crucial for the performance and scalability of applications, and imbalanced process arrival times are common in these applications. A micro-benchmark is used to investigate the nature of process imbalance with perfectly balanced workloads, and understand the nature of inter- versus intra-node imbalance. These insights are then used to develop imbalance tolerant reduction, broadcast, and alltoall algorithms, which minimize the synchronization delay observed by early arriving processes. These algorithms have been implemented and tested on a Cray XE6 using up to 32k cores with varying buffer sizes and levels of imbalance. Results show speedups over MPICH averaging 18.9x for reduce, 5.3x for broadcast, and 6.9x for alltoall in the presence of high, but not unreasonable, imbalance.
机译:这项工作提出并评估了高性能系统上MPI集体通信操作的算法。对集体通信算法进行了广泛的研究,并提出了一种通用算法来提高MPI集体操作在层次集群上的性能。该算法利用共享内存缓冲区进行有效的节点内通信,同时仍然允许将未经修改的,无层次结构的传统集合用于节点间通信。通用算法通过各种集合显示出令人印象深刻的性能结果,对MPICH算法和Cray MPT算法进行了改进。对于大多数集合体,平均提速为15倍至30倍,可扩展性高达65536个内核。还提出了针对节点间通信的进一步新颖改进。通过利用利用来自同一共享内存缓冲区的多个发送器的算法,可以实现2.5倍的额外加速。讨论还评估了特殊用途的扩展,以改善节点内的通信。这些扩展从集合体返回共享内存或写时复制保护缓冲区,从而减少或完全消除了节点内通信的第二阶段。这项工作的第二部分提高了存在时MPI集合体通信操作的性能。进程到达时间的不平衡。高性能的集体通信对于应用程序的性能和可伸缩性至关重要,在这些应用程序中,进程到达时间的不平衡是很常见的。微观基准用于调查工作负载完全平衡的过程不平衡的性质,并了解节点间与节点内不平衡的性质。然后,这些见解可用于开发不平衡容忍的减少,广播和allallall算法,从而最大程度地减少早期到达过程所观察到的同步延迟。这些算法已在Cray XE6上使用多达32k个内核(具有不同的缓冲区大小和不平衡级别)进行了实现和测试。结果显示,在存在高但不合理的不平衡的情况下,相比MPICH,平均降低速度为18.9倍,广播速度为5.3倍,全部为6.9倍。

著录项

  • 作者

    Parsons, Benjamin S.;

  • 作者单位

    Purdue University.;

  • 授予单位 Purdue University.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 160 p.
  • 总页数 160
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号