GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

Wang H.; Potluri S.; Bureddy D.; Rosales C.; Panda D.K.

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

【24h】

GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

机译：支持RDMA的群集上具有GPU意识的MPI：设计，实施和评估

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Designing high-performance and scalable applications on GPU clusters requires tackling several challenges. The key challenge is the separate host memory and device memory, which requires programmers to use multiple programming models, such as CUDA and MPI, to operate on data in different memory spaces. This challenge becomes more difficult to tackle when non-contiguous data in multidimensional structures is used by real-world applications. These challenges limit the programming productivity and the application performance. We propose the GPU-Aware MPI to support data communication from GPU to GPU using standard MPI. It unifies the separate memory spaces, and avoids explicit CPU-GPU data movement and CPU/GPU buffer management. It supports all MPI datatypes on device memory with two algorithms: a GPU datatype vectorization algorithm and a vector based GPU kernel data pack and unpack algorithm. A pipeline is designed to overlap the non-contiguous data packing and unpacking on GPUs, the data movement on the PCIe, and the RDMA data transfer on the network. We incorporate our design with the open-source MPI library MVAPICH2 and optimize a production application: the multiphase 3D LBM. Besides the increase of programming productivity, we observe up to 19.9 percent improvement in application-level performance on 64 GPUs of the Oakley supercomputer.

机译：在GPU群集上设计高性能和可扩展的应用程序需要解决几个挑战。关键的挑战是分开的主机存储器和设备存储器，这要求程序员使用多种编程模型（例如CUDA和MPI）来处理不同存储空间中的数据。当现实应用程序使用多维结构中的非连续数据时，这一挑战变得更加难以解决。这些挑战限制了编程效率和应用程序性能。我们提出了支持GPU的MPI，以支持使用标准MPI从GPU到GPU的数据通信。它统一了单独的内存空间，并避免了显式的CPU-GPU数据移动和CPU / GPU缓冲区管理。它通过两种算法支持设备内存上的所有MPI数据类型：GPU数据类型矢量化算法和基于矢量的GPU内核数据打包和拆包算法。管道被设计为与GPU上的非连续数据打包和拆包，PCIe上的数据移动以及网络上的RDMA数据传输重叠。我们将我们的设计与开源MPI库MVAPICH2结合在一起，并优化了生产应用程序：多相3D LBM。除了提高编程效率外，我们还发现Oakley超级计算机的64个GPU上的应用程序级性能提高了19.9％。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2014年第10期|2595-2605|共11页
作者
Wang H.; Potluri S.; Bureddy D.; Rosales C.; Panda D.K.;
展开▼
作者单位

Department of Computer Science and Engineering, Ohio State University , Columbus,;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Algorithm design and analysis; Data communication; Graphics processing units; Kernel; Memory management; Pipelines; Vectors; CUDA; GPU; InfiniBand; Lattice Boltzmann method; MPI; RDMA;

机译：算法设计与分析;数据通讯;图形处理单元;核心;内存管理;管道;向量;CUDA;GPU;InfiniBand;格子波尔兹曼法MPI;RDMA;

相似文献

外文文献
中文文献
专利

1. Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters [J] . Zheng Gu, Matthew Small, Xin Yuan, International journal of parallel programming . 2013,第5期

机译：协议自定义以提高启用RDMA的群集上的MPI性能
2. Design considerations for GPU-aware collective communications in MPI [J] . Iman Faraji, Ahmad Afsahi Concurrency and computation: practice and experience . 2018,第17期

机译：MPI中可识别GPU的集体通信的设计注意事项
3. The design and implementation of MPI collective operations for clusters in long-and-fast networks [J] . Motohiko Matsuda, Tomohiro Kudoh, Yuetsu Kodama, Cluster computing . 2008,第1期

机译：高速网络中群集的MPI集合操作的设计和实现
4. Maximizing MPI point-to-point communication performance on RDMA-enabled clusters with customized protocols [C] . Matthew Small, Xin Yuan International conference on Supercomputing . 2009

机译：使用自定义协议在启用RDMA的群集上最大化MPI点对点通信性能
5. An MPI-CUDA implementation of a model for calcium induced calcium release in a three-dimensional heart cell on a hybrid CPU/GPU cluster [D] . Huang, Xuan 2015

机译：MPI-CUDA模型在混合CPU / GPU集群上的三维心脏细胞中钙诱导的钙释放的模型实现
6. High Performance Data Clustering: A Comparative Analysis of Performance for GPU RASC MPI and OpenMP Implementations [O] . Luobin Yang, Steve C. Chiu, Wei-Keng Liao, -1

机译：高性能数据集群：GPURASCMPI和OpenMP实现的性能比较分析
7. Maximizing MPI Point-to-Point Communication Performance on RDMA-enabled Clusters with Customized Protocols [O] . Matthew Small, Xin Yuan 2010

机译：使用自定义协议在启用RDMA的群集上最大化MPI点对点通信性能

GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

摘要

著录项

相似文献

相关主题

期刊订阅