首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation
【24h】

GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation

机译:支持RDMA的群集上具有GPU意识的MPI:设计,实施和评估

获取原文
获取原文并翻译 | 示例
           

摘要

Designing high-performance and scalable applications on GPU clusters requires tackling several challenges. The key challenge is the separate host memory and device memory, which requires programmers to use multiple programming models, such as CUDA and MPI, to operate on data in different memory spaces. This challenge becomes more difficult to tackle when non-contiguous data in multidimensional structures is used by real-world applications. These challenges limit the programming productivity and the application performance. We propose the GPU-Aware MPI to support data communication from GPU to GPU using standard MPI. It unifies the separate memory spaces, and avoids explicit CPU-GPU data movement and CPU/GPU buffer management. It supports all MPI datatypes on device memory with two algorithms: a GPU datatype vectorization algorithm and a vector based GPU kernel data pack and unpack algorithm. A pipeline is designed to overlap the non-contiguous data packing and unpacking on GPUs, the data movement on the PCIe, and the RDMA data transfer on the network. We incorporate our design with the open-source MPI library MVAPICH2 and optimize a production application: the multiphase 3D LBM. Besides the increase of programming productivity, we observe up to 19.9 percent improvement in application-level performance on 64 GPUs of the Oakley supercomputer.
机译:在GPU群集上设计高性能和可扩展的应用程序需要解决几个挑战。关键的挑战是分开的主机存储器和设备存储器,这要求程序员使用多种编程模型(例如CUDA和MPI)来处理不同存储空间中的数据。当现实应用程序使用多维结构中的非连续数据时,这一挑战变得更加难以解决。这些挑战限制了编程效率和应用程序性能。我们提出了支持GPU的MPI,以支持使用标准MPI从GPU到GPU的数据通信。它统一了单独的内存空间,并避免了显式的CPU-GPU数据移动和CPU / GPU缓冲区管理。它通过两种算法支持设备内存上的所有MPI数据类型:GPU数据类型矢量化算法和基于矢量的GPU内核数据打包和拆包算法。管道被设计为与GPU上的非连续数据打包和拆包,PCIe上的数据移动以及网络上的RDMA数据传输重叠。我们将我们的设计与开源MPI库MVAPICH2结合在一起,并优化了生产应用程序:多相3D LBM。除了提高编程效率外,我们还发现Oakley超级计算机的64个GPU上的应用程序级性能提高了19.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号