首页> 外文期刊>IEEE transactions on visualization and computer graphics >TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs
【24h】

TOD-Tree: Task-Overlapped Direct Send Tree Image Compositing for Hybrid MPI Parallelism and GPUs

机译:TOD-Tree:用于混合MPI并行和GPU的任务重叠的直接发送树图像合成

获取原文
获取原文并翻译 | 示例
           

摘要

Modern supercomputers have thousands of nodes, each with CPUs and/or GPUs capable of several teraflops. However, the network connecting these nodes is relatively slow, on the order of gigabits per second. For time-critical workloads such as interactive visualization, the bottleneck is no longer computation but communication. In this paper, we present an image compositing algorithm that works on both CPU-only and GPU-accelerated supercomputers and focuses on communication avoidance and overlapping communication with computation at the expense of evenly balancing the workload. The algorithm has three stages: a parallel direct send stage, followed by a tree compositing stage and a gather stage. We compare our algorithm with radix-k and binary-swap from the IceT library in a hybrid OpenMP/MPI setting on the Stampede and Edison supercomputers, show strong scaling results and explain how we generally achieve better performance than these two algorithms. We developed a GPU-based image compositing algorithm where we use CUDA kernels for computation and GPU Direct RDMA for inter-node GPU communication. We tested the algorithm on the Piz Daint GPU-accelerated supercomputer and show that we achieve performance on par with CPUs. Last, we introduce a workflow in which both rendering and compositing are done on the GPU.
机译:现代的超级计算机具有数千个节点,每个节点都具有能够支持数兆位触发器的CPU和/或GPU。但是,连接这些节点的网络相对较慢,约为每秒千兆字节。对于时间紧迫的工作负载(例如交互式可视化),瓶颈不再是计算,而是通信。在本文中,我们提出了一种图像合成算法,该算法可在仅CPU和GPU加速的超级计算机上工作,并着重于避免通信和将通信与计算重叠,以均衡地平衡工作量为代价。该算法分为三个阶段:并行直接发送阶段,随后的树组合阶段和聚集阶段。我们在Stampede和Edison超级计算机上的OpenMP / MPI混合设置中,将我们的算法与IceT库中的radik-k和binary-swap进行了比较,显示了强大的缩放结果,并解释了我们通常如何获得比这两种算法更好的性能。我们开发了基于GPU的图像合成算法,其中我们使用CUDA内核进行计算,并使用GPU Direct RDMA进行节点间GPU通信。我们在Piz Daint GPU加速的超级计算机上测试了该算法,并证明我们可以达到与CPU相当的性能。最后,我们介绍一个工作流,其中渲染和合成都在GPU上完成。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号