首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment
【24h】

A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

机译:适用于GPGPU环境的可扩展的高效工作和深度最佳并行扫描

获取原文
获取原文并翻译 | 示例
           

摘要

The parallel scan is a basic tool that is used to parallelize algorithms which appear to have serial dependencies. The performance of these algorithms relies heavily on the efficiency of the parallel scan that is being used. To maintain work efficiency, current parallelization methods either sacrifice the overall depth or limit the scalability. In this study, we present a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process. In this method, the depth is increased by a small constant value above the lower bound; therefore, the amount of computation and/or memory access is effectively reduced. We also employ a novel cascaded thread-block execution method to exploit the single-program-multiple-data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. The proposed method facilitates the low-latency interthread accessible shared memory and the single-instruction-multiple-thread (SIMT) characteristics of the graphics hardware to reduce high-latency global memory access and costly barrier synchronization. Our experimental results demonstrate an average speed up of approximately 40 and 10 percent over the CUDA data parallel primitives (CUDPP) library derivation of the Kogge-Stone prefix tree and an implementation of Merrill and Grimshaw's method with coarser combination of the Kogge-Stone graph and the Brent-Kung prefix graph, respectively.
机译:并行扫描是一种基本工具,可用于并行化似乎具有串行依赖性的算法。这些算法的性能在很大程度上取决于正在使用的并行扫描的效率。为了保持工作效率,当前的并行化方法要么牺牲了总体深度,要么限制了可伸缩性。在这项研究中,我们提出了一种并行扫描方法,该方法是从Han-Carlson并行前缀图派生而来的,既高效又效率高。在这种方法中,深度会在下界上方增加一个小的常数值;因此,有效地减少了计算和/或存储器访问量。我们还采用一种新颖的级联线程块执行方法来利用NVIDIA开发的计算统一设备架构(CUDA)环境的单程序多数据(SPMD)特性。所提出的方法促进了低延迟线程间可访问共享存储器和图形硬件的单指令多线程(SIMT)特性,从而减少了高延迟全局存储器访问和昂贵的屏障同步。我们的实验结果表明,通过Kogge-Stone前缀树的CUDA数据并行基元(CUDPP)库派生以及使用Merrill和Grimshaw的方法将Kogge-Stone图和Brent-Kung前缀图。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号