A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

Ha Sang-Won; Han Tack-Don

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

【24h】

A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

机译：适用于GPGPU环境的可扩展的高效工作和深度最佳并行扫描

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The parallel scan is a basic tool that is used to parallelize algorithms which appear to have serial dependencies. The performance of these algorithms relies heavily on the efficiency of the parallel scan that is being used. To maintain work efficiency, current parallelization methods either sacrifice the overall depth or limit the scalability. In this study, we present a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process. In this method, the depth is increased by a small constant value above the lower bound; therefore, the amount of computation and/or memory access is effectively reduced. We also employ a novel cascaded thread-block execution method to exploit the single-program-multiple-data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. The proposed method facilitates the low-latency interthread accessible shared memory and the single-instruction-multiple-thread (SIMT) characteristics of the graphics hardware to reduce high-latency global memory access and costly barrier synchronization. Our experimental results demonstrate an average speed up of approximately 40 and 10 percent over the CUDA data parallel primitives (CUDPP) library derivation of the Kogge-Stone prefix tree and an implementation of Merrill and Grimshaw's method with coarser combination of the Kogge-Stone graph and the Brent-Kung prefix graph, respectively.

机译：并行扫描是一种基本工具，可用于并行化似乎具有串行依赖性的算法。这些算法的性能在很大程度上取决于正在使用的并行扫描的效率。为了保持工作效率，当前的并行化方法要么牺牲了总体深度，要么限制了可伸缩性。在这项研究中，我们提出了一种并行扫描方法，该方法是从Han-Carlson并行前缀图派生而来的，既高效又效率高。在这种方法中，深度会在下界上方增加一个小的常数值；因此，有效地减少了计算和/或存储器访问量。我们还采用一种新颖的级联线程块执行方法来利用NVIDIA开发的计算统一设备架构（CUDA）环境的单程序多数据（SPMD）特性。所提出的方法促进了低延迟线程间可访问共享存储器和图形硬件的单指令多线程（SIMT）特性，从而减少了高延迟全局存储器访问和昂贵的屏障同步。我们的实验结果表明，通过Kogge-Stone前缀树的CUDA数据并行基元（CUDPP）库派生以及使用Merrill和Grimshaw的方法将Kogge-Stone图和Brent-Kung前缀图。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2013年第12期|2324-2333|共10页
作者
Ha Sang-Won; Han Tack-Don;
展开▼
作者单位

Yonsei University, Seoul|c|;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
GPGPU; Han-Carlson adder; Parallel scan; high-performance computing; prefix sum;

机译：GPGPU;Han-Carlson加法器;并行扫描;高性能计算;前缀和;

相似文献

外文文献
中文文献
专利

1. Toward efficient parallel routing optimization for large-scale SDN networks using GPGPU [J] . Wang Xiong, Zhang Qian, Ren Jing, Journal of network and computer applications . 2018,第JULa期

机译：使用GPGPU进行大规模SDN网络的高效并行路由优化
2. High-performance quadtree constructions on large-scale geospatial rasters using GPGPU parallel primitives [J] . Jianting Zhang, Simin You International Journal of Geographical Information Science . 2013,第11a12期

机译：使用GPGPU并行图元在大规模地理空间栅格上进行高性能四叉树构造
3. Parallelization of the Scale-Changing Technique in Grid Computing environment for the electronmagnetic simulation of multi-scale structures [J] . F. Khalil, C. J. Barrios-Hernandez, A. Rashid, International journal of numerical modelling . 2011,第1期

机译：网格计算环境中尺度转换技术的并行化，用于多尺度结构的电磁仿真
4. Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes [C] . Welton Benjamin, Samanas Evan, Miller Barton P. International Conference for High Performance Computing, Networking, Storage and Analysis . 2013

机译：Scan先生：使用基于树的GPGPU节点网络，基于极限规模密度的集群
5. A study on the micro/nano resolution CT scanner for a GPGPU computing environment. [D] . Nagai, Norio. 2012

机译：用于GPGPU计算环境的微/纳米分辨率CT扫描仪的研究。
6. On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms [O] . Chunlei Chen, Li He, Huixiang Zhang, 2017

机译：基于GPGPU的增量聚类算法的准确性和并行性
7. Model Order Reduction of Large-Scale Finite Element Systems in an MPI Parallelized Environment for Usage in Multibody Simulation [O] . Volzer Thomas, Eberhard Peter 2016

机译：mpI并行环境中大型有限元系统的模型降阶用于多体仿真

A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment

摘要

著录项

相似文献

相关主题

期刊订阅