Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Rafique A.; Constantinides G.A.; Kapre N.

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

【24h】

Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

机译：GPU和FPGA上的迭代稀疏矩阵向量乘法的通信优化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While iterations of the iterative solver can be unrolled to provide reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter , which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9-42.6 speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large and hence higher speedups. Our approach for FPGA provides a 0.3-4.4 speedup over same-generation GPU devices where is pic- ed carefully for both architectures for a range of benchmarks.

机译：使用冗余计算进行通信交换可以在加速通信绑定的稀疏迭代求解器时提高FPGA和GPU的芯片效率。尽管可以展开迭代求解器的迭代以降低通信成本，但展开的程度取决于基础体系结构，其内存模型以及冗余计算的增长。本文提出了一个选择该算法参数的系统程序，该程序在硬件加速器（如FPGA和GPU）上提供了通信计算权衡。我们提供了预测模型来理解这种折衷，并显示出谨慎选择如何会导致性能提高，否则需要显着增加内存带宽。在Nvidia C2050 GPU上，我们针对一系列基准测试证明了比标准迭代求解器高1.9-42.6的速度，并且该速度受冗余计算增长的限制。相反，对于FPGA，我们提出了一种体系结构感知算法，该算法限制片外通信，但允许处理内核之间进行通信。这减少了冗余计算，因此可以实现更大的加速。我们针对FPGA的方法比同代GPU器件提速0.3-4.4，对于两种体系结构都仔细记录了它们，以得出一系列基准。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2015年第1期|24-34|共11页
作者
Rafique A.; Constantinides G.A.; Kapre N.;
展开▼
作者单位

Department of Electrical and Electronic Engineering, Imperial College London, UK;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Field programmable gate arrays; Graphics processing units; Instruction sets; Kernel; Registers; Sparse matrices; Vectors; Iterative numerical methods; field programmable gate arrays (FPGAs); graphics processing units (GPUs); matrix powers kernel; spare matrix-vector multiply;

机译：现场可编程门阵列;图形处理单元;指令集;内核;寄存器;稀疏矩阵;向量;迭代数值方法;现场可编程门阵列（FPGA）;图形处理单元（GPU）;矩阵乘以内核;备用矩阵向量乘法;

相似文献

外文文献
中文文献
专利

1. Model-driven autotuning of sparse matrix-vector multiply on GPUs [J] . Choi Jee W., Singh Amik, Vuduc Richard W. ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2010,第5期

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调谐
2. Optimization techniques for sparse matrix-vector multiplication on GPUs [J] . Marco Maggioni, Tanya Berger-Wolf Journal of Parallel and Distributed Computing . 2016,第jula期

机译：GPU上稀疏矩阵向量乘法的优化技术
3. Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs [J] . Abdelfattah Ahmad, Ltaief Hatem, Keyes David, Concurrency and computation: practice and experience . 2016,第12期

机译：使用GPU对基于PDE的多组件应用的稀疏矩阵矢量乘法的性能优化
4. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs [C] . Jee W. Choi, Amik Singh, Richard W. Vuduc Principles and practice of parallel programming . 2010

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调谐
5. Sparse Convex Optimization on GPUs. [D] . Maggioni, Marco. 2015

机译：GPU上的稀疏凸优化。
6. A new development of non-local image denoising using fixed-point iteration for non-convex ℓp sparse optimization [O] . Shuting Cai, Kun Liu, Ming Yang, -1

机译：使用定点迭代进行非凸ℓp稀疏优化的非局部图像去噪的新进展
7. Model-driven autotuning of sparse matrix-vector multiply on GPUs [O] . Jee W. Choi, Amik Singh, Richard W. Vuduc 2010

机译：GPU上的模型驱动的稀疏矩阵矢量乘法自动调整

Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

摘要

著录项

相似文献

相关主题

期刊订阅