NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

机译：NVIDIA GPU可扩展性，可解决cuThomasBatch的多个（批）三对角系统实现

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). The results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000-256,000), being closed to 3x (in double precision) and 4× (in single precision) faster using one Kepler NVIDIA GPU.

机译：三对角线系统的求解是许多应用程序中计算最昂贵的部分之一，因此，多项研究探索了使用NVIDIA GPU来加速这种计算的过程。但是，这些研究主要集中在使用并行算法来计算这样的系统，这些系统可以有效地利用共享内存，并且能够以较少数量的系统饱和GPU的容量，而在处理相对大量的系统时却表现出较差的可扩展性。系统。我们提出了一种基于Thomas算法的新实现（cuThomasBatch）。为了获得良好的可伸缩性，必须使用此方法来进行转换，以将输入存储在内存中的方式来利用合并（连续线程访问连续内存位置）。这项研究给出的结果证明，当处理相对大量的Tridiagonal系统（2,000-256,000），接近3倍（双精度）和4倍的Tridiagonal系统时，本工作中执行的实现能够击败参考代码。使用一个Kepler NVIDIA GPU更快（单精度）。

著录项

来源
《International conference on parallel processing and applied mathematics》|2018年|243-253|共11页
会议地点
作者
Pedro Valero-Lara; Ivan Martinez-Perez; Raul Sirvent; Xavier Martorell; Antonio J. Pena;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Tridiagonal linear systems; Scalability Thomas algorithm; PCR; CR; Parallel processing; cuSPARSE CUDA;

机译：三对角线性系统;可伸缩性Thomas算法; PCR; CR;并行处理; CUSPARSE CUDA;

相似文献

外文文献
中文文献
专利

1. cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs [J] . Pedro Valero-Lara, Ivan Martinez-Perez, Rauel Sirvent, Concurrency and Computation . 2018,第24期

机译：cuThomasBatch和cuThomasVBatch，CUDA例程，用于在NVIDIA GPU上计算一批三对角系统
2. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Macintosh Hamish J., Banks Jasmine E., Kelson Neil A. International journal of reconfigurable computing . 2019,第PTa1期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
3. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Hamish J. Macintosh, Jasmine E. Banks, Neil A. Kelson International journal of reconfigurable computing . 2019,第5aaPagea2期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
4. NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch [C] . Pedro Valero-Lara, Ivan Martinez-Perez, Raul Sirvent, International Conference on Parallel Processing and Applied Mathematics . 2018

机译：NVIDIA GPU可扩展性解决umberhomasbatch的多种（批量）的曲线系统实现
5. More effective use of high performance systems using sub-batch allocation resource management within multiple component multiple data applications. [D] . Foley, Samantha S. 2010

机译：通过在多个组件多个数据应用程序中使用子批处理分配资源管理，更有效地使用高性能系统。
6. LASSIE: simulating large-scale models of biochemical systems on GPUs [O] . Andrea Tangherloni, Marco S. Nobile, Daniela Besozzi, 2017

机译：LASSIE：在GPU上模拟生化系统的大规模模型
7. cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs [O] . Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, 2018

机译：Cuthomasbatch和Cuthomasvbatch，CUDA惯例计算NVIDIA GPU上的三角形系统批次

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

摘要

著录项

相似文献

相关主题

期刊订阅