Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Jakub Kurzak; Hartwig Anzt; Mark Gates; Jack Dongarra

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

【24h】

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

机译：NVIDIA GPU的批量Cholesky分解和解决方案的实现和优化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many problems in engineering and scientific computing require the solution of a large number of small systems of linear equations. Due to their high processing power, Graphics Processing Units became an attractive target for this class of problems, and routines based on the LU and the QR factorization have been provided by NVIDIA in the cuBLAS library. This work addresses the situation where the systems of equations are symmetric positive definite. The paper describes the implementation and tuning of the kernels for the Cholesky factorization and the forward and backward substitution. Targeted workloads involve the solution of thousands of linear systems of the same size, where the focus is on matrix dimensions from 5 by 5 to 100 by 100. Due to the lack of a cuBLAS Cholesky factorization, execution rates of cuBLAS LU and cuBLAS QR are used for comparison against the proposed Cholesky factorization in this work. Execution rates of forward and backward substitution routines are compared to equivalent cuBLAS routines. Comparisons against optimized multicore implementations are also presented. Superior performance is reached in all cases.

机译：工程和科学计算中的许多问题都需要解决大量小型线性方程组的问题。由于其强大的处理能力，图形处理单元已成为此类问题的诱人目标，NVIDIA在cuBLAS库中提供了基于LU和QR分解的例程。这项工作解决了方程组是对称正定的情况。本文介绍了用于Cholesky分解以及正向和反向替换的内核的实现和调整。目标工作负载涉及数千个相同大小的线性系统的解决方案，其中重点放在从5 x 5到100 x 100的矩阵尺寸上。由于缺少cuBLAS Cholesky分解，因此cuBLAS LU和cuBLAS QR的执行率很高用于与这项工作中建议的Cholesky因式分解进行比较。将向前和向后替换例程的执行率与等效的cuBLAS例程进行比较。还介绍了与优化的多核实现的比较。在所有情况下都可以达到卓越的性能。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2016年第7期|2036-2048|共13页
作者
Jakub Kurzak; Hartwig Anzt; Mark Gates; Jack Dongarra;
展开▼
作者单位

, Electrical Engineering and Computer Science, Knoxville, Tennessee;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
CUDA; Cholesky factorization; GPU; SIMT; batched; kernel;

机译：CUDA;Cholesky分解;GPU;SIMT;批处理;内核;

相似文献

外文文献
中文文献
专利

1. Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs [J] . Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Procedia Computer Science . 2016,第1期

机译：GPU上固定和可变大小的批处理Cholesky分解的性能调整和优化技术
2. Fast Cholesky factorization on GPUs for batch and native modes in MAGMA [J] . Abdelfattah Ahmad, Haidar Azzam, Tomov Stanimire, Journal of computational science . 2017,第May期

机译：在MAGMA中针对批处理和本机模式在GPU上进行快速的Cholesky分解
3. cuPentBatch-A batched pentadiagonal solver for NVIDIA GPUs [J] . Gloster Andrew, Naraigh Lennon O., Pang Khang Ee Computer physics communications . 2019,第期

机译：CupEntbatch-A用于NVIDIA GPU的批次的五个人求解器
4. NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch [C] . Pedro Valero-Lara, Ivan Martinez-Perez, Raul Sirvent, International conference on parallel processing and applied mathematics . 2018

机译：NVIDIA GPU可扩展性，可解决cuThomasBatch的多个（批）三对角系统实现
5. Matrix factorizations, triadic matrices, and modified Cholesky factorizations for optimization [D] . Fang, Haw-ren 2006

机译：矩阵分解，三元矩阵和改进的Cholesky分解以进行优化
6. NMF-mGPU: non-negative matrix factorization on multi-GPU systems [O] . Edgardo Mejía-Roa, Daniel Tabas-Madrid, Javier Setoain, 2015

机译：NMF-mGPU：多GPU系统上的非负矩阵分解
7. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs [O] . Kurzak, Jakub, Anzt, Hartwig, Gates, Mark, 2016

机译：批处理Cholesky分解的实现和调优并解决NVIDIa GpU

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

摘要

著录项

相似文献

相关主题

期刊订阅