【24h】

Fault Tolerant Cholesky Factorization on GPUs

机译:GPU上的容错Cholesky分解

获取原文

摘要

Direct Cholesky-based solvers are typically used to solve large linear systems where the coefficient matrix is symmetric positive definite. These solvers offer faster performance in solving such linear systems, compared to other more general solvers such as LU and QR solvers. In recent days, graphics processing units (GPUs) have become a popular platform for scientific computing applications, and are increasingly being used as major computational units in supercomputers. However, GPUs are susceptible to transient faults caused by events such as alpha particle strikes and power fluctuations. As a result, the possibility of an error increases as more and more GPU computing nodes are used. In this paper, we introduce two efficient fault tolerance schemes for the Cholesky factorization method, and study their performance using a direct Cholesky solver in the presence of faults. We utilize a transient fault injection mechanism for NVIDIA GPUs and compare our schemes with a traditional checksum fault tolerance technique, and show that our proposed schemes have superior performance, good error coverage and low overhead.
机译:直接基于Cholesky的求解器通常用于求解系数矩阵为对称正定的大型线性系统。与其他更通用的求解器(例如LU和QR求解器)相比,这些求解器在求解此类线性系统时提供更快的性能。近年来,图形处理单元(GPU)已成为科学计算应用程序的流行平台,并越来越多地用作超级计算机中的主要计算单元。但是,GPU容易遭受由alpha粒子撞击和功率波动等事件引起的瞬态故障。结果,随着使用越来越多的GPU计算节点,错误的可能性增加。在本文中,我们为Cholesky分解方法介绍了两种有效的容错方案,并在存在故障的情况下使用直接Cholesky解算器研究了它们的性能。我们将瞬态故障注入机制用于NVIDIA GPU,并将我们的方案与传统的校验和容错技术进行比较,并表明我们提出的方案具有卓越的性能,良好的错误覆盖率和较低的开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号