首页> 外文会议>IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale >Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods
【24h】

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods

机译:扩展和评估容错预处理共轭梯度法

获取原文

摘要

We compare and refine exact and heuristic fault-tolerance extensions for the preconditioned conjugate gradient (PCG) and the split preconditioner conjugate gradient (SPCG) methods for recovering from failures of compute nodes of large-scale parallel computers. In the exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011), the solver keeps extra information from previous search directions of the (S)PCG solver, so that its state can be fully reconstructed if a node fails unexpectedly. ESR does not make use of checkpointing or external storage for saving dynamic solver data and has only negligible computation and communication overhead compared to the failure-free situation. In exact arithmetic, the reconstruction is exact, but in finite-precision computations, the number of iterations until convergence can differ slightly from the failure-free case due to rounding effects. We perform experiments to investigate the behavior of ESR in floating-point arithmetic and compare it to the heuristic linear interpolation (LI) approach by Langou et al. (2007) and Agullo et al. (2016), which does not have to keep extra information and thus has lower memory requirements. Our experiments illustrate that ESR, on average, has essentially zero overhead in terms of additional iterations until convergence, whereas the LI approach incurs much larger overheads.
机译:我们比较并完善了用于从大型并行计算机的计算节点故障中恢复的预处理共轭梯度(PCG)和拆分预处理器共轭梯度(SPCG)方法的精确和启发式容错扩展。在基于Chen(2011)提出的方法的精确状态重构(ESR)方法中,求解器会保留(S)PCG求解器以前的搜索方向的额外信息,因此,如果a节点意外失败。 ESR不使用检查点或外部存储来保存动态求解器数据,与无故障情况相比,其计算和通信开销仅可忽略不计。在精确算术中,重构是精确的,但在有限精度计算中,由于舍入效应,直到收敛为止的迭代次数可能与无故障情况略有不同。我们进行实验以调查ESR在浮点算法中的行为,并将其与Langou等人的启发式线性插值(LI)方法进行比较。 (2007年)和Agullo等。 (2016),它不必保留额外的信息,因此内存需求较低。我们的实验表明,ESR平均而言,在进行迭代直到收敛之前,其基本开销为零,而LI方法产生的开销要大得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号