首页> 外文学位 >A case study towards the verification of the utility of analytical models in selecting checkpoint intervals.
【24h】

A case study towards the verification of the utility of analytical models in selecting checkpoint intervals.

机译:一个案例,旨在验证分析模型在选择检查点间隔中的实用性。

获取原文
获取原文并翻译 | 示例

摘要

As high performance computing (HPC) systems grow larger, with increasing numbers of components, failures become more common. Codes that utilize large numbers of nodes and run for long periods of time must take such failures into account and adopt fault tolerance mechanisms to avoid loss of computation and, thus, system utilization. One of those mechanisms is checkpoint/restart. Although analytical models exist to guide users in the selection of an appropriate checkpoint interval, these models are based on assumptions that may not always be true. This thesis examines some of these assumptions, in particular, the consistency of parameters like Mean Time To Interrupt (MTTI), checkpoint latency, and restart time, and explores the utility of the models, which assume an exponential failure distribution. The related experimentation uses checkpoint and restart data collected from NAMD, a widely used biomolecular simulation code, and failure data from Los Alamos National Lab (LANL) where failure distributions are not exponential in nature. It also presents preliminary work on spatio-temporal clustering of HPC failure data that is aimed towards determining the degree to which failures that occur in HPC centers are related.;The experimental results of this thesis validate that Daly's execution-time and Arunagiri's defensive-I/O checkpoint/restart models hold for NAMD. This shows that these models have utility even when failures do not have an exponential distribution. The results of the clustering indicate that for some systems failures are located into easily recognized clusters, while for others failures are placed in small clusters showing that they occur in close proximity spatially and temporally. Note, however, no conclusion can be drawn from these results as to whether they are related events as random events sometimes cluster. Spatio-temporal autocorrelation is recommended as a continuation of this research to determine the degree of the relatedness of failure events.
机译:随着高性能计算(HPC)系统变得越来越大,组件数量越来越多,故障变得越来越普遍。使用大量节点并长时间运行的代码必须考虑到此类故障,并采用容错机制来避免计算损失,从而避免系统利用率。这些机制之一是检查点/重新启动。尽管存在分析模型来指导用户选择适当的检查点间隔,但是这些模型是基于并非总是正确的假设。本文研究了其中的一些假设,尤其是平均中断时间(MTTI),检查点延迟和重新启动时间等参数的一致性,并探索了模型的效用,该模型假设了指数故障分布。相关实验使用从NAMD(广泛使用的生物分子模拟代码)收集的检查点和重新启动数据,以及来自Los Alamos国家实验室(LANL)的故障数据,故障数据本质上不是指数分布。它还提出了关于HPC故障数据的时空聚类的初步工作,旨在确定HPC中心发生的故障之间的相关程度。本文的实验结果验证了Daly的执行时间和Arunagiri的防守I / O检查点/重新启动模型适用于NAMD。这表明即使故障没有指数分布,这些模型也具有实用性。聚类的结果表明,对于某些系统,故障位于易于识别的群集中,而对于其他系统,故障则位于较小的群集中,表明它们在空间和时间上都非常接近。但是请注意,由于随机事件有时会聚类,因此无法从这些结果中得出关于它们是否为相关事件的结论。建议时空自相关作为本研究的延续,以确定故障事件的相关程度。

著录项

  • 作者

    Harney, Michael Joseph.;

  • 作者单位

    The University of Texas at El Paso.;

  • 授予单位 The University of Texas at El Paso.;
  • 学科 Engineering Computer.;Computer Science.
  • 学位 M.S.
  • 年度 2013
  • 页码 175 p.
  • 总页数 175
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 语言学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号