首页> 外文会议>IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale >Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer
【24h】

Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

机译:分析系统可靠性事件对Titan超级计算机中应用程序的影响

获取原文

摘要

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.
机译:极端规模的计算系统采用可靠性,可用性和可维护性(RAS)机制和基础结构来记录来自多个系统组件的事件。在本文中,我们结合应用程序放置和调度数据库来分析RAS日志,以了解常见RAS事件对应用程序性能的影响。这项针对Titan超级计算机上执行的大约200万个应用程序的记录进行的研究为系统用户,操作员和计算机科学研究人员提供了重要的见解。具体来说,我们通过比较记录事件的情况和未记录事件的相应情况,研究RAS事件对应用程序性能及其可变性的影响。由于我们观察到系统用户倾向于多次执行其应用程序,因此这种统计调查是可能的。我们的分析表明,大多数RAS事件确实会影响应用程序性能,尽管并不总是如此。我们还发现,不同的系统组件对应用程序性能的影响不同。特别是,我们的调查包括以下组件:并行文件系统处理器,内存,图形处理单元,系统和用户软件问题。我们的工作确立了向系统用户提供反馈以提高极端规模系统的运行效率的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号