首页> 外文会议>IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale >CPU Overheating Characterization in HPC Systems: A Case Study
【24h】

CPU Overheating Characterization in HPC Systems: A Case Study

机译:HPC系统中的CPU过热特性:案例研究

获取原文

摘要

With the increase in size of supercomputers, also increases the number of abnormal events. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the top500 list to understand under which conditions CPU overheating occurs. Our analysis show that overheating events are due to some specific applications. In a second part, we evaluate the impact of such overheating events on the performance of MPI applications. Using 6 representative HPC benchmarks, we show that for a majority of the applications, a frequency drop on one CPU impacts the execution time of distributed runs proportionally to the duration and to the extent of the frequency drop.
机译:随着超级计算机规模的增加,异常事件的数量也随之增加。其中一些事件可能会导致应用程序失败。其他可能只是影响系统效率。 CPU过热就是这样一种事件,它会降低系统效率:当CPU过热时,它会降低其频率。本文研究了超级计算机中的CPU过热问题。在第一部分中,我们分析了在前500强的超级计算机上一年来收集的数据,以了解在什么情况下CPU过热。我们的分析表明,过热事件是由于某些特定的应用引起的。在第二部分中,我们评估了此类过热事件对MPI应用程序性能的影响。使用6个具有代表性的HPC基准测试,我们表明,对于大多数应用程序,一个CPU上的频率下降会影响分布式运行的执行时间,该时间与持续时间和频率下降的程度成比例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号