...
首页> 外文期刊>Journal of Physics: Conference Series >Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms
【24h】

Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

机译:利用检查点重启技术优化机会平台上ATLAS生产应用程序的CPU效率

获取原文
           

摘要

Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications on the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created. We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.
机译:ATLAS实验的数据处理应用程序(例如事件模拟和重建)在初始化阶段要花费大量时间。此阶段包括加载大量共享库,从外部数据库读取检测器几何形状和条件数据,构建检测器几何形状的瞬态表示并初始化各种算法和服务。在某些情况下,初始化步骤可能需要长达10-15分钟的时间。这种缓慢的初始化对生产作业的总体CPU效率具有重大的负面影响,尤其是当作业是在机会性的,通常是短暂的资源(如商业云或自愿计算)上执行时。为了改善这种情况,我们可以利用以下事实:ATLAS运行大量具有相似配置参数的生产作业(例如,同一生产任务中的作业)。这使我们可以在配置作业结束时检查一个作业,然后使用生成的检查点映像快速启动数千个生产作业。通过应用此技术,我们可以将作业的初始化时间从几十分钟缩短到几秒钟。除此之外,我们还可以利用容器技术在各种计算平台上重新启动检查点应用程序,尤其是与创建检查点映像的平台不同的平台。我们将描述使用AthenaMP(ATLAS数据模拟,重建和分析框架Athena的多进程版本)创建Geant4模拟工作的检查点图像的机制,以及这些图像在志愿计算资源上运行ATLAS Simulation生产工作的用法( ATLAS @ Home)和超级计算机上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号