首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Loop transformations for fault detection in regular loops on massively parallel systems
【24h】

Loop transformations for fault detection in regular loops on massively parallel systems

机译:大规模并行系统上规则循环中用于故障检测的循环转换

获取原文
获取原文并翻译 | 示例
           

摘要

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages.
机译:分布式内存系统可以以合理的成本合并成千上万个处理器。但是,随着系统中处理器数量的增加,故障检测和容错能力成为关键问题。通过在多个处理器上复制计算并比较这些处理器产生的结果,可以检测到错误。在程序执行期间,由于数据依赖性,通常并非多处理器系统中的所有处理器始终都在忙。因此,处理器调度包含空闲时隙,因此本工作的目的是利用这些空闲时隙来调度重复计算,以进行故障检测。我们提出了一种由编译器辅助的方法,用于分布式内存系统中常规循环中的故障检测。这种方法通过复制语句实例的执行来实现故障检测。在仔细分析了常规循环的数据依赖性之后,以确保所需的故障覆盖率的方式来复制选定的循环语句实例。我们首先介绍用于故障检测的复制策略,并显示这些策略在任何可能的情况下都使用空闲处理器时间来执行复制的语句。接下来,我们介绍实现这些故障检测策略的循环转换。而且,开发了用于选择适当的循环变换的通用框架。在CRAY-T3D上进行的实验结果表明,添加故障检测功能的开销通常小于25%,而当通过对消息进行分组来减少通信开销时,则小于10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号