Loop transformations for fault detection in regular loops on massively parallel systems

Chun Gong; Melhem R.

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Loop transformations for fault detection in regular loops on massively parallel systems

【24h】

Loop transformations for fault detection in regular loops on massively parallel systems

机译：大规模并行系统上规则循环中用于故障检测的循环转换

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages.

机译：分布式内存系统可以以合理的成本合并成千上万个处理器。但是，随着系统中处理器数量的增加，故障检测和容错能力成为关键问题。通过在多个处理器上复制计算并比较这些处理器产生的结果，可以检测到错误。在程序执行期间，由于数据依赖性，通常并非多处理器系统中的所有处理器始终都在忙。因此，处理器调度包含空闲时隙，因此本工作的目的是利用这些空闲时隙来调度重复计算，以进行故障检测。我们提出了一种由编译器辅助的方法，用于分布式内存系统中常规循环中的故障检测。这种方法通过复制语句实例的执行来实现故障检测。在仔细分析了常规循环的数据依赖性之后，以确保所需的故障覆盖率的方式来复制选定的循环语句实例。我们首先介绍用于故障检测的复制策略，并显示这些策略在任何可能的情况下都使用空闲处理器时间来执行复制的语句。接下来，我们介绍实现这些故障检测策略的循环转换。而且，开发了用于选择适当的循环变换的通用框架。在CRAY-T3D上进行的实验结果表明，添加故障检测功能的开销通常小于25％，而当通过对消息进行分组来减少通信开销时，则小于10％。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |1996年第12期|P.1238-1249|共12页
作者
Chun Gong; Melhem R.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Efficient processor assignment algorithms and loop transformations for executing nested parallel loops on multiprocessors [J] . Chien-Min Wang, Sheng-De Wang IEEE Transactions on Parallel and Distributed Systems . 1992,第1期

机译：用于在多处理器上执行嵌套并行循环的高效处理器分配算法和循环转换
2. Adaptive Fault Detection and Isolation Approach for Actuator Stuck Faults in Closed-Loop Systems [J] . Xiao-Jian Li, Guang-Hong Yang International Journal of Control, Automation, and Systems . 2012,第4期

机译：闭环系统中执行器卡死的自适应故障检测与隔离方法
3. Adaptive fault detection and isolation approach for actuator stuck faults in closed-loop systems [J] . Xiao-Jian Li, Guang-Hong Yang International Journal of Control, Automation and Systems . 2012,第4期

机译：闭环系统中执行器卡住故障的自适应故障检测和隔离方法
4. COMMUNICATION-CONSCIOUS MAPPING OF REGULAR NESTED LOOP PROGRAMS ONTO MASSIVELY PARALLEL PROCESSOR ARRAYS [C] . Sebastian Siegel, Renate Merker, Frank Hannig, IASTED International Conference on Parallel and Distributed Computing and Systems . 2006

机译：经常嵌套循环程序的通信映射到大量并行处理器阵列
5. A Proof Theory for Loop-Parallelizing Transformations. [D] . Bell, Christian James. 2014

机译：循环并行转换的证明理论。
6. Fault Detection and Safety in Closed-Loop Artificial Pancreas Systems [O] . B. Wayne Bequette 2014

机译：闭环人工胰腺系统的故障检测与安全性
7. Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems [O] . Chun Gong, Rami Melhem, Rajiv Gupta 1996

机译：大规模并行系统规则环路故障检测的环路变换
8. New Loop Transformation Techniques for Massive Parallelism [R] . Lu, L. C., Chen, M. 1990

机译：大规模并行的新循环变换技术

Loop transformations for fault detection in regular loops on massively parallel systems

摘要

著录项

相似文献

相关主题

期刊订阅