首页> 外文会议>Proceedings of the 9th international conference for young computer scientists (ICYCS 2008) >Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving
【24h】

Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving

机译:基于双核冗余和上下文保存的芯片多处理器瞬态故障恢复

获取原文

摘要

To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit core redundancy of Chip Multiprocessors (CMPs). Chip-level Redundant Threading (CRT) is a novel approach to detect transient fault on CMPs by executing two copies of a given program on separate cores and comparing the store data. CRTR (CRT with Recovery) achieves fault recovery by comparing the result of every instruction before commit. Once detecting a nonidentical result, the microporcessor could be recovered by re-executing from the wrong instruction. The inter-core communication becomes critical in CRTR. To reduce the inter-core communication bandwidth demand, a new approach, Dual Core Redundancy with Context saving (DCR-C), is proposed for fault recovery in this paper. DCR-C extends CRT by adding hardwareimplemented context saving and recovery. In DCR-C, only store instructions are compared before commit as in CRT, so that the bandwidth demand can be largely reduced. The context saving is triggered by store caused cache miss. Therefore the context saving latency could be efficiently hidden. Once detecting a fault, the processor could be recovered to the saved context. The experimental results demonstrate that DCR-C is a preferable approach to achieve fault recovery with low performance overhead and intercore bandwidth demand.
机译:为了解决微处理器对瞬态故障的日益增加的敏感性,已提出了许多技术来利用芯片多处理器(CMP)的核心冗余。芯片级冗余线程(CRT)是一种新颖的方法,可通过在单独的内核上执行给定程序的两个副本并比较存储数据来检测CMP上的瞬态故障。 CRTR(带恢复功能的CRT)通过比较提交之前每条指令的结果来实现故障恢复。一旦检测到不一致的结果,就可以通过从错误的指令中重新执行来恢复微处理器。核心间的通信对于CRTR至关重要。为了减少核心间通信带宽需求,本文提出了一种新的方法,即具有上下文保护功能的双核冗余(DCR-C),用于故障恢复。 DCR-C通过添加硬件实现的上下文保存和恢复来扩展CRT。在DCR-C中,像在CRT中一样,仅在提交之前比较存储指令,因此可以大大减少带宽需求。上下文保存是由存储引起的高速缓存未命中触发的。因此,上下文保存等待时间可以被有效地隐藏。一旦检测到故障,就可以将处理器恢复到保存的上下文。实验结果表明,DCR-C是实现故障恢复且性能开销较低和内核间带宽需求较低的首选方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号