首页> 外文学位 >Increasing processor dependability in distributed shared-memory servers.
【24h】

Increasing processor dependability in distributed shared-memory servers.

机译:分布式共享内存服务器中处理器可靠性的提高。

获取原文
获取原文并翻译 | 示例

摘要

Scalable shared-memory servers offer high performance and capacity within the familiar shared-memory programming model. However, reliability and availability have been significant shortcoming for previous shared-memory architectures, as a single error in one of the many processor or memory modules could bring down the entire system. The goal of this thesis is to eliminate the processor module as a single point of failure for shared-memory servers without requiring changes to software and minimizing the impact on commodity hardware designs.;The basic approach studied is distributed redundancy, where pairs of processor cores are grouped together logically but separated physically to increase availability of the system. We propose a design space based on fault-containment granularity, and argue that achieving our goals requires that processor cores and their private caches keep unchecked values from propagating into shared memory. We investigate two alternatives for exposing these updates to the outside system: forcing a check when external requests arrive or hiding the updates using a relaxed memory model.;We propose initial designs based on lockstep coordination that constructs synchronous redundant processor pairs. We then leverage the hidden-update mechanisms to develop an asynchronous, distributed-redundant system. Our evaluations of common enterprise workloads show that asynchronous redundancy can achieve performance overheads averaging just 10% over a non-redundant system, while obviating the need for extensive initialization and deterministic execution found in synchronous designs.;We observe that although asynchronous redundancy has numerous benefits for the designer, it complicates the system's ability to recover from chip failures. Our implementation of asynchronous redundancy relies on one of the replica cores in each pair being potentially incoherent with the rest of the system, leading to temporal regions where, if the coherent core failed, data could be lost. We propose simple extensions to the cache coherence protocol to close these windows of vulnerability. Using symbolic model checking, we formally verify an example distributed shared-memory coherence protocol and our proposed extensions for chip-failure tolerance.
机译:可扩展的共享内存服务器在熟悉的共享内存编程模型内提供了高性能和容量。但是,可靠性和可用性对于以前的共享内存体系结构来说是严重的缺点,因为许多处理器或内存模块之一中的单个错误可能会使整个系统瘫痪。本文的目的是消除处理器模块成为共享内存服务器的单点故障,而无需更改软件并最大程度地减少对商用硬件设计的影响。;研究的基本方法是分布式冗余,其中成对的处理器内核在逻辑上分组在一起,但在物理上分开,以提高系统的可用性。我们提出了一种基于故障遏制粒度的设计空间,并认为要实现我们的目标,就需要处理器内核及其专用缓存防止未经检查的值传播到共享内存中。我们研究了将这些更新公开给外部系统的两种方法:在外部请求到达时强制检查或使用宽松的内存模型隐藏更新。我们提出了基于锁步协调的初始设计,该同步结构构造了同步冗余处理器对。然后,我们利用隐藏更新机制来开发异步,分布式冗余系统。我们对常见企业工作负载的评估表明,异步冗余可以在非冗余系统上实现平均仅10%的性能开销,同时避免了同步设计中需要进行广泛的初始化和确定性执行。对于设计人员而言,这使系统从芯片故障中恢复的能力变得复杂。我们异步冗余的实现依赖于每对副本中的一个复制核心可能与系统的其余部分不一致,从而导致临时区域,如果相关核心发生故障,则可能会丢失数据。我们建议对缓存一致性协议进行简单扩展,以关闭这些漏洞窗口。使用符号模型检查,我们正式验证了示例分布式共享内存一致性协议以及我们提出的针对芯片故障容限的扩展。

著录项

  • 作者

    Gold, Brian T.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 84 p.
  • 总页数 84
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:38:30

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号