首页> 外文期刊>Journal of Parallel and Distributed Computing >The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems
【24h】

The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

机译:故障跟踪归档文件:支持比较故障度量和分布式系统模型

获取原文
获取原文并翻译 | 示例
           

摘要

With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)-an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.
机译:随着分布式系统的存在,规模和复杂性的增加,资源故障已成为计算机科学研究的重要且实用的话题。尽管存在许多故障模型和故障感知算法,但由于缺乏公共故障数据集和数据处理工具,它们的比较受到了阻碍。为了方便容错模型和算法的设计,验证和比较,我们创建了故障跟踪档案库(FTA)-在线公共存储库,它收集了来自各种并行和分布式系统的故障跟踪。在这项工作中,我们首先描述档案的设计,尤其是标准FTA数据格式的设计,以及有助于自动分析跟踪数据集的工具箱的设计。我们还将讨论将FTA用于当前和将来的各种目的。其次,将工具箱应用于从各种应用程序域(例如,HPC,Internet操作和各种在线应用程序)中使用的分布式系统中收集到的9条故障跟踪之后,我们对各种分布式系统中的故障进行了比较分析。我们的分析提出了各种统计见解和典型的统计建模结果,以了解各种分布式系统中各个资源的可用性。分析结果强调需要公开提供来自不同分布式系统的跟踪数据。最后,我们展示了对故障数据含义的不同解释如何导致分布式系统中故障建模和作业调度的不同结论。我们针对不同解释的结果表明,当将其应用于通用而不是针对特定领域的分布式系统时,可能需要进一步回顾现有的故障感知算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号