Runtime support for improving reliability in system software .

机译：运行时支持可提高系统软件的可靠性。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

As software is becoming increasingly complex, software reliability is getting more and more important. In particular, the reliability of system software is critical to the overall reliability of computer systems since system software is designed to provide a platform for application software running on top of. Unfortunately, it is very challenging to ensure the reliability of system software and the defects (bugs) in it can often cause severe impact.;This dissertation proposes to use runtime support for improving system software reliability. Runtime support here refers to the technique to extend the runtime software system with more functionalities useful for reliability-oriented tasks, such as instrumentation-based profiling, runtime analysis, checkpointing/re-execution, scheduling control, memory layout control, etc. Leveraging runtime support, this dissertation proposes novel methods for bug manifestation, bug detection, bug diagnosis, failure recovery and error prevention in multiple phases in the software development and deployment cycle.;The most preferable phase to detect and fix software bugs is pre-release testing phase. To improve the testing effectiveness and efficiency, this dissertation proposes the first method to help manifest the bugs hidden in system software. Facing the real-world fact that there are always some bugs making their way to deployment sites no matter how rigorous the software testing is, this dissertation proposes the second method to help monitor the system software and detect runtime errors. To handle the runtime errors caused by software bugs, this dissertation proposes the third method to help diagnose the failure, recover the program, and prevent future errors due to the same bugs.;Specifically, we propose a software testing method called 2ndStrike to manifest hidden concurrency typestate bugs in multi-threaded system software. 2ndStrike first profiles certain program runtime events related to the typestate and thread synchronization. Based on the logs, 2ndStrike then identifies bug candidates that would cause typestate violation if event order is reversed. Finally, 2ndStrike re-executes the program in multiple iterations with controlled thread interleaving for manifesting bug candidates.;In addition, we propose a deployment-time monitoring and analysis method called DM-Tracker to detect anomalies in distributed system software running on parallel platforms during production runs. Based on the observation that data movements in parallel programs typically follow certain patterns, our idea is to extract data movement (DM)-based invariants at program runtime and check the violations of these invariants. These violations indicate potential bugs such as data races and memory corruption bugs that manifest themselves in data movements. Utilizing the data movement information, we propose a statistical-rule-based approach to detect anomalies for finding bugs.;Finally, we propose a deployment-time fault tolerance method called First-Aid to recover failures in system software due to common memory bugs during production runs and prevent future errors caused by the same bugs. Upon a failure, First-Aid diagnoses the bug type and identifies the memory objects that trigger the bug. To do so, it rolls back the program to previous checkpoints and uses two types of environmental changes that can prevent or expose memory bug manifestation during re-execution. Based on the diagnosis, First-Aid generates and applies runtime patches to avoid the memory bug and prevent its reoccurrence.;We have designed and implemented software prototypes for the proposed methods and evaluated them with real world bugs on large open-source system software packages, such as Apache, MySQL, Mozilla, MVAPICH, etc. The experimental results show that the methods proposed in this dissertation can provide great help in improving reliability of system software in various scenarios. In addition, the results also demonstrate that the runtime support in these methods can bring key advantages such as high efficiency, high accuracy, and high usability.

机译：随着软件变得越来越复杂，软件可靠性变得越来越重要。特别地，系统软件的可靠性对于计算机系统的整体可靠性至关重要，因为系统软件旨在为运行在其上的应用软件提供平台。不幸的是，要确保系统软件的可靠性是非常具有挑战性的，并且其中的缺陷（错误）经常会造成严重的影响。;本文提出了使用运行时支持来提高系统软件的可靠性。这里的运行时支持指的是一种扩展运行时软件系统的技术，该功能具有更多面向面向可靠性的任务有用的功能，例如基于仪表的性能分析，运行时分析，检查点/重新执行，调度控制，内存布局控制等。为此，本文提出了在软件开发和部署周期的多个阶段进行缺陷表现，缺陷检测，缺陷诊断，故障恢复和错误预防的新颖方法。检测和修复软件缺陷的最佳阶段是预发布测试阶段。。为了提高测试的有效性和效率，本文提出了第一种方法来帮助发现隐藏在系统软件中的错误。面对现实世界的事实，即无论软件测试多么严格，总有一些错误会进入部署站点，因此本文提出了第二种方法来帮助监视系统软件并检测运行时错误。为解决由于软件缺陷引起的运行时错误，本文提出了第三种方法来帮助诊断故障，恢复程序并防止由于相同的缺陷而导致将来的错误。具体来说，我们提出了一种称为2ndStrike的软件测试方法来表明隐藏的缺陷。多线程系统软件中的并发类型错误。 2ndStrike首先配置某些与类型状态和线程同步有关的程序运行时事件。然后，基于日志，2ndStrike会确定如果事件顺序颠倒会导致类型状态冲突的错误候选者。最后，2ndStrike在受控线程交织的多次迭代中重新执行该程序，以显示候选错误。此外，我们提出了一种部署时监视和分析方法，称为DM-Tracker，以检测在并行平台上运行的分布式系统软件中的异常情况。生产运行。基于观察到并行程序中的数据移动通常遵循某些模式，我们的想法是在程序运行时提取基于数据移动（DM）的不变量，并检查这些不变量的违反情况。这些违规表示潜在错误，例如数据争用和内存损坏错误，它们在数据移动中表现出来。利用数据移动信息，我们提出了一种基于统计规则的方法来检测异常以查找错误。最后，我们提出了一种部署时容错方法，称为“急救”，以恢复由于系统在运行过程中由于常见的内存错误而导致的故障。生产运行并防止由于相同的错误而导致将来的错误。发生故障时，急救会诊断错误类型并识别触发该错误的内存对象。为此，它会将程序回滚到以前的检查点，并使用两种类型的环境更改，这些更改可以防止或暴露重新执行期间的内存错误。基于诊断，急救会生成并应用运行时补丁，以避免内存错误并防止其再次发生。；我们已针对所提出的方法设计和实现了软件原型，并在大型开源系统软件包上对实际错误进行了评估。实验结果表明，本文提出的方法可以为提高各种场景下系统软件的可靠性提供很大的帮助。此外，结果还表明，这些方法中的运行时支持可以带来关键优势，例如高效，高精度和高可用性。

著录项

作者
Gao, Qi.;
展开▼
作者单位

The Ohio State University.;

展开▼
授予单位 The Ohio State University.;
学科 Computer Science.
学位 Ph.D.
年度 2010
页码 133 p.
总页数 133
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Improving the Reliability of Decision-Support Systems for Nuclear Emergency Management by Leveraging Software Design Diversity [J] . Ionescu Tudor B., Scheuermann Walter Journal of computing and information technology . 2016,第1期

机译：通过利用软件设计多样性来提高核应急管理决策支持系统的可靠性
2. Improving the Reliability of Decision-Support Systems for Nuclear Emergency Management by Leveraging Software Design Diversity [J] . Tudor B. Ionescu, Walter Scheuermann Journal of Computing and Information Technology . 2016,第1期

机译：通过利用软件设计多样性提高核应急管理决策支持系统的可靠性
3. A decentralized approach for discovering runtime software architectural models of distributed software systems [J] . Porter Jason, Menasce Daniel A., Gomaa Hassan Information and software technology . 2021,第Mara期

机译：用于发现分布式软件系统运行时软件架构模型的分散方法
4. Improving Reliability of Dynamic Software Updating Using Runtime Recovery [C] . Tianxiao Gu, Zelin Zhao, Xiaoxing Ma, Asia-Pacific Software Engineering Conference . 2016

机译：使用运行时恢复提高动态软件更新的可靠性
5. The 7U Evaluation Method: Evaluating software systems via runtime fault-injection and reliability, availability and serviceability (RAS) metrics and models. [D] . Griffith, Rean. 2008

机译：7U评估方法：通过运行时故障注入以及可靠性，可用性和可维护性（RAS）度量和模型评估软件系统。
6. Focus Issue on Plant Systems Biology: VirtualPlant: A Software Platform to Support Systems Biology Research [O] . Manpreet S. Katari, Steve D. Nowicki, Felipe F. Aceituno, 2010

机译：植物系统生物学的重点问题：VirtualPlant：支持系统生物学研究的软件平台
7. Improving the Reliability of Decision-Support Systems for Nuclear Emergency Management by Leveraging Software Design Diversity [O] . Ionescu Tudor B., Scheuermann Walter 2016

机译：通过利用软件设计多样性提高核应急管理决策支持系统的可靠性
8. Using software metrics and software reliability models to attain acceptable quality software for flight and ground support software for avionic systems [R] . Lawrence, Stella 1992

机译：使用软件度量和可靠性模型为航空电子系统的飞行和地面支持软件获得可接受的质量软件

Runtime support for improving reliability in system software .

摘要

著录项

相似文献

相关主题

期刊订阅