首页> 外文期刊>Journal of Parallel and Distributed Computing >A resource management and fault tolerance services in grid computing
【24h】

A resource management and fault tolerance services in grid computing

机译:网格计算中的资源管理和容错服务

获取原文
获取原文并翻译 | 示例
           

摘要

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.
机译:在网格计算中,资源管理和容错服务是重要的问题。所选资源用于作业执行的可用性是确定计算性能的主要因素。在本文中,我们提出了一个资源管理器来优化资源选择。我们的资源管理器会使用遗传算法自动选择可实现最佳性能的候选资源。通常,网格计算中发生故障的可能性比传统的并行计算中高,并且资源故障严重影响作业的执行。因此,容错服务在计算网格中至关重要。而且,对于期望的操作,通常期望网格服务满足某些最低的服务质量(QoS)水平。为了解决此问题,我们还提出了一种可满足QoS要求的容错服务。我们从分布式系统中的常规故障概念扩展了故障的定义,以便提供一种容错服务,处理各种类型的资源故障,包括过程故障,处理器故障和网络故障。我们还设计并实现了故障检测器和故障管理器。实施和仿真结果表明,我们的方法是有前途的:(1)资源管理器找到可确保有效执行作业的最佳资源集;(2)故障检测器检测到资源故障的发生;(3)故障管理器即使发生某些故障,也可以确保由于作业迁移而完成了提交的作业,并提高了作业执行的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号