首页> 外文会议>IEEE/ACM international symposium on cluster, cloud and grid computing >Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing
【24h】

Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing

机译:在云中使用消息队列实现高效的分布式调度,以进行多任务计算和高性能计算

获取原文

摘要

Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Due to the explosion of parallelism found in today's hardware, applications need to perform over-decomposition to deliver good performance, this over-decomposition is driving job management systems' requirements to support applications with a growing number of tasks with finer granularity. Our goal in this work is to provide a compact, light-weight, scalable, and distributed task execution framework (Cloud Kon) that builds upon cloud computing building blocks (Amazon EC2, SQS, and Dynamo DB). Most of today's state-of-the-art job execution systems have predominantly Master/Slaves architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. Cloud Kon is a distributed job management system that can support both HPC and MTC workloads with millions of tasks/jobs. We compare our work with other state-of-the-art job management systems including Sparrow and MATRIX. The results show that Cloud Kon delivers better scalability compared to other state-of-the-art systems for some metrics - all with a significantly smaller code-base (5%).
机译:大规模分布式系统中的任务调度和执行在实现良好性能和高系统利用率方面起着重要作用。由于当今硬件中出现的并行性爆炸式增长,应用程序需要执行过度分解以提供良好的性能,这种过度分解推动了作业管理系统对支持越来越多的细粒度任务的应用程序的需求。我们在这项工作中的目标是提供一个基于云计算构建块(Amazon EC2,SQS和Dynamo DB)的紧凑,轻便,可扩展和分布式的任务执行框架(Cloud Kon)。当今大多数最先进的作业执行系统主要具有Master / Slaves体系结构,这些体系结构具有固有的局限性,例如极端规模的可伸缩性问题和单点故障。另一方面,分布式作业管理系统很复杂,并且采用非平凡的负载平衡算法来维持良好的利用率。 Cloud Kon是一个分布式作业管理系统,可以支持具有数百万个任务/作业的HPC和MTC工作负荷。我们将我们的工作与其他最先进的工作管理系统(包括Sparrow和MATRIX)进行了比较。结果表明,在某些指标上,Cloud Kon与其他最新系统相比,具有更好的可伸缩性-所有这些都具有显着较小的代码库(5%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号