首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Cost-Aware Big Data Processing Across Geo-Distributed Datacenters
【24h】

Cost-Aware Big Data Processing Across Geo-Distributed Datacenters

机译:跨地理分布数据中心的可感知成本的大数据处理

获取原文
获取原文并翻译 | 示例
           

摘要

With the globalization of service, organizations continuously produce large volumes of data that need to be analysed over geo-dispersed locations. Traditionally central approach that moving all data to a single cluster is inefficient or infeasible due to the limitations such as the scarcity of wide-area bandwidth and the low latency requirement of data processing. Processing big data across geo-distributed datacenters continues to gain popularity in recent years. However, managing distributed MapReduce computations across geo-distributed datacenters poses a number of technical challenges: how to allocate data among a selection of geo-distributed datacenters to reduce the communication cost, how to determine the Virtual Machine (VM) provisioning strategy that offers high performance and low cost, and what criteria should be used to select a datacenter as the final reducer for big data analytics jobs. In this paper, these challenges is addressed by balancing bandwidth cost, storage cost, computing cost, migration cost, and latency cost, between the two MapReduce phases across datacenters. We formulate this complex cost optimization problem for data movement, resource provisioning and reducer selection into a joint stochastic integer nonlinear optimization problem by minimizing the five cost factors simultaneously. The Lyapunov framework is integrated into our study and an efficient online algorithm that is able to minimize the long-term time-averaged operation cost is further designed. Theoretical analysis shows that our online algorithm can provide a near optimum solution with a provable gap and can guarantee that the data processing can be completed within pre-defined bounded delays. Experiments on WorldCup98 web site trace validate the theoretical analysis results and demonstrate that our approach is close to the offline-optimum performance and superior to some representative approaches.
机译:随着服务的全球化,组织不断产生大量需要在地理位置分散的位置进行分析的数据。传统上,由于诸如广域带宽的稀缺性和数据处理的低延迟要求之类的限制,将所有数据移动到单个群集的集中式方法效率低下或不可行。近年来,跨地理分布的数据中心处理大数据继续受到欢迎。但是,跨地理分布的数据中心管理分布式MapReduce计算带来了许多技术挑战:如何在多个地理分布的数据中心之间分配数据以降低通信成本,如何确定可提供较高性能的虚拟机(VM)供应策略性能和低成本,以及应使用什么标准来选择数据中心作为大数据分析工作的最终归宿。在本文中,这些挑战通过在数据中心的两个MapReduce阶段之间平衡带宽成本,存储成本,计算成本,迁移成本和延迟成本来解决。通过同时最小化五个成本因素,我们将用于数据移动,资源供应和Reducer选择的复杂成本优化问题表述为联合随机整数非线性优化问题。 Lyapunov框架已集成到我们的研究中,并且进一步设计了一种有效的在线算法,该算法能够最大程度地减少长期平均时间的运营成本。理论分析表明,我们的在线算法可以提供具有可证明间隙的近乎最优的解决方案,并且可以保证数据处理可以在预定的有界延迟内完成。在WorldCup98网站上进行的跟踪实验验证了理论分析结果,并证明了我们的方法接近脱机最佳性能,并且优于某些代表性方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号