Scientific workflows in collaborative cloud environments are becoming more and more popular. There is an urgent need to address the problem of large amount of data transfer across geo-distributed data centers during workflow execution. By utilizing data dependencies, we propose a two-stage data placement strategy and a task scheduling strategy for efficient workflow execution. With our strategy, the most related datasets can be placed into the same data center based on the data dependence between them at workflow build-time; then the tasks are scheduled to their most closely related data centers for execution and the newly-generated data sets are put into the data center that has the most dependency with them at workflow runtime. The experimental results show that the proposed strategy can significantly reduce the volume of data transfer among different data centers, and hence improve the performance of running scientific workflows and cut down the cost of doing science on the clouds as well.%云环境中跨数据中心科学工作流的高效执行通常面临数据交互量大的问题.文中给出基于相关度的两阶段高效数据放置策略和任务调度策略:即在工作流建立阶段根据数据依赖关系图把关系紧密型数据集尽可能放置到同一数据中心;而后任务调度策略在运行阶段将任务调度到数据依赖最大的数据中心执行,并将新产生数据集放置到相关度最高的数据中心.实验表明,该策略能有效减少跨数据中心科学工作流执行时的数据传输量,从而能有效提升科学工作流的执行效率,并能减少资源的租赁费用.
展开▼