Meeting desired application deadlines is crucial as the nature of cloud applications is becoming increasingly mission-critical and deadline-sensitive. Empirical studies on large-scale clusters reveal that a few slow tasks, known as stragglers, could significantly stretch job execution times. A number of strategies are proposed to mitigate stragglers by launching speculative or clone (task) attempts. These strategies often rely on a model-based approach to optimize key operating parameters and are prone to inaccuracy/incompleteness in the underlying models. In this paper, we present LASER, a deep learning approach for speculative execution and replication of deadline-critical jobs. Machine learning has been successfully used to solve a large variety of classification and prediction problems. In particular, the deep neural network (DNN), consisting of multiple hidden layers of units between input and output layers, can provide more accurate regression (prediction) than traditional machine learning algorithms. We compare LASER with SRQuant, a speculative-resume strategy that is based on quantitative analysis. Both these scheduling algorithms aim to improve Probability of Completion before Deadlines (PoCD), i.e., the probability that MapReduce jobs meet their desired deadlines, and reduce the cost of speculative execution, measured by the total (virtual) machine time. We evaluate and compare the two strategies through testbed experiments. The results show that our two strategies outperform Hadoop without speculation (Hadoop-NS) and Hadoop with speculation (Hadoop-S) by up to 89 % in PoCD and 13% in cost.
展开▼