The proliferation of big data and big computing boosted the adoption of machine learning across many application domains. Several distributed machine learning platforms emerged recently. We investigate the architectural design of these distributed machine learning platforms, as the design decisions inevitably affect the performance, scalability, and availability of those platforms. We study Spark as a representative dataflow system, PMLS as a parameter-server system, and TensorFlow and MXNet as examples of more advanced dataflow systems. We take a distributed systems perspective, and analyze the communication and control bottlenecks for these approaches. We also consider fault-tolerance and ease-of-development in these platforms. In order to provide a quantitative evaluation, we evaluate the performance of these three systems with basic machine learning tasks: logistic regression, and an image classification example on the MNIST dataset.
展开▼