Scaling-Up Distributed Processing of Data Streams for Machine Learning

Nokleby Matthew; Raja Haroon; Bajwa Waheed U.

首页> 外文期刊>Proceedings of the IEEE >Scaling-Up Distributed Processing of Data Streams for Machine Learning

【24h】

Scaling-Up Distributed Processing of Data Streams for Machine Learning

机译：用于机器学习的数据流的缩放分布式处理

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Emerging applications of machine learning in numerous areas including online social networks, remote sensing, Internet-of-Things (IoT) systems, smart grids, and more involve continuous gathering of and learning from streams of data samples. Real-time incorporation of streaming data into the learned machine learning models is essential for improved inference in these applications. Furthermore, these applications often involve data that are either inherently gathered at geographically distributed entities due to physical reasons, for example, IoT systems and smart grids, or that are intentionally distributed across multiple computing machines for memory, storage, computational, and/or privacy reasons. Training of machine learning models in this distributed, streaming setting requires solving stochastic optimization (SO) problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared with the processing capabilities of individual computing entities and/or the rate of the communications links, this poses a challenging question: How can one best leverage the incoming data for distributed training of machine learning models under constraints on computing capabilities and/or communications rate? A large body of research in distributed online optimization has emerged in recent decades to tackle this and related problems. This article reviews recently developed methods that focus on large-scale distributed SO in the compute- and bandwidth-limited regimes, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication, and streaming rates and provides sufficient conditions for order-optimum convergence. In particular, it focuses on methods that solve: 1) distributed stochastic convex problems and 2) distributed principal component analysis, which is a nonconvex problem with the geometric structure that permits global convergence. For such methods, this article discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Furthermore, it reviews theoretical guarantees underlying these methods that show that there exist regimes in which systems can learn from distributed processing of streaming data at order-optimal rates nearly as fast as if all the data were processed at a single superpowerful machine.

机译：机器学习在众多领域的新兴应用，包括在线社交网络，遥感，互联网（物联网）系统，智能网格等涉及从数据样本的溪流进行连续收集和学习。将流数据实时加入到学习的机器学习模型对于这些应用中的推理是必不可少的。此外，这些应用程序通常涉及由于物理原因，例如IOT系统和智能电网的地理分布实体在地理分布实体上固有地聚集的数据，或者故意跨多个计算机器用于存储器，存储，计算和/或隐私原因。在这种分布式中，流式谱系的机器学习模型训练需要在物理实体之间的通信链路上以协同方式解决随机优化（SO）问题。与单个计算实体的处理能力和/或通信链路速率相比，流媒体数据速率高，这呈现了一个具有挑战性的问题：如何最好地利用机器学习模型的分布式训练的传入数据在约束下计算能力和/或通信率？在近几十年来解决了这一分布式在线优化的大型研究，以解决这一问题和相关问题。本文审查了最近开发的方法专注于计算和带宽限制的大规模的大规模分布，重点是收敛分析，明确地解释计算，通信和流率之间的不匹配，并提供足够的顺序条件 - 最低收敛。特别是，它侧重于解决方法：1）分布式随机凸起问题和2）分布式主成分分析，这是一种允许全球收敛的几何结构的非凸起问题。对于此类方法，本文讨论了当面对高速流数据时分布式算法设计方面的最近进步。此外，它审查了这些方法的理论担保，表明存在的制度，其中系统可以从订单最佳速率下从流动数据的分布式处理，几乎快速，就像在单个超级机器处处理所有数据一样快。

著录项

来源
《Proceedings of the IEEE》 |2020年第11期|1984-2012|共29页
作者
Nokleby Matthew; Raja Haroon; Bajwa Waheed U.;
展开▼
作者单位

Target AI Minneapolis MN 55402 USA;

Univ Michigan Dept Elect Engn & Comp Sci Ann Arbor MI 48109 USA;

Rutgers State Univ Dept Elect & Comp Engn New Brunswick NJ 08854 USA|Rutgers State Univ Dept Stat New Brunswick NJ 08854 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Machine learning; Training data; Distributed databases; Computational modeling; Data models; Optimization; Stochastic processes; Convex optimization; distributed training; empirical risk minimization (ERM); federated learning; machine learning; minibatching; principal component analysis (PCA); stochastic gradient descent (SGD); stochastic optimization (SO); streaming data;

机译：机器学习;培训数据;分布式数据库;计算建模;数据模型;优化;随机过程;凸优化;分布式训练;经验危险最小化（ERM）;联合学习;迷你成分分析（PCA）;随机梯度下降（SGD）;随机优化（SO）;流数据;

相似文献

外文文献
中文文献
专利

1. Classifying Uncertain and Evolving Data Streams with Distributed Extreme Learning Machine [J] . 韩东红, 张昕, 王国仁计算机科学技术学报（英文版） . 2015,第004期

机译：使用分布式极限学习机对不确定和不断发展的数据流进行分类
2. Distributed parallel deep learning of Hierarchical Extreme Learning Machine for multimode quality prediction with big process data [J] . Yao Le, Ge Zhiqiang Engineering Applications of Artificial Intelligence . 2019,第MAY期

机译：分层极限学习机的分布式并行深度学习，用于具有大过程数据的多模式质量预测
3. Distributed parallel deep learning of Hierarchical Extreme Learning Machine for multimode quality prediction with big process data [J] . Yao Le, Ge Zhiqiang Engineering Applications of Artificial Intelligence . 2019,第May期

机译：大工艺数据的多模质量预测分配平行深度学习
4. Benchmarking Distributed Stream Processing Frameworks for Real Time Classical Machine Learning Applications [C] . Merlin Sundar, Sriram Kailasam, Timothy A. Gonsalves International Conference on Computing, Communication and Networking Technologies . 2020

机译：实时经典机器学习应用程序的基准测试分布式流处理框架
5. Efficient Online Scheduling in Distributed Stream Data Processing Systems [D] . Li, Teng. 2018

机译：在分布式流数据处理系统中高效在线调度
6. A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring [O] . Adeyinka Akanbi, Muthoni Masinde 2020

机译：大数据平台上异构数据实时分析的分布式流处理中间件框架：环境监测案例
7. Scaling-Up Distributed Processing of Data Streams for Machine Learning [O] . Matthew Nokleby, Haroon Raja, Waheed U. Bajwa 2020

机译：用于机器学习的数据流的缩放分布式处理

Scaling-Up Distributed Processing of Data Streams for Machine Learning

摘要

著录项

相似文献

相关主题

期刊订阅