...
首页> 外文期刊>Proceedings of the IEEE >Scaling-Up Distributed Processing of Data Streams for Machine Learning
【24h】

Scaling-Up Distributed Processing of Data Streams for Machine Learning

机译:用于机器学习的数据流的缩放分布式处理

获取原文
获取原文并翻译 | 示例
           

摘要

Emerging applications of machine learning in numerous areas including online social networks, remote sensing, Internet-of-Things (IoT) systems, smart grids, and more involve continuous gathering of and learning from streams of data samples. Real-time incorporation of streaming data into the learned machine learning models is essential for improved inference in these applications. Furthermore, these applications often involve data that are either inherently gathered at geographically distributed entities due to physical reasons, for example, IoT systems and smart grids, or that are intentionally distributed across multiple computing machines for memory, storage, computational, and/or privacy reasons. Training of machine learning models in this distributed, streaming setting requires solving stochastic optimization (SO) problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared with the processing capabilities of individual computing entities and/or the rate of the communications links, this poses a challenging question: How can one best leverage the incoming data for distributed training of machine learning models under constraints on computing capabilities and/or communications rate? A large body of research in distributed online optimization has emerged in recent decades to tackle this and related problems. This article reviews recently developed methods that focus on large-scale distributed SO in the compute- and bandwidth-limited regimes, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication, and streaming rates and provides sufficient conditions for order-optimum convergence. In particular, it focuses on methods that solve: 1) distributed stochastic convex problems and 2) distributed principal component analysis, which is a nonconvex problem with the geometric structure that permits global convergence. For such methods, this article discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Furthermore, it reviews theoretical guarantees underlying these methods that show that there exist regimes in which systems can learn from distributed processing of streaming data at order-optimal rates nearly as fast as if all the data were processed at a single superpowerful machine.
机译:机器学习在众多领域的新兴应用,包括在线社交网络,遥感,互联网(物联网)系统,智能网格等涉及从数据样本的溪流进行连续收集和学习。将流数据实时加入到学习的机器学习模型对于这些应用中的推理是必不可少的。此外,这些应用程序通常涉及由于物理原因,例如IOT系统和智能电网的地理分布实体在地理分布实体上固有地聚集的数据,或者故意跨多个计算机器用于存储器,存储,计算和/或隐私原因。在这种分布式中,流式谱系的机器学习模型训练需要在物理实体之间的通信链路上以协同方式解决随机优化(SO)问题。与单个计算实体的处理能力和/或通信链路速率相比,流媒体数据速率高,这呈现了一个具有挑战性的问题:如何最好地利用机器学习模型的分布式训练的传入数据在约束下计算能力和/或通信率?在近几十年来解决了这一分布式在线优化的大型研究,以解决这一问题和相关问题。本文审查了最近开发的方法专注于计算和带宽限制的大规模的大规模分布,重点是收敛分析,明确地解释计算,通信和流率之间的不匹配,并提供足够的顺序条件 - 最低收敛。特别是,它侧重于解决方法:1)分布式随机凸起问题和2)分布式主成分分析,这是一种允许全球收敛的几何结构的非凸起问题。对于此类方法,本文讨论了当面对高速流数据时分布式算法设计方面的最近进步。此外,它审查了这些方法的理论担保,表明存在的制度,其中系统可以从订单最佳速率下从流动数据的分布式处理,几乎快速,就像在单个超级机器处处理所有数据一样快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号