Optimizing distributed data stream processing by tracing

Zvara Zoltan; Szabo Peter G. N.; Balazs Barnabas; Benczur Andras

首页> 外文期刊>Future generation computer systems >Optimizing distributed data stream processing by tracing

【24h】

Optimizing distributed data stream processing by tracing

机译：通过跟踪优化分布式数据流处理

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported.By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging. (C) 2018 Elsevier B.V. All rights reserved.

机译：异构移动，传感器，loT，智能环境和社交网络应用程序最近已开始生成无边界，快速且大规模的数据流，这些数据流必须“即时”处理。处理此类数据的系统必须通过检测操作异常以及触发自动和手动操作员动作来增强。在本文中，我们说明了如何将分布式数据处理系统中的跟踪应用于检测数据和操作环境的变化，以在数据质量和分布可能发生变化的情况下保持异构数据流处理系统的效率。通过跟踪单个输入记录，我们可以（1）在Web爬网和文档处理系统中识别异常值，并使用这些见解来定义URL过滤规则；（2）确定在处理之前应过滤的重键，例如NULL；（3）提示改进基于密钥的分区机制；（4）如果导入了重线程不安全的库，则可以衡量过度分区的限制。通过使用Apache Spark作为示例，我们展示了如何通过我们的分布式跟踪引擎来减轻或优化各种数据流处理效率问题。我们描述并定性地比较了两种不同的设计，一种基于对分布式数据库的报告，另一种基于跟踪ing带。我们的原型实现由通常适用于JVM环境的包装器组成，并且对核心系统的源代码的影响最小。我们的跟踪框架是第一个解决跨边界的多个系统中的跟踪问题并提供适用于自动化优化（而不仅仅是调试）的详细性能度量的第一个框架。（C）2018 Elsevier B.V.保留所有权利。

著录项

来源
《Future generation computer systems》 |2019年第1期|578-591|共14页
作者
Zvara Zoltan; Szabo Peter G. N.; Balazs Barnabas; Benczur Andras;
展开▼
作者单位

Hungarian Acad Sci, Inst Comp Sci & Control MTA SZTAKI, 13-17 Kende U, H-1111 Budapest, Hungary;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Distributed data processing; Data stream processing; Distributed tracing; Data provenance; Apache Spark;

机译：分布式数据处理;数据流处理;分布式跟踪;数据源;Apache Spark;

相似文献

外文文献
中文文献
专利

1. A General Communication Cost Optimization Framework for Big Data Stream Processing in Geo-Distributed Data Centers [J] . Gu Lin, Zeng Deze, Guo Song, Computers, IEEE Transactions on . 2016,第1期

机译：地理分布数据中心中用于大数据流处理的通用通信成本优化框架
2. Optimization of the Processing of Data Streams on Roughly Characterized Distributed Resources [J] . D. Millot, C. Parrot IEEE Transactions on Parallel and Distributed Systems . 2016,第5期

机译：大致表征的分布式资源上数据流处理的优化
3. Transformation-Based Streaming Workflow Allocation on Geo-Distributed Datacenters for Streaming Big Data Processing [J] . Chen Wuhui, Paik Incheon, Hung Patrick C. K. Services Computing, IEEE Transactions on . 2019,第4期

机译：地理分布数据中心上基于转换的流工作流分配，用于流式处理大数据
4. Tracing Distributed Data Stream Processing Systems [C] . Zoltán Zvara, Péter G.N. Szabó, Gábor Hermann, International Workshops on Foundations and Applications of Self* Systems . 2017

机译：跟踪分布式数据流处理系统
5. Query optimization for distributed stream processing [D] . Liu, Ying 2007

机译：分布式流处理的查询优化
6. A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring [O] . Adeyinka Akanbi, Muthoni Masinde 2020

机译：大数据平台上异构数据实时分析的分布式流处理中间件框架：环境监测案例
7. Tracing Distributed Data Stream Processing Systems [O] . Zvara Zoltán, Szabo PGN, Hermann Gábor, 2017

机译：跟踪分布式数据流处理系统
8. Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix D. Analysis of MIMD (Multiple Instruction Streams, Multiple Data Streams) Algorithms: Features, Measurements, and Results [R] . Smith, K. D. 1984

机译：信号处理的分布式计算：异步并行计算的建模。附录D. mImD（多指令流，多数据流）算法的分析：特征，测量和结果

Optimizing distributed data stream processing by tracing

摘要

著录项

相似文献

相关主题

期刊订阅