首页> 外文期刊>Future generation computer systems >Optimizing distributed data stream processing by tracing
【24h】

Optimizing distributed data stream processing by tracing

机译:通过跟踪优化分布式数据流处理

获取原文
获取原文并翻译 | 示例
           

摘要

Heterogeneous mobile, sensor, loT, smart environment, and social networking applications have recently started to produce unbounded, fast, and massive-scale streams of data that have to be processed "on the fly". Systems that process such data have to be enhanced with detection for operational exceptions and with triggers for both automated and manual operator actions. In this paper, we illustrate how tracing in distributed data processing systems can be applied to detecting changes in data and operational environment to maintain the efficiency of heterogeneous data stream processing systems under potentially changing data quality and distribution. By the tracing of individual input records, we can (1) identify outliers in a web crawling and document processing system and use the insights to define URL filtering rules; (2) identify heavy keys, such as NULL, that should be filtered before processing; (3) give hints to improve the key-based partitioning mechanisms; and (4) measure the limits of overpartitioning if heavy thread-unsafe libraries are imported.By using Apache Spark as illustration, we show how various data stream processing efficiency issues can be mitigated or optimized by our distributed tracing engine. We describe and qualitatively compare two different designs, one based on reporting to a distributed database and another based on trace piggybacking. Our prototype implementation consists of wrappers suitable for JVM environments in general, with minimal impact on the source code of the core system. Our tracing framework is the first to solve tracing in multiple systems across boundaries and to provide detailed performance measurements suitable for automated optimization, not just debugging. (C) 2018 Elsevier B.V. All rights reserved.
机译:异构移动,传感器,loT,智能环境和社交网络应用程序最近已开始生成无边界,快速且大规模的数据流,这些数据流必须“即时”处理。处理此类数据的系统必须通过检测操作异常以及触发自动和手动操作员动作来增强。在本文中,我们说明了如何将分布式数据处理系统中的跟踪应用于检测数据和操作环境的变化,以在数据质量和分布可能发生变化的情况下保持异构数据流处理系统的效率。通过跟踪单个输入记录,我们可以(1)在Web爬网和文档处理系统中识别异常值,并使用这些见解来定义URL过滤规则; (2)确定在处理之前应过滤的重键,例如NULL; (3)提示改进基于密钥的分区机制; (4)如果导入了重线程不安全的库,则可以衡量过度分区的限制。通过使用Apache Spark作为示例,我们展示了如何通过我们的分布式跟踪引擎来减轻或优化各种数据流处理效率问题。我们描述并定性地比较了两种不同的设计,一种基于对分布式数据库的报告,另一种基于跟踪ing带。我们的原型实现由通常适用于JVM环境的包装器组成,并且对核心系统的源代码的影响最小。我们的跟踪框架是第一个解决跨边界的多个系统中的跟踪问题并提供适用于自动化优化(而不仅仅是调试)的详细性能度量的第一个框架。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号