首页> 外文会议>International conference on very large data bases >CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning
【24h】

CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning

机译:清理:用于统一扩展数据清理的优化查询语言

获取原文

摘要

Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. At the same time, the existing tools that attempt to automate the data cleaning procedure typically focus on a specific use case and operation. Still, even such specialized tools exhibit long running times or fail to process large datasets. Therefore, from a user's perspective, one is forced to use a different, potentially inefficient tool for each category of errors. This paper addresses the coverage and efficiency problems of data cleaning. It introduces CleanM (pronounced clean'em), a language which can express multiple types of cleaning operations. CleanM goes through a three-level translation process for optimization purposes; a different family of optimizations is applied in each abstraction level. Thus, CleanM can express complex data cleaning tasks, optimize them in a unified way, and deploy them in a scaleout fashion. We validate the applicability of CleanM by using it on top of CleanDB, a newly designed and implemented framework which can query heterogeneous data. When compared to existing data cleaning solutions, CleanDB a) covers more data corruption cases, b) scales better, and can handle cases for which its competitors are unable to terminate, and c) uses a single interface for querying and for data cleaning.
机译:由于脏数据数量的增加,数据清理已成为数据分析不可或缺的部分。数据科学家将大部分时间用于准备脏数据,然后再将其用于数据分析。同时,尝试自动执行数据清理过程的现有工具通常专注于特定的用例和操作。但是,即使是这样的专用工具,其运行时间也很长,或者无法处理大型数据集。因此,从用户的角度来看,人们被迫对每种错误类别使用不同的,可能效率低下的工具。本文解决了数据清理的覆盖范围和效率问题。它引入了CleanM(发音为clean'em),该语言可以表达多种类型的清理操作。 CleanM经过三级转换过程以达到优化目的。在每个抽象级别应用了不同的优化系列。因此,CleanM可以表达复杂的数据清理任务,以统一的方式优化它们,并以横向方式部署它们。我们通过在CleanDB之上使用它来验证CleanM的适用性,CleanDB是一个新设计和实现的框架,可以查询异构数据。与现有的数据清理解决方案相比,CleanDB a)涵盖了更多的数据损坏案例,b)扩展性更好,并且可以处理其竞争对手无法终止的案例,并且c)使用单个接口进行查询和数据清理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号