【24h】

Declarative Data Cleaning: Language, Model, and Algorithms

机译:声明式数据清除:语言,模型和算法

获取原文
获取原文并翻译 | 示例

摘要

The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. This holds regardless of the application - relational database joining, web-related, or scientific. In all cases, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge is the design and implementation of a data flow graph that effectively and efficiently generates clean data. Needed improvements to the current state of the art include (i) a clear separation between the logical specification of data transformations and their physical implementation (ii) an explanation of the reasoning behind cleaning results, (iii) and interactive facilities to tune a data cleaning program. This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessment of the proposed framework for data cleaning.
机译:在决策支持系统和数据仓库领域中,众所周知的数据清理问题包括消除原始数据集中的不一致和错误。无论应用程序是什么(关系数据库连接,Web相关或科学),这都适用。在所有情况下,用于编写数据清理程序的现有ETL(提取转换加载)和数据清理工具都是不够的。主要挑战是数据流图的设计和实现,该数据流图如何有效地生成干净的数据。当前技术水平需要进行的改进包括(i)数据转换的逻辑规范与其物理实现之间的清晰区分(ii)清理结果背后的原因说明,(iii)以及用于调整数据清理的交互工具程序。本文介绍了一种语言,执行模型和算法,使用户能够声明性地表达数据清理规范并有效执行清理。我们以一个用于构建Citeseer网站的书目参考为例。潜在的数据集成问题是派生结构化和干净的文本记录,以便可以执行有意义的查询。实验结果报告评估了提议的数据清理框架。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号