...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Improving the Quality of Web-Based Data Imputation With Crowd Intervention
【24h】

Improving the Quality of Web-Based Data Imputation With Crowd Intervention

机译:通过人群干预提高基于网络数据局部的质量

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Data incompleteness is a common data quality problem in databases. Recent work proposes to retrieve missing string values from the World Wide Web for higher imputation recall, but on the other hand, takes the risk of introducing web noises into the imputation results. So far there lacks an effective way to control the quality of web-based data imputation, given the complexity of the quality model and lacking of enough ground truth data. In this article, an EM-based quality model is first built for web-based data imputation which investigates three key factors jointly, i.e., precision of web sources, correlation among web sources, and precision and recall of the employed extractors. However, the accuracy of the EM-based quality model could be harmed when the EM (Expectation Maximization) assumption that "the majority agree on the truth" does not hold in some cases. To solve this problem, we introduce crowd intervention to help improve the quality model. While a straightforward but expensive way is to let the crowd to identify all these undesirable cases and provide the right imputation values for these blanks, a most crowd-economic way is to select a small set of blanks for crowd-based imputation, whose results could help to adjust the EM-based quality model towards a better one. To achieve this, an adaptive blank selection strategy is proposed to select a sequence of blanks for crowd-based imputation. Also, we work on finding a proper time to stop further crowd intervention for the balance of crowd efficiency and quality improvement. Our experiments performed on three real world and one simulated data collections prove that the proposed quality model can effectively help improve the quality of the web-based imputation results by more than 15 percent, while our crowd cost saving strategy saves more than 75 percent crowd cost.
机译:数据不完整是数据库中的常见数据质量问题。最近的工作建议从万维网中检索缺失的字符串值,以获得更高的估算召回,但另一方面,承担将Web噪声引入估算结果的风险。到目前为止,考虑到质量模型的复杂性以及缺乏足够的地面真理数据,缺乏控制基于网络的数据归档的质量的有效方法。在本文中,首先是基于基于网络的数据避难所的基于EM的质量模型,该数据避免共同调查三个关键因素,即网源的精度,网源之间的相关性,以及所采用的提取器的精度和回忆。然而,当EM(预期最大化)假设“大多数人对真相同意”不持有时,基于EM的质量模型的准确性可能会受到伤害。为了解决这个问题,我们介绍了人群干预,以帮助改善质量模型。虽然简单但昂贵的方式是让人群识别所有这些不良案例并为这些空白提供正确的归责价值,最受欢迎的经济方式是为基于人群的贷款选择一小组空白,其结果可以帮助调整基于EM的质量模型更好。为实现这一点,提出了一种自适应空白选择策略,为基于人群的估算选择一系列空白。此外,我们致力于寻找适当的时间以阻止进一步的人群干预人群效率和质量改进的平衡。我们的实验在三个现实世界和一个模拟数据收集方面证明了所提出的质量模型可以有效地帮助提高网络的估算结果超过15%,而我们的人群成本储蓄策略节省了75%以上的人群成本。

著录项

  • 来源
  • 作者单位

    Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China|Univ Calif Santa Cruz Dept Comp Sci & Engn Santa Cruz CA 95064 USA;

    Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China|IFLYTEK Res Suzhou Peoples R China|IFLYTEK State Key Lab Cognit Intelligence Hefei Peoples R China;

    Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

    Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

    Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

    Inst Elect & Informat Engn UESTC Guangdong Dongguan 523808 Guangdong Peoples R China|Univ Queensland Sch Informat Technol & Elect Engn Brisbane Qld 4072 Australia;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Correlation; Data models; Web pages; Databases; Data mining; Data collection; Task analysis; Data imputation; web; crowd;

    机译:相关性;数据模型;网页;数据库;数据挖掘;数据收集;任务分析;数据归档;网络;人群;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号