Improving the Quality of Web-Based Data Imputation With Crowd Intervention

Gu Binbin; Li Zhixu; Liu An; Xu Jiajie; Zhao Lei; Zhou Xiaofang

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Improving the Quality of Web-Based Data Imputation With Crowd Intervention

【24h】

Improving the Quality of Web-Based Data Imputation With Crowd Intervention

机译：通过人群干预提高基于网络数据局部的质量

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data incompleteness is a common data quality problem in databases. Recent work proposes to retrieve missing string values from the World Wide Web for higher imputation recall, but on the other hand, takes the risk of introducing web noises into the imputation results. So far there lacks an effective way to control the quality of web-based data imputation, given the complexity of the quality model and lacking of enough ground truth data. In this article, an EM-based quality model is first built for web-based data imputation which investigates three key factors jointly, i.e., precision of web sources, correlation among web sources, and precision and recall of the employed extractors. However, the accuracy of the EM-based quality model could be harmed when the EM (Expectation Maximization) assumption that "the majority agree on the truth" does not hold in some cases. To solve this problem, we introduce crowd intervention to help improve the quality model. While a straightforward but expensive way is to let the crowd to identify all these undesirable cases and provide the right imputation values for these blanks, a most crowd-economic way is to select a small set of blanks for crowd-based imputation, whose results could help to adjust the EM-based quality model towards a better one. To achieve this, an adaptive blank selection strategy is proposed to select a sequence of blanks for crowd-based imputation. Also, we work on finding a proper time to stop further crowd intervention for the balance of crowd efficiency and quality improvement. Our experiments performed on three real world and one simulated data collections prove that the proposed quality model can effectively help improve the quality of the web-based imputation results by more than 15 percent, while our crowd cost saving strategy saves more than 75 percent crowd cost.

机译：数据不完整是数据库中的常见数据质量问题。最近的工作建议从万维网中检索缺失的字符串值，以获得更高的估算召回，但另一方面，承担将Web噪声引入估算结果的风险。到目前为止，考虑到质量模型的复杂性以及缺乏足够的地面真理数据，缺乏控制基于网络的数据归档的质量的有效方法。在本文中，首先是基于基于网络的数据避难所的基于EM的质量模型，该数据避免共同调查三个关键因素，即网源的精度，网源之间的相关性，以及所采用的提取器的精度和回忆。然而，当EM（预期最大化）假设“大多数人对真相同意”不持有时，基于EM的质量模型的准确性可能会受到伤害。为了解决这个问题，我们介绍了人群干预，以帮助改善质量模型。虽然简单但昂贵的方式是让人群识别所有这些不良案例并为这些空白提供正确的归责价值，最受欢迎的经济方式是为基于人群的贷款选择一小组空白，其结果可以帮助调整基于EM的质量模型更好。为实现这一点，提出了一种自适应空白选择策略，为基于人群的估算选择一系列空白。此外，我们致力于寻找适当的时间以阻止进一步的人群干预人群效率和质量改进的平衡。我们的实验在三个现实世界和一个模拟数据收集方面证明了所提出的质量模型可以有效地帮助提高网络的估算结果超过15％，而我们的人群成本储蓄策略节省了75％以上的人群成本。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2021年第6期|2534-2547|共14页
作者
Gu Binbin; Li Zhixu; Liu An; Xu Jiajie; Zhao Lei; Zhou Xiaofang;
展开▼
作者单位

Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China|Univ Calif Santa Cruz Dept Comp Sci & Engn Santa Cruz CA 95064 USA;

Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China|IFLYTEK Res Suzhou Peoples R China|IFLYTEK State Key Lab Cognit Intelligence Hefei Peoples R China;

Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

Soochow Univ Inst Artificial Intelligence Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China;

Inst Elect & Informat Engn UESTC Guangdong Dongguan 523808 Guangdong Peoples R China|Univ Queensland Sch Informat Technol & Elect Engn Brisbane Qld 4072 Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Correlation; Data models; Web pages; Databases; Data mining; Data collection; Task analysis; Data imputation; web; crowd;

机译：相关性;数据模型;网页;数据库;数据挖掘;数据收集;任务分析;数据归档;网络;人群;

相似文献

外文文献
中文文献
专利

1. Methods to improve the quality of smoking records in a primary care EMR database: exploring multiple imputation and pattern-matching algorithms [J] . Stephanie Garies, Michael Cummings, Hude Quan, BMC Medical Informatics and Decision Making . 2020,第1期

机译：提高初级保健物质中吸烟记录质量的方法：探索多种归纳和模式匹配算法
2. Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset [J] . Ratolojanahary Romy, Ngouna Raymond Houe, Medjaher Kamal, Expert Systems with Application . 2019,第OCTa期

机译：选择模型以改善多重插补，以处理水质数据集中的高比率缺失
3. Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset [J] . Ratolojanahary Romy, Ngouna Raymond Houe, Medjaher Kamal, Expert systems with applications . 2019,第Octa期

机译：模型选择，提高在水质数据集中处理高速缺失的多重估算
4. Multiple-vs Non-or Single-Imputation Based Fuzzy Clustering for Incomplete Longitudinal Behavioral Intervention Data [C] . Zhaoyang Zhang, Hua Fang 2016 IEEE First Conference on Connected Health: Applications, Systems and Engineering Technologies . 2016

机译：不完整纵向行为干预数据的基于多VS非或单输入的模糊聚类
5. Reaching Out to Carers of an Individual with Schizophrenia and a Psychosis Disorder: A Framework for the Construction and Evaluation of a Web-Based Intervention to Improve Carer Wellbeing and Quality of Life [D] . Johnson, Amy Victoria. 2020

机译：与精神分裂症和精神病障碍的人联系到个人的护理人员：一个基于网络干预的建设和评估的框架，以提高护理服务健康和生活质量
6. Improving Imputation Quality in BEAGLE for Crop and Livestock Data [O] . Torsten Pook, Manfred Mayer, Johannes Geibel, 2020

机译：提高BEAGLE的作物和畜牧数据估算质量
7. A Web-Based eHealth Intervention to Improve the Quality of Life of Older Adults With Multiple Chronic Conditions: Protocol for a Randomized Controlled Trial [O] . David H Gustafson Sr, Marie-Louise Mares, Darcie C Johnston, 2021

机译：基于网络的电子健康干预，提高老年成年人的生活质量，具有多重慢性病：随机对照试验的议定书
8. Development and Field-Testing of a Study Protocol, including a Web-Based Occupant Survey Tool, for Use in Intervention Studies of Indoor Environmental Quality [R] . Mendell, M. J., Eliseeva, E., Spears, M., 2009

机译：研究方案的开发和现场测试，包括基于网络的乘员调查工具，用于室内环境质量干预研究

Improving the Quality of Web-Based Data Imputation With Crowd Intervention

摘要

著录项

相似文献

相关主题

期刊订阅