SDLER: stacked dedupe learning for entity resolution in big data era

Ngueilbaye Alladoumbaye; Wang Hongzhi; Mahamat Daouda Ahmat; Elgendy Ibrahim A.

首页> 外文期刊>Journal of supercomputing >SDLER: stacked dedupe learning for entity resolution in big data era

【24h】

SDLER: stacked dedupe learning for entity resolution in big data era

机译：SDLer：在大数据时代的实体分辨率堆叠了Dedupe学习

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.

机译：在大数据时代，实体分辨率（ER）面临许多挑战，例如高可扩展性，复杂的相似度指标的共存，桃义和同义词，以及数据质量评估的要求。此外，尽管有超过七十多年的发展努力，但仍然对民主化的需求，以减少人类参与调整参数，数据标签，定义阻塞功能和特征工程。本研究旨在探索具有高精度和效率的新型堆叠的Dedupe学习ER系统。该研究评估了复杂的组合方法，例如双向复发性神经网络（Birnns）和长短期存储器（LSTM）隐藏单元，以便在捕获元组中捕获相似性的感觉上将每个元组传递到字形表示分布。此外，考虑了预训练的单词，在他们不可用的地方，考虑了在不同场景下定制为ER任务定制的语言表示分布的方法。此外，与具有少数属性的传统方法相比，基于基于元组和产生的较小块的封闭式阻塞方法的基于位置敏感的阻塞方法进行了评估。该算法在多个数据集上进行了测试，即基准测试和多语言数据。实验结果表明，与现有解决方案相比，堆叠的Dedupe学习实现了高质量和良好的性能，并衡量良好。

著录项

来源
《Journal of supercomputing》 |2021年第10期|10959-10983|共25页
作者
Ngueilbaye Alladoumbaye; Wang Hongzhi; Mahamat Daouda Ahmat; Elgendy Ibrahim A.;
展开▼
作者单位

Harbin Inst Technol Sch Comp Sci & Technol POB 75 Harbin Peoples R China;

Harbin Inst Technol Sch Comp Sci & Technol POB 75 Harbin Peoples R China;

Univ Ndjamena Tchad Dept Informat BP 1117 Ave Mobutu Ndjamena Chad;

Harbin Inst Technol Sch Comp Sci & Technol POB 75 Harbin Peoples R China|Menoufia Univ Fac Comp & Informat Dept Comp Sci Menoufia Egypt;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Bidirectional RNN; Big data; Data quality; Entity resolution; Stacked Dedupe Learning (SDL); Word Representation Distribution (WRD);

机译：双向RNN;大数据;数据质量;实体分辨率;堆叠的Dedupe学习（SDL）;Word表示分布（WRD）;

相似文献

外文文献
中文文献
专利

1. Linked Data Entity Resolution System Enhanced by Configuration Learning Algorithm [J] . Khai NGUYEN, Ryutaro ICHISE IEICE transactions on information and systems . 2016,第6期

机译：配置学习算法增强的链接数据实体解析系统
2. Leveraging active learning to reduce human effort in the generation of ground-truth for entity resolution [J] . Diego Araújo, Carlos Eduardo Santos Pires, Dimas Cassimiro Nascimento Computational Intelligence . 2020,第2期

机译：利用积极的学习来减少人类努力，在实体决议的基础真理中努力
3. The Interaction of Data, Data Structures, and Software in Entity Resolution Systems [J] . YINLE ZHOU, JOHN TALBURT, ERIC D. NELSON Software Quality Professional . 2011,第4期

机译：实体解析系统中数据，数据结构和软件的交互
4. Tutorial: Uncertain Entity Resolution: Re-evaluating Entity Resolution in the Big Data Era [C] . Avigdor Gal International conference on very large data bases . 2014

机译：教程：不确定的实体分辨率：在大数据时代重新评估实体的分辨率
5. Interactive data integration and entity resolution for exploratory visual data analytics [D] . Morton, Kristi 2015

机译：交互式数据集成和实体解析，用于探索性可视数据分析
6. Optimized Dual Threshold Entity Resolution For Electronic Health Record Databases – Training Set Size And Active Learning [O] . Erel Joffe, Michael J. Byrne, Phillip Reeder, 2013

机译：电子病历数据库的最佳双阈值实体分辨率–训练集大小和主动学习
7. An Iterative, Self-Assessing Entity Resolution System: First Steps toward a Data Washing Machine [O] . John R. Talburt, Awaad K., Daniel Pullen, 2020

机译：迭代，自我评估实体解析系统：朝向数据洗衣机的第一步

SDLER: stacked dedupe learning for entity resolution in big data era

摘要

著录项

相似文献

相关主题

期刊订阅