...
首页> 外文期刊>Journal of supercomputing >SDLER: stacked dedupe learning for entity resolution in big data era
【24h】

SDLER: stacked dedupe learning for entity resolution in big data era

机译:SDLer:在大数据时代的实体分辨率堆叠了Dedupe学习

获取原文
获取原文并翻译 | 示例
           

摘要

In the Big Data Era, Entity Resolution (ER) faces many challenges such as high scalability, the coexistence of complex similarity metrics, tautonymy and synonym, and the requirement of Data Quality Evaluation. Moreover, despite more than seventy years of development efforts, there is still a high demand for democratizing ER to reduce human participation in tuning parameters, data labeling, defining blocking functions, and feature engineering. This study aimed to explore a novel Stacked Dedupe Learning ER system with high accuracy and efficiency. The study evaluated sophisticated composition methods, such as Bidirectional Recurrent Neural Networks (BiRNNs) and Long Short-Term Memory (LSTM) hidden units, to renovate each tuple to word representation distribution in a sense to capture similarities amidst tuples. Also, pre-trained words embedding where they were not available, ways to learn and tune Word Representation Distribution customized for ER tasks under different scenarios were considered. More so, the Locality Sensitive Hashing (LSH) based blocking approach, which considered the entire attributes of a tuple and produced slighter blocks, compared with traditional methods with few attributes, were assessed. The algorithm was tested on multiple datasets namely benchmarks, and multi-lingual data. The experimental results showed that Stacked Dedupe Learning achieves high quality and good performance, and scales well compared to the existing solutions.
机译:在大数据时代,实体分辨率(ER)面临许多挑战,例如高可扩展性,复杂的相似度指标的共存,桃义和同义词,以及数据质量评估的要求。此外,尽管有超过七十多年的发展努力,但仍然对民主化的需求,以减少人类参与调整参数,数据标签,定义阻塞功能和特征工程。本研究旨在探索具有高精度和效率的新型堆叠的Dedupe学习ER系统。该研究评估了复杂的组合方法,例如双向复发性神经网络(Birnns)和长短期存储器(LSTM)隐藏单元,以便在捕获元组中捕获相似性的感觉上将每个元组传递到字形表示分布。此外,考虑了预训练的单词,在他们不可用的地方,考虑了在不同场景下定制为ER任务定制的语言表示分布的方法。此外,与具有少数属性的传统方法相比,基于基于元组和产生的较小块的封闭式阻塞方法的基于位置敏感的阻塞方法进行了评估。该算法在多个数据集上进行了测试,即基准测试和多语言数据。实验结果表明,与现有解决方案相比,堆叠的Dedupe学习实现了高质量和良好的性能,并衡量良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号