Configurable assembly of classification rules for enhancing entity resolution results

Dimas Cassimiro Nascimento; Carlos Eduardo Santos Pires; Thiago Pereira Nóbrega

首页> 外文期刊>Information Processing & Management >Configurable assembly of classification rules for enhancing entity resolution results

【24h】

Configurable assembly of classification rules for enhancing entity resolution results

机译：可配置的分类规则组装，以增强实体分辨率结果

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.

机译：现实世界的数据集通常会出现不同类型的数据质量问题，例如存在异常值，缺失值，不正确的表示形式和重复的实体。为了识别重复的实体，一项名为“实体解析（ER）”的任务，我们可能会采用多种分类技术。由于可以合并自动学习方法来生成基于规则的实体解析（RbER）算法的技术，因此基于规则的分类技术已引起越来越多的关注。但是，这些算法存在一系列缺陷：i）高质量RbER算法的生成通常需要高昂的计算和/或人工标记成本; ii）不可能调整RbER算法参数; iii）无法将有关ER的用户偏好纳入算法功能; iv）在处理特殊情况（即挑战实体的重复和非重复对）时，RbER算法的逻辑（二进制）性质通常不足。为了克服这些缺点，我们提出了规则汇编程序，这是一种可配置的方法，该方法基于逻辑规则产生的置信度得分，将可重复参数以及用户首选项考虑在内，对重复实体进行分类。使用实际数据集和合成数据集进行的实验都证明了该方法具有增强基准RbER算法和基本组装方法产生的结果的能力。此外，我们证明了所提出的方法不会在分类步骤上产生大量开销，并得出结论，规则汇编程序参数APA，WPA，TβM和Max更适合在实际场景中使用。

著录项

来源
《Information Processing & Management》 |2020年第3期|102224.1-102224.26|共26页
作者
Dimas Cassimiro Nascimento; Carlos Eduardo Santos Pires; Thiago Pereira Nóbrega;
展开▼
作者单位

Federal Rural University of Pernambuco (UFRPE) Federal University of Campina Grande (UFCG) Aprígio Veloso 882 - Universitáno Campina Grande PB 58429-900 Brazil Federal University of Campina Grande (UFCG) Brazil;

Federal University of Campina Grande (UFCG) Brazil;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Fill-level measurement using capacitance sensors with adaptively configurable electrodes for resolution enhancement [J] . Kandlbinder-Paret Christoph, Fischerauer Alice, Fischerauer Gerhard Measurement Science & Technology . 2019,第4期

机译：使用具有可自适应可配置电极的电容传感器进行填充级别测量，用于分辨率增强
2. Dynamic address resolution for enhanced configurability in packet-based TDMA GPONs [J] . Angelopoulos JD, Dessauvage C, Leligou HC, Information Sciences: An International Journal . 2007,第16期

机译：动态地址解析可增强基于分组的TDMA GPON中的可配置性
3. Rule-Based Entity Resolution on Database with Hidden Temporal Information [J] . Hongzhi Wang, Xiaoou Ding, Jianzhong Li, IEEE Transactions on Knowledge and Data Engineering . 2018,第11期

机译：具有隐藏时间信息的数据库中基于规则的实体解析
4. Context-based Entity Description Rule for Entity Resolution [C] . Lingli Li, Jianzhong Li, Hongzhi Wang, ACM international conference on information and knowledge management . 2011

机译：用于实体解析的基于上下文的实体描述规则
5. A comparison of resolution enhancement methods as pre-processing for classification of hyperspectral images. [D] . Morillo Contreras, Shirley. 2004

机译：作为高光谱图像分类预处理的分辨率增强方法的比较。
6. Decision-Tree Rule-Based and Random Forest Classification of High-Resolution Multispectral Imagery for Wetland Mapping and Inventory [O] . Tedros M. Berhane, Charles R. Lane, Qiusheng Wu, -1

机译：高分辨率多光谱图像的决策树基于规则和随机森林分类的湿地制图和清单
7. Enhanced depth resolution in optical scanning holography using a configurable pupil [O] . Ou H, Lam EYM, Wong KKY, 2014

机译：使用可配置瞳孔在光学扫描全息术中增强深度分辨率

Configurable assembly of classification rules for enhancing entity resolution results

摘要

著录项

相似文献

相关主题

期刊订阅