首页> 外文期刊>Information Processing & Management >Configurable assembly of classification rules for enhancing entity resolution results
【24h】

Configurable assembly of classification rules for enhancing entity resolution results

机译:可配置的分类规则组装,以增强实体分辨率结果

获取原文
获取原文并翻译 | 示例
           

摘要

Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.
机译:现实世界的数据集通常会出现不同类型的数据质量问题,例如存在异常值,缺失值,不正确的表示形式和重复的实体。为了识别重复的实体,一项名为“实体解析(ER)”的任务,我们可能会采用多种分类技术。由于可以合并自动学习方法来生成基于规则的实体解析(RbER)算法的技术,因此基于规则的分类技术已引起越来越多的关注。但是,这些算法存在一系列缺陷:i)高质量RbER算法的生成通常需要高昂的计算和/或人工标记成本; ii)不可能调整RbER算法参数; iii)无法将有关ER的用户偏好纳入算法功能; iv)在处理特殊情况(即挑战实体的重复和非重复对)时,RbER算法的逻辑(二进制)性质通常不足。为了克服这些缺点,我们提出了规则汇编程序,这是一种可配置的方法,该方法基于逻辑规则产生的置信度得分,将可重复参数以及用户首选项考虑在内,对重复实体进行分类。使用实际数据集和合成数据集进行的实验都证明了该方法具有增强基准RbER算法和基本组装方法产生的结果的能力。此外,我们证明了所提出的方法不会在分类步骤上产生大量开销,并得出结论,规则汇编程序参数APA,WPA,TβM和Max更适合在实际场景中使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号