首页> 外文期刊>International journal of web information systems >Learning representations of Web entities for entity resolution
【24h】

Learning representations of Web entities for entity resolution

机译:学习实体分辨率的Web实体的表示

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution. Design/methodology/approach - To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities. Findings - The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets. Originality/value - No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.
机译:目的 - 同一实体的匹配实例,称为实体分辨率的任务是数据集成过程中的一个关键步骤。本文旨在提出一个深入的学习网络,了解实体解析的不同表示的Web实体的不同表示。设计/方法/方法 - 要匹配Web实体,所提出的网络了解以下实体的表示:嵌入式,它是低维空间中实体中单词的矢量表示;来自卷积层的卷积载体,其在实体中的单词序列中捕获短距离模式;和文字袋矢量,由弓形层创建,该弓层基于手头的任务学习词汇中的单词的权重。给定一对实体,他们学习的表示之间的相似性用作标识可能匹配的二进制分类器的特征。除了这些特征之外,分类器还使用对成对的逆文档频率的修改,这识别成对实体的判别词。调查结果 - 在两个商业和两个学术实体分辨率基准数据集中评估了所提出的方法。结果表明,拟议的策略优于商业数据集中的先前方法,这些方法更具挑战性,并且对学术数据集中的竞争对手具有类似的结果。原创性/值 - 未以前的工作使用单个深度学习框架来了解实体解析的Web实体的不同表示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号