首页> 外文会议>International conference on very large data bases >SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints
【24h】

SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints

机译:SilkMoth:一种有效的方法来查找具有最大匹配约束的相关集

获取原文

摘要

Determining if two sets are related - that is, if they have similar values or if one set contains the other - is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are join-able; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking 0(n3) time, where n is the number of elements in the sets, for each set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed SilkMoth, a system capable of rapidly discovering related set pairs in collections of sets. Internally, SilkMoth creates a signature for each set, with the property that any other set which is related must match the signature. Silkmoth then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, Silkmoth applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of SilkMoth is that it is guaranteed to output exactly the same related set pairings as the brute-force method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, SilkMoth is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.
机译:对于数据清理,数据集成和信息检索中的许多应用程序,确定两组是否相关(即它们是否具有相似的值或一组是否包含另一组)是一个重要问题。例如,集合相关性是发现两个不同数据库中的列是否可连接的有用工具。如果列中足够的值匹配,则将它们连接起来可能很有意义。一个通用的度量是通过将元素视为二部图的顶点并计算元素之间最大匹配对的分数来测量两组的相关性。与其他需要在元素之间进行精确匹配的度量相比,该度量使用相似性函数来比较两个集合之间的元素,从而使其对于元素中的细微差异具有鲁棒性,并且对于实际的脏数据更有用。不幸的是,该度量标准遭受了昂贵的计算成本,需要花费0(n3)时间,其中n是每个集合之间比较的集合中元素的数量。因此,对于尝试以蛮力方式搜索相关集的所有配对的应用程序,运行时间变得无法接受。为了应对这一挑战,我们开发了SilkMoth,该系统能够快速发现集合集合中的相关集合对。在内部,SilkMoth为每个集合创建一个签名,其属性是与之相关的任何其他集合都必须与该签名匹配。然后,Silkmoth使用这些签名来修剪搜索空间,因此仅保留与签名匹配的集合作为候选。最后,Silkmoth对剩余的候选者应用最大匹配度量,以验证这些候选者中哪些是真正相关的集合。 SilkMoth的一个重要特性是,与近似技术不同,它可以确保输出与蛮力方法完全相同的相关集配对。因此,本文的一个贡献就是对签名空间的特征化,从而使这种特性成为可能。我们表明,在该空间中选择最佳签名是NP完全的,并且基于对空间特征的了解,我们提出了两种新颖的过滤器,有助于在验证之前进一步修剪候选者。此外,我们基于三角形不等式对最大匹配度量本身的计算进行了简单的优化。与相关方法相比,SilkMoth更为通用,可以处理更大范围的相似性函数和相关性度量,并且在实际数据集上的效率要高一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号