SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints

机译：SilkMoth：一种有效的方法来查找具有最大匹配约束的相关集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Determining if two sets are related - that is, if they have similar values or if one set contains the other - is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are join-able; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking 0(n3) time, where n is the number of elements in the sets, for each set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed SilkMoth, a system capable of rapidly discovering related set pairs in collections of sets. Internally, SilkMoth creates a signature for each set, with the property that any other set which is related must match the signature. Silkmoth then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, Silkmoth applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of SilkMoth is that it is guaranteed to output exactly the same related set pairings as the brute-force method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, SilkMoth is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.

机译：对于数据清理，数据集成和信息检索中的许多应用程序，确定两组是否相关（即它们是否具有相似的值或一组是否包含另一组）是一个重要问题。例如，集合相关性是发现两个不同数据库中的列是否可连接的有用工具。如果列中足够的值匹配，则将它们连接起来可能很有意义。一个通用的度量是通过将元素视为二部图的顶点并计算元素之间最大匹配对的分数来测量两组的相关性。与其他需要在元素之间进行精确匹配的度量相比，该度量使用相似性函数来比较两个集合之间的元素，从而使其对于元素中的细微差异具有鲁棒性，并且对于实际的脏数据更有用。不幸的是，该度量标准遭受了昂贵的计算成本，需要花费0（n3）时间，其中n是每个集合之间比较的集合中元素的数量。因此，对于尝试以蛮力方式搜索相关集的所有配对的应用程序，运行时间变得无法接受。为了应对这一挑战，我们开发了SilkMoth，该系统能够快速发现集合集合中的相关集合对。在内部，SilkMoth为每个集合创建一个签名，其属性是与之相关的任何其他集合都必须与该签名匹配。然后，Silkmoth使用这些签名来修剪搜索空间，因此仅保留与签名匹配的集合作为候选。最后，Silkmoth对剩余的候选者应用最大匹配度量，以验证这些候选者中哪些是真正相关的集合。 SilkMoth的一个重要特性是，与近似技术不同，它可以确保输出与蛮力方法完全相同的相关集配对。因此，本文的一个贡献就是对签名空间的特征化，从而使这种特性成为可能。我们表明，在该空间中选择最佳签名是NP完全的，并且基于对空间特征的了解，我们提出了两种新颖的过滤器，有助于在验证之前进一步修剪候选者。此外，我们基于三角形不等式对最大匹配度量本身的计算进行了简单的优化。与相关方法相比，SilkMoth更为通用，可以处理更大范围的相似性函数和相关性度量，并且在实际数据集上的效率要高一个数量级。

著录项

来源
《International conference on very large data bases》|2017年|1082-1093|共12页
会议地点
作者
Dong Deng; Albert Kim; Samuel Madden; Michael Stonebraker;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. An Efficient Algorithm for Finding a Maximum Weight k-Independent Set on Trapezoid Graphs [J] . Mrinmoy Hota, Madhumangal Pal, Tapan K. Pal Computational optimization and applications . 2001,第1期

机译：在梯形图上找到最大权重k独立集的有效算法
2. Maximum weight independent sets and matchings in sparse random graphs. Exact results using the local weak convergence method [J] . Gamarnik D, Nowicki T, Swirszcz G Random structures & algorithms . 2006,第1期

机译：稀疏随机图中的最大权重独立集和匹配项。使用局部弱收敛方法的精确结果
3. Alternating Direction Method for Maximum Entropy Subject to Simple Constraint Sets [J] . A. Bnouhachem, Z. B. Liu Journal of Optimization Theory and Applications . 2004,第2期

机译：简单约束条件下最大熵的交替方向法
4. SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints [C] . Dong Deng, Albert Kim, Samuel Madden, International conference on very large data bases . 2017

机译：SilkMoth：具有最大匹配约束的相关集的有效方法
5. Topics in numerical relativity: Solving the initial value problem using adaptive mesh refinement, examining evolution stability using spectral methods, and finding apparent horizons using a mean curvature-level set method [D] . Lowe, Lisa L. 2008

机译：数值相对论的主题：使用自适应网格细化解决初值问题，使用频谱方法检查演化稳定性，并使用平均曲率级集方法找到视在地平线
6. New Item Selection Method Accommodating Practical Constraints in Cognitive Diagnostic Computerized Adaptive Testing: Maximum Deviation and Maximum Limitation Global Discrimination Indexes [O] . Junjie Li, Lihua Ma, Pingfei Zeng, 2021

机译：新的项目选择方法适用于认知诊断计算机的自适应测试中的实际限制：最大偏差和最大限制全局鉴别指标
7. Maximum Weight Independent Sets and Matchings in Sparse Random Graphs. Exact Results Using the Local Weak Convergence Method [O] . David Gamarnik, Tomasz Nowicki, Grzegorz Swirscsz 2008

机译：稀疏随机图中的最大权重无关集和匹配项。使用局部弱收敛方法的精确结果

SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints

摘要

著录项

相似文献

相关主题

期刊订阅