首页> 外文会议>International conference on world wide web >Efficient Estimation for High Similarities using Odd Sketches
【24h】

Efficient Estimation for High Similarities using Odd Sketches

机译:使用奇数草图进行高相似度的有效估计

获取原文

摘要

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with b-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
机译:估计集合相似性是许多计算机应用程序中的中心问题。在本文中,我们介绍了Odd Sketch,这是一个紧凑的二进制草图,用于估计两组Jaccard的相似性。两个草图的异或等于两组对称差的草图。这意味着“奇数草图”为高度相似的集合提供了一种高度节省空间的估算器,这与Web重复检测,协作过滤和关联规则学习等应用程序相关。该方法扩展到加权的Jaccard相似度,例如用于TF-IDF矢量比较。我们提出了一种估计质量的理论分析,以保证基于奇数草图的估计器的可靠性。我们的实验证实了这种效率,并证明了在关联规则学习和Web重复检测任务上,与b位minwise哈希方案相比,Odd Sketchs的效率更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号