【24h】

Approximate String Joins in a Database (Almost) for Free

机译:近似字符串免费(几乎)加入数据库

获取原文
获取原文并翻译 | 示例

摘要

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.
机译:字符串数据无处不在,并且在过去几年中,其管理尤为重要。近似查询对字符串数据非常重要,尤其是对于涉及联接的更复杂查询。例如,这是由于数据中存在印刷错误,以及用于记录属性(例如名称和地址)的多种约定。商业数据库不直接支持近似字符串连接,使用用户定义的函数(UDF)有效地实现此功能是一个挑战。在本文中,我们开发了一种通过利用商业数据库中已有的功能在商业数据库之上构建近似字符串连接功能的技术。从根本上讲,我们的技术依赖于匹配长度为q的短子串(称为q-gram),并考虑了单个匹配项的位置以及此类匹配项的总数。我们的方法适用于近似全字符串匹配和近似子字符串匹配,并具有多种可能的编辑距离函数。具有合适的编辑距离阈值的近似字符串匹配谓词可以映射到原始关系表达式中,并可以通过常规关系优化器进行优化。我们通过商业数据库系统和实际数据,通过实验证明了我们的技术相对于直接使用UDF的好处。为了研究在编辑距离和q-gram长度变化的情况下近似字符串连接算法的I / O和CPU行为,我们还描述了基于原型实现的详细实验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号