首页> 外文会议>Knowledge-Based Systems for Safety Critical Applications >Distance based indexing for string proximity search
【24h】

Distance based indexing for string proximity search

机译:基于距离的索引,用于字符串邻近搜索

获取原文
获取原文并翻译 | 示例

摘要

In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.
机译:在许多涉及字符串数据的数据库应用程序中,通常有近邻查询(向与查询字符串相似的字符串询问)或近邻查询(向与查询字符串最相似的字符串询问)。字符串之间的相似性是根据应用程序域确定的距离函数定义的。最流行的字符串距离度量基于(i)字符编辑或(ii)块编辑操作(将一个字符串转换为另一个字符串)的(加权)计数。示例包括Levenshtein编辑距离和最近引入的压缩距离。主要目标是开发适用于字符和块编辑距离的高效近邻搜索工具。我们的前提是,最初用于度量距离的基于距离的索引方法可以修改为字符串距离度量,前提是它们几乎可以构成度量。我们显示了几种距离度量,例如压缩距离和加权字符编辑距离几乎是度量。为了分析基于距离的索引方法(特别是VP树)对字符串的性能,我们然后基于成对距离的分布开发了一个模型。基于此模型,我们展示了如何修改VP树以提高其在字符串数据上的性能,并在搜索时间和空间之间进行权衡。我们在合成数据集和蛋白质字符串上测试了理论结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号