Distance based indexing for string proximity search

机译：基于距离的索引，用于字符串邻近搜索

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.

机译：在许多涉及字符串数据的数据库应用程序中，通常有近邻查询（向与查询字符串相似的字符串询问）或近邻查询（向与查询字符串最相似的字符串询问）。字符串之间的相似性是根据应用程序域确定的距离函数定义的。最流行的字符串距离度量基于（i）字符编辑或（ii）块编辑操作（将一个字符串转换为另一个字符串）的（加权）计数。示例包括Levenshtein编辑距离和最近引入的压缩距离。主要目标是开发适用于字符和块编辑距离的高效近邻搜索工具。我们的前提是，最初用于度量距离的基于距离的索引方法可以修改为字符串距离度量，前提是它们几乎可以构成度量。我们显示了几种距离度量，例如压缩距离和加权字符编辑距离几乎是度量。为了分析基于距离的索引方法（特别是VP树）对字符串的性能，我们然后基于成对距离的分布开发了一个模型。基于此模型，我们展示了如何修改VP树以提高其在字符串数据上的性能，并在搜索时间和空间之间进行权衡。我们在合成数据集和蛋白质字符串上测试了理论结果。

著录项

来源
《Knowledge-Based Systems for Safety Critical Applications》|1994年|p.125-136|共12页
会议地点
作者
Sahinalp S.C.; Tasan M.; Macker J.; Ozsoyoglu Z.M.;
展开▼
作者单位

Dept. of Genetics, Case Western Reserve Univ., Cleveland, OH, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Efficiently Supporting Edit Distance Based String Similarity Search Using B $^+$-Trees [J] . Lu W., Du X., Hadjieleftheriou M., Knowledge and Data Engineering, IEEE Transactions on . 2014,第12期

机译：使用B $ ^ + $ -树
2. Search engine indexing storage optimisation using Hamming distance [J] . Anirban Kundu, Siddhartha Sett, Subhajit Kumar, International journal of intelligent information and database systems . 2012,第2期

机译：使用汉明距离的搜索引擎索引存储优化
3. Indexing the bit-code and distance for fast KNN search in high-dimensional spaces [J] . LIANG Jun-jie, FENG Yu-cai Journal of Zhejiang University. A, Science . 2007,第6期

机译：索引位码和距离，以在高维空间中进行快速KNN搜索
4. Distance based indexing for string proximity search [C] . Sahinalp, S.C., Tasan, . 2003

机译：基于距离的索引以进行字符串接近度搜索
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. SAM: String-based sequence search algorithm for mitochondrial DNA database queries [O] . Alexander Röck, Jodi Irwin, Arne Dür, -1

机译：SAM：用于线粒体DNA数据库查询的基于字符串的序列搜索算法
7. Distance Based Indexing for String Proximity Search [O] . S. Cenk Sahinalp, Murat Tasan et al. 2003

机译：基于距离的字符串邻近搜索索引

Distance based indexing for string proximity search

摘要

著录项

相似文献

相关主题

期刊订阅