首页> 外国专利> Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases

Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases

机译：基于密度的索引方法，可在大型数据库上高效执行高维最近邻居查询

页面导航

摘要
著录项
相似文献

摘要

Method and apparatus for efficiently performing nearest neighbor queries on a database of records wherein each record has a large number of attributes by automatically extracting a multidimensional index from the data. The method is based on first obtaining a statistical model of the content of the data in the form of a probability density function. This density is then used to decide how data should be reorganized on disk for efficient nearest neighbor queries. At query time, the model decides the order in which data should be scanned. It also provides the means for evaluating the probability of correctness of the answer found so far in the partial scan of data determined by the model. In this invention a clustering process is performed on the database to produce multiple data clusters. Each cluster is characterized by a cluster model. The set of clusters represent a probability density function in the form of a mixture model. A new database of records is built having an augmented record format that contains the original record attributes and an additional record attribute containing a cluster number for each record based on the clustering step. The cluster model uses a probability density function for each cluster so that the process of augmenting the attributes of each record is accomplished by evaluating each record's probability with respect to each cluster. Once the augmented records are used to build a database the augmented attribute is used as an index into the database so that nearest neighbor query analysis can be very efficiently conducted using an indexed look up process. As the database is queried, the probability density function is used to determine the order clusters or database pages are scanned. The probability density function is also used to determine when scanning can stop because the nearest neighbor has been found with high probability.

机译：通过从数据中自动提取多维索引来有效地对记录数据库执行最近邻居查询的方法和装置，其中每个记录具有大量属性。该方法基于首先获得概率密度函数形式的数据内容的统计模型。然后使用此密度来决定如何在磁盘上重组数据，以进行有效的最近邻居查询。在查询时，模型决定应扫描数据的顺序。它还提供了一种方法，用于评估到目前为止在模型确定的数据的部分扫描中找到的答案的正确性概率。在本发明中，对数据库执行聚类处理以产生多个数据聚类。每个群集都具有一个群集模型。聚类集以混合模型的形式表示概率密度函数。建立一个新的记录数据库，该数据库具有增强的记录格式，该格式包含原始记录属性，另外的记录属性则包含基于聚类步骤的每个记录的聚类编号。聚类模型对每个聚类使用概率密度函数，以便通过评估每个记录相对于每个聚类的概率来完成增强每个记录的属性的过程。一旦使用扩充记录来建立数据库，就将扩充属性用作数据库的索引，以便可以使用索引查找过程非常有效地进行最近邻居查询分析。在查询数据库时，概率密度函数用于确定顺序簇或数据库页面被扫描。概率密度函数还用于确定何时可以停止扫描，因为已找到高概率的最近邻居。

著录项

公开/公告号US6263334B1

专利类型
公开/公告日2001-07-17

原文格式PDF
申请/专利权人 MICROSOFT CORPORATION;
展开▼

申请/专利号US19980189229
发明设计人 KRISTIN P. BENNETT;DAN GEIGER;USAMA FAYYAD;
展开▼

申请日1998-11-11
分类号G06F173/00;
国家 US
入库时间 2022-08-22 01:03:48

相似文献

专利
外文文献
中文文献