...
首页> 外文期刊>Knowledge and Information Systems >Bulk construction of dynamic clustered metric trees
【24h】

Bulk construction of dynamic clustered metric trees

机译:动态聚类度量树的批量构建

获取原文
获取原文并翻译 | 示例
           

摘要

Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.
机译:在各个领域,诸如图像,音频,视频和自由文本之类的复杂数据类型的存储库变得越来越频繁。对于此类数据类型的通用搜索方法是相似性搜索,其中搜索是针对相似对象,并且相似性是通过度量距离函数建模的。在度量数据中进行相似性搜索的一类重要的访问方法是动态聚簇的度量树,其中索引被构造为分页和平衡的树,并且空间被分层划分为紧凑区域。虽然此类的访问方法通常允许单个对象的动态插入,但将给定数据集有效地批量插入索引的问题在很大程度上尚待解决。在本文中,我们解决了这个问题并提出了与它的两种情况相对应的新颖算法:索引最初为空(即批量加载),索引最初为非空(即批量插入)。提出的批量加载算法使用基于抽样的新聚类方法逐层构建索引自底向上,从而通过提高所选样本集的质量来改善聚类结果。提出的批量插入算法采用批量加载算法将给定的数据加载到新的索引结构中,然后使用一种新颖的分解方法来减少结构之间的重叠,从而将新结构和现有结构合并为统一的高质量索引。两种算法都可以显着改善构造和搜索性能,并且适用于所有动态集群度量树。一项广泛的实验研究结果表明,所提出的算法优于其他方法,可将CPU成本和I / O成本分别降低47%和99%,将CPU成本和48%的搜索成本分别降低48%和30%。 I / O成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号