首页> 中文期刊> 《计算机学报》 >RM树:一种支持字符串相似性操作的索引

RM树:一种支持字符串相似性操作的索引

         

摘要

String similarity processing is well adopted in various fields such as data cleaning, information integration, spelling check and bioinformatics. With the rapid growth of information, processing strings over massive datasets becomes a significant basic operation. Existing solutions based on q-Gram and inverted lists suffer from the following disadvantages: large storage cost, poor efficiency of updates and limited types of operation. At meanwhile, most of these methods are main memory based. Bed-tree is a recently proposed tree structured, disk resident index which supports diverse operations. Bed-tree fails to cluster similar strings together, thus incurs large I/O costs while string similarity processing. This paper proposes a disk resident index RM-tree which clusters similar strings together while eliminating the shortcomings of q-Gram and inverted list based solutions. RM-tree supports multiple types of string similarity operations such as range query, top-k query and string similarity join. This paper presents the construction method of RM-tree and its query processing algorithms. RM-tree is tested with extensive experiments including index construction and different types of query processing in terms of I/O cost and time consumption with two real world datasets and a synthetic dataset. Experiment results show that RM-tree is efficient for supporting string similarity operations, and provides better performance than Bed-tree.%字符串相似性操作在很多领域中被广泛应用,如数据清洁、信息集成等.现有研究工作主要为基于q-Gram和倒排索引的内存方法,在处理大量数据时具有以下缺点:内存消耗大、更新效率低、支持操作类型有限.现有的外存索引Bed树无法将相似的字符串聚类,在查询处理过程中导致了较大的I/O代价.该文设计了支持多种字符串相似性操作的RM树索引,消除了现有内存方法的缺点,并通过字符串聚类的方法提高了相似性操作的效率.该文通过大量实验结果证明了RM树的有效性.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号