首页> 中文期刊> 《计算机应用与软件》 >层次化聚类在分布式计算环境中的剪枝策略

层次化聚类在分布式计算环境中的剪枝策略

         

摘要

基于树结构中结点覆盖关系的一类层次化聚类算法可以对海量数据生成有意义的摘要.然而,该算法已被证明是NP-完全问题,求解其精确解需要庞大的计算量.虽然它在单机计算环境中存在有效的剪枝方法,但在分布式计算环境中这种剪枝算法并不可行.相应地提出了该层次聚类算法在分布式环境中的剪枝新策略,通过绑定结点与其覆盖的基本事件构成的有序数组,使穷举查询转换为有序数组的求交集运算,并能够在合并过程中执行大量剪枝,从而在有限的额外空间消耗的基础上显著减少计算时间.在2组公开基准数据集上进行了测试,结果表明,相比朴素的分布式计算策略,新的层次化聚类算法在时间效率上平均有30 ~ 40倍左右的提升.%A hierarchical clustering algorithm based on the node coverage relation in the tree structure can generate meaningful abstracts for the massive data.However, this algorithm has been proved to be an NP-complete problem, and its exact solution requires a large amount of computation.Although it has an effective pruning method in stand-alone computing environment, this pruning algorithm is not feasible in a distributed computing environment.A new pruning strategy of hierarchical clustering algorithm in distributed environment is proposed.By binding an ordered array of nodes and basic events that they cover, an exhaustive query is converted to an intersection set of ordered arrays, and a large number of pruning can be performed during the merge process.Thereby significantly reducing the computational time on the basis of limited additional space consumption.Tests were performed on two sets of open reference datasets.The results show that the new hierarchical clustering algorithm has 30 ~40 times improvement in time efficiency compared with the simple distributed computing strategy.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号