首页> 中文期刊> 《计算机技术与发展》 >基于二叉树的并行频繁项集挖掘算法

基于二叉树的并行频繁项集挖掘算法

         

摘要

Along with the advent of the era of big data,people have higher requirements in the speed of data processing and the utilization of data. In the aspect of mining frequent itemset,the algorithms of Count Distribution and Data Distribution are classical parallel algo-rithms for mining frequent itemset,because large storage space and communication overhead are needed in the process of mining,the min-ing efficiency is not very ideal. A parallel algorithm of frequent itemset mining based on the binary-tree is proposed in this paper,it takes advantage of the parallelism of MapReduce. Firstly,find out all subsets of fixed size in the database by using the method of traversing the binary-tree. Secondly,count occurrence numbers of each subset,and compare with a fixed threshold which is set in advance. If the occur-rence number of a subset is more than the threshold value,the subset is the frequent itemset which is requested. The study of the compari-son and analysis of the experimental results show that the proposed algorithm needs only one process of MapReduce to complete the min-ing work,it makes full use of the parallelism of the cluster. It does not need to use iterative way for mining frequent itemset,and the per-formance is superior to the CD and DD algorithms,in other words,it has higher mining efficiency.%大数据时代的到来,使得人们对数据的处理速度、利用率等方面的要求变得更高。在频繁项集挖掘方面, Count Distribution算法和Data Distribution算法是比较经典的并行频繁项集挖掘算法,由于挖掘过程中需要较大的存储空间和通信开销,挖掘效率并不十分理想。文中提出了一种基于二叉树的并行频繁项集挖掘算法,利用了MapReduce的并行性,先通过遍历二叉树的方法找出数据库中固定大小的所有子集,然后统计每个子集的出现次数,再与事先设定好的一个固定阈值进行比较,超过阈值的子集即为所求的频繁项集。通过对实验结果进行对比分析表明,提出的算法只需要一次Ma-pReduce过程即可完成挖掘,充分利用了集群的并行性,不需要使用迭代的方式进行挖掘,性能上明显优于CD和DD算法,也就是说,该算法具有较高的挖掘效率。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号