首页> 外文期刊>Journal of Intelligent Information Systems >One-pass MapReduce-based clustering method for mixed large scale data
【24h】

One-pass MapReduce-based clustering method for mixed large scale data

机译:基于MapReduce的混合大规模数据的聚类方法

获取原文
获取原文并翻译 | 示例
           

摘要

Big data is often characterized by a huge volume and a mixed types of attributes namely, numeric and categorical. K-prototypes has been fitted into MapReduce framework and hence it has become a solution for clustering mixed large scale data. However, k-prototypes requires computing all distances between each of the cluster centers and the data points. Many of these distance computations are redundant, because data points usually stay in the same cluster after first few iterations. Also, k-prototypes is not suitable for running within MapReduce framework: the iterative nature of k-prototypes cannot be modeled through MapReduce since at each iteration of k-prototypes, the whole data set must be read and written to disks and this results a high input/output (I/O) operations. To deal with these issues, we propose a new one-pass accelerated MapReduce-based k-prototypes clustering method for mixed large scale data. The proposed method reads and writes data only once which reduces largely the I/O operations compared to existing MapReduce implementation of k-prototypes. Furthermore, the proposed method is based on a pruning strategy to accelerate the clustering process by reducing the redundant distance computations between cluster centers and data points. Experiments performed on simulated and real data sets show that the proposed method is scalable and improves the efficiency of the existing k-prototypes methods.
机译:大数据通常是由巨大的卷和混合类型的属性,即数字和分类。 K-Prototypes已安装在MapReduce框架中,因此它已成为聚类混合大规模数据的解决方案。但是,k原型需要计算每个群集中心和数据点之间的所有距离。这些距离计算中的许多都是冗余的,因为在首次迭代之后,数据点通常保持在同一群集中。此外,k原型不适合在MapReduce框架内运行:由于在K-Prototypes的每次迭代,k-prototypes的迭代性质无法通过MapReduce建模,因此必须读取整个数据集,并将其写入磁盘和此结果高输入/输出(I / O)操作。要处理这些问题,我们提出了一种新的一次通过加速MapReduce的K原型组群聚类方法,用于混合大规模数据。所提出的方法只读取并写入数据,只有在很大程度上在很大程度上减少了与现有MapReduce实现的k原型的操作。此外,所提出的方法基于修剪策略来通过减少集群中心和数据点之间的冗余距离计算来加速聚类过程。对模拟和实数据集执行的实验表明,该方法是可扩展的,提高现有K原型方法的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号