首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Efficient $k$ -Means++ Approximation with MapReduce
【24h】

Efficient $k$ -Means++ Approximation with MapReduce

机译:有效的 $ k $ < / alternatives> -借助MapReduce的++近似

获取原文
获取原文并翻译 | 示例
           

摘要

-means is undoubtedly one of the most popular clustering algorithms owing to its simplicity and efficiency. However, this algorithm is highly sensitive to the chosen initial centers and thus a proper initialization is crucial for obtaining an ideal solution. To address this problem, -means++ is proposed to sequentially choose the centers so as to achieve a solution that is provably close to the optimal one. However, due to its weak scalability, -means++ becomes inefficient as the size of data increases. To improve its scalability and efficiency, this paper presents MapReduce -means++ method which can drastically reduce the number of MapReduce jobs by using only one MapReduce job to obtain centers. The -means++ initialization algorithm is executed in the Mapper phase and the weighted -means++ initialization algorithm is run in the Reducer phase. As this new MapReduce -means++ method replaces the iterations among multiple machines with a single machine, it can reduce the communication and I/O costs significantly. We also prove that the proposed MapReduce -means++ method obtains approximation to the optimal solution of -means. To reduce the expensive distance computation of the proposed method, we further propose a pruning strategy that can greatly avoid a large number of redundant distance computations. Extensive experiments on real and synthetic data are conducted and the performance results indicate that the proposed MapReduce -means++ method is much more efficient and can achieve a good approximation.
机译:-means无疑是最流行的聚类算法之一,因为它具有简单性和效率。但是,该算法对所选的初始中心高度敏感,因此正确的初始化对于获得理想的解决方案至关重要。为了解决此问题,建议使用-means ++顺序选择中心,以实现可证明接近最佳中心的解决方案。但是,由于其较弱的可伸缩性,-means ++随着数据大小的增加而变得效率低下。为了提高其可扩展性和效率,本文提出了MapReduce -means ++方法,该方法可以通过仅使用一个MapReduce作业来获得中心,从而大大减少MapReduce作业的数量。 -means ++初始化算法在Mapper阶段执行,加权-means ++初始化算法在Reducer阶段运行。由于此新的MapReduce -means ++方法用一台机器替换了多台机器之间的迭代,因此可以显着减少通信和I / O成本。我们还证明了所提出的MapReduce -means ++方法获得了-means最优解的近似值。为了减少所提出方法的昂贵距离计算,我们进一步提出了一种修剪策略,该策略可以大大避免大量冗余距离计算。对真实和合成数据进行了广泛的实验,性能结果表明,所提出的MapReduce -means ++方法效率更高,并且可以实现良好的近似。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号