...
首页> 外文期刊>Knowledge and information systems >Fast and exact out-of-core and distributed k-means clustering
【24h】

Fast and exact out-of-core and distributed k-means clustering

机译:快速准确的核心外和分布式k均值聚类

获取原文
获取原文并翻译 | 示例
           

摘要

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.
机译:聚类一直是数据挖掘中研究最广泛的主题之一,而k均值聚类一直是流行的聚类算法之一。 K均值需要在整个数据集上进行多次传递,这对于大型磁盘驻留数据集而言非常昂贵。鉴于此,已经对k均值的各种近似版本进行了大量工作,这些工作仅需要对整个数据集进行一次或少量遍。在本文中,我们提出了一种称为快速精确k均值聚类(FEKM)的新算法,该算法通常只需要对整个数据集进行一次或少量传递,并证明可产生与原始k所报告的相同的聚类中心-均值算法。该算法使用采样来创建初始聚类中心,然后对整个数据集进行一次或多次遍历以调整这些聚类中心。我们提供理论分析,以显示由此报告的聚类中心与原始k-means算法计算的聚类中心相同。来自大量真实和合成数据集的实验结果显示,与k均值相比,速度提高了2到4.5倍。本文还描述并评估了FEKM的分布式版本,我们将其称为DFEKM。该算法适用于分析在松散耦合的计算机之间分布的数据。与该领域的先前工作不同,DFEMK可证明产生与原始k均值算法相同的结果。我们的实验结果表明,对于在分布式数据上进行精确聚类,DFEKM明显优于其他两个可能的选项,后者分别下载所有数据并在松耦合配置下运行顺序k均值或并行k均值。此外,即使在紧密耦合的环境中,如果存在明显的负载不平衡,DFEMK也会优于并行k均值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号