Fast and exact out-of-core and distributed k-means clustering

Ruoming Jin; Anjan Goswami; Gagan Agrawal

首页> 外文期刊>Knowledge and information systems >Fast and exact out-of-core and distributed k-means clustering

【24h】

Fast and exact out-of-core and distributed k-means clustering

机译：快速准确的核心外和分布式k均值聚类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.

机译：聚类一直是数据挖掘中研究最广泛的主题之一，而k均值聚类一直是流行的聚类算法之一。 K均值需要在整个数据集上进行多次传递，这对于大型磁盘驻留数据集而言非常昂贵。鉴于此，已经对k均值的各种近似版本进行了大量工作，这些工作仅需要对整个数据集进行一次或少量遍。在本文中，我们提出了一种称为快速精确k均值聚类（FEKM）的新算法，该算法通常只需要对整个数据集进行一次或少量传递，并证明可产生与原始k所报告的相同的聚类中心-均值算法。该算法使用采样来创建初始聚类中心，然后对整个数据集进行一次或多次遍历以调整这些聚类中心。我们提供理论分析，以显示由此报告的聚类中心与原始k-means算法计算的聚类中心相同。来自大量真实和合成数据集的实验结果显示，与k均值相比，速度提高了2到4.5倍。本文还描述并评估了FEKM的分布式版本，我们将其称为DFEKM。该算法适用于分析在松散耦合的计算机之间分布的数据。与该领域的先前工作不同，DFEMK可证明产生与原始k均值算法相同的结果。我们的实验结果表明，对于在分布式数据上进行精确聚类，DFEKM明显优于其他两个可能的选项，后者分别下载所有数据并在松耦合配置下运行顺序k均值或并行k均值。此外，即使在紧密耦合的环境中，如果存在明显的负载不平衡，DFEMK也会优于并行k均值。

著录项

来源
《Knowledge and information systems》 |2006年第1期|共24页
作者
Ruoming Jin; Anjan Goswami; Gagan Agrawal;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化系统理论;
关键词
k-means clustering; Out-of-core datasets distributed; k-means; Confidence radius; Boundary points;

机译：k-均值聚类;核外数据集分布;k-均值;置信半径;边界点;

相似文献

外文文献
中文文献
专利

1. Fast and exact out-of-core and distributed k-means clustering [J] . Ruoming Jin, Anjan Goswami, Gagan Agrawal Knowledge and information systems . 2006,第1期

机译：快速准确的核心外和分布式k均值聚类
2. Slabpose Columnsort: A New Oblivious Algorithm for Out-of-Core Sorting on Distributed-Memory Clusters [J] . Geeta Chaudhry, Thomas H. Cormen Algorithmica . 2006,第3期

机译：Slabpose Columnsort：一种新的遗忘算法，用于分布式内存集群上的核外排序
3. Comparison of Distributed K-Means and Distributed Fuzzy C-Means Algorithms for Text Clustering [J] . I Made Artha Agastya, Teguh Bharata Adji, Noor Akhmad Setiawan Communications in Science and Technology . 2017,第1期

机译：文本聚类的分布式K均值和分布式模糊C均值算法的比较
4. Even Faster Exact k-Means Clustering [C] . Christian Borgelt International Symposium on Intelligent Data Analysis . 2020

机译：甚至更快的k-merical群集
5. A fast and scalable hardware architecture for K-means clustering for big data analysis. [D] . Raghavan, Ramprasad. 2016

机译：用于K均值群集的快速且可扩展的硬件体系结构，用于大数据分析。
6. A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality [O] . Xueyi Wang -1

机译：快速精确最近邻居法高维搜索使用K-均值聚类和三角不等式
7. Fast and exact out-of-core and distributed k-means clustering [O] . Ruoming Jin, Anjan Goswami, Gagan Agrawal 2005

机译：快速和精确的外核和分布式K-MEARE集群

Fast and exact out-of-core and distributed k-means clustering

摘要

著录项

相似文献

相关主题

期刊订阅