Analysis and performance improvement of K-means clustering in big data environment

机译：大数据环境下K-means聚类分析与性能提升

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.

机译：大数据环境用于支持大量数据处理。在这种环境下，将处理大量（即千兆字节，太字节）的数据。因此，使用大数据（即facebook，google）来处理产生大量数据请求的各种在线应用程序。在本工作中，研究和研究了大数据环境，并研究了如何使用大数据来消耗数据以及支持工具如何与Hadoop存储一起使用。此外，为了敏锐的理解和调查，通过Hadoop和MapReduce实现了一种聚类分析技术，尤其是K-mean聚类算法。集群是大数据分析的一部分，其中未标记的数据将被处理并用于组成数据组。此外，还可以观察到传统的k均值算法不适用于Hadoop和MapReduce，因此对数据处理技术进行了少量修改。除此之外，在聚类分析期间，传统的k均值还会发现各种问题，即波动精度，离群值和空聚类。因此，提出并实现了一种对传统的k均值聚类方法进行改进的聚类算法。该方法首先通过删除数据集中的离群点来提高数据质量，然后使用二分法执行聚类。最后，利用JAVA，Hadoop和MapReduce实现了所提出的聚类技术，并对其性能进行了评估，并与传统的k均值聚类算法进行了比较。所获得的性能显示了有效的结果，并且随着去效率的降低，簇形成的准确性提高。因此，该建议的工作可在大数据环境中采用，并提高集群性能。

著录项

来源
《2015 International Conference on Communication Networks》|2015年|43-46|共4页
会议地点 Gwalior(IN)
作者
Purva Rathore; Deepak Shukla;
展开▼
作者单位

Computer Science and Engineering, IES IPS Academy, Indore, India;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
big data; clustering; data mining; implementation; performance improvement;

机译：大数据;集群;数据挖掘;实现;性能提升;

相似文献

外文文献
中文文献
专利

1. A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability [J] . Anaraki Seyed Alireza Mousavian, Haeri Abdorrahman, Moslehi Fateme Pattern Analysis and Applications . 2021,第3期

机译：具有创新方法的PCA和K-in的混合互惠模型，其考虑子数据集改进K-Means初始化和逐步标记，以创建具有高可解释性的群集
2. PERFORMANCE IMPROVEMENT OF DISK BASED K-MEANS OVER K-MEANS ON LARGE DATASETS [J] . SWAGATIKA DEVI, TRILOKNATH PANDEY, ALOK KUMAR JAGADEV Journal of Theoretical and Applied Information Technology . 2013,第3期

机译：大数据集上基于磁盘的K均值在K均值上的性能改进
3. Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data [J] . T. Velmurugan Applied Soft Computing . 2014,第Null期

机译：面向连接的电信数据的k均值和模糊C均值聚类算法之间基于性能的分析
4. Analysis and performance improvement of K-means clustering in big data environment [C] . Purva Rathore, Deepak Shukla International Conference on Communication Networks . 2015

机译：大数据环境中K-MEATION集群的分析与性能改进
5. A fast and scalable hardware architecture for K-means clustering for big data analysis. [D] . Raghavan, Ramprasad. 2016

机译：用于K均值群集的快速且可扩展的硬件体系结构，用于大数据分析。
6. Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm Minimum Spanning Tree and Hierarchical Clustering in an Applied Study [O] . Saeedeh Pourahmad, Atefeh Basirat, Amir Rahimi, 2020

机译：初始簇质心的确定是否提高了K-Means聚类算法的性能？应用研究中遗传算法最小生成树和分层聚类的三种混合方法的比较
7. Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset [O] . Rui Shang, Balqees Ara, Islam Zada, 2021

机译：使用教育部门数据集分析软件产品和组织绩效的简单K均值和平行k平均聚类

Analysis and performance improvement of K-means clustering in big data environment

摘要

著录项

相似文献

相关主题

期刊订阅