首页> 外文会议>2015 International Conference on Communication Networks >Analysis and performance improvement of K-means clustering in big data environment
【24h】

Analysis and performance improvement of K-means clustering in big data environment

机译:大数据环境下K-means聚类分析与性能提升

获取原文
获取原文并翻译 | 示例

摘要

The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.
机译:大数据环境用于支持大量数据处理。在这种环境下,将处理大量(即千兆字节,太字节)的数据。因此,使用大数据(即facebook,google)来处理产生大量数据请求的各种在线应用程序。在本工作中,研究和研究了大数据环境,并研究了如何使用大数据来消耗数据以及支持工具如何与Hadoop存储一起使用。此外,为了敏锐的理解和调查,通过Hadoop和MapReduce实现了一种聚类分析技术,尤其是K-mean聚类算法。集群是大数据分析的一部分,其中未标记的数据将被处理并用于组成数据组。此外,还可以观察到传统的k均值算法不适用于Hadoop和MapReduce,因此对数据处理技术进行了少量修改。除此之外,在聚类分析期间,传统的k均值还会发现各种问题,即波动精度,离群值和空聚类。因此,提出并实现了一种对传统的k均值聚类方法进行改进的聚类算法。该方法首先通过删除数据集中的离群点来提高数据质量,然后使用二分法执行聚类。最后,利用JAVA,Hadoop和MapReduce实现了所提出的聚类技术,并对其性能进行了评估,并与传统的k均值聚类算法进行了比较。所获得的性能显示了有效的结果,并且随着去效率的降低,簇形成的准确性提高。因此,该建议的工作可在大数据环境中采用,并提高集群性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号