首页>
外国专利>
MAPREDUCE-BASED DISTRIBUTED CLUSTER PROCESSING METHOD FOR LARGE-SCALE DATA
MAPREDUCE-BASED DISTRIBUTED CLUSTER PROCESSING METHOD FOR LARGE-SCALE DATA
展开▼
机译:基于MAPREDUCE的大规模数据分布式集群处理方法
展开▼
页面导航
摘要
著录项
相似文献
摘要
Provided by the present invention is a MapReduce-based distributed cluster processing method for large-scale data, which comprises: sampling large-scale data according to an equal-scale non-repetition principle; inputting the sampled data into a MapReduce distributed parallel framework, and calculating the local density and average density of the sampled data; finding all sampled data having a local density greater than the average density to serve as a candidate point set of initial cluster center points for each cluster, and feeding the candidate point set back to a master node, wherein every two adjacent candidate points at a distance from each other which is greater than twice that of a set range are selected to serve as the initial cluster center points; using the MapReduce distributed parallel framework to perform a parallel clustering task, wherein an average value of the distance between the data is calculated for each cluster in order to update the cluster center points; child nodes applying an error sum of squares criterion function so as to determine whether to continue iteration; the child nodes performing clustering on the large-scale data according to the cluster center points. By means of the present invention, parallel clustering is implemented, thereby reducing the number of clustering iterations, while increasing clustering accuracy and the efficiency of parallel clustering.
展开▼