一种基于快速k-近邻的最小生成树离群检测方法

朱利; 邱媛媛; 于帅; 原盛

摘要

Outlier detection,also known as anomaly detection,is a very important foundamental research task in the field of data mining.It is mainly used for finding strange mechanism or potential danger,and aims to detecting those outliers their observations deviate so much from other observations and they are few suspicious data.Outliers,which are novel,unmoral and few,are often abandoned as noise or abnormal data.Outliers are also classified as many types,such as local,partial and so on.The techniques of outlier detection can be applied to many fields such as intrusion behavior,fraud,signs of early disease in the medical field and so on.Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection.The kNN-based algorithm could be used in big data sets efficiently,so it is widely applied for outliers detection based on distance and density.Unfortunately,the kNN-based algorithm's time complexity is O(N2),and it will be greatly increased with the size of date sets.The time complexity and space complexity of minimum spanning tree-based clustering algorithms using Prim's or Kruskal's method is O(N2),and the result of clustering depends on inputting parameters by users.Moreover,this algorithm can't detect outliers in high-density clusters.The existing MST-based algorithms become ineffective when provided with unsuitable parameters or applied to datasets which are composed of clusters with diverse shapes,sizes,and densities.Meanwhile,the most MST algorithms couldn't build tree dynamically,because of needing to know the distance between any two points in advance.In order to address these challenging problems,we proposed a new outliers detection method,which absorbs the advantages of distance-based method and density-based method.Firstly,this algorithm builds a split-tree to storage the information among data points.Secondly,we efficiently acquire all sets of well-separated pair decomposition on the whole dataset.Thirdly,all this algorithm partitions the input data set into several frames which are satisfy certain condition so that we can quickly obtain each point's k-nearest neighbors on the basis of the first two results.Fourthly,a minimum spinning tree is dynamically built according to the third result.In addition,we rank points which are suspected as outliers on the basis of its outlier factor by using the MST-based clustering without inputting parameter of cluster numbers manually.A new algorithm and a new metric are proposed to select the exact number of clusters and avoid insignificant clusters.And we detect all outliers at last.The time complexity of computing kNN and creating tree are O(kN) and O(NlogN),respectively.The experiments show that this new algorithm can detect both local outliers and global outliers without inputting the number of clusters from users.In the experiments,we use a series of real datasets and synthetic datasets to verify the efficiency and effectiveness of KDNS,FkNN and ADC proposed in this paper.The experimental results show that comparing with the previous approaches,our proposed algorithms can drastically reduce time complexity and significantly improve the rate of outlier detection.%离群检测也称异常点检测,是数据挖掘领域很有意义的热点问题之一,在很多方面都有广泛应用,如入侵行为、欺诈行为、医学上疾病前期的征兆等.基于k-近邻的算法能够很好的运用到大数据集上,因此在基于距离和基于密度的离群检测技术方面得到广泛应用.然而k-近邻算法的时间复杂度为O(N2),随着数据集规模的增加,时间开销大大增加.基于最小生成树的聚类算法在使用Prim或者Kruskal算法构建最小生成树时空间复杂度和时间复杂度均为O(N2),聚类结果依赖于用户参数的选择,而且容易漏检稠密簇中的局部离群点.针对以上问题,融合基于密度和基于聚类方法的优势,提出一种新的离群检测方法.该方法具有以下优点:(1)计算k-近邻的时间复杂度为O(kN)(女《N);(2)构建最小生成树的时间复杂度为O(NlogN);(3)自适应识别聚类数目;(4)能够检测出多种类型的离群数据.最后通过大量实验验证了文中所提的KDNS算法,FkNN算法和ADC算法的有效性.实验结果表明,相对于现有算法,文中算法可以大幅度降低时间复杂度并显著提高离群检测率.

一种基于快速k-近邻的最小生成树离群检测方法

摘要

著录项

相似文献

相关主题

期刊订阅