首页> 中文期刊> 《计算机应用》 >基于主成分分析和K近邻的文件类型识别算法

基于主成分分析和K近邻的文件类型识别算法

         

摘要

In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate,a new content-based file-type detection algorithm was proposed,which was based on Principal Component Analysis (PCA) and K Nearest Neighbors ( KNN).Firstly,PCA algorithm was used to reduce the dimension of the sample space.Then by clustering the training samples,each file type was represented by cluster centroids.In order to reduce the error caused by unbalanced training samples, K NN algorithm based on distance weighting was proposed.The experimental result shows that the improved algorithm,in the case of a large number of training samples,can reduce computational complexity,and can maintain a high recognition accuracy rate.This algorithm doesn't depend on the feature of each file,so it can be used more widely.%为解决基于文件后缀名和文件特征标识识别文件类型误判率较高的问题,在基于文件内容识别文件类型的算法基础上,提出主成分分析(PCA)和K近邻(KNN)算法相结合的文件类型识别算法.首先,使用PCA方法对样本预处理以降低样本空间的维数;然后,对降维后的训练样本集进行聚类处理,即用聚类质心代表每种类型的文件;最后,针对训练样本分布不均匀可能造成的分类误差,提出基于距离加权的KNN算法.实验结果表明,改进算法在样本数较多的情况下,能降低分类的计算复杂度,并保持了较高的识别正确率;而且该算法不依赖文件类型的特征标识,应用范围更为广泛.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号