首页> 外文会议>International Seminar on Research of Information Technology and Intelligent Systems >Removing Noise, Reducing dimension, and Weighting Distance to Enhance k-Nearest Neighbors for Diabetes Classification
【24h】

Removing Noise, Reducing dimension, and Weighting Distance to Enhance k-Nearest Neighbors for Diabetes Classification

机译:去除噪声,减小尺寸和加权距离,以增强K-Collect邻居进行糖尿病分类

获取原文

摘要

Various methods of machine learning have been implemented in the medical field to classify various diseases, such as diabetes. The k-nearest neighbors (KNN) is one of the most known approaches for predicting diabetes. Many researchers have found by combining KNN with one or more other algorithms may provide a better result. In this paper, a combination of three procedures, removing noise, reducing the dimension, and weighting distance, is proposed to improve a standard voting-based KNN to classify Pima Indians Diabetes Dataset (PIDD) into two classes. First, the noises in the training set are removed using k-means clustering (KMC) to make the voter data in both classes more competent. Second, its dimensional is then reduced to decrease the intra-class data distances but increase the inter-class ones. Two methods of dimensional reduction: principal component analysis (PCA) and autoencoder (AE), are applied to investigate the linearity of the dataset. Since there is an imbalance on the dataset, a proportional weight is incorporated into the distance formula to get the fairness of the voting. A 5-fold cross validation-based evaluation shows that each proposed procedure works very well in enhancing the KNN. KMC is capable of increasing the accuracy of KNN from 81.6% to 86.7%. Combining KMC and PCA improves the KNN accuracy to be 90.9%. Next, a combination of KMC and AE enhances the KNN to gives an accuracy of 97.8%. Combining three proposed procedures of KMC, PCA, and Weighted KNN (WKNN) increases the accuracy to be 94.5%. Finally, the combination of KMC, AE, and WKNN reaches the highest accuracy of 98.3%. The facts that AE produces higher accuracies than PCA inform that the features in the dataset have a high non-linearity.
机译:在医学领域中实施了各种机器学习方法,以分类各种疾病,例如糖尿病。 K-最近邻居(KNN)是预测糖尿病最着名的方法之一。通过将KNN与一个或多个其他算法组合来发现许多研究人员可以提供更好的结果。在本文中,提出了三个程序,消除噪声,减小维度和加权距离的组合,以改善基于标准的投票的KNN,以将PIMA印第安人糖尿病数据集(PIDD)分类为两类。首先,使用k-means群集(kmc)删除训练集中的噪声,以使两个类别中的选民数据更有能力。其次,其尺寸减少以减少类内的数据距离,但增加了帧间间数据距离。两种维度减少方法:主要成分分析(PCA)和AutoEncoder(AE),用于研究数据集的线性。由于数据集存在不平衡,因此将比例重量结合到距离公式中以获得投票的公平性。基于5倍的交叉验证的评估表明,每个建议的程序在增强knn方面都非常好。 KMC能够将KNN的准确性从81.6%增加到86.7%。结合KMC和PCA改善了KNN精度为90.9%。接下来,kmc和ae的组合增强了knn,给出了97.8%的准确度。结合了三种拟议的KMC,PCA和加权KNN(WKNN)的程序将准确性提高为94.5%。最后,KMC,AE和WKNN的组合达到了98.3%的最高精度。 AE比PCA产生更高的精度的事实通知数据集中的功能具有高线性度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号