首页> 外文OA文献 >Machine learning for network based intrusion detection : an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data
【2h】

Machine learning for network based intrusion detection : an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data

机译:基于网络的入侵检测的机器学习:使用KDD杯'99数据集和不平衡数据的神经网络分类器集合的多目标进化研究结果的差异

摘要

For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions.
机译:在过去的十年中,评估机器学习技术以在KDD Cup '99数据集上进行基于网络的入侵检测已变得司空见惯。该数据集很好地证明了机器学习在入侵检测中很有用。但是,它在文献中受到了一些批评,并且已经过时了。因此,一些研究人员质疑基于该数据集报告的发现的有效性。此外,如本论文所述,文献中报道的发现也存在差异。在某些情况下,结果是矛盾的。因此,很难分析当前的研究内容来确定研究结果的价值。本论文报告了一项实证研究,以确定差异的根本原因。确定了一些方法学因素,例如数据子集的选择,验证方法和数据预处理,这些因素会显着影响结果。这些发现也使人们对当前的研究有了更好的解释。此外,解决了文献中的批评并讨论了数据集的未来使用,这很重要,因为由于缺乏更好的公开可用替代方案,研究人员继续使用该数据集。由于入侵检测域的性质,KDD Cup '99数据集中的类之间存在极大的不平衡,这对机器学习构成了重大挑战。在其他领域,研究人员已经证明,诸如人工神经网络(ANN)和决策树(DT)之类的众所周知的技术通常会由于班级不平衡而无法学习未成年人班级。但是,以前尚未将其视为入侵检测中的问题。本文对一项实证研究进行了报告,该研究表明,类别不平衡导致文献中报道的某些类别的入侵检测不佳。本文提出了一种替代的训练ANN的方法,即使用遗传算法(GA)来进化ANN的权重,称为进化神经网络(ENN)。当使用评估函数按比例计算每个类别的实例的适应度,从而避免偏向数据集中的主要类别时,可获得显着改善的真实阳性率,同时保持较低的阴性阳性率。这些发现表明,从不平衡数据中学习的问题不是由于人工神经网络的局限性所致;而不是训练算法。此外,ENN能够检测文献中报道的ANN无法检测到的一类入侵。 ENN的局限性之一是缺乏对ANN获得的分类权衡的控制。这被识别为当前创建分类器方法的普遍问题。努力创建一个具有最高准确性的最佳分类器可能会导致徒劳的分类折衷,这在本论文中已得到明显证明。因此,提出了使用多目标GA(MOGA)扩展ENN的方法,该方法将每个类别的分类率视为一个单独的目标。这种方法产生了非支配解决方案的Pareto前沿,展现出不同的分类权衡,用户可以从中选择一种具有所需属性的方案。多目标方法还用于发展分类器集合,从而产生改进的Pareto前沿解。此外,研究了用于集成器的分类器成员的选择,证明了这如何影响所得集成器的性能。这是解释为什么某些分类器组合无法提供有效解决方案的关键。

著录项

  • 作者

    Engen Vegard;

  • 作者单位
  • 年度 2010
  • 总页数
  • 原文格式 PDF
  • 正文语种 English
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号