首页> 外文学位 >Hypothesis margin based weighting for feature selection using boosting: Theory, algorithms and applications.
【24h】

Hypothesis margin based weighting for feature selection using boosting: Theory, algorithms and applications.

机译:基于假设余量的加权,用于使用增强的特征选择:理论,算法和应用。

获取原文
获取原文并翻译 | 示例

摘要

Feature selection (FS) is a preprocessing process aimed at identifying a small subset of highly predictive features out of a large set of raw input variables that are possibly irrelevant or redundant. It plays a fundamental role in the success of many learning tasks where high dimensionality arisesas a big challenge. In this thesis, we took an unusual approach for using boosting as an effective FS by utilizing the training examples' mean margins. A weight criterion, termed Margin Fraction (MF), is assigned to each feature that contributes to the margin distribution combined in the final output produced by boosting. We argue that using the MF is more favorable for several reasons. First, boosting hypothesis margins have been used both for theoretical generalization bounds and as guidelines for algorithm design, and thus, a natural goal is to find learners (features) that achieve a maximum margin. Second, current boosting-based feature selection methods measure the relative importance of features based on the Confidence Ratio (CR) of the learned base hypothesis. However, while a feature may have a large CR, it will not contribute to a good overall margin unless its "conditional" margin is also large.;The thesis mainly consists of two parts. In part one, we establish a rigorous theoretical and mathematical basis for the proposed weighting and selection methodology, and we describe how to extend this methodology to handle the presence of imbalanced data; by defining a new weight metric, termed AUC Margin Fraction (AMF), that characterize the quality of a set of features based on the maximized Area Under ROC curve (AUC) margin it induces during the process of learning with boosting. Based on this we design two different embedded-based FS algorithms, the SBS-MF and the SBS-AMF. We then investigate the effectiveness of the proposed methods through extensive comparisons with other algorithms using real-world data.;In part two, we apply the proposed SBS-AMF method to design a real intrusion detection system (IDS) of virtual server environments utilizing only information available from the perspective of the virtual machine monitor (VMM). VMM-based IDSs break the boundaries of current state-of-the-art IDSs. They represent a new point in the IDS design space that trades a lack of program semantics for greater malware resistance and ease of deployment. To test the effectiveness and robustness of our proposed VMM IDS, we use different classes of servers, virtual appliances, and workloads, as well as different classes of malwares. Our experimental results show that SBS-AMF achieves significantly better detection performance on the data sets tested using the Local Outlier Factor anomaly detection algorithm (LOF), and we obtained on average 96% detection rate and 5% false alarm rate. These results indicate that sufficient information exists in features selected by SBS-AMF to build real IDS that is not susceptible to the characteristics of the attack behavior, or to specific workload.;Due to the growing popularity of Graphics Processing Units (GPUs) in general-purpose computing domains we applied this parallel computing approach to accelerate the LOF method, to enhance the detection speed of the proposed VMM IDS, as near real-time performance is needed in order to detect any malicious activity before the system becomes fully compromised. With the GPU-enabled LOF CUDA implementation we achieved more than a 100X. (Abstract shortened by UMI.).
机译:特征选择(FS)是一种预处理过程,旨在从可能不相关或多余的大量原始输入变量中识别出高度预测性特征的一小部分。它在许多学习任务的成功中起着至关重要的作用,在这些学习任务中,高维是一个巨大的挑战。在本文中,我们采用了一种不寻常的方法,即通过利用训练示例的平均余量来将增强用作有效的FS。将一个权重标准(称为边际分数(MF))分配给每个要素,这些要素有助于在通过增强产生的最终输出中组合的边际分布。我们认为,由于以下几个原因,使用MF更为有利。首先,提高假设边际已被用于理论概括边界和算法设计指南,因此,自然的目标是找到达到最大边际的学习者(特征)。其次,当前基于增强的特征选择方法基于学习的基础假设的置信度(CR)来度量特征的相对重要性。但是,尽管特征可能具有较大的CR,但除非其“条件”裕度也很大,否则它不会为良好的总体裕度做出贡献。;本文主要由两部分组成。在第一部分中,我们为提出的加权和选择方法建立了严格的理论和数学基础,并描述了如何扩展该方法以处理不平衡数据的存在。通过定义一个称为AUC边际分数(AMC)的新权重度量标准,该特征度量将基于在增强学习过程中引起的最大化ROC曲线下面积(AUC)余量来表征一组要素的质量。基于此,我们设计了两种不同的基于嵌入式的FS算法,即SBS-MF和SBS-AMF。然后,通过与使用实际数据的其他算法进行大量比较,研究所提出方法的有效性。第二部分,我们将所提出的SBS-AMF方法用于仅使用虚拟服务器环境设计真实入侵检测系统(IDS)。从虚拟机监视器(VMM)的角度来看可用的信息。基于VMM的IDS打破了当前最新IDS的界限。它们代表了IDS设计空间中的一个新点,该点以缺乏程序语义为代价,从而具有更大的抗恶意软件性和易于部署性。为了测试我们提出的VMM IDS的有效性和健壮性,我们使用不同类型的服务器,虚拟设备和工作负载以及不同类型的恶意软件。我们的实验结果表明,在使用局部异常值异常检测算法(LOF)进行测试的数据集上,SBS-AMF的检测性能显着提高,并且平均获得了96%的检测率和5%的误报警率。这些结果表明SBS-AMF选择的功能中存在足够的信息来构建真实的IDS,这些IDS不受攻击行为或特定工作负载的影响;由于图形处理单元(GPU)的普遍普及用途的计算领域,我们需要这种并行计算方法来加速LOF方法,以提高建议的VMM IDS的检测速度,因为需要近乎实时的性能才能在系统完全受到破坏之前检测到任何恶意活动。通过启用GPU的LOF CUDA实现,我们获得了超过100倍的性能。 (摘要由UMI缩短。)。

著录项

  • 作者

    Alshawabkeh, Malak.;

  • 作者单位

    Northeastern University.;

  • 授予单位 Northeastern University.;
  • 学科 Engineering Computer.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 161 p.
  • 总页数 161
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号