首页> 外文学位 >Robust and efficient feature selection for high-dimensional datasets.
【24h】

Robust and efficient feature selection for high-dimensional datasets.

机译:高维数据集的稳健而高效的特征选择。

获取原文
获取原文并翻译 | 示例

摘要

Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features.;Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions.;Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class.;In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge.;This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets.;The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.
机译:在数据库(KDD)的机器学习和知识发现社区中,特征选择是一个活跃的研究主题。它有助于使数据挖掘模型更易于领域专家理解,提高了模型的预测性能和鲁棒性,并减少了模型训练。本文旨在为当前许多特征选择研究者忽视的三个问题提供解决方案。这些问题是特征交互,数据不平衡和特征的多个子集。;大多数现有的过滤器特征选择方法是成对比较方法,用于测试每对变量(即一个预测变量和响应变量)并提供相关性测量与响应变量关联的每个功能。这样的方法不能考虑特征相互作用。数据不平衡是特征选择中的另一个问题。在不考虑数据不平衡的情况下,选择的特征将偏向多数类。在具有稀疏数据样本的高维数据集中,将有许多与输出高度相关的不同特征集。领域专家通常期望我们为他们识别多个特征集,以便他们可以根据他们的领域知识对其进行评估。本论文旨在基于称为最小期望误分类成本(MECM)的标准来解决这三个问题。 MECM是独立于模型的评估措施。它整体评估了测试特征子集的分类能力。 MECM具有可调整的权重以处理不平衡的数据集。大量案例研究表明,MECM对于搜索相互作用特征的紧凑子集具有一些有利的特性。此外,还开发了一种算法和相应的数据结构来生成多个功能子集。这项研究的成功将具有广泛的应用,从工程,业务到生物信息学,例如信用卡欺诈检测,垃圾邮件分类的电子邮件过滤器设置,用于疾病诊断的基因选择。

著录项

  • 作者

    Mo, Dengyao.;

  • 作者单位

    University of Cincinnati.;

  • 授予单位 University of Cincinnati.;
  • 学科 Statistics.;Engineering Industrial.;Information Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 131 p.
  • 总页数 131
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号