Robust and efficient feature selection for high-dimensional datasets.

机译：高维数据集的稳健而高效的特征选择。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Feature selection is an active research topic in the community of machine learning and knowledge discovery in databases (KDD). It contributes to making the data mining model more comprehensible to domain experts, improving the prediction performance and robustness of the model, and reducing model training. This dissertation aims to provide solutions to three issues that are overlooked by many current feature selection researchers. These issues are feature interaction, data imbalance, and multiple subsets of features.;Most of extant filter feature selection methods are pair-wise comparison methods which test each pair of variables, i.e., one predictor variable and the response variable, and provide a correlation measure for each feature associated with the response variable. Such methods cannot take into account feature interactions.;Data imbalance is another issue in feature selection. Without considering data imbalance, the features selected will be biased towards the majority class.;In high dimensional datasets with sparse data samples, there will be many different feature sets that are highly correlated with the output. Domain experts usually expect us to identify multiple feature sets for them so that they can evaluate them based on their domain knowledge.;This dissertation aims to solve these three issues based on a criterion called minimum expected cost of misclassification (MECM). MECM is a model independent evaluation measure. It evaluates the classification power of the tested feature subset as a whole. MECM has adjustable weights to deal with imbalanced datasets. A number of case studies showed that MECM had some favorable properties for searching a compact subset of interacting features. In addition, an algorithm and corresponding data structure were developed to produce multiple feature subsets.;The success of this research will have broad applications ranging from engineering, business, to bioinformatics, such as credit card fraud detection, email filter setting for spam classification, gene selection for disease diagnosis.

机译：在数据库（KDD）的机器学习和知识发现社区中，特征选择是一个活跃的研究主题。它有助于使数据挖掘模型更易于领域专家理解，提高了模型的预测性能和鲁棒性，并减少了模型训练。本文旨在为当前许多特征选择研究者忽视的三个问题提供解决方案。这些问题是特征交互，数据不平衡和特征的多个子集。；大多数现有的过滤器特征选择方法是成对比较方法，用于测试每对变量（即一个预测变量和响应变量）并提供相关性测量与响应变量关联的每个功能。这样的方法不能考虑特征相互作用。数据不平衡是特征选择中的另一个问题。在不考虑数据不平衡的情况下，选择的特征将偏向多数类。在具有稀疏数据样本的高维数据集中，将有许多与输出高度相关的不同特征集。领域专家通常期望我们为他们识别多个特征集，以便他们可以根据他们的领域知识对其进行评估。本论文旨在基于称为最小期望误分类成本（MECM）的标准来解决这三个问题。 MECM是独立于模型的评估措施。它整体评估了测试特征子集的分类能力。 MECM具有可调整的权重以处理不平衡的数据集。大量案例研究表明，MECM对于搜索相互作用特征的紧凑子集具有一些有利的特性。此外，还开发了一种算法和相应的数据结构来生成多个功能子集。这项研究的成功将具有广泛的应用，从工程，业务到生物信息学，例如信用卡欺诈检测，垃圾邮件分类的电子邮件过滤器设置，用于疾病诊断的基因选择。

著录项

作者
Mo, Dengyao.;
展开▼
作者单位

University of Cincinnati.;

展开▼
授予单位 University of Cincinnati.;
学科 Statistics.;Engineering Industrial.;Information Science.
学位 Ph.D.
年度 2011
页码 131 p.
总页数 131
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Robust twin boosting for feature selection from high-dimensional omics data with label noise [J] . He Shan, Chen Huanhuan, Zhu Zexuan, Information Sciences: An International Journal . 2015,第Null期

机译：稳健的双级增强功能，可从带有标签噪声的高维组学数据中选择特征
2. Rank-based Lasso - efficient methods for high-dimensional robust model selection [J] . Wojciech Rejchel, Ma?gorzata Bogdan Journal of machine learning research . 2020,第a期

机译：基于秩的套索 - 高维鲁棒模型选择的高效方法
3. An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data [J] . Lee Junghye, Choi In Young, Jun Chi-Hyuck Expert systems with applications . 2021,第Mara期

机译：高维微阵列数据中基因选择的有效多变量特征排序方法
4. Evaluating Feature Selection Robustness on High-Dimensional Data [C] . Barbara Pes International conference on hybrid artificial intelligent systems . 2018

机译：在高维数据上评估特征选择的鲁棒性
5. Variable and feature selection in large datasets. [D] . Maung, Crystal. 2014

机译：大型数据集中的变量和特征选择。
6. Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm [O] . Garba Abdulrauf Sharifai, Zurinahni Zainol 2020

机译：基于鲁棒相关基于冗余和二进制蚱蜢优化算法的高维和非兼容生物医学数据的特征选择
7. Robust feature selection for high-dimensional and small-sized gene expression data [O] . Feng Yang -1

机译：高维和小型基因表达数据的强大特征选择

Robust and efficient feature selection for high-dimensional datasets.

摘要

著录项

相似文献

相关主题

期刊订阅