首页> 外文学位 >Combating the class imbalance problem in small sample data sets.
【24h】

Combating the class imbalance problem in small sample data sets.

机译:在小样本数据集中解决类不平衡问题。

获取原文
获取原文并翻译 | 示例

摘要

The class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications.;Keywords: Class imbalance problem, feature evaluation and selection, machine learning, pattern recognition, bioinformatics, text mining.
机译:类不平衡问题是机器学习的最新发展。使用分类器对实际应用程序数据集进行泛化时经常会遇到此问题,它会使分类器执行次优状态。研究人员已经对重采样方法,新算法和特征选择方法进行了严格的研究,但尚未进行研究来了解这些方法如何很好地解决类不平衡问题。特别是,在文本分类问题之外很少研究特征选择。此外,没有研究关注从小样本中学习的其他问题。本文开发了一种新的特征选择度量标准,即通过滑动阈值进行特征评估(FAST),专门用于处理小样本不平衡数据集。 FAST基于接收器工作特性(AUC)下的面积,该面积是通过使用偶数仓位分布放置阈值移动单个特征分类器的决策边界而生成的。本文还提供了针对不平衡数据分类问题开发的三种类型方法和对来自不同应用程序的小样本数据集评估的七个特征选择指标的首次系统比较。我们使用AUC和P-R曲线下的面积(PRC)评估了这些指标的性能。我们比较了所有指标在所有问题上的平均表现以及在特定问题上表现出最佳表现的可能性。我们检查了每个问题域内这些指标的性能。最后,我们评估了这些指标的有效性,以了解哪种指标在算法中效果最佳。我们的结果表明,信噪比相关系数(S2N)和FAST是大多数应用中特征选择的理想选择。关键词:类不平衡问题,特征评估和选择,机器学习,模式识别,生物信息学,文本挖掘。

著录项

  • 作者

    Wasikowski, Michael.;

  • 作者单位

    University of Kansas.;

  • 授予单位 University of Kansas.;
  • 学科 Computer Science.
  • 学位 M.S.
  • 年度 2009
  • 页码 105 p.
  • 总页数 105
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号