Combating the class imbalance problem in small sample data sets.

机译：在小样本数据集中解决类不平衡问题。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications.;Keywords: Class imbalance problem, feature evaluation and selection, machine learning, pattern recognition, bioinformatics, text mining.

机译：类不平衡问题是机器学习的最新发展。使用分类器对实际应用程序数据集进行泛化时经常会遇到此问题，它会使分类器执行次优状态。研究人员已经对重采样方法，新算法和特征选择方法进行了严格的研究，但尚未进行研究来了解这些方法如何很好地解决类不平衡问题。特别是，在文本分类问题之外很少研究特征选择。此外，没有研究关注从小样本中学习的其他问题。本文开发了一种新的特征选择度量标准，即通过滑动阈值进行特征评估（FAST），专门用于处理小样本不平衡数据集。 FAST基于接收器工作特性（AUC）下的面积，该面积是通过使用偶数仓位分布放置阈值移动单个特征分类器的决策边界而生成的。本文还提供了针对不平衡数据分类问题开发的三种类型方法和对来自不同应用程序的小样本数据集评估的七个特征选择指标的首次系统比较。我们使用AUC和P-R曲线下的面积（PRC）评估了这些指标的性能。我们比较了所有指标在所有问题上的平均表现以及在特定问题上表现出最佳表现的可能性。我们检查了每个问题域内这些指标的性能。最后，我们评估了这些指标的有效性，以了解哪种指标在算法中效果最佳。我们的结果表明，信噪比相关系数（S2N）和FAST是大多数应用中特征选择的理想选择。关键词：类不平衡问题，特征评估和选择，机器学习，模式识别，生物信息学，文本挖掘。

著录项

作者
Wasikowski, Michael.;
展开▼
作者单位

University of Kansas.;

展开▼
授予单位 University of Kansas.;
学科 Computer Science.
学位 M.S.
年度 2009
页码 105 p.
总页数 105
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. A learning method for the class imbalance problem with medical data sets. [J] . Li DC, Liu CW, Hu SC Computers in Biology and Medicine . 2010,第5期

机译：具有医学数据集的班级不平衡问题的学习方法。
2. A novel ensemble classifier by combining sampling and genetic algorithm to combat multiclass imbalanced problems [J] . International journal of data analysis techniques and strategies . 2020,第1期

机译：一种结合采样和遗传算法的新型集成分类器来解决多类不平衡问题
3. K-Neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance [J] . Budi Santoso, Hari Wijayanto, Khairil Anwar Notodiputro, Applied mathematical sciences . 2018,第9a12期

机译：使用清洗数据进行K邻域过度采样：一种新方法，可在具有类不平衡的数据集中提高分类性能
4. Support Vector Machines for Class Imbalance Rail Data Classification with Bootstrapping-based Over-Sampling and Under-Sampling [C] . Ali Zughrat, M. Mahfouf, Y. Y. Yang, IFAC World Congress . 2014

机译：支持向量机的类别不平衡轨道数据分类，基于引导的过度采样和欠抽样
5. Optimal subsequence bijection and classification of imbalanced data sets. [D] . Koknar-Tezel, Suzan. 2011

机译：最佳子序列双射和不平衡数据集分类。
6. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data [O] . Kung-Jeng Wang, Bunjira Makond, Kung-Min Wang 2013

机译：通过使用采样和特征选择技术解决不平衡的患者分类数据提高乳腺癌的生存率
7. Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. [O] . Der-Chiang Li, Susan C Hu, Liang-Sian Lin, 2017

机译：检测代表性数据并生成合成样本，以提高不平衡数据集的学习准确性。

Combating the class imbalance problem in small sample data sets.

摘要

著录项

相似文献

相关主题

期刊订阅