【24h】

Classification on Data with Biased Class Distribution

机译:偏向类别分布的数据分类

获取原文
获取原文并翻译 | 示例

摘要

Labeled data for classification could often be obtained by sampling that restricts or favors choice of certain classes. A classifier trained using such data will be biased, resulting in wrong inference and sub-optimal classification on new data. Given an unlabeled new data set we propose a bootstrap method to estimate its class probabilities by using an estimate of the classifier's accuracy on training data and an estimate of probabilities of classifier's predictions on new data. Then, we propose two methods to improve classification accuracy on new data. The first method can be applied only if a classifier was designed to predict posterior class probabilities where predictions of an existing classifier are adjusted according to the estimated class probabilities of new data. The second method can be applied to an arbitrary classification algorithm, but it requires retraining on the properly resampled data. The proposed bootstrap algorithm was validated through experiments with 500 replicates calculated on 1,000 realizations for each of 16 choices of data set size, number of classes, prior class probabilities and conditional probabilities describing a classifier's performance. Applications of the proposed methodology to a benchmark data set with various class probabilities on unlabeled data and balanced class probabilities on the training data provided strong evidence that the proposed methodology can be successfully used to significantly improve classification on unlabeled data.
机译:用于分类的标记数据通常可以通过采样来获得,这些采样限制或支持某些类别的选择。使用此类数据训练的分类器将存在偏差,从而导致错误的推断和对新数据的次优分类。给定一个未标记的新数据集,我们提出一种引导方法,通过使用对训练数据的分类器准确性的估计和对新数据的分类器预测的可能性的估计来估计其类概率。然后,我们提出了两种提高新数据分类精度的方法。仅当分类器设计为预测后验类别概率时才可以应用第一种方法,其中根据新数据的估计类别概率调整现有分类器的预测。第二种方法可以应用于任意分类算法,但是需要对正确重采样的数据进行重新训练。该引导算法通过实验进行验证,该实验有500个重复项,该重复项是针对描述分类器性能的16种选择的数据集大小,类数,先验类概率和条件概率中的每一个的1,000个实现计算得出的。将所提出的方法应用于基准数据集,该基准数据集具有针对未标记数据的各种类别概率和针对训练数据的平衡的类别概率,提供了有力的证据,表明所提出的方法可以成功用于显着改善未标记数据的分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号