首页> 外文期刊>Journal of Intelligent Information Systems >Identifying and Handling Mislabelled Instances
【24h】

Identifying and Handling Mislabelled Instances

机译:识别和处理标签错误的实例

获取原文
获取原文并翻译 | 示例
       

摘要

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.
机译:数据挖掘和知识发现旨在根据数据生成有用且可靠的模型。不幸的是,一些数据库包含嘈杂的数据,干扰了模型的泛化。一个重要的噪声源是标签错误的训练实例。我们提供了一种新方法,该方法通过使用初步的过滤过程来提高分类的准确性。当一个实例在由几何图定义的邻域中,同一类别的实例的比例没有显着大于数据库本身时,是可疑的。可以删除或重新标记训练数据中的此类可疑示例。然后将过滤后的训练集作为学习算法的输入。我们对使用1-NN作为最终算法的UCI机器学习存储库的十个基准进行了实验,结果表明,与重新标记相比,去除效果更好。当我们在类中引入0%到20%的噪声时,尤其是在类之间具有很好的可分离性时,删除可以保持泛化错误率。最后将提出的过滤方法与松弛重新标记方案进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号