Identifying and Handling Mislabelled Instances

FABRICE MUHLENBACH; STEPHANE LALLICH; DJAMEL A. ZIGHED

首页> 外文期刊>Journal of Intelligent Information Systems >Identifying and Handling Mislabelled Instances

【24h】

Identifying and Handling Mislabelled Instances

机译：识别和处理标签错误的实例

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

机译：数据挖掘和知识发现旨在根据数据生成有用且可靠的模型。不幸的是，一些数据库包含嘈杂的数据，干扰了模型的泛化。一个重要的噪声源是标签错误的训练实例。我们提供了一种新方法，该方法通过使用初步的过滤过程来提高分类的准确性。当一个实例在由几何图定义的邻域中，同一类别的实例的比例没有显着大于数据库本身时，是可疑的。可以删除或重新标记训练数据中的此类可疑示例。然后将过滤后的训练集作为学习算法的输入。我们对使用1-NN作为最终算法的UCI机器学习存储库的十个基准进行了实验，结果表明，与重新标记相比，去除效果更好。当我们在类中引入0％到20％的噪声时，尤其是在类之间具有很好的可分离性时，删除可以保持泛化错误率。最后将提出的过滤方法与松弛重新标记方案进行了比较。

著录项

来源
《Journal of Intelligent Information Systems》 |2004年第1期|p.89-109|共21页
作者
FABRICE MUHLENBACH; STEPHANE LALLICH; DJAMEL A. ZIGHED;
展开▼
作者单位

ERIC Laboratory, Lumiere University (Lyon 2), Batiment L, 5. Avenue Pierre Mendes-France, 69676 Bron Cedex, France;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
supervised learning; mislabelled data; geometrical neighbourhood; filtering; removing instances; relabelling instances;

机译：监督学习;贴错标签的数据;几何邻域;过滤;删除实例;重新标注实例;

相似文献

外文文献
中文文献
专利

1. Methodological evaluation of DNA-based molecular keys to identify categories of mislabelling in commercial products from genus Merluccius spp. [J] . Perez Montse, Santafe-Munoz Angie M., Balado Miguel, Food Chemistry . 2018,第jana15期

机译：对基于DNA的分子键进行方法学评估，以鉴定Merluccius spp属商品中错误标记的类别。
2. Veracity handling and instance reduction in big data using interval type-2 fuzzy sets [J] . Amit K. Shukla, Megha Yadav, Sandeep Kumar, Engineering Applications of Artificial Intelligence . 2020,第Feba期

机译：使用区间2型模糊集进行大数据的准确性处理和实例约简
3. An instance-based learning recommendation algorithm of imbalance handling methods [J] . Zhang Xueying, Li Ruixian, Zhang Bo, Applied mathematics and computation . 2019,第期

机译：基于实例的不平衡处理方法的学习推荐算法
4. A Novel Throughput Based Temporal Violation Handling Strategy for Instance-Intensive Cloud Business Workflows [C] . Futian Wang, Xiao Liu, Wei Zhang, International Conference on Data Science . 2020

机译：基于新的基于吞吐量的时间违规处理策略，例如强化云业务工作流程
5. Identifying Suspected Instances of Fraud Utilizing Machine Learning Predictive Classification Models that Minimize Manual Labor to Inspect Anomalies [D] . Andrews, Kyle. 2021

机译：利用机器学习预测分类模型识别欺诈的涉嫌实例，这些分类模型可最大限度地减少体力劳动来检查异常
6. Directly Identify Unexpected Instances in the Test Set by Entropy Maximization [O] . Chaofeng Sha, Zhen Xu, Xiaoling Wang, -1

机译：通过熵最大化直接识别测试集中的意外实例
7. Identifying Competence-Critical Instances For Instance-Based Learners [O] . Henry Brighton, Chris Mellish 2007

机译：识别基于实例的学习者的能力 - 关键实例

Identifying and Handling Mislabelled Instances

摘要

著录项

相似文献

相关主题

期刊订阅