Benchmarking machine learning methods for the identification of mislabeled data

doi:10.21203/rs.3.rs-4011683/v1

Download PDF

Research Article

Benchmarking machine learning methods for the identification of mislabeled data

https://doi.org/10.21203/rs.3.rs-4011683/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Supervised machine learning recently gained growing importance in various fields of research. To train reliable models, data scientists need credible data, which is not always available. A particularly hard and widespread problem deteriorating the performance of methods are mislabeled samples (Northcutt et al, 2021). Common sources of mislabeling are weakly defined classes, labels that change their meaning, unsuitable annotators, or ambiguous guidelines for labeling. Because mislabeling lowers prediction quality, it is essential for scientists to be able to identify wrong labels before actually starting the learning process. For that, numerous algorithms for the identification of noisy instances have been developed. However, so far, a comprehensive empirical comparison of available methods has been missing.

In this paper, we survey and benchmark methods for the identification of mislabled samples in tabular data. We discuss the theoretical background of label noise and how it can lead to mislabeling, review categorizations of identification methods, and briefly introduce 34 specific approaches together with popular data sets. Finally, 20 selected methods are benchmarked using artificially blurred data with controllable mislabling and a new real-life genomic dataset with known errors. We compare methods varying the amount and the type of noise, as well as the sample size and domain of data. We find that most of the methods have the best performance on datasets with a noise level of around 20-30%, whereas the identification of noisy instances in low and high ranges of noise prevalence is more challenging. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and analysis code to enable a better handling of mislabeled data and give recommendations on usage of noise filters depending on various dataset parameters.

noise identification

mislabaled data

tabular data

benchmarking

realworld noise

No competing interests reported.

Download PDF

Editorial decision: Revision requested
01 Jun, 2024
Reviews received at journal
31 May, 2024
Reviews received at journal
26 May, 2024
Reviewers agreed at journal
21 May, 2024
Reviewers agreed at journal
20 May, 2024
Reviewers agreed at journal
10 Mar, 2024
Reviewers agreed at journal
10 Mar, 2024
Reviewers invited by journal
10 Mar, 2024
Editor assigned by journal
07 Mar, 2024
Submission checks completed at journal
04 Mar, 2024
First submitted to journal
04 Mar, 2024

You are reading this latest preprint version

Benchmarking machine learning methods for the identification of mislabeled data

Status:

Version 1

Abstract

Full Text

Additional Declarations

Status:

Version 1