Supervised machine learning recently gained growing importance in various fields of research. To train reliable models, data scientists need credible data, which is not always available. A particularly hard and widespread problem deteriorating the performance of methods are mislabeled samples (Northcutt et al, 2021). Common sources of mislabeling are weakly defined classes, labels that change their meaning, unsuitable annotators, or ambiguous guidelines for labeling. Because mislabeling lowers prediction quality, it is essential for scientists to be able to identify wrong labels before actually starting the learning process. For that, numerous algorithms for the identification of noisy instances have been developed. However, so far, a comprehensive empirical comparison of available methods has been missing.
In this paper, we survey and benchmark methods for the identification of mislabled samples in tabular data. We discuss the theoretical background of label noise and how it can lead to mislabeling, review categorizations of identification methods, and briefly introduce 34 specific approaches together with popular data sets. Finally, 20 selected methods are benchmarked using artificially blurred data with controllable mislabling and a new real-life genomic dataset with known errors. We compare methods varying the amount and the type of noise, as well as the sample size and domain of data. We find that most of the methods have the best performance on datasets with a noise level of around 20-30%, whereas the identification of noisy instances in low and high ranges of noise prevalence is more challenging. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and analysis code to enable a better handling of mislabeled data and give recommendations on usage of noise filters depending on various dataset parameters.