Medical datasets inherently contain errors from subjective or inaccurate test results, or from confounding biological complexities. It is difficult for medical experts to detect these elusive errors manually, due to lack of contextual information, limiting data privacy regulations, and the sheer scale of data to be reviewed. Current methods for detecting errors in data typically focus only on minimizing the effects of random classification noise. More recent progress has focused on using deep-learning to capture errors stemming from subjective labelling and confounding variables, however, such methods can be computationally intensive and inefficient.
In this work, a deep-learning based algorithm was used in conjunction with a label-clustering approach to automate error detection. Results demonstrated high performance and efficiency on both image- and record-based datasets. Errors were identified with an accuracy of up to 85%, while requiring up to 93% less computing resources to complete. The resulting trained AI models exhibited greater stability and up to a 45% improvement in accuracy, from 69% to over 99%. These results indicate that practical, automated detection of errors in medical data is possible without human oversight.