In the past few years, with the rapid development of computing power, the scale of datasets in different fields has increased rapidly, which makes people realize the importance of big data analysis1,2. Data mining and machine learning are often used to discover new underlying facts and identify patterns of interest in collected data3,4. In the process of mining or learning, there are some barriers to using the collected data directly. The collected data has various flaws, such as measurement error, abnormal value, and missing value5. Therefore, imperfect data needs to be processed to improve data quality6.
Missing data refers to the incomplete sample records in the dataset, which may be missing in one or more variables of some samples. The phenomenon of missing data in the real world exists in many fields, such as industry7, medicine8,9, business10 and scientific research11. There are various reasons for data loss, mainly divided into mechanical factors and human factors. For example, data storage failure, memory damage, mechanical failure, refusal by the interviewee to disclose related issues, data entry errors, etc12–15. Missing data may lead to the lack of much key information, especially when the missing rate is very high, resulting in the performance degradation of data mining or machine learning16,17.
The traditional method to deal with missing data is the deletion method, including the column deletion, feature deletion, and pairwise deletion18. The deletion method refers to deleting samples or feature attributes containing missing data directly from the dataset to obtain a complete dataset without missing data. Deleting missing values is a simple solution, and generally, when the observations of missing values are few, there will be no major problem. In turn, it can lead to serious information loss19,20.
At present, the commonly method to deal with missing data is the imputation method21, its goal is to predict the missing data with a certain algorithm, and then replace the missing value with the predicted value. Commonly used interpolation methods are: random imputation, mean imputation, median imputation, mode imputation, hot-deck imputation22,23.
Different from the above imputation method, multiple imputation (MI) generates multiple different interpolations for the same missing data, so as to obtain multiple complete datasets, and then synthesize these values to obtain the final interpolation24. Most researchers focus on the imputation of specific variables, mainly using two imputation methods: multivariable normal imputation (MVNI) and multivariate Imputation by chained equations (MICE)25. MVNI assumes that all variables in the imputation model follow a joint multivariate normal distribution, and uses data augmentation methods to impute missing data under the assumption of a multivariate normal distribution26. MICE is also known as fully conditional specification (FCS). The essential difference between MICE and the MVNI method is that it does not need to consider the joint distribution of variables during imputation, but uses the conditional distribution of a single variable to perform imputation one by one27. This means that the MICE method is more widely used in practice than the MVNI method28,29.
This paper proposes a new imputation method, called a class center-based multiple mixed imputation method (CCMMI), to handle missing values. In CCMMI, firstly, we calculated the average value of each class of data samples, that is, the class center30. Then, imputation thresholds were provided for subsequent multiple mixed imputation methods based on the Euclidean distance of the data samples from the class centers. For each class of incomplete datasets, we used multiple single models to produce multiple imputation results for each missing value and compared the Euclidean distance of various imputation results with the threshold of each class. Finally the most suitable interpolation was selected for each missing value as the imputation result. The CCMMI method was compared with other single models on datasets from different domains.
The rest of the paper are structured as follows. Section 2 provides an overview of missing mechanisms and related work on MI. Section 3 presents the proposed CCMMI method. Section 4 conducts experiments and analyzes the experimental results. Finally, conclusions are made in Section 5.