Data sources
Considering the wider population and applicability of the scale, three common scales were selected from different age groups as the simulated imputation datasets, which are the self-acceptance scale (SAQ) of college students, the activities of daily living scale (ADL) of elderly people and self-esteem scale (RSES) of middle school students [12-14]. In enrolled dataset, samples with missing values are typically discarded to obtain a complete dataset. The SAQ dataset included 742 individuals with complete age, gender, the characteristics of parents, etc. The ADL dataset included 1242 elders with age, gender, the characteristics of daily living, etc. There are 3513 middle school students in RSES dataset also with complete age, gender, the characteristics of parents, etc. SAQ and ADL dataset, simulation group, were used to compare and assess the ability of different imputation methods. RSES dataset, validation group, was used to test the application of imputation methods. All datasets were complete on all required variables.
Simulation of missing data
The mechanism of the missingness is important when imputing missing values. Missing item scores can be categorized into three types by Little and Rubin: when the missing data is independent to the actual or potential study variables, the losses are thought to be missing completely at random (MCAR) [15]. If the missingness due to issues related to the biological, psychological, social and/or cultural diversity of subjects, or depends on known or observed covariates, the non-issuance of the response is due to random causes (MAR). And the item nonresponse is classified as missing not at random (MNAR) if the probability of an item being missing depends on the true answer [16, 17]. In real world data, there is no way to verify that the data is MAR or MNAR though MCAR can be confirmed by Litter’s MCAR. Therefore, it is difficult to determine the missing data mechanism. Most scholars suppose that the missing of questionnaire data is at MAR, which use the relationships with other variables. In addition, most current imputation methods assume MAR in order to avoiding biased results. So the explanatory variables were assigned to missing under a MAR missing data mechanism in this study.
Imputation methods
Four imputation methods are considered in this study. Among them, (1) the direct deletion method is to delete all subjects with missing values and conduct statistical analysis based on a complete dataset. It is the most common and simplest approach which was used in statistical software. (2) Mode imputation is one of the most naive and easiest methods for imputing missing values for categorical variables. The mode of the non-missing values of each variable was used to impute the missing values. (3) Hot-deck (HD) imputation refers to selecting the corresponding variable value of the observation most "similar" to the missing observation as the filling value of the missing observation. Generally, it is divided into two methods: sequential hot platform filling method and random hot platform filling method [18].The most "similar" observation in sequential hot platform filling is selected in some order in the filling class. Random hot platform filling is randomly selected from the filling class. This research selects random Hot-deck imputation. (4) Multiple imputation (MI), which aims to produce a range of values that ‘‘approximate’’ the missing response [19]. MI uses a set of external covariates to generate a range of plausible values for each missing value (based on correlations between the covariates and the item to be replaced). The algorithm works by iteratively imputing the missing values based on the fitted conditional models until a stopping criterion is satisfied.
Performance evaluation of imputation algorithms
Comparison of different imputation methods is performed as follows:
(1) Absolute deviation. It is the absolute value of the difference of results between two data points of complete dataset and imputation dataset.
(2) The root mean square error (RMSE) [20]. (see Equation 1 in the Supplemental Files)
where n = the number of simulated imputation in each missingness proportion,
yij = statistics of ith imputation using imputation method j in each missingness proportion,
yi0 = statistics of ith imputation using complete dataset in each missingness proportion.
Higher RMSE indicates larger differences between datasets imputed with the test methods. A narrower range of RMSE values indicates more stability in imputation method. Likewise, a wider range of RMSE values for each combination indicates less stability and therefore reliability in imputation method.
(3) Average relative error. (see Equation 2 in the Supplemental Files)
where n = the number of simulated imputation in each missingness proportion,
yij = statistics of ith imputation using imputation method j in each missingness proportion,
yi0 = statistics of ith imputation using complete dataset in each missingness proportion.
The vertical axis plots the percentage relative error for continuous variables and percentage misclassification error for categorical variables, while the horizontal axis groups the results according to the proportion of missing values. Each boxplot represents the error measure over 50 random replications.
Statistical analysis
SAQ and ADL dataset were used as simulation groups. The missing rates in all datasets were set at 5%, 10%, 15%, and 20% under a MAR missing data mechanism, respectively. And We repeated 50 times to simulate a MAR missing data and fill the missing values at each missing rate by four imputation methods before absolute deviation and RMSE of mean, standard deviation, correlation coefficient were calculated. If the results of all methods were similar, average relative error of these statistics will be continued to calculate or they will be computed except those imputation methods with less effective than others obviously in order to determine the preferred methods. RSES dataset, validation group, was analyzed the performance of the extensionality in a supposed real world situation by simulation different nonresponse rates one time. All analyses were finished by SAS 9.4.