Comparison of Performance of Recursive Partitioning Models and Parametric Imputation Models Based On Predictive Mean Matching In Multiple Imputation By Chained Equations, in Order to Impute The Missing Values of The Binary Outcome, in The Presence of An Interaction Effect


 Background: Among the new multiple imputation methods, Multiple Imputation by Chained ‎Equations (MICE) is a ‎popular ‎approach for implementing multiple imputations because of its ‎flexibility. Our main focus in this study ‎is to ‎compare the performance of parametric ‎imputation models based on predictive mean matching and ‎recursive partitioning methods ‎in multiple imputation by chained equations in the ‎presence of interaction in the ‎data.Methods: We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.Results: ‎We have shown by simulation that to impute a binary response in ‎observations involving an ‎interaction, manually interring the interaction term into the imputation model in the ‎predictive mean matching ‎model improves the performance of the PMM method compared to the recursive partitioning models in ‎ ‎multiple imputation by chained equations.‎ The parametric method in which we entered the interaction model into the imputation model (MICE-‎‎‎Interaction) led to smaller bias, slightly higher coverage probability for the interaction effect, but it ‎had ‎slightly ‎wider confidence intervals than tree-based imputation (especially classification and ‎regression ‎trees). Conclusions: The application of MICE-Interaction led to better performance than ‎recursive ‎partitioning methods in MICE, although ‎the user is interested in estimating the interaction and does not ‎know ‎enough about the structure of the observations, recursive partitioning methods can be ‎suggested to impute ‎the ‎missing values.


Background
Among imputation methods, Multiple Imputation (MI) has received particular attention in different areas of research (Sterne et al., 2009, Rubin, 1987. The popularity of this method is attributed to its flexibility, besides its ability to produce unbiased estimates and valid inferences for data that are Missing At Random (MAR, i.e., when, conditional on the observed data, the probability of data being missing does not depend on unobserved data) and Missing Completely At Random (MCAR, i.e., when the probability of having a variable with missing data does not depend on any observed or missing variables) (Lee and Carlin, 2010).
Multiple Imputation by Chained Equations (MICE) is commonly used for imputing missing data (Raghunathan et al., 2001, Buuren andGroothuis-Oudshoorn, 2010). The primary advantage of MICE is that it is not necessary to specify a joint distribution for all variables (Gelfand and Smith, 1990). Any interactions or non-linear relationships that are estimated in the final analysis model must be accounted for in the imputation model. Incorrect selection of the imputation model in MICE may lead to invalid inferences.
When there is an interaction between the variables, the default imputation methods (imputation models that include only the main effects) may produce biased parameter estimations since those imputation models are not compatible with the analysis model ( Van Buuren, 2007, Seaman et al., 2012a, Slade and Naylor, 2020. In some cases, it may not be desirable or possible to include interactions in a parametric imputation model due to numerous predictors, small sample size, or highly correlated predictors. In these situations, recursive partitioning (tree-based) methods, such as Classification And Regression Trees (CART) and Random Forests (RF), can be used within the MICE algorithm (Strobl et al., 2009).
CART and RF have some advantages due to their nonparametric inherent nature that is, they do not require the user to specify an imputation model, as well as these models, are flexible enough to capture the interaction effects and non-linear relations. At each step in the MICE algorithm, missing values are imputed for a given incomplete variable using a tree built with all other variables as predictors. A primary drawback of tree-based methods is the difficulty interpreting the results. However, this disadvantage is inconsequential for the imputation process (Strobl et al., 2009, Burgette and Reiter, 2010, Doove et al., 2014. In this research, we aimed to conduct two simulation studies to compare the performance of (partially) parametric imputation methods (considering interaction in the imputation model) with recursive partitioning methods to find a more convenient method to preserve interaction effects. This paper is organized as follows. At first, simulation designs are described. Then four imputation methods within the MICE algorithm were carried out to investigate, followed by the results of the simulation studies.
The results from both simulation studies will be discussed in the last section.

Methods
We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations based on a MAR and MCAR mechanisms, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient. In general, a desirable imputation method results in low Bias, CP of at least 95%, and narrow CIs. Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.

Data generation
A total of 1000 simulated datasets with 1000 observations were generated, using two different logistic regression models, one with an interaction between two continuous variables and one with an interaction between two binary variables. The models are specified in Equations 1 and 2, respectively.

Removal of observations
In each of the datasets, we induced missing values at MAR and MCAR mechanisms such that the proportions of missing data of the outcome variable are 10%, 20%, 30%, 40%, and 50%.

Imputation of missing data
Throughout the analysis, we performed MICE using the mice package in R version 3.1.0 Groothuis-Oudshoorn, 2010, Team, 2013). Using PMM-based multiple multiple imputation methods and two recursive partitioning methods in MICE the missing values were imputed.

PMM
PMM is generally preferred over standard regression because it creates imputations from the observed data itself, which maintains data structure such as skewness and avoids problems such as imputing impossible values (Morris et al., 2014, Slade andNaylor, 2020). The default implementation of PMM in mice includes only main effects in the imputation model. PMM corresponds to standard PMM presented in Doove et al. (Doove et al., 2014).

MICE-Interaction
For MICE-Interaction, we included the interaction term as just another variable and used the default predictor matrix in the mice function in R software (Von Hippel, 2009, Seaman et al., 2012b. The difference between MICE-Interaction and PMM (MICE-PMM) is the inclusion of an interaction term in the imputation model for MICE-Interaction.

MICE-Stratified
For the MICE-Stratified method, the data sets are stratified by the fully observed binary variables that contributed to the interactions then the PMM model in the MICE method is used to impute missing values in two subgroups.

MICE-CART
CART is a tree-based imputation method that does not require the specification of an imputation model. In short, the decision tree is created by CART, based on a binary decision rule, using one predictor variable that partitions the data into two nodes by minimizing the variance of the outcome within each node.
Subgroups are created by the optimal split, according to a measure of homogeneity, such as the Gini index (Breiman L, 1984). To avoid overfitting, partitioning continues until a certain criterion is obtained (e.g., a predetermined number of observations in the final subsets) (Friedman et al., 2001, Breiman L, 1984. The data matrix is partitioned into Y obs and Y miss . Y obs includes the columns of Y that have been completely observed, and Y miss is part of that has not been completely observed. The following steps describe the MICE-CART procedure (k is the number of incomplete variables, and Ẏ is the currently imputed data matrix ) (Burgette and Reiter, 2010): 1. For j = 1, . . . , k, the initial values of are filled by random draws from Y j obs , and a data matrix Z is defined.
2. The CART is fitted on each Y j miss by using the remaining variables of Z as the predictor variables; only subjects with observed values on Y j miss are used in this process.
3. For subjects in Y j miss , the terminal node is determined, according to the fitted tree in step 2. An observed value on Y j miss is randomly selected from the subset in this node and used for imputation.

4.
Step 2 and step 3 are repeated for several iterations. This procedure is performed for each variable with at least one missing value, yielding one complete dataset.
For the CART method in MICE, the complexity parameter of 10 −4 and a minimum of five observations at any terminal node were applied (Therneau et al., 2018, Buuren andGroothuis-Oudshoorn, 2010).

MICE-RF
The RF algorithm is essentially an extension of CART that creates multiple trees, using bootstrap samples from the original data. To overcome overfitting, at each node, a random sample of variables is used to determine the optimal split. The RF imputation method creates multiple regression trees by randomly selecting B bootstrap samples from the original sample (Hayes et al., 2015). To make each split, a subsample of independent variables is selected randomly and used with CART to produce a tree (Breiman, 2001). If we consider the same description for CART in the MICE section, instead of generating a tree over the entire observations, bootstrap samples are first extracted; then, a tree is fitted on each bootstrap sample. Each tree contains several leaves, and each leaf contains a subset of observed values for variable j, called donors. Then, the leaf associated with the missing value in the j th variable (Y j miss ) is checked, based on the node of the created tree (Breiman, 2001).
For each Y j miss , if we consider all donors in the B node of the resulting leaf, a random value is selected from the donors. For each variable with missing values, this process is repeated to produce a complete observation set. The rest of the imputation process is the same as the CART method in MICE (Doove et al., 2014). In the present study, we used the mice package to implement RF in MICE Groothuis-Oudshoorn, 2010, Liaw andWiener, 2002). We drew 10 bootstrap samples (White et al., 2011, Shah et al., 2014 to make each split candidate one-third of the predictors Groothuis-Oudshoorn, 2010, Doove et al., 2014). However, we did not consider changing the number of trees ( ), because the literature (Shah et al., 2014) suggested that the imputation quality is equivalent for = 10 and = 100.

Regression analysis
For each of the 1000 imputed datasets, we fit a correctly specified final analysis model, logistic models, and the results are combined using Rubin's rules (Rubin, 2004).

Calculation of bias, CP, and CI width
We computed Bias as the estimated coefficient minus the true value. For a single repetition of the simulation, CP (Mistler and Enders) is 0 if the estimated 95% CI does not contain the true value and 1 if the estimated interval does contain the truth. In 1000 iterations of the simulation, the CP is the proportion of the confidence intervals, which contains the true value of the desired coefficient. Average CI Length (AL) refers to the width of the estimated 95% CI for each coefficient. We report the discussed measurements across the 1000 replications for each of the imputation methods. Additionally, we computed empirical standard error for each coefficient, which is computed as the SD of the estimates across 1000 simulation replications. (Doove et al., 2014) (Shah et al., 2014).

Scenario 1 (Interaction between two continuous variables)
For this scenario, Table 1 displays the mean values of biases across the 1000 replications, and Tables 2 and   3 display the coverage probabilities and average 95% confidence interval widths, respectively, for each imputation method. For the interaction effect, MICE-Stratified led to estimates with higher mean biases than the tree-based methods. Further, the CP values of the interaction effects at all missing percentages are the smallest using MICE-Stratified (For instance, under the MAR mechanism, it was only 33.8% at 10% missing percentage).
For the interaction effect, MICE-Interaction had smaller mean absolute biases and larger coverage probability values at all the missing percentages (>99.0%) than the tree-based methods. The mean absolute values of relative biases in estimating the interaction at 20% and 30% missing percentages under the MAR mechanism were, respectively, 0.001 and 0.006 for the MICE-Interaction method, and it generated complete coverage values.
For the MICE-Interaction method, the mean absolute values of relative biases for the coefficients contributing to interaction (X 1 and X 2 ), at 20% missing percentage under the MAR mechanism were 0.010 and 0.006, respectively, and were 0.002 and 0.005 at a 30% missing percentage. Furthermore, the coverage values for these coefficients, at 20% and 30% missing percentages under the MAR mechanism, were complete or very close to one. CP values for the MICE-Interaction method were greater than 0.95. For 40% and 50% missing percentages, the CP values for the recursive partitioning methods were less than 0.95 and were not acceptable at all. The MICE-Stratified method produced less than 0.95 for almost all effects of CP values.
In the presence of interaction between two continuous variables, for the main effects, MICE-CART led to smaller SE than MICE-Interaction and MICE-RF.

Scenario 2 (Interaction between two binary variables)
In the case where there is an interaction between two binary variables, Although the length of the CI, the standard error of the model, and the ratio of variations attributed to the MICE-Stratified method were the lowest, but due to the values of bias, relative bias and, CP, this method was not the acceptable model.
The values of the empirical standard error and the standard error of the model were almost the same, indicating that the variance of the coefficients obtained from the Rubin rules was unbiased.

Discussion
The CART and RF models in MICE were evaluated in comparison with the compatible parametric model Recent research has shown that whenever the imputation model and the analysis model were compatible, the parametric imputation model was superior to nonparametric tree-based imputation methods (CART and RF) (Slade and Naylor, 2020). However, no study has compared the performance of these methods in binary responses, when both binary and continuous predictors are available.
In the study conducted by Javadi et al. For the case where the target was imputing the response variable in the presence of an interaction between a binary and continuous predictors, the four discussed imputation methods were compared, and they concluded that by manual entry interaction term in the PMM model, the performance of PMM-based imputation methods compared to recursive partitioning methods, considering bias, coverage, and variation attributed by missing values were highly improved (Javadi et al., 2021).
In all scenarios, we calculated the values of the variation attributed to the missing values (λ values). Due to the fact that all the computed λ values were almost equal to 0.50, we refused to report them. λ values less than half indicate that the imputation results were less affected by the imputation model used and more affected by the observed values ( Van Buuren, 2018, Doove et al., 2014.
In this study, in all scenarios with an interaction between two binary variables, MICE-Interaction was superior to other methods. This means that in addition to the interaction estimates, MICE-Interaction estimated all the main effects with the least amount of bias values, and acceptable coverages.
In scenarios with an interaction between two quantitative variables, in about 80% of cases, the MICE-CART method resulted in the mean absolute of the relative biases less than 0.20. In estimating the interaction, the coverage values of the MICE-Interaction were acceptable in all scenarios. In this case, at the missing percentages of more than 30%, the MICE-CART method should be used with caution since the coverage values were less than 0.95.
In the case where there was an interaction between two binary variables, although the performance of the MICE-Interaction method was much better than other methods, but in estimating the interaction, tree-based methods also had acceptable performance. In about 70% of cases, the MICE-CART method resulted in absolute relative bias values less than 0.20, and in about 90% of cases, the coverage probability values were close to 0.95 or at least 0.95.
The present study has several strengths. For the first time, we compared tree-based imputation methods (CART and RF) with parametric imputation methods in MICE by specifying a parametric imputation model (MICE-Interaction) in simulation studies with binary outcomes, in the presence of interactions. Also, we used the simulation technique to generate missing data in complete datasets; therefore, the true value of the missing data was known for evaluating multiple imputation methods; this is important because we had access to the original observations before creating the missing values and could compare the results of real and imputed observations. Moreover, the missing percentages were generated under two mechanisms on binary response variables.
Some limitations of this study should be considered. Firstly, only one condition was considered for the MAR mechanism, which was also arbitrary; therefore, the results can only be generalized to this condition, not the rest of the MAR mechanisms. Secondly, it would be preferable if we had access to true data and could evaluate the performance of methods on these data, Finally, we considered the sample size to be 1000.
Overall, the selection of predictors and their effects for incorporation in the final model is an issue beyond the scope of this study. Our recommendations for the imputation model are based on the assumption that the final model is correctly defined. Future research should consider the interplay between the misspecification of the final model and the selection of MICE models.
The number of trees built for the RF model was set to 10 to limit bias and compared to PMM operating under their default parameters. The iteration number for imputation methods based on MICE was set to five as suggested. But in practice, the influence of tuning parameters on imputation performance and computational burden still and further study is warranted, and parameter searching may be necessary for complex datasets to attain better imputation performance. For CART, the default value of 10 −4 is appropriate. Smaller values of minbucket lead to slightly decreased mean bias, but the default value of 5 seemed reasonable (Breiman, 2001, Doove et al., 2014) (Slade and Naylor, 2020).

Conclusion
To our knowledge, this is the first paper to compare tree-based imputation in MICE to a parametric model that includes a true interaction effect and a binary outcome. If interest lies primarily in the estimation of an Interaction between two continuous variables, we recommend utilizing MICE-Interaction as it leads to the lowest bias, the highest CP, and narrow 95% CIs for the interaction effects at all missing percentages. In this case, If interest lies primarily in the estimation of main effects, there is a trade-off between tree-based and parametric imputations at missing percentages below 30%. MICE-Interaction had the highest CP of the interaction effect, but it was at the expense of wide 95% CIs. In the presence of interaction between two binary variables, we recommend using MICE-Interaction as it led to the lowest bias, acceptable CP for the interaction effects at all missing percentages. Importantly, parametric imputation should only be utilized if there is enough information to ensure that all necessary interaction terms were included in the imputation model.
Our results have important implications for epidemiologists in dealing with incomplete data. Researchers should consider that in order to reduce the biases of the estimated coefficients if they are aware of the interactions, those interactions should be included in the parametric PMM imputation model to make convergence between the imputation model and the analysis model. Therefore, if the researcher has sufficient information about the structure of the data and the user is able to correctly detect and identify nonlinear effects such as interaction, the use of the MICE-Interaction method is definitely recommended. Of course, in such a situation, the results obtained from the MICE-Interaction method are ideal results and may not actually occur because it is challenging and difficult to accurately identify all the interactions and relationships in the observations. Therefore, if the researcher is interested in estimating the interaction and does not have enough information about the structure of the observations, it is better to evaluate the recursive partitioning methods in MICE because these models detect interaction and nonlinear effects automatically and inherently. It is not recommended to use the PMM method in MICE, which is the default of most statistical software.     1.000 1.000 1.000 0.997 0.991 1.000 1.000 1.000 0.997