To evaluate the results of the four different pooling methods in Multiply Imputed datasets after a BWS-procedure, we conducted a simulation study and repeated the procedures in a real-world dataset (NHANES).
Simulation datasets
1) To generate simulated datasets, we used as input parameters the coefficients and the standard deviations (SD) of the variables of an empirical dataset of low back pain patients [14]. See Table 1.
2) A set of 9 variables, including categorical, dichotomous, and continuous variables (normally distributed) was generated.
3) In the simulated datasets, categorical and dichotomous variables were initially considered to be continuous to determine their level of correlation and were then subsequently categorized (in 4 categories) and dichotomized by using cutoff values from the empirical dataset. For the categorical variables ‘Cat1’ and ‘Cat2’ the cutoff values (in percentages) were: [0, 0.6, 0.8, 0.9, 1] and [0, 0.3, 0.6, 0.8, 1]. For both dichotomous variables the median was used as a cutoff value.
4) The outcome measure was obtained by first calculating the linear predictor score by multiplying coefficients from the empirical dataset by the predictor values. A logistic regression model was then used to convert these scores into probabilities and a uniform distribution to refer these to a binary outcome.
Table 1 Means and variances used for the simulated dataset.
|
Varname
|
Coefficients
|
Mean
|
Variance
|
Standard Deviation (SD)
|
Distribution
|
Cat1
|
Xcat1
|
0.5 / 1.5 / 1.5#
|
2
|
0.9
|
0.95
|
Normal
|
Cat2
|
Xcat2
|
1.5 / 1.5 / 1.5#
|
3
|
1
|
1
|
Normal
|
Dich1
|
X1
|
-0.5
|
0.8
|
0.2
|
0.45
|
Normal
|
Dich2
|
X2
|
-1
|
0.7
|
0.3
|
0.55
|
Normal
|
Cont1
|
X3
|
0.5
|
7
|
2
|
1.41
|
Normal
|
Noise
|
X4
|
0
|
7
|
2
|
1.41
|
Normal
|
Cont2
|
X5
|
-0.1
|
40
|
90
|
9.5
|
Normal
|
Cont3
|
X6
|
-0.1
|
26
|
15
|
3.9
|
Normal
|
Cont4
|
X7
|
-0.1
|
34
|
23
|
4.8
|
Normal
|
Cat1=categorical variable 1; Cat2=categorical variable 2; Dich1=dichotomous variable 1; Dich2=dichotomous variable 2;
Cont1=continuous variable 1; Cont2=continuous variable 2; Cont3=continuous variable 3; Noise=noise variable;
# Coefficients belonging to the dummies of the categorical variable
5) Four different datasets were simulated. Two sets with 200 observations, one with a correlation degree of 0.2 and the other with a degree of 0.6 and similarly two sets with 500 observations.
6) One of the regression coefficients in the model had an effect size of zero to mimic the behavior of a noise variable during variable selection (‘Noise’).
Generating Missing data
In each of the four simulated datasets, 20 percent missing data were created in both categorical variables and two continuous variables (‘Noise’ and ‘Cont4’). To create MAR missing data, each variable was linked to another variable that was used to generate 20 percent missing data.
Imputation method
MI was performed in each simulated dataset generating 5 imputed datasets using Multivariate Imputation by Chained Equations (MICE) including the outcome in the imputation model [15,16].
Pooling methods
Four different pooling methods were used:
1) The pooled sampling variance method (D1), which contains a combination of the pooled parameter estimates and the pooled sampling variances of each imputed dataset to construct a test that resembles a multivariate Wald test [8,17].
2) Multiple parameter Wald test (D2) which pools the chi-square values from the multiple parameter Wald or likelihood ratio tests [9].
3) Meng and Rubin pooling method (D3) which pools likelihood ratio tests [10].
4) The median-P-rule (MPR) which uses the median P-value of the significance tests conducted in each imputed dataset. Hence, it depends on P-values only and not on the parameter estimates [13].
Statistics and analyses
Logistic regression analyses were performed in all original complete simulated datasets and in all multiply imputed datasets. The coefficients, SE’s, P-values, and all developed prognostic models were compared. The BWS-procedures were performed using different P-out selection values (between 1.0 and 0.05) to develop parsimonious prediction models containing the strongest prognostic variables as well as larger prediction models containing strong and less strong prognostic variables.
The entire procedure was repeated 500 times and the results were compared with the results of the BWS-procedure in the complete dataset (no missing data). All statistic procedures were performed in R.
Comparing pooling methods
When comparing the pooling methods for variable selection, the focus was on three points:
The selection frequency of the variables.
The selection frequency of the variables of each method was compared with that of the complete datasets, i.e. without missing values, which served as a reference model. The frequency was obtained by summing up how many times a variable was selected in a model divided by the total number of simulated models within that run. Subsequently it was evaluated which pooling method showed the most similar selection frequencies compared to those in the complete dataset.
The P-values of the selected variables.
To compare the P-values, all P-values were first naturally log-transformed to be able to make even the smallest differences in P-values graphically visible. The median of these P-values of the four pooling methods was compared with the median of the P-values in the complete dataset.
The stability of the selected prognostic models.
Model stability was evaluated by providing model selection frequencies to quantify how many times a particular set of prognostic variables was selected [4,18]. The first 10 unique prognostic models were evaluated for all pooling methods and for the models selected in the complete dataset. For example: In one simulation run 500 initial models were fitted by applying BWS and from these 500 models, 10 unique models in the complete dataset were selected. With method D1, 375 models were identical to those 10 unique models in the complete dataset and so counted for 375/500= 75 percent of the same unique models. With method D2 we saw 350 unique models (70 percent). D3 showed 340 (68 percent) and MPR 450 (90 percent) the same unique models. In this example, MPR was the most stable pooling method. This way of analyzing the stability of the prognostic models was executed for all BWS-procedures under all different conditions.
Analyses in the NHANES-dataset
The NHANES-dataset was used as a real-world data set to evaluate the performance of the same four pooling methods (D1, D2, D3 and MPR). Twelve variables were analyzed of which 6 continuous, 2 dichotomous and 4 categorical variables, with a dichotomous variable as outcome measure. MI was performed, generating 5 imputed datasets (m=5), using Multivariate Imputation by Chained Equations (MICE) and including the outcome in the imputation model [15,16]. A BWS-procedure was conducted with P-out < 0.05 in all pooling methods. The selected variables and developed models were compared.