Background Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias when used inappropriately, its use in conjunction with other imputation approaches may unlock the potential value of missingness to reduce bias and improve prediction.
Methods We conducted a simulation study to determine when the use of a missing indicator, combined with an imputation approach, such as multiple imputation, would lead to improved model performance, in terms of minimising bias for causal effect estimation, and improving predictive accuracy, under a range of scenarios with unmeasured variables. We use directed acyclic graphs and structural models to elucidate causal structures of interest. We consider a variety of missingness mechanisms, then handle these using complete case analysis, unconditional mean imputation, regression imputation and multiple imputation. In each case we evaluate supplementing these approaches with missing indicator terms.
Results For estimating causal effects, we find that multiple imputation combined with a missing indicator gives minimal bias in most scenarios. For prediction, we find that regression imputation combined with a missing indicator minimises mean squared error.
Conclusion In the presence of missing data, careful use of missing indicators, combined with appropriate imputation, can improve both causal estimation and prediction accuracy.