BACKGROUND: Real-world data are increasingly being used for as a complement to randomized controlled trials (RCTs) for evaluating the effectiveness and cost-effectiveness of healthcare interventions. Real-world data often are expected to have higher generalizability by including more representative patient populations and resembling daily clinical practice better than RCTs. However, since data are not collected for research purposes, missing observations in real-world data are highly common. Inadequate handling of missing observations may lead to biased estimates and invalid conclusions. The aim of this scoping review was to identify and critically appraise statistical methods for dealing with missing observations in real-world data.
METHODS: We searched PubMed for simulation studies that assessed the performance of statistical methods for dealing with missing observations in real-world data published between January 2000 and December 2018. We searched for simulation studies because well-developed simulation studies may be preferable to choose the best missing data method for real-world data. Information was extracted on the aims of the studies, data generating mechanisms, assessed statistical methods for dealing with missing observations, performance of the statistical method, statistical software used, and authors’ remarks and conclusions on the validity and usability of the statistical methods.
RESULTS: Fifty-two studies were eligible for inclusion; 19 assessed methods for missing covariates, 13 for missing outcome(s), 15 for missing observations in both covariates and outcome(s), and in 5 studies it was unclear whether missing observations were present in covariates and/or outcome(s). Eleven studies took into account the multilevel structure of the data, whereas others did not. When imputing single-level missing at random (MAR) data, MICE and multivariate normal imputations (MVNI) seemed the best performing methods. When dealing with multilevel MAR data, multiple imputation-based methods seem to be the most flexible and best performing methods. For data missing not at random (MNAR, 16 studies), selection models and pattern mixture models appear to be most promising.
CONCLUSIONS: The choice of a statistical method to deal with missing observations depends on the type of missing variables and the assumed missing data mechanism. Although MAR is the most commonly assumed missing data mechanism, data that are missing not at random (MNAR) is also common in real-world data, although only few studies evaluated methods for MNAR data.