In this section, we will describe how specific IDA results may be used in the regression analysis to follow. The possible impact of IDA has three aspects: it may induce refinements of the intended statistical analysis and help defining the statistical analysis plan, help avoiding misinterpretation of results of the regression analysis, and provide useful background knowledge to decide on how to present results of the regression analysis.
5.1 Bacteremia study: refinements of the analysis strategy triggered by IDA results
Revisions of the analysis strategy based on the results of IDA are justified if any predictor-outcome associations were strictly not evaluated during IDA. An update of the analysis strategy could encompass a refinement of the model specifications, additional analyses, such as sensitivity analyses, or possibly a change to the intended analysis methods. In our examples, we show that large proportions and specific patterns of missing values, skewed distributions of predictors, and a high degree of redundancy between predictors may suggest that the plan should be updated. Furthermore, IDA findings might lead to planning of sensitivity analyses before formal statistical analyses commence.
In this example, the number of possible candidate predictors is relatively large if the final model should be 'parsimonious' and explainable. Here IDA can provide the necessary information to guide decisions which predictors to focus on during model building. This is often an iterative process that depends on observed features of the data, as the correlation and missingness structure of the predictor set changes if predictors are removed. We consider IDA as the first iteration of this process. Further steps will then be carried out by the modeling team.
Predictors with univariate distributions that are particularly narrow, or, in case of categorical variables, that are extremely unbalanced may contribute only very little to predictive performance because most of the subjects are similar. The chances that such a variable is a strong predictor of the outcome is very low.
Given their histograms, basophiles (and the associated basophiles ratio), and probably eosinophiles (and the eosinophiles ratio) could be candidates for discarding because of their excessive spike at zero, which cannot be removed by any transformation. Given that the aim of the analysis is to predict bacteremia accurately, these variables are probably unlikely to contribute to overall predictive accuracy.
Predictors are also expected to add only little or no predictive performance if they are redundant to other predictors in the model. However, sometimes a reparametrization of the predictor space may remove the redundancy and enhance interpretability of the model. Analysis of the predictors' multivariate distributions revealed that two predictors a priorily rated as important for predicting bacteremia exhibited a very high correlation: leukocytes (WBC) and neutrophiles (NEU). This correlation stems from the fact that neutrophiles are the biggest subtype of leukocytes. As a consequence of the correlation and the background information, one could replace WBC with a new variable WBC_NONEU = WBC – NEU. Using NEU and WBC_NONEU retains all the information of the two predictors for the model and keeps their regression coefficients interpretable but removes the high correlation (see also [21]).
From our IDA analysis, we would probably conclude that among the leukocyte-related predictor, only the computed WBC_NONEU (leukocytes minus neutrophiles) and NEU should be retained in an analysis with an extended candidate set, probably also MONO and LYM. The corresponding 5 'ratio' variables, basophiles and eosinophiles may not be needed for modeling as they are largely collinear with their absolute counterparts.
Likewise, one may also consider to remove those predictors that exhibit large proportions of missing values. Assume that a predictor with many missing values is closely associated with other predictors. In this case, it may be well imputable but is not likely to add predictive value on top of those other predictors. If that predictor in question is not associated with other predictors, it may not be well imputable and hence its imputation will introduce noise into the analysis. Hence, removing such variables may be indicated. It is not so clear, unfortunately, when to disregard a predictor because of its proportion of missing values. The threshold value may depend on how many predictors are affected by missing values, how much they are affected, and how the missingness pattern looks like. In our example, probably PAMY, TRIG and CHOL, all exhibiting more than one third of their values missing, may be the most obvious candidates for omission. Moreover, the following predictors all have missingness proportions of more than 20%: GLU, AMY, LIP, and HS. According to the missingness pattern dendogram, PAMY, TRIG, CHOL, AMY and LIP also have highly correlated missingness, suggesting that they do not serve each other in imputation models (Fig. 1).
Hence, 14 predictors could be excluded from model building without having to expect reduced predictive accuracy, which reduces the dimensionality of the predictor space, without unblinding the association of the predictors with the outcome, from 51 to 37. The modeling team will recompute the numbers of observations with complete recordings for all remaining predictors.
High-resolution histograms may also guide the modeling team in the choice of an appropriate handling of nonlinearity. Generally, many different modeling strategies to handle nonlinear associations of predictors with the outcome are available, e.g. restricted cubic splines, penalized splines or fractional polynomials [3]. For example, restricted cubic splines provide a linear fit outside of the boundary knots, and one may want to adjust default knot positions in case of very skewed distributions. In our IDA, we already showed histograms based on pseudolog transformations of some predictors. These transfomations were necessary in scatterplots to enhance their interpretability, but whether to use transformed predictors in the outcome models may be debatable. Royston and Sauerbrei [22] discussed other ways of pretransforming predictors to increase robustness of models, in particular at the tails of the predictor support. Probably one should not use each of the IDA domains alone to identify predictors that should be removed from model building. For example, a moderate proportion of missingness together with a moderate degree of redundancy of a predictor may also justify its exclusion.
If interactions of predictors have been pre-specified, IDA may evaluate the joint distribution of these predictors. Strong association of the predictors involved in an interaction may make the inclusion of their interaction unnecessary as it would come with great estimation uncertainty (cf. [23], p. 301). For example, we included scatterplots of all predictors with age, stratified by sex in our IDA. Among the key predictors, BUN had the highest correlation with age with correlation coefficients of 0.487 (males) and 0.386 (females), while there was hardly any correlation of bacteremia and PLT. Hence, interaction terms involving age and PLT can be more precisely estimated than interaction terms involving age and BUN.
Sensitivity analyses, which are in general not part of IDA, are a tool to evaluate the robustness of estimates on decisions in model building, for example choices of different methods, impact of variable selections, or impact of strategies to handle missingness or influential points. Sensitivity analyses should be pre-specified, and IDA may suggest that certain sensitivity analyses are necessary to back up the modeling results. Regarding missing values in the bacteremia study, one could perform such a sensitivity analysis by not imputing any predictors of minor importance but just omitting them from the model. One may also consider to transform some predictors with particularly skewed distributions before modeling. While this may lead to differences in the interpretation of the associated regression coefficient (if a linear functional form is chosen for such a predictor), one could evaluate if it also affects prediction performance or the values of the standard errors of the other covariates' regression coefficients (see Additional File 1 Appendix B.2 for an example). About dealing with the strong correlation between WBC and NEU, one could define such a sensitivity analysis by removing either WBC or NEU from the model and evaluate if this has a relevant effect on the model's performance.
In all three cases, such sensitivity analyses are consequences of IDA but they are still predefined in the sense that they are planned before uncovering the association of the outcome with the predictors [14, 18, 24].
By contrast, for example, sensitivity analyses that result from observing an unexpected pattern in the residuals of a model (e.g. if residuals show a clear nonlinear association with a predictor) must be seen as post-hoc analyses. Modifying the model because of such an unplanned sensitivity analysis increases the risk of overfitting the model. Nevertheless, it should be done and reported as a post-hoc analysis.
5.2 Bacteremia study: how IDA may guide the interpretation of modeling results
The results of the regression model consists of the estimated regression coefficients, their covariance matrix and in particular their standard errors, may include predictions for selected predictor patterns and will also comprise measures of model performance.
Skewed distributions. Skewed distributions of predictors may have consequences on the precision and the robustness of these results, and knowledge about the distributional shapes of the predictors are essential for interpretation. As revealed by our IDA, some of the predictors exhibited highly skewed distributions. For these predictors, the estimation of the nonlinear functional forms may suffer from disproportional impact of some observations, and estimation uncertainty will be reflected by wide confidence intervals. Impact of highly influential points may be reduced by pretransforming the predictors to more symmetric distributions, which however may change their interpretation if finally a linear functional form is chosen. Alternatively, the values could be winsorized before modeling as previously suggested [17, 22]. In addition, extreme values should be assessed for implausibility and, if classified as such, potentially removed. In general, there are numerous ways to make analyses robust against such influential points, including transformation, robust regression or by estimating robust variances [22, 25–26].
Transformation of predictors. If a predictor has been transformed, regression coefficients are given for units of the transformed predictor. In case of the pseudo-log transformation using a base of 10 that was suggested for BUN, they would correspond to the difference in outcome expected for a tenfold increase in the original predictor. This correspondence is only approximate as a pseudo-log transformation was used. See Additional File 1 Appendix B.2 for analyses of the bacteremia study with and without preceding pseudo-log transformation of predictors. If for WBC and NEU, pseudo-log transformations will be used in modeling the data, a unit of pseudo-logarithm would correspond roughly to a tenfold of the original WBC or NEU. The range of the pseudo-logged values is about 1.5; thus a unit difference covers almost the entire range of the data and comparably large regression coefficients have to be expected. See Additional File 1 Appendix B.3 for an illustration.
Validity of predictions. IDA allows to identify the support of a model, i.e., the ranges of values of the predictors from which the model was derived and to which it should be applicable. Predictions for observations from areas with higher joint density of predictors come with more confidence (narrower confidence intervals), while predictions with smaller support are less precise. The joint distribution also helps to understand in which cases predictions would actually be extrapolations. For example, in Fig. 3 the density of data points in any of the age-sex-groups is very low beyond a value of the pseudolog of WBC (t_WBC) greater than 1.5. The support is also essential to understand measures of model performance. Usually, the wider the support of a model, the more variance in the outcome can potentially be explained, and hence measures like the area under the ROC curve or the R-squared also tend to be greater. See Additional File 1 Appendix B.4 for an illustration.
Missing data handling. While a method to handle missing data is usually prespecified, IDA can give some information to support this decision or put it into question. If multiple imputation was prespecified, it has to be expected that the regression coefficients of predictors with higher proportion of missing values will generally have larger standard errors than those with fewer missing values, relative to comparing these quantities after complete case analysis. Consequently, if multiple imputation is combined with data-driven model selection approaches (such as backward elimination), such predictors are also less likely to be selected than more complete predictors, given they have approximately equal association with the outcome. Hence the decision whether to apply multiple imputation vs. using complete case analysis may impact the structure of the selected model. See also Additional File 1 Appendix B.5 for an illustration.
Interpretation of nonlinear functional forms. For predictors for which a nonlinear functional relationship with the outcome is assumed, the partial response function (predicted values vs. predictor) will usually be evaluated graphically. Areas in which this response function has a wide confidence interval correspond to low support in observed predictor values, and such a low support may preclude the precise estimation of a nonlinear functional form. In Supplemental Appendix B.1 we used a simplified fractional polynomial model for bacteremia status to illustrate the interplay between decisions to apply transformation to predictors before model building and their consequences on the estimated functional forms. In Additional File 1 Appendix B.6 we show an example where a nonlinear effect of a predictor was identified, but in the most relevant subrange of the predictor where the data is dense, the estimated nonlinear functional form agreed well with a straight line.
Predictor selection or reparameterization of predictors. If two correlated predictors are considered for a model (like WBC and NEU), interpretation may be difficult if the correlation results from the definition of the predictors. In the example with WBC and NEU, WBC cannot stay constant while varying NEU because neutrophiles are a component of leukocytes. Above we suggested to replace WBC by WBC_noNEU and then WBC_noNEU and NEU can vary independently, ensuring interpretability of regression coefficients.
5.3 How IDA may guide the presentation of results
While in this paper we intentionally do not present the actual modeling results for the bacteremia study, we give some general remarks on how IDA may guide the presentation of such results.
Transformations of predictors included in prediction models should be appropriately documented and reported. For continuous predictors, IDA suggests appropriate unit increments of predictors to which regression coefficients or derived quantities such as odds or hazard ratios should correspond (e.g. 1 year, 5 year or 10 year increments of age). Numerous examples from the medical literature demonstrate that this is often ignored, and one can find reports of regression coefficients of a continuous predictor and confidence limits that are all close to parity. For example, Ma et al [27] report an adjusted risk ratio of CRP (95% confidence interval) of 0.982 (0.973, 0.991) with a p-value < 0.001 for predicting survival of persons admitted to a hospital with COVID-19. When considering the reported interquartile ranges (7.52 to 37.93 mg/L for survivors, and 35.52 to 148.31 mg/L for non-survivors), it becomes apparent that a unit difference in CRP in this study cohort is probably not an appropriate choice for presenting the model, if interpretability is a goal.
Royston and Sauerbrei [17, p. 54f] discuss choosing an appropriate reference category for categorical predictors in a regression model. While there may be background knowledge to support the choice of a specific category as the reference, IDA may be used to ensure that the 'sample size (of the referent category) should not be too small' to avoid inflation of standard errors for all comparisons to the reference [17, p.55].
In our example, one could be interested in presenting partial dependence of predictions on predictors by displaying the estimated response function of predictors. IDA guides the choice of an appropriate range for the x-axis, which will be either the range of the predictor or a bit less than the range depending on the data sparsity in the tails of its distribution. One could also use a scaling of the x-axis that corresponds to the transformation that was deemed appropriate to symmetrize the distribution of the predictor. See Additional File 1 Appendix B.2 for illustrations.
All changes to the prespecified analysis and reporting strategy induced by IDA must be transparently reported in a statistical methods summary for the statistical report. In this example we did not suggest specific changes, but only illustrated which aspects of an analysis plan could be further refined or put into question once the IDA results are available. For each of these refinements, usually many options are possible and specific choices may depend on the preferences and experience of the analysis team.
In this application with 50 possible candidate predictors to choose from there is a lot of emphasis on how to use IDA to guide model building by disregarding predictors in the analysis. This is of course very specific to this example, and IDA is not always related to this aspect. Other studies and maybe downplay a bit the emphasis on this part, to avoid giving the impression that the importance of IDA is mostly related to this aspect.