Regression without regrets – initial data analysis is an essential prerequisite to multivariable regression

doi:10.21203/rs.3.rs-3580334/v1

Download PDF

Research Article

Regression without regrets – initial data analysis is an essential prerequisite to multivariable regression

https://doi.org/10.21203/rs.3.rs-3580334/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 08 Aug, 2024

Read the published version in BMC Medical Research Methodology →

You are reading this latest preprint version

Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and often questionable presentation of the modeling results. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. The main aim of initial data analysis (IDA) in the context of regression analyses is seen in providing knowledge about the data to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.

Initial data analysis

IDA framework

regression models

data screening

reporting

variable selection

functional form

variable transformation

STRATOS Initiative

Statistical models are commonly used for predicting an outcome variable based on the values of some predictor variables, and for describing the association between the outcome and the predictors. This is often done by defining a model structure such as, in its simplest form, a linear combination of the covariates, and estimating the unknown parameters of the model. Many aspects of multivariable regression analyses such as choosing an appropriate model family, covariate selection for a model, consideration of nonlinear associations of continuous covariates with the outcome, or validation of regression models have been discussed extensively [1–4]. However, correct interpretation and adequate presentation of a model crucially depend on knowledge about the predictors, in particular about their marginal and joint distributions. Further properties of the data such as patterns of missing values, collinearities, measurement errors, or complex hierarchies in the measured predictors may have to be considered when choosing a model building strategy that is appropriate for the tasks of prediction or description.

In practice, however, with standard software packages and coded functions for statistical procedures enabling regression analyses with little effort, data analysts may rush to perform sophisticated analyses, without systematically checking for errors in the data; without a clear understanding about the underlying features of the data; without knowledge on the suitability of the data for the intended analyses, or even without knowledge whether the data actually could provide answers to the research questions of interest.

The main aim of Initial Data Analysis (IDA) is seen in providing reliable and transparent knowledge about the data and how they meet preconditions to conduct appropriate statistical analyses and a correct interpretation of the results to answer pre-defined research questions [5]. Others have noted the need for a strategic approach to initial data analysis (for example, [6]). The IDA framework consists of six steps [5] incorporated in the research work flow [7]. Regarding steps I and II of the framework, we assume that metadata exist in sufficient detail, and that data cleaning was already performed [8]. Metadata summarize background information about the data, and a data cleaning process identifies and corrects technical errors. In this paper we focus on those aspects of the IDA framework that address preparation of the data for building a regression model and possible consequences of the findings (steps III - VI).

Data screening examines data properties to inform decisions about the intended analysis.
Initial data reporting documents give insights into the previous steps and can be referred to when interpreting results from the regression modeling.
Possible consequences of such initial analyses may be that the intended way of building the regression models may have to be revised, i.e., the analysis plan refined or updated.
Finally, reporting of IDA methods and results in research papers (step VI) is necessary to ensure transparency regarding key findings that influence the analysis or interpretation of results.

Further details about the elements of IDA are discussed in [5].

In this paper our objectives are (1) to give advice on what to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan and (2) to illustrate this IDA plan for data screening in an example with recommendations for data visualizations. Reproducible R code with link to dataset is available at https://stratosida.github.io/regression-regrets/. Intermediate results of this project have been presented at various conferences and workshops. Feedback was sought from STRATOS members and other experienced statisticians to come to a consensus.

We outline the assumptions made in our paper in Section 2. In Section 3 we define a list of several topics of relevance for an IDA plan in a regression modeling context. We show how to prespecify IDA topics by means of an example study in Section 4. In Section 5, we discuss possible consequences of IDA findings. Finally, we reflect on integrating IDA in the research process for regression analyses and address reporting and limitations.

This paper elaborates on planning and conducting IDA, in particular data screening, in a reproducible manner in the context of regression analyses for a descriptive or predictive purpose. We are very aware of the danger of formulating and testing hypotheses after exploring the data that can lead to overstated associations and false positive results [9–10]. Hence, IDA should not be confused with exploratory data analysis or with Chatfield's notion of an 'initial examination of data', as the latter two approaches explicitly evaluate associations of the outcome and the predictors [11–12]. Instead a key principle of IDA is that it should –without good reason- not anticipate analysis directly related to the research question, implying that associations between outcome and predictors are not explored, neither numerically nor graphically. Nevertheless, the conduct of IDA is guided by the research question and the intended analyses.

In this paper we consider studies in which the primary descriptive or predictive research question is addressed by fitting, interpreting and probably also applying one or several regression models with a continuous, binary, ordinal or count-type outcome variable. Figure 1 illustrates how IDA planning and conducting could be embedded in the research workflow. Ideally, a statistical analysis plan can be fully prespecified before data collection, and it may entail, among many other items, the specification of the predictor set to be included in a model, the model structure, the outcome variable, any further steps considered in model building, and specifications how the modeling results will be interpreted with respect to the research question, and how they will be presented (Fig. 1, item 1). Embedding IDA in the research workflow, this statistical analysis plan is specified as an “analysis strategy” that is considered preliminary for the time being. Next, an IDA plan is devised which specifies some analyses of the data to be conducted, carefully avoiding any evaluations of predictor-outcome associations. This IDA plan may also include evaluating conditions that lead to changes in the prespecified model, e.g. because of abundance of missing values, very imbalanced or even degenerate distributions or because of redundancy of predictors (Fig. 1, item 2). Then IDA is conducted according to the prespecified IDA plan (Fig. 1, item 3). Using the results of IDA, the statistical analysis plan may be updated or refined; any changes will be transparently reported as a consequence of IDA (Fig. 1, item 4). Finally, statistical analysis is conducted and reported using the finalized statistical analysis plan (Fig. 1, item 5). Besides suggesting refinements or updates of a preliminary statistical analysis plan, IDA may also guide presentation and interpretation of modeling results, which will be exemplified in later sections of the paper.

As a simple example, consider the case where the prespecified model includes a particular categorical predictor with four levels, which should be included as three dummy variables contrasting three levels to a common predefined reference level. There is hardly any background knowledge about the expected frequencies of the four levels. Without consideration of IDA, during the regression modeling process it may turn out that the prespecified model is not estimable because at one of the levels of the predictor no events were observed. The analysis team may recommend to revise the analysis strategy and to collapse this level with other levels in the analysis, but such revision is problematic because it is made post-hoc and has not been part of the analysis plan. Let's assume that the IDA plan included checks of the distributions of each predictor and conditions under which collapsing of predictor levels is considered. IDA was then conducted according to this plan and revealed the sparsity of one level of the predictor, but without evaluating the frequency of outcome events for that level. As a result, the statistical analysis plan was refined according to the predefined conditions, transparently stating that the sparse level of the categorical predictor in question was collapsed with a suitable other level. Final statistical modeling proceeded according to the pre-specified, finalized SAP.

Since regression models can be used for a wide range of purposes, assumptions on the regression analysis set-up in this paper are listed in Table 1. IDA tasks will be explained in a well-defined, practically relevant setting typically encountered in biomedical research.

Table 1

Assumptions and scope for a general strategy of initial data analyses as prerequisite of regression analyses.
Aspect	Details
Purpose of analysis	Descriptive or predictive model to relate an outcome variable with a set of predictors
Type of analysis	Regression analysis with one outcome variable
Type of outcome variable	Continuous, binary or count (no survival-type, multivariate or longitudinal outcome variable is considered here)
Number of predictors	3–50; the number is assumed smaller than the number of effective observations (low-dimensional setting)
Data cleaning	Has been conducted; for example, we assume that the variables satisfy plausibility limits.
Analysis strategy	Most elements of the model building strategy have been defined: the type of regression model that should be used, the set of predictors, and the way model building (selection of variables, consideration of possible nonlinear effects of continuous predictors, coding of categorical variables, inclusion of interactions, missing data) should be handled.
Background knowledge	We assume that the analyst collaborates with a domain expert to discuss which variables could be important predictors, or could be pivotal to describe the study cohort, or which correlation between predictors can be assumed, for example because of known causal relationships or knowledge about the recruitment process. We assume that the analysis strategy incorporates this domain expertise.

Figure 1 about here

Table 1 about here

In this section, we provide guidance on how to develop an IDA plan focused on data screening to prepare regression analyses (see Table 2). We will first discuss some prerequisites, and then outline our approach with nine key elements and five possible extensions. Further extensions of this core set are possible and their consideration depends on the context of the research study. Finally, we list some possible consequences of IDA results for the analysis strategy, the interpretation of results and the presentation of the regression model for a research report.

3.1 Prerequisites

Research aim: The research aim should be clearly defined as either predictive (validating, updating or newly developing a prediction model) or descriptive (estimating one or several associations of an outcome variable with predictors) (Table 2, PRE1). For newly developed prediction models, is transparency of the model important? If the research aim is descriptive, it should be clarified which predictor-outcome associations in the target population are of central interest and should be estimated and described.

Analysis strategy: We assume that the analysis strategy has been specified in a preliminary statistical analysis plan (Table 2, PRE2). To plan IDA, knowledge about the set of predictors to be considered in a model, the outcome variable, and the analytical strategy to build the regression model are necessary. Any assumptions of the model should be stated.

Data dictionary and metadata: a detailed data dictionary should be available, which informs about the meaning of each variable in context of the research question, the units of measurement, the possible levels in case of categorical variables, admissible values or value ranges (Table 2, PRE3). More generally, metadata also refer to information about the research study protocol and data collection processes.

Domain expertise, predictor grouping, variables of interest, structuring (Table 2, PRE4):

Groups of predictor variables may be considered, for example according to their biological context as in the dataset described below, which may influence the analysis strategy and may also help to structure IDA.
Missing value mechanisms: if not already specified in metadata, domain experts should be consulted to explain possible reasons for the occurrence of missing values for each predictor.
Domain expertise may contribute to a priori knowledge about the shape of distributions for the predictors, and possible correlations. Knowledge about correlations could be summarized in a directed or undirected acyclic graph connecting the predictors with each other as previously suggested, which may help to assess observed associations [13].
Structural variables: in the context of IDA, we define structural variables as those that help to structure IDA results for a clear organization and essential overview of data properties. Structuring can be based on levels of measurement (centers), on calendar time of recruitment, on demographic variables such as sex or age, or on variables of central importance to the research questions. Often the association of predictors with structural variables such as centers or time is of interest, and multivariate distributions of predictors may be easier to understand if stratified by structural variables such as sex. Ideally, structural variables are completely observable for all individuals. If a research aim includes many structural variables, it may be necessary to prioritize some of them for the task of structuring IDA. There may also be studies where no such structural variables can be identified. Structural variables may or may not be included as predictors in the analysis strategy. IDA analyses may also be structured by first describing predictors that are deemed more important to predict the outcome, followed by less important predictors.

3.2 Key elements of an IDA plan for regression

Here we outline key elements of an IDA plan. As stated above, IDA aims at providing reliable and transparent knowledge about the data. The IDA plan is focused on data screening to prepare for regression analyses; it does not specify procedures or software packages to conduct the analyses. It can be understood as a minimum basic set of analyses that can be extended depending on the context of the research study and data collection. The key elements are listed in Table 2; they are centered around three IDA domains: missing values, univariate distributions, and the multivariate system of predictors.

In many studies, missing values are a central and dominant problem which needs to be addressed. We propose to start data screening with missing values, first evaluating various levels of unit missingness (Table 2, M1) as recommended by the STROBE statement [14]. Unit missingness refers to observational units that have missing data on all variables required for analysis. In observational studies, this means that parts of the target population are (randomly or not randomly) underrepresented in the study cohort. Unit missingness could result from a biased selection process that may distort results (Table 2, ME1).

The proportion of missing values (item missingness) should be computed for the outcome and separately for each predictor (Table 2, M2). This is then followed by evaluating the number of complete observations that are available for a regression model (Table 2, M3). Knowledge about patterns of missing values may be useful for the modeling team to decide on possible substitution of predictors with abundant missing values. If stratified by the structural variables, such patterns may give further information about missingness mechanisms; for example, if missingness of a predictor is associated with time of recruitment this may indicate that measurements of this predictor became available only during the course of a study (Table 2, M4).

Knowledge about the empirical univariate distributions of the outcome and the involved predictors is important for later modeling decisions, for presentation of modeling results, and also for interpreting a model correctly (Table 2, U1 & U2). Such analyses may also detect concentrations of data, e.g., a spike at zero or digit preferences depending on the relative frequency. This knowledge may guide the decision whether the functional form of a continuous predictor can be modelled flexibly, which would require a sufficient number of observations in the area where a nonlinear association with the outcome is expected, or should be better predefined, e.g., as linear. It may also guide a strategy to deal with influential data points, such as truncation or transformation of a predictor. For the purpose of presenting the modelling results, the univariate distribution of a predictor may guide the choice of appropriate units corresponding to the regression coefficients.

For multivariate description we suggest to stratify associations between predictors by structural variables in order to understand how the predictors reflect structural heterogeneity and to limit the number of descriptions to be produced at this analysis stage (Table 2, V1). Furthermore, we suggest to examine associations between predictors (Table 2, V2). At this stage, it may be sufficient to use all pairwise complete observations to compute correlation coefficients between predictors. If nonlinear functional forms are considered for continuous predictors in a regression model, nonparametric (Spearman) correlation coefficients are a good choice. If the analysis strategy prespecified the consideration of some biologically plausible interactions, the association of the involved predictors should be given special attention in IDA, as high correlation between them may influence the ability to detect their interaction (Table 2, V3).

Table 2

Check list for an initial data analysis (IDA) plan
Topic	Item	Features
Prerequisites
Research aim	PRE1	Define the research aim as predictive or descriptive; in case of a predictive research aim, are aspects of transparency or parsimony important?
Analysis strategy	PRE2	Check specification of models and roles of variables in the models
Data dictionary	PRE3	For variables identified in P1, and any additional structural variables, check variable labels, definitions, values, units of measurement, data type, etc.
Domain expertise	PRE4	When discussing the analysis strategy with a domain expert, the following aspects should be addressed: potential key predictors, structural variables for IDA, any other grouping of predictors, expected proportion of missing values, expected distributions of and correlations between predictors
IDA screening domain: Missing values (predictor and outcome variables)
Participant (unit) missingness	M1	Describe the numbers of participants that were potentially eligible but not assessed for eligibility, those who were assessed for eligibility but not recruited and those who were recruited but did not contributed any data.
Variable (item) missingness	M2	Provide number and proportion of missing values for each predictor and for the outcome variable; distinguish by reason of missingness, if applicable.
Complete cases	M3	Describe number of complete observations when considering outcome and predictors for any candidate model described in P1.
Patterns	M4	Investigate patterns of missing values across all variables, either as tables or appropriately visualised. Can be structured by structural variables.
Missing values – Optional extensions
Predictors	ME1	Investigate predictors of missingness (complete vs incomplete cases).
IDA screening domain: Univariate descriptions (structural variables, predictors and outcome)
Categorical variables	U1	Summarize frequency and proportion for each category or with appropriate plots. If it is considered to collapse rare categories, summarize also frequencies of collapsed categories.
Continuous variables	U2	Inspect distributions with high-resolution histogram, summary of main quantiles (e.g. 1st, 5th, 25th, 50th, 75th, 90th, 99th) and extremes (e.g. 5 highest and 5 lowest values), further measures of location (e.g., the mean) and spread (e.g. Gini mean difference, standard deviation, interquartile range), number of distinct values. Describe the mode of the data and its frequency. Similarly, inspect distributions of transformed variables, if applicable.
Univariate analyses – Optional extensions
Sparsity	UE1	Create distributional plots to identify particular observations with extreme values
IDA screening domain: Multivariate descriptions (structural variables and predictors)
Association	V1	Visualize and summarize the association of each predictor with the structural variables
Correlation	V2	Quantify association (pairwise correlations) between all key predictors in a matrix or heatmap
Interactions, if applicable	V3	Evaluate bivariate distributions of the predictors specified in interactions. Include appropriate graphical displays.
Multivariate analyses – Optional extensions
Correlation	VE1	Compare matrix of Spearman and Pearson correlations coefficients
Clustering	VE2	Visualize clustering of predictors using a dendrogram to show closely associated predictors
Redundancy	VE3	Compute Variance Inflation Factors or fit parametric additive models to determine how well each predictor can be predicted from the remaining predictors

Table 2 about here

3.3 Further aspects and possible extensions

IDA domain extension: missing values

A framework for the treatment and analysis of missing values in observational studies (TARMOS framework) has been developed and described by STRATOS-TG1 [15]. Here we note two aspects of that framework that could be considered as parts of an IDA:

“A table of characteristics for the ‘complete’ versus ‘incomplete’ (or all) participants, or by whether variables with substantial missingness are observed.” [15] (Table 2, ME1)
“An assessment of the predictors of missingness, e.g. using a logistic regression model fitted to an indicator for being a complete record, and predictors of missing values i.e. associations with the incomplete variables.” to make inferences about potential mechanisms underlying missing data [15]. (Table 2, ME1)

IDA domain extension: univariate descriptions

Distributional plots may be useful to allow identification of areas with no or sparse data, or to identify extreme values that could have disproportional influence on regression results (Table 2, UE1).
If an unexpected distribution is identified in the ‘IDA domain: univariate distributions’, special attention should be given to evaluate bivariate distributions of predictors with this variable. For example, a skewed or multimodel univariate distribution for a predictor can result from a strong correlation of that predictor with other predictors. Multimodality in a distribution might also require further investigations, for e.g. possible measurement errors or digit preference. Of note, if a predictor exhibits a skewed distribution it may be difficult to depict in bivariate scatterplots, and the axis may need transformation.

IDA domain extension: multivariate descriptions

Once decisions on how to handle missing values and how to include predictors in a model have been made, more data screening may follow which is then performed by the modeling team. Specifically, these aspects may complement the basic set of multivariate analyses when needed:

Any Spearman correlation matrix may be complemented by Pearson (linear) correlation coefficients. Large differences between Pearson and Spearman correlation coefficients for a pair of predictors may indicate unusual types of association or outliers in a bivariate data cloud (Table 2, VE1). In particular if there are many predictor variables, one could focus on the scatterplots of such pairs of predictors to investigate the data pattern.
Variable clustering: identifies clusters of predictors that are closely associated (also contained as dendrogram in the heat map) (Table 2, VE2). Such clusters may give rise to model simplifications.
Redundancy analysis: identifies if a predictor is (almost) entirely represented by a linear combination or generalized additive model of other predictors [16] (Table 2, VE3).

In some regression problems, specific further analyses may become relevant:

With categorical predictors, correspondence analysis may be helpful to explore their associations graphically [17]
In case of a mix of continuous and categorical predictors (which is the rule rather than the exception), computing variance inflation factors for each design variable allows to identify redundancies at the modelling level. Design variables are all variables that code predictors for the model. Categorical predictors are usually coded as several binary dummy variables. Purposeful dummy coding (e.g. concerning the choice of reference category, or using ordinal coding for ordinal predictors as proposed by [18]) later facilitates a meaningful interpretation of the regression coefficients. Continuous predictors with an assumed nonlinear association with the outcome are coded with basis variables (e.g. spline bases or fractional polynomials). As this aspect does not involve the outcome variable, it may be seen as IDA, but usually it will be part of the planned analyses.
Depending on the scale of measurement, association between predictors could be visualized by scatterplots (continuous by continuous), or dotplots of original values (continuous by categorical), or frequencies (categorical by categorical). Ideally, these graphical displays should show the association of each predictor with all structural covariates simultaneously (Table 2, V2). For example, to display associations of a predictor X with the assumed structural covariates age and sex one could show scatterplots of X by age for males and females in two panels, or superimpose the two scatterplots using different symbols for males and females. If among the structural covariates there are two continuous ones, one of them could be categorized, but that bears the risk of overlooking patterns in the association of the predictor with that structural covariate. With many predictors evaluated, this approach could catch chance findings which may be overinterpreted, so the level of detail must be in accordance with sample size.
In situations with ‘many’ predictors, a heat map visualising the clustering of observations on one axis and of variables on the other axis may be a useful summary of the correlation structure of the independent variables. The structural covariates could be highlighted in the heat map.
Multivariate analyses are particularly vulnerable to missing values, as only observational units which are complete in the considered predictor variables can be included. While imputation methods can be used to reconstruct missing data, we consider these model-based methods as part of modeling and not of IDA.

We emphasize again that these multivariate analyses, in particular visual analyses, should not incorporate the outcome variable.

To illustrate the IDA plan (Table 2) an example study in a regression context with publicly available data and R code will be shown. The corresponding IDA plan, detailed analyses, and materials can be found in Additional File 1 and at the accompanying website (see Availability of data and materials). In this section, we chose a subset from this comprehensive material to exemplify the key aspects.

4.1 Overview of the bacteremia study

We will exemplify our proposed systematic approach to data screening by means of a diagnostic study with the hypothetical primary aim of using age, sex and 49 laboratory variables to fit a diagnostic prediction model for the bacteremia status (= presence or absense of bacteria in the blood stream) of a blood sample. A hypothetical secondary aim of the study is to describe the functional form of each predictor in the model. Between January 2006 and December 2010, patients with the clinical suspicion to suffer from bacteremia were included if blood culture analysis was requested by the responsible physician and blood was sampled for assessment of hematology and biochemistry. An analysis of this study can be found in Ratzinger et al [19].

The data consists of 14,691 observations from different patients among which 8% had bacteremia and 51 potential predictors. To protect data privacy our version of this data was slightly modified compared to the original version, and this modified version was cleared by the Medical University of Vienna for public use (DC 2019-0054). Compared to the official results given in [19], our results may differ to a negligible degree.

4.2 Bacteremia study: prerequisites for the IDA plan

4.2.1 Research aim (PRE1)

We assume that the aims of the study are to fit a diagnostic prediction model for bacteremia with 51 potential predictors collected in routine laboratory analyses of blood sampled and to describe the functional form of each predictor.

4.2.2 Analysis strategy (PRE2)

These aims are addressed by fitting a logistic regression model with bacteremia status as the dependent variable. Prediction models for bacteremia that preceded the model of Ratzinger et al [17] (see the citations therein) included the predictors age (AGE), leukocytes (WBC), blood urea neutrogen (BUN), creatinine (CREA), thrombocytes (PLT), and neutrophiles (NEU). Hence, we consider these variables are key predictors with known strong associations with bacteremia. Upon consultation of a laboratory medicine specialist, some variables were considered as of medium importance to predict bacteremia: potassium (POTASS) (which is related to kidney function), and some acute-phase related parameters such as fibrinogen (FIB), C-reactive protein (CRP), aspartate transaminase (ASAT), alanine transaminase (ALAT), and gamma-glutamyl transpeptidase (GGT). All other potential predictors are probably of minor importance. Continuous predictors should be modelled by allowing for flexible functional forms. Since there is a large number of potential predictors, the flexibility of the functional form of a predictor (which determines the number of parameters in the model) will follow its assumed importance for predicting bacteremia. Hence, more flexibility will be allowed for key predictors (e.g., to represent at least cubic associations with the outcome) than for all other predictors (e.g., enabling the modeling of quadratic associations). The decision on whether to use only key predictors, or to consider also candidate predictors from the predictor sets of medium or minor importance will be made based on results of data screening before uncovering the association of predictors with the outcome variable. In this example, an adequate strategy to cope with missing values will also be chosen after screening the data. Candidate strategies are omission of predictors with abundant missing values, complete case analysis, single value imputation or multiple imputation with chained equations, or a combination of those. In other types of studies, in particular longitudinal studies or studies with few candidate predictors, one might be forced or able to prespecify handling of missing data if sufficient prior knowledge about missingness patterns is available.

4.2.3 Data dictionary (PRE3)

The data dictionary of the bacteremia data set consists of columns for variable names, variable labels, scale of measurement (continuous or categorical), units, plausibility limits, and remarks (a simplified version is in Table 3). In the original data dictionary the variables are sorted by alphabetical order, but for Table 3 we sorted them by importance.

4.2.4 Domain expertise (PRE4)

The demographic variables age and sex are are chosen as the structural variables in this analysis for illustration purposes, since they are commonly considered important for describing a cohort in health studies. Key predictors and predictors of medium importance are as defined above. Laboratory analyses always bear the risk of machine failures, and hence missing values are a frequent challenge. This may differ between laboratory variables, but no a priori estimate about the expected proportion of missing values can be assumed. As most predictors measure concentrations of chemical compounds or cell counts, skewed distributions are expected. Some predictors describe related types of cells or chemical compounds, and hence some correlation between them is to be expected. For example, leukocytes consist of five different types of blood cells (BASO, EOS, NEU, LYM and MONO), and the sum of the concentration of these types approximately (but not exactly) gives the leukocyte count, which is recorded in the variable WBC. Moreover, these variables are given as absolute counts and as percentages of the sum of the five variables, which creates some correlation. Some laboratory variables differ by sex and age, but the special selection of patients for this study (suspicion of bacteremia) may distort or alter the expected correlations with sex and age.

4.3 Bacteremia study: IDA plan

In the following, we exemplify an IDA plan for the bacteremia study which uses the template of Table 2. The plan is written in future tense as we assume it is created before looking into the data.

4.3.1 Participant missingness (M1)

As the data is exported from the registry of the laboratory, and only performed laboratory analyses are included, participant missingness cannot be evaluated.

4.3.2 Variable missingness (M2)

Numbers and proportions of missing values will be reported for each predictor separately (M2). Type of missingness has not been recorded.

4.3.3 Complete cases (M3)

The number of available complete cases (outcome and predictors) will be reported when considering:

outcome

outcome and structural variables,

outcome and key predictors only,

outcome and key predictors and predictors of medium importance,

outcome and all predictors.

4.3.4 Patterns of missing values (M4)

Patterns of missing values will be investigated by:

computing a table of complete cases (see 4.3.3) for strata defined by the structural variables age and sex,

constructing a dendrogram of missingness indicators to explore which predictors tend to be missing together.

4.3.4 Univariate descriptions: Categorical variables (U1)

For sex and bacteremia status, the frequency and proportion of each category will be described.

4.3.5 Univariate descriptions: Continuous variables (U2)

For all continuous predictors, combo plots consisting of high-resolution histograms, boxplots and dotplots will be created. Because of the expected skewed distribution, combo plots will also be created for log-transformed predictors. As numerical summaries, minimum and maximum values, main quantiles (5th, 10th, 25th, 50th, 75th, 90th, 95th), the mean, the Gini mean difference, the number of distinct values, and the five lowest and five highest values will be reported using the R function Hmisc::describe.

Graphical and parametric multivariate analyses of the predictor space such as cluster analyses or the computation of variance inflation factors can be heavily influenced by the distribution of the predictors. In order to make this set of analyses more robust to highly influential points or areas of the predictor support, some predictors may need transformation (e.g. cube root or logarithmic transformation). As possible transformations we will consider cube roots and logarithms of predictors. Since some predictors may have values at or close to 0, we will consider the pseudolog transformation instead of the log transformation [20]. The success of transformations to symmetrize predictor distributions will be assessed by evaluating each untransformed and transformed predictor's correlation with normal deviates. Additional File 1 Appendix A contains some further explanations on the pseudo-log transformation.

4.3.6 Multivariate descriptions: associations of predictors with structural variables (V1)

A scatterplot of each predictor with age, with different panels for males and females will be constructed. Associated Spearman correlation coefficients will be computed.

4.3.7 Multivariate descriptions: correlation analyses (V2)

A matrix of Spearman correlation coefficients will be computed.

4.3.8 Comparing nonparametric and parametric predictor correlation (VE1)

A matrix of Pearson correlation coefficients will be computed. Predictor pairs for which Spearman and Pearson correlation coefficients differ by more than 0.1 correlation units will be depicted in scatterplots.

4.3.9 Variable clustering (VE2)

A variable clustering analysis will be performed to evaluate which predictors are closely associated. A dendrogram groups predictors by their correlation. Scatterplots of pairs of predictors with Spearman correlation coefficients greater than 0.8 will be created.

4.3.10 Redundancy (VE3)

Variance inflation factors will be computed between the candidate predictors. This will be done for the three possible candidate models, and using all complete cases in the respective candidate predictor sets. Redundancy will further be explored by computing parametric additive models for each predictor in the first two candidate models using the Hmisc::redun function.

4.4 Bacteremia study: results of IDA

The full results of IDA according to the IDA plan are available in the Supplementary File. Moreover, our accompanying website https://stratosida.github.io/regression-regrets/ also provides the R code for full reproducibility. The main findings of IDA can be understood from the selected results described below.

4.4.1 IDA domain: missing values (M)

There is no instance of unit missingness in the sample dataset. Outcome variable and structural variables are completely observed. An analysis with only the key predictors hardly suffers from missing values (94% complete cases). If the predictor set is extended to include those of medium importance, the proportion of complete cases decreases to only 63.9%, which may indicate that for this predictor set, complete case analysis is perhaps no longer acceptable. Extending to all predictors, only 27% of the observations would be complete. Individual predictors are missing with proportions of 48% and less. Only seven predictors have missingness proportions of more than 20%, and ten predictors between 10% and 20%. The remaining 32 predictors have smaller missingness proportions. Age and sex, the structural variables, are never missing. Completeness of predictors does not vary between groups defined by the structural variables.

We also investigated the concordance of missingness between predictors (Fig. 2). GLU, PAMY and HS show very individual missingness patterns with more then 20% discordance with the patterns of any other predictors. Some groups of predictors have much lower discordance than missingness proportions, which points towards very similar missingness patterns. For example, AMY and LIP; TRIG and CHOL; or FIB, NT and APTT are such groups.

Figure 2 about here

4.4.2 IDA domain: univariate distributions (U)

Many of the predictors measure concentrations of chemical compounds in the blood or represent cell counts. These predictors typically exhibit skewed distributions.

For 15 predictors a pseudolog transformation increased the correlation with normal deviates by more than 0.2 correlation units compared to not transforming the predictor. For these predictors, original and transformed distributions have been compared (cf Fig. 3 for four examples), and in scatterplots (IDA domain: multivariate analyses) the transformed values will be used. We also evaluated cube root transformations but that transformation reduced skewness only very moderately.

Figure 3 about here

We also investigated if there were spikes at specific values, by listing the five most frequent values of each predictor. For example, for basophiles and eosinophiles the most frequent value was 0 and occurred with a frequency (proportion) of 12,671 (87%) and 6,994 (48%), and corresponding concentration ratios (ratios of highest frequency and average frequency) were 15.7 and 17.3, respectively.

4.4.2 IDA domain: multivariate descriptions (V)

Absolute Spearman correlation coefficients of predictors with AGE, stratified by SEX, were generally below 0.3. Only few predictors had Spearman correlation coefficients with AGE between 0.2 and 0.3.

Some Spearman correlations coefficients between pairs of predictors were greater than 0.8, e.g. between WBC and NEU; between EOS and EOSR; BASO and BASOR; RBC, HGB and HCT; MPV and PDW; and LYMR and NEUR. These pairs were investigated by scatterplots. The high correlations can be explained by domain expertise.

For 23 pairs of predictors Spearman and Pearson correlation coefficients differed by more than 0.1. Scatterplots of these pairs revealed nonstandard patterns of association between the predictors of these pairs.

Among the key predictors and the predictors of medium importance, WBC and NEU exhibited the highest degree of redundancy with variance inflation factors above 7 (key predictor set) or even above 14 (key and medium importance predictors). Including all remaining predictors, many predictors became almost exactly redundant. Variance inflation factors increased when considering parametric additive models instead of linear models.

Table 3

Simplified data dictionary of the bacteremia study. Key predictors and structural variables are in boldface, predictors of medium importance in italic.
Variable	Label	Scale	Units	Variable	Label	Scale	Units
ID	Patient Identification	nom.	1-14691	GBIL	Bilirubin	cont.	mg/dl
BloodCulture	Blood culture result for bacteremia	nom.	no, yes	GLU	Glucoses	cont.	mg/dl
AGE	Patient Age	cont.	years	HCT	Haematocrit	cont.	%
BUN	Blood urea nitrogen	cont.	mg/dl	HGB	Haemoglobin	cont.	G/L
CREA	Creatinine	cont.	mg/dl	HS	Uric acid	cont.	mg/dl
NEU	Neutrophiles	cont.	G/L	LDH	Lactate dehydrogenase	cont.	U/L
PLT	Blood platelets	cont.	G/L	LIP	Lipases	cont.	U/L
SEX	Patient sex	nom.	1 = male, 2 = female	LYM	Lymphocytes	cont.	G/L
WBC	White blood count	cont.	G/L	LYMR	Lymphocyte ratio	cont.	% (mg/dl)
ALAT	Alanin transaminase	cont.	U/L	MCH	Mean corpuscular hemoglobin	cont.	fl
ASAT	Aspartate transaminase	cont.	U/L	MCHC	Mean corpuscular hemoglobin concentration	cont.	g/dl
CRP	C-reactive protein	cont.	mg/dl	MCV	Mean corpuscular volume	cont.	pg
GGT	Gamma-glutamyl transpeptidase	cont.	G/L	MG	Magnesium	cont.	mmol/L
FIB	Fibrinogen	cont.	mg/dl	MONO	Monocytes	cont.	G/L
POTASS	Potassium	cont.	mmol/L	MONOR	Monocyte ratio	cont.	%
ALB	Albumin	cont.	G/L	MPV	Mean platelet volume	cont.	fl
AMY	Amylase	cont.	U/L	SODIUM	Sodium	cont.	mmol/L
AP	Alkaline phosphatase	cont.	U/L	NEUR	Neutrophile ratio	cont.	%
APTT	Activated partial thromboplastin time	cont.	sec	NT	Normotest	cont.	%
BASO	Basophiles	cont.	G/L	PAMY	Pancreas amylase	cont.	U/L
BASOR	Basophile ratio	cont.	%	PDW	Platelet distribution width	cont.	%
CA	Calcium	cont.	mmol/L	PHOS	Phosphate	cont.	mmol/L
CHE	Cholinesterase	cont.	kU/L	RBC	Red blood count	cont.	T/L
CHOL	Cholesterol	cont.	mg/dl	RDW	Red blood cell distribution width	cont.	%
CK	Creatinine kinases	cont.	U/L	TP	Total protein	cont.	G/L
EOS	Eosinophils	cont.	G/L	TRIG	Triclyceride	cont.	mg/dl
EOSR	Eosinophil ratio	cont.	%

Table 3 about here

In this section, we will describe how specific IDA results may be used in the regression analysis to follow. The possible impact of IDA has three aspects: it may induce refinements of the intended statistical analysis and help defining the statistical analysis plan, help avoiding misinterpretation of results of the regression analysis, and provide useful background knowledge to decide on how to present results of the regression analysis.

5.1 Bacteremia study: refinements of the analysis strategy triggered by IDA results

Revisions of the analysis strategy based on the results of IDA are justified if any predictor-outcome associations were strictly not evaluated during IDA. An update of the analysis strategy could encompass a refinement of the model specifications, additional analyses, such as sensitivity analyses, or possibly a change to the intended analysis methods. In our examples, we show that large proportions and specific patterns of missing values, skewed distributions of predictors, and a high degree of redundancy between predictors may suggest that the plan should be updated. Furthermore, IDA findings might lead to planning of sensitivity analyses before formal statistical analyses commence.

In this example, the number of possible candidate predictors is relatively large if the final model should be 'parsimonious' and explainable. Here IDA can provide the necessary information to guide decisions which predictors to focus on during model building. This is often an iterative process that depends on observed features of the data, as the correlation and missingness structure of the predictor set changes if predictors are removed. We consider IDA as the first iteration of this process. Further steps will then be carried out by the modeling team.

Predictors with univariate distributions that are particularly narrow, or, in case of categorical variables, that are extremely unbalanced may contribute only very little to predictive performance because most of the subjects are similar. The chances that such a variable is a strong predictor of the outcome is very low.

Given their histograms, basophiles (and the associated basophiles ratio), and probably eosinophiles (and the eosinophiles ratio) could be candidates for discarding because of their excessive spike at zero, which cannot be removed by any transformation. Given that the aim of the analysis is to predict bacteremia accurately, these variables are probably unlikely to contribute to overall predictive accuracy.

Predictors are also expected to add only little or no predictive performance if they are redundant to other predictors in the model. However, sometimes a reparametrization of the predictor space may remove the redundancy and enhance interpretability of the model. Analysis of the predictors' multivariate distributions revealed that two predictors a priorily rated as important for predicting bacteremia exhibited a very high correlation: leukocytes (WBC) and neutrophiles (NEU). This correlation stems from the fact that neutrophiles are the biggest subtype of leukocytes. As a consequence of the correlation and the background information, one could replace WBC with a new variable WBC_NONEU = WBC – NEU. Using NEU and WBC_NONEU retains all the information of the two predictors for the model and keeps their regression coefficients interpretable but removes the high correlation (see also [21]).

From our IDA analysis, we would probably conclude that among the leukocyte-related predictor, only the computed WBC_NONEU (leukocytes minus neutrophiles) and NEU should be retained in an analysis with an extended candidate set, probably also MONO and LYM. The corresponding 5 'ratio' variables, basophiles and eosinophiles may not be needed for modeling as they are largely collinear with their absolute counterparts.

Likewise, one may also consider to remove those predictors that exhibit large proportions of missing values. Assume that a predictor with many missing values is closely associated with other predictors. In this case, it may be well imputable but is not likely to add predictive value on top of those other predictors. If that predictor in question is not associated with other predictors, it may not be well imputable and hence its imputation will introduce noise into the analysis. Hence, removing such variables may be indicated. It is not so clear, unfortunately, when to disregard a predictor because of its proportion of missing values. The threshold value may depend on how many predictors are affected by missing values, how much they are affected, and how the missingness pattern looks like. In our example, probably PAMY, TRIG and CHOL, all exhibiting more than one third of their values missing, may be the most obvious candidates for omission. Moreover, the following predictors all have missingness proportions of more than 20%: GLU, AMY, LIP, and HS. According to the missingness pattern dendogram, PAMY, TRIG, CHOL, AMY and LIP also have highly correlated missingness, suggesting that they do not serve each other in imputation models (Fig. 1).

Hence, 14 predictors could be excluded from model building without having to expect reduced predictive accuracy, which reduces the dimensionality of the predictor space, without unblinding the association of the predictors with the outcome, from 51 to 37. The modeling team will recompute the numbers of observations with complete recordings for all remaining predictors.

High-resolution histograms may also guide the modeling team in the choice of an appropriate handling of nonlinearity. Generally, many different modeling strategies to handle nonlinear associations of predictors with the outcome are available, e.g. restricted cubic splines, penalized splines or fractional polynomials [3]. For example, restricted cubic splines provide a linear fit outside of the boundary knots, and one may want to adjust default knot positions in case of very skewed distributions. In our IDA, we already showed histograms based on pseudolog transformations of some predictors. These transfomations were necessary in scatterplots to enhance their interpretability, but whether to use transformed predictors in the outcome models may be debatable. Royston and Sauerbrei [22] discussed other ways of pretransforming predictors to increase robustness of models, in particular at the tails of the predictor support. Probably one should not use each of the IDA domains alone to identify predictors that should be removed from model building. For example, a moderate proportion of missingness together with a moderate degree of redundancy of a predictor may also justify its exclusion.

If interactions of predictors have been pre-specified, IDA may evaluate the joint distribution of these predictors. Strong association of the predictors involved in an interaction may make the inclusion of their interaction unnecessary as it would come with great estimation uncertainty (cf. [23], p. 301). For example, we included scatterplots of all predictors with age, stratified by sex in our IDA. Among the key predictors, BUN had the highest correlation with age with correlation coefficients of 0.487 (males) and 0.386 (females), while there was hardly any correlation of bacteremia and PLT. Hence, interaction terms involving age and PLT can be more precisely estimated than interaction terms involving age and BUN.

Sensitivity analyses, which are in general not part of IDA, are a tool to evaluate the robustness of estimates on decisions in model building, for example choices of different methods, impact of variable selections, or impact of strategies to handle missingness or influential points. Sensitivity analyses should be pre-specified, and IDA may suggest that certain sensitivity analyses are necessary to back up the modeling results. Regarding missing values in the bacteremia study, one could perform such a sensitivity analysis by not imputing any predictors of minor importance but just omitting them from the model. One may also consider to transform some predictors with particularly skewed distributions before modeling. While this may lead to differences in the interpretation of the associated regression coefficient (if a linear functional form is chosen for such a predictor), one could evaluate if it also affects prediction performance or the values of the standard errors of the other covariates' regression coefficients (see Additional File 1 Appendix B.2 for an example). About dealing with the strong correlation between WBC and NEU, one could define such a sensitivity analysis by removing either WBC or NEU from the model and evaluate if this has a relevant effect on the model's performance.

In all three cases, such sensitivity analyses are consequences of IDA but they are still predefined in the sense that they are planned before uncovering the association of the outcome with the predictors [14, 18, 24].

By contrast, for example, sensitivity analyses that result from observing an unexpected pattern in the residuals of a model (e.g. if residuals show a clear nonlinear association with a predictor) must be seen as post-hoc analyses. Modifying the model because of such an unplanned sensitivity analysis increases the risk of overfitting the model. Nevertheless, it should be done and reported as a post-hoc analysis.

5.2 Bacteremia study: how IDA may guide the interpretation of modeling results

The results of the regression model consists of the estimated regression coefficients, their covariance matrix and in particular their standard errors, may include predictions for selected predictor patterns and will also comprise measures of model performance.

Skewed distributions. Skewed distributions of predictors may have consequences on the precision and the robustness of these results, and knowledge about the distributional shapes of the predictors are essential for interpretation. As revealed by our IDA, some of the predictors exhibited highly skewed distributions. For these predictors, the estimation of the nonlinear functional forms may suffer from disproportional impact of some observations, and estimation uncertainty will be reflected by wide confidence intervals. Impact of highly influential points may be reduced by pretransforming the predictors to more symmetric distributions, which however may change their interpretation if finally a linear functional form is chosen. Alternatively, the values could be winsorized before modeling as previously suggested [17, 22]. In addition, extreme values should be assessed for implausibility and, if classified as such, potentially removed. In general, there are numerous ways to make analyses robust against such influential points, including transformation, robust regression or by estimating robust variances [22, 25–26].

Transformation of predictors. If a predictor has been transformed, regression coefficients are given for units of the transformed predictor. In case of the pseudo-log transformation using a base of 10 that was suggested for BUN, they would correspond to the difference in outcome expected for a tenfold increase in the original predictor. This correspondence is only approximate as a pseudo-log transformation was used. See Additional File 1 Appendix B.2 for analyses of the bacteremia study with and without preceding pseudo-log transformation of predictors. If for WBC and NEU, pseudo-log transformations will be used in modeling the data, a unit of pseudo-logarithm would correspond roughly to a tenfold of the original WBC or NEU. The range of the pseudo-logged values is about 1.5; thus a unit difference covers almost the entire range of the data and comparably large regression coefficients have to be expected. See Additional File 1 Appendix B.3 for an illustration.

Validity of predictions. IDA allows to identify the support of a model, i.e., the ranges of values of the predictors from which the model was derived and to which it should be applicable. Predictions for observations from areas with higher joint density of predictors come with more confidence (narrower confidence intervals), while predictions with smaller support are less precise. The joint distribution also helps to understand in which cases predictions would actually be extrapolations. For example, in Fig. 3 the density of data points in any of the age-sex-groups is very low beyond a value of the pseudolog of WBC (t_WBC) greater than 1.5. The support is also essential to understand measures of model performance. Usually, the wider the support of a model, the more variance in the outcome can potentially be explained, and hence measures like the area under the ROC curve or the R-squared also tend to be greater. See Additional File 1 Appendix B.4 for an illustration.

Missing data handling. While a method to handle missing data is usually prespecified, IDA can give some information to support this decision or put it into question. If multiple imputation was prespecified, it has to be expected that the regression coefficients of predictors with higher proportion of missing values will generally have larger standard errors than those with fewer missing values, relative to comparing these quantities after complete case analysis. Consequently, if multiple imputation is combined with data-driven model selection approaches (such as backward elimination), such predictors are also less likely to be selected than more complete predictors, given they have approximately equal association with the outcome. Hence the decision whether to apply multiple imputation vs. using complete case analysis may impact the structure of the selected model. See also Additional File 1 Appendix B.5 for an illustration.

Interpretation of nonlinear functional forms. For predictors for which a nonlinear functional relationship with the outcome is assumed, the partial response function (predicted values vs. predictor) will usually be evaluated graphically. Areas in which this response function has a wide confidence interval correspond to low support in observed predictor values, and such a low support may preclude the precise estimation of a nonlinear functional form. In Supplemental Appendix B.1 we used a simplified fractional polynomial model for bacteremia status to illustrate the interplay between decisions to apply transformation to predictors before model building and their consequences on the estimated functional forms. In Additional File 1 Appendix B.6 we show an example where a nonlinear effect of a predictor was identified, but in the most relevant subrange of the predictor where the data is dense, the estimated nonlinear functional form agreed well with a straight line.

Predictor selection or reparameterization of predictors. If two correlated predictors are considered for a model (like WBC and NEU), interpretation may be difficult if the correlation results from the definition of the predictors. In the example with WBC and NEU, WBC cannot stay constant while varying NEU because neutrophiles are a component of leukocytes. Above we suggested to replace WBC by WBC_noNEU and then WBC_noNEU and NEU can vary independently, ensuring interpretability of regression coefficients.

5.3 How IDA may guide the presentation of results

While in this paper we intentionally do not present the actual modeling results for the bacteremia study, we give some general remarks on how IDA may guide the presentation of such results.

Transformations of predictors included in prediction models should be appropriately documented and reported. For continuous predictors, IDA suggests appropriate unit increments of predictors to which regression coefficients or derived quantities such as odds or hazard ratios should correspond (e.g. 1 year, 5 year or 10 year increments of age). Numerous examples from the medical literature demonstrate that this is often ignored, and one can find reports of regression coefficients of a continuous predictor and confidence limits that are all close to parity. For example, Ma et al [27] report an adjusted risk ratio of CRP (95% confidence interval) of 0.982 (0.973, 0.991) with a p-value < 0.001 for predicting survival of persons admitted to a hospital with COVID-19. When considering the reported interquartile ranges (7.52 to 37.93 mg/L for survivors, and 35.52 to 148.31 mg/L for non-survivors), it becomes apparent that a unit difference in CRP in this study cohort is probably not an appropriate choice for presenting the model, if interpretability is a goal.

Royston and Sauerbrei [17, p. 54f] discuss choosing an appropriate reference category for categorical predictors in a regression model. While there may be background knowledge to support the choice of a specific category as the reference, IDA may be used to ensure that the 'sample size (of the referent category) should not be too small' to avoid inflation of standard errors for all comparisons to the reference [17, p.55].

In our example, one could be interested in presenting partial dependence of predictions on predictors by displaying the estimated response function of predictors. IDA guides the choice of an appropriate range for the x-axis, which will be either the range of the predictor or a bit less than the range depending on the data sparsity in the tails of its distribution. One could also use a scaling of the x-axis that corresponds to the transformation that was deemed appropriate to symmetrize the distribution of the predictor. See Additional File 1 Appendix B.2 for illustrations.

All changes to the prespecified analysis and reporting strategy induced by IDA must be transparently reported in a statistical methods summary for the statistical report. In this example we did not suggest specific changes, but only illustrated which aspects of an analysis plan could be further refined or put into question once the IDA results are available. For each of these refinements, usually many options are possible and specific choices may depend on the preferences and experience of the analysis team.

In this application with 50 possible candidate predictors to choose from there is a lot of emphasis on how to use IDA to guide model building by disregarding predictors in the analysis. This is of course very specific to this example, and IDA is not always related to this aspect. Other studies and maybe downplay a bit the emphasis on this part, to avoid giving the impression that the importance of IDA is mostly related to this aspect.

In this paper we proposed a set of elements of initial data analysis that may help a data analyst in designing an IDA plan in conjunction with the statistical analysis plan. While this is for studies in which a descriptive or predictive research question is addressed with a statistical regression model, it can be adapted to other studies. In this context, IDA has the purpose to inform the data analyst and the domain expert about key properties of the data, without exploring the predictor-outcome association. The IDA findings are essential to empirically support the choice of the original analysis strategy or to suggest revisions. They are also key to correct interpretation of the analysis results. An IDA plan should balance an exhaustive investigation of the dataset with utility. It should have sufficient details to detect features of a data set that could affect the analysis by regression models, or the interpretation or presentation of results. It should cover necessary steps informed by the reseach aims and pre-specified analysis strategy in a systematic approach, to avoid missing items or overlooking important findings in lengthy template reports. While the IDA framework comprises six elements, here we devised a strategy to develop elements of an IDA plan from data screening onwards, balancing utility and parsimony. The strategy could be seen as a recommended minimum set of analyses that an IDA plan should contain in order to prepare for regression analysis. It is meant as a starting point from which an analyst may design a specific IDA plan by tailoring the items to their study or adding further aspects. Simplifications may be appropriate, in particular in the IDA domain of multivariate descriptions, for example if only few predictors are considered.

We included the outcome variable in univariate evaluations, but intentionally excluded it from any bivariate or multivariate analysis, as IDA shall not anticipate the main analysis. This principle distinguishes IDA from exploratory data analysis (EDA), in which associations in the data are explored and new hypotheses can be generated. To protect against arriving at wrong conclusions from prematurely evaluating the association of the outcome with the predictors when they are performing IDA, one could generate an outcome-blinded 'IDA data set' from the main analysis set. In such an ‘blinded’ IDA data set, the outcome variable is detached from the predictors and permuted relative to the predictors, such that any associations of the outcome with predictors variables are destroyed and any apparent associations meaningless, but associations between the predictors retained. The conclusions from our IDA example analyses are unchanged had they be conducted on such an IDA data set.

The IDA plan and how potential IDA findings may guide the decisions in the main statistical analyses, e.g. how to handle missing data or selecting predictors for a model, should be prespecified. IDA findings may suggest changes in the intended analyses that were not foreseeable, such as transformation of predictors, a refinement of the statistical model, or additional sensitivity analyses that are then pre-planned. Hence, an IDA plan enhances the statistical analysis plan, and relevant IDA methods should be incorporated in the methods section of a research report. The specific IDA findings that lead to such changes or IDA findings that help interpret the model results should be explicitly reported as results or in the discussion [24]. Transparent reporting of the planned and actually conducted analyses, as well as the reasons for the changes in an analysis plan, are essential for ensuring reproducibility and repeatability. Adequate reporting of research has been under discussion [14, 28] and we suggested reporting strategies of IDA for research papers [29]. IDA augments knowledge about a dataset and transparency in reporting will aid in accessible and reusable data according to the FAIR principles [30]. Of note, a range of R packages may facilitate the conduct of several aspects of IDA as well as data quality [31–32].

In this paper we focused on a predictive research question, but our recommendations may also guide the planning of IDA for descriptive or explanatory research questions, including the estimation of an adjusted exposure-outcome association, and of models that estimate causal effects. They may also help in designing a systematic approach to data screening for clinical trials, in particular if covariate adjustment is used, and may then be applied before treatment allocation is unblinded. We also expect that our recommendations may be useful for researchers fitting models with modern algorithmic approaches.

In summary, we provide practical recommendations for an IDA plan and how to carefully examine data properties to improve analyses and reproducibility or results. Our hope is that this empowers researchers to follow a systematic strategy for IDA.

Supplementary information

The online version contains supplementary materials available at ...

Additional file 1. Word document (.docx). Regression without Regrets.

Acknowledgements
We thank Sebastian Hödlmoser for supporting us with analyses. This work was developed as part of the international initiative of Strengthening Analytical Thinking for Observational Studies (STRATOS), and an earlier version of this manuscript was approved by the Publication and Visualisation Panels of the STRATOS initiative. The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies (http://stratos-initiative.org/). Members of the Topic Group “Initial Data Analysis“ of the STRATOS Initiative are Mark Baillie (Switzerland), Marianne Huebner (USA), Saskia le Cessie (Netherlands), Lara Lusa (Slovenia), Carsten O. Schmidt (Germany). Members of the Topic Group “Selection of Variables and Functional Forms in Multivariable Analyses“ are Michal Abrahamowicz (Canada), Heiko Becher (Germany), Harald Binder (Germany), Daniela Dunkler (Austria), Frank Harrell (USA), Georg Heinze (Austria), Marc Henrion (Malawi), Nadja Klein (Germany), Aris Perperoglou (UK), Geraldine Rauch (Germany), Patrick Royston (UK), Willi Sauerbrei (Germany), Matthias Schmid (Germany), Christine Schilhart-Wallisch (Austria).

Authors' contributions

G.H., M.B., W.S. and M.H. perceived idea and conceptionalized the study. G.H., M.B. and M.H. wrote the main manuscript text. M.B. and G.H. performed analysis and prepared all tables and Figs. M.B. and G.H. prepared code. G.H., M.B., L.L., F.E.H., W.S., C.O.S. and M.H. reviewed the manuscript, read and approved the final manuscript.

Funding

Not applicable.

Availability of data and materials

All R code can be found at https://stratosida.github.io/regression-regrets. The data of the bacteremia study can be found at https://zenodo.org/records/7554815.

Ethics approval and consent to participate

The bacteremia study is the only part of this work where data from human subjects was involved. All experimental protocols were approved by the Ethics Committee of the Medical University of Vienna (EK Nr. 333/2011). The need for informed consent was waived by the Ethics Committee of the Medical University of Vienna (EK Nr. 333/2011). The data was cleared for public use by the data clearing committee of the Medical University of Vienna (DC 2019-0054). All methods were carried out in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Author details

¹Medical University of Vienna, Center for Medical Data Science, Institute of Clinical Biometrics, Spitalgasse 23, 1090 Vienna, Austria

²Novartis Pharma AG, Basel, Switzerland

³University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technology, Department of Mathematics, Koper/Capodistria, Slovenia

⁴University of Ljubljana, Faculty of Medicine, Institute of Biostatistics and Medical Informatics, Ljubljana,

Slovenia

⁵University of Freiburg, Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, Freiburg, Germany

⁶University Medicine of Greifswald, Institute of Community Medicine, SHIP-KEF, Greifswald, Germany

⁷Vanderbilt University, School of Medicine, Department of Biostatistics, Nashville, TN, USA

⁸Michigan State University, Department of Statistics and Probability, East Lansing, MI, USA

Vach V. Regression Models as a Tool in Medical Research. Chapman and Hall/CRC. Boca Raton; 2013.
Harrell F Jr. Regression Modelling Strategies, 2nd Edition. Springer. New York, NJ; 2015.
Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, Dunkler D, Harrell FE Jr, Royston P. Heinze G for TG2 of the STRATOS initiative, State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues. Diagn Progn Res 202;4:3. https://doi.org/10.1186/s41512-020-00074-3.
Royston P, Altman DG. Regression using Fractional Polynomials of Continuous Covariates: Parsimonious Parametric Modelling. JRSS C (Applied Statistics). 1994;43(3):429–67. https://doi.org/10.2307/2986270.
Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Obs Stud. 2018;4:171–92.
Huber P. Data Analysis: What Can Be Learned From the Past 50 Years. Wiley. Hoboken, NJ; 2011.
Baillie M, le Cessie S, Schmidt CO, Lusa L, Huebner M. PLoS Comput Biol. 2022;18(2):e1009819. https://doi.org/10.1371/journal.pcbi.1009819. for Topic Group 'Initial Data Analysis' of the STRATOS initiative. Ten simple rules for initial data analysis.
Schmidt CO, Struckmann S, Enzenbach C, Reineke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21:63. 10.1186/s12874-021-01252-7.
Kerr NL. HARKing: hypothesizing after the results are known. Personal Soc Psychol Rev. 1998;2:196–217.
Ioannidis JPA, Why Most Published Research Findings Are False. PLoS Med. 2005;2(8):e124. https://doi.org/10.1371/journal.pmed.0020124.
Chatfield C. The Initial Examination of Data. JRSS A (General). 1985;148(3):214–31. https://doi.org/10.2307/2981969.
Cook D, Reid N, Tanaka E. Harv Data Sci Rev. 2021;3:3. https://doi.org/10.1162/99608f92.8453435d. The Foundation Is Available for Thinking About Data Visualization Inferentially.
Heinze G, Wallisch C, Dunkler D. Variable selection – A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49. https://doi.org/10.1002/bimj.201700067.
Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration. PLoS Med. 2007;4(10):e297. https://doi.org/10.1371/journal.pmed.0040297.
Lee KJ, Tilling KM, Cornish RP, Little RJA, Bell ML, Goetghebeur E, Hogan JW, Carpenter JR. STRATOS initiative. Framework for the treatment and reporting of missing data in observational studies: The Treatment And Reporting of Missing data in Observational Studies framework. J Clin Epidemiol. 2021;134:79–88. 10.1016/j.jclinepi.2021.01.008.
Harrell FE Jr, Dupont C. Hmisc: Harrell MIscellaneous. R package version 4.7-0. https://cran.r-project.org/package=Hmisc.
Sourial N, Wolfson C, Zhu B, Quail J, Fletcher J, Karunananthan S, Bandeen-Roche K, Béland F, Bergman H. Correspondence analysis is a useful tool to uncover the relationships among categorical variables. J Clin Epidemiol. 2010;63(6):638–46. https://doi.org/10.1016/j.jclinepi.2009.08.008.
Royston P, Sauerbrei W. Multivariable model-building. a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley. Chichester; 2008.
Ratzinger F, Dedeyan M, Rammerstorfer M, Perkmann T, Burgmann H, Makristathis A, Dorffner G, Lötsch F, Blacky A, Ramharter M. A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study. PLoS ONE. 2014;9(9):e106765. https://doi.org/10.1371/journal.pone.0106765.
Johnson NL. Systems of Frequency curves Generated by Methods of Translation. Biometrika. 1949;36:149–76.
Gregorich M, Strohmaier S, Dunkler D, Heinze G. Regression with Highly Correlated Predictors: Variable Omission Is Not the Solution. Int J Environ Res Public Health. 2021;18(8):4259. https://doi.org/10.3390/ijerph18084259.
Royston P, Sauerbrei W. Improving the robustness of fractional polynomial models by preliminary covariate transformation: A pragmatic approach. Comput Stat Data Anal. 2007;51(9):4240–53. https://doi.org/10.1016/j.csda.2006.05.006.
Gelman A, Hill J, Vehtari A. Regression and Other Stories. Cambridge: Cambridge University Press; 2021.
Altman DG, McShane LM, Sauerbrei W, Taube SE. Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): explanation and elaboration. PLoS Med. 2012;9(5):e1001216. 10.1371/journal.pmed.1001216.
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. Wiley. New York, NJ; 1987.
Zeileis A. Object-Oriented Computation of Sandwich Estimators. J Stat Softw. 2006;16(9):1–16. 10.18637/jss.v016.i09.
Ma X, Wang H, Huang J, Geng Y, Jiang S, Zhou Q, Chen X, Hu H, Li W, Zhou C, Gao X, Peng N, Deng Y. A nomogramic model based on clinical and laboratory parameters at admission for predicting the survival of COVID-19 patients. BMC Infect Dis. 2020;20(1):899. https://doi.org/10.1186/s12879-020-05614-2.
Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, Michie S, Moher D, Wager E. Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014;383(9913):267–76. https://doi.org/10.1016/S0140-6736(13)62228-X.
Huebner M, Vach W, le Cessie S, Schmidt CO, Lusa L. Hidden analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Methodol. 2020;20(1):61. https://doi.org/10.1186/s12874-020-00942-y. http://www.stratos-initiative.org. Topic Group “Initial Data Analysis” of the STRATOS Initiative (STRengthening Analytical Thinking for Observational Studies.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, 't Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;3:160018. https://doi.org/10.1038/sdata.2016.18. Erratum in: Sci Data. 2019;6(1):6.
Marino J, Kasbohm E, Struckmann S, Kapsner LA, Schmidt CO. R Packages for Data Quality Assessments and Data Monitoring: A Software Scoping Review with Recommendations for Future Developments. Appl Sci. 2022;12(9):4238.
Schmidt CO, Struckmann S, Enzenbach C, Reinecke A, Stausberg J, Damerow S, et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol. 2021;21:63. https://doi.org/10.1186/s12874-021-01252-7.

No competing interests reported.

Additionalfile1.docx

Download PDF

Journal Publication

published 08 Aug, 2024

Read the published version in BMC Medical Research Methodology →

Editorial decision: Revision requested
12 Mar, 2024
Reviews received at journal
05 Feb, 2024
Reviews received at journal
04 Jan, 2024
Reviewers agreed at journal
07 Dec, 2023
Reviewers agreed at journal
06 Dec, 2023
Reviewers agreed at journal
04 Dec, 2023
Reviewers invited by journal
04 Dec, 2023
Editor invited by journal
13 Nov, 2023
Editor assigned by journal
11 Nov, 2023
Submission checks completed at journal
11 Nov, 2023
First submitted to journal
08 Nov, 2023

You are reading this latest preprint version

Regression without regrets – initial data analysis is an essential prerequisite to multivariable regression

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Aims and scope of initial data analysis in the context of regression analysis

3. General strategy to develop an IDA plan for regression analyses

3.1 Prerequisites

3.2 Key elements of an IDA plan for regression

3.3 Further aspects and possible extensions

4. Illustrative example: bacteremia study

4.1 Overview of the bacteremia study

4.2 Bacteremia study: prerequisites for the IDA plan

4.2.1 Research aim (PRE1)

4.2.2 Analysis strategy (PRE2)

4.2.3 Data dictionary (PRE3)

4.2.4 Domain expertise (PRE4)

4.3 Bacteremia study: IDA plan

4.3.1 Participant missingness (M1)

4.3.2 Variable missingness (M2)

4.3.3 Complete cases (M3)

4.3.4 Patterns of missing values (M4)

4.3.4 Univariate descriptions: Categorical variables (U1)

4.3.5 Univariate descriptions: Continuous variables (U2)

4.3.6 Multivariate descriptions: associations of predictors with structural variables (V1)

4.3.7 Multivariate descriptions: correlation analyses (V2)

4.3.8 Comparing nonparametric and parametric predictor correlation (VE1)

4.3.9 Variable clustering (VE2)

4.3.10 Redundancy (VE3)

4.4 Bacteremia study: results of IDA

4.4.1 IDA domain: missing values (M)

4.4.2 IDA domain: univariate distributions (U)

4.4.2 IDA domain: multivariate descriptions (V)

5. Possible consequences of IDA

5.1 Bacteremia study: refinements of the analysis strategy triggered by IDA results

5.2 Bacteremia study: how IDA may guide the interpretation of modeling results

5.3 How IDA may guide the presentation of results

6. Discussion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1