Impact of Data-driven Methods for Causal Inference in Observational Studies: Simulations in Three Real-world Datasets

Background: Although data-driven methods for selecting covariates in multivariable models ignore confusion, mediation and collision, they are still used in causal inference. This study, through three real-world datasets, shows the impact of data-driven methods on causal inference. Methods: A research question leading to multivariate model was raised for each of three real-world datasets. Three covariate selection methods were compared on their performances to correctly answer the question: Augmented Backward Elimination with BIC criterion and “change-in-estimate” threshold set at 0.05, Backward Elimination with BIC criterion and a knowledge-based method relying on causal diagrams. The covariates were classied as indispensable, prohibited and optional, considering the potential bias they could cause on the estimate. For each dataset and sample size (N=75, 300 and 3,000), 10,000 Monte Carlo samples were drawn. Percentages of inclusion of each covariate in models were computed. Coverages of Wald’s 95% condence interval of exposure effects were computed with two different theoretical values (the analysed method, the knowledge-based method). Results: Even with the largest sample size (n=3,000), data-driven methods were not reproducible, with 8.6% to 53% of covariates included in 20% to 80% of experiences. Prohibited covariates could be included in more than 80% of experiences and indispensable covariates missed in more than 80% of experiences even with n=3,000. With the largest sample sizes, coverages of the theoretical knowledge-based value by data-driven methods ranged from 0% to 83.7%; coverages of the theoretical value of the same data-driven method ranged from 73.2% to 91.1% and were asymmetrical. Conclusion: In conclusion, data-driven methods should not be used in causal inference.

apply penalties forcing the value of some coe cients to zero, leading to exclusion of variables. Users try to obtain the "optimal" penalty parameter using cross-validation, AIC or BIC.
When the aim is to explain (exposure of interest), data-driven methods are also used. One of the most reported methods is the preselection of variables by univariable analysis, selecting adjustment variables that are associated with the outcome with a p-value cut-off [7,8]. The « change in estimate » selection method keeps an adjustment variable in the model depending on the change in the estimate of the coe cient of the exposure of interest (e.g., > 10%) that is induced by the removal of the adjustment variable, allowing conservation of "strong" potential confounders [9]. The purposeful selection method [10] is a hybrid method between a rst univariable preselection (e.g., p-value ≤ 0.25), then backward elimination of variables with non-signi cant association (e.g., p-value > 0.1) and a small "change in estimate" (e.g., ≤ 15-20%) and nally the entry of excluded variables from the initial univariable analysis (e.g., univariable p-value > 0.25) if signi cant (e.g., multivariable p-value < 0.10). However, in causal inference prior knowledge is primordial according to Witte J et al. [11]: "all selection methods rely on assumptions that can only be justi ed with subject-matter knowledge". Part of this knowledge can be represented with causal diagrams such as causal directed acyclic graphs (DAGs) [12]. In a causal DAG, variables are represented by nodes and causal relationships between variables by directed edges. These representations help the identi cation of confusion, mediation, or collision relations between variables. A collider is a variable caused jointly by the exposure and outcome of interest while a mediator is caused by the exposure and itself causes the outcome. Contrary to the causal DAG approach, the data-driven methods presented above do not take into account the difference between confounders, mediators or colliders and may lead to inappropriate adjustment on mediators and colliders unless variables are thoroughly preselected.
For several years now, stepwise methods have been criticized, as they lead to underestimated standard errors, non-normal sampling uctuations and erroneous p-values when using usual con dence interval and hypothesis test procedures: Wald, Score and likelihood ratio. Moreover, they amplify collinearity problems [13][14][15]. Peter L. Flom [15] proposed to use other methods such as the LASSO or the leastangle regression (LARS) which are supposed to be better alternatives. However, the stepwise method, and more widely, data-driven methods (including the penalized method proposed above), have well-known issues associated with p-values (even if AIC or BIC are used instead): risk of false discovery increased by the multiplicity of tests [16] and inappropriate acceptation of the null hypothesis when an effect is not signi cant [17]. Despite these warnings, researchers continue to use data-driven methods [7,8,18], and the development of data-driven methods for covariate selection in explanatory models is ongoing in 2019 with the article Confounder Selection via Support Intersection by Lee S et al.
But some researchers advocate the use of prior knowledge in explanatory models: VanderWeele TJ et al. [19] presented theoretical examples with DAGs. Heinz et al. [1] emitted recommendations. Witte J et al. [11] demonstrated the strengths and weaknesses of data-driven methods in causal inference using simulation with important sample size (minimum sample at 500), few variables, prevalence treatment and outcome around 0.5. However, these studies are too far from real-life settings.
There is a need for examples based on real datasets with prevalences far from 0.5, smaller datasets re ecting the low statistical power of some current studies and larger number of covariables. In this article, knowledge-based and data-driven covariate selection methods were tested in three causal research questions along with three real-life datasets. The objectives were to assess whether data-driven covariate selection methods were fragile without prior knowledge and whether they could lead to erroneous inference.

Methods
Data are derived from three different studies, published or submitted to peer-reviewed journals. Each one is described brie y in terms of size, available variables, and target population. Then, one causal research question is raised for each study. Covariate selection methods are compared on their propensity to select the appropriate set of variables and 95% con dence interval coverage of the true value that they are supposed to estimate.

Datasets and research questions
Dataset #1 is from a retrospective obstetrical study conducted in 200 patients with failure of medical abortion de ned as absent or incomplete expulsion requiring suction. The routinely collected clinical variables were: age (years), body mass index (BMI), number of vaginal deliveries, abortions, pregnancies, miscarriages, cesarean sections, children born alive, use of analgesic level 2 (yes or no), antiemetic (yes or no), weeks of gestation, length of hospitalization (hours), treatment route (oral or vaginal), additional dose of misoprostol, number of spontaneous fetal expulsions, doubt of expulsion (as declared by the physician: yes, don't know, or no), resort to suction (yes or no). Question #1 was: How much does failure of medical abortion increase the length of hospitalization? Dataset #2 is from a prospective study conducted in 665 adult women who underwent medical abortion. The variables collected a priori with an auto-questionnaire and a hetero-questionnaire were: Age (18-20,20-25,25-35,>35 years), medical history of abortion (medical, surgical, both or no) and pregnancy (yes or no), weeks of gestation, initial preference for medical abortion (yes or no), fear of hospitalization (yes or no), total misoprostol dose (200-400, 600-800, 1000-2000 mg), treatment route, pain (from 0 to 10), nausea/vomiting (yes or no), fever (yes or no), diarrhea (yes or no), bleeding estimate (as usual during menstruation, more, less than menstruation), misoprostol side effects intensity (from 0 to 3), and the question "If I were to have an abortion again, would I prefer the medical or surgical method?" (yes or no). Question #2 was: Does nausea increase the rate of preference for the surgical method of abortion for a future abortion?

Statistical analysis
Initial variable subset for data-driven methods Considering that minimal knowledge about the chronology of variable measurement was available, the starting pool of variables could differ depending on questions. All variables measured after the exposure of interest were removed to avoid inclusion of mediation variables [20] except for the rst dataset where all pre-suction variables were kept. Indeed, resort to suction was timestamped, unlike abortion failure in a context where suction is a late proxy of abortion failure.

Initial datasets
To use data-driven methods incomplete observations in the originals datasets #1 and #2 were excluded before any simulation was performed. In dataset #3, missing data were managed by simple imputation by chained equations before any simulation was performed. The predictive matrix included all preexposure covariates from the initial subset.

Covariate selection methods
The following covariate selection methods were used in order to build the explanatory models: (1) Knowledge-based with causal diagram using bidirectional edges to represent association between two variables that share ancestors (bidirectional edges were already used for this purpose in causal diagram for epidemiology [12]); (2) Augmented Backward Elimination (ABE) [21] with a "change in estimate" threshold set at 0.05 and BIC criterion; (3) Backward elimination (BE) with BIC criterion. Selection of the variable representing the exposure of interest was forced in the modeling process. For each data set, causal diagrams were designed by the author T.PL based on previous discussion with one medical expert of the eld (scienti c literature, guidelines and opinions), then validated by the same medical expert.

Knowledge-based approach
The "disjunctive criterion" [22] was used to select potential confounders from the initial set of candidate variables. In contrast with one of the classical de nitions of confounder which causes both the outcome and exposure, the "disjunctive criterion" selects any covariate that causes outcome or exposure or both. The disjunctive criterion method has been used more sparingly: -For a given potential variable, it was excluded if the existence of a selection bias leading to an imbalance between groups did not seem possible AND if this variable did not seem to be a proxy of an unmeasured confounder or an unknown confounder.
-Any potential variable considered as redundant with any selected confounder [23], exposure or outcome was excluded.
For each dataset, variable adjustments were categorized as either optional when it would not add or remove any important bias, prohibited when it might introduce an important bias or indispensable when it would remove an important bias. The lists of variables selected by the knowledge-based approach appear in the result tables. These lists of variables were not adapted to the sample size so as not to disadvantage the automatic methods. Here are some non-exhaustive explanations about the choice of variables: The research question associated with dataset #1 was the analysis of the causal link between "failure" of abortion and "length of hospitalization". "Doubt of expulsion" was considered as mediation variables between "failure" and "length of hospitalization" and so was prohibited adjustment. "Doubt of expulsion" and "additional dose of misoprostol" precede "resort to suction" and are completely or partly due to actual failure of abortion. The variables "anti-emetic" and "use of analgesic level 2" were caused by side effect caused by "additional dose of misoprostol" and so, were prohibited adjustments too.
The research question associated with dataset #2 was the analysis of the causal link between "nausea" and "future preference for surgical abortion compared to medical abortion after past medical abortion", the adjustment on "misoprostol side effect intensity" was prohibited because it seemed to be redundant with the exposure "nausea". That would be comparable to selecting "pain" (yes, no) and "pain" (0 to 10) in the same model. The adjustment on "total misoprostol dose" was optional because redundant with the adjustment on side effects on the one hand and "weeks of gestation" on the other. "Weeks of gestation" indicates misoprostol dose and may lead to "failure" or to "pain" which is one misoprostol side effect leading to "future preference for surgical abortion compared to medical abortion after past medical abortion". Side effects ("pain", "bleeding estimate") adjustments were indispensable because they are concomitant with "nausea" hence they may confound the real effect.
The research question associated with dataset #3 was the analysis of the negative causal link between "thrombolysis" and "massive brain necrosis". As this was a therapeutic question in a retrospective study, the indication bias could be important. This bias was corrected mostly by adjustment on "anticoagulant use", "time to imaging after stroke", "wake-up stroke", "medical history of stroke" which were classed as indispensable adjustments. There were no prohibited variable adjustments in this dataset.
Monte Carlo sampling and criteria of assessment A total of 10,000 samples (5,000 for dataset #3) each of three sizes (N = 75, 300 and 3000) of independent and identically distributed observations were randomly drawn with replacement from each initial dataset. The ordinary least squares general linear model was used for the rst research question while maximum likelihood logistic regressions were used for the second and third questions. Samples with constant outcome or exposure were excluded.

Criteria of assessment
The inclusion rate of a variable was de ned as the proportion of Monte Carlo experiences in which the variable was selected in the model by the data-driven method. We de ned inconstant inclusion of variables as any inclusion rate between 20 and 80%. The actual coverage of Wald's 95% con dence interval of the effect (odds ratio or adjusted difference of means) of the exposure, was assessed. Two different coverages were computed, with two different theoretical values of the real effect of the exposure on the outcome. The theoretical value was set to be the effect of the exposure on the outcome in the original dataset (duplicated to reach at least 1,000,000 observations in order to reach the asymptotic effect); rst in a model tted by the analyzed method, second in a model tted by the knowledge-based method. Overestimation and underestimation percentages were separately computed to describe coverage faults.
All analyses were performed with R statistical software (version 4.0.3, The R Foundation for Statistical Computing, Vienna, Austria) with 'abe', 'dagitty' [24], 'ggdag', 'mice', and data.table packages. The computations were performed on the GenOuest cloud computing platform, leveraging parallelization on a cluster of 50 processor cores.

General characteristics of samples
The numbers of patients excluded in dataset #1 and dataset #2 were respectively 24 (12%) and 67 (10%). In dataset #3, 95 (35%) of the 265 patients had incomplete observations that were kept using a simple imputation by chained equation. Primary exposures in datasets #1, #2 and #3, respectively "failure", "nausea" and "thrombolysis" had frequency rates respectively equal to 8.6%, 52% and 48%. The primary outcomes in datasets #1, #2 and #3 were respectively "length of hospitalization", "preference for medical abortion" and "massive necrosis". In dataset #1 the mean (standard deviation) "length of hospitalization" in the suction group was 27 (10) hours versus 11 (6) hours in the group without suction. In datasets #2 and #3, the frequencies of the primary outcomes were respectively 50% and 34%. The sets of eligible covariates, excluding the primary exposure, in datasets #1, #2 and #3 contained respectively 15, 13 and 23 potential covariates.
The set of included variables did not completely stabilize as the sample size grew. With BE in datasets #1, #2 and #3, respectively 8 (53%), 3 (23%) and 2 (9%) variables were inconstantly included with the largest sample size (n = 3000). With ABE, these gures were respectively 4 (27%), 3 (23%) and 3 (14%). Although that was expected for highly correlated variables such as "number of vaginal deliveries" and "number of children born alive", it was less expected for variables such as "body mass index" (Table 1).
With BE, some variables, such as "doubt of expulsion" in dataset #1 had an inclusion rate growing with sample size and tending towards 100%, while some other variables such as "use of analgesic level 2" had an inclusion rate that was constantly low (see Table 1). Strong U-shaped associations between the sample size and inclusion rate were found for "sleep apnea syndrome", "time to imaging after stroke" and "thrombus localization" in dataset #2 (see Table 2). Eventually, strong decreasing associations exist between sample size and inclusion rate, such as that of the variables "anticoagulant use", "diabetes mellitus" and "hypertension" in dataset #3 (see Table 3).

Coverage Compared to knowledge-based theoretical value
The coverage of the knowledge-based theoretical value by BE and ABE decreased with the sample size and, for the largest sample size (n = 3000), reached 0% for both BE and ABE in dataset #1, 58.9% and 57.6% for BE and ABE in dataset #2, and 77 and 88.7% for BE and ABE in dataset #3 (Tables 4 to 6).

Compared to the method itself
The coverages of the asymptotic BE value for BE on large sample sizes (n = 3000) in datasets #1, #2 and #3 were respectively 73.2%, 88.8% and 77%. For ABE, these gures were respectively 73.6%, 91.1% and 88.7%; for the knowledge-based method, they were 85%, 94.7% and 94.9%.

Discussion
In the three datasets, data-driven con dence intervals had poor coverage, even when compared to their own asymptotic effect. This may be due to the invalidity of Wald's con dence intervals for test-based data-driven methods (BE, ABE) without shrinkage [25]. Bias in point estimates makes the coverage tend towards zero when the sample size tends towards in nity as the width of the con dence interval tends towards zero, centered around the wrong value.
The inclusion of prohibited variables leads to bias in estimations to assess the causal effect of exposure on the outcome. Judd and McClelland wrote: "It seems unwise to let an automatic algorithm determine the questions we do and do not ask of our data" [26]. Since different models answer different research questions (e.g., adjusting or not on a mediation variable depending on whether the direct or total effect is to be analyzed), data-driven methods with unstable inclusions of covariables lead to random models, answering random research questions. In dataset #1, on large sample sizes, data-driven methods tended to answer the question "How much does suction increase the length of hospitalization?" instead of the initial question "How much does failure of medical abortion increase the length of hospitalization?", because they adjusted on "doubt of expulsion", "additional dose of misoprostol", "antiemetic".
In their article presenting ABE [21] authors assumed that the initial subset of variables respected the "disjunctive criterion". Hence, they did not use it as a blind method but as a complementary method of prior knowledge based on a DAG. They consider ABE as a way to improve reproducibility where DAGs are not unanimous or when knowledge is poor. As we saw in this article ABE and BE provide non-reproducible results, with random inclusion (inclusion rates between 20 and 80%) of many variables, even on quite large sample sizes, and some U-shape associations between sample size and inclusion rates. Moreover, ABE and BE include mediators and colliders without warning. Furthermore, the numerous options available in ABE function (p-value and "change-in-estimate" thresholds, AIC, BIC) may further increase non-reproducibility.
Other data-driven methods based on shrinkage such as Glider with group LASSO [27] or outcome adaptive LASSO [28] have been promoted as better alternatives, but they are as blind as test-based methods about relations (mediation, collision, confusion) between variables. On small-to-medium sample sizes, data-driven methods can omit confounders (e.g., "weeks of gestation" in dataset #1) leading to under-adjustment unless the sample size is very large besides the fact that most of the time underpowered studies lead to non-reproducible results [29]. A previous study, comparing forward, backward, stepwise and purposeful with simulation including six variables of which three had an effect on the outcome, showed the low proportion of experiences where the correct model was selected (even for n = 600 with 70% of correct models) [10]. However, in our study, even if some indispensable variables were not included, they were sometimes replaced by optional variables that could be proxies (e.g., "weeks of gestation" and "total misoprostol dose" in dataset #2 or "atrial brillation" and "anticoagulant use" in dataset #3). Indeed, in the causal DAG theory, an adjustment is su cient to control confounding if it blocks all backdoor paths. If several variables are on this path it is su cient to adjust on one or all of them to obtain the desired effect [30].
The variability of relations of inclusion rates to sample size (U-shape, decrease or increase) may be explained by contradictory forces growing in different directions with the sample size. First, the statistical power grows with the sample size for all effects, tending to an increase of the inclusion rate with the sample size. Second, the sensitivity of the "change in estimate" threshold is inversely proportional to sample size, leading to a decrease of inclusion rate with sample size. Third, two correlated variables compete for inclusion in the sample size. On small samples, the variable having the largest effect on the outcome and the highest statistical power may win while, on large samples, both variables have a statistical power tending towards 100% and other factors come into play, such as the change in the estimate of the primary exposure or the novel inclusion of a third variable that attenuate or increase the effect of one of the two concurrent variables.
All these chaotic inclusions and biased estimates could have an impact on clinical conclusions of articles [31].

Strengths & limits
Causal diagrams were validated by only one clinical practitioner who has studied the concerned eld, hence they are not universally consensual. The use of three datasets is a way to present diverse situations but is in no way exhaustive. Although real-life datasets were used, their small sample sizes made them imperfect, with idiosyncrasies that do not exist in the actual population. Data-driven methods used to recode variables or add interaction terms were not analyzed. Shrinkage-based methods were not analyzed, because they are rarely used with explanatory models. Although there are some methods to correct the bias of point estimate and con dence intervals after data-driven selection of variables, none were analyzed [32,33]. However, these methods provide inference that is only valid conditional on the selected variables, and so, fail to solve the problem of randomness of the research question due to methods randomly selecting variables. Moreover, these methods are not implemented in major statistical software: Stata, SPSS and SAS provide Wald's p-values and con dence intervals without warning when stepwise, backward or forward selection is performed.

Conclusion
This work shows that variables selected by data-driven methods (backward elimination and augmented backward elimination) are inconstant for a given sample size, vary with the sample size and are often inappropriate for causal inference (e.g., random adjustment on mediator). It also shows that coverage of the primary effect by 95% con dence interval estimated by data-driven methods can tend towards zero as the sample size grows. Selection of variables is not an easy task, relying on a complex context. Datadriven methods, blind to this context, should not be used to build an explanatory model.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
Availability of data and material Data are not available because of individual privacy. R script is available at https://github.com/ClementMassonnaud/dataDrivR.

Con icts of interest/Competing interests
The authors have no relevant nancial or non-nancial interests to disclose.        Causal diagram built with prior knowledge and medical practitioner expertise for dataset #1