Estimating the effect of adjuvant chemo-therapy for colon-cancer using registry data: a method comparison and validation

Background: Although randomized controlled trials (RCT) are the gold standard to estimate treatment eﬀects, they are often criticized in terms of generalizability. Observational data might alleviate this problem by being readily available in large quantities. However, observational data are potentially confounded. In this methodological study we use a large-scale RCT as the gold standard to examine the performance of various statistical methods to control for potential confounding in observational data. Methods: In this paper we compare three types of methods that allow researchers to correct for such potential confounding: direct methods, inverse probability weighting (IPW) methods and doubly robust (DR) methods. We uniquely compare estimates obtained from the population-wide Netherlands Cancer Registry (NCR) on colon cancer ( n = 52621 ) with estimates obtained from a large-scale RCT. As the RCT diﬀers from the observational data both in its sampling mechanism and in its treatment assignment mechanism, we ﬁrst resample the NCR data to reﬂect the distribution in RCT data. Next, we correct for potential confounding using three alternative types of methods and consequentially evaluate their estimates to those obtained in the RCT. Results: We ﬁnd that while all estimators qualitatively approximate to ﬁndings in the RCT, methods that can ﬂexibly model the response (i.e., direct estimation and DR estimation) performed consistently superior to the inverse propensity score method. Subgroup analysis indicates that relatively simple models allow us to properly estimate the treatment eﬀect. However, these simple models do not properly identify heterogeneous treatment eﬀects in stage II colon cancer. Careful sensitivity analysis using more ﬂexible models demonstrates both the uncertainty and the potential heterogeneous treatment eﬀect in stage II cancer and provides robust estimation of treatment eﬀect in stage III cancer. Conclusions: Our results suggest that both the direct method and the DR method, when executed with care, can be used to reliably estimate treatment eﬀects based on registry data. This methodological validation opens the door to more extensive use of registry data for the estimation of (individual) treatment eﬀects. Additionally, our identiﬁcation of potentially meaningful subgroups of stage II colon cancer patients who, based on our analysis seem to beneﬁt from chemotherapy, should be further explored.

Estimating the effect of adjuvant chemo-therapy for colon-cancer using registry data: a method comparison and validation Lingjie Shen 1* , Erik Visser 2 , Hans de Wilt 4 , Henk Verheul 5 , Felice van Erning 3 , Gijs Geleijnse 3 and Maurits Kaptein 2  1 Background Colorectal cancer is the third most commonly diagnosed cancer and the second leading cause of cancer death worldwide in 2018 [1].The majority of patients are curatively treated with surgical removal of the tumor, surrounding tissues, and lymph nodes leading to excellent long term survival [2].For stage III colon cancer, Randomized Control Trials (RCTs) [3][4][5] have shown an improved overall survival when surgery is followed by adjuvant chemotherapy compared to surgery only.However, results from RCTs were not supportive of prescribing adjuvant chemotherapy to stage II patients [6] due to lack of convincing evidence.For example, the meta-analysis by Erlichman and Charles (1999) [7] showed in a pooled analysis a hazard ratio (HR) of 0.83 (90% CI 0.72, 1.07) for disease-free survival and a HR of 0.86 (90 % CI 0.68, 1.07) for overall survival for fluoropyrimidine monotherapy compared to observation.The same finding was also reported [8]: while both RCT and real world data suggest a clinically relevant benefit of adjuvant chemotherapy for stage II colon cancer, the estimates do not reach statistical significance.A borderline significantly improved overall survival was found in stage II cancer in QUASAR trial (Relative Risk 0.82 95% CI 0.7-0.95) and consequently the authors suggest adjuvant chemotherapy for all stage II patients.Without critical pathological and clinical characteristics to identify the heterogeneous treatment effect in subgroups (e.g., emergency resection, tumor size, grade, number of positive lymph nodes), conclusions from the QUASAR trial may however not be applicable to all stage II patients.
These mixed and inconsistent results from current trials may be partly due to the heterogeneity between populations resulting from selection rules of eligible patients.The strict selection rules in RCTs make baseline populations in trials differ from daily clinical practice, which can threaten the external validity of studies.Registry data, without strict patient selection, may truly reveal the clinical relevance and may provide additional evidence regarding the effect of the adjuvant chemotherapy on a population outside the context of trials.Consequently, researchers have recently tuned their interest to such registry data to estimate treatment effects because it covers a large, heterogeneous and country-wide population [8][9][10][11].
Estimating the effect of adjuvant chemotherapy using registry data, however, is challenging.First, variations in the type, the dose, and the length of the chemotherapy commonly occur across individuals, which make comparison of treatment effect across studies difficult.For instance, although capecitabine and oxaliplatin (CAPOX) and fluorouracil, leucovorin, and oxaliplatin (FOLFOX) are regarded as current standard treatments regarding their improved survival compared to fluorouracil and leucovorin (FU/LV) [4], the detailed information of the chemotherapy regimen for each individual is not always properly recorded.This hinders the estimation of the long-term effect of CAPOX and FOLFOX using existing registry data.To fully utilize the registry data regardless of the limitations, researchers uses a binary variable to indicate the chemotherapy (FU/LV and FOLFOX) and control, and consequentially compares treatment effect using registry data with the estimation from trials (FU/LV vs. control) [8]; an approach we also adopt in the current study.
Second, unlike RCT data, in which the observed and unobserved confounding variables are balanced by random treatment assignment, the observational registry data runs the risk of being confounded.More precisely, in observational data, the treatment assignment mechanism is not exactly known, and thus confounding cannot be ruled out [12].Understanding the performance of statistical methods to correct for such potential confounding is an extremely important methodological question: if we are able to properly control for confounding in registry data we open up a treasure throve of new data to be used to estimate (individual) treatment effects.
Many statistical methods to control for potential confounding have been suggested over the past few decades, most often under the assumptions of positivity and unconfoundedness which we describe in detail in section 3.These methods can roughly be classified into one of three classes [13]: 1 Direct methods: Direct methods aim to directly model the outcome as a function of the treatment and potential covariates.Some methods consider the treatment itself as a covariate and fit the outcome directly, while others fit the separate outcome for the treatment and control groups [14].Using multivariate regression to directly model the associations between the outcome and the treatment while adjusting the covariates is the most commonly applied implementation of the direct method [15].Other, more modern, implementations use more flexible models like Bayesian additive regression trees (BART) [16][17][18] or Causal Forests [19]. 2 Inverse probability weighting (IPW) methods: IPW methods control for potential confounding by estimating, and controlling for, the conditional probability that an individual unit receives the treatment (i.e., the propensity score, PS) [12].IPW methods estimate the treatment effect by weighting each sample with the inverse PS [20], thereby generating the pseudo-population whose covariates are independent of the treatment assignment (as would be the case in an RCT). 3 Doubly robust (DR) methods: DR methods combine a model for the treatment assignment mechanism and a model for the outcome, which can practically yield accurate estimation when one of the two models is good (but not necessarily consistent) [21,22].Note that for any of the methods above researchers can make a variety of modeling choices.Historically, parametric models have been commonly used.For instance, in the majority of published studies about the treatment effect estimation in the field of colorectal oncology [8,23,24] the direct method, using standard logistic regression, is employed.Although these parametric models are efficient and wellunderstood, they impose strong parametric assumptions about the relationship between potential confounders and outcomes.Recently, more flexible semi-parametric and non-parametric models have demonstrated good performance [14].In particular, BART seems well suited to uncover the true treatment effect when the treatment assignment is confounded [14,[25][26][27].
When comparing direct estimation with IPW and DR methods in simulation studies, an improved performance of direct estimation over IPW methods is found [14], while in other study DR methods outperform both direct estimation and IPW methods [28].Comparisons between these different classes of methods are however hard to carry out in non-simulated scenarios: often when observational data (registry data) is available there is no ground truth of the estimate of interest available (i.e., there is no RCT data available).The unavailability of evaluations of different methods to control for confounding on real-world data hinders the adoption of these methods in general.
The case of adjuvant chemotherapy for colon cancer is different, however: here we uniquely have access to a large RCT [29] (QUASAR trial) as an experimental benchmark and we have access to the observational data collected by the NCR.This provides us with the opportunity to evaluate the performance of different (classes of) methods to control for confounding on real-world registry data.To our best knowledge, the QUASAR trial is the best possible gold standard to compare the effect of chemotherapy verses control as it has large samples, long follow up years, and good completion of six months of chemotherapy.
To conclude, observational registry data and RCT data potentially differ in two respects which requires us to be aware of when estimating and comparing the treatment effect [30]: 1 the sampling mechanisms, i.e., the mechanism by which subjects are included into studies, can differ substantially.While the absence of strict inclusion criteria is an argument in favor of the use of observational data, the sampling process will ultimately affect the distributions of the covariates and subsequently will affect the estimated average treatment effect.The sampling mechanism may also lead to differences in effect estimation between the RCT and the observational study. 2 the treatment assignment mechanism differs: while in the RCT the assignment mechanism is known -and often uniformly random -for observational studies, the assignment mechanism is unknown and confounding is potentially present.
In this study, of which we provide an overview in Figure 1, we attempt to control for both sources of divergence and aim to explore the effects of three classes of correction methods of controlling for the second source of divergence.This study thus helps us understand how these different methods perform on real-world registry data, potentially opening up their use to inform the current treatment guidelines.
In the next section, we first describe the two data sources used in this study.In section 3 we describe the different methods for controlling for confounding that we examine in this paper: direct methods, IPW methods, and DR methods.In section 4 we explain our approach to ensure that, next to controlling for confounding, the sampling mechanism playing in both data sources are comparable: effectively, we estimate the distribution of the main covariates in the QUASAR trial as D QU ASAR ′ , and subsequently subsample the NCR data accordingly (X sub ∼ D QU ASAR ′ ).Note that we explicitly order these two steps such that the models used to control for confounding are estimated using all the available data (X N CR ∼ D N CR ).In section 5 we present our results: we compare several estimates obtained using both direct, IPW, and DR methods with the estimates of the average treatment effect (ATE) originating from the RCT, and explore the performance of IPW and DR with different trimming values.The results of the subgroup analysis and sensitivity analysis are also presented.Finally, in section 6 and 7 we discuss our findings and provide recommendations for the analysis of observational data.

Overview of data sources
In this study we effectively used two data sources: on one hand we used the estimates resulting from the QUASAR trial [29] as the gold standard.Note that we did not have access to the patient-level data of the QUASAR trial and relied on the estimates and descriptive statistics provided in the original paper.On the other hand, we used registry data from the NCR (N = 52621) as our observational dataset of choice.The impact of the discrepancy of the treatments in two data sources on the estimation will be further discussed and interpreted in section 6.

The gold standard: the QUASAR trial
The QUASAR trial was an RCT designed to test the effects of adjuvant chemotherapy for colorectal cancer patients.Between May 25, 1994 In the trial, eligible patients had undergone the resection and had no evidence of distant metastases.Furthermore, they had no definite contraindications to chemotherapy.1622 patients were randomly assigned to receive chemotherapy consisting of FU/LV.The primary outcome was overall survival.We refer readers to [29] for more details regarding the exact protocol of the QUASAR trial.

The NCR
The NCR is a registry containing all incidences of cancer in the Netherlands managed by the Netherlands Comprehensive Cancer Organisation (Integraal Kankercentrum Nederland, IKNL).We analyzed a subset of patients who were diagnosed with colon cancer between 2006 and 2015 and who underwent surgical resection [1] .Follow-up was completed until January 31, 2019 .The total study population consisted of pathological stage II and III (N = 52621) patients .We excluded 100 patients who had the missing value of treatment and incidence.The primary outcome is overall survival.

Covariates description and data processing
Table 1 presents the distribution of the covariates in the current colon-cancer guideline and some important prognostic covariates as well as the basic demographic covariates, in which we exclude 4586 patients with clinical metastasis.For height and weight, a large number of values were not recorded (82.3 % and 81.3 %) which are imputed with median.The BMI is then computed with height and weight.For categorical covariates, the missing value is replaced with category as 'not recorded'.To explore the impact of covariates included on the estimation, we will consider four groups of covariates for analysis: [1] We only focus on the colon cancer patients instead of colorectal cancer patients in NCR for two reasons: first in NCR, the treatment regimen for rectum cancer patients are much more complicated than colon cancer; second, reported by [29], the treatment effect of adjuvant chemotherapy didn't differ significantly by tumor site, namely, colon or rectum. 1 3 covariates: age, sex, and stage; 2 5 covariates: 2 covariates (age, sex) in addition to 3 covariates in the guideline (stage, MSI, high risk [2] ); 3 13 covariates: 2 covariates (age, sex) in addition to 11 covariates in the guideline (stage, cM, pT, pN, high risk, MSI, number of lymph nodes assessed, colon perforation, lymphomatic invasie, agio invasie, grade); 4 All covariates in Table 1 3 Controlling for confounding: Direct, IPW, and DR methods In this section, we will introduce three groups of methods for controlling for confounding in detail.Note that this controls for confounding; we cover controlling for sampling mechanism in section 4.
To formalize our methods, we first formalize the problem of estimating treatment effects.We use the potential outcome framework by Rubin [see 32, chapter 1].We have a population with distribution D and size n, let X denote a vector of covariates with length d sampled according to D, Z denote the binary treatment variable (1=adjuvant chemotherapy, 0=no adjuvant chemotherapy) and Y denote a binary response variable (1 ==death, 0 ==alive).Capital roman letters denote random variable, while lower case letters denote realized values.We use Y i (1) to denote the potential outcome for subject i under the treatment Z = 1 and Y i (0) denote the potential outcome under the treatment Z = 0.The observed outcome under the realized treatment for the subject i can be denoted as The population average treatment effect (ATE) with covariates X ∼ D can be formulated as: However, by definition we cannot learn the true treatment effect because only one of the Y i (1) and Y i (0) is observed, which is the fundamental problem of causal inference.
To estimate the average treatment effect, assumptions of unconfoundedness and positivity (or overlap) ) [2] risk factors implied in the 2014 Dutch guideline for colon cancer [31]: perforation at diagnosis = yes, pT = 4, lymphatic invasion = yes, angio invasion = EMVI or IMVI, grade = poor to undifferentiated or unknown, the number of lymph nodes assessed less than 10.As long as one of risk factors occurs in stage II patients, patients are identified as 'high risk = high'.For stage III, 'high risk = not applicable'.
are made [33].In RCTs, the known assignment mechanism causes the treatment assignment Z independent of both observed and unobserved covariates X, consequentially leading to the treatment Z independent of potential outcomes under the assumption of unconfoundedness.In this case, the average treatment effect (ATE) can be formulated as In observational study where the treatment assignment Z is correlated with the covariates X, the ATE can be formulated as [34]: Note that the population treatment effect is defined as the average treatment effect on the target population D. To make the estimator of the treatment effect comparable across the QUASAR (D QU ASAR ) and NCR (D N CR ), we need to resample the subjects X sub in NCR according to D QU ASAR (X sub ∼ D QU ASAR )such that the population in NCR is matched to QUASAR.Although the true D QU ASAR is unknown, provided by the descriptive statistics of the covariates in QUASAR, we can estimate the D QU ASAR ′ by assuming the covariates are independent, which will be introduced in section 4. Now that we have formalized the problem of estimating the treatment effect, we will introduce different methods for modeling the E[Y i (1)] and E[Y i (0)] in observational study, namely, direct methods, IPW methods, and DR methods.In each case we have multiple implementations: we discuss implementations relying on relatively simple regression models and implementations relying on the flexible, non-parametric BART model.All the models are fitted based on the population (N=52521).Then the estimators of the treatment effect of the target population in NCR can be obtained by averaging the treatment effect of subjects sampled according to D QU ASAR ′ .

Direct methods
The direct method, also known as regression adjustment, uses the regression method to directly fit the outcome model E[Y |Z, X].If the unconfoundedness assumption holds, E[Y i (1)] and E[Y i (0)] can be obtained by marginalizing the conditional outcome of Y given covariates X across the population under treatment Z = 1 and Z = 0, respectively, and can be formalized as follows: which have been proved in equation 6.Therefore, by fitting the model for potential outcomes conditional on the covariates to adjust for the direct and indirect associations between covariates and outcomes, we can obtain the potential outcomes for each subject and subsequentially the ATE of the population.In the following section, we will introduce two methods to fit the model for potential outcomes E(Y |Z, X) based on observed outcomes, namely, a parametric model of logistic regression (Direct/LReg) and a non-parametric model of BART (Direct/BART).

Logistic regression (LReg)
Logistic regression (LReg) is a commonly used parametric method to control for confounding.When the covariates are strongly predictive of outcomes, we can efficiently identify the relation between treatment and outcome by making use of the information about relations between covariates and outcomes.By specifying the main effect and interaction effect between treatment and each independent covariate, ] and E[Y (0)] can be expressed as: where β0 , β1 , β1 and α are estimated by maximum likelihood.We perform lasso regularization to prevent overfitting [35].

Bayesian additive regression tree (BART)
BART is a sum-of-trees model where each tree is constrained by a regularization prior to a weak learner, and fitting and inference are obtained by Bayesian Markov Chain Monte Carlo (MCMC) algorithm that generates samples from a posterior [16].[27] and [36] first promoted the use of BART to obtain the heterogeneous and average treatment effect.[26] also compared the BART with other methods using simulated observational data in healthcare and results show that BART has a low bias.Unlike many other non-parametric models, BART is flexible and it can handle non-linear main effects and multi-way interactions without much input from researchers [37].We refer the readers to [16] for details.For binary outcomes, E(Y |Z, X) can be estimated using a probit model and then E[Y (1)] and E[Y (0)] can be obtained by averaging the outcome conditional on X for Z = 1 and Z = 0 across the population.
In our study, we use the default setting in [16] because of BART's robust performance with respect to various hyperparameter settings: the number of trees = 50 (we found that the results show no gain with more number of trees incorporated), the number of posterior draws = 1000, and the prior distribution of σ is inverse gamma distribution IG(α, β) where shape parameter(splitting probability) of α is 0.95 and rate parameter (depth penalty) β is 2, respectively.

Inverse probability weighting estimation (IPW) methods
IPW is one of the PS methods for correcting the mismatch of data distribution produced by the treatment assignment.We define PS as follows: We refer readers to [28,33] for details about the PS.The IPW method is considered to have some methodological advantages because it imitates the same treament assignment mechanism as the RCT.Weighting by the PS effectively generates the pseudopopulation in which the treatment assignment is independent of observed covariates, generating the similar distribution of covariates between two groups.Hence, in this pseudopopulation, we can obtain E[Y (1)] and E[Y (0)] simply by averaging the outcome in treatment and control groups after weighting by the inverse of PS.Then the E[Y (1)] and E[Y (0)] with respect to the distribution of X ∼ D in the target population with n subjects can be obtained via and Then the ATE can be estimated by directly comparing the difference between E[Y (1)] and E[Y (0)].In this study, we compute the PS using two regression models, namely, LReg with lasso regularization (IPW/LReg), and BART (IPW/BART) with default setting.We trim the weights at 0.1 and 0.9, which is implemented in both IPW and DR methods, and explore the effect of different trimmed values on the ATE estimation with these methods, which we will further illustrate in section 5.

Doubly robust methods (DR)
The DR estimator combines a model for the potential outcomes as a function of covariates E(Y |Z, X) and a model for assignment mechanism as a function of covariates E(Z|X) (i,e, π(X)).Particularly, we fit a separate outcome model for the treatment group Z = 1 and control group Z = 0. m z (X i ) is the predicted values for the subject i given X i and the treatment Z i = z.
Then E[Y i (1)] and E[Y i (0)] with respect to the distribution of X ∼ D in the target population can be estimated via [28]: The name DR is derived from the fact that the estimation remains unbiased when either a model for treatment assignment mechanism or a model for the potential outcomes is correctly specified [22].Combining a direct estimator and an IPW estimator under the form of a DR estimator leads to lower bias than the direct estimator alone, and lower variance than the IPW estimator alone [38].See [39, p148-149] for more details of the proof of the DR property.In this study, four DR estimators are constructed based on two models for potential outcomes E[Y |Z, X] fitted by LReg and BART and two models for assignment mechanism E[Z|X] fitted by LReg and BART.Table 2 shows the details of these four combinations.

Additional methods
Next to the three groups of methods described above, we also include two other alternatives: 1 To explore the effect of sampling mechanism and treatment assignment on estimation, we roughly model potential outcomes E(Y i |Z i ) on the whole population before the subsampling ((X N CR ∼ D N CR ) and the study population after subsampling (X sub ∼ D QU ASAR ′ )respectively. 2 Furthermore, in addition to covariates, we also include the PS in the direct method E[Y i |Z i , X i , π(X i )] (Direct/ps-BART).Both the model for potential outcomes and PS (i.e., π(X i ) = E(Z i |X i )) are fitted by BART with the default setting.Direct/ps-BART incorporates the estimate of PS in the specification of the outcome model, implicitly inducing a covariate-dependent prior to the regression function, which even outperforms the BART model ( [25,40].When fitting the model using LReg, we need to specify the relationship between each variable: the main effect of covariates and the interaction effect between the treatment and each covariate are specified for the model of E(Y i |X i , Z i ) in the direct method; the main effect of all covariates and two-way interaction effects of all covariates are specified for the model of E(Y i |X i , Z i = z) in DR method and the model of E(Z i |X i ) in IPW method.To prevent overfitting, lasso regularization is implemented on all the models fitted by LReg, the loss function is to minimize the mean squared error, and the ten-fold cross-validation is implemented to choose the regularization parameter.

Overview of methods
Note that all the models are initially fitted based on the whole population (X N CR ∼ D N CR ) and the estimator of ATE for comparison with the QUASAR are then obtained from the subsamples (X sub ∼ D QU ASAR ′ ) from the whole population.We quantify the sampling uncertainty by stratified sampling the study population for 5000 times, which we will introduce in section 4.

Study population sampling
In this section, we will select the study population from the whole population (N = 52621) to ensure that the subsamples of the NCR (X sub ∼ D QU ASAR ′ ) is similar to the samples in the QUASAR trial (X QU ASAR ∼ D QU ASAR ).We first select the eligible patients (N = 45539) by complying with the inclusion criteria of QUASAR.Then we subsample the target population according to the probablity estimated from the joint distribution of D QU ASAR ′ over important covariates by assuming independence.

Eligible participates
To select the eligible participates for the study population, we first sought to comply with the same inclusion criteria as the QUASAR trial [29].From a total of N = 52621 patients with colon cancer, 100 patients missed the information of treatment and incidence, 45442 patients were without distant metastases (cM==1), diagnosed stage II or stage III and aged 23 to 86.All of these 45442 remaining patients are included for a two-year follow-up analysis and n = 34832 for a five-year follow-up analysis in which patients diagnosed after 2013 are excluded.Figure 2 provides an overview of this first selection step [3] .

Sampling
Next to following the inclusion criteria in the QUASAR trial, we also estimate the distribution D QU ASAR ′ over covariates by correcting for the distribution employed in the NCR.Effectively, we ensured that our subsample X sub ∼ D QU ASAR ′ resulting from the NCR data corresponds -in terms of the distribution of several important covariates -as closely as possible to the QUASAR trial data.By correcting for the sampling mechanism such that the distribution of covariates are the same in both data sources (i.e., D QU ASAR ′ and D QU ASAR ), the estimator obtained by averaging the outcomes across subsamples of NCR (X sub ∼ D QU ASAR ′ ) can be reliably compared to the QUASAR in further analysis.To do so, we divide the eligible population into 16 subgroups by stage, sex, and age, and sampled from each subgroup according to the joint probability of their cancer stage, their sex, and their age as observed in the QUASAR sample (see Appendix) [4] .
Then we implemented this subsampling such that the resulting sample was as large as possible while maintaining the desired joint distribution of the covariates.The stratified subsampling leads to the selection of 7704 and 6378 patients from the NCR for two-year follow-up and five-year follow-up analysis; Table 3 demonstrates that our subsampling procedure effectively ensured the desired joint distribution of the NCR subsample comparable to the original QUASAR sample.Note that, to incorporate the variability induced by our subsampling procedure in our remaining procedure, we create m = 5000 stratified (sub)samples.

Results
In this section, we compare the performance of the different methods to control for confounding.We examine the ATE in two and five years of follow-up.The main results are presented in Figure 3, where the ATEs in each method are estimated with four groups of covariates.Numerical details are provided in Table 5 and 6.In addition, to assess the magnitude of violation of positivity assumption in IPW [3] It is worth noting that in our study the models for potential outcomes and assignment mechanism were fitted based on a very large sample (N = 52521), utilizing the information from the general population which is beyond the scope of most RCTs.Thus, it is reasonable to, in the future, estimate the treatment effect of patients who are not included in the RCT using these methods. [4]No other covariate distributions are reported in the paper of QUASAR trial.Note that only the marginals are reported, and we estimate the joint distribution by assuming independence supported by [41].and DR methods, we visualize the PS distribution between treatment and control groups and explore the variation of IPW and DR to different trim values -an approach to improve the performance in face of the positivity violation -which are shown in Figure 4. We also show the results of subgroup analysis to identify the heterogeous treatment effect and sensitivity analysis to assess the assumption of unconfoundedness.

Results of the estimation of treatment effect
To explore the impact of different covariates sets used on the estimation, we compare the results of three methods using four sets of covariates, which are shown in Figure 3.We find that the methods using the least covariates are the closest to the ground truth of RCT while the results using more covariates substantially overestimate the ATE.
We then explore the results of the different methods after the two-year followup.In the RCT the estimated ATE -0.021 (95% CI -0.039, -0.001).The result is presented in Figure 3 using the black dotted line (point estimate) and the grey band (CI).Using the NCR data without adjusting for any confounders -i.e., the Crude method -we obtain an ATE of -0.069 (95% CI -0.076, -0.062) for X N CR ∼ D N CR and a median ATE of -0.019 (range from -0.041, 0.003) for X sub ∼ D QU ASAR ′ .Thus, if neither control for sampling mechanism nor treatment assignment, the Crude method overestimates the treatment effect.This overestimation could be alleviated by controlling for the sampling mechanism, although the estimation still deviates from the RCT and shows large variance produced by the sampling process.
Regarding the impact of the number of covariates controlled for on the ATE estimation, it seems that using sex, age, and stage can properly obtain the ATE as close as possible to the RCT.When high risk and more covariates added (shown in the second, the third and the fourth row in Figure 3), the estimate deviates far from the results from the RCT.
Regarding the direct methods, the direct/BART and direct/ps-BART are very close to the ground truth while the direct/LReg seems to overestimate the ATE.Notably, the variation over different subsamples obtained to correct for the sampling mechanism in direct methods is very small: the direct method seems very robust to this sampling variation.Compared to direct methods, the IPW methods, however, seem to have a huge variation.The DR methods seem to -as expected -combine the best of both worlds: the estimates are close to the ground truth, and they exhibit small sampling variance compared to IPW methods.For each of the DR methods, irrespective of using parametric or non-parametric models, the majority of the point estimates are within the confidence bound of the original ICR.The DR using 13 and all covariates all show the same results, which probably means that unconfoundedness is hold in both cases.
In the five-year follow-up analysis, the ATE in RCT is -0.025 (95% CI -0.059, 0.008).In NCR, ATEs using crude method for X N CR ∼ D N CR and X sub ∼ D QU ASAR ′ are -0.073(95% CI -0.083, -0.063) and a median of -0.031 (range from -0.065, 0.005), respectively.These results are qualitatively similar to those obtained after two years: the Crude method overestimates the treatment effect, all the DR methods are very close to the ground truth.The majority of the point estimates obtained from the DR method -whether using a parametric or a non-parametric approach -fall within the confidence bound obtained in the RCT.
To conclude, our results show that a) in general all the estimates are consistent in terms of direction with the results of RCT; b) strength of the association between treatment and outcome varies across statistical methods and covariates included, which is also indicated in [42].Clearly, without cautious controlling for the confounding and sampling mechanism, results using the crude method will not be reliable in the aspect of both magnitude and directionality of the treatment effect, as reported in Crude methods on population X N CR ∼ D N CR and X sub ∼ D QU ASAR ′ .However, when correction methods are used carefully, we do obtain estimates that are close to the ground truth; c) regarding the performance of the four groups of methods, the DR estimations, particularly modeled by the non-parametric BART, seem more robust in obtaining the treatment effect than IPW.Within the direct methods the choice of a model, the non-parametric model, namely, ps-BART, is quite flexible and performs properly.

The results of the estimation with different trimmed values
Regarding the effect of covariates on the estimation of ATE, we find that the estimate using the least number of covariates is closest to the ground truth from the RCT while estimates from models using more covariates deviate from the ground truth.The right panel of Figure 4 shows the distribution of PS between treatment and control, indicating that adjusting for more covariates leading to more extreme PS.Although IPW estimator is theoretically unbiased, it is sensitive to extreme weights, and hence weights are commonly truncated or bounded to shrink the variance of the estimator.
In our study, weights are trimmed at a fixed level, namely, at 0.1 and 0.99, 0.5 and 0.95, 0.1 and 0.9, and 0.2 and 0.8 [20] and the results are shown in the left panel of Figure 4. We found that across the results, trimming the IPW weights decreases the variance of the estimator but makes the point estimate substantially vary across trimmed values.Besides, we also find that IPW methods benefit from the trimming, which is quite clear when looking at complex models using 5, 13 and all covariates.This impact on the complex model can also be validated by looking at the estimates with only three covariates: in this case, the point estimation doesn't vary across different trim values probably due to the comparable specification of PS model with the ground truth.Overall, in our study, trimming the extreme weight can help the IPW method approach to the ground truth.

Subgroup analysis
From the previous results, we find that high risk and other pathological characteristics are sensitive to the estimation.To understand the reason that leads to this overestimation, we explore the estimation in the subgroups of stage II and stage III classified by the high-risk level using Direct/ps-BART.The results are shown in Figure 5.We found that by adjusting the high risk the method can identify the heterogeneous treatment effect among the subgroups in stage II cancer.This finding is supported by a) the treatment effect on stage III patients are the same in the model with and without adjusting for high risk covariate ; b) the treatment effect for stage II colon cancer who are not identified as high risk are the same and is approximately equal to zero in all the models, regardless of high risk being adjusted or not; c) the heterogeneity of treatment effect in stage II cancer is supported by many studies indicating that part of stage II high-risk colon cancer.for example, pT4, can achieve the survival benefit from the adjuvant chemotherapy [11,43].
Based on these findings, although the models using three covariates can estimate the treatment effect comparable to the RCT, we can't rule out the possibility that the true treatment effect of the NCR data is very close to the estimate from the models using 5, 13 and all covariates, because no pathological information was provided in the RCT to identify the heterogeneous treatment effect of the high risk stage II patients.Overall, the model using three covariates without adjusting for high risk can estimate the treatment effect as close as possible to the ground truth from the RCT while at the cost of inability to identify the heterogeneity among the population, in particular, the subgroup of stage II cancer.

Sensitivity analysis
The subgroup analysis implicitly indicates that the estimation for stage III patients are robust to various covariates included while stage II patients are quite sensitive to the type and the number of covariates.Hence, we can use sensitivity analysis in subgroups to test for the impact of an unmeasured confounder and assess the extent to which we allow the unconfoundedness assumption to be violated in subgroups.
In our study, a two-parameter sensitivity analysis based on flexible non-parametric methods BART is implemented [44] and the results are presented in Figure 6.It implies that the estimate of stage II colon cancer, particularly those are unknown risk, is sensitive to the unmeasured confounder and the sign of the estimate can even change in the case where, for instance, some high risk factors are not controlled for.This implies that more information about this subgroup should be collected to identify the high risk stage, otherwise the estimation might not truly reveal the treatment effect.However, for stage III patients, the estimation of the treatment effect is quite robust to the unmeasured confounders and seemingly deterministic.

Discussion
In this study, we exploited the unique opportunity of having access to large-scale colon cancer registry data to compare three different methods of controlling for potential confounding in observational data, namely, direct method, IPW method, and DR method.Our study, by comparing the resulting estimates to those obtained in a large scale RCT, validated that these methods can be used to obtain the estimates of treatment effects close to the ground truth.This opens up the door to use the large-scale, high-quality registry data regarding colon cancer to further strengthen the (Dutch) guidelines by adding, as a third source of information next to RCT data and expert consensus, the available registry data.It has to be noted, however, that adding this third source requires care: methods using naive estimators -such as the crude method evaluated in this study -can be highly biased.We can only use the registry data when both the sampling process and the assignment process are properly controlled for.
Next to demonstrating the utility of observational data in this specific case, we also find evidence of the advantages of modeling the outcome (as done in the direct method and the DR method) over solely modeling the treatment assignment process (IPW method).This latter result might be because it is hard to properly estimate this process.Besides, the supreme performance of non-parametric BART over simple parametric LReg is quite promising, which implies its use for further analysis.In the subgroup analysis, we find that when adjusting for the high-risk in models, the heterogeneous treatment effect could be identified, although the estimation might deviate from the result of the RCT.This dilemma of whether or not to include more covariates as to obtain both the accurate and heterogeneous estimation could be solved by access to a variety of data with complete information in each subgroup, which could probably shrink the variation of estimation from the model using all the covariates as shown in Figure 5.The unconfoundedness assumption is also explored using sensitivity analysis and the result is qualitatively reasonable.
In general, the estimates we obtained from the NCR observational data -as can be seen in Figure 3 -are often higher than those in the ground truth Quasar trial.While we have discussed several sources of this discrepancy due to our explored correction methods, there are possible two alternative explanations.On one hand, although we have resampled the registry data to as much as possible match the population in two data sources, there is still uncorrectable heterogeneity between populations at baseline.For example, more stage II patients are identified as high risk in registry data than RCT, which consequentially increases the overall estimation.On the other hand, it is now known that CAPOX and FOLFOX are more effective than FU/LV [4].The fact that in the NCR data more patients received CAPOX/FOLFOX as opposed to FU/LV, while in the QUASAR trial all patients received FU/LV, might truly increase the average treatment effect of chemotherapy overall in the NCR.
Regarding the subsampling process, we made two assumptions a) the assumption of independence among gender, age and stage when estimating the distribution of population in QUASAR based on the population-level statistics provided in the end of the enrollment; b) the assumption of no change of the estimated distribution over time when subsampling the target population accordingly from the NCR data.These two assumptions, fortunately, can approximately be checked by the paper published during that time.Some studies reported that males and females have a similar distribution of incidence by age from 1971 to 1999 [45,46].Therefore, the variation of the distribution of age and gender in colon cancer over time could be ignorable and hence doesn't violate the assumption.However, all these two assumptions are still untestable without patient-level data with covariate information.

Conclusion
To conclude, our study shows that it is possible to obtain treatment effects from observational data when analyzed carefully.Without cautious controlling for confounders, the results from observational data may mislead people to make incorrect decisions.Our study also shows that the colon cancer registry data, when handled with care, can potentially be used to inform future guidelines: at the very least the ATE obtained from this data source -with the appropriate controls -provides estimates that are similar to those obtained in an RCT.For the first STRATUM, P (ST RAT U M ) = P (stageII ∩ male ∩ age<50) = P (stageII) × P (male) × P (age<50) = 0.91 × 0.61 × 0.11 = 0.061 (see Table 3 for more details of distributions of these covariates in the QUASAR).

Figures Figure 1
Figures

Figure 6 Supplementary
Figure 6Supplementary Files

Table 2
provides an overview of the methods introduced in section 3.1, 3.2, 3.3 and 3.4.Parametric LReg and non-parametric BART with the default setting introduced in section 3.1.2are used to fit both models for the potential outcomes E[Y i |Z i , X i ] and the treatment assignment E(Z i |X i ).

Table 1
Baseline characteristics of stage II and III patients who underwent surgery between 2006 And 2015

Table 2
Methods description

Table 3
Baseline characteristics of the study population of in the NCR after subsampling and QUASAR

Table 4
Stratified sampling profile P(Stratum): the probability of stratum in the population P with sample size 45485 P(STRATUM): the probability of stratum in the study population S with sample size N=7704, which is consistent with that of QUASAR