## 2.1 Data

To evaluate the impact of whether or not accounting for baseline imbalances, skewed costs, correlated costs and effects, and missing data in trial-based economic evaluations, empirical data from two previously published trial-based economic evaluations were used, the REALISE and HypoAware study.

## REALISE study

In the Rehabilitation After Lumbar disc Surgery (REALISE) study, early rehabilitation after lumbar disc surgery was compared to no referral after lumbar disc surgery among 169 participants (intervention group: n = 92; control group: n = 77). Resource use was measured from a societal perspective at 6, 12 and 26 weeks follow-up using cost questionnaires[42]. Resource use was valued using Dutch standard costs[43]. Utility values were based on the EuroQol (EQ-5D-3L), which was administered at baseline and 3, 6, 9, 12 and 26 weeks follow-up[42]. Utility values were estimated using the Dutch tariff for the EQ-5D-3L[44]. Quality-adjusted life years (QALYs) were calculated using linear interpolation between measurement points.

## HypoAware study

In the HypoAware study, the HypoAware intervention (a blended, group and online psycho-educational intervention based on the evidence-based Blood Glucose Awareness Training) was compared to usual care among 137 participants (intervention group: n = 71; control group: n = 66)[45]. Resource use was measured from a societal perspective at 2, 4, and 6 months follow-up using cost questionnaires. Utility values were based on the EuroQol (EQ-5D-5L), which was administered at baseline, 2, 4, and 6 months follow-up. Utility values were estimated using the Dutch tariff for the EQ-5D-5L[46]. Quality-Adjusted Life-Years (QALYs) were calculated using linear interpolation between measurement points.

Tables describing baseline characteristics of the REALISE and HypoAware study populations are included in the Appendix (Supplementary Tables 1 and 2). For a detailed description of both studies, the reader is referred elsewhere[42, 45-47].

## 2.2 Statistical analysis

In total, 14 full economic evaluations were performed for both the REALISE and HypoAware study. In the first analysis, a statistical approach was used, in which baseline imbalances, the skewed nature of cost data, the correlation between costs and effects and missing data were ignored. Thus, this approach simply compared the difference in costs and effects between both groups using t-tests, including only participants with complete cost and effect data, while assuming that both costs and effects were normally distributed and that costs and effects were not correlated. Although this statistical approach ignores all of the challenges in trial-based economic evaluations, it is still being used in practice[7, 8, 48]. Step-by-step, the analyses accounted for the different statistical challenges, until in the final approach all of the statistical challenges were accounted for using the following methods:

__- Baseline imbalances:__

Regression-based adjustment was used[16, 49, 50]. Costs and effects were corrected for their baseline value, if available, and for relevant confounding variables. Variables were considered to be a confounder if the estimated regression coefficients for the cost or effect differences changed by 10% or more when the possible confounding factor was added to the model[50, 51]. For the REALISE study, confounders of costs were participants’ baseline mental health status, physical health status, risk of future work disability, fear-avoidance beliefs about work, treatment credibility and treatment expectations. Confounders of effects included the participants’ baseline utility value, mental health status, back pain, and risk of future work disability. For the HypoAware study, confounders of costs were the participants’ baseline costs, number of severe hypoglycemia episodes during the previous 6 months, and wearing a real-time sensor. Confounders of effects comprised the participants’ baseline utility value and marital status.

__- Skewed costs:__

Non-parametric bootstrapping with 5,000 replications was used[33, 52-54]. The non-parametric bootstrap is a data-based resampling method to estimate statistical uncertainty, without making any distributional assumptions[52]. Bootstrapped confidence intervals were estimated using the bias-corrected and accelerated bootstrap method. The advantage of using bias-corrected and accelerated bootstrapping over percentile bootstrapping, is that it adjusts better for skewness and bias of the sampling distribution, resulting in more accurate confidence intervals[52, 55]. In the REALISE study, the skewness of costs was 1.70 and the kurtosis was 5.75 (excess kurtosis 2.75). In the HypoAware study, the skewness of costs was 1.39 and the kurtosis was 3.90 (excess kurtosis 0.90). The positive skewness indicates that the distribution is skewed to the right and the excess kurtosis indicates a long right tail (i.e. relatively many outliers).

__- Correlation between costs and effects__:

Seemingly unrelated regression (SUR) analysis was used in which two separate regression models were specified simultaneously (i.e. one for costs/one for effects)[33, 34]. In the REALISE study, the correlation between costs and effects was ρ = -0.42. In the HypoAware study, the correlation between costs and effects was ρ = -0.44. A negative correlation indicates that individuals with worse outcomes induce higher costs.

__- Missing data:__

Missing data were assumed to be MAR[35]. Multiple Imputation by Chained Equations (MICE) with predictive mean matching (PMM) was used to predict and impute the missing values based on observed data[26, 56]. PMM was used to deal with the skewed distribution of costs[18]. The advantage of PMM is that it is more robust against non-normal data than linear regression estimation methods, as it uses the observed distribution of the data and non-existing values cannot be imputed[57]. The number of imputed datasets was increased until the loss of efficiency was less than 5%, resulting in 10 imputed datasets for the REALISE study and 20 imputed datasets for the HypoAware study[58]. The imputed datasets were analysed separately to obtain a set of estimates, which were then pooled using Rubin’s rules[35] to obtain overall estimates, variances, and confidence intervals[35, 58, 59]. In the REALISE study, 33 (24%) participants had missing cost data and 21 (15%) had missing effect data. In the HypoAware study, 28 (17%) participants had missing cost data and 20 (12%) had missing effect data.

An overview of the 14 analytical approaches used in this study as well as the statistical challenges they account for can be found in Table 1. For all approaches, incremental costs and QALYs, 95% confidence intervals around incremental costs and QALYs, incremental cost-effectiveness ratios (ICERs) and cost-effectiveness accessibility curves (CEACs) were estimated and compared. ICERs were calculated by dividing incremental mean costs by incremental mean QALYs. CEACs were estimated using the Incremental Net Monetary Benefit (INMB) approach[60]. CEACs represent the probability of an intervention being cost-effective (y-axis) for a range of different ceiling ratios (x-axis) and provide a summary measure of the joint uncertainty surrounding costs and effects[61, 62]. All analyses were performed in StataSE 16® (StataCorp LP, CollegeStation, TX, US).

## 2.3 Comparison of the statistical approaches

Statistical approaches were compared in terms of how sensitive the point estimates are to changes in the statistical approaches (i.e. value sensitivity) and how sensitive the conclusion of an economic evaluation is to changes in statistical approaches (i.e. decision sensitivity)[63]. Value sensitivity was assessed by comparing incremental costs and QALYs, the corresponding confidence intervals, and ICERs across the 14 statistical approaches. Decision sensitivity was assessed by comparing the CEACs of the 14 statistical approaches. For comparing and interpreting the CEACs, thresholds of 0 €/QALY gained, 10,000 €/QALY gained and 23,300 €/QALY gained (i.e. about 20,000 £/QALY gained) were used, which refer to a situation in which decision-makers are not willing to pay anything per QALY gained, the Dutch willingness-to-pay (WTP) thresholds (i.e. between 20,000€/QALY gained and 80,000 €/QALY gained depending on disease severity) and the British National Institute for Health and Care Excellence (NICE) threshold, respectively.