We simulated a cohort of 100,000 (50,000 male and 50,000 female) individuals who were followed from birth to 95y (or death, whichever occurred first). The mortality rates for age bands 0-1y, 1-5y and every 5y from 5 to 95y were obtained from the US 1959-1961 birth cohort for white males and females(12).
We assumed the first-ever stroke occurred from age 45y. The marginal incidence rates for first stroke between 45-95y were calibrated using the 10y ischaemic stroke incidence rates from the Oxford Vascular Study(10). The rates were adjusted by an increase of ~50-100% for each sex and age band in our simulation, as the Oxford Vascular Study members were less deprived(18) and had a lower risk of stroke than the rest of England(19). Details about the adjustments are provided in the Supplementary materials (A1).
To quantify the survivor bias, we generated data under the null hypothesis that the male-female difference in first-ever stroke incidence remained 5 per 1000 person-year across age bands from 45 to 95y. Any deviation in the observed sex difference from 0.005 person-year reflected survivor bias. We allowed incident stroke to be associated with increased hazard of death within the same or subsequent age band(s), i.e. the effect of stroke history. We assumed that there existed an unobserved and time-invariant construct U that influenced the risk of stroke and/or survival (e.g. genetic variant or lifestyle factors), and no other confounders or mediators in the sex–stroke causal pathways (Figure 1). We explored four causal scenarios according to the relations between U, incident stroke and survival for males and females to access sex disparities in stroke incidence in each situation considering the mortality rates at different life stages.
Causal scenarios
In all four causal scenarios (Figure 1), the mortality rate was higher in males than females, especially during early and middle life(12) consistent with the UK national statistics(20). We assumed that stroke incidence rates were also higher for males than females. In addition, the hazard ratios (HR) for mortality associated with instantaneous stroke and stroke history were both set to be the same across all causal scenarios.
In the simulated cohort, U was generated from a standard normal distribution for each individual. The effect of U on incident stroke was the same in all four scenarios (HR=1.5), while its effect on survival (or mortality) varied (Figure 1). Details about the simulation parameters are provided in Supplementary materials (A2).
In causal scenario A, U did not influence mortality (HR=1.0). In scenarios B to D, U had a direct effect on mortality. In scenario B, U influenced mortality with the same magnitude for males and females (HR=1.5). In scenario C, U influenced mortality for males (HR=1.5), but not females (HR=1.0). In scenario D, the influence of U on mortality was greater for males (HR=2.5) than females (HR=1.5). As U was positively associated with stroke and/or survival, individuals with larger values of U were more likely to die, resulting in a biased sample consisting of those with smaller values of U. Therefore, the distribution of U would provide insights into the magnitude of survivor bias.
Data simulation and analyses
Under each scenario, we generated 1000 datasets, each consisted of 100,000 individuals and were followed to 95y. In each simulated dataset, survival data were generated for each age band (0-1y, 1-5y and every 5y thereafter) according to a piecewise Cox proportional hazard model:
$${H}_{i,t}^{survival}={}_{0,t}^{survival}\bullet exp \{{{}_{1,t}\bullet { gender}_{i} + }_{2}\bullet { U}_{i}+{}_{3}\bullet { gender}_{i}\bullet { U}_{i}+{}_{4}\bullet { stroke history}_{i,t}\}$$
1
where \({H}_{i,t}^{survival}\) denotes the survival hazard for individual i in band t; \({}_{0,t}^{survival}\) denotes the baseline survival hazard and \({}_{1,t}\) denotes the sex difference in survival in band t (same for all individuals); \({}_{2}\) and \({}_{3}\) denote the effect of U and interaction between U and sex on survival respectively; \({}_{4}\) represents the influence of stroke history on survival.
We first generated the survival hazard \({H}_{i,t}^{survival}\), and then survival time \({T}_{i,t}^{survival}\) by an inverse variable transformation under the exponential failure time model(21). If \({T}_{i,t}^{survival}\) was greater than the interval length, individual i survived band t and the simulation continued in the next band (t+1), otherwise, individual i died in this band (t).
Similarly, the first incident stroke (after 45y) was generated according to the piecewise Cox proportional hazard model:
$${H}_{i,t}^{stroke} = {}_{0,t}^{stroke}\bullet exp \{{{}_{1}\bullet { gender}_{i}+ }_{2}\bullet { U}_{i}\}$$
2
where \({H}_{i,t}^{stroke}\) denotes the stroke hazard for individual i in band t; \({}_{0,t}^{stroke}\) denotes the baseline stroke hazard in band t; \({}_{1}\) (fixed at log(0.005)) represents the male-female difference in stroke incidence (i.e., 5 per 1000 person-year for all age bands); \({}_{2}\) denotes the effect of U on first incident stroke. Similar to the survival model (1), we first generated the stroke hazard \({H}_{i,t}^{stroke}\), and then the time to the first incident stroke \({T}_{i,t}^{stroke}\) by an inverse variable transformation. Conditional on individual i surviving previous band (t-1), if \({T}_{i,t}^{stroke}\) was <5y, individual i experienced the first incident stroke, otherwise, he/she survived the band stroke-free. Details about the data generating process are provide in Supplementary materials (A3).
Once the survival and stroke data were simulated, we derived the cumulated death and survival rates and stroke incidence rates for males and females in each age band from 45y under four causal scenarios. The stroke incidence rate during a specific age band was calculated as the total number of events (i.e. first incident stroke) divided by the total person-year at risk. Individuals at risk refer to those who were alive and had not experienced the first incident stroke at the beginning of the age band.
All metrics were derived for each dataset, and then averaged over all datasets under each causal scenario. Data simulation and analyses were performed in R (version 3.6.1).