2.1 Individual-level UK Biobank dataset
We applied the UK Biobank data to investigate the influence of birthweight on breast cancer [18], with age of menarche and age at menopause as two candidate mediators. After performing the similar quality control procedure described in prior work [19], we obtained a total of 337,198 independent individuals of white British ancestry aged 48–82 years. We only kept female individuals with breast cancer as cases, leaving 7,350 breast cancer patients. To guarantee the temporal ordering between age of menarche, age at menopause and breast cancer, which is necessary for the causal interpretation of effects in the causal inference [20, 21], we ensured that no breast cancer patients reported their age at menopause earlier than age at menarche, and excluded 1,590 patients whose age at diagnosis was prior to age at menarche or age at menopause. Afterwards, we reserved 5,760 breast cancer patients. To maximize the sample size for boosting power, we included all female individuals without breast cancer as controls, leading to a total of 162,778 controls. Besides birthweight, age of menarche, age at menopause and status of breast cancer, we primarily incorporated age (by the end of the last data collection), menopause or not, ever smoked or not and BMI as potential covariates (Table 1).
Table 1
Descriptive statistics of the UK Biobank data after quality control used in the mediation analysis
variable | N | mean ± sd (yes/no) |
birthweight (kg) | 108,956 | 3.2 ± 0.6 |
age at menarche (year) | 163,763 | 13.0 ± 1.6 |
age at menopause (year) | 96,098 | 49.8 ± 5.0 |
age (year) | 168,539 | 66.5 ± 7.9 |
BMI (m/kg2) | 130,635 | 27.0 ± 5.1 |
breast cancer | 168,538 | 5,760/162,778 |
menopause or not | 141,969 | 102,889/39,080 |
ever smoked or not | 167,983 | 93,396/74,587 |
Note: BMI: body mass index; N: the sample size of diverse variables, which is different due to the distinct settings of missing value; sd: standard deviation. |
2.2 Mediation association from birthweight to breast cancer mediated by age at menarche or age at menopause
Using the UK Biobank dataset, we aimed to explore the association between birthweight (the exposure X) and breast cancer (the outcome Y), with age at menarche (the first mediator M1) and age at menopause (the second mediator M2) as two potential mediators (Fig. 2). We implemented our mediation analysis under the traditional framework with varying covariates in different mediation models [22]. The basic principle of incorporating covariates in these models was that we considered covariates if they were measured at the same time of collecting the outcome of focus. For example, when it came to age at menarche, we would not choose any covariates as none of them was measured at the age at menarche for a woman in the UK Biobank dataset; when it came to breast cancer, we considered BMI and smoking but would not include age at menopause if we only analyzed age at menarche. Moreover, we fit a linear model when analyzing continuous outcomes (e.g., age at menarche or age at menopause), whereas we fit a logistic model when analyzing binary outcomes (e.g., breast cancer).
With the two mediators under consideration, we here highlight that age at menarche can affect age at menopause but not vice versa [23]. As a consequence, there are eight association possibilities, with four potentially consecutive paths from birthweight to breast cancer (Fig. 2): (i) a direct association between birthweight and breast cancer but not mediated by neither age at menarche nor age at menopause; (ii) birthweight affects breast cancer through age at menarche alone; (iii) birthweight impacts breast cancer through age at menopause alone; (iv) birthweight influences breast cancer through age at menarche, subsequently by age at menopause. The summation of effect sizes on these paths is equal to the total causal effect of birthweight on breast cancer, and the summation of the last three effect sizes can be viewed as the indirect effect of birthweight. Note that, because the temporal ordering between these variables is determinative, these estimated effects have a causal interpretation if additional sequential ignorability assumptions are assumed to be satisfied [20].
2.3 Evaluating linear and non-linear relationship between birthweight and breast cancer
As would be shown below, though failing to detect an obvious linear relationship between birthweight and breast cancer in Model 1 and only identifying a marginally significant association between them in Model 8, we cannot completely exclude the likelihood that there might exist a non-linear association between birthweight and breast cancer [3, 10, 16]. To assess such relationship, we performed a logistic regression by including the square of birthweight to examine its non-linear association with breast cancer. In addition, we also discretized birthweight into three categories (< 2.5, 2.5 ~ 4.0, or > 4.0 by following prior work; see for instance in Table S1) and carried out the similar logistic analysis above. Furthermore, we performed a stratification analysis in terms of the menopause status in each analysis setting. In the present work, all statistical analyses were implemented under the R software computing environment [24], with a significance level of 0.05.