2.1 Data source
In this study, we used pooled data from genome-wide association study (GWAS) to explore the relationship between three reproductive factors and lung cancer through two-sample MR analysis. We extracted the single nucleotide polymorphisms (SNPs) that were reliably associated with AFS (ebi-a-GCST90000047), menarche (ukb-a-315) and age at menopause (ukb-b-17422) from the datasets archived in the GWAS database. For the same situation, we obtained summary-level data for the outcome events “LUCA, LUAD, LUSC, SCLC” from the “ieu-a-966, ieu-a-965, ieu-a-967, finn-b-C3_SCLC” datasets. All SNPs were selected from European populations to meet the Hardy–Weinberg law and eliminate demographic distribution bias(10). These GWAS data were listed in Table 1.
2.2 Statistical approach
MR analysis is a novel approach for solving problems in human biology and epidemiology by using genetic variation as instrumental variables (IVs)(11). Genetic information refers to the information that organisms transmit from their parents to their offspring, or from cells to cells during each cell division, in order to replicate the same things as themselves (12), and is difficult to be influenced by confounding factors and reverse causality. MR analysis identified genetic variations related to target exposure through large-scale genome-wide association study (GWAS) and then applied them to independent datasets to generate unbiased estimates of exposure and outcome (13).
2.3 Selection of instrumental variables
A reliable instrumental variable needs to meet the following conditions:(1) instrument variable is reliably associated with the exposure(p < 5*10− 8); (2) instrument variable is not associated with confounding factors; and (3) instrument variable only affects outcomes through exposure pathways(14,15). Figure 1 provides an overview of the basic principles, design, and process of our MR analysis.
First, we extracted SNPs which strongly associated with three reproductive factors (P < 5*10− 8) from the GWAS datasets as their respective IVs. Second, we selected SNPs with a distance of 10000 kb from each other to exclude SNPs with linkage imbalance (LD) (r2 < 0.001). As smoking is a common risk factor for lung cancer, SNPs associated with “smoking” and “cigarette using” were also removed. Third, to avoid the bias caused by weak instrumental variables, we calculated the F-statistic for each SNP by using the formula(16):
to estimate the strength of genetic instruments, where N is the sample size, k is the total number of SNPs selected for MR analysis. The following formula was used to calculate the R2 for each significant SNP(16):
, where BETA is the estimated effect on exposures, EAF is the effect allele frequency, N is the sample size, and SE is the standard error for each SNP. F-statistic > 10 suggests that SNPs are a sufficiently strong instrument to explain phenotypic variation, while F-statistic < 10 implies a weak instrument(17). And the results of F-statistic about three reproductive factors were shown in Supplementary Table 1. Fourth, we extracted the SNPs that were strongly associated with four outcomes from the three reproductive factors associated SNPs at a correlation criterion of r2 < 0.01, respectively. Furthermore, fuzzy SNPs with alleles of different origin and palindromic SNPs with fuzzy chains were directly excluded in MR analysis. Exclude SNPs with LD distance > 10000KB and minor allele frequency (MAF) < 0.01, as well as palindrome and multidirectional outlier SNPs. Eventually, the numbers of SNPs of three reproductive factors on lung cancer were shown in Supplementary Table 1.
2.4 Statistical analysis
Our study applied MR analysis to explore genetic associations of three reproductive factors [Age at first sexual intercourse (AFS), Age at menarche, Age at menopause] and four outcomes [Lung cancer over all (LUCA), Lung adenocarcinoma (LUAD), Squamous cell lung cancer (LUSC), Small cell lung cancer (SCLC)].
In order to evaluate the causal impact of each reproductive factors on lung cancer risk and test the sensitivity of the results to different patterns that violate instrumental variable assumptions, we mainly used the following three statistical methods: Inverse variance weighted (IVW), MR-Egger and Weighted Median (WM) (18). The IVW method requires each genetic variation to satisfy the assumption of instrumental variables, when met this condition, its statistical efficiency is significantly higher than the other two methods(19), so it was used as the major statistical method. The MR-Egger method can provides a sensitivity analysis for the robustness of MR analysis results(20). The WM method can generate consistent estimates even if more than 50% of the information comes from invalid instrumental variables (21). The estimated effects of three reproductive factors are reported as odds ratios (OR) with corresponding 95% confidence interval (CI).
The MR-pleiotropy residual sum and outlier (MR-PRESSO) method was used to check whether there are outlier in SNPs and remove them (22). After this, we mainly used three methods to test the sensitivity of the results of our MR analysis. We tested the heterogeneity of SNPs using IVW and MR-Egger methods, and used funnel plots as validation (23,24). We apply the MR-Egger method to evaluate whether there is horizontal pleiotropy in SNPs(25). In addition, we employed the leave-one-out method to detect whether there is individual single SNP that has a significant impact on outcomes (26). All MR analysis were conducted using R software (version 4.2.3) with R packages including Two-Sample MR, Mendelian Randomization, and MR-PRESSO.