Ethical approval
The datasets employed were exclusively obtained from publicly accessible repositories and underscoring our commitment to transparency and data accessibility. Comprehensive documentation of ethical clearance and approvals for each constituent study analyzed in our research is readily available in corresponding primary articles[13-15].
The selection of data sources
Table S1 presents statistics of genome-wide association studies (GWAS) utilized our statistical analyses. Kurilshikov et al., through the establishment of the MiBioGen consortium, curated available GWAS data on gut microbiota to interrogate interplay between genetic variations and microbial components. This consortium collectively enrolled 18,340 participants (24 cohorts) ranging from 11 countries, with both 16S rRNA gene sequencing profiles and genotyping data of microbial composition[13]. A dataset comprising 211 taxa with 122,110 mutation loci was compiled, with 119 genera featuring an average abundance exceeding 1% being selected for further analysis in our study. Figure 1 shows a schematic diagram of a Mendelian randomised study of gut flora-mediated circulating inflammatory proteins in breast cancer progression. We aimed to explore a more in-depth interaction between breast cancer and gut microbiota, thereby further analysing the relationship through mediating role of circulating inflammatory proteins. Our study is adheres to the STROBE-MR guidelines.
Figure 1. Schematic design of a two-sample Mendelian randomisation study of intestinal flora and breast cancer. The design hypothesises that gut microbiota influences the course of breast cancer by modulating circulating inflammatory proteins.
2.1 Data source for inflammatory proteins
We utilized Circulating Inflammatory Proteins GWAS dataset as reported by Zhao JH et al., encompassing 14,824 participants across 11 cohorts. This comprehensive dataset identified a total of 180 protein quantitative trait loci (PQTLs), comprising 59 cis and 121 trans variants[14]. GWAS analysis was independently conducted for each cohort using linear regression to establish additive genetic association model. The influence of inflammatory proteins was quantified as alterations in inverse-normalized protein levels per dose of effect allele. To mitigate potential confounding factors, adjustments were made for population substructure using genetic principal components, as well as covariates (e.g., age and sex) were incorporated into model.
2.2 Data source for breast cancer
We utilized the findings from a GWAS on breast cancer conducted by Michailidou K et al[15]. This study encompassed a large cohort of individuals, consisting of 228,951 individuals of European ancestry as well as 27,172 individuals of Asian ancestry. Through this analysis, a total of 65 novel loci correlated with overall breast cancer threats were identified. Additionally, by integrating computational data to identify target driver genes within breast cells at each locus, their study demonstrated a significant intersection between candidate target genes and somatic driver genes .
2.3 Instrumental variable
The threshold for determining the significance of IVs linked to breast cancer was established at P < 5 × 10-8.
Due to restricted quantity of single nucleotide polymorphisms (SNPs) associated with both GM genus and circulating inflammatory proteins at the P < 5 × 10-5 threshold, P < 1 × 10-5 thresholds in screening IVs for GM genus as well as circulating cytokines were opted, respectively. The rationale behind the selection of these thresholds has been extensively elucidated in prior studies[16, 17]. To ensure reliability and precision, additional criteria were applied to select all IVs: (i) implementation of chain imbalance analysis to fulfill the MR assumptions (R2 < 0.001, aggregation window size = 10,000 kb) utilizing data from European subset of 1000 Genomes Project; (ii) exclusion of palindromic SNPs to mitigate potential allele influence; (iii) elimination of outliers and ensuring data robustness through the Radial-MR method.
2.4 Statistical analysis
2.4.1 Mendelian randomization
Initially, we conducted MR analyses to evaluate the potential pathway involving gut flora, inflammatory proteins, and breast cancer. We identified: (i) 119 GM genera associated with breast cancer; (ii) metrics for inflammatory proteins; and (iii) for the GM genera linked with breast cancer, we further examined their influence on inflammatory metrics. Our primary method for assessing univariable Mendelian randomization (UVMR) estimates was inverse variance weighting IVW, complemented by weighted median (WM) and MR-Egger methods. IVW estimates are displyed by slope of a weighted regression of SNP outcome effect on SNP exposure effect, with intercept constrained to zero[18]. The WM method was utilized, considering outcomes as relevant even with half of IVs being below the mark. Additionally, MR-Egger method incorporated an intercept term to assess presence of horizontal pleiotropy (p < 0.05). Further details on these MR methods can be found [19, 20]. Several sensitivity analyses were conducted. Initially, Cochran's Q test was employed to examine the heterogeneity of the results, with significance set at P < 0.05 indicating heterogeneity. Additionally, the MR-PRESSO method was utilized to identify potential SNP outliers for multiplicity analysis and to validate robustness of the data. False discovery rate (FDR) correction was applied to each UVMR analysis, with FDR threshold set at Pfdr < 0.05[21]. Exposure was considered to be positively associated with the results of this study if P < 0.05.
2.4.2 Multivariable Mendelian randomization
MVMR was conducted to interrogate potential mediation from GM to breast cancer. MVMR, an extension of UVMR, allows for decomposition of the overall effect between exposure and outcome into direct and indirect effects[22, 23]. In our study, we utilized UVMR to estimate the overall impact between exposure and outcome. Additionally, MVMR analyses were employed to evaluate the direct influence of exposure on outcome and to account for potential mediation effects.
2.4.3 Statistical analysis
Statistical analyses were performed by R software (version 4.2.1, R Foundation for Statistical Computing, Vienna, Austria), with effect sizes presented as odds ratios (ORs) and corresponding 95% confidence intervals. "TwosampleMR" and "MendelianRandomization" packages were employed for MR analyses.