Study design and cohort characteristics
We used data in the UK Biobank (UKB) (http://www.ukbiobank.ac.uk). The UKB is a large prospective study that aims to improve the diagnosis, treatment, and prevention of disease. Full details are described elsewhere 50. It includes more than 500,000 participants aged 37-73 years, with baseline recruitment conducted between 2006 and 2010. Informed consent was obtained during enrolment, as was permission to access medical and other health-related data for research purposes. The UKB has approval from the North West Multi-Centre Research Ethics Committee (MREC) and the National Information Governance Board for Health and Social Care (11/NW/0382).
Cancer status was ascertained through linkage to national cancer registries 51. Information on cancer registrations was available up to October 2016, which includes the diagnoses code according to the International Classification of Diseases (ICD; ninth and tenth editions). We mapped all cancer-related ICD codes into “phecodes” which better reflect disease coding as relevant for clinical practice 52. We excluded participants who had self-reported having had cancer but did not have a record in the cancer registry. For participants with multiple cancer diagnoses, we included the first diagnosed cancer based on the date of diagnosis. As controls, we used participants with no report of any type of cancer-based on self-report, cancer registry, or hospital inpatient data, or benign or in situ tumours from the cancer registry. Applying criteria previously used by others 9, we grouped cancers according to whether they were sensitive to hormonal variation, classifying cancers of the breast, endometrium, ovary, prostate, and thyroid as “hormone-sensitive cancer”. Our analyses included 235,512 controls and 15,197 hormone-sensitive cancer cases. Incident cancer cases were defined as those diagnosed after the baseline assessment before the end of follow-up ( October 2016) and prevalent cases were those diagnosed before baseline assessment in the UKB. The basic covariates and covariates used for statistical adjustments are described in detail in the supplementary file (Supplementary method).
Genotypic Data: to control for artifacts introduced to the data during genotyping, initial standard quality control (QC) measures were applied to all data sets before analyses. The genotype data in the UKB includes 92,693,895 SNPs genotyped from 488,377 study participants. The QC procedure for the genotypic data focused on two levels i.e., at individual and SNP level. First, at the individual level, we exclude individuals with a call rate of less than 95% and individuals who did not self-identify as white British ancestry or who exhibited sex inconsistencies (sex mismatch between self-reported phenotype sex and genotype determined sex data) and had a putative sex chromosome aneuploidy (chromosomal anomalies). To check identical genes shared through common ancestors, we randomly selected individuals from a pair and excluded those pairs in which their genomic relationship is larger than 0.05. Furthermore, to avoid bias induced as a result of population stratification and to ensure participants are taken from a relatively homogenous population, we checked the population substructure in the Principal Component (PC) analysis to the excluded individual as population outliers with the first or second PC outside ± 6 SD of the population mean. Based on the release of the UKB genotype dataset, for those who were included in both the first and second, we calculated the genotype discordance rate between imputed genotype of the two versions for each SNP and each individual and exclude those with a genotype discordance rate of more than 0.05. Secondly at the SNP level, genetic markers with an INFO score <0.6, markers that deviate significantly from Hardy-Weinberg equilibrium (HWE) (1.00E-07) or with a call rate <0.95, with MAF <0.01 and ambiguous or duplicated SNPs were excluded. Additional specific cohort-level quality control measures can be found in the reference cohort-specific publications 53. To avoid systematic differences between cases and controls being interpreted as genetic variance, a more stringent quality-control process was then applied to the data. This included excluding individuals with incomplete phenotype data and re-moving markers with a minor allele frequency of less than 1%. In this study, we used high-quality SNPs from the International HapMap Project [HapMap3] that were reliable in estimating genetic variance and covariance at the genome-wide level, feasible for more complicated analyses and there was no substantial difference between estimated genetic variance from HapMap3 and 1000 genome SNPs54. After QC, 1,217,312 HapMap3 SNPs with 288,837 study participants have remained for the analyses.
Statistical Information
For the Univariate heritability estimate, we assumed a linear mixed model for the heritability analysis as follows:
$$y=Xb+{Z}_{a}a+\epsilon$$
where y is a vector of the response variable (cancer status); b is the vector of regression coefficients for the fixed effects; a is additive genetic effects with variance; ε is residual (environment effects) with variance and Z and X is the design of matrix of the fixed effects 14.
For the heritability estimate, the genomic relationship matrix (GRM) was constructed using plink software 55, 56. To estimate the Univariate heritability of the subgroups of cancers, two different methods were applied. First, we used the genomic relationship matrix-restricted maximum likelihood (GREML) method, which is based on the individual level genotype data. Second, as linkage disequilibrium score (LDSC) regression method largely depends on summary level genotype data, using the UKB individual genotype data, we computed the summary statistics. We used the pre-computed LD score for white Europeans 57 which is considered suitable for standard LDSC analysis in European populations to use it in a command-line tool of LDSC. For each method, we used both incident and prevalent cases together in the dataset as cases. The analyses were repeated restricting prospective [incident] cancer cases only. With the use of the prevalence rate of the subgroups of cancers, the observed scale estimates were transformed to liability scale according to Lee et al using MTG2 software. We used χ2 which is distributed following a chi-square distribution with 2 degrees of freedom and Wald tests.
The GREML method requires individual-level genotype data and is computationally demanding 55. The sample size of the UKB is large, therefore, we randomly subdivided the dataset to shorten computing time and applied a meta-analysis approach. We first divided the samples into two groups, UKBB1 (91,472 individuals from the first release) and the other samples except for UKBB1, named as UKBB2. In UKBB2, 197,365 individuals with genotype data passed the QC. We further randomly divided the UKBB2 into two groups of equal size (denoted as UKB2A [n=98,682] and UKB2B [n=98,683]) and fitted all models mentioned above for each group. We then meta-analysed the heritability and other related estimates from UKB2A, UKB2B, and UKBB1 using the Fisher’s method 58. For UKBB2, we used the same variables for adjustment as UKBB1.
Genome-wide association (GWAS) analysis
Recent advances in computational methods have facilitated the investigation of genetic variants and their effects on multiple complex diseases, i.e., GWAS. After estimating heritability, we, therefore, extend the analysis to estimate the effects of genome-wide SNPs associated with causal genes on the group of hormone-sensitive cancers as a single trait GWAS, using a logistic regression model. The phenotype used for the GWAS analysis is similar to the SNP-based heritability estimate. In total, 15,197 hormone-sensitive cancer cases, including breast cancer, prostate, endometrial, ovarian, thyroid, and 223,207 controls were included in the GWAS analysis. The phenotype is similarly adjusted to multiple variables to the heritability estimate to identify significant SNPs using the list of common SNPs from HapMap3. We first computed the statistical power of the study for hormone-sensitive cancers using the online available software GAS Power calculator for genomic study 59. The power calculation is conducted under the assumptions of genetic models (i.e., additive), 5% minor allele frequencies (MAFs), pair-wise LD, a 6.34% disease prevalence, 1:1 case-to-control ratio, and 5% level of significance. We found the sample size of hormone-sensitive cancers was sufficient to achieve 80% statistical power according to the additive genetic model applied. The power curve is attached in the supplementary file. [Supplementary Fig. 2].
We performed post GWAS analyses that involves constructing a quantile-quantile (QQ) plot for hormone-sensitive cancers in each case [all hormone-sensitive cancers Vs prospective hormone-sensitive cancer cases only]. We further quantified the degree of genomic inflation factor (lambda = λ) i.e., how best the observed data points fit to the expected value. The QQ plots in each case showed the bulk of the distribution is in the lower tail of the graph.
We identified genome-wide significant SNPs for hormone-sensitive cancers using plink software 56 to obtain the GWAS P-values that were used for the Manhattan plot for qqman package in R. For the post GWAS analysis to see the genomic inflation factor (λ = lambda), we plot QQ plot using QCEWAS package in R. λ is the median of the resulting chi-square test statistics divided by the expected median of the chi-square distribution. The median of a chi-squared distribution with one degree of freedom is 0.4549364, i.e., [qchisq(0.5,1) = 0.4549364]. A λ value is calculated from p-values in the output we have from the genome-wide association analysis. Low significant results are removed (there are more significant results than expected) to increase the lambda value. To rescale the lambda value to provide better information, we use the following formula to rescale the lambda calculated 60.
where n is the study sample size for cases and controls respectively, and ncases,1000 and ncontrols,1000 is the target sample size of 1000.
Phenotypic correlation:
Estimates of phenotypic and genetic correlation were computed separately between hormone-sensitive cancer and each non-cancer trait. The phenotypic correlation was estimated using Pearson correlations between each pair of traits for complete observation in R. To examine the genetic architecture further, we performed phenotypic correlation for components of hormone-sensitive cancers using the leave-one-out analysis approach. The results are summarized and presented in Table 2.
Genetic correlation analysis
As Bivariate LDSC estimates are not biased with sample overlap wherein controls are common in both traits and computationally very efficient 24, we run the genetic correlation to generate an overview of the genetic relationship between hormone-sensitive cancers and the six non-cancer subgroup traits. We then used the Bivariate GREML approach to estimate the genetic correlation between hormone-sensitive cancers and seven non-cancer traits. Further, we examine the genetic correlation between each component of hormone-sensitive cancers using a pair-wise comparison approach. The genetic correlation (rg ± SE) is calculated using cross-trait LD Score regression method.
As most oestradiol hormone is bound to the serum protein sex-hormone binding globulin (SHBG) and Albumin, i.e., biologically unavailable to exhibit its physiologic effect, implying the need to compute the free hormone level, we calculated the free concentration using serum oestradiol and the concentration of SHBG and Albumin with their respective association constant K 61.
$$cFO=({E}_{2}-{N}_{Total})/(\left({N}_{SHBG}\right)-\left({E}_{2}+{N}_{{E}_{2}}\right))$$
where cFo = calculated free oestradiol; E2= serum oestradiol level; NE2 =0.64x109*Albumin level +1; NSHBG = 5.55x104 * SHBG level; and NTOTAL = NSHBG + NE2
Leave-one-out (LOO) approach to determine the genetic correlation of hormone-sensitive cancers
The iterative scheme of leave-one-out analysis is carried out by using a different possible combination of hormone-sensitive cancers in cross-trait LDSC regression. The grouped hormone-sensitive cancer comprised of 5 distinct heterogeneous cancers and we created a 5-fold leave-one-out analysis that involves the different possible combinations of the hormone-sensitive cancers. During each iterative step, we exclude data of one independent cancer at a time and use the remaining cancer types as grouped hormone-sensitive cancers to compute the genetic correlation in Bivariate LDSC. These steps are iteratively completed five times. The analysis sketch map demonstrating all the possible combinations is summarized in Supplementary Fig. 4.
Gene-environment interaction
Finally, we checked the gene-environment interaction for hormone-sensitive cancers with selected traits using Bivariate GREML and GxEsum techniques for traits with continuous level measurement. The Bivariate GREML approach is applied with the assumptions of gene-environment interactions in contrast to the Univariate GREML model that assumes the absence of GxE interactions. Here in this method, we stratified the hormone-sensitive cancer phenotype by traits regarded as environments [i.e., BMI-normal Vs high; metabolic environment-favourable Vs unfavourable; and sex-male Vs female] to look for interactions. Such approach allows us to test whether the genetic effects are heterogeneous if individuals lie in the same environment thereby test for gene-environment interaction.
A recently proposed alternative method for quantitative traits, called GxEsum is able to estimate gene-environment interaction. This method is built on LDSC approach by using GWAS summary statistics and suggested as computationally efficient method 26. For SNP effects modulated by quantitative environment, the expected chi-square statistics (χ²) is
$$E\left[{\chi }^{2}|\iota j\right]= \frac{N{\delta }_{g1 }^{2}}{M}*lj+1+2({\delta }_{g1}^{2}+{\delta }_{\tau 1}^{2})$$
,
where N is the number of individuals, M is the number of SNPs, \({\delta }_{g1}^{2}\)is the variance due to GxE, \(\) \({\delta }_{\tau 1}^{2}\) is the variance due to residual heterogeneity or scale effects caused by residual-environment interaction (RxE), \(\iota j\) is the LD score at the variant j.
The P-value in this study is calculated applying the Wald-test with the assumption of the distribution of estimated genetic correlation was normal. The statistical significance level was set at p<0.05 (2-tailed).
Software: We have used the well-established MTG2 14 software to conduct the Bivariate GREML analyses and estimate the genetic correlation coefficient between each non-cancer trait and subgroups of cancer. For MTG2, the source code, executive binary file, user manual, and toy examples for practice are readily available for downloads using the link https://sites.google.com/sit/honglee0707/mtg2. The GxEsum model is implemented in the script that is publicly available at https://github.com/honglee0707/GxEsum. The version of source code used in the manuscript is deposited with DOI:10.5281/zenodo.4659681 at https://zendo.org/record/4659681#.YGKZXc9xeUk. The rest statistical analyses were performed using publicly available software that includes plink1.9, LDSC, and analysis packages in R & Python.
Reporting summary: Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability: All data will be available to approved users of the UK Biobank upon application. The authors state that all data necessary for confirming the conclusions presented in the manuscript are represented fully within the manuscript.