Study design
UK Biobank is a large prospective cohort initiated by the United Kingdom (https://www.ukbiobank.ac.uk). Approximately 500,000 UK population is aged 40 to 69 between 2006 and 2010. All participants were recruited from 22 assessment centers across the UK, covering England, Scotland, and Wales. Data collection about sociodemographic characteristic information, lifestyle, health-related factors, and blood samples during the interview(Sudlow et al. 2015). Furthermore, health-related outcomes, such as cancer, death status, and cause of death, were collected from multiple health-related registries, such as the national death registry and the national health system. All data have declared the participants’ consent—the measurement of All the variables we used at entry except for disease outcome.
The exclusion criteria were as follows: (1) Individuals who were missing data in the database; (2) Genetically non-British Caucasian ancestry; (3) Genetic sex was inconsistent with reported sex; and (4) Individuals who had a first or second degree as defined by kinship coefficient ≥ 0.8. The detailed filtering process of the participants is shown in Fig. 1. Finally, a total of 388,557 individuals have been included in this analysis. Over half the samples (208,514, 53.67%) were females, while 180,043 (46.33%) were males.
In this study, the data was approved via applications 58484 and 63726 and updated till August 22th, 2020.
Mortality Data
We used the mortality data from UK Biobank updated by August 22nd, 2020. The data were collected through death registries from England, Scotland, North Ireland, and Wales. We determined the primary cause of death for each individual. Eventually, 24,680 death records were matched and passed the quality control of enrollment, date of cancer diagnosis and date of death were also available from the UK Biobank database.
Genome-wide Genotyping Data
UK biobank genotyping data included 488,377 participants. Genotyping was performed using the Affymetrix UK BiLEVE Axiom array on an initial 49,950 participants; the remaining 438,427 participants were genotyped using the Affymetrix UK Biobank Axiom array. The quality control of genotypes was described previously(Bycroft et al. 2018). Genotype imputation was used to improve coverage by inferring the alleles of un-genotyped SNPs based on the linkage disequilibrium (LD). The genotypes were imputed into the dataset using the IMPUTE 4 software, combined with two reference panels, the Haplotype Reference Consortium (HRC) and a UK10K and 1000 Genomes Phase 3 haplotype resource(Auton et al. 2015).
Selection Of Diseases, Traits, And Biomarkers
We retrieved data about the burden of disease in the UK from open online resources, including The Global Cancer Observatory (GCO) and National Health Service (NHS), and GLOBOCAN(Bray et al. 2018). Thirteen chronic diseases and cancer with genetic backgrounds were selected, according to the incidence of common diseases in the UK in descending order(Sampson et al. 2015). They were Alzheimer’s disease(Desikan et al. 2017), essential hypertension, ischemic heart disease(Khera et al. 2016), type II diabetes, allergy, Parkinson’s syndrome, female breast cancer (Mavaddat et al. 2019), bladder cancer, colorectal cancer, lung cancer, prostate cancer, skin cancer, and thyroid cancer. Every single cancer patient in UK biobank was clinically diagnosed and recorded by multiple cancer registries, according to the International Classification of Diseases 10th edition (ICD-10). We used the outcome data updated up to August 2020.
We selected 31 traits and clinical biomarkers as highlighted by the international guidelines and references(Prins et al. 2017; Thompson and Voss 2009), which were proved as important health indictors, including (1) anthropometric related indicators: height, weight, body mass index (BMI), systolic blood pressure (sBP), diastolic blood pressure (dBP), mean artery pressure (MAP), pulse rate (PR), smoke status (Ever/Never), alcohol intake (Ever/Never), and sleep duration; (2) Hematological related biomarkers: red blood cell (RBC), hemoglobin (Hb), hematocrit (Ht), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), white blood cell (WBC), neutrophil, eosinophil, basophil, monocyte, lymphocyte, and platelet; (3) Metabolism related biomarkers: Total cholesterol (TC), triglyceride (TG), HDL cholesterol (HDLC), LDL cholesterol (LDLC), and glucose, and (4) Liver-related biomarkers: aspartate aminotransferase (AST), alanine aminotransferase (ALT), and alkaline phosphatase (ALP) andγ-glutamyl transpeptidase (γGTP). A summary of these variables is listed in Supplementary Table 1. We performed corelation statistics to decrease potential overfit.
To increase the reliability of anthropometric-related indicators, like sBP, dBP, MAP, and pulse rate, UK Biobank tested each indicator twice with a 10-minute break during the interview. So, we calculated the mean value as the results for each individual. The individuals were removed from further analysis if there were any missing values. Individuals with bottom or top 1% value for continuous variables were also removed. After we performed QC, we deleted basophil and eosinophil from the first step analysis because they had more than 1/3 of the missing value in our population.
For diseases and cancers, the terms we retrieved followed the ICD-10 definition provided in the data. Then, we matched four healthy individuals for each specific patient by age group and sex.
Genome-wide Association Studies
The GWAS was realized by the generalized linear model in PLINK2 software(Chang et al. 2015) (https://www.cog-genomics.org/plink/2.0/, v2.00a). We excluded variants with: (1) INFO score < 0.9; (2) Variants call rate < 95%; (3) Minor Allele Frequency (MAF) < 0.001, and (4) Hardy-Weinberg equilibrium P-value ≤ 1\(\times {10}^{-20}\). Notably, only autosomal were studied. Moreover, sex, age, age2, and top ten10 genetic principal components were adjusted as covariates. The threshold of the SNP P-value is < 5\(\times {10}^{-8}\). Moreover, the sample size of case-control GWAS analysis is smaller than quantitative variable, which caused a higher P-value than 5\(\times {10}^{-8}\),
To maintain sample size and to increase the accuracy of GWAS, we applied a method called Leave-One-Group-Out (LOGO) Cross-Validation GWAS meta-analysis(Dimou et al. 2017), which was widely accepted in machine learning and GWAS studies(Zeggini and Ioannidis 2009). We first randomly split all participants into ten equal subgroups, each of those ten subgroups would perform GWAS independently for every phenotype. Then, nine GWAS results from those subgroups would be taken to perform meta-analysis by inverse-variance method while keeping one subgroup out from meta for construction of PRS. The cross-validation process was then repeated ten times, with each of the ten subgroups used exactly once as the validation data. The ten results can then be averaged to produce a single and stable estimation. Furthermore, the GWAS meta-analysis was performed by METAL software (http://csg.sph.umich.edu/abecasis/Metal/).
Construction Of Polygenic Risk Scores
For the imputation genotype data in UK Biobank, SNPs were quality controlled before the construction of PRS. The first step was to match the SNP RSID on chromosomes and positions with the NCBI dbSNP. The second step removed non-bi-allelic SNPs and ambiguous palindromic SNPs (A/T or C/G SNPs with allele frequencies between 0.4 and 0.6)(Meisner et al. 2020). Third, we removed SNP loci whose MAF is more than 0.15(Linck and Battey 2019). Finally, the LD was analyzed by PLINK2. Independent loci were kept with LD clumping r2 lower than 0.1, based on 1000 Genomes European reference panel(Bulik-Sullivan et al. 2015).
For individuals, PRS for each phenotype was constructed by the sum of the weighted effect allele. Weights are equal to effect size, which is the beta value in the regression of each quantitative phenotype on the effective allele and is the Odds Ratio for the binary phenotype. To minimize the co-relation between traits, we summarized all overlap SNPs in different GWAS, mean value was treated as the effect size for each repeated SNP, and it would be calculated once. The formula is described below:
$${\text{P}\text{R}\text{S}}_{i,j}= \sum _{\text{k}=1}^{{\text{n}}_{j}}{G}_{i,k}\bullet {\beta }_{k}$$
,
Where\({ \text{P}\text{R}\text{S}}_{i,j}\) Stands for the PRS value for the ith individual and jth phenotype, \({\text{n}}_{j}\) For the numbers of SNPs in the jth phenotype, \({\beta }_{k}\)For the Beta of the quantitative phenotype and for the OR of the binary phenotype for kth SNP, and \({G}_{i,k}\) For the dosage in kth SNP for ith individual.
Finally, each PRS would be standardized to a standard normal distribution by Z-score transformation, which has a mean of 0 and a standard deviation of 1.
Statistical analysis
Cox proportional hazards models were performed to assess the association of clinical phenotypes and their derived PRS with survival, as quantified by the Hazard Ratios (HR) and 95% confidence interval (CI).
First, for each individual, we meta-analyzed every PRS by the inverse variance method for every trait. The resulting value was used for prediction in the Cox model. Furthermore, sex was not only an innate exposure for the individual but a dominant variable to influence all-cause mortality in our population, so we stratified the prediction by sex. Secondly, we constructed an overall quantitative risk score for the genetical contribution to lifespan by summing the PRS of 13 common diseases, ten related anthropometric measurements, and 21 risk biomarkers. We call the total overall risk score integrated PRS (iPRS). Every PRS in the iPRS was weighted by the coefficient of multivariable Cox regression, adjusted by the top 10 genetic principal components. The iPRS was divided by quantiles and examined the dose-response effect between hazard ratio and dominant phenotype on all-cause mortality.
Means and standard deviations or median were utilized to summarize continuous variables, while counts and proportions for categorical variables.
The statistical analysis was performed by R software (https://www.r-project.org/, Version 4.0.3). And P < 0.05 from the two-sided test was considered statistical significance.