Study outline
Our primary aim was to investigate to what extent genetics-based prediction models can benefit from addition of the easily attainable phenotypic risk factors: sex, smoking status, T2D parental disease status, physical activity, BMI and age.
To this end, we have built predictive models through linear regression modelling, including and excluding these variables using the UKB data. We built models separately for prediction of T2D and CAD. All models were trained using a subset of the UKB and validated in both the remainder of UKB data and additionally the Lifelines data, with details further explained in the methods and supplementary materials. These are two large databases for which numerous statistics are available, among which the input variables required for our models for a large number of individuals: genotyping chip data, BMI, genetic sex, smoking status, quantification of physical activity, parental disease status (Table 1). All reported statistics refer to the results for the Lifelines data used for validation, unless specified otherwise.
Table 1
Statistics of included participants. Data are presented as mean (SD) or n (%). For a histogram of the age distributions, we refer to supplementary figure 1.
|
UKB
|
Lifelines
|
Number of included individuals
|
334,338
|
29,825
|
Number of males
|
154,558 (46.2%)
|
11,929 (40.0%)
|
Number of females
|
179,780 (53.8%)
|
17,896 (60.0%)
|
Age range (yrs)
|
39 - 73
|
18 - 91
|
Body mass index (BMI), kg/m2
|
27.2 (SD:4.7)
|
25.7 (SD:4.2)
|
Number of individuals currently smoking
|
29,572 (8.8%)
|
5,592 (18.8%)
|
Number of individuals smoking in the past
|
149,410 (44.7%)
|
15,159 (50.8%)
|
Average days/week with vigorous activity
|
1.8
|
1.3
|
Average days/week with moderate activity
|
3.6
|
4.1
|
T2D prevalence at first assessment
|
14,263 (4.3%)
|
514 (1.7%)
|
T2D incidence after first assessment
|
987 (0.3%)
|
162 (0.5%)
|
CAD prevalence at first assessment
|
10,090 (3.0%)
|
477 (1.6%)
|
CAD incidence after first assessment
|
5,395 (1.6%)
|
99 (0.3%)
|
All analyses were conducted twice, once to model incidence and once to model prevalence. We opted to apply these analyses on both incidence and prevalence to obtain an impression of the influence of the effect of the outcome on the risk predictor rather than the other way around (Figure 1). For example, BMI is a known risk predictor for T2D, but is also affected by T2D, making prediction of individuals that already have T2D more accurate than risk prediction of individuals that will develop T2D in the future. Since predicting which individuals already have T2D is not interesting for the purpose of prevention, we primarily focus our analyses on predicting incidence rather than prevalence. To model prevalence, we used the entire dataset. To study incidence we exclude all individuals that had already attained the outcome on their first visit.
We present our results as incidence odds ratios of individuals in the highest risk decile compared to the remainder of the population, to allow for comparison to previous works and easy interpretability. Furthermore, individuals at highest risk stand to gain most from intervention, which makes identifying this group highly relevant. Additionally, we report the Area Under the Receiver Operator Curves (AUROC) for all different models (Figure 2, supplementary tables 1).
PGS based predictions
First, we reproduced earlier reports showing that UKB-derived PGS can be used to identify high-risk individuals in Lifelines 28. We trained and validated models in the UKB and then validated them also externally in the Lifelines cohort. In Lifelines, we observe that the prevalence odds ratio for those in the top decile for T2D and CAD are 3.4 (95-CI: 2.8-4.1) and 2.8 (95-CI: 2.2-3.5) with an AUROC of 0.89 (95-CI: 0.89-0.90) and 0.87 (95-CI: 0.86-0.89) after correcting for age, sex, genotyping batch number, smoking status, parental disease status and the first 4 principal components, respectively (Figure 2).
Questionnaire-based risk factors improve incidence predictions based on PGS
Next, we investigate how much predictive power these PGS models would gain by including easily and freely attainable regular risk factors into a PGS-based model. We built a number of models to assess the added value of each of those variables, by integrating individual factors into the PGS-based model and by integrating PGS into the non-genetic factor model.
We are interested in identifying individuals at high risk of obtaining T2D or CAD in the future, aiming to act preventively in high-risk individuals. To create models that are suited for identification of individuals, of a certain age at risk, of obtaining either T2D or CAD (rather than already having it) it is best to train models that predict incidence rather than prevalence, i.e. individuals that will obtain the outcome in the future rather than already having it 23. Prior to our analysis we have removed individuals that have the outcome on their initial measurement from the data and trained and validated the model. For comparison, we have also created models that predict prevalence rather than incidence (Figure 2).
For a T2D prediction model based on PGS, we observe that individuals in the highest risk decile have a 2.0 (95-CI: 1.3-3.1) fold higher incidence, which increases to 5.8 (95-CI: 4.2-7.9) when BMI, physical activity, sex, parental disease and smoking status are included in the model (Fisher exact test p-value: 7.3*10-04).
In addition to the prior model, we constructed a model that includes age as an additional risk factor. We built this model separately as we deemed it of less value to compare individuals at different ages when aiming to identify individuals that would benefit most from preventive action. When age is also added to the model the incidence odds ratio in the top decile is 7.4 (95-CI: 5.4-10.49); not statistically significantly different from when it is not included model (Fisher exact test p-value: 0.43). Similarly, we fail to observe a difference when age is included or excluded in the model in the part of the UKB dataset we used for validation. In the UKB we observe an incidence odds ratio of 6.7 (95-CI: 5.7-7.8) without age included in the model and 6.5 (95-CI: 5.6-7.6) with age included for individuals in the top decile (Fisher exact test p-value: 0.9). This suggests that the probability of obtaining T2D is not increasing with age. As this was contrary to our expectations, we further investigated this observation. We investigated the incidence of T2D at different ages in all three datasets: the UKB training dataset, the Lifelines and the UKB set used for validation. We found in the training set the incidence is increasing until the age of 57 and then starts decreasing again. We observe a similar effect in the UKB set used for validation. In the Lifelines data we do not observe this decrease after a certain age, but rather keeps increasing with age (Supplementary figure 2).
Similar to T2D, for CAD, Lifelines individuals in the highest risk decile have a 3.2-fold increased risk (95-CI: 2.0-5.1) when modelling incidence for CAD based on PGS compared to 4.6 (95-CI: 3.0-7.1) when BMI, physical activity, sex, parental disease and smoking status are also included in the model. While no statistically significant difference is observed in Lifelines (Fisher exact p-value: 0.36), this is observed in the UKB where individuals in the highest risk decile according to the PGS-based model have an odds ratio of 2.3 (95-CI: 2.1-2.5) compared to 4.3 (95-CI: 4.0-4.6) when the questionnaire-based risk factors are added (Fisher exact p-value = 4.1*10-24). We attribute this difference to the much smaller sample size of the Lifelines cohort, which leads to larger 95 percent confidence intervals.
When age is also included in the model the incidence odds ratio in Lifelines increases to 11.4 (95-CI: 7.7-17.4, Fisher exact test p-value = 0.032). This effect is larger than in the UKB where the incidence odds ratios in the highest decile are 5.3 (95-CI: 4.9-5.7, Fisher exact test p-value = 9.5*10-26). This is likely due to the much larger age range of the participants in the Lifelines database with the rarity of CAD at younger ages (18 to 91 in Lifelines and 39 to 73 in the UKB; for the age distribution we refer to supplementary figure 1).
Although we observe no effect of age on T2D, we do observe clear effects of the other questionnaire-based variables on both T2D and CAD. Overall, we conclude that there is a clear benefit of adding risk factors that can be obtained through a simple questionnaire to PGS-based risk assessments.
Limited added value of PGS on top of questionnaire-based risk factors for prediction of incidence
In the previous section we investigated the added benefit of adding questionnaire-based risk factors to PGS. Here we investigate to what extent PGS add value to a model based on solely those non-genetic risk factors that can be attained through a questionnaire, to predict incidence. This will allow a comparison of the added value for the added cost and effort of running a genotyping chip.
For a T2D prediction model based on BMI, physical activity, sex, parental disease and smoking status we observe that, compared to the remainder of the population, individuals in the highest risk decile have a 4.2 (95-CI: 3.0-5.8) fold higher incidence. When PGS are added to the model this is 5.8 (95-CI: 4.2-7.9) fold. Despite the fact that the PGS term in the model is significant (Wald test p-value: 2.76 * 10-9), the difference in the number of individuals detected in the highest decile is not statistically significantly different (Fisher exact test p-value = 0.25). Similarly, in the UKB, the odds ratios in the top decile are 5.6 (95-CI: 4.8-6.5) without PGS in the model and 6.7 (95-CI: 5.7-7.8) with PGS in the model, the difference not being statistically significant (Fisher exact test p-value = 0.21). Contrastingly, if we interrogate the effect of adding PGS to a model that predicts prevalence, rather than incidence, based on the aforementioned variables, the predictions are significantly different improving the odds ratio (Fisher exact test p-value = 3.1*10-9) from 6.4 (95-CI: 6.1-6.7) to 7.9 (95-CI: 7.5-8.2). We do note that the incidence rate is much lower than the prevalence rate, which may explain the failure to observe this difference in the prior. Nonetheless, we note that while the PGS term is significant in the model, it does not appear to have a distinguishable effect on the number of individuals with T2D classed in the top decile, compared to a model that does not include this term in a relatively large cohort.
Similar to T2D, we modelled incidence for CAD based on BMI, physical activity, sex, parental disease and smoking status. In Lifelines, individuals in the highest risk decile have a 2.4 (95-CI: 1.4-3.8) fold higher incidence compared to 4.6 (95-CI: 3.0-7.1) when PGS are included in the model (Fisher exact test p-value = 0.10). This increases to 11.4 (95-CI: 7.7-17.4) when age is also included in the model (Fisher exact test p-value = 0.03). While in Lifelines we do not observe a statistically significant difference between the model that include or excludes PGS, this is likely due to the limited sample size. In the UKB dataset used for validation, individuals in the highest risk decile have a 3.3 (95-CI: 3.0-3.5) fold higher risk for CAD when PGS are excluded and 4.3 (95-CI: 4.0-4.6) fold if PGS are included in the model, a significant difference (Fisher exact test p-value = 6.4*10-3). This shows that PGS, to some extent, are exerting their risk effects through mechanisms that are not captured by the non-genetic risk factors. Additionally, it reaffirms the added value of PGS also for individuals above the age of 39, as all individuals in the UKB are above this age.
Overall, it is clear that there is some, but limited, added value of PGS on top of questionnaire-based risk factors for predicting T2D and CAD incidence compared to when only free to attain risk factors are used. However, the prior is costly, requires effort and is time consuming compared to the latter which is cheap, fast and easy.
PGS and non-genetic risk factors identify different aspects of disease risk
Previously it was questioned whether PGS predict the same aspects of disease risk as these and other common, non-genetic risk factors 16 and if PGS would thus be no more than a complex approach to achieve the same result. The fact that the PGS term is statistically significant in a model that contains also the other risk factor terms indicates that PGS capture some aspect of risk that is not already captured by non-genetic risk factors. However, since the statistical significance of the term in the model can be difficult to interpret, we investigated whether individuals predicted to have a high incidence for T2D based on PGS alone are also identified through a model based on sex, smoking status and parental disease status. We investigated how the predictions from PGS compare to predictions based on BMI, sex and smoking, on an individual level.
We found the correlation between the predictions of the model predicting risk based on a questionnaire data and a model predicting risk based on genetics is marginal (Lifelines: T2D: r=0.05, p-value: 5.0*10-15; CAD: r=0.01, p-value: 0.04, UKB: T2D: r=0.04, p-value: 8.8*10-65; CAD: r=0.01, p-value: 1.2*10-11). Over 60% of individuals ranked differing at least 3 deciles apart according to the two different models. Furthermore, approximately 7.5% of the individuals in the highest category based on the PGS based model (decile 1) were classed in the lowest risk category by non-genetic model (decile 10) (Figure 3). Similar results are observed when prevalence, rather than incidence, is interrogated (Supplementary figure 3).
From our findings, we can conclude that risk predictions based on genetic risk scores are largely dissimilar to those derived from a list of known, questionnaire-based risk factors. While both predictions appear to allow identification of individuals at higher risk, they do largely disagree on whom those individuals are.
Genetic risk can be largely mitigated by controlling BMI for T2D and CAD
The fact that risk estimated based on questionnaire-based risk factors and risk based on genetics do not strongly overlap, suggests that non-genetic risk factors can be modified to mitigate the potential risk calculated based on genetics. To investigate whether individuals at high genetic risk can mitigate their genetic predisposition for T2D by adopting a healthier lifestyle, we investigated the effect of BMI in individuals in different genetic risk categories. We limited the analysis to BMI as, on the one hand, it is a known causal risk factor and showed largest impact in our analyses; and, on the other, weight reduction is a feasible lifestyle intervention which could be advised to mitigate genetic predisposition. Furthermore, limiting this analysis to the single most impactful variable allows for easy interpretation of the result.
We compared the effect of having a higher BMI in the different categories of genetic risk, in terms of both relative and absolute risk (figure 4). The T2D incidence in the low genetic risk category in those with a BMI above 30 was 0.6% and higher compared to the incidence of 0% among individuals with a BMI between 18.5 and 25 (Fisher exact test p-value: 0.03). In individuals at high genetic risk for T2D, the incidence of those with a BMI above 30 was 2.8% being higher than in those with a BMI between 18.5 and 25 which had an incidence of 0.3% (Fisher exact test p-value: 6.1*10-5). This indicates that the absolute difference in the high-risk group is 4-fold higher in the high genetic risk group compared to the low genetic risk group being only 0.6% in the prior compared to 2.5% in the latter group. A similar pattern is observed in the UK Biobank (figure 4). This suggests that those at high genetic risk for T2D benefit more from controlling their weight.
For CAD we fail to observe this same phenomenon for incidence in Lifelines, but do observe this in case we interrogate prevalence (Figure 4). The prevalence in the low genetic risk group is 0.1% in the normal BMI (18.5-25) group and 1.2% in the high BMI (30+) group (Fisher exact test p-value: 8.9*10-3). The prevalence in the high genetic risk group is 1.9% in the normal BMI group and 14.4% in the high BMI group (Fisher exact test p-value: 1.6*10-18). The absolute difference in the high genetic risk group is thus 12.5% compared to only 1.1% in the low genetic risk group. We ascribe our failure to observe this difference for incidence to the low incidence numbers. Taken together, this supports the notion that those at high genetic risk for CAD also benefit more from weight control than those in the low genetic risk group, in terms of absolute risk reduction.
No significant interaction effects between PGS and other risk factors
In addition to the additive models, we have also created models including a multiplicative interaction term between BMI and PGS, but this term does not significantly contribute to the prediction of either T2D or CAD (Wald test p-value = 0.02).
Indeed, when we apply the model with this multiplicative term to the Lifelines data, we observe that the predictive power of both models is similar, as evident from the similar AUROCS when comparing the models including the interaction term BMIxPGS and excluding the term. This is the case for both predicting prevalence and in case of predicting incidence. We observe that the AUROC model predicting prevalence is 0.893 (95 CI: 0.883-0.903) compared to 0.894 (95 CI: 0.884-0.903) without the multiplicative term and for incidence this is 0.812 (95 CI: 0.784-0.840) compared to 0.810 (95 CI: 0.782-0.839). In terms of prevalence ratio in the highest decile also no difference is observed. We do note that, although we do not observe these interactions to be significant, they may still exist but require larger sample sizes to detect, as large sample sizes are a known requirement for detecting interaction effects 29.