The necessity of incorporating non-genetic risk factors into polygenic risk score models

doi:10.21203/rs.3.rs-1691718/v1

Download PDF

Article

The necessity of incorporating non-genetic risk factors into polygenic risk score models

https://doi.org/10.21203/rs.3.rs-1691718/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 20 Feb, 2023

Read the published version in Scientific Reports →

You are reading this latest preprint version

The growing public interest in genetic risk scores for various health conditions may inspire preventive health action. However, these risk scores can be deceiving as they do not consider other, easily attainable risk factors, such as sex, BMI, age, smoking habits, parental disease status and physical activity. We show improved performance in identifying the 10% most at-risk individuals for type 2 diabetes (T2D) and coronary artery disease (CAD) by including these common risk factors. Incidence in the highest risk group increases from 3.2- and 4.2-fold to 5.8 for T2D, when comparing the genetics-based model, common risk factor-based model and combined model, respectively. Similarly, we see an increase from 2.9- and 2.4-fold to 5.0-fold risk for CAD. Importantly, we show that genetic and common risk factors capture different risks. As such, it is paramount that all these variables are considered when reporting risk, unlike current practice with available genetic tests.

"The function of protecting and developing health must rank even above that of restoring it when it is impaired" (Hippocrates). Yet, for a myriad of reasons, modern medicine and healthcare have evolved into a paradigm of “diagnosing and curing”, rather than preventing. All the while, 32 to 58 percent of all Europeans age 50 and over suffer from multiple age-related non-transmissible chronic diseases¹. As a result, healthcare costs have spiraled out of control, with the World Health Organization reporting a global expenditure of 8.3 trillion USD or 10% of the gross domestic product (GDP) in 2018². However, chronic conditions can in part be prevented by following simple health guidelines such as regular physical exercise, having a healthy diet, and not smoking^3–5. Yet, the adherence to this advice is limited. Among other reasons, this can be explained by the low perceived risk for each of these chronic conditions separately⁶.

Risk perception to stimulate preventive health action

Risk perception can influence health behaviors^7,8. These health behaviors can affect the risk for chronic diseases⁹. Unfortunately, little of this information has reached, been understood, or has been implemented by the public, while such insights and understanding can inspire preventive health action¹⁰. As a result, most are unaware and unmotivated to tackle their greatest personal health risk factor, be it inactivity, poor diet or tobacco use. This may in part be ascribed to lack of interest of the public. However, there is a growing interest in genetics-based risk assessment, as evident by the millions of genetic tests sold worldwide. This growing interest, combined with the predictive power of polygenic scores (PGS), can and is harnessed to promote disease prevention^8,11.

PGS are risk scores, computed based on genetic profiles and have proven effective at identifying individuals (10% individuals at highest risk) with a 2.5 and 2.9 odds ratio for developing type-2-diabetes (T2D) and coronary artery disease (CAD), respectively, when compared to the rest of the population. These risk assessments are based on relatively cheap genotyping chip assessments (as opposed to whole genome sequencing (WGS) required for monogenic analyses), well suited for PGS calculations ^12,13. Indeed, these PGS are now being implemented in commercially available tests ^14,15 and made available to the public. However, these do not account for neither are corrected for important established disease risk factors that can be easily attained through a questionnaire and could potentially vastly improve the predictions. By not accounting for factors such as weight and smoking, PGS based risk assessment can lead to a false perception of low risk, thereby potentially reducing the willingness of an individual to change health behavior. Or on the other hand lead to an unnecessary perception of high risk, potentially triggering avoidable anxiety. We seek to investigate how much disease predictions can be improved by adding established risk factors that can be assessed through a questionnaire to existing PGS models.

Genetic health risk limitations

Although PGS have proven able to identify individuals at high risk based on genotyping chip data, the usefulness of this newer approach to risk stratification remains a topic of debate^16,17. One commonly raised concern is that the variance explained for the predicted outcomes is often low. Typically, these vary between 1 to 5% for phenotypes such as diabetes and CAD ^17,18. Other, non-genetic risk factors, such as age, sex, smoking status, parental disease status, physical activity and body mass index (BMI), which already form part of most clinical risk prediction models, have proven more effective at identifying individuals at high risk ^19–21. Therefore, combining both genetic and non-genetic factors should lead to improved risk prediction, as previously suggested in a study that showed a T2D prediction model including BMI and PGS outperforms models including only either one of these predictors ²². Furthermore, additional inclusion of a number of biomarkers further increased its discriminative power ^23,24. However, most previously reported models combining PGS with other risk factors suffer from on ore more of the following three limitations. First, they can be overfitted to the population they have been trained on ²⁵. Second, they are built to predict which individuals have a certain outcome (prevalence) rather than those who will be developing it in the future (incidence). This introduces a bias, as multiple risk factors are also affected by the outcome and can lead to misleading risk assessments. Third, they are either not comprehensive, or conversely, include variables that require biomarker measurements, making them difficult and costly to implement for existing PGS platforms. As such, there are no models available yet that allow for easy implementation into currently available PGS services.

In this paper, we evaluate prediction models that include additional variables beyond PGS, but limit ourselves to risk variables that can be easily acquired (through a questionnaire), which would allow these models to be easily implemented in different practical settings. We investigate to which extent PGS prediction models benefit from inclusion of the following easily accessible variables: BMI, sex, age, physical activity, parental disease and smoking status. We develop these predictive models using the UK Biobank (UKB)²⁶ data, and subsequently externally validate them in the Lifelines ²⁷ cohort (Figure 1). These are two large cohorts for which physical measurements, biomarker and genetic data are available for a large number of participants and to some extent also time series data. While previous studies are limited to predicting prevalence, in this study we also built models to predict incidence.

Lastly, it remains unclear whether there is a multiplicative effect of having both poor genetics and being at risk through other factors such as having a high BMI. To investigate if there are interaction effects of the aforementioned risk factors with genetics, we investigated whether risk classification is more accurate when multiplicative terms with genetics are included in the risk models. We believe that the framework we introduce will help improve PGS-based predictions for a wide variety of diseases, thereby promoting the implementation of preventive health practices.

Study outline

Our primary aim was to investigate to what extent genetics-based prediction models can benefit from addition of the easily attainable phenotypic risk factors: sex, smoking status, T2D parental disease status, physical activity, BMI and age.

To this end, we have built predictive models through linear regression modelling, including and excluding these variables using the UKB data. We built models separately for prediction of T2D and CAD. All models were trained using a subset of the UKB and validated in both the remainder of UKB data and additionally the Lifelines data, with details further explained in the methods and supplementary materials. These are two large databases for which numerous statistics are available, among which the input variables required for our models for a large number of individuals: genotyping chip data, BMI, genetic sex, smoking status, quantification of physical activity, parental disease status (Table 1). All reported statistics refer to the results for the Lifelines data used for validation, unless specified otherwise.

Table 1

Statistics of included participants. Data are presented as mean (SD) or n (%). For a histogram of the age distributions, we refer to supplementary figure 1.

	UKB	Lifelines
Number of included individuals	334,338	29,825
Number of males	154,558 (46.2%)	11,929 (40.0%)
Number of females	179,780 (53.8%)	17,896 (60.0%)
Age range (yrs)	39 - 73	18 - 91
Body mass index (BMI), kg/m²	27.2 (SD:4.7)	25.7 (SD:4.2)
Number of individuals currently smoking	29,572 (8.8%)	5,592 (18.8%)
Number of individuals smoking in the past	149,410 (44.7%)	15,159 (50.8%)
Average days/week with vigorous activity	1.8	1.3
Average days/week with moderate activity	3.6	4.1
T2D prevalence at first assessment	14,263 (4.3%)	514 (1.7%)
T2D incidence after first assessment	987 (0.3%)	162 (0.5%)
CAD prevalence at first assessment	10,090 (3.0%)	477 (1.6%)
CAD incidence after first assessment	5,395 (1.6%)	99 (0.3%)

All analyses were conducted twice, once to model incidence and once to model prevalence. We opted to apply these analyses on both incidence and prevalence to obtain an impression of the influence of the effect of the outcome on the risk predictor rather than the other way around (Figure 1). For example, BMI is a known risk predictor for T2D, but is also affected by T2D, making prediction of individuals that already have T2D more accurate than risk prediction of individuals that will develop T2D in the future. Since predicting which individuals already have T2D is not interesting for the purpose of prevention, we primarily focus our analyses on predicting incidence rather than prevalence. To model prevalence, we used the entire dataset. To study incidence we exclude all individuals that had already attained the outcome on their first visit.

We present our results as incidence odds ratios of individuals in the highest risk decile compared to the remainder of the population, to allow for comparison to previous works and easy interpretability. Furthermore, individuals at highest risk stand to gain most from intervention, which makes identifying this group highly relevant. Additionally, we report the Area Under the Receiver Operator Curves (AUROC) for all different models (Figure 2, supplementary tables 1).

PGS based predictions

First, we reproduced earlier reports showing that UKB-derived PGS can be used to identify high-risk individuals in Lifelines ²⁸. We trained and validated models in the UKB and then validated them also externally in the Lifelines cohort. In Lifelines, we observe that the prevalence odds ratio for those in the top decile for T2D and CAD are 3.4 (95-CI: 2.8-4.1) and 2.8 (95-CI: 2.2-3.5) with an AUROC of 0.89 (95-CI: 0.89-0.90) and 0.87 (95-CI: 0.86-0.89) after correcting for age, sex, genotyping batch number, smoking status, parental disease status and the first 4 principal components, respectively (Figure 2).

Questionnaire-based risk factors improve incidence predictions based on PGS

Next, we investigate how much predictive power these PGS models would gain by including easily and freely attainable regular risk factors into a PGS-based model. We built a number of models to assess the added value of each of those variables, by integrating individual factors into the PGS-based model and by integrating PGS into the non-genetic factor model.

We are interested in identifying individuals at high risk of obtaining T2D or CAD in the future, aiming to act preventively in high-risk individuals. To create models that are suited for identification of individuals, of a certain age at risk, of obtaining either T2D or CAD (rather than already having it) it is best to train models that predict incidence rather than prevalence, i.e. individuals that will obtain the outcome in the future rather than already having it ²³. Prior to our analysis we have removed individuals that have the outcome on their initial measurement from the data and trained and validated the model. For comparison, we have also created models that predict prevalence rather than incidence (Figure 2).

For a T2D prediction model based on PGS, we observe that individuals in the highest risk decile have a 2.0 (95-CI: 1.3-3.1) fold higher incidence, which increases to 5.8 (95-CI: 4.2-7.9) when BMI, physical activity, sex, parental disease and smoking status are included in the model (Fisher exact test p-value: 7.3*10^-04).

In addition to the prior model, we constructed a model that includes age as an additional risk factor. We built this model separately as we deemed it of less value to compare individuals at different ages when aiming to identify individuals that would benefit most from preventive action. When age is also added to the model the incidence odds ratio in the top decile is 7.4 (95-CI: 5.4-10.49); not statistically significantly different from when it is not included model (Fisher exact test p-value: 0.43). Similarly, we fail to observe a difference when age is included or excluded in the model in the part of the UKB dataset we used for validation. In the UKB we observe an incidence odds ratio of 6.7 (95-CI: 5.7-7.8) without age included in the model and 6.5 (95-CI: 5.6-7.6) with age included for individuals in the top decile (Fisher exact test p-value: 0.9). This suggests that the probability of obtaining T2D is not increasing with age. As this was contrary to our expectations, we further investigated this observation. We investigated the incidence of T2D at different ages in all three datasets: the UKB training dataset, the Lifelines and the UKB set used for validation. We found in the training set the incidence is increasing until the age of 57 and then starts decreasing again. We observe a similar effect in the UKB set used for validation. In the Lifelines data we do not observe this decrease after a certain age, but rather keeps increasing with age (Supplementary figure 2).

Similar to T2D, for CAD, Lifelines individuals in the highest risk decile have a 3.2-fold increased risk (95-CI: 2.0-5.1) when modelling incidence for CAD based on PGS compared to 4.6 (95-CI: 3.0-7.1) when BMI, physical activity, sex, parental disease and smoking status are also included in the model. While no statistically significant difference is observed in Lifelines (Fisher exact p-value: 0.36), this is observed in the UKB where individuals in the highest risk decile according to the PGS-based model have an odds ratio of 2.3 (95-CI: 2.1-2.5) compared to 4.3 (95-CI: 4.0-4.6) when the questionnaire-based risk factors are added (Fisher exact p-value = 4.1*10^-24). We attribute this difference to the much smaller sample size of the Lifelines cohort, which leads to larger 95 percent confidence intervals.

When age is also included in the model the incidence odds ratio in Lifelines increases to 11.4 (95-CI: 7.7-17.4, Fisher exact test p-value = 0.032). This effect is larger than in the UKB where the incidence odds ratios in the highest decile are 5.3 (95-CI: 4.9-5.7, Fisher exact test p-value = 9.5*10^-26). This is likely due to the much larger age range of the participants in the Lifelines database with the rarity of CAD at younger ages (18 to 91 in Lifelines and 39 to 73 in the UKB; for the age distribution we refer to supplementary figure 1).

Although we observe no effect of age on T2D, we do observe clear effects of the other questionnaire-based variables on both T2D and CAD. Overall, we conclude that there is a clear benefit of adding risk factors that can be obtained through a simple questionnaire to PGS-based risk assessments.

Limited added value of PGS on top of questionnaire-based risk factors for prediction of incidence

In the previous section we investigated the added benefit of adding questionnaire-based risk factors to PGS. Here we investigate to what extent PGS add value to a model based on solely those non-genetic risk factors that can be attained through a questionnaire, to predict incidence. This will allow a comparison of the added value for the added cost and effort of running a genotyping chip.

For a T2D prediction model based on BMI, physical activity, sex, parental disease and smoking status we observe that, compared to the remainder of the population, individuals in the highest risk decile have a 4.2 (95-CI: 3.0-5.8) fold higher incidence. When PGS are added to the model this is 5.8 (95-CI: 4.2-7.9) fold. Despite the fact that the PGS term in the model is significant (Wald test p-value: 2.76 * 10^-9), the difference in the number of individuals detected in the highest decile is not statistically significantly different (Fisher exact test p-value = 0.25). Similarly, in the UKB, the odds ratios in the top decile are 5.6 (95-CI: 4.8-6.5) without PGS in the model and 6.7 (95-CI: 5.7-7.8) with PGS in the model, the difference not being statistically significant (Fisher exact test p-value = 0.21). Contrastingly, if we interrogate the effect of adding PGS to a model that predicts prevalence, rather than incidence, based on the aforementioned variables, the predictions are significantly different improving the odds ratio (Fisher exact test p-value = 3.1*10^-9) from 6.4 (95-CI: 6.1-6.7) to 7.9 (95-CI: 7.5-8.2). We do note that the incidence rate is much lower than the prevalence rate, which may explain the failure to observe this difference in the prior. Nonetheless, we note that while the PGS term is significant in the model, it does not appear to have a distinguishable effect on the number of individuals with T2D classed in the top decile, compared to a model that does not include this term in a relatively large cohort.

Similar to T2D, we modelled incidence for CAD based on BMI, physical activity, sex, parental disease and smoking status. In Lifelines, individuals in the highest risk decile have a 2.4 (95-CI: 1.4-3.8) fold higher incidence compared to 4.6 (95-CI: 3.0-7.1) when PGS are included in the model (Fisher exact test p-value = 0.10). This increases to 11.4 (95-CI: 7.7-17.4) when age is also included in the model (Fisher exact test p-value = 0.03). While in Lifelines we do not observe a statistically significant difference between the model that include or excludes PGS, this is likely due to the limited sample size. In the UKB dataset used for validation, individuals in the highest risk decile have a 3.3 (95-CI: 3.0-3.5) fold higher risk for CAD when PGS are excluded and 4.3 (95-CI: 4.0-4.6) fold if PGS are included in the model, a significant difference (Fisher exact test p-value = 6.4*10^-3). This shows that PGS, to some extent, are exerting their risk effects through mechanisms that are not captured by the non-genetic risk factors. Additionally, it reaffirms the added value of PGS also for individuals above the age of 39, as all individuals in the UKB are above this age.

Overall, it is clear that there is some, but limited, added value of PGS on top of questionnaire-based risk factors for predicting T2D and CAD incidence compared to when only free to attain risk factors are used. However, the prior is costly, requires effort and is time consuming compared to the latter which is cheap, fast and easy.

PGS and non-genetic risk factors identify different aspects of disease risk

Previously it was questioned whether PGS predict the same aspects of disease risk as these and other common, non-genetic risk factors ¹⁶ and if PGS would thus be no more than a complex approach to achieve the same result. The fact that the PGS term is statistically significant in a model that contains also the other risk factor terms indicates that PGS capture some aspect of risk that is not already captured by non-genetic risk factors. However, since the statistical significance of the term in the model can be difficult to interpret, we investigated whether individuals predicted to have a high incidence for T2D based on PGS alone are also identified through a model based on sex, smoking status and parental disease status. We investigated how the predictions from PGS compare to predictions based on BMI, sex and smoking, on an individual level.

We found the correlation between the predictions of the model predicting risk based on a questionnaire data and a model predicting risk based on genetics is marginal (Lifelines: T2D: r=0.05, p-value: 5.0*10^-15; CAD: r=0.01, p-value: 0.04, UKB: T2D: r=0.04, p-value: 8.8*10^-65; CAD: r=0.01, p-value: 1.2*10^-11). Over 60% of individuals ranked differing at least 3 deciles apart according to the two different models. Furthermore, approximately 7.5% of the individuals in the highest category based on the PGS based model (decile 1) were classed in the lowest risk category by non-genetic model (decile 10) (Figure 3). Similar results are observed when prevalence, rather than incidence, is interrogated (Supplementary figure 3).

From our findings, we can conclude that risk predictions based on genetic risk scores are largely dissimilar to those derived from a list of known, questionnaire-based risk factors. While both predictions appear to allow identification of individuals at higher risk, they do largely disagree on whom those individuals are.

Genetic risk can be largely mitigated by controlling BMI for T2D and CAD

The fact that risk estimated based on questionnaire-based risk factors and risk based on genetics do not strongly overlap, suggests that non-genetic risk factors can be modified to mitigate the potential risk calculated based on genetics. To investigate whether individuals at high genetic risk can mitigate their genetic predisposition for T2D by adopting a healthier lifestyle, we investigated the effect of BMI in individuals in different genetic risk categories. We limited the analysis to BMI as, on the one hand, it is a known causal risk factor and showed largest impact in our analyses; and, on the other, weight reduction is a feasible lifestyle intervention which could be advised to mitigate genetic predisposition. Furthermore, limiting this analysis to the single most impactful variable allows for easy interpretation of the result.

We compared the effect of having a higher BMI in the different categories of genetic risk, in terms of both relative and absolute risk (figure 4). The T2D incidence in the low genetic risk category in those with a BMI above 30 was 0.6% and higher compared to the incidence of 0% among individuals with a BMI between 18.5 and 25 (Fisher exact test p-value: 0.03). In individuals at high genetic risk for T2D, the incidence of those with a BMI above 30 was 2.8% being higher than in those with a BMI between 18.5 and 25 which had an incidence of 0.3% (Fisher exact test p-value: 6.1*10^-5). This indicates that the absolute difference in the high-risk group is 4-fold higher in the high genetic risk group compared to the low genetic risk group being only 0.6% in the prior compared to 2.5% in the latter group. A similar pattern is observed in the UK Biobank (figure 4). This suggests that those at high genetic risk for T2D benefit more from controlling their weight.

For CAD we fail to observe this same phenomenon for incidence in Lifelines, but do observe this in case we interrogate prevalence (Figure 4). The prevalence in the low genetic risk group is 0.1% in the normal BMI (18.5-25) group and 1.2% in the high BMI (30+) group (Fisher exact test p-value: 8.9*10^-3). The prevalence in the high genetic risk group is 1.9% in the normal BMI group and 14.4% in the high BMI group (Fisher exact test p-value: 1.6*10^-18). The absolute difference in the high genetic risk group is thus 12.5% compared to only 1.1% in the low genetic risk group. We ascribe our failure to observe this difference for incidence to the low incidence numbers. Taken together, this supports the notion that those at high genetic risk for CAD also benefit more from weight control than those in the low genetic risk group, in terms of absolute risk reduction.

No significant interaction effects between PGS and other risk factors

In addition to the additive models, we have also created models including a multiplicative interaction term between BMI and PGS, but this term does not significantly contribute to the prediction of either T2D or CAD (Wald test p-value = 0.02).

Indeed, when we apply the model with this multiplicative term to the Lifelines data, we observe that the predictive power of both models is similar, as evident from the similar AUROCS when comparing the models including the interaction term BMIxPGS and excluding the term. This is the case for both predicting prevalence and in case of predicting incidence. We observe that the AUROC model predicting prevalence is 0.893 (95 CI: 0.883-0.903) compared to 0.894 (95 CI: 0.884-0.903) without the multiplicative term and for incidence this is 0.812 (95 CI: 0.784-0.840) compared to 0.810 (95 CI: 0.782-0.839). In terms of prevalence ratio in the highest decile also no difference is observed. We do note that, although we do not observe these interactions to be significant, they may still exist but require larger sample sizes to detect, as large sample sizes are a known requirement for detecting interaction effects ²⁹.

PGS can be used to identify individuals at higher risk of developing T2D, CAD and other diseases ²⁸. This can help identify and motivate individuals that should be prioritized for preventive health measures. For this reason, we focused mostly on the utility of PGS to identify individuals at highest risk, defined as those with the 10% highest risk, as opposed to its discriminating power in the remainder of the risk spectrum. We confirm that PGS can be used in a Dutch cohort (Table 1) to identify the top 10% at-risk individuals at an approximately 3.2- and 2.8-fold higher risk of developing T2D and CAD, respectively. However, we also find that individuals that are in the highest risk decile based on BMI, smoking status, physical activity, parental T2D status and sex have an incidence odds ratio of 4.2- and 2.4-fold, compared to the remainder in the Lifelines cohort, for T2D and CAD, respectively (Figure 1). This suggests that a risk assessment based on variables that can be obtained through a simple questionnaire or directly from electronic health records are similarly or more accurate than risk prediction based solely on genetics. Due to the ease of attaining such variables, we suggest to continue using the questionnaire approach as a first risk assessment, rather than rely solely on genetic testing to determine risk.

Nonetheless, as genetic testing becomes increasingly more accessible and appealing to individuals, there is a potential to harness this interest to deliver risk impressions for numerous preventable chronic conditions. We show that when PGS predictions are augmented with risk factors that can be easily attained through a questionnaire, risk predictions become more accurate improving from approximately a 2.0- and 2.3-fold higher incidence in the top decile to 6.5- and 4.3-fold for T2D and CAD respectively.

Additionally, we showed that PGS-derived risk often does not agree with risk derived from questionnaire-based risk factors (Figure 3). Our results suggest that many individuals presented with risk assessments solely based on their genetic risk scores, will falsely conclude they are at low or high risk, stressing the need for inclusion of these easily attainable variables into already existing PGS models. As a result, it can occur that an individual feels protected due to a low genetic risk score, despite being at high risk due to being a heavily overweight smoker, when PGS are reported without consideration of other risk factors. As such, it may even be deceiving to report risk based on solely PGS, which is concerning because this is currently often the case with offered PGS services, at least with commercially available genetic tests. Hence, we strongly argue for adoption of additional variables into PGS risk models, especially those that can be acquired through a simple questionnaire.

Although models based on sex, BMI, parental disease and smoking status perform relatively well, there is still added value of the genetic risk scores, albeit limited, in line with earlier reports ^30,31. We observe that when genetic risk is also included on top of sex, BMI, parental disease and smoking status, the incidence odds ratio increases from 4.2 to 6.5 for T2D and from 2.4 to 4.3 for CAD (Figure 2). Whether these gains are sufficient to warrant the added cost of a genotyping assessment may for now be a question. However, with the cost of genotyping chips being close to the 30 euro mark and 30X WGS currently periodically being available for less than 200 euro ¹⁵, it is not difficult to imagine that such data will soon be readily available for a large number of individuals. This stresses the need for availability of platforms that allow integrated analysis of genetic and phenotypic data.

Age is still an obvious predictor of prevalence, also in case of T2D, as prevalence is a function of a time. Although we argue that it is unfair to compare disease prevalence of older individuals to younger individuals and should thus not be used in a model that predicts prevalence, this does clearly indicate that age should be considered when presenting individuals with their risk. If age is not considered when informing about risk, the absolute prevalence of a disease may appear irrelevant. For example, an increased prevalence from 0.25% to 2% for diabetes at young ages may appear irrelevant, but when the absolute risk increases at older age from 5% to 40% is communicated, may seem far more substantial and more likely to trigger action. Therefore, it is important to communicate the lifetime risk rather than 10-year risk to individuals of younger ages.

Role in prevention

The models created in this project can be used to identify individuals at high risk of either CAD or T2D. Depending on the outcome you are at risk for, you may want to take different preventive actions, as different risk factors may be relevant. For instance, high blood pressure can be a risk factor for CAD and can be affected by salt intake. For T2D, high blood pressure is less of an issue, while sugar intake may be much more important to monitor. If an individual is aware of the phenotype they are at highest risk for, they can identify the risk factors that they can reduce to efficiently lower their health risk (as opposed to following all standard guidelines, which cover to wide a range of actions to inspire actual action).

While some individuals are at high genetic risk, which they cannot change, they can still take preventive action to offset their genetic predisposition. Earlier work has indeed shown that those with elevated risk based on genetics can still lower their risk to well below the overweight individual with low genetic risk ³². Similarly, we observe that individuals with a healthy BMI (between 18.5 and 25) and high genetic risk (top decile), still have a lower or similar incidence than individuals with a high BMI (over 30) and low genetic risk (bottom decile) for T2D or medium genetic risk in case of CAD (figure 4). We simultaneously observe that individuals in the highest genetic risk groups stand to gain the most from a healthier lifestyle in terms of reducing risk on an absolute scale. Thus, if a limited number of individuals can be selected for a program to limit or even reduce weight, those in the high genetic risk category should be targeted over those in the low genetic risk category. These predictions can therefore be useful when prevention becomes a more common procedure in health care.

While our models can be used to provide individuals with insights into their health risks and likely will inspire some action, a single risk assessment will likely not be sufficient to trigger consistent long term lifestyle and dietary changes ³³. Therefore, we acknowledge that action beyond the supplementation of such information is necessary. Since short term weight loss programs have limited or no effects ^34,35, a system needs to be constructed that continuously monitors and guides individuals, which is costly.

These costs are also a barrier, which should be removed by a healthy insurance system that supports preventive care. However, in the Netherlands the government rewards insurance additional money for insuring chronically ill individuals, limiting the motivation to trigger these individuals into action to improve their health ³⁶. Therefore, change is required at a much higher level, before our society will effectively shift toward preventive care rather than reactive healthcare. Insurances should be rewarded for lowering risk profiles of individuals. This can be achieved by rewarding insurances for improving the health of its consumers. Here, risk models such as these could be used as a guide to determine the appropriate award to make it both cost effective for the government in the long term and trigger action from insurances to activate healthier lifestyles.

Limitations

To allow easy implementation of our models in practice, we limited our analyses to risk factors that can be acquired through a questionnaire or from electronic medical records. We do acknowledge that better prediction models may be achieved by inclusion of non-genetic biomarkers that have previously been identified ³⁷. Other work has shown that an AUROC of 0.9 can be achieved, allowing identification of the 7% of the population that is at a 7-fold higher risk of T2D ²³. One limitation of that study, however, is that it predicts individuals that have T2D, rather than those that will acquire it in the future. As T2D affects weight, it is unclear to what extent weight is a predictor or a consequence. We also observe this effect (Figure 2), as the discriminative power of the prediction models increases when we predict prevalence rather than incidence. Therefore, we recommend that when predictors that are affected by the outcome are included in the model, those individuals that already attained the outcome at the original measurement are removed from the analysis

Furthermore, the follow-up period is limited to an 8-to-12-year time span, and more individuals will develop T2D and CAD after this period. Additionally, not all individuals that do have diabetes, will have received their diagnoses or have reported this to the UKB, causing noise in the data. Indeed, we do observe that the incidence of T2D is higher in Lifelines, compared to the UKB, despite the fact that the average age of Lifelines cohort is lower and that the overall diabetes prevalence in the UK (7%) ³⁸ is similar to that in the Netherlands (6.6%) ³⁹. This is also despite the fact that follow-up annotation for the Lifelines cohort is shorter (approximately 8 years in Lifelines versus 12 years in the UKB). We attribute this difference to the fact that Lifelines was designed to take repeat measurements from all participants, in contrast to the UKB. The date of disease annotation is often correlated to the date of the second assessment, a clear bias that cannot be resolved. Despite these limitations, we are still able to identify individuals at a significantly increased risk, as evident from the much higher incidence of the respective outcomes in the higher risk categories in both cohorts.

There is also a number of ethical limitations to consider when offering polygenic risk scores, which warrant an elaborate dissemination, for which we refer to ⁴⁰.

Lastly, in this paper, we focused on T2D and CAD due to their high prevalence, burden on society and their often-preventable nature. We acknowledge that PGS can play a role in screening for other common health conditions as well, such as cancer ^12,41,42 (albeit varying per cancer type ⁴³), and even rare diseases in the future ⁴⁴.

With the emerging public interest in preventive health ⁴⁵, the demand for more personalized risk assessments is likely to keep increasing. This will increase the need for identification of potential focus points for preventive health action. While, to some extent, genetic risk profiling is readily commercially available to the general public, most of the reported risk estimations can be greatly improved by using models that include easily accessible variables. To this end, we have developed a SaaS platform that transforms raw VCF files into risk scores, with the option of taking additional variables such as BMI, sex, age, parental disease and smoking status into consideration to ultimately arrive at more accurate predictions than those available to the public to date. We expect that methods like the one presented here will become common in healthcare to identify high risk individuals and initiate targeted preventative measures.

Data and code availability

All results and code created during this project are available upon request, if sharable in accordance with the UK Biobank and Lifelines material transfer agreement, by contacting the corresponding author of this paper (Sipko van Dam). We adhered to the 'Scientific Reports' policies on sharing data and materials.

The manuscript is based on data from the UK Biobank through application 55495. The Resource is available to all bona fide researchers for all types of health-related research that is in the public interest, without preferential or exclusive access for any person. The catalogue of the UK Biobank is accessible at https://biobank.ndph.ox.ac.uk/ukb/catalogs.cgi. All international researchers can obtain data access at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. A fee is required.

The manuscript is based on data from the Lifelines Cohort Study, Study OV20_00020. Lifelines adheres to standards for data availability. Due to ethical restrictions imposed by the Lifelines Scientific Board and the Medical Ethical Committee of the University Medical Center Groningen related to protecting patient privacy, the data are not publicly available. The data catalogue of Lifelines is publicly accessible at http://www.lifelines.net. All international researchers can obtain data at the Lifelines research office ([email protected]), for which a fee is required.

The Lifelines and UK Biobank systems allow access for reproducibility of the study results.

Acknowledgements

We thank the UKB data access granted through application 55495 and data access to the Lifelines data through application OV20_00020. Additionally, we thank the UGLI consortium for the QC on Lifelines genotyping data and the related documentation. The Lifelines Biobank initiative has been made possible by subsidy from the Dutch Ministry of Health, Welfare and Sport, the Dutch Ministry of Economic Affairs, the University Medical Center Groningen (UMCG the Netherlands), University Groningen and the Northern Provinces of the Netherlands. This project was funded by the UMCG under project number: PPP-2019_023 and Ancora Health B.V. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

I have read the journal's policy and the authors of this manuscript have the following competing interests: Sipko van Dam, Pytrik Folkertsma, Jose Castela Forte, Dylan H. de Vries and Rahul Gannamani are employed by Ancora Health B.V., a for profit organisation. Bruce Wolffenbuttel sits on the medical advisory board of Ancora Health B.V. Additionally, Jose Castela Forte and Rahul Gannamani own shares of Ancora Health B.V. The funder provided support in the form of salaries for all employees but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author contributions

SvD: Study design, wrote manuscript, performed modelling analyses. PF: Processed/structured data (structuring and aligning data from both biobanks) and supported all analyses. JCF: Provided input to the design of the study, aided interpretation of the results. DdV: Designed figures. RG: conceptualization of the paper. BW: Provided input to the design of the study, insights into interpretation of the results and design and conceptualization of the paper. Provided data analysis infrastructure. Editing and final approval of the manuscript were done by all authors.

Van Der Heide, I., Melchiorre, M. G., Quattrini, S. & Boerma, W. Innovating care for people with multiple chronic conditions in Europe: An overview.
Global spending on health: Weathering the storm. https://www.who.int/publications/i/item/9789240017788.
E, K., GD, B., G, U., R, H. & CR, G. Influence of individual and combined health behaviors on total and cause-specific mortality in men and women: the United Kingdom health and lifestyle survey. Arch. Intern. Med. 170, 711–718 (2010).
Pot, G. K. et al. Lifestyle medicine for type 2 diabetes: practice-based evidence for long-term efficacy of a multicomponent lifestyle intervention (Reverse Diabetes2 Now). BMJ Nutr. Prev. Heal. 3, bmjnph-2020-000081 (2020).
Raghupathi, W. & Raghupathi, V. An Empirical Study of Chronic Diseases in the United States: A Visual Analytics Approach to Public Health. Int. J. Environ. Res. Public Health 15, (2018).
Fink, G., McConnell, M. & Nguyen, B. D. Learn or react? An experimental study of preventive health decision making. Exp. Econ. 1–32 (2020) doi:10.1007/s10683-020-09668-6.
Ferrer, R. & Klein, W. M. Risk perceptions and health behavior. Curr. Opin. Psychol. 5, 85 (2015).
Muse, E. D. et al. Impact of polygenic risk communication: an observational mobile application-based coronary artery disease study. npj Digit. Med. 2022 51 5, 1–9 (2022).
Warburton, D. E. R., Nicol, C. W. & Bredin, S. S. D. Health benefits of physical activity: the evidence. C. Can. Med. Assoc. J. 174, 801 (2006).
Buja, A. et al. Health Literacy and Physical Activity: A Systematic Review. J. Phys. Act. Heal. 17, 1259–1274 (2020).
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics vol. 50 1219–1224 (2018).
Mars, N. J. et al. Polygenic and clinical risk scores and their impact on age at onset of cardiometabolic diseases and common cancers. bioRxiv 727057 (2019) doi:10.1101/727057.
Multhaup, M. L. et al. The science behind 23andMe’s Type 2 Diabetes report The science behind 23andMe’s Type 2 Diabetes report Estimating the likelihood of developing type 2 diabetes with polygenic models.
Nebula Library - Unlocking Genetic Research. https://nebula.org/blog/nebula-library-unlocking-genetic-research/.
Janssens, A. C. & Joyner, M. J. Polygenic risk scores that predict common diseases using millions of single nucleotide polymorphisms: Is more, better? Clinical Chemistry vol. 65 609–611 (2019).
Choi, S. W., Mak, T. S. H. & O’Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020).
Mostafavi, H. et al. Variable prediction accuracy of polygenic scores within an ancestry group. Elife 9, (2020).
Wilson, P. W. F. et al. Prediction of incident diabetes mellitus in middle-aged adults: The framingham offspring study. Arch. Intern. Med. 167, 1068–1074 (2007).
D’Agostino, R. B. et al. General cardiovascular risk profile for use in primary care: The Framingham heart study. Circulation 117, 743–753 (2008).
Boecker, M. & Lai, A. G. Could personalised risk prediction for type 2 diabetes using polygenic risk scores direct prevention, enhance diagnostics, or improve treatment? Wellcome Open Res. 5, 1–14 (2021).
Moldovan, A., Waldman, Y. Y., Brandes, N. & Linial, M. Body mass index and birth weight improve polygenic risk score for type 2 diabetes. J. Pers. Med. 11, 582 (2021).
Liu, W., Zhuang, Z., Wang, W., Huang, T. & Liu, Z. An Improved Genome-Wide Polygenic Score Model for Predicting the Risk of Type 2 Diabetes. Front. Genet. 0, 63 (2021).
He, Y. et al. Comparisons of Polyexposure, Polygenic, and Clinical Risk Scores in Risk Prediction of Type 2 Diabetes. Diabetes Care 44, 935 (2021).
De La Vega, F. M. & Bustamante, C. D. Polygenic risk scores: a biased prediction? Genome Med. 10, (2018).
Sudlow, C. et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Med. 12, e1001779 (2015).
Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics vol. 50 1219–1224 (2018).
McAllister, K. et al. Current Challenges and New Opportunities for Gene-Environment Interaction Studies of Complex Diseases. Am. J. Epidemiol. 186, 753 (2017).
Elliott, J. et al. Predictive Accuracy of a Polygenic Risk Score–Enhanced Prediction Model vs a Clinical Risk Score for Coronary Artery Disease. JAMA 323, 636–645 (2020).
Sun, L. et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLOS Med. 18, e1003498 (2021).
K, L., R, M., A, M., A, M. & K, F. Personalized risk prediction for type 2 diabetes: the potential of genetic risk scores. Genet. Med. 19, 322–329 (2017).
P, S., PR, H. & T, E. Does heightening risk appraisals change people’s intentions and behavior? A meta-analysis of experimental studies. Psychol. Bull. 140, 511–543 (2014).
Freedhoff, Y. & Hall, K. D. Weight loss diet studies: we need help not hype. Lancet (London, England) 388, 849–851 (2016).
Söderlund, A., Fischer, A. & Johansson, T. Physical activity, diet and behaviour modification in the treatment of overweight and obese adults: A systematic review. Perspect. Public Health 129, 132–142 (2009).
Health insurance in the Netherlands | Leaflet | Government.nl. https://www.government.nl/documents/leaflets/2012/09/26/health-insurance-in-the-netherlands.
van der Meer, T. P., Wolffenbuttel, B. H. R. & Patel, C. J. Data-driven assessment, contextualisation and implementation of 134 variables in the risk for type 2 diabetes: an analysis of Lifelines, a prospective cohort study in the Netherlands. Diabetologia 64, 1268–1278 (2021).
Whicher, C. A., O’Neill, S. & Holt, R. I. G. Diabetes in the UK: 2019. Diabet. Med. 37, 242–247 (2020).
Sluijs, T., Lokkers, L., Özsezen, S., Veldhuis, G. A. & Wortelboer, H. M. An Innovative Approach for Decision-Making on Designing Lifestyle Programs to Reduce Type 2 Diabetes on Dutch Population Level Using Dynamic Simulations. Front. public Heal. 9, (2021).
Lewis, A. C. F. & Green, R. C. Polygenic risk scores in the clinic: new perspectives needed on familiar ethical issues. Genome Med. 13, (2021).
Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat. Med. 26, 549–557 (2020).
Mars, N. et al. The role of polygenic risk and susceptibility genes in breast cancer over the course of life. Nat. Commun. 2020 111 11, 1–9 (2020).
Zhang, Y. D. et al. Assessment of polygenic architecture and risk prediction based on common variants across fourteen cancers. Nat. Commun. 2020 111 11, 1–13 (2020).
Oetjens, M. T., Kelly, M. A., Sturm, A. C., Martin, C. L. & Ledbetter, D. H. Quantifying the polygenic contribution to variable expressivity in eleven rare genetic disorders. Nat. Commun. 2019 101 10, 1–10 (2019).
AD, M. & D, M. Analyzing Public Interest in Metabolic Health-Related Search Terms During COVID-19 Using Google Trends. Cureus 13, (2021).
ugli [Lifelines Wiki]. http://wiki-lifelines.web.rug.nl/doku.php?id=ugli.
Lopera Maya, E. A. et al. Lack of Association Between Genetic Variants at ACE2 and TMPRSS2 Genes Involved in SARS-CoV-2 Infection and Human Quantitative Phenotypes. Front. Genet. 11, 613 (2020).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
UK Biobank Accessing UK Biobank Data Version 2.3. (2020).
: Resource 1967. https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=1967.
: Data-Field 22000. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22000.
: Data-Field 22006. https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=22006.
Auton, A. et al. A global reference for human genetic variation. Nature vol. 526 68–74 (2015).
Scott, R. A. et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes 66, 2888–2902 (2017).
Nikpay, M. et al. A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).
Xue, A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat. Commun. 2018 91 9, 1–14 (2018).
P, van der H. & N, V. Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease. Circ. Res. 122, 433–443 (2018).

Competing interest reported. I have read the journal's policy and the authors of this manuscript have the following competing interests: Sipko van Dam, Pytrik Folkertsma, Jose Castela Forte, Dylan H. de Vries and Rahul Gannamani are employed by Ancora Health B.V., a for profit organisation. Bruce Wolffenbuttel sits on the medical advisory board of Ancora Health B.V. Additionally, Jose Castela Forte and Rahul Gannamani own shares of Ancora Health B.V. The funder provided support in the form of salaries for all employees but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Supplements.docx

Download PDF

Journal Publication

published 20 Feb, 2023

Read the published version in Scientific Reports →

Editorial decision: Major revision
31 Aug, 2022
Reviewers agreed at journal
28 Jul, 2022
Reviews received at journal
16 Jul, 2022
Reviewers agreed at journal
15 Jul, 2022
Reviewers invited by journal
15 Jul, 2022
Editor assigned by journal
15 Jul, 2022
Editor invited by journal
15 Jun, 2022
Submission checks completed at journal
15 Jun, 2022
First submitted to journal
25 May, 2022

You are reading this latest preprint version

The necessity of incorporating non-genetic risk factors into polygenic risk score models

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Risk perception to stimulate preventive health action

Genetic health risk limitations

Results

Study outline

PGS based predictions

Questionnaire-based risk factors improve incidence predictions based on PGS

Limited added value of PGS on top of questionnaire-based risk factors for prediction of incidence

PGS and non-genetic risk factors identify different aspects of disease risk

Genetic risk can be largely mitigated by controlling BMI for T2D and CAD

No significant interaction effects between PGS and other risk factors

Discussion

Role in prevention

Limitations

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1