A CNN deep learning model to improve SNP-based hypertension risk prediction accuracy

DOI: https://doi.org/10.21203/rs.3.rs-2285831/v1

Abstract

Hypertension is a modifiable factor for cardiovascular diseases such as ischemic heart disease, one of the leading causes of death worldwide, known as the silent killer. Therefore, especially at a young age, method development to detect the risk of hypertension is essential. Most models for predicting disease risk are primarily based on lifestyle factors. Recently, considering the risk of genetic factors, including disease-related SNPs, has improved the accuracy of individual disease prediction. SNP is a small genetic change in DNA and is the most common genetic variation in humans. Four approaches are used to predict hypertension with genomic markers analysis: a statistical, meta-analysis, machine learning, and clinical modeling. The most critical issue in these models is the high number of input SNPs and their relationship. In the present study, a deep learning method with the CNN approach uses multiple SNPs and hypertension labels in a longitudinal cohort study for comparison; PRS was calculated using plink and gcta64 software.

First, the genomic data is converted into an image and entered into the CNN model, whose layers include the convolution layer, pooled layer, fully connected layer, and output layer. Data contains three sections: genomic data, age, and longitudinal data of hypertension based on the study of cardiac-metabolic genetics in Tehran. AUC was used to compare the performance of the model. The CNN model with an AUC value of 0.877 shows better performance than the PRS and the latest models presented in the literature.

Introduction

Silent killer, a major cause of cardiovascular diseases such as ischemic heart disease and stroke, is hypertension, a significant cause of death and disease worldwide. Blood pressure and hypertension are multivariate traits, including many metabolic pathways (13). Therefore, methods that can accurately detect the risk of hypertension at young ages are essential.

On the other hand, one of the most important goals of personal health is to improve the accuracy of disease prediction by examining genetic variants. Therefore, different clinical and genetic methods are used to predict this disease.

The human genome and genomic differences

The human genome contains all the information needed for human life, consisting of 3 billion base pairs in 23 pairs of chromosomes called DNA. This set includes all the necessary information to determine the traits that make up a person. Specific encoding within genes is performed by the specificity of a combination of four bases called adenine, thymine, guanine, and cytosine (A, T, G, C).

When comparing the genomes of two individuals, many differences (in terms of number) are seen, resulting from genomic changes such as multiple single nucleotide polymorphisms, deletion, replacement, and changes in the number of copies. Each of these changes, along with environmental factors, can lead to a different phenotype in the individual.

SNP is a small genomic change in DNA and is the most common genetic variation in humans. SNP occurs when one nucleotide base, such as a nucleotide with an A base, replaces another three nucleotide bases (C, G, or T). Of course, these changes can be missense (no change in phenotype), no pathogenic effect (change in phenotype), pathogenic effect (diabetes, cancer, heart disease, Huntington's disease, and hemophilia), and a latent effect, for example assuming changes in the genome of regions. The gene regulator is not harmful and only manifests under certain conditions such as susceptibility to lung cancer and changes in a person's response to treatment (4).

Genome-Wide Association Studies (GWAS) and Polygenic Risk Score (PRS)

Linkage or genetic linkage tests were used to examine the relationship between genotype and phenotype of inheritance in the family. These methods have been very successful in investigating diseases caused by a gene. But these findings are not generalizable in complex multifactorial diseases. Genome-wide association studies have identified biomarkers, identified individuals with specific disease genes, and placed the type of response to treatment more broadly (57).

Most of these studies are often case-control, Which means that a group of patients is selected as the case group and some healthy individuals as the control group.

These individuals were genotyped for most of the known single nucleotide polymorphisms. The allelic frequency of each of these markers is compared in the case and control groups. The basis of these tests is calculating the effect based on the odds ratio, which means that the presence or absence of an allele is associated with the presence or absence of the studied phenotype in a specific community (8).

Polygenic score (PGS) or polygenic risk score (PRS) estimates an individual's genetic responsibility for a trait or disease that is calculated based on genotype characteristics and genome-wide study. While current PRSs usually explain only a tiny fraction of trait variance, their association with the largest phenotypic diversity factor has led to the use of PRSs in biomedical researches (9).

Disease risk models

Disease risk models based on epidemiology (with limited predictive power) are primarily based on risk factors for lifestyle factors such as a positive family history. Recently, considering the risk of genetic factors such as disease-related or phenotype-related SNPs has improved the accuracy of individual disease risk prediction (10).

Currently, the development of genetic risk models focuses on achieving accurate predictive power to identify individuals at risk in a powerful way. GWA studies define SNPs in terms of their association with a disease/phenotype in the population (11).

In this study, genomic models predicting traits and hypertension are reviewed. These methods include statistical, meta-analysis, machine learning, and clinical model.

Statistical approach

The statistical approach is the dominant approach in analyzing widely genomic relationships. The first statistical model is GRAMMAR (12). In this model, the data are examined under a hybrid model.

GCTA model was presented in 2011 (13). In the GCTA method, unlike the analysis of SNP relationships, the effects of all SNPs are viewed as random effects mixed with a linear model. Increased performance of this model is presented in the ACTA model (14). With the advent of fast GPUs, the algorithm for parallel execution changed, and the REACTA model was proposed (15). Latest research Based on the SNPs introduced in GWAS related to hypertension, the PRS model has an AUC of 0.804 (16).

Meta-analysis approach

Different values ​​such as p-value and sample size and integration of GWAS studies with one phenotype are used in meta-analysis methods. The most important model in this approach is the METAL model (17), which integrates different studies in GWAS based on sample size, data variance, p-value, and impact size. Various software packages such as GWAMA and MetABEL have been proposed for this approach (18).

Machine learning approach

There are several challenges in examining the combined effect of multiple genetic variants and environmental variables. First, in GWAS, genotypes of up to one million SNPs are determined in a few thousand individuals, leading to the problem of a small sample size and a large number of dimensions. Second, when a large number of genotype SNPs are on a genomic scale, the linkage disequilibrium between SNPs (leading to correlated variables) should be considered (19, 20).

Clinical approach

In addition to genetic factors, clinical factors are also considered in hypertension in the clinical approach. Various factors considered in clinical research include: family history, age, sex, salt intake, alcohol consumption, smoking, sedentary lifestyle, and obesity are risk factors for hypertension.

Based on traditional criteria for predicting blood pressure, presented research based on machine learning methods with an AUC of 0.757 (21). In another study, using clinical factors such as age, smoking, family history, physical activity, and body mass index, the accuracy was 0.854, and by adding genetic parameters, the accuracy was 0.871 (19).

Research gaps and objectives

There are several challenges in genome-based risk modeling and clinical factors. First, in GWAS, about one million SNP genotypes cause a small sample size error relative to the number of traits in a few thousand individuals. Second, when many genotype SNPs are in the same genomic region, the linkage disequilibrium (LD) between SNPs (leading to correlated variables) must be considered. Third, increasing the accuracy of predictions at younger ages is the most critical issue (21). These challenges can be addressed using machine learning methods. In machine learning, it is possible to increase the number of features, and at the same time, if the features are related to each other, modeling will not fail.

As a result, the present study uses CNN-based deep learning methods to use multiple blood pressure SNPs in a longitudinal cohort study.

Method

In this paper, the risk of hypertension is modeled using the PRS and CNN methods. These models use SNP and age without any clinical factors.

Polygenic risk score

Because a person's genomic does not have significant changes from birth, genomic information can act as an early predictor of risk. Hypertension is affected by several genetic variants with small effect sizes. It is necessary to examine the mass effect of multiple SNP by calculating a single metric that represents an individual's overall genetic risk score to predict significant risk. A simple genetic risk score (GRS) is a simple sum of the number of risk alleles (usually several SNPs from GWAS, sometimes based on weight-effect measures) present in each individual (22). Recently, a wide range of SNPs, from thousands to millions of SNPs, have been used to create an improved GRS such as PRS (23).

The size of the risk presented by PRS is a possible range, and this differs from risk information from genetic markers of single-gene disorders.

Plink and gcta64 software are used to calculate PRS. The execution code is as shown in Fig. 1. It should be noted that the execution of code on files in the ped format that stores genomic data, is executed on the Linux platform.

In addition, to validate the model, the data is divided into test and train, and the PRS calculated separately and finally used in the disease prediction model.

Convolutional neural network

A convolutional neural network is suitable for processing data with two properties: (1) spatially related properties, such as an image with pixels arranged in an array or a video with images in a row. (2) Features are homogeneous across the page. The main structure of CNN consists of input layers in an image page.

Other layers include the convolution layer, the pooled layer, the fully connected layer, and the output layer (24). The convolution layer extracts local features based on the weight of the neurons by entangling all parts of the image. The pooled layer performs independent operations on each feature map, such as average or maximum integration, which can effectively reduce the resolution of the feature and reduce the number of network parameters required for optimization. CNN puts these basic structures together and puts the first layer as the input of the second layer, which can learn deeply. CNN implementation steps are shown in Fig. 2.

The first step is converting the genome data stored in a ped file to an image file. To do this, in every 4 pixels in a row, only one cell is blacked out based on the genome type, and the rest of the cells remain white. For best input, the length and width of the image are considered equal to the square of the number of genomes, which is equivalent to 600K SNPs. This method reduces the volume of genomic data with a compression factor of 13%. In the next step, a deep network based on the convolution layer with 24 filters is formed. The filters are 4 * 4, in which only one column in each row has a value of one and the rest is zero, and after applying the filter, the results are pooled. The number of CNN layers depends on the complexity. In this study, six layers are appropriate. After each convolution layer, the max-pool layer is 2 * 2, which reduces the image's dimension in length and height. Due to the different inputs, the data is normalized after six layers. Finally, the image is converted into an array and input to the FCN. The sample image is shown in Fig. 3 after six layers.

Research data

The research data consists of genomic data, age, and longitudinal data of blood pressure phenotype for 7268 participants in six phases. These data were selected from the Cardiac-Metabolic Genetics Study of Tehran (TCGS) (25). TCGS is a genetic cohort study to determine the risk factors for major non-communicable disorders in the Tehran Lipid and Glucose Study (TLGS) (26).

This research with the code IR.SBMU.ENDOCRINE.REC.1400.008 was approved by the National Ethics Committee in Biomedical Research in May 2021. All participants provided a sufficient baseline blood sample for plasma and DNA analysis and gave written consent for blood-based analysis and long-term follow-up. All methods of this study were performed according to appropriate and relevant instructions. All participants were included except those with isolated diastolic blood pressure (DBP greater than 90 and SBP less than 140 mmHg) (27). To measure blood pressure, participants first sit in a chair for 15 minutes, after which their specialist measures their blood pressure twice using a standard mercury sphygmomanometer. There is at least a 30-second interval between these two separate measurements, and the average of the two measurements is recorded as the participant's blood pressure (28).

Individuals who have an SBP above 140, a DBP above 90, or a participant taking hypertension medication at any phase are labeled hypertensive and otherwise considered normal (19, 29).

In phase two or three, genome samples from white blood cells were genotyped by standard proteinase K method extracted with HumanOmniExpress-24-v1-0 chip by deCODE according to Illumina device (30). Genomic data for each individual has 600K SNPs stored in the ped file.

Based on (31) to improve results and quality control, classification based on age (over 18 years and under 18 years) and SNP classification has been done in rarity and prevalence. Accordingly, SNPs with missing values less than 5% and mean allelic frequency (MAF) greater than 5% are considered.

Validation and evaluation of results

Various criteria are expressed in the subject literature to validate and evaluate modeling results, shown in Table 1. TP, TF, FP, and FN stand true positive, true negative, false positive, and false negative. These criteria express the efficiency of the constructed model in the research data.

In addition to these criteria, one of the most widely used criteria to compare models is the area under receiver curve (AUC). A complete classification will have an AUC of 1, and a weak classification will have 0.5. A good classifier will have an AUC greater than 0.5 and is considered excellent when it reaches 1 (32).

The 10-fold method is used for modeling validation, which divides the data into ten equal parts without repetition and creates different models, calculating the final result based on the average of the obtained results. It should be noted that because there is similar data in phases for each participant, all rows for one participant are in only one fold.

PHP language is used to prepare the data, and Orange is used for modeling and visualizing. The code and sample data are placed on Github.

Results

After calculating PRS, simple neural network, random forest, decision tree, logistic regression, and support vector machine (SVM) models were used to obtain the classification results. The results of PRS modeling are shown in Table 2 and the ROC diagram in Fig. 4.

In CNN modeling for the FCN layer, different classification methods are used. The results are shown in Table 3 and the ROC Chart in Fig. 5.

The final comparative results of PRS and CNN modeling are shown in Table 4.

Discussion

More than 45,000 blood pressure samples in TCGS participants were evaluated and predicted by PRS and CNN models in the present study. According to CNN modeling results, the AUC value shows 0.877 without considering any additional risk factors. The AUC values obtained in the literature in different modeling based on genomic variants is 0.804 (16) and based on genomic and clinical factors equal to 0.871 (19). This result shows the efficiency of the CNN method, and our findings show better performance. Also, the crucial practical achievement in the genomic era is creating a tool for early diagnosis of hypertension risk based on genomic markers to improve prevention and treatment in high-risk individuals.

Conclusion And Future Work

Our approach provides an improved tool for predicting the risk of hypertension at younger ages and has limitations due to the use of internal data validation. This study's results outside the TCGS data were not used for external validation, with a slight ethnic composition.

The primary purpose of this study was to improve the accuracy of predicting the risk of hypertension by providing a CNN deep neural network, which successfully presented this approach compared to PRS. For future work, accuracy can be significantly increased by adding influential clinical variables such as race. Prehypertension can also be used to better model hypertension and the separation of IDH and ISH.

Declarations

Ethical Approval and Consent to Participate

This research was approved by the National Committee for Ethics in Biomedical Research in May 2021 with code IR.SBMU.ENDOCRINE.REC.1400.008. 

Consent for publication

All participants provided an adequate baseline blood sample for plasma and DNA analysis and gave written consent for blood-based analyses and long-term follow-up.

Human and Animal Rights

All methods in this study were performed following appropriate and relevant guidelines. 

Availability of data and materials

Not applicable

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Conflict of interest

The authors declare no competing financial interests.

Acknowledgements

We gratefully acknowledge the input of the investigators, research coordinators, and committee members of this study.

Supplementary materials

Details for supplementary materials is available with the published article.

Authorship contributions

Conception and design of the study: Seyed Ali Lajevardi, Mehrdad Kargari, Maryam Sadat Daneshpour;

Acquisition of data: Seyed Ali Lajevardi, Maryam Sadat Daneshpour, Mehdi Akbarzadeh;

Analysis and/or interpretation of data: Seyed Ali Lajevardi, Maryam Sadat Daneshpour,Mehdi Akbarzadeh.

Drafting the manuscript: Seyed Ali Lajevardi, Maryam Sadat Daneshpour;

References

  1. López-Martínez F, Núñez-Valdez ER, Crespo RG, García-Díaz V. An artificial neural network approach for predicting hypertension using NHANES data. Sci Rep [Internet]. 2020 Dec 30;10(1):10620. Available from: https://doi.org/10.1038/s41598-020-67640-z
  2. Mills KT, Stefanescu A, He J. The global epidemiology of hypertension. Nat Rev Nephrol [Internet]. 2020;16(4):223–37. Available from: http://dx.doi.org/10.1038/s41581-019-0244-2
  3. Zhou B, Bentham J, Di Cesare M, Bixby H, Danaei G, Cowan MJ, et al. Worldwide trends in blood pressure from 1975 to 2015: a pooled analysis of 1479 population-based measurement studies with 19·1 million participants. Lancet. 2017;389(10064):37–55.
  4. Filshtein TJ, Brenowitz WD, Mayeda ER, Hohman TJ, Walter S, Jones RN, et al. Reserve and Alzheimer's disease genetic risk: Effects on hospitalization and mortality. Alzheimer's Dement. 2019 Jul 1;15(7):907–16.
  5. Mills MC, Rahal C. A scientometric review of genome-wide association studies. Commun Biol [Internet]. 2019 Dec 7;2(1):9. Available from: http://dx.doi.org/10.1038/s42003-018-0261-x
  6. Hebbring S. Genomic and Phenomic Research in the 21st Century. Trends Genet [Internet]. 2019;35(1):29–41. Available from: https://doi.org/10.1016/j.tig.2018.09.007
  7. Bush WS. Genome-wide association studies. Encycl Bioinforma Comput Biol ABC Bioinforma. 2018;1–3:235–41.
  8. Visscher P, Brown M, McCarthy M, Yang J. Five Years of {GWAS} Discovery. Am J Hum Genet [Internet]. 2012;90(1):7–24. Available from: https://doi.org/10.1016%2Fj.ajhg.2011.11.029
  9. Choi SW, Mak TSH, O'Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc [Internet]. 2020;15(9):2759–72. Available from: http://dx.doi.org/10.1038/s41596-020-0353-1
  10. Mosley JD, Gupta DK, Tan J, Yao J, Wells QS, Shaffer CM, et al. Predictive Accuracy of a Polygenic Risk Score Compared with a Clinical Risk Score for Incident Coronary Heart Disease. JAMA - J Am Med Assoc. 2020;323(7):627–35.
  11. Abraham G, Inouye M. Genomic risk prediction of complex human disease and its clinical application. Curr Opin Genet Dev [Internet]. 2015;33(Cvd):10–6. Available from: http://dx.doi.org/10.1016/j.gde.2015.06.005
  12. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. {GenABEL}: an R library for genome-wide association analysis. Bioinformatics [Internet]. 2007;23(10):1294–6. Available from: https://doi.org/10.1093%2Fbioinformatics%2Fbtm108
  13. Yang J, Zeng J, Goddard ME, Wray NR, Visscher PM. Concepts, estimation and interpretation of {SNP}-based heritability. Nat Genet [Internet]. 2017;49(9):1304–10. Available from: https://doi.org/10.1038%2Fng.3941
  14. Gray A, Stewart I, Tenesa A. Advanced Complex Trait Analysis. Bioinformatics. 2012;28(23):3134–6.
  15. Cebamanos L, Gray A, Stewart I, Tenesa A. Regional heritability advanced complex trait analysis for {GPU} and traditional parallel architectures. Bioinformatics [Internet]. 2014;30(8):1177–9. Available from: https://doi.org/10.1093%2Fbioinformatics%2Fbtt754
  16. Vaura F, Kauko A, Suvila K, Havulinna AS, Mars N, Salomaa V, et al. Polygenic Risk Scores Predict Hypertension Onset and Cardiovascular Risk. Hypertens (Dallas, Tex 1979) [Internet]. 2021 Apr [cited 2021 Mar 15];77(4):1119–27. Available from: http://www.ncbi.nlm.nih.gov/pubmed/33611940
  17. Willer CJ, Li Y, Abecasis GR. METAL: Fast and efficient meta-analysis of genome-wide association scans. Bioinformatics. 2010;26(17):2190–1.
  18. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies and beyond. Nat Rev Genet [Internet]. 2013;14(6):379–89. Available from: http://dx.doi.org/10.1038/nrg3472
  19. Niu M, Wang Y, Zhang L, Tu R, Liu X, Hou J, et al. Identifying the predictive effectiveness of a genetic risk score for incident hypertension using machine learning methods among populations in rural China. Hypertens Res [Internet]. 2021; Available from: http://dx.doi.org/10.1038/s41440-021-00738-7
  20. Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, et al. Machine learning in genome-wide association studies. Genet Epidemiol [Internet]. 2009;33(S1):S51--S57. Available from: https://doi.org/10.1002%2Fgepi.20473
  21. Wu X, Yuan X, Wang W, Liu K, Qin Y, Sun X, et al. Value of a Machine Learning Approach for Predicting Clinical Outcomes in Young Patients With Hypertension. Hypertension [Internet]. 2020;75(5):1271–8. Available from: https://doi.org/10.1161%2Fhypertensionaha.119.13404
  22. Padmanabhan S, Dominiczak AF. Genomics of hypertension: the road to precision medicine. Nat Rev Cardiol [Internet]. 2020 Nov 20; Available from: http://dx.doi.org/10.1038/s41569-020-00466-4
  23. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet [Internet]. 2018;19(9):581–90. Available from: http://dx.doi.org/10.1038/s41576-018-0018-x
  24. Luo Y, Li Y, Lu Y, Lin S, Liu X. The prediction of hypertension based on convolution neural network. 2018 IEEE 4th Int Conf Comput Commun ICCC 2018. 2018;2122–7.
  25. Daneshpour MS, Fallah M-S, Sedaghati-Khayat B, Guity K, Khalili D, Hedayati M, et al. Rationale and Design of a Genetic Study on Cardiometabolic Risk Factors: Protocol for the Tehran Cardiometabolic Genetic Study ({TCGS}). {JMIR} Res Protoc [Internet]. 2017;6(2):e28. Available from: https://doi.org/10.2196%2Fresprot.6050
  26. Azizi F, and Arash Ghanbarian, Momenan AA, Hadaegh F, Mirmiran P, Hedayati M, et al. Prevention of non-communicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase {II}. Trials [Internet]. 2009;10(1). Available from: https://doi.org/10.1186%2F1745-6215-10-5
  27. Mahajan S, Zhang D, He S, Lu Y, Gupta A, Spatz ES, et al. Prevalence, Awareness, and Treatment of Isolated Diastolic Hypertension: Insights From the China PEACE Million Persons Project. J Am Heart Assoc [Internet]. 2019 Oct;8(19):1–17. Available from: https://www.ahajournals.org/doi/10.1161/JAHA.119.012954
  28. Tohidi M, Hatami M, Hadaegh F, Azizi F. Triglycerides and triglycerides to high-density lipoprotein cholesterol ratio are strong predictors of incident hypertension in Middle Eastern women. J Hum Hypertens. 2012;26(9):525–32.
  29. Mills KT, Bundy JD, Kelly TN, Reed JE, Kearney PM, Reynolds K, et al. Global Disparities of Hypertension Prevalence and Control. Circulation [Internet]. 2016 Aug 9;134(6):441–50. Available from: https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.115.018912
  30. Kolifarhood G, Sabour S, Akbarzadeh M, Sedaghati-khayat B, Guity K, Rasekhi Dehkordi S, et al. Genome-wide association study on blood pressure traits in the Iranian population suggests ZBED9 as a new locus for hypertension. Sci Reports 2021 111 [Internet]. 2021 Jun 3 [cited 2021 Sep 1];11(1):1–13. Available from: https://www.nature.com/articles/s41598-021-90925-w
  31. Mattoo TK. Definition and diagnosis of hypertension in children and adolescents - UpToDate. UpToDate [Internet]. 2019;(Cv):1–34. Available from: https://www.uptodate.com/contents/definition-and-diagnosis-of-hypertension-in-children-and-adolescents?search=tension arterial&source=search_result&selectedTitle=1~150&usage_type=default&display_rank=1#H12
  32. Martinez-Ríos E, Montesinos L, Alfaro-Ponce M, Pecchia L. A review of machine learning in hypertension detection and blood pressure estimation based on clinical and physiological data. Biomed Signal Process Control [Internet]. 2021 Jul;68(March):102813. Available from: https://doi.org/10.1016/j.bspc.2021.102813

Tables

Tables 1 to 4 are available in the Supplementary Files section