Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk. 2 pmol/µl of each probe mix 1 µL, Taq DNA ligase 0.05 µL, ddH2O 4 µL, and multi-PCR product 4 µL. The ligase detection reaction was performed at 95 °C for 2 min, then 40 cycles at 94 °C for 15 s, and at 50 °C for 20 s. The uorescent ligase detection reaction product was characterized by the sequencer PRISM 3730 (ABI). Approximately 5% of the DNA samples were added into the total samples under blind conditions to assess the quality of SNP genotyping. The concordance rate was 100%.

predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Background
Asthma is an immune-mediated respiratory airway disease, characterized by cough, wheezing, chest tightening, shortness of breath, and so on. The Global Initiative for Asthma (GINA) has reported that the prevalence of asthma is increasing in many countries, and approximately 339 million people are affected worldwide [1]. Over recent decades, various studies have focused on the relationship between genetics and/or environment with asthma, but its etiology remains unclear. Although the heritability estimate of asthma has reached 80% [2], the etiology cannot simply be explained by genetic factors. The mechanisms underlying environmental effects on genetics also play an important role. The methods for gene discovery in asthma start from candidate gene association studies to family-based genome-wide linkage analyses and are followed by genome-wide association studies (GWAS). GWAS of asthma have dominated in recent years, providing bias-free discovery of novel risk loci [3]. Since the rst GWAS of child asthma reported in 2007, 83 papers on asthma or asthma-related traits are reported in the GWAS catalog until January 27, 2020 (https://www.ebi.ac.uk/gwas/). Among these papers, more than 1000 loci are associated with asthma or asthma-related traits. However, most of the loci are located in introns or in the intergenic regions, and few loci are located in the functional regions such as exons and non-coding regions. Fourteen loci in the functional regions are associated with asthma. Five missense variants (CDHR3 rs6967330 [4], TLR1 rs4833095 [5], GSDMA rs7212938 [5], GSDMB rs2305480 [6], and GSDMA rs3894194 [6]), three regulatory region variants (IL5 rs4143832 [7], HLA-DQB1 rs7775228 [8], and TACR1 rs7588010 [9]), three upstream gene variants (IL33 rs928413 [4], HLA-DRA rs9268516 [10], and TSLP rs1837253 [11]), two non-coding transcript exon variants (NOTCH4 rs404860 [8] and BTNL2 rs3117098 [8]), and one 5′UTR variant (IL1RL1 rs3771180 [11]) are reported.
Most of these loci are located in immune-related genes, such as TLR1, GSDMA, IL5, and HLA-DQB1, which are consistent with immune responses in asthma. Several loci have been replicated and associated with asthma in China (rs6967330 [12] and rs928413 [13]), United Kingdom (rs7212938 and rs3894194 [14]), Korea (rs7212938 [15]), Slovenia (rs2305480 [16]), and Sweden (rs2305480 and rs3771180 [17]). Furthermore, two loci (rs1837253 and rs3117098) have been replicated and associated with asthma in the Guangxi Zhuang population in our previous studies [18,19]. This nding suggests that GWAS risk loci, particularly those located in the functional regions, play an important role in the genetics of asthma.
To date, many studies of candidate genes or environmental risk factors are conducted, which have often excluded the consideration of the gene-environment interactions on asthma. The observable features (phenotype) of asthma, such as clinical features and underlying mechanisms (endotype), are complex and represent a series of host-environment interactions that occur over different spatial scales. However, environmental exposure factors (e.g., exposure to tobacco smoke and allergies, body mass index [BMI], vitamin D levels, and air pollutants) may alter gene activity and expression without changing the underlying DNA sequence and increase the risk of asthma. In recent years, several Mendelian randomization studies have con rmed that higher BMI [20], smoking [21], and low linoleic acid [22] (which is believed to suppress immune responses) may increase asthma risk. Therefore, the risk prediction model for asthma requires a combination of genetic and environmental information.
The traditional method, which sets P values as the "gold standard" of statistical validity [23], cannot meet the requirements of the present multiple data types and high accuracy of risk prediction. The present study is the rst report on combining GWAS risk loci and clinical data to predict asthma using machine-learning approaches. In addition, we conducted a case control study of 123 patients with asthma and 100 healthy controls and detected 14 GWAS risk loci located in the functional regions. We collected clinical data, which were easy to collect from the records of patients with asthma. Machinelearning approaches were used to build an asthma risk prediction model by combining GWAS risk loci and clinical data.

Subjects And Methods
Subject ascertainment and collection of clinical data A total of 123 patients with asthma and 100 healthy controls were recruited from the rst a liated hospital of Guangxi Medical University in China, from February 2010 to August 2016. All subjects were from Zhuang population and permanently residing in Guangxi Province, China. Patients with asthma were diagnosed by at least two respiratory physicians in accordance with the guidelines of GINA. The controls were also evaluated by respiratory physicians using a self-report questionnaire, which included general conditions and medical history, but excluded those who had a history of lung diseases, asthma, rhinitis, or other allergic diseases. Clinical data, including gender, age, height, weight, family history, exposure to tobacco smoke, and allergies, were collected from electronic medical records. The same information was also collected in controls from the health examination center. The present study was approved by the Institutional Ethics Committee of Guangxi Medical University (Approval Number: 2013-KY-GuiKe-053), according to the Helsinki Declaration of Human Rights. All subjects were informed and wrote an informed consent.
DNA isolation and SNP genotyping DNA was extracted from peripheral venous blood (2 mL) of each subject using a DNA extraction kit (provided by Tiangen, Shanghai, China) following the instructions. Primer design (Primer 3 Online), synthesis, and SNP genotyping were conducted by Shanghai BioWing Applied Biotechnology Company (http://www.biowing.com.cn/). The primer sequences of the 14 GWAS risk loci are listed in Table 1. All these SNPs were genotyped using the polymerase chain reaction (PCR)/ligase detection reaction assay. Multiplex PCR method was used for the ampli cation of target DNA sequences. The reaction conditions included 2 min of initial denaturation at 95 °C, then 40 cycles of denaturation at 94 °C for 30 s, 90 s of annealing at 53 °C, 30 s of extension at 65 °C, and a nal 10 min of extension at 65 °C. To determine whether the reaction was successful, 2 µL of each product was run in a 3.0% agarose gel.
The ligation reaction for each subject was conducted in a total volume of 10 µL, including 1× NEB Taq DNA ligase buffer 1 µL, 2 pmol/µl of each probe mix 1 µL, Taq DNA ligase 0.05 µL, ddH2O 4 µL, and multi-PCR product 4 µL. The ligase detection reaction was performed at 95 °C for 2 min, then 40 cycles at 94 °C for 15 s, and at 50 °C for 20 s. The uorescent ligase detection reaction product was characterized by the sequencer PRISM 3730 (ABI). Approximately 5% of the DNA samples were added into the total samples under blind conditions to assess the quality of SNP genotyping. The concordance rate was 100%.

Data preprocessing and machine-learning approaches
The procedure for preprocessing data was as follows: (1) For the genotype of GWAS risk loci, wild type is set to "0"; heterozygous is set to "1", and homozygous is set to "2".
Missing records were no more than 10% of all features, and we used mode data to ll in the missing records. The traditional method (chi-square test) was used to compare the difference between cases and controls with SPSS 16.0.
All modeling processes were programmed on PyCharm software using Python version 3.7.4 on an IntelR Core™ i7-9850H Central Processing Unit with 16 GB RAM @2.60GHz laptop. Using the module cross_val_score [26], evaluation metrics were calculated to evaluate the predictive power of the models, including area under the curve (AUC), receiver operator characteristic curve, accuracy score, precision score, recall score, and f1_score. We evaluated all models using 10-fold cross-validation repeated 10 times. Machine-learning approaches including XGBoost [27], DT [28], SVM [29], and RF [30] algorithms were selected as classi ers to identify the importance of features.

GWAS risk loci genotype and clinical data of study subjects
A total of 123 patients with asthma (50 males and 73 females) and 100 health controls (52 males and 48 females) were included in the present study. The median age and range of the asthma group were 27.9 years and 22-67 years, respectively, and the controls were 38.8 years and 18-71 years, respectively. GWAS risk loci genotype and clinical data of cases and controls, which were used as risk features for asthma classi cations, are listed in Table 2. When using the chi square test to compare the proportion of these risk features, four positive features (rs3117098, rs1837253, BMI, and family history) were signi cantly different between cases and controls (P<0.05).

Machine-learning model performance and comparison
Four different machine-learning models and ve evaluation metrics were performed using python, and the results are listed in Table 3. Using GWAS risk loci or clinical data, the best performances were an AUC of 64.3% and 71.4%, respectively. Using the four positive features of the traditional method, the best performance was demonstrated with an AUC of 70.2%. When combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data could obtain a better performance.

Important features of asthma prediction
The XGBoost models, which showed the best predictive performances, were selected as the nal predictors. To capture the informative risk features of asthma, splitting node algorithm was used, which showed the number of each important splitting node (feature) in trees. A high F score indicated that the corresponding feature was important. As listed in Figure 1, the top six risk factors in predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and BMI.

Discussions
In general, to collect all environmental exposure factors and detect all genetic information for asthma are di cult. As such, using the limited genetic and environmental exposure information to predict the risk of asthma is important. In the present study, we applied machine-learning approaches by combining GWAS risk loci and clinical data to build an accurate classi er for the prediction of asthma. The results showed that XGBoost established the best model with an AUC of 79.7% in predicting asthma, wherein rs3117098, rs7775228, family history, rs2305480, rs4833095, and BMI were the top six risk factors in this model.
A previous study selected SNPs from openSNP database and used machine-learning approaches to predict asthma. The results showed that asthma can be predicted with an AUC of 0.62 and 0.64 for RF-SVM and RF-K-nearest neighbor models, respectively [31]. Similar performance metric with an AUC of 64.3% was obtained in our study, when only GWAS risk loci were adopted as risk factors. Family-based studies showed that asthma was a heritable disease with heritability estimates of approximately 50%-60% [32]. Identi ed genetic variants associated with asthma in large-scale GWAS studies only accounted for a low fraction [3]. Introducing clinical data as environmental exposure factors was necessary to improve the accuracy of asthma-prediction models. AlSaad et al. used a real electronic health record dataset comprising 6159 asthma cases and 4912 controls to predict the risk of asthma and obtained the highest AUC of 0.831 [33]. In contrary to using GWAS risk loci or clinical data alone, we combined GWAS risk loci and clinical data and obtained a more accurate asthma-prediction models with an AUC of 79.7%.
The accuracy of our study was lower than that of AlSaad et al. [33] probably because the sample size of our study was relatively small, and we selected fewer clinical data. Collecting the clinical data for control samples was di cult, particularly for the clinical indicators which were not checked. Therefore, larger, more comprehensive data collection and better design research were required to verify our results.
Meanwhile, we found that the top six risk factors in predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and BMI. However, in contrary to the results obtained by traditional methods, several loci without remarkable differences appeared in the top risk factors based on XGBoost modeling. XGBoost considered the interactions of risk factors, whereas the traditional method only performed direct genotype-phenotype association testing. XGBoost method was evidently more e cient than traditional methods and can use more information for the construction of a classi cation model. Boosting is a popular ensemble technique in which new models are added to adjust the errors made by the prior models. Models were added recursively until no remarkable improvements can be observed. Gradient boosting is an algorithm in which new models are created for predicting the residuals of previous models and then combined for the nal prediction. When adding new models, a gradient descent algorithm was used to minimize the loss. The XGBoost model was widely used for diagnosis classi cation [34,35], treatment effect [36,37], and prognosis evaluation [38,39] in different diseases.
Several important limitations were found in the present study. First, pulmonary function and clinical indicators were particularly related to asthma development. Clinical data in controls were not included in our clinical feature set because of the lack of such data. Further research should focus on gathering more clinical data to improve the diagnostic value with asthma. Second, our work included only Zhuang population from a single center in China. The prediction models might not be suitable for other ethnicities or districts. Third, although we used 10-fold cross-validation for processing, the small sample size still affected the accuracy and stability of the model.

Conclusions
Our study combined GWAS risk loci and clinical data for the rst time to construct asthma-prediction models. We have obtained an asthma-prediction model with a higher accuracy based on the XGBoost method, which may provide insights into the pathogenesis of asthma. Further study is required to evaluate more genetic markers and clinical data and predict asthma risk.    Figure 1 Feature importance plot of features