Subject ascertainment and collection of clinical data
A total of 123 patients with asthma and 100 healthy controls were recruited from the first affiliated hospital of Guangxi Medical University in China, from February 2010 to August 2016. All subjects were from Zhuang population and permanently residing in Guangxi Province, China. Patients with asthma were diagnosed by at least two respiratory physicians in accordance with the guidelines of GINA. The controls were also evaluated by respiratory physicians using a self-report questionnaire, which included general conditions and medical history, but excluded those who had a history of lung diseases, asthma, rhinitis, or other allergic diseases. Clinical data, including gender, age, height, weight, family history, exposure to tobacco smoke, and allergies, were collected from electronic medical records. The same information was also collected in controls from the health examination center. The present study was approved by the Institutional Ethics Committee of Guangxi Medical University (Approval Number: 2013-KY-GuiKe-053), according to the Helsinki Declaration of Human Rights. All subjects were informed and wrote an informed consent.
DNA isolation and SNP genotyping
DNA was extracted from peripheral venous blood (2 mL) of each subject using a DNA extraction kit (provided by Tiangen, Shanghai, China) following the instructions. Primer design (Primer 3 Online), synthesis, and SNP genotyping were conducted by Shanghai BioWing Applied Biotechnology Company (http://www.biowing.com.cn/). The primer sequences of the 14 GWAS risk loci are listed in Table 1. All these SNPs were genotyped using the polymerase chain reaction (PCR)/ligase detection reaction assay. Multiplex PCR method was used for the amplification of target DNA sequences. The reaction conditions included 2 min of initial denaturation at 95 °C, then 40 cycles of denaturation at 94 °C for 30 s, 90 s of annealing at 53 °C, 30 s of extension at 65 °C, and a final 10 min of extension at 65 °C. To determine whether the reaction was successful, 2 µL of each product was run in a 3.0% agarose gel.
The ligation reaction for each subject was conducted in a total volume of 10 µL, including 1× NEB Taq DNA ligase buffer 1 µL, 2 pmol/µl of each probe mix 1 µL, Taq DNA ligase 0.05 µL, ddH2O 4 µL, and multi-PCR product 4 µL. The ligase detection reaction was performed at 95 °C for 2 min, then 40 cycles at 94 °C for 15 s, and at 50 °C for 20 s. The fluorescent ligase detection reaction product was characterized by the sequencer PRISM 3730 (ABI). Approximately 5% of the DNA samples were added into the total samples under blind conditions to assess the quality of SNP genotyping. The concordance rate was 100%.
Data preprocessing and machine-learning approaches
The procedure for preprocessing data was as follows:
(1) For the genotype of GWAS risk loci, wild type is set to “0”; heterozygous is set to “1”, and homozygous is set to “2”.
(2) For BMI categories, light weight (BMI<18.5) is set to “0”; normal weight (18.5≤BMI<24.0) is set to “1”, and overweight (BMI≥24.0) is set to “2”
(3) For family history of asthma, exposure to tobacco smoke, and allergies, “no” is set to “0”, and “yes” is set to “1”.
Missing records were no more than 10% of all features, and we used mode data to fill in the missing records. The traditional method (chi-square test) was used to compare the difference between cases and controls with SPSS 16.0.
All modeling processes were programmed on PyCharm software using Python version 3.7.4 on an IntelR Core™ i7-9850H Central Processing Unit with 16 GB RAM @2.60GHz laptop. Using the module cross_val_score[26], evaluation metrics were calculated to evaluate the predictive power of the models, including area under the curve (AUC), receiver operator characteristic curve, accuracy score, precision score, recall score, and f1_score. We evaluated all models using 10-fold cross-validation repeated 10 times. Machine-learning approaches including XGBoost[27], DT[28], SVM[29], and RF[30] algorithms were selected as classifiers to identify the importance of features.