Individual biomarker analysis
Our data was obtained from HADCL (Healthcare Analytics and Data Center). We identified 43,981 patients with AD from a population of 12 million individuals spanning the years 2000 to 2019. Within the dataset, comprising an initial set of 210 routine blood biomarkers, we conducted an in-depth analysis focusing on 31 specific biomarkers. For a comprehensive understanding of our selection process, please refer to the Methods section. Student’s t-test26 was used to evaluate the differences in the 31 routine blood biomarkers between MCI and AD among different groups (using data that does not contain missing values). A significance threshold of p < 0.05 was set for the t-test results. Out of the 31 biomarkers subjected to t-test analysis, 24 exhibited a significant difference between MCI and AD patients, as indicated by p-values less than 0.05. These significant findings are presented in Table 1. The remaining 7 biomarkers, with p-values exceeding 0.05, did not show statistically significant differences and were not included in this table. Notably, hemoglobin (blood), HCT (Haematocrit), and RBC (Red Blood Cell) were significantly reduced, while neutrophils (absolute) and WBC (White Blood Cell) were increased in AD patients among all 4 groups.
Model development
As stated the data process in Method section, we finally excluded 10,344 patients, leaving total of 7159 patients with 777 patients in Female 65 ~ 74 (MCI, n = 131; AD, n = 646), 642 patients in Male 65 ~ 74 (MCI, n = 147; AD, n = 495), 3,821 patients in Female 75 ~ 89 (MCI, n = 249; AD, n = 3,572), and 1,919 patients in Male 75 ~ 89 (MCI, n = 222; AD, n = 1,697).
However, the number of AD patients far exceeds that of MCI patients in our data. Given that the imbalance in classification may result in high false positive rates and low generalization when modeling27,28, the AD patients were randomly reduced to the same number as MCI subjects, to minimize the bias. The final number of each group after balancing was 262 patients in Female 65 ~ 74 (MCI, n = 131; AD, n = 131), 294 patients in Male 65 ~ 74 (MCI, n = 147; AD, n = 147), 498 patients in Female 75 ~ 89 (MCI, n = 249; AD, n = 249), and 444 patients in Male 75 ~ 89 (MCI, n = 222; AD, n = 222). For more details, please refer to Fig. 1.
To train more stable models, ensemble learning29 was adopted in this work. Ensemble learning combines multiple weakly supervised models to create more comprehensive models. The algorithm assigns higher weights to the more important weak classifiers, and if a weak classifier predicts a patient wrong, the other weak classifiers can correct it. To establish the best classification algorithm, 85% of each data set (Female 65 ~ 74, Male 65 ~ 74, Female 75 ~ 89, and Male 75 ~ 89) was selected randomly as the training set, and the remaining 15% was used for the independent test set. Details for final data distribution are shown in Fig. 1. The 5-fold cross-validation (CV)30 was used to train the weak classifier models, and extensive screening and grid search were applied to determine optimal algorithms and the best-performing weights, respectively.
According to the t-test results, there were 5 significant biomarkers in the 4 groups, including Hemoglobin (blood), HCT, Neutrophil (absolute), RBC, and WBC. Results from models trained with these 5 biomarkers showed that only the model of Female 65 ~ 74 had an AUC greater than 0.65. In Female 65 ~ 74, we combined the above 5 biomarkers with Logistic Regression (LR)31, Gaussian Naive Bayes (GNB)32, and Random Forest (RF)33 to generate basic classifiers that were then applied to construct the Female 65 ~ 74 model. These basic classifiers were used for ensemble learning with a weight set of LR: GNB: RF = 1: 3: 1 (Fig. 2). In the other 3 groups, the performances of models using feature matrix formed by the 5 significant biomarkers could not achieve a relatively high accuracy (AUCs were lower than 0.65). We thus used forward feature selection (FFS)34 to add biomarkers from Table 1 one by one to optimize the models. We finally found an optimal model for Female 75 ~ 89, which was generated by basic classifies integrated 19 biomarkers (analytes related to liver function (Alanine Aminotransferase (ALT), Alkaline Phosphatase (ALP), Bilirubin), minerals and proteins (Potassium, Calcium, Phosphate, Protein, Albumin), and blood cells (Basophil (absolute), Neutrophil (absolute), Lymphocyte (absolute), Basophil (%), Neutrophil (%), Lymphocyte (%), WBC, RBC, Mean corpuscular hemoglobin concentration (MCHC), Hematocrit (HCT, Hemoglobin)) with LR, Gaussian Naive Bayes (GNB)35, and Extra Trees (ET)36 algorithms. Similarly, ensemble learning was used to build the Female 75 ~ 89 model with a weight set of LR: GB: ET = 1: 3: 4 (Fig. 2).
Because the optimization by FFS did not work well for male models, we changed the strategy by using all 31 biomarkers to train the model. We ultimately constructed a relatively good prediction model for Male 65 ~ 74 through 31 biomarkers and Linear Discriminant Analysis (LDA)37, Support Vector Machine (SVM)38, and ET algorithm. Ensemble learning was used to construct the Male 65 ~ 74 model with a basic classifier weight set of LDA: SVM: ET = 1: 2: 4 (Fig. 2). As for Male 75 ~ 89, we first performed Maximum-Relevance-Maximum-Distance (MRMD)39 feature selection to screen out 23 biomarkers (ALP, variables associated with renal function (Creatinine, Urea), minerals and proteins (Potassium, Sodium, Calcium, Phosphate, Albumin, Protein), and blood cells (Basophil (absolute), Neutrophil (absolute), Eosinophil (absolute), Lymphocyte (absolute), Monocyte (absolute), Basophil (%), Monocyte (%), WBC, Platelet, Mean corpuscular volume (MCV), Mean corpuscular hemoglobin (MCH), MCHC, Hemoglobin, Red blood cell distribution width (RDW) (%)), and then combined them with RF, AdaBoost Classifier (ADA), and ET algorithms. These basic classifiers were used for ensemble learning to generate the Male 75 ~ 89 model with a weight set of RF: ADA: ET = 1: 2: 3 (Fig. 2).
Routine blood biomarker-driven models and effective predictive performance
The prediction accuracy and AUC values of each model on the CV are shown in Table 2 and Fig. 3. Specifically, applying the ensemble methods combined with LR31, GNB32, and RF33, the 5-biomarkers for the Female 65 ~ 74 model yielded an accuracy (ACC) of 0.63 and an AUC of 0.70, with sensitivity (SN) at 0.61 and specificity (SP) at 0.65. For the Female 75 ~ 89 model, the 19-biomarkers integrated with LR, GB35, and ET36 achieved similar performance, resulting in an ACC = 0.63, AUC = 0.67, SN = 0.69, and SP = 0.58. Likewise, the 31-biomarkers constructed with LDA37, SVM38, and ET for the Male 65 ~ 74 model provided an ACC at 0.63, AUC at 0.66, SN at 0.62, and SP at 0.64, and the 23-biomarkers generated with RF, ADA, and ET for the Male 75 ~ 89 model obtained an ACC = 0.66, AUC = 0.68, SN = 0.66, and SP = 0.65.
Table 2
The model performance on cross-validation test sets
| | Cross-validation test sets |
| | SN | SP | ACC | F1 | AUC |
ML model | F 65 ~ 74 | 0.61±0.01 | 0.65±0.02 | 0.63±0.00 | 0.62±0.00 | 0.70±0.01 |
| TabNet | 0.50±0.34 | 0.70±0.28 | 0.60±0.08 | 0.48±0.26 | 0.74±0.08 |
ML model | M 65 ~ 74 | 0.62±0.02 | 0.64±0.02 | 0.63±0.02 | 0.62±0.02 | 0.66±0.00 |
| TabNet | 0.65±0.31 | 0.58±0.36 | 0.61±0.08 | 0.59±0.15 | 0.72±0.06 |
ML model | F 75 ~ 89 | 0.69±0.10 | 0.58±0.08 | 0.63±0.18 | 0.66±0.24 | 0.67±0.22 |
| TabNet | 0.35±0.40 | 0.74±0.36 | 0.55±0.07 | 0.31±0.33 | 0.63±0.10 |
ML model | M 75 ~ 89 | 0.66±0.04 | 0.65±0.04 | 0.66±0.00 | 0.66±0.02 | 0.68±0.01 |
| TabNet | 0.39±0.32 | 0.81±0.21 | 0.60±0.07 | 0.42±0.30 | 0.67±0.10 |
Abbreviations: SN, sensitivity; SP, specificity; ACC, accuracy; F1, F-score; AUC, area under the receiver operating characteristic curve; F 65 ~ 74, Female 65 ~ 74 years old; M 65 ~ 74, Male 65 ~ 74 years old; F 75 ~ 89, Female 75 ~ 89 years old; M 75 ~ 89, Male 75 ~ 89 years old. |
We next evaluated the model performance on independent test sets. The prediction results were summarized in Table 3 and Fig. 3, and the detailed ROC curves were depicted in Fig. 4. Consistent with the results on CV test sets, the performances of each model on independent test sets were relatively stable, yielding an average ACC of 0.6425, AUC of 0.685, SN of 0.6275, and SP of 0.6575. The consistency of performance on CV test sets and independent test sets suggests that our models are resistant to noise, outliers, or slight changes in the data set and thus can provide a reliable prediction for unknown samples. Our model demonstrates robust consistency, withstanding variations such as noise and outliers, affirming reliability in predicting the progression from MCI to AD. This stability is crucial for application in clinical settings, where it can aid in the early detection of high-risk individuals.
Table 3
The model performance on independent test sets
| | independent test sets |
| | SN | SP | ACC | F1 | AUC |
ML model | F 65 ~ 74 | 0.70±0.04 | 0.66±0.02 | 0.68±0.02 | 0.69±0.02 | 0.76±0.02 |
| TabNet | 0.43±0.34 | 0.65±0.31 | 0.54±0.09 | 0.41±0.26 | 0.59±0.11 |
ML model | M 65 ~ 74 | 0.59±0.04 | 0.65±0.07 | 0.62±0.05 | 0.61±0.05 | 0.65±0.07 |
| TabNet | 0.54±0.35 | 0.47±0.33 | 0.51±0.05 | 0.47±0.21 | 0.49±0.09 |
ML model | F 75 ~ 89 | 0.61±0.14 | 0.68±0.04 | 0.64±0.17 | 0.63±0.20 | 0.66±0.22 |
| TabNet | 0.32±0.38 | 0.72±0.36 | 0.52±0.05 | 0.29±0.30 | 0.57±0.07 |
ML model | M 75 ~ 89 | 0.61±0.00 | 0.64±0.00 | 0.63±0.00 | 0.62±0.00 | 0.67±0.00 |
| TabNet | 0.28±0.24 | 0.80±0.17 | 0.54±0.06 | 0.32±0.24 | 0.57±0.09 |
Abbreviations: SN, sensitivity; SP, specificity; ACC, accuracy; F1, F-score; AUC, area under the receiver operating characteristic curve; F 65 ~ 74, Female 65 ~ 74 years old; M 65 ~ 74, Male 65 ~ 74 years old; F 75 ~ 89, Female 75 ~ 89 years old; M 75 ~ 89, Male 75 ~ 89 years old. |
Impact of age and gender on prediction performance
In MAP, we stratified patients into 4 distinct age and gender groups and constructed models trained with 5-, 31-, 19-, and 23-biomarkers corresponding to Female 65 ~ 74, Male 65 ~ 74, Female 75 ~ 89, and Male 75 ~ 89, respectively. To demonstrate the influence of age and gender stratification on the prediction from MCI to AD, we performed an additional comparative experiment using data from all individuals aged 65 ~ 89 years, irrespective of gender. In MAP, the 5-, 19- and 23-biomarkers used for training the Female 65 ~ 74, Female 75 ~ 89, and Male 75 ~ 89 models were derived from the initial 31 biomarkers according to a secondary selection process after age and gender stratification. For the non-stratified data, the 5- and 19-biomarker data were selected directly from all MCI and AD patients to increase the sample size for analysis, while the 23-biomarker data was strictly adhered to a two-step screening method, where they were chosen through feature selection using MRMD from the initial set of 31 biomarkers (Fig. 5). As described above, we also excluded the patients with missing values, balanced the data between AD and MCI groups, and divided the data into training and independent test sets for model training (Fig. 5).
Results of the non-stratified data showed a comparatively less optimal model performance when employing the same machine-learning methodology as MAP. As shown in Table 4, without age and gender stratification, the models exhibited higher SN but lower SP, leading to a bias in identifying MCI subjects who did not convert to AD. This phenomenon may be attributed to the model excessively focusing on the biomarkers of AD patients while neglecting those of MCI subjects. Consistent with the cross-validation findings, the independent test results also revealed that models without categorizing patients into different age and gender groups yielded inferior performance compared to models grouped by age and gender. Specifically, the former provided an average ACC of 0.56, an AUC of 0.605, a SN of 0.7175, and a SP of 0.4075 (Table 4). Our findings reveal the crucial role of age and gender stratification in improving the accuracy of identifying individuals at risk of progressing from MCI to AD.
Table 4
The model performance on data without age and gender stratification
Cross-validation test sets |
data | model | SN | SP | ACC | F1 | AUC |
features | 5 biomarkers |
F & M 65 ~ 89 | ML model | 0.72±0.01 | 0.39±0.01 | 0.56±0.01 | 0.62±0.01 | 0.58±0.00 |
TabNet | 0.50±0.18 | 0.64±0.13 | 0.57±0.05 | 0.52±0.13 | 0.61±0.05 |
features | 19 biomarkers |
F & M 65 ~ 89 | ML model | 0.73±0.01 | 0.73±0.01 | 0.73±0.01 | 0.73±0.01 | 0.60±0.00 |
TabNet | 0.69±0.31 | 0.69±0.31 | 0.69±0.31 | 0.69±0.31 | 0.61±0.04 |
features | 23 biomarkers |
F & M 65 ~ 89 | ML model | 0.75±0.01 | 0.41±0.02 | 0.58±0.01 | 0.64±0.01 | 0.62±0.01 |
TabNet | 0.49±0.39 | 0.54±0.43 | 0.52±0.03 | 0.44±0.19 | 0.50±0.06 |
features | 31 biomarkers |
F & M 65 ~ 89 | ML model | 0.74±0.01 | 0.42±0.01 | 0.58±0.01 | 0.64±0.01 | 0.63±0.01 |
TabNet | 0.34±0.20 | 0.76±0.18 | 0.55±0.03 | 0.41±0.16 | 0.62±0.04 |
independent test sets |
features | 5 biomarkers |
F & M 65 ~ 89 | ML model | 0.71±0.01 | 0.38±0.01 | 0.54±0.01 | 0.61±0.02 | 0.57±0.01 |
TabNet | 0.48±0.18 | 0.58±0.15 | 0.53±0.02 | 0.49±0.11 | 0.55±0.01 |
features | 19 biomarkers |
F & M 65 ~ 89 | ML model | 0.73±0.01 | 0.43±0.01 | 0.58±0.01 | 0.63±0.02 | 0.62±0.02 |
TabNet | 0.70±0.32 | 0.37±0.32 | 0.54±0.03 | 0.56±0.06 | 0.55±0.03 |
features | 23 biomarkers |
F & M 65 ~ 89 | ML model | 0.71±0.02 | 0.42±0.01 | 0.56±0.01 | 0.62±0.01 | 0.62±0.02 |
TabNet | 0.55±0.37 | 0.56±0.41 | 0.56±0.04 | 0.50±0.17 | 0.59±0.02 |
features | 31 biomarkers |
F & M 65 ~ 89 | ML model | 0.72±0.01 | 0.40±0.01 | 0.56±0.01 | 0.62±0.01 | 0.61±0.01 |
TabNet | 0.37±0.09 | 0.71±0.08 | 0.54±0.01 | 0.42±0.05 | 0.56±0.12 |
Abbreviations: F & M 65 ~ 89, Female and Male 65 ~ 89 years old. |
Comparative analysis of MAP and TabNet models for predicting the risk of MCI to AD
To compare the performance of MAP with models trained with deep-learning methods, the 5-/19-/23-/31-biomarker sets were trained with TabNet40, a deep-learning method that is widely used to train interpretable models, to construct TabNet models with or without age-and gender-grouping. Results revealed poor performances of TabNet models on CV test sets and independent test sets. When grouped by age and gender, TabNet models had similar AUCs as MAP on CV test sets, but their ACCs were lower and the differences between SNs and SPs were larger than MAP (Table 2, and Fig. 3). Regarding independent test sets, TabNet models obtained lower ACCs, AUCs, and SNs, but larger differences between SNs and SPs than MAP (Table 3, and Fig. 3, 4). When the data was not stratified by age and gender, TabNet models displayed slightly inferior performances as MAP on CV test sets and independent test sets (Table 4).