Development of individualized diagnostic models and analysis process for AD and PD patients
Based on serum protein expression profiles, we used artificial intelligence to construct three individual disease diagnostic models. In addition, the biological pathways, functions, upstream factors, and Pseudo-timing information between diseases are mined (Fig. 1). 388 serum protein expression profiles were downloaded from the GEO database, which contained 156 Control samples (CT), 50 MCI, 50 AD, and 132 PD samples. On the one hand, the optimal feature of the model for constructing the model are firstly filtered according to the difference, variance, colinearity, and importance. Then use the optimal feature to train different classifier models, including random forest (RF), Decision Tree(DT), and Navie Bayes (NB), Artificial neural network (ANN), k-Nearest Neighbor (KNN). The trained model was applied to test set to observe the classification effectiveness (accuracy, confusion matrix, ROC) and feature effectiveness (TSNE) of the model. On the other hand, we analysed the pathways and functions between disease and normal samples, followed by the possible order of occurrence of the disease. Finally, we analysed the upstream regulators and possible drug targets.
The AI model for the diagnosis of AD, MCI and CT based on 6 serum protein markers
AD, CT, and MCI samples were extracted from the data set, 1879 DEPs between the AD, CT, and MCI were detected (Fig. 2a). When constructing feature engineering, we follow the following principles: 1) Features with small variance do not have large impact on the classifier. 2) Highly correlated features may lead to covariance problems in the model. 3) A few important features are sufficient to represent the whole range of features. After variance, correlation and importance screening (Supplementary Fig. 1a-b), six features were finally obtained, containing LOC728492, PCBD2, EPHA2, MRPL19, SGK2, LGALS1. These six optimal features were expressed significantly differently between groups, and their importance was shown in the figure (Fig. 2b-c).
We use different classifiers (KNN, RF, ANN, NB, DT) and different features (optimal features, random features, all features) to build models. Finally, the optimal features can achieve similar even better classification performance than all features, and this is not due to randomness (Fig. 2d). The accuracy and loss curves for these six features during ANN model training (Fig. 2e) show that we stopped training when the model was stable. The Micro-AUC for the optimal features was 0.9994, higher than 0.9191 for all features and 0.6385 for random features (Fig. 2f). The accuracy of this model in all three test sets is greater than 0.95 (Fig. 2g-i), and their AUC in the test set are shown in the supplemental Fig. 4a-c. The model accuracy in the all test set was 98.07%, where MCI and AD classification being completely correct, outperforming all features and random features (Fig. 2j-l). Compared to all features and random features, optimal features can distinguish samples well (Fig. 2m-o). The above results show that the optimal features selected after feature engineering help to improve performance and simplify the model. We defined 0 for the CT, 0.5 for the MCI and 1 for the AD sample to analyse the correlation between optimal features and disease progression. Most features were positively correlated with the severity of cognitive loss, except for MRPL19. EPHA2 is a neuroinflammatory factor (Supplementary Fig. 1c), which may indicate that the neuroinflammatory pathway in which EPHA2 resides is closely related to the progression of AD.
The AI model for the diagnosis of PD, MCI and CT based on 15 serum protein markers
We extracted PD, CT and MCI samples from the dataset and used 3092 DEPs as initial features (Fig. 3a). Finally, after feature selection, 15 features were retained (Supplementary Fig. 1d-e), containing ERO1LB, IGLa, LOC400763, PHKG2, PPM1L, RAD51L3, IL23A, DYNLRB2, BCAT1, CDC37, IL1RN, MAB21L2, S100A13, FAM73B, IP6K2. Heatmaps of the 15 features also showed significant differences between groups (Fig. 3b-c). Similarly, the ANN model with feature engineering performs best (Fig. 3d). When the model tends to be stable, the classification accuracy is the highest and the loss is the lowest (Fig. 3e). The test set accuracy of the optimal features was 97.05%, where the MCI classification was completely accurate with micro-AUC of 0.9984, while the all features were 0.83343 and the random features were 0.7897 (Fig. 3f,j-l). The accuracy of this model in all three test sets is greater than 0.94 (Fig. 3g-i), and their AUC in the test set are shown in the supplemental Fig. 4d-f.The optimal features distinguished the MCI samples well compared to all features and random features (Fig. 3m-o). Finally, we also analysed the correlation between optimal features and disease progression (Supplementary Fig. 3f). Among these features, IL23a and IL1RN are pro-inflammatory cytokines and anti-inflammatory factors, respectively. MAB21L2 may be related to neurodevelopment [21], and BACT1 knockout may cause neuronal oxidative damage [22].
The AI model for the diagnosis of AD, PD, MCI and CT based on 30 serum protein markers
Similarly, we took out the DEPs of all samples for feature filtering to obtain the optimal model with 30 features (Fig. 4a-b), among which PCBD2, LGALS1 belong to the features in model1, while IGLa, ERO1LB, MAB21L2, CDC37, DYNLRB2, FAM73B, IP6K2, S100A13 belong to the features in model2, which indicates that the features extracted by feature engineering have good robustness, and the importance of these 30 features is shown in the figure (Fig. 4c). The filtered features were also optimal in the Artificial neural networks (ANN) model compared to other methods and other classifiers (Fig. 4d). The accuracy of this model in all three test sets is greater than 0.95 (Fig. 4g-i), and their AUC in the test set are shown in the supplemental Fig. 4g-i. Compared to all features and random features, the classification accuracy of all test sets is 98.71%, and all samples are correctly classified except a PD samples which are misclassified as AD, and the micro-AUC reaches 0.9999 (Fig. 4f,j-l), which was greater than 0.8541 for all features and 0.6660 for random features, and could completely identify MCI samples in the TSNE (Fig. 5m-o).
The serum proteins of patients in the MCI, AD and PD groups all showed different differences from the healthy sample
In this paper, we first analysed the DEPs of disease. The number of DEPs in MCI, AD and PD compared to CT was 1010, 839 and 2122 respectively. The number of DEPs in AD and PD compared to MCI was 1221 and 1467 respectively. Finally, the number of DEPs between AD and PD was 2082 (Supplementary Fig. 2a). Firstly, PD was very different from CT, MCI, AD in terms of the number of DEPs, but was closest to MCI (Supplementary Fig. 2b). Next, we found that the phase change proteins in MCI, AD, and PD had relatively large differences in location and cell type, with PD is mostly distributed in the nucleus and enzymes, AD is mostly distributed in the extracellular space and transcriptional regulators, and MCI is mostly distributed in the cytoplasm (Supplementary Fig. 2c-d). In addition, we found differences in phase separation scores in the cytoplasm between PD and normal samples, which may indicate that phase separation in PD is associated with the cytoplasm (Supplementary Fig. 2e). Further analysis of the cell type scores in the cytoplasm revealed that the differences may lie in other cell types. Finally, we show the 10 proteins with the largest differences in disease relative to normal (Supplementary Fig. 2g-i), where EMG1,IFI6 are the most up-regulated and down-regulated DEPs for MCI relative to normal, ZCD2, IFI6 are the most up-regulated and down-regulated DEPs for AD, and CCT7, RANBP6 are the most up-regulated and down-regulated DEPs for PD.
Early PD may occur before early MCI
Serum molecules flow with the blood and can affect the body's cells, tissues and organs in a comprehensive way. With regard to the biological events influenced by serum molecules, we further analysed the activation levels of individual biological events based on conventional significance analysis. We classified the disease in more detail based on the underlying information, dividing MCI into early MCI (EMCI) and late MCI (LMCI), AD into early mild-moderate AD (EMMAD) and late mild-moderate AD (LMMAD), and PD into early PD (ESPD) and mild-moderate PD (MMPD). By observing the canonical pathways and disease and bio functions, we could find that the number of up-regulated pathways increased and the number of down-regulated pathways decreased in the process from EMCI/CT to LMMAD/CT (Fig. 5a-b). Z-scores, the mean change in pathway relative to control samples, showed the same trend. ESPD followed the same trend as EMCI but with greater variation. The results show that there is a continuum of inertia between multiple biological events in the organism of MCI and AD patients, while PD is more distinct from both. The occurrence of biological events in the organism of patients with early PD was intermediate between healthy and early MCI. This suggests that early PD may precede early MCI.
Similarly, IPA was used for the analysis of 7 groups of samples (Fig. 5c). Among the canonical pathways, we choose the 10 pathways with the largest relative differences between MMPD and LMMAD. Long-term activation of EIF2 leads to continuous decline in protein synthesis, which leads to memory impairment and neuronal damage [23]. The up-regulation ratio of EIF2 in MCI is small, while in AD and PD is larger, which may indicate that EIF2 is more related to neuronal damage.
Among the classical pathways, we identified two pathways associated with the Coronavirus, namely the "Coronavirus Replication Pathway" and the "Coronavirus Pathogenesis Pathway". Coronavirus Replication in disease was enhanced but pathogenicity was reduced compared to normal samples. The Coronavirus replication ability of AD was stronger than that of PD. It is known from the literature that patients with COVID-19 appear to be more susceptible to AD and that patients with AD may be more susceptible to severe infection with COVID-19 [24]. In contrast, the current literature does not clearly indicate whether PD patients are more susceptible to COVID-19. This may reveal a greater susceptibility to COVID in AD.
The level of cell maturation is relatively low in the early stages of disease compared to normal samples, while in the middle and late stages of disease progression, cell maturation begins to increase abnormally to near even greater than normal levels. In terms of molecular function, excessive increases in activating nuclear factor NF-kB have been shown to play an important role in driving Abeta deposition, neuroinflammation and neurodegenerative disease in AD, but NF-kB levels are not increased in PD, which may suggest that NF-kB does not promote a-SYN deposition [25].