We included three phases to identify and validate the gene-based RPE/choroid complex diagnostic model for AMD (Figure 1). In the discovery phase, GSE29801 was applied to screen DEGs, which were used in the RF model to select key genes biomarkers for AMD. In the training phase, LASSO logistic regression was utilized to identify informative genes and construct the diagnostic model. In the validation phase, the performance of the model was verified in GSE50195. Besides, enrichment and immune-cell analysis were also performed in the two datasets to obtain a robust association between the model and overall immune status in AMD patients.
Identification of DEGs in AMD
A total of 83 DEGs were found between AMD and normal samples in GSE29801, including 29 up-regulated ones and 54 down-regulated ones (Figure 2A). Then the DEGs were selected robust and general genes.
Selection of candidate gene biomarkers
We adopted RF model to identify key gene biomarkers for classifying AMD and normal samples. The top 20 genes were used in the traning cohort (Figure 2 B).
Functional enrichment analysis
Metascape analysis showed the top 20 clusters of enriched biological processes (Figure 2C). Results showed that the DEGs between AMD and normal samples were significantly enriched in regulation of burn wound healing, regulation of system process, NABA SECRETED FACTORS and so on.
Construction of the diagnostic model
The 20 genes selected by RF were applied to LASSO logistic regression to build the diagnostic model. Consequently, ten optimal genes were employed to establish a diagnostic model, including BMPR2, CNOT3, CRLF1, FXYD6, HRASLS5, KRTDAP, NUDT16L1, PI16, PLAGL1, SART1 (Figure 3A, B). The risk score formula was calculated as follows: risk score = -2.50339 + (-0.02804 * expression level of BMPR2) + (0.05620 * expression level of CNOT3) + (-0.00099 * expression level of CRLF1) + (0.00811 * expression level of FXYD6) + (-0.00074 * expression level of HRASLS5) + (0.13072 * expression level of KRTDAP) + (0.01815 * expression level of NUDT16L1) + (0.00053 * expression level of PI16) + (0.00086 * expression level of PLAGL1) + (0.00074 * expression level of SART1). The ROC curves indicated a high diagnostic power of the model, because the AUC was 0.908 (95% CI: 0.823-0.975) with a specificity of 0.782 and a sensitivity of 0.965 (Figure 4A). Patients were classified into high- and low-risk groups based on the median expression levels of the optimal genes (Figure 5A).
External validation of the diagnostic model
To verify the robustness of the ten-gene based diagnostic model, we obtained an independent cohort GSE50195. Patients were divided into high- and low-risk groups according to the fixed formula and the median expression levels of the optimal genes calculated from the training cohort (Figure 5B). The ROC curve in the validation cohort indicated a reliable diagnostic result, with the AUC of 0.809 (95% CI: 0.522-0.889) and the specificity and sensitivity of 0.644 and 0.912, respectively (Figure 4B). The model had consistently high sensitivity in both training and validation cohorts so that the diagnostic performance of the model remained robust and precise.
Gene set enrichment analysis
GSEA was conducted to elucidate the potential biological processes occurring in high-risk patients compared with low-risk ones. We found that glutathione metabolism and phototransduction are the shared pathway that had gene enriched in the training and validation cohort. In the training cohort, gene expression in the two signal pathway significantly reduced (Figure 6A). While in the validation cohort, glutathione metabolism had an increase level of gene expression (Figure 6B). The above analyses may set the foundation for further exploring the molecular mechanisms of AMD.
Inference of immune cells in retinal pigment epithelium (RPE)/choroid complex
We quantified 28 types of immune cells including the B cells, T cells, DCs, macrophages, natural killer cells and so on to investigate the composition of the RPE/choroid complex by applying ssGSEA. As a result, the levels of neutrophil in the high-risk group were significantly higher than that in the low-risk group in both training and validation datasets. In the training cohort, the levels of activated B cell, activated CD4 T cell, central memory CD4 T cell, effector memory CD8 T cell, gamma delta T cell, immature B cell, myeloid-derived suppressor cells (MDCS), natural killer T cell, T follicular helper cell, type 1 T helper cell and type2 helper cell in the high-risk group were significantly higher than that in the low-risk group (Figure 7A). While in the validation cohort, the level of monocyte increased in the high-risk group, but the levels of CD56bright natural killer cell, CD56dim natural killer cell and memory B cell decreased in the high-risk group compared with that in the low-risk group (Figure 7B).
We also analyzed the correlation between the ten genes involved in the diagnostic model and immune cells. In the training cohort, NUDT16L1, FXYD6, HRASLS5 and KRTDAP were negatively correlated with almost all the immune cells, while PLAGL1, ACSL1 and CRLF1 exhibited positive correlations (Figure 8A). In the validation cohort, the correlations between the optimal genes and the immune cells were not that strong as that in the training cohort. It showed that PI16 was negatively correlated with CD56dim natural killer cell, natural killer cell and neutrophil, and CNOT3 and HRASLS5 also had the negative correlations with CD56dim natural killer cell. While BMPR2 and ACSL exhibited positive correlations with CD56bright natural killer cell.