Differentially expressed mRNAs and lncRNAs associated with prognosis
Expression data and clinical information of three cohorts were downloaded from TCGA and GEO. The baseline characteristics were summarized in Table 1. The first cohort consisted of 535 LUAD and 59 normal samples from TCGA, and was used for differential expression analysis. 17122 mRNAs and 1826 lncRNAs were screened, and 1456 mRNAs and 92 lncRNAs were differentially expressed between LUAD and normal groups. The expression profiles of the DEmRNAs and DElncRNAs were visualized in Fig. 2 in the form of heatmaps and volcano maps. As shown in Fig. 2A&C, 606 and 850 mRNAs were upregulated and downregulated in LUAD samples respectively while Fig. 2B&D shows the 49 upregulated and the 43 downregulated lncRNAs respectively.
189 DEmRNAs and 11 DElncRNAs were associated with the overall survival of LUAD patients using univariate COX regression. The genes with the 10 highest and the 10 lowest Hazard Ratios (HRs) were reported in Table S1 (Supplementary Materials).
Additional filtering of mRNAs and lncRNAs
A PPI network was constructed by uploading the 189 prognosis-associated DEmRNAs on the STRING database. Fig. 3A depicts the derived PPI network, which was visualized by the Cytoscape program and contained 77 nodes and 919 edges. Then, 2 modules were identified using the MCODE algorithm (shown in Fig. 3B-C). 43 mRNAs were further characterized as hub genes by the cytoHubba tool. Co-expression analysis of mRNAs and lncRNAs was also performed, as shown in Fig. 3D. 3 lncRNAs were significantly correlated to the expression of hub mRNAs. As a result, 46 genes (43 hub mRNAs and 3 lncRNAs) were used for the following risk modeling analysis.
Risk scores derived from COX regression
6 genes with nonzero coefficients were identified in the LASSO COX regression model. The risk score was calculated for each sample using the following formula:
RiskScore= 0.08443×DLGAP5 + 0.02479×MKI67 + 0.00070×TOP2A + 0.09877×CCNB1 + 0.05551×PICSAR + 0.10420×SCAT1
Samples were classified into high- and low-risk groups by comparing their risk scores to the median score in the TCGA cohort. As shown in Fig. 4A-C, the areas under the time-dependent ROC (AUROC) of TCGA are 0.690, 0.687, 0.673 and 0.726 for 1-, 2-, 3- and 4-years survival, respectively. The predicted risk scores presented better performance in the GSE31210 cohort (AUROC: 0.710, 0.696, 0.723 and 0.749 for 5, 6, 7 and 8 years, respectively) and the GSE30219 cohort (AUROC: 0.699, 0.696, 0.731 and 0.726 for 1, 2, 3 and 4 years, respectively). It is noteworthy that a different time interval was assessed for GSE31210 because LUAD was relatively mild (mostly in stage I) and survival time was generally longer in this cohort. Additionally, the Kaplan Meier analysis with log-rank tests revealed a significant difference in the overall survival (p<0.01) between the two groups in the TCGA cohort as well as in the two validation cohorts (shown in Fig. 4D-F). Moreover, as shown in Fig. 4G-I, the proposed risk score was proven to be an independent prognostic factor for LUAD (HR>1, p<0.05) by successively using univariate and multivariate COX regression.
GO and KEGG pathway enrichment analyses were performed on the DEGs between the high- and low-risk groups to elucidate the biological functions and pathways associated with the risk score. As shown in Fig. 5A-B, the DEGs between different risk groups are mainly enriched in the functions associated with mitosis and cell cycle, such as DNA replication, chromosome segregation and spindle. KEGG pathway enrichment analysis (Fig. 5D-E) also revealed that DNA replication and cell cycle pathways were enriched. Interestingly, the DEGs are significantly enriched in pathways of neurodegeneration and related diseases, such as Huntington disease and Alzheimer disease.
Fig. 5C&F shows the correlation between the risk score and immune status. As seen, the scores of aDCs, pDCs, APC co-stimulation, and MHC class I, are significantly different between the low-risk and high-risk groups. Most of the immune functions are significantly higher expressed in the high-risk group with the exception of the Type II IFN Response.
Machine learning modeling
The distributions of gene expression and risk scores in the different stages of LUAD are presented in Fig. 6A. Gene expression and risk scores were all significantly higher in pathological Stage III-IV LUAD, suggesting poorer survival for these LUAD stages. Five machine learning models, including logistic regression, KNN, SVM, random forest and XGBOOST, were applied to classify early-stage LUAD (pathological Stage I-II) from others (pathological Stage III-IV) based on the 6 risk genes. The AUROCs were assessed and are shown in Fig. 6B. The KNN model presented the highest AUROC (0.790) outperforming the other machine learning models. Fig. 6C shows the SHAP illustration of the XGBOOST model. Blue through red color indicates low to high expression level. SHAP values depict the effect of gene expression on the classification model. A positive SHAP value means that gene expression leads to an increase in the possibility of being Stage I-II, while a negative value represents that gene expression reduces this possibility.
Interestingly, the 6 genes’ expression and risk scores were lower in EGFR mutation than other LUAD. Five machine learning models were trained to recognize EGFR mutation, and the XGBOOST model was the one that presented the best performance (AUROC: 0.778). The AUROCs and SHAP values are shown in Fig. 6F&G.
The web-based tool
A web-based tool was developed (http://www.aimedicallab.com/tool/aiml-luad.html), and its interface is shown in Fig. 6D&H. Users only need to enter the normalized expression data of the 6 genes for each patient and click the ‘predict’ button. Then, the risk score and classification results of the machine learning models will be calculated and shown. Note that the early-stage classification was performed by KNN and the EGFR mutation was recognized by XGBOOST, since these were the best performing models in the two classification problems, respectively. Fig. 6D shows the input and results of a high-risk patient from the TCGA cohort; he was 71 years old with stage-III LUAD and survived only about 2 months in the study. In contrast, Fig. 6E is an example of low-risk patients. Her condition was more moderate (Stage I) and she survived more than 6 years.
CMap analysis and molecular docking
The forty-three hub mRNAs were all upregulated in LUAD samples and were uploaded to the CMap web server. A connectivity list of thousand compounds was obtained, with 20 of them at the top or bottom presented in Table S2 (Supplementary Materials). Four compounds had a score of 100 or -100, including SA-792541, doxorubicin, JW-7-24-1 and CD-437. The scores of mirtazapine, teniposide were 99.99 and -99.99 respectively. The other 14 compounds, including quizartinib, terreic-acid, SN-38 and purvalanol-a, presented a score of 99.98 or -99.98.
Molecular docking was conducted using the AutoDock Vina. The affinity scores of TOP2A-Purvalanola, Ki67-Etoposide and CCNB1-Etoposide were -6.7, -6.4 and -6.8 kcal/mol, respectively. These scores were all less than -5 kcal/mol suggesting, therefore, a strong blinding. The docking results are visualized using the PyMOL program and they are presented in Fig. 7.