1、Classification and characterization of lung adenocarcinoma
The overview of the process used in this study was shown in Fig. 1. After extraction from KEGG and MMPORT databases, a total of 947 genes related to metabolic pathways and 483 genes related to immunity were obtained. In the TCGA LUAD cohort (n = 502), 343 pairs of gene expression differences were found in contrast to adjacent normal tissues. The GEO-LUAD cohort was merged using the combat package and principal component analysis (PCA) was applied to evaluate the merging effect (Fig. 2). Through GEO cohort validation, 272 genes were identified that were expressed in both GEO and TCGA cohorts. Using these 272 pairs of genes for Non-negative matrix factorization (NMF) algorithm, the TCGA cohort was divided into two subtypes (Fig. 3A). According to the results of survival analysis, they were named high-risk group and low-risk group, and it was found that there were statistically significant differences in disease-free survival and survival between these two types (Fig. 3B, C).
We used the ESTIMATE algorithm to estimate the population abundance of immune cells and stromal cell populations infiltrating the tissues in both subtypes (Fig. 3D). The ESTIMATE evaluation results showed that C1 subtype had a higher score compared to C2 subtype, indicating a more significant level of immune cell infiltration in C1 subtype (Fig. 3E).
2、Establishment and validation of prognostic models
In order to further explore the value of differential genes, The TCGA-LUAD cohort was randomly divided into a training group and a testing group in a 7:3 ratio. we combined univariate Cox regression, LASSO regression (Fig. 4A, B) and multivariate Cox regression (Fig. 4C), used to select 9 genes to construct an immune prognostic risk model. Risk scores were obtained by combining expression levels and regression coefficients (RiskScore = CAT exp × -0.166751936312374 + CCL20 exp ×0.0563504470190809 + GPI exp ×0.133661988590366 + INSL4 exp * 0.126270665159998 + NT5Eexp * 0.105058801123867 + GSTA3exp × -0.064669483229661) + GNPNAT1 exp × 0.136377362257813). The training set and test set patients were divided into low-risk group and high-risk group based on median risk score, and the robustness of the prognostic risk model was evaluated. In both TCGA cohort (Fig. 5A, B, C) and GEO cohort (Figure D), patients in the high-risk group had shorter survival time than those in the low-risk group ROC curves showed that RiskScore had the largest area under AUC curve among all clinical features (Fig. 5E). In TCGA cohort, the areas under AUC curve at 1 year, 3 years, and 5 years were: 0.752, 0.755, and 0.654 respectively (Fig. 5F).
According to the RiskScore, calculate the scores of each sample and then divide the GEO cohort and TCGA cohort into two groups based on the median risk score (Fig. 6A, B). Compare the gene expression patterns in different risk groups between the two cohort. The results show that regardless of whether it is in different risk groups in TCGA or GEO cohort similar abnormal gene expressions are observed in both (Fig. 6C, D). Additionally, there is a higher distribution of death cases in the high-risk group (Fig. 6E, F). In addition, after dividing the TCGA cohort into two groups based on patient staging (I-II stage and III-IV stage), we found that the high-risk group exhibited shorter survival (Fig. 6G, H).
3. Construction of a Nomogram
After univariate Cox regression and Multivariate Cox analysis, we found that the risk score is an independent risk factor and used it to construct a column chart for prediction (Fig. 7A, B). In the column chart, scores for various factors such as age, stage, and risk score were calculated, and the total score can be used as a predictive tool (Fig. 7C). At the same time, a calibration curve was plotted to measure the consistency between actual observed prognosis values and predicted values from the column chart (Fig. 7D). The performance of the column chart was evaluated using ROC curves with AUC values of 0.715 at 1-year survival period, 0.735 at 3-year survival period, and 0.739 at 5-year survival period (Fig. 7E).
4. Clinical differences between high-risk and low-risk groups
After analysis, it was found that there is a correlation between the genes used to construct the prognostic model and the risk scores (Fig. 8A), and there are also differences in gene expression between high-risk and low-risk groups (Fig. 8B-H). However, when dividing the high and low expression groups based on the median expression of each gene, we observed survival differences between the high and low expression groups of CAT\CCL20\GPI\GNPNAT1 in TCGA cohort (Fig. 9A-D).
The analysis of the correlation between RiskScore and tumor mutation showed that risk score was positively correlated with tumor (Fig. 10A), and there were differences in TMB between high and low risk groups (Fig. 10B). Further investigation into the relationship between TMB and survival time found that the high mutation group had a shorter survival time (Fig. 10C). According to the division of risk groups, the TCGA cohort was divided into four groups: high-risk high-TMB group, high-risk low-TMB group, low-risk high-TMB group, and low-risk low-TMB group. The results showed that the low-risk high-TMB group had the longest survival time, followed by the low-risk low-TMB group and the high-risk high-TMB group (Fig. 10D). In terms of gene mutations, the analysis of the top 10 genes TP53, TTN, MUC16, CSMD3, RYR2, LRP1B, ZFHX4, USH2A, KRAS and IRP2 showed that the mutation rates of each gene in the high-risk group were significantly higher than those in the low-risk group (Fig. 10E, F).
To further investigate immune infiltration, we employed the MCPcounter method. We calculated the high- and low-risk groups, risk scores, and the population abundance of stromal cell populations. The results showed a correlation between B-cell lineage, T-cell, myeloid cell, endothelial cell, and fibroblast with riskscore (Fig. 11A). In the high- and low-risk groups, the abundance of myeloid cells, endothelial cells, fibroblasts, and neutrophils varied (Fig. 11B-E).
4、Potential drug sensitivity analysis
We used TIDE and TICA models to predict clinical differences between high-risk and low-risk groups, and used IMvigor210 as an external validation set. According to TIDE scores, high-risk groups showed higher TIDE score and Worse treatment response than low-risk groups (Fig. 12A). According to TICA scores, there was no statistically significant difference in AIPS scores in PD1 positive groups, but in PD1 negative groups, regardless of CALT4 expression, low-risk groups had higher scores (Fig. 12B). External validation on the IMvigor210 cohort showed that patients classified as high-risk groups had shorter survival and weaker immunotherapy response (Fig. 12C, D).
The oncoPredict was used to conduct drug sensitivity analysis on high- and low-risk groups, specifically for common chemotherapy regimens of lung adenocarcinoma. The results showed that there were varying degrees of response differences among commonly used chemotherapy drugs such as paclitaxel(Fig. 13A), docetaxel(Fig. 13B), cisplatin(Fig. 13C), gemcitabine(Fig. 13D), and vinorelbine(Fig. 13E) in the high- and low-risk groups. These drugs are commonly used chemotherapy drugs for lung adenocarcinoma and showed higher scores in the low-risk group.