Building a robust and generalizable LUAD classifier
To explore prognosis-related signatures in LUAD, we performed an integrating analysis of different RNA-Seq data sets from multiple cohorts (Fig. 1). Firstly, 17 consensus prognosis-related genes (PRGs) were identified using a univariate Cox regression analysis performed on intersection genes of six independent cohorts (GSE13210, GSE68465, GSE31210, GSE72094, TCGA, and OncoSG). Then, a subset of 16 PRGs with correlation coefficient < 0.75 (Supplementary Figs. 2A) was selected for further investigation after correlation analysis.
The study utilized the 16-PRG set as a basis for developing a total of 26 classifiers. These classifiers were constructed by employing a mixture of eight distinct machine learning techniques (Fig. 2A). Upon evaluating the performance of 26 classifiers, we utilized the Concordance index (C-index) derived from both training and test sets. Our findings revealed that the classifier generated by employing the Gradient Boosting Machine (GBM) algorithm, known as the generalized boosted-regression-modeling classifier, attained a C-index of 0.69 in the training set. In addition, the consistency indices in the test sets (GSE13120, GSE31210, GSE68465, GSE72094, and OncoSG) ranged from 0.61 to 0.69 (Fig. 2A). All of these results collectively demonstrate that the GBM classifier exhibits excellent stability and accuracy. Subsequently, the scoring system of the GBM classifier was employed to allocate risk scores, and the samples were segregated into high-risk and low-risk categories based on a cutoff of GBM risk scores.
Additionally, the results of ROC analysis revealed the following area under the curve (AUC) values for 1-, 3-, and 5-year predictions, respectively: 0.725, 0.714, and 0.723 in TCGA; 0.924, 0.719, and 0.693 in GSE13120; 0.70, 0.686, and 0.706 in GSE31210; 0.685, 0.653, and 0.589 in GSE68465; 0.613, 0.656, and 0.601 in GSE72094; and 0.625, 0.638, and 0.756 in OncoSG (Fig. 2B-G). The analysis of Kaplan-Meier (KM) curves revealed that LUAD patients in the greater GBM-risk group exhibited a considerably worse prognosis compared to those in the lower GBM-risk group (Fig. 2H-M). Collectively, our findings indicated that the GBM classifier, which consisted of 16 PRGs, had a high level of accuracy in both patient classification and prognosis prediction for LUAD.
Integrating GBM risk score and clinicopathological factors in a prognostic model
Next, we examined the autonomy of the GBM-risk score as a prognostic model. The independence of GBM risk score was validated through multivariate Cox regression analysis, with a statistically significant p-value of less than 0.001, log (HR) = 5.75 (95%CI, 4.39–7.11) (Fig. 3A). Considering the inconsistent clinical information categories across distinct cohorts, we have opted to exclude non-shared clinical information. Consequently, we selected the TCGA dataset as the training set, while the GSE13120, GSE31210, GSE72094, and OncoSG datasets as the test sets. Subsequently, a prognostic model was constructed utilizing the Cox proportional hazards (Coxph) approach, and shared clinical data was integrated from several cohorts, including variables such as gender, age, tumor grade, and risk score. The results indicated that the Coxph model demonstrated promising predictive accuracy in both the training set of TCGA and the independent external test sets of GSE13120, GSE31210, GSE72094, and OncoSG (Fig. 3B-F). Specifically, in the training set of TCGA, the 1-, 3-, and 5-year AUC values were 0.783, 0.754, and 0.753, separately (Fig. 3B). In the independent external test set of GSE13120, the 1-, 3-, and 5-year AUC values were 0.904, 0.792, and 0.752, separately. Similarly, in the independent external test sets of GSE31210, GSE72094, and OncoSG, the 1-, 3-, and 5-year AUC values were as follows: GSE31210 (0.843, 0.780, and 0.732), GSE72094 (0.659, 0.727, and 0.667), and OncoSG (0.666, 0.591, and 0.606) (Fig. 3C-F). Consequently, a visual representation of the Cox proportional hazards model, incorporating the GBM risk score, age, sex, and grade, was created utilizing the R package 'regplot' (Fig. 4A). The constructed nomogram showed a visual depiction of the prognostic model, enabling healthcare practitioners to approximate the likelihood of survival for particular patients by considering the prognostic markers that have been established.
In comparison to other clinical features, the GBM-risk score had a greater area under the curve (AUC) value as determined by ROC analysis (Fig. 4B-D). The calibration curves offered empirical evidences that support the reliable prediction ability of the nomogram in determining the survival rates at 1-year, 3-year, and 5-year intervals for LUAD patients (Fig. 4E-G). The DCA showed that the nomogram established in this study demonstrated enhanced precision in prognosticating survival outcomes in patients with LUAD as compared to conventional methods of either administering all available treatments or refraining from treatment altogether (Fig. 4H). These findings indicated that the nomogram has a potential to be a useful tool in the clinical decision-making process for LUAD patients. Collectively, the utilization of the GBM-based risk score, in conjunction with clinical characteristics, demonstrated encouraging efficacy in prognosticating the prognosis of LUAD patients.
Single-cell deconvolution revealing epithelial cells as a predominant cell subgroup associated with worse survival
Additionally, we conducted a reanalysis of the single-cell transcriptomes of a total of 17 samples, consisting of nine samples from patients with LUAD and eight samples from individuals with healthy lung sample. Following a comprehensive set of stringent quality control and filtering procedures, we conducted an integrative analysis of the transcriptomes from a collective of 55,426 cells. Eleven major cell types were identified based on characteristic canonical cell markers (Supplementary Figs. 2B, Supplementary Figs. 3A, C), including epithelial cells and various immune cell types such as T cells, B lymphocytes, plasma cells, monocytes, macrophages, mast cells, and dendritic cells (Supplementary Figs. 3A, C). Furthermore, stromal cell types such as fibroblasts and endothelial cells, as well as cycling cells, were also identified (Supplementary Figs. 3A, C). Subsequently, we excluded cycling cells and undefined cells from our examination and proceeded to develop a comprehensive single-cell atlas for LUAD (Fig. 5A-B). This atlas revealed notable heterogeneity in the proportions of stromal cells and immune cells across the samples (Fig. 5C), which aligns with findings from prior studies (Kim et al. 2020; Wu et al. 2021).
Leveraging the deconvolution algorithm, scissor, we projected the prognostic information extracted from LUAD BULK-seq data onto the single-cell atlas of LUAD. The findings revealed epithelial cells as a predominant cell subgroup associated with worse survival across all of six cohorts (Fig. 6A-F). This suggests that the cellular composition of epithelial cells plays crucial role in influencing the prognosis of LUAD.
Epithelial cells with a significant enrichment of PRGs, and malignant epithelial cells with higher levels of PRGs than non-cancerous counterparts
To gain a more comprehensive understanding of the operational mechanisms of consensus PRGs at a subcellular level, we conducted a thorough examination focusing on individual cellular samples obtained from clinical sources. Multiple methods were employed to evaluate the enrichment of PRGs within certain subpopulations of cells. It was revealed that PRGs had a prominent enrichment in epithelial cells (Fig. 7A), and there was a statistically significant increase in PRGs in LUAD when compared to healthy lung tissues (Fig. 7B). These findings provide additional evidence that epithelial cells are the primary subgroup of cells associated with prognosis in LUAD. In order to determine if malignant epithelial cells maintain a significant enrichment of PRGs, we conducted copy number karyotyping (CopyKAT) analysis. The epithelial cells were divided into two groups, high-grade malignancy group and normal group, followed by the scoring of PRGs in both cell categories utilizing five distinct algorithms. A notably elevated PRGs score was seen in malignant cells compared to their normal counterparts (Fig. 7C-D). In summary, these findings indicate that PRGs may exert a substantial influence on tumor formation at a subcellular level.
Correlation between PRGs and tumor microenvironment in LUAD
In order to examine the potential interactions among epithelial cells and other biological components, we employed CellChat as a tool for analyzing cell-cell communication. Our study revealed an intricate interplay among epithelial cells, macrophages, and T cells (Fig. 8A-B). Based on the findings obtained from the CopyKAT analysis, we proceeded to conduct a more detailed investigation into the communication between epithelial cells in both malignant and normal states. The results revealed a noteworthy contact between malignant cells and macrophages, as well as between malignant cells and T cells (Fig. 8C-D).
Moreover, in order to further investigate the correlation between PRGs and tumor microenvironment, we utilized the deconvolution method known as xCell to calculate the cellular makeup of each bulk RNA-seq data sample. This was done in combination with the Gene Set Variation Analysis (GSVA) score for each sample. By utilizing these metrics, we performed the correlation analysis between PRGs and the invasion of non-neoplastic cells. A positive correlation was identified between the levels of PRGs and M1 macrophages, as well as between PRGs and Th2 cells (Fig. 9A-C, Supplementary Figs. 4A-D). Conversely, a negative correlation was found between PRGs and CD4 Tcm cells, as well as between PRGs and M2 macrophages (Fig. 9A-C, Supplementary Figs. 4A-D). Furthermore, a notable disparity was observed in samples with varying tumor purities based on PRGs GSVA score. Specifically, samples with elevated tumor purity demonstrated markedly higher PRGs GSVA scores (Fig. 9D). In summary, these findings established a connection between PRGs and tumor microenvironment in LUAD, specifically in relation to the infiltration of non-neoplastic cells, including macrophages and T lymphocytes.
PRGs exhibiting higher enrichment in tumor-associated cell states
To investigate the potential function of PRGs in epithelial cell development, we extracted all of the epithelial cells and analyzed their cell trajectory using monocle2. The results demonstrated that the epithelial cells were separated into three states (Fig. 10A), with the majority of tumor cells being enriched in states 1 and 3 (Fig. 10B). The PRGs enrichment score in various states were evaluated by five aforementioned enrichment score algorithms, ‘AddModuleScore’, ‘UCell’, ‘AUCell’, ‘Singscore’, and ‘Jasmine’. The results showed the cells of states 1 and 3 exhibited significantly higher PRGs score enrichment than state 2 cells (Fig. 10C). These results indicate that PRGs may have a substantial effect on the growth and progression of epithelial cells, particularly in cell states associated with tumors.
High GSVA score of PRGs in LUAD experiences worse outcomes
To delineate the disparate biological functions of PRGs in LUAD, we further performed GSVA and GSEA analysis on previously aforementioned six cohorts. We found that the high PRGs GSVA score (HPGS) groups showed significantly enrichment of signal pathways including epithelial-mesenchymal transition (EMT), hypoxia, IL6/JAK/STAT3 and KRAS_UP (Fig. 11A-F, Supplementary Figs. 5A-F). Meanwhile, we also observed that the HPGS groups had a markedly reduced survival span (Fig. 11G-L). EMT plays a pivotal role in the advancement of cancer, as it enables the migration, invasion, and metastasis of cells (Craene and Berx 2013; Yuan et al. 2014; Pastushenko and Blanpain 2019). The increased prevalence of EMT markers in the HPGS groups may provide a plausible explanation for their poorer clinical outcomes of LUAD. Hypoxia is a widely recognized catalyst for the advancement of tumors and the development of resistance against therapeutic interventions (Eltzschig and Carmeliet 2011; Gilkes et al. 2014). The activation of the IL6/JAK/STAT3 pathway, involved in inflammation and immunity, has the potential to contribute to the remodeling of the tumor microenvironment, hence facilitating tumor growth and progression (Johnson et al. 2018; Choy et al. 2020). The KRAS_UP pathway, which is commonly shown to be mutated in LUAD, is closely linked to the manifestation of aggressive tumor characteristics and the development of resistance towards therapeutic interventions (Buscail et al. 2020; Drosten and Barbacid 2020). Individuals exhibiting elevated PRGs GSVA scores may potentially derive advantages from the implementation of more assertive treatment approaches or the utilization of specialized therapeutic interventions specifically designed to counteract the activation of these implicated pathways. Taken together, our findings indicated that LUAD patients with high PRGs GSVA score experienced worse outcomes, which was associated with the activation of cancer-related pathways.