Integration and deconvolution methodology deciphering prognosis-related signatures in lung adenocarcinoma

This study aims to establish a risk prediction model based on prognosis-related genes (PRGs) and clinicopathological factors, and investigate the biological activities of PRGs in lung adenocarcinoma (LUAD). Risk score signatures were developed by employing multiple algorithms and their amalgamations. A predictive model for overall survival was established through the integration of risk score signatures and several clinicopathological parameters. A comprehensive single-cell atlas, gene set enrichment analysis (GSEA) and gene set variation analysis (GSVA) were used to investigate the biological activities of prognosis-related genes in LUAD. A risk prediction model was established based on 16 PRGs, exhibiting robust performance in predicting overall survival. The single-cell analysis revealed that epithelial cells were primarily associated with worse survival of LUAD, and PRGs were predominantly enriched in malignant epithelial cells and influenced epithelial cell growth and progression. Furthermore, GSEA and GSVA analysis showed that PRGs were involved in tumor pathways such as epithelial-mesenchymal transition, hypoxia and KRAS_UP, and high GSVA scores are correlated with worse outcome in LUAD patients. The constructed risk prediction model in this study offers clinicians a valuable tool for tailoring treatment strategies of LUAD and provides a comprehensive interpretation on the biological activities of PRGs in LUAD.


Introduction
Lung adenocarcinoma (LUAD), a particularly virulent subtype of non-small cell lung cancer, remains a formidable global health adversary largely due to its dismal patient prognosis (Denisenko et al. 2018;Siegel et al. 2022).The advent of single-cell RNA sequencing has revolutionized our understanding of this disease, enabling a high-resolution exploration of tumor heterogeneity and prognostic cell subpopulations (Suvà and Tirosh 2019;Gohil et al. 2021).In this complex landscape, machine learning emerges as a powerful tool to integrate multidimensional omics datasets for prognostic prediction (Cruz and Wishart 2007;Komorowski et al. 2022).A diverse array of machine learning algorithms, including artificial neural networks, Bayesian networks, support vector machines, and decision trees, have been Ming Yi and Jiaying Shi contributed equally to this work.
harnessed to tackle various cancer prediction tasks (Akay 2009;Lg and At 2013).The unveiling of a plethora of gene expression signatures tethered to LUAD prognosis has further amplified the potential of machine learning models (Langerød et al. 2007;Corrêa and Augsburger 2016;Stelloo et al. 2016).The integration of these signatures into machine learning models holds tantalizing promise for patient riskstratification and the tailoring of treatments.
However, rigorous validation, model interpretability, and generalizability remain significant hurdles.The "blackbox" nature of machine learning further complicates matters, obscuring the biological interpretation of model features.Confronting these limitations is paramount for the clinical translation of robust, personalized prediction models that harness the bounty of omics data to enhance outcomes of LUAD patients.In this study, a machine learning pipeline was employed to thoroughly investigate prognosis-related signatures across several molecular profiles and develop a reliable prognostic model for LUAD.In addition, we elucidated the biological activities of these predictive risk gene signatures through the implementation of an integrated and deconvolution analysis of both single-cell and bulk transcriptome data of LUAD.

Public data collection and processing
The single-cell RNA-Seq data used in this study consist of nine samples from LUAD patients obtained from the GSE171145 dataset.Additionally, eight samples (GSM4058906, GSM4058911, GSM4058912, GSM4058913, GSM4058914 GSM4058917, GSM4058921, and GSM4058926) from individuals with healthy lungs were included sourced from the GSE136831 dataset.The microarray data utilized in this study were obtained from GSE13213, GSE68465, GSE31210, and GSE72094.The GEO datasets can be accessed at http:// www.ncbi.nlm.nih.gov/ geo.The additional bulk RNA-Seq datasets for OncoSG (Chen et al. 2020) and the Cancer Genome Atlas (TCGA) were obtained from the cBioPortal database (http:// www.cbiop ortal.org) and the UCSC Xena (http:// xena.ucsc.edu), respectively.All datasets underwent batch effect correction.The mRNA expression data from the bulk LUAD cohorts were converted to z scores to improve comparability across different databases.The "ComBat" function from the R package sva (version 3.46.0)was used to eliminate batch effects.

Univariate Cox regression and correlation analysis
The univariate Cox regression analysis was performed on both the training and testing sets using the R package glmnet (version 4.1-7), focusing on the intersection genes.The selection of consensus prognosis-related genes for further investigations was based on the criteria of statistical significance (p < 0.05) and hazard ratios (HRs) greater than 1 or less than 1.The correlation analysis was performed using the "cor" function in R software (version 4.2.3).

Developing a robust machine learning-based classifier
A combination of eight machine learning algorithms, including random survival forest (RSF), elastic network (Enet), stepwise Cox regression (StepCox), CoxBoost, partial least squares regression for Cox (plsRcox), and supervised principal components (SuperPC), was utilized to develop a classifier to generalize boosted regression modeling (GBM) and survival support vector machine (survival-SVM).The assessment of the classifiers' performance was carried out utilizing the Concordance index (C-index).The assignment of GBM risk score for each sample was performed using the "gbm.perf"function from the R package gbm (version 2.1.8.1).The Kaplan-Meier (KM) curve analysis was performed using the R package survival (version 3.5-5) and survminer (version 0.4.9).The receiver operating characteristic (ROC) analysis was conducted using the R package timeROC (version 0.4) and survivalROC (version 1.0.3.1).

Multivariate Cox regression analysis
To assess the statistical independence of the risk score, a multivariate Cox regression analysis was performed using the R package glmnet (version 4.1-7).This analysis offers estimates of the contribution of the risk score while controlling for other relevant factors.

Nomogram model development and evaluation
The PRGs risk score and clinical pathological factors were enrolled into multivariate Cox regression analysis to confirm prognostic factors associated with overall survival (OS) of LUAD patients.To predict the survival probability at 1, 3, and 5 years for LUAD patients, a nomogram model was created using the R package regplot (version 1.1).Calibration curves were plotted to evaluate the agreement between the predicted survival probabilities and the actual observed survival rates, using the R package rms (version 6.7-0).In addition, decision curve analysis (DCA) was conducted using the R package caret (version 6.0-94) and ggDCA (version 1.2) in order to assess the clinical utility of a predictive model by quantifying the net benefit of using the model for decision-making compared to other strategies.

Single-cell RNA-seq data generation, quality control and across-sample integration
The samples were converted into a Seurat object using the "CreateSeuratObject" function from the Seurat package (version 4.3.0.1) in R software (version 4.2.3)(Satija et al. 2015).We removed cells with either lower than 500 or higher than 6000 expressed genes (Supplementary Fig. 1A).In addition, we discarded cells with mitochondria content higher than 20% (Supplementary Fig. 1B).Finally, a total of 55,426 cells were acquired for the subsequent analysis.The R package harmony (version 0.1.1)was used as the batch effect removal method (Korsunsky et al. 2019).We used the R package Seurat to normalize expression matrices by the functions of "Normal-izeData" and "ScaleData".Then "FindVariable" function of Seurat package was applied to select the top 2000 variable genes (Supplementary Fig. 1C).

Dimension reduction and unsupervised clustering
Principal components analysis (PCA) was performed on the integration-transformed expression matrix using the "Run-PCA" function from R package Seurat, and the first 15 principal components (PCs) were used in the "FindNeighbors" function from R package seurat (Supplementary Fig. 1B).The resolution parameters of the "FindClusters" function were different for different cell types, with 0.5 for all cells.Uniform manifold approximation and projection (UMAP) was performed for visualization in two dimensions using the "RunUMAP" function with the same PCs and other default parameters (Supplementary Fig. 1D).

Cell type identification and differential gene expression analysis
The R package SingleR (version 2.0.0) was used to evaluate the cell type scores (Aran et al. 2019), and "FindAllmakers" function from R packsge Seurat was used to identify the top10 genes for each cluster.Besides, the R package COSG (version 0.9.0) was applied to evaluate the top 100 genes in each cluster (Dai et al. 2022).The differentially expressed genes (DEGs) of each cluster were generated by Seurat "FindAllMarkers" function.

Single cell deconvolution using Scissor and Gene sets enrichment analysis (GSEA)
The R package Scissor (version 2.0.0) was used to associate phenotypic data from bulk RNA-seq with the single-cell data (Sun et al. 2022).To determine the primary effector subgroups of PRGs, the enrichment scores of PRGs in different cell types was assessed using five enrichment score algorithms conducted by the R package irGSEA (version 2.1.4)and Seurat (version 4.3.0.1), namely "UCell", "AUCell", "Singscore", "Jasmine" and "AddModuleScore".

CopyKAT analysis
The R package copykat (version 1.1.0)was used to infer copy number variations (CNVs) based on the gene expression of all the epithelial cells in our single-cell RNA data (Gao et al. 2021).The parameters containing ngene.chr = 5, win.size = 25, KS.cut = 0.1, distance = 'euclidean' and id.type = 'S' were used in function "copycat".Three samples with more than 100 epithelial cells in healthy lungs were used as reference, and CNV inference analysis on epithelial cells from LUAD was performed.

Cell-cell interplay and xCell infer cell composition
To investigate the potential communication between epithelial cells and other types of cells, the ligand-receptor interactions between different cells was analyzed using the R package CellChat (version 1.6.1)(Jin et al. 2021).The function "xCellAnalysis" in package xCell (version 1.1.0)was used to infer cell composition from bulk RNA-seq.

Pseudo-time trajectories analysis of epithelial cells
To further investigate the potential role of PRGs in epithelial cell development, we extracted all the epithelial cells and performed cell trajectory analysis using the R package monocle (version 2.26.0) (Qiu et al. 2017).The function "reduc-tionDimension" in package monocle was used to reduce dimension, with the parameters of 'max_components = 2' and 'method = 'DDRTree''.The function "orderCells" and "plot_cell_trajectory" was used to order cells and plot cell development trajectory.

Gene set enrichment analysis and Gene set variation analysis (GSVA)
Gene set variation analysis was conducted by function "gsva" in the R package GSVA (version 1.46.0)(Hänzelmann et al. 2013).The samples were divided into high PRGs GSVA score (HPGS) groups and low PRGs GSVA score (LPGS) groups based on the GSVA scores of the PRGs gene set.Differential expression genes analysis between the two groups was performed by using the R package limma (version 3.54.2).All differential expression genes were ranked in descending order according to log2FoldChange (log2FC).The gene set of 50 hallmark pathways investigated in this study was downloaded from the GSEA website (https:// www.gsea-msigdb.org/ gsea/ index.jsp).Then, GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enriched pathways was identified by the R package clusterProfiler (version 4.6.2) (Yu et al. 2012).

Statistical analysis and graphical visualization
All data processing, statistical analysis, and plotting were conducted using the R software (version 4.2.3).The optimal cutoff value was determined using the R package cutoff (version 1.3).The plots displayed in this study was generated using R package ggplot2 (version 3.

Building a robust and generalizable LUAD classifier
To explore prognosis-related signatures in LUAD, we performed an integrating analysis of different RNA-Seq data sets from multiple cohorts (Fig. 1).Firstly, 17 consensus prognosis-related genes (PRGs) were identified using a univariate Cox regression analysis performed on intersection genes of six independent cohorts (GSE13213, GSE68465, GSE31210, GSE72094, TCGA, and OncoSG).Then, a subset of 16 PRGs with correlation coefficient < 0.75 (Supplementary Fig. 2A) was selected for further investigation after correlation analysis.
The study utilized the 16-PRG set as a basis for developing a total of 26 classifiers.These classifiers were constructed by employing a mixture of eight distinct machine learning techniques (Fig. 2A).Upon evaluating the performance of 26 classifiers, we utilized the Concordance index (C-index) derived from both training and test sets.Our findings revealed that the classifier generated by employing the Gradient Boosting Machine (GBM) algorithm, known as the generalized boosted-regression-modeling classifier, attained a C-index of 0.69 in the training set.In addition, the consistency indices in the test sets (GSE13213, GSE31210, GSE68465, GSE72094, and OncoSG) ranged from 0.61 to 0.69 (Fig. 2A).All of these results collectively demonstrate that the GBM classifier exhibits excellent stability and accuracy.Notably, the classifiers denoted as 'RSF', "StepCox[both] + RSF", "StepCox[forward] + RSF", and "StepCox[backward] + RSF" exhibited AUCs exceeding 0.9 within the TCGA cohort, but performed weakly in other datasets (Fig. 2A).This discrepancy could potentially be attributed to the overfitting during its development.Subsequently, the scoring system of the GBM classifier was employed to allocate risk scores, and the samples were segregated into high-risk and low-risk categories based on a cutoff of GBM risk scores.
Additionally, the ROC analysis revealed the following area under the curve (AUC) values for 1-, 3-, and 5-year predictions, respectively: 0.725, 0.714, and 0.723 in TCGA; 0.924, 0.719, and 0.693 in GSE13213; 0.70, 0.686, and 0.706 in GSE31210; 0.685, 0.653, and 0.589 in GSE68465; 0.613, 0.656, and 0.601 in GSE72094; and 0.625, 0.638, and 0.756 in OncoSG (Fig. 2B-G).The analysis of Kaplan-Meier (KM) curves revealed that LUAD patients in the greater GBM-risk group exhibited a considerably worse prognosis compared to those in the lower GBM-risk group (Fig. 2H-M).Collectively, our findings indicated that the GBM classifier, which consisted of 16 PRGs, had a high level of accuracy in both patient classification and prognosis prediction for LUAD.

Integrating GBM risk score and clinicopathological factors in a prognostic model
Next, we examined the autonomy of the GBM-risk score as a prognostic model.The independence of GBM risk score was validated through multivariate Cox regression analysis, with a statistically significant p value of less than 0.001, log (HR) = 5.75 (95% CI 4.39-7.11)(Fig. 3A).Considering the inconsistent clinical information categories across distinct cohorts, we have opted to exclude non-shared clinical information.Consequently, we selected the TCGA dataset as the training set, and the datasets including GSE13213, GSE31210, GSE72094, and OncoSG as the test sets.Subsequently, a prognostic model was constructed utilizing the Cox proportional hazards (Coxph) approach, and shared clinical data was integrated from several cohorts, including variables such as gender, age, tumor grade, and risk score.The results indicated that the Coxph model demonstrated promising predictive accuracy in both the training set of TCGA and the independent external test sets of GSE13213, GSE31210, GSE72094, and OncoSG (Fig. 3B-F).

3
Fig. 1 Flowchart describing the construction and validation of prognosis-related signatures in LUAD Specifically, in the training set of TCGA, the 1-, 3-, and 5-year AUC values were 0.783, 0.754, and 0.753, separately (Fig. 3B).In the independent external test set of GSE13213, the 1-, 3-, and 5-year AUC values were 0.904, 0.792, and 0.752, separately.Similarly, in the other independent external test sets, the 1-, 3-, and 5-year AUC values were as follows: GSE31210 (0.843, 0.780, and 0.732), GSE72094 (0.659, 0.727, and 0.667), and OncoSG (0.666, 0.591, and 0.606) (Fig. 3C-F).Consequently, a visual representation of the Cox proportional hazards model, incorporating the GBM risk score, age, sex, and grade, was created utilizing the R package 'regplot' (Fig. 4A).The constructed nomogram showed a visual depiction of the prognostic model, enabling healthcare practitioners to approximate the likelihood of survival for particular patients by considering the prognostic markers that have been established.
In comparison to other clinical features, the GBM-risk score had a greater area under the curve (AUC) value as determined by ROC analysis (Fig. 4B-D).The calibration curves offered empirical evidences that support the reliable prediction ability of the nomogram in determining the survival rates at 1-, 3-, and 5-year intervals for LUAD patients (Fig. 4E-G).The following DCA analysis further showed that the established nomogram demonstrated enhanced precision in prognosticating survival outcomes in LUAD patients compared to conventional methods of either administering all available treatments or refraining from treatment altogether (Fig. 4H).These findings indicated that the nomogram has a potential to be a useful tool in the clinical decision-making process for LUAD patients.Collectively, the utilization of the GBM-based risk score in conjunction with clinical characteristics demonstrated encouraging efficacy in prognosticating the prognosis of LUAD patients.

Single-cell deconvolution revealing epithelial cells as a predominant cell subgroup associated with worse survival
Additionally, we conducted a reanalysis of the single-cell transcriptomes of a total of 17 samples, consisting of nine LUAD and eight healthy lung samples.Following a comprehensive set of stringent quality control and filtering procedures, we conducted an integrative analysis of the transcriptomes from a collective of 55,426 cells.Eleven major cell types were identified based on characteristic canonical cell markers (Supplementary Fig. 2B), including epithelial cells and various immune cell types such as T cells, B lymphocytes, plasma cells, monocytes, macrophages, mast cells, and dendritic cells (Supplementary Fig. 3A-C).Furthermore, stromal cell types such as fibroblasts and endothelial cells, as well as cycling cells, were also identified (Supplementary Fig. 3A-C).Subsequently, we excluded cycling cells and undefined cells from our examination and proceeded to develop a comprehensive single-cell atlas for LUAD (Fig. 5A, B).This atlas revealed notable heterogeneity in the proportions of stromal cells and immune cells across the samples (Fig. 5C), which aligns with findings from prior studies (Kim et al. 2020;Wu et al. 2021).
Leveraging the deconvolution algorithm, scissor, we projected the prognostic information extracted from the Bulk RNA-seq data onto the single-cell atlas of LUAD.The findings revealed epithelial cells as a predominant cell subgroup associated with worse survival across all of six cohorts (Fig. 6A-F), suggesting that the cellular composition of epithelial cells plays crucial role in influencing the prognosis of LUAD.

PRGs significantly enriched in epithelial cells, and malignant epithelial cells in particular
To gain a more comprehensive understanding of the operational mechanisms of consensus PRGs at a subcellular level, we conducted a thorough examination focusing on individual cellular samples obtained from clinical sources.Multiple methods were employed to evaluate the enrichment of PRGs within certain subpopulations of cells.It was revealed that PRGs had a prominent enrichment in epithelial cells (Fig. 7A), and there was a statistically significant increase in PRGs in LUAD when compared to healthy lung tissues (Fig. 7B).These findings provide additional evidence that epithelial cells are the primary subgroup of cells associated with prognosis in LUAD.In order to determine if malignant epithelial cells maintain a significant enrichment of PRGs, we conducted copy number karyotyping (CopyKAT) analysis.The epithelial cells were divided into two groups, high-grade malignancy group and normal group, followed by the scoring of PRGs in both cell categories utilizing five distinct algorithms.A notably elevated PRGs score was seen in malignant cells compared to their normal counterparts (Fig. 7C, D).In summary, these findings indicate that PRGs may exert a substantial influence on tumor formation at a subcellular level.

Correlation between PRGs and tumor microenvironment in LUAD
In order to examine the potential interactions among epithelial cells and other biological components, we employed Fig. 2 Building a robust and generalizable LUAD classifier.A A total of 26 classifiers and their respective C-index for six independent cohorts, including TCGA, GSE13213, GSE31210, GSE68465, GS72094, and OncoSG, as well as the average C-index for each classifier in these cohorts.B-G Time-dependent ROC analysis for predicting OS at 1, 2, 3, 4, 5 years in TCGA (B), GSE13213 (C), GSE31210 (D), GSE68465 (E), GS72094 (F), and OncoSG (G).H-M Kaplan-Meier curves of OS across TCGA (H), GSE13213 (I), GSE31210 (J), GSE68465 (K), GS72094 (L), and OncoSG (M) Our study revealed an intricate interplay among epithelial cells, macrophages, and T cells (Fig. 8A, B).Based on the findings obtained from the CopyKAT analysis, we proceeded to conduct a more detailed investigation into the communication between epithelial cells in both malignant and normal states.The results revealed a noteworthy contact between malignant cells and macrophages, as well as between malignant cells and T cells (Fig. 8C, D).
Moreover, in order to further investigate the correlation between PRGs and tumor microenvironment, we utilized the deconvolution method known as xCell to calculate the cellular makeup of each bulk RNA-seq data sample.This was done in combination with the GSVA score for each sample.
By utilizing these metrics, we performed the correlation analysis between PRGs and the invasion of non-neoplastic cells.A positive correlation was identified between the levels of PRGs and M1 macrophages, as well as between PRGs and Th2 cells (Fig. 9A-C, Supplementary Fig. 4A-D).Conversely, a negative correlation was found between PRGs and CD4 Tcm cells, as well as between PRGs and M2 macrophages (Fig. 9A-C, Supplementary Fig. 4A-D).Furthermore, a notable disparity was observed in samples with varying tumor purities based on PRGs GSVA score.Specifically, samples with elevated tumor purity demonstrated markedly higher PRGs GSVA scores (Fig. 9D).In summary, these findings established a connection between PRGs and tumor microenvironment in LUAD, specifically in relation   to the infiltration of non-neoplastic cells, including macrophages and T lymphocytes.

PRGs exhibiting higher enrichment in tumor-associated cell states
To investigate the potential function of PRGs in epithelial cell development, we extracted all of the epithelial cells and analyzed their cell trajectory using monocle2.The results demonstrated that the epithelial cells were separated into three states (Fig. 10A), with the majority of tumor cells being enriched in states 1 and 3 (Fig. 10B).The PRGs enrichment score in various states were evaluated by five aforementioned enrichment score algorithms, 'AddMod-uleScore', 'UCell', 'AUCell', 'Singscore', and 'Jasmine'.The results showed the cells of states 1 and 3 exhibited significantly higher PRGs score enrichment than state 2 cells (Fig. 10C).These results indicate that PRGs may have a substantial effect on the growth and progression of epithelial cells, particularly in cell states associated with tumors.

High GSVA score of PRGs in LUAD experiences worse outcomes
To delineate the disparate biological functions of PRGs in LUAD, we further performed GSVA and GSEA analysis on previously aforementioned six cohorts.We found that the high PRGs GSVA score (HPGS) groups showed significantly enrichment of signal pathways including epithelialmesenchymal transition (EMT), hypoxia, IL6/JAK/STAT3 and KRAS_UP (Fig. 11A-F, Supplementary Fig. 5A-F).Meanwhile, we also observed that the HPGS groups had a markedly reduced survival span (Fig. 11G-L).EMT plays a pivotal role in the advancement of cancer, as it enables the migration, invasion, and metastasis of cells (Craene and Berx 2013;Yuan et al. 2014;Pastushenko and Blanpain 2019).The increased prevalence of EMT markers in the HPGS groups may provide a plausible explanation for their poorer clinical outcomes of LUAD.Hypoxia is a widely recognized catalyst for the advancement of tumors and the development of resistance against therapeutic interventions (Eltzschig and Carmeliet 2011;Gilkes et al. 2014).The activation of the IL6/JAK/STAT3 pathway, involved in inflammation and immunity, has the potential to contribute to the remodeling of the tumor microenvironment, hence facilitating tumor growth and progression (Johnson et al. 2018;Choy et al. 2020).The KRAS_UP pathway, which is commonly shown to be mutated in LUAD, is closely linked to the manifestation of aggressive tumor characteristics and the development of resistance towards therapeutic interventions (Buscail et al. 2020;Drosten and Barbacid 2020).Individuals exhibiting elevated PRGs GSVA scores may potentially derive advantages from the implementation of more assertive treatment approaches or the utilization of specialized therapeutic interventions specifically designed to counteract the activation of these implicated pathways.Taken together, our findings indicated that LUAD patients with high PRGs GSVA score experienced worse outcomes, which was associated with the activation of cancer-related pathways.

Discussion
This study demonstrates the robust prognostic utility of the 16-gene classifier and integrated prognostic model in LUAD.Through validation in multiple independent cohorts, these signatures exhibit generalizability without overfitting to a single dataset.The 16-gene classifier, comprised of prognosis-related genes (PRGs), provides additive and independent prognostic information beyond conventional clinical and pathological factors.
Multivariate and nomogram analyses revealed that the classifier's risk score significantly improved prognostic prediction even after accounting for variables like age, sex, and grade.This indicates that the biological processes represented by these PRGs, including glutathione metabolism (GCLM), tumor suppression (HTATIP2), RNA splicing (SNRPD1), apoptosis inhibition (BIRC5), and potential roles in metastasis (NUP37), confer substantial prognostic signals that complement clinical variables.The results of cell trajectory analysis showed that PRGs are essential to epithelial cell growth and development, especially in tumor-associated cells.Monocle2 (Qiu et al. 2017) has helped to clarify epithelial cell states, with most tumor cells in states 1 and 3. Since malignancy is characterized by increased cell proliferation, tumor cells may be more abundant in certain settings.State 2 cells with lower PRGs content may exhibit reduced proliferative potential, enhanced differentiation potential, and a decreased ability to produce tumors.Notably, Some PRGs have established mechanistic roles in LUAD biology governing proliferation, apoptosis, and metastasis.For instance, GCLM overexpression promotes tumor progression and decreases survival rate (Harris et al. 2015).The HTATIP2 protein acts as a tumor suppressor inhibiting proliferation and metastasis (Tong et al. 2009;Li et al. 2013).Elevated SNRPD1 correlates with an aggressive cancer phenotype (Schümperli and Pillai 2004;Quidville et al. 2013).NUP37 upregulation may enhance metastasis (Bilokapic and Schwartz 2012;Liu et al. 2012).BIRC5 inhibits apoptosis Thorough examination of single-cell transcriptome data from both LUAD and healthy lung tissues, our findings highlight the notable involvement of epithelial cells and PRGs in the prognosis of LUAD.Firstly, the investigation of singlecell deconvolution of this study revealed that epithelial cells were the predominant cell subgroup that had a favorable correlation with worse survival in patients with LUAD.This discovery aligns with the established understanding of the involvement of epithelial cells in the development of lung cancer, given that these cells are the main site of initiation in LUAD (The Cancer Genome Atlas Research Network 2014; Kim et al. 2020;Wu et al. 2021).Our study also the significance of the cellular composition of epithelial cells in exerting an impact on the prognosis of LUAD.This implies that implementing techniques focused on regulating the makeup of epithelial cells may have the potential to enhance the prognosis of individuals with LUAD.Additionally, our investigation unveiled a substantial augmentation of PRGs in epithelial cells, along with a statistically significant elevation of PRGs in LUAD in comparison to lung tissues in a healthy state.This observation implies that PRGs have a significant impact on the advancement and evolution of LUAD.The significant abundance of PRGs in malignant epithelial cells provided additional evidences in favor of this concept, indicating that the pursuit of targeting PRGs may be a feasible treatment approach for LUAD.However, additional research is required to clarify the precise functions of these PRGs in LUAD and to verify their potential as targets for therapeutic intervention.
The examination of intercellular communication and the relationship between PRGs and immune cell infiltration yields significant insights into the tumor microenvironment of LUAD.The observed contact between cancer cells and macrophages/T cells demonstrates a notable intercommunication that has the potential to facilitate the advancement of tumors.The noteworthy observation lies in the positive correlation between PRGs and M1 macrophages/Th2 cells, given that both cell types have the potential to display protumoral activities.M1 macrophages, which have historically been regarded as having antitumoral properties, can also contribute to the advancement of cancer by promoting angiogenesis, modifying the extracellular matrix, and suppressing the adaptive immune response (Kessenbrock et al. 2010;Quail and Joyce 2013).Similarly, the pro-cancer role of Th2 cells is largely due to their ability to suppress Th1/ CTL responses (Kidd 2003;Walker and McKenzie 2018).The elevated levels of PRGs observed in these cells indicate a potential involvement of PRGs in modulating an immunosuppressive milieu.On the other hand, the intriguing aspect is in the negative connection seen between PRGs and M2 macrophages/CD4 Tcm cells.M2 macrophages and CD4 Tcm cells are commonly associated with the provision of antitumor immunity (Mueller et al. 2013;Yunna et al. 2020).The observed reduction in cell infiltration, along with elevated levels of PRGs, suggests that PRGs might play a role in modifying the microenvironment to avoid immune surveillance.Additionally, our findings that samples exhibiting high tumor purity displayed elevated levels of PRGs suggest that the upregulation of PRGs is associated with the increased dominance of cancer cells inside the tumor.All of these provide evidences for PRGs involved in the formation of an immunosuppressive microenvironment, which facilitates uncontrolled tumor growth.
One potential constraint of this study pertains to the comparatively limited sample size, particularly in relation to the analysis conducted using single-cell RNA sequencing data.Although the bulk RNA sequencing data yielded valuable insights into the patterns of gene expression across a substantial cohort of patients, the single-cell analysis was limited to a small subset of patients.In order to validate the findings and achieve a more comprehensive understanding of the variability of cell types within lung tumors and their microenvironment, it would be necessary to conduct singlecell RNA sequencing on a larger cohort.Furthermore, the examination solely focused on the analysis of gene expression data, neglecting to explore other potential determinants of patient prognosis, such as gene mutations, copy number variations, and protein expression levels.The integration of multi-omics data from bigger cohorts is expected to enhance the robustness of prognostic models and uncover novel prognostic signals that extend beyond gene expression profiles.Notwithstanding these constraints, the utilization of both single-cell and bulk RNA sequencing in this investigation yields significant findings and underscores the potential of an integrated multi-omics strategy in analyzing larger groups of individuals.This approach has the capacity to enhance the development of more accurate prognostic tools and the identification of actionable therapeutic targets for personalized treatment of lung cancer.

Conclusion
Our investigation employed a combination of single-cell and bulk RNA sequencing methodologies to acquire significant knowledge regarding the microenvironment of lung tumors and to ascertain prospective targets for therapeutic intervention.A prognostic model was developed utilizing gene signatures to assign risk scores, thereby equipping clinicians with a valuable tool to facilitate informed treatment decisions for patients.The utilization of both single-cell and bulk analysis techniques has provided insights into the functional significance of specific genes that have elevated expression in tumor epithelial cells.These genes have been found to be crucial in facilitating intercellular communication, driving malignant transformation, and influencing the overall prognosis of the tumor.Our study found that individuals who had high-risk scores experienced worse outcomes, which was associated with an increased presence of cancer-related pathways.The results of our study emphasize potential targets for immunotherapy and alternative treatment strategies that have the potential to disrupt critical processes that promote tumor growth.Our findings can aid healthcare professionals in developing more precise and individualized treatment plans.In brief, the integration of single-cell and bulk RNAseq data holds significant potential for elucidating the intricate dynamics of the tumor microenvironment and identifying actionable targets that can inform more precise treatment strategies tailored to individual patients' prognoses.

Fig. 3
Fig. 3 Integrating GBM risk score and clinicopathological factors in a prognostic model.A Multivariable Cox regression analysis of OS.B-F Time-dependent ROC analysis for predicting OS at 1, 3, 5 years

Fig. 5 Fig. 6
Fig. 5 Cell clustering and annotation.A UMAP plot of 50,000 + cells from 9 LUAD samples and 8 healthy lung samples, colored by 10 major cell types.B Heatmap of canonical markers of 10 major cell types.C Cell composition of different samples

Fig. 7
Fig. 7 PRGs significantly enriched in epithelial cells, and malignant epithelial cells in particular.A PRGs enrichment scores in 10 types of cells, calculated by five algorithms.B Comparison of PRGs scores

Fig. 8
Fig. 8 Cell-cell communication.A, B The weight of interactions between epithelial cells and other cell types.Epithelial cells connecting to other cells (A), and other cells connecting to epithelial cells

Fig. 9
Fig. 9 Correlation between PRGs and the cell composition in different LUAD samples.A Heatmap of correlation coefficient between PRGs and cell composition.B, C Scatter plot showing the correlation between PRGs and cell composition in TCGA (B) and OncoSG (C).D Comparison of the PRGs scores of different tumor purities ◂

Fig. 10
Fig. 10 Association between PRGs and epithelial cells state.A Trajectory plots showing different states of epithelial cells.B Trajectory plots showing epithelial cells and cancer cells.C Comparison of PRGs scores of epithelial cells in different states