Stem Cell-Related Prognostic Signature for Lung Adenocarcinoma

Background: Adenocarcinoma is characterized by high inltration and negative growth, which is easy to invade the walls of blood vessels and lymphatic vessels. The function of the stem cell population is to control and maintain cell regeneration. Therefore, it is necessary to study the prognostic value of stem cells in LUAD. Methods: Stem cells signature construction relies on the univariate, least absolute shrinkage operator (LASSO) and multivariate Cox regression analysis. Risk plot, Kaplan-Meier analysis and the area under ROC (AUC) are veried the accuracy of the signature in GEO (GSE20319 and GSE42127) and TCGA cohort respectively. What's more, to further improve persuasiveness, we conducted nomogram, drug ecacy analysis, immune inltration analysis, and compared the relationship between mRNA stem cell index (mRNAsi) and signature. Result: The patients were divided into two groups via the cutoff of risk score; it can be seen that the low-risk group has better survival. The robust signature that contains ten stem cells is independent prognostic factors for LUAD (C6orf62, DNER, NELL2, LATS2, LGR5, PTPRO, LRIG1, PABPC1, NT5E and SET). The nomogram showed that 1-year (0.805), 3-year (0.773), and 5-year (0.765) survival rates of the signature we constructed were all greater than 0.7, indicating that our signature was very feasible. Conclusion: The signature can be used as a reliable and convenient tool for lung adenocarcinoma.


Background
Lung adenocarcinoma refers to a malignant tumor originating from lung epithelial tissue, which is a type of non-small cell lung cancer. In recent years, the incidence rate has gradually increased. In addition, due to the limitations of diagnosis and treatment, the mortality rate of LUAD ranks rst in malignant tumors [1]. Tumor stem cells refer to cells that have self-renewal ability and can produce heterogeneous tumor cells, which play a signi cant role in tumor survival, proliferation, metastasis and recurrence [2,3]. The ability of tumor stem cells to move and migrate makes tumor cells migration possible, at the same time, cancer stem cells can stay dormant for a long time and have a variety of drug-resistant molecules, but are not sensitive to external physical and chemical factors that kill tumor cells, which leads to the result that tumors often relapse after conventional cancer treatment eliminates most common tumor cells [4,5].
The development of LUAD treatment plan and survival period are affected by many factors, but the TNM stage of tissue cells may be one of signi cant factor in determining treatment plan and estimating prognosis. TNM stage is based on anatomy and is a description of the cumulative range of tumors.
However, it should be emphasized that the TNM stage also has shortcomings including the uneven source of case data and the relatively complicated stage of N. With the gradual development of diagnosis and treatment technology, we found that molecular markers have a greater prognosis for patients.
Studying the genetic functions and pathways of LUAD could contribute to establish prognostic markers and therapeutic targets, which could accurately and comprehensively predict the prognosis of LUAD [6]. Therefore, the ideal of using cancer stem cell therapy provides a new direction for the diagnosis and treatment of LUAD and the regulatory mechanism of stem cells at the molecule level requires further digging.
In this research, we constructed a stem cell signature of 10 genes as a prognostic target for lung adenocarcinoma. Meanwhile, we analyzed the types of immune cells in LUAD to understand the connection between stem cells and the immune microenvironment.

Data Acquisition and Selection
The RNA-sequencing and clinical traits information of LUAD were obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov) and Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.gov/geo/) that were served as training cohort and validation cohort, respectively. After classi cation and regularization, there were 54 normal samples and 497 tumor samples. At the same time, when merging clinical information, missing and incomplete samples were deleted. Besides, 166 tumor stem cells were downloaded from the cancerSEA database [7] to prepare for further signature construction. GSE30219 [8] was conducted by GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array).

Signature Construction and Veri cation
It was worth emphasizing that the R package was an indispensable key tool for us to construct and verify a signature next. The signature was established by a two-step method, the rst step was LASSO COX regression, and the second step was multivariate COX regression. Patients were divided into low-risk and high-risk groups based on the cutoff of risk score, which was calculated by formula as follows: HR 1 × gene 1 expression + HR 2 × gene 2 expression … + HR n × gene n expression [10]. In the TCGA and GEO cohorts, the risk curve was drawn to describe further the relationship between the patients' risk value and survival states and protein expression, the Kaplan-Meier curve and ROC curve were used to verify the reliability of the signature [11]. What's more, to further clarify the relationship and comparison between clinical factors and signature, subgroup analysis and nomogram analysis were conducted.

Gene Set Enrichment Analysis
GSEA is a method used to evaluate the distribution trend of genes in the gene list sorted by phenotype correlation and to understand gene positioning, function and biological signi cance. The molecular tag database is constructed, which contains multiple functional gene sets. By analyzing the gene expression data, it is obtained whether the expression is signi cantly enriched in a certain function. We presented the GO term and the KEGG pathway of the stem cell signature to analyze further it's possible biological functions [12]. The number of permutations was set to 1000, and our selection criteria are closely related to a nominal P-value (P < 0.05).

Therapeutic E cacy and Signature
Some patients from TCGA recorded the results of the evaluation of the e cacy after the rst treatment of the radiotherapy and chemotherapy, which also provided a direction for us to verify the reliability of the signature in term of e cacy. According to Response Evaluation Criteria in Solid Tumors (RECIST) and risk score, this part of patients was classi ed to compare whether there were differences between different therapeutic effect [13].

mRNAsi and Signature
In recent years, literature has proposed the concept of stemness indices (mRNAsi), which was calculated by a predictive model with an OCLR algorithm based on pluripotent stem cell samples from the Progenitor Cell Biology Consortium dataset (https://bioinformaticfmrp.github.io/PanCanStem_Web/). Speci cally, the Spearman correlation algorithm (RNA expression data) contributed to the stem index model to score LUAD samples in the TCGA dataset. The stem indices were then mapped to the [0, 1] range by using a linear transformation that subtracted the minimum and divide by the maximum. The index is closer to 1, which indicated that the cell differentiation was worse, and the characteristic of stem cells was stronger. We merged mRNAsi into our signature to compare whether there was a difference between low-and high-risk groups [14].

Immune In ltration Analysis
TIMER database, providing six types of immune cell in ltration and using RNA-Seq expression pro ling data to detect immune cell in ltration in tumor tissue, was used to appraise potential links between risk grouping and tumor-in ltrating immune cells (TIICs). Deconvolution is a newly released statistical method that allows TIMER to infer the incidence of TICC from gene expression pro les. CIBERSORT (http://cistrome.shinyappes), a deconvolution algorithm, can estimate the cell composition of complex tissues based on standardized gene expression data, and the method can be used to energize speci c cell types. With CIBERSORT, we can visualize the composition of immune cells in tumor samples of LUAD, and standard annotation les established gene expression datasets. P-value (P < 0.05) was a signi cant criterion to determine the type of immune cells affected by grouping [15].

Construction of Signature
All cancer stem cells were downloaded from CancerSEA (http://biocc.hrbmu.edu.cn/CancerSEA/home.jsp), which involved 14 functional states of 41900 single cancer cells from 25 cancer types. Twenty-eight CSCs associated with OS (P < 0.05) was measured as predictive CSCs for LASSO analysis (Fig. 1A-B). Through multivariate COX regression, we select ten CSCs to construct a robust signature for LUAD ( Table 1). The calculation formula of the risk score is as follows: risk score = 0.578 × expression C6orf62 + 1.24 × expression DNER + 0.737 × expression NELL2 + 1.404 × expression LATS2 + 1.202 × expression LGR5 + 0.676 × expression PTPRO + 0.718 × expression LRIG1 + 1.306 × expression PABPC1 + 1.126 × expression NT5E + 1.458 × expression SET. According to the cutoff of risk scores, patients in TCGA were divided into low-risk group and high-risk group [16]. The area under the ROC curve for 1, 3, 5-year were 0.771, 0.734, 0.687 (Fig. 1C). The survival analysis suggested that the overall survival rate of the low-risk group was higher than that of the high-risk group (P < 0.001). year survival rate of the low-risk group was close to 50%, while the 5-year survival rate of the high-risk group was only 20% (Fig. 1D). The box plot displayed the stem cell contained in the signature in the normal tissues and tumor tissues. (Fig. 1E). The risk curve can clearly show the relationship between survival status, survival time and expression of CSCs and risk score (Fig. 1F) [17].

Validation of the Signature in TCGA
The univariate Cox regression showed factors related to prognosis like a stage, T, M, N and risk score (P < 0.05), while multivariate Cox regression showed that only stage and risk score were signi cant independent risk factor of LUAD. Compared with other clinical factors, the area under the ROC curve of the signature in each period was the largest, which implied that compared with other clinical factors, the predictive ability of the gene signature we constructed was optimal ( Fig. 2A-B). The area under the ROC curve (AUC). The area under the ROC curve for 1-year, 3-year and 5-year OS were 0.771, 0.734 and 0.687, which implied that our signature had excellent predictive power (Fig. 2C-E) [18,19]. We conducted a hierarchical analysis to clarify the link between subgroups and risk grouping. A further conclusion was drawn that all subgroups except N3 could identify high-risk and low-risk groups. In N3, there were only two patients, both of which belonged to the high-risk group. And in most subgroups, high-and low-risk groups had signi cant differences, such as age < = 65, age > 65, female, male, stage III, T2, T3, N0, N2 and M0 (P < 0.05) (Fig. 3). What's more, the relationship between genes in the signature and various clinical factors was also clearly revealed through clinical correlation analysis. P-value < 0.05 was our criterion to judge whether it was meaningful (Fig. 4) [20]. We constructed nomogram to predict the score of the signature and clinical factors, the correction curve showed the predicted value, and the 45-degree line represented the actual survival result. The 45-degree lines of the calibration chart were close to 45 degrees in 1-, 3-, 5-year survival probability. The 1-year AUC was 0.805; the 3-year AUC was 0.773, the 5year AUC was 0.765, which indicated that our signature was very reliable (Fig. 5A) [21].

Gene Set Enrichment Analysis
The biological characteristics of the signature were con rmed by the analysis of GO term and KEGG pathway. In GO term annotation, ve categories were positively associated with the low-risk group, which were hexose catabolic process, NADH metabolic process, monosaccharide catabolic process, ATP generation from ADP and NAD metabolic process. At the same time, ve categories were negatively related to the low-risk group, which were negative regulation of adaptive immune response, regulation of tumor necrosis factor biosynthetic process, bile acid metabolic process, positive regulation of tyrosine phosphorylation of STAT5 protein and regulation of type 2 immune response. In the KEGG pathway, there were ve pathways were positively associated with the low-risk group, such as ECM receptor interaction, focal adhesion, glycosphingogolipid biosynthesis latco and neolatco series, pentose phosphate pathway and P53 signaling pathway. While there were ve pathways were negatively related to the low-risk group, like JAK start signaling pathway, primary immunode ciency, VEGF signaling pathway, B cell receptor signaling pathway and T cell receptor signaling pathway (Fig. 5B) [22].

Validation of the Signature in GEO
In order to further verify the feasibility of the gene signature, we veri ed through the GEO database. In GSE20319 and GSE42127, the relationship between survival status, survival time and the expression of the CSCs and risk score were consistent with the conclusion in TCGA. The survival analysis in GSE30219 and GSE42127 revealed that the overall survival rate of the low-risk group was signi cantly better than that of the high-risk group (P < 0.05). Besides, in GSE30219, the area under the ROC curve was 0.826, 0.638 and 0.599 in 1-year, 3-year and 5-year survival rates, respectively. One of the more worthwhile was that in GSE42127, the area under the ROC curve was 0.788, 0.657 and 0.582. This series of external veri cation fully demonstrated the feasibility and accuracy of our signature (Fig. 6) [23].

Therapeutic E cacy Analysis
In the TCGA database, 126 patients recorded the results of the rst treatment after radiotherapy and chemotherapy. At the same time, we tracked the evaluation of e cacy, 111 cases were complete response (CR), only one case was the partial response (PR), eight cases were stable disease (SD), and seven cases were progressive disease (PD). The three CSCs in the signature had signi cant differences in the e cacy of the different drug (P < 0.05). And it was worth noting that our signature had a certain meaning. P-value was 0.052, very close to 0.05 (Fig. 7A) [24,25].

Relationship between Signature and mRNAsi
There were already clear articles that calculated the mRNAsi of 1174 genes. We matched the known mRNAsi with the samples and divided our patients into two groups by the median value of mRNAsi (high-mRNAsi group and low-mRNAsi group) [26]. It was found that mRNAsi could not effectively distinguish high-and low-mRNAsi in LUAD and the area under the ROC curve still had a certain gap compared with our signature. However, the mRNAsi of the high-risk group in our signature was also signi cantly higher than that of the low-risk group. This also veri ed that our signature was related to stem cells (Fig. 7B) [27,28].

Tumor mutational burden and Immune In ltration Analysis
According to the calculated mutation burden value, we found that it had signi cant differences in the lowrisk and high-risk groups (Fig. 8A). TIMER database, which provides six types of immune cell in ltration, uses RNA-Seq expression pro ling data to detect immune cell in ltration in tumor tissue. The signature showed a negative correlation with the levels of B cells, CD4 T cells, CD8 T cells, Dendritic cells, Macrophages and neutrophil cells (P < 0.05) (Fig. 8B). These situations revealed that our signature was indeed related to immune cells. In addition, we characterize the cellular composition of the tumorin ltrating immune cells through CIBERSORT method. Compared with the high-risk group, CD8 T cells, monocytes, resting dendritic cells and resting mast cells had higher expressions (P < 0.05), while M0 macrophage had lower expression (P < 0.001) (Fig. 8C). CD4 memory activated T cells and CD8 T cells had the highest positive correlation (r2 = 0.53), which implied that there was a mutual effect between them. While plasma and M2 Macrophages had the highest negative correlation (r2 = -0.37) that suggested they were antagonistic to each other (Fig. 8D) [29].

Discussion
Despite the dramatic progress in diagnosis and treatment, the prognosis of advanced lung adenocarcinoma is still unsatisfactory. With the development of clinical management of lung cancer, some prognostic factors are well characterized, such as age, grade and TNM grade. Cancer stem cells refer to cells that have self-renewal capacity and can produce heterogeneous tumor cells, which play a signi cant role in tumor survival, proliferation, metastasis and recurrence. Therefore, the use of CSCs to establish a prognostic model is conducive to the prediction and precise treatment of LUAD.
We established a signature containing ten genes(C6orf62, DNER, NELL2, LATS2, LGR5, PTPRO, LRIG1, PABPC1, NT5E, SET)and divided patients into high and low-risk groups based on the median risk value. The research on the mechanism of stem cells have been pervasive, but there is no experiment to build these ten stem cells into a signature.
Cancer stem cells or tumor initiating cells are considered to be the main drivers of disease progression and treatment resistance across various cancer types. DNER is a neuron-speci c transmembrane protein with extracellular EGF-like repeat sequences, which promotes the metastasis and proliferation of cancer cells by activating Girdin/PI3K/ATK signal transduction and progression of prostate cancer and the growth of PC-3 cells by regulating the main genes of cancer stem cells [30][31][32]. NELL2s is a rich glycoprotein that contains EGF-like domains in nerve tissues, interact with protein kinase C and has multiple physiological functions. Hypermethylation of promoter silences NELL2 and affects the progression of renal cell carcinoma [33][34][35]. LATS2, as a potential tumor suppressor, is a signi cant mediator of the apoptosis response pathway. LATS2-Wnt / β-catenin / DRP1 / mitochondrial division is identi ed as a signaling pathway that promotes cancer cells death [36,37].
LGR5 is a promising marker of intestinal stem cells and cancer stem cells. Intestinal stem cell marker LGR5 is a receptor for Rspongin, and its role is to enhance Wnt signaling in hyperplastic crypts. Wnt pathway plays a signi cant key in ISC self-renewal by inducing RSPO receptor LGR5 expression. An abnormal increase in LGR5 expression may represent one of the most common molecular changes in some human cancers, resulting in long-term enhancement of canonical Wnt / β-catenin signaling [38][39][40]. PTPRO is a tumor suppressor and is abnormally expressed in various malignant tumors. PTPRO causes ulcerative colitis through TLR4 / NF-KB signaling pathway and plays a role in liver brosis by affecting PDGF signaling in HSC activation. It is noteworthy that PTPRO is a new candidate gene for emphysema with severe obstruction [41,42]. LRIG1, a transmembrane protein, has a tumor suppressive effect, and its expression is downregulated in a variety of cancers. It can antagonize epidermal growth factor receptor signaling in epithelial tissues and inhibit cell invasion, migration, VM (angiogenesis simulation) by regulating EGFR / ERK-mediated EMT (epithelial-mesenchymal transition) [43,44]. PABPC1 can combine with adenylate-rich sequences in mRNA under the action of high a nity, which plays an important role in post-transcriptional regulation of genes and is also involved in many metabolic pathways of mRNA, including adenylate polymerization/adenylation, mRNA transport, mRNA translation, microRNA degradation related regulation [45]. NT5E is a ubiquitously expressed glycosylphatidylinositol-xed glycoprotein, which can convert extracellular adenosine 5'-monophosphate to adenosine, and promote tumor development by inhibiting the anti-tumor immune response and promoting angiogenesis [46,47].
GSEA proves that the constructed signature does involve related cancer pathway. P53 is a tumor suppressor protein that regulates the expression of various genes, including apoptosis, growth inhibition, differentiation, inhibition of cell cycle progression and accelerated DNA repair, genotoxicity and senescence after cellular stress. Like all other tumor suppressors, the P53 gene normally slows or monitors cell division. The JAK / STAT signaling pathway is involved in numerous signi cant biological processes such as cell differentiation, proliferation, migration, apoptosis, survival, and immune regulation. In addition, the JAK / STAT signaling pathway also participates in drug treatment of anemia, thrombocytopenia, neutropenia, and antiviral.
With immune in ltration analysis, we found that the signature regulates the immunity of lung adenocarcinoma through CD4 T cell, which can interfere the immune response of the immune system to the tumor, participate in the immune escape of the tumor, induce the immune tolerance of the tumor, and promote the occurrence and development of the tumor.

Conclusion
In conclusion, compared with TMN stage, our robust signature of ten stem cells is a very reliable and accurate prognostic target for LUAD. Further research should be devoted to the functional analysis of our research results and veri cation in clinical trials.

Consent for publication
Not approval.

Available of data and materials
All data were from TCGA and GEO, which are publicly available.   Survival curve analysis. Kaplan-Meier survival illustrates the overall survival of subgroups, which was strati ed by age ≤ 65, age > 65, gender and TNM stage. Subgroup analysis. Box plot shows the relationship between stem cells in the signature and each clinical subgroup.