Establishment of a Prognostic Stemness Signature For Cervical Squamous Cell Carcinoma

Background: The aim of this study was to construct a robust stemness-related gene signature for predicting prognosis of cervical squamous cell carcinoma (CSCC). Methods: Expression data for the PCBC database-derived pluripotent stem cell (PSC) samples were collected using the one-class logistic regression (OCLR) method to calculate stemness indexes (mRNAsi) of samples derived from the TCGA dataset. Functions of possible mRNAsi-related stemness genes extracted through WGCNA were then examined by enrichment analysis. Most representative stemness genes for prognosis prediction were screened to construct a stemness-related gene signature by shrinkage estimate and univariate analysis. Next, the TCGA dataset and the GSE44001 external dataset were incorporated into that model and classied to evaluate the model eciency and stability in patient prognosis prediction and classication according to the Riskscore. The associations between the Riskscore and clinical characteristics as well as relevant signaling pathways were also explored. Moreover, the prognosis predicting eciency of the stemness-related gene signature was compared with those of CSCC prognostic signatures reported in other studies. Results: According to the ndings, mRNAsi showed signicant correlation with key oncogene mutation degrees (including DMD, KMT2C, EP300 and MUC4), inltrating stroma cells and the CIMP classication for CSCC cases. The 8-stemness gene signature in this study achieved high stability and accuracy in prognosis prediction for CSCC cases. In the meanwhile, the model provided diverse therapeutic targets to precisely treat CSCC in clinical practice based on various subtype-specic stemness genes. Conclusion: Our present study suggested that the 8 stemness gene signature can help to screen out novel stem-related diagnostic indicators, therapeutic targets and prognostic predictors in CSCC.

CSCC patients at close clinical stages. As a result, hierarchical processing for the high-risk subgroups is necessary to optimize postoperative treatment and achieve individualized monitoring, making it urgently needed to discover the new prognostic markers.
Stemness is the ability to differentiate and self-renew from original cells, and this is mainly because that normal stem cells can differentiate to different cell types in the adult organisms (11). Tumor progression is related to the loss of differentiation phenotypes and the gradual acquisition of progenitors-and stem cells-like characteristics. Cancer cell distant metastasis is quite commonly seen in the non-differentiated primary tumors, which results in disease progression and poor prognosis, and this is especially because of resistance to existing treatments (12)(13)(14). Some studies indicate that, CSCC cells constituted by the highly heterogeneous cancer cells are not the same. Only a small portion of different tumor cells display the stemness features in the self-replication and maintenance of tumor maintenance. Such cells are the cell bank of CSCC, which have great in uence on cancer metastasis and relapse (15,16).
An increasing number of genomic, epigenomic, transcriptomic and proteomic signatures have been discovered to be related to cancer stemness. Such signatures show correlation with some tumorigenesisrelated signaling pathways, thus modulating transcriptional networks and maintaining cancer cell growth and proliferation. The dysregulated transcription of cancer cells frequently leads to the acquisition of stemness characteristics as well as de-differentiation of oncogenes through altering key signaling pathways that regulate the phenotypes of normal stem cells (17)(18)(19). Over the last one decade, The Cancer Genome Atlas (TCGA) and the GEO databases have displayed the landscapes of primary cancers through generating the comprehensive molecular pro les comprised of histopathological and clinical annotations, together with the genomic, epigenomic, transcriptomic and post-translational proteomic characteristics (20,21). The Progenitor Cell Biology Consortium (PCBC) database (22) covers the expression pro les of the embryonic stem cells (ESCs) and the induced pluripotent stem cells (iPSCs), which can be used for identifying the CSCC stemness genes in later analysis.
In this work, we constructed a model based on the PCBC database-derived stem cell sequencing data for estimating the stemness index (mRNAsi) values for CSCC samples obtained via TCGA and the GEO databases, examining the relationships with patient clinicopathological features and prognosis, and identifying the mRNAsi-related stemness genes extracted through WGCNA. Besides, those most representative stemness genes were extracted for constructing the model to predict the prognosis for CSCC by Lasso and stepwise regression analysis. Notably, our model will assist the clinicians in accurately predicting the prognosis for CSCC patients and selecting individualized treatments.

Methods
Database sources and pre-processing The overall ow diagram of present study was summarized in Fig 1. We acquired thelevel 3 standardized data for the RNA sequences from TCGA via the GDC Data Portal using the TCGAbiolinks functions of R package, namely, GDCdownload, GDCquery, and GDCprepare (http://www.r-project.org). A total of 304 samples were collected in this work for analysis. At the same time, GSE44001 dataset enrolling 97 CSCC samples was obtained from the GEO database (23). Moreover, the RNA sequencing pro les were acquired from the PCBC database through the PCBC Synapse Portal (https://www.synapse.org/pcbc), including 16 ESC and 77 iPSC samples.
We processed the TCGA-derived CSCC RNA-Seq data by the following steps: First of all, we retained primary solid tumor samples with available expression data. Secondly, we converted the Ensembl ID into the gene symbol, and retained protein-encoding genes. For samples having several gene symbols, we used the average expression value for further analysis. For survival analysis, we removed samples with survival time < 0 day. GEO data were preprocessed according to the following steps: Firstly, we retained samples that had complete expression data, then, converted information on the survival time of samples in the unit of day. Later, we transformed the probe to gene symbol, but eliminated probes related to several genes. The average expression value was used in the situation that having several gene symbols. Table 1 presents the clinical statistics of both TCGA and GSE44001 datasets after preprocessing.

Calculation of OCLR-derived mRNAsi
To calculate mRNAsi based on the mRNA expression data, we constructed a prediction model using the PCBC-derived pluripotent stem cell samples (ESC andiPSC) through the one-class logistic regression (OCLR) method (24). In order to make sure that the TCGA cohort was comparable, gene names in the form of Ensembl IDs were mapped to the Human Genome Organization (HUGO) (25), and genes with no mapping were removed. As for the training matrix obtained, altogether 12998 mRNA expression pro les were collected, which were obtained on the basis of every available PCBC sample. For all samples, the respective average expression values were determined and used in centralization. Finally, we used the R package gelnet (v1.2.1) OCLR function to calculate the gene weight vector based on those processed data.
For all CSCC samples, their respective expression pro les were determined by respective Spearman correlation coe cient and model gene weight vector. Then, all coe cients of Spearman correlation were transformed linearly, while the minimal coe cient of samples was eliminated, and then the value acquired was divided by maximum for calculating mRNAsi of each respective sample, yielding the range of distribution of [0,1].
TCGA data were utilized to compare diverse mRNAsi values among samples with different ages, genders, tumor grades and stages. To investigate the relationship of mRNAsi with mutations related to CSCC in the key oncogenes, the R package TCGAbiolinks(26) (V2.14.0) was utilized to detect the mutations in TCGA-CSCC, while the R package maftools (V2.0.16) (27) was adopted for plotting the mutation frequency diagram. To explore the associations between mRNAsi and molecular subtypes of CSCC samples, TCGA-CSCC samples were clustered in accordance withgene expression pro les collected by the R package TCGAbiolinks (v2.14.0). Later, R package estimate was used to calculate the immune and stromal scores of each sample, so as to further explore the correlations of immune and stromal scores with the sample miRNAsi.
The construction of mRNAsi-related gene modules through WGCNA First of all, transcripts with median absolute deviation (MAD) of >median and over 75% TPM>1 were selected based on the sample expression pro les, while hierarchical clustering was also conducted for sample clustering analysis. Then, samples with distance >80000 were taken as the outlier samples for screening. The coe cient of Pearson correlation was used to determine the inter-transcript distance, the R software package Weighted Gene Co-expression Network Analysis (WGCNA) was utilized to establish distance between any two transcripts, whereas 8 was set as the soft threshold to screen co-expression modules. Notably, the co-expression network is suggested to be consistent with scale-free network. For a node with a connectivity k, its logarithm (log(k)) must show negative correlation negatively correlated with the logarithm of speci c node occurrence probability (log(P(k)), with a correlation coe cient of >0.9. Proper β value was adopted for guaranteeing that the network was a scale-free one. Later, the expression matrix was transformed to an adjacent one, which was then additionally transformed to a topological one for gene clustering based on Topological Overlap Matrix (TOM) by using the average-linkage hierarchical clustering approach according to the mixed dynamic shear tree standard. Besides, in every gene network module, the gene number was set as 80 at least. Then, gene module was determined by the dynamic shear approach, and eigengene values were calculated for all modules successively. Afterwards, clustering analysis were performed on all modules, and adjacent modules were merged into a new one, with reset appropriate height, deepSplit and minModuleSize values. The correlations of the obtained gene modules with mRNAsi were calculated to search for the gene modules with high correlation for further research. Moreover, Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) enrichment analysis was performed by utilizing the WebGestaltR (v0.4.2) R package to examine the signaling pathways affected by the genes involved in these modules.

Survival analysis
Samples containing survival data from TCGA dataset were divided into training and test sets at a 6:4 ratio. For the sake of eliminating the in uence of random allocation bias on model stability, each sample was randomly grouped for 100 times with replacement. Univariate analysis was carried out by using the Cox proportional hazards regression model, and survival data was analyzed through the survival coxph function of R package(28). A p-value <0.05 suggested that the difference was statistically signi cant.
Construction and validation of the prognostic stemnessrelated gene model Lasso regression analysis was conducted using the R package glmnet (29) function. Besides, stepwise regression analysis was carried out by R package MASS function according to the Akaike data criteria. Finally, a risk model was established with genes selected. Thereafter, related gene expression data were selected based on TCGA and GEO datasets, respectively, which were later used in the model to calculate Riskscore values of all samples. Later, Riskscore values were transformed into the zscore, and the Riskscore of 0 was utilized as a threshold to classify samples to high (Risk-H) or low (Risk-L) risk group. In addition, we determined the accuracy, stability and e ciency of the model to classify CSCC prognosis through KM and receiving operating characteristic (ROC) analyses using TCGA test dataset as well as GSE44001 dataset.

Function analysis of Riskscore in CSCC patients
The R package GSVA(30) ssGSEA function was employed to assess the score of KEGG enrichment analysis. At the same time, we determined the relationships between Riskscore values and ssGSEA scores, and conducted clustering analysis for all samples.

Relationships between Riskscore and clinicopathological features in CSCC paitents
We used relevant Riskscore values and clinical characteristics to plot the forest plot and construct the nomogram. Moreover, the relationships between patient survival and Riskscore values as well as clinical characteristics were examined through the univariate and multivariate Cox proportional hazards regression models.

Results
Relationships between mRNAsi and clinicopathological features, molecular classi cation, immune status and gene mutations in CSCC patients First of all, mRNAsi of each TCGA-CSCC sample was calculated by OCLR method. Further analysis found that there was no signi cant difference in mRNAsi across samples with diverse ages, genders, grades and stages (Fig.S1). Then, we screened out 10 most frequent mutations and revealed that samples carrying DMD, KMT2C, EP300 and MUC4 mutations (Fig.S2, p < 0.05) showed markedly elevated mRNAsi values compared with samples having no mutation. MUC4 and KMT2C have been recognized as the promoter genes of CSCC, which modulate CSCC occurrence, development and malignant transformation, and their mutations are markedly correlated with tumor metastasis, relapse, and poor patient prognosis (31,32).
Then, we compared the difference in mRNAsi across various samples. As a result, difference in mRNAsi was not signi cant across different CSCC subtypes on the basis of gene expression pro les and iCluster subgroups (Fig.S3A, 3C). Subsequently, we compared the difference in mRNAsi among CSCC samples with different CIMP subtypes. Our results indicated that there was signi cant difference between groups (Fig.S3B). According to our results, mRNAsi was signi cantly negatively correlated with the stromal cell in ltration (StromalScore and ESTIMATEScore) in TCGA-CSCC tumor samples, but it showed weak correlation with immune cell in ltration (ImmuneScore) (Fig.S4).
In conclusion, mRNAsi showed tight correlation with gene mutations and in ltration of stroma cells, and it was a creditable molecular marker for CSCC.
Selection of possible stemness-related genes and functional enrichment analysis According to Fig.2A, some outlier samples were observed, samples that had >80000 distance were identi ed to be outlier samples in screening, and 303 samples were obtained at last (Table S1). Then, WGCNA was used to construct the weight co-expression network for guaranteeing the scale-free network (Fig.2B). Subsequently, gene modules were determined by dynamic shear approach, which were then conducted clustering analysis. Additionally, modules with close distance were further merged into the new module, having height, deepSplit and minModuleSize set to 0.25, 2 and 80, respectively. Altogether 16 modules were obtained at last (Fig.2C). It should be noted that, grey module represented gene set that was not clustered to any one of the rest modules. Based on Fig.2D, the relationships between the 16 module eigenvectors and mRNAsi as well as clinical features were then determined, separately. Clearly, the pink module showed the most signi cant positive correlation with mRNAsi, while green and cyan modules showed signi cant negative correlations, among which, the pink module contained 286 genes, the green module covered 493 modules, and the cyan module had 114 gene, yielding a total of 893 genes (Table S1). Notably, these genes were identi ed as the possible stemness-related genes.
Subsequently, we carried out functional enrichment analysis on genes in the pink (Figure 3 and Table S2), green and cyan modules (Figure 4 and Table S3). As a result, these genes were mainly enriched to the pathways that were signi cantly correlated with cervical cancer genesis and development, including ECMreceptor interaction, the cGMP-PKG signaling pathway, the PI3K-Akt signaling pathway, and Human papillomavirus infection (33)(34)(35). Prognostic risk prediction model establishment on the basis of possible stemness-related genes and accuracy validation The detailed information of Training (n=175) and test (n=116) set samples are displayed in Table S4.
Differences in the clinical factors showed no statistical signi cance between the training and test sets, with the only exception of tumor grade, indicating reasonable sample grouping ( Table 2).
Straight after univariate survival analysis, the prognosis-speci c stemness-related genes were narrowed through lasso regression to reduce the gene number for the risk model. Then, samples in the training set were divided into Risk-L and -H groups based on the Riskscore. Fig.6A shows survival distributions in Risk-L and Risk-H groups, and the 8 stemness gene expression in samples. According to the gure, samples in training set that had greater Riskscore values had signi cantly reduced OS compared with those that had lower Riskscore values, which revealed that samples that had higher Riskscore values showed more dismal prognosis. In addition, the high expression levels of PPP1R14A, TNFRSF11B, CPQ, AOC3 and DPYSL4 were associated with dismal prognosis for samples in training set, and they might serve as risk factors. Moreover, the increased KCNMA1 and RHOH expression was associated with the better prognosis for samples in training set and were thus the protective factors. Also, we carried out ROC analysis to classify the prognosis for samples in training set based on Riskscore value. The e ciency of our model in predicting the OS at 1-5 years for training set samples was examined, with the mean AUC of up to 0.87 (Fig.6B). At last, we drew the KM curve on the basis of OS for samples in Risk-L and Risk-H groups, thus comparing the difference in survival between both groups. As observed from Fig.6C, Risk-H group of the training set had remarkably shorter OS than that of Risk-L group (P<0.0001).
To validate the reliability of our model, the expression data of the above-mentioned 8 stemness-related genes of TCGA-CSCC samples in test set were collected, and the Riskscore values for all test set samples were measured. Similar to previous analysis results, samples in test set with greater Riskscore values had markedly reduced OS than those with lower Riskscore values, with the mean average AUC of up to 0.8 (Fig.S5). At last, an external data set was used to validate our model accuracy in the prediction of CSCC prognosis and similar ndings were obtained (Fig.S6).
In conclusion, our risk model constructed on the basis of sample survival and possible stemness genes using Lasso and stepwise regression analysis was e cient in predicting CSCC prognosis. Signaling pathways associated with Riskscore As shown in Fig.7A, there were 22 related KEGG pathways obtained, all of which had correlation coe cient >0.3. Next, above-mentioned 22 pathways were chosen for clustering analysis on the basis of sample enrichment scores. 21 out of those 22 pathways showed negative correlation with Riskscore, which were mainly associated with cellular metabolism and immunity (Fig.7B).
The 8-stemness gene signature was an independent factor for predicting CSCC prognosis Nomogram has been identi ed as an approach to effectively and intuitively present risk model results, and it is used conveniently to predict the prognosis for patients. To be speci c, the length of straight line in nomogram represents the in uences of different parameters together with their signi cance on prognosis. In this study, a nomogram was constructed using Riskscore, age, tumor grade and T stage. As displayed in Fig.8, Riskscore most signi cantly affected the prediction of survival rate. Furthermore, both Riskscore and patient clinical characteristics were utilized for plotting the forest plot. According to Fig.9, Riskscore had the HR of around 2.43 (P<0.001), which indicated that the stemness-related gene signature might serve as a vital risk factor in predicting CSCC prognosis. Afterward, univariate and multivariate COX regression analyses were conducted to assess the independence of 8-stemness gene signature on survival prediction. As a result, only 8-stemness gene signature showed signi cant correlation with the prognosis for CSCC patients, both upon univariate (HR=2.3410, p=8.3E-12) and multivariate (HR=2.6570, p=1.9E-12) analyses (Table 3). Thus, our 8-stemness gene signature might be used to be an independent prognostic factor for CSCC.
Comparison between our 8-stemness gene signature risk model and other reported prognostic models Three CSCC prognostic models (namely, the 8-, 5-and 3-gene signature) (36-38) reported in literature were selected, which were then compared with the 8-stemness gene signature constructed in our study. The corresponding Riskscore values of each prognostic model were calculated based on gene expression pro les for samples derived from TCGA-CSCC dataset, which were further converted to zscore values. Then, samples were divided into high-(Riskscore >0) and low-risk (Riskscore >0) groups, and OS difference were compared. Fig.10 displays the ROC and overall survival KM curves for those three models. Obviously, the 8-stemness gene signature constructed in this study had higher ROC value than other three models, with statistically signi cant difference in OS between low and high risk subgroups.

Discussion
In current study, expression data for the PCBC database-derived pluripotent stem cell samples were obtained using the OCLR method to calculate stemness indexes (mRNAsi) of samples derived from the TCGA dataset. Meanwhile, corresponding clinicopathological features, gene mutation status and survival data were retrieved. Further analysis results revealed that mRNAsi having signi cant correlation with key oncogene mutation status, stroma cells in ltrating and CIMP classi cations among CSCC cases. Additionally, functions of the 893 potential mRNAsi-related stemness genes within the gene modules were analyzed by enrichment analysis and WGCNA. Besides, we conducted shrinkage estimate and univariate analysis to screen the most representative stemness genes for prognosis to construct the 8stemness gene signature for CSCC. Then, samples obtained based on the TCGA-CSCC dataset and GSE44001 external dataset were incorporated into the model and classi ed according to the Riskscore value to evaluate the model e ciency and stability in the prediction of patient prognosis. Subsequently, we examined the associations between Riskscore and relevant clinicopathological characteristics as well as signal transduction pathways. At last, our 8-stemness gene signature was compared with three CSCC prognostic signatures reported in literature.
As a result, one (KCNMA1) in the eight genes was suggested previously to be involved in the progression of CSCC, pathogenesis and malignant transformation, indicating that our bioinformatic mining results were highly reliable and accurate. (39). However, the relationships of the remaining 7 genes with CSCC are not validated in clinical or basic study. Typically, RHOH participates in the migration of prostate cancer cells (40). PPP1R14A has been demonstrated to evaluate the prognostic risk of gastric cancer (41). DPYSL4, which is associated with mitochondrial supercomplexes, can regulate energy metabolism in tumor cells and be involved in cancer invasion and progression (42). Dysregulation of TNFRSF11B is shown to be involved in breast cancer genesis and development and correlates with poor prognosis in patients (43). Nonetheless the associations of genes with CSCC are not examined yet which should be further investigated.
Nowadays, with the rapid development of high-throughput genome-wide sequencing techniques, the bioinformatic big data for the large-scale CSCC samples can be easily obtained to explore the hidden information. Despite the superiority of our stemness-related gene signature, there are certain limitations in the present work like other bioinformatic studies. First of all, the experimental veri cation is lacking, and further experimental validation should be performed on the potential mechanisms underlying the signature genes. In addition, we have limited TCGA-CSCC samples and relevant clinical data. Therefore, a larger sample size is needed for identifying the key prognostic stemness gene functions for predicting CSCC progression.

Conclusion
To sum up, this study established a novel stemness-related gene signature which was validated as a new and signi cant prognostic marker for CSCC. Notably, our stemness-related gene signature connected the molecular features of CSCC stem cells and clinical outcomes, which will help to nd novel predictors for therapeutic effect and prognosis. Besides, the methods of our study had general applicability and provided some references value for other cancer types.    The KM curve on the basis of expression level of each 8 gene in the risk model

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.