Identification of a Six-gene Signature Predicting Overall Survival for Cervical Cancer


 Background: Although the incidence of cervical cancer has decreased in decades with the development of human papillomavirus vaccines and cancer screening, cervical cancer remains one of the leading causes of cancer-related deaths worldwide. Identify the potential biomarker for treatment and prognosis of cervical cancer is necessary.Methods: Samples with mRNA-seq, copy number variant, single single nucleotide polymorphism data and clinical follow-up information were download from TCGA database, which were randomly divided into training datasets (N=146) and test datasets (N=147). We selected and identified prognostic gene set and genomic mutated gene set and then integrated the two set of data with random survival forest algorithm and constructed a prognostic signature. External validation datasets and immunohistochemical staining were also evaluated.Results: We obtained 1,416 different expression prognostic-related genes, 624 genes with copy number amplification, 1,038 genes with copy number deletion, and 163 significantly mutated gene. A total of 75 candidate genes were obtained after overlap of the different expression genes and genomic variations. Subsequently we obtained six characteristic genes through random survival forest algorithm. The results showed that high expression of SLC19A3, FURIN, SLC22A3, DPAGT1 and low expression of CCL17, DES were associated with poor prognosis in cervical cancer patients. We constructed a six-gene signature which can separate cervical cancer samples associated with different overall survival and showed robust performance for predicting survival (Training set: p ˂ 0.001, AUC = 0.82; Testing set: p ˂ 0.01, AUC = 0.59). Conclusions: Our study identified a novel six-gene signature and nomogram to predict overall survival of cervical cancer, which may be beneficial to clinical decision making for individual treatment.


Introduction
Cervical cancer (CC) is one of the main causes of cancer death in women and the third leading malignancy among women after breast and colorectal cancers worldwide, with 569,000 new cases each year [1,2]. Cervical cancer is still a major problem for healthy women worldwide despite signi cant efforts. Walboomers JM et al demonstrated that human papillomavirus (HPV) is a necessary cause of cervical squamous cell carcinoma (CSCC) [3]. It is proved that HPV integration induces genomic instability and somatic mutation, playing an important role in the pathogenesis of CC [4]. Although the development of cervical cancer screening has led to a drastic reduction in the incidence of cervical cancer, there are still many challenges for advanced lesions. The 5-year survival rates of distant cervical squamous cell carcinoma patients were only 17% in United States [5]. Therefore, searching for potential biomarkers for treatment and prognosis of CSCC with multi-omics data is necessary. However, few studies were conducted to explore the relation between genomic factor and CSCC prognosis.
At present, biomarkers used for the prognosis of CSCC are mainly divided into two categories. One is the clinical biomarkers for CSCC survival prediction, such as squamous cell antigen (SCC). The other one is to ne prognosis signatures constructed by several prognostic genes by high-throughput sequencing data analysis. For instance, Shen et al discovered that CD28 and PTEN gene play important roles in the occurrence and development of CSCC through methylated microarray data [6]. Mao et al identi ed lncRNAs related to the prognosis of CSCC from TCGA database and developed a 15-lncRNA signature risk score to comprehensively assess the prognostic function of lncRNA [7]. Ma et al found the potential miRNA expression signature capable of predicting survival time for CSCC patients [8]. LRIG1 and LRIG3 expression in CSCC is related to HPV status and patient survival [9]. Therefore, it may probably be crucial and helpful to identify genes of prognostic value by bioinformatics analysis. High-throughput multi-omics sequencing data has laid a solid foundation for identifying genes related to cancer prognosis. Multiomics data analysis can reveal the mechanism of cancer development from multiple perspectives.
In this study, we analyzed the transcriptome data to obtain genes related to the prognosis of CSCC. And then we analyzed the copy number variation (CNV) data and mutation data of CSCC to obtain genes related to the occurrence and development of CSCC. Subsequently, we integrated the results and propose a prognostic gene signature by using the random survival forest algorithm. It is proved that the model can effectively verify the test dataset and external independent dataset, and the signature shows a strong clinical independence. Overall, the prognostic gene signature can effectively predict the prognostic risk of CC and provide a basis for a better understanding of the pathogenesis of CSCC.
Materials And Methods 1. Collection and pre-processing of publicly available data Cervical cancer RNA sequencing (RNA-seq) FPKM dataset and corresponding clinical follow-up information were downloaded from publicly available TCGA database by UCSC cancer browser (https://xenabrowser.net/datapages/), which contains 309 samples and 306 samples respectively. The copy number variation (CNV) data of the SNP 6.0 array contains a total of 297 samples. The mutation annotation le (MAF) downloaded from the GDC client contains a total of 289 samples. The GSE44001 expression pro le data and clinical follow-up information were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/), which contains a total of 300 samples.
The TCGA transcriptome data were standardized and randomly divided into two groups, one group as the training datasets (n=146), and the other group as the test datasets (n=147). The GSE44001 data were used as independent validation datasets. The age, survival status, grade and tumor stage were expressed in Supplementary S1.

Univariate Cox proportional-hazards regression analysis
Univariate Cox proportional-hazards regression analysis was performed for each transcriptome gene in TCGA training datasets, and a gene with a signi cance level of p < 0.05 was selected as a prognostic gene.
3. Copy number variation (CNV) analysis GISTIC 2.0 [10] software was used to identify the genes with signi cant ampli cation or deletion. The threshold was set as ampli cation or deletion length greater than 0.1 and signi cance level of p < 0.05. 4. Gene mutation analysis MUTSIG 2.0 software was used to identify the genes in MAF with signi cant mutation with p< 0.05.

Gene signature construction
At rst, the candidate genes signi cantly related to survival were screened for signi cant ampli cation, deletion, and mutation. Then we used random survival forest algorithm [11] to rank the importance of the genes. Monte Carlo iterations are 100, and the number of advances is 5 (nrep = 100, nstep = 5), Genes with relative importance greater than 0.5 were identi ed as characteristic genes. Finally, we performed a multivariate Cox regression analysis to build the risk score model: where N is the number of prognostic genes, is the expression value of prognostic genes, and is the estimated regression coe cient of genes in the multivariate Cox regression analysis.

Functional enrichment analysis
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed by using the R package clusterpro ler [12] for genes, to identify over-represented GO terms in biological processes categories and KEGG pathway. For this analysis, a FDR < 0.05 was considered to denote statistical signi cance.

Immunohistochemical staining (IHC)
We collected a total of 160 human cervical tissue samples, 130 of which had accompanying follow-up information, and 30 cancer-adjacent cervical tissue samples from archives of para n-embedded tissues between January 2010 and January 2014 at the Department of Pathology of Peking Union Medical College Hospital. The follow-up was performed until December, 2018. The pathological diagnoses were recon rmed by a pathologist. The project was approved by the Ethical Committee (Peking Union Medical College Hospital), and informed consent was acquired from patients or family members. IHC was performed as previously described [13]. Antibodies against the following were used: CCL17 1:200; DES 1:100; DPAGT1 1:200; FURIN 1:200, SLC19A3 1:500, SLC22A3 1:500. The scoring details have been described previously [14]. Positive staining was de ned as follows: ve random microscope elds were selected to evaluate semi-quanti cation staining. The intensity of immunostaining was graded as follows: 1+, weak; 2+, moderate; 3+, strong or 4+, very strong. The area of positive cancer cells in each microscopic eld was categorized as follows: 1+, 0 to 25%; 2+, 25 to 50%; 3+, 50 to 75% or 4+, 75 to 100%. The sum between 5 and 80 was obtained by multiplying the two scores by 5. A sum from 0 to 42 was assigned as "low expression" and that from 43 to 80 as "high expression". Two pathologists independently reviewed the slides using their individual criterion, and the consensus interpretations were used as the nal interpretations. Discrepant interpretations were adjudicated by a third pathologist.

Results
1. Selection and identi cation of prognostic gene set and genomic mutated gene set For TCGA training datasets, a total of 1,412 genes were selected as prognostic candidate genes basing on univariate Cox regression analysis (p < 0.05). The top 20 most signi cant prognostic genes with lowest p value were expressed in Table 1. (More details in Supplementary S2) Meanwhile, genes with signi cant ampli cation or deletion were identi ed by GISTIC 2.0. Figure 1A shows the signi cantly ampli ed segments in cervical cancer genome. And the signi cantly ampli ed genes on each segment are recorded in Supplementary S3. For instance, ERBB2 was signi cantly ampli ed on the 17q12 segment (q value = 5.62E-12), MYC was signi cantly ampli ed on the 8q24.21 segment (q value = 7.70E-09), and EGFR was signi cantly ampli ed on the 7p11.2 segment (q value = 0.01188). In total, 624 genes were identi ed with signi cant ampli cation. Figure 1B shows the signi cantly deleted segments in cervical cancer genome. And the signi cantly deleted genes on each segment are recorded in Supplementary S4. For instance, CD3D was signi cantly deleted in 11q24.2 segment (q value = 9.11E-26), SMAD4 was signi cantly deleted in 18q21.2 segment (q value = 0.0010189), and PTEN was signi cantly deleted in 10q23.31 (q value = 0.0077083). A total of 1,038 genes were identi ed with signi cant deletion.
Subsequently, we used MUTSIG 2.0 to identi ed genes with signi cant mutation with TCGA MAF data. P 0,05 was set as the threshold. A total of 163 genes were identi ed with signi cantly mutation frequency.
In Figure 1C, the top 50 genes with the most frequent mutations and the distribution of different mutation forms including synonymous mutations, missense mutations, frame insertions or deletions, frame movements, nonsense mutations, splice sites and other non-synonymous mutations in training datasets were shown. The upper histogram represents the total number of synonymous and nonsynonymous mutations in each patient's 50 genes, and the right histogram represents the mutation proportion in all samples of 50 genes. In these genes, some of which were reported to be closely related to the occurrence and development of cancer in previous studies can be identi ed, such as FBXW7, PIK3CA, PTEN, KRAS, TP53, etc.
Furthermore, we analyzed the pathways and biological processes involved in copy number variant genes and mutant genes to investigate the potential mechanism. After the integration of copy number variant genes and mutant genes, a total of 1,643 genes were obtained for functional enrichment analysis. As shown in Figure 1D, 1,643 genes were signi cantly enriched in human T−cell leukemia virus 1 infection, Apoptosis, FoxO signaling pathway, ErbB signaling pathway, Central carbon metabolism in cancer, endometrial cancer, etc. In Figure 1E, 1,643 genes were signi cantly in developmental process, protein metabolic process, epidermal cell differentiation.

Prognostic signature construction for cervical cancer patients
In this study, we aimed to obtain a prognostic signature for cervical cancer prediction. Therefore, we selected 75 genes with signi cant copy number variant and mutation in prognostic candidate gene set for stepwise importance rank by random survival forest algorithm. Finally, we identi ed 6 genes with relative importance greater than 0.5 as the nal prognostic signature. Hazard risk (HR), z-score, p value, importance and relative importance of 6 genes are shown in Table 2. Figure 1F shows the relationship between prediction error rates and the number of separate trees, and Figure 1G expresses the out-of-bag importance rank of the top 6 genes. Multivariant Cox regression analysis was conducted to develop a 6gene signature and the risk model was as follow Then the 6-gene signature risk score for each sample was calculated. According to the median value of all the sample, patients were divided into a high-risk group and a low risk group (cutoff=0.02513276). The usage of this 6-gene signature model in TCGA training datasets is shown in Figure 2. Figure 2A shows that 73 patients were divided into the high-risk group and the low-risk group respectively. Figure 2B demonstrates the receiver operating characteristic (ROC) curve, and the Area Under Curve (AUC) of 60months is 0.82. Figure 2C shows that as the risk score increases, the survival period decreases signi cantly, and the majority of high-risk group are death samples. We also analyzed the expression level of 6 genes with the increase of rick score. And it is veri ed that patients in high risk group tend to have higher expression levels of SLC19A3, FURIN, SLC22A3 and DPAGT1, as well as lower expression of CCL17 and DES (Desmin).

Detection of the robustness of gene signature
To verify the accuracy of the 6-gene signature, the test datasets were researched with the same model and cut-off value as training datasets. Kaplan-Meier survival curve, ROC curve and AUC of 6-gene signature, the relation between risk score, survival period and gene expression level are shown in Figure 3. Figure 3A demonstrated the signi cant prognostic difference between high-risk group and low-risk group (log-rank p= 0.009019433). Figure 3B, C shows the consistent conclusion with training datasets that SLC19A3, FURIN, SLC22A3 and DPAGT1 might be risk factors while CCL17 and DES might be protective factors.
Furthermore, we validated the 6-gene signature in external independent datasets downloaded from GEO to con rm the stability with different platforms. The same formula and cut-off value was conducted. In the consistence with our previous ndings, Figure 4 demonstrated the Kaplan-Meier curves of GSE44001 datasets, which further suggested a signi cant prolonged survival time in the low-risk patients compared to that in the high-risk patients ( Figure 4A). The AUC of 60-months is 0.59, and the relation between gene expression level and risk score is also consistent with TCGA datasets (Figure 4B, C). Therefore, the 6-gene signature has robust prognostic capabilities.
2. Independence evaluation of the prognostic signature To con rm the independence of the 6-gene signature in clinical application, univariate and multivariate Cox regression analysis were conducted with TCGA training datasets, TCGA test datasets and GSE44001 datasets respectively. Variates was analyzed systemically in

Discussion
There were an estimated 569,800 new cervical cancer cases and 311,400 deaths worldwide in 2018 [15]. Nearly 90% of cervical cancer deaths occurred in developing parts of the world. In several Western countries, where screening programs have long been established, cervical cancer rates have decreased by as much as 65% over the past 40 years. Whereas in contrast to favorable overall trends, cervical cancer rates are reported to be rising in Uganda and in some countries of Eastern Europe (Estonia, Lithuania, and Bulgaria) [16]. Although increasingly preventable through the use of vaccines, and curable through early cytological detection, women with advanced or recurrent disease face a dismal prognosis with potentially considerable morbidity and mortality [17]. The treatment and prognostic prediction of CC patients remains a big challenge. Heterogeneity is evident in the results for overall and progression-free survival and local recurrence [18]. If the tumor behavior can be reliably predicted at the initial diagnosis, the prognosis of CC would probably be optimally improved. Therefore, it is critical to investigate the molecular mechanism of CC and identify novel biomarkers.
At present, some gene signatures have been used in clinics, such as multigene signature panels (MSPs) to evaluate breast cancer recurrence [19], which reveals that it is a reliable molecular identi cation method to screen gene prognostic biomarkers by gene expression pro ling. Lin et al constructed a prognostic model for CC patients undergoing radiotherapy and chemotherapy with stage IB-IV based on magnetic resonance imaging (MRI), whole-tumor apparent diffusion coe cient (ADC), human papillomavirus (HPV) genotyping [20]. Mao et al developed a 26-IncRNAs signature to predict the overall survival of CSCC [21]. In our study, we downloaded RNA-seq data, CNV data and MAF data of CC from TCGA database and divide them into training datasets (N=146) and test datasets (N=147). By analyzing the CNV data and mutation data of the training set, potential genes related to the occurrence and development of cervical cancer were obtained, meanwhile the transcriptome data was used for analysis to screen for biomarkers related to the prognosis of cervical cancer, and subsequently two sets of data were integrated. As a result, the prognostic signature model was determined through random survival forest. The test datasets and external datasets (GSE44001) further veri ed that 6-gene model was robust. The 6-gene signature has a higher AUC and fewer genes, which might be probably more conducive to clinical application.
In our 6-gene signature, DPAGT1, FURIN, SLC19A3 and SLC22A3 are veri ed to be risk factors, while CCL17 and DES are veri ed to be protective factors. Some previous researches have revealed the association between the 6 genes we identi ed with some other cancer pathogenesis. It was reported that CCL17 played a critical role in preventing tumorigenesis for its ability to engage CCR4+CD8+ T cells [22][23][24]. Okada et al. used RGD-ber-mutant adenoviral vector transfected into the tumors of melanomabearing mice, and found that recombinant AdRGD-CCL17 induced CD8+CTL cell aggregation to tumor tissue, which effectively inhibiting tumor growth [25]. In addition, the Naoko team also used tumorbearing animal experiments to con rm that injecting recombinant CCL17-adenovirus into the tumor of mice with colorectal cancer will induce in ltration of activated CD8+CTL and exert signi cant anti-tumor immune function [26]. Zhan et al. also found that the CCL17 loaded drug-eluting stents were inoculated into pancreatic cancer-bearing mice, and the locally released CCL17 recruited CCR4+CD8+ T cells to pancreatic tumor tissues, thereby inhibiting the occurrence and metastasis of pancreatic tumors [27].
Interestingly, many studies also showed that CCL17 can also stimulate cancer cell proliferation and migration [28,29] [30,31]. However, in present study, the expression of CLL17 was evaluated in the cervical cancer tissue and showed lower expression was associated with poor prognosis of cervical caner. DES encodes a muscle-speci c class III intermediate lament which plays a crucial role in maintaining the structure of sarcomeres, inter-connecting the Z-disks and forming the myo brils [32].
Study revealed that the expression of DES was signi cantly downregulated in gall bladder carcinogenesis according to correlated promoter hypermethylation, which indicated its potential as a candidate biomarker for gall bladder carcinogenesis [33]. DES was demonstrated that it can inhibit telomerase in prostate cancer cells [34]. Moreover, previous studies also reported that DES is a prognostic predictor and therapeutic target for colorectal cancer [35,36]. However, there was rare DES study in cervical cancer. At present study, we also demonstrated that DES was signi cantly down-regulated in cervical cancer which associated with poor prognosis. DPAGT1 is an upstream regulator of E-cadherin N-glycosylation status and adheres junction's composition, and dysregulation of DPAGT1 may causes disturbances in intercellular adhesion in oral cancer [37,38]. In 2019, the data also demonstrated that overexpression of DPAGT1 was signi cantly associated with a poor chemotherapeutic response in hepatitis B viruspositive hepatocellular carcinoma patients [39]. FURIN encodes a member of the subtilisin-like proprotein convertase family, which includes proteases that process protein and peptide precursors tra cking through regulated or constitutive branches of the secretory pathway. Studies revealed that FURIN is associated with the prognosis of pancreatic cancer We integrated multi-omics data including transcriptome, CNV data and mutation data, eventually constructed a 6-gene signature for CC prognosis. The signature was veri ed to be robust and stable in different databases. Furthermore, the signature was shown to be optimally independent even with in uence of many clinical factors. But it is not completely reliable since we only draw the conclusion with bioinformatics analysis alone. Therefore, the further experimental veri cation researches would be necessary.

Conclusion
In present study, we constructed a 6-gene signature for cervical cancer prognosis based on multi-omics data. The predictive value of the signature was further tested in training datasets and external independent validation datasets by applying Kaplan-Meier survival curves. And the signature shows strong robustness and clinical independence. Therefore, the 6-gene signature can be used as a prognosis assessment for cervical cancer patients, and it may provide some guidance for clinical treatment. The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.