3.1 TCGA-USC patient characteristics
A dataset of 110 UCS samples and 35 adjacent normal uterus tissue samples was downloaded from TCGA. The training and testing set consisted of data from 56 and 74 USC cases, respectively. The clinicopathological features among the 2 groups and the whole dataset did not differ significantly (P-value = >0.05). These features were summarized in Table 1.
3.2 Identification of dysregulated genes in USC and functional enrichment analysis
To ensure that our analysis compared equivalent numbers of USC and non-USC cases, we downloaded a dataset of normal uterus tissue samples from GTEx (n=78), which along with the 35 in the TCGA dataset brought the total number of normal uterine cases to 113. Using Limma package in “R”, and a cutoff threshold of |log2FC| >2, FDR <0.01, 1385 genes were identified as being dysregulated in USC tissue vs the normal controls (Figure 1A). Functional enrichment analysis revealed that the dysregulated genes are significantly associated with 717 GO term processes and 21 KEGG pathways. The most significantly enriched GO terms were extracellular matrix, mitosis, and cell adhesion, processes that might promote cancer progression (Figure 1B). The most significantly enriched pathways are involved in cell adhesion, cell cycle, PI3K-Akt signaling pathway, cancerous microRNAs, transcriptional misregulation, and pathways involved in melanoma and bladder cancer (Figure 1C).
3.3 Prognostic signature construction and evaluation in the training set
To identify dysregulated genes that may be associated with OS, we performed univariable Cox regression analysis and uncovered 29 genes that significantly correlated with OS (Table S1). To narrow down to the most important prognostic genes, we used LASSO regression analysis, which revealed 5 dysregulated genes as being potential critical indicators of USC survival (Figure 2A). Next, multivariable Cox regression analysis narrowed down to a signature 4 genes, KRT23, CXCL1, SOX9 and ABCA10 (Figure. 2B) that effectively predict OS (Table 2). Among these, KRT23, CXCL1, and SOX9 exhibited positive regression coefficients, indicating a high risk of mortality. While ABCA10 showed a negative regression coefficient, implying a low mortality risk. Next, we constructed the following risk prediction formula based on the 4 prognostic genes and used it to calculate each patient’s risk score in the training set: risk score = (0.5424 × expression level of KRT23) + (0.2398 × expression level of CXCL1) + (0.5398 × expression level of SOX9) - (1.7023 × expression level of ABCA10). This signature was used to calculate risk scores for 56 USC cases (individually) in the training set. The risk scores were then ranked linearly and assigned as high-risk or low-risk based on whether they were higher or lower than the median risk score (Figure 2C). The relationship between risk scores and survival time was showed in Figure 2D. Visualization of the expression of the 4 genes in a heatmap revealed that the expression level of the 3 high-risk genes increasing with rising risk scores, while the low-risk gene showed an opposite correlation (Figure 2E). Kaplan-Meier analysis revealed that patients in high-risk group experienced worse outcomes relative to the low-risk group (P-value =0.003317,Figure 2F). Relative to standard clinicopathological parameters like age, myometrium invasion, node metastasis and disease stage, the 4-gene prognostic signature scored 0.855 in AUC (area under the ROC curve) analysis, indicating superior performance over conventional prognostic factors (0.213, 0.796, 0.728 and 0.564 for age, myometrium invasion, node metastasis and stage, respectively; Figure. 2G).
3.4 Validation of the 4-gene signature in the testing set
To assess the robustness of the 4-gene prognostic signature, risk scores for the 74 USC cases in the testing set were calculated and ranked as described in section 3.3 (Figure. 3A). The relationship between risk scores and survival is shown in Figure 3B. This analysis revealed that the expression of the 3 high-risk genes increased with rising risk scores, while the low-risk gene exhibited the opposite effect (Figure 3C). Kaplan-Meier curve indicated that the high-risk group experienced worse outcomes relative to the low-risk group (P-value = 0.0004387,Figure 3D). The score of 0.811 for the 4-gene signature was revealed by AUC analysis was higher than for conventional prognosis indicators (0.430, 0.752, 0.808 and 0.688 for age, myometrium invasion, node metastasis and stage, respectively; Figure 3E), consistent with observations made in the training set.
3.5 Independent prognostic value of the 4-gene signature.
To evaluate the potential of the 4-gene signature independently of conventional prognosis indicators, we used univariate and multivariate Cox regression analysis on testing set cases with reporting complete clinical features. This analysis revealed that our prognostic signature and tumor stage are both independent predictors of OS (Table 3). Next, we tested if the 4-gene signature could predict OS at different disease stages. To this end, we stratified the cases by stage into early (stage I+II) and late stage (stage III+IV). Patients in high-risk group in both early and late stage exhibited lower OS relative to those in the low-risk group (P value = 0.003306 and P value = 0.02755, respectively, Figure 3F-G). These results indicate that the 4-gene signature has superior performance in early stage, highlighting its potential clinical application.
3.6 Evaluation of the 4-gene signature in predicting RFS.
To evaluate whether the 4-gene signature could predict recurrence-free survival (RFS) in USC, TCGA-USC cases with RFS data were analyzed. Cases with RFS of <30 days were excluded and 95 cases further analyzed. Each patient’s risk scores were calculated and ranked as described in section 3.3 (Figure 4A). The risk scores and recurrent time are shown in Figure 4B. This analysis revealed that expression of the 4 genes increased with rising risk scores (Figure 4C). Kaplan-Meier analysis revealed that the high-risk group had higher recurrence rate relative to the low-risk group (P value = 0.01198,Figure 4D). The AUC analysis of the prognostic signature revealed a score of 0.737 at RFS prediction, which was higher than the scores from conventional indicators (0.151, 0.595, 0.551 and 0.632 for age, myometrium invasion, node metastasis and stage, respectively, Figure 4E). Univariate and multivariate Cox regression analysis revealed the prognostic signature and stage as independent prognostic factors for RFS, consistent with OS analysis (Table 4). Analysis of the effectiveness of the 4-gene signature in predicting RFS at different disease stages revealed that patients in low-risk and high-risk groups had significantly different RFS in late stage (P value = 0.003489, Figure 4F). However, there was no difference in early stage between the two risk groups (Figure S1).