Identification of molecular subtype related to HGSOC metabolism
The transcriptome data of 855 metabolic genes were extracted from TCGA-OV, and unsupervised clustering was used to construct molecular typing. When K = 2, the slope of CDF (Consistent cumulative distribution function) decreased slightly (Figure 2A-B), the sample size of each subgroup was more balanced, the intra-group correlation was high, and the inter-group correlation was low (Figure 2C). The other subtype when k at other value was showed at Supplementary figure 1. We divided the HGSOC samples of TCGA-OV into cluster 1 and cluster 2. Kaplan-Meier analysis showed that outcomes in cluster 2 were worse than cluster 1 (p = 0.025, hazard ratio = 1.395, Figure 2D). We verified the metabolic typing in GPL96-OV, GPL570-OV, and GPL6480-OV and found that these metabolic genes could still divide the samples into two groups, but there was no statistical significance in prognostic guidance. Only in the GPL96-OV and GPL570-OV cohorts, cluster 2 show a trend of poor prognosis (Supplementary figure 2).
Characteristics of two metabolic subtypes
To determine the metabolic characteristics of the two subtypes, we identified differentially-expressed metabolic genes of the two subtypes. We found that 141 genes were overexpressed in cluster 2 (| logFC | > 1, adjust p-value > 0.05, Figure 2E) and 160 genes were overexpressed in cluster 1 (| logFC | > 1, adjust p-value > 0.05). We placed these genes into Metascape for enrichment analysis and found that cluster 2 was enriched in carbohydrate processes (Figure 2F), while cluster 1 was enriched in respiratory processes (Figure 2G). Then, we carried out ssGSEA analysis of immune cells and found higher immune infiltration in cluster 1 than in cluster 2 (Figure 2H). Finally, according to the subtype grouping, GSEA analysis showed that cluster 2 was enriched in "LYSINE_DEGRADATION", while cluster1 was enriched in "OXIDATIVE_PHOSPHORYLATION", "ALZHEIMERS_DISEASE", "PARKINSONS_DISEASE", and "HUNTINGTONS_DISEASE" (Figure 2I).
Construction of co-expression network related to metabolic subtypes
To build a scale-free network, it is necessary to select a suitable Soft threshold. When the Soft threshold = 3/4/5/6/8, R-square > 0.9 (Figure 3A). The mean connectivity is the largest among these values when the Soft threshold = 3 (Figure 3B). Therefore, we chose the Soft threshold = 3 to build a scale-free network. Then, using the method of the dynamic cutting tree, the genes were divided into five modules (Figure 3C). We took the futime, fustat, age, grade, stage, and metabolic subtype of the sample as character data and analyzed the correlations with the module. The correlation between the turquoise module and the cluster was the highest (Figure 3D, cor = -0.47, p = 8E-21). We drew a scatter plot of the turquoise module's gene significance and module membership (Figure 3E). Finally, we used Cytoscape to construct a co-expression network (Figure 3F) for the gene significance > 0.3 and module membership > 0.3. Finally, we identified five hub genes (NDUFA2, COX8A, COX5B, UQCR11, UQCRQ, and ATP5MF) with the highest degree of connection in this network.
Construction of the metabolic-related prognostic model
Univariate Cox analysis of 564 genes in the turquoise module showed that 48 genes were statistically significant (Figure 4A, p < 0.05). The 855 metabolic genes in the GSEA database were overlapped with the entire genomes of GPL96-OV, GPL570-OV, and GPL6480-OV. Then GPL96-OV, GPL570-OV, and GPL6480-OV obtained 783, 832, and 846 metabolic genes. To enable our model to be successfully verified by the data of other cohorts, we intersected these 48 genes with the metabolic genes of GPL96-OV, GPL570-OV, and GPL6480-OV (40 genes) (Figure 4B). To optimize our model, we used LASSO for dimensionality reduction analysis. With the increase of lambda value, the coefficients of some genes gradually become 0 (Figure 4C), indicating that these genes have little effect on the model and should be abandoned. Then the 10X cross-validation method was used to calculate the partial likelihood deviance of the model. When the gene number was 16, the model's performance was the best (Figure 4D). Figure 4E shows the 16 genes and their corresponding coefficients (Table 1). We named the model 16-MGM (16 metabolic gene-related model). The risk score of each patient was calculated using 16-MGM.
Verifying the prognostic performance of 16-MGM in training set TCGA-OV
Kaplan-Meier analysis of TCGA-OV showed that 16-MGM had a significant prognostic ability (Figure 4F, p < 0.05), and the prognosis was poor when the score was high. Figure 4G showed the score distribution and survival status distribution of each patient. ROC curve showed that 16-MGM had a certain prognostic ability after nine years (AUC > 0.65). Univariate and multivariate Cox analysis of risk score, age, stage, and grade showed that the model's risk score was independent of age, stage, and grade for predictive analysis (Figure 4I). It has prognostic ability in many clinical subgroups, including age < = 60, age > 60, grade 3, stage 3, and stage 4 (Figure 4J).
Verify the prognostic performance of 16-MGM in the verification sets
We verified the prognostic ability of 16-MGM in verification sets GPL96-OV (N = 251) and GPL570-OV (N = 353), and GPL6480-OV (N = 558) cohorts. Kaplan-Meier analysis showed that, in the GPL96-OV cohort, the risk score predicted the overall survival rate (Figure 5A, p = 0.43), and the AUC values of 3 and 5 years were greater than 0.5. In the GPL570-OV cohort, the risk score predicted the overall survival rate (Figure 5B, p = 0.019), and the AUC values of 3 and 5 years were greater than 0.5. In the GPL6480-OV cohort, the risk score predicted the overall survival rate (Figure 5C, p < 0.001), and the AUC value of 3 and 5 years was greater than 0.5. 16-MGM also predicted progression-free survival (PFS), which was statistically significant in the cohort TCGA-OV and GPL570-OV cohorts (Figure 5D-E, p < 0.05), but not in the GPL6480-OV cohort (Figure 5D-E).
Difference of immune infiltration level between high and low-risk score samples
We used the CIBERSORT algorithm to calculate the infiltration percentage of 22 immune cells per sample in the TCGA-OV, GPL96-OV, GPL570-OV, and GPL6480-OV cohorts. There was no significant difference in infiltration between patients with high- and low-risk scores (Figure 6A). However, there were significant differences in the infiltration of some immune cells. We selected the cells with different immune infiltration in each cohort (Figure 6B) and found that there was significant infiltration of "mast cell activated" in the high-risk group of the GPL96, GPL70, and GPL6480 cohorts and "T cell follicular helper" in the low-risk group of the GPL96, GPL70, and TCGA-OV cohorts (p < 0.05).
GSEA analysis of samples with high and low-risk scores
We performed GSEA analysis of the KEGG pathway in TCGA-OV, GPL96-OV, GPL570-OV, and GPL6480-OV. The high-risk group did not enrich the pathway, while the low-risk group enriched several pathways (Figure 7A-D), including metabolic-related pathways. We intersected the results (Figure 7E) and found that "DNA replication" was enriched in all four cohorts.