Identifying prognostic lactylation-related genes and subtypes in PCa
The clinical and pathological characteristics of PCa patients in the TCGA and MSKK cohorts are listed in Table 1. Using the R package “limma” with an absolute log2-fold change (FC) > 0.585 and an adjusted P value < 0.05 to perform differential expression analysis, 81 lactylation-related genes were differentially expressed between tumor tissues and adjacent nontumorous tissues in TCGA cohort (Fig. 2A). By univariate cox analysis of the relationship between 81 differentially expressed genes and disease-free survival(P < 0.05) of prostate cancer from TCGA, we obtained 16 of the 81 genes were significantly related to the prognosis of PCa (Fig. 2B). Close correlation was observed among the 16 lactylation-related DEGs (Fig. 2C). CNV status analysis showed a frequent alteration in 16 lactylation-related DEGs (Fig. 2D). It was noted that ALDOA only had copy number deletion and most alterations were losses in copy number. Unsupervised consensus clustering analysis of 16 prognostic lactylation-related genes could obtain two clusters with two different lactylation signature (Fig. 2E, F). In this study, consensus clustering based on differentially expression of lactylation-related genes was achieved using the R package “Consensus ClusterPlus”. The optimal clustering value was k = 2. Kaplan–Meier method suggested that, compared with the DFS of prostate cancer patients in C1 and C2 lactylation cluster, PCa patients in C2 lactylation cluster had significantly shorter DFS time (Fig. 2G; P = 0.045, HR = 1.53, 95% CI = 1.01–2.34). PCA analysis and t-Distributed Stochastic Neighbor Embedding suggested significant differences between C1 and C2 lactylation cluster(p = 0.001) (Fig. 2H, I). The heatmap for the association between 16 lactylation-related DEGs and clinicopathological manifestations was also analyzed (Fig. 2J).
Selection of key LRGs by machine learning and construction of a LRGs prognostic model with good performance
Prostate cancer patients with DFS in TCGA were randomly divided into training cohort (n = 245) and test cohort (n = 246) at a ratio of 1:1. Because we need to construct the Cox proportional hazards regression model later, the main premise for constructing the cox proportional hazards regression model is the assumed hazard ratio. We check each collaborators variable (T stage, N stage, metastasis status, PSA, Residual tumor, risk score) is in line with the cox proportional hazards model assumptions with Schoenfeld residuals. Each schoenfeld individual test p was not statistically significant (p > 0.05), the global schoenfelds were not statistically significant (Supplementary Fig. 1). Therefore, we can assume that this cox model meets the proportional hazards' assumption. Then, we used LASSO regression analysis, random forest, XGBoost, Boruta algorithm and SVM-RFE to find key genes associated with DFS in the TCGA training cohort. Five lactylation-related genes, ALDOA, DDX39A, H2AX, KIF2C and RACGAP1, were identified as key genes by the LASSO regression algorithm. The five variables were obtained based on the optimal value of λ = 0.069(Fig. 3A), and the lowest partial likelihood deviance is shown in Fig. 3B. According to the correlation map between the number of decision trees and the model error, we chose ntree was 600, the error rate of the model tended to stabilize and the importance of variables was ranked according to the VIMP method (Fig. 3C). Six key genes, ALDOA, DDX39A, H2AX, KIF2C, RACGAP1 and H2BC4, were selected by the minimum depth variable selection function in the R package “randomForestSRC”. SVM-RFE algorithm identified 8 key genes significantly associated with the DFS (Fig. 3D). The Boruta algorithm identified 6 key genes significantly associated with the DFS (Fig. 3E). XGBoost algorithm screened 10 most important signature genes affecting the disease-free survival of prostate cancer (Fig. 3F).The intersection of the RF, LASSO, SVM-RFE, Boruta and XGBoost results was shown in a Venn diagram in Fig. 3G. We identified 5 overlapping key genes, including ALDOA, DDX39A, H2AX, KIF2C, RACGAP1. The overlapping key genes screened by the five machine learning algorithms were consistent with those screened by the lasso algorithm. We wanted to use the five signature genes to construct a survival prognosis model. Then, the five key genes were used to construct a survival prognostic model by Lasso cox regression analysis in TCGA training set, and validated in TCGA test set and external validation set MSKCC. The risk score model was as follows: Risk score = (0.0880 × expression level of ALDOA) + (0.0108 × expression level of DDX39A) + (0.0199 × expression level of H2AX) + (0.2848 × expression level of KIF2C) + (0.0181× expression level of RACGAP1). Univariate and multivariate Cox regression analyses demonstrated that T stage, Gleason score, and risk score based on the signature of five lactylation-related genes were independent predictors of prognosis in patients with PCa (Fig. 3H、I).According to the median value of the risk scores in TCGA training cohort, the patients were divided into two groups, either high-risk group or low-risk group. KM analysis indicated that significantly poorer DFS in TCGA training cohort was detected among patients in the high-risk group compared to patients in the Low-risk group (log-rank test, p < 0.0001, HR = 6.04,95%CI:2.83–12.9,5-yearsDFS:0.49,95%CI:0.38–0.63) (Fig. 3J). A receiver operating characteristic (ROC) curve was constructed to estimate the model and assess the reliability of the risk score, and the areas under the curve (AUCs) in TCGA training cohort at 1 year, 3 years and 5 years DFS were 0.82, 0.797 and 0.769, suggesting that the risk model can be useful in predicting prognosis (Fig. 3K).The ROC curve analysis shows the highly sensitive and specific predictive performance of the risk score in TCGA training cohort (Fig. 3L).So, the Prognostic model had a better predictive performance in the TCGA training cohort.
An internal validation set (TCGA test cohort) and an external validation set (MSKCC cohort) were used to verify the prediction performance of the prognostic model
To verify the prediction model had a powerfully predictive performance in the other cohort, the TCGA test cohort and the MSKCC cohort were used to verify the predictive performance. According to the median value of the risk scores in TCGA test cohort, the patients were divided into high-risk group and low-risk group. KM analysis indicated that significantly poorer DFS in TCGA test cohort was detected among patients in the high-risk group compared to patients in the Low-risk group (log-rank test, p = 0.0003, HR = 3.31,95%CI:1.16–6.61,5-years DFS:0.64,95%CI:0.52–0.77) (Fig. 4A). In the TCGA test cohort, the AUC for the 1-year, 3-year, and 5-years DFS were 0.704,0.674 and0.644(Fig. 4B). The ROC curve analysis shows the highly sensitive and specific predictive performance of the risk score in TCGA test cohort (Fig. 4C). To better assess its predictive accuracy, the above results were replicated in the entire TCGA cohort. We found that the DFS of the high-risk group was shorter than that of the low-risk group (P < 0.001) and the area under the ROC curves (AUC) of the prognostic model for DFS was 0.762 at 1 year, 0.745, at 3 years, and 0.709 at 5 years in entire TCGA cohort (Fig. 4D,E). Also, in the ROC curve containing risk score, stage, Gleason score, residual tumor and prostate-specific antigen (PSA), the AUC value of the risk score was higher than other indicators in the entire TCGA cohort (Fig. 4F) Similarly, patients with PCa in MSKCC cohort were classified into a high-risk group and a low-risk groups according to the median risk score. KM analysis indicated that the DFS was shorter in patients with high-risk scores than those with low-risk scores in MSKCC cohort (log-rank test, p = 0.0021, HR = 2.91,95%CI:1.43–5.92,5-years DFS:0.62,95%CI:0.50–0.76) (Fig. 4G). The areas under the curve (AUCs) of the ROC curve at 1 year, 3 years and 5 years DFS were 0.835, 0.701 and 0.661, suggesting good predictive ability (Fig. 4H). The risk score was statistically different between TCGA subgroups C1 and C2(Fig. 4I). The risk score of each patient in the entire TCGA cohort and MSKCC cohort was calculated based on the expression levels and regression coefficients of these five genes. The distribution of risk scores and survival status in the entire TCGA cohort and MSKCC cohort is shown in Fig. 4J and Fig. 4K. The risk curves reflect the relationship between the risk score and disease-free survival status of patients with prostate cancer, and we found that the probability of recurrence or progression was higher in the high-risk patients than the low-risk patients.
Establishment and evaluation of the predictive nomogram
The independent predictors, including T stage, risk score and Gleason score, which affect the DFS of PCa patients, were incorporated into the nomogram model (Fig. 5A). Time-dependent C-index curves of different variables based on TCGA cohorts show the optimum performance of nomogram compared with other single factors (Fig. 5B). The nomogram AUCs in the TCGA-PRAD cohort for the 1-year, 3-year, and 5-year OS probability were 0.796, 0.755, and 0.746, respectively (Fig. 5C). To evaluate the accuracy of this nomogram, the ROC curves were utilized to compare this signature with several available clinical traits and showed that the predictive value of this nomogram model was more optimal compared to several clinical traits. (Fig. 5D). Moreover, the calibration curve of the nomogram model showed the actual DFS was close to the predicted DFS (Fig. 5E). DCA showed that the nomogram provided net benefit over different threshold probability ranges, respectively, suggesting that might have potential applications in different contexts (Fig. 5F).
Analysis of tumor immune microenvironment of high- and low-risk groups
First, we estimated the composition of infiltrating immune cells in high- and low-risk groups of TCGA cohort by CIBERSORT, ESTIMATE, MCP counter, TIMER algorithms, IPS algorithms, EPIC algorithms and xCell algorithms. The results of Immune cell Infiltration in high and low-risk groups were presented as a heatmap (Fig. 6A). From the heatmap, we have known that regulatory T cell and M2 Macrophages were significantly higher in the high-risk group compared to the low-risk group, while Neutrophils and Fibroblasts were significantly lower in the high-risk group. The increase of regulatory T cells and M2 macrophages are conducive to tumor immune escape and tumor recurrence, which suggested that the infiltration of these immune cell subtypes into the tumor microenvironment might have a significant impact on the prognosis of PCa patients. Based on risk grouping, in terms of immune cells, we discovered that the content of natural killer T cell, natural killer cell, immature dendritic and neutrophil was significantly lower in patients with high-risk group compared to the patients with low-risk group (Fig. 6B). In terms of immune functions, the Immune functions of Type II IFN response, CCR and MHC class I level were significantly downregulated in the patients with high-risk group compared to the patients with low-risk group with ssGSVA algorithm analysis (Fig. 6C). Then, we explored the correlation between infiltrating immune cells and risk group and prognostic genes with the CIBERSORT algorithm. We found that the risk score of LRGs was significantly correlated with regulatory T cells, M2 Macrophages, T cells CD4 memory resting and T cells follicular helper (Fig. 7A). Moreover, the risk score had a significantly positive correlation with the regulatory T cells, M2 macrophages and follicular helper T cells, and it had a significantly negative correlation with the T cells CD4 memory resting (Fig. 7B-E). Considering the importance of checkpoint inhibitors in clinical treatment, we further investigated potential changes in immune checkpoint expression between the two groups. Combined with the bar chart showing immune checkpoint expression, where most canonical markers of exhausted T including TIM3/HAVCR2, LAG3 and CTLA4 are highly expressed in high-risk group of patients, we could conclude that the abundance of exhausted T cells infiltration in high-risk group of patients may be higher than that in low-risk group [19]. Besides, we found that the expression of B2M, JAK1, JAK2, CD40, TNFSF4, CD28 in the high-risk group were significantly lower than those in the low-risk group (Fig. 7F).
Functional enrichment analysis and drug sensitivity
Exploring the functional annotation between high-risk and low-risk subtypes, we found that the high-risk group was significantly enriched in the lamin filament, U2AF complex, positive regulation of apoptotic DNA fragmentation and positive regulation of DNA catabolic process in the GSVA of gene ontology biological processes (GOBPs) (Fig. 8A). KEGG pathway analyses were performed to explore the biological processes and pathways using the gene set variation analysis (GSVA). We found that the high-risk group was enriched in the base excision repair, DNA replication, homologous recombination, pyrimidine metabolism, mismatch repair, cell cycle (Fig. 8B). The risk score was closely related to the DNA synthesis, repairing and degradation and cell cycle progression of PCa according to the above functional enrichment analysis, and we further analyzed the response of PCa patients in the TCGA cohort to DNA synthesis and repairing and cell cycle-related chemotherapy drugs (5-Fluorouracil, Cisplatin, Docetaxel, Paclitaxel, Gemcitabine, Epirubicin,Talazoparib). The results revealed that these drugs had lower half maximal (50%) inhibitory concentration (IC50) in patients of the high-risk group, implying that these patients may be more sensitive to these drugs (Fig. 8C-I). Based on the immune checkpoint analysis, we know that JAK1 and JAK2 are lower in the high-risk group, which is consistent with our results from the drug prediction analysis that we know that the high-risk group has a worse response to JAK inhibitors (JAK_8517, Ruxolitinib) (Fig. 8J-K).
Relationship between the LRGs signature and clinicopathological traits and mutation landscape
First, we evaluated the association between clinicopathological parameters and the risk score to further investigate the clinical relevance of the prognostic model. The risk score increased significantly with increasing Gleason score, T-stage and N-stage, but there was no statistical difference between the no metastasis (M0) and metastasis (M1) (Fig. 9A). In assessing the relationship between expression profiles of the signature genes in PCa tissues and related clinicopathological characteristics based on TCGA cohort, we further found the 5 key genes were differentially expressed between tumor tissues and normal tissues, and were highly expressed in tumor tissues (Fig. 9B). There were differences in expression of 5 key genes in different T stages and N stages, and the expression in advanced T stages (T3 + T4) and N1 was higher than that in early T stages(T1 + T2) and N0(Fig. 9C-D). Among the five key genes, except ALDOA, the expression was different in different Gleason scores and increased with the increase in Gleason score (Fig. 9E). We further investigated the mutation profiles separately in the high- and low-risk groups of PCa patients from TCGA cohort. Results indicated that Tumor Mutational Burden (TMB) was more widely distributed in the high-risk group, and miscellaneous mutation was the most common variant classification in PCa. The top five most frequent mutation genes in the high-risk group were TP53(20%), SPOP (16%), TITIN (12%), MUC16(10%), FOXA1(8%), while genes such as TITIN (13%), SPOP (7%), MUC16 (6%), MUC4(6%), SYNE1 (6%) had the top five mutation frequencies in low-risk group (Figure. 9F-G). Previous studies have shown that mutations in these genes are often associated with a poor prognosis for PCa [20–23].
The mRNA expression levels of lactylation-related genes in PCa cells
To further evaluate the mRNA expression levels of these five genes, we utilized their expression in the cell lines using RT-qPCR. The results indicated that most signature genes were expressed at a relatively high level in castration-resistant prostate cancer cell lines (C4-2, PC3,22Rv1, DU145) compared with non-castration-resistant prostate cancer cell line (LNCAP) (Fig. 10A-E). Castration-resistant prostate cancer cell lines are more malignant than non-castration-resistant prostate cancer cell line. In sum, the high expression of these five key genes is positively correlated with the malignant degree of prostate cancer, which validated the previous accuracy of our previous bioinformatics analysis.