Systematic identification of prognostic gene signatures for overall survival prediction
To identify the prognostic gene signatures, we collected three public DLBCL datasets with accession numbers of GSE32918 (n = 172), GSE4475 (n = 166) and GSE69051 (n = 172) from Gene Expression Omnibus (GEO) database as depicted in the flow chart in Figure S1. Subsequently, univariable Cox regression analysis was conducted, and identified 763, 685, and 589 genes associated with overall survival (OS) based on the three gene expression datasets (Fig. 1A, log-rank test, P < 0.01), respectively. Particularly, CEBPA, CSF2RA, CYP27A1, LST1, MREG, SCPEP1, and TARP were found to be significantly associated with OS in all the three datasets at the stringent threshold (Fig. 1A). Furthermore, the three datasets were merged as training set (n = 510), and multivariable Cox regression model was then built based on these genes using the merged dataset. A stepwise method was used to select a subset of those gene signatures and a multivariable Cox regression model with highest performance. Specifically, five genes including CEBPA, CYP27A1, LST1, MREG, and TARP were retained in the multivariable Cox model (Table 1), which was termed as five-gene Cox model, and all of them were favorable for the prognosis (Fig. 1B).
Performance Validation In An Independent Dataset
To evaluate the performance of the multivariable model in risk prediction, we first calculated the risk scores for the DLBCL samples in the training set, and stratified these samples into high and low risk groups by the median of risk scores. The high-risk group exhibited worse prognosis than the low-risk group (Fig. 2A, P < 0.0001). Moreover, we also collected an independent gene expression dataset of 414 samples with long-term follow-up. Similarly, the risk scores were predicted and used to stratify the samples in the validation set. Consistently, the two groups also had significant prognostic difference (Fig. 2B, P < 0.0001). Furthermore, the five gene signatures were found to be expressed higher in low-risk group than high-risk group in both the training and validation sets (Figs. 2C-2D). These results indicated that the five gene signatures were robust and consistently associated with OS in both training and validation datasets.
The five-gene Cox model is superior to other gene expression-based Cox models
To demonstrate the superiority of the five-gene Cox model based on the five gene signatures, we compared its performance with three sets of gene signatures[16–18] in the validation set. Based on the trained models by the gene signatures, the samples in the validation set could also be stratified into high- and low-risk groups. The three sets of gene signatures had the capability of predicting the prognosis of DLBCL patients (Fig. 3A-C). The gene signatures proposed by Rosenwald et al. [17] had the worst performance (Fig. 3B). However, our proposed five gene signatures showed the highest statistical significance than the other sets (Fig. 3D), suggesting that the Cox model based on the five gene signatures was superior to other models.
The five-gene-based risk stratification is a prognostic factor independent of clinical factors
To further investigate the robustness of the five-gene Cox model, we tested the independence of the five-gene-based risk stratification in the validation set. As the IPI scoring system was a well-recognized factor for prognostic risk prediction and widely applied in clinical practice[19], the samples were first divided into two groups with high ( > = 3) or low (< 3) IPI scores, considering age, serum lactate dehydrogenase (LDH), Eastern Cooperative Oncology Group (ECOG) Performance Status, Ann Arbor stage, and extranodal infiltration sites[5]. As shown in Fig. 4A, the risk scores estimated by the five-gene Cox model showed no significant difference between the two groups. Moreover, the differences were not also observed between the four stages. These results suggested that the risk scores were not associated with IPI scoring system and tumor stage.
Notably, the samples could be classified into four groups by combining the IPI scoring system and the five-gene-based risk stratification, and the four groups exhibited significantly prognostic difference (Fig. 4B, P < 0.0001). It should be noted that the differences of overall survival were not observed between the two groups with the worse prognosis, but the samples with IPI > = 3 in high-risk group still had shorter overall survival than samples with IPI > = 3 in the low-risk group based on the KM curve.
Moreover, we also tested whether the risk stratification was independent of the subtypes. Consistently, the three subtypes, including ABC, GCB and unclassified subtypes, could be further stratified into high- and low-risk groups, and the samples in high-risk group had significantly shorter survival than those in low-risk group (Fig. 4C-E). In addition, we also fitted the IPI scoring system, stage, subtype and risk stratification into a multivariable Cox model, and found that the risk stratification was still statistically significant with these prognostic factors as cofactors (Table 2). These results further demonstrated that the five-gene-based risk stratification was an independent prognostic factor for DLBCL risk prediction.
The molecular characteristics and potential drugs for the two risk groups
To reveal the molecular characteristics of the two risk groups, we compared the gene expression profiles of high-risk with those of low-risk group. The differentially expressed genes (DEGs) were then selected by Wilcoxon rank-sum test and fold change (Adjusted P-value < 0.05 and log2-fold change > 0.5). Moreover, the overrepresentation enrichment analysis (ORA) was employed to identify the pathways potentially involved in the DLBCL progression (Fig. 5A). Specifically, cell cycle-related pathway and those associated with genomic stability maintenance, such as Fanconi anemia pathway and Mismatch repair, were highly upregulated in high-risk group (Adjusted P-value < 0.05). In contrast, immune-related pathways such as rheumatoid arthritis, antigen processing and presentation, allograft rejection, hematopoietic cell lineage, and Th1 and Th2 cell differentiation were upregulated in low-risk group (Adjusted P-value < 0.05).
Among the DEGs, 29 genes were found to be therapeutic targets specifically upregulated in the two risk groups (Fig. 5B). As we known, CD20 (also termed MS4A1) is expressed on the surface of normal B lymphocytes and almost all DLBCL cases. At present, RITUXIMAB, a chimeric monoclonal antibody directed against the CD20, combine with intensive chemotherapy (CHOP) is the standard therapy for DLBCL. Besides, two cell cycle kinases, AURKB and CDK1, were upregulated in high-risk group, and BARASERTIB and DINACICLIB might be the potential drugs for treating DLBCL with high-risk factors. For the low-risk group, some immune checkpoint proteins and inhibitors were identified, such as PDCD1 (PD-1), CD274 (PD-L1), CTLA4, and their corresponding drugs, suggesting that the low-risk samples might benefit from inhibiting the immune checkpoint pathway.