Large-scale transcriptome data analysis reveals prognostic signature genes for overall survival prediction in diffuse large B cell lymphoma

BACKGROUND With the improvement of clinical treatment outcomes in Diffuse large B cell lymphoma (DLBCL), the high rate of relapse in DLBCL patients is still an established barrier, due to the therapeutic strategy selection based on potential target remains unsatisfactory. Therefore, there is an urgent need in further exploration of prognostic biomarkers so as to improve the prognosis of DLBCL. METHODS The univariable and multivariable Cox regression models were employed to screen out gene signatures for DLBCL overall survival prediction. The differential expression analysis was used to identify representative genes in high-risk and low-risk groups, respectively, by student t test and fold change. The functional difference between the high-risk and low-risk groups were identied by the gene set enrichment analysis. RESULTS We conducted a systematic data analysis to screen the candidate genes signicantly associated with overall survival of DLBCL in three NCBI Gene Expression Omnibus (GEO) datasets. To construct a prognostic model, ve genes (CEBPA, CYP27A1, LST1, MREG, and TARP) were then screened and tested using the multivariable Cox model and the stepwise regression method. Kaplan-Meier curve conrmed the good predictive performance of the ve-gene Cox model. Thereafter, the prognostic model and the expression levels of the ve genes were validated by means of an independent dataset. All ve genes were signicantly favorable for the prognosis in DLBCL, both in training and validation datasets. Additionally, further analysis revealed the independence and superiority of the prognostic model in risk prediction. Functional enrichment analysis revealed some vital pathways resulting in unfavorable outcome and potential therapeutic targets in DLBCL. CONCLUSION We developed a ve-gene Cox model for the clinical outcome prediction of DLBCL patients. Meanwhile, potential drug selection using this model can help clinicians to improve the clinical practice for the patients.


Background
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin lymphoma with an annual incidence of 1-5/10,000 [1,2]. DLBCL is an aggressive and potentially curable hematological malignancy, which makes an early diagnosis and effective treatments essential for patients. R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone) is the current standard rst line treatment of DLBCL [3]. Despite the high rate of complete response (76%), approximately 40% of patients will relapse, and the molecular mechanism underlying recurrence remains largely unknown [4]. DLBCL displays tremendous clinical, genetical and molecular heterogeneity. The International Prognostic Index (IPI) has been used to predict the prognosis of patients with DLBCL for nearly 30 years, there still have minority of patients whose clinical process were not in accord with the IPI strati cation [5]. Gene expression pro ling has helped identify two major subtypes, known as germinal center B-cell-like (GCB) and activated B-cell-like (ABC), of which patients of the ABC group exhibit a generally worse prognosis [6]. However, the high prices and strict requirements regarding tissue limit the routine use of this method. Therefore, efforts have been made to nd novel biomarkers with prognostic values in order to improve therapeutic strategy selection based on potential targets [7].
Currently, various markers are de ned through immunophenotyping, such as CD5, CD30, BCL2, MYC and TP53 [8,9]. CD5 promotes downstream B-cell receptor signaling, is associated with ABC subtype and more aggressive clinical traits. Patients with CD30 + DLBCL, which leads to the downregulation of NF-κB and B-cell receptor signaling, tends to exhibit a better prognosis [10,11]. Meanwhile, in patients with the GCB subtype, BCL2 and MYC rearrangements would lead to worse prognosis [12]. TP53 mutation also adversely affects patients' prognosis [13]. Based on the new integrated genetic map, B. Chapuy, et al identi ed distinct subsets, including a previously unrecognized group of low-risk ABC-DLBCLs, two GCB-DLBCLs subsets with different prognosis and an ABC/GCB-independent group [14]. In addition, R.
Schmitz, et al uncovered some previously unknown subtypes of DLBCL by differences in gene-expression signatures and responses to immunochemotherapy [15]. The subset of high-risk patients requires more intensive therapy, the personalized therapy based on patient's histological and molecular-genetic characteristics will bring greater bene ts to patients. Therefore, further exploration of prognostic indicators is still needed to distinguish DLBCL patients with varied prognosis.

Data collection
The gene expression data and corresponding clinical information were collected from NCBI Gene Expression Omnibus (GEO) database with accession numbers of GSE32918 (n = 172), GSE4475 (n = 166), GSE69051 (n = 172), and GSE10846 (n = 414). The expression values were normalized by the data submitters, and discretized by median, which were used for downstream analysis.

Cox proportional hazard model
The univariable Cox proportional hazard model was used to screen the prognostic genes in the rst three datasets. Subsequently, the gene signatures jointly identi ed in the three datasets were then subjected for the multivariable Cox model, and the stepwise regression method was used to determine the best model. The expression threshold for splitting high-expression and low-expression was selected based on the minimal P-value of the log-rank tests in the training set. The threshold in validation set was selected by the quantile in the training set. The risk scores for the samples of training and validation sets were estimated by the multivariable Cox model. The high-and low-risk groups were strati ed based on the median of the risk scores in the training set. The independence of the risk strati cation on the cofactors was also assessed by multivariable Cox model.

Differential Gene Expression Analysis
Page 4/18 The differential gene expression analysis was conducted to identify the genes that upregulated or downregulated in a speci ed group as compared the other one. The Wilcoxon rank-sum test and fold change methods were employed, and the thresholds of adjusted p-value and log2-fold change were determined at 0.05 and 0.5.

The Pathway Enrichment Analysis
The upregulated genes in each risk group were subjected to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis, respectively. Hypergeometric test was applied to test the statistical signi cance of the pathways. The threshold for adjusted P-value was determined at 0.05.

The Drug-target Identi cation
The therapeutic targets were selected from the upregulated genes in each risk group. The drugs and upregulated genes were mapped by the R package maftools with drugInteractions.

Systematic identi cation of prognostic gene signatures for overall survival prediction
To identify the prognostic gene signatures, we collected three public DLBCL datasets with accession numbers of GSE32918 (n = 172), GSE4475 (n = 166) and GSE69051 (n = 172) from Gene Expression Omnibus (GEO) database as depicted in the ow chart in Figure S1. Subsequently, univariable Cox regression analysis was conducted, and identi ed 763, 685, and 589 genes associated with overall survival (OS) based on the three gene expression datasets (Fig. 1A, log-rank test, P < 0.01), respectively. Particularly, CEBPA, CSF2RA, CYP27A1, LST1, MREG, SCPEP1, and TARP were found to be signi cantly associated with OS in all the three datasets at the stringent threshold (Fig. 1A). Furthermore, the three datasets were merged as training set (n = 510), and multivariable Cox regression model was then built based on these genes using the merged dataset. A stepwise method was used to select a subset of those gene signatures and a multivariable Cox regression model with highest performance. Speci cally, ve genes including CEBPA, CYP27A1, LST1, MREG, and TARP were retained in the multivariable Cox model (Table 1), which was termed as ve-gene Cox model, and all of them were favorable for the prognosis

Performance Validation In An Independent Dataset
To evaluate the performance of the multivariable model in risk prediction, we rst calculated the risk scores for the DLBCL samples in the training set, and strati ed these samples into high and low risk groups by the median of risk scores. The high-risk group exhibited worse prognosis than the low-risk group ( Fig. 2A, P < 0.0001). Moreover, we also collected an independent gene expression dataset of 414 samples with long-term follow-up. Similarly, the risk scores were predicted and used to stratify the samples in the validation set. Consistently, the two groups also had signi cant prognostic difference ( Fig. 2B, P < 0.0001). Furthermore, the ve gene signatures were found to be expressed higher in low-risk group than high-risk group in both the training and validation sets (Figs. 2C-2D). These results indicated that the ve gene signatures were robust and consistently associated with OS in both training and validation datasets.
The ve-gene Cox model is superior to other gene expression-based Cox models To demonstrate the superiority of the ve-gene Cox model based on the ve gene signatures, we compared its performance with three sets of gene signatures [16][17][18] in the validation set. Based on the trained models by the gene signatures, the samples in the validation set could also be strati ed into highand low-risk groups. The three sets of gene signatures had the capability of predicting the prognosis of DLBCL patients ( Fig. 3A-C). The gene signatures proposed by Rosenwald et al. [17] had the worst performance (Fig. 3B). However, our proposed ve gene signatures showed the highest statistical signi cance than the other sets (Fig. 3D), suggesting that the Cox model based on the ve gene signatures was superior to other models.
The ve-gene-based risk strati cation is a prognostic factor independent of clinical factors To further investigate the robustness of the ve-gene Cox model, we tested the independence of the vegene-based risk strati cation in the validation set. As the IPI scoring system was a well-recognized factor for prognostic risk prediction and widely applied in clinical practice [19], the samples were rst divided into two groups with high ( > = 3) or low (< 3) IPI scores, considering age, serum lactate dehydrogenase (LDH), Eastern Cooperative Oncology Group (ECOG) Performance Status, Ann Arbor stage, and extranodal in ltration sites [5]. As shown in Fig. 4A, the risk scores estimated by the ve-gene Cox model showed no signi cant difference between the two groups. Moreover, the differences were not also observed between the four stages. These results suggested that the risk scores were not associated with IPI scoring system and tumor stage.
Notably, the samples could be classi ed into four groups by combining the IPI scoring system and the ve-gene-based risk strati cation, and the four groups exhibited signi cantly prognostic difference ( Fig. 4B, P < 0.0001). It should be noted that the differences of overall survival were not observed between the two groups with the worse prognosis, but the samples with IPI > = 3 in high-risk group still had shorter overall survival than samples with IPI > = 3 in the low-risk group based on the KM curve.
Moreover, we also tested whether the risk strati cation was independent of the subtypes. Consistently, the three subtypes, including ABC, GCB and unclassi ed subtypes, could be further strati ed into high-and low-risk groups, and the samples in high-risk group had signi cantly shorter survival than those in lowrisk group (Fig. 4C-E). In addition, we also tted the IPI scoring system, stage, subtype and risk strati cation into a multivariable Cox model, and found that the risk strati cation was still statistically signi cant with these prognostic factors as cofactors ( Table 2). These results further demonstrated that the ve-gene-based risk strati cation was an independent prognostic factor for DLBCL risk prediction.
The molecular characteristics and potential drugs for the two risk groups To reveal the molecular characteristics of the two risk groups, we compared the gene expression pro les of high-risk with those of low-risk group. The differentially expressed genes (DEGs) were then selected by Wilcoxon rank-sum test and fold change (Adjusted P-value < 0.05 and log2-fold change > 0.5). Moreover, the overrepresentation enrichment analysis (ORA) was employed to identify the pathways potentially involved in the DLBCL progression (Fig. 5A). Speci cally, cell cycle-related pathway and those associated with genomic stability maintenance, such as Fanconi anemia pathway and Mismatch repair, were highly upregulated in high-risk group (Adjusted P-value < 0.05). In contrast, immune-related pathways such as rheumatoid arthritis, antigen processing and presentation, allograft rejection, hematopoietic cell lineage, and Th1 and Th2 cell differentiation were upregulated in low-risk group (Adjusted P-value < 0.05).
Among the DEGs, 29 genes were found to be therapeutic targets speci cally upregulated in the two risk groups (Fig. 5B). As we known, CD20 (also termed MS4A1) is expressed on the surface of normal B lymphocytes and almost all DLBCL cases. At present, RITUXIMAB, a chimeric monoclonal antibody directed against the CD20, combine with intensive chemotherapy (CHOP) is the standard therapy for DLBCL. Besides, two cell cycle kinases, AURKB and CDK1, were upregulated in high-risk group, and BARASERTIB and DINACICLIB might be the potential drugs for treating DLBCL with high-risk factors. For the low-risk group, some immune checkpoint proteins and inhibitors were identi ed, such as PDCD1 (PD-1), CD274 (PD-L1), CTLA4, and their corresponding drugs, suggesting that the low-risk samples might bene t from inhibiting the immune checkpoint pathway.

Discussion
DLBCL is a remarkably heterogeneous disease, both histologically and genetically. Despite signi cant advances in subtype classi cation of DLBCL, accurate prediction of prognosis remains a challenge. With the development of high throughput sequencing technology, some potential prognostic genomic markers for DLBCL patients have been identi ed [17,20,21]. However, the number of prognostic markers is still limited. There is an urgent need to screen out more biomarkers to improve the accuracy of prognostic prediction.
In the present study, we identi ed potential gene candidates by using the univariable Cox regression analysis to examine associations between gene expression and patient prognosis of three DLBCL cohorts in GEO. To further narrow down these gene signatures, multivariate Cox analysis was carried out on the merged datasets. A stepwise approach was used to select a subset of gene candidates with highest performance, and a risk model was established for predicting DLBCL prognosis based on the expression levels of ve genes including CEBPA, CYP27A1, LST1, MREG, and TARP. We evaluated the model performance using an independent gene expression dataset and compared it with previously reported models. Our ve-gene based risk model showed improved robustness, accuracy, and e ciency compared to those models and was demonstrated to be an independent prognostic factor for overall survival in patients with DLBCL. Subsequently, we compared the gene expression pro les of high-risk with those of low-risk group and performed ORA to identify pathways potentially involved in the DLBCL progression. Thus, we believe that our ve-gene-based risk scoring model can be used for re ning DLBCL subtypes and potentially improving patient therapy.
According to the multivariable Cox model, high expression of the ve genes were all associated with a favorable survival outcome. CEBPA is a transcription factor playing roles in regulating proliferation and differentiation of many cell types [22]. Within the hematopoietic system, inactivation mutation of CEBPA blocks the granulocytic differentiation in acute myeloid leukemia (AML) [23]. In addition, it has been reported that CEBPA-regulated PER2 activation is a potential tumor suppressor pathway in diffuse large B-cell lymphoma (DLBCL) [24]. CYP27A1, a cytochrome P450 oxidase family member, is closely related to the proliferation of multiple tumor cells, such as prostate, breast and colon cancer [25][26][27]. LST1 is encoded within the TNF region of the human MHC which regulates lymphocyte proliferation [28]. MREG is reported to suppress thyroid cancer cell invasion and proliferation through PI3K/Akt-mTOR signaling pathway [29]. The biological role of these genes in DLBCL need to be further investigated.
The overrepresentation enrichment analysis of DEGs suggests that the abnormal cell cycle progression and increased genomic instability contribute to the rapid progression of DLBCL. Inhibitors of cell cycle kinases, such as BARASERTIB and DINACICLIB, may be effective in high-risk patients. On the contrary, genes related to immune-related pathways, such as antigen processing and presentation, Th1 and Th2 cell differentiation, were enrichment in low-risk group, suggesting that activated host immune response may predict favorable prognosis and response to therapy. These ndings provide novel clues into the explanation of the mechanism of DLBCL.
The prognostic model we proposed is helpful for further risk strati cation in genetic level on the basis of the current traditional typing, but this study still has some limitations. Some potential prognostic factors may be not included in the model such as the racial factors and the roles that the ve genes play in DLBCL required further experimental validation.

Conclusions
To sum up, our research indicates that the ve-gene prognostic model is a reliable tool for predicting the OS of DLBCL patients and providing some hints on drug selection, which can assist clinicians in selecting personalized treatment, although speci c drug selection requires further molecular biology research and clinical trials.

Consent for publication
Participants gave their written informed consent for the materials to appear in publications without limit on the duration of publication.

Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors have declared that no competing interests exist.        The molecular characteristics and potential drugs for the two risk groups (A) The top-ten GO terms enriched by the upregulated genes in high-risk and low-risk groups. The dots size and color represent the ratio of gene counts and statistical signi cance, respectively. (B)The heat map of the expression of 29 therapeutic targets genes among the DEGs, for the low-risk vs high-risk groups in the validation datasets.
The DEGs were selected by Wilcoxon rank-sum test and fold change (Adjusted P-value < 0.05 and log2-