The penalized regression model has been primarily used for variable selection in high-dimensional data analysis. However, if a variable is selected by the penalized regression model, it does not necessarily indicate that the variable is important, because the results of penalized regression are affected by tuning parameters(10). The number of variables selected into the model would decrease with increasing . When was large, variables that could still be selected into the model were considered relatively important. Based on this property, we proposed a strategy to measure the importance of variable based on the penalized regression analysis. This strategy can be used to analyze GEP to detect an optimal gene subset associated with cancer subtype classification, diagnosis, and prognosis. In this study, we applied this strategy for DLBCL classification analysis. Finally, six genes were identified as an optimal gene subset for both subtype classification and survival prediction in DLBCL. The predictive and prognostic performances of those six genes were further validated in the external dataset. What’s more, taking simplicity and predictability of clinical models into consideration, we found that the six-gene model outperformed the typically penalized regression models. All these indicated that our strategy is effective.
MYBL1, MAML3, and S1PR2 were highly expressed in patients with GCB subtype relative to levels in patients with the ABC subtype. MYBL1 is a member of the myb transcription factor family. All members of the myb family are involved in the regulation of proliferation and/or differentiation of different hematopoietic cells, of which MYBL1 regulates the proliferation and/or differentiation of germinal center (GC) B cells(17). Jose´e Golay et al. suggested that MYBL1 could be a specific marker for proliferating centroblasts due to its specific induction(17). Subsequently, several studies based on GEP analysis also demonstrated that MYBL1 could be regarded as a biomarker to classify DLBCL, highly consistent with our results(3, 5–7). Sphingosine-1-phosphate receptor 2 (S1PR2) is a G-protein-coupled receptor (GPCR). It couples Gα12 or Gα13 (encoded by GNA12 and GNA13) to induce apoptosis(18). Jagan R. Muppidi et al. found that mutations that result in S1PR2 inactivation were exclusively in the GCB subtype and hardly ever occurred in the ABC subtype(19). Additionally, in a mouse model lacking both alleles of S1PR2, half of the mice developed B-cell lymphomas with GCB morphology and molecular characteristics(20). However, the expression of S1PR2 in the GCB subtype was relatively higher than that in the ABC subtype in our study. This is because FOXP1, whose function is to repress S1PR2 expression, was highly expressed in the ABC subtype(18). MAML3 belongs to the Mastermind-like (MAML) family, including MAML1, MAML2, and MAML3. Members of this family are essential transcriptional coactivators for Notch-induced transcription events(21). MAML1 was reported to be a regulator of the NF-κB signal pathway, and activation of the NF-κB pathway is a main characteristic of the ABC subtype(22, 23). Kochert et al. showed that MAML2 was highly expressed in several types of B cell-derived lymphomas relative to normal B-cells(24). However, functional study on MAML3 is limited. Studies have shown that MAML3 overexpression may be involved in cancer metastasis(25), suggesting that MAML3 overexpression was associated with poor prognosis, which was discordant with our results. It implied that MAML3 may involve other regulatory and oncogenic mechanisms in DLBCL.
TNFRSF13B, CYB5R2, and BATF were relatively overexpressed in the ABC subgroup. TNFRS13B, also known as TACI (transmembrane activator and calcium-modulating cyclophilin ligand interactor), is a member of the tumor necrosis factor (TNF) receptor superfamily(26). TACI and its ligands, BAFF and APRIL, are critical factors for the growth and survival of both normal and malignant B cells(27). Accumulating evidence indicated that TACI tended to be frequently expressed in the ABC subtype and was considered as a classifier in the DLBCL classification(6, 28). The high expression of TACI may be one of the reasons for NF-κB pathway activation in the ABC subtype, because the combination of TACI and its ligand could activate the NF-κB pathway(29). CYB5R2 belongs to the cytochrome reductase family. This enzyme has been shown to be involved in oxidation reduction, drug metabolism, methemoglobin reduction in erythrocytes, and lipid metabolism(30, 31). The expression of CYB5R2 varied in different cancers. CYB5R2 has been considered as a tumor suppressor gene and was shown to be inactivated in prostate cancer, breast cancer, and nasopharynx carcinoma(32–34). However, Lotem et al. reported that CYB5R2 was up-regulated in B cell acute lymphocytic leukemia(35). The role of CYB5R2 in lymphomas is poorly understood. Qun Liu et al. found that CYB5R2 expression was highly correlated with many genes of the Toll pathway, suggesting that CYB5R2 may be associated with cancer invasion(36). The protein encoded by BATF is a nuclear basic leucine zipper protein that belongs to the AP-1/ATF superfamily of transcription factors(37). BATF has been shown to play an important role in T- and B-cells during immune responses, and BATF controls global regulators of class-switch recombination in both T- and B-cells(38). Interestingly, in the context of B-cell malignancy, BATF was consistently linked to the ABC subtype(6, 39). Jun Li et al. identified BATF as the target of the NF-κB pathway(40). It implied that the overexpressed BATF in the ABC subtype may be associated with abnormal activation of the NF-κB pathway.
To date, several signatures based on gene expression have been developed to determine COO subtype, some of which use formalin-fixed, paraffin-embedded tissue (FFPET). The Lymph2X assay proposed by Scott et al. is one of these methods that use FFPET (41). It is a 20-gene signature, five of which are the member of six-gene model. There is no doubt analyzing FFPET is more clinically practical than frozen tissue(9, 41), but six-gene model is more parsimonious and easier to be explained than 20-gene signature, because it’s a simple logistic model. Whereas, the 20-gene signature is a weighted average of the 15 predictive genes, and Scott et al. did not describe how the weight was calculated in detail(41). This may limit its widespread use. Additionally, we ranked genes based on their importance, which provides an ordering of genes in term of priority for further functional and targeted drug research. Certainly, to realize the potential clinical benefits of the six-gene signatures, further efforts are needed: firstly, designing specific probes and quantifying expression of six genes in FFPET, as Scott et al. did(41); secondly, evaluating the predictive accuracy of six-gene model in FFPET; thirdly, validation in independent cohorts(42).
In summary, in the penalized regression analysis, we developed a strategy to rank variables based on the relationship between tuning parameters and the number of variables selected into the model. This strategy can be applied to determine the optimal gene subset for cancer subtype classification, diagnosis, and prognosis. In this study, we applied this strategy for DLBCL stratification. Six genes were eventually identified as composite markers for both subtype classification and prognostic prediction. Further, the predictive performance of the six genes was validated in an external dataset, which demonstrated the efficiency of our strategy. Finally, the ordered gene list provides a direction for further functional and targeted drug research.