Subtype classication and prognosis of diffuse large B-cell lymphoma based on variable importance analysis

Background Gene expression proling (GEP) is considered as gold standard for cell-of-origin classication of diffuse large B-cell lymphoma (DLBCL). The high dimensionality of GEP limits its application in clinical practice. Penalized regression was commonly used to determine the optimal gene subset for classication in high dimensional gene data. However, the results of penalized regression methods were affected by the tuning parameters. Results To solve the instability of penalized regression methods, we proposed a strategy to measure the importance of variables with an aggregated index. This strategy was applied to six penalized methods to identify a small gene subset for DLBCL classication. Using a training dataset of 350 DLBCL patients, we identied six genes (MYBL1, TNFRSF13B, MAML3, CYB5R2, BATF, and S1PR2) as the optimal gene subset for DLBCL classication. The AUC was 0.9986 (95%CI 0.9967–1) and discrimination slope (DS) was 0.9442 (95%CI 0.9203–0.9661) in the training dataset. The discriminative performances were further validated in the external dataset with an AUC of 0.9455 (95%CI 0.9298–0.9612) and DS of 0.6211 (95%CI 0.5824–0.6591). Additionally, the calibration and clinical usefulness were apt in both datasets. Subgroups patients characterized by these six genes showed signicantly different prognosis. Furthermore, model comparisons demonstrated that the six-gene model outperformed models constructed by typical penalized regression methods. Penalized variable importance analysis is ecient to identify with good predictive datasets (ii) subtype datasets molecular datasets


Sample features
summarizes the relationship between the clinical characteristics and DLBCL subtypes in the training and validation dataset (CombatData). To su ciently describe the distributions of clinical characteristics of CombatData, we also listed GSE31312 and GSE23501 in Table 1 because they provided su cient clinical and demographic information. There was signi cant difference in IPI between ABC and GCB subtype both in training and validation dataset. The proportion of high risk (IPI ≥ 3) DLBCL patients in ABC subtype is signi cantly higher than GCB subtype. In training dataset, the difference in IPI may be due to the signi cant difference in age, disease-stage, serum LDH levels and the performance status (ECOG) between these two subtypes. In the validation dataset, the difference in IPI may be attributed to signi cant difference in age and disease-stage. The median survival time for training and validation dataset was 6.80 and 6.95 years, respectively.     Identi cation of the optimal subset for DLBCL classi cation The six penalized methods were applied to the training dataset. Subsequently, the variables were ranked according to the selected frequency. Variables ranked rst were relatively important. As shown in Table 2, different methods generated different ranked gene sets, which made it di cult to select the optimal variables. Thus, the RankAggreg method was used to merge individual ranking lists into a single "super"list. This "super"-list re ected the overall importance of variables ( Table 2). The annotated gene lists were displayed in Table S1 (Additional le 1). Genes in the "super"-list are MYBL1, TNFRSF13B, MAML3, CYB5R2, BATF, S1PR2, ASB13, LIMD1, SERPINA9, ENTPD1, LMO2, FUT8, PALD1, ZBTB32, BCL2L10, PDE7A, respectively.  To identify the best gene subset for DLBCL strati cation, variables in the "super"-list were sequentially added to the logistic models, and the AUC was calculated (Fig. 1A). According to the results of Delong's test, the addition of the seventh variable did not signi cantly increase the AUC of the logistic model (top 6 AUC vs. top 7 AUC: z = 1.317, p = 0.188). Thus, the top six genes were determined as the optimal gene subset for DLBCL classi cation (six-gene model). The expression pattern of the six genes is shown in Fig. 1B. The expression levels of TNFRSF13B, CYB5R2, and BATF were relatively up-regulated in patients with ABC subtype, whereas the levels of MAML3, MYBL1, and S1PR2 were down-regulated. Table 2 is here and is attached to the end of manuscript.

Model performance
The six-gene model exhibited good discriminative performance in the training dataset. The AUC was 0.9986 (95%CI: 0.9967-1), and the DS (Additional le 2: Figure S1A including sensitivity, speci city, positive predictive value (PPV), and negative predictive value (NPV), we set 0.5 as the threshold for assigning patients to ABC or GCB subtypes. In the training dataset, the sensitivity, speci city, PPV, and NPV were 0.988, 0.978, 0.976, and 0.989, respectively. In the validation dataset (CombatData), the four metrics were 0.845, 0.871, 0.851, and 0.866, respectively (Additional le 3: Figure S2). The calibration performance is shown in Fig. 2C and Fig. 2D (Additional le 4: Figure S3). It indicated that the correspondence was close between the mean predicted probabilities and the observed outcomes in both the training and validation datasets. Figure 2E and Fig. 2F (Additional le 5: Figure S4) shows the net bene t for the mRNA-based molecular subtype signature. The six-gene model for DLBCL classi cation had a higher net bene t than the two alternative strategies: all or none of the patients was assigned to the ABC subtype. (CombatData), survival analyses indicated that there were signi cant differences in the OS and progression-free survival (PFS) rate between the two predicted subgroups (OS: p = 0.003, PFS: p < 0.001; Fig. 3B and 3C). The median OS and PFS time for the predicted ABC subgroup was 6.13 and 3.46 years, respectively. As shown in Table 3, patients with the predicted ABC subtype had a higher risk of mortality than patients with the predicted GCB subtype after adjusting for IPI (HR = 1.530, 95%CI: 1.104-2.121).
Above all, in the presence of the information from IPI, the predicted subtype by the six genes remained as signi cant risk predictor both in training and validation datasets, indicating that the six genes contained prognostic information that was independent of IPI. Figure S5 (Additional le 7) and Table 3 show that the difference of OS between the two predicted subgroups was not signi cant in GSE23501. This may be due to fewer (10/60) patient deaths during the follow-up years.  Table 3 is here and is attached to the end of manuscript.

Discussion
The penalized regression model has been primarily used for variable selection in high-dimensional data analysis. However, if a variable is selected by the penalized regression model, it does not necessarily indicate that the variable is important, because the results of penalized regression are affected by tuning parameters (10). The number of variables selected into the model would decrease with increasing .
When was large, variables that could still be selected into the model were considered relatively important. Based on this property, we proposed a strategy to measure the importance of variable based on the penalized regression analysis. This strategy can be used to analyze GEP to detect an optimal gene subset associated with cancer subtype classi cation, diagnosis, and prognosis. In this study, we applied this strategy for DLBCL classi cation analysis. Finally, six genes were identi ed as an optimal gene subset for both subtype classi cation and survival prediction in DLBCL. The predictive and prognostic performances of those six genes were further validated in the external dataset. What's more, taking χ 2 χ 2 λ λ simplicity and predictability of clinical models into consideration, we found that the six-gene model outperformed the typically penalized regression models. All these indicated that our strategy is effective.
MYBL1, MAML3, and S1PR2 were highly expressed in patients with GCB subtype relative to levels in patients with the ABC subtype. MYBL1 is a member of the myb transcription factor family. All members of the myb family are involved in the regulation of proliferation and/or differentiation of different hematopoietic cells, of which MYBL1 regulates the proliferation and/or differentiation of germinal center (GC) B cells (17). Jose´e Golay et al. suggested that MYBL1 could be a speci c marker for proliferating centroblasts due to its speci c induction (17). Subsequently, several studies based on GEP analysis also demonstrated that MYBL1 could be regarded as a biomarker to classify DLBCL, highly consistent with our results(3, 5-7). Sphingosine-1-phosphate receptor 2 (S1PR2) is a G-protein-coupled receptor (GPCR). It couples Gα12 or Gα13 (encoded by GNA12 and GNA13) to induce apoptosis (18). Jagan R. Muppidi et al. found that mutations that result in S1PR2 inactivation were exclusively in the GCB subtype and hardly ever occurred in the ABC subtype (19). Additionally, in a mouse model lacking both alleles of S1PR2, half of the mice developed B-cell lymphomas with GCB morphology and molecular characteristics (20).
However, the expression of S1PR2 in the GCB subtype was relatively higher than that in the ABC subtype in our study. This is because FOXP1, whose function is to repress S1PR2 expression, was highly expressed in the ABC subtype (18). MAML3 belongs to the Mastermind-like (MAML) family, including MAML1, MAML2, and MAML3. Members of this family are essential transcriptional coactivators for Notch-induced transcription events (21). MAML1 was reported to be a regulator of the NF-κB signal pathway, and activation of the NF-κB pathway is a main characteristic of the ABC subtype (22,23). Kochert et al. showed that MAML2 was highly expressed in several types of B cell-derived lymphomas relative to normal B-cells (24). However, functional study on MAML3 is limited. Studies have shown that MAML3 overexpression may be involved in cancer metastasis (25), suggesting that MAML3 overexpression was associated with poor prognosis, which was discordant with our results. It implied that MAML3 may involve other regulatory and oncogenic mechanisms in DLBCL.
TNFRSF13B, CYB5R2, and BATF were relatively overexpressed in the ABC subgroup. TNFRS13B, also known as TACI (transmembrane activator and calcium-modulating cyclophilin ligand interactor), is a member of the tumor necrosis factor (TNF) receptor superfamily (26). TACI and its ligands, BAFF and APRIL, are critical factors for the growth and survival of both normal and malignant B cells (27).
Accumulating evidence indicated that TACI tended to be frequently expressed in the ABC subtype and was considered as a classi er in the DLBCL classi cation (6,28). The high expression of TACI may be one of the reasons for NF-κB pathway activation in the ABC subtype, because the combination of TACI and its ligand could activate the NF-κB pathway (29). CYB5R2 belongs to the cytochrome reductase family. This enzyme has been shown to be involved in oxidation reduction, drug metabolism, methemoglobin reduction in erythrocytes, and lipid metabolism (30,31). The expression of CYB5R2 varied in different cancers. CYB5R2 has been considered as a tumor suppressor gene and was shown to be inactivated in prostate cancer, breast cancer, and nasopharynx carcinoma (32)(33)(34). However, Lotem et al. reported that CYB5R2 was up-regulated in B cell acute lymphocytic leukemia (35). The role of CYB5R2 in lymphomas is poorly understood. Qun Liu et al. found that CYB5R2 expression was highly correlated with many genes of the Toll pathway, suggesting that CYB5R2 may be associated with cancer invasion (36). The protein encoded by BATF is a nuclear basic leucine zipper protein that belongs to the AP-1/ATF superfamily of transcription factors (37). BATF has been shown to play an important role in T-and B-cells during immune responses, and BATF controls global regulators of class-switch recombination in both T-and B-cells (38). Interestingly, in the context of B-cell malignancy, BATF was consistently linked to the ABC subtype (6,39). Jun Li et al. identi ed BATF as the target of the NF-κB pathway (40). It implied that the overexpressed BATF in the ABC subtype may be associated with abnormal activation of the NF-κB pathway.
To date, several signatures based on gene expression have been developed to determine COO subtype, some of which use formalin-xed, para n-embedded tissue (FFPET). The Lymph2X assay proposed by Scott et al. is one of these methods that use FFPET (41). It is a 20-gene signature, ve of which are the member of six-gene model. There is no doubt analyzing FFPET is more clinically practical than frozen tissue(9, 41), but six-gene model is more parsimonious and easier to be explained than 20-gene signature, because it's a simple logistic model. Whereas, the 20-gene signature is a weighted average of the 15 predictive genes, and Scott et al. did not describe how the weight was calculated in detail (41). This may limit its widespread use. Additionally, we ranked genes based on their importance, which provides an ordering of genes in term of priority for further functional and targeted drug research. Certainly, to realize the potential clinical bene ts of the six-gene signatures, further efforts are needed: rstly, designing speci c probes and quantifying expression of six genes in FFPET, as Scott et al. did (41); secondly, evaluating the predictive accuracy of six-gene model in FFPET; thirdly, validation in independent cohorts (42).
In summary, in the penalized regression analysis, we developed a strategy to rank variables based on the relationship between tuning parameters and the number of variables selected into the model. This strategy can be applied to determine the optimal gene subset for cancer subtype classi cation, diagnosis, and prognosis. In this study, we applied this strategy for DLBCL strati cation. Six genes were eventually identi ed as composite markers for both subtype classi cation and prognostic prediction. Further, the predictive performance of the six genes was validated in an external dataset, which demonstrated the e ciency of our strategy. Finally, the ordered gene list provides a direction for further functional and targeted drug research.

Conclusions
The six genes had considerable clinical usefulness in DLBCL classi cation and prognosis. Penalized variable importance analysis is an e cient strategy to identify an optimal gene subset with good predictive performance.

Methods
Variable importance analyses LASSO (least absolutes shrinkage and selection operator) (14), aLASSO (adaptive LASSO) (12), EN (elastic net) (13), ridge regression (16), SCAD (smoothly clipped absolute operator) (11), and MCP (minimax concave penalty) (15) are commonly used penalized methods. The basic idea of these methods is to subtract a penalty term from an objective function, and thus set some regression coe cients to zero (11)(12)(13)(14)(15)(16). The penalty term is a function of the absolute value of regression coe cients and tuning parameters. Among all tuning parameters, penalty factor is usually determined by using crossvalidation methods, which results in the instability of penalized regression methods. The bigger is, the fewer variables will be selected into the model. Conversely, the smaller is, the more variables are in the nal model.
When is large, variables that can still be selected into the model are relatively important. Based on this property of penalized methods, we proposed a strategy to measure the importance of selected variables. This strategy contained three steps: (1) generating 100 using pathwise coordinate descent method (43,44); (2) constructing penalized regression model using each to select variables; and (3) ranking variables according to the frequency that they were selected in 100 penalized regressions. As such, more frequently selected variables are considered more important.
Different methods usually generate different ranked gene sets (45). To obtain a single list that re ects the overall importance of variables, a rank aggregation method proposed by Pihur et al. (hereafter referred as RankAggreg) was used (46). The basic idea of RankAggreg is that it takes every ranking list into consideration and nds a "super"-list, which would be as "close" as possible to all individually ordered lists simultaneously (46).

Identifying optimal variable subset
To obtain a small variable subset that can still achieve good predictive performance, variables based on the "super"-list were sequentially added to the logistic model (45). The area under the receiver operating characteristic curve (AUC) was used to determine the number of variables selected into the model (45). A variable whose addition would make the AUC reach the statistical maximum was chosen as the threshold. For instance, the addition of the fth variable signi cantly increased the AUC, whereas addition of the sixth variable did not. We chose the top ve variables as the optimal variable subset. The statistical tests of paired AUC were conducted using Delong's method (47).
The above-mentioned methods can be used to identify the optimal gene subset for cancer subtype classi cation, diagnosis, and prognosis. In this study, we applied this method to discover the optimal mRNA-based molecular signature for DLBCL classi cation. The response variable Y of logistic model is equal to 1 for patients with ABC subtype otherwise is 0.
Model performance assessment (48,49) λ λ λ λ λ λ Based on the optimal gene subset, a multivariable logistic regression model was constructed as the nal prediction model. The performance of the prediction model was assessed in terms of discrimination, calibration, and decision curve analysis. Discrimination referred to the ability of the model to separate patients with the ABC subtype from those with the GCB subtype. It was quanti ed based on AUC and the discrimination slope (DS). Calibration performance indicated the agreement between observed outcomes and predictions. It was assessed by a exible calibration curve, which could be generated based on a nonparametric loess smoother. The model was recalibrated in validation datasets to reduce the effect of miscalibration (50). Decision curve analysis was used to explore the clinical usefulness of the mRNAbased molecular signature. It was presented through a decision curve.
Additionally, we also compared our model with typical LASSO, EN, MCP and SCAD. The tuning parameter in the typically penalized methods was chosen based on the rule of minimum mean cross-validated error. Considered that a good clinical prediction model should predict accurately and be parsimonious (51), we use the number of variables contained in the model, the change of AUC (ΔAUC), category-free net reclassi cation improvement (NRI > 0) and integrated discrimination improvement (IDI) to compare model performance. More details about model comparison can be found in Additional le 6.

Survival analysis
To assess the prognostic performance of the mRNA-based molecular subtype signature, survival analyses were performed. Survival curves of the predicted subtypes were estimated using the Kaplan-Meier method, and the signi cance test was performed using the log-rank test. Univariate and multivariable Cox proportional hazards models were constructed to evaluate the association among the mRNA-based molecular subtype signature, clinical characters, and survival in each dataset.
For all analyses, a P value less than 0.05 indicated statistical signi cance. Further, all analyses were conducted using R software (version 3.5.1).

Patients and GEP analysis
The raw GEP and clinical information of DLBCL patients were downloaded from the Gene Expression Omnibus (GEO) database. Datasets were identi ed if they met the following criteria: (i) datasets were created using Affymetrix Human Genome U133 Plus 2.0 (HG-U133 Plus_2.0); and (ii) subtype information was available. Consequently, six datasets GSE10846, GSE31312, GSE93984, GSE23501, GSE56313, and GSE64555 were included. Samples with incomplete subtype and/or unclassi ed or overlapping information in the datasets were excluded, and 974 DLBCL patients were eventually enrolled for the analyses. A group of 350 patients from Lenz's study (the accession number is GSE10846) was used as the training dataset to identify mRNA-based molecular subtype signatures. Other datasets were merged using Combat method and were used for validation (hereafter referred as CombatData) (52). To fully judge the generalizability of the model, we also assessed the model performance in each component of CombatData.
λ Each raw GEP dataset was preprocessed using the robust multi-array average (RMA) algorithm (53). After background correction, normalization, and summarization, the differentially expressed genes between the ABC and GCB subgroups were determined using Liner models and empirical Bayes methods (54).
Subsequently, the z-score transformations were performed on the genes with interquartile ranges greater than or equal to 1 (55,56). Finally, 3240 genes were considered as candidate mRNA-based molecular subtype signatures.

Declarations
Ethics approval and consent to participate Ethical approval and consent to participate was waived since this study was completely based on the publicly available GEO database.

Consent for publication
Not applicable.

Availability of data and materials
The results reported here are based on data generated by the GEO under accession numbers GSE10846, GSE31312, GSE23501, GSE93984, GSE56313, and GSE64555.  Six-gene model performance assessments. (A) AUC for each validation datasets (CombatData and its component); (B) discrimination slope for validation dataset. The boxplot display the distribution of predicted probabilities P (Y=1|X) for ABC and GCB subtype respectively; (C,D)calibration plot for training and validation dataset, respectively. For a perfect calibration, the mean predicted probabilities are close to observed ones for all groups; (E,F)decision curves for training and validation datasets, respectively. The black line corresponds to the net bene t when no patients assign to ABC subtype, whereas the grey solid