Subtype classification and prognosis of diffuse large B-cell lymphoma based on variable importance analysis
Background Gene expression profiling (GEP) is considered as gold standard for cell-of-origin classification of diffuse large B-cell lymphoma (DLBCL). The high dimensionality of GEP limits its application in clinical practice. Penalized regression was commonly used to determine the optimal gene subset for classification in high dimensional gene data. However, the results of penalized regression methods were affected by the tuning parameters.
Results To solve the instability of penalized regression methods, we proposed a strategy to measure the importance of variables with an aggregated index. This strategy was applied to six penalized methods to identify a small gene subset for DLBCL classification. Using a training dataset of 350 DLBCL patients, we identified six genes (MYBL1, TNFRSF13B, MAML3, CYB5R2, BATF, and S1PR2) as the optimal gene subset for DLBCL classification. The AUC was 0.9986 (95%CI 0.9967–1) and discrimination slope (DS) was 0.9442 (95%CI 0.9203–0.9661) in the training dataset. The discriminative performances were further validated in the external dataset with an AUC of 0.9455 (95%CI 0.9298–0.9612) and DS of 0.6211 (95%CI 0.5824–0.6591). Additionally, the calibration and clinical usefulness were apt in both datasets. Subgroups of patients characterized by these six genes showed significantly different prognosis. Furthermore, model comparisons demonstrated that the six-gene model outperformed models constructed by typical penalized regression methods.
Conclusions The six genes had considerable clinical usefulness in DLBCL classification and prognosis. Penalized variable importance analysis is an efficient strategy to identify an optimal gene subset with good predictive performance.
Figure 1
Figure 2
Figure 3
This is a list of supplementary files associated with this preprint. Click to download.
Posted 24 May, 2020
Subtype classification and prognosis of diffuse large B-cell lymphoma based on variable importance analysis
Posted 24 May, 2020
Background Gene expression profiling (GEP) is considered as gold standard for cell-of-origin classification of diffuse large B-cell lymphoma (DLBCL). The high dimensionality of GEP limits its application in clinical practice. Penalized regression was commonly used to determine the optimal gene subset for classification in high dimensional gene data. However, the results of penalized regression methods were affected by the tuning parameters.
Results To solve the instability of penalized regression methods, we proposed a strategy to measure the importance of variables with an aggregated index. This strategy was applied to six penalized methods to identify a small gene subset for DLBCL classification. Using a training dataset of 350 DLBCL patients, we identified six genes (MYBL1, TNFRSF13B, MAML3, CYB5R2, BATF, and S1PR2) as the optimal gene subset for DLBCL classification. The AUC was 0.9986 (95%CI 0.9967–1) and discrimination slope (DS) was 0.9442 (95%CI 0.9203–0.9661) in the training dataset. The discriminative performances were further validated in the external dataset with an AUC of 0.9455 (95%CI 0.9298–0.9612) and DS of 0.6211 (95%CI 0.5824–0.6591). Additionally, the calibration and clinical usefulness were apt in both datasets. Subgroups of patients characterized by these six genes showed significantly different prognosis. Furthermore, model comparisons demonstrated that the six-gene model outperformed models constructed by typical penalized regression methods.
Conclusions The six genes had considerable clinical usefulness in DLBCL classification and prognosis. Penalized variable importance analysis is an efficient strategy to identify an optimal gene subset with good predictive performance.
Figure 1
Figure 2
Figure 3