Generation and validation of STDs
The CTGAN, copula GAN, and TT-GAN models were trained for comparison (Appendix 2). Subsequently, we employed RF, CatBoost, XGBoost, and LightGBM as regression classifiers to predict continuous variables. Subsequently, we implemented the discretization and converter methodology. To assess prediction performance, we conducted evaluations using the RF, CatBoost, XGBoost, and LightGBM models individually for each of the generated GAN models (Appendix 3). We generated 1,616 lung cancer STD and 4,766 liver cancer STD.
As shown in Table 1, for the lung cancer dataset, the following AUC values were obtained for the original dataset: RF: 85.02%, CatBoost: 86.02%, XGBoost: 84.24%, and LightGBM: 84.49%. The model performance was observed through the STD, which was generated by each GAN model without the preprocessing stage. For the STD generated by CTGAN, the AUC for RF was 84.00 ± 0.55, while CatBoost achieved 83.80 ± 0.45, XGBoost attained 81.20 ± 0, and LightGBM obtained 82.88 ± 0.72. When the STD was produced by copula GAN, the values were 84.45 ± 0.26 for RF, 81.58 ± 0.77 CatBoost, 79.40 ± 0.71 for XGBoost, and 84.07 ± 0.58 for LightGBM. The STD generated by TT-GAN yielded an AUC of 81.53 ± 0.46 for RF, 82.64 ± 0.44 for CatBoost, 84.45 ± 0.51 for XGBoost, 84.32 ± 0.18 for LightGBM.
The model performance was assessed by examining the STD generated by each GAN model after the preprocessing stage. The STD generated by CTGAN with the RF classifier yielded AUC of 82.31 ± 0.50 for RF, 83.68 ± 0.74 for CatBoost, 81.33 ± 0.55 for XGBoost, and 80.59 ± 0.32 for LightGBM. With the CatBoost classifier, the AUC was 83.39 ± 0.33 for RF, 83.93 ± 0.67 for CatBoost, 83.07 ± 1.13 for XGBoost, and 82.38 ± 0.31 for LightGBM. When the XGBoost classifier was used, the AUC was 82.95 ± 0.55 for RF, 83.58 ± 0.24 for CB, 82.09 ± 1.02 for XGBoost, and 82.67 ± 0.54 for LightGBM. As for the application of the LightGBM classifier, the AUC was 83.32 ± 0.67 for RF, 82.99 ± 0.28 for CatBoost, 80.76 ± 0.58 for XGBoost, and 82.97 ± 1.71 for LightGBM.
The STD was produced by copula GAN using RF classifier, the AUC was 81.18 ± 0.83 for RF, 80.50 ± 0.80 for CatBoost, 78.30 ± 0.92 for XGBoost, and 77.78 ± 1.61 for LightGBM. When the CatBoost classifier was utilized, the AUC was 81.90 ± 0.21 for RF, 82.17 ± 0.50 for CB, 79.81 ± 1.01 for XGBoost, and 81.65 ± 1.16 for LightGBM. Further, with the XGBoost classifier, the AUC was 81.50 ± 0.76 for RF, 82.30 ± 0.77 for CatBoost, 80.35 ± 0.60 for XGBoost, and 82.50 ± 0.87 for LightGBM. The application of the LightGBM classifier yielded AUC values of 81.18 ± 0.78 for RF, 80.04 ± 0.92 for CatBoost, 81.36 ± 0.44 for XGBoost, and 81.27 ± 1.07 for LightGBM.
Further, the TT-GAN-derived lung STD when used with the RF classifier yielded AUC values of 83.24 ± 0.26 for RF, 83.83 ± 0.13 for CatBoost, 82.99 ± 0.26 for XGBoost, and 82.76 ± 0.19 for LightGBM. When the CatBoost classifier was used, the AUC was 83.32 ± 0.24 for RF, 83.96 ± 0.19 for CatBoost, 83.10 ± 0.31 for XGBoost, 82.37 ± 0.18 for LightGBM. The utilization of the XGBoost classifier, 83.32 ± 0.18 for RF, 84.06 ± 0.15 for CatBoost, and 83.29 ± 0.15 for XGBoost, 84.04 ± 0.20 for LightGBM. When the LightGBM classifier was used, the AUC was 82.32 ± 0.37 for RF, 84.13 ± 0.12 for CatBoost, 83.28 ± 0.46 for XGBoost, and 83.16 ± 0.48 for LightGBM.
Table 1
Performance evaluation of prediction models using lung cancer SSD test dataset
Data
|
Generator
|
Classifier
|
Prediction model
|
|
RF
|
CatBoost
|
XGBoost
|
LightGBM
|
|
Original
|
-
|
-
|
85.02%
|
86.02%
|
84.24%
|
84.49%
|
|
Without Discretization and converter
|
CTGAN
|
-
|
84.00 ± 0.55
|
83.80 ± 0.45
|
81.20 ± 0.48
|
82.88 ± 0.72
|
|
Copula GAN
|
-
|
84.45 ± 0.26
|
81.58 ± 0.77
|
79.40 ± 0.71
|
84.07 ± 0.58
|
|
TT-GAN
|
-
|
81.53 ± 0.46
|
82.64 ± 0.44
|
84.45 ± 0.51
|
84.32 ± 0.18
|
|
Discretization and converter
|
CTGAN
|
RF
|
82.31 ± 0.50
|
83.68 ± 0.74
|
81.33 ± 0.55
|
80.59 ± 0.32
|
|
CatBoost
|
83.39 ± 0.33
|
83.93 ± 0.67
|
83.07 ± 1.13
|
82.38 ± 0.31
|
|
XGBoost
|
82.95 ± 0.55
|
83.58 ± 0.24
|
82.09 ± 1.02
|
82.67 ± 0.54
|
|
LightGBM
|
83.32 ± 0.67
|
82.99 ± 0.28
|
80.76 ± 0.58
|
82.97 ± 1.71
|
|
Copula GAN
|
RF
|
81.18 ± 0.83
|
80.50 ± 0.80
|
78.30 ± 0.92
|
77.78 ± 1.61
|
|
CatBoost
|
81.90 ± 0.21
|
82.17 ± 0.50
|
79.81 ± 1.01
|
81.65 ± 1.16
|
|
XGBoost
|
81.50 ± 0.76
|
82.30 ± 0.77
|
80.35 ± 0.60
|
82.50 ± 0.87
|
|
LightGBM
|
81.18 ± 0.78
|
80.04 ± 0.92
|
81.36 ± 0.44
|
81.27 ± 1.07
|
|
TT-GAN
|
RF
|
83.53 ± 0.22
|
83.92 ± 0.44
|
83.19 ± 0.77
|
82.58 ± 0.19
|
|
CB
|
84.69 ± 0.55
|
85.86 ± 0.30
|
85.94 ± 0.51
|
84.55 ± 0.56
|
|
XGB
|
84.84 ± 0.47
|
85.91 ± 0.14
|
85.44 ± 0.19
|
85.34 ± 0.60
|
|
LGBM
|
84.69 ± 0.37
|
85.69 ± 0.09
|
82.97 ± 0.36
|
85.42 ± 0.71
|
|
In Table 2, the performances of the RF, CatBoost, XGBoost, and LightGBM prediction models for the liver cancer dataset were evaluated using the AUC metric for the test sets. The original dataset showed AUC values of 85.96% for RF, 86.69% for CatBoost, 85.14% for XGBoost, and 85.91% for LightGBM. Without the preprocessing stage, the STD from CTGAN, exhibited AUC values of 83.31 ± 0.17 for RF, 83.81 ± 0.23 for CatBoost, 81.20 ± 0.50 for XGBoost, and 82.69 ± 0.19 for LightGBM. The STD from copula GAN exhibited AUC values of 82.46 ± 0.07 for RF, 83.61 ± 0.24 for CatBoost, 80.93 ± 0.62 for XGBoost, and 82.53 ± 0.42 for LightGBM. The STD from TT-GAN exhibited AUC values of 80.29 ± 0.14 for RF, 81.98 ± 0.35 for CatBoost, 80.43 ± 0.31 for XGBoost, and 80.33 ± 0.37 for LightGBM.
When evaluating the impact of pre-processing, the STD generated from CTGAN, in conjunction with the RF classifier, yielded AUC values of 81.77 ± 0.21 for RF, 82.78 ± 0.52 for CatBoost, 79.60 ± 0.62 for XGBoost, and 80.94 ± 0.34 for LightGBM. The implementation of the CatBoost classifier results in an AUC of 82.65 ± 0.24 for RF, 81.00 ± 0.33 for CatBoost, 77.60 ± 0.27 for XGBoost, and 80.34 ± 0.60 for LightGBM. Employing the XGBoost classifier yielded AUC values of 82.96 ± 0.21 for RF, 82.44 ± 0.50 for CatBoost, 80.43 ± 0.50 for XGBoost, and 81.81 ± 0.48 for LightGBM. Finally, using the LightGBM classifier, AUC values of 82.47 ± 0.43 for RF, 81.47 ± 0.29 for CatBoost, 78.76 ± 0.41 for XGBoost, and 80.34 ± 0.19 for LightGBM were obtained.
The STD generated by copula GAN exhibited AUC values of 78.95 ± 0.28 for RF, 71.70 ± 0.70 for CatBoost, 65.62 ± 2.14 for XGBoost, and 74.54 ± 1.42 for LightGBM when utilized by the RF classifier. The CatBoost classifier yielded AUC values of 80.95 ± 0.38 for RF, 79.41 ± 0.69 for CatBoost, 75.10 ± 1.09 for XGBoost, and 78.69 ± 1.41 for LightGBM. The XGBoost classifier yielded AUC values of 78.96 ± 0.45 for RF, 79.75 ± 1.13 for CatBoost, 74.95 ± 1.24 for XGBoost, and 74.30 ± 1.42 for LightGBM. Whereas the LightGBM classifier yielded AUC values of 77.67 ± 1.01 for RF, 70.46 ± 1.15 for CatBoost, 68.46 ± 1.43 for XGBoost, and 71.32 ± 0.56 for LightGBM.
The STD obtained using the TT-GAN yielded various AUC. When employing the RF classifiers, the AUC values were 83.24 ± 0.26 for RF, 83.83 ± 0.13 for CatBoost, 82.99 ± 0.26 for XGBoost, and 82.76 ± 0.19 for LightGBM. The application of the CatBoost classifier yielded AUC values of 83.32 ± 0.24 for RF, 83.96 ± 0.19 for CatBoost, 83.10 ± 0.31 for XGBoost, and 82.37 ± 0.18 for LightGBM. Implementing the XGBoost classifier yielded AUC values of 83.32 ± 0.18 for RF, 84.06 ± 0.15 for CatBoost, 83.29 ± 0.15 for XGBoost, and 84.04 ± 0.20 LightGBM. Finally, the AUC with the LightGBM classifier was 82.32 ± 0.37 for RF, 84.13 ± 0.12 for CatBoost, 83.28 ± 0.46 for XGBoost, and 83.16 ± 0.48 for LightGBM.
Table 2. Performance evaluation of prediction models using liver cancer SSD test dataset
|
|
Data
|
Generator
|
Classifier
|
Prediction model
|
RF
|
CatBoost
|
XGBoost
|
LightGBM
|
Original
|
-
|
-
|
85.96%
|
86.69%
|
85.14%
|
85.91%
|
Without Discretization and converter
|
CTGAN
|
-
|
83.31 ± 0.17
|
83.81 ± 0.23
|
81.20 ± 0.50
|
82.69 ± 0.19
|
Copula GAN
|
-
|
82.46 ± 0.07
|
83.61 ± 0.24
|
80.93 ± 0.62
|
82.53 ± 0.42
|
TT-GAN
|
-
|
80.29 ± 0.14
|
81.98 ± 0.35
|
80.43 ± 0.31
|
80.33 ± 0.37
|
Discretization and converter
|
CTGAN
|
RF
|
81.77 ± 0.21
|
82.78 ± 0.52
|
79.60 ± 0.62
|
80.94 ± 0.34
|
CatBoost
|
82.65 ± 0.24
|
81.00 ± 0.33
|
77.60 ± 0.27
|
80.34 ± 0.60
|
XGBoost
|
82.96 ± 0.21
|
82.44 ± 0.50
|
80.43 ± 0.50
|
81.81 ± 0.48
|
LightGBM
|
82.47 ± 0.43
|
81.47 ± 0.29
|
78.76 ± 0.41
|
80.34 ± 0.19
|
Copula GAN
|
RF
|
78.95 ± 0.28
|
71.70 ± 0.70
|
65.62 ± 2.14
|
74.54 ± 1.42
|
CatBoost
|
80.95 ± 0.38
|
79.41 ± 0.69
|
75.10 ± 1.09
|
78.69 ± 1.41
|
XGBoost
|
78.96 ± 0.45
|
79.75 ± 1.13
|
74.95 ± 1.24
|
74.30 ± 1.42
|
LightGBM
|
77.67 ± 1.01
|
70.46 ± 1.15
|
68.46 ± 1.43
|
71.32 ± 0.56
|
TT-GAN
|
RF
|
83.24 ± 0.26
|
83.83 ± 0.13
|
82.99 ± 0.26
|
82.76 ± 0.19
|
CatBoost
|
83.32 ± 0.24
|
83.96 ± 0.19
|
83.10 ± 0.31
|
82.37 ± 0.18
|
XGBoost
|
83.32 ± 0.18
|
84.06 ± 0.15
|
83.29 ± 0.15
|
84.04 ± 0.20
|
LightGBM
|
82.32 ± 0.37
|
84.13 ± 0.12
|
83.28 ± 0.46
|
83.16 ± 0.48
|
The TT-GAN preserved the attributes of the original data and the relationships between variables, thereby maintaining connections between continuous and categorical values during the generation of the STD. It exhibited good efficacy in safeguarding real-world patterns and commendable performance in terms of model efficiency.