3.1. Experimental setup
To determine if the combination of SCAD, SMOTE resampling, and BO-ML models can improve classification performance, the following stages need to be completed:
1.Importing COPD Monitoring Data.
2. Using SCAD to select the most relevant features.
3. Splitting the original training set into a training set (70%) and a test set (30%). (Please see Supplementary Table S3.)
4.Balancing COPD Dataset with SMOTE.
5.Building BO-ML Models in both Balanced and Unbalanced Datasets.
6.Comparing and Evaluating Model Performance.
The flowchart can be found in Figure 1. Throughout this process, it will be observed whether the combination of these methods enhances or reduces the overall efficiency of the models. Furthermore, to ensure the models' ability to generalize, the training set underwent SMOTE resampling exclusively, while the test set retained the same feature variables without any additional processing.
3.2. Baseline characteristics
Among the initial 6,648 study participants, 1,901 individuals with incomplete data were excluded, resulting in a final analysis sample of 4,747 participants. Among them, 443 individuals (9.3%) were confirmed as COPD patients. The gender distribution showed that 48.9% were male and 51.1% were female. The age distribution of the participants was as follows: 26.9% were between 40 and 49 years old, 36.6% were between 50 and 59, 28.6% were between 60 and 69, and 7.9% were over 70 years old. More detailed information can be found in Supplementary Table S2. In Figure 2, it is evident that the higher prevalence of COPD among males compared to females; the prevalence of COPD decreases with increasing literacy; the prevalence of COPD increases with advancing age; Individuals with lower BMI values have a higher prevalence of COPD, particularly among those with low body weight, reaching 27.1%.
3.3. Using SCAD to screen COPD related factors
The SCAD model was utilized to incorporate 34 potential risk factors associated with COPD, and the "tune.method" option in the SIS package was used to specify the methods for selecting the optimal tuning parameter λ include AIC, BIC, eBIC, and CV. After debugging, to ensure the retention of sufficient information, this study employs AIC as the tune.method. Eventually, the SCAD method identifies 14 variables that exhibit strong correlations with COPD. As shown in Table 3.
Table 3 the selected variables and regression coefficients for the SCAD method.
Variables
|
AIC
|
BIC
|
eBIC
|
CV
|
Cough frequently at age 14 and before(X2)
|
-0.30505211
|
-0.07537729
|
-
|
-
|
Hospitalization for pneumonia or bronchitis between the ages of 15 and 17(X4)
|
0.50023039
|
-
|
-
|
-
|
Respiratory disease(X5)
|
0.87865125
|
0.57747375
|
0.41777030
|
0.41777030
|
Gastroesophageal reflux(X12)
|
-0.20792135
|
-
|
-
|
-
|
family history(X14)
|
0.28988007
|
0.08573527
|
-
|
-
|
Current smoking(X16)
|
0.44375813
|
0.00910660
|
0.07505639
|
0.07505639
|
Polluting fuel for household heating(X18)
|
0.17471077
|
-
|
-
|
-
|
Age(X22)
|
0.58876871
|
0.53305883
|
0.40153387
|
0.40153387
|
Marital status(X24)
|
-0.13293717
|
-
|
-
|
-
|
Region(X25)
|
0.02864334
|
-
|
-
|
-
|
Gender(X26)
|
-1.03309069
|
-1.25037895
|
-0.85786855
|
-0.85786855
|
BMI(X27)
|
-0.10398143
|
-
|
-
|
-
|
Kyphosis(X31)
|
0.05966857
|
-
|
-
|
-
|
funnel chest(X33)
|
1.89946875
|
-
|
-
|
-
|
In addition, to examine whether the variables selected by the SCAD method exhibit collinearity, we conducted a test for multicollinearity using the variance inflation factor (VIF). A VIF value below 5 indicates weak multicollinearity. Table 4 demonstrates that the selected variables have VIF values close to 1, indicating a weak collinearity among them. This suggests that by carefully selecting variables, we can effectively mitigate the adverse effects of feature collinearity on the classification performance of the model.
Table 4 VIF test value.
Variables
|
VIF value
|
Variables
|
VIF value
|
X2
|
1.053
|
X22
|
1.051
|
X4
|
1.016
|
X24
|
1.036
|
X5
|
1.069
|
X25
|
1.320
|
X12
|
1.021
|
X26
|
1.737
|
X14
|
1.055
|
X27
|
1.026
|
X16
|
1.747
|
X31
|
1.022
|
X18
|
1.341
|
X33
|
1.012
|
3.4. The results of SMOTE resampling.
The original training dataset exhibits a class imbalance, with a lower number of COPD patients (n=314) and a higher number of non-COPD patients (n=3008). After applying SMOTE resampling, a balanced distribution was achieved between the two categories, with a 1:1 ratio of COPD patients to non-patients. As shown in Table 5.
Table 5 Class distribution before and after SMOTE resampling.
Dataset
|
COPD patients
|
Non-COPD patients
|
The original training set
|
304
|
3008
|
After SMOTE resampling
|
3008
|
3008
|
3.5. Model establishment and evaluation
In the BO algorithm, the maximum number of evaluations for the objective function is set to "100" as a termination criterion. The optimization is run 8 times to find the optimal parameter values for each classifier in its specific search space. Table 6 summarizes the best configurations for all classifiers. The internal validation results for each classification model in the training dataset are summarized in Table 7.As shown in Table 6, before balancing the data, all classification models achieved a high specificity (1.000)but had extremely low sensitivity values (ranging from 0.000 to 0.021). This suggests that the classification models did not achieve satisfactory performance in accurately detecting COPD patients within the imbalanced dataset. In contrast, after applying SMOTE resampling to balance the data, there were significant improvements in the comprehensive evaluation metrics, namely AUC and G-mean, for all models. This outcome demonstrates the effectiveness of the data balancing process in enhancing the classification models' recognition performance for minority class samples.
When comparing different models, the BO-KNN model stood out for its relatively strong performance on the imbalanced dataset. During internal validation using the holdout method, the model achieved notable evaluation metrics: AUC (0.680), ACC (0.908), specificity (1.000), sensitivity (0.021), and G-mean (0.146). Nevertheless, after applying data balancing techniques, BO-DT performs excellently with the following metric values: AUC (0.920), ACC (0.860), Specificity (0.854), Sensitivity (0.867), G-mean (0.860). However, after applying data balancing techniques, the BO-DT model exhibited exceptional performance with improved metric values: AUC (0.920), ACC (0.860), specificity (0.854), sensitivity (0.867), and G-mean (0.860). These results highlight the superior performance of the SMOTE and BO-DT combination compared to other models in mitigating the effects of data imbalance.
Table 6 The optimal selection of hyperparameter values for different Models.
Models
|
Optimized
Hyperparameters
|
Strategy
|
Imbalance
|
SMOTE resampling
|
BO-DT
|
Maximum number of splits
|
2
|
420
|
Split Criterion
|
Maximum deviance reduction
|
Maximum deviance reduction
|
|
BO-NB
|
Distribution names
|
Kernel
|
Kernel
|
Kernel type
|
Gaussian
|
Box
|
|
BO-SVM
|
Kernel Function
|
Gaussian
|
Gaussian
|
Kernel Scale
|
0.001
|
0.0043
|
Box Constraint level
|
0.017
|
6.4026
|
Standardize data
|
TRUE
|
FALSE
|
Multiclass method
|
One-vs-One
|
One-vs-All
|
|
BO-KNN
|
Number of neighbors
|
8
|
2985
|
Distance metric
|
Euclidean
|
Spearman
|
Distance weight
|
Equal
|
Squared inverse
|
Standardize
|
TRUE
|
FALSE
|
(Note: BO-DT: Bayesian optimization algorithm improved Decision Trees; BO-NB: Bayesian optimization algorithm improved Naive Bayes; BO-SVM: Bayesian optimization algorithm improved Support Vector Machines; BO-KNN: Bayesian optimization algorithm improved K-nearest neighbors.)
Table 7 summarizes the performance of the model on the internal validation data.
Models
|
AUC
|
ACC
|
Specificity
|
Sensitivity
|
G-mean
|
BO-DT
|
0.500
|
0.906
|
1.000
|
0.000
|
0.000
|
BO-NB
|
0.760
|
0.906
|
1.000
|
0.000
|
0.000
|
BO-SVM
|
0.530
|
0.906
|
1.000
|
0.000
|
0.000
|
BO-KNN
|
0.680
|
0.908
|
1.000
|
0.021
|
0.146
|
SMOTE+BO-DT
|
0.920
|
0.860
|
0.854
|
0.867
|
0.860
|
SMOTE+BO-NB
|
0.750
|
0.695
|
0.667
|
0.723
|
0.695
|
SMOTE+BO-SVM
|
0.820
|
0.836
|
0.729
|
0.942
|
0.829
|
SMOTE+BO-KNN
|
0.920
|
0.843
|
0.812
|
0.875
|
0.843
|
To ensure the models' ability to generalize, this study proceeded with external validation by employing a test set for each model. The results obtained from the external validation (refer to Table 8) align with the results from internal validation, indicating that the classification models improved their ability to identify COPD patients when using resampling techniques to handle the imbalanced dataset. During the external validation process of the imbalanced dataset, the BO-KNN model demonstrated performance that was consistent with the previous results from internal validation, and the performance was satisfactory. However, after balancing the data using SMOTE resampling, the BO-NB model demonstrates remarkable stability in generalization performance during external validation. At the same time, the BO-NB model combined with SMOTE outperforms other models in terms of evaluation metrics, especially with noticeably higher scores in comprehensive metrics such as AUC (0.770) and G-mean (0.696). This suggests that BO-NB exhibits a higher recognition rate for both positive and negative samples, as well as excellent overall predictive performance.
Table 8 summarizes the performance of the model on the external validation data
Models
|
AUC
|
ACC
|
Specificity
|
Sensitivity
|
G-mean
|
BO-DT
|
0.500
|
0.909
|
1.000
|
0.000
|
0.000
|
BO-NB
|
0.750
|
0.909
|
1.000
|
0.000
|
0.000
|
BO-SVM
|
0.540
|
0.909
|
1.000
|
0.000
|
0.000
|
BO-KNN
|
0.680
|
0.907
|
0.995
|
0.023
|
0.152
|
SMOTE+BO-DT
|
0.670
|
0.816
|
0.863
|
0.349
|
0.549
|
SMOTE+BO-NB
|
0.770
|
0.671
|
0.665
|
0.729
|
0.696
|
SMOTE+BO-SVM
|
0.610
|
0.738
|
0.765
|
0.465
|
0.597
|
SMOTE+BO-KNN
|
0.650
|
0.789
|
0.825
|
0.426
|
0.593
|