In our experiment to evaluate the performance the four classifiers different evaluation metrics are used. They are accuracy, precision, recall, and f1-score. To evaluate the prediction power of the models the ROC curve and confusion matrix were used.
1. Logistic Regression
The model was trained with the training set and the prediction was executed with the test set. The ROC curve is then drawn. Figure 2ashows ROC curve and the AUC of the logistic regression with SMOTE. Figure 2b shows the results of the resampled data using MSMOTE. The ROC curve shows an AUC of 83.1% and 82.1% of the test set when SMOTE and MSMOTE are applied, respectively.
Compared to the SMOTE test set the AUC of the MSMOTE decreases slightly to 82%.
The accuracy in Table 2 shows that the prediction power of the resampling data (SMOTE and MSMOTE) is almost identical. In predicting accuracy, MSMOTE performed better than SMOTE.
Figure 3 presents the confusion matrix of the logistic regression with MSMOTE. Table 2 shows the evaluation metrics of the logistic regression model with SMOTE and MSMOTE.
Table 2. Summary of evaluation metrics of the logistic regression.
Classifiers
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
Logistic regression + SMOTE
|
82
|
83
|
97
|
83
|
89
|
Logistic regression + MSMOTE
|
87
|
82
|
96
|
89
|
93
|
In Figure 3, the logistic regression model with MSMOTE correctly predicts 89.5% non-defaulted loans and 55.25% defaulted loans. There are 44.8% defaults missed and 10.5% good loans missed. Consequently, in order to minimize loan defaults, the number of missed defaults needs to be minimized to save loss, and the number of correctly predicted non-defaulted needs to be maximized.
2. Random Forest
Figure 4a and 4bshows the ROC curves of the Random Forest model with SMOTE and MSMOTE, respectively. They essentially shows the relationship between True Positive Rate (TPR) and False Positive Rate (FPR), with one always going in the same direction as the other, from 0 to 1. A good classification model would always have the ROC curve above the black baseline (the broken line). The AUC of the Random Forest model with the SMOTE resampling is 89% while that of MSMOTE is 99%. That means, the random forest model with MSMOTE performs better than that of the SMOTE.
Using the confusion matrix, we can easily calculate F1-score, precision, recall and accuracy of the random forest algorithm to show its performance.
The confusion matrix in Figure 5 shows that of the random forest model when data is resampled by MSMOTE. As shown in Figure 5, it predicts non defaulted loans better than defaulted loans. Overall, the results are better than the logistic regression. As shown in Table 3, the accuracy of the random forest model with MSMOTE is 99% and that of SMOTE is 96%.
Table 3. Summary of evaluation metrics of the random forest classifier.
Classifier
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
RF + SMOTE
|
96
|
89
|
96
|
99
|
98
|
RF + MSMOTE
|
99
|
99
|
99
|
99
|
99
|
Finally, the accuracy score (number of correct predictions over the number of total predictions) on the training and test data of random forest with MSMOTE is checked. Accordingly, the classifier gave accuracy score of 99% with no over fitting.
Table 4. Train and test accuracy score of random forest with MSMOTE.
Model
|
MSMOTE
|
Train Accuracy
|
Test Accuracy
|
Random Forest
|
99.99829%
|
99.99774%
|
3. Bagging Classifiers
Figure 6a and 6b displaytheROC curves and AUC values of the bagging classifier. Test results of bagging with SMOTE and with MSMOTE gave predictive power of AUC of 89% and 99%, respectively.
Figure 7 displays the confusion matrix of bagging classifier after the dataset is balanced with MSMOTE. Evaluation metrics such as precision, recall, f1-score and accuracy all are calculated from this confusion matrix. The summary of the evaluation metrics for bagging classifier are shown on Table 5 below.
Table 5. Summary of evaluation metrics of the bagging classifier.
Classifier
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
Bagging + SMOTE
|
96
|
89
|
96
|
99
|
98
|
Bagging + MSMOTE
|
99
|
99
|
99
|
99
|
99
|
From F1-score perspective, all classifiers perform greater than 50%. The bagging classifier with MSMOTE has F1-score of 99%.
When checked for over fitting, the model shows, it has a little bit of over fitting in the SMOTE and performs better in MSMOTE, as the accuracy score of train and test of the bagging with MSMOTE is very close. That means, there is almost no over fitting when applying bagging with MSMOTE.
Table 6. Train and test accuracy scores of bagging with SMOTE and MSMOTE.
Model
|
SMOTE
|
MSMOTE
|
Train Accuracy (%)
|
Test Accuracy (%)
|
Train Accuracy (%)
|
Test Accuracy (%)
|
Bagging
|
99.9775
|
96.0783
|
99.9746
|
99.95548
|
4. Adaptive Boosting
Finally, Figure 8aand 8bdisplay the ROC curves and the AUC values of AdaBoost. The test set confirmed that AdaBoost has predictive power of AUC equal to 82% and 83% using SMOTE and MSMOTE, respectively.
The results show that, the ROC curves and AUC values of Adaboost with MSMOTE performs better than that of SMOTE.
As seen in Figure 9, the confusion matrix of Adaboost with MSMOTE predicted 99% of settled loans and 36.8% of defaulted loans. Even though, there is no default loan missed and 63% of good loans were missed.
Table 6. Summary of evaluation metrics of the adaptive boosting.
Classifier
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
Adaboost + SMOTE
|
90
|
81
|
96
|
94
|
95
|
Adaboost + MSMOTE
|
95
|
83
|
95
|
99
|
97
|
When summarized, as shown in Table 7, Adaboost performed better when the data is balanced with MSMOTE. Regarding over fitting, the model shows some over fitting when executed with MSMOTE and no over fitting with SMOTE.
Table 7. Train and test accuracy score of Adaboost with SMOTE and MSMOTE.
Model
|
SMOTE
|
MSMOTE
|
Train Accuracy
|
Test Accuracy
|
Train Accuracy
|
Test Accuracy
|
Adaboost
|
90.896%
|
90.468%
|
97.298%
|
95.14%
|
5. Analysis of Results and Discussion
For better comparison, the performance results of the four models are summarized in Table 8 and Table 9. The F1-score shows all the classifiers performed well being 93% the minimum except LR with SMOTE scoring the least with 89%. The result also shows that RF and Bagging has comparable performances with both SMOTE and MSMOTE scoring better in almost all performance measures. However, the LR classifier showed the lowest classification ability with performance of accuracy and F1-score of 82% & 89% with SMOTE and 87% & 93% with MSMOTE, respectively. RF and Bagging classifier performed better with accuracy & AUC of 96% & 89% with SMOTE and 99% with MSMOTE, respectively. Compared that of LR and Adaboost, precision and recall scores of the RF and Bagging was also higher which indicates a good performance.
Table 8. Summary of performances of the four ensemble classifiers using SMOTE.
Classifiers
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
LR + SMOTE
|
82
|
83
|
97
|
83
|
89
|
RF + SMOTE
|
96
|
89
|
96
|
99
|
98
|
Bagging + SMOTE
|
96
|
89
|
96
|
99
|
98
|
Adaboost + SMOTE
|
90
|
81
|
96
|
94
|
95
|
Table 9. Summary of evaluation metrics for ensemble classifier with MSMOTE.
Classifiers
|
Accuracy (%)
|
AUC (%)
|
Precision (%)
|
Recall (%)
|
F1-Score (%)
|
LR + MSMOTE
|
87
|
82
|
96
|
89
|
93
|
RF + MSMOTE
|
99
|
99
|
99
|
99
|
99
|
Bagging + MSMOTE
|
99
|
99
|
99
|
99
|
99
|
Adaboost+MSMOTE
|
95
|
83
|
95
|
99
|
97
|
Figure 10 shows the ROC curves of all the four classification models. As mentioned earlier, an ROC curve that tends towards the left corner of the graph correctly identifies a greater proportion of observations (i.e.: a higher recall). A higher AUC value also indicates a better performance of the models. As indicated by Figure 10b and Table 9 RF and bagging models have shown better classification ability. Consequently, they both outperformed the other models in predicting potential defaulters.
As seen in the results, the two resampling techniques, SMOTE and MSMOTE, the latter has achieved better performance when combined with the ensemble classifiers. That is mainly because, MSMOTE not only considers the distribution of minority classes but also rejects latent noise spots based on K-NN classifier method [21]. Experimental results also indicated that the MSMOTE algorithm used can result in better prediction of the minority class than SMOTE [22]. Moreover, MSMOTE when used with bagging based ensemble classifier it gives better accuracy, precision, F1-score, and recall than when it is used with boosting ensemble classifiers. That is because, bagging improves stability and accuracy of machine learning algorithms, reduces variance, overcomes over-fitting, improves misclassification rates, among others. In noisy data environments like ours, generally, bagging outperforms boosting [23] [24].
When we compare the results we have got with existing works, as Table 10 presents, the proposed models outperformed existing works even those using the same datasets. Table 10 presents and compares the results of our work with existing works.
Table 10. Performance Comparison of Selected Related Works with the Proposed System.
Ref.
|
Year
|
Proposed Method
|
Dataset
|
Algorithm(s) Used
|
Performance (Accuracy in %)
|
[5]
|
2021
|
Resampling and cost-sensitive mechanisms
|
Lending Club datasets
|
Logistic Regression
|
65.5
|
[6]
|
2018
|
Combinations of classifiers and resampling techniques
|
Lending Club datasets
|
Random Forest
|
81.76
|
[10]
|
2020
|
Synthetic Minority Oversampling Technique (SMOTE)
|
Dataset from Kaggle
|
Neural Network
|
94.81
|
[11]
|
2018
|
Hybrid under-sampling method that combines clustering
|
A real loan default data from a P2P company
|
DSUS
|
68.8
|
[20]
|
2019
|
Construct K-XGBoost model based on K-Means++
|
Small business credit loan data of Lending Club
|
XGBoost
|
92.2
|
Our Study
|
2022
|
MSMOTE with Bagging based ensemble classifier (Random Forest, Bagging)
|
Loan Data for Dummy Bank from Kaggle
|
Random Forest, Bagging Classifier
|
99
|
Our Study
|
2022
|
MSMOTE with Boosting based ensemble classifier (Adaboost)
|
Loan Data for Dummy Bank from Kaggle
|
Adaptive Boosting
|
95.1
|
Furthermore, the following research questions were raised in the first section:
RQ1: Identify which approach or method better handles imbalanced data in loan prediction systems?
RQ2: Which approach or method that handles imbalanced data is overlooked?
RQ3: Which machine learning algorithm handle imbalanced data better?
The first research question is answered based on the discussion in section II, review of related works. As mentioned, existing approaches have used resampling techniques and cost-sensitive learning methods to deal with class imbalance issues in loan default prediction problems to improve the performance of classifiers.
The second research question is answered again in section II. There are many resampling techniques where some of them are used in many works like Random Undersampling (RUS) [5, 6, 7, 11, 13], Random Oversampling (ROS) [6,7, 11], SMOTE [6, 7, 8, 10, 11, 13], DSUS [11] and cost-sensitive [5, 8, 13]. The method we have used in this work is overlooked in previous works to handle imbalanced data. Besides, the ensemble classifier technique, adopted in this work, has not been used to handle imbalanced data.
The final research question is answered in Table 11 & 12 and Figure 13b. As mentioned, the experimental result confirmed that bagging based ensemble techniques performs better in loan default prediction.