The experiments were conducted using scikit learn, an important library for machine learning in Python for classification tasks. XGBoost is an open-source Python library that provides a gradient boosting framework. Gradient boosting is a machine learning approach used for classification, regression and clustering problems. XGBoost ensembles weak learners to create a strong learner. Models are introduced sequentially into the ensemble. The introduction of the next models sequentially enables the weak models’ errors to be corrected thereby achieving an optimized solution.
5.1 Data Description
To validate the efficacy and generalization performance of our proposed model, we used four datasets that exhibit different imbalance ratios. We employ two commonly used credit scoring datasets and two real world datasets.
The Australian and German datasets are the most frequently used datasets for credit scoring by most researchers in related literature which allows us to perform a comparative study with previously conducted research in credit scoring. The datasets are freely available on the UCI Machine Learning Repository [31]. The Australian dataset consists of 690 instances of which 307 are fully paid and 383 are defaults. The Australian dataset is composed of eight numerical and six categorical features. The German dataset consists of 1000 instances with 700 samples indicating fully paid and 300 representing defaulters. The German dataset also consists of 13 categorical features and 7 numerical features. From the two datasets, the most frequently used financial and nonfinancial indicators or variables are selected.
The robustness of the Adaptive XGBoost is further examined using two real world Peer to Peer (P2P) credit scoring datasets to validate the suitability of the Adaptive XGBoost approach for real world problems since credit scoring is an ephemeral scenario since many of the variables may drift over time. This makes Adaptive XGBoost to be tailored for incremental learning environments and to detect and adapt to changes in the underlying data distribution.
The first real world dataset is the RRDai data sourced from a Chinese Internet finance enterprise called RenRenDai that consists of loan data for the year 2017. The dataset is publicly available on https://www.renrendai.com. The dataset was sourced via a web crawler. Most instances of the RRDai dataset represent the outstanding loan instances. The classification task is to ascertain whether a client defaults in his or her obligation to pay off the monthly repayment amount. The attributes related to the RRDai dataset include loan amount P, annualized yield rate apr, repayment period T, remaining repayment period \({T}_{r}\) and actual outstanding amount\({ L}_{u}\). Two repayment options exist and they are the average capital plus interest approach and the debt servicing approach.
The monthly repayment of loan for average capital plus interest method is expressed as:
\({L}_{e}\) = P*\(\frac{\frac{apr}{12} *\left(1+\frac{apr}{12} \right)T}{\left(1+\frac{apr}{12}\right)T-1}\) (10)
The gross interest for the debt servicing approach is calculated as:
\({R}_{t}\) = P*\(\frac{apr}{12}\) * T (11)
The second real world dataset is called the LendingClubLoan data and is sourced from a Peer to Peer (P2P) lending company called LendingClub. The dataset is used to match borrowers with investors online. The dataset is available in Kaggle under the website https://www.kaggle.com/wendykan/lending-club-loan-data. The dataset consists of complete loan data from the year 2007 to 2015 and includes the current loan status and recent information regarding payments.
Credit scoring input variables of the datasets were split into the following subsets [33]: applicant assessment (grade,subgrade), loan characteristics (loan purpose and amount), applicant characteristics (annual income, housing situation), credit history length, delinquency) and applicant indebtness(loan amount to annual income, annual installment to income).
The imbalanced datasets are processed using the Synthetic Minority Technique (SMOTE). Previously proposed approaches for credit risk assessment models for financial institutions use imbalanced data whereby the number of non-default cases is usually much higher than the default cases. If the class imbalance problem is not taken into account and we use all data to build a classification model, a model that has high accuracy for the determination of non-defaults but extremely low accuracy for defaults is obtained. This study uses SMOTE to balance the number of cases from both defaults and non-defaults to minimize the effects of class imbalance problem on modeling. After identifying relevant variables, outliers and missing values are then solved. A random sampling of the data of 80% for the construction of a credit scoring model is performed. The remainder of 20% of the data is then used for back testing of the discriminating power of the completed model.
Table 2
Formulation of Credit Score data
Dataset | Instances | Features | Training Set | Test set | Good/bad |
German | 1000 | 24 | 800 | 200 | 700/300 |
Australia | 690 | 14 | 552 | 138 | 307/383 |
RRDai Ren | 1421 | 17 | 1137 | 284 | 1072/349 |
LendingClubLoan | 2642 | 11 | 2114 | 528 | 1322/1320 |
5.2 Evaluation Measures
In credit scoring, the average accuracy is the commonly used performance metric and shows the overall generalization performance of the learning model. To better explore the suitability of the model to distinguish between nondefault applications and default applications and to accurately get a reliable and robust conclusion, we employ six evaluation metrics to empirically evaluate the prediction performance of our proposed approach. The six performance metrics are the accuracy (ACC), Brier score (BS), Area under the Rock Curve (AUC), the F1 score, and the type 1 and type11 errors of the confusion matrix to accurately distinguish between nondefault applications and default applications. The two types of errors are used to evaluate the models the models to predict their performance in much more detail. A type 1 error occurs when a default loan application is being wrongly classified as a nondefault loan application. Conversely, a type 11 error occurs when a nondefault is misclassified as default. In a confusion matrix, TP and TN represent the numbers of correctly classified good borrowers and bad borrowers respectively. The two variables, FP and FN represent the numbers of misclassified loan applications.
The performance metrics can be represented as follows:
The average accuracy (ACC)= \(\frac{TP+TN}{TP+FP+TN+FN}\) (12)
The type 1 errors: \(\frac{FP}{TN+FP}\) (13)
The type 11 errors: \(\frac{FN}{TN+FP}\) (14)
The Brier score is a performance metric used to measure the accuracy of the predicted probability and the calibration of the prediction performance. The Brier score is within the interval 0 to 1 and the value of the interval represents probabilistic predictions from perfect to poor. A lower Brier score reflects better predictions.
The Brier score can be expressed as:
BS = \(\frac{1}{N}\sum _{i=1}^{N}{(p}_{i}-{y}_{i})\)2 (15)
, where N is the number of samples and\({p}_{i}\)
and \({y}_{i}\) denote the probability prediction and the true label respectively of sample i.
The F1-score considers both precision and recall of classification models. It is the harmonic average of precision and recall and ranges within the interval of 0 to 1. F1 ca be expressed as:
F1 = 2.\(\frac{precision.recall}{precision+recall}\) (16)
, where precision is the proportion of positive samples in positive cases and is defined as:
Precision = \(\frac{FP}{TN+FP}\) (17)
Recall is expressed as the proportion of predicted positive cases in the total positive cases and is expressed as
Recall = \(\frac{TP}{TP+FN}\) (18)
The Area Under the Roc Curve (AUC) performs a global assessment by computing the area under the Receiver Operating Characteristics Curve (ROC) according to the predicted scores which is plotted against the True Positive Rate (TPR) and False positive Rate (FPR).
5.3 Experimental Results
The prediction performance of the AHPSO-XGBoost is compared with nine other ensemble learning approaches. The prediction results of benchmark ensemble models across the four datasets and evaluation metrics are presented as baseline of our proposed method. Ten ensemble approaches namely AdaBoost, AdaBoost-NN, Bagging Decision Tree, Bagging NN, Random Forest, Decision Tree, Logistic Regression, Leveraging Bagging, Adaptive Random Forest and AHPSO-XGBoost are validated across the datasets as baselines for the learning process.
Table 3
Performance of benchmark models across datasets
Dataset | Classifier | ACC | F1 | AUC | BS | TYPE 1 | TYPE 11 |
Australia | AdaBoost | 72.43 | 53.65 | 62.05 | 0.2268 | 56.49 | 12.65 |
| AdaBoost-NN | 71.85 | 43.85 | 58.85 | 0.1678 | 53.66 | 16.46 |
| Bagging-DT | 73.57 | 56.43 | 67.35 | 0.2465 | 50.42 | 13.68 |
| Bagging-NN | 70.58 | 59.35 | 63.48 | 0.1874 | 65.48 | 10.48 |
| Random Forest | 71.35 | 61.67 | 61.35 | 0.2247 | 56.34 | 11.56 |
| Decision Tree | 74.45 | 54.32 | 59.85 | 0.1778 | 62.23 | 14.28 |
| Logistic Regression | 72.85 | 62.28 | 60.59 | 0.1764 | 58.45 | 15.09 |
| Leverage Bagging | 74.33 | 57.76 | 55.79 | 0.1743 | 63.54 | 17.36 |
| Adaptive Random Forest | 72.85 | 54.34 | 61.75 | 0.1659 | 54.35 | 11.49 |
| AHPSO-XGBoost | 75.43 | 72.48 | 70.48 | 0.1604 | 53.05 | 10.43 |
German | AdaBoost | 66.34 | 58.58 | 63.46 | 0.2213 | 62.53 | 12.87 |
| AdaBoost-NN | 69.34 | 62.87 | 67.82 | 0.2174 | 61.27 | 13.48 |
| Bagging DT | 73.47 | 59.42 | 63.28 | 0.1783 | 59.74 | 15.85 |
| Bagging-NN | 68.76 | 64.37 | 59.86 | 0.1674 | 56.46 | 11.78 |
| Random Forest | 71.26 | 69.72 | 60.47 | 0.2034 | 60.74 | 12.35 |
| Decision Tree | 74.64 | 67.48 | 58.73 | 0.2167 | 58.79 | 14.56 |
| Logistic Regression | 72,58 | 65.39 | 65.48 | 0.1784 | 64.72 | 16.86 |
| Leverage Bagging | 74.78 | 70.84 | 69.87 | 0.1674 | 68.43 | 11.38 |
| Adaptive Random Forest | 68.85 | 69.87 | 64.89 | 0.1689 | 64.58 | 12.69 |
| AHPSO-XGBoost | 76.83 | 73.74 | 71.84 | 0.1567 | 69.79 | 11.21 |
RRDai | AdaBoost | 74.47 | 59.47 | 72.38 | 0.2217 | 0.1874 | 14.23 |
| AdaBoost-NN | 69.83 | 61.35 | 69.85 | 0.2145 | 0.2167 | 16.78 |
| Bagging-DT | 71.68 | 58.67 | 64.58 | 0.1673 | 0.2214 | 13.29 |
| Bagging-NN | 68.73 | 63.48 | 62.87 | 0.1784 | 0.1894 | 15.63 |
| Random Forest | 70.39 | 60.79 | 67.89 | 0.2242 | 0.1787 | 12.84 |
| Decision Tree | 66.58 | 57.45 | 68.89 | 0.1783 | 0.2231 | 17.62 |
| Logistic Regression | 68.38 | 70.34 | 70.38 | 0.1687 | 0.1875 | 12.31 |
| Leveraging Bagging | 72.68 | 72.84 | 73.49 | 0.1602 | 0.2262 | 11.87 |
| Adaptive Random Forest | 74.86 | 76.58 | 71.29 | 0.1687 | 0.1876 | 13.56 |
| AHPSO-XGBoost | 72.35 | 70.35 | 74.67 | 0.1583 | 0.1671 | 11.05 |
LendingClubLoan | AdaBoost | 69.47 | 66.89 | 71.86 | 0.1873 | 0.2217 | 16.81 |
| AdaBoost-NN | 65.67 | 63.47 | 68.47 | 0.2241 | 0.1764 | 13.48 |
| Bagging-DT | 63.28 | 59.83 | 66.89 | 0.1784 | 0.2137 | 12.75 |
| Bagging-NN | 66.87 | 68.42 | 63.12 | 0.1843 | 0.1785 | 14.39 |
| Random Forest | 69.84 | 71.89 | 67.84 | 0.1798 | 0.2236 | 13.89 |
| Decision Tree | 67.38 | 69.84 | 69.54 | 0.2345 | 0.1706 | 12.34 |
| Logistic Regression | 70.83 | 71.73 | 73.42 | 0.1865 | 0.1689 | 13.26 |
| Leveraging Bagging | 73.48 | 74.86 | 74.89 | 0.1721 | 0.1654 | 12.19 |
| Adaptive Random Forest | 71.89 | 72.49 | 72.36 | 0.1672 | 0.1702 | 11.87 |
| AHPSO-XGBoost | 77.58 | 76.69 | 74.85 | 0.1612 | 0.1603 | 10.89 |
Table 3 shows the prediction performances of benchmark models using different indicators. Table 4 provides a summary of the results of XGBoost models that were optimized using different optimization algorithms. The AHPSO-XGBoost optimized using AHPSO achieved the best performance on the four credit datasets. For the Australia dataset, AHPSO-XGBoost achieved the highest accuracy and lowest type 1 error rate. A lower type 1 error is an indication that the model avoids misinterpreting a poor credit application with a higher probability of good credit. AHPSO-XGBoost has the lowest Brier score when compared to XGBoost-RS and XGBoost-TPE. The F1 score of AHPSO-XGBoost is the best among all the XGBoost optimized algorithms due the appropriate settings of AHPSO. The success is hugely attributed to the hyperparameter of XGBoost Maximum Delta step which resists the problem of imbalanced data to a certain extent. On the German dataset, AHPSO-XGBoost algorithm obtained the best prediction accuracy with a score of 75.43%, which is 4% higher than the default XGBoost model. The model achieves promising results for Type 11 error and Brier score. The F1 score obtained is quite promising and is a demonstration of the fact that the model can adequately learn the unbalanced data.
For the RRDai dataset, AHPSO-XGBoost obtains the best prediction performance when compared to other optimized credit scoring models for all indicators, a sign that the hyperparameters settings are more reasonable when compared to other model settings.
In the LendingClubLoan dataset, AHPSO-XGBoost performed well as shown by all indicators. The default non-optimized model Type 11 error is lower than the AHPSO-XGBoost Type 11 error due to the unbalanced data that has more labels associated with good credit. Parameter settings need to be improved to learn the imbalance accurately. AHPSO-XGBoost achieves the best ability of probability prediction as it achieves the highest Brier score.
On average, AHPSO-XGBoost achieves a better prediction as compared to other models. The results obtained indicate that AHPSO has the capacity to promote the alignment of the XGBoost algorithm with the characteristics of the credit data. In terms of prediction performance, the AHPSO-XGBoost algorithm performs better than PSO-XGBoost model due to the fact that AHPSO avoids premature convergence of the particle swarm and enables particles to obtain hyperparameters that render the algorithm to be more accurate.
Table 4
Results of the performances of XGBoost with various optimization methods on credit data
Dataset | Classifier | ACC | F1 | AUC | BS | TYPE 1 | TYPE 11 |
Australia | XGBoost | 71.29 | 71.85 | 69.45 | 0.2016 | 63.37 | 13.59 |
| XGBoost-GS | 68.83 | 69.59 | 66.84 | 0.1724 | 67.59 | 14.53 |
| XGBoost-RS | 70.69 | 66.79 | 63.49 | 0.1836 | 61.47 | 12.68 |
| XGBoost TPE | 72.53 | 68.74 | 68.86 | 0.1649 | 59.89 | 15.04 |
| PSO-XGBoost | 73.48 | 70.28 | 67.57 | 0.1631 | 60.81 | 12.37 |
| AHPSO-XGBoost | 75.62 | 73.89 | 70.74 | 0.1514 | 58.84 | 11.19 |
German | XGBoost | 71.45 | 68.52 | 72.86 | 0.1784 | 66.89 | 16.59 |
| XGBoost-GS | 68.56 | 71.49 | 70.49 | 0.2205 | 71.64 | 12.47 |
| XGBoost-RS | 71.53 | 69.78 | 73.53 | 0.1687 | 69.42 | 14.65 |
| XGBoost TPE | 69.59 | 70.89 | 71.63 | 0.1734 | 70.83 | 13.84 |
| PSO-XGBoost | 70.37 | 72.18 | 74.76 | 0.1689 | 72.18 | 11.87 |
| AHPSO-XGBoost | 73.54 | 74.87 | 76.89 | 0.1582 | 74.27 | 11.16 |
RRDai | XGBoost | 74.64 | 69.73 | 71.43 | 0.1708 | 70.67 | 12.74 |
| XGBoost- GS | 70.85 | 71.47 | 68.57 | 0.1636 | 69.42 | 13.56 |
| XGBoost-RS | 72.58 | 70.83 | 69.43 | 0.2134 | 71.54 | 11.87 |
| XGBoost TPE | 71.87 | 72.82 | 70.82 | 0.1672 | 68.71 | 12.25 |
| PSO-XGBoost | 69.43 | 73.49 | 72.69 | 0.1587 | 73.47 | 11.34 |
| AHPSO-XGBoost | 76.49 | 75.58 | 73.45 | 0.1503 | 74.38 | 10.46 |
LendingClubLoan | XGBoost | 71.42 | 70.41 | 69.79 | 0.1504 | 59.87 | 13.27 |
| XGBoost-GS | 73.19 | 69.86 | 72.69 | 0.1484 | 49.76 | 11.54 |
| XGBoost- RS | 70.42 | 72.58 | 70.83 | 0.1253 | 53.28 | 12.47 |
| XGBoost TPE | 72.39 | 71.39 | 71.59 | 0.1386 | 48.93 | 14.42 |
| PSO-XGBoost | 73.87 | 74.83 | 73.82 | 0.1164 | 52.63 | 11.35 |
| AHPSO-XGBoost | 76.87 | 75.78 | 74.89 | 0.1056 | 49.89 | 11.21 |
5.4 Kolmogorov-Smirnov Evaluation metric
The evaluation of the generalization performance of machine learning algorithms tailored for credit scoring requires evaluation metrics capable of highlighting how well creditworthy and non-creditworthy customers are distinguished and ascertaining whether the learning model as changed or not. In credit scoring, the availability of more data does not guarantee that more training data will improve the generalization performance of the learning model when using traditional batch learning classifiers. More data may lead classifiers to overfit and the data may exhibit drifting characteristics. To carry out an evaluation of the capability of a machine learning model to accurately discriminate customers that will pay their debts in full or not and to evaluate whether the variables have drifted or not, we implement the Kolmogorov-Smirnov (KS) metric [34]. The population scores may exhibit concept drift [35] as population scores may evolve over time and some populations may remain stable. To evaluate whether the population scores have changed or not, the Population Stability Index (PSI) is adopted [36]. If the population Stability Index (PSI) evolves over time, it is an indication of the degradation of the prediction model and a sign that the population scores are changing a new prediction model needs to be built. The Kolmogorov-Smirnov (KS) statistic indicates the maximum distance between the cumulative probability distribution function (cdfs) obtained by customers that pay their debts in full and those who default [34]. Without loss of generality, if the number of customers to be scored is (𝑛+𝑚), the probability that the 𝑖th customer will default is denoted as \({D}_{i}\) = 1 and \({D}_{i}\)=0 otherwise. The empirical cumulative distribution functions (cdf) of good and bad customers are expressed using Eqs. 1 and 2 respectively and 𝑛 denotes the total number of good customers and 𝑚 represents the number of good customers, L=\({min}_{si}\) 1\(\le\)i\(\le\)(𝑛+𝑚) is the lower bounds of all the scores available and H= \({max}_{si}\)1<i<(𝑛+𝑚) is the upper bound.
\({F}_{good\left(a\right)}\) =\(\frac{1}{n}{\sum }_{i=1}^{n}\left\{{}_{0 otherwise }{}^{1 if {s}_{i \le }a{}_{and }{D}_{i}=1 }with a\in [L,H]\right.\)
(19)
$${F}_{bad\left(a\right)=}\frac{1}{m}\sum _{i=1}^{m}{}_{o otherwise }{}^{1 if {s}_{i}<a and {D}_{i}=0 }with a\in [L,H]$$
(20)
The Kolmogorov-Smirnov metric is given Eq. (21) and it expresses highest difference that exists between the cdfs that provides the description between the good and the bad customers.
KS= \({max}_{a\in [L,H]}\)|\({F}_{bad}\)(a)-\({F}_{good}\)(a)| (21)
If the value of the KS statistic is zero, it’s a sign that the two credit score distributions are the same and is an indication that the credit score cannot distinguish between defaulters and nondefaulters. If the KS value equals to 100, it is an indication that the credit score accurately distinguishes defaulters from nondefaulters. A KS score greater than 35% indicates a substantial discriminative power of the prediction model to distinguish te different types of customers. The Population Stability (PSI) shows the changes that happen in the population distribution of loan applicants. PSI reveals changes that occur in the environment that needs to be analyzed further by bank experts to determine whether any macroeconomic conditions or lending policies are affecting the model outcomes [36]. To get the PSI score, we compute the probability distribution function (pdf) of the defaulting customers in two different time periods. Firstly, the pdfs of these distributions are calculated using a specific number of ranges so that each range has approximately the same number of defaulting customers. The variables \({n}_{i}\) and \({m}_{i}\) are counters of defaulting customers in the two samples at the ith bin and that \(\sum {n}_{i}\) = 𝑁 and \(\sum {m}_{i}=\)𝑀. Given these counters, the PSI can be computed as:
PSI= \(\sum _{i=1}^{r}\left[\right(\frac{ni}{N} - \frac{mi}{M}\)) x(ln\(\frac{ni}{N}\) - ln\(\frac{mi}{M}\))] (22)
If PSI rates are below 10%, it is an indication that the population is quite stable and that the learning model is not unacceptably volatile.
5.5 AHPSO-XGBoost performance using the Kolmogorov-Smirnov metric
The use of batch learning to assess the prediction performance of AHPSO-XGBoost does not provide a proper indication of whether variables will drift over time. To evaluate the effectiveness of AHPSO-XGBoost on both static and dynamic domains for it to properly distinguish creditworthy and noncreditworthy customers in environments where the customer variables may drift or evolve with time, we use the two read world datasets, RRDai and the LendingClubLoan. The traditional credit scoring datasets used in the literature are too small in terms of features and instances available and unclear with regard to what each feature stands for. For the two real world datasets, the target is an indication of whether customers paid their debt in full or defaulted in the month being analyzed. The learning algorithms used to perform a comparative study analysis have different hyperparameters that may impact the generalization performance of the prediction models to be learned. For AHPSO-XGboost, the hyperparameters are optimized by the PSO algorithm and the rest of the algorithms are optimized using a grid search with ten fold cross-validation with the goal of maximizing the KS rates in the test data. To learn incrementally, the data stream learning algorithms are tuned so that the KS rates during the testing phase are optimized. To validate the prediction performance of AHPSO-XGBoost against other incremental learning algorithms, we adjusted the Holdout validation scheme together with the Test and Train approach since holdout validation alone is not tailored for streaming scenarios where data is streaming and often collected over an extended period of time. In the holdout validation scheme, the available data is split into a specific number of months for training and the residue is used for testing. The objective is to test if the created predictive model, AHPSO-XGBoost, KS and PSI generalize well to unseen instances or changing populations or not over time. In the event that the nature of the datais indeed ephemeral, the expectation is that both AHPSO-XGBoost and KS rates are to drop over time after the model is learned and if PSI rates would increase. In the event that AHPSO-XGboost, KS and PSI rates remain the same, incremental learning would not be implemented. To properly perform an informed analysis, different training and testing ratios of months are used. Different proportions of the dataset are tested in order to evaluate the impact of different volumes of data and to find out if the behavior of loan applications evolve or change over time. If the variables affecting loan applications are drifting, then a batch model would erroneously score the loan applications resulting in lower KS and AHPSO-XGBoost rates and will generate a higher PSI. The monthly- test then train validation scheme uses each month to train just after the evaluation process. The objective is to make a comparison of the results generated by a learner that is constantly being updated with new data against a learning model that was learned to a predetermined point and the evaluation was performed later. The comparison of the results of the two different models enables us to determine whether the process of continuously updating the predictive learning models significantly improves the AHPSO-XGBoost and KS results without compromising the PSI rates.
5.6 Analysis of the results
Figure 1 shows the KS rates for the RRDai dataset. All the algorithms are learning incrementally and to evaluate their behavior, we use the KS rates
From the experiments conducted on the RRDai dataset, there is a positive trend as the KS rates grow over time for all the stream learning algorithms, an indication that the classifiers are capable of discerning between creditworthy and non-creditworthy customers although more training and testing data continues to arrive. The Friedman test [37] was used to compare the ranking performance of all the ten stream learning algorithms. The Friedman test is a nonparametric test that ranks the classifiers separately. The Friedman test revealed that there is no significant statistical differences that exist among the ten methods and the PSI rates of all the classifiers are low.
Figure 2 shows the PS rates obtained from the RRDai dataset. The PS rates are fluctuating over time, an indication that the variables are drifting and the prediction model has degraded and the population itself is changing and a new learning model needs to be built.
Figure 3 shows the KS rates on the LendingClubLoan dataset. All the data stream learning algorithms were able to learn a consistent prediction model capable of discerning between creditworthy and non-creditworthy customers.
The KS results obtained indicate that the availability of more data generated higher KS rates. The Friedman test applied indicate that there is no statistical differences between the KS rates observed and this is highly attributed to the face that the dataset is quite stable.
The PS rates for the LendingClubLoan are shown below. The population continues to change as in the RRDai dataset even though the dataset shows sign of stability. The PS rates continue to fluctuate and the model needs to be updated continuously.
5.6 Statistical Tests of Significance
To conclude the empirical experiments conducted, we perform a complete performance evaluation by implementing some hypothesis testing to evaluate if the experimental differences in the performances are statistically significant. To accomplish this, the Friedman test [37], is used to perform the comparison of the performance rankings of all the algorithms. The Friedman test is a nonparametric test and ranks the classifiers separately. The null hypothesis of the Friedman test states that all classifiers under consideration perform identically and all differences are only random fluctuations. In the event that the null hypothesis of the Friedman test is rejected, then a post hoc test is conducted to find the particular pairwise comparison that generates significant differences. The post hoc test used in the significance test evaluation is the Nemenyi test [38]. In the Nemenyi test, the prediction performances of two learning algorithms is considered to be significantly different if their average ranks differ by at least the Critical Difference (CD). The Critical Difference (CD) is expressed as
CD = \({q}_{a}\sqrt{\frac{k(k+1)}{6n}}\), and \({q}_{a}\) represents the critical value of the Tukey distribution and the variable 𝑘 represents the number of machine learning prediction models that are to be compared and 𝑛 stands for the number of datasets.
Table 5 shows the average rank of each model on each dataset.
Table 5
The average rank of each model on each dataset
Classifier | Australia | German | RRDai | LendingClubLoan | Average Rank |
AdaBoost | 2.25 | 3.45 | 3.42 | 2.85 | 2.99 |
AdaBoost-NN | 3.14 | 2.78 | 3.78 | 4.00 | 3.43 |
Bagging DT | 2.84 | 3.46 | 2.96 | 3.86 | 3.28 |
Bagging NN | 1.94 | 3.58 | 2.86 | 4.68 | 3.27 |
Random Forest | 1.86 | 2.74 | 3.48 | 3.94 | 3,01 |
Logistic Regression | 2.72 | 3.82 | 2.98 | 3.84 | 3.34 |
Decision Tree | 3.24 | 2.54 | 1.86 | 4.35 | 4.00 |
Leverage Bagging | 1.35 | 1.89 | 2.34 | 2.18 | 1.94 |
Adaptive-Random Forest | 1.48 | 2.34 | 2.00 | 2.54 | 2.09 |
AHPSO-XGBoost | 1.00 | 1.35 | 2.26 | 2.45 | 1.77 |
The proposed AHPSO-XGBoost prediction model performs the best among the ten models with the average rank of 1.77. AHPSO-XGBoost is slightly better than Leverage Bagging with an average rank of 1.94 and this demonstrates the robustness and effectiveness of our approach in distinguishing between credit worthy and not creditworthy customers. Figure 5 shows the CD diagram of the results of the Nemenyi test. The average rankings of all the machine learning models are represented by the horizontal axis where the difference in average ranks is lower than the CD value, the models are connected by a black bar.
Figure 5 shows the CD diagram with the results of the Nemenyi test. In the Nemenyi diagram, all the ten learning model ranks are shown and our proposed approach, AHPSO-XGBoost model is significantly better than the other nine learning models.