Let \(T= \left\{\left({x}_{i},{y}_{i}\right) | i=\text{1,2},\cdots n\right\}\) a training dataset, where \({x}_{i}\) denotes an instance and \({y}_{i}∊\left[\text{0,1}\right]\)is its corresponding output class. A sample is a member of the majority or negative class if \({y}_{i}=0\) and it belongs to the minority or positive class if \({y}_{i}=1.\) To build an effective credit scoring model for high-dimensional datasets, a novel multiple-optimized ensemble learning (MOEL) method is proposed. The overall process of MOEL is illustrated in Fig. 1, which consists of four main steps. In the first step, the training set is partitioned into multiple subsets using weighted random forests to construct diverse and optimized subsets. In the second step, a new evaluation metric is applied on the subsets by considering the data distribution in each subset, which helps to select better-optimized subsets. In the third step, to balance the class distribution in each optimized subset a novel oversampling method is implemented, Finally, a stack-based ensemble method is implemented to build an optimized credit scoring model. The detailed mechanisms of the above steps are discussed in the following sub-sections.
3.1. Generation of multiple diverse optimized subsets (MDOS)
The redundant and irrelevant features in the high-dimensional imbalanced datasets have negative impacts on the credit scoring models, which causes the banks to suffer significant financial losses. The adoption of resampling on the original feature space of the high-dimensional imbalanced dataset may not only amplifies the error rate but also increases the training time. As a result, in addition to resampling, choosing the relevant features is crucial for classification to increase the model's performance. To overcome the above two issues, the training set is segmented into a multiple divese optimized subsets (MDOS) to choose the relevant features from the high-dimensional imbalanced dataset. To make the subsets more diverse and discriminate, WRFs are applied to each subset which consists of multiple random trees with different class weights. The WRFs are adopted because of the following reasons: 1) in each subset, it generates enough diversities by randomly selecting both the features and the instances, 2) based on the data from split nodes, the relevance of the features in the random tree is calculated, 3) it also handles the uneven class distributions by adjusting the weights of the samples, and 4) combining the weighted random trees helps to build a more accurate and effective predictive model.
To lessen the negative impact of the imbalanced data, WRF assigns more weight to the minority class instances to increase their importance. The weight of the samples in the \(ith\)weighted random tree is defined as:
$${w}_{i}^{C}=\frac{{N}_{i}}{C*{N}_{i}^{C}}$$
1
where \({W}_{i}^{c}\) represents the weight of the instances of \(cth\) class. \({N}_{i}\) represents the total number of instances and \({N}_{i}^{c}\) represents the number of instances belonging to \(cth\) class\(.\)
In this work, the relevance of the features for each random tree is calculated using the Gini index [47]. In this study, the non-creditworthy applicants are considered as positive and the creditworthy applicants as negative samples. The Gini index explains how the \(jth\) node of the \(ith\) weighted random split the total number of samples into positive and negative samples. Mathematically, it is expressed as:
$${G}_{i,j}=1-{p}_{i,0}^{2}-{p}_{i,1}^{2}$$
2
where \({p}_{i,0}\) and \({p}_{i,1}\) is the number of negative and positive samples in the \(jth\) node of the \(ith\) weighted tree, respectively.
The importance of feature f is computed by considering both the weight of samples and the Gini value at each \(jth\) node of the \(ith\) weighted tree before and after branching, which is expressed as:
$${IM}_{i,j}\left(f\right)=\frac{\sum _{c=0}^{1}{W}_{i,c}*{N}_{i,j,c}}{{N}_{i}}*{G}_{i,j}-\frac{\sum _{c=0}^{1}{W}_{i,c}*{N}_{i,j\_l,c}}{{N}_{i}}*{G}_{i,j\_l}-\frac{\sum _{c=0}^{1}{W}_{i,c}*{N}_{i,j\_r,c}}{{N}_{i}}*{G}_{i,j\_r}$$
3
where the weighted number of samples and the Gini value of the jth node before splitting are represented by the first part of Eq. (3) and the weighted number of samples and the Gini value after splitting are represented by the other two parts. \({N}_{i,j,c}\) represents the number of samples belonging to \(cth\) class. The numbers \({N}_{i,j\_l,c}\) and\({ N}_{i,j\_r,c}\), respectively, represent the number of samples of the \(cth\) class that are located in the left and right children of the jth node in the ith tree. \({G}_{i,j\_l}\)and \({G}_{i,j\_r}\) represents the Gini value of the left and right child for the \(jth\) node, respectively.
The importance of each feature \(f\in {F}_{i}\) in the \(ith\) weighted random tree is defined as:
$${IM}_{i}\left(f\right)=\sum _{j\in {F}_{i}}{IM}_{i,j}$$
4
Finally, the importance of the features can be computed by combining their importance in all random trees of the WRF, which is expressed as:
$$FIM\left(f\right)=\sum _{i=1}^{m}{IM}_{i}\left(f\right)$$
5
where \(m\) = number of trees in the WRF.
To obtain the optimized feature subset (FS), the features with the highest \(FIM\) values are incrementally added to the subset. The feature selection procedure starts with an empty subset, and the performance of the classifier \(M\) is evaluated after each feature is added to the subset. F1 score is utilized in the experiment as a performance metric because it is considered to be the more appropriate for assessing the model's predictive ability even in the imbalanced dataset [47]. The feature with the highest \(FIM\) value is added to the FS. The performance obtained by the classifier is considered as a baseline and then sequentially the subset is expanded by adding more informative features. After each addition, the performance of the current FS is compared against the baseline. If the current FS’s F1_score is better than the baseline, then it is considered as the current best FS. In each iteration, the next best feature is added to the current best FS, and the performance of the newly obtained FS is compared against the current best FS. If the newly generated subset achieves better results than the current best FS, then the newly generated FS will be considered as the current best FS. This process is repeated until the classifier’s performance is evaluated against all the features. Finally, the optimized FS is used in the subsequent steps. By repeating this process on different WRFs, a set of optimized feature subset \({FS}_{i}\) \(,i=\left\{1\cdots M\right\}\) is generated. This rank-based feature selection process helps to minimize the invalid and redundant features from the high-dimensional imbalanced credit dataset. Algorithm 1 presents the pseudocode of MDOS.
Algorithm 1: MDOS |
Input: Training dataset T =\({\left\{\left({x}_{i},{y}_{i}\right)\right\}}_{i=1}^{n}\) with \(F\) number of features and \(m\) number of WRFs |
Output: Optimized feature subset FS |
Procedure: |
1. \({f}{o}{r} i=\text{1,2},\cdots m\) do |
2. Build a \({WRF}_{i}\) using\(T\) 3. Compute the \({FIM}_{i}\) of each feature of the \({WRF}_{i}\) using (1)-(5) 4. \({FS}_{i}=\left\{ \right\}\) 5. Select the feature \({f}_{1}\)with highest \({FIM}_{i}\) value as baseline feature subset i.e.,\({FS}_{i}= \left\{{f}_{1}\right\}\) 6. The performance of the classifier is checked by using \({f}_{1}\) i.e.\(baseline= F1\_score\left({f}_{1}\right)\). 7. \({f}{o}{r}\)\(j=2 to F\) do 8. Select the next best feature \({f}_{j}\) and the performance is checked by using the feature subset\(\left\{{f}_{1}\cup {f}_{j}\right\}\) 9. If \(F1\_score\left({f}_{1}\cup {f}_{j}\right)>baseline\), then\({FS}_{i}=\left\{{f}_{1}\cup {f}_{j}\right\}\) 10. end for 11. end for |
3.2. Selection of better-optimized subsets
In the previous step, a set of multiple diverse optimized subsets are generated by implementing a rank-based feature selection process on the WRFs. Since the subsets are obtained from different WRFs, they may have different generalization capabilities. The major goal of this stage is to pick the optimized subsets with the best performance by evaluating each optimized subset using a newly developed evaluation metric called weight density. By using this evaluation metric, the weight, and density of each sample of the subset are assessed to produce more optimized subsets by taking into account the feature space and the imbalanced class distribution.
Due to imbalanced class distributions, the minority class instances are given more importance by assigning more weights than the majority class instances. The weights of the majority class \({W}_{maj}\) and the minority class samples \({W}_{min}\) are defined as:
$${W}_{min}=\left|{T}_{maj}\right|/\left|{T}_{min}\right|$$
7
where \(\left|{T}_{maj}\right|\) and \(\left|{T}_{min}\right|\) represents the number of majority and minority class instances in the optimized subset, respectively.
Additionally, the density of the instances in each optimized subset is calculated by counting the number of instances of neighboring classes. Through this density factor, we can eliminate the noisy samples from the dataset. If a particular instance \({x}_{i}\)'s nearest neighbors \(KNN\left({x}_{i}\right)\) exclusively contain examples of a different class, that instance will be regarded as noise and could harm the model's performance.
In each subset, the density of the instance \({x}_{i}\) is computed by finding its K-nearest neighbors,\(KNN\left({x}_{i}\right).\)Let \(KNN\left({x}_{i}\right)={N}_{i,maj}\cup {N}_{i,min}\), where \({N}_{i,maj}\)and \({N}_{i,min}\) be the majority and minority set located in the neighborhood of \({x}_{i}\). For each \({x}_{i}\)belonging to \(jth\) optimized subset, its density is defined as:
$$D\left({x}_{i},j,min\right)=\frac{\left|{N}_{i,maj}\right|}{K}$$
8
$$D\left({x}_{i},j,maj\right)=\frac{\left|{N}_{i,min}\right|}{K}$$
9
where \(D\left({x}_{i},j,min\right)\) and \(D\left({x}_{i},j,maj\right)\) represents the density of minority and majority class instances, respectively.
The weight-density measure of the \(jth\) optimized subset is computed by combining the weights and density, which is expressed as:
$${WD}_{j}={W}_{maj}\times D\left({x}_{i},j,maj\right)+{W}_{min}\times D\left({x}_{i},j,min\right)$$
10
When an instance neighborhood contains more instances from a different class, the value of \({WD}_{j}\) increases. As a result, the decision boundary becomes fuzzier and more complicated, increasing the likelihood that this instance may be wrongly classified, which lowers the ability to generalize. Therefore, to achieve better-optimized ensemble learning, we select the top-α optimized subsets \(\left({}_{1},{}_{2},\cdots {}_{}\right)\) having a minimum value of\({ WD}_{j}\). The process of generating optimized subsets for ensemble learning is presented in Algorithm 2.
Algorithm 2: Selection of better-optimized subsets |
Input: Training dataset T =\({\left\{\left({x}_{i},{y}_{i}\right)\right\}}_{i=1}^{n}\) with \(F\) number of features and \(m\) number of WRFs |
Output: Better optimized subsets |
Procedure: |
1. \({f}{o}{r} i=\text{1,2},\cdots m\) do 2. Implement Algorithm-1 to obtain feature subsets\({FS}_{i}\left(i=1, 2,\cdots , m\right)\) 3. Set the weights of majority and minority instances using (6) and (7), respectively 4. Compute the density of the minority and majority class instances using (8) and (9), respectively. 5. Compute the weighted density measure \({ WD}_{i}\) of each feature subset using (10) end for 6. Select top-α optimized subsets with minimum value of \(WD\) i.e., \(=\left({}_{1},{}_{2},\cdots {}_{}\right)\) |
3.3. Resampling of optimized subsets
Each of the optimized subsets \({\varnothing }_{i}(i=\text{1,2},\cdots n)\) are used to build the base classifiers of the stack-based ensemble model. To increase the effectiveness of the ensemble model, a novel oversampling method is applied to each optimized subset to balance the subsets. To oversample the subsets, the Mahalanobis distance metric is adopted to compute the distance between the data samples. It is adopted due to the following reasons: 1) it could able to recognize the highly correlated data and the outliers in the dataset, and 2) the elimination of these data samples could enhance the performance of the model. The Mahalanobis distance \(\left({Dist}^{2}\right)\) among the minority sample\({x}_{i}^{min}\in {\varnothing }_{i}\) is computed as::
$${Dist}_{i}^{2}={\left({x}_{i}-\mu \right)}^{T}{\sum }^{-1}\left({x}_{i}-\mu \right)$$
11
where \(\mu\) is the mean of the \({T}_{min}\) and \(\sum\) is the covariance matrix. The minority samples are initially organized according to the \({Dist}^{2}\) values in descending order. Next, the distance vector (\({Dist}^{2}\)) is segregated into two equal partitions, and then sequentially a unique label is assigned to each sample of both partitions. The instances of \(partition1\) have \({Dist}^{2}\) values greater than or equal to the \({Dist}^{2}\) value of the middle instance, while the rest instances belong to the\(partition2\).
$$partition1=\left({x}_{1}^{min},{x}_{2}^{min},\cdots {x}_{\frac{n}{2}}^{min}\right) and partition2=\left({x}_{\frac{n}{2}+1}^{min},{x}_{\frac{n}{2}+2}^{min},\cdots {x}_{n}^{min}\right)$$
12
where \(n\) is the number of minority examples in the optimized subset. Then, sequentially a unique label \({l}_{i}\) is assigned to each instance of both partitions. The first instance of both partitions is assigned the same label, and similarly, the other instances of both partitions are labeled. The instances of both partitions are labeled as:
$$parition1\left(label\left({x}_{i}^{min}\right)\right)={l}_{i} , i=\left\{1,\cdots ,n/2\right\} and$$
$$partition2\left(label\left({x}_{i}^{min}\right)\right)={l}_{i-n/2}, i=\left\{\frac{n}{2}+1, \cdots ,n \right\}$$
13
The final stage involves creating synthetic instances by averaging two instances from each partition with the same label, which is defined as:
$${x}_{i}^{syn\_min}=\left({x}_{i}+{x}_{j}\right)/2$$
14
where \({x}_{i}\) and \({x}_{j}\) are the instances of partition1 and partition2 of the same label. The synthetic instances are added to the minority subset of the optimized subset\({\varnothing }_{i}\). The newly generated instances are uniquely distinct but are related to the instances of both partitions. The same procedure is sequentially repeated for the remaining instances, and the same label is given to each new instance as that of their parents. This procedure is done until the optimized subset is balanced. The advantages of the proposed oversampling methods are 1) Contrary to SMOTE or KNN-based techniques, which generate the synthetic instances within a specific cluster of the minority class subset and give the classifier no additional information, the synthetic instances produced by the proposed method are evenly distributed throughout the minority class decision boundary, 2) the newly generated instances lies strictly within the region of the minority class instances, hence no overlapping of synthetic instances with the majority class takes place. The pseudocode of the proposed Mahalanobis-based oversampling method is presented in Algorithm 3.
3.4. Stacking-based ensemble method
In the final step, a stacking-based ensemble method with self-adaptive parameter optimization is adopted on the optimized balanced subsets to train the base classifiers. The performance of the base classifiers can be enhanced by optimizing their parameters. In this method, the parameters of the base classifiers are auto-optimized, and hence it improves the performance of the ensemble model.
As shown in Fig. 2, the base classifiers (\({clf}_{1}, {clf}_{2},\cdots {clf}_{n}\)) like logistic regression (LR), random forest (RF), XGBoost, gradient boosting decision tree (GBDT), and light gradient boosting machine (LGBM) are employed. The algorithms are trained on the optimized balanced subsets and their performances are validated on the validation set. Next, the top \(m\) classifiers (\(s{clf}_{1}, {sclf}_{2},\cdots {sclf}_{m}\)) are selected based on their superior performances in terms of F1_score. To further improve the performance of the selected classifiers, their parameters are optimized using a hyperparameter optimization framework [48]. The predicted response of the optimized classifiers \(\left({oclf}_{1}, {oclf}_{2},\cdots o{clf}_{m}\right)\) is stacked and integrated by a meta-classifier to obtain the final predicted output. As a meta-classifier LR is employed because of its high predictive performance in the stacked-based ensemble method [8].
Algorithm-3: Mahalanobis-based Oversampling Technique |
Input: Optimized subsets \({\varnothing }_{i}\), minority class subset \({T}_{min}\), and majority class subset\({T}_{maj}\) |
Output: Balanced subsets |
Procedure: 1. \({f}{o}{r} i=\text{1,2},\cdots m\)do 2. Find the number of new synthetic samples to be generated, i.e.\(r=\left|{T}_{maj}\right|-\left|{T}_{min}\right|\) 3. For each \({x}_{i}\in\) \({T}_{min}\), its Mahalanobis distance \({Dist}^{2}\) is computed using (11). 4. Sort the minority instances according to their \({Dist}^{2}\) values in descending order. 5. Divide the minority class subset into two equal partitions, i.e. \(mid=n/2\). 6. For each \({x}_{i}\in partition1\) and \({y}_{i}\in partition2\), a sequentially unique label is assigned using (13). 7. Initialize\(c=0\) 8. Initialize synthetic minority subset,\({T}_{min}^{syn}=\left\{ \right\}\) 9. for j = 1,2, \(\cdots n/2\) do 10. select \({x}_{i}\) and \({x}_{i+\frac{n}{2}}\) from partition1 and partition2, respectively, such that\(label\left({x}_{i}\right)=lable\left({x}_{i+\frac{n}{2}}\right)\) 11. Generate synthetic instance \(x\) by averaging the instances \({x}_{i}\) and \({x}_{i+\frac{n}{2}}\) 12. \({T}_{min}^{syn}={T}_{min}^{syn}\cup \left\{x\right\}\). 13. \(c=c+1\). end for 14. If \(c<r\), pair the synthetic minority instances \({T}_{min}^{syn}\) with the samples of both the partitions and repeat steps (9)-(13). If still\(c<r\), then pair the instances of the current \({T}_{min}^{syn}\) with the immediate previous level and later with the predecessor levels and for each subsequent level repeat steps (9)-(13). 15. If\(c\ge r\), the generated synthetic minority samples are combined with the original minority instances, \({T}_{gmin}={T}_{min}\cup \left\{{T}_{min}^{syn}\right\}\) . 16. Finally, the optimized subset gets balanced, i.e. \({}_{i\_bal}={T}_{gmin}\cup\) \({T}_{maj}\). end for |