Bank Loan Classification of Imbalanced Dataset Using Machine Learning Approach

doi:10.21203/rs.3.rs-2667057/v1

Download PDF

Research Article

Bank Loan Classification of Imbalanced Dataset Using Machine Learning Approach

https://doi.org/10.21203/rs.3.rs-2667057/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Before giving loans to borrowers, banks decide whether the borrower is bad (defaulter) or good (non-defaulter). The prediction of borrower status whether the borrower will be a defaulter or a non-defaulter is not an easy task to the loan providing entity. In machine learning, building an automated loan default classification system is an optimization problem with an ultimate objective of improving loaner classification in loan decision making. However, this problem becomes difficult when there is a profile of imbalanced data since the classifier may misclassify the rare samples from the minority class. To solve this problem, we used a Modified Synthetic Minority Oversampling Technique (MSMOTE). It is an oversampling technique where synthetic data of the minority class is generated to balance with the majority class. This is combined with ensemble classifier technique to further improve the performance of bank loan prediction systems. MSMOTE is a variant of Synthetic Minority Oversampling Technique (SMOTE) algorithm. Bagging- and boosting- based ensemble techniques are applied on the imbalanced dataset to improve the performance of loan prediction. The dataset is gathered from Kaggle to validate the proposed scheme. Experimental results show that, among others, the proposed model, MSMOTE, when combined with adaptive boosting resulted in 95% of precision and accuracy. Whereas, MSMOTE combined with Bagging and Random Forest resulted in 99% of precision and accuracy.

bank loan classification

ensemble classifier

imbalanced dataset

loan default prediction

machine learning approach

Banking institution can be defined as a financial institution that receives and transfers deposits through a variety of lending activities. Receiving deposits (savings, time deposits, and cash), granting loans, and advance payments to the public are the main activities that use capital as loans to borrowers [1].

Banks play an important role in a market economy. Before granting a loan to a borrower, banks try to determine whether the borrower is bad (defaulter) or good (non-defaulter). Predicting the state of a borrower, i.e. whether a borrower will default in the future, is however to the loaner entity. Basically, the default prediction is a binary classification that categorizes borrowers who default and those who do not [2].

Loans are actually repaid according to the terms stipulated in the bill, and failure to do so is called "loan default." Default loan prediction is an analytical method that uses current and historical information, customer credit behavior, and credit & payment information to predict a customer's ability to repay a loan while assessing the profitability of the loan. Many classification models of machine learning algorithms have been used to predict a customer's ability to repay a loan [1].

Machine learning algorithms allow us create new models using anonymized historical data. This data is used to train the model to make better predictions. A good model allows a financial institution to predict the prospect that a customer will default on a loan and takes steps to take predetermined precautions before this happens [3].

For this, there is no single machine learning model that best fits. Rather, each machine learning system comprises a set of components such as a problem, a data source, a model, an optimization algorithm, a validation and testing elements [4].

In the real world, there are various imbalanced datasets such as fraud detection, risk management, and medical diagnosis, among others. Hence, it is difficult to make predictions on such an imbalanced dataset because the classifiers are prone to detecting the majority class rather than the minority one.

The objective of this work is to deal with the data imbalance problem to improve Bank loan prediction using resampling techniques (data level approach) and algorithmic ensemble techniques. Furthermore, it explores various ensemble classifiers and resampling techniques to determine which combinations handle imbalanced data most efficiently.

In doing so, more specifically, the following research questions (RQ) are addressed:

RQ1: Identify which approach or method better handles imbalanced data in loan prediction systems?

RQ2: Which approach or method that handles imbalanced data is overlooked?

RQ3: Which machine learning algorithm handle imbalanced data better?

The rest of the paper is organized as follows. In section II a review of related works and additional background regarding imbalanced learning approaches are presented. Section III discusses the proposed system design and methodology adopted in this work. In section IV experimental result and discussion are presented. Model building and results are also discussed in this section. Section V concludes the work and presents future research directions.

Some relevant studies that deal with imbalanced learning in loan prediction system are discussed in this section.

In [5], the authors studied undersampling and cost-sensitive learning to deal with the imbalanced dataset. Meanwhile, for machine learning schemes it utilizes logistic regression, random forest, and neural network to foresee the default risk of peer to peer (P2P) lending. In [6], based on computational intelligence methods, the authors developed a new credit risk prediction process using the most recent dataset of Lending Club loan dataset. In addition, they addressed the imbalance problem by comparing different resampling approaches to determine which one improves credit worthiness evaluations and how. Further, to determine which combination handles imbalanced data more efficiently, they explored various machine learning classifiers and resampling techniques. In [7], using different sets of credit-related datasets, the authors developed a credit default prediction model. Min-Max normalization scales features within one range when there is a significant difference between minimum and maximum values [4]. To overcome the problem of the data imbalance, data level resampling methods are used [5]. Various machine learning models are also employed to obtain efficient results [3].

In [8], the authors tried to make loan default prediction on imbalanced datasets with an improved random forest approach which employs weighted majority votes in tree aggregation. The weights assigned to each tree in the forest are based on OOB (out-of-bag) errors which are easy to obtain during the forest construction process. To do imbalanced prediction using random forest, they used two ways: one is cost sensitive learning which incorporates class weights into the random forest classifier. The other method is using oversampling with the minority class and/or down sampling with the majority one to balance the original data. The former is called weighted random forest while the latter is balanced random forest. In [9], the authors focused on credit risk, which is defined as the probability of defaulting on the loan or credit acquired from a financial institution. They used the logistic regression and Classification & Regression Trees (CART) approach with techniques such as undersampling, Prior Probabilities, Loss Matrix, and Matrix Weighting to address imbalanced data.

In [10], the authors used a federated learning based approach for the prediction of loan applications that are less likely to be repaid which helps in resolving the issues by sharing the weight of the model which are aggregated at the central server. The federated system is coupled with Synthetic Minority Oversampling Technique (SMOTE) to solve the problem of imbalanced training data. It is found that flexible and aggregated models can prove to be crucial in keeping out the defaulters in loan applications. In [11], the authors dealt with the data imbalance problem to enhance the performance of the loan default prediction. It used a hybrid undersampling approach Diversified Sensitivity Undersampling (DSUS) that combines k-mean clustering, stochastic sensitivity measure and a robust redial basis Neural Network function to handle the problem of imbalanced class distribution. A real loan default data from a peer to peer (P2P) company in China is used to validate the performance of the method.

In [12], using real world user loan data of the Lending Club, the authors build a loan default prediction model using a random forest algorithm. To deal with the imbalance class problem in the data set, the SMOTE method is used. And then, a series of operations such as data cleaning and dimensionality reduction are performed. In [13], the authors dealt with how deep learning can be used to classify mortgages into Default or Fully Paid Loans. It compares Recall, Sensitivity and AUC¹ scores of different deep neural network architectures against a baseline linear classifier. This work examines the effect of different resampling, regularization and cost-sensitive learning methods on the neural networks classification performance.

In summary, as shown in Table 1, existing works didn’t use combination of data level approach (resampling) and ensemble classifier to handle imbalanced data in bank loan default prediction. Existing systems used resampling techniques and cost-sensitive learning methods to deal with class imbalance issues in loan default prediction problem to improve the performance of a classifier. This research tried to improve the classification performance by combining data level approach with classifier ensemble technique on imbalanced datasets on binary class problems. Ensemble based method is combined with different under- and over- sampling methods which improved the performance of imbalanced dataset when compared with traditional sampling techniques. This study analyzes bagging- and boosting- based ensemble learning techniques to handle imbalanced data.

Table 1

Summary of Related Works.
Ref.	Year	Dataset Count	Machine Learning	Under-sampling	Over-sampling	Hybrid	Cost Sensitive	Ensemble
[1]	2020	1	✓	x	✓	✓	x	x
[5]	2021	1	✓	✓	x	x	✓	x
[6]	2018	1	✓	✓	✓	✓	x	x
[7]	2020	3	✓	✓	✓	✓	x	x
[8]	2012	1	✓	x	✓	x	✓	x
[9]	2016	1	✓	✓	x	✓	x	x
[10]	2020	1	✓	x	✓	✓	x	x
[11]	2018	1	✓	✓	✓	✓	x	x
[12]	2019	1	✓	x	✓	✓	x	x
[13]	2018	1	✓	✓	x	✓	✓	x
Our Study	2022	1	✓	x	✓	✓	x	✓

[1] AUC stands for Area Under the ROC Curve, whereas ROC stands for Receiver Operating Characteristic Curve

The methodology followed to conduct this research is presented as follows. First, a review of the related works is conducted to gain a better understanding of the research area. Based on these related works, we have identified the state-of-the-art regarding handling of imbalanced dataset using different machine learning models. Existing works that are linked to this study have been evaluated while critically evaluating the research gaps.

1. Research Workflow and Proposed System Design

The overall research work flow accomplished in this work to achieve the aimed objectives is described in Fig. 1. In this research, a combination of data-level approach and classifier ensemble technique are used to improve the classification performance of bank loan default prediction. There are three main approaches to resampling data at data-level: oversampling, undersampling, or a combination of both. Hybrid resampling technique has been used to reduce “Imbalance ratio” calculated as the ratio between instances of majority and minority classes. The resampled data is provided as input to train the ensemble classifier model that is formed by using different classifier models.

The proposed method is carried out in two phases as 1) Resampling Methods Using MSMOTE and 2) Classifier Ensemble Techniques.

2. Resampling the Imbalanced Data

2.1 Synthetic Minority Oversampling Technique (SMOTE)

SMOTE is a well-known oversampling method. It interpolates multiple minority class instances close to one another to create a new minority class sample. This helps to reduce the proportion of imbalance in the dataset. Based on the oversampling ratio, SMOTE creates instances by randomly choosing one or more of the k-nearest neighbors (KNN) of the minority class instances and interpolating their values. As a result, overfitting is avoided. This allows the minority class's decision boundaries to be spread further into the majority class's space [14].

2.2 Modified Synthetic Oversampling Technique (MSMOTE)

When SMOTE generates synthetic samples, it takes samples from each minority class and introduces synthetic samples. It does not take into account latent noise or the distribution of minority classes. In this work, a modified method, named MSMOTE, is used to improve the performance of SMOTE. By calculating the space of all the samples the modified algorithm categorizes minority samples into three separate categories: security samples, border samples and latent noise samples. In order to improve the performance of a classifier, security samples are used as data points. On the contrary, noise reduces the performance of a classifier.

The fundamental flow of MSMOTE is the same as that of SMOTE, but the strategy used to choose nearest neighbors differs. In the security sample, the algorithm randomly selects a data point from the k-nearest neighbors, while in the border samples, the nearest neighbor is chosen and latent noise is ignored.

3. Classifier Ensemble Techniques

In classifier ensembles, several individual classifiers are combined to perform a classification task. The random errors of each classifier will cancel each other if they differ from each other and help to output the right decisions. Each classifier receives data to be classified as input and produces predictions as output. After that, the output predictions are combined using various methods, such as voting, weighted voting and averaging. In this research, bagging based and boosting based ensemble techniques are used.

3.1 Bagging Based Techniques for Imbalanced Data

Bagging is an ensemble classifier formed using various training data sets. It is a method that combines bootstrapping and averaging to generate a combined predictor by generating and combining multiple versions of them. Bootstrap samples are used as training sets where each base classifiers are independently trained. They are formed by randomly selecting samples with replacement from the original samples of the training set. Therefore, bagging is preferred to use with unstable algorithms in which small changes in the training set result in large changes in the output of that system. Bagging is employed to reduce over fitting for the sake of creating strong learners to generate accurate predictions. In contrast to boosting, bagging allows replacement within a bootstrapped sample.

In this work, random forest and bagging are used as base classifiers.

SMOTE Bagging and MSMOTE Bagging are combinations of SMOTE and MSMOTE with bagging algorithms. Both techniques are oversampling techniques where synthetic data is generated to the minority class to balance with the majority class. These synthetic data are generated according to k-nearest neighbors rule in different ways for SMOTE and MSMOTE as already discussed earlier.

3.2 Boosting Based Techniques for Imbalanced Data

Boosting is an ensemble method that combines weak learners to create strong learners that can make accurate predictions. Boosting processes imbalanced datasets by effectively altering the distribution of training data. This makes boosting algorithms more effective in the classification of minority classes. In this work, the AdaBoost technique is used as classifier model on the imbalanced data.

SMOTE Boost and MSMOTE Boost are combinations of SMOTE and MSMOTE with adaptive boosting algorithms. Using SMOTE and MSMOTE data preprocessing algorithms synthetic instances are introduced. In the new dataset, the weights of the new instances are proportional to the total number of instances. Whereas, the weights of the original instances of the dataset are normalized such that they form a distribution with the new instances. The weights of the instances of the original dataset are updated after training a classifier; then another sampling phase is applied [15].

4. Data Pre-processing

The dataset used in this work is taken from Kaggle competition [16]. Originally, there are 30 features and 887k records in the dataset. The data includes loan details, e.g. amount, purpose, interest rate, and instalments, among others.

The first step of data preprocessing is data cleaning which is done by checking and eliminating any missing values because they affect the accuracy of the model. The next step of data preprocessing is data reduction. In loan default prediction, ignoring irrelevant features can increase classification accuracy and decrease the computational costs associated with running several machine learning models. Accordingly, in this research work 887k rows and 17 columns of data is used after reduction.

5. Feature Engineering

The second component in the model shown in Fig. 2 is a feature engineering module. Its main purpose is to enhance data reliability by cleaning data and choosing the subset of data features with the most discriminatory power. Our model comprises two important steps in feature engineering: data transformation and correlation analysis. Feature selection reduces the dimensionality of the data, which helps mitigate the risk of over fitting.

Whereas, data transformation includes converting categorical features to numeric values, standardization and log transformation. During standardization, a function standardizes a feature by subtracting the mean and then scaling to a unit variance. Unit variance means dividing all the values by the standard deviation.

In this study, log transformation is applied to attributes such as loan amount, annual income, and interest rate, debt to income ratio, total payment, recoveries and instalments. Through correlation analysis, correlation of numeric and binarized nominal attributes are computed with respect to the loan status to provide a better understanding of the data and its attributes.

6. Imbalanced Data and Splitting the Dataset

An imbalanced dataset is such a case where there is major difference in the number of classification categories. In our dataset domain, the classification categories consist of “default” and “non-default”, where the number of “non-defaulter” cases dominate the number of “defaulter” cases. Almost 92.4% of the datasets are non-defaulter. In such a situation a model becomes more inclined to the majority class and cannot properly identify the minority class. To solve this issue, there are two possibilities at data level approaches: either oversample the minority class or under-sample the majority class. But, under-sampling the majority class will make it difficult to properly understand the trends of the independent attributes. Furthermore, oversampling the minority class alone will not solve this as the techniques lying behind the oversampling will also matter greatly. Thus, this study used SMOTE and MSMOTE at data level and ensemble techniques while handling class imbalanced dataset.

For the purpose of checking the performance of the machine learning models splitting the dataset is a fundamental task. It helps to prevent over fitting by evaluating the performance of the model on a portion of the dataset upon which the model has not been trained. Accordingly, in this study 80:20 train-test split ratio has been used. This means that 80 percent of the entire dataset was used to train the model and the remaining 20 percent was used to evaluate the performance of the model.

7. Implementation Techniques

In this study we have used a hybrid approach that combines data level approach in pre-processing and ensemble approach in modelling the imbalanced data. At data level, SMOTE and MSMOTE techniques are used. Preprocessed data is provided as input to ensemble classifiers that are constructed using the training data. In this article, four different ensemble classifier algorithms are used, namely, Random forest, Logistic regression, Bagging classifier and AdaBoost. Finally, the experimental results of all the four classifiers are compared among themselves and with existing similar works.

In our experiment to evaluate the performance the four classifiers different evaluation metrics are used. They are accuracy, precision, recall, and f1-score. To evaluate the prediction power of the models the ROC curve and confusion matrix were used.

1. Logistic Regression

The model was trained with the training set and the prediction was executed with the test set. The ROC curve is then drawn. Figure 2ashows ROC curve and the AUC of the logistic regression with SMOTE. Figure 2b shows the results of the resampled data using MSMOTE. The ROC curve shows an AUC of 83.1% and 82.1% of the test set when SMOTE and MSMOTE are applied, respectively.

Compared to the SMOTE test set the AUC of the MSMOTE decreases slightly to 82%.

The accuracy in Table 2 shows that the prediction power of the resampling data (SMOTE and MSMOTE) is almost identical. In predicting accuracy, MSMOTE performed better than SMOTE.

Figure 3 presents the confusion matrix of the logistic regression with MSMOTE. Table 2 shows the evaluation metrics of the logistic regression model with SMOTE and MSMOTE.

Table 2. Summary of evaluation metrics of the logistic regression.

Classifiers	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
Logistic regression + SMOTE	82	83	97	83	89
Logistic regression + MSMOTE	87	82	96	89	93

In Figure 3, the logistic regression model with MSMOTE correctly predicts 89.5% non-defaulted loans and 55.25% defaulted loans. There are 44.8% defaults missed and 10.5% good loans missed. Consequently, in order to minimize loan defaults, the number of missed defaults needs to be minimized to save loss, and the number of correctly predicted non-defaulted needs to be maximized.

2. Random Forest

Figure 4a and 4bshows the ROC curves of the Random Forest model with SMOTE and MSMOTE, respectively. They essentially shows the relationship between True Positive Rate (TPR) and False Positive Rate (FPR), with one always going in the same direction as the other, from 0 to 1. A good classification model would always have the ROC curve above the black baseline (the broken line). The AUC of the Random Forest model with the SMOTE resampling is 89% while that of MSMOTE is 99%. That means, the random forest model with MSMOTE performs better than that of the SMOTE.

Using the confusion matrix, we can easily calculate F1-score, precision, recall and accuracy of the random forest algorithm to show its performance.

The confusion matrix in Figure 5 shows that of the random forest model when data is resampled by MSMOTE. As shown in Figure 5, it predicts non defaulted loans better than defaulted loans. Overall, the results are better than the logistic regression. As shown in Table 3, the accuracy of the random forest model with MSMOTE is 99% and that of SMOTE is 96%.

Table 3. Summary of evaluation metrics of the random forest classifier.

Classifier	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
RF + SMOTE	96	89	96	99	98
RF + MSMOTE	99	99	99	99	99

Finally, the accuracy score (number of correct predictions over the number of total predictions) on the training and test data of random forest with MSMOTE is checked. Accordingly, the classifier gave accuracy score of 99% with no over fitting.

Table 4. Train and test accuracy score of random forest with MSMOTE.

Model	MSMOTE
Model	Train Accuracy	Test Accuracy
Random Forest	99.99829%	99.99774%

3. Bagging Classifiers

Figure 6a and 6b displaytheROC curves and AUC values of the bagging classifier. Test results of bagging with SMOTE and with MSMOTE gave predictive power of AUC of 89% and 99%, respectively.

Figure 7 displays the confusion matrix of bagging classifier after the dataset is balanced with MSMOTE. Evaluation metrics such as precision, recall, f1-score and accuracy all are calculated from this confusion matrix. The summary of the evaluation metrics for bagging classifier are shown on Table 5 below.

Table 5. Summary of evaluation metrics of the bagging classifier.

Classifier	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
Bagging + SMOTE	96	89	96	99	98
Bagging + MSMOTE	99	99	99	99	99

From F1-score perspective, all classifiers perform greater than 50%. The bagging classifier with MSMOTE has F1-score of 99%.

When checked for over fitting, the model shows, it has a little bit of over fitting in the SMOTE and performs better in MSMOTE, as the accuracy score of train and test of the bagging with MSMOTE is very close. That means, there is almost no over fitting when applying bagging with MSMOTE.

Table 6. Train and test accuracy scores of bagging with SMOTE and MSMOTE.

Model	SMOTE		MSMOTE
Model	Train Accuracy (%)	Test Accuracy (%)	Train Accuracy (%)	Test Accuracy (%)
Bagging	99.9775	96.0783	99.9746	99.95548

4. Adaptive Boosting

Finally, Figure 8aand 8bdisplay the ROC curves and the AUC values of AdaBoost. The test set confirmed that AdaBoost has predictive power of AUC equal to 82% and 83% using SMOTE and MSMOTE, respectively.

The results show that, the ROC curves and AUC values of Adaboost with MSMOTE performs better than that of SMOTE.

As seen in Figure 9, the confusion matrix of Adaboost with MSMOTE predicted 99% of settled loans and 36.8% of defaulted loans. Even though, there is no default loan missed and 63% of good loans were missed.

Table 6. Summary of evaluation metrics of the adaptive boosting.

Classifier	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
Adaboost + SMOTE	90	81	96	94	95
Adaboost + MSMOTE	95	83	95	99	97

When summarized, as shown in Table 7, Adaboost performed better when the data is balanced with MSMOTE. Regarding over fitting, the model shows some over fitting when executed with MSMOTE and no over fitting with SMOTE.

Table 7. Train and test accuracy score of Adaboost with SMOTE and MSMOTE.

Model	SMOTE		MSMOTE
Model	Train Accuracy	Test Accuracy	Train Accuracy	Test Accuracy
Adaboost	90.896%	90.468%	97.298%	95.14%

5. Analysis of Results and Discussion

For better comparison, the performance results of the four models are summarized in Table 8 and Table 9. The F1-score shows all the classifiers performed well being 93% the minimum except LR with SMOTE scoring the least with 89%. The result also shows that RF and Bagging has comparable performances with both SMOTE and MSMOTE scoring better in almost all performance measures. However, the LR classifier showed the lowest classification ability with performance of accuracy and F1-score of 82% & 89% with SMOTE and 87% & 93% with MSMOTE, respectively. RF and Bagging classifier performed better with accuracy & AUC of 96% & 89% with SMOTE and 99% with MSMOTE, respectively. Compared that of LR and Adaboost, precision and recall scores of the RF and Bagging was also higher which indicates a good performance.

Table 8. Summary of performances of the four ensemble classifiers using SMOTE.

Classifiers	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
LR + SMOTE	82	83	97	83	89
RF + SMOTE	96	89	96	99	98
Bagging + SMOTE	96	89	96	99	98
Adaboost + SMOTE	90	81	96	94	95

Table 9. Summary of evaluation metrics for ensemble classifier with MSMOTE.

Classifiers	Accuracy (%)	AUC (%)	Precision (%)	Recall (%)	F1-Score (%)
LR + MSMOTE	87	82	96	89	93
RF + MSMOTE	99	99	99	99	99
Bagging + MSMOTE	99	99	99	99	99
Adaboost+MSMOTE	95	83	95	99	97

Figure 10 shows the ROC curves of all the four classification models. As mentioned earlier, an ROC curve that tends towards the left corner of the graph correctly identifies a greater proportion of observations (i.e.: a higher recall). A higher AUC value also indicates a better performance of the models. As indicated by Figure 10b and Table 9 RF and bagging models have shown better classification ability. Consequently, they both outperformed the other models in predicting potential defaulters.

As seen in the results, the two resampling techniques, SMOTE and MSMOTE, the latter has achieved better performance when combined with the ensemble classifiers. That is mainly because, MSMOTE not only considers the distribution of minority classes but also rejects latent noise spots based on K-NN classifier method [21]. Experimental results also indicated that the MSMOTE algorithm used can result in better prediction of the minority class than SMOTE [22]. Moreover, MSMOTE when used with bagging based ensemble classifier it gives better accuracy, precision, F1-score, and recall than when it is used with boosting ensemble classifiers. That is because, bagging improves stability and accuracy of machine learning algorithms, reduces variance, overcomes over-fitting, improves misclassification rates, among others. In noisy data environments like ours, generally, bagging outperforms boosting [23] [24].

When we compare the results we have got with existing works, as Table 10 presents, the proposed models outperformed existing works even those using the same datasets. Table 10 presents and compares the results of our work with existing works.

Table 10. Performance Comparison of Selected Related Works with the Proposed System.

Ref.	Year	Proposed Method	Dataset	Algorithm(s) Used	Performance (Accuracy in %)
[5]	2021	Resampling and cost-sensitive mechanisms	Lending Club datasets	Logistic Regression	65.5
[6]	2018	Combinations of classifiers and resampling techniques	Lending Club datasets	Random Forest	81.76
[10]	2020	Synthetic Minority Oversampling Technique (SMOTE)	Dataset from Kaggle	Neural Network	94.81
[11]	2018	Hybrid under-sampling method that combines clustering	A real loan default data from a P2P company	DSUS	68.8
[20]	2019	Construct K-XGBoost model based on K-Means++	Small business credit loan data of Lending Club	XGBoost	92.2
Our Study	2022	MSMOTE with Bagging based ensemble classifier (Random Forest, Bagging)	Loan Data for Dummy Bank from Kaggle	Random Forest, Bagging Classifier	99
Our Study	2022	MSMOTE with Boosting based ensemble classifier (Adaboost)	Loan Data for Dummy Bank from Kaggle	Adaptive Boosting	95.1

Furthermore, the following research questions were raised in the first section:

RQ1: Identify which approach or method better handles imbalanced data in loan prediction systems?

RQ2: Which approach or method that handles imbalanced data is overlooked?

RQ3: Which machine learning algorithm handle imbalanced data better?

The first research question is answered based on the discussion in section II, review of related works. As mentioned, existing approaches have used resampling techniques and cost-sensitive learning methods to deal with class imbalance issues in loan default prediction problems to improve the performance of classifiers.

The second research question is answered again in section II. There are many resampling techniques where some of them are used in many works like Random Undersampling (RUS) [5, 6, 7, 11, 13], Random Oversampling (ROS) [6,7, 11], SMOTE [6, 7, 8, 10, 11, 13], DSUS [11] and cost-sensitive [5, 8, 13]. The method we have used in this work is overlooked in previous works to handle imbalanced data. Besides, the ensemble classifier technique, adopted in this work, has not been used to handle imbalanced data.

The final research question is answered in Table 11 & 12 and Figure 13b. As mentioned, the experimental result confirmed that bagging based ensemble techniques performs better in loan default prediction.

In this work, we achieved better results in the area of loan default prediction. Appropriate features were selected through feature engineering process. Additionally, the given dataset contains imbalanced classes. To deal with this, we used different resampling methods combined with four different ensemble classifiers to determine their performances. Accordingly, logistic regression, bagging classifier, adaptive boosting and random forest classifiers combined with SMOTE and MSMOTE resampling techniques were used to build loan default prediction models. Experimental results show that bagging and random forest models performed better in dealing with samples with a large class imbalance. The classification power of these techniques was assessed using five evaluation metrics - accuracy, precision, recall, f1-score and ROC curve.

Based on the model evaluation, random forest and bagging classifier models have shown better performances when combined with MSMOTE in terms of ROC curve (99%), precision (99%), recall (99%) and f1-score (99%). Therefore, it can be concluded that Bagging and random forest classifier models are more efficient and more accurate than the others in predicting the binary categorization of bank loaners.

For further improvements the following are some of the areas where additional efforts could be exerted:

Use of more feature extraction techniques to create new variables that are relevant and interesting to get more insights in the area;
In this paper, we have collected dataset from Kaggle to achieve the proposed method. This is not real data from financial institutes. However, in the future one could work on real datasets collected from different financial institutions; and
Future research could examine other important algorithms such as deep learning networks and show the importance of data balancing in enhancing the quality of outcomes.

Ethics Approval and Consent to Participate: Not Applicable.

Consent for Publication: All authors gave their consent to publish this manuscript to this journal.

Availability of Data and Materials: The dataset used in this work is publicly available as specified in the manuscript.

Competing Interest: The authors declare no competing interest of any kind.

Funding: Not Applicable.

Authors’ Contributions: Conceptualization, SBB & AMB; data curation, SBB; analysis, SBB & AMB; supervision, AMB; validation, SBB & AMB; writing draft report, SBB; review and editing, AMB. All authors have read and agreed to the published version of the manuscript.

Acknowledgments: The authors acknowledge the support from Addis Ababa Science and Technology University where the laboratory and other facilities were utilized during the MSc thesis work of the first author.

Authors’ Information:

Soreti Bekele Babo received her B.Sc. degree in Electrical and Computer Engineering (Computer Engineering stream) from Ambo University, Hachalu Hundessa Institute of Technology, Ambo, Ethiopia, in 2018. She also obtained her M.Sc. degree in Computer Engineering from Addis Ababa Science and Technology University, Addis Ababa, Ethiopia, in 2022. She is now a lecturer at Ambo University, Hachalu Hundessa Institute of Technology. Her research interests include Artificial Intelligence and Machine Learning.

Asrat Mulatu Beyene received his M.Sc. and Ph.D. from Addis Ababa University School of Electrical and Computer Engineering and IT Doctoral Program, respectively. Currently, he is working as assistant professor in the Dep’t of Electrical and Computer Engineering of Addis Ababa Science and Technology University. His research interest spans data science, secured distributed systems and hardware software co-design. He has published on numerous conferences and journal publications that are reputable. Besides, he is currently serving as ICT Executive Officer of the University since 2020 and President of Internet Society Ethiopian Chapter.

Al-Qerem, Ahmad, Ghazi Al-Naymat, Mays Alhasan, and Mutaz M. Al-Debei. "Default prediction model: the significant role of data engineering in the quality of outcomes." Int. Arab J. Inf. Technol. 17, no. 4A (2020): 635-644.
Aphale, Amruta S., and Sandeep R. Shinde. "Predict Loan Approval in Banking System Machine Learning Approach for Cooperative Banks Loan Approval." International Journal of Engineering Trends and Applications (IJETA) 9, no. 8 (2020).
Ereiz, Zoran. "Predicting default loans using machine learning (OptiML)." In 2019 27th Telecommunications Forum (TELFOR), pp. 1-4. IEEE, 2019.
Tabiaa, Meriem, and Abdellah Madani. "The deployment of Machine Learning in eBanking: A Survey." In 2019 Third International Conference on Intelligent Computing in Data Sciences (ICDS), pp. 1-7. IEEE, 2019.
Chen, Yen-Ru, Jenq-Shiou Leu, Sheng-An Huang, Jui-Tang Wang, and Jun-Ichi Takada. "Predicting Default Risk on Peer-to-Peer Lending Imbalanced Datasets." IEEE Access 9 (2021): 73103-73109.
Namvar, Anahita, Mohammad Siami, Fethi Rabhi, and Mohsen Naderpour. "Credit risk prediction in an imbalanced social lending environment." arXiv preprint arXiv:1805.00801 (2018).
Alam, Talha Mahboob, Kamran Shaukat, Ibrahim A. Hameed, Suhuai Luo, Muhammad Umer Sarwar, Shakir Shabbir, Jiaming Li, and Matloob Khushi. "An investigation of credit card default prediction in the imbalanced datasets." IEEE Access 8 (2020): 201173-201198.
Zhou, Lifeng, and Hong Wang. "Loan default prediction on large imbalanced data using random forests." TELKOMNIKA Indonesian Journal of Electrical Engineering 10, no. 6 (2012): 1519-1525.
Birla, Shiivong, Kashish Kohli, and Akash Dutta. "Machine learning on imbalanced data in credit risk." In 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 1-6. IEEE, 2016.
Shingi, Geet. "A federated learning based approach for loan defaults prediction." In 2020 International Conference on Data Mining Workshops (ICDMW), pp. 362-368. IEEE, 2020.
Chen, Ya-Qi, Jianjun Zhang, and Wing WY Ng. "Loan default prediction using diversified sensitivity undersampling." In 2018 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 1, pp. 240-245. IEEE, 2018.
Zhu, Lin, Dafeng Qiu, Daji Ergu, Cai Ying, and Kuiyi Liu. "A study on predicting loan default based on the random forest algorithm." Procedia Computer Science 162 (2019): 503-513.
Cooper, Michael J. "A Deep Learning Prediction Model for Mortgage Default." University of Bristol (2018).
Galar, Mikel, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, no. 4 (2011): 463-484.
Hu, Shengguo, Yanfeng Liang, Lintao Ma, and Ying He. "MSMOTE: Improving classification performance when training data is imbalanced." In 2009 second international workshop on computer science and engineering, vol. 2, pp. 13-17. IEEE, 2009.
Kaggle – Loan Default Dataset. Available online: https://www.kaggle.com/datasets/yasserh/loan-default-dataset accessed on 19 July 2022).
Li, Li-Hua, Alok Kumar Sharma, Ramli Ahmad, and Rung-Ching Chen. "Predicting the Default Borrowers in P2P Platform Using Machine Learning Models." In International Conference on Artificial Intelligence and Sustainable Computing, pp. 267-281. Springer, Cham, 2021.
Bagging classifier. Available online: https://www.geeksforgeeks.org/ml-bagging-classifier/ (accessed on 28 July 2022).
Boosting in Machine Learning. Available online: https://www.geeksforgeeks.org/boosting-in-machine-learning-boosting-and-adaboost/ (accessed on 29 July 2022).
Qiu, Wenyu. "Credit risk prediction in an imbalanced social lending environment based on XGBoost." In 2019 5th International Conference on Big Data and Information Analytics (BigDIA), pp. 150-156. IEEE, 2019.
Natasha, Azaria, Dedy Dwi Prastyo, and Suhartono. "Credit scoring to classify consumer loan using machine learning." In AIP Conference Proceedings, vol. 2194, no. 1, p. 020070. AIP Publishing LLC, 2019.
Akça, Mehmet Furkan, and Onur Sevli. "Predicting acceptance of the bank loan offers by using support vector machines." International Advanced Researches and Engineering Journal 6, no. 2 (2022): 142-147.
Diachkov, Dimitrii. "Machine learning-based approaches for automatic data validation and outlier control of loan microdata in the Bank of Russia." IFC Bulletins chapters 57 (2022).
Khatir, Ahmed Almustfa Hussin Adam, and Marco Bee. "Machine Learning Models and Data-Balancing Techniques for Credit Scoring: What Is the Best Combination?." Risks 10, no. 9 (2022): 1-22.

No competing interests reported.

Download PDF

Editorial decision: Major revision
22 May, 2023
Reviews received at journal
17 Apr, 2023
Reviewers agreed at journal
17 Apr, 2023
Reviewers invited by journal
17 Apr, 2023
Submission checks completed at journal
10 Mar, 2023
Editor assigned by journal
10 Mar, 2023
First submitted to journal
07 Mar, 2023

You are reading this latest preprint version

Bank Loan Classification of Imbalanced Dataset Using Machine Learning Approach

Status:

Version 1

Abstract

Figures

I. Introduction

Ii. Review Of Related Works

Iii. Research Methodology

Iv. Results And Discussion

V. Conclusions And Future Works

Declarations

References

Additional Declarations

Status:

Version 1