Anomaly credit data detection based on enhanced Isolation Forest

In view of the real-world problem of falsity and errors credit data, and the performance degradation of the credit evaluation model caused by these problems, we proposed an outlier detection algorithm, which considered two characteristics of class-imbalance and cost-sensitive in credit data. We use an anomaly detection model called EIF to optimize the credit evaluation models. EIF uses the EasyEnsemble algorithm to construct balanced data sets, and train an Isolation Forest model for anomaly detection by the balanced datasets with different disturbances. On the one hand, the balanced dataset ensures that the class-imbalance problem is solved by undersampling, on the other hand, each sub-model learns from the overall minority class samples in order to solve the cost-sensitive problem. Experiments were performed on UCI German dataset, and the test set with fake data was constructed by correlation. Compared with other anomaly detection algorithms in common credit evaluation models, the EIF-optimized model has a higher F1 score and a lower cost-sensitive error rate. In conclusion, the EIF model is effective in enhancing the performance of the credit evaluation model for forged credit datasets.


Introduction
The rapid economic development has promoted change of the public's consumption concept. New forms of consumption such as prepaid and credit consumption have also promoted the development of the credit industry rapidly. The core of the credit business is to use the credit evaluation technology to reliably evaluate the customer's credit, and control the risks generated in the business. In the process of credit evaluation, although various institutions can accurately evaluate the customer's credit according to the collected data, the credit evaluation still based on a large amount of credible data. If abnormal data occurs due to forgeries and errors, credit evaluation will definitely be greatly affected. Therefore, the detection and processing of abnormal data has become the focus of credit evaluation technology.
The principle of credit evaluation is establishing a credit evaluation model based on various information data of the individual being evaluated, to find the individual characteristics of credit. The credit evaluation of future customers is based on the previous model. The initial credit evaluation work is largely based on the practitioners' own experience to evaluate the customer's credit, which is costly and is not conducive to mass use [1]. With the development of big data technology, machine learning methods are more and more used in credit evaluation work. But because of the seriousness of credit evaluation work, strongly interpretability models are used in most works. Vojtek et al. [2] use linear discriminant analysis (LDA) and logistic regression (LR) models for credit evaluation. The author explains that these two models are widely used in bank credit evaluation work due to their simplicity and strong interpretability. Uddin et al. [3] uses random forest (RF) model to assess the credit risk of micro-enterprises, and conducts multi-dimensional analysis on missing data. Such models include decision tree (DT) [4,5], naive Bayes (NB) [6,7], support vector machine (SVM) [8][9][10] et al. Neural networks have achieved good results in credit evaluation also. But because of the poor interpretability, they are not widely used in practice [9][10][11].
Anomaly detection also known as outlier detection in some literature. Anomaly detection assumes that a certain feature of the anomaly(outlier) is different from the others(inlier), and detects the anomaly by these features. The distance-based outlier detection algorithm assumes that the normal are densely distributed and the anomalies are sparsely distributed. The distance between the sample and its neighbors is used as judgment index to detect abnormal samples [11]. In the research of Ren and Liu [12], the k-nearestneighbors (KNN) was used to perform anomaly detection pre-processing on network intrusion behaviors, and obtained high-quality dataset was used for traditional model training. The density-based algorithm is a variant of the distancebased algorithm. It calculates the density difference between the sample and its neighbors to determine whether it is an anomaly [13,14]. Campos [15] comprehensively compares the performance of representative outlier detection methods based on distance and density, and the Local Outlier Factor (LOF) algorithm, which is based on local density, has the best performance among such algorithms. The classificationbased algorithm uses the traditional classification model as anomaly detection, trains a single-classification model on normal data, and judges the outliers that do not belong to the normal class [16]. The Isolation Forest (iForest) [17] randomly selects attributes and values to recursively divide the dataset to construct a tree structure. Anomalies are more likely to be isolated near the root of the tree, while normal points are deeper in the isolation tree. iForest has linear time complexity and is not affected by the sample dimension, so it performs well in medium and high dimensions [18].
There are two problems in the actual use of credit data: On the one hand, because the credit industry will conduct manual screening in the business process, samples with obvious low credit evaluation will be rejected, resulting in relatively few bad samples and insufficient description of the characteristics of bad credit. As a result, causing classimbalance problem. On the other hand, the loss caused by misjudging bad credit samples as good credit samples is much larger than that of judging good credit samples as bad credit samples, resulting in the cost-sensitive problem. The class-imbalance problem is usually solved by sampling methods: the undersampling method randomly discards the majority of samples, which may cause the loss of important features; the oversampling method replicates the sampling of the minority class samples for many times, which will lead to overfitting. The SMOTE [19] algorithm uses the idea of oversampling to interpolate between the minority class samples in the data, which realizes the expansion of the samples and reduces the degree of overfitting. Liu et al. [20] proposed two methods, EasyEnsemble and Balance-Cascade, for the idea of ensemble learning, which combines the advantages of bagging and boosting ensemble learning methods, and uses multiple undersampled base models to form an overall missing ensemble model. This study overcomes the defect that simple undersampling loses most of the features. Cost-sensitive problems are usually solved by cost-sensitive learning. Frumosu et al. [21] explored the relationship between cost and quality in industry using cost-sensitive learning, mentioning the importance of cost metrics. But the cost of misjudgment is difficult to measure in credit evaluation work. Therefore, in this paper, each sub-model is allowed to learn from all few classes of samples to make the model as a whole more focused on bad credit samples.
In this paper, we use the correlation between attributes and credit to generate anomalies that can deceive the credit evaluation model, and use these anomalies to simulate fake samples in real work. After that, we use the idea of Easy-Ensemble to construct a balanced training data set, and through sample perturbation and attribute perturbation, train a better iForest model called EIF model. The EIF model can detect the fake data which generated in the previous article and improve the performance of the traditional credit evaluation model.

Existing work
In building the credit anomaly detection model, we used some ideas of the existing work. To help the reader understand the credit anomaly detection model in this study, this section introduces the core ideas of algorithms, Isolation Forest(iForest) and EasyEnsemble, used in the model.

Isolation forest
Isolation Forest(iForest) is an anomaly detection algorithm proposed by Liu et al. in 2012 [17]. The nodes of the isolation tree are defined T. T is divided into external nodes and internal nodes: nodes without child nodes are called external nodes, and the remaining nodes are called internal nodes. The internal node consists of attribute q, the split value p of attribute q, and two child nodes (T l , T r ). p is a random value between the maximum and minimum values of attribute q. In an isolation tree, q and p are used as the separation conditions of nodes. Each node compares the q value of the new sample with p, and judges that the new sample belongs to T l or T r .
When training an isolation tree on a d-dimensional data- The isolation tree randomly selects the attribute q and the split value p to recursively split X ' until each node contains only one sample or all samples within a node have the same value. The dataset is sampled n times to train different isolation trees, and all the different isolation trees constitute the isolation forest together.
Define the path length h(x) of sample x as the number of edges in the path from the root node of the isolation tree to the external node containing x. In the assumption, anomalies usually have short path lengths in the tree. E(h(x)) is the average path length of x in the forest, and the calculation method of the anomaly score of x is as follow: is the harmonic number, which is usually estimated by the sum of ln(i) and Euler's constant (0.5772156649); c(ψ) is the average path length from the isolation tree with sampling number ψ to the leaf node, the calculation formula is as follows:

EasyEnsemble
The EasyEnsemble algorithm is an undersampling method used to deal with class-imbalance problems and is suitable for training ensemble models. EasyEnsemble divides the data into a majority-class dataset N and a minority-class dataset P. N is undersampled multiple times independently to get k subsets of N{N 1 , and P to train a base classifier. Overall, each base classifier was trained with a balanced training set, and the entire data was used in training without missing features.

Problem description and algorithm introduction
In the credit evaluation, there is the phenomenon of forging data to deceive the existing credit evaluation model. This kind of the data is not recognized by the existing model, causing serious losses to the credit evaluation work. We propose an iForest model that combines with the ideas of EasyEnsemble, called the EIF model. And design a fake data generation algorithm based on actual experience.

EIF model
As earlier, credit evaluation work is characterized by two kinds of imbalance, class-imbalance and cost-sensitive. To solve the problems caused by imbalance, we use integration learning which constructs multiple base classifiers to jointly deal with a difficult problem. Each base classifier is trained by a balanced dataset constructed under undersampling to solve the class-imbalance problem; meanwhile, each balanced training set contains all bad credit samples, and base classifier fully learns the features of the bad credit samples, and the ensemble model focuses more on the bad credit samples to optimize for the cost-sensitive problem.
We call this ensemble model as enhanced isolated tree model (EIF model). Figure 1 shows the structure of the EIF model.
In ensemble learning, in order to ensure the excellent effect of the ensemble classifier, the base classifier should be "good but different". The mean is that each base classifier should have a certain classification ability and have different learning directions for the overall data. This paper adopts the sample random sampling and attribute random sampling method to generate the difference among the base classifiers. The specific process is shown in Algorithm 1:

Generating anomalies
In real life, there is a phenomenon of maliciously modifying data to obtain a good credit rating. This study uses the correlation between attributes and credit to simulate deliberate fraud in credit assessment, and falsifying data set. We treat falsified data as anomalies. The process of generating anomalies is shown in the Fig. 2.
The overall idea of the algorithm is to use the correlation coefficient between each attribute and credit to forge each sample and finally get the forged samples identified as good credit by the evaluation model.
We filtered each attribute based on its correlation coefficient with credit. Algorithm calculates the Pearson correlation coefficient between each attribute {A 1 , A 2 , … , A n }{A 1 , A 2 , … , A n } and the credit C, as shown in the following formula: In the experiment, we set the forged threshold value of 0.1, attributes with absolute value of correlation coefficient greater than 0.1 are considered to have falsification value. According to the Pearson correlation coefficient, a dictionary is generated as a forgery rule. In this paper, C = 0 C = 0 means good credit, C = 1 C = 1 means bad credit, forgerydictionary F = {f 1 , f 2 , … , f n } F = {f 1 , f 2 , … , f n } generation rules such as formula (4). Attributes with low credit evaluation correlation coefficients are excluded, and attributes that are positively correlated with C are adjusted to the minimum value, otherwise, adjusted to the maximum value.
In the forgery process of the algorithm, randomly select some bad credit samples. Each samples selects k number of attributes, and replaces these attributes by the forgerydictionary F. These samples are predicted by pretrained simple model, The samples which predict a good credit, are successful forgeries. The test set with anomalies uses forgeries to modify the original test set. The credit data forgery algorithm is as follows:

EIF-based anomaly credit data detection
When the credit evaluation model is applied on a data set with mixed anomalies, the model performance will drop significantly. Credit evaluation model with EIF model will judge whether each sample is anomaly through the EIF model firstly. Then, normal samples will be used for credit evaluation. This paper considers anomalies as bad credit. Anomalies can be marked, and further researched to gain more value. It is not within the scope of this study, so we leave this question aside. The workflow of EIF model is shown in Fig. 3.

Experiment
This section describes the environment and the data set of the experiment, and introduces the evaluation metrics that meet the requirements of the imbalance problems. The

Experimental environment
EIF model and credit data forgery algorithm are developed based on Spyder under Windows 10 system, using python programming, the hardware environment is: (AMD Ryzen 7 5800H @3.20 GHz with 16Gbytes of RAM). The experiment was carried out in the same environment.

Data source and preprocessing
The dataset used in this paper is the German dataset in the UCI public database. The dataset describes 1000 loan records, 700 are "good credit" samples, and 300 are "bad credit" samples. The original data is represented by 19 different attributes. The original text numericalize data by onehot representation to get the file called german-numeric. The german-numeric is represented by 24 different attributes. Based on german-numeric file, we set the good credit to 0, and the bad credit to 1 for the subsequent formula expression and experimental calculation. There are dimensional differences between different attributes, which will cause various attributes to have different effects in the distance calculation. This paper contains a number of distance-based algorithms that require normalization to remove this effect. We use Min-Max normalization for preprocessing.

Model evaluation
This paper uses the accuracy rate (Accuracy) to evaluate the performance of the credit forgery algorithm, and uses F 1 -score and Cost-sensitive error rate to evaluate the performance of credit evaluation model. To calculate these metrics, we need to define TP, TN, FP and FN as Table 1: According to the confusion matrix, we can define Accuracy, Precision and Recall as follows: When evaluating credit forgery algorithm, all of sample is bad credit (TN = FP = 0). In fact, the accuracy formula is Accuracy = TP/TP + FN. F 1 -score is the harmonic mean of Precision and Recall. F 1 -score considers both precision and recall, can reflect the generalization ability of the models. So, it is a common indicator for model evaluation in binary classification problems. The formula of F 1 -score is as follows: Cost-sensitive error rate is used to assess the effect of cost-imbalance. The cost-sensitive error rate is defined in terms of the cost matrix, cost matrix is shown as Table 2.
The cost 01 represents the cost of the sample's actual credit is good, but it is misjudged as bad. On the contrary, cost 10 represents the actual bad credit of the sample, which is misjudged as a good cost. In credit evaluation work, cost 10 is much larger than cost 01 . Accordingly, we can get two different costs of errors: Precision = TP∕(FP + TP),   In the above formula, f is the credit evaluation model, D + is the bad credit sample set, D − is the good credit sample set. The total cost is the product of the number of times each error is made and the cost. Cost-sensitive error rate is the quotient of the total cost of all errors and the number of samples.
Cost-sensitive error rate sets different cost for each error, so it can reflect the overall loss of the model. Our experiments will use the cost-sensitive error rate as a measure to determine the ability of the model to face cost-sensitive problems.

Experimental design and results
For testing the performance of our methods, we designed a series of comparative experiments. The models and algorithms which we contract with are shown in Table 3.
Firstly, we verify the effect of the fake dataset which is generated by the credit data forgery algorithm. We use fivefold cross-validation, the dataset is divided into five parts, and taking one part as test set for each experiment, the other data is train set. The algorithm modifies all bad credit samples in the test set and repeats the experiment with different amount of attributes k. algorithm use LR as simple check model. We get a completely fake set. Common credit evaluation models are trained by the train set, then predict credit for fake set. We check whether the fake set fool these models. Five parallel experiments were performed, then averaging Accuracy as result. Table 4 shows the Accuracy of various evaluation models for the recognition of fake samples generated by the forgery algorithm. The forgery algorithm falsifies samples based on the correlation coefficient between each attribute and credit, so the samples obtained by falsification have the greatest impact on the LR and SVM using the linear correlation. The reason why the algorithm is less effective in the NB and DT models is that the NB and DT models are more concerned with the effect of some characteristics  of the sample on the whole, and changes in a few attributes may not affect the model's judgment, even if these attributes are highly correlated with the sample. Overall, the forgery algorithm in this paper has varying degrees of deceptive effects on commonly used credit evaluation models. Therefore, it is believed that the forgery algorithm has the ability to forge samples with bad credit into good samples.
The credit data forgery algorithm selects k = 3, LR as check model, then generates test sets with varying amounts of fake samples. These test sets are predicted by common credit evaluation models and models with EIF. We use F 1 -score and Cost-sensitive error rate (CR) for evaluate performance of models. In the original Cost-matrix, cost 01 = 1, cost 10 = 5. Set EIF model parameters after tuning: k = 10, m = 10, l = 256, c = 19. Perform a five-fold cross-validation experiment, get mean as result. The result is shown as Table 5.
The above chart shows that models with EIF can improve the F 1 -score and obviously reduce the cost-sensitive error rate, especially in a high fake rates environment. When the percentage of forged samples is low, EIF improves the F 1 -score of the credit evaluation model less, and even weakly reduces the F1 score of the model in the NB model, but the F 1 -score is still well improved when the percentage of spurious is high. Meanwhile, the Cost-sensitive error rate of the model significantly decreased after EIF processing. EIF improve performance of credit evaluation models, and increases effects with the rise of fake rate. We can determine EIF have ability to fight against anomalies.
We also use various anomaly detection algorithm instead of EIF, test their performance in the same credit evaluation problem. We still use the credit evaluation model and test sets of the previous experiment. In this experiment, fake rate c = 20%. Using F 1 -score and Cost-sensitive error rate (CR) for contrast. The result as Table 6.
Each optimal result is highlighted in bold. We can see that most of the models get the best F 1 -scores after EIF processing, and the cost-sensitive error rate gap is tiny compared to the best anomaly detection algorithm. The reason for the significant decreasing effect of OCSVM on cost-sensitive error rate is that the strict division bound of OCSVM classifies as many suspected fake samples as possible, which can be reflected by the significant decrease in F 1 -score of the models using the OCSVM algorithm. LOF has a better F 1 -scores in NB model because LOF is more "tolerant" of forged samples as long as they are partially supported by similar points, which can be seen from the high cost-sensitive error rate of LOF in all credit evaluation models. Compared with the simple IF algorithm, EIF both have higher F 1 -score with lower cost-sensitive error rate, which can be seen as a better improvement of IF. When ensuring lower cost loss, EIF model pay more attention to improving accuracy. So, model with EIF have the best F 1 -score and low cost-sensitive error rate. It is proved that the EIF model has a good ability to detect anomaly in the credit data, both class-imbalance and cost-sensitive problem can be improved. EIF model is an excellent choice for anomaly in credit evaluation work.

Conclusion
According to the problem that new data will be forged against the existing model in the evaluation work of credit data, this paper proposes a method to detect the fake data by using the anomaly detection model EIF based on Isolation Forest and EasyEnsemble. The EIF model trains the base model by all the minority and undersampling majority classes. The classimbalance problem is solved without missing global data. We use linear correlations to generate fake data for evaluating models, these fake data simulate anomalies in credit data. The common credit evaluation models with EIF model are assessed by F 1 -score and cost-sensitive error rate. The results are significantly better than vanilla model. And EIF model compared with other detection model, has better optimization effect. EIF model is effective in enhancing the performance of the credit evaluation model.
In this paper, anomalies detected are directly classified as bad credit. In the future, our directions are analyzing each credit anomaly for more information and refining credit evaluation models through abnormal information. Finally, discovering causes of anomalies from the perspective of credit evaluation work, restoring anomalies to normal credit data.