2.1 Data Standardization
Before analyzing the data, because different variables in the data set have different units and different degrees of variation, different units often make the practical interpretation of the coefficients difficult, and when different variables have their own variances that differ significantly, they can make the weight of different variables in the calculated coefficients of the relationship very different. In order to eliminate the influence of the magnitude and the variables' own variation size and numerical magnitude, the data are standardized (new data = (original data - mean)/standard deviation)[7].
2.2 Data Normalization
In solving the credit card fraud binary classification problem, since the plain Bayesian algorithm is used to implement the binary classification problem, the plain Bayesian algorithm requires that the data brought into the model must be positive; therefore, we cannot use normalization to process the data and must use normalization to process the data[8].
2.3 Delineating the data set
After data preprocessing, the machine learning model was selected for training. For the continuous variables in the dataset, the correlation coefficient matrix heat map was used to check whether there was a linear relationship between the variables, and it was found that there was no linear relationship between the variables (see Fig. 1). There is also no other kind of relationship between the variables (see Fig. 2). Therefore, there is no need to remove any of the variables, and we use all of them for analysis. The data is divided into training data and test data, and the python train_test_split module is used to divide the data set, and the training set is set to 70% of the sample, and the test set is set to 30% of the sample number, and the random_state = 100 is set in dividing the sample data.
2.4 Random Oversampling
Because of the unbalanced nature of the data and because the credit card fraud problem is a dichotomous problem, so it is important to ensure that the data is balanced. The number of frauds in the labels is balanced with the amount of non-fraud data. After checking, the sample has 91.26% of non-fraud data and 8.74% of fraud data (see Fig. 3), so there is a serious imbalance in the sample and the sample imbalance needs to be solved. We use machine learning algorithms to deal with the sample imbalance, and the main methods are: oversampling, undersampling, integrated sampling, and using class weights to improve the class imbalance these methods. In the following study, the oversampling is done by the two methods of random oversampling and SMOTE oversampling, and the undersampling is done by the random undersampling method. In the process of solving the sample imbalance problem, the methods in the imblearn library of python are mainly used.
The core idea of random oversampling is to randomly replicate and repeat a few classes of samples, so that the number of minority classes is the same as the number of majority classes to obtain a new balanced data set[9, 10]. Since the dataset has been divided into a training set and a test set, we only need to randomly oversample the data in the training set in order not to change the distribution of the data in the test set. We use the RandomOverSampler module in the imblearn library to perform random oversampling of the training set data. Finally, the training set after the random oversampling process is trained using the decision tree algorithm, the random forest algorithm, the KNN algorithm and the plain Bayesian algorithm
2.5 SMOTE Oversampling
SMOTE (Synthetic Minority Class Oversampling), is a modification scheme based on the random oversampling method[11]. Random oversampling adds a few samples by simply copying the samples, which is prone to model overfitting. the basic idea of SMOTE is to analyze the few class samples and synthesize new samples manually[12]. And when using SMOTE oversampling to deal with the data imbalance problem, the data processed is still all the data in the training set, without any processing of the test set data. And the SMOTEENN module in the imblearn library is used to randomly oversample the training set data. Finally, the training set after the random oversampling process is trained using decision tree algorithm, random forest algorithm, KNN algorithm and plain Bayesian algorithm
2.6 Undersampling
Under-sampling (down-sampling) is a method to achieve sample balancing by removing a portion of the multiclass samples by randomly removing some majority class samples to reduce the size of the majority class[13]. Because the credit card fraud dataset we study is highly skewed and has a sample size of 1,000,000 items, the sample size is so large that the size of the dataset must be reduced in order to make learning feasible, in which case undersampling seems to be a reasonable and effective strategy[14]. Therefore, we tried to use undersampling for the data. To ensure the distribution of the data in the test set, we use the undersampled data from the training set and use the imbalanced-learn RandomUnderSampler method and NearMiss method for down-sampling to deal with the data imbalance problem. Finally, we take the under-sampled training set and train the model using decision tree algorithm, random forest algorithm, KNN algorithm, plain Bayesian algorithm and support vector machine algorithm
2.7 Integrated sampling
The core of integrated sampling is to first use oversampling to expand the sample and then remove the points that are in a glued state using the Tomek Link method[15], sometimes directly removing all pairs that are close together, because after oversampling, the sample size of 0 and 1 has reached 1:1. we use imblearn's SMOTE Tomek method to implement integrated sampling. In order not to change the distribution of the data in the test set, we only synthetically sample the training set data. We take the training set after the integrated sampling process and train the model using decision tree algorithm, random forest algorithm, KNN algorithm and plain Bayesian algorithm
2.8 Using class weights to improve category imbalance
Since the credit card fraud detection dataset we study is an imbalanced dataset, we can improve the category imbalance by setting the category weights.
First, we will train a simple logistic regression. Then, we will implement a weighted logistic regression with class weights of 'balance', where the model automatically assigns class weights inversely proportional to their respective frequencies when class weights = 'balance'. Finally, we will use a grid search to find the optimal value of the class weights. The metric we are trying to optimize is the f1 score.
For logistic regression, we use log loss as the cost function. We do not use mean squared error as a cost function for logistic regression because we use sigmoid curves as a prediction function instead of fitting a straight line.
2.9 Simple logistic regression
Because we have processed the training set data, the training set data is well distributed and the logistic regression model is very suitable for solving this type of problem[16], so we try to use logistic regression to solve the credit card fraud problem. We use the sklearn library to train our model, and we use the default logistic regression. By default, the algorithm will assign equal weights to both classes. At this point, the confusion matrix is shown in Fig. 4 (see Fig. 4).
2.10 Logistic regression (class_weight='balanced')
We added a class weight parameter to the logistic regression algorithm by passing the value "balanced", and when the class weight = 'balanced', the model will automatically assign class weights that are inversely proportional to their respective frequencies. In this case, the confusion matrix is shown in Fig. 5 (see Fig. 5).
2.11 Logistic regression (manually set class weights)
We Use grid search to find the optimal weight that makes the F1 Score the highest, i.e., search for weights between 0 and 1. In grid search, if a few categories n are given as weights, most categories will get 1-n as weights.
By grid search, the scores of different weight values are plotted to get the change of F1 when the weight of the first category is changed (see Fig. 6). By looking at the figure, we can see that the highest value of a few categories peaks at 0.793082759825461. By grid search, we obtained the best class weights, 0.20691724017453905 for class 0 (majority class) and 0.793082759825461 for class 1 (minority class). now we have obtained the best class weights using grid search to represent the performance of the test data by F1 Score. At this point, the confusion matrix is shown in Fig. 7.