Unbalanced Data Processing and Machine Learning in Credit Card Fraud Detection

doi:10.21203/rs.3.rs-2004320/v1

Download PDF

Article

Unbalanced Data Processing and Machine Learning in Credit Card Fraud Detection

https://doi.org/10.21203/rs.3.rs-2004320/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Imbalanced data often performs poorly in the model and can prevent the model from capturing a few classes of samples, so it is crucial to process imbalanced data. This paper is a credit card fraud detection based on imbalanced data, comparing different processing methods for imbalanced data and using machine learning to detect credit card fraud, and finally arriving at optimal results. Since credit card fraud data is mostly a dichotomous problem and highly imbalanced, it means that the machine learning model favors the majority of the samples and treats the fraudulent transactions in the credit card fraud data as correct transactions. The treatment of imbalanced data is crucial because of the low percentage of fraudulent data. We used different methods for imbalanced data such as oversampling, undersampling, combined sampling and using class weights to improve the class imbalance and applied these methods to credit card fraud detection and calculated Accuracy, Precision, Recall, F1 score and AUC. Because of the severe imbalance in the data, the model is biased towards majority of the samples, so the accuracy of the model will be high. Because we focus more on the probability that a minority class of the sample is correctly classified, to check the performance of the model, we will use the F1 score, the Area Under the Precision-Recall Curve (AUPRC). and recall as measures instead of accuracy. The results demonstrate that the model achieves the best performance by resampling the credit card fraud data, and finding the optimal weights for different category samples by grid search when setting the category weights leads to a significant improvement in the performance of the logistic regression model, and the random forest outperforms all the machine learning models that are compared.

Credit Card Fraud Detection

Imbalanced Data

SMOTE

Oversampling

Undersampling

Machine Learning Algorithms

Credit card fraud is a type of identity theft in which someone other than the owner uses account details or a credit card to make illegal transactions, and losses caused by fraud reach $1 billion per year due to the rapid growth of online payments and electronic banking[1]. The widespread use of credit cards for online shopping and their massive use with the dramatic increase in credit card transactions and the rapid growth of e-commerce has led to an increased incidence of fraud[2]. Therefore there is an urgent need to detect credit card fraud. As fraudulent strategies evolve and become more sophisticated, this has led to huge losses, reaching €1.8 billion in fraudulent transactions in the year 2018[3].

For the study of imbalanced data, for example, for a two-class classification problem, the data ratio of the two types of samples may be 1:10, 1:100 or even higher. In this kind of data set, the nature of the minority class cannot be expressed. When the classification algorithm extracts the sample information, more focus on the expression of the majority class, while causing the minority class misclassification. Because most of the available credit card transaction data contains more legitimate transactions and very few fraudulent transactions[4], the performance of machine learning models depends on the quality of the training set data, and if there is a significant imbalance in the training data, then the machine learning models will deteriorate severely[5]. The unbalanced and unstable environment in the credit card data set limits the good performance of machine learning algorithm in detecting fraud[6]. Therefore, the treatment of imbalanced datasets is crucial, otherwise, the accuracy in the resulting results will be high, but this is not what we want, and we focus on the probability of those that are fraudulent but not correctly classified. In this study, there is a serious imbalance in the data, only 91.26% are non-fraud data and 8.74% are fraud data, where 0: means not credit card fraud. 1: means credit card fraud (see Fig. 3), so this is a serious imbalance in the categories, to solve this problem, we will choose different methods to deal with the imbalance data. After the imbalance data processing, machine learning model is chosen to learn credit card fraud data and classify credit card fraud data by machine learning algorithm. By comparing multiple imbalance data processing methods and using multiple machine learning algorithms to classify the data, we finally find the most suitable imbalance data method and the most effective machine learning model to maximize the probability that a few classes of samples are correctly classified.

In this study, multiple methods will be used to solve the problem of imbalance in credit card fraud data, and machine learning models will be used to classify the data when the data are processed under different methods respectively. Because most researchers currently judge the goodness of solving the credit card fraud problem only by the accuracy of the algorithm[1, 2], they neglect the accuracy of classifying a small class of samples (the real fraud cases). To solve this problem, we first deal with the problem of severe imbalance in credit card fraud data by comparing commonly used machine learning algorithms using the F1 score and the Area Under the Precision-Recall Curve (AUPRC) and the recall rate as the measures, and finally find the optimal imbalance data processing method and machine learning model. To achieve accurate classification of the few classes of samples in the data credit card fraud data that are fraud data and improve the accuracy of classification

2.1 Data Standardization

Before analyzing the data, because different variables in the data set have different units and different degrees of variation, different units often make the practical interpretation of the coefficients difficult, and when different variables have their own variances that differ significantly, they can make the weight of different variables in the calculated coefficients of the relationship very different. In order to eliminate the influence of the magnitude and the variables' own variation size and numerical magnitude, the data are standardized (new data = (original data - mean)/standard deviation)[7].

2.2 Data Normalization

In solving the credit card fraud binary classification problem, since the plain Bayesian algorithm is used to implement the binary classification problem, the plain Bayesian algorithm requires that the data brought into the model must be positive; therefore, we cannot use normalization to process the data and must use normalization to process the data[8].

2.3 Delineating the data set

After data preprocessing, the machine learning model was selected for training. For the continuous variables in the dataset, the correlation coefficient matrix heat map was used to check whether there was a linear relationship between the variables, and it was found that there was no linear relationship between the variables (see Fig. 1). There is also no other kind of relationship between the variables (see Fig. 2). Therefore, there is no need to remove any of the variables, and we use all of them for analysis. The data is divided into training data and test data, and the python train_test_split module is used to divide the data set, and the training set is set to 70% of the sample, and the test set is set to 30% of the sample number, and the random_state = 100 is set in dividing the sample data.

2.4 Random Oversampling

Because of the unbalanced nature of the data and because the credit card fraud problem is a dichotomous problem, so it is important to ensure that the data is balanced. The number of frauds in the labels is balanced with the amount of non-fraud data. After checking, the sample has 91.26% of non-fraud data and 8.74% of fraud data (see Fig. 3), so there is a serious imbalance in the sample and the sample imbalance needs to be solved. We use machine learning algorithms to deal with the sample imbalance, and the main methods are: oversampling, undersampling, integrated sampling, and using class weights to improve the class imbalance these methods. In the following study, the oversampling is done by the two methods of random oversampling and SMOTE oversampling, and the undersampling is done by the random undersampling method. In the process of solving the sample imbalance problem, the methods in the imblearn library of python are mainly used.

The core idea of random oversampling is to randomly replicate and repeat a few classes of samples, so that the number of minority classes is the same as the number of majority classes to obtain a new balanced data set[9, 10]. Since the dataset has been divided into a training set and a test set, we only need to randomly oversample the data in the training set in order not to change the distribution of the data in the test set. We use the RandomOverSampler module in the imblearn library to perform random oversampling of the training set data. Finally, the training set after the random oversampling process is trained using the decision tree algorithm, the random forest algorithm, the KNN algorithm and the plain Bayesian algorithm

2.5 SMOTE Oversampling

SMOTE (Synthetic Minority Class Oversampling), is a modification scheme based on the random oversampling method[11]. Random oversampling adds a few samples by simply copying the samples, which is prone to model overfitting. the basic idea of SMOTE is to analyze the few class samples and synthesize new samples manually[12]. And when using SMOTE oversampling to deal with the data imbalance problem, the data processed is still all the data in the training set, without any processing of the test set data. And the SMOTEENN module in the imblearn library is used to randomly oversample the training set data. Finally, the training set after the random oversampling process is trained using decision tree algorithm, random forest algorithm, KNN algorithm and plain Bayesian algorithm

2.6 Undersampling

Under-sampling (down-sampling) is a method to achieve sample balancing by removing a portion of the multiclass samples by randomly removing some majority class samples to reduce the size of the majority class[13]. Because the credit card fraud dataset we study is highly skewed and has a sample size of 1,000,000 items, the sample size is so large that the size of the dataset must be reduced in order to make learning feasible, in which case undersampling seems to be a reasonable and effective strategy[14]. Therefore, we tried to use undersampling for the data. To ensure the distribution of the data in the test set, we use the undersampled data from the training set and use the imbalanced-learn RandomUnderSampler method and NearMiss method for down-sampling to deal with the data imbalance problem. Finally, we take the under-sampled training set and train the model using decision tree algorithm, random forest algorithm, KNN algorithm, plain Bayesian algorithm and support vector machine algorithm

2.7 Integrated sampling

The core of integrated sampling is to first use oversampling to expand the sample and then remove the points that are in a glued state using the Tomek Link method[15], sometimes directly removing all pairs that are close together, because after oversampling, the sample size of 0 and 1 has reached 1:1. we use imblearn's SMOTE Tomek method to implement integrated sampling. In order not to change the distribution of the data in the test set, we only synthetically sample the training set data. We take the training set after the integrated sampling process and train the model using decision tree algorithm, random forest algorithm, KNN algorithm and plain Bayesian algorithm

2.8 Using class weights to improve category imbalance

Since the credit card fraud detection dataset we study is an imbalanced dataset, we can improve the category imbalance by setting the category weights.

First, we will train a simple logistic regression. Then, we will implement a weighted logistic regression with class weights of 'balance', where the model automatically assigns class weights inversely proportional to their respective frequencies when class weights = 'balance'. Finally, we will use a grid search to find the optimal value of the class weights. The metric we are trying to optimize is the f1 score.

For logistic regression, we use log loss as the cost function. We do not use mean squared error as a cost function for logistic regression because we use sigmoid curves as a prediction function instead of fitting a straight line.

2.9 Simple logistic regression

Because we have processed the training set data, the training set data is well distributed and the logistic regression model is very suitable for solving this type of problem[16], so we try to use logistic regression to solve the credit card fraud problem. We use the sklearn library to train our model, and we use the default logistic regression. By default, the algorithm will assign equal weights to both classes. At this point, the confusion matrix is shown in Fig. 4 (see Fig. 4).

2.10 Logistic regression (class_weight='balanced')

We added a class weight parameter to the logistic regression algorithm by passing the value "balanced", and when the class weight = 'balanced', the model will automatically assign class weights that are inversely proportional to their respective frequencies. In this case, the confusion matrix is shown in Fig. 5 (see Fig. 5).

2.11 Logistic regression (manually set class weights)

We Use grid search to find the optimal weight that makes the F1 Score the highest, i.e., search for weights between 0 and 1. In grid search, if a few categories n are given as weights, most categories will get 1-n as weights.

By grid search, the scores of different weight values are plotted to get the change of F1 when the weight of the first category is changed (see Fig. 6). By looking at the figure, we can see that the highest value of a few categories peaks at 0.793082759825461. By grid search, we obtained the best class weights, 0.20691724017453905 for class 0 (majority class) and 0.793082759825461 for class 1 (minority class). now we have obtained the best class weights using grid search to represent the performance of the test data by F1 Score. At this point, the confusion matrix is shown in Fig. 7.

3.1 Data sets

In this study, the dataset used is publicly available on Kaggle at Kaggle: Your Home for Data Science, which has 1,000,000 samples (0: 912597 sample size, 1: 87403 sample size), 8 features (3 numeric data, 4 subtype data), and the last column is labeled (0 indicates that it is not fraud, 1 indicates that it is fraud). The data set is described in Table 1. Table 2.

Table 1

Data set description
Column	Description
distance_from_home	How far is the transaction being attempted from the house of the card owner.(in Kms)
distance_from_last_transaction	How far is the transaction being attempted from the last place of transaction using same card.(in Kms)
ratio_to_median_purchase_price	If median transaction using that card is 50, and the present transaction is attempting a sum of 85 this ratio is 50/85
repeat_retailer	Is it taking place in a retail place(store) where card is frequently used
used_chip	Did the transaction attempt take place by using an RFID chip
used_pin_number	Was the input of pin number involved in this transaction attempt
online_order	Is the transaction beign attempted to pay an online order

Table 2

Summary statistics of features in the dataset
Attribute Name	Type	Max	Min	Mean	Std
distance_from_home	numerical	10632.723670	0.004874	26.628792	65.390784
distance_from_last_transaction	numerical	11851.104560	0.000118	5.036519	25.843093
ratio_to_median_purchase_price	numerical	267.802942	0.004399	1.824182	2.799589
repeat_retailer	categorical	1.000000	0.000000	0.881536	0.323157
used_chip	categorical	1.000000	0.000000	0.350399	0.477095
used_pin_number	categorical	1.000000	0.000000	0.100608	0.300809
online_order	categorical	1.000000	0.000000	0.650552	0.476796

3.2 Experimental results

When looking at the distribution of the continuous-type numerical features in the credit card fraud data, it is found that most fraudulent card transactions occur within 50 km of the victim, and for most fraudulent cases, the ratio-to-median-purchase of fraudulent and real transactions is almost equal, which means that the fraud does not make the victim feel uncomfortable by withdrawing a large amount of money at one time. Instead, they will make small transactions (see Fig. 8). When counting the sample size of discrete numerical features in credit card fraud data, it is found that the attribute REPEAT_RETAILER accounts for a much higher percentage of fraudulent cases than non-fraudulent cases, and that fraudulent transactions are almost never transacted using RFID chips (see Fig. 9).

Before balancing the data, the logistic regression model was used to classify the data and the results are shown in Table 3 with an accuracy of 96%, but if you check the recall for level 1, it is 0.60 This clearly shows that the algorithm does not handle these unbalanced data well. After processing the imbalanced data by using Near Miss under-sampling method, the data was classified using logistic regression model and the results are shown in Table 4, which shows that the recall of level 1 increased from 0.60 to 0.83 and the effectiveness of the model was improved substantially. After the imbalanced data were processed by using the SMOTE oversampling method, the data were classified using the logistic regression model, and the results are shown in Table 5, the random forest was used to classify the data, and the results are shown in Table 6, and the K-nearest neighbor algorithm was used to classify the data, and the results are shown in Table 7. It can be seen that the best classification results are obtained in SMOTE processed data and using Random Forest algorithm. The Precision Recall plot is plotted below.(see Fig. 10). Now, we take only a few sample cases where the sample length is equal to the number of fraud cases and implement different algorithms to check whether they are applicable to the original data. The results of the logistic regression algorithm are shown in Table 8, the results of the random forest algorithm are shown in Table 9, and the results of using the K-nearest neighbor algorithm are shown in Table 10. It can be seen that all three algorithms improve the classification effect for a small number of samples, and the random forest algorithm and the K-nearest neighbor algorithm perform the best (see Fig. 11).

The results of Accuracy, Precision, Recall, F1score and AUC under different machine learning algorithms when the imbalanced data of credit card fraud was processed by random oversampling method and later modeled and predicted by machine learning models are shown in Table 11.

The results of Accuracy, Precision, Recall, F1score and AUC under different machine learning algorithms when the imbalanced data of credit card fraud was processed by SMOTE oversampling method and later modeled and predicted by machine learning models are shown in Table 12.

The results of Accuracy, Precision, Recall, F1score and AUC under different machine learning algorithms when the imbalanced data of credit card fraud was processed by the undersampling method and later modeled and predicted by machine learning models are shown in Table 13.

The results of Accuracy, Precision, Recall, F1score and AUC under different machine learning algorithms when the imbalanced data of credit card fraud was processed by the integrated sampling method and then modeled and predicted by machine learning models are shown in Table 14.

By using class weights to improve the method of class imbalance for the imbalanced data of credit card fraud, and then using a logistic regression model, by setting different class weights differently, the results are obtained as in Table 15. By looking at the confusion matrix, we are able to predict category 1 better from the previous model, but at the cost of our category 0 misclassification. Here, since the focus was on improving the f1 score, we did this by adjusting the category weights.

Table 3

Performance of logistic regression model without handling unbalanced data
	Precision	Recall	F1 Score	Support
0	0.96	0.99	0.98	273816
1	0.89	0.60	0.72	26184

Table 4

Classification results of logistic regression model after using NearMiss
	Precision	Recall	F1 Score	Support
0	0.98	0.97	0.97	273816
1	0.70	0.83	0.76	26184

Table 5

After using SMOTE oversampling method, the classification results of logistic regression model
	Precision	Recall	F1 Score	Support
0	0.99	0.93	0.96	273816
1	0.57	0.95	0.71	26184

Table 6

Results of random forests after using smote oversampling method
	Precision	Recall	F1 Score	Support
0	1.00	1.00	1.00	273811
1	1.00	1.00	1.00	26189

Table 7

After using smote oversampling method, the results of k-nearest neighbor algorithm
	Precision	Recall	F1 Score	Support
0	1.00	1.00	1.00	273816
1	0.98	1.00	0.99	26184

Table 8

Logistic regression results with a small number of samples
	Precision	Recall	F1 Score	Support
0	0.95	0.93	0.94	26285
1	0.93	0.95	0.94	26157

Table 9

Random forest results with a small number of samples
	Precision	Recall	F1 Score	Support
0	1.00	1.00	1.00	26285
1	1.00	1.00	1.00	26157

Table 10

K-nearest neighbor results with a small number of samples
	Precision	Recall	F1 Score	Support
0	1.00	1.00	1.00	26285
1	1.00	1.00	1.00	26157

Table 11

Comparison of the algorithms under random oversampling
	Accuracy	Precision	Recall	F1	AUC
Decision Tree	1.00	1.00	1.00	1.00	1.00
Random Forest	1.00	1.00	1.00	1.00	1.00
KNN	0.99	0.98	1.00	0.99	0.99
Naïve Bayes	0.63	0.15	0.73	0.25	0.68

Table 12

Comparison of the algorithms under SMOTE oversampling
	Accuracy	Precison	Recall	F1	AUC
Decision Tree	0.99	0.99	0.99	0.99	0.99
Random Forest	0.99	0.99	0.99	0.99	0.99
KNN	0.99	0.98	0.99	0.98	0.99
Naïve Bayes	0.63	0.15	0.73835	0.25999	0.68

Table 13

Comparison of the algorithms under under-sampling
	Accuracy	Precision	Recall	F1	AUC
Decision Tree	0.99	1.00	0.99	0.99	0.99
Random Forest	0.99	1.00	0.99	0.99	0.99
KNN	0.99	0.99	0.99	0.99	0.99
Naive Bayes	0.63	0.15	0.73	0.25	0.68
SVM	0.81	0.08	0.11	0.09	0.48

Table 14

Comparison of the algorithms under integrated sampling
	Accuracy	Precision	Recall	F1	AUC
Decision Tree	0.99	0.99	0.99	0.99	0.99
Random Forest	0.99	0.99	0.99	0.99	0.99
KNN	0.99	0.98	0.99	0.99	0.99
Naive Bayes	0.63	0.15	0.73	0.26	0.68

Table 15

Comparison of logistic regression model performance under different class weights
	Accuracy	Precision	Recall	F1	AUC
Logistic Regression	0.95	0.89	0.59	0.71	0.79
Logistic Regression (class weight='balanced')	0.93	0.57	0.94	0.71	0.94
Logistic Regression (set class weights manually)	0.95	0.71	0.85	0.78	0.91

In this study, by comparing these methods of handling data imbalance such as random oversampling, SMOTE oversampling, random under-sampling, integrated sampling, and adjusting category weights, and using these machine learning models such as decision tree, random forest, K-nearest neighbors, plain Bayes, support vector machine, and logistic regression after handling imbalanced samples, we found that by oversampling credit card fraud data, the model achieves the best performance We found that by oversampling the credit card fraud data, the model achieves accurate classification of fraudulent transactions in credit card fraud data and improves the accuracy of classification of fraudulent transactions, and the random forest algorithm outperforms other machine learning algorithms among all the machine learning models compared.

Acknowledgement: The authors would like to thank all those who supported us in writing this article, my supervisor, and those who contributed to this research but could not include themselves.

Funding Statement: This research work has been supported by the special needs project of the School of Computer Science and Technology of Huaibei Normal University (in the context of review and evaluation, taking multiple measures to improve the academic ability of students majoring in computer science) 2021z1gc147 and the School of Computer Science and Technology of Huaibei Normal University (based on integrated learning algorithm construction Patient Information Prediction Model) 2021sykf046 and the first-class undergraduate talent demonstration and leading base in Anhui Province 2019rcsfjd044 ,Haili Peng, China.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

Author contributions: Haili Peng is responsible for designing the research plan, implementing the research process, organizing data, article writing and drafting the paper.

Jing Wang is responsible for revising the thesis, final review of the thesis, obtaining research funding and mentoring support

Data availability

Data available on Kaggle: https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud

Alarfaj, S. F. K. et al Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms. IEEE Access 10, 39700-39715 (2022).
Alharbi, A. et al A Novel text2IMG Mechanism of Credit Card Fraud Detection: A Deep Learning Approach. Electronics 11, 756 (2022).
Nguyen, V. B., Dastidar, K. G., Granitzer, M. & Siblini, W. The Importance of Future Information in Credit Card Fraud Detection. International Conference on Artificial Intelligence and Statistics 151, 10067-10077 (2022).
Adewumi, A. O. & Andronicus, A. A. A survey of machine-learning and nature-inspired based credit card fraud detection techniques. International Journal of System Assurance Engineering and Management 8(2), 937-953 (2017).
Khushi, M. et al A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access 9, 109960-109975 (2021).
Zareapoor, M. & Yang, J. A novel strategy for mining highly imbalanced data in credit card transactions. Intelligent Automation & Soft Computing, 1-7 (2017).
Münstermann, B. & Weitzel, T. What is process standardization?. Conf-IRM 2008 proceedings, 64 (2008).
Carneiro, E. M., Dias, L. A. V., Da Cunha, A. M. & Mialaret, L. F. S. Cluster Analysis and Artificial Neural Networks: A Case Study in Credit Card Fraud Detection. International Conference on Information Technology-New Generations, 122-126 (2015).
Ghazikhani, A., Yazdi, H. S. & Monsefi, R. Class imbalance handling using wrapper-based random oversampling. 20th Iranian Conference on Electrical Engineering (ICEE2012), 611-616 (2012).
Liu, A., Ghosh, J. & Martin, C. Generative Oversampling for Mining Imbalanced Datasets. DMIN 66-72 (2007).
Elreedy, D. & Atiya, A. F. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Information Sciences 505, 32-64 (2009).
Maldonado, S., López, J. & Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Applied Soft Computing 76, 380-389 (2019).
Liu, X. Y., Wu, J. & Zhou, Z. H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2), 539-550 (2009).
Ganganwar, V. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2, 42-47 (2012).
Liu, Y., Yu, X., Huang, J. X. & An, A. Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Information Processing & Management 47(4), 617-631 (2011).
LaValley, M. P. Logistic regression. Topics in biostatistics 117, 2395-2399 (2008).

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Unbalanced Data Processing and Machine Learning in Credit Card Fraud Detection

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 Data Standardization

2.2 Data Normalization

2.3 Delineating the data set

2.4 Random Oversampling

2.5 SMOTE Oversampling

2.6 Undersampling

2.7 Integrated sampling

2.8 Using class weights to improve category imbalance

2.9 Simple logistic regression

2.10 Logistic regression (class_weight='balanced')

2.11 Logistic regression (manually set class weights)

3 Experiment And Result

3.1 Data sets

3.2 Experimental results

4 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1