Fraud Detection: A Study of AdaBoost Classiﬁer and K-Means Clustering

Fraud is a problem which can aﬀect the economy greatly. Billions of dollars are lost because of fraud cases. These problems can occur through credit cards, insurance and bank accounts. Currently there have been many studies for preventing fraud. Machine learning techniques have helped in analysing fraud detection. These include many supervised and unsupervised models. Neural networks can be used for fraud detection. The dataset for the present work was collected from a research collaboration between Worldline and the Machine Learning Group of Université Libre de Bruxelles on the topic of big data mining and fraud detection. It consists of the time and amount of various transactions of European card holders during the month of September in 2013. This paper gives an analysis of the past and the present models used for fraud detection and presents a study of using K-Means Clustering and AdaBoost Classiﬁer by comparing their accuracies.


Introduction
Fraud can be a major reason for the downfall of an economy. The fraudsters can steal money and use them, which in turn leads to a distribution of illegal money. Illegal money in companies affects their reputation. Fraud can be of various types which include financial fraud, credit card fraud, reputation fraud etc. There are many methods of committing a fraud. These include stealing a person's identity, hacking emails to get personal details and many more. With technology, it has become easier for fraudsters to steal money. This is due to the huge growth of the e-commerce market.
Fraud detection works by knowing the time and place of transaction. If the place of transaction is faraway from the usual place of transaction and the timings are at odd hours, it is likely to be a case of fraud. The fraudsters can also take out a large amount of money at one time which may be unusal for the card holder.
Fraud can be detected through machine learning models. These models can be supervised or unsupervised. Supervised models include the support vector machines (SVMs), the Naive Bayes, the AdaBoost Classifier. Unsupervised models include Principal Component Analysis (PCA) and K-Means Clustering. Neural networks can be trained by using any of these models. Figure 1 shows a block diagram for supervised vs. unsupervised machine learning models.
Various works have been done in fraud detection in the recent years. Ref. [1] shows how the AdaBoost classifier works on credit card fraud detection and how the best accuracy is found out by majority voting. Fraud detection as an application for reputation in markets is discussed in [18]. A graph based on semisupervised model was proposed in [3]. A hierarchical structure was created for node-level attention and view level attention. An attention-based neural network for credit card fraud detection was proposed in [4]. Opinion fraud detection can be done by using decision trees as shown in [5]. Comparison analysis were performed in [7] for logistic regression, K-Nearest Neighbour(KNN) and Naive Bayes models. The KNN algorithm has been applied on fraud detection with outlier detection [13]. Recently, there have been many review works on the methods used for fraud detection. Ref. [6] gives a brief theoretical review on financial literacy. Statistical fraud detection is discussed in [30]. Refs. [28], [27], [20], [19], [15], [2], [25] describe data mining based tools and fraud detection techniques. Decision trees have been used in Ref. [8] . Gradient boosting was used in Ref. [9]. The fuzzy systems and Bayesian learning have been studied in Ref. [15]. Ref. [11] discusses optimizing extreme learning machines for fraud detection. Real time fraud detection are studied in [26], [16]. It can be done by using self organization maps. The applications of neural networks for fraud detection are done in [24], [23], [14]. This is done by finding out different features and classifying the data. The input vectors for the neural network can be the current transaction, payment history, transaction history. Convolutional neural networks were also used [22] for online transaction fraud detection. A secure AI based architecture was proposed in [10]. A survey on adaptive fraud detection was done in [29]. Sensitivity analysis was used within deep learning framework for fraud detection in credit card transactions [21]. Bayesian and neural networks have been combined for improving fraud detection [17]. The AdaBoost algorithm was applied to Naive Bayes in Ref. [12] in order to detect insurance fraud.
The dataset for this research was collected at the time of a research collaboration work between Worldline and the Machine Learning Group of Université Libre de Bruxelles on the topic of big data mining and fraud detection. It consists of various transactions of European card holders during the month of September in 2013. To provide confidentiality for the users and their identity, the information is provided as numerical variables after PCA transformation. It consists of the time between the transactions and the amounts of the money for the transactions.
The paper is organised as follows. Section 2 gives a discussion on some examples of supervised models followed by a description of unsupervised models

Support Vector Machines (SVMs)
The SVMs are generally used for classification and regression. They can work with high dimensional data by classifying the data using hyperplanes. A hyperplane is a subspace which has 1 less dimension as compared to the feature space. The SVM increases the dimensions of the feature space to segregate the data. This is done by using a classifier. The classifier changes the feature space by adding a feature to it. Also, SVMs can use kernel functions which increase the dimension of the feature space. Multiple kernel functions can be used to find the best options in order to transfer the non linear space into linear space. They need low computational power.

The AdaBoost Classifier
The AdaBoost (Adaptive Boosting) classifier combines the weak classifiers to produce a strong classifier. If a weak classifier has a high accuracy, more weight is assigned to it. It is an ensemble learning method. Random forests, XGBoost estimators can be used to improve the accuracy. A coefficient is given to the weak classifier in order to minimize the training error. This method of boosting is used with other algorithms in order to improve their performance. It uses the Matthews Correlation Coefficient (MCC) which helps in measuring the quality of the problem. A result of +1 means that the prediction is perfect whereas a result of −1 means total disagreement.

The Naive Bayes
Naive Bayes makes use of the Naive Bayes Theorem. According to the theorem, the probability is calculated by using the previous knowledge and conditions. Various experimental analysis of Naive Bayes have been done for fraud detection. It can be used for classification problems and it is suitable for real time fraud detection. However, it cannot identify all the abnormal behaviours.

Unsupervised Models
Unsupervised machine learning models are those models where the models find hidden patterns themselves without any supervision. The following subsections discuss about unsupervised machine learning models.

Principal Component Analysis (PCA)
It is used to turn the high dimensional data into low dimensionality without changing the original data. It can be used for feature selection. For Principal Component, the analysis is done by finding the direction of vectors instead of using functions. It transforms the input features into principal components and uses them as input functions. Due to less noise, PCA helps in visualising high dimensional data.

K-Means Clustering
K-Means is a method of separating the dataset into different clusters where the data is in the cluster closest to the mean of the cluster. The clusters are independent of each other. The centroids are chosen and they are used to train the classifier. Each data point is given a class label.
In the next section, the results for fraud detection on the dataset, consisting of the time and amount of transactions of the European card holders during September, 2013, are discussed. The results of the present work are obtained by using the AdaBoost Classifier, which is a supervised machine learning model, as well as, using the unsupervised model based on K-Means Clustering. Python is used for coding both the models. Both the algorithms have their own pros and cons. The K-Means algorithm is easier to perform on a small dataset as it requires less computation time. The AdaBoost Classifier can be used when there are more number of estimators.

Fraud Detection using the AdaBoost Classifier
The most important parameters in this classifier are the number of estimators, the base estimator and the learning rate. The number of estimators indicate the number of weak learners to train. The base estimator is a weak learner used to train the model. The learning rate adds weights to the weak learner.
In this method the base estimator is taken as random forest. The number of estimators is different for each observation. The accuracy for 200 estimators with a learning rate of 0.01 is 92.479%. For the same learning rate and different number of estimators, the accuracies are shown in Table 1. For a learning rate of 1 and different number of estimators, the accuracies are shown in Table 2.
The accuracy of the model for 100 estimators with a learning rate of 0.01 is 93.089%. For a learning rate of 1, the accuracy is 93.902% for the same number of estimators. The maximum accuracy of 94.512% is obtained for the learning rates of 0.01 and 1 for the number of estimators as 1000 and 400 respectively. As shown in both tables 1 and 2, the learning rate does not affect the accuracy of the AdaBoost Classifier to a great extent. It helps in improving the stability of the model.

Fraud Detection using the K-Means Clustering
Different number of clusters are formed for the given dataset and their accuracies are studied as shown in table 3. For 300 clusters, the accuracy is observed to be the highest with a value of 99.385%. The accuracy is observed to be low when the number of clusters is small.

Conclusions
In the present work, the fraud detection is studied for the dataset consisting of the time and amount of transactions of the European card holders during September, 2013, using the supervised machine learning model, the AdaBoost Classifier, as well as, the K-Means clustering, which is an unsupervied machine learning model. The dataset has been collected from a research collaboration work between Worldline and the Machine Learning group of Universite Libre de Bruxelles on the topic on big data mining and fraud detection.They are two different models, yet they do not show a huge difference in the accuracies. Different learning rates for AdaBoost Classifier lead to marginal differences in their accuracies. The accuracy for K-means clustering is higher but it takes a longer time to compute as compared to the AdaBoost Classifier model.