Figure 1 depicts our four-phase approach in detecting fraud using machine learning. The process starts with collecting the data. In our case, we used the data set called IEEE-CIS provided by Vesta Corporation, which is the forerunner in guaranteed e-commerce payment solutions [17]. A good complex dataset is a backbone for any robust machine learning model to produce a plausible reality-matched output. Therefore, this sample is chosen as it was really representative, which covered almost every challenging real-life pattern for a typical fraud detection problem, i.e., massive data volume, genuine transactions outnumber fraudulent events, and diversified card-related features ranging from time delta, transaction amount, addresses to even network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
In the second phase, we conducted a variety of methods to preprocess the data. The first thing is the minification step to reduce memory usage. This helps us save a lot of resources when building the prediction model and speed up the training process. After that, we conduct an exploratory analysis to inspect data for patterns, trends, or relationships between variables and between the target column and other variables. Then we experimented many ways to pick out the most suitable techniques for feature transformation and feature selection. The main part of the preprocessing stage is to separate users into new and old group through the process of establishing card identification based on given card-related features in the dataset. In the third phase, after processing categorical and numerical data into the suitable form, we deploy DNN on unknown users and CatBoost on the known users before combining into the final results in the last phase. Details for each process will be clarified in the following sections.
3.1 Data Source
The IEEE-CIS dataset consists of two files, namely transaction and identity joined by TransactionID, with 433 features and 590,540 instances in total. They are real-world transactions provided by Vesta Corporation, a forerunner specializing in guaranteed e-commerce payment solutions. Even though a simple glossary is provided, the meaning of each feature is quite obscured because they are all masked without a pairwise dictionary for the purpose of privacy protection agreement. To clearly manifest, a table of features in transaction and identity set based on explanations of Vesta is given in Table 1 as follows:
Table 1. Data description
Transaction
|
Feature
|
Description
|
TransactionDT
|
Timedelta from a given reference DateTime (not an actual timestamp)
|
TransactionAMT
|
Transaction payment amount in USD
|
ProductCD
|
Product code, the product for each transaction
|
card1 - card6
|
Payment card information, such as card type, card category, issue bank, country, etc.
|
addr
|
Address
|
dist
|
Distance P_ and (R__)
|
C1-C14
|
Counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
|
D1-D15
|
Timedelta, such as days between previous transactions, etc.
|
M1-M9
|
Match, such as names on card and address, etc.
|
Vxxx
|
Vesta engineered rich features, including ranking, counting, and other entity relations.
|
Identity
|
DeviceType
|
Type of machine customer uses
|
DeviceInfo
|
Information of machine
|
id_01 - id_11
|
Numerical features of identity, such as device rating, ip_domain rating, proxy rating, behavioral fingerprint-like account login times/failed to login times, how long an account stayed on the page, etc.
|
The business logic behind binary classification, according to the owner of the dataset, is that a transaction is denoted as “isFraud=1” when there is reported chargeback on the card, and all transactions posterior to it associated with a user account, email address, etc., are labeled as fraud too. If the cardholder did not report within 120 days then those suspicious transactions are automatically considered legitimate (isFraud=0). In other words, once a card has been reported as fraud, that account will be converted to isFraud=1. Therefore we are predicting fraudulent clients rather than fraudulent transactions.
3.2 Data Preprocessing
Exploratory Analysis. In general, transactions were recorded from November 30th, 2017 to May 31st, 2018 as depicted in Figure 2. When doing exploratory data analysis, we notice that approximately 3,5% of train transactions are fraud with more than 95% of columns have missing values, which is a normal real-world pattern for this fraud detection task.
To deal with the imbalance between the number of fraudulent transactions and the non-fraudulent transactions, we apply SMOTE method to increase the number of fraudulent transactions many times using the KNN algorithm. Specifically, a data point will be randomly selected from the pool of fraudulent transactions and determine the closest neighbors to this point, the number of fraudulent transactions is further increased between the selected point and its neighbors.
After doing multivariate analysis, we found that the number of fraudulent transactions is high in products of category W or C, paid by debit card, credit card, visa or master card. Cards like American Express, discover card have very few or even no fraudulent transactions in the case of charge cards because they are not as commonly used as other cards. In addition, fraud is associated with users with email domains of gmail.com or hotmail.com, using computers whose operating systems are Windows 7 or Windows 10 operating systems, or, if the transactions are made over the phone, the fraud occurs frequently on phones that normally use Chrome 63.0 or generic mobile safari. The probability of a fraudulent transaction when done by computer or by phone is relatively the same. For variables whose values have been encoded in the form T/F (M1 to M9 minus M4, id_35 to id_38) or New / Found / Not Found (id_15, id_16, id_28, id_29), fraudulent transactions are dominant at observation whose values are T and Found.
Feature Transformation. Most of the variables have left skewed distribution. Some variables need to undergo logarithmic transformation to return to the normal distribution, such as the TransactionAmt in Figure 3 as belows:
The data set used has a lot of null values - NaN, so any columns that have the percentage of missing values above 95% will be discarded because they do not contribute much to the model's performance. For the numerical features, they will be imputed with 0 or the mean while for the categorical features, each blank space is filled with the word "Unknown" and treated as a new separate category. Since the machine learning model only accepts numerical variables as input, categorical features will be converted to numbers through Label Encoding.
Feature Engineering. This is the major part in our process, where we will start splitting customers into known and unknown groups. Firs, the initial data set will be divided into two parts: the training set and the test set with a ratio of 7:3. Suppose one user uses multiple cards for many different transactions. Therefore, it is necessary to define groups of cards based on the associated identifier card properties (card1, card2, card3, card4, card5, card6, productCD) with the CardID_D1 column as shown in Figure 4 below.
Next, we continue to separate the cardGroups into cardIDs based on the V307 feature, which is important in identifying cardIDs.
Table 2. First five rows of number of transactions corresponding with each card group
STT
|
cardGroup_name
|
Counts
|
0
|
15775481.0150.0mastercard102.0credit-129S
|
1414
|
1
|
9500321.0150.0visa226.0debit84W
|
480
|
2
|
7919194.0150.0mastercard166.0debit-92W
|
439
|
3
|
7919194.0150.0mastercard166.0debit-124W
|
282
|
4
|
7919194.0150.0mastercard202.0debit-34W
|
242
|
It can be seen that each color has been marked as a cardID in Figure 5 belonging to each cardGroup. And V307 is the cumulative result of the TransactionAmt value of the previous transaction. Next is the stage of identifying the customers based on the customer identification information (TransactionAmt, id_19, id_20) – assuming id_19, id_20 are information of the IP address and cardID, the result used to separate new/old users is illustrated in Figure 6 and 7 as follows:
User separation is performed for both the training set and the test set. Those identifiers presented in both data sets will be recognized as old customers, otherwise new customers. The purpose of this is to train the model to well-identify new and old users so that once that person is reported as fraudulent, subsequent transactions involving this user identifier will also be labeled “isFraud=1”.
Feature Selection. Picking the right set of features as inputs to a model is one of the key contributions to our achieved performance. In the first place, we use Principal Component Analysis (PCA) technique to reduce the number of prefix V variables from 339 down to 30 most important ones. This method is based on the observation that the data are not normally distributed randomly in space but are often distributed near certain special lines or planes. PCA considers a special case where such special planes have linear form as subspaces.
For DNN, the input variables include: categorical variables, namely ProductCD, card1-card6, addr1, addr2, P_emaildomain, R_emaildomain, M1-M9 (through Label Encoding process) and numerical variables not prefix V and id_ (through normalization to achieve zero mean and zero variance). For CatBoost, the input variables are not the following: TransactionID, TransactionDT, isFraud, discarded V variables after PCA.
3.3 Model
Our model is a combination of CatBoost and Neural Network as base learners. Their predictions on overlapping and non-overlapping parts respectively will be combined into a single output. We use CatBoost to see if we can improve the prediction rate for overlapping users. CatBoost is a powerful gradient boosted decision tree (GBDT) in classification tasks involving big data. Two innovative qualities of CatBoost are the automatic handling of categorical values, and strong performance relative to other GBDT implementations. CatBoost uses the ordering principle called Target-Based where the values for each example rely only on the observed history [18]. Thus for a set of data with plentiful categorical features as IEEE-CIS dataset, we can improve our training results without spending time and effort turning categories into numbers. CatBoost is robust as it does not require extensive hyper-parameter tuning [19]. Due to these advantages, CatBoost whose parameters were described as Table 3 outperforms most other machine learning algorithms in both speed and accuracy.
Table 3. CatBoost parameters
Parameters
|
Parameter description
|
Value
|
learning_rate
|
Used for reducing the gradient step
|
0.07
|
loss_function
|
The metric to use in training
|
Log-loss
|
depth
|
Depth of the tree
|
8
|
n_estimators
|
The number of of trees to build before taking the maximum voting or averages of predictions
|
5000
|
For non-overlapping users, our Neural Network architecture, as given in Figure 8, consists of an input layer with the size of the number of selected features, 3 hidden layers (the respective neurons are 512 - 256 - 1), and an output layer of one neuron. The optimal parameters for our model are shown in Table 4.
Table 4. Neural Network parameters
Parameters
|
Parameter description
|
Value
|
learning_rate
|
Used for reducing the gradient step
|
0.0001
|
loss_function
|
The metric to use in training
|
binary cross-entropy
|
optimizer
|
Adam with Nesterov momentum
|
Nadam
|