After preparing the dataset, we started implementing machine learning models, since our dataset is highly imbalanced with 55,129 entries for the majority class v/s only 239 entries for the minority class, we applied methods such as oversampling, undersampling, and cost-sensitive learning. Our test set has 22,097 samples out of which 22,001 samples are negative samples ‘0’ (low suspicion of IUU activity) and 96 samples are positive samples ‘1’ (high suspicion of IUU activity). For our project, we would be using 2 performance measurement indicators namely -
-
The Receiver Operating Characteristics Area Under the Curve (ROC_AUC) which is a measure of classification problem performance at various threshold settings, the top left-most point of the curve gives the optimal threshold setting for the given model, higher the ROC_AUC score the better the model is at classifying the samples.
-
Recall, in our project the cost of misclassifying a positive case (vessel conducting IUU activity) as a negative case (vessel not conducting IUU activity) is more. The top left-most point of the ROC_AUC curve gives us the best trade-off between the recall of the majority and minority classes. The higher the recall, the better the model is performing.
Oversampling is a technique of creating artificial samples for the minority class, one of the popular oversampling techniques is the Synthetic Minority Over-sampling Technique (SMOTE)[29], which creates synthetic samples by randomly sampling the characteristics of the minority class. We oversampled our training data’s minority class using SMOTE and kept the testing data separate so that the testing data doesn’t get contaminated. We then trained various models like -
-
Artificial Neural Network (ANN) with 3 dense layers (including 1 input and 1 output layer), the first layer had 50 neurons with Relu activation, the second layer had 15 neurons with Relu activation, and the output layer had 1 neuron with Sigmoid activation since we wanted to classify the sample as binary (0 - low suspicion of IUU activity, 1 - high suspicion of IUU activity). We used the Adam optimizer and ‘binary cross entropy loss function. After training the model on X_train, we plotted the ROC_AUC graph, using which we found the best threshold value for classifying the predicted values of the test set. The recall for the majority and minority classes are 0.75 and 0.71 respectively for the test set.
Table 1
Confusion Matrix of ANN (oversampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
16,524
|
5,477
|
1
|
28
|
68
|
-
XGBoost stands for “Extreme Gradient Boosting”, is an optimized distributed gradient boosting library designed for efficient training of machine learning models[30]. We used a high gamma value and a low max depth value because the model was overfitting. The recall for the majority and minority classes are 0.78 and 0.67 respectively for the test set.
Table 2. Confusion Matrix of XGBoost (oversampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
17,134
|
4,867
|
1
|
32
|
64
|
- Logistic Regression estimates the probability of an event occurring, in this case, was the vessel conducting IUU (1) or not (0), based on a given dataset of independent variables. The recall for the majority and minority classes are 0.91 and 0.40 respectively for the test set.
Table 3. Confusion Matrix of Logistic Regression
(oversampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
20,073
|
1,928
|
1
|
58
|
38
|
-
Ensemble learning is a process in which multiple models are combined to solve a given problem and produce better results than the individual models[31]. Here, we did ensemble learning of XGBoost and Logistic Regression, with soft voting wherein the probabilities of each prediction in each model are combined and the prediction with the highest total probability is picked. The recall for the majority and minority classes are 0.90 and 0.47 respectively for the test set.
Table 4. Confusion Matrix of Ensemble Learning (oversampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
19,773
|
2,228
|
1
|
51
|
45
|
- Stacking is a method to explore different models for the same problem, here, we take some base models and train them on the training set, then we append the predictions (of the training set) of each of the base model to the training set, finally, we train the meta-model on the new training set containing the results of the base models. Here, we took the base models as ANN and Logistic Regression and the meta-model as XGBoost. The recall for the majority and minority classes are 0.91 and 0.40 respectively for the test set.
Table 5
Confusion Matrix of Stacking model (oversampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
20,073
|
1,928
|
1
|
58
|
38
|
As seen in Fig. 3. above, the best tradeoff for the recall of the majority and minority classes belongs to ANN and XGBoost for oversampling.
Undersampling is a technique to randomly remove samples from the majority class of the training dataset, resulting in a better class distribution, which can reduce the skew from a 1:100 to a 1:10, or like in our case to a 1:1[33]. We performed random undersampling on our training data’s majority class and kept the testing data separate so that the testing data doesn’t get contaminated. We then trained models similarly as we did while oversampling-
Table 6
Confusion Matrix of ANN (undersampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
14,033
|
7,968
|
1
|
25
|
71
|
Table 7. Confusion Matrix of XGBoost (undersampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
16,938
|
5063
|
1
|
22
|
74
|
- Logistic Regression, with a similar configuration as used for oversampling The recall for the majority and minority classes, are 0.72 and 0.72 respectively for the test set.
Table 8. Confusion Matrix of Logistic Regression (undersampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
15,942
|
6,059
|
1
|
27
|
69
|
- Ensemble Learning with XGBoost and Logistic Regression, with soft voting. The recall for the majority and minority classes are 0.78 and 0.75 respectively for the test set.
Table 9. Confusion Matrix of Ensemble Learning (undersampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
17,126
|
4,875
|
1
|
24
|
72
|
- Stacking model with the base models as cost-sensitive ANN and Logistic Regression with the meta-model as XGBoost. The recall for the majority and minority classes are 0.29 and 0.96 respectively for the test set.
Table 10. Confusion Matrix of Stacking (undersampling)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
6,315
|
15,686
|
1
|
4
|
92
|
As seen in Fig. 4. above, the best trade-off for the recall of the majority and minority classes belongs to XGBoost and Ensemble Learning for undersampling.
Cost-Sensitive Learning is a method used when there is class imbalance and the cost of misclassifying a positive case as negative has serious consequences. Here, we assign weights to each of the classes, the higher the weight, the higher the cost of misclassifying that class[32]. For each of the models, we performed a grid search to find the best weight for the minority class. We then trained various models like -
-
Artificial Neural Network (ANN) with 3 dense layers (including 1 input and 1 output layer), the first layer had 50 neurons with Relu activation, the second layer had 15 neurons with Relu activation, and the output layer had 1 neuron with Sigmoid activation. We used the Adam optimizer and ‘binary cross entropy loss function. The recall for the majority and minority classes are 0.75 and 0.56 respectively for the test set.
Table 11
Confusion Matrix of ANN (cost-sensitive)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
16,430
|
5,571
|
1
|
42
|
54
|
-
Cost-Sensitive ANN, here with 4 dense layers (including 1 input and 1 output layer), the first layer had 63 neurons with Relu activation, the second layer had 30 neurons with Relu activation, the third layer had 10 neurons with Relu activation, and the output layer had 1 neuron with Sigmoid activation. We used the Adam optimizer and ‘binary cross entropy loss function. The weight for the majority class is ‘1’ whereas the weight for the minority class is ‘750’. The recall for the majority and minority classes are 0.81 and 0.69 respectively for the test set.
Table 12
Confusion Matrix of Cost-sensitive ANN
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
17,889
|
4,112
|
1
|
30
|
66
|
Table 13. Confusion Matrix of XGBoost (cost-sensitive)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
16,656
|
5345
|
1
|
20
|
76
|
- Logistic Regression with the parameter class_weight to assign different class weights, here, we assigned the class weight for the majority class as ‘1’ and ‘230’ for the minority class. The recall for the majority and minority classes are 0.74 and 0.74 respectively for the test set.
Table 14. Confusion Matrix of Logistic Regression (cost-sensitive)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
16,355
|
5,646
|
1
|
25
|
71
|
- Ensemble Learning with XGBoost and Logistic Regression, with soft voting. The recall for the majority and minority classes are 0.79 and 0.76 respectively for the test set.
Table 15. Confusion Matrix of Ensemble Learning (cost-sensitive)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
17,328
|
4,673
|
1
|
23
|
73
|
- Stacking model with the base models as cost-sensitive ANN and Logistic Regression with the meta-model as XGBoost. The recall for the majority and minority classes are 0.78 and 0.71 respectively for the test set.
Table 16. Confusion Matrix of Ensemble Learning (cost-sensitive)
|
Actual Values
|
0
|
1
|
Predicted Values
|
0
|
17,156
|
4,845
|
1
|
28
|
68
|
As seen in Fig. 7. above, the best trade-off for the recall of the majority and minority classes belong to XGBoost and Ensemble Learning for cost-sensitive learning.
As seen in Fig. 8. above, after comparing the best models of all the methods Oversampling (OS), Undersampling (US), Cost-Sensitive Learning (CS), the best trade-offs of the majority and minority recall belong to XGBoost (Cost-sensitive), Ensemble Learning (Cost-sensitive), and XGBoost (Undersampling).