Two main architectures were proposed to generate our four different models on the three different datasets, CNN-LSTM (Fig. 1) and LSTM-FCN (Fig. 2). The proposed CNN-LSTM architecture uses a one-dimensional convolutional hidden layer that operates over a 1D sequence with 3 filters (collection of kernels that are utilized to store values learned during the training process) and a kernel size of 32. The convolutional hidden layer is accompanied by batch normalization to normalize its input by applying a transformation that maintains the mean output close to 0 and the output standard deviation close to 1. The hidden layer is used for feature extraction. An activation function is used in the hidden layers of a neural network to allow the model to learn more complex functions. In our architecture, we used Rectified Linear Activation (ReLU) to enhance the results of the training. The ReLU is Followed by then a MaxPooling1D layer whose job is to reduce the learning time by filtering the input (output of the previous layer) to the most salient new output. A dropout layer was introduced to avoid overfitting, a common issue in LSTM models. The introduced dropout layer had a probability value of 0.2 at which outputs of the layer are dropped out. The output of the dropout layer is then passed into the LSTM block. The LSTM block comprises a single hidden layer made up of 8 LSTM units, and an output layer used to make a prediction. The LSTM block is followed by a Dense layer (a Dense layer receives input from all neurons of the previous LSTM output layer) to produce one output value for the sigmoid activation function. The input values for the sigmoid function belong to the set of all real numbers, and its output values have a range of (0, 1) a binary outcome that represents (benign, attack). As part of the optimization of the algorithm, a Binary Cross-Entropy loss function was used to estimate the loss of the proposed architecture on each iteration so that the weights can be updated to reduce the loss on the next iteration [63],[64],[65],[66],[67],[68].
LSTM-FCN augments the fast classification performance of temporal convolutional layers with the precise classification of LSTM Neural Networks [69]. Temporal convolutions have proven to be an effective learning model for time series classification problems [35]. The proposed LSTM-FCN has a similar architecture to the proposed CNN-LSTM architecture but instead, it utilizes a GlobalAveragePooling1D layer to retain much information about the “less important” outputs [65]. The layers then concatenate to one Dense final layer with a Sigmoid activation function.
Both models have utilized Adam Optimization Algorithm [70] with a steady learning rate of 0.03 (the proportion that weights are updated throughout the 3 epochs of the proposed architecture). The 0.03 is a mid-range value that allows for steady learning. There was no need to optimize the hyperparameters (finding the optimal number of LSTM cells) due to the almost 0% misclassification rate of the proposed models. The default weight initializer that was used in the proposed architecture is GlorotUniform or Xavier Uniform. Since k-fold cross-validation (CV) is not commonly used in DL, here it is introduced on each model to investigate if it produces different results by preventing overfitting. Moreover, the k value is chosen as 5 which is very common in the field of ML [71],[72]. The models have utilized the StratifiedKFold [73] to ensure that each fold of the dataset has the same proportion of observations (balanced) with the response feature. In the case where k-fold CV was not introduced, the train_test_split function from Scikit-learn [74] was utilized to split data into 80% for training and 20% for testing. A summary of the accuracy and loss results for all applied models for all datasets are listed in Table 5.
Table 5
Accuracy and Loss values for all datasets.
Dataset Method | BoT-IoT | UNSW-NB15 | TON-IoT |
Accuracy | Loss | Accuracy | Loss | Accuracy | Loss |
CNN-LSTM | 99.99% | 0.0016 | 99.99% | 0.0001 | 98% | 0.05 |
LSTM-FCN | 100% | 0.0068 | 100% | 0.0054 | 90% | 0.36 |
CNN-LSTM 5-folds CV | 99.99% | 0.0020 | 99.99% | 0.0001 | 95% | 0.25 |
LSTM-FCN 5-folds CV | 100% | 0.0015 | 100% | 0.0002 | 85% | 0.59 |
Accuracy describes just what percentage of test data are classified correctly. In any of these models, there is a binary classification of Attack or Benign. When accuracy is 99.99%, it means that out of 10000 rows of data, the model can correctly classify 9999 rows. Table 5 shows that very high accuracy levels (≈ 99.99%) were achieved for the BoT-IoT and UNSW-NB15 datasets. However, this was not the case for the TON-IoT dataset where accuracy levels ranged from 85%-98%. It also reveals that using 5-folds CV has decreased the accuracy of the models used on the TON-IoT dataset. The proposed LSTM-FCN models have shown slightly better performance than the proposed CNN-LSTM models in detecting attacks using the BoT-IoT and the UNSW-NB15 datasets (100% vs 99.99%) while the CNN-LSTM significantly performed better in detecting attacks compared to the LSTM-FCN using the TON-IoT dataset (98% vs 90% and 95% vs 85%).
The models use probabilities to predict binary class Attacks or Benign between 1 and 0. So if the probability of Attack is 0.6, then the probability of Benign is 0.4. In this case, the outcome is classified as an Attack. The loss will be the sum of the difference between the predicted probability of the real class of the test outcome and 1. Table 5 shows that very low loss values were achieved for the BoT-IoT and UNSW-NB15 datasets. At the same time, using 5-folds CV reduced the loss values for the FCN-LSTM from 0.0068 to 0.0015 and 0.0054 to 0.0002 for the BoT-IoT and UNSW-NB15 datasets respectively. However, this was not the case for the TON-IoT dataset where loss values increased when 5-folds CV was implemented.
The Area Under the Receiver Operating Characteristics (AUROC) is a performance measurement for classification models. The AUROC tells us what is the model probability of separating between different classes, Attack or Benign in this case. The AUROC is a probability that measures the performance of a binary classifier averaged across all possible decision thresholds. When the AUROC value is 1, it indicates that the model has an ideal capacity to distinguish between Attack or Benign. When the AUROC value is 0, it indicates that the model is reciprocating the classes. In other words, predicting a Benign class as an Attack class and vice versa. And when the AUROC value is 0.5, it indicates that the model is incapable of distinguishing between Attack or Benign. Table 6 shows a summary of AUROC values for all proposed models on the three datasets.
Table 6
Summary of AUROC values from different models.
Model Dataset | CNN-LSTM | LSTM-FCN | CNN-LSTM 5-folds CV | LSTM-FCN 5-folds CV |
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
BoT-IoT | 1.00 | 1.00 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | 0.998 | 0.976 | 0.987 | 0.993 | 0.998 |
UNSW-NB15 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
TON-IoT | 0.993 | 0.868 | 0.887 | 0.997 | 0.999 | 0.583 | 0.999 | 0.543 | 0.892 | 0.891 | 0.807 | 0.660 |
All models showed ideal capacity (AUROC = 1.00) for predicting Attack or Benign classes for the UNSW-NB15 dataset. As for the BoT-IoT dataset, the CNN-LSTM and LSTM-FCN models showed high capacity (AUROC = 1.00) for predicting Attack or Benign classes. The CNN-LSTM 5-folds CV had an AUROC = 0.5 which indicates that this model can be incapable of distinguishing between Attack or Benign. The LSTM-FCN 5-folds CV had an AUROC value larger than 0.992 which means that this model is almost capable of predicting Attack or Benign classes.
The TON-IoT dataset showed a slightly different case than the first two datasets, with AUROC values that were 0.993 and 0.868 for the CNN-LSTM and LSTM-FCN models respectively indicating that they are near capable of predicting Attack or Benign classes. The 5-folds CV for both CNN-LSTM and LSTM-FCN models showed AUROC values that ranged between 0.543 and 0.999. Figure 3 demonstrates accuracy, precision, and recall results by CNN-LSTM and LSTM-FCN models for all datasets.