The field of Software-Defined Networking (SDN) has witnessed various applications of machine learning (ML) approaches for the detection of Distributed Denial of Service (DDoS) attacks. Numerous studies have explored the effectiveness of different ML algorithms in enhancing security measures within SDN environments. One notable study conducted by [26] utilized supervised learning algorithms to address DDoS attacks based on flow fluctuations. Analyzing Packet_in requests from an emulated SDN network, they compared the performance of eight supervised learning models. However, their approach using a single feature for detection might not generalize well to all forms of attack traffic.
In contrast, [17] employed an Advanced SVM with a 5-tuple feature for detecting flooding-based DDoS attacks, achieving a 97% detection rate for TCP SYN flood and UDP Flooding. Yet, this approach focused on specific types of attacks, potentially neglecting other forms harmful to the network. The work of [28] took a comprehensive approach, considering three attack forms (TCP SYN, UDP, and ICMP) and transforming traffic features into a 6-tuple for detection. Although achieving a 95% accuracy with a low false alarm rate, the limited dataset may impact generalization.[12] aimed to improve accuracy using KNN and SVM models but only utilized two features, potentially limiting generalization. Similarly, [21] proposed a KPCA-GA-SVM algorithm, demonstrating a high accuracy of 98.09%. However, using traditional datasets might limit generalization capabilities.
The use of traditional datasets is a common trend, [16] applied Naïve Bayes on NSL-KDD data, and [13] used CICIDOS 2017 for DNN-based real-time DDoS threat detection. While achieving good accuracy, concerns about generalization capabilities persist. [23] explored different ML techniques for classifying bandwidth attacks, Controller attacks, and flow table attacks. [3] proposed a Snort IDS and deep learning technique, showcasing the superiority of SAE over adaptive sampling. [7] employed a CNN-based multi-dimensional IP flow analysis for DDOS detection, demonstrating effectiveness in SDN data simulation. [22] applied four ML algorithms for DDOS flooding attacks, showcasing their vulnerability testing and detection capabilities. [5] utilized LR, SVM, KNN, RF, and LSTM for DDOS detection, with Random Forest exhibiting the highest accuracy. However, their offline testing may limit real-time applicability. [4] compared SVM, KNN, DT, MLP, and CNN for DDOS detection, with SVM performing better, but generalization concerns remain due to the dataset's nature. [2] introduced a novel dataset and applied a hybrid Random Forest-Support Vector Classifier and other ML algorithms for DDOS detection. However, the inclusion of multiple features may pose overhead on the controller in a real SDN network.
In summary, various ML approaches have been explored for DDoS detection in SDN, however most study did not Mathew correlation coefficient (MCC), logistic loss and training time as metric of evaluation. The MCC is not affected by imbalance dataset and can help to determine a perfect classification. While the log loss is used to assess a model’s uncertainty. The training time can be used ascertaining the computational intensity of a model. Hence, these metrics were incorporated in this study to evaluate performance.
Table 1
Existing studies and performance evaluation metrics used
Author | Accuracy | Precision | F1score/Fmeasure | Specificity | Recall | MCC | Log Loss | Training time |
[4] | ✓ | ✓ | ✓ | | | ✓ | | |
[5] | ✓ | ✓ | ✓ | | ✓ | | | |
[26] | ✓ | ✓ | ✓ | | ✓ | | | |
[22] | ✓ | | | | ✓ | | | |
[2] | ✓ | ✓ | ✓ | ✓ | | | | |
[21] | ✓ | ✓ | | | ✓ | | | |
[23] | ✓ | | | ✓ | | | | ✓ |
[7] | ✓ | ✓ | ✓ | ✓ | | | | |
[3] | ✓ | ✓ | ✓ | | ✓ | | | |
[17] | ✓ | | | | | | | |
[28] | ✓ | | | | | | | |
[16] | | ✓ | ✓ | | ✓ | | | |
This study | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
2.1 Machine learning techniques used
Nearest Neighbor (KNN): KNN is a supervised machine learning algorithm that uses k nearest data points to determine the most frequently occurrence of a class in a query, with the final classification based on the most frequent class [25].
Random Forest
Random Forest is a machine learning algorithm that uses multiple decision trees to predict class labels [29]. It uses a sample size of N, assumes M characteristics, and m parameters in the input to influence attribute selection. A replacement tree is built for each sample, and m attributes are randomly selected from a tree node [24].
Naives Bayes Classifier
Naïve Bayes classifier is a classification algorithm based on Bayes rule, assuming attribute independence. It uses supervised learning to build a prediction model, calculating the probability of an instance belonging to each class. Equations (2.1) establish the probability of a class occurrence and causing an instance to belong to it [15].
$$\text{P}(\left(\text{y}|\text{x}\right)= \text{P}\left(\text{y}\right)\text{P}\left(\text{x}|\text{y}\right)/\text{P}(\text{x})$$
2.1
\(\text{P}(\left(\text{y}|\text{x}\right)\) is the probability of instance x being in class y
\(\text{P}\left(\text{y}\right)\) is the probability of y occurrence? It connotes how frequent the class x is in the given dataset. \(\text{P}(\left(\text{x}|\text{y}\right)\) is the probability of causing x in y while the \(\text{P}\left(\text{x}\right)\) is the probability of x occurrence.
Logistic regression
This is a machine learning classification technique that assigns input to one of the possible classes using a sigmoid function [6]. It can be expressed in the form
$$\text{P}\left(\text{X}\right)= {\text{e}}^{\text{b}0+\text{b}1\ast \text{x}}/{\text{e}}^{\text{b}0+\text{b}1\ast \text{x}}$$
2.2
Here P(X) is the probability of X which is the output, b0 is bias and b1 is the coefficient associated with input value (x). Classification label is set to 0 when the probability of an event is less than a given threshold value else it is assigned class label 1.
Support Vector Machine
Support Vector Machine (SVM) is a classification Algorithm.The idea of SVM is to find a hyper-plane that best separates two classes [18].
Given a training data set \(\text{D}=\left\{{(\text{X}}_{1}, {\text{y}}_{1}){,(\text{X}}_{2}, {\text{y}}_{2}),\dots ,{(\text{X}}_{\text{n}}, {\text{y}}_{\text{n}})\right\} \text{w}\text{h}\text{e}\text{r}\text{e} {\text{X}}_{\text{i}}\)is the characteristic vector of the training sample and \({\text{y}}_{\text{i}}\) is the associated class label. \({\text{y}}_{\text{i}}\) takes + 1 or − 1 \({\text{y}}_{\text{i}}\in \left\{+1,-1\right\}\). The linear hyperplane for a set of training data is defined as: \(\text{w}.\text{x}+\text{b}=0\). Where \(\text{w}\) is the vector, and \(\text{b}\) is the bias term. The point above the separation hyperplane is satisfied by \(\text{w}.\text{x}+\text{b}>0\) while the point below is satisfied by \(\text{w}.\text{x}+\text{b}<0.\)The two margins need to be adjusted to control the separability of data thus:
$$\text{w}.\text{x}+\text{b} \{ \ge 1 \text{f}\text{o}\text{r} {\text{y}}_{\text{i}}=1 \le -1 \text{f}\text{o}\text{r} {\text{y}}_{\text{i} }=-1$$
$$\text{H}1:\text{w}.\text{x}+\text{b}\le 1 \text{f}\text{o}\text{r} {\text{y}}_{\text{i}}=1$$
2.3
$$\text{H}2:\text{w}.\text{x}+\text{b}\le 1 \text{f}\text{o}\text{r} { \text{y}}_{\text{i}}=-1$$
2.4
This means that, the class falling on or above H1 belongs to class + 1, while the vector falling on or below H2 is in class \(-1\). \({\text{y}}_{\text{i}}\left(\text{w}.\text{x}+\text{b}\right)\ge 1, \forall \text{i}\)
The maximum edge is: \(\frac{2}{\Vert \text{w}\Vert }+\text{C}{\sum }_{\text{i}=1}^{\text{N}}{{\xi }}_{\text{i}}\)(2.5)
such that: \({\text{y}}_{\text{i}}\left(\text{w}.{\text{x}}_{\text{i}}\right)+\text{b}\ge 1-{{\xi }}_{\text{i}}, {\xi } \ge 0, \text{i}=1,\dots ,\text{N}.\)
Where C > 0 is the penalty parameter indicating the degree of attention of outlier and the relaxation variable \({{\xi }}_{\text{i}}\) is a measure of the degree of outliers. Finding the maximum value of \(\frac{2}{\Vert \text{w}\Vert } \text{i}\text{s} \text{e}\text{q}\text{u}\text{i}\text{v}\text{a}\text{l}\text{e}\text{n}\text{t} \text{t}\text{o}\)calculating the minimum value of \(\Vert \text{w}\Vert\).
2.3 Performance Evaluation metrics
The metrics used to evaluate the performance of the various models is the Accuracy, Precision, F1score, Recall, Specificity, Mathews correlation coefficient, and Logistic loss function
Accuracy is the measurement of the system that correctly classified both Benign traffic and malicious traffic [21].
$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}}\ast 100\text{\%}$$
.
Precision
Precision is defined as the percentage of the model prediction out of the total data values about the positive class [17].
$$Precision=\frac{TP}{TP+FP}\ast 100$$
Recall is the measurement of the relevant instance that is successfully retrieved [2]. \(Recall=\frac{TP}{TP+FN}\ast 100\)
F1 score is defined as the measure where recall and precision both are used [2] \(F1Score=\frac{2\ast Precision\ast Recall}{Precision+Recall}\ast 100\)
Specificity
Specificity is defined as the measure of the prediction of the negative class in the dataset [2]. Specificity =
The Matthews correlation coefficient (MCC) is a method using contingency matrices to compute the Pearson product–moment correlation coefficient between observed and predicted values. This different approach remains robust even when dealing with imbalanced datasets [4].
MCC = \(\frac{\left(TP\ast TN\right)-(FP\ast FN)}{\sqrt{\left(TP+FN\right) \times \left(TN+FP\right)\times \left(TN+FN\right) \times (TP+FP)}}\)
Logistic loss
is a metric used to assess the effectiveness of a classification model. It assesses the model's uncertainty by comparing projected probability against genuine binary outcomes. The lower the log loss, the better the model's predictions match the actual results.
Logistic Loss = \(N1\sum i=1 N[{y}_{i}\text{log}\left({p}_{i}\right)+\left(1-{y}_{i}\right)\text{log}\left(1- {p}_{i}\right)]\)
N is the number of instances or samples in the dataset. \({y}_{i}\) is the actual binary class label for the i-th instance (0 or 1). \({p}_{i}\)is the predicted probability of the positive class for the i-th instance.