This section focuses on the basic components of the SDN architecture, and how it solves the problems in traditional networks. Besides, theoretical information of RNN and EncDecAD anomaly detection methods are also presented.
3.1. Software-Defined Networking
As stated by the Open Networking Foundation (ONF) [39], SDN is a fresh architecture that separates network forwarding and control tasks. This allows network control to be directly programmable and the architecture for network services and applications to be conceptualized. Infrastructure devices operate as simple forwarding engines in this architecture, dealing with incoming packets according to a number of rules that are instantaneously produced by a controller in the control layer along with the predescribed program logic. The controller typically executes on a distant machine and communicates with the forwarding elements over a secure link utilizing a few standardized commands. For SDN, ONF offers a high-level architecture [40] that is divided vertically into three main essential layers: i) Infrastructure Layer consists of forwarding elements, which include physical and virtual switches that can be accessed with an open interface. ii) The Control Layer is composed of a collection of software-based SDN controllers that provide unified control capabilities via open APIs for handling forwarding behavior. Controllers can communicate with one another via three communication interfaces: Southbound, Northbound, and East/Westbound. iii) The Application Layer mainly comprises of end-user apps that make use of SDN communication and network services. The main parts of this architecture such as the control layer, application layer, infrastructure layer, and communication interfaces between these three layers are shown in detail in Figure 1.
Through three open interfaces, the SDN controller communicates with these three layers: a) The southbound interface enables communication between the controller and the forwarding components of the infrastructure layer. The OF protocol, which is managed by ONF, is an essential element for generating SDN solutions according to ONF and can be seen as a encouraging implementation of such an interaction. b) The northbound interface enables the controllers to be programmable by exposing the controllers' universal network abstraction and other features for usage by programs at the application layer. Rather than a protocol, it is viewed as a software API that enables the programming and management of the network. While there is no standardization effort for this, many brands offer REST-based APIs for applications to use to provide a programming interface to their controllers. c) East / Westbound interface, considered as a communication interface, is not backed by an recognized standard yet. This is primarily intended to allow inter-controller communication to synchronize the situation for high availability.
Forwarding elements (usually switches) is required to verify a southbound API to be useful in SDN architecture. OF switches come in two types: Software-based (eg Open vSwitch) and hardware-based implementations. Software switches are generally well designed and contain all the features. However, even the latest implementations suffer from being slow. Hardware-based OF switches are usually implemented as ASICs. Although they offer line speed forwarding for a large number of ports, unlike software implementations, they lack flexibility and feature completeness.
The OF-enabled switch can be divided into three main elements [42]. These elements; data path, control path, and OF protocol: a) The data path contains one or more group tables flow tables that search and forward packets. A flow table consists of flow entries associated with actions that tell the switch how to handle the flow. Flow tables are often created by the controller and enable the controller to explain alternative methods of transferring flows. b) A control path is a channel that connects the switch to the controller in programming terms. The OF protocol is used to substitute commands and packets across this channel. c) The OF protocol is responsible for interconnecting switches and controllers. It may include information about messages exchanged, packets sent and received, statistics collecting, and actions to be executed in certain flows.
A flow table entry consisting of several fields in an OF-enabled switch can be organized as follows:
- Matching Fields are used to identify network packets based on their 15-tuple packet header, ingress port, and optional packet metadata. In Figure 2, packet header fields arranged in accordance with OSI L1-4 layers are shown.
- The priority of flow entry gives precedence to the matching order of the flow input.
- The action set shows the specific actions to be executed on the packets when the title matches.
- Counters are used to keep track of traffic statistics. (The total quantity of bytes and packets in each flow, as well as the time at which the final packet matches the flow).
- Timeouts define the maximum amount of time or idle time before the switch overrides the flow.
OF messages can be grouped into three main categories [41]. There are three types of connections: controller-to-switch, asynchronous, and symmetric. Controller-to-switch messages are those initiated by the controller and used to monitor the state of the switches. A switch can initiate asynchronous messages to notify the controller of network events and to modify the switch's state. Finally, symmetric messages are generated automatically by the switch or controller. As soon as an ingress packet arrives at the OF switch, pipeline processing does a scan of the flow tables. The entry into the flow table is determined by matching fields and priority. If the values in the packet correspond to the values in the entry's fields, the packet corresponds to the incoming flow table entry. Any (wildcard field or no field) value in a flow table input field matches all possible values in the header. Only the most critical flow entry should be chosen. If several flow inputs match with the same priority, the chosen flow input is clearly indefinite. To address such a situation, the OF specification offers an optional mechanism that allows to validate whether the new flow input matches an present input. In this way, a packet can be precisely matched to a flow with wildcard fields (macroflow), matched to a flow (microflow), or not matched to any flow. If the match is located, some actions defined in the match flow table entry are performed. If there is no match, the switch passes the packet (or only the header) to the controller for decision. After chenking the related policy in the management plane, the controller responds to the switch to add new entries to the switch's flow table. The switch uses the last input to control both the queued packet and subsequent packets in the same flow.
The controller stands at the heart of SDN networks, connecting applications and network devices. The SDN controller is responsible for managing all network flows by loading flow entries into switch devices. There are two distinct forms of flow configuration: proactive and reactive. Proactive settings preload flow rules into flow tables. Thus, the flow configuration procedure is completed before the first packet of a flow reaches the OF switch. The primary advantage of proactive setup is that it reduces the frequency with which the controller is contacted, resulting in a minor installation delay. However, it has the potential to overload switch flow tables. In the reactive setup mode, the controller adds a flow rule to the flow table only when there is no input, which occurs when the first packet of a flow arrives at the OF switch. As a result, communication between the controller and the switch is initiated by a single packet. After a specified period of inactivity, these flow entries are overridden and erased from the table. To respond against the flow setup request, the controller first evaluates the flow to the application's policies and then determines the necessary steps to execute. Following that, it determines a route for this flow and loads new flow entries, including launching requests to each switch along that path.
Transferring information between switches and controllers provides an overview of switch traffic. There are two ways for the switch to provide statistics to the controller. There are two types of flow monitoring: pull-based and push-based. The controller accumulates counters for numerous flows that fit a specified flow specification in the pull-based approach. This technique can optionally generate a report that includes all flows that match a wildcard specification. While this minimizes switch-to-controller traffic, it makes the controller ineffective at learning about the actions of other flows. The pull-based strategy necessitates an improvement in the latency between controller requests, as this can impair the scalability and reliability functions based on statistics collecting. In the push-based approach, statistics are delivered to the controller of each switch to alert it of certain occurrences, such as the creation of a new flow, a timeout, or the deletion of a table entry due to inactivity. Before the input timeout, this procedure does not notify the controller about the flow's behavior (which indicates that it is unsuitable for scheduling).
3.2. Anomaly Detection
3.2.1. Replicator Neural Network based anomaly detection
RNNs are neural networks and are specific examples of autoencoders [43] originally proposed as a compression technique [44]. The first study to suggest its use as an anomaly detection technique is recommended by Hawkins et al. [45]. Typically, input vectors in multilayer neural networks are mapped to the target output vectors. However, RNN also uses input vectors for output vectors. In other words, the input values in the output is reproduced by RNN. The RNN's weights are chosen in such a way that the mean squared error is as small as possible. As a result, while standard models are more probable to be successfully replicated by the trained RNN, models characterizing outliers are less accurately represented and have a greater error. Data exclusion is quantified using reconstruction error.
Cordero et al. propose an approach that uses RNN to identify anomalies in network flows [46]. In this method, an RNN [45] is used primarily to create a model that represents the normal network flow. While it has been demonstrated that the original RNN may be lowered to three layers [47], using the original five layers with the dropout regularization technique [48] produces superior results and avoids overfitting.
Each layer in an RNN is completely connected to every other layer. The activation function of layers 2 and 4 is nonlinear hyperbolic tangent. The output layer's activation function is linear or identical. The sole distinction between the original RNN and the one used in [46] is in the middle layer's activation function (Layer 3). The original RNN makes use of a stepwise activation function that, in theory, aims to reduce the dimensionality of input data by clustering data samples [45]. While the stepwise activation function possesses intriguing theoretical properties, backpropagation approaches based on gradient reduction do not work adequately with it [46]. Because the gradient components of progressive functions are nearly zero, the learning process stalls. Instead of this activation function, it employs the sigmoid activation function, which has been shown to be effective as an intermediate activation function for RNNs [49].
The features extracted from different network flows are used to build RNN models. Depending on the tools used, many different features can be extracted from the network for training. The number of selected features is proportional to the number of input neurons. At each training stage, the RNN is fed the extracted flow features as an input. A validation set is used to ascertain the degree to which the learning process is capable of generalization. After training, the RNN can be used as a normal model for the purpose of calculating anomaly scores (AS). ASs that exceed a predetermined threshold are considered abnormal.
Let \(\mathcal{X}=\left\{{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots ,{\mathbf{x}}_{N}\right\}\) represent the set of all N training samples where \(\mathbf{x}=\left\{{x}^{1},{x}^{2},\dots ,{x}^{D}\right\}\) is a flow with \(D\) features and the function \(f\left(\mathbf{x}\right)=\widehat{\mathbf{x}}+\overrightarrow{ϵ}\) correspond to the output of an RNN. The reconstruction of \(\mathbf{x}\) is represented with the vector \(\widehat{\mathbf{x}}\) and the vector \(\overrightarrow{ϵ}=\left\{{ϵ}^{1},{ϵ}^{2},\dots ,{ϵ}^{D}\right\}\) is the error elements of the reconstruction. The weights of the neural network are updated employing backpropagation with particular gradient descent techniques such as Stochastic Gradient Descent (SGD). In the learning process, the loss function being minimized is formulated in Eq 1.
$$\mathbf{L}=\frac{1}{N}\sum _{i=1}^{N} {\left({\mathbf{x}}_{i}-\left({\widehat{\mathbf{x}}}_{i}+{\overrightarrow{ϵ}}_{i}\right)\right)}^{2}$$
1
The network aims to achieve a combination of weights such that \(\widehat{\mathbf{x}}\approx \mathbf{x}\) and \(ϵ\approx 0\) since the purpose of backpropagation with gradient descent is to minimize L. The noise is added throughout the network by randomly detaching units in each learning iteration with the dropout method [48] to evade learning the trivial identity solution \(f\left(\mathbf{x}\right)=\mathbf{x}\). The residual value \(\overrightarrow{ϵ}=\mathbf{x}-\widehat{\mathbf{x}}\) is employed to calculate the AS, which determines how anomalous a set of features is. The AS of the set of flow features \(\mathbf{x}\) is defined in Eq. 2.
$$AS\left(\mathbf{x}\right)=\frac{1}{D}\sum _{i=1}^{D} {\left({x}^{i}-{\widehat{x}}^{i}\right)}^{2}=\frac{1}{D}\sum _{i=1}^{D} {\left({ϵ}^{i}\right)}^{2}$$
2
To determine whether the network flow \(\mathbf{x}\) is an anomaly or not, a threshold is selected to decide if the AS is too high for a flow to be counted as normal. The threshold is assigned to the highest reconstruction error \(E\) found during training after the elimination of the outliers.
3.2.2. LSTM based encoder-decoder for anomaly detection
LSTM networks [50] are recurrent models used for a variety of learning tasks such as handwriting recognition, speech recognition, and emotion analysis. To map an input sequence to a vector representation of constant dimensionality, an LSTM-based encoder is used. The decoder is another LSTM network that generates the desired sequence using this vector representation.
Malhotra et al. [51] suggest an LSTM-based Encoder-Decoder (EncDecAD) scheme for time series anomaly detection. In this architecture, the encoder generates a vector representation of the input time series, which the decoder uses to reproduce it. The EncDecAD is trained to recreate samples of "normal" time series using the input time series as the output. Following that, the reconstruction error is used to determine the probability of an anomaly occurring at that location. It is demonstrated that using an encoder-decoder model trained on solely normal sequences, anomalies in multivariate time series may be detected.According to this theory, the encoder-decoder have only seen and understood normal examples of the training data during the training phase. Given an abnormal sequence, the trained model fails to reproduce it well by resulting in higher reconstruction errors in constrast with normal sequence reconstruction errors.
The definition of the EncDecAD approach is mathematically expressed as follows. Given a time series \(X=\left\{{\text{x}}^{\left(1\right)},{\text{x}}^{\left(2\right)},\dots ,{\text{x}}^{\left(L\right)}\right\}\) of length \(L\), where each point \({\mathbf{x}}^{\left(i\right)}\in {R}^{m}\) is an \(m\)-dimensional vector of readings for \(m\) variables at a time \({t}_{i}\). The case is explored where many such time series are available or can be acquired by selecting a window of length \(L\) over a longer time series. To recreate the normal time-series, the EncDecAD model is trained. Following that, the reconstruction errors are used to determine the likelihood of a point in a test time series being anomalous in order to produce an anomaly score \({a}^{\left(i\right)}\). for each point \({\mathbf{x}}^{\left(i\right)}\). A greater score for anomaly indicates that the point is more likely to be abnormal.
To reconstruct examples of normal time-series, an LSTM encoder-decoder is trained. The LSTM encoder approximates the input time series with a fixed-length vector representation. Using the current hidden state and the value calculated by the LSTM decoder at the previous time-step, this representation is used to reconstruct the time-series.
The input \({\mathbf{x}}^{\left(i\right)}\) is used to achieve the state \({\mathbf{h}}_{D}^{(i-1)}\) and thenceforth \({\text{x}}^{{\prime }(i-1)}\) corresponding to target \({\mathbf{x}}^{(i-1)}\) is predicted by the decoder in the training phase. The decoder uses the predicted value \({\mathbf{x}}^{{\prime }\left(i\right)}\) as input, and then predicts \({\mathbf{x}}^{{\prime }(i-1)}\) in the inference phase. Given a set of normal training sequences as \({s}_{N}\), the objective of the model training is to minimize the function \(\sum _{X\in {s}_{N}} \sum _{i=1}^{L} {∥{\text{x}}^{\left(i\right)}-{\text{x}}^{{\prime }\left(i\right)}∥}^{2}\). To achieve the encoder’s hidden state \({\mathbf{h}}_{E}^{\left(i\right)}\) at the time \({t}_{i}\), the value \({\mathbf{x}}^{\left(i\right)}\) at time \({t}_{i}\)and the encoder’s hidden state \({\mathbf{h}}_{E}^{(i-1)}\) at the time \({t}_{i}-1\) are used. The prediction \({\mathbf{x}}^{{\prime }\left(i\right)}\) and \({\mathbf{h}}_{D}^{\left(i\right)}\) is used to achieve the next hidden state \({\mathbf{h}}_{D}^{(i-1)}\) by the decoder.
Normal time series are subdivided into four groups as \({s}_{N},{v}_{N1},{v}_{N2}\) and \({t}_{N}\). Additionally, the anomalous time series are divided into two groups as \({v}_{A}\) and \({t}_{A}\). The LSTM encoder-decoder reconstruction model is developed using the set of sequences \({s}_{N}\). When the encoder-decoder model is being trained, the set \({v}_{N1}\) is used for early stopping. The formula of \({\mathbf{e}}^{\left(i\right)}=\mid {\mathbf{x}}^{\left(i\right)}-{\mathbf{x}}^{{\prime }\left(i\right)}\mid\) is used to calculate the reconstruction error vector for \({t}_{i}\). The parameters µ and \({\Sigma }\) of a normal distribution \(\mathcal{N}(\mu ,{\Sigma })\) are estimated by using the error vectors for the points in the sequences in the set \({v}_{N1}\) with maximum likelihood estimation. After that, the anomaly score is calculated by for any point \({\mathbf{x}}^{\left(i\right)}\). In a supervised approach, if \({a}^{\left(i\right)}\) is greater than threshold \(\tau\), a point in a sequence can be expected to be “anomalous”, otherwise “normal”. If sufficient anomalous sequences are present, a threshold \(\tau\) over the probability values is learned to optimize \({F}_{\beta }=\left(1+{\beta }^{2}\right)\times P\times R/\left({\beta }^{2}P+R\right)\), where \(R\) is recall and \(P\) is precision. Here, "normal" refers to the negative class, while "anomalous" refers to the positive class. If a window contains an anomalous pattern, the entire window is marked as "anomalous." This method is particularly advantageous in a variety of practical applications where the precise location of the anomaly is unknown. On the validation sequences in \({v}_{N2}\) and \({v}_{A}\), the parameters \(\tau\) and \(c\) are determined with maximum \({F}_{\beta }\).
3.3. Evaluation Metrics
The proposed approach should be validated using a relevant metric. The binary classification results can be classified into four categories [52]: 1) True Positive (TP): Positive instances have been classified accurately; 2) False Negative (FN): Positive instances have been classified incorrectly; 3) False Positive (FP): Negative instances have been classified incorrectly; 4) True Negative (TN): Negative instances have been classified incorrectly. Furthermore, additional metrics can be determined by starting with the prior ones [53]:
Accuracy: This metric is expressed as the proportion of correct predictions to total instances:
$$\begin{array}{c}\text{Accuracy}=(\text{TP+TN)/(}\text{TP}+\text{TN}+\text{FP}+\text{FN}\text{)}\end{array}$$
3
True Positive Rate (TPR): This metric is equilevent to the proportion of all "correctly identified instances" to all "examples that should be identified".
False Positive Rate (FPR): This metric denotes the proportion of the “number of misclassified negative instances” to the “total number of negative instances”.
Receiver Operating Characteristics (ROC): In the case of a class imbalance problem in the dataset, ROC curve [54][55] is being used as a normal criterion for testing classifiers [56]. When faced with an issue of class imbalance, the area under the receiver operating characteristic curve (AUC) metric is frequently utilized as a de facto criterion for evaluating the effectiveness of classifiers. After sorting by classification probabilities, the AUC can be used to determine how frequently a random instance of a positive class ranks higher than an instance of a negative class.