Multivariate outlier filtering for A-NFVLearn: an advanced deep VNF resource usage forecasting technique

Virtual Network Function Resource Adaptation (VNF-RA) aims at adequately adapting Network Function Virtualization Infrastructure (NFVI) resources according to the geographical fluctuation of user demand by maximizing the quality of service (QoS) of the offered services and the energy consumption of the NFVI while limiting the risks of Service Level Agreement (SLA) breaches, the CAPEX and OPEX of the cloud operators and their customers. Virtual Network Function (VNF) resource usage forecasting leads, therefore, a key role in enabling proactive resource adaptation in dynamic Network Function Virtualization (NFV) environments whose resource demand constantly changes. In parallel, Long Short-Term Memory (LSTM)-based prediction has garnered huge interest in the research community. Several research teams have proposed different VNF resource usage prediction algorithms based on this machine learning technique. However, current LSTM-based VNF resource usage forecasting techniques lack the flexibility to take several resource attributes of different scales, over many time steps, in order to predict several other resource attributes over many time steps from a Service Function Chain (SFC). In this paper, we push the state of the art forward by presenting A-NFVLearn, a flexible multivariate, LSTM-based model with an attention mechanism which uses different attributes of resource load history (CPU, memory, I/O bandwidth) from various VNFs of an SFC to forecast future load of multiple resources of a VNF. Next, we propose a multivariate outlier filtering scheme at pre-processing based on Adjusted Outlyingness (AO), which improves the training time performance of LSTM-based models without impacting prediction accuracy.


Introduction
Network Function Virtualization (NFV) technology is a milestone for nextgeneration cloud infrastructures. NFV's purpose aims at dematerializing legacy network functions (NFs) embedded in physical hardware and converting them into heterogeneous, scalable VNFs. In turn, those VNFs become swiftly deployable on-demand anywhere in a cloud infrastructure or a Network Function Virtualization Infrastructure (NFVI). Henceforth, this allows NFs initially anchored in a physical SFC to be deployed and scaled at various areas in a telecom service provider (TSP)'s or a cloud service provider (CSP)'s network (i.e. the core, cloud or edge networks) to efficiently handle fluctuating resource demand and Service Level Objective (SLO) of its customers (e.g. delay and availability of business-critical applications). Those features not only help TSPs and CSPs meet financial objectives, SLAs, CAPEX and OPEX but also bring increased latency benefits and QoS to the customer [1].
However, to bring NFV technology to maturity, new automated mechanisms for VNF deployment, management and orchestration must be designed. To this end, many research teams have been looking at Deep Learning techniques like LSTM to further improve upon traditional resource forecasting mechanisms such as moving average (MA), autoregression (AR), weighted average (WA) and auto-regressive integrated moving average (ARIMA) to improve resource usage forecasting in an SFC. In fact, accurate VNF resource usage prediction is an important first step in a Virtual Network Function Resource Adaptation (VNF-RA) pipeline that facilitates decision-making to automatically adapt available physical resources to VNF instances by triggering VNF horizontal scaling, vertical scaling, or migration requests by NFV management and orchestration (NFV-MANO).
In any case, enhancing LSTM-based resource usage forecasting mechanisms is no easy task. Researchers not only need to acquire massive amounts of data from operating NFVIs, but they also need to devise ad hoc mining and filtering techniques for this data with the help of both seasoned experts of the NFV domain as well as skilled Deep Learning practitioners with a vast experience in time series forecasting.
Hence, such large amounts of data require adequate filtering mechanisms in order to efficiently remove outliers. Doing so with the proper know-how brings huge performance benefits at the learning stage of LSTM models thanks to the reduced quantity of training examples. When carefully executed, it also makes trained models less prone to over-fitting without impacting the accuracy of the predicted outputs.
Therefore, it makes outlier filtering an important challenge for budding resource usage forecasting experts and as such, tackling this issue is the main challenge we want to tackle in this paper. The next challenge is to build upon an existing LSTMbased architecture, called NFVLearn [2], and improve its prediction accuracy. NFV-Learn is a multivariate, many-to-many LSTM-based architecture that leverages resource usage interdependencies of several input resource attributes such as vCPU, vRAM, and I/O bandwidth of many VNFs in an SFC over several time steps, to forecast resource usage of many resource attributes of one VNF over many time steps.
The article is organized as follows. Problem description and motivation are described in Sect. 2. We then present a literature review in Sect. 3. A description of data-driven resource usage prediction, correlation coefficients used in this paper and the NFVLearn design are presented in Sect. 4. Experiments and evaluation of our proposed techniques are described in Sect. 5. Finally, a discussion and a conclusion are presented in Sects. 6 and 7, respectively. For convenience, a summary of acronyms and a summary of symbols are provided in Tables 1 and 2, respectively.

Problem description and motivation
Resource usage forecasting based on Deep Learning techniques such as NFV-Learn requires considerable amounts of historical resource usage data from VNFs. This data often travels intermittently throughout an NFVI's control plane, which puts pressure on the traffic load overhead in the substrate network. Fortunately, although this is an unavoidable limitation of data-hungry Deep Learning implementations, it can be mitigated by carefully selecting the proper model architecture. So far, two different implementation types appear in the literature to mitigate this limitation. The first one is to run a VNF resource usage forecasting model on the NFV-MANO, which is a good option when an administrator wants to offload resource usage forecasting tasks away from the VNFs (e.g. when there are VNF processing, memory, power and disk size constraints, a low number of SFCs to manage in the NFVI, low network latency, large network bandwidth, a small number of network hops from the VNFs to the MANO, etc.) [3]. The second implementation type is to run a VNF forecasting model locally on a VNF, which is a better option in large-scale NFVIs and remote data centres with large numbers of managed SFCs and VNFs, where outgoing resource usage data in the control plane could eventually cause network delays and bottlenecks [4]. NFVLearn's current agnostic design (i.e. selection of the number of input and output time steps and their granularity, selection of input and output features) makes it extremely flexible. It, therefore, makes it fit for different types of implementations. However, its LSTM-based architecture could benefit from the latest advancements in the Deep Learning field. For instance, its ability to decipher interdependencies over several time steps of multiple resource usage features of

Resource usage prediction matrix
(t x , t y ) Attention weights historical data makes it prone to great forecasting accuracy improvements from an attention mechanism. Moreover, to provide the most accurate predictions, VNF resource usage forecasting mechanisms typically require resource load history from multiple sources to truly benefit from resource attribute interdependencies of an SFC. However, those mechanisms based on Deep Learning are very sensitive to noise in the collected resource usage history required to forecast future resource load, especially for online prediction. That is why data pre-processing, outlier filtering and outlier removal are key in order to reduce variance in the training set.
More specifically, to improve the forecasting accuracy of patterns, seasonality, trends, and periodicity in time series, filtering out extreme values from the sampling pool is a common practice. Leaving those outliers in a DL training set would otherwise lead to model under-fitting [5][6][7]. This is especially critical in cloud computing and NFV since those environments host a wide variety of heterogeneous applications, services, tasks and processes, and SFCs can run along several server hosts and clusters at various points of the core and edge networks.
Our aim in this paper is to tackle these challenges. To reach this goal, our main contributions are the following: • The study and investigation of an added attention mechanism to NFVLearn's architecture. We will then compare the training performance and resource usage prediction accuracy of this new design, called A-NFVLearn, with the results of NFV-Learn's models. • The study and investigation of AO, a novel multivariate outlier filtering technique, and its application at A-NFVLearn's pre-processing stage. We will compare the training performance and resource usage prediction accuracy of A-NFVLearn's models with and without AO. • Ensure that this added outlier filtering technique is entirely automated and generalizes well with resource usage data from different SFC types and scenarios (IMS and web virtualized instances).

Related work
The section presents a literature review relevant to two aspects of this study: LSTMbased resource usage prediction and outlier filtering at pre-processing of RNN-based models.

Automated learning and VNF resource usage forecasting
Resource usage forecasting garners huge interest from the mobile, edge, cloud and NFV research communities. For instance, authors in [1,8] note how challenging it is for prospective research teams to design robust, simple and practical resource usage forecasting mechanisms for cloud computing and NFVIs. The main reasons are twofold: the first being that it requires a vast knowledge of several particularities and objectives of NFV and cloud computing resource usage prediction (delays in VNF placement, resource adaptation in the server nodes, QoS, SLAs, overhead reduction in the control plane, latency minimization, etc.), and the second one is that it requires a deep understanding of the potential and limitations of LSTM learning techniques. Therefore, research teams must gather cross-cutting skills in two large research areas. However, several state-of-the-art LSTM architectures have recently been proposed that successfully enable automated NFV, cloud and mobile networking management and orchestration [4,[9][10][11][12][13][14]. Several approaches leverage LSTM [15]. For instance, the authors in [9] propose a method for analysing the time series correlation of IoT equipment working conditions based on univariate sensor data that leverages an LSTM-based prediction model for working status forecasting. Results showed that their approach achieves less RMSE than an ARIMA model. In another example, Patel & al. [12] propose a proactive approach using LSTM-based deep learning for predicting the auto-scaling of VNFs ahead of time in response to dynamic traffic variations. Their LSTM model inputs CPU and bandwidth capacity to forecast CPU resources utilized by upcoming VNF instances. It then uses a distributed VNF provisioning algorithm to produce scaling decisions ahead of time. Approaches in [10,11] propose VNF-FG bandwidth forecasting algorithms for allocating resources in NFV environments that can differently weigh the over-provisioning and under-provisioning costs through an asymmetric loss function. Their proposed solutions are interesting since their LSTM models input multivariate network traffic data from all links of a VNF-FG or an SFC to predict future resource needs and help plan resource allocation through the VNF-FG.
A new trend has also emerged in the last few years, where research teams combine LSTM cells with an attention layer [16] with beneficial results. Several approaches in the NFV research domain successfully applied this combination for resource usage forecasting [4,13,14]. For instance, authors in [14] introduce a VNF resource prediction based on a Content and Aspect Embedded Attentive Target Dependent Long Short-Term Memory (CAT-LSTM) model that maximizes the benefits of using an SFC through a directed graph. Their model inputs CPU usage of multiple VNFs over several time steps to predict CPU usage over one time step of a VNF. In [4], authors also use CAT-LSTM to predict future CPU resource loads of a VNF by inputting historical CPU resource utilization of that same VNF and its neighbours. Finally, the approach proposed in [13] goes further by taking the CPU, memory, input and output bandwidth of all VMs to forecast through a CAT-LSTM model the CPU utilization of an intrusion prevention system (IPS) instance. They use up to 20 s of multivariate data sampled every 5 s to predict a resource usage horizon of 5 s, thus 1 time step of that IPS.

Outlier filtering for data pre-processing
Outlier detection and filtering of large data sets is well established and documented. This is an important step at the pre-processing stage, before deep learning model training.
However, several proposed time series-based outlier filtering techniques are designed for univariate data. This is understandable because typical time series prediction techniques such as AR, MA, WA, ARIMA, etc., don't leverage multivariate data and are usually "same-to-same" feature predictions. Among common outlier filtering techniques for univariate data, we find z-score filtering [17], oneclass Support Vector Machine (SVM) [18][19][20] and Local Outlier Factor (LOF) [21,22].
Authors in [17] use z-score to pre-process and clean missing data and outliers in an approach that forecasts electricity load and price using Jaya-LSTM in smart grids. Although this approach can predict two sets of features (electricity load and electricity price), data filtering focuses on univariate data to improve prediction accuracy on a single feature. The authors then compare the prediction accuracy of a single output feature per model with SVM and univariate LSTM approaches. Z-score is an easy way to estimate outliers in a normal distribution by locating the mean and standard deviation of univariate values of a population.
Approaches in [18][19][20] use one-class SVM for outlier data detection and elimination because this approach is simple, efficient and can guarantee the authenticity of data to a certain extent. One-class SVM can greatly improve the precision of anomaly detection in the case of small samples, unbalanced sample classification, and supposes no assumptions about data distribution. For example, the authors in [18] do so to reduce irregular patterns in metro passenger flow forecasting from an LSTM network. Next, the authors in [19] introduce a novel two-step LSTM configuration for removing deterministic components associated with dominant gear signals in the first step (with LSTM regression) and removing the 'residual deterministic' components associated with varying gear signals in the second step (with one-class SVM). They present different LSTM architectures combined with a one-class SVM to separate abnormal data from normal vibration signals collected from non-consecutive helicopter test flight data time series. They show that LSTM regression is not advantageous and that better performance can be achieved by a one-class SVM outlier detection based on statistical features. Finally, authors in [20] present a data analysis system combining ARIMA and LSTM for persistent organic pollutants concentration prediction. ARIMA is used to capture linear components, while LSTM is used to capture nonlinear components. They use the one-class SVM method to detect the anomalies of POPs concentration values and then use the average sampling concentration in the previous month to replace this abnormal value.
LOF is another commonly applied technique efficient at finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours. In [21], authors apply LOF and adaptive K-means to an abnormal data recognition algorithm to implement data pre-processing and noise extraction on wind turbine data. Univariate data are then re-combined and processed into an LSTM-based stacked denoising autoencoder (LSTM-SDAE) model to obtain nonlinear temporal relationships among multivariate variables. In another approach [22], LOF is applied to enhance detection efficiency in a two-stage abnormal behaviour-based anomaly detection mechanism in a virtual machine. Authors demonstrate that LOF can significantly reduce computational complexity and meet real-time performance needs.
Lastly, a common outlier detection technique frequently found in the literature is the Mahalanobis Distance (MD). It is a measure of the distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. In [21], MD is calculated based on reconstruction errors and an alarm mechanism based on the sliding window technique was set up to detect abnormalities in real-time. Finally, authors in [23] used MD to denoise multi-dimensional sensor data, flag the outliers, and enhance the robustness of the denoising process.
While these approaches prove extremely efficient when applied to specific problems involving univariate data, none offers a way to detect outliers from multivariate data in time series precisely. This point is particularly of high value for an approach such as A-NFVLearn, which heavily relies on interdependencies of several resource attributes, hence, of multiple interrelated variables.

A-NFVLearn design
This section first discusses the data structure of the collected resource usage data used for our experiment. Next, we give detailed information about the attention mechanism enhancing A-NFVLearn's resource usage predictions. Lastly, we present the AO approach used to remove outliers from multivariate data. For reference, A-NFVLearn's data structure and architecture are depicted in Fig. 1.

Data structure
Timestamped resource usage historical data is required to train A-NFVLearn models through supervised learning. This VNF Component (VNFC) or VNF resource usage data is collected from simulations at regular intervals and can be from any resource attribute: vCPU, memory, input bandwidth, output bandwidth, etc. This stored data is then filtered, scaled and organized so that a column represents the resource attribute of a VNF and a row represents a time stamp at which the resource usage occurred. Data reorganized in this manner will later simplify data pre-processing as required by A-NFVLearn's architecture; for instance, it allows us to: (1) label F x input features and F y output features from any of the resource attribute columns of the dataset and, (2) for each labeled input features and output features, create resource usage sliding windows of size T x input time steps and of size T y output time steps.
This process results in the generation of training examples, which can further be split into training and validation sets, with each training example being a set of two matrices: the input matrix and the output labels matrix. The input matrix of a training example (Eq. 1), of dimension [T x , F x ] , is denoted as follows: The choice of values for T x and T y is at the will of the user before model training. The number of input/output features and time steps is based on several factors such as prediction accuracy, the length between input/output time steps, interdependency reinforcement, etc., as we will later see in Sect. 5.4. A filtered data frame is therefore denoted as a set of input samples (Eq. 3) and output labels (Eq. 4) as follows:

Attention mechanism
A-NFVLearn's attention mechanism [16] is developed to improve the performance on long input sequences. The main idea is to allow the decoder to access encoder information during decoding selectively. This is achieved by building a different "context vector" for every time step of the decoder, calculating it in function of the previous hidden state and of all the hidden states of the encoder, assigning them trainable weights.
In this way, the attention mechanism assigns different importance to the different features of the input sequence and gives more attention to the more relevant inputs.

Encoder
At each time step, the representation of each input sequence t x ∈ 1, 2, … , T x is computed as a function of the hidden state h(t x ) of the previous time step and the current input. The final hidden state h(T x ) contains all the encoded information from the previously hidden representations and the previous inputs.

Context vector
A different context vector c(t y ) is computed for every time step t y ∈ 1, 2, … , T y of the decoder.
To calculate the context vector c(t y ) for time step t y we proceed as follows. First of all, for every combination of time step t x of the encoder and time step t y of the decoder, the so-called alignment scores e(t x , t y ) are computed with the following weighted sum: In this equation, W a , U a and V a are trainable attention weights. The weights W a are associated to the hidden states h(t x ) ∈ 1, 2, … , T x of the encoder, the weights U a are associated to the hidden states s(t y ) ∈ 1, 2, … , T y of a decoder, and the weights V a define the function that calculates the alignment score.
For every time step t y , the scores e(t x , t y ) are normalized using a softmax activation function over the encoder time steps t x , obtaining the attention weights (t x , t y ): The attention weights (t x , t y ) capture the importance of the input of time step t x for decoding the output of time step t y . The context vector c(t y ) is calculated as the weighted sum of all the hidden values h(t x ) of the encoder according to the attention weights: Hence, this context vector allows us to pay more attention to the more relevant inputs in the input sequence.

Decoder
The context vector c(t y ) is passed to the decoder, which computes the probability distribution of the next possible output. This decoding operation goes for all the time steps present in the input. Then the current hidden state s(t y ) is computed according to an LSTM function, taking as input the context vector c(t y ) , the hidden state s(t y − 1) and output ŷ ⟨t y −1⟩ of the previous time step: Using this mechanism, the model can find the correlations between different parts of the input sequence and corresponding parts of the output sequence.
For each time step, the output of the decoder is calculated by applying the softmax function to the weighted hidden state s(t y ):

AO implementation
AO is an outlier detection mechanism allowing skewness in multivariate data presented in [24]. The method is based on the adjusted boxplot for skewed data [25] and essentially defines a different scale on each side of the median for univariate data. This scale is obtained by means of a robust measure of skewness [26]. In this work, this approach was selected for its unique ability to adapt to skewed distributions like highly correlated but radically different workload patterns such as VNF resource usage from several resource attributes. And since it defines a different scale on each side of a median of univariate data, AO easily adapts to different resource usage scenarios and low-mid-high resource usage patterns of an SFC.

AO for univariate skewed data
The outlyingness of a univariate data point tells us how far the observation lies from the centre of the data, standardized utilizing a robust scale. In this definition, it does not matter whether the data point is smaller or larger than the median. However, when the distribution is skewed, authors in [24] apply a different scale on each side of the median. The AO for univariate data is then defined as: With w 1 and w 2 being the lower and upper whisker of the adjusted boxplot applied to the data set X n .
Note that AO (1) is location and scale invariant. Hence it is affected by changing the data's centre and/or scale. As the AO is based on robust measures of location, scale and skewness, it is resistant to outliers. Theoretically, resistance up to 25% of outliers can be achieved, although the MedCouple often has a substantial bias when the contamination is more than 10%.

AO for multivariate skewed data
Consider now a p-dimensional sample X n = (x 1 , … , x n ) T with x i = (x i1 , … , x ip ) T . AO outlier detection for multivariate data is then defined as: In practice, AO cannot be computed by projecting the observations on all univariate vectors . Hence, we should restrict ourselves to a finite set of random directions. Our simulations have shown that considering m = 250p directions yields a good balance between 'efficiency' and computation time. Random directions are generated as the directions perpendicular to the subspace spanned by p observations, randomly drawn from the data set. As such, AO is invariant to affine transformations of the data. Moreover, in our implementation we always take || || = 1 , although this is not required as AO (1) is invariant.
Once AO is computed for every observation, we can use this information to decide whether an observation is outlying. Unless for normal distributions for which the AOs are asymptotically distributed, the AO distribution is generally unknown (but typically right-skewed as they are bounded by zero). Hence, we compute the adjusted boxplot of the AO-values and declare a multivariate observation outlying if its AO i exceeds the upper whisker of the adjusted boxplot. More precisely, the outlier cutoff value equals: where Q 3 is the third quartile of the AO, and similarly for IQR and MC. Here, MC stands for MedCouple, a robust measure of skewness [26]. It is defined as: With med n the sample median, and

Experiment
This section presents the main performed experiments and obtained results.

Data
Big instances of VNFs, such as an IP multimedia subsystem (IMS), may comprise smaller instances of software components called VNFCs. A VNFC is an internal component of a VNF which provides a defined sub-set of that VNF's functionality, with the main characteristic that a single instance of this component maps 1:1 against a single virtualization container. This means that the VNFCs in a VNF are linked to each other by a combination of directed and undirected links and work together to provide the required functionality of the VNF. This irrevocably poses a challenge for someone aiming to collect resource usage metrics from those VNFCs since some ad hoc measures have to be taken to filter the internal network traffic between multiple VNFCs of the same VNF. For this experiment, we have combined the resource usage metrics of all VNFCs into single VNFs and redefined SFCs into directed graphs as shown in Fig. 2 and a complete description of the composition of the SFCs used in this work can be found below.
The proposed approach was tested on one IMS and another Web scenario, built on a bare-metal Kubernetes testbed.
For instance, the IMS SFC deployment comprises a Clearwater Core IMS virtual environment organized in a VNF-FG split into INVITE and REGISTER requests. VNF instances and components have been labelled according to their position in the forwarding graph (VNF1 and VNF2 in the INVITE request FG, and VNF3, VNF4, VNF5 and VNF6 in the REGISTER request FG). A SIPp traffic generator handles the generation of both request types to the SFC. SIPp is an open-source traffic generator and test tool for the SIP protocol. We collected CPU, memory, input bandwidth and output bandwidth resource usage data from VNF1 to VNF5 and all but output bandwidth resource usage data from VNF6 for a total of 23 resource attributes.
The Web SFC deployment is a virtualized web server environment comprised of a Go web application (VNF1) attached to a virtualized switch serving as a load balancer. That load balancer is tied to three Java REST API instances (VNF2), which serve words read from a Postgres SQL database (VNF3). The network traffic originates from a Siege traffic generator. Siege is an opensource regression test and benchmark utility. It can stress test a single URL with a defined number of simulated users. We collected CPU, memory, input bandwidth and output bandwidth resource usage data from VNF1 and all but output bandwidth resource usage data from VNF2 and VNF3 for a total of 10 resource attributes. In this scenario, we couldn't collect output bandwidth resource usage data from VNF2 because the network monitor was incorrectly scanning the network traffic on the load balancer between VNF1 and VNF2 instead of the exit node of VNF2.
Each pod in the IMS and web scenarios tests has three vCPUs, 2 GB of RAM, and 10 Mbps links. Workload generated by SIPp and Siege both vary linearly with sharp increase/decrease occurring around 10% of the time with VNF resource usage metrics (vCPU, vRAM, I/O bandwidth) collected every 20 s in a span of two weeks, giving over 80.000 timestamped raw/unfiltered data points which were used to train and validate our LSTM models.

Tools and model training
The models were trained on a computer with a 6-core Intel Core i7 10710U CPU with each core clocked at 1. AO filtering was made in Matlab 2016a using LIBRA, 1 developed at Robust@ Leuven. The outliers were flagged and stored in a.csv file, then filtered at the data pre-processing stage of model training. Examples of flagged outliers (in red) in a bagplot using LIBRA for bivariate data are depicted in Fig. 3. Figure 4 presents a high-level overview of data pre-processing before model training. Steps 1 and 2 are explained in Sects. 4.1 and 5.1, respectively. In step 3, we select the VNF resource attributes we want to forecast (we usually select resource attributes from one VNF for each A-NFVLearn model). In step 4, we then select the most highly correlated input resource attributes from an SFC to enhance the forecasting of our models. To do so, we use any approach presented in [2] (for this work,   only the Pearson input feature selection mechanism was used). Next, in step 5, we use multivariate AO to filter outliers from the resource usage data of the input and output features selected in the previous steps (steps 3 and 4). Finally, we apply Min-Max normalization to the input and output features in step 6. All trained models used the same hyperparameters shown in Table 3 for comparison purposes. The number of hidden layers, the number of units per layer, the regularization process as well as its sub-parameters (ADAM optimizer's learning rate, 1 , 2 and ) and the early stopper value were selected through a grid search process that yielded the best overall MSE validation loss. For the normalization process, NFVLearn uses an unorthodox technique: before training our models, we separately fit and transform both input features and output features by using MinMax normalization with a range of [0, 1]. The key idea behind this technique is that we use different input/output features of widely different ranges ([0.00, 3.00] usage ratio from 3 vCPUs, [0, 4.000.000.000] Bytes of RAM usage and [0, 10.000.000] bps of I/O bandwidth usage). Hence, our aim is to uniformly scale down the values of any and all output features, to then uniformly derive the weights of the neural nodes through gradient descent during backpropagation. The fit weights of the MinMax   normalization for both input and output features are then stored in order to "un-normalize" predicted values later on, when using our trained models for prediction. Moreover, we have selected specific input features (Tables 4 and 5) based on the highest correlation coefficient scores matching one or several output features (Tables 6 and 7) for each model. We made those choices to ensure the highest RMSE scores for each predicted resource usage feature by leveraging LSTM's ability to reinforce predictions based on hidden interdependencies between each selected resource attribute.
Finally, we have trained models with different sets T x = [5,10] and T y = [3,5,7] in order to evaluate the prediction accuracy of different numbers of output time steps given different numbers of input time steps.

Comparisons
Different LSTM-based combinations are tested in Sect. 5.4. For instance, we designed models with NFVLearn or A-NFVLearn and combined them either with or without AO filtering for a total of four different combinations. Then, we trained those models with variations in the numbers of input time steps ( [5,10]) and predicted output time steps ( [3,5,7]). In order to compare the prediction accuracy of each setup, the RMSE defined in Eq. 16 and the coefficient of determination R 2 defined in Eq. 17 were used. The principle behind this selection of metrics is that RMSE offers a good measure of the differences between values. At the same time, R 2 provides a measure of how well observed outcomes are replicated by the model based on the proportion of total variation of outcomes explained by the model.  where y is the observed valuê y is the estimated value y = 1 n ∑ n i=1 (y i ) n a set of values to evaluate

Results
The results of our experiments are shown in Tables 8,9,10,11,12 and in Figs. 5,6,7,8. The average run time of our trained models in simulations was 0.01654 ms, with a standard deviation of 0.00098 ms. Table 8 gives each model's number of flagged AO outliers at pre-processing. Those flagged outliers were then processed out of the data set for each training instance and their neighbouring data in a time window of length T x . We notice that several models (i.e. IMS VNF1, IMS VNF3, IMS VNF6, WEB VNF1, and WEB VNF2) have a relatively low number of detected outliers that amount to less than 5% of their respective data sets. On the other side of the spectrum, however, other models (i.e. IMS VNF5 WEB VNF2) get a high number of filtered outliers that amount to over 10% of their respective data sets. Tables 9 and 10 show the RMSE ratio and performance ratio of each combination of attention-based and AO-filtered IMS models, while Tables 11 and 12 do the same for attention-based and AO-filtered web models, respectively. For each table, the average RMSE and performance is compared to the average "without attention/without AO" baseline models. Table 9 shows that RMSE scores of IMS attention-based models outperform those without an attention mechanism. Furthermore, observations in Table 10 (17)   show that IMS models with AO filtering outperform models without AO filtering during the training of the models. Moreover, we notice that IMS models with an attention-based mechanism have the edge over models without attention performance-wise. These observations demonstrate that IMS models using an attention mechanism and AO filtering improve RMSE accuracy of the resource usage predictions and significantly improve IMS model training performance. Observations from Table 11 show that RMSE scores of web models without AO filtering outmatch those with AO filtering. In this case, models with AO filtering yield over 9% higher RMSE scores than the baseline web model, which is still at a tolerable level since the baseline web model's average RMSE score is 0.0621. Next, we observe in Table 12 that attention-based web models outperform web models without an attention mechanism, with an edge for models which also use AO filtering with a 39% performance gain during training. These observations show that attention-based, AO-filtered web models benefit from an outstanding training performance gain without significantly impacting the overall RMSE scores of the models. Figures 5 and 6 depict training and validation results of all IMS and web models assembled in two sets: the first compares models with and without an attention mechanism (Fig. 5) and the second compares models with and without AO filtering (Fig. 6). In Fig. 5, all subfigures indicate that models with an attention mechanism achieve better RMSE results than those without, both for IMS and web models, as well as in training and validation.  T y = [3,5,7] . For instance, Fig. 7a and b depicts histograms of R 2 scores and RMSE from predicted CPU resource usage, respectively. Next, Fig. 7c and d shows R 2 scores and RMSE of predicted memory usage. Lastly, Fig. 7e and f shows predicted output bandwidth resource usage results.
General observation of R 2 scores throughout shows that the majority are > 0.90 , which indicate a high fidelity of the predictions compared to the observed results at the validation step. However, some resource attributes showing R 2 scores below 0.8 require further investigation, i.e., WEB VNF1 CPU (Fig. 7a), memory (Fig. 7c) and output bandwidth (Fig. 7e), and R 2 IMS VNF2 memory (Fig. 7c, near or below 0.2) R 2 . In WEB VNF1's case, we understand that this is an instance of a web server receiving HTTP requests from end users. As such, it makes it difficult to forecast resource usage of any kind, given prior resource usage history from that same VNF or its neighbours. Furthermore, IMS VNF2's memory usage forecasting is difficult to predict since that VNF is the end point of IMS "INVITE" requests and, as such, its main function is, therefore, more network traffic oriented and less reliant on random access memory than, for example, a VNF hosting a database manager.
Observations of RMSE scores in Figs. 7b, d and f show that models built using T x = 5 and T y = 3 (dark blue) are proportionally the less accurate models overall, and those built using T x = 10 and T y = 3 are the most accurate. Interestingly, the RMSE accuracy of the models has stronger ties to the number T x of input time steps than to the number T y of output time steps, showing that LSTM does indeed learn relationships over many time steps and enhances output predictions accordingly.
Further, results in Fig. 8 show interesting trends. First, we introduce results produced by a baseline LSTM approach to establish a basis of evaluation with A-NFVLearn and NFVLearn. This baseline LSTM has a state-of-the-art architecture that takes multiple input features of a similar resource attribute over many time steps to forecast resource usage of a single resource attribute over many time steps (e.g. inputs CPU resource usage of VNF1, VNF2 and VNF3 to predict the future resource usage of VNF1). When comparing the results of this baseline LSTM with our different approaches, we observe that its average RMSE per output time steps aligns closely with those results of NFVLearn (without attention) with AO. Our three other approaches, however, clearly outmatch RMSE accuracy, especially for predictions over 4 time steps.
Next, we notice that models without attention yield better RMSE accuracy when using fewer input time steps (models with 5 input time steps are more accurate than those with 10 input time steps). In contrast, models with an attention mechanism show the opposite. Moreover, we notice that the gap in RMSE accuracy becomes larger between models with and without attention when the model uses more input time steps. This observation demonstrates an attention mechanism's ability to decipher complex relationships between input features over more input time steps than models simply using LSTM.
Moreover, other trends emerge when looking at Fig. 8a and b. First, we notice a significantly poorer RMSE accuracy for the predicted outputs at time step 1 from models without attention, followed by excellent RMSE accuracy results at output time steps [2 − 4] . Further investigation should be made to determine why output predictions at the first time step fail to generalize. Results also show that RMSE accuracy degrades steadily and with a lower slope when using models with attention mechanisms. This is especially beneficial at predicted output time steps [4 − 7] , where models with attention outperform those without attention. Lastly, we observe that models using AO show slightly lower RMSE accuracy than those without AO. However, it is worth noting that models with attention and AO outperform those without AO (both with and without attention) when predicting resource usage values at output time steps ]5 − 7].
Results in Fig. 8 highlight A-NFVLearn's efficiency with an attention mechanism and AO filtering. The attention mechanism ensures a smooth, steady prediction accuracy over a large horizon window, while AO filtering provides outstanding training performance. When used together, models trained with attention and AO outmatch models without attention, and that advantage grows with the availability of a large resource usage history.

Discussion
First, a very clear trend emerges from the results in Sect. 5.4: A-NFVLearn has significant gains in prediction accuracy and training performance over NFVLearn for both IMS and Web models. This observation is interesting from a design perspective since it confirms our initial hypothesis that NFVLearn's multivariate, many-to-many architecture was suitable for an attention mechanism upgrade. This enhancement demonstrates that attention mechanisms are perfectly fit for resource usage forecasting using interdependency-based resource usage history in cloud and NFV environments. A-NFVLearn benefits from the attention mechanism's deeper ability to decipher relationships between several input features and improves model generalization. It does so with fewer training iterations than NFVLearn's standard LSTM architecture.
The next highlight from the results is that AO filtering systematically improves the training performance of the VNF resource usage forecasting models. Pairing it with an attention mechanism further improves performance results while mitigating a decrease in RMSE accuracy. This technique is a good option for A-NFVLearn's pre-processing stage since an administrator can trade off a minor loss in RMSE accuracy with improved prediction under-fitting reduction and training performance. Applied to large datasets and NFV environments hosting large numbers of VNFs, AO filtering will significantly reduce the overall training time of all the VNF models, reducing the power required to train all those models. As mentioned, the decision to apply AO filtering or not is left entirely under the systems administrator according to model training performance, accuracy and prediction variance objectives.
Another comment is that the AO filtering process is extremely user-friendly. Throughout the experiments, we have kept AO's outlier detection parameters to default settings as provided in the implementation in [24]. The flagged multivariate outlier data points are automatically filtered out along with their neighbours architecture designed for resource usage forecasting of multiple resource attributes of a VNF in an SFC. We compared training performance, RMSE prediction accuracy and R 2 prediction fidelity of A-NFVLearn and NFVLearn with and without AO's pre-processing outlier filtering technique. Results show that leveraging A-NFVLearn's attention mechanism significantly improves a VNF model's training time performance and RMSE prediction accuracy and that AO multivariate outlier filtering also improves training time performance and reduces variance in the forecasted resource usage of CPU, memory and network bandwidth resource attributes of VNFs from IMS and web SFCs. In future work, we aim to devise a clustering technique to label jobs and tasks from the multivariate resource load of a VNF to reinforce resource usage forecasting of A-NFVLearn.