GANAD: A GAN-based method for network anomaly detection

Cyber-intrusion always leads to severe threats to the network, i,e., system paralysis, information leaky, and economic losses. To protect network security, anomaly detection methods based on generative adversarial networks (GAN) for hindering cyber-intrusion have been proposed. However, existing GAN-based anomaly score methods built upon the generator network are designed for data synthesis, which would get unappealing performance on the anomaly detection task. Therefore, their low-efficient and unstable performance make detection tasks still quite challenging. To cope with these issues, we propose a novel GAN-based approach GANAD to address the above problems which is specifically designed for anomaly identification rather than data synthesis. Specifically, it first proposes a similar auto-encoder architecture, which makes up for the time-consuming problem of the traditional generator loss computation. In order to stabilize the training, the proposed discriminator training replaces JS divergence with Wasserstein distance adding gradient penalty. Then, it utilizes a new training strategy to better learn minority abnormal distribution from normal data, which contributes to the detection precision. Therefore, our approach can ensure the detection performance and overcomes the problem of unstable in the process of GAN training. Experimental results demonstrate that our approach achieves superior performance to state-of-the-art methods and reduces time consumption at the same time.


Introduction
At present, the Internet and computer networks have realized the interconnection of information, accelerated the speed of information transmission, and changed the way of data transmission.However, the widespread adoption of computer networks introduces several security threats that may cause misbehavior and severe damage.Those threats often keep changing and will evolve to new unknown variants [1].To prevent these threats, anomaly detection technologies are employed by a classification engine that can determine the safety of the network.Nowadays, an excellent anomaly detection system is required to discover various anomalies with new network attacks emerging efficiently.Recently there are plenty of data-driven network intrusion detection methods, which have a tendency towards minority attack classes compared to normal traffic [2].Firstly, many supervised methods works have been proposed successively which classify behaviors that do not match normal behavior as attacks.Representative supervised methods such as decision tree (DT) [3], and support vector machine (SVM) [4] could analyze and identify these behaviors successfully.However, they were also shown to not scale to the large real-world network data sets which the amount of attack traffic in the network is limited.Therefore, unsupervised and weakly-supervised methods like REPEN [5], PRO [6], and semi-supervised methods like DeepSAD [7], capable of classifying anomalies without labeled data, were deemed for defending anomaly threats.However, they failed to detect all abnormal behaviors efficiently because of unknown anomalies or data contamination, etc.Therefore, these methods can not be capable of handling the current cyber anomaly threats, level of sophistication, and flexibility.
Moreover, the lack of prior knowledge, i.e, the attack Categories (Zero-day attack) is an important challenge in the detection task, as they need to be detected quickly to be avoided great damage.On the one hand, many network infrastructures and individual devices within CPSs or IoT have unknown vulnerabilities, which complicate the security solutions.On the other hand, with the 5G network and Cloud Services' fast development, the transmission and bandwidth of cyber-attack traffic will significantly increase, and thus these intrusions may be difficult to be detected in real time and stably.However, GAN [8] has been proposed in this field and achieved excellent performance on complex network traffic data sets.It is well known that unknown network intrusions will also behave in a pattern more similar to a known anomaly pattern rather than the normal data [9].Since GAN is able to learn implicit probability distribution, the discriminator can find the generated or fake samples.Recent work like AnoGAN [10] is the first GAN-based method, which extracts normal sample features to discriminate anomalies.GANomaly [11] further improves the generator over the previous work by utilizing an encoder-decoder-encoder network to change the generator network.
High efficiency and continuous stabilization of detection can be challenging for current methods.Existing GAN-based anomaly detection methods do not satisfy the low-latency and stable detection requirement of IDS.In addition, these works above are mainly focused on data synthesis of generator [12] and have obtained suboptimal anomalous scores for intrusion detection.Thus, two main challenges are yet to be addressed.On the one hand, the previous GAN-based methods perform poorly due to relying solely on the generator.On the other hand, because of optimization problems that find a latent variable, existing GAN-based methods [10,13] cannot well solve the problems that both satisfy stable training and efficiency, hence they may fail to identify the anomalies in real-time.
To overcome the above hurdles, we design a WGAN-based model called GANAD that improves its performance by using an improved GAN network structure and is applied flexibly.Our architecture models input data as heterogeneous network nodes and the GAN as a similar auto-encoder network to handle character features of data.It introduces a new designed encoder with layers using spectral normalization that can facilitate generator to generate samples with wider variety.Besides, This design can accelerate the residual loss computation by avoiding the use of typical GAN structure [13] which needs to find corresponding latent space iteratively and increase the efficiency of detection by avoiding the need for optimization problem of generator.And the proposed network allows discriminator and generator training with spectral normalization, which can capture important hidden information in the sample distribution even with little overlap between samples.Moreover, we improve the architecture [14] by replacing GAN discriminator with no constraints, required for computing discrimination loss, for a discriminator trained with gradient penalty, which stabilizes the adversarial training.Therefore, it not only simulates more accurate data distribution to get superior performance but also reduces computational cost.Furthermore, our proposed approach utilizes residual loss and discrimination loss to construct a training strategy for modeling weak abnormal supervisory signals.At last, our work achieves an equilibrium between optimum detection accuracy and efficient performance.Three different data sets are conducted to verify the effectiveness and generalization of our approach, and experimental results demonstrate that our approach outperforms the state-of-the-art methods.
The main contributions of this paper are summarized as follows: • We propose an anomaly based IDS for network anomaly detection using improved GAN, called GANAD, which can achieve efficient intrusion detection.Experiments show that the novel variant GAN network architecture in general clearly makes our approach get optimal performance and overcomes the challenge of lack of prior knowledge.• Proposal of a novel and faster method for computing the discrimination and reconstruction loss is the key to improving the detection performance, which can meet trade-off between high efficiency and stable requirements.• To get better evaluation scores, we propose a novel training strategy to model abnormal weakly labeled data space between majority normal and minority abnormal samples, and also between the real and generated samples.• The proposed GANAD is validated on three real-world network datasets for binary classification and mutil classification tasks.Experimental results demonstrate that the proposed approach is superior to the state-of-the-art network anomaly detection approaches, achieving both stabilization and efficiency.

Related work
We have surveyed many research efforts and encouraged progress on network anomaly detection.In this section, we briefly introduce existing works in this field according to traditional work, DNN-based work, and GAN-based work.

Traditional work
In the initial stage, there were traditional methods, and most of them utilized the supervised machine learning algorithm.For example, the distance-based method [15] was applied to evaluate whether the data is anomalous by using the distances of nearest neighbors or clusters in the data.Clustering-based approaches have also been proposed.Blowers et al. [16] proposed a method called DBSCAN to identify anomalies in the network.Next, Khan et al. [17] proposed a method that uses genetic algorithm to detect anomalies.Shone et al. [18] discriminated anomalies using random forest as a classifier.To multi-classify various anomalies, Snehal et al. [19] proposed that combining SVM and decision tree is to build a multi-classification anomaly detection system that constructs multi-classification SVM by using binary classification tree.Selvakumar et al. [20] proposed a fuzzy and rough set based nearest neighborhood algorithm (FRNN) to classify network trace datasets.Representative marching learning methods based on density evaluation like Local Outlier Factor (LOF) [21], Robust Covariance [22] and Isolated Forests (IF) [23], which can solve the problem of too little labeled data to some extent.Due to the rapid development of new neural network, network traffics having ultra-high dimensions are ubiquitous which make these methods ineffective.

DNN-based work
More recent works were based on deep neural networks (DNN), and DNN-based algorithms have also been widely used in network anomaly detection.In the initial stage, Ingre et al.
Recurrent neural network (RNN) was used by Torres et al. [24] to capture the temporal features of network data.Deng et al. [25] combined a structure learning approach with graph neural networks, additionally using attention weights to provide explainability for the detected anomalies.Kwon et al. [26] established three different Convolution Neural Network (CNN) architectures based on structural scalability to improve network anomaly detection performance.In another study on the same task, Zhao et al. [27] suggested that a network intrusion detection framework use DBN and a probabilistic neural network.This method demonstrated that the effect of combination was better than that of the non-optimized DBN.
In the next stage, Pang et al. [28] adopted a reinforcement learning method called DPLAN that optimizes learning of marked abnormal data and unmarked abnormal data to identify unknown anomalies.Wang et al. [29] proposed an unsupervised representation learning method called RDP that learns data distance in a random project space by training a neural network with random mapping.Pang et al. [30] proposed a method called DevNet that realizes abnormal score learning by using neural deviation learning and optimized the representation of abnormal score by integrating neural network, Gaussian prior, and Z-Score-based deviation loss function.Autoencoder (AE) [31], variational autoencoder (VAE) [32] and deep autoencoding Gaussian mixture model (DAGMM) [33] have been successively used for the purpose of abnormal data detection.But these methods model the data distribution and derive anomaly scoring criteria based on Gaussian mixture.In a follow-up study, Zhai et al. [34] proposed an energy-based model DSEBM, using the accumulated energy between the class denoising autoencoder layers to obtain the anomaly score.Lately, Mirsky et al. [35] proposed a method called Kitsune, a plug-and-play network intrusion detection system (NIDS) that can learn to detect attacks on the local network, without supervision.

GAN-based work
Finally, Generative Adversarial Networks (GAN) were applied to network anomaly detection.The common practice of them was to determine whether the test sample is in an abnormal state by measuring the discreteness between the test sample distribution and the learning distribution.AnoGAN [10] generated real space samples from the latent space, then defined abnormal scores based on the discrepancy between the generated samples obtained by latent space and the test samples.In addition, this method optimized the update of the generated network iteratively via the backpropagation algorithm.This iteration optimization process is calculating complexity and time-consuming which is not applicable to real-time network anomaly detection.Instead of utilizing a typical GAN, Efficient GAN-Based Anomaly Detection (EGBAD) [36] first brings the BiGAN architecture to the anomaly detection domain.Lately, ALAD [37] adopted bi-directional GANs that simultaneously learn an encoder network during training.However, this design avoids the computational expenses in the inference procedure, its discriminator training is still time-consuming at test time.Because of the success of GAN in generating realistic complex datasets, MAD-GAN [13] used the Long-Short-Term-Memory Recurrent Neural Networks (LSTM-RNN) as the base models in the GAN framework to capture the temporal correlation of time series distributions.Mohammadi [38] proposed an end-to-end deep architecture for IDS using Generative Adversarial Networks (GANs) for training deep models in a semi-unsupervised setting.f-AnoGAN [39] improved the computational efficiency by adding an encoder before the generator to map from data to the latent space.In addition, this method proposed three different architectures for mapping samples to the latent space.However, the above work lacked an evaluation on the time cost of these architectures.Recently, generative adversarial networks (GANs) as a promising unsupervised approach to detect cyber-attacks, FID-GAN [40] was an unsupervised intrusion detection system (IDS) for cyber-physical systems which was proposed for a fog architecture achieving higher detection rates.The work IGAN [41] tackled the class imbalance problem by generating new representative instances for minority classes with an imbalanced data filter and convolutional layers to the typical GAN.ACGAN [42] proposed an auxiliary classifier generative adversarial network to generate synthesized samples to augment the ID datasets.
3 Proposed method

System model
The architecture of our proposed system model shown in Figure 1 is based on the MLP framework and deployed in three parts: 1) input part; 2) GAN part; 3) anomaly score part.All of them are fully connected layers.The input includes training data, testing data, and random latent space.The Training data samples are the normal data patterns used to train the GAN and the encoder.The testing data patterns are evaluated by our system model.The GAN part is equipped with the discriminator, generator, and encoder.On the left is a GAN framework in which the generator and discriminator are obtained with iterative adversarial training.Then the encoder is trained within the MLP architecture while using the trained generator as the decoder.On the right, the input part sends unknown data patterns and real data patterns to be evaluated by the system.On the top, according to the anomaly score computed discrimination and reconstruction losses by the anomaly detection system, we can decide whether the evaluated pattern is an anomaly or not.

GAN with MLP
As a powerful model framework, GAN is suitable to deal with high-dimensional data like traffic samples.Designed by game theory, GAN consists of two adversary networks: a generator G and a corresponding discriminator D. The generator network plays a role in producing synthetic data samples which are similar to real sample patterns from a random latent space.In addition, the discriminator network plays a role in distinguishing generated or real samples.Following a typical GAN framework, the synthetic samples generated by the generator as the inputs are passed to the discriminator, which will try to find the generated (i.e."fake") data samples from the actual (i.e."real") normal training data samples.The two models of GANs are trained together in a zero-sum adversarial minimax game, in which the generator tries to maximize the probability of producing outputs recognized as real, while the discriminator tries to minimize the same probability.Therefore, they can be regarded as two agents playing a minimax game with value function V (G, D) as follows: Because of the heterogeneous network, the real-world network traffic data is diverse and complex.In order to handle these high-dimensional and diversified data, the discriminator and generator are constructed as MLP networks.We assume that network traffic data samples are not independent of each other and there is an unseen relationship among them.Thus, each layer of this network will play an important role in capturing the non-linear and combination features of network data.In our framework, a single MLP network as one small part of GAN is to obtain the correlations characters among the data, which can be prepared for the detection task.Besides, our GAN is different from GANs in general in network architecture and training strategy which is described in Sections 3.3 and 3.4.

Network architecture
Referring to the recently developed GAN network architecture, especially BiGAN proposed by Donahue [14] has one more encoder to map the real samples to the latent space state.
Hence there is no need to find the latent state again corresponding to the samples in the test process.This design saves time by avoiding the use of back-propagation algorithms.Inspired by the computational efficiency of BiGAN, we build a GAN framework that maps the input data samples to the latent space through the encoder network during training.Our model improves its latent representation ability of data and testing efficiency by adding spectral normalization into the encoder network.The overall architecture of our model framework is shown in Figure 2.Where x represents real space variables, z is the random variables sampled from a latent distribution.z is the generated latent space variable obtained by the encoder, and x is the new space variable generated by the generator.It consists of three main parts, the encoder, generator, and discriminator.Firstly, the real data samples are preprocessed to obtain x.Then x and z are input to the encoder and generator networks respectively to obtain an accurate latent distribution of z as well as a reconstructed generated distribution of x .
In our model, the discriminator D use Xavier initializer to initialize the weights matrix and is trained with the constrain gradient penalty to stabilize the discrimination training.Moreover, it is trained with Adam optimizer to minimize the earth-mover distance between its predictions and real labels.Its loss is presented as (2): where n is the number of samples, x i ∀i ∈ {1, . . ., n} are one of training data samples, which should be distinguished as real and recognized as normal samples by our network.And z i are the latent space samples, they should be distinguished as fake and detected as anomalies by the discriminator network.In addition, we define θ as the penalty coefficient, which is used to enforce the unit gradient norm constraint.xi is random samples sampled uniformly along straight lines between pairs of points sampled from the data distribution and generator distribution.The weights of generator G are also initialized with Xavier initializer, it is trained with Adam optimizer to minimize the Wasserstein-1.And its objective is to fool the discriminator into the wrong decision recognizing the generated samples as real.Its loss value function is given by: In standard GAN architecture, the discriminator D is always used to discriminate between real and generated samples.However, the discriminator playing a fundamental role of performance not well, [10] indicates that generator reconstructed sample can be used to localize the anomalous distribution in classification tasks.Therefore, our method will adopt a new strategy to detect minority abnormal samples with a new designed GAN by computing an abnormal score through the convex combination of reconstruction loss and discriminator loss.The reconstruction loss measures the dissimilarity between the evaluated real sample and the generated sample in the input domain space, while the discriminator loss takes into account the discriminator network output.In the adversarial training phase, the generator learned an implicit representation of evaluated samples always affects the discriminator's decision.Thus, the reconstruction loss is very important since it can be used to measure the probability of an evaluated sample being an anomaly sample.
As we all know that it is first necessary to find the corresponding sample representation being evaluated in the latent space for computing reconstruction loss L R .The literature [13] has shown that computing L R is time-consuming through the inversion of the generator.In order to compute L R faster, [40] proposed an encoder mapping from the data pattern space to the latent space directly, which uses the auto-encoder to train the proposed encoder [43].For this purpose, our architecture builds a new designed encoder that maps random data patterns to the latent space.In contrast to [40] that train encoder through auto-encoder, our encoder trained with a simple MLP network which results in good performance is more suitable for detecting emergency intrusion.
In order to compute L R efficiently, we utilize the simple MLP network rather than autoencoder to train encoder E. We train an encoder that obtains the latent space representations of real data patterns by mapping data patterns to the latent space.The proposed encoder E is introduced by Figure 3.In addition, we train the generator as the decoder part of the autoencoder, which is to ensure that x and corresponding G(E(z)) are as similar as possible.Figure 4 shows the relationship between the encoder and generator space mappings.To stabilize the training of the network, spectral normalization (SN) [44] is applied to normalize the weight matrix of the fully connected layer of the encoder.Compared with the encoder of [40], our encoder with spectral normalization is able to learn the latent representation of the optimal data distribution.Moreover, the encoder is trained by measuring the Euclidean distance between the input data x and reconstructed data G(E(z)) as follow function: Figure 3 The architecture of Encoder where n is the data dimension.
The discriminator network D is the other part of the whole architecture, and it is, with the generator part and encoder part, used to build our GAN architecture.However, even many modified loss functions proposed can misbehave in the presence of a good discriminator [45].Specifically, real-world network traffic sample quality is always not well, since the abnormal data samples are the minority and unusual.Thus, WGAN value function appearing to correlate with sample quality isn't making optimization of the generator easier, which results in the undesired detection performance.Unlike other approaches that directly minimize the training loss value function, adding gradient penalty constraint to our discriminator is a better solution to optimize the adversarial training.Gradient penalty term is a model-level constraint that does not affect the ability of the neural network learning.Hence it allows the discriminator to approximate the Wasserstein metric more accurately.The input layer is where the data pairs (x, E(x)) and (z, G(z)) are input.The pairs of data patterns as input which can contain more hidden information promote the discriminator detection performance.In addition, spectral normalization is applied on the input layers which not only stabilizes the training of the discriminator but also improves the performance of the discriminator by reparameterization.Moreover, it can also update the weight of the hidden layer, which be used to compensate for the lack of gradient penalty using in a multi-category real sample scenario applications [46].
Figure 5 describes the architecture of the discriminator, Where x, G(z), E(x), and z are input variables, and x is the data samples at a real distribution.z represents the latent space, G(z) is the generated data samples obtained by generator, E(x) can generate new latent distribution.Firstly we input z and E(x) into the x_layer while the input x and generated latent space G(z) into y_layer to get the vectors which are input into the intermediate layer.Spectral normalization is added to each of the two input layers.The intermediate layer is embedded in the discriminator, which helps to better evaluate the difference between the pairs of discriminator's input.This layer is used to soften the decision of the discriminator to obtain a more moderate result and is also used to be computing loss value as feature matching.Finally, the output layer is trained by the discriminator loss function d(•) to obtain the final loss value.Then the output is obtained through the intermediate layer.

Training strategy
In this paper, GANAD is an anomaly detection approach based on the discriminator that evaluates how different a sample distribution is from other data.We introduce a WGAN-based GAN to simulate data distribution precisely, then define anomaly scores of all data samples by quantifying differences of distributions between real samples and generated samples.Finally, we discriminate the anomalies from normal samples through a testing criterion.
To this end, we first need to model the data distribution accurately: This means that the generator is used to learn the normal data distribution until the generated data distribution approximates the normal data distribution: p G (x) ≈ p X (x).Where p G (x) and p X (x) are the distribution learned by the generator and discriminator respectively.Our model can obtain better latent space representation by the encoder, thus it is able to restructure real distribution precisely.In addition, the generator can better simulate the true distribution by adversary learning.
In this context, we should redefine a novel anomaly score that measures the attributes of a data sample.Generally, GAN learns the latent feature space of the data by the generator network and determines whether it is abnormal or not by calculating the normal probability obtained from the test samples.So, the residual form between real examples and generated examples is generally defined as the anomaly score.Unlike the above defined score, there are reconstruction loss and discriminator loss at the last update iteration of the mapping procedure to the latent space respectively in the testing stage, defined by us as scores that constitute the anomaly score.One of them is reconstruction score R rec : We adopt the generator to measure the dissimilarity between the generated samples and the real samples in the real space.The other one is discriminator score D d : We determine the dissimilarity between the generated samples and the real samples during adversarial training.Inspired by [10], we use the convex combination of reconstruction error and discriminator error to judge whether the sample is abnormal or not.Therefore, the abnormal score in this paper is designed as shown in the (5): Where α is a constant that varies between 0 and 1, the reconstruction score R rec and the discriminator score D d are defined by the reconstruction loss L R and the discriminator loss Here, we define L R as the reconstruction loss function as shown in ( 6), a cost function based on the feature space specifically is used to measure the variability between the test and generated samples.G(E(x)) denotes samples reconstructed from the latent space corresponding to x.
The encoder and generator collaborate with each other to reconstruct the input.Then the input data is passed through the encoder and generator to get the output.There is a reconstruction error between the input and output.L D represents the discriminator loss function, we have two expressions for it.As shown in (7), the first is that we use the cross-entropy loss function δ to represent the difference between the source representation of real samples x and the latent representation of samples E(x).Next, (8) shows that we use the feature matching loss to define our L D .This evaluates if the reconstructed data has similar features in the discriminator as the true sample.
In the discriminator loss function, f i is the intermediate layer of the discriminator network.f i (•) is the output of this intermediate layer.Specifically, f i (•) plays an important role in the training procedure which maps the space of data to the feature space.Generally, when the dataset size is relatively large, multiple middle layers will help to better evaluate the difference between a pair of discriminator inputs by features.In our network, we only use one layer.The addition of an intermediate layer where we apply L1 regularization to the auxiliary main function enables us to capture the rich feature information of the sample.
For the binary classification task, we only need to distinguish whether the sample is an anomaly or normal and define the reconstruction error value of the sample as a score.As shown in (9), we propose a cost function L r to identify reconstructed sample E(x) from x. Generally, the cross-entropy loss is adopted to train the generator as a classifier.Then we obtain the residual value between them by this loss.In order to simulate weak anomaly supervised signal over data distribution, we utilize the following objective function (10) to enable the generator to generate data samples that match the statistics of real data.Generally, the standard cross-entropy loss function is used to enable the discriminator to correctly distinguish the real samples from the generated samples.However, for multi-classification tasks, the feature matching loss function is good at improving the performance of GAN training in our work than other loss functions.
Our goal is to obtain precise anomalous data distribution that is used to identify various anomalies.To achieve this, we use the following objective function (11) to train the discriminator as a classifier.When it is for a binary classification task, λ = 1, otherwise, λ = 0.

Model training
To make our model training more stable, we make a series of improvements that utilize a new network structure and training strategy.Inspired by SN, we apply SN in the discriminator network to constrain the Lipschitz limitation to reach the saddle point of the discriminatorbased loss function.In addition, the discriminator trained with SN allows the parameter matrix to use as many features as possible for discrimination work while satisfying local 1-Lipschitz constraint.Unlike Wasserstein distance-based GAN (WGAN) [47] which directly adopts weight-clipping to deal with 1-Lipschitz weight constraint, we adopt gradient penalty [45] as the gradient regularization method to solve gradient explosion or disappearance.Here the saddle point problem min G,E max D V (D, E, G) includes the gradient regularization V gr (D) on the discriminator, and spectral normalization V sn (D, E) on the discriminator and encoder.Equation ( 12) defined below solves our saddle-point problem.
We deal with the objective function by fining-tune the model training as shown in ( 13) and ( 14): Where D, E, and G respectively represent the discriminator, encoder, and generator.Equation (13) means that the objective of training is to minimize the loss value of generator and encoder and maximize the loss value of the discriminator.Equation ( 14) represents the optimization of training process.pX represents the distribution of data samples, and pZ is the distribution over the latent space.pE(z | x) and pG(x | z) are the joint data distribution learned by the encoder and generator respectively.w − gp represents Wasserstein distance with gradient penalty term, which constrains the hyper-parameters to satisfy the 1-Lipschitz continuity.w is Wasserstein distance.In our model, we apply w −gp to train the discriminator D while w is used to train the encoder and generator.Therefore, the improved discriminator loss function solves two problems existing in WGAN by setting an additional gradient penalty mechanism, which is the concentration of parameters and gradient disappearance or explosion caused by gradient clipping.E used on the encoder is a non-linear parametric function in the same way as G, and it can be trained using Wasserstein distance.our algorithm innovatively applies the above optimization methods in the training process and makes different optimizations for different tasks.In a nutshell, our algorithm can be described as shown in Algorithm 1: z ← E(x), Encoder samples 5: x ← G(z), Reconstruct samples 6: Wsn (W E ) := W /∂(W ), updating weights of encoder 7: Wsn (W D ) := W /∂(W ), updating weights of discriminator 8: when training reaches stable 9: end for 10: for detection-steps do 11: Procedure inference 12: if binary classify then 15: end procedure.

Datasets processing
To evaluate the performance of our model, we run experiments including binary classification, multi-classification on KDDCUP'99 (10 percent) [48], NSL-KDD [48] and UNSW_NB15 [49] benchmark datasets.KDDCUP'99 (10 percent) is a dataset widely used for the testing of network anomaly detectors, which is built based on the data captured by DARPA'98, and it can simulate four attack scenarios well: DoS, probe, U2R, and R2L.NSL-KDD is an iterative and updated version of dataset KDDCUP'99, which discards the shortcomings of previous data sets: redundant records, duplicate records, and data imbalance, making attacks more realistic.UNSW_NB15 is a dataset mixed with real modern normal and modern network traffic comprehensive attack activities, which can best simulate the traffic activities in the real network environment.It includes a wide range of attack scenarios that contains nine different families of attacks like backdoors, DoS, exploits, fuzzers, or worms, etc.For each data set, according to the contaminate rate, we construct a training and a testing set.The former with only normal data and the latter with both normal and attack data.KDDCUP'99 1 contains 805050 records with 41 dimensions features, includes three types: inherent features, content features, and traffic features.There are some originally discrete data which are inherent features 'protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_Login', which is processed by using dummy coding or one hot coding.We label the "normal" samples as "abnormal" and the 'abnormal' samples as the 'normal' following the dataset setup of [36] due to the detection in this experiment is pure binary classification, and this trick will not affect the identification ability of the model.Next, we randomly divide the original dataset (about 500000 samples) into two groups.Then we choose the normal label samples as the training set from one of the two sets to train our model.We do not consider the abnormal samples (remove them).At last, normal label and abnormal label samples are selected as the test set according to the contamination rate. 2 contains 148517 data patterns with 41 features, but there is additional class label in each sample.It represents nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN.We label each class category into five types respectively: DoS, probe, U2R, R2L, and normal.We will conduct additional multiclassification experiments on this dataset.In the preprocessing stage, we first combine the training set and test set.Then each classification feature is encoded into one hot vector or dummy vector and is scaled to be in the range [0,1].The division of the training set and test set is similar to that of KDDCUP'99. 3is created by the IXIA PerfectStorm tool in the Cyber Range Lab of UNSW Canberra.It covers the in-depth characteristics of network traffic, which contains 257673 samples with 49 dimensions features.And it is composed of flow features, basic features, content features, time features, and additional generation features.In addition, we use Dummy Encoding or One-Hot Encoding to process three main nominal features protocol types, state types, and services.So these discrete features will be transformed into numeric features.The division of the training set and testing set is similar to that of KDDCUP'99.

Simulation experiments
The network anomaly detection problem is for anomalies without prior knowledge where the latent space distributions between the minority abnormal samples and majority normal samples.For this target, we use the additional encoder to model the latent space distribution of data examples.In addition, we assume that all the training data patterns are normal.Moreover, we use spectral normalization as an optimizer to train MLP network with hidden layers for the encoder, and discriminator.We use MLP networks with depth 3 and 1 intermediate layer for the discriminator, and use depth 3 and 1 hidden layer for Generator, and encoder.In order to generate better samples, we find a latent space dimension of 32 is the best choice for our study.However, by introducing an encoder, our proposal is expected to improve both the detection precision and the detection efficiency.Therefore, we compare our method to the work in [11,13,37,40], which all detect anomalies using GAN architecture and additional network that help reconstructs data samples.The detection performance is evaluated using precision, recall, and F1 score.The detection efficiency is evaluated using the mean computing time from the beginning of training to the end of test.In addition, we evaluate the effect of the combination of gradient penalty term (GP) and spectral normalization (SN).In a nutshell, we call this ablation studies Section 4.5.We expect that a better detection rate can be achieved when considering a combination of both GP and SN.

Comparison algorithms
To validate the effectiveness of our own approach, we compare it with some other anomaly detection methods, such as isolated forests (IF) [23], One Class Support Vector Machine (OC-SVM) [50], autoencoder-based model (DAGMM) [33] and some GAN-based models AnoGAN [10], ALAD [37], MAD-GAN [13] and FID-GAN [40].The following is a brief introduction to these methods: Isolated Forests (IF) is a classical traditional machine learning method, which is generally used for anomaly detection of structured data.Anomalies are defined as those "outliers easy to be isolated", which is also understood as sparse space distribution.Firstly, the randomly selected segmentation values are utilized to construct a tree on the randomly selected features.Then, the anomaly score is defined as the average path length from a specific sample to the root.
One Class Support Vector Machine (OC-SVM) is an unsupervised novelty detection method based on libsvm, that is used to evaluate the high-dimensional distribution by learning the decision boundary around the normal example.
Deep Autoencoding Gaussian Mixture Model (DAGMM) is a method for anomaly detection using a model of the autoencoder.The training algorithm is based on an algorithm that determines the possibility of latent and reconstruction features of samples as a criterion for anomaly detection.And its main idea is to first train an autoencoder to generate both potential spatial features and reconstructed features of a sample.Then we train an evaluation network, which outputs the Gaussian mixture model parameters of low-dimensional potential space for sample modeling.
AnoGAN is the first anomaly detection method based on GAN.This method uses the common basic architecture DCGAN for unsupervised learning of the latent spatial distribution characteristics of normal samples.Then it restores the latent representation of each test sample in the reference stage to obtain the result that determines sample abnormality when it exceeds a certain threshold.
ALAD is a bidirectional GAN-based adversarial learning method of anomaly detection, which captures adversarial learning features for abnormal detection tasks.Then the reconstruction error is used to determine whether the data sample is abnormal or not.Moreover, the model is built on the basis of the cyclic consistency loss in real space and latent space and the stable GAN training.Its performance of anomaly detection achieves SOTA.
MAD-GAN is a method that proposed a multivariate anomaly detection with the GAN framework to detect attacks using a novel anomaly score called DR-Score.This score exploits both the discriminator and generator networks, which are LSTM-RNN networks, by computing and combining a reconstruction loss with the discrimination loss.
FID-GAN is a novel fog-based, unsupervised intrusion detection method for CPSs using GANs in the latest research.It is proposed for a fog architecture, which brings computation resources closer to the end nodes and thus contributes to meeting low-latency requirements.

Detection performance
We use the precision, recall, and F1 score as the performance metric to evaluate the detection performance of anomalies.Our experimental results of the three datasets are shown in the Tables 1 and 2. It demonstrates that our method is better than both GAN-based methods AnoGAN, ALAD, MAD-GAN, and FID-GAN by comparing with the above methods.From experimental results, in general, we can observe that: • For the KDDCUP'99 dataset, refining into each metric, our approach is about 2% and 3% higher than the best baseline methods.In the meantime, it achieves that our approach surpasses other SOTA methods in terms of accuracy.From the experimental results of classify NSL-KDD and UNSW_NB15 datasets in Table 1, which are more complex and more challenging to detect intrusions from, since their precision are, in general, lower than the results of the KDDCUP'99 data set.• For the NSL-KDD dataset, our approach significantly outperforms all the above methods in all metrics.It is also known from the result that the lack of sufficient training data will leads to the poor performance of all the data-based deep learning methods.We guess that  MAD-GAN and FID-GAN perform not well since NSL-KDD dataset is more complex and more challenging.Given that our method can accurately learn the distribution of real normal and generated data, and identify the subtle differences between the data.Hence it is good for efficient detection even if the dataset is not large.• In UNSW_NB15 dataset, our method is slightly better than ALAD and AnoGAN, but it outperforms the other methods in terms of accuracy and F1 scores.In the precision case, IF is higher than AnoGAN.However, the other two cases of IF results are weaker than the latter deep learning methods, which again verify the deficiency of machine learning methods in discriminating anomalies.FID-GAN's precision seems poor, but it achieves a near 100% recall value.This is unacceptable in the real-world setting as the number of false positive samples is too large.
We observe from the result that MAD-GAN not considering the computing complexity would hinder the intrusion detection performance.FID-GAN is the improved version of MAD-GAN which is still inefficient.In addition, FID-GAN adopting computing optimization is not suitable to discriminate anomalies since gradient-based optimizer is easy to get struck in local optimal.Because of the disadvantage of the KDDCUP'99 dataset itself which is no classification or specification of specific attack categories, it is not suitable for the multiclassification task.Of note, MAD-GAN and FID-GAN are not able to be employed for specific category discrimination.In contrast to MAD-GAN and FID-GAN, the key idea that our approach overwhelms better than them is to adopt gradient-free computing and detect anomalies from different perspectives.Meanwhile, it can be observed from Table 2 that our approach completely surpasses various existing baseline approaches.And the results show excellent detection performance of our approach.This is because novel training strategy, which is used in our method, is capable of learning more complex data distributions better than other GAN-based methods.Overall, looking at the relative performance of GANAD with other GAN-based methods, we can see that GAN-based anomaly detection is unable to compete our method since we model the latent space distribution of samples appropriately.

Time cost performance
To validate the efficiency of our approach, we compare the time spent on experiments to other GAN-based methods.We train the model for 50 epochs for all methods.From the results of Figure 6a and b, it is obvious that our methodology is better than the other two methods.From experimental results, in general, we can observe that: • Since the detection of anomalies is a latency constrained application, the anomaly detection score needs to be computed in a short time.This mainly depends on the computation of the discrimination and reconstruction losses.So, most GAN-based methods suffer from time-consuming.In the KDDCUP'99 dataset, the results in Figure 6a show that our approach is significantly superior to AnoGAN.In contrast to our architecture, MAD-GAN and FID-GAN model data as time series and use RNN-LSTM networks to consider data dependencies.Thus, our method only using fully connected layers compared with them requires a lower computing time.It also indicates that we can deal with big highdimension data faster and more efficiently.In addition, from the time cost in NSL-KDD and UNSW_NB15 datasets, we can see the time difference between our method and the other four methods is not so obvious.We guess the reason is that the dataset is not big enough for the experiment.Even so, our evaluation results are still the best.• For multi-classification experimental results, Figure 6b shows the advantage of our approach over the other two methods that GANAD is good at multi-classification. The cost time of ALAD is about twice that of our approach, which validates the success of the combination of gradient penalty and spectral normalization.This is because finding the latent representation of a sample and computing its reconstruction loss demands time.And the encoder in our architecture enables a major reduction in the time taken to detect anomalies because it obtains the latent representation of patterns through direct mapping.
Since training GANs is not always an easy task due to mode collapse and stabilization issues, this is a disadvantage in the use of ALAD for improving existing GAN-based IDSs.Contrary to ALAD, our method stabilizes the training by adding a constraint to loss computation.

Ablation studies
To better demonstrate the detection performance of our model, we perform ablation experiments by adding and deleting model components.In particular, we perform experiments with and without gradient penalty term (GP) optimization, with spectral normalization and without spectral normalization (SN) to examine the performance of the full model (with gradient penalty term and spectral normalization) respectively.From experimental results, in general, we can observe that: • As shown in Table 3, generally speaking, our approach is overall balanced in all aspects of metrics, and the addition of both SN and GP can improve the model performance on the UNSW_NB15 dataset.But the addition of GP and SN alone does not seem to work more significantly on the KDDCUP'99 and NSL-KDD datasets.• From Table 4, we can see that our ablation experiments do not well reflect the superiority of the overall framework well, and the performance of each variant approach is almost the same.In the NSL-KDD experiment, our variant approaches have been slightly improved, which indicates that the effectiveness of the experiment still exists.In the UNSW_NB15 dataset, either adding gradient penalty or adding spectral normalization can only make the model more stable or more computationally efficient.We guess the reason is related to the large difference in the number of attack types.

Conclusion and future work
In this article, we proposed GANAD, a novel system using a GAN that is specifically designed for detecting network anomalies.The detection is based on the novel training strategy, which could better learn minority abnormal distribution from normal data patterns compared to previous works.In addition, we utilize an additional encoder to map data samples to the latent space, such that the generator loss computation is optimized.Furthermore, to address the severe GAN unstable training problem that hinders the detection task, our approach is proposed within discriminator training to replace JS divergence with Wasserstein distance adding gradient penalty.The empirical evaluation of three datasets demonstrates that our model outperforms the previous GAN-based model in most cases with respect to recall, precision, and F1 score.In addition, it reduces training costs and time consumption.Moreover, we further conduct ablation experiments to validate the efficiency of our method.Therefore, our approach gives a new way to detect network anomalies.The information about the distribution of anomalous samples in the existing network data is so vague and undetectable, hence abnormal behavior is only slightly deviated and masked in the data space.In future works, we plan to explore network anomaly detection at a deeper level.We will investigate the application of GANs in unsupervised network intrusion detection to improve the performance of unknown anomalous traffic detection.

Figure 1
Figure 1The complete framework of our system model

Figure 4
Figure 4 The encoder and generator

Figure 5
Figure 5 The architecture of discriminator

( a )Figure 6
Figure 6 Time cost comparison between GANAD and the two other methods Algorithm 1 GAN-based adversarial learning network anomaly detection.
input: x, real space variables; z, latent space variables; E, encoder function; G, generator function; f , the feature layer of D.

Table 3
The bold entries represent the best results of our experimental data compared with other approaches

Table 4
Multi classification performance of ablation study on two datasetsThe bold entries represent the best results of our experimental data compared with other approaches