MFVT: an anomaly traffic detection method merging feature fusion network and vision transformer architecture

Network intrusion detection, which takes the extraction and analysis of network traffic features as the main method, plays a vital role in network security protection. The current network traffic feature extraction and analysis for network intrusion detection mostly uses deep learning algorithms. Currently, deep learning requires a lot of training resources and has weak processing capabilities for imbalanced datasets. In this paper, a deep learning model (MFVT) based on feature fusion network and vision transformer architecture is proposed, which improves the processing ability of imbalanced datasets and reduces the sample data resources needed for training. Besides, to improve the traditional raw traffic features extraction methods, a new raw traffic features extraction method (CRP) is proposed, and the CPR uses PCA algorithm to reduce all the processed digital traffic features to the specified dimension. On the IDS 2017 dataset and the IDS 2012 dataset, the ablation experiments show that the performance of the proposed MFVT model is significantly better than other network intrusion detection models, and the detection accuracy can reach the state-of-the-art level. And, when MFVT model is combined with CRP algorithm, the detection accuracy is further improved to 99.99%.

interest in the research of intrusion detection systems and good detection results have been achieved [8][9][10].
Besides, the detection of anomaly network traffic is an important task of network intrusion detection, which is essential to classify network traffics [11], which requires researchers to make accurate judgments on the collected network traffic data and detect network traffic with offensive behavior. To detect anomaly traffics more effectively, network traffic packets are usually divided into flows according to source IP, destination IP, source port, destination port, protocol, and timestamp [12]. The current anomaly traffic detection technology mainly includes traditional network anomaly traffic detection technology and network anomaly traffic detection method based on machine learning. In this paper, deep learning methods were used to classify network traffics. Deep learning methods have the characteristics of end-to-end and automatic extraction of network traffic data features, to avoid the cumbersome process of manual extraction of features, and deep learning methods have good adaptability, self-organization, and promotion ability. So, the use of deep learning can make the detection system have more stable performance and higher detection efficiency [13,14].
However, deep learning technology needs a large amount of labeled data for training, and labeled data require experts with specific knowledge to spend a lot of time on labeling, which is time-consuming and laborious. Most of the datasets used in deep learning are imbalanced datasets. These problems cause a significant impact on the performance of deep learning models. Under-sampling and over-sampling are commonly used to solve data imbalance problems, but under-sampling will discard some data leading to the loss of some features, and over-sampling will add some data leading to changing the original data distribution, both of which have an impact on the experimental accuracy [15]. In this paper, the traffic features learned from a two-layer convolutional networks are fused, which can alleviate the impact of data imbalance on the accuracy of the experiment. Due to the outstanding performance of transformer architecture in the field of natural language processing (NLP) and the limitations of its application in computer vision, Dosovitskiy [16] improved the transformer architecture and proposed vision transformer architecture for image sequence converter realize image classification and achieved good results. Meanwhile, experiments proved that vision transformer required fewer training resources. Inspired by the vision transformer architecture, a deep learning model (MFVT) based on the feature fusion network and the vision transformer architecture was proposed in this paper for network anomaly traffic detection. MFVT model has strong ability to deal with imbalanced datasets and therefore effectively reduce the sample resources required for training. This paper also studies the influence of learning rate change and the number of training epochs on the experimental accuracy based on the MFVT model.
So far, there are many ways to process raw network traffic data, but there is no uniform standard. Since the data that a neural network can accept must be of the same dimension, the extracted network traffic data must be filtered to a specific dimension before it can be used as the input of the neural network model. Most of the traditional methods directly intercept the data of specific dimensions from the network traffic data. Although the effect is quite good, there is room for improvement. Therefore, PCA algorithm is used in this paper to reduce all the processed digital traffic features to a specified dimension. The experimental accuracy obtained in the datasets IDS 2017 [17] and IDS 2012 [18] is significantly higher than the traditional methods.
In summary, the main contributions of this paper are as follows.
• A deep learning model (MFVT) based on feature fusion network and vision transformer architecture is proposed, which can effectively improve the detection accuracy while reducing the training resources. On the IDS 2017 dataset and the IDS 2012 dataset, MFVT model can achieve the best performance on all evaluation metrics. • A new raw traffic data extraction algorithm (CRP) is proposed, which uses the PCA [19] algorithm to reduce the processed digital traffic features to a specified dimension. The ablation experiment results show that the detection accuracy has significantly improved to compare with traditional methods. • Based on the MFVT model, the impact of training epochs and the variation of the learning rate on the detection performance of the model is further studied.
The rest of this paper is organized as follows. Section 2 introduces the related works to the model and method presented in this paper, Sect.3 details the deep learning model and the raw network traffic data processing algorithm, Sect.4 introduces ablation experiments and experimental results of MFVT model in detail, and finally, our work is summarized in Sect.5 .

Related work
This section mainly summarizes some documents related to the work of this paper, including intrusion detection and transformer architecture.

Intrusion detection
With the continuous development of artificial intelligence big data and cloud computing technology, intrusion detection technology is constantly updated using new technologies [20][21][22][23] In 1980, Anderson [24] proposed the concept of intrusion detection technology, which aims to timely identify abnormal behaviors in the network and reduce losses caused by abnormal behaviors. Over the past 40 years, many methods have been used in intrusion detection, all of which aim to sense attacks with good predictive accuracy and improve real-time prediction. These methods all attempting to extract a pattern from network traffics to distinguish attack traffics from regular traffics. Specifically, Table 1 briefly summarizes the methods used in intrusion detection. Currently, the traditional machine learning methods applied to the field of intrusion detection are mainly supervised learning, such as support vector machine (SVM) [25][26][27], K-nearest neighbor (KNN) [28], and random forest (RF) [29,30]. These methods mentioned above have a high false alarm rate and a low detection rate for attack traffics. It is a common problem in traditional machine learning methods to design a feature set that can accurately reflect traffic characteristics, and the quality of feature set directly affects the classification performance of the method. In recent years, although many researchers have been working on the problem of how to design feature sets [31,32], how to design a set of suitable traffic feature sets is still an unresolved research topic.
Moreover, deep learning [33] has good self-adaptability, self-organization, and generalization capabilities. Therefore, it can be a good solution to the problem that traditional machine learning needs to manually design a group of feature sets. The use of deep learning can enable detection systems with higher detection efficiency and therefore has been widely studied by scholars in recent years. Yan [34] constructed an intrusion detection system based on convolutional neural network (CNN) and applied generative adversarial network to synthesize attack traces, and experimental results verified the effectiveness of the system. Zhang [13] proposed a deep hierarchical network-based intrusion detection model that combines CNN and long short-term memory (CNN_LSTM) network, and the CNN_LSTM model achieved good performance on the IDS2017 dataset. Lin [35] constructed a dynamic network anomaly detection system, which uses long shortterm memory (LSTM) network combined with attention mechanism to detect anomalies. Zhang [36] proposed a two-layer parallel learning cross-fusion deep learning model (PCCN), which uses feature fusion technology to improve the extraction of features from small sample data, and experiments on ablation experiments showed good performance. Zhong [37] proposed HELAD, a network anomaly traffic detection algorithm integrating multiple deep learning techniques. Although HELAD has better adaptability and detection accuracy, its bit error rate is slightly higher.

Transformer architecture
In 2018, transformer architecture [38] first appeared in the field of natural language processing (NLP), and it has occupied an important position in the field of NLP. Transformer architecture has been continuously improved by subsequent scholars [39]. Vaswani [40] first constructed transformer architecture based on attention mechanism. Devlin et al. [41] proposed BERT, a new language representation model, which pretrains a transformer from unmarked text through joint adjustments of left and right contexts. BERT got the latest results from 11 natural language processing tasks at the time.
Influenced by the excellent performance of transformer architecture in NLP task, scholars began to extend transformer architecture to the field of computer vision and achieved good results. Chen et al. [42] constructed a sequence transformer to perform regression prediction of pixels and obtained competitive results in the image classification task. In 2020, Dosovitskiy et al. [43] proposed a vision transformer architecture, which uses a pure transformer to directly extract the features of image block sequences and obtain the most advanced performance on multiple image recognition reference datasets. Besides the most basic image classification tasks [44], transformer models are gradually applied to various computer vision tasks, and the number of vision models based on transformer architecture has gradually become more and more. In this paper, the latest intrusion detection model based on feature fusion is improved and integrated into vision transformer architecture, and then a deep learning model (MFVT) that combines feature fusion network with vision transformer architecture is proposed for network anomaly traffic detection. The MFVT takes full advantage of the respective strengths of feature fusion and vision transformer architecture and further improves the detection accuracy of abnormal network traffic by combining with the CPR algorithm proposed by us.

Model and methods
This section mainly introduces the CPR algorithm and MFVT model.
In order to improve the processing capacity of existing deep learning models for imbalanced datasets and reduce the required training set resources, in this paper, a new model MFVT and a new raw data processing algorithm CPR were designed. This section mainly introduces the MFVT model and the CPR algorithm. The MFVT model can improve the detection ability of small sample datasets and reduce the training set resources, and the CPR can effectively remove the interference features in the raw data. Figure 1 shows the entire detection process. The MFVT model mainly composed of a feature fusion network and the vision transformer architecture. MFVT can use the raw features of network traffics to automatically learn the differences between different categories of network traffic features to classify anomaly network traffics, but the network model requires that the dimensionality of all input data must be consistent, so an Fig. 1 Anomaly network traffic detection process algorithm named CPR was proposed to extract the raw features of network traffics and intercept the same dimensional data.

Data processing
The raw data processing algorithm (CPR) proposed in this paper mainly accomplishes the task of extracting raw traffic data from pcap files and processing them into the twodimensional matrix that required by the network model [45,46]. Figure 2 shows the entire data processing process.
Three steps are required to process the raw flow data into a two-dimensional matrix. The specific steps are as follows.
The first step is to extract the raw data of network traffic from the pcap file and then convert the extracted byte type data into binary type data.
In the second step, the converted packets are divided into flows according to the fivetuple, and the number of packets and bytes contained in each packet are limited when dividing the flow. If the number of data packets is insufficient, fill in the preceding item, and if the number of bytes contained in the data packet is insufficient, fill in 0. For the Fig. 2 Overall flow of data processing completion of this step, refer to the paper [47]. Through the above operations, a dataset with fixed dimensions can be obtained. The pseudocode is shown in algorithm 1.
In the third step, the network traffic data obtained after the first two steps contain high data dimensions and may have redundant features that are useless for network training, which need to be further extracted. In this paper, the data obtained from the first two steps are directly fed into the PCA algorithm to obtain the data of the required dimensions, and then the data are processed into a two-dimensional matrix. The pseudocode is shown in algorithm 2.
The main idea of PCA is to map the N-dimensional features to the K-dimension, which is a new orthogonal feature, also known as the principal component, and is a reconstructed K-dimensional feature based on the original N-dimensional features, as shown in Formula 1,2,3,4,5.
Formula 1 indicates that the original data X is arranged into a matrix with n rows and D columns, and then the matrix is zero-averaged. x ij represents the data in row i and column j of matrix X. In Formula 2, c represents the covariance matrix of matrix X. Formula 3 expresses getting the eigenvalue and eigenvector of the covariance matrix c, eig() is the function of getting the eigenvalue and eigenvector, w indicates the obtained eigenvector, and b indicates the corresponding eigenvalue. In Formula 4, the eigenvectors are arranged into a matrix in rows from top to bottom according to the corresponding eigenvalues. The first k rows are taken to form the matrix p, where sort() is the sorting function and slect() is the selection function. Formula 5 represents the dataset Y obtained after dimension reduction. Figure 3 is the overall structure of the MFVT model, which composed of two parts.

The structure of MFVT
First part is the feature fusion network, which is composed of two layers of parallel convolution networks. The first layer is stacked with two convolution layers, the first convolution has a step of 1, the second convolution has a step of 2, and the size of the kernel is 3. The second layer consists of a convolutional layer and a pooling layer, where the convolutional layer has a kernel size of 3 and a step size of 1, and the pooling layer has a step size of 2. The padding size used in the two-layer convolution process is all 1. To make full use of the features extracted by convolution layer and pooling layer, the extracted features are fused to improve the extraction effect of features for small sample data.The whole calculation process of the feature fusion network is shown in Formula 6-16. Formula 6 represents the padding operation, and Formula 7 represents the size change of the output matrix of convolution processing after the padding operation. Under the premise that padding_n is equal to 1, the stride=1 keeps the output size unchanged, and the stride=2 halves the output size. (1)

Fig. 3 MFVT's overall structure
Where X O represents the matrix data obtained after the original flow data is processed by the CPR algorithm, because the convolution manipulation will change the size of the input matrix, in order to keep the matrix size unchanged it is necessary to perform the padding operation by Formula 7.X represents the matrix after the padding operation,X ij represents the specific data value in the matrix. W is the width of the matrix, and H is the height.
Formulas 6, 8,9, 10 represent the entire calculation process of the first layer in the feature fusion network. Formulas 8 and 10 represent the convolution operation, V represents the convolution kernel matrix, v ij represents the specific value in the convolution kernel matrix, and k represents the kernel sizes. X 1 1 represents the eigenmatrix obtained after the first convolution operation. Since the stride in Formula 7 is 1, the output size remains unchanged. X 2 1 represents the matrix obtained after the padding operation of X 1 1 , and X 3 1 represents the eigenmatrix obtained after the second convolution, and the output size is halved because the stride in Formula 7 is 2.
Formulas 6, 11, 12 represent the entire computational process of the second layer in the feature fusion network, whereX 1 2 denotes the feature matrix extracted after the convolution operation, the stride=1 does not change the output size, and X 2 2 denotes the feature matrix obtained after the maximum pooling operation, which halves the size of the output feature matrix.
Formula 13 shows the scale changes of the features extracted from the first and second layers of the feature fusion network. Formula 14 represents the specific process of fusing the first layer with the second layer features. The fusion refers to the summation of the number of channels, but the data must be kept consistent except for the number of channels. C represents the number of channels, C(1) represents the number of channels is 1, C(32) represents the number of channels is 32 and so on, X f represents the features extracted by the feature fusion network.
The second part is composed of the vision transformer architecture. To combine vision transformer architecture with feature fusion network, the structure of vision transformer is modified in this paper. The main methods used include feature embedding, learnable embedding, and transformer encoder. For feature embedding, standard transformer accepts sequence of token embeddings as input. To process the feature X f learned by the feature fusion network, we reconstructed X f into a flattened 2D block sequence X p . Formula 15 is a specific variation of the formula, the same as NLP, will be added to the sequence of images in the token classification, the sequence of images is cut into multiple patches by a picture to get the number of patches where p indicates.
Learnable embedding, a learnable embedding z 0 0 = x class is preset for the feature block embedding sequence, x class denotes the category vector whose state/feature Z 0 L at the transformer encoder output is used as the feature representation y, as shown in Formula 21. Learnable embedding is randomly initialized at the beginning of training and obtained by training.
Transformer encoder consists of several blocks, each containing a multi-head attention block and a multi-layer perceptron (MLP) block, with normalization applied before each block and residual concatenation applied after each block. Figure 4 shows the structure of head attention. Formula Finally, the embedding vectors that combine category vectors and feature block embedding can be input into transformer encoder. The encoder built up by blocks can extract data features for classification just like CNN. The whole calculation process is shown in Formula 18,19,20,21. The feature embedding block X 1 P E and the category vector X class form the embedding input vector Z 0 . Formula 19 adopts skip connection, where MAS represents multi-head attention operation, LN represents normalization operation, L represents repeatable times, and Z ′ l represents the lth output. Formula 20 adopts skip connection, MLP represents the multi-layer perceptron block, L represents repeatable times, and Z l represents the lth output. y represents the feature representation.

Experiments and results analysis
This section first introduces the experimental environment, the datasets IDS 2017 and IDS 2012 used in the experiments, the evaluation criteria used in the experiments and finally specifies the ablation experiments and some details of the experiments. In the ablation experiments, a series of advanced models were compared with the MFVT model.

The experimental environment of this paper
In this paper, ablation experiments were conducted on the MFVT model and CPR data processing algorithm under the environment shown in Table 2. Fig. 4 Attention structure

Datasets
In this paper, A series of ablation experiments were designed using both IDS 2012 and IDS 2017 datasets.
The IDS 2012 dataset contains a week of network activity including both normal and malicious activity, with three days consisting of all normal traffics and the remaining four days consisting of a large amount of normal traffics with a specific type of attack traffics. IDS 2012 dataset contains attack traffic including internal penetration, HTTP denial of service, distributed denial of service using IRC botnet, and brute force cracking of SSH [18].
The IDS 2017 data collection period lasts for five days from 9am on Monday, July 3, 2017, to 5pm on Friday, July 7, 2017, of which Mondays only include normal traffic. The attacks implemented included brute force FTP, brute force SSH, DoS, Heartbleed, web attack, infiltration, Botnet, and DDoS [17].

Evaluation metrics
Authoritative evaluation metrics must be used to judge the merits of a network anomaly traffic detection method. The effectiveness of the machine learning-based network anomaly traffic detection algorithm can be evaluated by the metrics shown in Formula 26, 25, 23, 22, 24 [48]. TP represents the positive sample predicted to be positive by the model, which can be called the accuracy rate judged to be true. TN represents the negative sample predicted to be negative by the model, which can be referred to as the percentage of correct judgments that are false. FP represents the negative sample predicted by the model to be positive, which can be referred to as the false alarm rate. FN represents the positive sample predicted to be negative by the model, which can be referred to as the underreporting rate [37].

Ablation experiment and results analysis
In this paper, two datasets of IDS 2012 and IDS 2017 were used for ablation experiments. In addition, this paper also carried out an exploratory study on the impact of model optimization methods on MFVT model detection performance on IDS 2012 dataset. In the MFVT model, the size of the kernels used in the convolutional neural network is 3 * 3 , the segmentation size set in the vision transformer architecture is 11 * 11 , the number of the head in the multi-head attention is 12, and the number of blocks in the encoder is 12. In the process of model training, the data input batch used in this paper is 256, the epoch of the training iteration is set to 100, and the stochastic gradient descent (SGD) optimizer is used to accelerate the network convergence. The momentum is fixed at 0.9, the learning rate is fixed at 3e-2, weight_decay Set to 0, the loss function uses CrossEntropyLoss. All ablation experiments and results will be described in detail below. Figure 6 shows the parameter changes of MFVT model when using IDS 2012 dataset for training, including training loss, verification loss, and verification accuracy. As shown in the picture, the convergence speed of MFVT model is fast, but there are large fluctuations in the later stages of training. Table 3 shows the experimental results based on IDS 2012 dataset. It is obvious from the table that the MFVT model combined with CPR algorithm proposed in this paper is superior to other methods on all evaluation metrics, reaching the state-of-the-art level. It can also be concluded from the table that MFVT model has superior performance, and its detection accuracy is only slightly worse than that of DT (Decision Tree), but it has higher precision. To better demonstrate the ability of the MFVT model to deal with imbalanced data, the experimental results of all evaluation metrics of the MFVT model in various types of attack traffic are shown in Table 4.

Ablation experiment based on IDS 2012
Combining the (B) in Fig. 5 and Table 4 (the experimental results of Infiltrating and Distributedenial, which account for a relatively large proportion, have been marked in bold),it can be concluded that the traffic of HTTP and rutesh, which account for a relatively small proportion, still obtains good experimental results. It shows that MFVT model has strong ability to recognize small sample data. The detection performance is further improved by combining the CPR algorithm with the MFVT model.

Ablation experiment based on IDS 2017
The IDS 2012 dataset contains fewer types of attack traffic, and the effectiveness of the MFVT model and the data processing algorithm CPR are demonstrated to be not generalizable on this dataset only. So, ablation experiments also were performed on the more complex IDS 2017 dataset. Figures 7 and 8 are the results of the ablation experiment, from which it can be seen that the accuracy, recall, F1-score and accuracy of the MFVT model and the combination of MFVT model and CPR algorithm all reached nearly 100%, which was significantly better than other comparison models. Figure 7 shows that the detection results obtained by MFVT model and the combination of MFVT model and CPR algorithm are close to 100% in the evaluation criteria, which is significantly better than other comparison models. The comparison of the FPR between the MFVT model and other comparison models is shown in Fig. 8, from which it can be seen that the MFVT model is still the best. Combined with Fig. 5a and Table 5(experimental results of DDos, Hulk and Portscan, which account for a large proportion of attacks, have been marked in bold), it can be concluded that the MFVT model combined with the CPR algorithm has a better ability to recognize small samples.
To further demonstrate the error of the prediction results of the MFVT model proposed in this paper combined with CPR, the experimental results were made into the heat map shown in Fig. 9. From the heat map, the performance of the MFVT model combined with CPR is very high, and the prediction error rate is extremely low.
To verify that the MFVT model can reduce the sample resources required for training, we tested it on the IDS 2017 dataset by reducing the training set data volume according to Formula 27 with all other conditions held constant, data 0 is the initial assigned training set data volume, data n is the updated data volume, and n is taken according to Formula 28, where n 0 is the initial value of n equal to 0.9, and N takes values in the range of 1-7. Table 6 shows the test results.
As can be seen from Fig. 10, when the training set data amount is reduced to 80% of the original training set data amount, the impact on the overall accuracy of the test set is very small. Through this experiment, it is proved that the MFVT model combined with

Optimization of MFVT model
In the conclusion of this section, it is hoped to further improve the detection accuracy and the stability of the model by increasing the training epochs and continuously adjusting the learning rate (lr) during the training process. Thus, IDS 2012 is used as the ablation experiment dataset, which takes less time to train than IDS 2017.Two sets of experiments were conducted. In the first group, our model was trained 1,000 times and the results were recorded every 100 times. In the second group, based on the first group, lr is changed 100 times per iteration according to Formula 29, where lr i is the learning rate changed every time according to the formula, lr 0 is the initial learning rate, and the epoch is every hundred iterations. To ensure the rigor of the experiment, the values were obtained after conducting the two sets of experiments several times. It can be seen from Fig. 11 that both the increase of training epochs and the change of lr can get better prediction accuracy in some intermediate results, but the experimental results tend to be stable in the end. In comparison, the variation of lr will make the variation of experimental results more stable.

Results and discussion
Since most of the deep learning models need a lot of training resources, a network anomaly traffic detection model (MFVT) which combining a feature fusion network with the vision transformer architecture was proposed. MFVT can reduce training resources while maintaining high detection accuracy. In this paper, a new raw traffic data extraction algorithm (CRP) was proposed. The MFVT model combined with the CRP algorithm achieved nearly 100% detection accuracy on both datasets IDS 2012 and IDS 2017, and with much better performance than the other methods in the comparison experiments. The MFVT model combined with the CRP algorithm is more capable of handling imbalanced datasets and can further improves the detection accuracy of the experiment. Although the MFVT model combined with the CRP algorithm has an excellent performance in the field of anomaly traffic detection, the scalability of the model is weak and the detection accuracy of new types of attack traffic that do not appear in the training set needs to be improved in the face of the increasingly complex network environment and the emergence of new attack types.
Considering the importance and practical significance of scalability, the scalability of the MFVT model will be further improved in the future to enhance the practical value and practical significance of the model.