Considering the large amount and load of network traffic data to be processed, the traditional one-dimensional convolutional neural network model design cannot meet the requirement of lightweight to identify the types and categories of encrypted and malicious traffic with high precision. Therefore, we add normalized processing module and attention mechanism module to the model.

## 5.1 Batch Normalization Addition

When designing the convolutional neural network model, the Batch Normalized (BN) module is considered to be added to the normal convolutional neural network model [35]. Internal Covariate Shift [36] can cause problems such as slow convergence rates and gradient saturation, which the BN module can resolve.

$${y}_{i}^{\left(b\right)}={BN\left({x}_{i}\right)}^{\left(b\right)}={\gamma }\left(\frac{{x}_{i}^{\left(b\right)}-\mu \left({x}_{i}\right)}{\sqrt{\sigma {\left({x}_{i}\right)}^{2}+ϵ}}\right)+\beta$$

1

\({{x}_{i}}^{\left(b\right)}\) represents the value of the \(i-th\) input node of this layer when the \(b-th\) sample of the current batch is input, \({x}_{i}\)for \([{x}_{i}^{1}, {x}_{i}^{2}, {x}_{i}^{3},\dots ,{x}_{i}^{m}]\) a row vector, length of batch size m, \(\mu\) and \(\sigma\) for the mean and standard deviation, \(ϵ\) division by zero to prevent the introduction of a minimum quantity (negligible), \(\beta\) and \(\gamma\) for the shift and scale parameters.

## 5.2 Attention Mechanism Addition

Due to the uneven distribution of data, the model will pay more attention to sufficient data, which will affect the final classification effect. As mentioned in this paper [37], CBAM is a lightweight general module, which can be applied in any CNN model and plays a non-negligible role in the application of GAB and CAB [38]. GAB and CAB can be used to learn the recognition features, so as to better solve the problem of low accuracy caused by uneven data distribution.

$${M}_{c\_a}=\left(ReLU\right(Conv2\left(GAP\right({M}_{G\_IN}\left)\right)\left)\right)\otimes {M}_{G-IN}, M\in {R}^{H\times W\times C}, {M}_{{G-IN}_{}}\in {R}^{H\times W\times {C}^{{\prime }}}, {C}^{{\prime }}=C/2$$

2

The channel attention feature \({M}_{c\_a}\) is calculated in Formula 2, where \(H\) represents the height, \(W\) represents the width, \(C\) represents the number of channels, \(ReLU\) represents the use of ReLU activation function, \(GAP\) represents the global average pooling, \({M}_{G-IN}\) the use of 1×1 convolution layer to reduce the number of channels.

$${M}_{G-OUT}={M}_{c\_a}\otimes \left(ReLU\right(C\_G\left({M}_{c\_a}\right)\left)\right)$$

3

The number of channels required for each category is calculated by \({M}^{{\prime }},{M}^{{\prime }}\in {R}^{H\times W\times ck}\), Where \(c\) is the number of channels needed to identify each category, and \(k\) is the number of classes. half of the features are retained by \({M}^{{\prime }{\prime }}（{M}^{{\prime }{\prime }}={M}^{{\prime }}）\), and the Dropout function is removed to make prediction with all the features.

Formula 3 calculates the output of GAB, namely the spatial attention feature map \({M}_{G-OUT}\), \({M}_{G-OUT}={M}_{G-IN}\). \({M}_{G-OUT}\) is used to save the subtle and different information of each network traffic category in the detailed network traffic data, which is used as the input of the subsequent CAB.

$${S}_{i}=\frac{1}{n}\sum _{j=1}^{n}GMP\left({m}_{ij}^{{\prime }{\prime }}\right), i=\left\{\text{1,2},3,\dots ,k\right\}, S=\{{S}_{1},{S}_{2},{S}_{3},\dots ,{S}_{k}\}$$

4

Formula 4, \({S}_{i}\) represents the degree of significant response to the feature mapping of each category, \(GMP\) represents the global maximum pooling, \({m}_{ij}^{{\prime }{\prime }}\) represents the JTH feature of class \(i\) in \({M}^{{\prime }{\prime }}\) and the score \(S\) of each category of network traffic is calculated by averaging the sum of \({M}^{{\prime }{\prime }}\) maximum pooling.

$${M}_{i\_avg}^{{\prime }}=\frac{1}{n}\sum _{j=1}^{n}{m}_{ij}^{{\prime }}, i=\left\{\text{1,2},3,\dots ,k\right\}$$

5

In Formula 5, \({M}_{i\_avg}^{{\prime }}\) represents the feature output mapping feature map of the class \(i\), and \({m}_{ij}^{{\prime }}\) represents the reaction of the JTH feature of the class \(i\) in \({M}^{{\prime }}\). The sum of the characteristic fractions of each class is calculated and averaged.

$${A}_{CAB}=\frac{1}{k}\sum _{i=1}^{k}{S}_{i}{M}_{i\_avg}^{{\prime }},{A}_{CAB}\in {R}^{H\times W\times 1}$$

6

In Formula 6, \({A}_{CAB}\)is to multiply and average the calculated scores of each class and the semantic features of the class. It helps to differentiate areas of DR Grading.

$${M}_{C-OUT}={M}_{C-IN}\otimes {A}_{CAB}$$

7

Finally, as shown in Formula 7, \({M}_{C-OUT}\) is obtained by multiplying CAB and category attention \({A}_{CAB}\), enabling the model to obtain more accurate classification of different network traffic categories.