Visual Analytics for Network Events Classification in LAN With Deep Convolutional Neural Network


 This article illustrates a method of visualizing network traffic in LAN based on the Hilbert Curve structure and the array exchange and projection, with nine types of protocols’ communication frequency information as the discriminators, the results of which we call them feature maps of network events. Several known scan cases are simulated in LANs and network traffic is collected for generating feature maps under each case. In order to solve this multi-label classification task, we adopt and train a deep convolutional neural network (DCNN), in two different network environments with feature maps as the input data, and different scan cases as the labels. We separate datasets with a ratio of 4:1 into the training dataset and the validation dataset. Then, based on the micro scores and the macro scores of the validation, we evaluate performance of the scheme, achieving macro-F-measure scores of 0.982 and 0.975, and micro-F-measure scores of 0.976 and 0.965 separately in these two LANs.


Introduction
Recently, with more and more objects connected into the internet, more information is shared through the format of digital data on the internet. In the era of big data, resilient and robust network systems with the ability of protecting privacy of users are critical. In the local area network (LAN), a malware delivered, for example by phishing e-mails, has the ability to intrude and expand into the other hosts, causing leaky of personal information. Through delivering malware and further spreading to social media, messaging services and applications, attacks like this can affect multiple aspects of personal life.
With an explosion of information, the manipulation of network systems is becoming more and more difficult. Moreover, the digital property of it adds up to the complicity of explaining certain events in networks. With respect to big data of network traffic, visual analytics is usually used to convert it into visual information, which has been adopted in research for enhancing the explicitly of tasks such as anomaly detection from network traffic.
On the other hand, with the advancement of machine learning, especially deep learning, it is considerable to utilize deep learning for dealing with onerous analyses of enormous network traffic. In many studies, machine learning methods such as support vector machine and neural networks are used to detect anomaly in LAN, telling the status of the network is normal or abnormal, which is mainly focused on detecting of anomaly instead of digging into the hidden explicitly of network events.
In this research, we focus on the visualization of network traffic by selecting suitable discriminators and a further classification between different events in LAN with a deep learning method.
When malware intrude into a LAN, and try to expand into (or steal some data from) the other hosts in the LAN, it tries to access some specific TCP or UDP ports of all the hosts. This kind of activities is a network event. In this paper, we focus on ARP scan, TCP scan, UDP scan, and those scans to specific ports as the types of network events, which we target to classify by our scheme.
We propose a scheme that visualizes network traffic of different network events in LAN by generating feature maps based on a structure called Hilbert Curve, compressing protocol information during a certain duration into one *Correspondence: sywtokyo@hongo.wide.ad.jp 1 Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan Full list of author information is available at the end of the article image ( Fig. 1). Then we simulate eight types of network events in LAN, generating feature maps as the dataset for training a DCNN model thus solving this multi-label classification task. Our model detects and classifies these eight types of attacks using visual analytics, by which we could provide some explicitly for the occurrence of a specific type of attacks.
This paper is organized as follows. Section 2 discusses related works about network anomaly detection with machine learning. Section 3 provides an overview of the scheme including the visualization of network traffic based on protocol communication information and classification of network events using DCNN. Section 4 presents the performance evaluation of the scheme based on validation scores of the model. Section 5, we discuss the contribution of this research and show some limits which can be improved through future work. Section 6, we conclude the paper and the result of this research.

Related work
Anomaly detection in computer networks attracts many attentions, with more than 40 years of evolution [1]. With the rapid growth and increasing complexity of network infrastructures, and the evolution of attacks, identifying and preventing networks attacks are becoming more and more challenging. Traditional approaches to this issue include an application of several knowledge-based rules on network communication, and once these rules are satisfied, a network event can be considered malicious.
Several traditional machine learning methods such as support vector machine (SVM) and neural network (NN) have been used to address these issues of network anomaly detection in both personal computers and critical infrastructure [2] [3]. Moreover, Yang et al. [4], they use restricted Boltzmann machine (RBM) to extract high-level features of traffic data and train SVM with stochastic gradient descent (SGD) for classification of these features. Asmaa et al. [5] presented a comprehensive discussion of using RBM for feature learning and a classifier for anomaly detection. Salama et al. [6] presented an intrusion detection hybrid scheme using deep belief network (DBN) and SVM, classifying the intrusion into two clusters: normal or attack. They adopt DBN for reducing dimension of features and SVM for the classifier. They evaluated this scheme with the NSL-KDD dataset [7] and achieved an accuracy of 0.9.
However, limits of these methods in dealing with big data of network traffic and the lack of explainability have shown disadvantages in solving more complicated detection problems in networks. Furthermore, lots of studies adopt a supervised method, training a classifier with data labeled as normal or abnormal, so that knowledge about anomaly can be constructed thus detected, which however can be limited.
With the advancement of deep learning in recent years, large-scale data analyses on network traffic data have become feasible and been showing great performance. For instance, Saxe et al. [8] proposed a deep neural network (DNN) based malware detector that employs twodimensional binary features to detect malware. Yousefi et al. [9] gave out a generative feature learning-based approach for malware classification, where latent features from the hidden layer of an autoencoder are used for anomaly detection.
Different from the aim of detecting anomaly in networks in former research, we proposed an approach to classify different types of network events in LAN. And instead of only labeling data with the normal and abnormal, we adopt eight types of network event clusters, thus building a multilabel classifier in a dynamic way. Moreover, we focus on the representation of network traffic in LAN by 2-D images data based on protocol communication frequencies. Then we build a dataset to train a deep convolutional neural network (DCNN) for solving this multi-label classification problem. Consequently, this system is supposed to have the ability to deal with big data of network traffic, at the same time, have a stable training progress for the multi-label classification of network events in various LAN environments.

Network traffic visualization in LAN with Hilbert curve
The system we use to collect network traffic with different events, mainly consists of two terminals for manipulating and any other hosts connected in a LAN. In detail, one terminal is used as an event generator for implementing scan commands thus generating various network events in the LAN and another one is used to collect network traffic with these events simultaneously (Fig. 2).

Fig. 2
The events generating and data collecting system in the LAN.
In this research, we use a tool called tcpdump for network traffic collecting. And all the traffic data broadcasted in the LAN or data sent directly to the data collector are collected and processed with a daily base. For generating these network events, we adopt a tool called nmap, which can be used to send various commands of scans in LAN. Moreover, the implementing of a command for each network event lasts for two whole days. Then we extract protocol and time stamp information from the collected traffic data in order to further visualize time-series features hidden in the network traffic.
We generate a feature map for a constant recording time unit. To define this recording time unit, first, we introduce a concept called the fineness, which shows how finely we should analyze the information hidden in big data of network traffic. Then we define a constant time unit for recording traffic data and generating each feature map as in (1).
Here, Tst (time standard) is a standard interval for the recording time unit, which is defined as 64 seconds in this research. St is a parameter of the basic segment, showing the standard size of a feature map with a value of 8 pixels. The parameter of size is used to compute the length and width of a feature map. For example, when the fineness is 1.0, a feature map with a size of 16 pixels represents a recording time unit of 256 seconds (around four minutes). On the other hand, if we use a size of eight pixels instead, the recording time unit will be 64 seconds. As a result, it is possible to use these parameters (fineness and size) to bring features of traffic data during a time span with a fixed length into an image with different sizes. At the same time, we can use an image with a fixed size to represent traffic data during time spans with different lengths.
In detail, we compute how many times of communication for each protocol has been recorded during every recording time unit, as the discriminators for representing features of network traffic with different events. Considering explainability of this network events classification, from all protocol information, communication frequency information from IP, ARP, TCP, HTTP, HTTPs, UDP, mDNS, DHCP and the others is extracted and clustered. Then, as a visualizing method, we convert communication frequency information of these protocols into pixel values using (2), where we use these values to show the frequency of a specific protocol's communication during recording time units.
Where is the frequency of a protocol's communication within a recording unit, and the denominator is the maximum from all frequency values within the duration for generating a feature map. p i represents the value of the pixel point.
Considerable visual analytics should have properties including intense information representation, showing timerelated relationships between data, and a computable twodimension structure with respect to a DCNN model. As a result, a geometric structure called Hilbert curve is adopted in this research, by which we compress time-related features of network traffic data into a 2D image, keeping the hidden relationships between data as well. We project the pixel values computed from the communication frequencies as pixel points into an image. Here, a fineness value discussed above of 0.5 is adopted, with an image size of 16 pixels. Thus each feature map of a protocol consists of 1024 records, each of which shows features of a specific protocol's communication within 0.5 seconds (Fig. 3).  Here, as a method of expressing time-related features of an event in LAN, we put feature maps of nine protocol clusters into an image (Fig. 1). That means, statistical information of nine types of protocols collected in the LAN can be represented in different regions of an image (48×48) through the array exchange and projection. Moreover, considering the computing cost of the DCNN model and the fineness of features representation, an image size of 16 and a fineness value of 0.5 is adopted in this research. At last, we can get a feature map representing time-sequential traffic data within 128 seconds in the LAN of a specific network event, by using information of nine types of protocols.

Network events generating and traffic data collecting
In this research, we manipulate seven types of scan commands in LAN covering arp scan, tcp scan, scan of tcp port 23, scan of tcp port 80, udp scan, scan of udp port 137 and scan of udp port 1900. We implemented each of these events by using corresponding nmap commands shown in the following through the event generator in the LAN. Network traffic from two network environments, LAN A and LAN B, was collected and used for generating feature maps. Here, LAN A is a network with a variable-length subnet mask with a length of 25 digits. And it is a network of the institute's critical infrastructures. On the other hand, LAN B is a network serving for general purposes in several labs, such as research and other daily operations.
We generated eight types of network events (including the normal state of a network) in LAN A and LAN B separately. Then we collected network traffic through the data collector, and generated feature maps of network events using the approach discussed above (Fig. 4). At last, we achieved 8120 feature maps in LAN A and 7125 feature maps in LAN B. Furthermore, we divided the datasets into a training set and a validation set with a ratio of 4:1 (Table 1). Table 1 Datasets of traffic data's feature maps in LAN A and LAN B Fig. 5 The structure of the DCNN model we adopt in this research, consisting of thirteen convolutional layers accompanied by five maxpooling layers, and three fully connected layers.

Network events classification using DCNN
Convolutional neural network (CNN) is one type of deep learning, with a characteristic of movement invariance with respect to the input of time-related data and is usually used to solve problems related to images such as multi-label image classification. A CNN model usually includes several convolutional layers accompanied with pooling layers in some cases and several fully connected layers at last. By using a kernel in each convolutional layer and the pooling layer in some of the layers, we can compress the information in the input data. It is thought that the information contained in the image can be expressed by combining these layers. Finally, in order to get the outcome as one-dimension information to solve multi-label classification, fully connected layers are combined in the CNN model and thus the output can be narrowed to a specific range. As a result, the input data here are the generated feature maps and the labels are the corresponding types of network events with respect to these feature maps.
We designed and built a DCNN model based on VGG-16 [11], which consists of thirteen convolution layers accompanied by five maxpooling layers and three fully connected layers, with an output of eight values (Fig. 5). By adopting this DCNN model, the hidden features of network events inside a feature map can be extracted. Then three fully connected layers are used to flatten the output matrix of the convolution layers and maxpooling layers to a onedimension array and compress information into a matrix, giving out confidences of these eight types' network events.
Moreover, an activation function of the ReLU is used, which is defined as (3). After the computation of each layer, we use this function to convert the output to a non-linear distribution. And at the last layer, for this multi-label problem, we use an activation function called the softmax, which is defined as (4). Through using the softmax, each component including negative, greater than one, or might not sum to 1, will be in the interval (0, 1), with a sum of 1.
Where x is the input data, f(x) is the output data of a node, and the "max" is a function used to get a maximum between 0 and x.
Where x i represents each element of the input vector x. And the softmax is used to normalize these values through dividing by the sum of all the exponentials of these elements. The progress of training a deep learning model is basically an optimization problem, and a suitable learning function for updating weights of the model at every epoch is extremely significant for achieving a result of the classification. Since these considerations, two learning functions, the RMSprop and the Adam, are adopted individually to implement the training thus comparing the training results of them.
The RMSProp has a property that the emphasis is placed on the latest gradient information more than the past gradient information and gradually the past gradient information is forgotten, instead, the new gradient information is greatly reflected. Therefore, it has a great adaptation to time-related data. This learning function is defined as (5) (6).
Where L is the result of training loss, W is weights of the node, is the learning rate which is used to control the extent we update weights at each time, and is the decay rate with a value of 0.9, which shows the extent of past gradient information impacting on the current updating.
The other learning function of the Adam, which is defined in the following (7), with decay values of 0.9 and 0.999, is adopted to train the DCNN model as well, and the result of it is used to compare with that of the RMSProp.
Where L is the training loss, W is weights of the node, is the learning rate, 1 and 2 are the decay rates with values of 0.9 and 0.999 individually, and is used to prevent the denominator from being zero.
Then we train this DCNN model from scratch with initial small random weights with values in an interval of [0, 1.0). Moreover, we adopt the mini-batch training, dividing the input data of feature maps into several small batches with a size of 40 for training, by which we stabilize the training progress as well as reducing the training cost. A learning rate ( ) with a value of 0.00002 is used for updating weights at each epoch. And an early stopping is used to prevent the overfitting, a situation that validation accuracy keeps descending while training accuracy keeps increasing, by monitoring validation loss scores within recent five epochs.
We show the graph of validation loss with the RMSProp in LAN A as an example of the training progress with the early stopping (Fig. 6), where through updating weights at each epoch, the validation loss descends gradually, and at last, it terminates since the last five epochs' validation loss values don't achieve a better result through the updating. After training, the model is supposed to have the ability to identify eight types of network events in LAN based on the confidence scores of model prediction. We trained our model using the datasets from two LAN environments, the LAN A and the LAN B, and adopted two types of learning functions, the RMSProp and the Adam as a comparison. As a result, the corresponding training graphs of the two LAN environments using two different learning functions are shown below (Fig. 7). From the graphs, we can see that in LAN A, both two learning functions achieve a progress with more than 25 epochs and a validation accuracy of around 0.98 at last, even though the Adam seems to outperform the RMSProp at first. On the other hand, in LAN B, the Adam completes the training progress with only 15 epochs while the RMSProp completes that with 30 epochs in total. However, the final result of the RMSProp is obviously better than that of the Adam, with both a better training accuracy and a better validation accuracy. And it also has a more stable training progress compared with the Adam. Hence, based on these comparisons, we choose the RMSProp in this research and further evaluate the performance of the scheme in the multilabel network events classification problem. Furthermore, by visualizing the compressed representation of different network events' feature map at the last fully connected layer of the DCNN model, we obtained compressed feature maps of these eight types of network events (Fig. 8).

Results
We evaluate the scheme in two active networks, LAN A and LAN B, each of which includes the event generating and data collecting system we discussed before. We evaluate the performance of it using the precision, recall, and F-measure.
The precision is a parameter used to show how many events are successfully classified in all test data; the recall is a parameter that is used to show how many times a specific event is successfully classified in all test data of that event They are defined as (8) and (9)  And the F-measure defined as (10) is a parameter used to show the comprehensive evaluation of a model's performance.
Here, we use the macro averages of each evaluation standard as the result, which take the average of each class's metric, thus treating all classes equally. Moreover, considering the existing imbalance in the datasets (e.g. fewer examples of tcp scan than the other classes in LAN A), the micro average method is also adopted to further evaluate the scheme, which aggregates the contributions of all classes to compute the average metric instead.
While training, we adopted the evaluating methods above at each epoch. We computed the precision, recall and Fmeasure scores of each event as a measure of the classification ability of the scheme. The corresponding result is shown above ( Table 2). From the result, we can see that the classification between normal, arp scan, tcp scan and udp scan shows relatively great performance. Whereas, the classification between scans to some specific ports such as tcp port 80 and udp port 137 has relatively low F-measure scores, which means that it is more difficult to classify between scans to these specific ports than between only normal and abnormal in LAN. Moreover, at last, we achieved a macro-F-measure score of 0.982 in LAN A and that of 0.975 in LAN B. We achieved a micro-F-measure score of 0.976 in LAN A and that of 0.965 in LAN B.

Discussion
The visualization of network traffic data allows some explainability to anomaly detection and classification in LAN. And a DCNN model is adopted to classify these reoccurring patterns in feature maps of various network events. Moreover, the experiments under two different networks are conducted to evaluate the scheme.
On the other hand, it is still possible that an adversary could forge these features inside a feature map by such as adjusting communication frequencies. Therefore, a more delicate experiment in a real-world setting needs to be verified. Furthermore, besides the proposed eight types of network events, the influence of additional, non-explicit network events on the classification result should be considered.

Conclusion
In this research, we are aimed to visualize traffic data in LAN by generating feature maps thus classifying these different network events with a DCNN model. We adopt nine types of protocol information as the discriminators for feature representation in feature maps. Then we evaluated the scheme using the recall, the precision and the F-measure. At last, we achieved macro-F-measure scores of 0.982 and 0.975, and micro-F-measure scores of 0.976 and 0.965 separately in two different LAN environments, as the comprehensive evaluation of the scheme's performance in network events classification.

Availability of data and materials
The datasets used for the evaluation of the algorithm are available online at https://github.com/yuweisunn/LANSecurity