Cyberattacks detection and analysis in a network log system using XGBoost with ELK stack

The usage of artificial intelligence and machine learning methods on cyberattacks increasing significantly recently. For the defense method of cyberattacks, it is possible to detect and identify the attack event by observing the log data and analyzing whether it has abnormal behavior or not. This paper implemented the ELK Stack network log system (NetFlow Log) to visually analyze log data and present several network attack behavior characteristics for further analysis. Additionally, this system evaluated the extreme gradient enhancement (XGBoost), Recurrent Neural Network (RNN), and Deep Neural Network (DNN) model for machine learning methods. Keras was used as a deep learning framework for building a model to detect the attack event. From the experiments, it can be confirmed that the XGBoost model has an accuracy rate of 96.01% for potential threats. The full attack dataset can achieve 96.26% accuracy, which is better than RNN and DNN models.


Introduction
In recent years, cyberattacks are evolving and becoming more sophisticated. For example, with the development of Machine Learning algorithms, some illegal users might use the technology of cyberattacks and Machine Learning to analyze information from the social networks (Ahad et al. 2016). The specific target of cyberattacks is given based on the data, the attack success rate, or the vulnerability that is discovered . According to Neustar's International Network Benchmark Index report released in 2018 [1], 82% of cybersecurity experts said they are worried that attackers will use artificial intelligence to make a destructive attack on the network environment. However, a large number of experts believe that artificial intelligence can play a considerable role in network security and provides excellent supporting .
As mentioned above, in the campus network environment, various cyberattacks have appeared and tried to attempt the stability of the campus network environment. From the network logs, it can be found that many unusual network usage scenarios are trying to pass the campus network security system (Lai et al. 2019;Liu et al. 2021;). However, the systems with visualized network log data and the capability of detecting cyberattacks have considerable charges.
The open-source platform ELK Stack is implemented to build a network log system (NetFlow Log) in this work. First, the network logs related to the cyber attack behavior were collected in a large amount of data and then obtained preliminary information. Second, the data analysis was observed. After visualizing the log data, the administrator can use the machine learning model to import historical log data for analysis and detection. Then, perform a risk assessment based on the cross-validation analysis of the visual information displayed by the ELK Stack, even if it has not occurred or uncertain events. The administrator also has sufficient information to make the right decisions and take precautions to avoid the associated losses in information security. Our goal is to use XGBoost for machine learning, then implement a visualization system for cyberattack behavior to help administrators detect whether historical network log data has cyber attack behavior or not. The specific objectives are as follows: 1. Demonstrate the visualization and monitoring system of NetFlow log. 2. Compare XGBoost, RNN, and DNN model in two kinds of model, potential attack and full attack log data.

Background review and related works
This section provides the background of this work and several kit information, including Python, ELK Stack, XGBoost, and so on. Then, the next section is discussed in more detail.

Keras
In

ELK Stack
ELK Stack refers to the architecture based on three opensource software Elasticsearch, Logstash, and Kibana (Bajer 2017). ELK Stack can be used to form a system for querying, collecting, and analyzing logs. This work can get data from any source and format. Without changing the original system architecture, ELK Stack is built to instantly search and analyze data and ultimately use visual capabilities to present the analyzed data results (Rattan et al. 2019). NetFlow Log is the automated network log platform built in this work. It is built on top of these three open-source software. In addition, ELK Stack has three kits and many other software packages, such as Filebeat, Xpack, and ECE.

Decision tree
In decision theory, a decision tree consists of decision graphs and possible outcomes that are used to help decision-making achieve program goals (Safavian and Landgrebe 1991). In Machine Learning, a decision tree is a predictive model. Use tree graphics to help computers judge, segment our data, and make decisions based on it. Each node in the tree represents a specific target, and each forked path represents a possible feature of data segmentation. The information gain is obtained from the action of segmenting the data. The segmentation process is repeated until the leaf node. This leaf node corresponds to the target from the master node to the leaf node and has all the feature values on the path. Decision trees are available for forecasting and data analysis. A complete decision tree typically contains three nodes: decision nodes, opportunity nodes, and endpoints. Decision trees have several generation methods, classification tree analysis, regression tree analysis, Classification and Regression Trees (CART), Chi-square Automated Interaction Detection (CHAID).
As the most fundamental component of XGBoost, it needs to introduce the CART regression tree. It constructs a decision tree based on the characteristics and data of the training to determine the prediction result of each piece of data. Also, it uses the Gini index to calculate the gain to select the characteristics of the decision tree. The Gini index formula is as follows: p k represents the probability of class classification category (k) in dataset D, the number of categories is indicated by K. The Gini index calculates the gain formula as: D represents the entire dataset, D 1 and D 2 , respectively, represent data having feature A in the dataset and data other than A.

Gradient boosting
Gradient Boosting is a Boosting method that is a Machine Learning technique for regression and classification problems (Friedman 2001). Gradient Boosting generates prediction models in the form of multiple weak classifiers. Each model is established in the gradient direction of the loss function of the previous model. However, when the loss function is large, the model is more error-prone. On the other hand, if our model can make the loss function drop, our model will continue to improve. Thus, the loss function is reduced in the gradient direction by multiple improvements, and a good model is finally obtained (Friedman 2002). The specific algorithm is as follows: Input Training set T = (x 1 , y 1 ), (x 2 , y 2 )..., (x n , y n ) Output Boosting tree f M (x) Procedures: -Initialization f 0 (x) = 0 f orm = 1, 2..., M -Calculating the residual r m i = y i − f ( (m − 1))(x i ), i = 1, 2, ..., n (3) -Fitting the residual r m i to learn a regression tree and get

XGBoost
The objective function of XGBoost consists of two parts (Chen and Guestrin 2016). The first part is used to calculate the difference between the predicted score and the true score is The second part is normalization ( f t ) , and the formula is as follows.
T represents the number of leaf nodes, w j represents the weight of the j leaf nodes, γ control the number of leaf nodes, λ control the score of the leaf nodes not too large to prevent overfitting.

Deep neural network (DNN) and reccurent neural network (RNN)
The generalized DNN contains variants such as CNN and RNN. In practical applications, the so-called Deep Neural Networks usually incorporates several known structures, such as LSTM or convolution layer. However, in a narrow sense, the difference between DNN and RNN and CNN is that DNN is especially expressed as a fully connected neuron structure and does not contain convolution units or temporal associations. DNN is sometimes called Multi-Layer perceptron (MLP) The neural network used to process sequence data is called RNN. In the neural network model of DNN, the neural layers are fully connected, but the nodes of each layer are not connected. This neural network model is very inefficient in processing sequence problems. For example, in advertising promotion, one needs to understand the user's browsing habits or preferences and use them. The principle of the RNN model is to connect the neuron's output back to the neuron's input. The network memorizes the previous message and uses it for the calculation of the current output. That is to say, the output of RNN is affected by the input of the last layer and the output neuron same layer.

Related works
There are many theories, ideas, and experimental structures of other research, which allowed us to have better results in our experiments. According to the background of this work (Peterson 2018), Iman Sharafaldin et al. (Sharafaldin et al. 2019 gave us a lot of inspiration, also analyzed a large amount of data and visualized it, and proposed a classification of cyberattacks. In addition, at the IEEE International Conference on Smart Computing (SMARTCOMP) in 2017, a conference paper published by X. Yuan et al. mentioned Yuan et al. (2017) the defense mechanism of DDos and its use of Deep Learning to establish a DDoS attack, also given us inspired. In addition to these, many papers give us a lot of constructive references (Chen et al. 2018).
In a paper published by Kozik et al. (2018), the flexibility of cloud-based architecture was used for large-scale Machine Learning, shifting high computing requirements and highstorage parts to the cloud. The cloud-first builds a complex learning model and then uses edge computing to execute it. As well as Kristiani et al. (2020) demonstrated the combination of sensors, edge, and cloud (iSEC) framework.
In a paper published by Al-Qurishi et al. (2018), a model for predicting Sybil attacks using Deep Learning is proposed. The Sybil attack denies the reception or transmission of real nodes on the network by creating enough false identities, effectively blocking the network services of other users. Through its experiments, it is possible to provide highprecision predictions even when importing uncleaned data effectively. The campus network security system offers the network log data used in this work. The training and prediction are raw data. Through experiments, even complete attack behavior data can be high accuracy without error judgment. James Zhang et al. have proposed a method to detect abnormal behavior of network performance data , which uses Open Science Grid to collect and use perf-SONAR servers and uses Boosted Decision Tree (BDT) and simple feedforward neural networks for Machine Learning. In this work, eXtreme Gradient Boosting is also used for decision classification to detect anomalous behavior in network log data. The network log data are divided into attack and non-attack and finally submitted to ELK for visualization analysis.
Today's hackers can use HTTP Parameter Pollution [2] training data to achieve classification that undermines Machine Learning and input design data into training data to reduce detection accuracy. The paper published by Sen Chen et al. proposes a two-stage learning enhancement method KUAFUDET Chen et al. (2017) to learn and identify malware through confrontation detection. It includes the training phase of selecting and extracting features and the testing phase of using the first training phase. The sorting extraction of feature importance was also used in their work, and the complete attack data and the general original log data were imported as experimental data for reference comparison.
Hongyu Liu et al. have proposed a point-to-point detection method Liu et al. (2019). Based on the Deep Learning model of convolutional neural networks and recurrent neural networks, payload classification (PL-RNN) is performed and used for attack detection. XGBoost is used in this work to learn log data and summarize its important features. It effectively detects the difference between normal data and aggressive behavior and serves as the basis for both classifications. In addition, a paper published by Peiyuan Sun et al. (2019) proposed a Machine Learning-based approach, which can model the attack behavior based on intuitive observation. Similarly, at the 2015 International Conference on Information and Communication Technology and Systems (ICTS), Langi et al. (2015) presented an assessment of Logstash and Elasticsearch.
Ibrahim Ghafir et al. have proposed a Machine Learningbased system Ghafir et al. (2018) that can detect and predict APT attacks accurately and quickly. The system can be evaluated experimentally, and APT can be predicted in an early step. The prediction accuracy rate is 84.8%. Machine Learning is also used to quickly build a predictive model to classify network logs in this work. It has half of the cyber attack behavior and has high accuracy. In addition, this work has constructed a visualization system that provides network log data so that administrators can easily view log data at any point in time.
The paper presented by Ozgur Koray Sahingoz et al. (2019) mentions that phishing is one of the methods used by hackers today. It proposes a real-time anti-phishing system, which has been experimentally proven to detect the network. Authentic rate of 97.98% when phishing URL The paper presented by Abebe Abeshu Diro and Naveen Chilamkurti (2018) mentions that applying Deep Learning for attack detection is the preferred approach because of its high feature extraction capabilities. In their work, they also hope Machine Learning can make progress in detecting attacks.
The managers of the Institute of Nuclear Physica, Italy (INFN), used ELK Stack to set up a monitoring system to facilitate the management of each node's activities (Bagnasco et al. 2015). In a conference paper, T. Ram Prakash et al. proposed the construction of the ELK Stack system and how to identify network users Prakash et al. (2016) geographically. In addition, the paper by Chao-Tung  also proposed a visual platform system using ELK Stack as a statistical analysis of air quality and influenza-like illness. This work refers to the ELK Stack construction method and finally successfully imports the network log of the campus network security system and analyzes the data.

System design and implementation
This chapter describes how to use artificial intelligence to build predictive models and use ELK Stack to visualize system architecture and network log data implementation. In addition, this work creates a Deep Learning model using DNN and RNN to compare with the XGBoost Machine Learning model. The network logs collected in this work are based on campus network devices, with more than 7 million data per day, approximately 2 to 3G. 2 TB has been collected and continues to increase.

System architecture
In this work, we installed Anaconda3 on Windows 10 and use Juypter Notebook as the Python development environment. After pre-processing the network log data in the development environment, using XGBoost for Machine Learning and execute historical network log data to check the cyberattack behavior. In addition, construct a network log system on Linux systems using open source software such as ELK Stack to visualize the cyber attack behavior for more intuitive analysis by managers.
As shown in Fig. 1, the network logs are collected and submitted to the ELK for visual analysis to present the results of the cyber attack behavior detection to the administrator. On the other hand, Python imports network logs, perform data preprocessing, and conducts model training. Finally, the model submits the cyber attack prediction result to managers. Suppose the ELK Stack analyzes the log data into a regular data stream. Still, the model prediction results show that the data stream is an attack behavior. In that case, the administrator can use the results of both parties for cross-validation analysis to perform the risk assessment. It can prevent the impact of hidden cyber attacks or unknown cyber-attacks.

NetFlow log system
First, Linux built-in shell scripts were used to write scripts and schedules so that the machine can automatically download the network log data from the server-side. This serverside collected NetFlow log using Netdump. NetDump is a tool that catches all types of packets on our LAN network and prints them out. This tool aims to acquire information and categorized the different packets that flow on the LAN. After data processing, Logstash collects and filters the log data. Then Logstash is transferred to Elasticsearch for later data search or analysis. Then Kibana is used to visualize the analyzed data and finally present it on the website. The above is the NetFlow Log System, a campus network log platform.

Network usage
Before analyzing the cyber attack behavior, this work can set up several frequently used domains and visualize the log data. The administrator can monitor the network for abnormal use. In addition, this paper divides these domains into search engines, auction sites, online communities, entertainment, and high-risk domains. All of the above domain IPs are public IPs and can only be observed by the administrator.

Attack data analysis
Cyber-attacks tend to hide their packaging and pretend to be a secure data stream to trick the information security system. However, just like walking in the snow, we will leave footprints. This work selected several kinds of cyber-attacks and recorded their eigenvalues. Then use Elasticsearch to filter the cyber log data. Managers monitor data visualization of suspected cyber attacks.

Machine learning with XGBoost
This section discusses how to use XGBoost for Machine Learning and construct a prediction model to detect the cyberattack behavior in network log data. Also, determine which data streams in the network log data are suspected of having cyber attacks behavior and which are normal. Figure 2 shows the decision tree of XGBoost. Figure 3 is the DNN model established by this work, including an input layer and the final output layer. It also contains two hidden layers and three dropout layers, which are fully connected. Figure 4 is the RNN model established by this work. The difference between the RNN model and the DNN model is that the output of the RNN is not only affected by the input of the previous layer, but also by the output of the same layer of neurons.

Data preprocessing
First, the log data must be preprocessed to convert the data to a format that the machine can learn. The algorithm is as follows.1. In addition, our log data has raw data of 500,000 records, and the data of suspected aggression accounts for about 1.8% of the total number of single log data. This experiment uses ELK Stack to filter the attack data of different periods and then extract the log data from the database for integration. Finally, the log data is pre-processed to complete the pre-operation of the training set and the verification set. A total of about 200,000 records is divided into 66% as a training set and 33% as a verification set. Therefore, there are two kinds of the dataset to be trained, raw and full attack log data.

XGBoost model training
Undertake the preprocessing data of 3.3.1, and then import the data into the model for Machine Learning training. However, compared to data with cyber-attacks, standard network usage data accounts for most of the logs and may not even appear. Therefore, how to make the model learn the correct features is the primary goal.
The data of cyberattack behavior is classified as normal traffic or noise to avoid Machine Learning to classify data. Therefore, collecting log data for multiple periods and filter out the data with attack characteristics to form a training set. Our training set will try to train by writing data from different attacks and non-attacks. Finally, both attack and non-attack data contain approximately 50% of the data in this work, providing the best model feedback. The training set includes roughly 150,000 log data, and the validation set contains 50% of the data, including attack data and non-attack data, for a total of approximately 77,000 data. In addition, using random floating parameters to adjust the parameters in XGBoost, use L1 and L2 normalization to perform regular gradient enhancement, avoid overfitting or inappropriate. The feature importance is passed after each training to adjust the characteristics of the log data.

XGBoost model prediction
In the forecast set, use two types of data to import Machine Learning model predictions. The first is 96.26% complete attack data, and the second is new, unmodified log data. It verifies the correctness and versatility of our model. Finally, the training and validation of the model is completed, which will have high precision and a good F1 score. The algorithm is as follows (2).

Deep learning with keras
In this section, we discuss how to use Keras for Deep Learning. In the field of Deep Learning, CNTK and TensorFlow are widely used in Deep Learning research. However, although both have compelling features, the actual application is more complicated. Therefore, the Deep Learning project for this job will use Keras to build a dichotomy prediction model, perform the cyber attack behavior detection on network log data, and determine which data streams in the network log data are suspected of having the cyber attack behavior and which are normal. In this work, the DNN model and the RNN model were built, and they will have experimented with the same data as the XGBoost Machine Learning model.

Deep neural networks model
First, the log data are pre-processed as in Sect. 3.5, and the data are converted into a format that the machine can learn. After the data are imported into the DNN model, it is trained in a supervised learning manner. In order to ensure that the DNN model can produce a globally optimal solution during the experiment, this work uses the Scikit-learn suite to optimize the parameters in the model. Since this work aims to predict potential attacks, there are several types of attacks in the dataset. However, the characteristics of cyberattack behavior are very scattered, leading to over-fitting or gradient disappearing even after numerous adjustments. The dataset is also given a full attack record to ensure the fairness of the experiment, as well as a new, unmodified log data resource for validation.

Recurrent neural network model
In addition to the DNN solution, the data problem in the Deep Learning model also has RNN. Since DNN cannot fully predict full attack data, and there is often over-fitting or gradient disappearance. RNN can also deal with data problems and significantly improve DNN over-fitting. Therefore, this work is connected to the DNN model to rebuild an RNN model and give the same data to conduct experiments. In order to ensure that the RNN model can produce a globally optimal solution during the experiment, this work uses the Scikit-learn suite to optimize the parameters in the model.

Experimental results
This section describes the use of XGBoost to build a Machine Learning model, Keras to build DNN and RNN Deep Learn-ing models for binary classification prediction, and ELK Stack to analyze network usage and attack behavior characteristics.

Experimental environment
This section describes our hardware lab environment. This experiment uses two hosts, one with Linux as the operating system and the ELK Stack server. The other is to use Window10 as the operating system, install Anaconda 3 and related kits in the Python development environment, and build the XGBoost Machine Learning model. Detailed hardware devices are shown in Table 1.

ELK stack network usage
To more easily confirm the network usage on campus, this experiment finds the public IP addresses of major commercial websites, search engines, social networks, etc., through the Internet. This information can be easily found on websites such as ipinfo.io. Then, use Elasticsearch to filter the required domain information, remove the non-service local IP address to avoid information miscellaneous, record the necessary domain name, and use Kibana to visualize it. Figure 5 shows the pie chart of Network usage.

ELK stack attack analysis
In this experiment, the characteristics of several kinds of cyber attack behaviors are selected as the screening conditions. After ELK analysis, the data are visualized and presented, providing an intuitive way for the administrator to observe the cyber attack. Figure 6 shows the Code Red attack events every 60 minutes. It can be seen from the graph that the attack events in Code Red are quite often, but each event is in a short time. Figure 7 describes the Worm Sasser attack events every 60 minutes. In Worm Sasser attacks, the period time is more longer and a little less often. Figure 8 presents the SQL Slammer attack events every 60 minutes. The SQL Slammer attacks are a little less often and in a short time. Figure 9 illustrates the DDOS attacks every 5 minutes. It can be seen from the graph that compared to other attacks

Machine learning data preprocessing
There are about 500,000 data per data in the network log data, and the data of suspected aggression accounts for about 1.8% of the total. The experiment will have the best training results after about 50% of each experimental attack and nonattack to achieve better training conditions. Therefore, this experiment uses ELK Stack to filter other periods' attack data and then extract the log data from the database for integration. Finally, the log data are pre-processed to complete the preoperation of the training set and the verification set. About 200,000 pieces of data will be divided into 66% as a training set and 33% as a verification set. Figure 10 is a bar graph in which the feature importance is sorted according to the score. In order "Dst Pt","In Byte","Src Pt","Output","In Pkt ","Duration","Proto","Input". As shown in Fig. 11, Gain represents the relative contribution of the feature to the model, and a high value means that it is more important for prediction.

XGBoost model prediction
Weight indicates the number of times the feature is used to split the node. As shown in Fig. 12. Figure 13 represents the relative number of observations associated with this feature, for example, 100 observations,  Total Gain represents the total gain that a feature brings in each split node in all trees as shown in Fig. 14. The number of all samples covered by a feature at each split node is called Total Cover as shown in Fig. 15.  To verify the correctness and versatility of the model, the data used in the prediction are the new raw log data, and the cleaned data are handed over to the model after preprocessing. The predicted result is as high as 96.01%. To verify the correctness of the model, a set of full-attack prediction sets is re-sampled here, and the accuracy rate is as high as 96.26%. It proves that the attack data can be fully recognized when attack behavior characteristics are in the log data. As shown in Table2.
Finally, the evaluation indicator is applied to test the model's mean square error (MSE), model accuracy, and F1 Score model correctness, as shown in Table 3. Figure 16 shows the training and validation loss values for this DNN model. It can be seen from the figure that the loss value of the training data keeps decreasing and is infinitely close to the validation data.  Figure 17 shows the training and validation accuracy values for this DNN model. From the figure that the accuracy of the training set is constantly increasing and close to the verification set. This is a good model.

DNN model prediction
The model validation set prediction results are shown in Table 4. The data used for prediction are the same as the data used by XGBoost. The DNN model predicts results as high as 96.89%. In order to verify the versatility of the model, a set of full attack prediction sets is also sampled here, with an accuracy of only 69.66%. Compared with the previous  Figure 18 shows the training and validation loss values for this RNN model. Figure 19 shows the training and validation accuracy values for this RNN model. It can be seen from these two figures that this RNN model is also training in a good direction and has a good accuracy rate. The prediction results of the RNN model are shown in Table 5. The data used for prediction are the same as the data used in the first two models. The RNN model predicts results as high as 97.61%, even surpassing the accuracy of XGBoost. In order to verify the versatility of the model, the same set of attack data were also used for prediction; however, the accuracy was only 70.85%.

RNN model prediction
The comparison of three models of XGBoost, RNN, and DNN is presented in Table 6.

Conclusions and future works
This paper demonstrates a network log system monitoring and visualization using ELK Stack. This system allows administrators to easily visualize the charts and monitor the information they need from tens of millions of log data. This work also compares Machine Learning with Deep Learning models of XGBoost, RNN, and DNN. From the experimental results, XGBoost is the best in the data prediction of the full attack. Therefore, this work chooses to use XGBoost as the machine learning model for the log data attack prediction. This attack prediction model can help to detect the ELK as the analyzed data. For example, suppose the ELK Stack analyzes a log as ordinary data. The model prediction results show that these data streams have aggressive behavior characteristics. In this case, the administrator can use the two-party results to cross-verification and further information security risk assessment. In the future, ELK Stack will collect more functional values related to the attack behavior and further visualize the Network log data as an analysis chart. Network usage will add the remaining large domain IP domains to it and distinguish each different domain. Convenient for management to observe. XGBoost is one of the most popular machine learning models. Its limitations are not limited to the two categories of attack and non-attack log data. It can more actively increase the data characteristics of the attack behavior, enrich our database. Use XGBoost to create a multi-classification model that can directly identify the type of attack and find unusual data from the network log. Besides, cross-validation can be used in conjunction with deep learning to compare predictions and improve information security.