This section contains the materials and methods used in the experimentation, which includes the dataset, the neural network models, and all the hyperparameters (optimization function, and loss function) used in the training of the neural network.
3.1. Dataset
The dataset used in this paper is obtained from the TON_IoT Datasets which were publicly made available by the University of New South Wales at the Australian Defense Force Academy. This group of datasets is described in their website as a new generation of Industry 4.0/Internet of Things (IoT) and Industrial IoT (IIoT) datasets for evaluating the fidelity and efficiency of different cybersecurity applications based on Artificial Intelligence (AI) [18].
The TON_IoT datasets contain data collected from several normal and cyber-attack events in a realistic experimental setup that was designed at the Cyber Range and IoT Labs to the UNSW Canberra [18]. The TON_IoT datasets are a group of many different datasets within it, including the Telemetry data of IoT/IIoT services, the Operating Systems logs, and Network traffic of the IoT network.
The datasets are labelled into two classes of whether an attack occurred, or if it is a normal operation. In the category of an attack happening, it is further divided into sub-classes of attack. There are nine different types of cyber-attacks are, scanning, DoS, DDoS, ransomware, backdoor, data injection, Cross-site Scripting (XSS), password cracking attack, and Man-in-The-Middle (MITM) attacks.
Out of all the data collected by the lab, this paper focuses only on the IoT/IIoT dataset. In the IoT/IIoT dataset, two main folders are named “Processed_datasets”, and “Train_Test_datasets”. The “Processed_datasets” contains all the processed, and filtered version of the datasets with their standard features and label in CSV format. The “Train_Test_datasets” folder contains the samples of the datasets to be used for training and testing new cybersecurity applications and machine learning applications.
This paper uses the “Train_Test_datasets” to train a neural network model. Inside the “Train_Test_datasets” folder, there are seven different IoT devices and the respective data collected from each of them. The IoT devices include Fridge, Garage Door, GPS Tracker, Modbus, Motion Light, Thermostat, and Weather module. The experiment used to collect all the data used a combination of physical and simulated IoT/IIoT devices. From all the datasets, the Modbus dataset is used for this paper.
The Modbus dataset was chosen because it is found in most smart manufacturing and industrial applications. The experimenters extracted the register type features from the Modbus service. The Modbus data that were extracted are described in Table 1. The Modbus data communications protocol will be discussed in more detail in the next section.
Table 1
Features of the Modbus Dataset
Features | Description |
ts | Timestamp of sensor reading data |
date | Date of logging Modbus register’s data |
time | Time of logging Modbus register’s data |
FC1_Read_Input_Register | Modbus function code that is responsible for reading an input register. |
FC2_Read_Discrete_Value | Modbus function code that is in charge of reading a discrete value. |
FC3_Read_Holding_Register | Modbus function code that is responsible for reading a holding register. |
FC4_Read_Coil | Modbus function code that is responsible for reading a coil. |
label | Identify normal and attack records, where ‘0’ indicates normal and ‘1’ indicates attacks. |
type | A tag with normal or attack sub-classes, such as DoS, DDoS, and backdoor attacks. |
The Modbus dataset does not contain all the classes previously described. It contains six different classes which are Backdoor, Injection, Normal, Password, Scanning, and XSS. A brief explanation of each of these cyber-attacks is described below.
-
Backdoor refers to any method by which an unauthorized user is able to get around normal security measures and gain high level user access (which is also known as root access) on a computer system, network, or software application [24]. This attack can be used by attackers to steal any type of data (personal, financial) that is stored in the computer system, or network. The attackers can also use this attack to further install additional malware, and it can also be used to hijack devices.
-
Injection attacks are the type of attack where the attacker injects a code into a program or query, or the attacker injects malware onto a computer in order to execute remote commands that can read or modify a database. It can also be used to change data on a website [25].
-
Password Cracking Attack is an attack where the attacker tries to crack the password of a computer system by either a brute-force attack, or dictionary attacks [26]. This attack can allow the attacker to bypass the authentication procedure and hence, compromise the IoT or IIoT device.
-
Scanning Attacks are attacks where the attackers scan the devices to gather network information of the devices before launching sophisticated attacks. Some common scanning techniques include IP address scanning, port scanning, and version scanning [27].
-
XSS Attacks, also called Cross-Site Scripting attacks, are a type of injection, in which malicious scripts are injected into benign and trusted websites. XSS attacks occur when an attacker uses a web application to send malicious code, generally in the form of a browser side script, to a different end user [28]. This attack is capable of compromising the data and authentication procedures between IIoT devices and a remote Web server.
An overview of the Modbus dataset is shown in Fig. 1.
It can be concluded from Fig. 1 that the dataset is very imbalanced. Even in binary-classification tasks, the ratio of class 1-to-class 2 will be 68:32.
3.2. Modbus Communication Protocol
Modbus is a data communications protocol that was published by Modicon in 1979. It was initially used in their programmable logic controllers, but now it has become the standard communication protocol [29].
The Modbus protocol is very popular in industrial environments because it is openly published and royalty-free. Part of its popularity is due to the fact that it is easier to maintain compared to other standards, and it has less number restrictions on the format of data to be transmitted.
The Modbus protocol uses character serial communication lines, Ethernet, or the Internet protocol suite to transmit data. Data can be transmitted between multiple devices connected to the same cable or Ethernet network. The ease of use of the Modbus protocol is apparent in its use in IoT applications where, for example, a motion sensor module and temperature module can both be connected to the same computer, via Modbus.
It is a master/slave protocol, meaning a device operates as a master with one or more devices operating as a slave. A master device can read and write the data on a slave device’s register. Each slave in the network has a unique 8-bit device address that is used by the master to transmit data. The slaves do not respond unless its address is recognized. The response time of the slave should also be within a specified time, or the master will call it a “no response error”.
The data packet transmitted always starts with the slave address, a function code, followed by a function code, and parameters defining what is being asked for.
Table 2 lists the object types that a Modbus server provides to a Modbus client device [30].
Table 2
The object types that a Modbus server provides to a Modbus client device.
Object Type | Access | Size | Address Space |
Coil | Read-Write | 1 bit | 00001–09999 |
Discrete input | Read-Only | 1 bit | 10001–19999 |
Input register | Read-Only | 16 bits | 300001–39999 |
Holding register | Read-Write | 16 bits | 400001–49999 |
The address space of the object types is called Entity Address. The Entity Address is the starting address, which is a 16-bit value in the data part of the Modbus frame. It ranges from 0–65,535 (0000 – FFFF in hexadecimal numbers).
The Coils are 1-bit registers, that are used to control discrete outputs. It can be read or written. Discrete inputs are also 1-bit registers that are used as inputs, they can only be read. Input registers are 16-bit registers that are used for input, and it can only be read. Holding registers are the most universal registers, they are 16-bit registers where data can be read or written. This register can be used to hold many things including, but not limited to, inputs, outputs, configuration data, or any requirement that calls for the “holding” of a data [31].
The Modbus dataset that is used in this paper contains the four object types being logged in time.
3.3. Neural Network
This paper uses 3 different types of neural networks to learn the mapping between the Modbus Register inputs and the Attack type output.
Neural networks, also called artificial neural networks, fall under the topic Deep Learning. Deep Learning further fall under the topic of Machine Learning, and Machine Learning is a type of an implementation under Artificial Intelligence.
Neural networks contain layers of nodes (also called artificial nodes), and each of those nodes contain parameters for the engineer to tune. The parameters in the node are the weights, and biases. Engineers do not manually tune them, rather they run an algorithm to tune the weights and biases for them. For a neural network to work, it needs a lot of data, and the data should be all encompassing with regards to the use cases.
This sub-section will discuss about the embedding layer, different types of neural networks used, as well as the optimization function and loss function used.
3.4. Embedding Layer
The main contribution of the paper is the use of Embedding Layers to model the Modbus register values. The Embedding layer is almost exclusively used in the Natural Language Processing subject. The benefit that an Embedding layer provides with NLP tasks is that it provides a dense and low-dimensional vectors with its main benefit being generalization power [32].
This layer provides a simple way to convert words into vectors of real numbers. Before the Embedding layer had become so popular, there were various methods to convert words into vectors but none of them worked as well as the Embedding layer. This is due to the simple fact that the Embedding layer is trainable [33] whereas the other vectorization methods could not be differentiated with respect to the loss function, hence, it could not be trained.
The Embedding layer is a simple lookup table that stores the embeddings of a fixed dictionary and size [34]. In simple terms, it just maps a number to a table of vectors, then these vectors represent the number in the neural network. The representation vectors can be trained, so it will keep on getting better in representing the input vector. An ideal Embedding layer will encode the meaning of the word, and similar words will be closer in the vector space [35].
The inspiration to use the Embedding Layer is due to the type of data the Modbus dataset provides. Each register has a 16-bit number, which in turn represents a value between 0 and 65535. This dataset looks very similar to the Natural Language Processing datasets if the words are encoded with numbers ranging from 0 to the length of the total unique words. The Embedding layer will be able to map each register value to a vector space which is unique to it. The neural network will then be used to learn whether an attack has occurred by taking the 4 vector spaces of the 4 different registers. The Embedding layer will be able to map all the combination of register vectors that constitute an attack closer to each other in the vector space.
3.5. Multi-Layer Perceptron
A multi-layer perceptron (MLP) is a feedforward artificial neural network, and the first type of neural network this paper will use in the experiment.
It is a very simple fully connected class of feedforward neural network that maps an input vector to an output vector [36]. It will contain the following linear transformation:
where, \(y\) is the output vector,
\(x\) is the input vector,
\(W\) is the weight vector,
\(b\) is the bias.
A MLP layer can linearly map an input to the output. Activation functions can be used to make the whole network learn non-linear functions. The activation function used in the hidden layer of our network is the Rectified Linear Unit (ReLU), except for the output layer where a log softmax function is used.
3.6. Convolutional Neural Network
Convolutional Neural Network (CNN) is a class of artificial neural network that is used mostly in the field of computer vision [37]. CNNs were inspired by the biological visual cortex [38] [39] where the neurons resemble the organization of neurons in the brains of animals.
For CNN to be most effective, the vectors should have a spatial aspect to them. This means that the vectors do not provide any information by themselves, but a group of vectors mean something. Think of a pixel of color, it does not have any meaning, but put millions of pixels together and you get an image. If the pixels of an image are moved in any way, it loses meaning again. This is the main reason why other neural networks could not map images as effectively. CNNs inherently takes the position of the input vector with respect to its neighbors into consideration when training the model.
Convolutional Neural Networks have shown to be good enough in NLP tasks, so this paper will also use a CNN network to test whether it is good enough to map the Embedded vectors from the Modbus dataset.
There are other layers used in conjunction with the CNN layer, which are the Max-Pooling Layer, and the Dropout Layer.
The Max-Pooling Layer applies max pooling over an input signal. This means that it takes a kernel size and within that window it only allows the maximum value to move forward while reducing the dimension of the vector. This makes sure that only the best represented features from the CNN are used by the neural network to learn. Max pooling is shown by the following equation:
$$out\left({N}_{i},{C}_{j},ℎ,w\right)=\underset{m=0,\dots {kH}-1}{{max}}\underset{n=1,\dots ,{kW}-1}{{max}}input\left({N}_{i},{C}_{j},stride\left[0\right]\times ℎ+m,stride\left[1\right]\times w+n\right)$$
where, \(N\) is the number of batch size,
\(C\) is the number of channels,
\(ℎ\) is the hidden vector,
\(w\) is the weight vector
The Dropout Layer is used to randomly zero some of the elements of the input tensor with the probability \(p\) that the user sets. This is used to decrease the co-adaptation of the neurons in the neural network [40]. We call the neurons as co-adapted when more than one of its neurons have highly correlated behavior.
The three neural network layers are used in the following order: Convolutional Neural Layer, Max-Pooling Layer, and Dropout Layer.
3.7. Long Short-Term Memory
A long short-term memory (LSTM) is a type of Recurrent Neural Network (RNN) [41]. RNNs, in particular, were designed for time series data [42]. Time series data, for example the stock market data, are the types of data where it is collected over time. This type of data is highly correlated to its own previous data points in time. To map all the intricacies that come with mapping a time series data, the recurrent neural network was invented.
The inspiration to use RNN for this dataset is because RNNs like GRUs, and LSTMs are very good at modelling natural language [43]. Since the embedding layer was also primarily used for Natural Language Processing, it is only logical that RNNs should also work very well with the Embedded vectors of the Modbus dataset.
Out of the RNN, GRU, and LSTM layers that fall under the category of RNN layers, the LSTM layer is chosen because it is the most effective in modelling long sequences of data. This is due to the fact that LSTM has an exclusive Cell State that is not directly affected by the input data.
3.8. Loss Function
Loss function is one of the most important aspects of designing a neural network. This is because the loss function, also called a cost function, is a way in which the performance of the model is gauged. Lower the loss, the better the model performs. If the loss of the model is low with the training dataset, but higher in the validation or testing dataset, it means that the model has overfit the dataset.
The loss function used in this experiment is the Negative Log Likelihood Loss (NLLLoss) [44], and the Binary Cross Entropy Loss (BCELoss) [45]. Both the losses are good at training a classification problem with their respective number of classes.
The Negative Log Likelihood Loss is given by the equation below:
$$l\left(x,y\right)={\sum }_{n=1}^{N}\frac{1}{{\sum }_{n=1}^{N}{w}_{{y}_{n}}}{l}_{n}$$
Where, \(x\) is the input (predicted output),
\(y\) is the target (real output),
\(w\) is the weight,
\(N\) is the batch size.
The Binary Cross Entropy Loss is given by the following equation:
$$l\left(x,y\right)=L=\{{l}_{1},\dots ,{l}_{N}{\}}^{T}$$
$${l}_{n}=-{w}_{n}\left[{y}_{n}\cdot {log}{x}_{n}+\left(1-{y}_{n}\right)\cdot {log}\left(1-{x}_{n}\right)\right]$$
Where, \(x\) is the input (predicted output),
\(y\) is the target (real output),
\(w\) is the weight,
\(N\) is the batch size.
3.9. Optimization Function
The optimization function is used to update the weights and biases in the network. This function decreases the loss of the overall model gradually over many iterations. The optimization algorithm used is the Adam Optimization algorithm. It had been introduced in this paper [46].