Intrusion detection and classifying the types of attacks are the two most important of our purpose. Our study uses deep Q-learning and machine learning approaches to binary and multiclass intrusion detection, respectively. Since we need fast response in binary diagnosis, binary detection is implemented in the fog layer. In other words, deep Q-learning identifies attacks in the fog layer. Events marked as intrusions are sent for more analysis and to identify types of attacks on the cloud. Cloud can perform complex analysis because there is no need for an urgent classifier and it has flexibility in response time. In the cloud layer, there is a robust method. It includes an ensemble machine learning approach of multi-class classification.

In the first step, we model the environment using GRU architecture to learn the internal relations between events in fog. In general, we integrate the influence of the historical information of the dynamic environment on policy optimization. Because we aim to detect attacks with small mutants compared to previous attacks, and recognition zero-day attacks in an IoT environment.

The important point is to send and share the information obtained in different fog nodes to the summary module in the cloud. That means the information created in each fog node is sent to the cloud so that it will be visible to all nodes. This module is responsible for updates based on received information. Generally, the proposed method is implemented in three phases, which are: 1) the preprocessing and environment modeling phase, 2) the binary detection and parameter updating phase, and 3) the multi-class detection phase. The overview of the proposed architecture is shown in Fig. 1. Three proposed phases with detail are described in the following subsections.

## 3.1 preprocessing and environment modeling phase

Since the IoT environment is completely variable and new attacks are being produced with very few differences compared to previous attacks, finding the relationship between events will significantly help in detecting intrusions and their types. GRU can learn values over short and long periods. It was proposed to solve the vanishing gradient problem in RNN. GRU is similar to LSTM but simpler than LSTM and uses fewer parameters. Therefore, it will be more computationally efficient. The use of GRU can lead to the benefit of the information on previous events in the detection of new attacks. In other words, prior information can be incorporated into an internal state that is a suitable representation of the interactive environment. These states are entered into Deep Q-Learning to perform binary recognition. We use Deep Q-Learning to deploy agents in discrete action space environments.

According to Fig. 2, Bn (Mini Batch) records are sampled from the dataset in the first step. Choosing Bn records means that each state (S) will be equal to Bn records, and each record will contain S1 to Sm features.

In the Next step (Fig. 3) (Si), (hi-1) and (Ai-1) which represent the current state, the hidden state of the previous step, and the previous action, respectively, enter the GRU module at any time (t). Eq. (1) calculations are performed in each GRU unit.

Zi = δ (WzSi + Uzhi−1)

ri = δ (WrSi + Urhi−1) (1)

h'i = tanh (WSi ʘ Uhi−1)

hi = (1-Zi) ʘ hi−1 + zi ʘ hi)

Zi is the update gateway for state i. This gateway decides what information should be discarded and what new information should be kept in it. Wz and Uz are the weights of Si and hi−1, respectively. Next, the sigmoid activator function is used to produce the output (0 or 1). In this stage, the model decides how much information to transfer to the future. This action also avoids the problem of gradient vanishing. ri is the reset gate. In this gate is decided what information from the past should be forgotten. Wr shows the weight of Si and Ur shows the weight of hi−1. We use h'i to specify how the two update and reset gates determine the output. The output of this function tells what to remove from the previous step. In the last step, the network decides to keep the information related to the current memory and transfer it to the network by calculating the hi−1 vector. This requires the update gate to determine what information to retrieve from the current memory contents of hi and what from the previous step hi−1. This step is calculated by the equation hi.

The output of this module includes the prediction of the next observation Si+1 and the next hidden state hi. In fact, the produced hi indicates what information from the previous step hi−1 and what information from the current observation Si should be stored. In addition to being used as an input at time t + 1 in the GRU, the generated output hi is also sent to the binary detection module to perform attack detection. In general, we perform binary detection on data that have useful information from previous events and current events. In other words, we have provided the ability to learn in earlier time steps.

## 3.2 Binary detection phase

RL is based on the Markov Decision Process (MDP) theory. An MDP consists of tuples (S, A, T, R), where S is a set of states, A is a set of actions, T is a mapping function that presents the probability of transition from the pair (S, A) to a new state, and R is the reward function. The mapping function uses Markov probability. In such a way that the probability of transition to a new state depends only on the current state. MDP aims to learn the optimal policy so that it chooses the best action for each state. As mentioned deep Q-learning is used as a binary detection technique. Deep Q-learning is a model-free algorithm and it doesn’t create a model of the environment’s transition function. The purpose of Deep Q-learning is to estimate the Q-Value and solve the Q(S, A) function based on the experience samples in each time step according to the following formula:

Q(S.A)←Q(Si,Ai)+α(Ri + γQ(Si+1,Ai+1)-Q(Si,Ai)) (2)

To obtain Q(S, A), it is necessary to calculate Q (Si,Ai) which represents the function in state i, (Si+1,Ai+1) the function in state i + 1, α is the learning rate and Ri is the reward for the action is Ai. Therefore, to update Q(S, A) for each action A in state S must calculate the estimated yield Ri + γQ (Si+1, Ai+1). The estimated yield is also called TD-Target. Repeated execution of the update rule results in the correct Q(S, A).

Before implementing binary detection, the initialization of the parameters and the DQL must be done. It means that the output from the modeling phase is entered into the DQL and it determines the QV, SV, AV and Bn vectors.

As shown in algorithm1, we use two categories of iteration (inner iteration and outer iteration) to detect intrusion in the binary detection phase. The outer iteration is used to train the DQL model, and the inner iteration is responsible for improving the Q-values. Each Bn represents a state (S) which is determined at the beginning of the outer iteration. The specified current state (Si) is entered into the outer loops to perform the DQL training process. At the end of the external iteration, the Bn value will be reset and the DNN parameters will be updated if needed.

As mentioned, inner iteration is used to improve Q-values. In the inner iteration, the Q-function is estimated by the DNN function. The values of Bn records of the environment represent the current state of the environment (note that the current state in this step is the output of the modeling step. It means that previous events have also affected the current state).

We use a deep neural network (Q-Network) to calculate Q(S, A). In each inner iteration, the characteristics of each record (values of variables) are entered into the DNN input layer. The DNN output layer also displays Q-values (intrusion/normal). All values obtained for each iteration are stored in the vector Q (Q-Value). The action that has the highest Q value (Q-Value) is predicted as the current action (A'i) for the desired record according to the following equation.

Action = argmax (Q-Value) (3)

The noteworthy point is that the epsilon-greedy approach is used for exploration in DRL. Epsilon-greedy is a learning strategy based on the definition of reinforcement learning that helps the agent discover all possible actions by increasing the number of explorations and finding the optimal policy. An action selected by the epsilon-greedy approach is an action that is either randomly selected with probability ɛ or an action predicted with probability (1-ɛ). In the first inner iteration, the probability of choosing a random action is high. Over time, due to the use of the epsilon-greedy approach, this probability has decreased and actions are predicted with a probability of (1-ɛ). The DNN used is a three-layer deep neural network that uses the ReLU activation function for all layers. The loss function used is the Mean Square Error function.

In the following, the values in (A'i) and the label of each state (Ai) are used to calculate the reward, Eq. (4).

RVi =Reward (AVi,labeli) (4)

According to the Eq. (2) we need to calculate TD-Target. For this purpose, we must calculate QVi+1. As you can see in Fig. 4 the next state (Si+1) is used to calculate QVi+1 and obtain TD-Target. Here we also use another deep neural network (Target-Network) with separate parameters from Q-Network). According to the Eq. (5):

QTi=RVi + γ Q (Si+1, Ai+ 1)

QTi=RVi + γ (QVi+1) (5)

RVi is the earned reward for Si, [0, 1] € γ is the discount factor for future rewards and QVi+1 is the state vector of Si+1. The value of γ is updated in each iteration. After each iteration, the loss function will be calculated to improve the performance of the neural network (Target-Network). According to Eq. (6) loss function is calculated at the end of each inner iteration.

Loss= (QVi-QTi)

Loss = 1/n∑ (Qs-QT) r + γQ(S,,a,))2 (6)

At the end of the outer iteration, the updated parameters are sent to the Q-Network and Target-Network.

Using two neural networks (Target-Network, Q-Network) will lead to better stability of the model. Note that outer iterations continue until the all dataset is covered.

The event that is detected as an attack is provided to the cloud in order to detect the type of attack.