As described in sections I and II, Reinforcement learning is most suitable for routing in Underwater Sensor Networks. Here, the authors have used Q learning to perform to find a path from source to destination. Let us discuss the implementation of Q learning for underwater networks
A. Q Learning
Q learning is one of the popular Reinforcement learning algorithms that enables an agent to iteratively learn from the environment and improves its learning over time by choosing optimal action. It is an iterative process that involves agent learning by exploring the environment which in turn updates the model as the exploration continues.
Components of Q learning include Agent, Action, State and Rewards, and Q values.
Agent – is a learning entity that takes optimal actions and operates in a specified environment
State – The current position of an agent in the specified environment is labeled as the state of the agent
Actions – Any operation taken by an agent when it is in a specified state.
Rewards – A positive or negative response given to an agent based on its action.
Q values – It is the metric used to measure an action at a particular state.
Some of the ways of finding this Q value are -
Temporal Difference – It calculates the Q value by incorporating the value of the current state and action by comparing the differences with the previous state and action.
Bellman’s Equation – It calculates the Q value of a given state and assesses its relative position. The state with the highest value is considered as an optimal state. Bellman’s equation is represented as in (1).
\(Q\left(s,a\right)=Q\left(s,a\right)+ \propto (r+\gamma (\text{max}\left(Q\left({s}^{{\prime }}, {a}^{{\prime }}\right)\right)-Q\left(s,a\right))\) (1) [22]
Q(s,a) → represents the expected reward for taking action a in the state s
r → represents the actual reward received for any particular action
s’ → represents the next state
α → represents the learning rate
γ → represents a discount factor
max(Q(s’, a’) → is the highest expected reward for all possible actions a’ in state s’
In this paper, the authors have used Bellman’s equation to find the Q value. These Q values are stored in a Q table which contains rows and columns with the list of rewards for the best action of each state in a specified environment.
Table 2
State/Action
|
A1
|
A2
|
A3
|
S1
|
|
|
|
S2
|
|
|
|
S3
|
|
|
|
Here, the rows represent different situations the agent might encounter and the columns portray the actions. Agents would look up the Q table to fetch expected future rewards for any given state and action pair.
B. Al Routing
An underwater environment is simulated with an adhoc number of sensors. An AI-based routing is designed to route the packets from the source sensor to the destination sensor. Here, the state is represented either as position precision or as sensor path. Algorithm 1 represents AI-based routing used in this paper.
Algorithm 1
AI Based Routing
2. Initialize state, action, reward and Q-table
3. Initialize α, γ, ε values
4. Choose action based on ε greedy method:
Update Q table using (1)
Calculate reward:
If outcome == -1
Return − 2
Else
Return 2*(1-delay/event-duration)
Next action is selected based on the highest score in the table
Repeat step 4 for various episodes
Sensors can take random actions to forward the packets to the next sensor. These random actions are selected based on ε greedy selection or geo-location-based routing which are described in section C. Upon selecting random action, the Q value and reward are calculated as specified in Algorithm (1), and the Q table is updated accordingly. Once the Q table is updated for various episodes, the sensor now chooses the action which is having highest value in the Q table. These appropriate actions fetch an optimal path to depot the packet to the destination sensor.
C. Action Selection Method
In this section, let us discuss various action selection methods used in AI Routing algorithms.
Epsilon Greedy Action Selection Method – is a method to balance between exploration and exploitation randomly. Here, ε value refers to the probability of choosing exploitation or exploration. Generally, the ε value is initialized to a smaller value, giving a lesser chance of exploration.
Action taken at time(t) =\(\left\{\begin{array}{c}\text{max}Qt\left(a\right) with probability 1-\epsilon \\ any action \left(a\right) with probablity \epsilon \end{array}\right.\)
(2)
Geo Location Based Routing – is a geographical approach that returns the sensor closest to the requestor as shown in (3). In this approach, the nearest sensor is chosen based on the geographical positions of the sensors. The position of the nearest sensor is calculated using the acknowledgment of the hello packet.
Geo location and Epsilon greedy routing are again based on two important parameters next target and position precision. Next target finds the nearest sensor using a dictionary containing earlier hello messages. This method of search has a threshold. Once, the threshold has been reached, it switches to position precision where the nearest neighbors are again calculated based on hello packet response times.
In this paper, authors have implemented epsilon greedy and geo location routing with next target and position precision parameters. The results are presented in the next section.