To study epidemiology, various compartmental models are being implemented. Compartmental models define a simple mathematical foundation that projects the spread of infectious disease. Currently, various compartmental models are available19. Furthermore, different mathematical models are being presented to illustrate the relationship of population heterogeneity and the present crisis of pandemic20. These compartmental models are mostly generated using ordinary differential equations (ODE)21. Although ODE and other mathematical methods are sufficient in modeling an infectious disease, they lack the randomness of being infected, cured, and death. Also, mathematical models do not include any super-spreaders22.
Therefore, we implement a virtual environment that solves these issues. The virtual environment is used to generate states and results based on some particular actions. The virtual environment is designed based on the SEIR (Susceptible-Exposed- Infectious-Recovered) compartmental model. Fig. 1 depicts the different stages of SEIR compartmental model. Due to the randomness in various transitions, implementing virtual compartmental models make the problem more challenging. The virtual environment is designed in a 2D grid where the population can randomly move. In each day, the population performs a fixed number of random moves. In Fig. 2 an info-graphic representation of the environment and the training process are illustrated.
In the environment, susceptible individuals are infected if they are in close contact with an infectious person. Initially, the infected population is in the exposed stage. After 1-3 days, individuals of the exposed stage is further transmitted to the infectious stage. In this stage, individuals can transmit the disease. The infectious individuals are either recovered after 21-27 days, or they may even lose their lives. The environment is configured so that around 80% of the infected population may survive.
In the virtual environment, the spread of the disease can be mitigated by reducing the movements of the population. There are three levels of movement restrictions in the environmental setup, namely level 0, level 1, and level 2. In level 0, no movement restrictions are enforced. In this state, the population makes the maximum movements. In level 1, the movement of the individuals is restricted by 25%. In general, maintaining social distancing and avoiding unnecessary means is considered to be equivalent to level 1 restriction23. In level 2, the movement is reduced by 75% that is similar to a lockdown state24. These movement restrictions are provided by the DRL agent. However, although movement restrictions result in reducing the spread of disease, it causes an economic collapse.
In the virtual environment, each individual contributes to the economy through movement. Therefore, if movement restriction is placed, it has an impact on the economy as well. Each individual contributes a value of 0.8-0.1 by moving. People who did not survive can not make any further contributions to the economy. Therefore, the increasing number of death count has also a negative impact on the economy. Also, the infectious population can not contribute to the economy. Therefore, a high number of active cases has also a negative influence on the economy.
In RL, a state is an observation that passes estimable information to the agent. By analyzing the information, an agent makes an optimal move based on its policy. States can be both finite or infinite. In the virtual environment setup, relevant information about the spread of the disease is passed through a state. Seven parameters are passed as a state of the environment. Fig. 2 illustrates the state parameters as infographic. Active cases represent the number of the population who are in the infectious stage. Newly infected refers to the number of the population who have shifted into the infectious stage on a particular day. Cured cases and death cases illustrate the number of people who have been cured and died from the start of the pandemic, respectively. The reproduction rate represents the average number of people who are being infected by the current infectious population. The economy illustrates the daily economical contribution of the population. Along with the states, the current movement restriction is also presented as a state parameter.
In DRL, an action is encouraged and discouraged by a reward function. A reward function encourages an agent to be in a particular state/situation by giving it a high reward for the situation. On the contrary, a particular action or situation is discouraged by giving the agent a low reward. An agent tries to generate such a policy/knowledge so that by following the policy, the agent may avoid the discouraging situation. Through designing a proper reward function, it is possible to generate such an agent that may be able to follow the human desired situation. For the current environment, the reward function is designed as follows,
The reward function contains three parameters from the environment: the current economy ratio, the current cumulative death ratio, and the current percentage of active cases. Due to the three types of movement restrictions, the economic ratio can be separated into three levels. Due to the direct relationship with movement restriction and economy, level 0, level 1, and level 2 result the value of Et approximately be close to 1, 0.75, and 0.25, respectively. However, this can be altered due to high death count and randomness. By avoiding the Dt parameter, the correlation of the economical levels and active cases can be utilized. In Fig. 4 a similar situation is illustrated. By utilizing the graph, it can be observed that while the active cases are low, the reward prioritizes higher economical stages. The further increase in active cases lessens the reward of higher economical stages.
By setting the value of r = 8, the reward of different economical stages is almost the same (the absolute difference is less than 0.001) after crossing 0.82% active cases. This boundary is thought of as a critical point after which, the economy does not matter. After this boundary, the goal becomes to lessen the surge of the disease. Furthermore, including the Dt in the reward function, the agent is also encouraged to reduce the death ratio. Fig. 3 illustrates the relation of reward function relating to the active case percentage and death ratio in three possible economic stages. The impact of the deaths in the reward function is tuned using the parameter s. And s = 5 is set to prioritize the negative impact of the deaths.
Both r and s are the tuning parameters of the reward function. Increasing the value of r causes the reward threshold (described in Fig. 4) to be reduced. Whereas, the value of s defines the significance of death. A higher value of s influences the agent to heavily reduce the death ratio ignoring the economic balance.
The Agent Network
The decision process of the DRL can be considered to be a Markov Decision Process (MDP). In MDP, the environment contains a finite set of states S, with a finite set of actions A. If s, st ∈ S, and α ∈ A, then the state transition can be represented as,
The equation states the transition probability of choosing an action α, given an environment state s, and achieving a new state st. The DRL agent acquires a policy π through bootstrapping. Through this policy, the agent performs an optimal action αi for a given state s, represented as, π(αi s). The optimal action is chosen based on the state-value function Vπ (s) that defines the chained reward value. The reward value is a chain multiplication of discount value γ and state rewards R. This can be presented as,
An optimal policy π∗ finds the best possible state-value function, that can be defined as,
As the transition of an MDP (τ(st s, α)) is unknown, a state-action function Qπ (s, a) is generated. The state action function mimics the value state-value function Vπ (s) and also tries to identify best action α. The state-action function greedily chooses the actions for which, it gains the maximum state-value.
The Qπ (s, a) function is defined as the DRL agent. In the experiment, we study with memory-based DRL agents since the memory-based agent perceives further possibilities and takes optimal decisions and acquires better rewards25. Among different memory sizes, we found that the DRL agent makes better actions with a minimal memory of 30 days. The agent is implemented using three bidirectional Long Short Term Memory (LSTM). Bidirectional LSTM performs optimally when there exist both forward and backward relationships in a portion of data26. In the case of this epidemic data, using bidirectional LSTMs provides the following benefits: (a) Select an optimal action based on previous data, and (b) Estimate the influence of selecting a particular action. The agent uses three bidirectional LSTM layers, followed by four dense layers. In Fig. 5 the memory-based DRL agent architecture is depicted.
DDQN method is used to train the agent. The DDQN architecture uses an actual agent and a target agent. Traditionally, in DDQN, both agents contain the same network structure. Furthermore, the traditional DDQN training process is implemented to train the architectures27. The agent is trained over 7000 episodes and without any pre-knowledge and human interpolation. To explore the environment properly, random movements were made in the training episodes. The training is started with a random movement ratio of ε = 1, and it is continuously decayed as ε = max(ε ε/(6000), 0.1). To propagate the future rewards to any particular state, the discount value (γ) is set to be 0.9.