Epidemic modelling of multiple virus strains: a case study of SARS-CoV-2 B.1.1.7 in Moscow

During a long-running pandemic a pathogen can mutate, producing new strains with different epidemiological parameters. Existing approaches to epidemic modelling only consider one virus strain. We have developed a modified SEIR model to simulate multiple virus strains within the same population. As a case study, we investigate the potential effects of SARS-CoV-2 strain B.1.1.7 on the city of Moscow. Our analysis indicates a high risk of a new wave of infections in September-October 2021 with up to 35 000 daily infections at peak. We open-source our code and data.


I. INTRODUCTION
I N March 2020, the World Health Organization (WHO) declared COVID-19, a disease caused by the SARS-CoV-2 virus, a global pandemic [1]. Up until August 2020 the number of worldwide COVID-19 cases doubled approximately every 20 days. As of April 2021, there have been over 130 million confirmed cases and over 2.8 million deaths.
Since the start of the epidemic SARS-CoV-2 has mutated into multiple new major strains. Some of these strains have the potential to change the course of pandemic and were deemed variants of concern (VOC) by the World Health Organization. As of time of writing, the list of VOC includes the B.1.1.7 strain from United Kingdom, the B.1.351 from South Africa and the B.1.1.28.1 [2]. The B.1.1.7 strain has been reported to have a 40% -90% larger basic reproduction number than the base SARS-CoV-2 [3], [4] and it has spread to 130 countries at the time of writing [2]. In combination this makes the B.1.1.7 potentially dangerous and it is important to assess the risks and plan the response.
Epidemic modelling was successfully used to estimate the spread of the virus and provided tools to plan the optimal response strategy [5]. In particular, modifications of the SEIR model [6] were effective, as both simple and capable of simulating a complex epidemic [7], [8], [9], [10], [11], [12], [13]. There were multiple attempts to apply more advanced machine learning techniques to COVID-19 forecasting [14], [15], [16]. However, Sun et al. report that due to insufficient training data and over-fitting more advanced machine learning models have not achieved improvements significant enough to justify their added complexity [9].
The core research question of our work is not to provide the most accurate forecast, but to estimate the potential impact of 1 https://github.com/btseytlin/covid peak sir modelling a new virus strain. We chose a SEIR-based approach to this problem for the following reasons. First, the parameters of SEIR models are simply characteristics of the disease, which allows to use medical studies to constrain parameter ranges. Second, Unlike black-box models, SEIR models provide the ability to adjust parameters and simulate multiple "what-if" scenarios. This is crucial for simulating what a virus strain with a 40% -90% larger basic reproduction number can do.
The main contributions of our work are as follows: The remaining paper is structured as follows. Section II provides necessary background on SEIR models. In Section III we describe the proposed SEIR modification and modelling quarantine measures. Section IV describes the evaluation process, dataset, parameter optimization and the resulting accuracy metrics. Finally, in Section V we extend the model for multiple strains. In Section VI we obtain forecasts for B.1.1.7 in Moscow.

II. SEIR BACKGROUND
The original SEIR model [6] simulates an epidemic in a closed population of size N . The SEIRD model [17] extends SEIR by adding fatalities from infection. We use the SEIRD model further on, as it is more relevant to our task.
In SEIRD, the population is partitioned into five compartments: Susceptible, Exposed, Infections, Recovered and Deceased. Susceptible individuals can be infected. Exposed individuals have been infected, but are not spreading the pathogen yet. Infectious individuals are spreading the pathogen. Recovered individuals are permanently immune. The size of compartments at time t is denoted by S(t), E(t), I(t), R(t), D(t). The following ODE system describes the transfer of population between compartments: subject to initial conditions S 0 , E 0 , I 0 , R 0 , D 0 and the con- The parameters of the model are: α -infection fatality rate, β -number of cases generated by an infected individual in one day, δ = 1/d incubation where d incubation is the length of incubation period, γ = 1/d inf ectious where d inf ectious is the time in days till recovery or death.
The basic reproduction number R 0 is the defining characteristic of a pandemic. It equals the expected number of people infected by one infectious individual during the course of their sickness and can be computed as follows:

III. MODELLING THE COVID-19 EPIDEMIC IN MOSCOW
To create an epidemic model for two strains we first have to create an accurate model of the one-strain epidemic. SEIRD requires modifications to simulate a long-running pandemic like COVID-19. It assumes all parameters to stay constant, but this assumption is violated by quarantine measures, where R 0 changes with time. Optimizing the parameters on historical data is also problematic as the model ignores possible underreporting. To resolve these issues we modify the SEIRD model with a quarantine measures function and additional compartments.

A. Modelling quarantine measures
In case of COVID-19 quarantine and social distancing measures directly affect the number of contacts and the probability of infection on contact, which makes β dependant on measures currently in effect. As a consequence, a model with constant β is not able to simulate multiple waves of epidemic. To model the effect of quarantine measures we replace the constant parameter β with a time-dependant function β(t).
For β(t) we propose a simple model where at every t the basic reproduction number is reduced by a factor of q(t) ∈ [0, 1]. Then we can then obtain β(t) from R(t) using Equation 2.
In general q(t) can be any differential function. We used a stepwise function with sigmoid transitions between steps: The q 1 , q 2 , . . . , q n values are quarantine power values to be obtained by optimization and the knot points t 1 , t 2 , . . . , t n are hyperparameters. The transition function produces a soft descent or ascent from q a to q b . The parameters s and r control the shape of transition. Empirically we found r = 20 and c = 0.5 to work well for our case.

B. SEIRD-H: modelling under-reporting in data
In reality statistical data on infections, deaths and recoveries is not complete. Many people go through the sickness without getting tested or hospitalized. Errors are possible when registering deaths as well. For example, a previous study [18] has shown that in Moscow the real death toll from COVID-19 might be 2.6 times higher than what the official statistics shows. We propose a modification to SEIRD to take in account the incompleteness of statistical data.
Let cases of infection, death and recovery be called visible if they are registered in statistical data, and invisible if they are undetected. Upon infection a case becomes visible with probability p i , and invisible with probability 1−p i . A death of an invisible case is registered as a visible death with probability p d to handle the situation where a person dies from unknown causes and is then revealed to have died from COVID-19.
To model these relationships we modified the compartments of the SEIRD model. In the modified model compartments I v , R v , D v contain visible cases and compartments I, R, D contain invisible cases. We nickname this model SEIRD-H: SEIRD with hidden states. The ODE system for the SEIRD-H model is provided in Appendix A Equation 9 and Figure 1 shows the schematic of the new model.
How do we obtain the initial conditions S 0 , E 0 , I 0 , I v0 , R 0 , R v0 , D 0 , D v0 ? The sizes of invisible compartments can not be directly observed. Our solution is to run the simulation from the first infected case till the required date. Consider the case of Moscow data: the first date of dataset is 2020-03-12, while the first official recorded COVID-19 case in Moscow is on 2020-03-02. We set the compartment values to S = N − 1, I = 1, and the rest to zeros, and run the simulation for ten days. The obtained states are the initial conditions for further simulation.
We have established a modified SEIR model by adding a time-dependant function of quarantine measures and hidden compartments for cases invisible in statistical data. However the model is not complete until we find such parameters that it's able to accurately simulate the one-strain epidemic. Additionally, the accuracy of the model must be experimentally tested before it can be extended for virus two-strains.

IV. EXPERIMENTS
We aim to produce a model that outputs values for visible I, D and R cases as close as possible to statistical data. However it's not enough to check the fit of our model on training data. In reality we want to train the model on historical data and produce forecasts for the feature. We have to use an evaluation procedure that imitates this regime to get a robust estimate of the quality of forecasts. One way to do this is time-aware cross-validation. In this setting, an evaluation date s randomly selected. The model parameters are optimized on historical data before this date. The model produces a forecast for the time range after the evaluation date. The forecast is compared to the ground truth data and error is computed. The process is repeated many times and final error metrics are averaged. We evaluate our model using time-aware cross validation and use the Mean Absolute Error (MAE) and Symmetric Mean Percentage Error (SMAPE) metrics. The results are compared to two baselines: the persistence model, which always predicts the last training value, and the unmodified SEIRD model.

A. Dataset
For conducting the experiments the official COVID-19 data provided by Moscow authorities and made available by Yandex [19] was used. The dataset spans the time range between 2020-03-10 and 2021-03-23. For each date the following statistics were provided: new confirmed cases, new recoveries, new deaths, total confirmed cases, total recoveries, total deaths.

B. Optimizing model parameters
We optimized the model-predicted new daily visible infections, deaths and recoveries to be as close as possible to the statistical data. The Levenberg-Marquardt method via the lmfit [20] library was used for performing the optimization. Ranges of possible parameter values were constrained using information from COVID-19 studies [21], [13]. The final loss function is a sum of residuals for infections, deaths and recoveries. To put all residuals on the same scale we transformed them to symmetric relative error: The final loss function L is a weighted sum of relative errors for daily infected, recovered and dead: Where n is the length of training time range, true I (t), true R (t), true D (t) are ground truth daily infections, recoveries and deaths on day t, and pred I (t), pred R (t), pred D (t) are output by the model, w I , w R , w D are hyperparameter weights for loss components.
Empirically we found weights w D = 0.5, w I = w R = 0.25 to work well.
The obtained parameters are provided in Table I. An interesting observation is that our models considers 75% infectious cases to go undetected.
The resulting quarantine measure function and training fit are presented on Figure 2 and Figure 3. The model predicts no quarantine measures at the beginning of the epidemic, a summer period of lockdown and a following period of eased quarantine measures, which approximately matches the real timeline in Moscow.

C. Model verification
For testing the one-strain model by time-aware crossvalidation we pick 14 evenly spaced dates in the training range. For each date we train the model on the previous data, forecast cumulative deaths for 30 days ahead, and compute the MAE and SMAPE.
The resulting metrics are presented in Table II. Basic SEIRD fails to outperform the persistence baseline, mostly due to producing very bad predictions during the second wave of  infections, as it's not able to simulate multiple waves. SEIRD-H outperforms both baselines. We conclude that SEIRD-H produces accurate enough predictions to extend it for modelling two virus strains.

V. MODELLING TWO VIRUS STRAINS
We can think of two strains within the same population as two parallel pandemics, with the condition that a person can only get infected once. This can be modelled by two SEIRD models that share the compartments of Susceptible, Recovered and Dead individuals. Figure 4 illustrates this idea.
The base SARS-CoV-2 model can be obtained by training a SEIRD-H model. For training this model we use historical data prior to the date of first infection by the second strain.
Obtaining the model for the second strain is trickier, as we do not have per-strain statistical data to train a separate model. We can use the fact that a new strain at it is core is the same disease as the base pathogen, but with slightly different epidemiological parameters. For example, we can assume that quarantine measures affect the new strain in the same way as  Fig. 4: Two-strain model as two parallel pandemics that share the compartments of Susceptible, Recovered and Dead.
the base strain. Thus, we can reuse the learned parameters of the base model to create the new strain model. The major difference between the base SARS-CoV-2 and B.1.1.7 is that B.1.1.7 has a 40% -90% larger basic reproduction number [3]. Then the B.1.1.7 model can be obtained by copying the parameters from the base model and modifying R B.1.1.7 (t). For example, if B.1.1.7 is 50% more infectious than the base strain then we can set R B.1.1.7 (t) = 1.5 · R(t).
The ODE system for the two-strain model is provided in Appendix section A Equation 10 and the detailed schematic is provided in Appendix section B Figure 7. Now that we have obtained a model for two virus strains, we can use it to forecast the potential effect of B. To sum up our analysis, the new wave of infections happens in every scenario, even though the magnitude and date of the peak vary. The new wave might be averted by a factor not considered in the model, like vaccinations or seasonality, but we can say for certain that B.1.1.7 possesses a significant risk.  Up until July 2021 it will look like COVID-19 infections are declining, as the few B.1.1.7 will be invisible among numerous base strain cases. It will be tempting to relax quarantine measures. Then B.1.1.7 cases can explode exponentially and catch society by surprise. The bare minimum recommendation is to prepare for the explosion of cases in June-August 2021 and study B.1.1.7 further.

VII. CONCLUSIONS
The aim of our work was to develop a model to asses the risk of a new virus strains. We proposed the SEIRD-H model, a modified SEIR model with additional compartments, to simulate the epidemic from incomplete statistical data. We introduce a smoothed stepwise function to model quaran-tine measures. The proposed approach is verified by crossvalidation on historical data and is shown to outperform the baselines. Finally, we extend the obtained model to simulate two strains of COVID-19 and forecast the impact of B.1.1.7 strain on the city of Moscow. Experimental results indicate that B.1.1.7 has the potential to cause a new wave of infections in Moscow that peaks in September-October 2021.
Narrowing the risk estimate is subject to further study. One approach to improving the forecast is to find the true number of current B.1.1.7 cases via genetic sequencing. The other approach is to model the effect of vaccinations. Currently our model provides a starting point for estimating the effects of new virus strains and can be used to assess risks of other COVID-19 strains and even different epidemics.
The interpretations, conclusions, and recommendations in this work are those of the authors and do not necessarily represent the views of associated organizations.

APPENDIX A TWO-STRAIN MODEL ODE SYSTEM
The ODE system for the SEIRD-H one-strain model: The ODE system for the SEIRD-H two-strain model: