Predictive modeling of novel coronavirus spreading in European regions based on worldwide data-sets

Background: The novel coronavirus 2019-nCoV outbreak which was reported by World Health Organization (WHO) at 21th of January 2020 situation report spreads around the globe during next few weeks and causes serious health, social and ﬁnancial issues. The primary epicentre was reported is Wuhan City, China and together with other aﬀected regions in Asia remains the most monitored area. On the other hand, the European region was experiencing only a few cases until February 29th when the number of conﬁrmed cases reached one thousand infections. Within a few days during the 8th of March the Italian most infected places were locked by restrictive quarantine. In this study we present probabilistic model to track the infection spread trough the European regions based on the model adjusted with worldwide data-sets. Methods: To track and predict the oﬃcially reported number of cases we developed probabilistic model with the ability to predict the number of conﬁrmed cases one day ahead. The model has internal/hidden parameters which cannot be directly observed such as the number of infectious individuals or transmissibility rate which is represented by reproduction number R t . The model starts with assumed number of infectious individuals and reproduction number R 0 and as more data are gathered from data-sets during time the internal parameters are further estimated. The model is updated each day as the new number of conﬁrmed cases are reported. Particle ﬁlters algorithm is used as the back-end method due to its ability to handle multi-modal data distribution. Results: Presented results show the performance of the probabilistic model which is able to handle short-term prediction (in number of days) of conﬁrmed and recovered cases. The estimated reproduction rate is further used in long-term simulation which ﬁts the data gathered world-wide. The one day prediction error is below 5% of nominal value and as we are located in Czechia the prediction model for our region was tested 4 days forward with the same error. The overal performance of the model was compared to data gathered from China due to the longer history of measurements. Conclusion: We have proposed a probabilistic model which is used with particle ﬁlters to predict next moves of the conﬁrmed cases. As a side eﬀect we were able to model internal parameteres as reproduction number or recovery rate which can be used for running long-term simulations of virus spreading.


Introduction
Presented study focuses on modeling the growth of infections (number of confirmed cases) and transmissibility rate (represented by reproduction number R t ) of novel coronavirus 2019-nCoV  in European regions based on worldwide datasets. The Covid-19 outbreak was reported by World Health Organization (WHO) on 21th January 2020 [1]. Another announcement was made by WHO on 30 January that the covid-19 already constituted a Public Health Emergency of International Concern [2]. After two months at the beginning of March the number of confirmed cases in Wuhan City, China reaches 67000 [3] infections and the European epicentre Italy was closed by restrictive quarantine with over 10000 cases on 11th March 2020 [3]. More European regions applied restrictive measures as closing public places, schools, universities etc.
In this study we present a probabilistic model to track the infection spread trough the European regions based on the model adjusted by worldwide data-sets.
Other approaches use mathematical models to simulate the dynamics of the infection spreading in population [4,5,6] or attempt to model the spread via commercial air travel [7,8].
While our model is based on nonparametric implementation of the Bayes filter called Particle filters [9], there are method focused on other Bayes filters such as Kalman filters [10] or using Markov chain Monte Carlo methods [11].

Materials and methods
Used data sets All reported experiments are based on online available data-sets [12] operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) [13]. Data-sets which contain the total number of confirmed (reported) cases [3], recovered and deceased cases [14,15] are used in simulation experiments for each Country/Region for each day starting from 22 January 2020 up to 22 March 2020.

Simulation model
The model estimates the Reproduction number R which represents the transmissibility of virus changing in time as an average number of newly infected people generated by one infectious person. The initial constant R 0 is called the basic reproduction number [16] and there are researches focused on calculating the R 0 [17,18]. The actual average of infected cases generated from one person at time t is called effective reproduction number R t [6,19] and is estimated by other researchers as R t = 6.47 [20], R t = 2.2 [5], R t = 2.68 [11].
The simulation model is based on prediction-correction mechanism [9] where in each prediction step the internal parameters estimation is made and during correction step the model parameters are updated with measurement (if available). The internal state of the model x t = I t C t Re t R t T is represented by effective reproduction number R t , number of infectious individuals I t which are not detected, number of confirmed cases C t and number of recovered cases Re t which can be obtained from public reports. The measurement taken in time t is represented by variable z t .
The algorithm uses particle filters for representing posterior distribution bel (x t ) = p (x t |z t ) over possible states x t based on measurements z t by set of particles X t . The simulation time step was chosen as one day due to the nature of used data-sets. The input for the model (measurement) is number of confirmed cases for each day.
Due to the fact that the available data set provides the information about confirmed cases, the model is used to predict the confirmed cases (C) based on estimated effective reproduction number R t and number of infectious individuals I t .
One iteration of proposed model is given by following steps: t to X t Step 1 and 2 represents the initialization of particles and state vectors.
Step 3 and 4 describes the sampling of new set of particles from the original distribution of samples in previous time step t − 1. For each particle x t weight w t is calculated based on measurement z t and the weighted set of samplesX t is updated. Finally the new set of particles X t is created in steps 7 and 8, where the new states are resampled with probability given by its weights.
The parameter incubation period was set to 5.2 days according to WHO report[21], work presented by Li et al. [5] and calculations based on disease onset to diagnosis [22]. Parameter recovery period was originaly set to 10 days according to parameteres used different models e.g. epidemic SEIR models [23,24] where it represents the time form diagnosis to total recovery.

Results
Simulation experiments were performed on regions located in China where the proposed model was tested and then applied to regions located within Europe. While many regions in Europe reach the similar number of infectious individuals we show on Fig. 1   compared to the number of confirmed (reported) cases for particular day, after complete set of measurements for all evaluated days (from 23 January 2020 to 22 March 2020) the RM SE = 2516.6 and average R t = 1.09 were calculated. From the simulation the basic reproduction number R 0 = 1.4 and number of infectious individuals I 0 = 998 were estimated. We can compare the modeled data with number of confirmed and reacovered cases to see the performance of the model. The number of infectious individuals is only informative value because we cannot compare it with real data. There is also an approach to get informative number of real infections in the population presented by Wu and McGoogan [25], however is impossible to get similar data-sets for european locations. The model itself can be used for simulation of estimated long-term cases. For this purpose the model is initialized with the confirmed cases C 0 = 444 as reported 23th of Jan 2020, the number of infectious individuals estimated by particle filters I 0 and number of recovered cases Re 0 = 0. Long-term simulation is shown on figure 3 and the results are comparable with real-world scenario. The one day predictions for European regions are represented by Czechia as our home country (Fig. 4), Italy (Fig. 5) as the highest number of confirmed cases by the 22th of Marh 2020 and Germany (Fig. 6) as the region with the first confirmed european case.  The model is capable of short-term prediction for more than one day ahead. The result for four days prediction is shown on figure 7 where the predicted number of cases for Wed 25 Mar 2020 was 1765 and for Thu 26 Mar 2020 the prediction was 1796 cases, while the real reported number of confirmed cases by the Ministry of Health of the Czech Republic [26] was 1763 cases for Wed resp. 2022 for Thu.

Discussion
Even there are data-sets which cover the global infection spreading, there could be problem with the data gathered for prediction analysis. We have found the differences in data presented in used data-sets [3] and data tracked by our team and presented locally in Czechia (Tab. 1). While epidemic in Czechia is actually at the very beginning it is difficult to track the confirmed cases and update the model. The estimated reproduction number average for Czechia R t = 1.3 is in correlation with the growing number of confirmed cases and also with the restrictive measures applied locally to high schools and universities which should lead to infection elimination. All presented results show that the prediction error for number of confirmed cases is below 5% of nominal value which exceeds our expectations. There is also a drawback in the usage of reported number of recovered cases which are only used for RMSE calculation not for the model update. Even if this value is not used in the correction step of the particle filter algorithm, the model still tracks the real values of the recovered cases as shown of figure 2.
The estimated average reproduction numbers for representative locations around Europe R t = 1.28 for Italy, R t = 1.31 Czechia and R t = 1.22 Germany together with the estimated R t = 1.1 for China-Hubei are lower than the reproduction numbers estimated by other researches at the Covid-19 outbreak R t = 6.47 [20] for Wuhan during January 2020 which can be due to different history data analysis. On the other hand, the lower estimated number e.g. for China-Hubei after almost two months shows that the infection spreading slows down. For the next weeks we can expect that the applied restrictive measures in Europe will similarly help to get the infection under control although high requirements for the health infrastructure are expected and it can take a few weeks to reduce the reproduction number.

Conclusion
Proposed probabilistic model shows the capability to handle short-term predictions for various regions in Europe. Moreover, from the simulation experiments it seems to be reasonable to use such model for overal trends analysis. While the novel coronavirus epidemy is still evolving, as more data will be available the model parameters can be adjusted more preciselly and comparison with other clasical models e.g. SEIR can be performed.

Availability of data and materials
The datasets supporting the result are available in public repositories as referenced.