Understanding the COVID19 infection curves -- �nding the right numbers

,


Introduction
In the absence of a vaccine and efficient antiviral drugs, the containment of the COVID19 pandemics relies mainly on non-pharmaceutical interventions.Such interventions depend upon the correct determination of the parameters that characterize the spread of the virus (like the transmission rate, recovery rate, death rate, etc.), which are fed into mathematical models to predict the time evolution of the pandemics.But obtaining the correct numbers is difficult while the pandemics unfolds, because of incomplete data and uncertainties in their interpretation.One such error, together with its solution, is addressed in this paper.
The mathematical modeling of the spreading of infectious diseases is centuries old [1] and eventually started with the work of Daniel Bernoulli on the variola spreading [2].These models may be grouped in two categories: stochastic and deterministic.Whereas the stochastic models, based mainly on simulations, may be more suitable for smaller communities, the deterministic models are usually employed for large communities, like countries or cities, where the fluctuations around averages are small and the time evolution is accurately determined by a small number of parameters.Because the analyzed population is divided into compartments, the deterministic models are also called compartmental models [3].The most simple such model is the SIR model, composed of three compartments: Susceptible, Infected, and Recovered, which was introduced in 1927 by Kermack,McKendrick,and Walker [3,4].The Susceptible compartment includes the healthy persons that are not immune and may get infected by contact with the sick people from the Infected compartment.The persons in the Infected compartment are transferred into the Recovered compartment as they recover or die (we shall call this also the Closed cases compartment).In this model it is also assumed that the recovered people are immune and cannot get sick again (therefore they do not return to the Susceptible compartment upon recovery).
If we denote by S, I, and R the number of people in the compartments Susceptible, Infected, and Recovered, respectively, then, in the simplest case one may assume that the total number of people, N = S + I + R is constant.It is more simple to work with the normalized values, so we introduce the notations s = S/N , y = I/N , and r = R/N .Then, the time evolution of s, y, and r is given by the set of equations (see the Supplemental Material and Refs.[4, where p and λ will be called the infection rate and the recovery rate, respectively.In general, p and λ may also be functions of t (for example, p is influenced by regulations imposed to the population, such as social distancing, whereas λ is influenced by the response of the public health system and the time passed since the infection), but for the moment we shall assume that they are constants.From Eqs. (1), only the first two are independent, the third one being a consequence of the normalization condition s + y + r = 1, which yields ṙ = − ṡ − ẏ.Equations ( 1) may be solved with the initial boundary conditions s(0) = s 0 and y(0) = y 0 (see the Supplementary Material for more details).

Method 1 (M1): fitting the time series of the active cases
With the formalism presented above, one may try to describe the evolution of the COVID19 pandemics.The official numbers of infected, dead, and recovered persons may be found on several databases.We use the one from GitHub (https://github.com/CSSEGISandData/COVID-19).In Fig. 1 we plot y(t) (Active cases) for Romania, Hubei (China), Germany, and South Korea, together with the fit of the SIR model (minimum square deviation), with constant p and λ (or a).This simplified model does not fit the data very well and more complex models should be employed.
Although the details of fitting the time series of the Active cases are important, here we shall focus on a more conspicuous problem.In Fig. 2 we show the time series of the Closed cases of the same countries and on the same periods as in Fig. 1.We observe that the SIR results for the same parameters as in Fig. 1 differ from the recorded data by orders of magnitude.Such discrepancies cannot be mended by small adjustments of the parameters p and λ or by assuming that they vary with time.
We did such fits for several countries and the situation was the same for all of them.

Method 2 (M2): fitting the time series of the infections
One may try another approach.The time variation of s, and therefore of y, is determined by the number of infectious persons, which may not be accurately described by the number of active cases, as it is assumed in the SIR model.This happens because there is no medical procedure that could eliminate a patient from the list of active cases immediately when it becomes noninfectious.Rather, one is eliminated from this list after a certain interval of time, when it tests negative for the COVID19 disease.This may allow for an overestimation of the number of active cases at any moment in time.For this reason, we may assume that the (total) number of infected people, which is y(t) + r(t) = 1 − s(t), is more appropriate for the fitting procedure.In this way we obtain the plots of Fig. 3. Obviously, this fits the time series of the infectious population much better than Method 1, but it fails to fit (by orders of magnitude) the time series of the Active cases, as one can see in Fig. 4. Furthermore, for longer time intervals, the fitting functions for Romania and Germany of Fig. 3 are likely to differ significantly from the data.Therefore, we observed in this section that the SIR model cannot simultaneously fit the time series of the active cases and the total number of infections or, equivalently, the active cases and the closed cases.We may expect that by changing the constant p and λ parameters with functions of time we may be able to better fit the active cases or the closed cases, but not both of them at the same time, because of the huge differences observed between the predictions and the time series.This prompts us to make some substantial changes in the model.

Methods
To explain the discrepancies observed in the previous section, we extend the SIR model by assuming that the number of infected people is significantly different from the one we know.For simplicity, we shall refer to this as the SIR2 model.We assume that beside the known infected population I 1 , there are I 2 infected persons that we do not know about, such that the total infected population is I = I 1 + I 2 .Similarly, the know and unknown recovered populations are denoted by R 1 and R 2 , respectively (see Fig. 5 for the schematic representation of this model).If the total population is again denoted by N , then the susceptible population is (2e) In the system (2) we assumed that the rates of contamination due to the infected populations I 1 are I 2 , denoted by p 1 and p 2 , respectively, are different, because the population I 1 is much better monitored and isolated from the healthy population (one may safely assume that p 1 ≪ p 2 ).Furthermore, a contaminated person first enters the I 2 population and it has the rate ξ (assumed constant) to enter the known infected population I 1 .The rates by which  the cases are closed (by recovery or death) in the populations I 1 and I 2 , denoted by λ 1 and λ 2 , respectively, may be different because of the following two (competing) reasons: (1) the population I 1 gets better health care and (2) the symptoms of I 1 may be (on average) more severe than the ones of the population I 2 , because they may have led to the observation of COVID19.The rate ζ of transferring population from R 2 to R 1 is mostly due to the postmortem detection of the SARS-Cov-2 virus.In practice, this may be very small and therefore it may be ignored (we set ζ = 0 in numerical calculations).
If we (wrongly) assume that all the infected cases are known (as it was the case in the simple SIR model), N = S 1 + I 1 + R 1 , we obtain the fake number conservation condition ṡ1 + ẏ1 + ṙ1 = 0, which, in this case, may lead to discrepancies.
We observe that the SIR2 model fits the European countries Romania and Germany (considering the simplicity of the model), but the data for Hubei and South Korea are not fitted well-although it is much better than the simple SIR model.These discrepancies may happen for several reasons, from the biases that exist in the data to the time dependence of the parameters, as discussed above.Nevertheless, the dramatic differences between the fitting of different time series, observed in the previous section, do not appear in SIR2.
Similar results are obtained for other countries, as we show in the Supplemental Material.

Discussion
Whereas the SIR model with constant p and λ (infection and healing rates) completely fails to describe the data because of the large discrepancies between the fits of different time series (if we fit one time series, another one may differ from the fitting function by more than one order of magnitude in significant parts of the fitting range), the SIR2 model fits quite well the available data for several countries (e.g.Figs. 6 and 7 and the Supplemental Material), whereas for other countries some discrepancies may appear (e.g.Figs. 8 and 9).We also notice that s decreases fast.According to our assumptions, s cannot be directly calculated from the available data, since we do not know y 2 and r 2 (or, equivalently, I 2 and R 2 ).We estimate it in the model.If s becomes too small, then the time variations of y 1 and r 1 are not determined by the actual infections, but by the rate of people transfer from compartment I 2 into compartment I 1 (COVID19 diagnosis at some time after the infection) and from I 1 to R 1 (recovery or death of people confirmed with COVID19).
The discrepancies between the data and the SIR2 estimates may be due to the fact that the parameters that we use in the models should be replaced with functions of time, according to the time evolution of the pandemic (people awareness, health system improvement, governmental measures, time dependence of the recovery rates, etc.) and the testing of the population.Another important aspect is that the spread of the disease is not geographically homogeneous and one should split countries (or regions) more localized containers in which the infections are more uniformly distributed.Furthermore, even in the same geographical area, some people are more exposed to the virus than others, which would lead to further splitting of the population into containers.These aspects are not incorporated into our model, but our description, while it remains simple, is reliable.
Nevertheless, maybe the most important conclusion of this work is that there is a significant part of the population which was infected and which is not detected.This implies that at the time of this writing, at least in Europe, the susceptible population has decreased sufficiently in most countries, so that, if the virus causes immunity, a second wave of infections may be much smaller or it may not be possible if restrictions are lifted.If an increase of the number of case would appear upon relaxing the restrictions, this would be most likely due to the unequal exposure of the population to the virus.Furthermore, in some countries the present time variation of the COVID19 cases may be due to the diagnosis of infections that took place some time ago, rather due to fresh infections.
where p and λ are called the infection rate and the recovery rate, respectively (see also the main text).By dividing the system (1) by N , we obtain the system (1) of the main text, namely with a being the only parameter.From Eqs. (3) we see that a sets the shape of the curve whereas p sets the scale of the time interval or the evolution speed of the epidemic.In what follows, we shall omit to include explicitly the dependence of the functions on the t or x variables, unless we consider it necessary.Other characteristics of the time evolution of s may be readily obtained from (3).For example, the first two equations may be combined into a single equation for s, which may be further simplified if we introduce the notation w(x) = ln[s(x)]: w ′′ = (e w − a)w ′ (5) * Institutul Nat ¸ional de Cercetare-Dezvoltare pentru Fizicȃ şi Inginerie Nuclearȃ Horia Hulubei, Mȃgurele, Romania, Email: dragos@theory.nipne.ro† Colegiul Nat ¸ional de Informaticȃ "Tudor Vianu", Bucharest, Romania (by • ′′ we denote the second derivative d 2 /dx 2 ).Starting from the initial conditions w(0) ≡ w 0 = ln(s 0 ) = ln(1 − y 0 − r 0 ) < 0 and w ′ (0) = −y 0 < 0, one may obtain an analytical expression for w(t).For this, we integrate both sides of Eq. ( 5) to obtain In Eq. ( 6) we may separate the variables and obtain In the equations above, one can neglect r 0 , since this is in general either zero or very small.We see that D(w, a, y 0 , r 0 ) ≥ 0. Furthermore, the integral diverges if and only if the denominator D becomes zero at the lower integration limit.If we denote this limit by w ∞ , then D(w ∞ , a, y 0 , r 0 ) = 0 and lim x→∞ w(x) = w ∞ .Therefore, w ∞ may be determined from the equation Since s ′ < 0 for any x, then w ′ < 0, so w ∞ < w 0 < 0. To find the solution of Eq. (8), we denote f a (w) ≡ 1 + aw − e w and we observe that f a (±∞) = −∞ and f a (0) = 0.The derivative f ′ a (w) = a − e w is monotonically decreasing, from lim w→−∞ f ′ a (w) = a to lim w→∞ f ′ a (w) = −∞.This implies that f a (w) = 0 has two solutions, one of them being w = 0 and the other one being smaller or bigger than zero, depending on a.The function f a (w) has a maximum at w m , which satisfies the condition 0 = f ′ a (w m ) = a−e wm , that is, w m = ln a.Therefore, if a < 1, w m < 0, so the second solution is negative, whereas if a > 1, w m > 0 and the second solution is positive.If a = 1, then both solutions coincide with w m = 0.These aspects may be seen in Fig. S1.
From Eq. ( 8) and Fig. S1 we can see that if a ≥ 1, then lim y0+r0→0 w ∞ = 0, whereas if a < 1, then lim y0+r0→0 w ∞ < 0. In epidemics a < 1, that is, a small initially infected population may spread the disease in a finite fraction of a large population.In this case, since at the beginning y 0 + r 0 ≪ 1, then s 0 1 and y ′ (x = 0) of Eq. (3b) is positive.On the other hand, if a ≥ 1, then y ′ (x) < 0 and the number of active cases decreases monotonically.

Fitting the data with SIR
To find the approximation of w for x ≪ 1, we may start either from Eq. ( 6) or from Eq. (7).From Eq. (6), for example, using w ′ (x ≪ 1) ≈ w ′ (0) = −y 0 , we obtain directly From Eq. (9a) we obtain the other quantities, r(x) ≈ ay 0 x and y(x) ≈ y 0 e (1−y0−r0−a)x (9b) As expected, in the short time limit, both s and y vary exponentially with time (or with x = pt).This exponential variation is a hallmark of the epidemic spreading at x ≪ 1.
If an epidemic had evolved for some time and is far from the beginning, as is the case of the COVID19 pandemic, one should fit the whole time series.The COVID19 started spreading in different countries or regions at different times, so we fit each country or region separately.As explained in the main text, we take the data from https://github.com/CSSEGISandData/COVID-19and form the vectors of data S d , I d , and R d , corresponding to the compartments S, I, and R, respectively.The unit of time is 1 day and we choose t = 0 the day when the number of infections exceeds 100.We choose this starting point because, on one hand, we want to eliminate the large fluctuations that may occur in the beginning of the epidemic spreading and, on the other hand, to keep as many data points as possible.If N is the total population of the country or region analyzed, we calculate the densities s d = S d /N , y d = I d /N , and r d = R d /N .On the other hand, we calculate the time evolution s(t), y(t), and r(t), using Eqs.( 2) and the initial conditions s 0 = s d (1), y 0 = y d (1), and r 0 = r d (1).The functions s(t), y(t), and r(t) depend on the parameters p and λ, which we determine by fitting the data using the least squares method.
The fitting of the data may be done in several ways.We present here the two methods that we used.
Method 1 (M1) Eventually the most important time series is y d , which represents the time evolution of the number of active cases.Therefore we first determine the parameters p and λ by fitting y(t) to y d (with the initial conditions as specified above).The results are plotted in Fig. 1 of the main text.We notice that while some of the details may not be fitted by this method, the calculated functions reproduce the general behavior of the time series.The details may be adjusted by assuming that p depends on time (because of the restrictions imposed and the reaction of the population), whereas λ depends both on the time of the infection and on the time elapsed from the infection (because of the time variation of the health care system efficiency and of the human body response).Using such adjustments, one may expect to be able to fit the time series y(t), but the real problem of this method is that the same parameters should be used to calculate the estimation of the other time series, namely of s(t) and r(t).It is shown in Fig. 2 of the main text that the difference between r(t) and the data y d (t) is very big-in relevant time intervals, the estimation is more than one (or even two) orders of magnitude bigger than the reported values.Therefore, by adjusting the parameters to fit the y d time series in more detail is not expected to make such a huge difference in the values of r(t) to compensate for the differences observed in Fig. 2.
We have to find another solution.
Method 2 (M2) As a second method, we try to fit the time series of the confirmed cases y d + r d .We applied again the least squares method and we obtained the results presented in Fig. 3 in the main text.Even though the calculations y(t) + r(t) are similar to the data y d + r d , this time the active cases y(t) differ by orders of magnitude from the time series y d , as one can see in Fig. 4 (main text).As in the case of M1, the fine adjustments of p and λ to fit the time series of confirmed cases cannot compensate for the big differences obtained for the active cases.
We also tried to fit simultaneously y d and r d , by applying the least square method to both sets, but by doing so, in the best case we could fit only one of the time series.Visually, in general this method led to worse results for both, y d and r d , whereas quantitatively, the time series of bigger numbers was better fit than the time series consisting of the smaller numbers.

Fitting with the modified SIR model-SIR2
In order to fit all the observed time series, we had to make qualitatively important adjustments to the model, as explained in Section 3 of the main text.We split the compartment I of the SIR model into I 1 and I 2 , where I 1 represents the number of confirmed active cases, whereas I 2 is the number of infected people which are not known and therefore are missed from the statistics.Similarly, R is split into R 1 and R 2 , R 1 representing the known closed cases (recovered persons plus deaths) and R 2 being the number of closed cases which are missing from the statistics.As in the SIR model, we define the densities s ≡ S/N , y 1 = I 1 /N , y 2 = I 2 /N , r 1 = R 1 /N , and r 2 = R 2 /N , whereas the rest of the parameters are defined in the main text.This model allowed us to fit quite well all the time series for most of the countries, as one can see in the main text and in Fig. S2 of this Supplemental Material, but there are also some countries and regions that cannot be fitted so well, like Hubei (China), South Korea (see main text), and the Netherlands (Fig. S2).
We also notice that the susceptible population decreases faster in some of the cases, which means that the dynamics of y 1 and r 1 in the time interval when s(t) ≪ 1 is mainly due to the crossing from I 2 to I 1 (at rate ξ) and recovery from I 1 to R 1 (at rate λ 1 ).The oscillations and discontinuities of s, when s ≪ 1, are numerical artifacts, since s ≥ 0 and according to Eq. (2a) of the main text ṡ < 0 for any t.These artifacts are irrelevant for the time evolution of the other quantities because s ≪ 1.The Closed cases r(t).The model is tted to the time series of the Active cases, as shown in Fig. 1.
Figure 3 The number of con rmed cases 1 s(t) = y(t) + r(t) and the t with the SIR model in four populations: Romania, Hubei (China), Germany, and South Korea.The t parameters and the time interval from which the data is taken are shown in the plots.
The time series of the Active cases and the SIR predictions, for the parameters obtained by tting the Con rmed cases with the SIR model as shown in Fig. 3.

Supplementary Files
This is a list of supplementary les associated with this preprint.Click to download.

Figure 2 :
Figure 2: The Closed cases r(t).The model is fitted to the time series of the Active cases, as shown in Fig. 1.

Figure 3 :
Figure 3: The number of confirmed cases 1 − s(t) = y(t) + r(t) and the fit with the SIR model in four populations: Romania, Hubei (China), Germany, and South Korea.The fit parameters and the time interval from which the data is taken are shown in the plots.

Figure 4 :
Figure 4: The time series of the Active cases and the SIR predictions, for the parameters obtained by fitting the Confirmed cases with the SIR model as shown in Fig. 3.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: The schematic representation of the extension of the SIR model, denoted here as SIR2 model.

Figure 8 :Figure 9 :
Figure 8: The fitting of the SIR2 model to the data in Hubei, for the indicated period.The obtained parameters are shown in the plots.

Figure 1 :
Figure1: (S1) The function f a (w) (continuous lines) and the constants a ln(1 − y 0 ) (dashed lines) from the equation that gives the asymptotic value of w, when x → ∞ (we assume r 0 = 0).

Figures Figure 1 The
Figures

Figure
Figure

Figure
Figure