Modelling the SARS-CoVid-2 outbreak in Italy: development of a robust statistical index to track disease dynamics

Mariano Bizzarri (  mariano.bizzarri@uniroma1.it ) Sapienza University. Rome, Department of Experimental Medicine, Systems Biology Group Lab; Sapienza University, Rome, Italy. Mario Di Traglia Department of Public Health and Infectious Diseases (Biostatistics Section), Sapienza University, Rome, Italy. Alessandro Giuliani Istituto Superiore di Sanità, Environment and Health Department, Rome, Italy. Annarita Vestri Department of Public Health and Infectious Diseases (Biostatistics Section), Sapienza University, Rome, Italy. Valeria Fedeli Sapienza University. Rome, Department of Experimental Medicine, Systems Biology Group Lab; Sapienza University, Rome, Italy. Alberto Prestininzi NHAZCA Srl, SpinOff-Sapienza University, Rome, Italy.


Introduction
After a novel strain of coronavirus -SARS-CoV-2 -was identi ed in Wuhan (China) 1 , Italy resulted to be one among the most affected countries by Covid-19 epidemic 2 . Noticeably, despite the existence of an already established program to deal with such an outbreak, Italian government was taken by surprise.
Moreover, despite warning coming from previous, disregarded clinical reports, clinical and administrative countermeasures lasted until March 2020 before to be adopted in a systematic way.
Predictive mathematical models for epidemics are fundamental to forecast the course of the epidemic and to plan effective control strategies. Unfortunately, it is likely that unreliability of the preliminary data as well as inadequacy/lack of proper descriptive models played a relevant role in undermining political decisions of the Italian administration.
Several models have been developed to describe the Sars-CoV-2 pandemic, by reframing the SEIR model3, using the discrete-time SIR model (which includes dead individuals)4, or a control-oriented SIR model that stresses the effects of delays and compares the outcomes of different containment policies5.
Stochastic transmission models have been applied as well6.
The above-mentioned approaches stem from compartmental diffusion processes routinely used in physical chemistry and pharmacokinetics. SEIR stands for Susceptible-Exposed-Infected-Recovery that denote four states that each patient traverses going from susceptible (S), to the exposed (E) and infected (I) conditions ending up into the recovery (R), a compartment that collects both recovered and dead patients. The apparent oddity of putting into the same compartment healed and dead persons comes from the physical-chemistry origin of the model in which R compartment collects all the entities (e.g. molecules) arriving at the nal step of the process (end products, excreted or adsorbed substances).
So far so good, but the problem (when the model is applied to real processes) is hidden in the statement 'on average', that embeds the physical condition of 'ergodicity'. In simple terms, this condition implies stochasticity of the contacts and (as for any averaging process) the homogeneity of the process.
Both these conditions are violated in the Italian case. Data highlighted differences of one order of magnitude in both infection dimension and spreading rate between Lombardia (accounting for around 60% of total fatalities and infections) and the rest of Italy. We will not discuss the possible causes of this discrepancy (that, even if less dramatic, involves the whole Northern Italy, in which 86% of the overall Italian fatalities have been recorded), as this task would require a speci c clinical/epidemiological inquiry. Instead, we focus on the possible ways to get a su ciently accurate estimate of the epidemics process when in presence of huge violation of SEIR model premises. A statistical index different from the SEIR-derived ones, could be very useful to assess the entity of an epidemics outbreak.
The need of a different statistical index is still more cogent in the Italian case, considering that the Lombardia outbreak happened earlier than the appearance of Covid-19 cases in other regions, and administrative o cials were asked to take decisions based on limited information, which -at that timewas mainly restricted to Lombardia singularity. This situation could be not so rare, thus the need of a more robust statistical index is not limited to the Italian situation. Indeed, during the initial stages of the infection, there is not enough information to predict epidemic temporal development using analytical models, but some useful hints could be extracted as "hidden" information from simple statistics based on very preliminary data. This is why , before entering into the new proposed index derivation, it is worth checking if the epidemic data collected during the rst months, affected by both uncertainty and heterogeneity biases, do allow for a reliable forecasting by adopting less demanding models, with respect to SEIR ones. We achieved that goal by means of the Verhulst-Pearle model (sigmoid), which was able to predict the occurrence of the peak of infection, when fed with very early data. We estimated the parameters by regression methods, applying the geometric mean as a function of truly infected individuals. The initial condition for the Bernoulli differential equation was determined applying the backpropagation method 7 .
Then we obtained a novel index (RI) derivation, by splitting the R compartment into healed and dead persons, ending up into the dH/dI (RI) index with dH and dI being the daily (day(i)-day(i-1)) differential of healed and infected people, respectively. This choice allowed us to detect a clear tipping point in epidemic dynamics in Italy (without considering data from Lombardia), thus recognizing the onset the decline in the epidemic cycle. Such tipping point has emerged only later in Lombardia, and this nding is consistent with the anomalous dynamics of the epidemics in that region.
We complemented the investigation by a concomitant estimate of the clinical severity of Covid-19 disease, which should allow grasping both the potential lethality of Covid-19 and the effectiveness of hospital care. Underestimation of asymptomatic (positive) Covid-19 bearing subjects and uncertainty in detecting the true number of infected people, due to limitation in both availability and reliability of oralpharyngeal swab data during the rst months of the epidemic, severely undermined the trustworthiness of raw lethality data (Case Fatality Ratio, CFR, corresponding to the death/infected ratio), mostly because true incidence resulted underestimated. CFR values relative to different regions convey information on both age distribution (COVID-19 fatal cases are almost completely restricted to patients >70 years old) and e ciency of public health systems 8 . In fact, Italy has the oldest population in Europe and the second oldest population in the world, after Japan. COVID-19 has shown a strong dependence on age due to the severity of the infection and the risk of death8.
However, the CFR in Japan is much lower (3.2%) than that recorded in Italy (13.7%), despite their similarities. Therefore, to obtain a more reliable evaluation of clinical evolution, we looked for an independent parameter that could free itself from the estimate of infected individuals, which is affected by a high level of uncertainty as abovementioned. Thus, we focused on the daily number of accesses of Covid-19 patients to the intensive care unit (ICU). Here we show that the number of accesses to the intensive care unit has a very high linear correlation with the number of deaths so that the slope of this relationship is a good faith estimate of the effectiveness of hospital care as well as an unbiased index of the clinical evolution of the epidemic.
The shared rationale of the above-mentioned strategies is the 'blessing of delay': in order a patient is actually registered as 'healed' it is necessary to wait for three negative tests; this creates a time delay with respect to the ascertainment of infection cases. This delay (together with the latent but unescapable correlation between infected and healed frequencies) makes the ratio between daily changes in healed and infected compartments a proxy of the time derivative of infection dynamics only marginally affected by infected number uncertainty. Similar considerations hold for intensive care unit hospitalization and fatalities that in turn do not include any infection number estimate.
Methods RI index. We rely on o cial data from the Italian Health Administration, recorded daily for each Italian province (in Italy each region is further divided into provinces) 9 . The time window taken into consideration went from the end of February to 12 May 2020. The division of this time window into " elds" derives both from the need to compare different regional entities and to be consistent with different stages of epidemic curve. Accordingly, the average value of the RI (dH/dI) index was calculated from the healed and infected curves within each eld ( Table 1). The elds correspond to T0 = starting point, with a very low number of infected and almost no healed patients. T1 = initial exponential increase in the epidemic phase, featuring a marginal number of infected compared to healed patients. T2 = in ection of the curve of infected patients, corresponding to the beginning of the exponential increase of fatal cases. T3 = plateau, the phase of the curve of infected patients showing a trend toward reduction with simultaneous exponential increase in the number of healed individuals. T4 = drastically negative drop among infected patients tendency of the infected and exponential increase in the healed. The time windows were calibrated on the total of the Italian regions (excluding Lombardia data), while Lombardia was singularly considered. It is worth noting the Italian dynamics of RI was super-imposable to the dynamics relative to other countries (data not shown).
It is worth remarking that the infected, dead and healed compartments are non-overlapping, as each patient is considered to be healed only after three negative PCR results performed across a period of three weeks. The robustness to the uncertainty of the infected number of the proposed index derives from both the use of the daily differential (instead of the total number of infected patients), and the long delay between the infection and the declaration of "complete recovery", which is obtained only after three PCR negative test. In the initial stages of the infection, there is not enough information to predict epidemics temporal development using too demanding models. Thus, we approached the forecast of time and intensity of the peak of infection through the Verhulst-Pearle sigmoid model (Fig.3). The derivative of the sigmoid indicates (in the peak) the in ection point of the epidemic curve ( The model tted with the data until 24 March was demonstrated to correctly predict the onset of the end of April peak. The con dence interval, (1-α) = 0.95, indicates the uctuation of the real amount of daily infected people during the peak (Ll = 100000, Ul = 110000). The real data was around 108000 infected people (as predicted by the model).
Backward prediction model. The rst available data on the number of daily infected in Italy was available on February 24, 2020. In the rst ten days of registration, epidemiological data resulted highly "unstable" and untrustworthy. Nevertheless, they can still capture the dynamics of the phenomenon. To eliminate excessive uctuations, a four terms moving median was applied to the data. Furthermore, on the levelled series, the geometric mean of the rates of change was calculated. This average is 0.02354 (net rate). Finding the evolutionary dynamics of the number of infected, at the beginning of the exponential branch and coming from the asymptotic branch, the formula of compound capitalization is applied. We indicate with t the day February 24 and with j the number of days before t. The daily infected peoples at time t-j is given as follows: Where: N(tj) = 202 / (1+ 0.02353808) j N (t) = levelled number of infected at day t, alpha = geometric mean of the in the rst 10 days and j=numbers of days before t.
For j = 130, we get N (t-j) = 10; from February 24, subtracting 130 days we arrive in mid-October.
This apparently paradoxical early beginning of epidemics obtained an independent biological proof by the genetic distances among different Sars-CoV2 isolated strains10.

Results
Epidemics trend in Italy and Lombardia region. Table 1 shows the dynamics of the RI index (dH/dI, calculated as the average value at different time intervals both in Lombardia and in the rest of Italy). The different time intervals (" elds") correspond to the peculiar behaviour of infectious dynamics, analyzed for Lombardia and for the rest of other Italian regions. Raw data have been provided by daily sampling made by Italian Health Administration and are affected by biases due to non random sampling and test uncertainty (especially important in the rst phases of the epidemic).
With a simple glance at Table 1, it is possible to appreciate the differences between Lombardia and the rest of the Italian regions. The difference is evident from the average values of RI = dH/dI calculated for each eld, between T0 and T5. The rst eld highlights an RI greater in Lombardia with respect to the rest of Italy tracking the start of the epidemic process in this region. Consistently, the much higher percentage of patients healed in Lombardia region in the initial periods suggests the need to bring the actual start of the process back in time compared to the o cial registration of the rst cases. Going to more recent times (successive intervals) is evident as the index reaches a plateau at window T3 for 'rest of Italy', indicating the end of the cycle of epidemic, while still oating in Lombardia, with a large dispersion of the RI value .This is still more evident in Figures 1A and 1B. Figure 1A reports daily changes since 24 February 2020 (T0), of infected, and healed in Italy without Lombardia. Top panel reports RI (dH/dI) index while bottom panel refers to the temporal trend of healed and infected cases. The analysis of single sections of the time course suggest some interesting observations. The average RI value in Field 1 T0-T1 is near to zero; there is only a slow increase of infected (it must be remembered the low number of swap analyses performed in this period) with almost non-detectable increase of healed cases. The average RI value in the Field 2 T1-T2 is on the rise, but lower than 1 due to the temporal delay between healed and infected cases generation. The infection rate rises, and so does healed curve but at a lower rate. The average RI value reaches a 'tipping point' marking a phase transition11 of the dynamics at Field 3 (T2-T3). The healed daily variation overcomes the infection rate and the RI has a huge variance, another signature of phase transition 11 . In this time interval, the Infection rate reaches its plateau while healed curve displays a neat acceleration. The average value RI assumes an average value >1 in the Field 4 (T3-T4), where a signi cant departure of the rate in between healed and infected patients can be observed. Finally, in the Field 5 (T4-T5), the dH/dI ratio assumes an average value <0. From this point onward, dH/dI is no more relevant given the negative rate of infected persons and thus the negative values of dH/dI index. This is a further proof of the phase transition started in the previous intervals and identi ed by infection peak. It is worth noting how the two extremes of the section (3 and 16 April) are coincident with the reduction of the absolute number of infected persons. This behaviour coincides with the plateau of our forward model based on a completely independent approach. Figure 1B relates to the Lombardia region and tells us a different story than the rest of Italy. First, we note that, in contrast to the rest of Italy, there is no clear "critical point". The RI does not assume the average value >1 for the entire range of data examined (82 days from T0 corresponding to 12 May 2020), while in the rest of Italy the number of recovered people is very high, with RI> 1 starting from the 40th day (April 2, 2020). The Lombardia region continues to oscillate around an average index of RI=0.65, without reaching a stable plateau or going towards stable negative territories. All in all, the proposed index tells us of the reaching of a tipping-point at day 45 from the beginning of analysis (in the rest of Italy) that marks the end of epidemic cycle being the fatalities after that point the 'echo' of previous infections. Lombardia differs from the previous sketched scenario as it displays a completely different dynamics of the index.

Parameters for clinical severity estimation
The aforementioned Japan/Italy paradox, where two nations with largely super-imposable demographic structures have huge differences in lethality trends, asks for something different from the usual lethality indexes. We decide to remove the principal cause of uncertainty, i.e. the rate of infected people, by focusing on different observables. That goal was achieved by relying on the relation between the number of accesses in intensive care units (ICU) and the number of deaths. Both these variables are known with no (or very little) uncertainty and they give an immediate estimate of the severity of the disease ( Fig.2A,  B). Figures 2C and 2D reports the evolution of ICU as well as the Covid-19-related fatalities recorded on a daily basis in Italy, excluding (Fig.2C) and including Lombardia data (Fig. 2D). Figure 2C shows the parallel behavior of the number of accesses to intensive care and the number of deaths for the entire country. The two parameters are strictly correlated (Pearson r = 0.94), and the slope of the curve is equal to 0.19 pointing to an 81% of success for intensive care (Figure 2a). This relation points to a constant 'clinical relevance' of the disease throughout the entire time course.
The gure refers to the same day for both death numbers and ICUs and is worth noting that the t comes from the shared trend of the two variables both registering the same epidemic spread. As a matter of fact the 'lagged correlations' (i.e. the correlations between number of deaths and ICUs, three, four, ve and six days before) displayed a signi cantly lower correlation coe cient. The fact that ICUs-Fatalities relation is driven by the epidemics dynamics (the relation practically disappears when lagged correlations are computed) tells us that this relation can be considered as a proxy of the epidemics dynamics more than a purely clinical descriptor. Figures 2C and 2D describe the simultaneous evolution over time of the two parameters -ICUs and death -evidencing how the rise and the subsequent decline of the epidemics can aptly be depicted with the help of these observables. Keeping in mind that fatalities have a variable delay from hospitalization, the ICUs-Dead number function is equivalent to a temporal differential rate of infection scaled by a largely constant intrinsic fatality rate. At odds with other strategies, the ICUs-Dead number function does not imply any direct infection estimate. Figure 2B refers to Lombardia, and again we observe a strict relation between the number of daily accesses to intensive care units and deaths. It is worth noting that while for the rest of Italy the slope was 0.19, the slope relative to Lombardia is quite higher (0.28) pointing to a decreased success of care dropping from 81 to 72%. By a closer look at Fig.2B is evident how the right part of the plot (corresponding to the peak of the epidemics with higher number of both fatalities and intensive care accesses) has a higher displacement of observations from the regression line with a tendency to score a higher number of fatalities.
This asymmetric displacement could be the image in light of the initial overcrowding of the hospitals in Lombardia that rst faced the initial impact of the epidemic. In any case, the relation between hospital accesses and deaths can be considered as a useful parameter for monitoring disease severity. The corollary of this fact is that the number of healed individuals properly re ects the epidemics evolution in a clinically reasonable way, allowing overcoming bias provided by uncorrected data collection about incidence and lethality. Accesses to critical care units reach a peak at mid-March and then regularly decreases, reaching in these days (July 2020) very limited numbers (<50 patients in respect to ~4000 at mid March). We hypothesize that this behaviour re ects the improvement in treatment of patients that are followed at home or in standard hospital regimen, and does not imply any change in the viral pathogenicity, at least in the short period we have investigated.
These ndings bring some relevant consequences: a) despite infection and death rates values being untrustworthy, recording of critical care units accesses rates can provide a useful estimate of epidemic dynamics, especially if we aim to appreciate the most relevant medical outcomes; b) decrease in critical care units accesses rates mirrors the ones we observed for death rates.
Given the reduction in hospitalization in intensive care structures is mostly ascribed to improved e ciency of medical treatments delivered at home or during standard hospitalization, it can be reliably surmised that early and e cient pharmacological therapy can successfully reduce both accesses to critical care units as well as death rates. While the reasons behind the (apparent) high incidence of fatality rates in Lombardia still remain uncovered, we can exclude that differences in virus pathogenicity could provide an affordable answer. The impressive trend we observed in Lombardia from the outset (end of February 2020) simply demonstrated that, since the beginning, the number of (asymptomatic) infected individuals reached a "critical" mass that ultimately caused a sudden "eruption". Indeed, almost half of the Covid-19 cases registered in Italy occurred in Lombardia.
The Verhulst-Pearle sigmoid model (sigmoid) (Fig. 3A, B) allowed us to reconstruct the time and intensity of the epidemics peak and to highlight the "hidden" initial period of the epidemic (backward prediction). Namely, we have been able in dating back the beginning of epidemics some months before the rst ascertained case. This occurrence, that recently has been acknowledge by new investigations12, tells about an 'hidden circulation' of the virus in Lombardia long ago before any o cial evidence that could be one of the concurring causes of Lombardia singularity. On a more methodological ground it is worth noting how the sigmoidal model was able to correctly estimate the time course of the derivative of epidemic spread (the same information conveyed by SEIR-based R-like statistics) .when fed of only early (and very uncertain) data before 24 March 2020 ( Figure 4).

Conclusion
As aptly remarked in a paper recently appeared in Nature 20 , R-like statistics are in many cases imprecise estimates of epidemic dynamics that both rest on unveri ed assumptions and do not capture the current status of an epidemic spiking up and down when case numbers are low. Moreover, they are out of scope in situations with marked spatial heterogeneity. Other measures (trends in new infections rate, deaths and hospital admissions, cohort surveys) could provide a more reliable estimate of how many people in a population currently have the disease (or have had it). Overall, R index is "too large, and is being used for purposes for which it was never intended" 20 . Here we set some feasible alternatives xing the above fallacies by checking some of the above-mentioned 20 alternative strategies.
Indeterminacy of the estimate of affected people, by means of a general underestimation of the asymptomatic pool, might have biased the models on which o cial statistics usually rely. Furthermore, the fact that data on the infection spreading have been mostly obtained through Real-Time RT-PCR diagnostic -a tool burdened by a huge rate of unreliable results 13,14 -is an additional source of uncertainty. It is noteworthy that serological-based survey have evidenced incidence rates 40-80 fold higher than those suggested by previous estimates, based on RT-PCR15.
Another source of bias is linked to the fact that the beginning of the epidemic has been overlooked by missing a non-negligible number of affected patients (especially in the time lapse of November 2019 -January 2020). Tracing back the early starting point of the Sars-CoV-2 outbreak is a di cult task, besides some attempts have been done in that eld10; our results suggest a sensible way to achieve this goal by means of sigmoidal models much less demanding than classical SEIR ones.
The tracing back the beginning of the epidemic to October 2019. Beside the genetic distance analysis 10, indirect evidence has been recently provided by mass media relating the emergence of "strange" interstitial pneumonitis during the fall and evidence is mounting that Covid-19 spreading should be backdated also in other countries, including China. This reconstruction casts more than a shadow upon the current narrative and require an in depth investigation of the true origins of Covid-19 epidemic.
Uncertainty affects even lethality rate estimation. Lethality (or Case Fatality Ratio, CFR) is a measure critically dependent on the accuracy of incidence estimation. It is commonplace to explain differences recorded in fatality rate registered in different countries by advocating differences in demographic structure or in the e ciency of public health systems. Indeed, Italy has the most elderly population in Europe and the second most elderly population in the world after Japan. COVID-19 has a strong age dependence for the severity of the infection and the risk of death. However, lethality rate in Japan is far lower (3,2 %) than that recorded in Italy (13,7 %)16 , the similarities in between the two countries notwithstanding. Moreover, the overall burden of casualties ascribed to Sars-CoV-2 might be easily awed by the choice of exclusion/inclusion criteria. Indeed, it is di cult to differentiate between deaths in which Covid-19 is detectable from deaths caused by SARS-CoV-2 infection because the vast majority of patients who died in Italy had one or more major pathologies (98.8% with at least 1 comorbidity, and 48.6% having 3 or more diseases) that contributed to the death17. Additionally, the death trend also re ects a number of intertwined factors and represents the ultimate terminus of a complex process that started much earlier. Therefore, the simple daily death recording unlikely could provide a signi cant picture of the epidemic dynamics. Overall, these factors are among those that contributed in undermining the effectiveness of epidemic management18. Indeed, in "the absence of prevalence and incidence data, including the results of serology testing, it is di cult to predict the effects of speci c major public health decisions"19.
To derive easy to grasp but reliable estimates of the time course of epidemics, here we propose the use of a new index (RI) that allows to detect tipping point (and then the change of pace of epidemics going to the end of its cycle) in a reliable way.
In the same time, the relation between fully recovered individuals and deaths could represent a trustworthy and unbiased estimator of the epidemics process. It is worth of interest that, when the number of accesses is plotted against the number of Covid-19-related casualties, we found a near to unity relation between these two parameters largely invariant in time and space (i.e., Italian regions display the same trend without appreciable differences).     Early data before March 24, 2020