A new regression model for the forecasting of COVID-19 outbreak evolution: an application to Italian data

The novel coronavirus SARS-CoV-2 was first identified in China in December 2019. In just over five months, the virus affected over 4 million people and caused about 300,000 deaths. This study aimed to model new COVID-19 cases in Italian regions using a new curve. A new empirical curve is proposed to model the number of new cases of COVID-19. It resembles a known exponential growth curve, which has a straight line as an exponent, but in the growth curve proposed, the exponent is a logistic curve multiplied for a straight line. This curve shows an initial phase, the expected exponential growth, then rises to the maximum value and finally reaches zero. We characterized the epidemic growth patterns for the entire Italian nation and each of the 20 Italian regions. The estimated growth curve has been used to calculate the expected time of the beginning, the time related to peak, and the end of the epidemics. Our analysis explores the development of the outbreaks in Italy and the impact of the containment measures. Data obtained are useful to forecast future scenarios and the possible end of the epidemic.


Introduction
The novel coronavirus SARS-CoV-2, responsible for the COVID-19 epidemic, was first identified in Wuhan, China, in December 2019 among a cluster of patients that presented with an unidentified form of viral pneumonia [1]; it is reported to cause a range of symptoms, including fever, cough, and shortness of breath [2]. At the time of writing (data updated on May 31st, 2020, GMT 07:46), the number of persons infected worldwide was 5,941,778, with 367,378 reported deaths; 212 countries have confirmed cases. In Italy, the first official cases appeared on February 21st in the Lombardia region; eleven municipalities in northern Italy were identified as the centers of the two main Italian clusters and placed under quarantine. The majority of positive cases in other regions are traced back to these two clusters. On March 8th, the quarantine was expanded to all of the Lombardia region and 14 other northern provinces, and on the following day to all of Italy, placing more than 60 million people in a de-facto quarantine mode. On March 11th, the Prime Minister prohibited nearly all commercial activity except for supermarkets and pharmacies, and on the same day, the World Health Organization recognized COVID-19 as a pandemic. On March 21st, the Italian government closed all non-essential businesses and industries, with additional restrictions on people's movements. On May 31st, the total number of cases in Italy was 232.664, with 33.340 deaths; the total of tests performed was over two and a half million.
Mathematical models are powerful tools for understanding infectious diseases [3]. Identifying signature features of the growth kinetics of an outbreak can be helpful to design reliable models of disease spread and understand important details of the transmission dynamics of an infectious disease [4]. The force of infection in mathematical transmission models is typically estimated using time-series data that describe epidemic growth as a function of time [5]. Usually, an outbreak follows an exponential growth at an early stage, peaks, and then the growth rate decays, as containment measures to limit the transmission of the virus are introduced. Most of the studies in the literature used simple exponential growth models and focused on the initial growing process, but on the other hand, there are also many works arguing that the number of infected people follows a trajectory different from a simple exponential growth [6]. Moreover, most of the models adapt to the time series of the total number of infected and not the daily variation of the same; that is, the number of new cases and new deaths per day. However, it is undoubtedly useful to try predicting the trend of newly infected people on a day-by-day basis, as this type of time series is much more useful in predicting the end of the epidemic. Through this time series, it is easily identifiable the day with the maximum increase of the number of the infected, while in the functions that consider the total infected number according to the time, it is necessary to define the inflection point, as the maximum number of new cases/day corresponds to the time corresponding to the inflection point.
In this paper, we employed a new growth model, starting from the known exponential growth. The novelty lies in modeling the exponential exponent with a logistic function, so that the exponent can decrease over time rather than remain constant. Our analysis explores the outbreak development in Italy and the impact of the containment measures, both at the national level and within the 20 Italian regions. We employed simple models to quantitatively document the effects of the Italian control measures against the COVID-19, provide informative projections on the development of the outbreak, and informative implications for the coming pandemic.

Mathematical assumption
A new empirical curve is proposed to modeling the number of new cases of COVID-19 as a function of time. The simple curve of exponential growth is defined by two parameters: a (is the number of times per unit time of growing by a factor e, i.e. the base of the natural logarithm) and b (related to abscissa values translation). The exponential growth curve is N(t) = e a(t−b) , where N(t) is the number of new COVID-19 infected per day, t is the number of days since the start of registration of the infected (Figure 1(a)).
The exponential growth curve can be rewritten as follows: Considering logarithmic transformation: then So, considering Eq. 2, exponential growth, irrespectively to b values, can be visualized as straight lines parallel (in logarithmic scale); for example, the exponential growth plotted in Figure 1(a) has been reported in Figure 1(b); another example in Figure 1(d,e). In real epidemics, the exponential growth phase is limited to the first period; after this first phase, there is a peak of new infections, only to see a progressive decrease in new infections, down to zero. So, the 3-parameters logistic curve of transformed data has the following equation: Considering the transformation reported above, the number of new infected can be modeled on a logistic curve (Figure 1(c,f), in logarithmic scale). Solving for N(t), we obtained: and finally With a factor correction: 1+e c(t) − 1 ( 7 ) a and b are related to the natural transmission rate (as in the exponential growth curve), c represents the effect size of containment measures. Parameter b corresponds to the day before the start of the epidemic when the value of N(t) is > 0. Therefore, considering Eq. 7, the graph in Figure 2(a,b) simulation of growth curve have been reported: The new curve growth proposed (7) has several useful characteristics: Considering the first growth phases of the infection, the exponent denominator (which is a logistic) approximates to 1, so that the whole exponent approximates at the straight line a(t-b). In other terms, the function approximates at For large values of t, i.e. at the end of the outbreak, the curve tends to zero.
For t → +∞e The insertion of the corrective term '−1' in (7) is justified since the number of newly infected people must reach zero value at the end of the epidemic.
The first derivative is: Unfortunately is not possible to calculate the exact value of t in order to N (t) = 0 (time of both peak values), but it is possible to obtain a numerical approximation, with the desired precision, considering: In this context, the precise value could not be necessary, but an approximate value of one day is sufficient. To calculate the day relative to the maximum increase of cases and deaths, it was estimated to which day corresponded to the maximum predicted value of the dependent variable. Using the same equation, it is also possible to model the number of new dead of COVID-19 as a function of time (days). The estimated growth curve can be directly used to calculate both the expected time of the beginning and the end of epidemic curves.

Data sources
The temporal resolution of the datasets was day-to-day, starting from February 24th, 2020. New COVID-19 cases and new deaths were extracted from the database of the Italian government at the following link: https://github.com/pcm-dpc/COVID-19/blob/master/ dati-regioni/dpc-covid19-ita-regioni.csv. We characterized the epidemic growth patterns for each of the 20 Italian regions (the autonomous provinces of Trento and Bolzano were considered instead of the whole Trentino Alto Adige region).

Statistical data analyses
Parameters (a, b, c) can be jointly estimated through a nonlinear least-square curve fitting to the case incidence curve modeled by the equation above reported, assuming that the number of daily new cases is a structural part plus a normal error term. Nonlinear regressions were performed determining unweighted least squares estimates of parameters using the Levenberg-Marquardt method; also, 95% confidence intervals were calculated. The goodness of fit was quantified using adjusted R square; normality of residuals was tested using the D'Agostino-Pearson test; finally, a Run test was performed to test autocorrelation among residuals. Initial values were chosen using a simulated curve on a Microsoft Excel spreadsheet. The trend of new cases and deaths is reported at the graphic level, as an example, only concerning the Lombardia region; the data for each region are available as Supplementary materials. All statistical analyses were performed using Excel 365, SPSS 20.0, and R software; the significance threshold was fixed at 0.05.

Results
In Figure 3(a,b), an example (new cases and new death) of the modeling of a new growth model for the forecasting of the COVID-19 outbreak evolution is shown. The proposed new growth curve showed high goodness of fitting; except for the Basilicata and Molise regions where R 2 adj showed values close to 0.20 (these regions are sparsely populated and therefore showed, in absolute value, a low number of deaths and cases), R 2 adj ranged from 0.2822 (new death; Umbria) to 0.8992 (new cases; Emilia Romagna), with a mean R 2 adj = 0.714. On the abscissa axis are the days starting from the beginning of the epidemiological data records, which is from 24/02/2020. For example, using Lombardia results, the curve relative to the new cases/day has parameters (with 95% confidence interval): a = 0.4401 (0.3758; 0.5174); b = −23.97 (−30.32; −18.43); c = 0.02434 (0.02258; 0.02708); the curve relative to the new deaths/day has parameters: a = 0.6147 (0.5123; 0.7426); b = −5.074 (−9.442; −1.156); c = 0.03247 (0.02969; 0.03533). Both the new cases/day and new deaths/day curves showed acceptable R 2 values (0.7766 and 0.8092, respectively). Residuals did not satisfy normality and absence of autocorrelation assumptions for the new cases (D'Agostino-Pearson test; p < 0.0001; Run test; p < 0.0001), while the new deaths/day showed a deviation from homoscedasticity (D'Agostino-Pearson test; p < 0.0001) and did not meet the assumption of autocorrelation absence (Run test; p = 0.0001); although Lombardia did not satisfy these assumptions, other regions satisfy them; moreover, the deviation from the observed values was not very pronounced (see Figure 3) The smallest regions showed a much lower total number of infected, so the new  cases time-series is subjected to significant sampling noise; R 2 value is lower than for bigger regions. Using a sum of predicted data obtained from the regional regression curves is possible to hypothesize that the numbers of the new infected, for Italy, will be less than 5 per day on day 212 (September 23th, 2020) from February 24th.
In Figure 4(a), the number of days from the (estimated) start to the peak (x-axis), and from the peak to the (estimated) end of the outbreak (y-axis), for each of the 19 Italian regions and the 2 autonomous provinces have been plotted. This plot shows the heterogeneity in the spreading speed of the virus between the different regions. In particular, it has been possible to calculate the average number of days between the beginning of the epidemic and the peak (33 ± 13 days); the minimum occurred for the Valle d'Aosta (13 days), while the maximum for the Lombardia region (58 days). Moreover, considering an estimated time-interval between the day of the maximum number of new cases and the end of the epidemic, an average number of days of 85 ± 40 has been calculated; the shortest time frame (22 days) is has been found for the Umbria region, while the longest (166 days) for the Marche region. Figure 4(b) shows the estimated date of the onset of the epidemic (in the x-axis) and the total number of infected (data updated on May 31st). It is noteworthy that the regions where the outbreak started first had the highest number of total infected (Lombardia, Piemonte, Emilia Romagna, and Veneto, in chronological order); in the remaining regions, where contagion began between February 21st and March 13th, the total number of cases in sensibly lower.
Finally, we empirically validated the proposed model using predicted values on the 25 of June 2020 for each Italian region and comparing them to the observed values of new cases and new deaths, averaged in the week between 22 and 28 of June to minimize random sampling error. Real data and estimated values showed comparable values, allowing to obtain a validation of the proposed model. Results of the comparison between real and estimated data for each region are reported as Supplementary Material.

Discussion
To improve the ability of epidemiological models to capture the trajectory and impact of pandemics and epidemics, we have introduced a generalized-growth model to characterize the epidemic ascending phase, maximum and descending phase until zero. To our knowledge, this new growth curve is proposed for the first time; it adapts to modeling new cases (and new deaths) per day. It shows some useful features: it is simple, uses only four parameters, so the risk of overfitting is minimized. The Levenberg-Marquardt algorithm reaches convergence with few iterations (5-10), while the initial values to be used, once found for a specific curve, are also suitable for all the others, both for new cases and for new deaths. The proposed model does not hold true for the entire Italian national data and the regional data at the same time. The proposed model is a non-additive one, meaning that if the model assumption holds true for each of the 20 regions, it does not hold true for the entire nation and vice versa. The model is clearly additive on the log scale and hence not additive on the original scale. Thus, the number of national daily new cases (and deaths) can be obtained as a sum of the regional numbers.
The proposed model has a good fit for all the time series considered, except for two regions (Basilicata and Molise, which are sparsely populated). This curve can be added to many others already present in the literature. The type of curve proposed is empirical, as it does not derive from a model of differential equations but from the ad hoc construction of an equation that shows the desired shape characteristics. This can be useful because, being freed from the assumption of the models, it can better adapt to the changes recorded in the time series of new cases of COVID-19. It shows similarities with the first derivative of Richard's function [6,7]. Even if it considers the asymmetry between the ascending and the subsequent descending phases, Richard's function cannot exceed a limit value; on the contrary, the function proposed in this paper overcomes this limitation, adapting to any type of asymmetry present.
This type of curve is important since it allows to foresee the amount of time it takes for the epidemic to reach its end. It is especially useful in times of emergency to rationalize resources, monitor the heterogeneity of the outbreak evolution in the different regions, and plan the necessary public health interventions in the period of maximum crisis. It should be noted that the numbers of new cases used for these epidemiological growth curves are obtained from nasopharyngeal swabs. In Italy, this screening procedure has been carried out mainly on subjects with evident severe symptoms. On the other hand, in the various regions, there has been a heterogeneous approach to screening procedures, for which the results may have been affected by bias and random errors due to the different criteria used for performing nasopharyngeal swabs. Nevertheless, the estimated values of the onset of the epidemic are fairly consistent with each other and quantify a scenario already reported by the official sources of information. This curve could also be helpful to assess shifts in epidemic growth patterns resulting from mitigation during an outbreak due to population behavior changes or public health control interventions that affect transmission.

Conclusions
In this paper, we have calibrated the new growth curve to model the reported number of infected and death cases in the COVID-19 epidemics, from February 24th to May 11th, for the whole of Italy and its 20 regions. Our analysis explores the development of the outbreaks in Italy and the impact of the containment measures both at the aggregate level and within each region. We documented the early stages of infection, which follow the exponential pattern, estimated the peak of new infections, and made future scenario projections of the possible end of the outbreak. We quantified the initial reactions and ramping up of control measures on the dynamics of the epidemics and unearthed an inverse relationship between the number of days from peak to the quasi-end and the duration from start to the peak of the epidemic among the 20 analyzed Italian regions. We identified heterogeneity of the development of the epidemic and responses across the regions. The goodness of forecasting will establish the goodness of this new equation and its possible use in the future.

Disclosure statement
No potential conflict of interest was reported by the author(s).