Two-scale Phenomenological and Statistical Analysis of the COVID-19 Pandemic

In this work, the daily reported new infection cases and deaths globally due to COVID-19 are approximated by Gaussian functions of the time elapsed since the beginning of the pandemic. This approximation is shown to consistently designate the time evolution of daily rates of new cases and the total duration of the epidemic in any given region or country. At the pandemic scale, this phenomenological analysis is applied for condensing statistical data of the epidemic evolution in twenty-four countries into two “universal master plots” involving in total ﬁve adjustable parameters per country. The analysis reveals intrinsic features of the pandemic and local speciﬁcities of the epidemic as well, helps in anticipating the epidemic peak date in each country and yields conﬁdence intervals for the average numbers of incidences. Finally, the phenomenology facilitates comparisons of the epidemic evolution in different countries and evaluations of the effectiveness of related health policies.


Introduction
The mathematical and statistical analyses of epidemic data are considered as substantial tools for national health organizations to develop prevention policies limiting the propagation of an epidemic while sustaining health systems and lowering human and economic losses. Known models, statistical, mathematical, mechanistic space-state and empirical (machine learning) have been thoroughly reviewed recently 1 . Since the emergence of COVID-19, several newly published research articles rely on mathematical models to predict the spread of the virus, to estimate the basic reproduction factor, R 0 and, to highlight the influence of public policies on the outcome of the epidemic [2][3][4][5][6][7][8][9][10][11] . Statistical analyses of the epidemic data from different countries, by means of distribution and error functions, have also been performed with the aim of predicting the epidemic peak 12, 13 some of which make use of machine learning techniques 14,15 . Although statistical information from official health organizations and various other information channels 16-18 becomes increasingly available and trusty, modeling seems difficult because collected data are affected by large fluctuations, especially in the neighborhood of the epidemic peak. In addition, comparing the respective efficiency of the public health policies adopted in different countries/regions is delicate since the cardinal numbers of affected population sets and the distributions of primary seeds are very different. Nevertheless, examining the statistical databases shows qualitatively that the total number of infections/deaths as a function of the elapsed time since the first case occurrence shapes sigmoidal whereas its derivative, i.e. the number of new cases per day as a function of the elapsed time, τ, looks reasonably Gaussian. It should be thereby assumed that the virus evolves similarly in different countries. Common to the regions experiencing the virus spread, are also the adopted countermeasures, lockdown policies, travel restrictions and quarantine, that justify viewing countries as closed, non-interacting systems.
Alike others, the present work relies on statistical datasets from twenty four countries [16][17][18] , and postulates that in all of them the number of daily new cases is a Gaussian function of the elapsed time. The time evolution of the corresponding total number of cases, n(τ), is then obtained by integration yielding an error function of, τ, which sigmoidal shape consistently describes the data. It is shown in addition that n(τ) and its time derivatives, easily transform in to non-dimensional quantities in terms of which time series of statistical data from any country superimpose with very little dispersion on master plots. The latter represent the virus evolution in any country. Comparisons of the epidemic evolution in different countries/regions is thereby simplified, consisting therefrom in comparing the respective values of the parameters spanning this homothetic transformation. It is found that by reaching the epidemic peak the parameters yielding the master plots have converged toward stationary values and estimations of the epidemic duration become faithful.
In the following, the model and thereby obtained results are first presented. Therefrom, conclusions can be drawn about the epidemic evolution and the limits of the adopted analysis are revealed. The methodology and procedures used in this work, are detailed in the methods and computations section aiming at adequately address issues important for others in the field to reproduce the results 19 . The discussion section, lists the main conclusions of the present work, comments on limitations and envisages directions for future work for further improving the methodology and agreement with the statistical data.

Model
As it has been already proposed in the literature 13,20 , the number of daily new cases, infections or deaths, are supposed evolving as Gaussian functions of the time elapsed since the occurrence of the first case, τ, expressed in number of days: where, τ m , locates in time the epidemic peak, σ , is the standard deviation and A 0 , is a normalization constant proportional to the total number of cases, N 0 that should be reached upon vanishing the epidemic at elapsed time, τ end . Integration of eq.1 over the time interval [0, τ] yields the number of predicted cases as a function of time, τ: Introducing the reduced variable, τ * = τ−τ m √ 2σ , eqs.(1 − 2) read as follows: Parameters, (N 0 k , τ m k , σ k ) i , with indices, k and i (i=inf/dec) labelling respectively the k th (k=1-24) country-dataset and its kind, infections or deaths, can be determined via least-squares fits of eq.(2) to the data (see Methods and Computations below). The relationship between populations of infected and deceased individuals is accounted for in the present analysis by choosing identical standard deviations for both of them and reducing thereby the number of adjustable parameters to five. Data from different origins (countries or regions) transformed via eqs. (3,4) and plotted all together, yield non-dimensional master plots condensing all the available information on the graphs of the righthand-side functions in these equations. Unlike processing of individual datasets at the country epidemic scale, their combination yielding the master plots targets the description of the virus propagation at the upper scale of the pandemic.
Remarkable and common to all the datasets is the presence of large fluctuations that seem amplifying upon approaching the epidemic peak 16, 17 . These do certainly reflect counting errors affecting the data but may also originate from causes intrinsic to the virus propagation. However, there is little hope to extract useful information from the data scatter present in country-specific datasets given the low values of the corresponding cardinal numbers that equals the number of days of the lasting epidemic. For the master plots however, the situation is different despite the moderate, yet persisting, data dispersion. Indeed, the related dataset size is now proportional to the number of the contributing countries/regions. Considering the master plots as representative of the pandemic behavior on average, confidence intervals can be determined at any desired value of τ * . This is equivalent to considering the daily new cases observed in each country as resulting from the realization of the same stochastic process, with,ñ = dn * (τ) dτ * , the associated random variable. This view of the pandemic emerging from the lower scale as the superimposition of the parallel and shifted in time epidemic evolutions in different countries is reasonable, provided active regions in each country have not reached the 2-dimensional percolation threshold and that countries correspond to the image invoked in the introduction of closed and non-interacting systems. Noteworthy, country-specific confidence intervals are easy to obtain via eqs. (3,4) and the appropriate to each country/region parameter set, (N 0 k , τ m k , σ k ) i . Of great practical importance, the main duration, δ τ * duration = 2τ * end , of the pandemic can be defined as the time delay separating time instances at α‰ of the maximum of the master plot gaussian (eq.4), with the α value conveniently defined. In the following, the value α=1 has been arbitrarily chosen. Thereby, an effective elapsed time is obtained marking the end of the epidemic in any given country. In real time units this reads for both, infections and deaths, This phenomenological analysis has been applied here to reported datasets from France 16-18 , taken as a working example of a country that has passed the epidemic peak at present time ( Fig.1, a and b). In Fig. 1a, arrows mark time instances with daily rates at 1‰ of the maxima of the theoretical graphs, thus defining an effective relaxation time, characteristic of the epidemic duration in France, δ τ in f duration = δ τ dec duration =83 days (Table 1). In this table, columns Shi f t in f and Shi f t dec represent the time delay since January 2020 till the occurrence of the first case (infection/death). Accordingly, the position of the epidemic peak in this figure (infections) is obtained by adding to, τ in f m = 73.9, the corresponding, Shi f t in f = 3 days. The relaxation time is a kind of "mean field" parameter, emerging from the present analysis, which may help refining comparisons of the epidemic evolution in various countries. With no exceptions, all the country datasets used in the present work are fitted with equal quality by the Gaussian and error function forms, demonstrating as a first result of the present work, that these consistently designate the evolution of the epidemic. Fig.1a shows that with increasing the elapsed time the scatter of the data around the theoretical graph increases. As it has already been mentioned above, this is a feature common to all the datasets forming the statistical database used in the present work. No indications are available about the causes of these fluctuations though it is reasonable to admit that counting errors are superimposed to other causes, possibly intrinsic to the epidemic. Interestingly, the graph representing the time evolution of cumulated numbers of infections is much less affected by fluctuations, which should be considered as the usual smoothing of integrating over time the daily rates of new infections and a strong indication that counting errors should predominantly cause the observed data dispersion (Fig. 1b). In the following, the usual prescription of statistical data analyses is used, considering data dispersion as the consequence of a Gaussian noise, which yields confidence intervals around the mean values of the numbers of infections or deaths 21 .

Master plots
Having determined (N 0 k , τ m k , σ k ) i separately for each country, k=1-24, whereas parameter values are given in Table 1, raw datasets have been converted in reduced coordinates via eqs. (3,4) and are displayed all together in Figs.2 (a-d). It can be seen that cumulated numbers of events condense with little dispersion on the master graphs excepted for two of them departing significantly from the average behavior (Fig. 2b). More specifically, the infection/death datasets of China do not relate consistently as, unlike the first, the last departs markedly from the master plot prescription. Though less marked, a similar trend is observed for the South Korean datasets. Obviously the expected correlation that should exist between infections and deaths does not hold in these cases. Reasons behind such discrepancies are not clear and deserve further investigation, well beyond the scope of the present work, in which pragmatism has led eliminating these two singular datasets from the subsequent statistical analysis. This exemplifies the discriminating power of the master plots analysis and constitutes a main, significant result of the present work: master plots are essential in checking consistency of the statistical datasets of the virus spreading.
Error bars shown in Figs. (3) represent confidence intervals at a 99% confidence level, computed under the usual assumption that data dispersion stems from a random process with Gaussian distribution of probability. It can be seen that error bars represent faithfully the scatter of data displayed therein. More precisely, the reduced time interval [-4,4] has been divided in contiguous elements of equal length with values, ≈ 0.18 and ≈ 0.14, for infections and deaths respectively. To the data contained in these intervals is then assigned a time instance corresponding to the interval center whereas, < n * > and < dn * dτ * >, are set equal to the mean values of the data contained in the given time interval (full circles in these figures). Because the time intervals do contain a modest number of elements (≈ 36), error bars have been determined by resorting to Student's t-distribution 22 , replacing appropriately the Gaussian distribution of probability when dealing with finite datasets. It can be also seen from these figures that the theoretical graphs of cumulated numbers and daily rates of events (full lines) fit satisfactorily the data points, although the former systematically underestimate daily rates above, τ * > 0.8 (Figs. (3a − 3b)). This trend highlights the fact that the Gaussian is a useful yet non perfect approximation of daily rates and points to the need of further investigation targeting its improvement. Finally, it is worth noting that error bars for cumulated events are quite small, as is the related noise and, would not be visible without the amplification factor (x5) being used in Figs. (3c − 3d).
The real time delay corresponding to the time intervals in the above described 'coarse-graining' procedure yielding the error bars, amounts ≈ 2 − 3 days. Close examination of Figs. (3) shows also that the coarse-graining smoothens the data sufficiently to eliminate the fluctuations otherwise present in all the datasets of daily rates (e.g. Fig.1). Since this delay is much shorter than any characteristic time of the epidemic, incubation period or total duration, the hypothesis made above, that victims counting modes are principally responsible of the fluctuations present in the daily rates, is favored.
The capability of estimating error bars from the evolution of the pandemic is a direct consequence of the present two-scale phenomenological analysis that yields the master plots and should be considered as the third substantial result of this work.
It is generally accepted that phenomenology serves in classifying data and testing them for internal consistency rather than constituting a predictive tool. This statement fully applies to the present work, which however departs slightly from the general case when considering two particular aspects, the first is related to the symmetry properties of Gaussian functions and the second, empirical, emerges from the present study: beyond the epidemic peak the country parameters converge toward stationary values reflecting the existence of a robust minimum of the least-squares functional (see below, section Methods and Computations).
Stationary parameters, σ i and τ i m (i=inf/dec), imply that widths and positions of the Gaussian rates distributions are also stationary, which yields their widths at any height values below their maxima. In particular, by arbitrarily defining the dates of beginning and end of the epidemic at the time instances such as the events (infections or deaths) amount, 1‰, of the respective values at the corresponding maxima, one obtains the total duration values of the epidemic in terms of infection or death events. Duration values of the epidemic, defined in terms of infection daily rates, are collected in Table 1. By averaging the values for countries having passed the epidemic peak one obtains, < τ in f /dec duration >= 94 days, which represents the characteristic, estimated relaxation time of the pandemic. Increasing confidence in this value will be gained by averaging over more and more countries having crossed the epidemic peak, with main interest to provide a sound basis for defining how long should last confinement restrictions.

Methods and Computations
Data in the form of time series of cumulated numbers of infections/deaths, updated on a daily basis, have been collected from various official sources for 24 selected countries experiencing the endemic since January 22 till May 8, 2020 16-18 . Therefrom, first derivatives are numerically obtained via the Richardson finite differences scheme working at order O(h 6 ) 23, 24 . At the date above, no epidemic events were announced in the countries entering the present study but China, for which the starting date of the epidemic is still badly known. The parameters, (N 0 k , τ m k , σ k ) i , have then been numerically estimated for each country by using a least-squares minimization scheme applied to the populations of infected and deceased individuals while adopting a common value for the corresponding standard deviations. The home-made computer program driving the minimization 19 , interfaces with the multidimensional minimization package MERLIN 25 . In principle, the parameters can be adjusted via one among eqs. (1,3) or (2,4) indifferently. In practice however, due to the large fluctuations affecting the time derivatives (Fig.1a), the smoother integral forms reveal preferable (Fig. 1b). In this work, eq.(4) has been used together with the objective function 4/9 given by: where, the index, j, runs over the events in any given dataset, w j , are weights (w j =10 if |τ * j |>1 and w j =1 otherwise) and, n * (τ * j ), n * p (τ * j ), are respectively the observed and predicted numbers of events expressed in reduced units (eq. 3). The statistical analysis of the reduced data fluctuations around the mean values, < n * > and < dn * dt > at any τ * (master curves), has been made under the assumption that these follow normal distributions of known mean but of unknown variance. A reasonable compromise between the number of master plot data carrying error bars (confidence intervals of the mean) and the need for the highest possible statistical significance of the estimated errors, leads to defining classes with about 50 contiguous time intervals, δ τ * ∈ [0.14, 0.18]. Otherwise, the computations proceeded along the standard guidelines of statistical data analyses 21 .
The minimization procedure yields stationary values of the parameters for all the countries that are close or have crossed the pandemic peak (Table 1). If this condition is not fulfilled, the quality of the fit improves at the expense of increasingly large, yet unacceptable, N 0 and τ m values. In these cases, τ m and σ 's were determined by limiting, N 0 , to the values of the actual numbers of infections/deaths within a 5% tolerance interval, by using a native tool contained in the package MERLIN that allows setting lower and upper limits to guesses by the minimization package of the adjustable parameters values. Figure 1 illustrates the successful application of this procedure to the data from France.

Discussion
Unlike other works, the present study takes advantage of a phenomenological approach to evidence characters common to the epidemic in different countries. The approach relies on the assumption that time rates of events are Gaussian functions of the elapsed time, which is proven to satisfactorily designate the shapes of the data from different origins. However, the developed methodology is not unconditionally bound to this hypothesis since any other distribution, better approximating the observations or stemming from justified modeling principles, will perfectly adapt to the phenomenological framework of the present work. Defining this framework has been motivated by the observation that cumulated numbers and daily rates of events of different origins condense in to uniquely defined master graphs, via a linear transformation of the coordinates. This is the major result of the present work, susceptible of general applicability to cases other than the COVID-19 pandemic.
The early applied travel and mobility restrictions, aiming at hindering the virus propagation, have practically transformed the countries experiencing the epidemic in to closed and non-interacting population ensembles. It follows that the global virus evolution at the pandemic scale may be viewed as resulting from the linear superimposition of, k=1-24, shifted in time replicas of the same stochastic process evolving in, k, different countries. Thereby, data dispersion overwhelmingly present in the master plots of the datasets of daily rates (Figs. 2c, 2d), is used for evaluating error bars, which provide expectations for the upper limits of daily rates as functions of the elapsed time at the desired confidence level. Although the mathematical projection of the duration of the epidemic between countries may differ substantially, as is seen in Table 1, the corresponding reduced data are within the estimated error bars (Fig. 3), thus leading to the conclusion that the intrinsic characteristics of the pandemic are well obeyed. This result is of interest for national health authorities. In this context, the coarse-graining procedure yielding the estimation of errors has revealed useful in eliminating data fluctuations (Fig. 3). However, clarifying their origin is matter for further investigations.
Finally, the present phenomenological analysis can be used as a tool in checking for consistency the datasets referring to the time series of cumulated numbers, the reduced data of infections and deaths from different countries. Without addressing the details of the correlations between them, the phenomenology discriminates datasets that do importantly depart from the master plots. Thereby, singularities are shed in light that justify the elimination of the incriminated datasets from the error estimation in the present analysis, while encouraging further investigations for better understanding their origins. Work in progress focuses on improving the data description, by using better justified time distributions of events.