Logarithmic Quadratic Regression Model for Early Periods of COVID-19 Epidemic Count Data

Abstract


Abstract Background
While COVID-19 epidemic has been spreading worldwide, its characteristics are still unclear.The development of good mathematical models for predicting its prevalence and subsiding is strongly expected.The epidemic curve shows how the epidemic increases and subsides.This is the number of persons found infected daily.To express this with a mathematical model, the compartment model such as the SIR model is used generally.However, model parameter values of these ordinary differential equation based models are very sensitive for errors of observed data, and it is often di cult to nd a reliable model especially when the amount of data is not su cient.On the other hand, a regression model with a small number of parameters is more robust against data errors than a highly sensitive nonlinear differential equation model, though, it is not clear what a good regression model is for epidemic data.

Methods
We modeled the initial emerging period of the epidemic curve of COVID-19 in Tokyo with a model that introduces a quadratic polynomial function to the logarithms of the numbers of infected cases, and modeled it with other regression models including the generalized linear model to compare.

Results
It was shown that the statistical properties of the logarithmic quadratic function model were good even in the early stages of the epidemic, which is generally said to increase exponentially and monotonically.By applying the logarithmic quadratic function model to the data of the number of cases in each country of the world, the starting and the subsiding dates of the epidemic and the total number of cases in each country were estimated.

Conclusions
Although an epidemic curve in an early period said generally to be exponential, namely linear in the logarithmic space, a quadratic curve regression ts better than the linear and the generalized linear model.These estimates can be informative to reveal the transition mechanism from pre-epidemic to epidemic, and to pandemic.

Background
While COVID-19 epidemic has been spreading worldwide, its characteristics are still unclear.The development of good mathematical models for predicting prevalence and subsiding is strongly expected.
The epidemic curve shows how the scale of the infection increases.For example, this is a time series of the number of infected persons con rmed in a day, or daily epidemic count data.When this is expressed by a mathematical model, compartmental models such as the SIR model [1][2][3], and regression with an exponential function [6] are used generally.The number of infected persons per day is the difference in the cumulative number of infected persons, and ordinary differential equations (ODEs) or difference equations are often used in these models.The exponential regression model is used when the difference and accumulation are increasing rapidly in the early period of epidemic.The exponential regression is equivalent mathematically to the linear regression of logarithms of count data.
In the compartment model with simultaneous differential equation systems, the values of parameters such as basic production number R0, incubation period, and period with infectious ability must be determined [1,2].However, it is rare that they can be determined by experimental observation for new infectious diseases, especially in the early period of the epidemic, so they are determined by searching for parameter values that best t the model to the time series of the epidemic counts.Parameter estimation uses a nonlinear numerical optimization method in general, but the optimum parameter values uctuate greatly due to slight errors in the observed data.It is often di cult to obtain a reliable model when the amount and accuracy of data are not su cient [6,7].
Models with a small number of parameters may be more robust to data errors than nonlinear and sensitive ODE models.Regression of the epidemic count data by the exponential function and regression of the logarithm of the epidemic count by a linear or a quadratic function can be easy at model parameter estimation.These regressions may robust for data with large errors or small sample sizes.
Since the epidemic count is a non-negative integer, and it seems to increase exponentially in the early period of the epidemic, the generalized linear model (GLM) that uses the logarithmic transformation for the link function and Poisson distribution for the error distribution can be expected as a good model [5].In this model, it is expected that the logarithms of the epidemic counts will be linear with time, but it has not been proven.GLM, and the exponential function, are monotonically increasing, though the epidemic count increases and then decreases generally [6].
A time series of the epidemic count, or an epidemic curve is often depicted as a bell shape curve [7].A typical mathematical model of a bell curve is the exponential quadratic function, that is used for the probability density function of the normal distribution.This is represented as a simple quadratic polynomial function by logarithmic transformation, and has two zeros.These two zero points can be considered the epidemic's starting and subsiding points, and this starting point can be an approximate value of the transition point from the endemic to the epidemic.Also, if the epidemic curve is divided into two periods, namely steady uctuation and bell shape curve periods, each may correspond to an endemic and an epidemic.
We applied a model that introduces a quadratic polynomial function to the logarithms of the epidemic count data, that is, the logarithmic quadratic function model, to model the transition of the epidemic counts in the early period of the epidemic of COVID-19 in Tokyo.The accuracy and statistical properties were compared with other regression models, the exponential regression and GLM.We searched for the data range where the model ts best for each model.As a result, it is shown that the logarithmic quadratic function model has better statistical properties even in the early period of the epidemic, which is said to increase exponentially and monotonically.
We estimated the starting and the subsiding dates of the epidemic and the total number of cases in each country by applying the logarithmic quadratic function model to the epidemic count data in 114 countries in the world.These estimates can be informative to reveal the transition mechanism from endemic to epidemic, and epidemic to pandemic.

Result Tokyo
We applied the logarithmic quadratic function model to the daily epidemic count data of Tokyo, and compared statistical properties of the model with those of the exponential regression and GLM.The data published by Tokyo metropolitan government [10] were used for model comparison.This is the daily numbers of persons found infected in Tokyo from January 24, 2020, when the rst infected person was con rmed in Tokyo, to April 20, 2020.The day on which the count is zero was excluded from the data.
The analyzed data consist of 62 days.
The epidemic count hardly increases in the former period of the Tokyo data, and it increases in the latter period (Fig. 1, middle).The exponential regression, GLM, and the bell curve regression were applied to this count data, and the exponential regression, linear regression, and the quadratic polynomial function regression were applied to the logarithms of the count data.We introduced the Poisson distribution for the error distribution model and the logarithmic transformation for the link function of GLM.
We searched for the data range that gave the best t for each model by reducing the samples (days) one by one from the beginning of the data.The model tness was evaluated by the absolute value of the mean of the residuals of the data, AMR, de ned as follows: where N is the number of samples in the count data, ri is the residual of the sample i and the expected value for the sample i that is predicted by the tted model.In the AMR plot (Fig. 1, left) of the logarithmic quadratic model, N for which the t is optimal can be roughly seen, the minimum is at N = 49.In the linear model, AMR has little correlation with N. Other models have the lowest AMR when the data size is reduced to approximately 10 points or less.Therefore, except for the logarithmic quadratic model, the entire original data consist of 62 samples was used for model tting, namely we chose N = 49 for the logarithmic quadratic function model and 62 for all other models.
In the distribution of residuals to the expected value predicted by the model (Fig. 1, right), it is most vertically symmetrical when the bell curve is applied to the count data.Also in the logarithmic quadratic function model, which is a logarithmic version of the bell curve model for count data, the residual distribution is approximately vertically symmetrical.In other models, the residual is biased both upward and downward depending on the model prediction values.Especially the going down pro le in the right most part of the plots shows that the model prediction is too large for large count data.
The fewer the outlier, the better generally in the distribution of leverages (Figure S1).It is customary for outliers to be at least 2.5 times the average value.In the modeling of the epidemic counts, the outliers are the most in the case of the exponential regression to the count data, and are the same as or larger than those of the models for logarithms of counts.In the modeling of the logarithms of counts, the number of outliers is the same in linear regression and the quadratic function regression, however, outliers and other samples are separated clearly in the plot of the linear regression.This separation is unnatural statistically.

Worldwide Estimation
When the logarithmic quadratic function model is applied and the model curve is convex upward, the two points where the model curve intersects the horizontal axis (zero point) are found.These can be estimates of the starting and subsiding dates of an epidemic.The total number of infected persons can also be estimated by integrating the model function from the estimated starting date to the subsiding date.
We applied the quadratic function to the logarithms of the epidemic counts in each country, observed in the period from the end of 2019 to April 17, 2020.The dataset is published by European Centre for Disease prevention and Control (ECDC) [11].The number of countries or regions included is 204.For the tting range, we searched for the range where the AMR was the smallest for each country.The number of countries and regions where the model was convex upward was 114.The estimations of starting and subsiding dates and the total number of infected persons were calculated for these.The numbers of days from the estimated starting date to the subsiding date, or estimated period of epidemic, of the top 30 countries are shown in Table 1.The estimated total numbers of infected persons of top 30 countries are shown in Table 2. Plots of logarithms of count data and model predictions for each country are shown in the supplementary.

Discussion
By searching the range of data for which the model is optimal, the logarithmic quadratic function with statistically better properties than the exponential function and GLM can be obtained.Optimal data ranges cannot be found for other models.It is commonly said that observations increase exponentially in an early period of an epidemic, that is, their logarithms look linear with time, but a regression model to capture the pro les of logarithms of epidemic count should be the quadratic polynomial function, rather than the linear.
The bell curve model for the epidemic counts is mathematically equivalent to the quadratic polynomial model for the logarithms of the counts, but in the former, the predicted value decreases at the end of the data range, whereas it does not in the logarithmic quadratic model (Fig. 1, middle).In the tting results of the bell curve model, the error of the count data is large when the predicted value is large.It is considered that the sensitivity of the parameter value to the error is high due to the exponential function in the model, and this sensitivity caused over tting.The logarithmic quadratic function is more suitable for parameter estimation because the numerical calculation for parameter estimation is stable generally and the runtime error is less likely to occur.
In general count data, such as the epidemic count data, the distribution of the counts and the residuals of the data may seem to be larger as the model prediction values are larger.However, GLM of the logarithmic link function and Poisson distribution, which have these characteristics, do not show that it has good properties in this study; the residual distribution do not seem uniform for model prediction values.The residual distribution of the logarithmic quadratic function model looks more uniform.This suggests that the logarithms of the epidemic counts are not linear with time.Although the linear model for logarithms has very small AMR, this is a good result, but the distribution is heavily biased by the model values.
The logarithmic quadratic function model can be used to estimate the starting and the subsiding dates, and the total number of infected cases.This is not possible with the exponential regression and GLM, which are monotonically increasing models.Therefore, the logarithmic quadratic function model may be useful for predicting the kinetics of epidemic in its early period.These estimations may be informative for exploring the transition mechanism from epidemic to pandemic.
The logarithmic quadratic function is a very simple model.When this ts well the epidemic data, it is likely that the kinetics or the mechanism of the epidemic is simple.In other words, there is no or unchanging effect of arti cial control of the epidemic.In the case of COVID-19, the epidemic went very fast, so it is considered that there are not a few cases where the epidemics began to subside before the governmental response was effective.The logarithmic quadratic function does not t well to American data (Figure S2).This is considered to be due to the national measures taken.
It is sometimes di cult to t the exponential quadratic function as a bell curve model directly to the epidemic count data.In order to do this, the initial values of the parameter values given to the tting algorithm must be close enough to the optimum values, and it is often necessary to manually nd good initial values.Therefore, the exponential quadratic function is not suitable for systematically tting models of many countries as in this study.
There are errors in the count data and the model parameters are highly dependent on the errors.Therefore, the estimated starting and subsiding dates and the estimated total number of infected persons are highly affected by the errors, and reliability of the estimations cannot be guaranteed for any data.Also, the data may deviate from the "natural" quadratic curve if a national response is taken.We think it is needed to nd a model that is more robust than the logarithmic quadratic function for such data.

Method Data
We used two datasets on the number of persons found infected per day, Tokyo and International datasets.These are published by Tokyo metropolitan government and European Centre for Diseases prevention and Control (ECDC).Modeling was performed by removing days with a count of 0 in the Tokyo data and days with a count of 2 or less from the international data.

Formulae Logarithmic Quadratic Model
The logarithmic quadratic function is tted to logarithm of epidemic count data y(x) as follows;

2 )
where x is a number of days from the beginning of the data, y(x) is a number of persons found infected in the day x, a, b and c are model parameters.x and y are natural numbers and a, b and c are real numbers.Two zero points of the logarithmic quadratic function that t to the daily epidemic count data are represented by model parameters a, b and c as follows: where -c/a is positive when the tted function is convex upward, or a < 0 and c > 0, and logarithm of a typical epidemic curves are convex upward.Therefore, the estimated epidemic period is represented as 2{(-c/a)^0.5},and the estimated maximum daily epidemic count is exp(c).When log(y(x)) is represented by a quadratic polynomial function as shown above, y(x) is represented by an exponential quadratic function as follows: where C is exp(c).Here let the N(x) is the probability density function of the standard normal distribution of which the mean is 0 and the variance is 1.N(x) is de ned as follows: y(x) is represented by N(x) as follows: y (x) = exp (a( x − b) 2 + c) = C exp(a(x − b) Estimated total number of infected persons are represented by two zero point of the logarithmic quadratic function above and Q(x), the cumulative distribution function of the normal distribution, as follows: GLM and Exponential model The generalized linear model (GLM) for the epidemic count data introduces logarithmic transform as the link function and Poisson distribution for the residual distribution.

Figures
Figures

Table 1
Estimated period lengths in days of epidemic of top 30 counties those relative AMR is less than 0.03.When the size of data is small or epidemic counts are small for a country, the in uence of data errors to model parameters is large, so parameter estimations are not reliable for such countries.Although the logarithmic quadratic function model was convex upward in 114 countries, 14 of them had a large relative AMR, and the ttings did not seem successful.The logarithmic quadratic function model of 33 countries became convex downward, and numerical calculation failed in 53 countries.Four countries (Anguilla, Bhutan, South Sudan and Yemen) were excluded from modeling because the daily epidemic counts are only 0 or 1.

Table 1 :
Estimated period lengths in days of epidemic of top 30 counties those relative AMR is less than 0.03.