Tokyo
We applied the logarithmic quadratic function model to the daily epidemic count data of Tokyo, and compared statistical properties of the model with those of the exponential regression and GLM. The data published by Tokyo metropolitan government [10] were used for model comparison. This is the daily numbers of persons found infected in Tokyo from January 24, 2020, when the first infected person was confirmed in Tokyo, to April 20, 2020. The day on which the count is zero was excluded from the data. The analyzed data consist of 62 days.
The epidemic count hardly increases in the former period of the Tokyo data, and it increases in the latter period (Fig. 1, middle). The exponential regression, GLM, and the bell curve regression were applied to this count data, and the exponential regression, linear regression, and the quadratic polynomial function regression were applied to the logarithms of the count data. We introduced the Poisson distribution for the error distribution model and the logarithmic transformation for the link function of GLM.
We searched for the data range that gave the best fit for each model by reducing the samples (days) one by one from the beginning of the data. The model fitness was evaluated by the absolute value of the mean of the residuals of the data, AMR, defined as follows:
where N is the number of samples in the count data, ri is the residual of the sample i and the expected value for the sample i that is predicted by the fitted model. In the AMR plot (Fig. 1, left) of the logarithmic quadratic model, N for which the fit is optimal can be roughly seen, the minimum is at N = 49. In the linear model, AMR has little correlation with N. Other models have the lowest AMR when the data size is reduced to approximately 10 points or less. Therefore, except for the logarithmic quadratic model, the entire original data consist of 62 samples was used for model fitting, namely we chose N = 49 for the logarithmic quadratic function model and 62 for all other models.
In the distribution of residuals to the expected value predicted by the model (Fig. 1, right), it is most vertically symmetrical when the bell curve is applied to the count data. Also in the logarithmic quadratic function model, which is a logarithmic version of the bell curve model for count data, the residual distribution is approximately vertically symmetrical. In other models, the residual is biased both upward and downward depending on the model prediction values. Especially the going down profile in the right most part of the plots shows that the model prediction is too large for large count data.
The fewer the outlier, the better generally in the distribution of leverages (Figure S1). It is customary for outliers to be at least 2.5 times the average value. In the modeling of the epidemic counts, the outliers are the most in the case of the exponential regression to the count data, and are the same as or larger than those of the models for logarithms of counts. In the modeling of the logarithms of counts, the number of outliers is the same in linear regression and the quadratic function regression, however, outliers and other samples are separated clearly in the plot of the linear regression. This separation is unnatural statistically.
Worldwide Estimation
When the logarithmic quadratic function model is applied and the model curve is convex upward, the two points where the model curve intersects the horizontal axis (zero point) are found. These can be estimates of the starting and subsiding dates of an epidemic. The total number of infected persons can also be estimated by integrating the model function from the estimated starting date to the subsiding date.
We applied the quadratic function to the logarithms of the epidemic counts in each country, observed in the period from the end of 2019 to April 17, 2020. The dataset is published by European Centre for Disease prevention and Control (ECDC)[11]. The number of countries or regions included is 204. For the fitting range, we searched for the range where the AMR was the smallest for each country. The number of countries and regions where the model was convex upward was 114. The estimations of starting and subsiding dates and the total number of infected persons were calculated for these. The numbers of days from the estimated starting date to the subsiding date, or estimated period of epidemic, of the top 30 countries are shown in Table 1. The estimated total numbers of infected persons of top 30 countries are shown in Table 2. Plots of logarithms of count data and model predictions for each country are shown in the supplementary.
Table 1
Estimated period lengths in days of epidemic of top 30 counties those relative AMR is less than 0.03.
estimated days of epidemic | county |
278.3 | Senegal |
258.5 | San_Marino |
243.1 | Peru |
210.0 | Slovakia |
198.7 | Russia |
175.7 | Bahamas |
173.6 | China |
159.3 | Uzbekistan |
156.5 | Saudi_Arabia |
132.2 | Paraguay |
130.4 | India |
117.1 | Democratic_Republic_of_the_Congo |
114.4 | Mexico |
112.9 | Indonesia |
109.0 | Sweden |
104.4 | Uruguay |
104.1 | Kazakhstan |
102.4 | Italy |
101.4 | Iran |
99.8 | Serbia |
99.5 | Morocco |
98.5 | Ireland |
98.4 | South_Korea |
97.3 | Palestine |
97.2 | Bulgaria |
96.1 | Pakistan |
95.7 | Brazil |
95.2 | Algeria |
92.3 | Cote_dIvoire |
91.9 | Finland |
Table 2
Estimated final numbers of cases, or infected person counts of top 30 countries those relative AMR is less than 0.03.
estimated number of total cases | country |
5636461.7 | Russia |
4350695.2 | Peru |
213335.6 | Spain |
194503.7 | Italy |
143654.2 | Germany |
109365.7 | India |
108111.5 | Turkey |
92216.3 | Iran |
75154.5 | Saudi_Arabia |
64596.7 | Brazil |
45299.1 | Canada |
44709.0 | Belgium |
37631.5 | Netherlands |
29618.6 | China |
28518.7 | Switzerland |
25445.0 | Mexico |
22857.1 | Ireland |
21888.4 | Uzbekistan |
21469.5 | Portugal |
20065.4 | Sweden |
14232.1 | Pakistan |
14206.0 | Serbia |
13816.9 | Austria |
13653.6 | Ukraine |
12723.2 | Israel |
12722.4 | Indonesia |
12121.0 | Chile |
11503.3 | Poland |
10784.4 | Romania |
10297.4 | Ecuador |
When the size of data is small or epidemic counts are small for a country, the influence of data errors to model parameters is large, so parameter estimations are not reliable for such countries. Although the logarithmic quadratic function model was convex upward in 114 countries, 14 of them had a large relative AMR, and the fittings did not seem successful. The logarithmic quadratic function model of 33 countries became convex downward, and numerical calculation failed in 53 countries. Four countries (Anguilla, Bhutan, South Sudan and Yemen) were excluded from modeling because the daily epidemic counts are only 0 or 1.