Simple method for estimating daily and total COVID-19 deaths using a Gumbel model

The purpose of the present paper is to introduce a method for forecasting daily and total numbers of COVID-19associated deaths. We apply the Gumbel distribution function for the analysis of time series data of the first wave. The Gumbel distribution function F(t) has a notable property of F(t) = 1/2.718 at the node (peak) point of the distribution. Therefore, we can forecast the number of total deaths N. In the present study, the Gumbel model gives the estimate N ≈ 2.718N1, where N1 is the partial sum of the daily numbers of deaths until the day of the peak. The proposed model can also forecast the daily numbers after the peak. We investigated the data of New York City, Belgium, Switzerland, Sweden, and the United Kingdom. The Gumbel model gives reasonable results for New York City, Belgium, and Switzerland. On the other hand, the proposed method underestimates N for Sweden and the United Kingdom. The proposed approach is very simple, and carrying out the analysis is easy. This method uses spreadsheet software for most of the calculations, and no special program is needed. keywords: COVID-19, mortality, forecast, extreme values, Gumbel distribution Coronaviruses are large, enveloped RNA viruses, the importance of which has been recognized through the identification of a newly emerged coronavirus as the causative agent of severe acute respiratory syndrome (SARS).1 The worldwide pandemic called COVID-19 is caused by the most recently discovered coronavirus, SARS-CoV-2, and is “the third documented spillover of an animal coronavirus to humans in only three decades that has resulted in a major epidemic”.2 The 2003 outbreak of SARS had a case-fatality rate of approximately 10% (774 deaths), and MERS killed 34% of people with illness between 2012 and 2019 (2494 deaths). While the case-fatality rate of COVID-19 is a few percent, the associated pandemic has caused many more deaths worldwide.3 The international research community is currently fighting this global health crisis. There are numerous efforts to develop methods of predicting the time evolution of a pandemic for the purpose of science-based policy making.4–6 In the first wave of a pandemic in many regions, the daily plot of reported deaths is single-peaked and skewed to the right. The time series data have three phases: (A) an exponential increase in the early stage, (B) the intermediate stage changing from increasing to decreasing, and (C) a slowly decreasing final stage. Taking this characteristic into account, the present study makes use of extreme value (EV) statistics to estimate the daily numbers in (C) and the total number of deaths from the data of (A) and (B). EV theory provides methods to analyze the maximum (or minimum) value samples derived from distributions, such as normal or exponential distributions. EV theory has been used in many fields to estimate the probability of rare events from observed data.9 The Fisher-Tippett type I distribution,10 i.e., the Gumbel distribution, is a distribution of maximum (or minimum) values from the data of several distributions. The cumulative distribution function of the Gumbel distribution for maximum values is given as F(t) = exp{−e−y(t)}, (1) where y(t) is a linear function of t, y(t) = (t −β ) α , α > 0. (2) Here, α and β are parameters that decide the shape and position of the distribution, respectively. The parameter β corresponds to the position of the node, i.e., the peak position, and is rather easy to estimate from the distribution curve. A notable property of the Gumbel model is that we can estimate the total number N from the data of t ≤ β . To this end, we make use of the value of F(t) at the node: F(β ) = e−1 = 1/2.718. (3)

keywords: COVID-19, mortality, forecast, extreme values, Gumbel distribution Coronaviruses are large, enveloped RNA viruses, the importance of which has been recognized through the identification of a newly emerged coronavirus as the causative agent of severe acute respiratory syndrome (SARS). 1 The worldwide pandemic called COVID-19 is caused by the most recently discovered coronavirus, SARS-CoV-2, and is "the third documented spillover of an animal coronavirus to humans in only three decades that has resulted in a major epidemic". 2 The 2003 outbreak of SARS had a case-fatality rate of approximately 10% (774 deaths), and MERS killed 34% of people with illness between 2012 and 2019 (2494 deaths). While the case-fatality rate of COVID-19 is a few percent, the associated pandemic has caused many more deaths worldwide. 3 The international research community is currently fighting this global health crisis. There are numerous efforts to develop methods of predicting the time evolution of a pandemic for the purpose of science-based policy making. [4][5][6] In the first wave of a pandemic in many regions, the daily plot of reported deaths is single-peaked and skewed to the right. The time series data have three phases: (A) an exponential increase in the early stage, (B) the intermediate stage changing from increasing to decreasing, and (C) a slowly decreasing final stage. Taking this characteristic into account, the present study makes use of extreme value (EV) statistics to estimate the daily numbers in (C) and the total number of deaths from the data of (A) and (B).
EV theory provides methods to analyze the maximum (or minimum) value samples derived from distributions, such as normal or exponential distributions. EV theory has been used in many fields to estimate the probability of rare events from observed data. 9 The Fisher-Tippett type I distribution, 10 i.e., the Gumbel distribution, is a distribution of maximum (or minimum) values from the data of several distributions. The cumulative distribution function of the Gumbel distribution for maximum values is given as where y(t) is a linear function of t, Here, α and β are parameters that decide the shape and position of the distribution, respectively. The parameter β corresponds to the position of the node, i.e., the peak position, and is rather easy to estimate from the distribution curve. A notable property of the Gumbel model is that we can estimate the total number N from the data of t ≤ β . To this end, we make use of the value of F(t) at the node:  Since the partial sum of daily reports N 1 with t ≤ β is This means that the total number N can be estimated from the early stage of daily data t ≤ β . Using the relation n(β ) = N/(αe), where n(β ) is the maximum daily number at t = β , we can estimate α by α = N e n(β ) .
From this, we define the convergence time t c by t c = 2.9702α, using the relation that 95th percentile of the Gumbel distribution is β + 2.9702α. The proposed method is simple and easy to use. We have performed the main part of the calculations by using spreadsheet software.

Results
The present study analyzes the daily numbers and the total number of deaths in the first wave of the COVID-19 pandemic. We have selected the data for one city in the United States, New York City, and four European countries, Belgium, Switzerland, Sweden, and the United Kingdom. All plots of time series data for this city and these countries are single-peaked and skewed to the right. New York City reports two types of data: confirmed deaths and probable deaths. Confirmed deaths counts the number of people who had laboratory-confirmed COVID-19. We analyzed the time series data for confirmed deaths in New York City. Moreover, we chose Sweden because of the herd immunity strategy taken by this country. Table 1 shows the summary of our results. The total populations of this city and these countries are approximately 83.37 × 10 5 (NYC), 115.

New York City
In Figure 1, we report the results of our analysis of the daily data for New York City as function of time. The solid line indicates the theoretical plot of daily numbers N e · f (t) using the Gumbel distribution. Here, f (t) is the probability density function of the Gumbel distribution. The dashed curve indicates the 7-day moving average of the reported data.
The number of data points used for the regression analysis is 15. The output of the linear regression analysis is presented in the form of c 0 + c 1 t with constants c 0 and c 1 . The 95% confidence interval of c 0 is from −1.6130 to −1.5865, and that of c 1 is from 0.08613 to 0.08905. The coefficient of correlation is R(T,Y ) = 0.9996, and the coefficient of determination is R(T,Y ) 2 = 0.9992. As this large value of the correlation coefficient suggests, the theoretical curve almost perfectly fits the reported data in the exponentially increasing period. The theoretical curve for forecasting in the decreasing phase also shows a reasonable fit to the observation.

Belgium
In Figure 2, we report the results of our analysis of the daily data for Belgium as a function of time. The solid line indicates the theoretical result of the Gumbel model. The dashed curve indicates the 7-day moving average of the reported data. The number of data points used in the regression analysis is 15. The 95% confidence interval of c 0 is from −1.5731 to −1.5602, and that of c 1 is from 0.06948 to 0.07091. The coefficient of correlation is R(T,Y ) = 0.9999, and the coefficient of determination is R(T,Y ) 2 = 0.9997.

Switzerland
In Figure 3

Sweden
In Figure 4, we report the results of our analysis of the daily data for Sweden as a function of time. The solid line indicates the theoretical curve of the daily numbers. The dashed curve indicates the 7-day moving average of the reported data. The number of data points used in the regression analysis is 20. The 95% confidence interval of c 0 is from −1.6199 to −1.5900, and that of c 1 is from 0.06477 to 0.06726. The coefficient of correlation is R(T,Y ) = 0.990, and the coefficient of determination is R(T,Y ) 2 = 0.979.
The large discrepancy between the observed data and theoretical prediction seems to suggest that this discrepancy is the result of the herd immunity policy taken by Sweden. Although there has been much discussion on the effect of herd immunity, 12 scientific investigations on this issue are still limited. Therefore, it is too early to make such a conclusion at this stage of our knowledge of COVID-19.

United Kingdom
The fit for the United Kingdom is reported in Figure 5 The output of our analysis for the United Kingdom is N/N e = 1.338, and this ratio for Sweden is 1.443. The theoretical curve also deviates from the reported data in the same manner as Sweden.

Discussion
The present study demonstrates the advantage of the EV theory for investigating time series data of deaths in pandemic outbreaks. We make use of one class of Gumbel distribution for the extremes of maximum values. The Gumbel distribution provides a means of extrapolation in time with theoretical justification by the EV theory and can be used to estimate the data in the decreasing phase from the data in the exponentially increasing phase. Gompertz (1825) proposed a formula for describing the mortality rates of the elderly. This law states that death rates increase exponentially with age. 13 Gumbel discussed the Gompertz's model in one section (6.3.5 Oldest Ages) of his book. 7 He applied the Gumbel model of minimum values for the analysis of mortality. The cumulative distribution function is given by The survival function of the Gompertz model S(t) is defined for t ≥ 0. It can be shown 14 that S(t) is approximated as the survival function of the Gumbel distribution for −∞ < t < ∞, There are theoretical efforts to derive the Gumbel distribution of the smallest values to explain the phenomena of exponentially increasing mortality. Abernethy (1979) applied EV theory to this problem. 15 This previous study used an analogy between death of the organism and the failure of a multicomponent system and mathematically derived this type of Gumbel distribution. We hope that this line of approach can explain the applicability of the Gumbel distribution of maximum values to the pandemic data. In many regions, it is reported that the outbreak of a second wave of the pandemic began before the first wave of COVID-19 converged to a final stage. 16 We have carried out preliminary calculations for the first wave data of the United States. The dataset is that used in this analysis for European countries. The estimated number of total deaths N e is approximately 100,000. However, the second wave of the pandemic began in the decreasing phase of the first wave, so we cannot calculate the total number of deaths N from the reported data. Therefore, we are now trying to design a more general tool for analyzing the first and second waves of a pandemic simultaneously.

Data extraction
From the website for New York City, we downloaded the dataset of the daily number of COVID-19 deaths: https://www1.nyc.gov/site/doh/covid/covid-19-data-deaths.page. This dataset includes two types of data: confirmed deaths and probable deaths. This analysis uses the data of confirmed deaths.
For the analysis of the data for Belgium (BEL), Switzerland (CHE), Sweden (SWE), and the United Kingdom (UK), we downloaded datasets from the website of the European Center for Disease Prevention and Control: https//www.ecdc.europa.eu. (The filename is COVID-19-geographic-distribution-worldwide-2020-09-12.xlsx.)

Gumbel model
The probability density function f (t) of the Gumbel distribution is given by The meant and variance V of the distribution arē where the mode, which is the peak position of f (t), is β . Useful information for the parameter estimation is obtained for t = β : For the Gumbel distribution, it is easy to calculate the time t = q(P) that satisfies the relation F(t) = P with 0 < P < 1. Using eq. (1), we obtain the following quantile: We present quantiles for P = 0.5, 0.90, and 0.95: Here, q(0.5) is the median of distribution. We define the convergence time t c measured from the peak position by the quantile of P = 0.95 as t c = 2.9702α.

Estimation of daily and total numbers
This subsection describes the process for estimating the daily data and the total number. The date is indexed as t = 0 when the first case was reported. Later we will redefine the starting date for the convenience of comparing the data of different regions.
We use the 7-day moving average n(t) as the daily data in this analysis: where d t denotes the reported daily number. Note that n(t) includes the information of d t+1 , d t+2 , and d t+3 . In the process of calculations, note that, for the data of Switzerland, it is necessary to use the 9-day average because even data of the seven-day average fluctuate considerably around the trend line. The accumulated number U(t) is calculated using d i : From U(t), we estimate the distribution function F(t). The first step is the estimation of total number N = ∑ t n(t). To this end, we used eq. (3). Therefore, it is necessary to estimate mode b. We define the estimated mode t b , which takes an integer value recursively defined as We define the number N 1 as the partial sum of daily data {d t ,t ≤ t b }. The estimate of the total number N e can be calculated by The next step is the estimation of daily number n(t). The distribution function F(t) is approximated using U(t) and N e :F (t) = U(t)/N e .
The linear function y(t) = (t − β )/α is estimated by the regression analysis of the data of intermediate estimateỹ: Next, we rewrite the definition of t usingF(t) of eq. (9). We set the first time satisfying the conditionF(t) > 0.01 as t = 1. Setting the final time t f as the smallest t satisfying t > t b , and n(t) < 0.1n(t b ), and we define the "observed" total number of deaths by N = U(t f ).

Forecasting of time series data
We calculate the estimate of y(t) from the intermediate estimateỹ using the regression analysis program. After performing a linear regression of the dependent variableỹ and the independent variable t, we obtain two estimated parameters a and b for α and β , respectively. The present study uses t max points for the regression : T : 1, 2, . . . ,t max and Y :ỹ(1),ỹ(2), . . . ,ỹ(t max ).
Here, time (day) starts from t = 1. We define the estimate of y(t) as The goal is to calculate the estimate of F(t): The daily moving average n(t) is estimated by n e (t) = {F e (t) − F e (t − 1)}N e .
There is another way to estimate α. From eq. (6), we obtain an estimate of α aŝ α = N e e n(t b ) = N 1 n(t b ) , and a direct estimate of t c is calculated byt Note that this estimation uses N 1 and n(t b ), both obtainable directly from observed data.