Uncovering the universality of COVID-19 fatality rates

An important and elusive characteristic of the COVID-19 pandemic is the fatality rate which allows us to understand the severity of this disease, health care needs, and the impact on large populations. To address this and other questions we present a probabilistic model to study the evolution of the COVID-19 pandemic and to correct the case fatality rates. Our model employs probabilities to estimate the time evolution of infections, recoveries, and deaths. This model discriminates asymptomatic, mild/moderate, and severe cases and allows for the estimation of undiagnosed individuals. Furthermore, we compare the model curves to oﬃcial data for medium-sized cities, world metropolises, and medium-sized countries, spanning a range of populations from a few million to several million individuals. Using the undiagnosed estimates we correct the case fatality rates and ﬁnd that it ranges from 0 . 33 % ± 0 . 02 % to 1 . 14 % ± 0 . 07 % . Since we applied the method to cities and countries with diﬀerent characteristics, the corrected case fatality rates indicate a universality that is independent of location and other so-cial/demographic conditions. Our results agree with sample tests and seroprevalence studies, considerably changing our understanding of the COVID-19 fatality rates. Other applications include the estimate for severe cases, ICU needs, and undiagnosed cases.


Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS COV 2) is the virus causing the Corona Virus Disease (COVID-19) [1]. Sources suggest the virus crossed from their animal hosts to humans through an yet unknown mechanism [2,3,4]. It started infecting humans in late 2019 in Wuhan, China and by March 2020 the disease was spreading fast in several countries around the globe and was declared a global pandemic by the World Health Organization (WHO) [5,6,7]. The SARS-COV-2 virus is known to cause a wide variety of symptoms, most notably in the respiratory system [8,9] and it spreads through human contact, contaminated surfaces, and through the air, [10]. This latter mechanism is particularly important if people are in confined air-tight spaces such as airplanes, restaurants, and similar environments. As in other pandemic episodes, isolation plays a major role in containing the virus [11,12]. Very early in the pandemic, the governments imposed social isolation, confinement, and even complete lock downs lasting several weeks in an effort to stop the virus dissemination. This generated immense political, social, and economic burdens with consequences virtually impossible to measure at the present time [13,14,15]. Health professionals and governments worldwide are making huge efforts to understand the virus, treat the patients and at the same time keep track of infected individuals [16].
One of the fronts in the battle to understand epidemics and to plan effective control strategies is guarded by mathematical models [17,18,19,20,21,22]. These so-called dynamic or epidemiological models try to answer some general questions: What is the duration of the epidemic?; How many people will get infected and die?; Are isolation and social distancing determinants to the course of the pandemic? The answers to these questions might help governments and health care officials to properly react. A widely used approach to model this problem is the Susceptible-Infectious-Recovered (SIR) model, first introduced in 1927 [23]. Other efforts use variations/generalizations of this approach to find exact and more general solutions [24,25,26]. Despite being limited and overly deterministic, SIR-like models continue to be useful and widely used to study pandemics and outbreaks [27].
In this work, we introduce a probabilistic model to study the COVID-19 pandemic. In particular, we assign relative probabilities and use random numbers to weight the occurrence of infections, deaths, and recoveries according to the assigned probabilities. This approach uses a straightforward and non-deterministic way to compute the time evolution of the disease and was specifically designed to model the characteristics of the COVID-19 pandemic. This model is especially useful to calculate the number of infected individuals, deaths, ICU needs, and especially the so-called asymptomatic cases that often go undiagnosed. One of the most important and elusive features of the COVID-19 pandemic is the case fatality rate (CFR) [28,29]. Since asymptomatic cases are difficult to detect and might come in huge numbers, we anticipate that the number of cases is much larger than officially reported [30,31,32]. This implies that the case fatality rates need to be corrected for a proper estimation of the real fatality rate. Our model addresses this problem by dividing those infected into different levels of severity and specifically estimating asymptomatic cases that often go unreported.
This paper is organized as follows. In section 2 we fully describe our probabilistic model. Section 3 is devoted to discussing the parameter space of the model and fitting the model to real curves for selected cities and countries around the world. We also calculate the case fatality rates of the COVID-19 infection and correct them using the model results. Section 4 is dedicated to a thorough discussion of our results and an evaluation of the impact and reach of our probabilistic approach to modeling this problem. We also point out and discuss the limitations of the model. In section 5 we draw our conclusions, especially those regarding the corrections to the case fatality rates.

The Probabilistic Model
We developed a probabilistic code to model and interpret the time evolution of the COVID-19 pandemic. The model assumes that individuals may occupy different compartments according to their condition or stage of the disease: free, asymptomatic, mild/moderate, severe, recovered or deceased (Fig.1). The model works by assigning probabilities to these compartments and also to the related events (infections, recoveries, deaths). To weight these probabilities we use a very well-known and documented random number generating function [33]. For each day (step or iteration) the code evaluates if new infections, recoveries, or deaths occur in a completely random, nondeterministic fashion.
We aim to model a population of N individuals with a fraction I being isolated. First, we set the infection probability of free individuals (P I ) and assume that the probability of new infections varies according to the number of infected individuals (N I ) [34]. Since N I varies with time it impĺies that P I is also a function of time (P I (t)). This happens because as the outbreak progresses more individuals became infected and the rate of new infections increases. This cannot go on indefinitely since both the total population N and the total number of exposed individuals (1 − I) × N are finite. It implies that as more individuals get infected, new infections are less likely, assuming re-infections do not occur. We thus define P I (t) as given by equation 1 thus allowing for a realistic modeling of the official cases: This function is updated at each iteration t (days) and increases by a constant value S (increase/decrease step). A critical fraction C F of the total population (C F × N ) is considered as a point of interest above which the function stops growing, in accordance with the idea of herd immunity since more recovered/immunized individuals decrease the rate of new infections. Thus the infection probability remains constant during a critical interval ranging from C F × N < N I (t) ≤ (C F + δ) × N , where δ represents a small variation of the critical fraction C F . In our simulations, we use C F = 0.40 and δ = 0.10, although these values can be reset by the user. When the critical interval is exceeded (N I (t) > (C F + δ) × N ) the model assumes that new infections are rarer and thus the probability of infection starts to decrease with a negative S at each iteration. Eventually P I (t) is so low that the rate of new infections dims and the outbreak dies out.
Next we assign relative probabilities for each compartment in fig. 1. Free individuals, once infected, have a probability of assuming one of the infected compartments: asymptomatic (P A ), mild/moderate (P M ), or severe (P S ). In this model, only severe cases might evolve to death and we assign a probability P D for this occurrence. Hence, the recovery probability of severe cases is P R = 1 − P D . The model has three levels of randomness and at each iteration, all individuals are checked. If an individual is free then the first aleatory number is drawn and compared against P I (t). If the individual becomes infected then a second aleatory number is drawn and compared to P A , P M , and P S to check the severity of the case. Also at each iteration the model calculates/updates the number of asymptomatic (A(t)), mild/moderate M (t), severe cases (S(t)), recoveries and deaths (D(t)). Asymptomatic and mild/moderate individuals always recover in t A and t M days after infection. Severe cases either recover or evolve to death in t S days after infection when a third and final aleatory number is drawn and compared to P D to decide if a death or a recovery will occur. It is very important to set these different recovery times to realistically model the official deaths and case curves. In our simulations we used fixed median recovery times (days) for the asymptomatic (t A = 10), mild/moderate (t M = 15) and severe cases (t S = 20), according to [35,36]. However, these values can be adjusted by the user if needed. As mentioned, a fraction I of the total population N is isolated/quarantined and the model will thus run on (1 − I) × N individuals. In summary, all the probabilities (P A , P M , P S , P D ) and parameters (I, S, C F , δ) are set by the user and reasonable values will be discussed in section 3.
It is known that the official number of total cases is often underestimated by a myriad of factors. Up to 75% of cases are asymptomatic [37,38], meaning that these cases will rarely enter the official records. To properly model this feature we need to estimate a range for the number of cases to account for those that are not detected. Our model thus calculates both an upper limit (N upp ) and a lower limit (N low for the total number of cases, allowing for a flexible range for the total number of infections. Note that the cumulative number of cases (N I (t)) is different from the total number of active cases (N A (t)), representing those currently infected: N upp Cumulative cases grow as a function of time and eventually reach a plateau. In contrast, active cases increase as a function of time reaches a maximum and then starts to fall, eventually reaching zero as A(t), M(t), and S(t) either recover or die. Both cumulative and active cases are given by the model and the user may choose which one to use depending on the data format available for the official cases. In this work, we chose to model our data using the cumulative cases model curves.
To fit the models to the official data we allow the probabilities for asymptomatic and mild/moderate cases to vary in the range P A = 0.67 − 0.70 and P M = 0.28 − 0.31 respectively, while fixing P S = 0.02, according to the medical literature [37,39,40]. In practice we let the official curves for cases and deaths tell us exactly which values to use. With recovery times and probabilities for each case severity fixed, we are left with P D and S to be adjusted. We follow the available information in the literature but ultimately let the models adjust according to the data. We adjust the model parameters and probabilities to get a good fit on the overall shape of the curve (setting S is critical) and determine the best fit by minimizing the χ 2 to obtain a good fit by comparing D(t) to the official deaths (D O ) and verifying which isolation model corresponds to the official data. We then use the same isolation to adjust both N upp I (t) and N low I (t) to the official data. Since both cumulative models are well above the official cases we multiply the official case curves by an underestimation correction, meaning that many cases are 'missing' in the official data. We end up with two underestimation factors (U low C and U upp C ) that give us a range for the number of real cases. After both curves are adjusted according to D(t), we have two equations relating the models (N I (t)) and the official cases (C O ): N upp To calculate the case fatality rate (CFR) we simply divide official deaths (D O ) by the total number of official cases (C O ), according to equation 2(a). However the model offers a way to correct the denominator of this equation and we use both U low C and U upp C to obtain the corrected case fatality rates, according to equations 2(b),(c). These equations properly correct the case fatality rates by taking into account the adjusted number of cases. Note that using U low C as a correction produces an upper limit for the case fatality rate (CCRF upp ) and using U upp C produces a lower limit (CCF R low ), because U low C < U upp C .
As shown in the next section these corrections dramatically change the case fatality rates for the COVID-19 pandemic.

Results
In this section we show: i) The parameter space of the model; ii) The fits of official data for selected cities and countries around the globe; iii) The case fatality rates and their corrections using the model results.

Exploring the Model Parameter Space
To test the model parameter space we simulated a fictitious city with 1,000,000 individuals using as input ( iii) The number of cumulative and severe cases change drastically from I = 0 to I = 0.90 since the numbers of exposed individuals change with I. The number of deaths also varies from 150 (I = 0.90) to 2800 (I = 0). Model results for the severe cases for different values of S and I have a peak and eventually vanish as more people either die (with probability P D ) or recover (with probability 1 − P D ). This experiment clearly demonstrates the model's ability to cover different scenarios, ranging from short-lived outbreaks to long-term infections. Both parameters S and I are able to account for changes in the steepness, plateau, duration of outbreaks, and most importantly how rapidly the disease is spread among the population.

Fitting the probabilistic model to official COVID-19 cases
We present the applicability of the model to different population sizes ranging from cities with ∼ 1 million people to larger cities and countries with several million individuals. In every figure of this section we use C F = 0.40, δ = 0.10, t A = 10, t M = 15 and t S = 20 to generate models from I = 0.30 to I = 0.90. Other parameters and probabilities are adjusted according to each case and are shown in the figures. Figure 3, left side, displays the models for the city of São Paulo, Brazil (pop. 12,252,000) which accumulated 247,730 official cases and 11,030 deaths in the first 181 days after the first officially reported case on February 26 th , 2020. On the right side, we show the results for New York City, USA (pop. 8,336,817) which reached 247,613 cases and 19,196 deaths 228 days after the first reported case on February 29 th , 2020. The official data for cases and deaths (red continuous curves) for both cities are from official government sources [41,42]. Figures 3(a),(b),(c),(d) show the fits for the upper and lower limit models to the official cases for each city. All fits were adjusted multiplying the official data (C O ) by underestimation factors U upp C and U low C (indicated in the plots) thus correcting for the missing/undiagnosed cases, as discussed in section 2. The main result is that São Paulo has its official cases underestimated by factors ranging from U low C = 4.39 (lower limit) to U upp C = 15.69 (upper limit). In other words, São Paulo had, according to the model, at least 4.39 times more cases than officially reported. For New York we find underestimation factors ranging from U low C = 6.12 to U upp C = 21.87, meaning that New York had at least 6.12 times more cases than officially reported. Figures 3(e),(f) show the evolution of severe cases for both cities. We find that for São Paulo the severe cases reach a maximum of 8,291 individuals 106 days after the first reported case (I = 0.70 model curve). For New York, we find that severe cases reach a maximum of 41,291 such cases 32 days after the first reported case (I = 0.50 model curve). The official data for severe cases was not available and is thus not shown. The time evolution of deaths is shown in figures 3(g),(h). We use the official deaths fit corresponding to I = 0.70 (São Paulo) and I = 0.50 (New York) to adjust both the upper and lower limit models to the official cases.
Another example is shown in figure 4 for Italy (pop. 60,360,000) and Spain (pop. 47,431,256). The official data for these countries were obtained from official government sources [43,44]. Italy reached 298,200 cases and 35,710 deaths 219 days after the first reported case on February 15 th , 2020. The time evolution for the upper and lower limit cases for Italy is shown in figures 4(a),(c). We find U low C = 10.13 and U upp C = 33.79, meaning that Italy had at least 10.13 times more cases than officially reported. Figure 4(e) shows that severe cases in Italy reached a maximum of 129,481 individuals 41 days after the first reported case (I = 0.70 curve). The time evolution of deaths in Italy is shown in figure 4(g). Note that we use the official deaths best fit corresponding to a line above the I = 0.90 model curve to adjust both the upper and lower limit models shown in figures 4(a),(b). We further assume that this line also fits the severe cases in figure 4(e). According to official data [44], Spain reached 543,400 cases and 29,630 deaths 208 days after the first reported case on February 15 th , 2020. The time evolution for both upper and lower limit cases are shown in figures 4(b),(d) and we find U low C = 9.22 and U upp C = 29.77, meaning that Spain had at least 9.22 times more cases than officially reported. Figure 4(f) shows that severe cases in Spain reached a maximum of 50,760 such cases 40 days after the first reported case (I = 0.70 curve). The time evolution of deaths is shown in figure 4(h) and a model curve between I = 0.70 and I = 0.90 is the best fit, which was also used to adjust both upper and lower limit models in figures 4(b),(d). The same approach described above was applied to other cities and countries and the results are summarized in table 1.

Correcting the Case Fatality Rates (CCFR)
One of the most important and elusive features of the COVID-19 pandemic is the fatality rate since the official cases are highly underestimated [45,46]. Thus calculating the case fatality rates according to equation 2(a) necessarily produces highly overestimated values (see table 1, CFR(%) column). As described in section 2 the corrected fatality rates are obtained using equations 2(b),(c) and the lower and upper values of U C obtained from the fits. We calculated the CF R's and their corrections for several countries and cities as shown in table 1. We find mean corrected case fatality rates and standard deviations of CCF R low = 0.33 ± 0.02 and CCF R upp = 1.14 ± 0.07.

Discussion
The Parameter Space. We tested the model's ability to properly cover its parameter space and to account for different scenarios, ranging from shortlived outbreaks to long-duration infections. We did this by varying parameters S and I. The parameter S is able to describe several outbreak scenarios as shown in figure 2 (left side) where we fix the value of I to 0 and run simulations for different values of S in the range 1.0 × 10 −7 − 1.0 × 10 −2 . In practice the value of S governs the length of the outbreak. The parameter I measures how the fraction of isolated individuals affects the evolution of the pandemic episode. This is investigated in figure 2 (right side) where we plot the time evolution of N upp I , N low I , severe cases, and deaths for several values of I (S = 10 −5 ). Although the model is able to account for different scenarios, we point that the infection probability curve may not be a bonafide representation of reality. Nevertheless, it is a hypothesis that according to our tests correctly describes the time evolution of cases. We also point out that the isolation factor should not be literally interpreted since the fraction of isolated individuals in a given population varies every day. Also, a city's population is not constant and there is a fair amount of people that move in and out of big urban centers on a daily basis. In other words, there is a social dynamic that is important but is not taken into account in our model. All these factors compete for an oscillating isolation factor and thus limit its meaning and reach.
Fitting real cases. In figures 3 and 4 we show the fits to real data for two world metropolises with a few million inhabitants and two medium-sized countries with several million individuals. In these figures and for the other cities and countries considered we used a range of values for the S parameter ranging from 9 × 10 −5 to 1 × 10 −2 . The death probabilities ranged from 14% to 22%. We clearly note that good fits were found regardless of the size of the population, with changes in the parameter S and relatively small variations in the probabilities. However, the parameter S is not an assumption. Instead, the shape of official deaths and case curves tell us which value of S better describes the data. We clearly see that the overall shape of the curves and the time to reach a plateau were properly recovered by the fits, suggesting similar and universal characteristics of the disease in different locations around the globe and for different population sizes, despite the complex social dynamics and limitations described above.
The immediate gain from each fit is the U upp C and U low C values. These numbers are very different for each one of the places considered and are summarized in table 1. For New York, official data indicates 247,730 cases (after 228 days), but we find U upp C = 21.87 and U low C = 6.12, meaning that the real number of cases could be at least 6.12 times higher or even 21.87 times higher. If we consider the lower limit as true, for instance, we find that official cases are roughly 16% of the real cases, i.e., 84% of the cases went unnoticed. Italy had 298,200 official cases (after 219 days) and we find U upp C = 33.79 and U low C = 10.13. Again, using the lower limit result, we find that Italy had at least 10.13 times more cases than officially reported, meaning that official cases represent roughly 10% of the real cases. Similar conclusions might be drawn for the other places considered and this discrepancy in official and real cases could be due to limited or no access to medical facilities, poor testing or a myriad of other factors [47,48].
Severe Cases and ICU needs. Our model estimates the number of severe cases, those that will most likely visit the hospital in need of medical care. Some of these cases might require long-term medical attention and a fraction of those will need intensive care units (ICU) and respirators. The curves corresponding to severe cases show that they go up, reach a peak, and start to decrease as the outbreak progresses. For Spain (see Fig.4), considering the I = 0.70 model curve, we see that severe cases peaked roughly at 50,000 individuals. This result alone dramatically explains why the medical facilities and hospitals in Spain (and in many other cities and countries) had exhausted their capacities very early into the outbreak [49,50,51,52]. For instance, if only 2% of these severe cases require intensive care, we are still talking about 1000 ICU beds that must be dedicated to COVID-19 patients. However, as of 2013, Spain had a total of 4700 multi-purpose ICU beds in the entire country [53]. Another example is New York City which had, according to official data 17,259 individuals hospitalized 32 days after the first reported case on February 29 th , 2020 [41]. No data was available to show how many of those hospitalized required respirators/ICU beds. For NYC, the model predicts a maximum of 41,291 severe cases for the 32 nd day after the first reported case. Taking into account that severe cases take ∼ 20 days to recover (or evolve to death) [36] we should compare the model prediction to day 52 after the first reported case. Comparing the model prediction to the official data [41] on day 52 we indeed see that NYC had 41,516 hospitalizations. This demonstrates how precisely the model estimates the number of those in need of hospitalizations/ICU beds for such a full-blown pandemic episode. However, we did not explore in depth this particular aspect of the model since we focused on estimating the corrections to the case fatality rates.
Correcting the Official Case Fatality Rate. We use the U low C and U upp C values to correct the official case fatality rates (CFR's) according to equation 2. These results are in table 1. We notice the values of U low C and U upp C are very different for each of the locations considered but we find CCF R upp = 1.14% ± 0.07% and CCF R low = 0.33% ± 0.02%. These small standard deviations for the mean CCFR's contrast to the wildly different CF R's for the many locations considered which average 6.75% ± 3.11%. Although the corrections applied to the models to fit the many official data curves were very different, they translate into consistent corrected case fatality rates (see table 1). In summary, considering the ubiquitous unreported cases the resulting corrected case fatality rates (CCFR's) are much lower than the case fatality rates (CF R's).
In summary we find the corrected case fatality rates (CCFR's) ranging from 0.33% to 1.14%. This is in accordance with completely independent studies that use sample testing conducted worldwide and especially in hardhit countries like Brazil and the USA. In particular, a study compiling data from several independent seroprevalence studies in 51 different countries finds case fatality rates varying from 0% to 1.54% with a median of 0.27%. This study includes both light (e.g. Afghanistan) and hard (e.g. USA) hit areas around the globe [54]. Another seroprevalence study conducted in Brazil in 90 cities (comprising 54 million individuals) finds that the official number of cases is underestimated by a factor of 7. These cities combined had (until May 2020) ∼ 105, 000 official cases and ∼ 8, 000 deaths, suggesting a corrected case fatality rate of ∼ 1.10% [55,56]. Finally, another seroprevalence study conducted in New York City finds a corrected case fatality rate for the first wave of the COVID-19 pandemic of 0.97% [57]. These studies find corrected case fatality rates well inside the range indicated by our model. We highlight that seroprevalence studies are a completely different approach compared to our mathematical-statistical study.
We mention that calculating the real case fatality rates for the COVID-19 pandemic is a challenge since it is impacted by access to medical facilities, social constraints, population size and density, the mean age of the impacted population, and a myriad of other factors. That is why our method tries to set both an upper and a lower limit for the CCFR's. We argue that countries with better infrastructure, medical facilities, and social standards might experience a lower case fatality rate than countries with less favorable conditions. Nevertheless, even considering the lower limit of 0.33% ± 0.02% this disease is at least 30 times deadlier than the 2009 Swine Flu pandemic which had an estimated case fatality rate of 0.01% [58,59].

Conclusions
Our probabilistic method was able to generate model curves and good fits to the official data for different population sizes, thus demonstrating that the parameters and probabilities chosen to describe the problem for each case are correctly describing the time evolution of the COVID-19 pandemic. These results clearly show how isolation plays an important role in preventing the spread of this disease. This probabilistic model was designed to estimate the number of undiagnosed patients using both an upper and a lower limit for the number of cases, clearly showing that real cases might be several times larger than official reported cases, independent of population size and other factors. This clearly demonstrates a universal characteristic for the disease and especially for the corrected case fatality rates. The model also demonstrated that we can estimate the time evolution of severe cases and showed that thousands of patients were in this condition during the most severe days of the outbreak, in accordance with official data. That was the reason for so many deaths in the first months after the first reported cases: hospitals were not prepared for a fast-spreading disease like COVID-19. No country or health system can deal with several thousand individuals in need of medical care at the same time, some of which in dire need of respirators. Our main result is the use of a range of values for the number of real cases to correct the official fatality rates. Notwithstanding, the corrections applied for the number of cases were very different the derived corrected fatality rates are rather consistent and have small standard deviations which agree with other studies employing seroprevalence methods. We find that the COVID-19 pandemic may not be as deadly as initially thought or as indicated by the CFR's since the corrected fatality rates range from 0.33% ± 0.02% to 1.14% ± 0.07%. This revealed that although the pandemic affected countries and regions in different ways we were still able to find a universal characteristic for this pandemic.   shows a different result: The upper limit for the total number of cases (a), the lower limit for the total number of cases (c), the severe cases (e) and deaths (g). On the right hand side panels we show several scenarios referring to the different values of I (with fixed S = 10 −5 ). Panels (b), (d), (f) and (h) also show the upper limit for the total number of cases, the lower limit for the total number of cases, severe cases and deaths, respectively. to I = 90%. The probabilities P A , P M , P S and P D and the parameters S and U C used to adjust the models to the data are indicated. We show the upper limit for the total number of cases (a) and (b); the lower limit for the total number of cases (c) and (d); severe cases (e) and (f) and deaths cases (g) and (h). Our model estimates the severe cases in each considered scenario, although we do not have the time evolution of real severe cases for each of the cities considered. Italy (left-hand side) and Spain (right-hand side). Each model curve corresponds to a different isolation factor I (green), covering most of the parameter space from I = 30% to I = 90%. The probabilities P A , P M , P S and P D and the parameters S and U C used to adjust the models to the data are indicated. We show the upper limit for the total number of cases (a) and (b); the lower limit for the total number of cases (c) and (d); severe cases (e) and (f) and deaths cases (g) and (h). Although we do not have the time evolution of real severe cases for each of the cities considered, our model estimates the severe cases in each investigated scenario.