Estimating Unreported COVID-19 Cases in the United States Based on the tvSIRu Model

COVDI-19


Introduction
Although COVID-19 was reported several months ago 1 , the coronavirus is still raging on a global scale, especially keeping surging in the United States, which is one of the most important engines of the global economic network.
The epidemic in the United States will have an important impact on the global economy and politics.It is fundamental to make relatively accurate estimates of this pandemic for preventing and controlling COVID-19 in the United States 2,3 , wherein the transmission rate (TR), infection fatality rate (IFR) are key indicators 4 .
The main obstacle to calculating such indicators is the unreported infection rate(UIR), which may be caused by insufficient testing, data depression of mild or asymptomatic patients, and time lag bias 5,6 .Direct use of IFR values derived from official data might lead to larger error 7 .Similar research on SARs has pointed out that preferential ascertainment of severe cases and delayed reporting of deaths are the main two reasons for case fatality risk (CFR) error 8 .Beyond insufficient early testing, mild and asymptomatic patients may cause most unreported cases.In Brazil, only some moderate and severe infectives in hospitalizations have been recorded 9 .On the other hand, the time lag deviation could be explained by the incubation period of COVID-19, which floats in a wide range 10 and still possess high transmittance 11 .The incubation period is also correlated to the age of the infectives, which also directly affects IFR 12 .It has been concluded that the unreported cases may lead to four kinds of uncertainty in IFR calibration, with the unclear denominator, unknown infection time, unknown incubation, and undiagnosed asymptomatic infections 13 .
Characterizing unreported cases has become a popular question in epidemic modeling on COVID-19.There is a surging body of literature calculating UIR or reported rate (RR) based on country-level data 14,15 16 , wherein, single country-level data may lead to greater bias 17 .Moreover, the county-level data in the United States on recovered infectives has not been released, the calculation of IFR depends merely on the national aggregate data, which may further amplify the error.More and more studies use multinational data 18 , county-level data 19 , or country-county mixed regional data 20 for analysis, which greatly improves the accuracy of modeling by increasing the dimensionality and quantity of data.
However, little study has investigated the time effect of UIR, which may affect the accuracy of all the indicators.A recent study suggests using the time-varying SIR model to capture the changing transmissive rate 21 .Moreover, the incubation period has been proved to change in different stages of transmission 22 .Some studies believe that the COVID-19 IFR of China has been overestimated, and the possible value would be 2.3% 23 , while another study shows that the IFR of COVID -19 in Wuhan in the early stage might be as high as 20% 24 .Such disputes may also imply a changing trend in IFR.
This study proposes a SIR regression model with unreported infection rate (SIRu) and SIRu with time-varying parameters (tvSIRu) to estimate the values of TR, UIR, and IFR, and to assess the impact of time effect as well.
Data for this study are the county-level data in the United States released by Johan Hopkins University 25 .This study provides the first insights into the time series values of TR, UIR, and IFR of COVID-19, contributing to a deeper understanding of the trend of COVID-19 in the United States.

Methods Data
The COVID-19 data used in this article contains 3142 counties in the United States, which included the number of daily new infectives, cumulative infectives, and deaths, while the population of recovered infectives remained unreported.
The date of the data was ranging from Jan 22, 2020, to August 20, 2020, which contained 666104(3142 *212) records.As a time lag order(t,t+1) was applied in data analysis, the number of whole records used for regression was 662,962 (3142 *211).

tvSIRu model with fixed UIR
In the classic SIR dynamic model, the number of daily infectives (Id t1 ) at time t1 can be expressed by the function of infection rate β, the number of susceptible persons (S t0 ), infected persons (I t0 ), and total population (N) at time t0 (Equation 1).
The SIR with unreported infection rate (SIRu)could be illustrated in Figure 1.

Figure 1 SIR model with unreported infection cases
As the population of recovered infectives was not released, two special parameters were added to the SIR model, φ for unreported infection rate (UIR), λ for recovery/death rate (RDR), which could be described as the following equation: Wherein, Ic t0 represented the total cumulative infectives at time t0, and Icr t1 denoted the cumulative cases reported.Then, Rc t0 reflected the whole population of removals at time t0, Rdr t0 as the cumulative death reported at time t0.Wherein, RDR can also be transformed into an infection fatality rate(IFR).
IFR= 1/( RDR+1) Then the two explanatory variables in Equation1, S t0 , I t0 , could be calculated as While UIR was fixed without time effect, then the daily new infectives at time t 1 (I d t1 ) could also be calculated by φ and the corresponding data reported (I dr t1 ): The SIR model (Equation1) could be developed into Equation 5 by substituting Equation 2-4, while the further simplification and operation of Equation7 would be taken as the general tvSIRu model.
As the four variables can be calculated by the data released, Equation A.0 could be seen as a primary linear function of Equation A with coefficients, a，b，c，d： While considering the fixed time effect of all three parameters in Equation A, the corresponding average value (β, λ, φ) could be calculated in Equation B.

tvSIRu model with time-varying UIR
If the UIR varies over time, the UIR of the cumulative case and daily new cases also be different, which was defined as φ and φ´ respectively.Equation 7 would be rewritten as To simply the computation, a new parameter β´ was introduced: Then the time-varying model (tvSIRu) could be transformed into a similar form of equation A: To verify the assumption of time-varying parameters, the coefficients in Equation A and B could be represented by initial values and time effect functions.Such functions were substituted into the two models gradually, result in several sub-equations with time effects.
Substituting the Equation 9-11 into Equation A and B, then three complete functions could be generated: In terms of the specific functions reflecting time effect, the power, exponential, and periodic function had been tested and compared in this article:

OLS and SIRu Regressions
The linear regression derived from the SIRu model showed acceptable fitness, and the Adjusted R 2 was 0.4813 (n=662,962) (Table 1).The negative value of coefficients b and c were Consistent with the corresponding operation signs in Equation A.0.Such results may verify the assumption of the SIRu to a certain extend.

SIRu at the State level
The study further utilized county-level data to compare state-level parameters based on fixed time effects.Figure 2 shows the fitness of Equation A.0 across the whole states, most of which were above 0.5 (Figure 2), and each state has different values of TR, UIR, and RDR in Equation A.1, which indicated an obvious spatial heterogeneity in the transmission of COVID-19(Figure 3).All the parameters and statistical descriptions are reported in Appendix A and B.
In terms of UIR, most states are concentrated between 28-50(Figure 3b).The fitting results on RDR in some cities are not significant, but most of the significant values are between 200-500, which is equivalent to the value of IFR ranging from 0.2% to 0.5% (Figure 3c).The Pearson correlation between the three state-level indicators was also tested, showing a positive correlation between UIR and RDR.In other words, the lower the IFR, the higher the UIR (Figure 3d).4).Wherein, the UIR was 19.02 (95%CI 18.93-19.12),which is similar to the value in Equation A. However, both equations showed decreasing trends in the changing RDR, implying an increase of IFR (Figure 5).The power function also showed better performance in tvSIRu with all three time-varying parameters estimated by Equation B.0, which indicated a gradual increase in both UIR and RDR (Table 5).This trend indicated that the initial UIR and RDR were relatively low (Figure 6).The value of UIR and RDR respectively.IFR could be calculated as 0.70% (95%CI =0.52%-0.95%).Based on the officially released data on Augst 20,2020, it can be concluded that about 30% of the whole population had been infected.In terms of fixed time effect, the results show that from January 22 to August 20, the average TR and UIR at the country level in the United States were 0.03 and19.5 respectively, and the RDR was 19.5, which also mean the infection fatality rate was 0.516%.The IFR is slightly lower than the overall infection mortality rate of 0.66% estimated in China 17 , while the CDC recommends 0.65% 26 .
In further analysis of the state level, the UIR of all states was ranging from Based on the time-varying effect at the country level, UIR and IFR are increasing following the power function rather than the exponent function which is the default used in many research 21 .Other than the average value of 0.03 in SIRu, the TR estimated by tvSIRu is decreasing from a large value of 227 to the value of 0.022 on August 20, which is much lower than the fixed value 0.05-0.06 reported in related research 21 .It may further explain the high contagiousness in the initial stage in COVID-19 transmission.The increasing UIR estimated by tvSIRu has a similar value of 9.1 (95%CI = 5.7-14.0)at T212(August 20), which is very close to the value of 9 estimated in a former study on April 20 , and the latest study in September.The UIR value is also close to the value reported in Brazil (Reported rate=9.2%,UIR=10.8) 18.Such similarity in the estimated UIR in different periods may be caused by the fixed time effect in the former models, which only represent the average values of UIR as the SIRu model calculated.The increasing UIR means that the IFR is on a downward trend.The value of IFR on August 20 was 0.70% (95%CI =0.52%-0.95%),which is still close to the value recommend by CDC 26 .
Many studies have supposed that the UIR will decrease with the improvement of COVID-19 testing and increased hygiene awareness, but our research shows that UIR in the United States is increasing, which may have a great impact on policy-making for COVID-19 prevention.On the other hand, empirical IF is often used in contemporary COVID-19 modeling, but the tvSIRu model indicates that the COVID-19 infection rate changes dramatically.The initial value of TR is 246, reflecting that this pandemic is extremely contagious in the early transmission stage of the United States.Previous SIR modeling has seldom characterized such a feature, which may lead to large estimation errors.The reducing TR, IFR, and increasing UIR indicated by the model show that the epidemic is spreading rapidly in the United States with a large number of self-healing populations, however, it is noteworthy the potential increasing cases of severe illnesses would greatly affect the medical system, and relevant departments still need to provide more protection to high-risk groups.

Conclusion
This article indicates that there may be an increasing number of unrecorded COVID-19 cases in official U.S. data, wherein, the tvSIRu model provides a simple, convenient and relatively accurate calculation of the unreported parameters of COVID-19 with time effect based on official released data.
Moreover, this method can be easily transplanted to analyze or epidemic modeling of other countries.
It must be admitted that if single geography unites of data is used, the independent variables may display strong collinearity, leading to overfitting, it is necessary to use proper sub-geographical level data to fit the national-level or state-level data.Furthermore, the non-linear model regression was based on the Gauss-Newton iteration, which could be further optimized with machine learning models.

Figure 2 State-level fitness of Equation A.0 with county-level data
The scaled density curve of adjusted R 2 shows that Equation A was generally applicable, and its mapping indicated that the potential spatial heterogeneity of the states will affect the results of SIRu modeling.Among them, the states in the southeastern, the west coast, and the Great Lakes Region show higher adaptability.shows an obvious connection between RDR and TR, UIR.

Figure 4 Time-varying TR estimated by Equation A.2
Although the initial values of the power function are much higher than the exponential function, in the medium term, the two values tend to be the same, while the periodic function shows that it was in the third wave.

Figure 5 Time-varying RDR with 95%CI estimated by Equation A.3
If the time effect of UIR is not considered, the fitting results show that RDR exhibits a decreasing effect over time, which means that IFR may be slowly increasing.State-level tness of Equation A.0 with county-level data The scaled density curve of adjusted R2 shows that Equation A was generally applicable, and its mapping indicated that the potential spatial heterogeneity of the states will affect the results of SIRu modeling.Among them, the states in the southeastern, the west coast, and the Great Lakes Region show higher adaptability.
Time-varying TR estimated by Equation A.2 Although the initial values of the power function are much higher than the exponential function, in the medium term, the two values tend to be the same, while the periodic function shows that it was in the third wave.
Time-varying RDR with 95%CI estimated by Equation A.3 If the time effect of UIR is not considered, the tting results show that RDR exhibits a decreasing effect over time, which means that IFR may be slowly increasing.

Figure 3
Figure 3 State-level parameters of Equation A.1 with county-level data: (a) Transmissive rate; (b) Unreported infection Rate;(c) Recovery/Death Rate;(d)Correlation Test

Figure 4
Figure 4 Time-varying TR estimated by Equation A.2 When the time effect of RDR was further added in Equation A.3, the AIC of power function displayed a slight decrease in equation A.3 (Table4).Wherein, the UIR was 19.02 (95%CI 18.93-19.12),which is similar to the

Figure 5
Figure 5 Time-varying RDR with 95%CI estimated by Equation A.3

Figure 1 SIR
Figure 1 SIR model with unreported infection casesSIRu model introduced a parameter of unreported infection in the traditional SIR model, which is supposed to have the same parameters of transmission rate, infection fatality rate as the reported infection.

Figure 6
Figure 6 Time-varying UIR, RDR with 95%CI estimated by Equation B.0Equation B only provides the estimated values of UIR and RDR.Both the power function and the exponential function imply an increasing effect, wherein, the power function is much smaller than the exponential function in terms of UIR estimation.

Figures Figure 1 SIR
Figures

Figure 3 State
Figure 3

Figure 6 Time
Figure 6

Table 1
Linear SIR Regression estimated by Equation A.0 unreported infection rate φ, and the recovery/death rate λ (Table2).The result showed that the average β value from Jan 22 to August 20 was 0.0339 (95%CI 0.0338-0.0340),and the φ value was 19.5(95%CI 19.38-19.54),whichimpliedthatthere might exist 19.5 cases undiagnosed while one infection reported in US counties averagely.Meanwhile, the λ value of 192.5(95%CI 191.790-193.243)could be interpreted as an IFR value of 0.516%.Table 2 SIR Regression estimated by Equation A.1

Table 4
Time-varying TR and RDR estimated by EquationA.3