RLIM: a recursive and latent infection model for COVID-19 prediction and turning point in United States

Initially found at Hubei, Wuhan and identiﬁed as a novel virus of coronavirus family by WHO, COVID-19 has spread worldwide with an exponentially speed, causing millions of death and public fear. Currently, COVID-19 has brought a secondary wave within U.S., India, Brazil and other parts of the world. However, its transmission, incubation, and recovery processes are still unclear from the medical, mathematical and pharmaceutical aspects. Classical Suspect-Infection-Recovery model has limitations to describe the dynamic behavior of COVID-19. Hence, it becomes necessary to introduce a recursive, latent model to predict the number of future COVID-19 infected cases in U.S. In this article, a dynamic model called RLIM based on classical SEIR model is proposed to predict the number of COVID-19 infections with a dynamic secondary infection rate ω in assumption. An intermediate state called SI is introduced between recovery and infection statues to record the number of secondary infected cases from a latent period of recovery. Compared with other models, RLIM ﬁts historical recovery cases and utilizes them to predict future infections. Because RLIM utilizes multiple information sources, and provides error back propagation schematics, it is reasonable to assert that its predictions are more accurate and persuasive. Projections of four U.S. COVID-19 states show that with the secondary infectious rate ω varies from 0.01 to 0.3 within a latent period of 14 days chosen, RLIM can predict the newly infected number from January 15 to February 15, 2021 with AFER lower to


Introduction
Since its first appearance in Wuhan, Hubei, China from December 2019, a novel virus named as COVID-19 has affected millions of people all over the world, which has caused unpredictable economic losses and public fear. Until now, COVID-19s origin, its incubation time and transmission speed are still waiting for clarifications. Numerous attempts from medical, clinical, and mathematical aspects have been made around COVID- 19, trying to answer the dramatic increase of infections brought by it and predict its transmission trends.
A certain number of COVID-19 related researches developed their mathematical modeling based on the Susceptible-Infective-Removal (SIR) model, which was originally proposed by Kermack and McKendrick [1] to analyze Black Death virus transmission occurred at London, United Kingdom and pestilence at Mumbai India, 1666. Theoretically, this model divides a virus transmission progress into three phases: suspect, infection and removal, and relates mathematical parameters with characteristics of each stage. For example, a mathematical parameter: β, was assumed between suspect and infection to identify the percentage of a healthy and vulnerable person transforming into a positively infected patient. The β has been associated with R 0 : basic reproductive number, which express the average speed of a certain virus transmission. Another indicator:γ, was widely applied to record the percentage from infection to recovery status. The reciprocal of γ indicated the median incubation period of COVID-19 transmission, which attracted numerous interests of scientific community.
In response to the COVID pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD- 19) on Kaggle [4]. One of the challenges is to explore the COVID-19 transmission, incubation and environment facts from over 400,000 scholarly articles within CORD-19. According to CORD-19s record, many scientists began their investigation of COVID-19 from Wuhan, Hubeis patients records since April 2019, but soon dived into Italy, United States, Brazil and India as well. Anatoly Zhiljavsky et al. [6] developed a mathematical model based on SIR model to analyze COVID-19 United Kingdom epidemics, asserting that a reproduction number R 0 equals 2.5 is suitable for megacities such as London, Manchester and Sheffield. Virginia E. Pitzer et al. [7] delved into the diagnostic testing practices in United States, and estimated that United States had a higher reproduction number R 0 ranging from 4.0 to 7.1 compared with China. Easton R. White et al. [8] claimed that government interventions in the United States resulted into an SIR model with a dynamic, time-variant R 0 .
In our modeling, R 0 is also determined as a time-variant parameter, ranging from 1 to 10.
Regarding to the incubation period of COVID-19, a certain number of research findings were also published: Lu [9] investigated 2015 COVID-19 infected cases and recorded their incubation periods ranging from 0 to 33 days. Lai [10] collected 125 Chinese patients exposure periods and the estimation indicated that the median incubation period was 4.75. Zhu [11] assumed that the latent period and the infectious period are approximately equal to the incubation period and the length of stay in hospital, and concluded that the value of the latent period and the infectious period are preliminarily 5 to 10 days, respectively. Adhikari [12] asserted that the average incubation duration of COVID-19 was 4.8 +/-2.6, ranging from 2 to 11 days (with 95% confidence interval, 4.1 to 7). In this article, the value of γ is ranging from 0.03 (33 days) to 0.25 (4 days) based on the scientific findings mentioned above.
Although numerous mathematical models were developed to address the dynamics of COVID-19, very few of them focus on secondary infections caused from recovery. Many of these models treat COVID-19 as a respiratory disease, which requires immediate medical attention but will not last for long and cause secondary affections. However, the long-lasting phenomenon and secondary outbreaks from United States, United Kingdom, Brazil and India indicate that COVID-19 symptoms cannot be terminated similarly as flu. For example, Ester C Sabino et al. [21] observed the resurgence of COVID-19 in January, 2021, Brazil and asserted that one of the main reasons behind was that immunity against COVID-19s infection has already begun to wane by December, 2020. Thus, the recovered group can have a possibility to be infected, or become a virus carrier. On the COVID Tracking Project website, the definition of COVID-19 recovery is set as "ymptom improvement", "Hospital discharges", or even "days since diagnosis" [15]. Yet there is no clear evidence that these recovered patients are immune to COVID-19 afterwards. Thus it is reasonable and necessary to assume that a portion of them, after a certain period, will back to the susceptible groupdue to their immune systems vulnerability status. A recent scientific report from Christian Gaebler et al. [16] also proves that the humoral memory response to COVID-19 will last between 1.3 and 6.3 months after infection without vaccine support. Victor Alexander Okhuese [14] attempted to estimate the probability of COVID-19 reinfections by searching the equilibrium state of SEIRUS model. In his simulation report, after 12 days the rate of recovery and rate of infection will meet and reach an equilibrium state. However, his model merely took consideration of wrongly executed PCR tests, which is not accurate enough for description of current United States COVID-19 transmission.
In this article, we develop and present a novel COVID-19 transmission model named as Recursive Latency Infection Model (RLIM). The main contributions of this model are: -Taken secondary infections into consideration for United States recovery group; Fig. 1 The SIR model.
-Developed a mathematical model to describe the recover-infection process; -Provided COVID-19 infected cases and recovered cases prediction with analysis from state level data sources; The rest of this manuscript is organized as follows. Section 2 discusses necessary mathematical modeling, equations and algorithms implemented. Section 3 describes simulation settings, software and scientific packages utilized by RLIM program. Section 4 takes New Jerseys COVID-19 data and simulation results for discussion, and provides predictions on NJs infections between mid-January and mid-February. Section 5 summarizes the work and brings further discussions.

Recursive Latency Infection Model (RLIM)
In this paper a modified COVID-19 transmission model based on the original SIR model is proposed. As originally proposed by Kermack and McKendrick in 1927 [1]. In this paper, they proposed the Suspectible-Infective-Removed (SIR) model and used their model to explain the 1665-1666s plague at London and the 1906s pestilence at Mumbai, India successfully. The SIR model diagram is shown in Figure 1. And its transmission process is described by equations (1), (2) and (3).
The classical SIR model is only feasible in an ideal epidemic transmission environment, because it takes no consideration on the time-variance of infection rate β and recovery rate γ at all. Also it requires no disease control which means any political or clinical intervention is forbidden. Such transmission behavior rarely appears. However, based on its theoretical assumptions, many revised models, such as SEIR [20], SEIRUS [14], mechanic-statistic SEIR [18], and deep learning SEIR [19], adopted different epidemic transmission characteristics and human interventions, were proposed and developed by researchers. The description of these models can be found in Hethcotes review [2].
The RLIM is inspired by research from Jianping Huangs team, Lanzhou university [3]. Their model, named as GPCP (Global Prediction system for COVID-19 Pandemic), adds four states of disease from SIR model: insusceptible state (P), potentially infected state (E), quarantined state (Q), and mortality state (D). The GPCPs disease transmission process is described by equations (4)- (10).
In RLIM, according to our observation and given facts from news reports analysis, we add a symbol ω in the transmission loop. The ω represents a probability of a certain patient, who had recovered from COVID-19 affection for a certain period, which is identified by respiratory tests or antibody tests again as a virus-carrier. Following this definition, ω is used between status R and I, transmitting a certain number of Recovered group into Infected group. Following the assumptions above, the RLIMs equation series shall be modified as equations (11)- (15).
Compared with GPCP model, RLIM has the following advantages: (1)Simplified the virus transmission process by removing unnecessary states such as P (immune group) and Q (quarantine group). The reason is because no official data records tracking these two states in our dataset. (2)Improve the GPCP model with recursive state SI and parameter ω, to avoid the problem of forward transmission only. Without recursive state and parameters existence, number of new infections will decrease no matter what actions to take, and this phenomenon is contradictory to current United States COVID-19 transmission records.
(3)Introduce the latency parameter: τ to indicate the median re-infection period. In RLIM, τ is initialized with a value of 14 according to WHO's instructions and scientific reports. This parameter correlates itself with many U.S. states recovery policy: patients in hospital will be automatically treated as recovered after a certain period.
In order to apply equations (11) and (12) in our algorithm, a transform into discrete data series shall be implemented as equations (16) and (17). Replace I(t) and R(t) with 4 th stage equations (16) and (17) into equation (14) and we have equation (18).
In equation (16), a relationship between coefficients [a, b, c, d] and [e, f, g, h] is established. Thus, RLIM can predict the infected number of cases given historical number of recovery, previous infections, and assumptions of infection rate, recovery rate and secondary infection period. The description of the corresponding algorithm flow is given in Figure 3, this diagram will be discussed in section 2.2. Fig. 3 The RLIM model diagram.

Notations
The notations used throughout this article are described in Table 1.

RLIM algorithm
The RLIM algorithm is used to calculate of predicted infected numbers according to equations from (16) to (18) and optimize the difference between actual data recorded by COVID tracking project and the predictive numbers returned from the model. Initially, the predicted recovery numbers R 1 , R 2 , · · · , R n are calculated using the 4th R-K order method, based on the real data series of a state from United States, between November 2020 and January 2021. The coefficients (e, f, g, h) associated with this recovery function will then be transformed into another coefficients (a, b, c, d), with the pre-assigned recovery rate λ (default value 0.01) and secondary infection rate ω (default value 0.01). With coefficients (a, b, c, d) assigned, number of newly infected cases: I 1 , I 2 , · · · , I n within this state can be calculated. Comparing these predictive numbers with actual data, one can evaluate and justify if this round of prediction is accurate or not. Our proposed RLIM model will continuously search for the optimal infection series and then determine the optimistic ω associated with this state.

Performance measure
RLIMs performance measure is calculated as difference between its predictive output I p and actual value I k . Three performance indicators are defined: mean square error (MSE), standard deviation, and average forecasting error rate (AFER) which are shown equations from (19) to (21). Because different states of America have quite different number of infections, ranging from hundreds to thousands, these indicators will be uniformed between 0 to 1 to justify the performance as well.

Mean square error(MSE)
The average of squared difference between RLIMs predictive output I p and actual value I k can be calculated in 19.

Root mean square error (RMSE)
The root mean square error is also used to evaluate RLIMs prediction quality. Its formulation is in 20.

Average forecasting error rate(AFER)
Average forecasting error rate is the percentage of error, which represents the relative difference between the predictive output I p and actual value I t . It is a cumulative statistic deviation between two time-series.
3 Experimental setup

Data source
The data source directly applied in our simulation is from [13]. This dataset contains the United States state-level data of COVID-19, starting from April 2020. In this article, New Jersey (NJ), New York (NY), South Dakota (SD), and New Mexico (NM) are selected because they all have daily tracking recovery reports. RLIM relies heavily on accurate and reliable recovery case reports, and these states own recovery data sources of high credibility. Parameter initializations for RLIM are the same for all the states: (1)Data fitting period: N ovember15, 2020January15, 2021; (2)Prediction: January16, 2021-F ebruary15, 2021; (3)Recovery rate: λ = 0.01; (4)Secondary infection rate: ω = 0.01; (5)Latency period: τ = 14;

Software implementation
The programming language inside RILM is PYTHON with version 3.7, and the essential software package used is SCIPY with version 1.5.4. Two software modules are inherited from SCIPY: integrate and optimize. RILM utilizes the integrate function to calculate the MSE and the optimize function to fit the real recovery data into the 4 th order R-K parameters.

Code availability
RLIM software is publicly available on Github [17], with all codes and implementations available for research. Simulation results are also available via Github.

Prediction with MSE/RMSE/AFER
Observations from Figure 4, the data fitting and prediction of New York state indicate that RLIM successfully fits current data records from mid-November till mid-January. Different values of the ω, the secondary infection rate, results into different shape of the curve, and the maximum threshold of infections as well. For example, the value of ω equals 0.2, 0.22, 0.25 and 0.3 results in a prediction of newly infected numbers equaling 17,500, 18,000, 20,000 and 25,000. These values also correspond with cumulative infected cases equaling 1.6 million, 1.6 million, 1.8 million and 2 million. To summarize, a higher value of secondary infection rate will result in a faster infection increase pace. Another interesting observation is about the turning point, which shows no correspondence with either λ or ω. Simulations with RLIM on different states with recovery rates ranging from 0.01 to 0.5 will not affect the turning point, and we may delve into this topic in section 4.2. In Table 2  in Nov. 2020, with AFER 29.02%. While in December 2020, it obtains a better MSE of 4630069.903 and a lower AFER of 14.97%.
In Table 3, the numbers of New York states new recoveries are also predicted among Nov.2020, Dec.2020 and Jan.2021, with MSE/RMSE/AFER calculated against authentic data records. In Nov. 2020, the mean square error of recovery case prediction reaches 27893.26, with AFER 38.8%. And for predictions in December 2020, the mean square error increases into 43394.77, but with a better AFER of 25.6%.

The turning point
From RLIMs output on New York states infected numbers and recovery numbers, one can observe that the turning point of this states COVID-19 transmis- sion is around January 30, 2021. Predictions indicate that from mid-January, New Yorks infections will slowly increase from 17,324 to 18,549, and then fall back to 17,816 at mid-February. The main reason that this turning point appears is the secondary infections ω, whose range is [0.1, 0.55], experiences high fluctuations at early November, 2020, and then drops below 0.4 during December, and after Christmas, 2020, becomes stable around value of 0.25. Figure 5 illustrates the variance of secondary infection rate ω since mid-November, 2020. These dynamic, inconsistent points are reflections from true COVID-19 data records, calculated from equation 18. In mathematics, RLIM will reach a turning point when recovery rate λ equals secondary infection rate ω. The recovery rate λ for RILM is a constant value of 0.01, thus, given the output from RLIMs prediction of New York state as January 31, 2021, the secondary infection rate ω shall drop to a value of 0.01 and reach an optimal state. Then, RLIMs predictive sequences of new infections and new recoveries will be linear, unless receives real data record from reliable data sources.

New Jersey, South Dakota and Virginia
Simulation results shown in Figure 6 describe three scenarios: moderate increase (New Jersey) above, moderate decrease (South Dakota) in the middle, and exponential increase (Virginia) below. The optimal secondary infection rate ω for these states are marked above (0.13 for New Jersey, 0.056 for South Dakota and 0.19 for Virginia). Observations from these states infection and recovery data indicate: no strong correlations between ω and COVID-19 trans- mission trends. Revisiting equations from 11 to 15 from section 2.1 will also explain that in RLIM, ω only affects the incremental steps of infection cases positively and recovery cases negatively. However, it is still valuable in prediction of the turning point, when it approaches the value of recovery rate. Thus, one can conclude that if the recovery rate remains stable during the periods of τ (in RLIM, τ equals 14), then RLIM will approach it during a period and lock down the turning point. The advantage will significantly reduce the time for scientists elaboration on COVID-19s behavior.

Conclusion and future work
This research proposes a recursive, latent, dynamic virus transmission model from classical SIR model. This model, named as RLIM, attempts to explain how COVID-19 transmits within different regions of United States. Introducing a new parameter ω into classical SIR model, RLIM is able to predict newly infected cases based on recovered data and historical COVID-19 records. Simulation outputs with RLIM on New York, New Jersey, South Dakota and Virginia proves that given reasonable initial value of ω , this model is able to predict a 30 days infections and recoveries with a reasonable error rate. As stated in section 1, RLIMs performance relies heavily on accurate recovery reports and statistic data from reliable information sources. Some states of America measured recovery as"length of periods from hospitalization","days from the first symptom appears", which resulted errand data of recovery. With accurate, precisely defined recovery data, RLIM is able to provide better predictions.
Furthermore, the other field of application is to integrate RLIM with machine learning techniques. The recursive, latent status can be modified into a back propagation process inside a neural network, so RLIM can be equipped with self-learning abilities.