When did COVID-19 start? - Optimal inference of time ZERO


 According to the official report, the first case of COVID-19 and the first death in the United States occurred on January 20 and February 29, 2020, respectively. On April 21, California reported that the first death in the state occurred on February 6, implying that community spreading of COVID-19 might have started earlier than previously thought. Exactly what is time ZERO, i.e., when did COVID-19 emerge and begin to spread in the US and other countries? We develop a comprehensive predictive modeling framework to address this question. Using available data of confirmed infections to obtain the optimal values of the key parameters, we validate the model and demonstrate its predictive power. We then carry out an inverse inference analysis to determine time ZERO for ten representative States in the US, plus New York city, UK, Italy, and Spain. The main finding is that, in both the US and Europe, COVID-19 started around the new year day.


INTRODUCTION
The first case of COVID-19 in the United States was reported on January 20, 2020 and, according to the official account, the first death on American soil occurred on February 29. Astonishingly, on April 21, California reported that the first death in the state occurred on February 6, more than three weeks earlier than previously reported. One implication is that community spreading may have already occurred in the US three weeks earlier than believed. The importance of knowing precisely the starting date of community spreading cannot be overstated: this is the date based on which government mitigating actions and control measures would be imposed. In the US, according to the government report, the onset of an exponential increase in the infections occurred in the middle of March, leading to the belief that COVID-19 began the phase of community spreading around the same time. Based on this perception, the White House issued a nationwide social-distancing order on March 16. Statewide stay-at-home or shelter-in-place orders were given by the governors of various States at different time. The effectiveness of these government actions notwithstanding, as of July 24, there have been over four million cases in the US with close to 145,000 deaths. This devastating development means that the perceptive date of community spreading of COVID-19 in the US was wrong: it could have been weeks earlier than the governments have chosen to believe.
To develop a reliable method to infer the starting date of COVID-19, or any infectious disease, is of uttermost importance. Precise knowledge of exactly when community spreading started would prompt the government to take actions at the earliest possible moment, drastically reducing the number of infections and saving many thousands of lives. More specifically, knowing this time in combination with knowledge about the symptomatic individuals in the early stage of the disease spreading enables: (1) an effective reduction in the range of contact tracing, which increases the chance of accurately locating the source of infection with limited resources, (2) an assessment of the infectability of the virus and the way by which it spreads, providing guidance for early control measures, (3) determination of the interstate and international propagation paths of the virus, and (4) providing unequivocal early warnings for the governments. In this regard, a recent work based on gene sequencing analysis found clusters of related viruses in patients living in different neighborhoods of New York city, suggesting that multiple, independent but isolated introductions of the virus had mainly come from Europe and other parts of the US [1]. In another study [2], the frequencies of the key words such as coughing and fevers on Twitter were used for model analysis, with the finding that the actual outbreak time can be 5-19 days earlier than officially reported.
Our method is based on inverse inference and enables the starting date of the epidemic to be deduced from limited available data. Another feature of our model is that it contains sufficient details to capture the key dynamical behaviors of COVID-19 spreading. In particular, our inverse method of inference is based on a comprehensive, non-Markovian spreading model tailored to COVID-19 in either an open or a closed setting [3]. The dynamics are described by the time evolution of the populations in five distinct states (SHIJR): susceptible (S), hidden (H), infected (I), confirmed (J), and removed (R). The S, I, and R states are conventional, but the H and J states are COVID-19 proper. Of particular importance is the H-state population: it is the population of individuals who have already contracted the coronavirus but have shown only mild symptoms or would never show any symptoms. To evolve the dynamics, the initial hidden population, H(t 0 ), is an essential parameter, where t 0 is the time (day) at which the number of real confirmed cases begins to increase. Government control measures are usually imposed some time after this day.
It is important to note that t 0 is not the day when the coronavirus first appears in the community (i.e., the starting date of community spreading): the latter could be significantly earlier, at a date termed as time "ZERO," denoted as 0! For a time period between t 0 and t 1 , where t 1 > t 0 , there is an appreciable and continuous increase in the number of infections, prompting the government to impose vigorous control measures on day t 1 . The timeline is thus: 0 < t 0 < t 1 . The problem is to determine time 0. Our inverse inference method is articulated to solve this problem. By generating dynamical evolution of the epidemic model, comparing the model prediction with the limited data available after t 0 , and invoking an optimization procedure, we determine the key model parameter values including H(t 0 ). Running the model between a hypothesized time 0 and t 0 to determine how long it takes for H(0) (a small positive integer, e.g., one or two) to reach H(t 0 ) allows us to pin down time 0 precisely. The relevant dates and timeline are illustrated in Methods.
We apply the inverse method to ten States in the US plus New York city and the country as a whole, and find that time ZERO is as early as the beginning of January. In the US, the day t 1 is March 16, while t 0 varies among the individual States (e.g., February 26 for Washington and March 2 for New York). For a specific system (a State or a country), letting ∆T ≡ t 0 − 0 be the time span between the date on which the virus started community spreading and the date of the number of real confirmed cases beginning to increase, we find an exponential scaling relation between H(t 0 ), the number of people who already carried the coronavirus on the officially confirmed date in that system and ∆T . This means that, a longer delay in reporting the first case would lead to an exponentially large virus-carrying population, rendering significantly more challenging to fully control the disease spreading. The need for early, pre-emptive testing thus cannot be overemphasized.
Our model has the power to predict the occurrence of community spreading of COVID-19 long before the time of official report of the outbreak, making it possible for the governments to impose control measures and to summon the essential medical resources. Take the example of the US. In February, there was already unmistakable evidence that the coronavirus already existed in the US. In complete hindsight, consider the fictitious scenario that widespread tests had been carried out in the US in February so that adequate data had been available. Inverse inference could have been carried out then to determine time ZERO. This could have sent the vital message to the governments that community spreading of COVID-19 had already started weeks ago. If strict government control actions had been taken at the end of February, the epidemic picture of the US today would have been drastically different. Our framework of modeling and inverse inference can arguably be a valuable asset for guiding the governments to take appropriate actions for possible future outbreaks of coronavirus or other infections diseases.

RESULTS
Finding time ZERO for the US States, Italy, Spain, and UK. We aim to find time ZERO for ten representative States in the US, plus New York City (NYC), the entire USA, and three European countries [Italy, Spain, and the United Kingdom (UK)]. Each State in the US is treated as an open system, while NYC, USA, and the three European countries are treated as a closed system (Supplementary Note of SI). We first demonstrate that each corresponding model has the required predictive power. Figure 1 shows four representative examples in terms of the daily accumulative number of confirmed cases, J(t), for NYC, the State of New York, the State of California, and UK. For each example, the whole time interval is divided into two sub-intervals: the first (blue) is used to estimate the four key model parameters and the second sub-interval (green) of 14 days is used for prediction. The model generated J(t) in the first phase is thus the result of a sophisticated, optimal fit. As can be seen from The orange, red, and green horizontal segments represent the officially reported time duration in which there is (are) one (two and three) confirmed cases, respectively. Not every system would have all three colored segments as, e.g., more than one case could be reported by the government. Those rare cases typically represent external input, not the beginning of community spreading. The purple triangles represent the time of lock-down for the respective systems. In the US, the virus first emerged around January 4-9. Figure 2 shows the inferred time ZERO together with its confidence interval for the 15 States/city/countries, in an ascending order. It can be seen that the novel coronavirus appeared in Italy about the New Year day. In the US, it first emerged between January 7 and 11 in Washington or New York State. When the whole country of USA is treated as a closed system, the virus first appeared on about January 9, with confidence interval overlapping with that of the Washington and New York States, suggesting Washington or New York as the first State in which the virus emerged. It can be seen from Fig. 2 that the officially reported time of the emergence of the first few cases does not represent the beginning the community spreading. In most cases, the time ZEROs inferred by our method were earlier than the official time, e.g., about one month earlier in Italy, two months earlier in New York State and New York City, over half month earlier for Washington State. These results indicate that, before the official report of cases, COVID-19 had already  spread in the community for some time. The danger of the misconception of the delayed starting date of community spread as reported by the governments is real with devastating consequences: when the governments decided to impose control measures, it may have already been too late. For example, New York State issued the lockdown order on March 22, which is more than two months later than time 0 (January 9), indicating unusually slow response of the State government to COVID-19. For the US as a country, the White House issued a nationwide social-distancing order on March 16, but it was already two months later than the starting time of COVID-19 in the country, attesting to the extremely and unreasonably slow response of the federal government to the pandemic! There are cases where the inferred dates of time ZERO agree approximately with the official dates, e.g., the State of Arizona. This is because the outbreak in other regions of the country (e.g., the original epicenter New York City) increased the awareness level, promoting the local governments and population to take certain protective measures and thereby delaying the community spread.
The results of time ZERO, together with model parameters and the estimated timeline for the appearance of five and 20 hidden individuals for the 15 systems are summarized in Tab. I. Note that the values of η lie in the interval [0.5, 0.8], indicating insufficient testing for a relatively long period of time after the exponential outbreak. Insufficient test, of course, gave fewer confirmed cases than actual, opening the door for the governments to undermine the severity of the pandemic and even to have the deception that COVID-19 would be under control. A consequence is that COVID-19 has continued to spread aggressively in the US at the present, with no indication of control in sight. In contrast, in countries that have successfully controlled the disease, such as China and South Korea, the value of λ is between 0.1 and 0.19, signifying significantly stringent government control measures [3]. FIG. 3. Exponential growth of the hidden asymptomatic population prior to implementation of government control for the 15 systems. For uncontrolled and free growth, the exponential rate is essentially the infection rate β whose value has been estimated from data. For a given system, after time ZERO has been determined, it is straightforward to predict how the hidden, asymptomatic population grows with time from a single case, before any government control measure is imposed. Under such a circumstance, there is free growth characterized by an exponential law, as shown in Fig. 3 on a semi-logarithmic scale for the 15 systems, where the exponential growth rate is determined by the infection rate β.
What is the relation between the size of the virus-carrying hidden population at the time of first confirmed case and ∆T , the time elapsed since ZERO? Figure 4 answers this question for the 15 systems. It can be seen that ∆T varies drastically among the 15 systems. When the first confirmed case was reported, there is already a sizable population of the asymptomatic individuals (hundreds or even thousands), whose actual size also varies dramatically among the systems. A general trend is that the hidden population at t 0 tends to grow exponentially with ∆T , a consequence of the free growth behavior exemplified in Fig. 3. Finding time ZERO for the recent second COVID-19 outbreak in Beijing, China. We apply our framework of inverse inference to predict time ZERO for the recent second outbreak of COVID-19 in Beijing, China. From June 11 to 14, 79 cases were reported in Beijing. Since the city had had zero new cases for several months before, the new outbreak is independent of the previous one in the January-February time frame, and can thus be regarded as a second outbreak. Analyzing the official report indicates that symptomatic patients first emerged on June 6. The extreme sparsity of the available data prevents an accurate determination of the four key parameters in our inverse model through optimization, but it is still possible to make approximate estimates. In particular, in Beijing, the regions of the new outbreak were isolated, and aggressive and large scale testing was conducted immediately after the emergence of the new cases, which includes those who are asymptomatic. That is, the individuals in the H state are tested. It is thus reasonable to hypothesize that, by June 15 (time t 0 ), the reported number of cases is approximately the value of H(t) at this date: H(t 0 ) = 79. Since no case was confirmed before June 15, we have η = 1. For the value of infection rate β, if we take it from the data collected in Wuhan (the epicenter in China during the first outbreak), i.e., β = 0.36, our model gives June 3 as time 0. However, if we use β = 0.2 as in the US and European countries, time 0 would be May 23. For β < 0.36, time 0 would be earlier than June 3. The latest possible date of time 0 is thus June 3. Since the first symptomatic individuals appeared on June 6 and the average incubation time is about five days, the estimated time 0 is quite close to the actual time 0. This demonstrates that, in China where government responses are quick and testing is widely available, our model can predict time 0 even during the early stage of the outbreak with sparse data. That is, from the first official report of the outbreak, our model is already capable of providing a rough estimate of time 0, facilitating greatly localization of the spreading source(s).

DISCUSSION
There were speculations that the novel coronavirus could have been in the US a few weeks earlier than January 20, the officially reported date. Inverse inference based on our non-Markovian SHIJR model for COVID-19 leads to a consistent answer confirming the speculations: the virus first emerged in the US at the very beginning of the year. Our confidence in this result comes from our model that has been comprehensively constructed and rigorously tested in terms of the following three aspects. Firstly, the designation of the model states and their dynamical evolution are fully in accord with the known characteristics of COVID-19, subject to government control measures. Secondly, the key model parameters are optimally estimated using empirical data from an adequately long time period including the initial growth phase of the epidemic. Thirdly, the model has been validated with its predictive power firmly established through a comparison between the model generated and real data of the daily accumulative number of confirmed cases in a 14-day period that is not involved in parameter estimation. All these have been done for ten US States and New York city, the US as a whole, plus three European countries most severely hit by the COVID-19 pandemic. Particularly worth noting is that our inverse procedure gives two results that are mutually consistent: the virus was already in Europe as early as the end of December 2019 and the earliest possible date for New York city to have the virus is around January 10. This consistence gives credence at a quantitative level to the widely believed proposition that the virus in New York city was from Europe through air travel [1].
Our study has demonstrated that, for a given system (a State, a city, or even a country), open or closed, our non-Markovian SHIJR model is capable of yielding an estimate of time ZERO and generating the possible epidemic trajectories into the future based on limited data with the power to predict the most likely epidemic scenario. With the inclusion of appropriate optimization and inverse inference procedures, the predictive modeling framework represents a contribution to mathematical and computational epidemiology, going beyond the existing models and offering a general and comprehensive paradigm applicable not only to COVID-19 but also to future pandemics. In addition, the framework developed in this paper can be an accurate and reliable tool/source for governments at all levels, enabling not only an accurate assessment of the government testing and surveillance capabilities for the infectious disease but also a comprehensive evaluation of the effects of government imposed measures to control the disease. This can provide guidance for optimizing these measures to save human lives and for determining the optimal time for reopening to minimize the economical and social impacts.

METHODS
In the US, the circumstances under which COVID-19 spreads vary dramatically among different States: not only are the levels of travel restriction orders dissimilar, but other factors affecting the disease spreading such as the population, medical resources, and social/political cultures are also distinct. A quantitative assessment of the effects of the control measures taken by the government to contain COVID-19 thus needs to be carried out on a State-by-State basis. A complication is that each individual State is not a closed system: people move into and out of the State on a daily basis. This presents a tremendous challenge to modeling [4,5], as most current data analyses and models for COVID-19 were for the setting of a closed system without considering the inbound and outbound population movements . Another feature of the COVID-19 pandemic that most existing models did not take into account is the non-Markovian nature of the spreading dynamics, as characterized by the various time delays associated with the dynamical states. To account for the non-Markovian characteristics in the model, coupled differential equations with distinct time delays are necessary [3]. To determine time 0 for a State in the US treated as an open system, we develop a coupled, dual-system spreading model. In particular, we treat the target system (A) as one under influences from another, much larger system (B) that represents all the other States. Because the size of B is much larger than that of A, in terms of the spreading dynamics, system B can be regarded as a closed system. From the standpoint of nonlinear physics, the influences of system B on system A can be viewed as a perturbation or background noise, while the effects of A on B can be neglected. The perturbation can be estimated based on the population of the target State and the empirical human movement data. The backbone of this unidirectionally coupled modeling framework is a non-Markovian spreading model incorporating various time delays in a closed-system setting [3].
For the five-state (SHIJR) model in the closed system setting, there are three parameters whose values are to be determined from data: (1) the infection rate β -the probability that an individual in the S state catches the virus and switches into the H state, (2) H(t 0 ) -the hidden population at the initial time t 0 when the available data, i.e., the number of confirmed cases, began to increase with time to enable reliable estimation of the model parameters, and (3) the fraction of undocumented infections, denoted as η, whose value is determined by the testing and surveillance capability of the government. For COVID-19, the state transitions in the SHIJR model are non-Markovian. In the open system setting, an additional feature exists: there is time dependence due to human movements in and out of the system. For both closed and open system models, the government control measures result in an exponential decrease in the human social and movement activities. The collective effects of these measures can be described by one parameter: the exponential decay rate of the activities, denoted by λ, where a larger rate corresponds to more stringent control measures. The value of λ can be estimated based on the available epidemic data.
A detailed mathematical description of the model is presented as a Supplementary Note in Supplementary Information (SI). Figure 5 explains our principle to infer time 0 in terms of a number of key dates underlying COVID-19 spreading. Some days after time 0, the first or the first few cases emerged and were officially reported. However, the number of cases is typically small, e.g., one or two, and this number can remain unchanged for a number of days. Time t 0 is the date after which the number of cases begins to increase. The government imposes control measures on date t 1 > t 0 . Since the government control is an integral part of our model, it is necessary to use data up to some date later for determining the model parameters when the effects of the control measures have been manifested in the data. We assume that this would require at least 12 days and thus set t 2 = t 1 + 12 days.
To estimate the four model parameters in a computationally feasible way while ensuring accuracy, we devise the following two-step procedure. We separate the four parameters into two groups: (β, H(t 0 ), η) and λ, where the first three are associated with the intrinsic spreading process while the last is tied to the reduction in the human movements as a result of government control measures. Because of the time delay for the effects of the measures to be manifested, the value of λ does not affect the fitting with the data in the early stage of the disease. As a result, we first use the available data in the time interval [t 0 ,t 2 ] to estimate (β, H(t 0 ), η). For each parameter, we assign an initial range of its possible values. A large number of combinations of the three parameter values are then used to evolve the model from t 0 to generate the number of confirmed cases, J(t), in the time interval [t 0 ,t 2 ]. Using the real data, we carry out a weighted optimization procedure to determine a set of combinations of the three parameters with minimum errors. We then choose a range for λ and uniformly distribute a number of values of λ in this range. Each λ value is combined with the already determined combinations of (β, H(t 0 ), η) to yield an equal number of combinations (β, H(t 0 ), η, λ). With all the chosen λ values, this leads to a large number of combinations of the four parameters. Finally, we carry out the same optimization procedure in the time interval [t 0 ,t 3 ] with t 3 > t 2 to determine the combination of the four parameters with the minimum error. We choose t 3 = t 2 + 12 days.
With the values of the optimal parameters so estimated, the model generates the time series J(t) that can be compared with the data. In the time interval [t 0 ,t 3 ], the agreement is generally excellent. However, this is not indicative of the predictive power of the model because the model parameters are estimated using the data in the same time interval. To test the model for prediction, we run the model in the time interval [t 3 ,t 4 ], as indicated in Fig. 5. Comparison with the real data in this time interval reveals generally quite good agreement, validating the model.
The model is now ready to be used for inferring time 0. Quite straightforwardly, for a given system (a State, a city, or a country), we set H(0) = 1, choose a number of possible candidates for time 0, and run the model from time 0 to t 0 to test which candidate date leads to the known number H(t 0 ): the starting date that gives the correct H(t 0 ) value is taken to be time 0.

DATA AND CODE AVAILABILITY
All relevant data are available from the authors upon request. All relevant computer codes are available from the authors upon request.