Dating the emergence of the first case of COVID-19 in Hubei, China with commercial flights: a retrospective modelling study


 BackgroundCommercial flights contributed to the early-stage international transmission of severe acute respiratorysyndrome coronavirus 2 (SARS-CoV-2). Understand the effect of international and inter-state flights on thevirus transmission is important to evaluate the initial response of the outbreak. This study investigated thelikely date of the emergence of the first COVID-19 case in Wuhan, China.MethodsWe constructed a geographical-structured model, including 9122 county-level geographical units in 250different regions or countries, and 26,094,036 flight plans. Using the model, we estimated the date of thenumber of deaths and the date of first death caused by COVID-19 in 155 different countries. We set acertain trigger for country and county level lockdown, and a built-in flight randomizer, to assess thedifferent evidence that can suggest the possibility of different dates to be the emergence of the firstCOVID-19 case.FindingsWe found the median number of global deaths caused by COVID-19 from 50 trails of our simulation isbetween 853901 and 28432 when the emergence of the first case of COVID-19 case is within the timeinterval between September 1, 2019 and November 1, 2020. Overall, the average and median R2 of 155countries decrease as the date of the emergence postpone; however, a small increase of R2 is observed whenthe date emergence of the first case of COVID-19 case is September 22. The deviation of the simulationresult of the deaths from the actual scenario in the eight major epicenters demonstrates a negative trend asthe first case emerges later. A similar pattern can be observed on the date of the first death in these eightcountries, which can be applied to a global scale.InterpretationsWe found that our simulation results can only include the actual number of global deaths caused byCOVID-19 on May 1,2020 inside the 95%PI when the date of the emergence of the first case is betweenSeptember 15, 2019 and October 15, 2020; The lowest deviation can be observed on September 22. Wealso found a decrease of R2 when we deploy later emergence of the first case; unexpectedly, an abnormalincrease of R2 on September 22 is observed. When focusing on the eight major epicenters, we found that thenumber of deaths in these eight countries can only be all included inside the 95%PI when the emergence ison September 15 and September 22. The deviation of the first deaths demonstrates different patternsdepending on the continent, whereas the global average demonstrates a pattern of normal distribution witha maximum of 0.2107 on September 22 after the removal of outliers. This could suggest a high likelihoodof the emergence of the first case of COVID-19 be around September 15, 2020 and September 22, 2020.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2) has impacted multiple countries after first detected in Wuhan China. Evidence suggests that the virus can spread through aerosols. 1,2 Commercial flights were determined to be a crucial factor in importing cases and resulting in local infection. 3,4 Some countries have enforced strict border closures to mitigate the spread of this deadly virus. 5 Although multiple non-pharmaceutical interventions on the COVID-19 pandemic were introduced, they were not sufficient to contain the spreading of the virus, both locally and globally. 6,7 Multiple modelling studies have been conducted to assess the local COVID-19 transmission in the UK, Singapore, Canada, and many other countries. 6,8,9 These studies provided insight into the virus transmission and mitigation based on a local and static population. Results given by these studies reflect the countrylevel transmission dynamics and potential epidemic sizes without the influx and outflux of the cases. 4,9 International flights contributed to the early stage spread of the virus, 10 but none of the mathematical models introduced before included this crucial factor. Due to this reason, the simulation results from these models were limited to the local level. To better describe the global transmission dynamic and investigate the early transmission trajectory of the virus, we construct a computer model to evaluate the potential date of the emergence of the first case of COVID-19 and give an assessment of the effect of the mobility of the population on the exportation of cases. According to ISO-3166 and IATA standard, we broke down the world into 250 country-level regions, then further divided 2154 district-level regions ( Figure 1). 11,12 The population of each region was collected through census data, and it is shown in the supplementary material (supplementary material). 13 A dedicated SEIR model was constructed for each region, and will only be activated when a case is imported from another province. The patients could be traveling from a province to another and initiated the transmission in the destination.

Methods
We assumed the population to be distributed evenly without heterogeneity; the population of each county is recorded and integrated into the model (Supplementary Material).

Flights
We integrated the flight plans from September 1 st , 2019 to June 1 st , 2020 (26,094,036 Flights), to help to describe a more dynamic transmission scenario. Due to the lockdown and travel ban in some areas, only 21,081,805 are included (Supplementary material). Flights are assumed to be the only way to travel between states and countries, the exchange of population who is not infected is excluded.
We picked the global average passenger load factor in 2019 (82.6%) to be the uniform passenger load factor for our model, regardless of the actual load factor for a certain flight. 14 In 2018, 1,811,324,000 passengers traveled internationally through air. 15 Since commercial flights is one of the main contributors to the importation of COVID-19 cases, 10,16 we focused on the effect of commercial flights on the transmission of SARS-Cov-2 and therefore investigate the date of the emergence of the first case.
Several studies have discussed the case of onboard transmission, and they discovered a low probability of in-flight transmission. 17,18,19 Thus, we excluded the in-flight transmission, assuming the transmission scenario to be only local human-to-human infections.

Lockdown
We introduced two types of lockdown into the model: local level lockdown, and country-level lockdown. The lockdown in a certain county or country will be triggered when a certain number of deaths caused by COVID-19 is passed, causing the frequency of social activities to drop; the frequency of social activities will be reduced to no lower than 15% of its original level. Lockdowns were assumed to gradually decrease the number of contacts between the individuals in these counties, altering the relative number of contacts to our best estimate of a reasonable reduction in contact rates under lockdown.
The local level lockdown will be enforced only if the local threshold is surpassed, it will not affect the flight plans and the lockdown in other parts of the country.
The country-level lockdown will be triggered when the death poll of any county inside the country reached a threshold, it will cause the whole country to lockdown regardless of the severity of other counties (i.e. The whole China was locked down due to the high death poll in Hubei) (Supplementary Material).

Initial Epicentre
Since the origin of the virus is still unknown, this study will not address the location where the virus emerged, but rather focus on the date of the onset of the first case of COVID-19. Due to the reason that the first COVID-19 case was reported in Wuhan, China in December, 2019, we assumed the location of the first case of COVID-19 to be in Hubei, China in between the time interval from September 1, 2019 to November 1, 2019. The lockdown in Hubei provinces is assumed to be a country-level lockdown. 20,21

Figure 2: Description of the transmission model
Individuals in the stochastic compartmental model are classified into susceptible, exposed, infectious, and removed states. The cases will be determined to be an exported case or a local case by the stochastic algorithm.
Although the virus could spread through imported goods and animals, our model includes only human-tohuman transmission. 22 We assumed all individuals at the start as susceptible without immunity. The individual will enter the exposed state when contracted with an infected case. After an average latent period of 4.5 days, the person will remain infectious for 7 days; the infectious period includes 2 days of preclinical period and 5 days of clinical period. 6,23,24 The individual in the exposed and infected stage will be determined to become whether a local case or an exported case according to a stochastic algorithm ( Figure  2 25 The imported cases will preserve its data (time since onset, age, and social activity pattern) and start transmission accordingly in the new area. Since the average mortality rate of COVID-19 patients demonstrates a strong correlation with the age, 26 we stratified the population in some countries into 10-year age bands. 27 Our stratification was completed on the country level; thus, we did not consider the difference of population age distribution between different provinces and counties in the same country. 13 The mortality rate of each individual in the model was determined based on a function based on age. The mortality rate for a certain ten-year age interval was assumed to be the same; the data was acquired from a publication by the Chinese CDC (supplementary material). 28 Overall, the number of deaths caused by COVID-19 by May 1, 2020 worldwide demonstrates an exponential decrease pattern as later dates of the emergence of the first case of COVID-19 were deployed. 29

Deaths
Actual deaths is lower than the 95% prediction interval; when the date is set as late as than October 22, 2019, the actual number is higher than the 95% prediction interval ( Figure 3).      Figure 5). Obviously, as the emergence of the first case of COVID-19 was set later, the number of countries that do not include the actual scenario within its 95%PI increases. When the emergence happened on September 15, 2020 and September 22, 2020, the simulation included all the actual numbers of the eight countries inside the 95%PI; seven of eight countries are excluded from 95%PI when the emergence is on October 15, 2020 ( Figure 5).
Overall, the date of the emergence of the first case in which the eight epicentres demonstrate the lowest deviation from the actual scenario converged towards Septembers 15, 2020. The date of the emergence of the first case countries which results in the lowest deviation of countries in North America are overall later than that of those countries in Europe. The simulation with the emergence of the first case of COVID-19 date resulting in the lowest deviation of the first death from the actual date of the first death in the eight major epicenters which possesses the lowest deviation is later than the dates which they have the highest coefficient of determination. The dates with the lowest deviation of the United States, United Kingdom, Spain, Germany, Belgium, France, Canada, and Italy are October 23, October 8, October 8, October 15, October 8, October 1, October 8, and October 1, respectively. The similar pattern does apply to other countries.
There are some outliers from our simulation which are mainly African countries. The simulation result of Namibia, Lesotho, and Uganda, for example, which the date of the first death caused by COVID-19 happened after July, demonstrates significant deviation (the scenario that results in the latest date of the first death) from the actual scenario: 105 days, 87 days, and 89 days (Figure 6 a). When the three outliers are excluded, the correlation coefficient of Africa was significantly boosted, with a peak of 0.2427 (originally 0.0182) on October 1, 2019 (Figure 6 b). As a result of that, the global average increased after the outliers were removed, with a more obvious maximum on September 22, 2019 (Figure 6 c). The global average demonstrated a Gaussian distribution with an obvious central peak on September 22, 2019 after the removal of outliers; however, the pattern cannot be observed before the removal (Figure 6 c).

Discussion
Using this model, we evaluated different parameters regarding the deaths caused by COVID-19 worldwide. We found that the number of deaths worldwide by May 1, 2020, the number of deaths in the eight major epicenters on May 1, 2020, and the deviation of the first death in 155 different countries, demonstrates a convergence pattern of the maximum value when the emergence of the first case of COVID-19 is in the week between September 15, 2020 and September 22, 2020.
We observed an exponential decay pattern of the number of deaths caused by COVID-19 when the emergence of the first case delays (Figure 3). We found that the date in which our simulation result is the closest to the actual scenario to be September 22, 2019 ( Figure 3). Unexpectedly, the date of the emergence of the first case resulting in the highest degree of fit in the eight major epicenters differed to some extent depending on the continent ( Figure 5). When the date is later than October 8, 2019, although the number of deaths worldwide can still fit the actual number of deaths into its 95%PI, the number of deaths in some countries (ie. Italy) resulted from the simulation is far less than the actual number ( Figure 3, Figure 5).
We found the overall deviation of the number of deaths by May 1, 2020 from our simulation to be smaller when the emergence of the first case to be in September, 2019 than in October 2019. However, the deviation of the date of the first deaths in Europe demonstrates a minimum on October 15, 2019. The pattern of the deviation in the four different geographical divisions (Europe, North America and South America, Africa, and Asia and Oceania) varies to a great extent ( Figure 6). Unexpectedly, the global average Pearson product moment correlation coefficient demonstrates a normal distribution pattern when the date of the emergence of the first case changes ( Figure 6).
We observed the inconsistency between geographical divisions. For instance, the maximum degree of fit of the date of the first death in Europe is demonstrated when the emergence of the first case happened on October 15, 2020, whereas the counterparts for the other continents converge before October 1 ( Figure 6). This may due to the different passenger load factors of the flights and the effect of ground vehicles on the overall population mobility and thus affect the transmission dynamics. 30 The three outliers contributed to a great reduction of the overall correlation coefficient. We believe a higher degree of fit can be realized if some limitations to the countries for inclusion are added.
Our study is subject to several limitations. First, the census data we acquired is not completed at the same time, thus the transmission scenario in our simulation may be different in some areas. This may also have an impact on our age stratification, altering the mortality rate in some countries.
We did not include inter-state and international travels through ground vehicles, thus may underestimate the exportation in some countries where the trains and buses play dominant role in transportation. In terms of transmission, we did not include asymptomatic cases, which may change the transmission dynamics and therefore influence our modelling results from various aspects. 31,32 The passenger load factor of commercial fights has a great impact on the transmission scenarios; however, we failed to acquire the exact passenger load factor for every flight record but rather used a global average. 15 This may lead to an overestimation of the passengers of some flights with lower demands.
Another assumption we made is that Wuhan, China is the location where the virus emerged. We only discussed the scenario which that Wuhan is the sole epicenter at early stage of the virus transmission, whereas the actual scenario could be more complicated. 33 Our study provides insights into the early stage transmission of the SARS-CoV-2 virus with the presence commercial flights. Our simulation results suggest a high degree of fit and likelihood of the date of onset of the first case to be in between the time interval between September 15, 2019 and October 1, 2019 with the peak on September 22, 2019. The most likely date of the emergence of the first case we found is over 70 days earlier than the onset date of the first case identified (December 1, 2019). 33 Since the global Pearson product moment correlation coefficient demonstrates normal distribution as the onset date of the first case changes, we estimated that the coefficient would be as low as 0.1179 if the onset date of the first case of COVID-19 is December 1, 2019. This may indicate some untested cases in Wuhan at the early stage of the outbreak. 34,35 Nevertheless, since mutations will affect the infectivity of the virus, the actual origin of the virus may be far different from our estimation. 36

Supplementary Material
Geographical-structured model

Figure 1: Sample geographical division
We included all county/province level that has airports. Overall, there are 250 countries/regions and 2154 province level districts in our model. 1,2

Transmission model
We constructed a conventional SEIR model for each county-level unit. The overall model from a macro scale can be written as: The flowchart for an individual patient is:

Figure 2: Description of the transmission model
Individuals in the stochastic compartmental model are classified into susceptible, exposed, infectious, and removed states. The cases will be determined to be an exported case or a local case by the stochastic algorithm.

Stochastic algorithm for flight assign
We integrated 21,081,805 flights from September 1, 2019 to June 1, 2020 to describe the mobility of the population. To utilize this day, we used a stochastic algorithm to determine the number of patients on board.
For a county with the population of , the number of fights taken off of on a certain day , passenger of each flight of , passenger load factor of , the possibility of an individual onboard is: The expectancy of the number of exposed/infectious individuals onboard ( and )is: However, to reduce the amount of computational power required for the model, we simplified this algorithm to a Gaussian distribution function. Therefore, the number of exposed/infectious passenger is: After determining the number of patients that will be onboard, randomly pick certain patients to be exported from the province. The algorithm will then assign them a flight and a destination.
We picked 10 flights on January 11, 2020 from Shanghai, China (ISO-3166: CN-SH) to demonstrate our algorithm (Table 1). Overall, there are 2946 seats available ( ) to choose from. However, there may be some seats that do not have passengers; therefore, we will multiply by the passenger load factor ( = 82·6%) to find the number of passengers on board ( ), 3 which equals: In this case, is 2,433. The algorithm will then generate a random integer from 0 to (2433) for each passenger, this number is the passenger code for them ( Table 1). The passenger code is corresponding to one flight, and this is the flight which the passenger is going to take.

Key parameters in the model
The key epidemiology parameters in our model are acquired from published studies. 4,5,6 We used a serial interval of 7 days based on published studies. We also used the mean of the latent period 4·5 days. 4 The patient will be determined whether to demise or recover after a median of 18·8 days (95% credible interval [CrI] 15·7-49·7). 4 The mortality rate of the model for each individual was determined based on their age according to the table  (Table 2).  Pearson correlatoin coefficient H u n g a r y F r a n c e R u s s i a n F e d e r a t i o n G e r m a n y I t a l y U n i t e d K i n g d o m S p a i n T u r k e y N o r w a y K a z a k h s t a n B u l g a r i a R o m a n i a C z e c h i a A z e r b a i j a n U k r a i n e N e t h e r l a n d s M o n t e n e g r o S w i t z e r l a n d B e l g i u m P o l a n d G r e e c e F i n l a n d D e n m a r k P o r t u g a l C r o a t i a S w e d e n I r e l a n d C y p r u s S e r b i a A u s t r i a I c e l a n d L u H a i t i B e l i z e E l S a l v a d o r G u y a n a A n t i g u a a n d B a r b u d a T h a i l a n d J a p a n M a l a y s i a I n d o n e s i a U n i t e d A r a b E m i r a t e s A u s t r a l i a P h i l i p p i n e s I n d i a S i n g a p o r e S a u d i A r a b i a N e w Z e a l a n d I r a n M a l d i v e s P a k i s t a n I s r a e l B a n g l a d e s h N e p a l S r i L a n k a K y r g y z s t a n I r a q T a j i k i s t a n L e b a n o n C y p r u s F i j i U z b e k i s t a n J o r d a n A r m e n i a Q a t a r O m a n