Conditional quantiles estimation of the incubation period of COVID-19

Background: In December 2019, some cases of pneumonia with unknown etiology were identiﬁed in Wuhan, Hubei province in China. The World Health Organization (WHO) has named this disease as COVID-19, standing for “2019 coronavirus disease”, and announced the disease have become a public health incident on December 31, 2019. This study aimed to investigate the conditional distribution of the incubation period of COVID-19 on the age of infected cases, and estimate its corresponding conditional quantiles from information on 2172 conﬁrmed cases from 29 provinces outside Hubei in China. Methods: We collected data including the infection dates, onset dates, and ages of the conﬁrmed cases from the websites of the centres of disease control, or the daily public reports through February 16th, 2020. A new maximum likelihood method was developed to account for the biased sampling, or right truncation, issue of the data as the epidemic is still ongoing. The estimators can be shown to be consistent asymptotically under mild conditions. Results: Based on the collected data, we found that the conditional quantiles of the incubation period distribution of COVID-19 varies over ages. In detail, the high conditional quantiles of people in the middle age group are shorter than those of others. We estimated that the 0.95-th quantile related to people in the age group 23 ∼ 55 is less than 15 days. Conclusions: Observing that the conditional quantiles vary over ages, we may take more precise measures for people of diﬀerent ages. For example, we may consider carrying out an age-dependent quarantine duration, rather than a uniform 14-days quarantine, in practice. Remarkably, we may need to extend the current quarantine duration for people aged 0 ∼ 22 and over 55 because the related 0.95-th quantiles are much greater than 14 days.


Background
In December 2019, some cases of pneumonia with unknown etiology were identified in Wuhan, Hubei province in China. After investigation by the National Coronavirus Research Group, this pneumonia was identified as caused by a new coronavirus (2019-nCoV). The World Health Organization (WHO) has named this disease as COVID-19, standing for "2019 coronavirus disease" [1].
It turns out that the novel coronavirus, similar to SARS-COV, is the seventh member of the Nidovirales family of coronaviruses [2], but COVID-19 has a shorter serial interval than that of SARS [3] and higher transmissibility than MERS in the Middle East countries [4], nevertheless. It is highly infectious [5] and even contagious during incubation period [6]. It can cause severe symptoms or even death [7]. The novel coronavirus not only threatens cities in China [8], but also seems to have exploded worldwide. Hence, it is important to take necessary measures to prevent and control it as quick as possible.
In prevention and control efforts, it is well known that the incubation period distribution plays important roles. Knowledge of this distribution can help mathematically model the size of the epidemic [8], predict the time at which the disease will outbreak, and determine the efficacy of medical intervention [9], etc. The pioneering work on deriving the incubation period distribution was conducted by Philip Sartwell in 1950 [10]. After that, the lognormal distribution was widely used to model the incubation period distribution for infectious diseases. Many authors studied the incubation period distributions of various other diseases. Some other distributions, e.g., Gamma distribution and Weibull distribution, were also suggested to fit the observed incubation periods; see for example [9,11,12,13].
In the literature, Li et al. [14] first studied the incubation period distribution of COVID-19 based on the early 10 observations in Hubei province in China. Relying on their estimation, Li et al. [14] suggested a 14-day medical observation period or quarantine for exposed persons. Guan et al. [15] reported the median incubation period, i.e., 3.0 days (range 0 to 24.0 days), of 1099 patients from 552 hospitals in 31 provinces/provincial municipalities through January 29th, 2020. Recently, Backer et al. [16] updated this distribution based on the reported travel histories and symptom onset dates of 88 travellers from Wuhan with confirmed 2019-nCoV infection in the early outbreak phase. Backer et al. [16] estimated that the 97.5 percentile of the incubation period distribution of COVID-19 is 11.1 days. Linton et al. [17] further took into account of the biased sampling issue, and obtained that the estimated 95%-th quantile is greater than 14 days.
However, no existing literature above investigates the distributed characteriser of the incubation period of COVID-19 over people of different ages. Based on 2172 confirmed cases collected outside Hubei provinces in China, a simple ANOVA indicates that the age of confirmed cases has a significant effect on the incubation period of COVID-19. This motivates us to estimate the conditional incubation period distribution on ages. Note that the collected data subject to biased sampling because COVID-19 is still ongoing throughout February 16th, 2020 in China. The current study differs itself from [18], [19], [20] and [21], which investigated the relationship between the age and the incubation period of AIDS, but did not touch the biased sampling issue.
In this study, we developed the conditional quantiles model of the incubation period of COVID-19 on the age of infected cases, and provided the estimating method in detail. The main results were calculated based on the collected data, and the conclusion was presented accordingly.

Data and methods
In this section, we provide a summary on the collected data, and introduce the estimating method according to the major characters of the collected data.
The data set is taken from the websites of the centres of disease control, or the daily public reports on COVID-19 in 29 provinces outside Hubei province through February 16th, 2020. It consists of 2172 confirmed cases, including four indexes, i.e., gender, age, onset time and infection time. The incubation period value here is calculated by using the formula "Incubation Period = Onset date − Infection date + 1". Note that the default count unit is supposed to be 'day' throughout this paper. Among these 2172 cases, there are many cases from Zhejiang, Henan and Anhui than from the other provinces because of the large population of confirmed cases in these provinces. Figure 1 reports the scatter plot of the incubation period of COVID-19 v.s. the age of confirmed cases.
We conduct a preliminary one-way ANOVA study on the incubation period of COVID-19 over four age groups, i.e., 0∼17, 18∼40, 41∼65 and over 65, and find that the age of confirmed cases has a significant effect on the incubation period. Hence, we further investigate the incubation period distribution of COVID-19 conditional on age as follows.
Note that the Weibull distribution fits well the data set, and the mean incubation period varies over people of different ages. We propose to model the relationship between the true incubation period, say T , and the age, i.e., X, through the conditional distribution G(t|λ(X), η) on X. The related density function is specified as follows.
where I(·) denotes the indicator function, and η > 0 and λ(X) The reasons of using this kind of conditional distribution form are as follows: (i) The conditional mean of T takes the form E(T |X) = λ(X)Γ(1+1/η), which implies that the age X has a obvious effect on E(T |X) through λ(X); (ii) λ 3 (x) is flexible enough to characterize the trend of the change of E(T |X) over X. Note that λ 3 (x) includes β 0 , β 0 + β 1 x, and β 0 + β 1 x + β 2 x 2 as special cases. Here Γ(·) denotes the gamma function.
Furthermore, information from the empirical result shown in Figure 2 indicates that one may model the distribution of the age X by normal distribution. Write its density as φ(x; µ, σ 2 ). Then, a natural idea is through maximizng the likelihood function to estimate the conditional distribution based on {T j , X j } m j=1 . Unfortunately, the incubation period of some infected cases can not be fully observed when the COVID-19 is still ongoing. The observed incubation period of COVID-19, say Y , subjects to biased sampling. That is, Y observed at some fixed time t * is not the same as the true incubation period T . This is because for some case infected at time t S , we only can observe such incubation period Y with Y = T at time t * if T ∈ (0, ∆], where ∆ = t * − t S . This implies that the distribution of Y is in fact the conditional distribution depending on the random event {T ∈ (0, ∆]}.
That is, we have Denote the collected samples as where Y i 's denote the observed incubation periods, X i 's the ages, and ∆ i = t * − t S,i the difference between the infected time of the i-case and the observing time t * , i.e. February 16th, 2020.
Since the number of infected cases do not grow exponentially throughout February 16th, 2020, it is unreasonable to use the likelihood function developed in [17] again in this paper. Fortunately, note that there are cases infected in almost every day throughout the data collecting time. Hence, it is reasonable to assume that ∆ i 's are non-random. Furthermore, note that Y i , X i are independent of the number of infected cases in each day t S,i . Then, after obtaining F (y|X) in (2), we propose to use the following likelihood function: Note that E log(L T (β, η, µ, σ 2 )) = E log( Y (β, η, µ, σ 2 )) when all samples involved are independent, which is mild because all observations in this paper are collected nationwide. It is easy to check that the maximum likelihood estimator based on (3) is asymptotically the same as that based on (1), which is consistent and satisfies the asymptotic normality under some general conditions.
However, Y (β, η, µ, σ 2 ) contains some integral values , which is computationally difficult. Fortunately, it holds that Noting that the estimation of µ, σ 2 is trivial based on {X i } n i=1 , we focus on how to estimate β, η in the sequel. To handle the integral values, we propose to use the EM algorithm as follows. Here suppose that we have obtained some initial estimatorŝ β (0) ,η (0) , which may be easily computed by pretending Y i 's having no bias.
We have coded this algorithm by R program relying on the optimization function constrOptim(). The implementation runs quite fast. Usually, convergence can achieve by several iterations.

Results
Based on the information of 2172 confirmed cases, we compute the estimated parameters by using the implementation of EM algorithm mentioned above; see Table  1. It is worth mentioning thatβ 3 = −1.1 × 10 −6 is very small, which implies that the cubic form λ 3 (x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 is flexible enough to characterize the trend of the conditional mean E(T |X) on X. We do not need to assume a higher order polynomials for λ(x). Using results in Table 1, we obtain the conditional 0.05, 0.25, 0.5, 0.75, 0.9 and 0.95-th quantiles of the incubation period distribution of COVID-19 on ages; see Figure 4 for details. Figure 4 indicates that quantiles corresponding to people of the middle ages seem to be less than those of the others. Especially, the estimated 0.95-th conditional quantile of the children and the elderly is greater obviously than that of the middle-aged. To be more detailed, we specify the 0.95-th conditional quantiles of the incubation period distribution of COVID-19 on different ages in Table 2. Table  2 indicates that the 0.95-th conditional quantiles of people in the age group 23 ∼ 55 lie between 14 and 15 days, shorter than those of the other groups. We also list the numbers of cases in each groups and the corresponding proportions. It turns out the infected cases of ages 23 ∼ 55 account for more than 70% of the total collected cases. Further, note that we collected 136 cases whose incubation periods are greater than 14 days. We provide the distribution of these cases over all age groups. It is shown that the proportion of the age group 23 ∼ 55 is the smallest. Its value is only 5.11%. On the other hand, the proportions of other age groups are much higher than 5%.
Furthermore, in order to further verify the results above, we divide the observed incubation periods into three groups by age: 0 to 25 years old, 26 to 60 years old and over 60 years old, according to World Population Prospects: the 2019 Revision [1] . Then we fit the Weibull distribution in each group by setting λ(x) = β 0 . Figure 5 shows that people under age of 0 to 25 years old or over 60 years old has higher probability that would emerge longer incubation period than people under age of 26 to 60 years old. Moreover, the right figure is the fitted Weibull distribution function of incubation period of COVID-19 in three age groups. It is obvious that the 0.95 quantile of people under 26 and over 60 is greater than that of people aged from 26 to 60. This roughly coincides with the results reported in Figure 4, and hence indicates that the conditional distribution considered above is able to characterize the relationship of age and real incubation period of COVID-19.

Discussion
Our estimation of the conditional quantiles indicates that the incubation period of COVID-19 varies depending on the age of the infected cases. Precisely, the incubation period of the young and the old tends to be longer than that of the middle-aged people.
It seems that we can find some supports from the immune theory in medical science. Note that the human immunity refers to the sensitivity of the immune system in response to infection. During the incubation period, since the host's immune system has not yet been activated, and the body has not begun to show symptoms, the virus can use this period to make a lot of replication. In many situations, to the infection, the more responsive the host's immune system is, the shorter the incubation period tends to be. By further noting that the human immunity is weak at the beginning, improves with age, and will decline in the old period [22,23], it hence may be not surprised to see that the incubation period of the young and the old cases are longer than that of the middle-aged cases.
Currently, the quarantine duration is fixed to 14 days. It does not take into account of any other facts, e.g., age. Hence, our results may be helpful for disease control and prevention efforts, because it enables us to take some more precise measures. For example, personalized quarantine duration can be taken for individuals of different ages. Especially, people of ages between 23∼55 play important roles in real life and are a significant part of the labor force. Besides, they account for the largest proportion of the population. A relative short quarantine duration for them not only can reduce the burden of the medical staff but also is conducive to social economic development. In contrary, the conditional quantiles on ages 0∼22 and over 55 are much greater than 14 days. We may need to extend the quarantine duration for people of these ages. Such extension may help the prevention but have limit impacts on the social economic development.
It is worth mentioning there exist some other ways in statistics to characterize the conditional quantile of the incubation period over the age. That is, first model the relationship between T and X by the following linear model: and then use the technique of quantile regression to estimate the unknown parameters β i , i = 0, 1, 2, 3. However, note that the true incubation period T can not be fully observed, and the observed incubation period Y is randomly smaller than T . The estimated quantiles may suffer from some problem, e.g., underestimation.
In fact, we also report in Figure 6 the result of the ordinary quantile regression mentioned above. Figure 6 shows that the regression quantiles follow a similar to the conditional quantiles reported in Figure 4. Nevertheless, we also note that the 0.25th quantile and the 0.05-th quantile intersect with each other when the age is greater than 80. It seems difficult to have a reasonable explain about this phenomenon. This strange result may be caused by the biased sampling issue. Hence, we did not take the regression quantiles to analyze the current data, although it provides some similar results as the conditional quantiles.

Conclusion
Based on the collected data, our model showed that the incubation period of COVID-19 varies depending on the age of the infected cases. Specifically, the incubation period of the young and the old tends to be longer than that of the middle-aged people. These findings enable us to take some more precise measures rather than fixed ones, and thus may be helpful for disease control and prevention efforts. For example, personalized quarantine duration, namely shorter for middle-aged people and longer for the young and the old, can be taken for individuals of different ages, instead of fixed 14 days. People aged 23∼55 are a major part of the labor force and account for the largest proportion of the population, therefore, such methods may help the prevention but have limit impacts on the social economic development.