Time-frame of human endemic coronavirus emergence
As mentioned above, all endemic coronaviruses were estimated to have recently emerged as human pathogens 2, 28, 30, 34, 35, 37, 38. However, besides being generally based on a limited number of sequences, most previous analyses did not include some of the viruses that are now recognized to be closely related to endemic human coronaviruses (e.g., the dromedary camel alphacoronaviruses related to HCoV-229E). Also, it is now recognized that the presence of recombination, the lack of a temporal signal in the sequence data, and the pervasive effect of purifying selection can affect molecular dating 39-41. Accounting for these effects has become common practice only in recent years. Thus, we decided to reassess the timing of endemic coronavirus emergence and to date the time when circulating strains last shared a common ancestor. To this purpose, we retrieved all available sequences with known sampling date for the four coronaviruses (HCoV-OC43, n=167; HCoV-229E, n=31; HCoV-NL63, n=37; HCoV-HKU1, n=68) (Supplementary Table 1). The animal viruses most closely related to each human coronavirus were also included in the analyses.
Because recombination is known to be frequent in all coronavirus genera 42-44, we used 3SEQ to identify recombination events, which were detected in all datasets (Fig. 1) 45. Based on the location of breakpoint positions, we then selected the longest non-recombining region for each alignment. Specifically, we obtained relatively long regions for HCoV-OC43, HCoV-229E, and HCoV-HKU1, whereas two short regions (NL63_reg1 and NL63_reg2) were available for the HCoV-NL63 alignment (Fig.1). For all the selected non-recombining regions, maximum likelihood phylogenetic trees were constructed and we checked for the presence of a temporal signal by performing regression of root-to-tip genetic distances against sampling dates. These analyses indicated a strong temporal signal for all regions, with the exclusion of NL63_reg1 and of the HCoV-HKU1 long region (Fig. 1). In this latter case, the lack of a temporal signal is most likely due to the short time span among virus sampling dates, with the earliest sequences collected in 2003 (Supplementary Table 1).
Before performing molecular dating, we evaluated whether natural selection strongly affected branch length estimates in the viral phylogenies. In fact, it is now recognized that purifying selection and saturation effects contribute to the time-dependent substitution rate variation in viruses, which in turn affects molecular dating 40, 46. We thus estimated branch lengths using the aBS-REL (adaptive branch-site random effects likelihood) model, which accounts for different selective pressures among lineages and is relatively robust to substitution saturation 47. For all phylogenies, branch lengths estimated with aBS-REL were comparable to those obtained with a GTR (General Time Reversible) model (Fig. 1), suggesting that molecular dating can be performed with minor effects related to the time dependency of substitution rates.
Thus, for the three phylogenies (HCoV-OC43, HCoV-229E, and NL63_reg2) showing a temporal signal, we used a Bayesian approach to estimate substitution rates and time-measured evolutionary histories. Substitution rates in the range of 1.6x10-4 to 1.5x10-3 were obtained, in line with previous analyses 43. For the HCoV-HKU1 phylogeny, date estimates were obtained by using the substitution rate of HCoV-OC43 (another betacoronavirus) as a prior. For the circulating strains of all coronaviruses, we obtained similar tMRCA (time to the most recent common ancestor) estimates, which ranged from 71 (HCoV-229E) to 55 (HCoV-HKU1) years ago (Fig. 2, Supplementary Fig. 1). The splits of human coronaviruses from their most closely related animal viruses were more variable. Specifically, we estimated that HCoV-OC43 split from the bovine coronavirus (BCoV) lineage around 1891 (IC:1876-1905), whereas HCoV-229E separated from the camel alphacoronavirus in the 18th century (1754, IC:1713-1791) (Fig. 2). Because bovines and camels are plausible zoonotic sources for the human infections, these split dates may be considered as good proxies for the time when HCoV-OC43 and HCoV-229E entered human populations. The long time span separating the split times and the tMRCAs are most likely accounted for either by extinct ancestral lineages or by unsampled viral diversity. With respect to HCoV-HKU1 and HCoV-NL63, they were estimated to have diverged from the related rodent or bat viruses in 1651 (IC:1469-1793) and 1871 (IC:528-1984), respectively (Supplementary Fig. 1). However, confidence intervals for HCoV-NL63 were very large. This effect is most likely due to the short non-recombining region we used for dating and indicates that the split from the bat virus can be estimated with substantial uncertainty. Also, because there is no indication whether rodents and bats represented the hosts from which the spillovers occurred, the time when HCoV-NL63 and HCoV-HKU1 entered human populations remains highly uncertain. Nonetheless, these data support a recent introduction of endemic coronaviruses in human populations and indicate that all zoonotic transmissions occurred in a time frame ranging from ~500 to 55 years ago (Fig. 3).
Human coronavirus emergence in the context of viral outbreaks
The molecular dating analyses reported above indicate that all endemic human coronaviruses emerged as human pathogens earlier than 55 years ago, and most likely in a more distant past (Fig. 3). This implies that, at least between ~ 1965 and 2002 (when SARS-CoV appeared), no coronavirus gained the ability to spread widely in our species. Thus, the pattern of coronavirus emergence seems to be highly irregular and to have intensified in recent years. To compare the timing of coronavirus emergence to that of another respiratory virus, we recorded the occurrence of known influenza pandemics since 1500. As previously noted, this pattern is also irregular and there is no clear relationship between pandemic occurrence and human population size 48. Notably, influenza pandemics were also uncommon from the 70s to 2009 (when H1N1pdm09 caused a pandemic) (Fig. 3). Overall, these data are not in full agreement with the notion that emerging infectious diseases (EIDs) have progressively increased in frequency in the last decades, with a peak in the 80s (possibly secondary to AIDS diffusion) 49. However, this might be due to chance and to the fact that we only analyzed few viruses. We thus retrieved data from the original work that analyzed EID occurrence (from 1940 to 2004). Specifically, Jones and co-workers defined the timing of EIDs based on the first description of the original case(s) of a novel human infectious disease 49. We focused on viral EIDs and we separated zoonotic diseases that are not transmitted by an arthropod vector, vector-borne diseases, and non-zoonotic viruses. Results did not suggest a strong increasing trend with time, with the exclusion of a higher frequency of zoonoses in 1990-2000 (Fig. 4b,d,f).
We next explored the annual occurrence of viral disease outbreaks as recorded in GIDEON (Global Infectious Diseases and Epidemiology Network), the most comprehensive epidemiology database at a global scale 50. We removed viruses for which an effective vaccine has been developed from the total record of viral disease outbreaks, to avoid geographic and timing biases. Diseases that are mainly sexually transmitted (e.g., AIDS, hepatitis B) or whose spread was largely due to the use of human-contaminated material (e.g., AIDS, hepatitis C) were also excluded. This originated a dataset of 1252 outbreaks for 69 viral diseases. We considered either all outbreaks (that involved humans and/or animals) or outbreaks with at least 50 recorded human cases that occurred between 1945 and 2017 (to avoid border effects). For both arthropod-borne and non-arthropod-borne zoonoses (but not for non-zonotic viruses), we observed a tendency to increase with time, which was particularly evident only since ~1995-2000 (Fig. 4a,c,e). However, these trends are likely to be strongly influenced by increased reporting efforts and the small number of outbreaks in the 40s and 50s is in line with previous observations that the number of discovered viruses abruptly increased around 1954, roughly corresponding to the advent of tissue culture techniques for virus detection 51. As previously suggested 49, we thus used the annual number of articles published in the Journal of Infectious Diseases (JID) to account for this effect (Fig. 4g). After Bonferroni correction for multiple tests and controlling for reporting effort (see Materials and Methods), no significant increase with time was observed for zoonotic disease outbreaks (Fig. 4a). A significant decrease with time was instead observed for non-zoonotic diseases and for all outbreaks of arthropod-borne diseases (Fig. 4c,e).
Overall, these data suggest that the timing of large epidemics is episodic and seems to be unrelated from human population growth. In general, the number of outbreaks of zoonotic and non-zoonotic viral diseases has not increased significantly over the last 80 years.
Recent and ongoing evolution of HCoV-OC43 and HCoV-229E
To investigate the selective patterns acting on the coding regions of HCoV-OC43 and HCoV-229E since their separation from bovine/camel viruses, we applied gammaMap 52, a method that combines analysis of within-population variation and divergence from an outgroup to estimate codon-wise selection coefficients (γ).
Coronaviruses have large and complex genomes which encode 16 non-structural (nsps) and four structural proteins (spike, envelope, membrane, and nucleoprotein), as well as a variable number of accessory molecules. Embecoviruses (e.g., HCoV-OC43, HCoV-HKU1, and BCoV) encode an additional structural protein, a hemagglutinin-esterase (HE) which serves as a receptor-destroying enzyme 2, 53. In line with data on several other viruses 47, 54-56, we found that most codons evolved under strong to moderate purifying selection (γ < -5) (Supplementary Fig. 2). However, sites with robust evidence of positive selection (posterior probability > 0.75 of γ ≥ 1) could also be detected. The majority of these sites are located in a restricted number of proteins with mainly structural functions (Fig. 5, 6, and Supplementary Fig. 3). Whereas most selected sites in the spike proteins and in HE are polymorphic in circulating viral populations, those located in other regions are not (Supplementary Table 2). Importantly, the positively selected sites in the spike proteins of both HCoV-OC43 and HCoV-229E are clustered within regions that interact with the cellular receptors (9-O-acetylated sialoglycans and aminopeptidase N, ANPEP) and that were previously shown to modulate binding (Fig. 5 and 6). Thus, most positively selected sites in HCoV-229E map to the three loops that contact human ANPEP (hANPEP) (see below) 57, 58. Likewise, several positively selected sites are located within the sialoglycan-binding site of the HCoV-OC43 spike protein, and changes at sites 22 and 24 in other embecoviruses largely affect binding affinity 59. Similarly, the HE positively selected sites map to the lectin domain. Mutations at positions 114, 177, and 178 determine the loss of sialoglycan binding, which is thought to have contributed to the shift to the human host 60. This indicates that gammaMap reliably identified relevant selection signatures.
Analysis of the polymorphic positively selected sites in the S and HE proteins of viruses sampled at different time intervals indicated that the evolution of HCoV-OC43 and HCoV-229E is ongoing and that new amino acid combinations have progressively emerged (Fig. 5c and 6c). Indeed, the amino acid status at the positively selected sites broadly corresponds to the receptor binding domain (RBD) classes of HCoV-229E and to HCoV-OC43 genotypes. Based on our dating analyses, the most recent RBD class of HCoV-229E (class VI) and the HCoV-OC43 genotypes (F/G/H) emerged ~ 15 years ago (Fig. 2). Interestingly, analysis of the RBD region in 92 BCoV sequences revealed limited variability with no clear temporal pattern (Supplementary Fig. 4). The same comparison could not be performed for camelid viruses as most of them were sampled in 2014-2015.
The HCoV-229E spike protein evolves to optimize receptor binding, not as a result of antigenic drift
Because, in analogy to SARS-CoV-2, HCoV-229E binds a protein receptor, we further investigated the positively selected sites in the spike protein. The specificity of HCoV-229E for hANPEP was previously ascribed to an extended tandem of H-bonds involving the 314-320 segment of RBD loop 1 and the 287-292 portion of a beta-strand surface-exposed of hANPEP domain II 57. Most of these interactions involve backbone atoms, reducing the dependency on sequence variations. In fact, the camel alphacoronavirus can use hANPEP as a receptor 32. It was however suggested that changes in loop regions might accommodate species-specific differences among ANPEP orthologs and optimize receptor binding affinity 58. We thus compared the HCoV-229E RBD crystal structure and the corresponding model for camel alphacoronavirus (Fig. 7a). We also modeled camel ANPEP (cANPEP) based on the structure of the human ortholog. Overall, cANPEP features fewer charged residues at the interface than the human protein. In particular, T287 and I314 are replaced by D288 and D315 in the human receptor, whereas G291 is replaced by K292. Analysis of the contact interface indicated that the positively selected sites 316 (R or K, depending on RBD class), 407 (S in class I, H in classes V and VI), and 408 (K in classes I, V and VI) contribute additional interactions with the human protein than those established by the camel virus (Fig. 7a). These are made possible by the presence of the charged residues in the human receptor. Overall, these observations suggest that HCoV-229E can interact with hANPEP more efficiently than the camel virus and that positively selected sites contribute to increase affinity.
Previous investigations showed that the affinity of the six RBD classes of HCoV-229E for hANPEP varies in a range of Kd from ~430 (class I) to ~30 nM (classes V and VI) (Fig. 7c) 57. In particular, a strong increase in affinity is observed for classes V and VI. Some of the positively selected sites contribute to this increased affinity by changing loop conformation and by establishing additional interactions (Supplementary Fig. 5). For instance, H407 in classes V and VI forms an additional polar interaction with the spatially close D315 of hANPEP, and K408 in the same classes intercepts the E291 backbone in the receptor. (Fig. 7a).
Variations in the RBD loops, which progressively emerged over the last 50 years (Fig. 7c), were previously proposed to derive from immune selection 57. Inspection of the IEDB database revealed that no experimental epitope for the spike protein of HCoV-229E has been described. We thus used the sequences of RBDs belonging to different classes to predict epitope positions for using BepiPred-2. Results indicated that epitopes do differ among RBD classes (Supplementary Table 3) and map to different structural regions (Fig. 7b) (data for HCoV-OC43 are shown in Supplementary Fig. 6). This is in line with the observation that antibodies against classes I and IV show no cross-neutralization and that HCoV-229E is undergoing antigenic drift 57, 58. Nonetheless, the hypothesis of antigenic drift is difficult to reconcile with the evidence that reinfection with HCoV-229E is common and humoral immunity might be short-lived 22-25. To clarify these issues, we used an extended set of spike protein sequences to date the temporal emergence of RBD classes. Results indicated that classes II, III, and IV, which have about two-fold higher affinity than class I, emerged 1-8 years apart. However, since the appearance of class V (with much higher affinity) about 44 years ago, no class emerged for 26 years (Fig. 7c). In fact, class VI split from class V about 17 years ago and the two classes show very similar sequence and binding properties. These different time intervals are poorly consistent with antigenic drift. Instead, these results suggest that strains with higher affinity for the cellular receptor have out-competed strains with lower affinity, and that HCoV-229E has evolved to optimize binding to the cellular receptor.