Human endemic coronavirus emergence in the context of past and recent zoonotic outbreaks

Understanding the evolutionary dynamics of the four human endemic coronaviruses might provide insight into the future trajectories of SARS-CoV-2 evolution. We re-assessed the timing of endemic coronavirus emergence and we show that all viruses entered human populations in a time-frame ranging from ~500 to 55 years ago. Because the three highly pathogenic coronaviruses (SARS-CoV, MERS-CoV, and SARS-CoV-2) spilled-over in tight temporal succession, the pattern of coronavirus emergence, in analogy to that of inuenza pandemics, is highly irregular. To contextualize this observation in a wider perspective of viral diseases emergence since 1945, we mined epidemiology database information. After controlling for reporting bias, we nd that, contrary to widespread beliefs, the occurrence of viral diseases (either zoonotic or not) has not increased over the last decades. Analysis of the recent and ongoing evolution of HCoV-229E and HCoV-OC43 indicated that positive selection most likely contributed to ne-tune the interaction with the human interferon/inammatory response. Conversely, integration of evolutionary inference and molecular dating provided evidence that these viruses are not undergoing antigenic drift and the temporal emergence of spike protein variants is best explained by optimization of receptor binding anity. These data provide a fresh look on viral disease emergence and on coronavirus evolution.


Introduction
Coronaviruses (order Nidovirales, family Coronaviridae, subfamily Coronavirinae) are a diverse group of positive-sense, single stranded RNA viruses with high zoonotic potential [1][2][3] . In 2002, a highly pathogenic coronavirus, severe acute respiratory syndrome coronavirus (SARS-CoV), spilled-over from palm civets to humans and caused ~8,000 cases in several countries 4 . These events were followed by the appearance, in 2012, of Middle East respiratory syndrome coronavirus (MERS-CoV), a camel-derived pathogen that caused multiple outbreaks of respiratory disease mainly in the Arabic Peninsula 5 . Containment and surveillance strategies allowed the control of these viruses, which have never (SARS-CoV) or only occasionally (MERS-CoV) reappeared in human populations 6 . However, at the end of 2019, a novel coronavirus, designated SARS-CoV-2 by the ICTV 7 , emerged in China and is now recognized as the cause of COVID-19 8 . The virus rapidly spread worldwide and the World Health Organization declared the SARS-CoV-2 pandemic in early March 2020. Most likely, SARS-CoV-2 originated and evolved in bats, eventually spilling over to humans, either directly or through an intermediate host [9][10][11][12][13][14][15] . To date, more than 72 million COVID-19 cases have been con rmed (https://covid19.who.int/, as of 18 th December 2020), suggesting that, until an effective vaccination campaign is implemented, the virus will continue to circulate among people and, possibly, other animals [16][17][18][19] .
The epidemic behavior of SARS-CoV, MERS-CoV, and SARS-CoV-2, as well as their clinical severity, have clearly raised awareness on the potential danger posed by coronaviruses, which were considered relatively harmless to humans before 2002. In fact, four other coronaviruses (HCoV-OC43, HCoV-HKU1, HCoV-NL63, and HCoV-229E), sometimes referred to as "common cold coronaviruses", have been circulating in human populations for decades, usually causing mild symptoms 2,20,21 . All these viruses are seasonal and generate short-term immunity, with reinfections being common within one year [22][23][24][25] .
Phylogenetic analyses indicated that bats most likely represent the animal reservoirs from which the HCoV-NL63 and HCoV-229E alphacoronaviruses emerged [27][28][29][30] . It is presently unknown whether HCoV-NL63 was transmitted to humans via an intermediate host, as the most closely related viruses were detected in bats from Kenya 27 . Conversely, viruses highly similar to HCoV-229E were identi ed in camelids (dromedary camels and alpacas), strongly suggesting that, in analogy to MERS-CoV, these animals represented the zoonotic source of human infection 2, 3, 31-33 . The other two endemic coronaviruses, HCoV-OC43 and HCoV-HKU1, belong to the Betacoronavirus genus and most likely have their animal reservoirs in rodents 2,3 . Whereas it is widely accepted that bovines were the intermediate hosts mediating the transmission of HCoV-OC43 to humans, the zoonotic source of HCoV-HKU1 is presently unknown 2,3,26,34,35 . Given the commensal behavior of several rodents, it cannot be excluded that the virus was directly transmitted to our species by mice or related animals.
Although with some controversy 36 , most previous estimates indicated that the endemic coronaviruses entered human populations in the last 1000 years 2,28,30,34,35,37,38 . However, early analyses were often based on a small number of sequences and did not control for effects that are now recognized to affect molecular dating. Also, little is known about the past and ongoing selective events that accompanied the emergence and spread of endemic coronaviruses in human populations. In the wake of the COVID-19 pandemic, a better understanding of the evolutionary dynamics of endemic coronaviruses, as well as of the tendency of viral disease emergence, might provide valuable insight into the possible trajectories of SARS-CoV-2 evolution, especially in case the virus should become endemic.

Results
Time-frame of human endemic coronavirus emergence As mentioned above, all endemic coronaviruses were estimated to have recently emerged as human pathogens 2,28,30,34,35,37,38 . However, besides being generally based on a limited number of sequences, most previous analyses did not include some of the viruses that are now recognized to be closely related to endemic human coronaviruses (e.g., the dromedary camel alphacoronaviruses related to HCoV-229E).
Also, it is now recognized that the presence of recombination, the lack of a temporal signal in the sequence data, and the pervasive effect of purifying selection can affect molecular dating [39][40][41] . Accounting for these effects has become common practice only in recent years. Thus, we decided to reassess the timing of endemic coronavirus emergence and to date the time when circulating strains last shared a common ancestor. To this purpose, we retrieved all available sequences with known sampling date for the four coronaviruses (HCoV-OC43, n=167; HCoV-229E, n=31; HCoV-NL63, n=37; HCoV-HKU1, n=68) (Supplementary Table 1). The animal viruses most closely related to each human coronavirus were also included in the analyses.
Because recombination is known to be frequent in all coronavirus genera [42][43][44] , we used 3SEQ to identify recombination events, which were detected in all datasets ( Fig. 1) 45 . Based on the location of breakpoint positions, we then selected the longest non-recombining region for each alignment. Speci cally, we obtained relatively long regions for HCoV-OC43, HCoV-229E, and HCoV-HKU1, whereas two short regions (NL63_reg1 and NL63_reg2) were available for the HCoV-NL63 alignment (Fig.1). For all the selected nonrecombining regions, maximum likelihood phylogenetic trees were constructed and we checked for the presence of a temporal signal by performing regression of root-to-tip genetic distances against sampling dates. These analyses indicated a strong temporal signal for all regions, with the exclusion of NL63_reg1 and of the HCoV-HKU1 long region (Fig. 1). In this latter case, the lack of a temporal signal is most likely due to the short time span among virus sampling dates, with the earliest sequences collected in 2003 (Supplementary Table 1).
Before performing molecular dating, we evaluated whether natural selection strongly affected branch length estimates in the viral phylogenies. In fact, it is now recognized that purifying selection and saturation effects contribute to the time-dependent substitution rate variation in viruses, which in turn affects molecular dating 40,46 . We thus estimated branch lengths using the aBS-REL (adaptive branchsite random effects likelihood) model, which accounts for different selective pressures among lineages and is relatively robust to substitution saturation 47 . For all phylogenies, branch lengths estimated with aBS-REL were comparable to those obtained with a GTR (General Time Reversible) model ( Fig. 1), suggesting that molecular dating can be performed with minor effects related to the time dependency of substitution rates.
Thus, for the three phylogenies (HCoV-OC43, HCoV-229E, and NL63_reg2) showing a temporal signal, we used a Bayesian approach to estimate substitution rates and time-measured evolutionary histories.
Substitution rates in the range of 1.6x10 -4 to 1.5x10 -3 were obtained, in line with previous analyses 43 . For the HCoV-HKU1 phylogeny, date estimates were obtained by using the substitution rate of HCoV-OC43 (another betacoronavirus) as a prior. For the circulating strains of all coronaviruses, we obtained similar tMRCA (time to the most recent common ancestor) estimates, which ranged from 71 (HCoV-229E) to 55 (HCoV-HKU1) years ago (Fig. 2, Supplementary Fig. 1). The splits of human coronaviruses from their most closely related animal viruses were more variable. Speci cally, we estimated that HCoV-OC43 split from the bovine coronavirus (BCoV) lineage around 1891 (IC:1876-1905), whereas HCoV-229E separated from the camel alphacoronavirus in the 18 th century (1754, IC:1713-1791) (Fig. 2). Because bovines and camels are plausible zoonotic sources for the human infections, these split dates may be considered as good proxies for the time when HCoV-OC43 and HCoV-229E entered human populations. The long time span separating the split times and the tMRCAs are most likely accounted for either by extinct ancestral lineages or by unsampled viral diversity. With respect to HCoV-HKU1 and HCoV-NL63, they were estimated to have diverged from the related rodent or bat viruses in 1651 (IC:1469-1793) and 1871 (IC:528-1984), respectively (Supplementary Fig. 1). However, con dence intervals for HCoV-NL63 were very large. This effect is most likely due to the short non-recombining region we used for dating and indicates that the split from the bat virus can be estimated with substantial uncertainty. Also, because there is no indication whether rodents and bats represented the hosts from which the spillovers occurred, the time when HCoV-NL63 and HCoV-HKU1 entered human populations remains highly uncertain.
Nonetheless, these data support a recent introduction of endemic coronaviruses in human populations and indicate that all zoonotic transmissions occurred in a time frame ranging from ~500 to 55 years ago (Fig. 3).

Human coronavirus emergence in the context of viral outbreaks
The molecular dating analyses reported above indicate that all endemic human coronaviruses emerged as human pathogens earlier than 55 years ago, and most likely in a more distant past (Fig. 3). This implies that, at least between ~ 1965 and 2002 (when SARS-CoV appeared), no coronavirus gained the ability to spread widely in our species. Thus, the pattern of coronavirus emergence seems to be highly irregular and to have intensi ed in recent years. To compare the timing of coronavirus emergence to that of another respiratory virus, we recorded the occurrence of known in uenza pandemics since 1500. As previously noted, this pattern is also irregular and there is no clear relationship between pandemic occurrence and human population size 48 . Notably, in uenza pandemics were also uncommon from the 70s to 2009 (when H1N1pdm09 caused a pandemic) (Fig. 3). Overall, these data are not in full agreement with the notion that emerging infectious diseases (EIDs) have progressively increased in frequency in the last decades, with a peak in the 80s (possibly secondary to AIDS diffusion) 49 . However, this might be due to chance and to the fact that we only analyzed few viruses. We thus retrieved data from the original work that analyzed EID occurrence (from 1940 to 2004). Speci cally, Jones and co-workers de ned the timing of EIDs based on the rst description of the original case(s) of a novel human infectious disease 49 . We focused on viral EIDs and we separated zoonotic diseases that are not transmitted by an arthropod vector, vector-borne diseases, and non-zoonotic viruses. Results did not suggest a strong increasing trend with time, with the exclusion of a higher frequency of zoonoses in 1990-2000 (Fig. 4b,d,f).
We next explored the annual occurrence of viral disease outbreaks as recorded in GIDEON (Global Infectious Diseases and Epidemiology Network), the most comprehensive epidemiology database at a global scale 50 . We removed viruses for which an effective vaccine has been developed from the total record of viral disease outbreaks, to avoid geographic and timing biases. Diseases that are mainly sexually transmitted (e.g., AIDS, hepatitis B) or whose spread was largely due to the use of humancontaminated material (e.g., AIDS, hepatitis C) were also excluded. This originated a dataset of 1252 outbreaks for 69 viral diseases. We considered either all outbreaks (that involved humans and/or animals) or outbreaks with at least 50 recorded human cases that occurred between 1945 and 2017 (to avoid border effects). For both arthropod-borne and non-arthropod-borne zoonoses (but not for nonzonotic viruses), we observed a tendency to increase with time, which was particularly evident only since 1995-2000 (Fig. 4a,c,e). However, these trends are likely to be strongly in uenced by increased reporting efforts and the small number of outbreaks in the 40s and 50s is in line with previous observations that the number of discovered viruses abruptly increased around 1954, roughly corresponding to the advent of tissue culture techniques for virus detection 51 . As previously suggested 49 , we thus used the annual number of articles published in the Journal of Infectious Diseases (JID) to account for this effect (Fig.   4g). After Bonferroni correction for multiple tests and controlling for reporting effort (see Materials and Methods), no signi cant increase with time was observed for zoonotic disease outbreaks (Fig. 4a). A signi cant decrease with time was instead observed for non-zoonotic diseases and for all outbreaks of arthropod-borne diseases (Fig. 4c,e).
Overall, these data suggest that the timing of large epidemics is episodic and seems to be unrelated from human population growth. In general, the number of outbreaks of zoonotic and non-zoonotic viral diseases has not increased signi cantly over the last 80 years.

Recent and ongoing evolution of HCoV-OC43 and HCoV-229E
To investigate the selective patterns acting on the coding regions of HCoV-OC43 and HCoV-229E since their separation from bovine/camel viruses, we applied gammaMap 52 , a method that combines analysis of within-population variation and divergence from an outgroup to estimate codon-wise selection coe cients (γ).
Coronaviruses have large and complex genomes which encode 16 non-structural (nsps) and four structural proteins (spike, envelope, membrane, and nucleoprotein), as well as a variable number of accessory molecules. Embecoviruses (e.g., HCoV-OC43, HCoV-HKU1, and BCoV) encode an additional structural protein, a hemagglutinin-esterase (HE) which serves as a receptor-destroying enzyme 2, 53 . In line with data on several other viruses 47, 54-56 , we found that most codons evolved under strong to moderate purifying selection (γ < -5) ( Supplementary Fig. 2). However, sites with robust evidence of positive selection (posterior probability > 0.75 of γ ≥ 1) could also be detected. The majority of these sites are located in a restricted number of proteins with mainly structural functions (  Table 2). Importantly, the positively selected sites in the spike proteins of both HCoV-OC43 and HCoV-229E are clustered within regions that interact with the cellular receptors (9-O-acetylated sialoglycans and aminopeptidase N, ANPEP) and that were previously shown to modulate binding ( Fig. 5 and 6). Thus, most positively selected sites in HCoV-229E map to the three loops that contact human ANPEP (hANPEP) (see below) 57,58 . Likewise, several positively selected sites are located within the sialoglycan-binding site of the HCoV-OC43 spike protein, and changes at sites 22 and 24 in other embecoviruses largely affect binding a nity 59 . Similarly, the HE positively selected sites map to the lectin domain. Mutations at positions 114, 177, and 178 determine the loss of sialoglycan binding, which is thought to have contributed to the shift to the human host 60 . This indicates that gammaMap reliably identi ed relevant selection signatures.
Analysis of the polymorphic positively selected sites in the S and HE proteins of viruses sampled at different time intervals indicated that the evolution of HCoV-OC43 and HCoV-229E is ongoing and that new amino acid combinations have progressively emerged ( Fig. 5c and 6c). Indeed, the amino acid status at the positively selected sites broadly corresponds to the receptor binding domain (RBD) classes of HCoV-229E and to HCoV-OC43 genotypes. Based on our dating analyses, the most recent RBD class of HCoV-229E (class VI) and the HCoV-OC43 genotypes (F/G/H) emerged ~ 15 years ago (Fig. 2). Interestingly, analysis of the RBD region in 92 BCoV sequences revealed limited variability with no clear temporal pattern (Supplementary Fig. 4). The same comparison could not be performed for camelid viruses as most of them were sampled in 2014-2015.
The HCoV-229E spike protein evolves to optimize receptor binding, not as a result of antigenic drift Because, in analogy to SARS-CoV-2, HCoV-229E binds a protein receptor, we further investigated the positively selected sites in the spike protein. The speci city of HCoV-229E for hANPEP was previously ascribed to an extended tandem of H-bonds involving the 314-320 segment of RBD loop 1 and the 287-292 portion of a beta-strand surface-exposed of hANPEP domain II 57 . Most of these interactions involve backbone atoms, reducing the dependency on sequence variations. In fact, the camel alphacoronavirus can use hANPEP as a receptor 32 . It was however suggested that changes in loop regions might accommodate species-speci c differences among ANPEP orthologs and optimize receptor binding a nity 58 . We thus compared the HCoV-229E RBD crystal structure and the corresponding model for camel alphacoronavirus (Fig. 7a). We also modeled camel ANPEP (cANPEP) based on the structure of the human ortholog. Overall, cANPEP features fewer charged residues at the interface than the human protein. In particular, T287 and I314 are replaced by D288 and D315 in the human receptor, whereas G291 is replaced by K292. Analysis of the contact interface indicated that the positively selected sites 316 (R or K, depending on RBD class), 407 (S in class I, H in classes V and VI), and 408 (K in classes I, V and VI) contribute additional interactions with the human protein than those established by the camel virus (Fig.  7a). These are made possible by the presence of the charged residues in the human receptor. Overall, these observations suggest that HCoV-229E can interact with hANPEP more e ciently than the camel virus and that positively selected sites contribute to increase a nity.
Previous investigations showed that the a nity of the six RBD classes of HCoV-229E for hANPEP varies in a range of K d from ~430 (class I) to ~30 nM (classes V and VI) (Fig. 7c) 57 . In particular, a strong increase in a nity is observed for classes V and VI. Some of the positively selected sites contribute to this increased a nity by changing loop conformation and by establishing additional interactions ( Supplementary Fig. 5). For instance, H407 in classes V and VI forms an additional polar interaction with the spatially close D315 of hANPEP, and K408 in the same classes intercepts the E291 backbone in the receptor. (Fig. 7a).
Variations in the RBD loops, which progressively emerged over the last 50 years (Fig. 7c), were previously proposed to derive from immune selection 57 . Inspection of the IEDB database revealed that no experimental epitope for the spike protein of HCoV-229E has been described. We thus used the sequences of RBDs belonging to different classes to predict epitope positions for using BepiPred-2. Results indicated that epitopes do differ among RBD classes (Supplementary Table 3) and map to different structural regions (Fig. 7b) (data for HCoV-OC43 are shown in Supplementary Fig. 6). This is in line with the observation that antibodies against classes I and IV show no cross-neutralization and that HCoV-229E is undergoing antigenic drift 57,58 . Nonetheless, the hypothesis of antigenic drift is di cult to reconcile with the evidence that reinfection with HCoV-229E is common and humoral immunity might be short-lived [22][23][24][25] . To clarify these issues, we used an extended set of spike protein sequences to date the temporal emergence of RBD classes. Results indicated that classes II, III, and IV, which have about two-fold higher a nity than class I, emerged 1-8 years apart. However, since the appearance of class V (with much higher a nity) about 44 years ago, no class emerged for 26 years (Fig. 7c). In fact, class VI split from class V about 17 years ago and the two classes show very similar sequence and binding properties. These different time intervals are poorly consistent with antigenic drift. Instead, these results suggest that strains with higher a nity for the cellular receptor have out-competed strains with lower a nity, and that HCoV-229E has evolved to optimize binding to the cellular receptor.

Discussion
Zoonotic diseases have been constantly emerging during human history, accounting for a large number of outbreaks, epidemics and pandemics. In recent years, several emerging and re-emerging pathogens have spread locally or globally. In addition to the three pathogenic coronaviruses, we have experienced the H1N1 pandemic (2009), as well as two pandemics caused by arthropod-borne viruses (chikungunya virus in 2014 and Zika virus in 2015) 61 . A cross-country Ebola epidemic has caused thousands of deaths in 2014-2016 in West Africa (https://www.who.int/health-topics/ebola/), and Nigeria reported the largest ever number of Lassa fever cases in 2018 (https://www.who.int/csr/don/23-march-2018-lassa-fevernigeria/). These events have led to the conclusion that the occurrence of viral zoonoses is escalating [61][62][63][64][65][66][67] . However, a survey of emerging infectious diseases (EIDs) caused by zoonotic viruses indicates no remarkable increase over eight decades and the occurrence of in uenza pandemics in the last 3 centuries reveals no temporal pattern, nor an association with human population size ( Fig. 3 and 4). Our data on the timing of endemic coronavirus emergence show that the four viruses entered human populations more than 50 years ago (and most likely much earlier). This indicates that, for several decades, no coronavirus spilled over to humans (or at least caused a registered outbreak) until three highly pathogenic coronaviruses emerged in tight temporal succession. To contextualize this observation in a wider perspective of viral diseases emergence over the last 8 decades, we mined data from GIDEON, a comprehensive epidemiology database. Because they have very different ecological characteristics, we separately analyzed zoonoses that, in analogy to coronaviruses, are not transmitted by arthropod vectors, arthropod-borne infections, and non-zoonotic diseases. An analysis of raw data -i.e. the number of outbreaks (irrespective of the number of cases) and the number of sizable (more than 50 human cases) outbreaks per year -showed a weak tendency for zoonoses to increase over time, with a possible upswing since 1995. However, this trend was abolished, if not reversed, when outbreak number was corrected for a measure of reporting effort, indicating that the occurrence of viral diseases has not been increasing over the last decades.
Of course, the data used to reach this conclusion have a number of limitations. First, outbreak counts are clearly subject to reporting biases depending on geographical location, misdiagnosis (especially before the widespread use of molecular testing), and concurrent historical events (e.g., wars, famines). Second, zoonotic events resulting in a small number of cases may remain unreported (especially in the past or in developing regions), a bias we tried to overcome by performing the analyses for outbreaks with at least 50 human cases. Third, the annual number of JID articles only represents a proxy for reporting effort.
Whereas it is di cult to weight the relevance of these possible biases, they are unlikely to affect our results to such a degree that we failed to detect a strong increase of outbreaks if it really occurred. Moreover, as mentioned above, no strong escalation of EIDs caused by viruses was observed when we analyzed the data curated by Jones and coworkers, who did not rely on GIDEON 49 . In this respect, it is worth mentioning that our data do not contradict the overall conclusion reached in their work-i.e. that EIDs have risen in frequency since 1940. In fact, Jones and colleagues found that less than 26% of EIDs are caused by viruses, indicating that the increase they observed over time is mostly due to other pathogens 49 . We also stress that our ndings concerning the timing of outbreaks by no means imply that anthropic changes previously associated to infectious disease emergence (e.g., deforestation, climate change, wildlife habitat encroachment, loss of biodiversity) 64, 67-77 have no role in favoring the occurrence of viral zoonoses.
If the three highly pathogenic coronaviruses do not t within a more general trend for zoonotic viruses to increasingly infect human populations, we are left with the question of which factors can explain the irregular pattern of coronavirus emergence. This same issue has remained unanswered for decades in the case of in uenza, as virological and non-virological factors have been associated to the occurrence of pandemics with poor explanatory power [78][79][80][81][82][83][84] . Thus, it is presently impossible to predict the timing of such events and their severity. In general, as noted elsewhere, we have very little ability to anticipate which viruses will emerge and how pathogenic they will be 85,86 . In this respect, it is also worth mentioning that, because they have now circulated in (and adapted to) human populations for decades, if not centuries, we cannot exclude that the endemic coronaviruses were once more pathogenic than they are now. Indeed, it was previously suggested that the 1889-1890 u pandemic, which was characterized by pronounced central nervous system symptoms, was actually caused by HCoV-OC43, as the dates corresponds to the time when the virus entered human populations and HCoV-OC43 displays some neurotropism 34 .
Although we cannot go back in time and infer the original phenotype of endemic coronaviruses, nor can we have a full picture of their ancestral genetic diversity, analysis of their evolution is potentially very informative to understand the future trajectories of SARS-CoV-2, and of coronaviruses in general. Analysis of bat coronaviruses indicated that, in analogy to SARS-CoV, SARS-CoV-2 required limited adaptation to gain the ability to infect and spread in our species 56,87 . As HCoV-OC43 and HCoV-229E most likely emerged from bovine and camelid coronaviruses, we investigated which selective events accompanied the divergence of these human viruses from the animal ones and their diffusion in humans. We note, however, that because of the lack of information on early isolates (as evident in the phylogenetic gaps in Fig. 2), it is formally impossible to distinguish between the initial events associated with the optimization for human infection and the ongoing adaptation resulting from immune selection or other pressures.
Our results indicate that the spike protein and other structural proteins of both viruses represented the major targets of selection. An interesting exception is the strong signature of selection we observed for HCoV-OC43 ORF5 (also known as ns12.9). The encoded protein functions as a viroporin and its deletion reduces viral replication, in ammatory response, and virulence in mouse models 88 . Positive selection also drove the evolution of the membrane proteins of both viruses, as well as of the envelope protein of HCoV-OC43. This latter, besides having structural roles, acts as a viroporin and represents a neurovirulence factor 89,90 . Likewise, the membrane proteins of several coronaviruses, including HCoV-OC43, in addition to their role in virion maturation, are capable of antagonizing interferon responses [91][92][93][94][95] .
Overall, these data suggest that positively selected sites in these proteins might contribute to ne-tune the interaction between coronaviruses and human immune responses.
Clearly, the spike protein, as well as HE in the case of HCoV-OC43, have a major interest as targets of selection, as they represent major determinants of host range and infectivity 1-3 . Most selected sites were found to be located in the receptor binding domains of the spike proteins, as well as in the lectin domain of HE. However, additional sites mapped to other regions of the spike proteins and were mostly xed in frequency. These include three sites in the heptad repeat region of the spike protein of HCoV-229E and one site in the fusion peptide of HCoV-OC43 (Fig. 5 and 6). Notably, the heptad repeat region was previously described as major target of selection in MERS-CoV and related camel viruses 96,97 and variants within this region and/or the fusion peptide were shown to modulate viral tropism and host range in several viruses, including animal coronaviruses 98-101 . Coronaviruses can use very different cellular receptors and their spike proteins display a remarkable ability to adapt to different cellular receptors 2,42 . Embecoviruses such as HCoV-OC43, HCoV-HKU1 and BCoV attach to 9-O-acetylated sialoglycans via the spike protein, with HE acting as a receptor-destroying enzyme 53,59 . Conversely, HCoV-229E and HCoV-NL63, in analogy to other betacoronavirus, use a protein receptor 102,103 . Biochemical and crystallographic analyses indicated that, since the shift to the human host, the spike and HE proteins of HCoV-OC43 have co-evolved to optimize the balance between binding and release from sialoglycans in human airways 60, 104 . We con rm herein the previously observed emergence of spike and HE variants over time and the replacement of earlier variants with the more recent ones (Fig. 5). However, the relative binding a nity and esterase activity of such variants have not been extensively investigated, yet. Conversely, binding assays have shown that different classes of the HCoV-229E spike protein RBD have very different binding a nity for hANPEP. The appearance of variants with increased a nity has clearly occurred progressively in time (Fig. 7c) and changes at the positively selected sites have most likely facilitated the initial spill-over from camels (Fig. 7a).
Whereas these data suggest that HCoV-OC43 and HCoV-229E have been adapting to optimize infection and spread in human populations, an alternative possibility is that the selective pressure towards amino acid replacements is exerted by the host immune system. Indeed, it was previously suggested that antigenic drift was responsible for the emergence of distinct HCoV-229E RBD classes 57 . This assumption was based on the observation that antibodies raised against class I RBD do no neutralized viruses with RBDs belonging to different classes 57,58 . Nonetheless, growing evidence suggests that the humoral immune response against endemic coronaviruses wanes in a few months [22][23][24][25] . As a consequence, natural reinfection is possible and the antibody response can hardly be regarded as a strong selective pressure for these viruses. In line with this observation, three lines of evidence indicate that antigenic drift is not primary responsible for selecting changes within the RBDs of HCoV-OC43 and HCoV-229E. First, limited variation with no temporal pattern was evident in the RBD region of BcoV sequences sampled over 34 years. Second, dating of the emergence of HCoV-229E RBD classes indicated an initial rapid turnover of classes I to IV followed by a 26-year time during which no variant turned up after the emergence of class V. Class V and the closely related class VI RBDs differ in binding a nity from the other classes by almost an order of magnitude 57 . Third, neutralizing antibodies against SARS-CoV, SARS-CoV-2 and MERS-CoV can bind region of the spike protein other than the RBD or can recognize different viral proteins [105][106][107][108] . However, the polymorphic selected sites we detected were almost exclusively located within the RBDs.
These results, together with the remarkable seasonality of endemic coronaviruses, suggest that selection has been acting to optimize biding to the cellular receptor and that strains with increasing a nity have replaced those with lower binding ability and, possibly, lower infectivity. These observations have relevance for the evolution of SARS-CoV-2. Recent work has indicated that the spike protein of SARS-CoV-2 can tolerate a substantial number of substitutions, with some of them even increasing receptor binding 109 . Thus, it is in theory possible that, as observed for HCoV-229E, variants that enhance ACE2 binding will emerge. We however note that even class V and VI RBDs have much lower a nity for hANPEP (K d3 0nM) than most other human coronaviruses to their respective cellular receptors (K d in the range of 1 to 5 nM for SARS-CoV-2, SARS-CoV, and HCoV-NL63) 110,111 . It is thus possible that the selective pressure for increased biding is low for SARS-CoV-2. Nonetheless, even minor changes in binding a nity might confer a selective advantage over other strains, as preliminary data on a newly evolved variant in the UK suggest (https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2lineage-in-the-uk-de ned-by-a-novel-set-of-spike-mutations/563).
Whether SARS-CoV-2 will evolve variants that alter its antigenic properties is also, of course, of great interest. Our data on HCoV-229E and HCoV-OC43 suggest that antigenic drift is not a major evolutionary driving force for these viruses and available data on SARS-CoV-2 suggest that variants affecting antigenic properties are present at low frequency in the circulating viral population 112,113 . However, the evolution of the virus is likely to depend on the duration and e cacy of natural immunity. Although immunity against SARS-CoV-2 might be relatively short-lived, additional data over longer time-frames will be required to de nitely address this issue 114 . Moreover, in rare cases of long-term SARS-CoV-2 infection, either or not in association with convalescent plasma treatment, the virus might experience within-host selective pressures that favor the appearance of immune escape variants [115][116][117] . If such variants do not impair viral tness in naive hosts, they may spread in the population, especially if they are associated with changes that increase infectivity, as it seems to be the case for the recently emerged UK variant (https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-theuk-de ned-by-a-novel-set-of-spike-mutations/563). Likewise, the imminent mass deployment of vaccines against SARS-CoV-2 will subject the virus to a selective pressure that the endemic coronaviruses have never experienced.

Sequences and alignments
Complete or almost complete genome sequences for all four endemic coronaviruses were downloaded from the NCBI database (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/).
Only sequences with known sampling dates were included in the analyses (Supplementary Table 1).The HCoV-OC43 Paris strain was excluded as its sampling date is uncertain 118 . For each human coronavirus, the closest phylogentically related animal virus was also retrieved: camel alphacoronavirus, BCoV, murine coronavirus, and a bat coronavirus from Kenya for HCoV-229E, HCoV-OC43, HCoV-HKU1, and HCoV-NL63, respectively (Supplementary Table 1).
All complete genome sequences with sampling year of BCoV were also retrieved from NCBI, along with all HCoV-229E spike sequences, again with collection date (Supplementary Table 4 and Supplementary  Table 1).

Recombination analysis
Recombination can affect phylogenetic tree branch length estimates and, consequently, molecular evolution analyses 39 . Thus, each coronavirus alignment was tested for evidence of recombination signals using the 3SEQ software (v.1.7) 45 . This method scans a given alignment searching for mosaic recombination signals in all possible sequence triplets. The result is the identi cation of genomic regions in which one of the three sequences is the recombinant (child) of the other two (parental). 3SEQ full scans were run with a recombination signi cance threshold of 0.01. All signi cant recombination events were mapped onto coronavirus alignments and the longest non-recombinant genomic regions, de ned as the genomic region comprised between two recombination breakpoints, were selected for subsequent analyses (Fig. 1).Unique recombination events identi ed in genome alignments are shown in Figure 1. No breakpoint was detected in the extended set of HCoV-229E spike proteins.

Phylogenetic trees and temporal signal
Phylogenetic trees for the non-recombinant regions of all endemic coronaviruses were reconstructed using the phyML software under a general time reversible (GTR) model plus gamma-distributed rates and 4 substitution rate categories 120 . Substitution models were estimated using JmodelTest 2 121,122 .
Internal GTR estimated branch lengths were compared to branch lengths calculated using a model that accounts for different selective pressures among lineages. This model is implemented in the aBSREL (adaptive Branch-Site Random Effects Likelihood 123 ) tool from the HYPHY suite (version 2.5) 124 .
To evaluate whether the non-recombinant genomic regions selected for the analyses carried su cient temporal signal, we calculated the correlation coe cients (r) of regressions of root-to-tip genetic distances against sequence sampling years 125 . We applied a method that minimizes the residual mean squares of the models and calculated p values by performing clustered permutations (1,000) of the sampling dates 125,126 . We considered signi cant a regression with p<0.05. Non-recombinant regions of HCoV-OC43, HCoV-229E, and HCoV-NL63 (region 2) showed evidence of temporal signal (Fig. 1). The same result was obtained for the extended set of HCoV-229E RBD spike proteins ( Supplementary Fig. 7). To select the best-t molecular clock and tree prior, we ran the path sampling tool implemented in BEAST to choose between a constant size, an exponential growth, or a coalescent Bayesian skyline tree prior and between a strict and a relaxed log normal clock (50 steps, 1,000,000 iterations each). A Bayes factor test was applied to compare the different likelihoods (Supplementary Table 4).
A constant size population tree prior with a strict clock model was favored for HCoV-OC43 and HCoV-229E, whereas an exponential growth population tree prior with a relaxed clock model with log-normal distribution was preferred for HCoV-NL63.
We performed two different Markov chain Monte Carlo runs for all four endemic coronaviruses, two hundred million iterations each, and sampled every 10,000 steps after a 10% burn-in. Runs were combined after checking for convergence and for heaving effective sampling sizes >100.

Population genetics-phylogenetic analysis
Selective events that accompanied the appearance of the human viruses were investigated for HCoV-OC43 and HCoV-229E, the two endemic coronaviruses for which the closest related animal virus is almost certain (i.e. the bovine and the camelid coronaviruses) (Supplementary Table 1).
Analyses were performed with gammaMap, that uses intra-speci c variation and inter-speci c diversity to estimate, along coding regions, the distribution of selection coe cients (γ). Thus, for the two coronaviruses, all ORF coding region sequences (cds) were retrieved from the same set of strains analyzed before, all possible overlapping regions were masked and single ORF cds alignments were generated using MAFFT, with codons sequence type as parameter.
In the gammaMap framework, the selection coe cient is de ned as 2PNes, where P is the ploidy, Ne is effective population size, and s is the tness advantage of any amino acid-replacing derived allele. The method categorizes selection coe cients into 12 prede ned classes ranging from -500 (inviable) to 100 (strongly bene cial), with 0 indicating neutrality 52 . We also assumed θ (neutral mutation rate per site), k (transitions/transversions ratio), and T (branch length) to vary within genes following log-normal distributions, whereas p (probability of adjacent codons to share the same selection coe cient) following a log-uniform distribution. Finally, for the selection coe cients, we considered a uniform Dirichlet distribution with the same prior weight for each selection class. We performed 2 runs with 100,000 iterations each and with a thinning interval of 10 iterations. Runs were merged after checking for convergence. Codon positions were de ned as positively selected if they showed a posterior probability > 0.75 of having γ ≥ 1.

EID events, in uenza pandemics, and viral disease outbreaks
The timing of in uenza pandemics was obtained from a previous work 83  We compiled a list of viral disease outbreaks from data stored in the Global Infectious Disease and Epidemiology Network (GIDEON) database 50 (last accessed December 15th, 2020) (Supplementary Table   6). GIDEON is updated monthly, and collates information from a wide range of sources, including Health Ministry publications, peer-review journals, textbooks, as well as the WHO and CDC websites. Information of individual outbreaks was manually inspected to determine whether the overall number of human cases (aggregated over countries in cases of multi-regional outbreaks) was higher than 50 (Supplementary  Table 7).
To analyze the relationship between viral disease outbreaks and time, we tted generalized linear models model with Poisson error using log(number JID articles) as an offset. The number of JID articles was selected as an offset because, as previously indicated 49 , the journal is a major, dedicated venue for the publication of human infectious disease research since 1945. Computations were performed in R version 3.1.2 using the glm function.

Molecular modeling and epitope prediction
The structure of HCoV-229E RBD of class I in complex with hANPEP was retrieved by the Protein Data Bank (PDB ID: 6AKT). Such structure was also used as template to model the interaction between the Sprotein RBD of HCoV-229E camel ortholog and cANPEP, using the webserver HOMCOS 139 . The HOMCOS webserver performs blastp searches 140 to look for complexes formed by proteins, which are homologous to the query proteins. We then selected one of these complexes as a template and launched the program MODELLER 141 that models the interaction between the query proteins with a script provided by HOMCOS. On the basis of sequence similarity at the binding interface, we chose structures 6U7E, 6U7F and 6U7G as templates to model the interaction of RDB Class I-II, Class IV and Class V/VI, respectively, with hANPEP. The same templates have been used to model the sole RBDs using SWISS-MODEL 142 . These RBDs structures were then used to map epitopes on the molecular surface. VADAR 1.8 (Volume, Area, Dihedral Angle Reporter) 143 , was used to assess the accuracy of all models. VADAR uses a combination of more than 15 speci c algorithms to calculate different parameters for each residue and of the overall protein structure. We used such parameters to verify i) the agreement of observed structural parameters (such as φ and ψ dihedral and buried charges) of the newly predicted structures with the expected values calculated on the corresponding sequences and ii) the presence of a low number of packing defects. Structures were then analyzed with the software PyMOL 144 , that was also used to create proteins gures.
Epitope positions were predicted using the BepiPred-2.0 method with default parameters and accessed through the IEDB server (http://tools.iedb.org/bcell/help/#Bepipred-2.0) 145 . Figure 1 Recombination events and temporal signal. Unique recombination events in endemic coronaviruses (left panels). Each event is shown as a line with dots representing the start and the end. The non-recombinant regions used in the analyses are indicated with gray shadows. Schematic representations of coronavirus genomes are also reported. Plots of the root-to-tip distance as a function of sampling years are shown in the central panels. Each point corresponds to a viral sequence and the dotted line is the linear regression calculated using a method that minimizes the residual mean squares. The r coe cient and the corresponding p value are also shown. Right panels report comparisons of branch lengths obtained using the aBS-REL and the GTR models. Each dot represents an internal branch of a phylogenetic tree calculated using the longest non-recombinant regions of each endemic coronavirus.    Glycans are rendered as dark blue spheres. (c) Sequence logos of the three RBD loops. Positively selected sites are shown in orange (irrespective of their amino acid frequency). Sequence logos are grouped by collection date.