Phylogenetic analysis defined two large clades (A and B). Within clade A, 5 clusters belonging to the betacoronavirus (βCoV) genus and 2 clusters belonging to the gammacoronavirus (γCoV) genus were detected (Fig. 1a). The βCoV cluster A7 corresponds to human SARS-CoV-2 (148 analyzed sequences), 6 pangolin CoVs (Manis javanica) (MP789, PcoV_GX_P5L, P5E, P1E, P4L and P2V) and 2 bat CoVs (RaTG13 and RmYN02; Rhinolophus affinis and R. malayanus, respectively). Within this cluster, we found that SARS-CoV-2 is genetically related to CoV RaTG13, and both share a common ancestor with CoV MP789 (Fig. 1a). This result agrees with previous analyzes made with the complete genome and with S protein [7, 25].
Since we did not find any phylogenetic incongruence reflected on the tree, suggesting a lack of recombination between clusters, we focused more specifically on the cluster where SARS-CoV-2 was found (A7) (Fig. 1a). We analyzed the genealogical relationships between the CoVs that comprise cluster A7. The haplotype network showed the formation of a loop between the group of pangolins and 4 hypothetical ancestral haplotypes (Fig. 1b), suggesting recombination within this cluster. Furthermore, the network suggests that 4 of the isolates analyzed here (SARS-CoV-2, RaTG13, MP789 and RmYN02) diverged from these 4 hypothetical ancestors (Fig. 1b). Despite the fact that SARS-CoV-2 and RaTG13 CoV share a genomic nucleotide identity of 96.2% [4], and an S gene nucleotide identity of 93.15% [7], the divergence showed in the phylogenetic tree and in the haplotype network rules out a direct parental relationship between these two isolates (Fig. 1a and b).
In 2019, various pangolin CoVs were isolated, among which the isolate MP789 CoV is the most interesting because it shares a nucleotide similarity of 85%-92% with SARS-CoV-2, and 90% with RaTG13 CoV [7]. The similarity analysis of the S nucleotide sequences of cluster A7 shows a mosaic similarity pattern across the S gene between SARS-CoV-2, RaTG13, MP789 and RmYN02, which suggests a probable ancestral genetic exchange between the 4 hypothetical ancestors of these CoVs (Fig. 1c). The most notable differences between SARS-CoV-2 and the rest of the CoVs S gene were found in the RBD, indicating a hybrid zone between RaTG13 and MP789 CoVs in this region (Fig. 1c). This result suggests a probable ancestral cross-species recombination between bat and pangolin CoVs.
S protein is thought to be under natural selection and plays an important role in cross-species transmission [5, 26–28]. A recent study reported negative selection in the S gene when SARS-CoV-2 was compared with RaTG13 and a group of pangolin CoVs [26]. We performed an MK test between SARS-CoV-2, RaTG13 and MP789, the results of which showed that between SARS-CoV-2 and RaTG13 CoV there were more synonymous (dS) than non-synonymous (dN) substitutions, indicating negative selection (NI>1). Whereas, between SARS-CoV-2 and MP789 CoV the contrary was found, indicating positive selection (NI<1) (Table 1). The negative selection predicted for SARS-CoV-2 is due to its high similarity to RaTG13 CoV, therefore, the fixation of dN substitutions are not favored. On the other hand, the incongruences found in the pangolin CoV results compared to a previous study [26] could be due to differences in the strategies and methods used.
The S protein RBD plays a key role during the infection process of SARS-CoV-2 to human cells because it contains the six amino acids (L455, F486, Q493, S494, N501 and Y505) that are essential for efficient binding of SARS-CoV-2 to ACE2 [12]. SARS-CoV-2 RBD shows a closer similarity to MP789 CoV RBD (96.8% homology) than to RaTG13 CoV RBD (89.56% homology) [7]. Interestingly, we found that 26 of 33 dN substitutions between SARS-CoV-2 and RaTG13 CoV were located in the RBD, while 7 of 505 dN substitutions between SARS-CoV-2 and MP789 CoV were also located in the RBD (Table 1). This indicates that in this region, MP789 CoV has suffered less dN changes than RaTG13 CoV when compared with SARS-CoV-2. Since only one polymorphism was detected in RBD in the 148 SARS-CoV-2 sequences, the MK test did not determine any value, suggesting that this is a highly conserved region. The comparison between SARS-CoV-2 and MP789 CoV RBD shows that they share the 6 amino acids that are essential for binding to ACE2 receptor, while in RaTG13 CoV these amino acids are missing (Fig. 2). These results could indicate that both the pangolin and humans have similar ACE2 at the interacting domain with S protein, as reported by others [29, 30]. As a consequence, the ACE2 binding sites and the region in general should be conserved (70% homology), being sufficient for the interaction to take place.
Table 1. McDonald-Kreitman test for Spike gene and RBD of SARS-CoV-2 comparing RaTG13 and MP789 CoVs
|
Polymorphic substitutions between virus
|
MK
NI
|
Fisher’s exact test
p value
|
S gene
|
Nonsynonymous
|
Synonymous
|
|
|
RaTG13 CoV
|
33
|
223
|
12.839
|
0.00000
|
MP789 CoV
|
505
|
75
|
0.371
|
0.04134
|
|
|
|
|
|
RBD
|
|
|
|
|
RaTG13 CoV
|
26
|
60
|
ND
|
0.310345
|
MP789 CoV
|
7
|
71
|
ND
|
0.101266
|
NI Neutrality index (significance at 95%)
ND Undetermined
A genetic feature that makes SARS-CoV-2 more infectious is the fact that the S protein harbors an insert of 12-nucleotides between the S1 and S2 subunits that encode for a polybasic cleavage site (RRAR) that is recognized by furins (Fig. 2). This cleavage site is related with an increased efficiency of entry during infection [31, 32]. Nevertheless, this insertion is not present in all betacoronaviruses, like in SARS-CoV [13, 31, 33]. However, the human HKU1 CoV and MERS-CoV have variants of polybasic insertions that are also recognized by furins [34–36]. The presence of these polybasic insertions have been seen to increase the pathogenicity of viruses, such as in avian influenza [37–39], MERS-like CoV [40], and in bovine CoV [41].
We also found that RmYN02 CoV has an insert in the same position as that in SARS-CoV-2, but it is not a polybasic cleavage insertion (-AAR). There have been suggestions that instead, it could be the product of recombination between wild bat CoVs [13]. On the other hand, several experiments have shown that this polybasic cleavage site is acquired and fixed during the serial passage of CoVs in cell cultures or in animals [37, 42]. The aforementioned leads to two possible explanations for the polybasic cleavage insertion in SARS-CoV-2 and the role in its adaptation to humans: 1) the ancestor of SARS-CoV-2 acquired it in a host, went through a recombination process in an unidentified intermediary host, and then jumped to humans, or 2) it was acquired in humans during a cycle of human to human transmission that helped its adaptation and virulence process. Rambaut et al., [43] determined that the most recent common ancestor of SARS-CoV-2 appeared in November 2019 and proposes that the virus had enough time to acquire the insert during transmission between humans.