Comparison of phylogenetic supertree and ML tree
To accurately determine the evolutionary relationships among SARS-CoV–2, approaches of phylogenetic supertree and ML tree were employed for phylogenetic analysis of 102 SARS- CoV–2 isolated all over the world together with 5 SARS-CoV, 2 MERS-CoV, and 11 bat coronaviruses as outgroups. In the phylogenetic supertree (Figure 1), SARS-CoV and MERS- CoV were placed on one major branch, while SARS-CoV–2 belonged to another major branch. The divergent location of SARS-CoV–2 relative to SARS-CoV and MERS-CoV on the phylogenetic supertree was consistent with the results from the phylogenetic ML tree in this study (Figure S1) and with previous reports about the phylogeny of SARS-CoV–2 constructed with the whole genome 3,4,6. However, some discrepancies present between the phylogenetic supertree and the ML tree.
Distinctive phylogenetic distances observed on clades of SARS-CoV and SARS- CoV–2 in phylogenetic supertree, explicitly presented evolutionary relationships among coronaviruses. By contrast, coronaviruses clustered tightly on clades of SARS-CoV and SARS-CoV–2 in phylogenetic ML tree (Figure S1), with barely discerned branch length (less than 0.001). Furthermore, the phylogenetic supertree successfully identified coronavirus AY572035 sampled from civet the closest ancestor of the SARS-CoVs (Figure 1) with a distinct branch length, which was highly consistent with the previous study 18. It is worth noting that some bat coronavirus sampled from the same animal host or/and same sampling location, displayed closer genetic distance in phylogenetic supertree, which is rational and logical from the perspective of evolutionary progress. However, bat coronavirus had no definitive evolution relationship in the phylogenetic ML tree. Therefore, the phylogenetic ML tree was less suitable for phylogenetic inference, at least for coronavirus listed above. The major factor that determines phylogenetic ML tree topology appears to be the orf1ab gene that is about 75% of the genome. It is readily explained by the similar evolution relationship obtained in the phylogenetic ML tree relative to the source phylogenetic ML tree based on the sequence of ORF1ab (Figure S1, Figure 2A). Taken together, the phylogenetic supertree displayed significant superiority for deciphering evolutionary relationships among coronavirus.
Clues to the origin of the SARS-CoV–2
As the phylogenetic supertree and ML tree exhibited, RaTG13 (MN996532), bat-SL- CoVZC45 (MG772933), bat-SL-CoVZXC21 (MG772934) and SARS-CoV–2s formed one major clade (Figure 1, Figure S1). In particular, RaTG13 isolated from bat Rhinolophus affinis (Yunnan, China), is the closest relative of SARS-CoV–2s located on different branches, which substantiates the previously reported phylogeny of SARS-CoV–2s constructed with the whole genome. The phylogenetic distance of SARS-CoV–2s and RaTG13 was distinct exhibited in the phylogenetic supertree (Figure 1); by contrast, it was barely observed in the phylogenetic tree constructed in this study (Figure S1) or previous report 19.
To interpret the disparate proximity between SARS-CoV–2s and RaTG13 in phylogenetic supertree relative to ML tree, we examined and evaluated the 10 source ML trees (Figure 2), based on which the phylogenetic supertree was built. Consistent with the results of supertree and ML tree, RaTG13 (MN996532) is identified as adjacent coronavirus to SARS-CoV–2s in source ML trees based on phylogenetic analysis of five CDSs, including ORF1ab, spike protein, N protein, ORF6 and ORF7a (Figure 2A, 2B, 2D, 2G, 2H). By contrast, bat coronavirus MG772933 and MG772934, both of which are isolated from bat Rhinolophus sinicus (Zhejiang, China) 20, were the nearest relatives of SARS-CoV–2s in source ML trees based on M protein, ORF3a, and ORF8 (Figure 2C, 2F, 2I). In addition, phylogenetic analysis of E protein sequence showed that SARS-CoV–2s, MN996532, MG772933, and MG772934 are pinpointed on the same branch (Figure 2E). The above distinct phylogenetic analysis results showed beyond a reasonable doubt that there are highly non-uniform rates of evolution on sequences of varied proteins in SARS-CoV–2s, with no clear consensus phylogeny within coronavirus could be determined, which makes single gene based phylogenetic analysis a relatively weak tool to study viral phylogeny. The conflict phylogeny reflected by 10 source ML trees suggests a possibility of another bat coronavirus in divergent species be the adjacent ancestor of SARS-CoV–2, and/or SARS-CoV–2s already made advanced evolution in its animal host. What is clear is that the actual validity of RaTG13 be the direct ancestor of SARS-CoV–2 is seriously questioned, although they share 96.5% identical genome sequence. Therefore, it is misleading in phylogenetic inference to taking RaTG13 as the direct ancestor of SARS-CoV–2.
Mutants and evolution of SARS-CoV–2
Within phylogenetic supertree, nine sub-branches were resolved in SARS-CoV–2 clades, labeled from clade A until clade I in Figure 1, which were absent in phylogenetic ML tree based on full-length genomic sequence analysis (Figure S1). The sub-branches displayed an evolutionary scenario of the SARS-CoV–2s in human hosts from December 2019 to March 2020 all around the world, at least based on 102 SARS-CoV–2 isolates in this study. By interrogating ten CDSs of SARS-CoV–2s, diverse mutations are disseminated within five viral proteins, which are ORF1ab, N protein, spike protein, ORF3a, and ORF8 (Table 1).
Within most mutation sites described in this study, the original amino acid was substituted by another one possessing altered chemical properties, except L1599F in ORF1ab (clade A), V62L in ORF8 (clade H), and I1606V in ORF1ab (clade D1). Most strikingly, SARS-CoV–2s from the USA displayed common mutation in clades of A, C, D, F, H, and I, covering a large number of countries listed in this study, including Spain, Finland, Sweden, Italy, Brazil, Australia, and South Korea. In particular, detection of the identical mutation in ORF3a protein (G251V) in clade I indicated the spread of the G251V mutant happened at least in January 2020 or earlier, in Sweden, Italy, Brazil, Australia, and the USA.
The gene of ORF1ab, taking up 75% of the whole genome size of coronavirus, produces a series of non-structural proteins (nsp), which assemble to facilitate viral replication and transcription. Mutations in ORF1ab present in the majority of clades, including clades A, B, C, D1 in D, and E, which are involved in SARS-CoV–2s from Spain, USA, China, but no identical mutation site was detected. Among them was a mutation from proline to leucine (P4715L) in ORF1ab, which was located on Nsp12 that is considered a primary target for nucleotide analog antiviral inhibitors such as remdesivir, thus the mutation would possibly make anti-coronavirus treatment less effective 21,22.
The viral spike protein, responsible for virus entry into the host cell, exhibited two mutated sites distributed in clade A (D614G) and F (H49Y), respectively. The mutation site D614G in spike protein is located between the receptor-binding domain (451–509) and the polybasic cleavage site (682–685) 23, which possibly can regulate binding capability of the virus with human host ACE2 receptor or capability of viral infection. Further studies and clinical observations are needed to find out whether mutation sites on various proteins could change the viral ability to infect and its pathogenicity.