Recombination analysis
R1: This analysis was done to initially check for the possibility of SARS-CoV-2 being a recombinant. Out of a total of 508 events, 28 events satisfied the criteria of at least 4 methods having significant p values, and of the 28 only two were of interest. Event 26 showed SARS-CoV-2 as a recombinant with a pangolin CoV and SARS-CoV as major and minor parents, and event 125 showed SARS-CoV as a recombinant with a pangolin CoV and bat CoV RaTG13 as major and minor parents respectively, see Table 2. Figures 1 and 2 also show the Neighbour-Joining (NJ) trees of the two events, event 26 tree shows how SARS-CoV-2 sequence, NC_045512.2 is most closely related to bat coronavirus RaTG13 (96.03%), and to the pangolin sequences (85.20%) as also shown in Table 3 in the pairwise percentage table. The event 26 NJ tree also shows some distant relationship with a lot of other bat CoVs in the Beta-CoVs genus. In event 125 the NJ tree shows the SARS-CoV from the 2003 Chinese outbreak as a recombinant of bat coronavirus RaTG13 with 78.43% pairwise percentage identity as shown in Table 3. The pangolin CoVs also cluster closely with SARS-CoV with about 78.12% pairwise percentage identity. All of the recombination events were in the non-structural protein region of the genome.
R2: As a follow up to R1, we sought to check for recombination events within the different SARS-CoV-2 viruses available in Genbank at the time of the analysis and also computationally feasible to analyse. Out of the 422 events, 37 events satisfied the criteria of at least 4 methods having significant p values, Of the 37 events, 5 involved SARS-CoV-2 as a recombinant with bat CoVs and pangolin CoVs as major and minor parents as shown in events 12, 39, 112, 124 and 147, see also Table 2. A common feature among all the R2 events is that different “variants” of the SARS-CoV-2 virus as depicted by the different accession numbers show different bat and pangolin CoV parents. This might be an indication of ancient (and not recent) recombination events that happened over a long period of time as indicated by the interaction of the SARS-CoV-2 with a diverse number of other CoVs, which is not possible in a short space of time. In event 12 there is recombination between two SARS-CoV-2 sequences. This is important in describing the possibility of co-infection or super-infection, as the prerequisite for recombination is that two viral sequences have to be in the same space and time for recombination to occur. In event 39 and 124, there are recombination events between SARS-CoV-2 from the recent outbreak and SARS-CoV from the 2003 outbreak. This makes it possible to hypothesize that SARS-CoV is the backbone sequence for SARS-CoV-2 given that they have a 78.54% similarity as shown in Table 3. Hence it is reasonable to hypothesize that natural recombination and/or mutational events on a SARS-CoV backbone sequence might have led to the emergence of SARS-CoV-2, which may also provide basis for not to ruling out the possibility of laboratory manipulation of the SARS-CoV. Only event 147 was in the structural protein encoding region of the genome, specifically the nucleocapsid protein region of the genome.
Table 2. Recombination analysis with position and function of genomic regions
R1
|
|
Event
|
Recombinant, Genbank ID
Major Parent, Genbank ID
Minor parent, Genbank ID
|
No. of Methods
|
P-value range
|
*Position of breaking points
Approx. genomic region
|
Known genomic function
|
26
|
SARS-CoV-2 sequence, NC_045512.2
Pangolin coronavirus, MT040336
SARS-CoV, NC_004718
|
5
|
1.30x 10-03—2.85x10-15
|
17581—18192
ORF1ab, nsp13 genomic region
|
RNA helicase, 5′ triphosphatase (59)
|
125
|
SARS-CoV, NC_004718
Pangolin coronavirus, MT040336
Bat coronavirus RaTG13, MN996532.1.
|
4
|
4.00x 10-02—3.10x10-08
|
15105—15538
ORF1ab, nsp12 genomic region
|
Primer dependent RdRp (60)
|
R2
|
|
Event
|
Recombinant, Genbank ID
Major Parent, Genbank ID
Minor parent, Genbank ID
|
No. of Methods
|
P-value range
|
Position of breaking points
Approx. genomic region
|
|
12
|
SARS-CoV-2 sequence, MT188341
SARS-CoV-2 sequence, MT293183
BatCoV, NC_014470
|
4
|
7.90 x 10-03—7.10 x10-40
|
29810—45
ORF10 to ORF1ab,
orf10 protein
|
Plays a role in virus assembly and release, and it involved in viral pathogenesis (61)
|
39
|
SARS-CoV, NC_004718
Pangolin coronavirus, MT040336
SARS-CoV-2 sequence, MT123293
|
5
|
1.30 x 10-03—8.40 x10-18
|
17752—18121
ORF1ab, nsp13 genomic region
|
RNA helicase, 5′ triphosphatase (59)
|
112
|
SARS-CoV-2 sequence, MT039888
Bat CoV, NC_025217
Bat CoV-HKU4, NC_009019
|
7
|
1.40x 10-02—1.14x10-08
|
4706—4934
ORF1ab, nsp3 genomic region
|
Polypeptides cleaving, blocking host innate immune response, promoting cytokine expression (62)
|
124
|
SARS-CoV-2 sequence, MT163716
Pangolin coronavirus, MT040335
SARS-CoV, NC_004718
|
4
|
4.90x 10-02—1.03x10-07
|
15164—15584
ORF1ab, nsp12 genomic region
|
Primer dependent RdRp (60)
|
147
|
SARS-CoV-2 sequence, MT263421
Bat CoV, NC_025217
Bat CoV-HKU5, NC_009020
|
5
|
1.60x 10-03—6.01x10-07
|
28935—29371
ORF9, Nucleocapsid protein
|
An antagonist of interferon (IFN) and viral encoded repressor of RNA interference (63)
|
*Approximate genomic region was estimated from nucleotide positions based on the SARS-CoV-2 sequence, NC_045512.2 using the Genbank annotated format of the sequence.
Pairwise genetic distance Simplot of SARS-CoV-2 and other related CoV
To corroborate the findings from the RDP4 software in detecting recombination, we used the Similarity Plot. The Similarity Plot allows identification of one query sequence, generally the one suspected to be the recombinant or the mosaic, and the rest of the sequences are considered as reference sequences. The analysis indicated that the newly isolated SARS-CoV-2 represented by sequence SARS-CoV-2_NC_045512.2 in our analysis is more closely related to bat Coronavirus (RaTG13) further supporting preliminary reports that suggested a possible species jump from bats to humans to seed the current pandemic virus associated with COVID19 (see Figure 3). One pangolin CoV MT040336 also shows close identity to SARS-CoV-2 (85.1%) similarity as also shown in Table 3.
Table 3. Pairwise percentage identity of SARS-CoV-2 and other related CoV
Phylogenetic Incongruence testing
To confirm the findings from the recombination analysis (RDP4) and sequence similarity (Simplot), we used a more conclusive test, the SH test (48). The null hypothesis of the SH test states that the difference between trees (branch length, topology or likelihoods) is zero. In Table 4 the observed differences between the 60A1 and 60A2 and/or 473A1 and 473A2 were significantly greater than zero and the null hypothesis was rejected, thus declaring that these trees are significantly different i.e. incongruent (p< 0.05) as shown by the p-values (p-SH) indicating that there is substantial phylogenetic incongruence between the 60A1 and 60A2 and/or 473A1 and 473A2 trees. The incongruence between these trees allude to the fact that recombination plays a role in CoV evolution and in SARS CoV-2 in particular.
Table 4. Shimodaira-Hasegawa test for incongruence
60 A1 tree as reference
|
Tree
|
deltaL
|
bp-RELL
|
p-KH
|
p-SH
|
p-WKH
|
p-WSH
|
c-ELW
|
p-AU
|
60A1
|
0
|
1
|
1
|
1+
|
1
|
1
|
1
|
1
|
60 A2
|
988.9
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
60 A2 tree reference
|
Tree
|
deltaL
|
bp-RELL
|
p-KH
|
p-SH
|
p-WKH
|
p-WSH
|
c-ELW
|
p-AU
|
60 A1
|
1.99
|
0.16
|
0.15
|
0.15
|
0.15
|
0.15
|
0.22
|
0.17
|
60 A2
|
0
|
0.84
|
0.85
|
1+
|
0.85
|
0.85
|
0.78
|
0.83
|
473 A1 tree as reference
|
Tree
|
deltaL
|
bp-RELL
|
p-KH
|
p-SH
|
p-WKH
|
p-WSH
|
c-ELW
|
p-AU
|
473 A1
|
0
|
0.99
|
0.97
|
1
|
0.97
|
0.97
|
0.99
|
0.10
|
473 A2
|
233.63
|
0.01
|
0.03
|
0.03
|
0.03
|
0.03
|
0.00
|
0.05
|
473 A2 tree as reference
|
Tree
|
deltaL
|
bp-RELL
|
p-KH
|
p-SH
|
p-WKH
|
p-WSH
|
c-ELW
|
p-AU
|
473 A1
|
12.74
|
0.40
|
0.38
|
0.38
|
0.38
|
0.38
|
0.40
|
0.37
|
473 A2
|
0
|
0.60
|
0.62
|
1
|
0.62
|
0.62
|
0.60
|
0.63
|
deltaL: logL difference from the maximal logl in the set; bp-RELL: bootstrap proportion using RELL method (64); p-KH: p-value of one-sided (55); p-SH: p-value of Shimodaira-Hasegawa test (48); p-WKH: p-value of weighted KH test; p-WSH: p-value of weighted SH test; c-ELW: Expected Likelihood Weight (65); p-AU: p-value of approximately unbiased (AU) test (56); +: 95% confidence sets; -: significant exclusion
Phylogenetic analysis
After demonstrating that recombination events play a major role in CoV evolution we sought to determine how the new 413 SARS-CoV-2 sequences would cluster together and also with previously known CoV sequences by constructing 3 phylogenetic trees as described in Figure 4, 5 and 6. Figure 4 shows 60 representative CoV sequences and how they are classified into the different genera. It also shows SARS-CoV-2 reference sequence being closely related to bat RaTG13 sequence in the sub-genus Sarbecovirus. Pangolin CoVs and other bat CoVs also cluster close to the SARS-CoV-2 virus. Figure 5 is an extension of Figure 4, with the addition of a collapsed branch of 413 SARS-CoV-2 sequences from the current outbreak. Figure 6 is an analysis of 690 SARS-CoV-2 complete sequences, with the tree rooted against the SARS-CoV-2 reference sequence (NC_045512.2).