The genomic recombination events may reveal the evolution of coronavirus and the origination of 2019-nCoV

To trace the evolution of coronavirus and reveal the possible origination of the novel pneumonia coronavirus (2019-nCoV), we collected and thoroughly analyzed 2966 publicly available coronavirus genomes, including 182 2019-nCoVs strains. We observed 3 independent recombination events with statistical significance between some isolates from bats and pangolins. In consistence with previous records, we also detected the putative recombination between Bat-CoV-RaTG13 and Pangolin-CoV-2019 covering the receptor bind domain (RBD) of the spike glycoprotein (S protein), which may lead to the origination of 2019-nCoV. Population genetic analyses give estimations indicating that the recombinant region around RBD is possibly undergoing directional evolution. This may result to the adaption of the virus to be infectious in hosts. Not surprisingly, we find that the S protein of coronavirus keeps high diversity among bat isolates, which may provide a genetic pool for the origination of 2019-nCoV.


Introduction
The novel pneumonia coronavirus (2019-nCoV), after the firstly identification in Wuhan, China 1,2, has become pandemic worldwide. Up to now there have been more than 180 thousands confirmed novel coronavirus pneumonia (NCP) cases around the world. For the control and prevention of the disease, efforts have been made to tracing the origins of 2019-nCoV. In the publication of the first genome of 2019-nCoV, bats was considered as the original host of this virus 3. Bat-CoV-RaTG13, a bat coronavirus isolated from Rhinolophus affinis, is 96% identical to 2019-nCoV at the whole genome level 4. A pangolin isolate Pangolin-CoV-2019 shares only 91.02% identity in whole genome level to 2019-nCoV, but shows higher sequence identity in the spike glycoprotein (S protein, 97.5 %) coding sequence than Bat-CoV-RaTG13 5. Therefore Pangolin was considered as a potential intermediate host of 2019-nCoV 6-8. It is reported that the receptor binding domain (RBD) of the S protein in 2019-nCoV might be resulted from a recombination event between Bat-CoV-RaTG13 virus and Pangolin-CoV-2019 6,7,9. The RBD-ACE2 binding free energy for 2019-nCoV is significantly lower than that for SARS 10,11, which partially explained the highly infectious activity of the 2019-nCoV. Thus, genomic recombination may be closely related to the origination of nCoV-2019. Statistic analyses of the genomic recombination between pangolin coronavirus and bat coronavirus should be important for tracing 2019-nCoV's origins. The subsequent evolution of the recombinants is scientifically interesting through in depth analysis in population level for more coronavirus strains.

Results And Discussion
The multiple alignment of 2966 coronavirus genome sequences has been performed and tried to identify if there is recombination between bat and pangolin coronavirus. In total, we identified 3 independent recombinants. Each of them has evidences from at least six statistic tests (P-value<0.05) ( Table 1). We have validated the three recombinants by generating their own phylogenetic trees ( Figure 1) and pairwise identity plots ( Figure S1). Two of three recombination inferring that the exchanges of genetic materials between coronavirus from bats and coronavirus from pangolins are not rare. One of these two recombinants is within the ORF1 region and the other one is spanning the 3' end of ORF1 and 5' beginning of the S protein ( Figure 2A).
Our analysis once again verified that a 228bp long sequence within the S protein (Figure 2A) in 2019-nCoV is of high possibility to be resulted from recombination between Bat-CoV-RaTG13 and Pangolin-CoV-2019 (Table 1, Figure 1D, Figure S1C, S2), although the 2019-nCoV is not isolated and identified from bat or pangolin yet. In whole genome level, Bat-CoV-RaTG13 shows higher identity to 2019-nCoV than Pangolin-CoV-2019. Our analysis suggest the high possibility that 2019-nCoV originated from a bat coronavirus after acquiring a recombinant sequence at the S protein from a pangolin coronavirus ( Figure 2B). The S protein recombinant sequence encodes a 76 AA long peptide and locates at the receptor binding domain (RBD), which may influence the host preference of the virus. This recombination event may play a key role in the origination of 2019-nCoV.
We observed that there is a peak value at the S protein recombinant in Fixation Index (Fst) calculated between human and bat coronavirus. So do that in Fst between human and other hosts, such as camel or cow ( Figure 2C, Figure S3). The rise in differentiation reflected from Fst inferred that the S protein recombination is a usual event and may be important for the coronavirus' adaption to different hosts. We did not observe obvious variation in composite likelihoods (CLR) or Tajima's D within the S protein recombinant among all the 2019-nCoV strains. One explanation for these is that the RBD region is highly conserved for 2019-nCoV. We Furthermore, we observed a CLR peak at the ORF1 recombinant for 20190-nCoV ( Figure S4). We also observed a CLR peak at the boundary recombinant for human isolates ( Figure S5). There is also an Fst peak between human and bat coronavirus at the boundary recombinant ( Figure S5). Because of the lack of enough samples, we could not identify selection signals in pangolin coronavirus.
To speed up the progress, we performed whole genome alignments by CUDA ClustalW 25. We did recombination detection by RDP4