A simple method for predicting emerging SARS-CoV-2 variants using outgroups infecting non-human hosts

The ability to predict emerging variants of SARS-CoV-2 would be of enormous value, as it would enable proactive design of vaccines in advance of such emergence. Based on molecular evolutionary analysis of S protein, we found a signiﬁcant correspondence in the location of amino acid substitutions between SARS-CoV-2 variants recently emerging and their relatives that infected bat and pangolin before the pandemic. This observation suggests that a limited number of sites in this protein are repeatedly substituted in independent lineages of this group of viruses. It follows, therefore, that the sites of future emerging mutations in SARS-CoV-2 can be predicted by analyzing their relatives (outgroups) that have infected non-human hosts. We discuss a possible evolutionary mechanism behind these substitutions and provide a list of frequently substituted sites that potentially include future emerging variants in SARS-CoV-2.


Introduction
In December 2020, three SARS-CoV-2 variants emerged with increased infectivity from England, South Africa and Brazil. The fact that certain mutations in the spike (S) protein had occurred independently prompted us to reexamine our September 2020 study of the evolution 1 of this protein [1]. In our original study, we characterized the Importance of each residue position in the S protein by comparing its diversity in SARS-CoV-2 with that in relatives (outgroups) that infected bats or pangolins by using a simple equation: where diversity(x) is defined as the number of different amino acids observed at the site in question in virus group x. This equation, which was meant to be descriptive rather than predictive, identified twenty positions of high Importance. We were thus surprised to find that, of these twenty positions, four were characteristic of the above emerging variants: Histidine 69, Valine 70, Glutamine 484 and Asparagine 501. These sites coincide with four out of the five residues (69, 70, 417, 484, 501) that are observed multiple times in the three emerging lineages or the lineage transmitted between human and mink [5]. We reanalyzed the underlying sequence data and found that the Importance values of these sites were determined primarily by diversity(outgroup), rather than diversity(SARS-CoV-2). In hindsight, this is somewhat expected, as the latter term was close to unity at the time when we performed the analysis (i.e., before the emergence of new variants).

Theory
A natural question, then, is why a limited set of sites with high diversity in outgroups have also recently mutated in SARS-CoV-2. One possible explanation is that these sites are rapidly evolving under low functional constraints (i.e. neutral evolution) and thus frequently substituted in multiple lineages. This explanation is contradicted by the fact that the sites in question are estimated to be under positive selection (nonsynonymous substitutions more frequent than synonymous substitutions) using Bayes Empirical Bayes analysis [6] applied to closely related outgroups (see the 4th column in https://mafft.cbrc.jp/alignment/pub/sarscov2/fulllist.tsv), although the estimation is sensitive to sequence selection. A more likely explanation, then, is that the sites are involved in either infection of host cells, evasion of host immunity, or both.
Indeed, Glutamine 484 and Asparagine 501 are structurally close to the interface with the host cell receptor ACE2, which, in turn, is targeted by neutralizing antibodies. Histidine 69 and 2 Valine 70, on the other hand, are far from the ACE2 binding site but proximal to a recentlyreported epitope for infection-enhancing antibodies [2,3]. An overlapping region has been reported to bind sialic acids [4]. Modification of these processes could thus enable the virus to escape from the host's immune system, albeit temporarily, as the change will inevitably be counteracted by a shift in the antibody repertoire of the host, resulting in an effective "arms race". In this scenario, the sites with higher diversity imply direct or indirect host-pathogen interactions and are thus in a constant state of flux.

Results and Discussion
According to the latter interpretation, it is possible that positions of mutations in future emerging variants can be predicted simply by identifying sites with high diversity in outgroups, where such an arms race has been played out longer than between SARS-CoV-2 and humans. Because of their potential importance in the design of vaccines against future emerging variants, we list residue positions with the highest diversity(outgroup) in Table 1, where we have considered 3 two definitions of outgroups: one that is identical to that used in our original analysis in which 6 sequences were used and a broader definition (11 sequences) to increase the amount of data used in the calculation. Both datasets are available at https://mafft.cbrc.jp/alignment/pub/sarscov2/.
When viewed as a heatmap on the spike molecular surface, it is apparent that the residue positions with high diversity are not evenly distributed, but form clusters in the N terminal domain (NTD), receptor binding domain (RBD) and S1/S2 cleavage site (Fig. 1). We note that the correspondence between the positions of emerging mutations and those with high diversity(outgroup) is significant by Fisher's Exact Test regardless whether the original outgroup (Table 2A) or the broad outgroup (Table 2B)

Conflcts of interest
None declared.