Evolutionary Trace of SARS-CoV-2. In order to map functional determinants in SARS-CoV-2 proteins we applied the ET approach. With the multiple sequence alignments (Figure S1A, Dataset S1) and the corresponding phylogenetic trees (Figure S2-S4) in hand for 24 of the 26 SARS-CoV-2 proteins (see SI Methods and Materials), our protocol calculated the ET ranking of importance for 99.5% of SARS-CoV-2 amino acid residue positions (Dataset S2) generated from each of three protein databases (UniRef90, UniRef100, NCBI NR) and combined into a single average. To independently assess the quality of these ranks, rather than rely on the variety and breadth of sequences in the alignments as indicative of information content, we used a statistical measure that quantifies the distribution of ET rankings in the 3D structure. Residues with smaller ET rankings tend to cluster together in active sites, protein-protein interaction sites or other functional sites (30, 31, 39–41). Such a clustering of top-ranked residues was particularly prominent in several SARS-CoV-2 proteins and complexes including the NSP5 main protease, the NSP7/NSP8/NSP12 RNA-dependent RNA polymerase complex and the NSP10/NSP16 RNA cap methyltransferase complex and can be visualized as groups of warm colored residues in the protein structure (Figure 1). We evaluated the quality of ET rankings using the Selection Cluster Weighting (SCW) z-score which measures how well highly ranked residues cluster relative to a randomized distribution of scores on the structure (see SI Materials and Methods). For almost all proteins the SCW z-Score is 2 standard deviations above the randomized background, suggesting that the alignments are informative and that the resulting ET rankings are meaningful (Figure S1, Dataset S3). For the proteins that do not reach significant z-scores there is a clear correlation to a lack of sequences in the alignments (e.g. NSP1, E, ORF3, and ORF7a), or, the structure belongs to a small domain within a larger protein (e.g. the macrodomain within NSP3 and the HR2 domain within the S protein).
To probe these smaller domains within large proteins we further investigated the ADP-ribose-phosphatase (ADPRP) subdomain and macro and papain-like protease (PLpro) domains of NSP3. NSP3 was an intriguing case because top-ranked ET residues cluster well in its PLpro domain but not in its macrodomain or in the ADPRP subdomain (Dataset S3). In order to better resolve ET rankings for NSP3, we generated new alignments, phylogenetic trees, and ET residue rankings for the subsequences specific to each NSP3 domain structure (see SI Materials and Methods). In this focused analysis, the PLpro domain now yielded ~50% more sequences leading to a corresponding increase in the clustering of top-ranked residues (Figure S5). For the macrodomain and ADPRP subdomain, thousands of additional sequences spanning the three domains of life and distantly related viruses were included in the new data set which resulted in ET rankings that rivaled the significance of clustering in the PLpro domain. The stark differences we find in the phylogenetic trees of specific NSP3 domains confirm previous observations of alternate domain configurations in different coronavirus genera and even within clades of betacoronavirus (6). The improvement in SCW z-score corresponds to a cluster of highly ranked ET residues within the ligand binding site of the macro domain and ADPRP subdomain (Figure S5D and E) which was missing in the analysis of the full NSP3 reference sequence. Having better resolved ET rankings in the NSP3 domains, we returned to the main data set to see how well ET rankings captured functional sites in other proteins.
Phylogenetically conserved ligand binding sites. A catalog of SARS-CoV-2 ligand binding sites could serve as a timely resource for prioritizing therapeutic targets. Previous studies have shown that evolutionary sequence information correlates well-enough with enzyme active sites so as to serve as 3D-templates for functional signatures (35) and identify allosteric sites (42, 43). Here we used NSP12, NSP15 and NSP16 as examples to show how the evolutionary sequence information captured by ET can successfully predict ligand binding sites for virus proteins. NSP12 is an RNA dependent polymerase, NSP15 mediates the cleavage of both single- and double-stranded RNA at uridine sites (44) and NSP16 is a m7GpppA-specific, S-adenosylmethionine (SAM)-dependent, 2’-O-MTase (45). As shown in Figure 2A-C, top ranked ET residues cluster around the native ligands of NSP12 (RNA) (46), NSP15 (GpU) (8) and NSP16 (m7GpppA and SAM) (47), indicating an accurate prediction of ligand binding sites for these proteins. Several new functional sites are also predicted by ET (Figure 2D and 2E). On the spike protein (S), one such ET cluster partially overlaps the S2’ protease cleavage site that is critical for membrane fusion and infectivity of the SARS virus (48). On the nucleoprotein (N), a cluster of highly ranked ET residues lies adjacent to the putative RNA binding site (49) and may contribute to formation of N protein-RNA helical filaments that are essential to packaging the RNA genome. These results indicate ET can provide alternative drug target sites with no currently available ligand-bound structures.
In addition to being important to protein function, ideal drug target sites should also be rarely mutated in the current outbreak so as to avoid the potential emergence of drug resistance. Thus, we focused on positions that do not have any mutations observed in the 52,061 high quality, full length SARS-CoV-2 sequences that were available as of September 14th, 2020. As more genomes and mutations within them are sequenced it may be necessary to lower the variant count stringency. In order to translate proteome-wide ET ranks and mutational profiles into potential drug target sites, we focused on clusters of mutation-free, surface-exposed residues that are highly ranked by ET and fall within 5Å of each other (Figure 3, Dataset S4). The resulting catalog of putative drug targets includes 116 sites at ~5 sites per structure with the largest structure (full-length model of Spike, 6vsb_1_1_1) having the highest number of sites. For NSP12, NSP15 and NSP16, the predicted drug targets overlap the known ligand binding sites.
In order to evaluate whether these ET drug sites may correspond to druggable target sites, we examined their overlap with sites observed in five SARS-CoV-2 protein-drug complex crystal structures. It is important to note that all 5 drugs showed an inhibitory effect in either cellular or biochemical assays. Remdesivir has been shown to speed up the recovery of COVID-19 patients in clinical trials (50), while the α-ketoamide inhibitor 13b can suppress SARS-CoV-2 replication in cell lines (51). Vir251 and tipiracil were also shown to effectively inhibit the enzymatic activities of their targets (7, 8). The remaining drug, sinefungin, is a pan-MTnase (NSP16) inhibitor that inhibits the growth of yeast cells ectopically expressing NSP16 from SARS-CoV (45). The ET drug sites were mapped onto the five SARS-CoV-2 protein-drug complexes (7, 8, 51–53) and, as shown in Figure 3, all five drugs reside in protein surface pockets that are within or very close to our predicted ET drug sites. The ET drug site for NSP5 is the least well recovered due to a single SARS-CoV-2 sequencing entry (strain MT745875) wherein several residues in the protease active site are mutated (G143S, S144E and C145I), including the catalytic cystine residue. S144E and C145I are both caused by two nucleotide substitutions in the codon, and only observed in this strain (sampled on 06/24/20). It is unclear whether this is a sequencing artifact or represents a genuine active site plasticity that compromises NSP5’s active site as a stable drug target. It does however illustrate the importance of accurately detecting emerging sequence variations when choosing drug targets. Overall, these results show that predicted ET drug sites can recover experimentally tested drug binding pockets and suggest new sites that can be targeted in computational docking approaches. In addition, because these sites are conserved across multiple coronavirus genera, these predicted ET drug sites are anticipated to be relevant for identifying inhibitors of SARS-CoV-2 as well as more distantly related coronaviruses.
Conserved linear epitopes. ET drugs sites may prove valuable in guiding drug design, but these approaches are dependent upon having high resolution crystal structures and some structures are either not yet available (e.g. NSP2, NSP6, M, and several accessory proteins), do not cover a majority of the protein (NSP3 and NSP4) or are too low in resolution for accurate docking studies (NSP12, NSP14, ectodomain of S, N, ORF3a and ORF7a). However, ET operates over linear protein sequences and thereby can identify phylogenetically important sequence fragments even in the absence of a 3D structure (54). As in our approach to discover ET drug sites, we combined ET residue ranking information with sequencing data from SARS-CoV-2 isolates to arrive at linear peptides along the proteome that are evolutionarily important and also show little variation in the current outbreak (Figure S6, Dataset S5). In order to assess the value of these epitopes, we asked whether they could recapitulate ET-derived drug sites. ET-defined linear peptides for NSP12 were mapped onto an available NSP12 structure and, as illustrated in Figure 4A, the majority of the structural and linear peptides overlap with each other. Linear ET peptides and ET drug sites overlap well for other SARS-CoV-2 proteins, which was quantified by Jaccard Similarity and Fisher’s exact test (Dataset S6). These data suggest that linear ET peptides contain functionally relevant information since they recapitulate ET drug sites for proteins or domains without requiring 3D structural data. In the absence of a protein structure, these ET peptides could be useful in designing inhibitory peptides (55, 56).
These peptides are also connected to a second main approach towards resolving the pandemic, by way of vaccine development. Although vaccines for COVID-19 may become available soon, ideally, effective protection against future outbreaks from related coronaviruses would require a broadly neutralizing effect wherein the immune system recognizes epitopes shared among coronavirus species. The prospect of raising a broadly neutralizing response is bolstered by a recent study wherein naïve patients, never exposed SARS-CoV-2, were found to possess a subset of T-cells that can cross-react to homologous epitopes shared by common cold coronaviruses and SARS-CoV-2 (28). In this context, we note that ET rankings reflect the degree of homology over the phylogenetic tree, so we reasoned that summing ET scores over the length of an identified T-cell epitope may be able to estimate its potential for cross-reactivity.
As a first step, we summed the ET ranks for each of the 40 SARS-CoV-2 epitopes that had been shown to react with patient-derived T-cells so that they could be ranked by predicted cross-reactivity to 161 common cold coronavirus epitopes assayed by Mateus et al. Although summing ET ranks could identify SARS-CoV-2 epitopes that are more likely to be cross-reactive (Figure S7), it did not account for the specific amino acid differences in the potentially cross-reactive homolog. In other words, ET ranks can predict whether or not a SARS-CoV-2 epitope will be cross-reactive in general, but they do not specify which epitope homologs will cross react.
In order to improve resolution of our predictions to specific epitope homologs, we next combined EA, a predictor of mutational impact, with the summed ET rankings. EA calculates the predicted impact of amino acid variations on protein function aiding in the interpretation of coding variants (36–38). Summing the predicted impact of amino acid changes between a SARS-CoV-2 epitope and a homologous epitope in another virus (sumEA) while adjusting for the SARS-CoV-2 epitope’s overall evolutionary importance (sum(100-ET ranking)) produced a metric that was able to separate cross-reactive epitopes from those that did not cross react (Figure 4B and S7, Dataset S7). This metric, sumEA/sum(100-ET ranking), was then applied to 21 untested SARS-CoV-2 T-cell epitopes and their common cold homologs (28). From a total of 92 homologs we identified 23 with potential to cross react to one of five SARS-CoV-2 epitopes (Figure 4C, Dataset S8). These 5 SARS-CoV-2 epitopes along with the 9 others experimentally shown to possess cross-reactivity could be used in a multi-epitope vaccination strategy that provides a broad neutralizing response to currently circulating coronaviruses, SARS-CoV-2 and, possibly, future outbreaks. Moreover, the approach is not specifically linked to any specific virus, so it could be replicated in other families of pathogens.
Dissemination. To disseminate these results, a public website (http://cov.lichtargelab.org) makes these data and analyses fully accessible. The data include, for example, multiple sequence alignments, pre-calculated ET ranks, and predicted epitopes (both linear and structural) for all SARS-CoV-2 proteins. In addition, an interactive structure viewer enables users to explore any one of the ET-colored structures (Figure 1) and predicted ET drug sites associated with those structures (Dataset S4-5). The website will be updated as new SARS-CoV-2 isolates and protein structures become available.