In order to enable a comprehensive structural analysis of emerging partial or complete SARS-CoV-2 genomes, we follow a three-phase approach. In the first phase, we analyze new genomes (e.g. from GISAID or sequences directly provided by associated laboratories) for sequences with amino acid exchanges in regions of SARS-CoV-2 protein structures that are potentially relevant for drug-binding or cell-uptake.
In phase two, we look for sequences where one or more mutations are found that have not yet been investigated in our structural-analysis pipeline. In such a case, the sequence is submitted to a structure-modeling workflow, whose results can be used to predict an influence caused by the respective amino acid exchanges on active-site regions or cavities that are potentially relevant for drug-development. In the case of the spike protein, on whose RBD we focus herein, relevant models are analysed using our Catalophore-Halo technology (defined and illustrated below).
In phase three, we compare any modified RBD’s Halo to the wild-type Halo. If such a Halo comparison shows a substantial change, we expose the new variant to a LIE-based molecular dynamics modelling pipeline in order to predict the corresponding change in binding affinity.
SARS-CoV-2 sequence analysis of spike-RBD diversity
By December 6th 2021, 5,799,116 SARS-Cov-2 genome sequences and 5,649,261 spike-protein sequences were available at GISAID21. For the sequence analysis of the SARS-CoV-2 RBD diversity we extracted 5,173,253 RBD sequences by filtering the spike-protein sequences for the RBD flanking residues “TSNF” (position 315-318) and “NFNG” (position 542-545) as well as for correct size of 223 amino acids. Within these sequences we identified a total number of 7,700,325 amino acid exchanges, which amounts to 0.67 % of all residues. Compared to August 2020, when 185 mutations were identified in 73,042 spike-RBD sequences13, this occurrence per sequence of mutations shows a 588-fold increase, highlighting the progressive evolution of this virus30,31. The accumulation of mutations was significantly higher in the receptor binding motif (RBM), the 72 amino acid long sequence within the RBD, which mediates contacts with hACE215: With a total of 7,451,124 amino acid exchanges, meaning 2.00 % of all residues within all of the collected RBM sequences, the relative number of mutations within the RBM increased 1,547-fold compared to August 2020, where 68 amino acid exchanges were found in 73,042 RBM sequences13.
Out of the 5,173,253 considered RBD and RBM sequences, 85.11 and 84.67 %, respectively, contain at least one amino acid exchange, with 99.48 % of all mutated RBDs bearing mutations within the RBM. The mean number of mutations within the RBD and RBM sequences in this analysis are 1.75 ± 0.57 and 1.70 ± 0.49, respectively. With 15 amino acid substitutions located in the RBD, ten of which occur in the RBM, the recently discovered VOC Omicron stands out from previous variants by an at least three times higher number of mutations in this region21 (see also Figure 1b).
Referring to a list of 143 representative SARS-CoV-2 spike-protein sequences available at GISAID21, six of the amino acid exchanges within the Omicron RBD co-occur in other RBD variants, namely K417N (present in seven variants including VOC Beta), N440K (three variants), S477N (four variants), T478K (75 variants, including VOC Delta), E484A (two variants) and N501Y (33 variants including VOCs Alpha, Beta and Gamma). Their potential influences on one or more properties such as protein stability and flexibility as well as binding and sensitivity to antibodies have been evaluated, whereby exchanges N440K, S477N, T478K and N501Y were reported to possibly enhance binding affinity to hACE213,32–36.
The remaining nine mutations within the Omicron RBD (G339D, S371L, S373P, S375F, G446S, Q493R, G496S, Q498R and Y505H) do not occur in any of the 143 representative SARS-CoV-2 spike-protein sequences21. Nevertheless, possible influences of these amino acid exchanges have already been studied in silico or in vitro, with G339D, Q493R and Q498R being predicted to enhance binding to hACE235,37,38.
Visualization of amino acid exchanges using difference Halos
The influence of respective amino acid exchanges in the spike RBDs of VOCs Alpha, Beta, Delta and Omicron regarding crucial properties can be visualized by Catalophore Halos. Halos are multivariate physicochemical-property fields composed of points aligned to an equidistant grid. Herein, we show electrostatics (lower triangle, Figure. 2a) and hydrophobicity (upper triangle, Figure 2a) of the spike-RBD Halo. This depiction makes it easier to determine at a first glance, whether or not mutations have a noticeable influence on the RBD-hACE2 interface field and to decide whether or not a comprehensive structural analysis will be prioritized in such cases.
In order to spot the most prominent differences between variants, we chose to present the electrostatics and the hydrophobicity Halo point clouds (out of the 15 calculated physico-chemical properties), represented as a difference Halo matrix (see Figure 2). Differences were calculated by subtracting Halo 2 (columns of Figure 2a) from Halo 1 (rows). Regarding electrostatics and hydrophobicity, the Omicron Halo - among all the other variants plotted here - shows the most eye-catching differences compared to the wild-type Halo (Figure 2c and 2b). Interestingly, the Halos of the Beta and Gamma variants appear to be more different from the wild type regarding electrostatics than the Delta variant (Figure 2a), whereas the Alpha variant shows almost no difference regarding electrostatics and hydrophobicity.
By visual comparisons of the Halo cloud points regarding differences in property distribution of electrostatics and hydrophobicity, Omicron seems to be akin to a combination of Beta and Delta with additional changes in the region where the mutation N440K is located (Figure 2b). The large difference in the Halo between the Omicron and the Delta variants is explained by the mutations G339D, S371L, S373P, S375F, K417N, N440K, G446S, S477N, E484A, Q493R, G496S, Q498R, N501Y, Y505H only present in Omicron and the absence of L452R while only sharing T478K. At the binding interface, the new arginine replacing glutamine at Omicron spike position 493 yields a new strong hydrogen bond with ACE2 Glu35 which is clearly reflected by the dark blue area (increase in positive charge) of the Omicron-wildtype difference Halo as illustrated by Figure 2c. These depictions allow a rapid qualitative overview of the extent of physico-chemical changes at the RBD-hACE2 binding region. As a consequence, they quickly inform a decision on which variants to prioritize for further resource-intensive structural and dynamic analysis, e.g., calculating the binding affinities to hACE2 using the LIE prediction model we developed for this purpose.
LIE-model development and optimization
In order to achieve an as accurate as possible predictive model, we evaluated various simulation-time ranges and replicate numbers. For each combination of these two optimization parameters, we calculated two sets of binding energies ΔG, once by least-squares model fitting and once through leave-one-out cross-validation (LOOCV). The latter value set can safely be considered as a true prediction since none of the variants was included in the training set used for its evaluation.
We evaluated agreements between experimental and predicted ΔG values of each run-parameter combination by calculating corresponding mean absolute errors (MAE) given in kJ/mol and coefficients of determination (R²). According to the last prediction column in Supporting Table 1 (visualized in Figure 3a), the highest prediction accuracy and lowest error was achieved with 50 replicates of 500 ps molecular dynamics (MD) simulation trajectories, where R² amounts to 0.79 and 0.71 with an average error of less than 1.9 and 2.1 kJ/mol for model fitting and cross-validation, respectively. Using MD simulations longer than 500 ps (1 and 2 ns) actually reduced the predictive accuracy during cross-validation in terms of both quality measures which are, by the way, strongly correlated (r2=0.92). As expected, the highest accuracy was obtained with the highest number of replicates (50).
A corresponding scatter plot of predicted vs. experimental binding energies is depicted in Figure 3c. The ranking of variants in our training set, which comes from experimental measurements, is very well reflected by our model predictions, in particular comparing for instance the SARS-CoV-2 wild type (WT) with Alpha (green dots in Figure 3c). In addition, the distribution of experimental ΔG values in terms of mean and standard deviation (-55.76 ± 4.7 kJ/mol) agrees with those of the fitting (-55.83 ± 3.0 kJ/mol) and cross-validation procedure (-55.78 ± 4.0 kJ/mol) extraordinarily well, which was achieved with the optimal fitted weights, wvdw=0.758 and welec=0.028 (see below).
For this large difference in weights between the Van der Waals and the electrostatic contributions, two explanations are natural. On the one hand, spike-hACE2 binding affinities might mainly be affected by hydrophobic rather than electrostatic interactions and, on the other hand, the computed electrostatic interaction term might just not correlate well with experimental findings. However, having applied this predictive model to an unpublished internal set of 21 hACE2 variants in complex with the wild type for which EC50 values were available, we obtained a high prediction accuracy of R²=0.64. Prior to that, EC50 values had been converted to binding free energies in analogy to KD values.
Considering the number of replicates needed for a sufficient convergence of the energies and consequently of affinity-based variant rankings, the plot in Figure 3b indicates reasonable stability already from 30 replicates on, although MAE and R² are still slightly worse compared to 50 replicates (Table S1). We can safely assume a continuous growth of the model accuracy with additional replicates. However, with regard to the required run time (currently 2.7 days on a Amazon AWS C6i.16xlarge instance), 50 replicates seem to be a reasonable choice if we put more emphasis on accuracy. For this convergence analysis we determined the fraction of variant pairs associated with a swap in their energy order while increasing the number of replicates. All frequencies in percent were obtained via division by the total number of variant pairs n=432=1849. In more urgent cases, 30 replicates would take around one and a half days to run at a slightly lower accuracy.
Another major quality aspect of binding-affinity models is related to the predicted ranking of VOCs, as we strive for high consensus between predicted and experimental top-N variants. Figure 3d illustrates the fraction of variants among the top N predicted ones that are also included in the top N experimental variants of the training set. We compared the performance of an LIE model A) trained with 50 replicates and applied to predictions on the basis of 50 replicates as well, B) same as A) but using 30 replicates for both steps, and C) a mix of A) and B), namely model training with 50 but predictions on 30 replicates. Finally, a non-empirical prediction model published by Singh et al.13 in 2020 and based on umbrella sampling along with weighted-histogram analysis was applied to the same training set of 43 variants. As expected, the highest consensus, especially in the range of up to top 5, was obtained by models trained with 50 replicates, models A) and C) corresponding to the green and red plots (where green is mainly hidden behind the red line). Regarding predictions of training-set items, 80-100 % of the top 3-5 are in agreement with experimental top 3-5 candidates. Considerably less consensus was obtained with LIE model B) based on 30 training replicates yielding around 60-80 % at top 3-5 and, in particular, by the umbrella sampling method with 30-50 % consensus.
LIE application to VOCs
Using our final predictive model calibrated with the optimal weights
we estimated binding affinities for VOCs that emerged during the past year: Alpha, Beta, Gamma, Delta, and most recently Omicron. However, we have to bear in mind that our structural variant models only comprise the spike RBD rather than the entire spike trimer. Table 2 shows not only our calculated binding free energies along with KD values derived at 310 K temperature, but, for the purpose of rank comparison, also experimental values for WT, Alpha, Beta, and Gamma recently determined by Barton et al. through surface plasmon resonance32.
Variant
|
Experimental binding affinities
|
LIE model prediction
|
|
KD
[nM]
|
KD*
[nM]
|
KD*
ratio
|
ΔG
[kJ/mol]
|
ΔG
[kJ/mol]
|
KD
[nM]
|
KD
ratio
|
ΔΔG
[kJ/mol]
|
WT
|
74.4
|
1.70
|
1
|
-52.0
|
-53.1
|
1.11
|
1
|
0
|
Alpha
|
7.0
|
0.16
|
10.6
|
-58.1
|
-58.1
|
0.16
|
6.8
|
-4.9
|
Beta
|
20.0
|
0.46
|
3.7
|
-55.4
|
-53.2
|
1.07
|
1.0
|
-1.2
|
Gamma
|
13.5
|
0.31
|
5.5
|
-56.4
|
-54.2
|
0.74
|
1.5
|
-2.1
|
Delta
|
n.d.
|
-53.5
|
0.96
|
1.2
|
-1.5
|
Delta plus
|
n.d.
|
-52.3
|
1.56
|
0.7
|
-0.2
|
Omicron
|
n.d.
|
-56.8
|
0.27
|
4.1
|
-4.7
|
Table 1. SARS-CoV-2 VOCs binding free energies and dissociation constants predicted by LIE model and compared to KD values determined through surface plasmon resonance by Barton et al.32 KD* refers to experimental KD scaled to the value range of our training set using the corresponding WT KD (1.7 nM) as a reference. KD ratio relates all predicted as well as Barton’s measured (and scaled) KD values to the corresponding wild-type reference KD, whereas ΔΔG was calculated as the predicted binding-energy difference from the predicted WT energy (-50.9 kJ/mol).
For comparability reasons on the Gibbs energy level, Barton’s dissociation constants have been scaled to the value range of Zahradnik’s data/training set using the WT as an anchor point for which Zahradnik had published a KD value of 1.7 nM. This scaling gives an additional column KD* determined according to equation
The KD ratio column simply relates predicted KD values to the WT reference KD of 1.7 nM indicating how many times more strongly a particular mutant binds than the wild type. A final column with ΔΔG represents deviations of predicted binding energies from the WT binding energy.
For VOC binding affinity predictions, 240 replicates have been produced and analysed. According to these results in Table 1 and Figure 3c, the predicted order of dissociation constants for WT, Alpha, Beta and Gamma exactly reflects experimental findings32,39. In particular, with a 6.8-fold decrease compared to the wild-type dissociation constant, the most outstanding binding affinity among all VOCs (as far as experimental results are published) was predicted for Alpha. This is also in line with lab data32,39 featuring around 10-fold binding-affinity increase and reports40 stating, for the UK between October and November 2020, a 70%–80% rise in transmissibility with respect to the wild type reference.
For the new SARS-CoV-2 Omicron spike variant, our studies reveal a similarly remarkable 4.1-fold increase of binding affinity compared to the wild type. In terms of binding strength it therefore ranges between Alpha and Gamma. A significantly lower binding affinity, somewhere between Beta and WT, was predicted for Delta and Delta plus. Moreover, Delta plus (having K417N in addition to the amino acid exchanges found in Delta) was predicted to have a lower affinity than Delta, which is supported by experimental results revealing weaker binding due to the presence of K417N39,41. In addition, the authors allude to an increase of immune evasion caused by this mutation. It should be noted that Omicron as well includes this mutation possibly explaining its lower binding affinity compared to Alpha and thereby indicating a strong tendency to immune evasion for Omicron. The relative order of WT, Delta and Omicron is also in line with Docking results recently published by Kumar et al.42, although energy magnitudes and therefore differences given in that article translate to a KD ratio of 10 million for Omicron compared to the WT.