Protein models preparation
The crystal structure of the SARS-CoV-2 spike receptor-binding domain (RBD) bound to the ACE2 receptor was retrieved from the Protein Databank (pdb code 6M0J) and used as a starting point for protein models preparation. Two protein models were derived from this crystal structure : the RBD/ACE2 complex form and the RBD unbound form. Both structures were relaxed 100 times using rosetta modeling suite version 3.12 36,37 and the lowest scoring models were kept. The relaxations were made with coordinate constraints ensuring that the models do not deviate by more than 0.15 Angstroms from the initial crystal structure 38. Additional flags were set in order to account for glycosylated amino acid residues.
After relaxation, the RBD/ACE2 interface residues were computed on the complex form using Rosetta scripts and the InterfaceByVector residues selector with default settings. 27 and 25 residues were selected respectively on the RBD and ACE2 side.
Computational protein design and exhaustive sequence enumeration
Computational protein design and exhaustive suboptimal sequence enumeration tasks were performed on the RBD/ACE2 complex protein model using POMPd 23. POMPd relies on PyRosetta to compute energy matrices. PyRosetta version r245 was used in this project.
Mutable, flexible and rigid residues were defined as follows: the 27 interface residues on the RBD side were mutable, the 25 interface residues on the ACE2 side were flexible and all other residues were rigid. Mutable residues are allowed to mutate to any of the 20 natural amino acid types, flexible residues can reorient their side chain without changing their amino acid type and rigid residues are completely frozen. The Dunbrack 2010 rotamer library 39 was used to define the conformational search space. The basic level of rotamer discretization was used, no extra rotamers were added on chi angles. The Rosetta genpot energy function was used, and additional flags were set in order to account for glycosylated amino acid residues. For design, POMPd calls toulbar2 with flags “-dee: -hbfs: -m -A -s --cpd“. For enumeration the additional flags “--scpbranch -a -ub <Emax>” are added. All calculations were performed on the CALMIP high performance computing cluster, using Intel Skylake 6140 2.3 Ghz CPUs.
Calculations on the RBD/ACE2 complex form
A side chain positioning task was performed on the L strain structure in order to compute its optimal energy, which was determined to be -1694.57 kcal/mol. The Global Minimum Energy Conformation (GMEC), the optimal sequence using our settings, was computed. The GMEC was found to have an energy of -1704.42 kcal/mol. An exhaustive enumeration of suboptimal sequences was then performed using an energy threshold of 8 kcal/mol above the GMEC. 91,056,763 different sequences satisfying the threshold were identified. The energy threshold of 8 kcal/mol was determined to be the maximum value for which results could be obtained in one day time with the computational resources used approximately 400 gigabytes of RAM, and given that the number of sequences grows exponentially with the size of the enumeration threshold (Fig S2). The L strain sequence is not present in the enumerated sequences, it is located at 9.85 kcal/mol from the global optimum.
Calculations on the RBD unbound form
A side chain positioning task was performed on the L strain RBD unbound form using toulbar2. Its optimal energy, using our settings, was determined to be -386.74 kcal/mol.
For each one of the 91 million sequences found in the enumeration, the RBD unbound form energy was also computed by solving 91 million NP-complete side chain positioning problems to optimality. The 27 interface residues were defined as flexible and all other residues were kept rigid. The computation was performed using a parallel implementation MPI-based variant of toulbar2. It was completed in less than 2 days on 200 CPU cores.
∆∆G fitness landscape
∆∆G values were computed as ∆∆G = ∆Gmut – ∆Gwt = (Ecomplexmut – Eapomut) – (Ecomplexwt– Eapowt)
where Ecomplexmut and Eapomut are the energies of each mutant in the enumeration, respectively in complex and apo forms, Ecomplexwt and Eapowt are the energies of the L strain respectively in complex and apo forms. Prior to computing the fitness landscape, the 91 million sequences were filtered in order to retain only sequences having a sufficiently stable RBD unbound form, with energy less than 1 kcal/mol worse than the L strain and a negative ∆∆G energy (with increased predicted affinity towards ACE2). The remaining 6,390,176 sequences were further filtered in order to remove all mutants exhibiting unpaired cysteine mutations. The final set includes 4,507,187 different sequences. The fitness landscape was computed on the final set of sequences, using a Hamming distance of 1 as neighborhood and ∆∆G energy as the fitness function. We could include the L strain variant in the fitness landscape since it is a neighbor of two sequences.
Local optima cluster representatives calculation
The fitness landscape contained 3,272 local optima, which were clustered with mmseqs using a sequence identity threshold of 80%:
mmseqs easy-cluster in_fasta out_clusters tmp --min-seq-id 0.8
The clustering produced 59 clusters. Each cluster medoid was then identified, and the 59 corresponding sequences were selected for experimental analysis. A sequence logo representing all local optima was computed using Weblogo 40.
Most probable paths from the L strain to active potential variants
We calculated shortest paths between the L strain and active potential variants in the ∆∆G fitness landscape graph in which edges were weighted by mutational probabilities. We used DNA level mutation rates estimated by maximum likelihood using MEGA 41 with the General Time Reversible model (best fit under AIC and BIC regularization) extracted from 42 on coronavirus genomic sequences. From this, we computed a transition probability matrix at the DNA level using matrix exponentiation, with a time parameter adjusted to get an expected number of DNA mutations of around one mutation over the designed RBD region (with 27 residues). Matrix exponentiation was computed using the Pade approximation available in Python scikit as the scipy.linalg .expm function. The resulting transition matrix gives access to transition probabilities P(M[W) that a given DNA base W (in the L strain) will mutate to a base M in the next time-slice. To compute the amino acid level mutation rates induced by this DNA transition matrix, we first computed a codon to codon transition probability matrix, assuming independent identically distributed mutation rates given by the previous matrix. For a given amino acid A, let lc(A) be the set of synonymous codons representing amino acid A, the a priori probability that a given codon c in lc(A) is used to represent A is simply f(c) = r(c)/|lc(A)| where r(c) is the Relative Synonymous Codon Usage (RSCU) of the synonymous codon c, as computed for SARS-CoV-2 coronavirus 43. The probability for an amino acid W, represented by a latent codon variable, to mutate in an amino acid M (represented by any of its synonymous codon) is then P(M|W) = Sum_W in lc(W) f(cW) Sum_cM in lc(M) P(cM|cW). The -log of this transition probability matrix was used to weight the edges connecting two sequences in our variant landscape. The weight of a minimum cost path between two variants then defines a most likely path from the source variant to the target variant. Dijkstra’s algorithm was used to compute the shortest paths from the L strain to active potential variants.
Sequence community graph
The community graph was calculated from 4,507,188 sequences (including L strain variant). It was partitioned using the Leiden algorithm and modularity as a quality measure. Each node in the graph represents a community of sequences. Edges between the nodes were weighted with mutational probabilities described previously. The size of the nodes is proportional to the log of the size of communities. The thickness of edges connecting the L strain community is proportional to the log sum of all L strain community outgoing edges weights. All nodes with a degree smaller than 3 were removed. Self edges were removed and the graph was made undirected. The Leiden algorithm was run for 50 iterations and appears to be stable with a modularity of 0,93. Calculations were done using python leidenalg library. The partition was computed with the following command:
la.find_partition(g, weights = 'weights', partition_type=la.ModularityVertexPartition, n_iterations=50)
Theoretical affinity of antibodies towards L strain RBD and potential variants.
Three different antibodies in complex with L strain RBD were used for ∆∆G calculations (pdb codes : 6XDG and 7C01). These complexes were relaxed 100 times using rosetta modeling suite version 3.12 (genpot scoring function) and the lowest scoring models were kept. Coordinate constraints were set in order to ensure that the models do not deviate by more than 0.15 Angstroms from the initial crystal structure. Additional flags were set in order to account for glycosylated amino acid residues. The energy of each of these complexes was calculated.
The energies of the potential variants in complex with antibodies were obtained by side chain positioning calculations.
Identification of matching natural RBD sequences
We downloaded natural spike protein sequences from GISAID (https://www.gisaid.org/), using the spikeprot0125.tar archive containing 7,352,708 and only kept the 7,241,769 sequences with length above 1620, representing putative full-length sequences (possibly containing wildcard characters ‘X’). Identifying the subset of our 4,507,187 RBD motifs that appears in the database would require more than 20,000 billions pairwise alignments. Suspecting that only few of these would appear in the natural diversity, we exploited the gap structure of the motifs to look for matches of partial dense sub-motifs in the GISAID set. The submotif defined by the 18 last residues of our RBD motif contains only short gaps (the longest being 7 residues long). Our 4,507,187 RBD motifs contain only 152,487 different combinations of these 18 residues. We sorted this set alphabetically and divided it into 50 subsets. Exploiting the fact that any finite language is regular, we built a regular expression containing the disjunction of the motifs appearing in it and compiled it to a Deterministic Finite State Automata (DFA). By bringing similar motifs closer together, sorting before splitting increases the likelihood that the automata size will be small. Search was performed only in the subregion of the full sequences starting at position 451 and ending at position 519. DFA compilation and search was performed using Google’s re2 library (https://github.com/google/re2), as available in the Python API pyre2 (https://pypi.org/project/pyre2/). The sets of RBD motifs that yielded no hit were discarded and the same process of division in subsets, disjunction, automata compilation and search repeated recursively until singleton sequences with hits in GISAID were identified. We then selected full RBDs containing one of these 18-residue motifs with GISAID-hits and repeated the same search process. We found no occurrence of these in the GISAID sequences. We therefore repeated the same overall process, allowing this time for precisely one mismatch. With this added matching flexibility, our set of designed RBDs had 4,905,597 hits in GISAID (67.7%), covering the L strain, Delta and Lambda VOCs (Variants Of Concern), from a total of 51 designed RBDs (see Table S5-GISAID matches). With one extra mismatch allowed, the Alpha, Kappa, Eta and Iota VOCs are also covered.
Extraction of 27-residues RBD motifs from GISAID
From all sequences of the spike protein in the spikeprot0125.tar archive, we kept the 7,241,769 sequences with length above 1,620, representing putative full-length sequences (possibly containing wildcard characters ‘X’). From each sequence, we extracted the region from position 454 to 555 that was expected to contain the RBD design region (positions 404 to 505 in the L strain). We removed all sequences containing ‘X’ and removed duplicate sequences. A multiple sequence alignment was computed with mafft, using the L strain protein S sequence as a reference, preserving length (using mafft flags --6merpair --thread -1 --keeplength --addfragments). The 27 residues of interest were extracted from each sequence, resulting in a set of 826 different 27-mers. All sequences with remaining gaps were removed, resulting in a set of 774 unique gapless 27-mers. A sequence logo was computed from this set using Weblogo.
2D map of ∆∆G local minima landscape
We projected the ∆∆G local minima landscape on a 2D map using t-distributed stochastic neighbor embedding (t-SNE) as implemented in the python sk-learn package. We defined a customized distance metric in order to ensure that local minima clusters computed with mmseqs2 are correctly identified by t-SNE:
d(s1,s2) = Hamming(s1,s2) + λ.same_cluster(s1,s2)
Where Hamming is the Hamming distance between two sequences (i.e., number of mutations), same_cluster is a function which returns 1 if two sequences belong to the same cluster and 0 otherwise, and λ is a control parameter (λ=20 in our calculations). The t-SNE algorithm was run for 1000 iterations with a learning rate of 50, a perplexity value of 6 and an early exaggeration value of 12.
Sequence entropy
The sequence entropy of the 27 residues of the RBD interface were computed on the 3272 local optima of the ∆∆G fitness landscape, as well as on the 774 unique sequences extracted from GISAID. The Shannon entropy was calculated after normalizing the frequency of occurrence of each amino acid type at each position by the natural frequency of occurrence of amino acids as estimated in the literature 44.
Yeast display experiments
DNA sequences encoding for the receptor binding domains (RBDs) of L strain SARS-CoV-2, the human coronavirus 229E and the 59 potential variants (PVs) were synthesized by Twist Bioscience. Next, they were amplified by PCR to introduce 50 nucleotides long flanking sequences complementary to the yeast display plasmid (RRID:Addgene_41522). The amplified DNA sequences and the linearized yeast display plasmid were transformed into Saccharomyces cerevisiae cells (Strain EBY100; ATCC) so that the yeast homologous recombination machinery ligated the DNA sequences encoding for the RBDs at the N-terminal of the Myc tag. Transformed cells were selectively grown in tryptophan-free minimal (SD-Trp-Ura) media (6.7g/L Yeast Nitrogen Base, 5.0g/L Casamino acids, 1.065 g/L MES acid, and 2% w/v dextrose) for 24 h at 30 C, with shaking. Next day, cell media was changed to SG-CAA media (2% Galactose, 0.67% Yeast Nitrogen Base, 0.5% Casamino Acids, 0.54% Sodium Phosphate Dibasic, 0.856% Sodium Phosphate Monobasic Monohydrate) to induce RBD expression for 24 h at 30 C, with shaking. Next day, induced cells were spun down for 2 mins at 2,000 x g, resuspended in HBS blocking buffer (20 mM Hepes 7.4, 150 mM NaCl, 1% (w/v) BSA) and incubated with recombinant Fc-ACE2 for 45 mins at RT, with shaking. Next, plates were washed twice with HBS blocking buffer and incubated with 1:250 diluted FITC-conjugated anti c-Myc (Immunology Consultants Lab, CMYC-45F) and Alexa647-conjugated anti-human- antibodies for 30 mins at RT, with shaking. Cells were washed twice with HBS blocking buffer and cell fluorescence was measured using an IntelliCyt high throughput flow cytometer. Cells were gated to exclude non-single cells, FITC labeling was used to select RBD-expressing cells, and Alexa647 labeling was used to quantify Fc-ACE2-binding cells. Fc-ACE2-binding is reported as percentage within the FITC+ population and was gated according to the Alexa647 signal of the positive (L strain RBD) and negative (229e RBD) controls.
Expression and purification of recombinant soluble proteins
The DNA constructs for the RBDs of L strain SARS-CoV-2, the human coronavirus 229E and the eight PVs that showed Fc-ACE2 binding in yeast experiments were codon-optimized for mammalian cell expression, synthesized by Twist Bioscience and cloned into a mammalian expression vector as C-terminal genetic fusions to a 10xHis, a siderocalin module and a 3C protease cleavage site. The DNA construct encoding Fc-ACE2 was acquired from Addgene (#164222) and the DNA constructs encoding the four neutralizing antibodies were synthesized by Genscript. 24 μg of the respective DNA constructs were used to transfect 30 ml of suspension Expi293F (Thermo Scientific) cells at a density of 2.5E6 cells/ml in Expi293 media (Thermo Scientific) and cells were grown at 37 C in a humidified 8% CO2 incubator, with 130 rpm shaking. After 24 h, cells were feeded with 3 mM valproic acid and 0.45% glucose. After 5 days, cells were harvested for 10 mins at 1,000 x g. All RBD variants were purified using a sepharose Ni-IMAC resin (Pierce, Thermo Scientific) and eluted by 3C protease cleavage. The expressed IgG and Fc-ACE2 were purified using a protein A resin and eluted with 150 mM NaCl, 100 mM glycine (pH 2.8).
Kinetic analyses by biolayer interferometry (BLI)
BLI experiments were performed using an Octet 8-channel system (Sartorius) using HBS blocking buffer supplemented with 0.05% (w/v) Tween-20. 30 nM Fc-ACE2 or 20 nM of the three tested therapeutic antibodies were immobilized on Octet protein A biosensors. The biosensors were dipped into wells containing purified L strain RBD, 229E or the respective PV at 2000, 1000, 500, 250, 125, 62.5 and 31.3 nM concentrations for 200 seconds, and subsequently dipped into wells containing HBS blocking buffer supplemented with 0.05% (w/v) Tween-20 for 200 seconds. Data were reference-subtracted, and curves were fitted using the GraphPad Prism association dissociation model (https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_equaton_association_then_disso.htm).
ACE2-expressing HEK293 cell lines
All transduction and neutralization experiments were performed using an Ace2-negative (Ace2-) HEK293 cell line, and two different Ace2-positive (Ace2+) HEK 293 cell lines. One ACE2+ cell line was transiently transfected with an ACE2 plasmid (Addgene #141185) followed by three rounds of hygromycin selection. Next, ACE2 expression was validated by binding of an anti-Myc antibody (Fig S.10). The other ACE2+ cell line was created by lentiviral transduction and previously published (Wu et al. 2021). Significant differences were not observed in the results obtained with both cell lines.
Production of spike-pseudotyped lentivirus
To produce lentiviral particles pseudotyped with the spike (S) protein, the RBDs of the chosen PVs were cloned in the context of the S protein into the Addgene # 145032 plasmid.
Subsequently, 3 μg of the corresponding spike plasmid, 12 μg of a lentiviral backbone expressing neonGreen (Addgene #162034) and 9 μg of a 2nd generation lentiviral packaging plasmid (Addgene #122600) were used for triple-transfections of 30 ml Exp293F cells. After 24 hours, cells were feeded with 3 mM valproic acid and 0.45% glucose, and lentivirus production proceeded for 4 additional days. After that, cells were pelleted for 5 mins at 1,000 x g and supernatants were filtered using 0.45 um filters (Sartorius) and stored at 4C.
The generated lentiviral particles were quantified by RT–PCR using a commercial kit (Biovision cat. No. K1471) that includes lysis buffer, reverse transcriptase, DNA polymerase and oligos annealing to the lentiviral scaffold. A standard curve was built with the provided standards and used to quantify the lentivirus amounts. Serial 10-fold dilutions of all pseudotyped lentivirus were run.
Pseudotyped lentiviral particles transduction assays
Ace2+ and Ace2- HEK293 cells were seeded in 96 well plates at 5x10E4 cells per well in DMEM media supplemented with 10% FBS and incubated at 37 C. After 24 h, media was removed and replaced by equal amounts of all pseudovirus diluted in fresh DMEM media. After 18 hours, cell media was removed and cells were washed three times with HBS blocking buffer. Next, viral transduction was measured as neonGreen fluorescence using an IntelliCyt high throughput flow cytometer. Uninfected controls and the lentiviral particles expressing the RBD of the L strain were used to set the gates. All experiments were done in technical replicates and repeated in 3 different days, and statistical significances were calculated by unpaired t tests using the GraphPad Prism software version 9.0.
Neutralization assays
First, we estimated the IC50 of the four purified neutralizing antibodies on pseudotyped lentiviruses carrying the L strain RBD (fig. S9), and then used a concentration corresponding to 10 x the respective IC50s for the neutralization experiments. To estimate the IC50s of the neutralizing antibodies with pseudotyped lentivirus expressing the RBD of the L strain, 10-fold dilutions of the antibodies were incubated with L strain pseudotyped lentivirus for 1 hour at 37 C and subsequently added to ACE2+ HEK293 cells. NeonGreen fluorescence was analyzed using an IntelliCyt high throughput flow cytometer and the data from 4 different experiments were used to estimate the IC50 values using the Graphpad Prism version 9.0. For the neutralization experiments, equal amounts of the lentivirus pseudotyped with the L strain or the corresponding PV variants were pre-incubated with 10 x IC50 concentrations of the 4 antibodies for 1 hour at 37 C and subsequently added to ACE2+ cells. Technical duplicates and at least 3 biological replicates of each sample were performed. Statistical significance was calculated by unpaired t tests using the GraphPad Prism software version 9.0.