RAIN: a Machine Learning-based identification for HIV-1 bNAbs

Broadly neutralizing antibodies (bNAbs) are promising candidates for the treatment and prevention of HIV-1 infection. Despite their critical importance, automatic detection of HIV-1 bNAbs from immune repertoire is still lacking. Here, we developed a straightforward computational method for Rapid Automatic Identification of bNAbs (RAIN) based on Machine Learning methods. In contrast to other approaches using one-hot encoding amino acid sequences or structural alignment for prediction, RAIN uses a combination of selected sequence-based features for accurate prediction of HIV-1 bNAbs. We demonstrate the performance of our approach on non-biased, experimentally obtained sequenced BCR repertoires from HIV-1 immune donors. RAIN processing leads to the successful identification of novel HIV-1 bNAbs targeting the CD4-binding site of the envelope glycoprotein. In addition, we validate the identified bNAbs using in vitro neutralization assay and we solve the structure of one of them in complex with the soluble native-like heterotrimeric envelope glycoprotein by single-particle cryo-electron microscopy (cryo-EM). Overall, we propose a method to facilitate and accelerate HIV-1 bNAbs discovery from non-selected immune repertoires.

no cure nor vaccine for HIV-1 infection available 2 .Humoral immune response to HIV-1 targets the envelope (Env) protein of the virion, a trimeric membrane glycoprotein complex comprising gp120 and gp41 3 .However, the virus rapidly escapes immune control due to the exceptional Env glycoprotein diversity generated by HIV-1 error-prone replication machinery 4 .Moreover, additional mechanisms of immune evasion exist, such as heavy glycosylation of gp120, promoting a conformational masking of the receptor-binding site, enhancing immune evasion 5 .Screening of plasma from HIV-1 seropositive (HIV-1+) subjects led to the identi cation of rare individuals possessing sera with broad and potent neutralizing activities against numerous HIV-1 viruses.Additional studies allowed the cloning and sequencing of B cell receptors (BCRs) and permitted the identi cation of broadly neutralizing antibodies (bNAbs), which can neutralize most viral strains at low concentrations in vitro 6 .Investigation of the development and structural properties of these bNAbs, revealed only a low level of sequence identity between them, but demonstrated that speci c characteristics are associated with their function.For example, bNAbs have an extreme level of somatic hypermutations (SHMs) and large nucleotide insertions leading to long heavy chain complementary determining regions (CDRs) 7,8 .
Since their identi cation, bNAbs have gained intense therapeutic interest.Although approved drugs against HIV-1 infection exist, passive antibody prophylaxis and immunotherapy could hold a valuable place in both prevention and treatment 9 .Passive transfer of bNAbs demonstrated a decrease of viral loads 10,11 , prevention of infection 12,13 , delay of viral rebound 14,15 and suppression of viremia in humanized mice, non-human primates and human without notable adverse events or side effects 16,17 .BNAbs target distinct sites of vulnerability at the surface of the envelope: the CD4-binding site (CD4bs), variable loop V1/V2 apex and V3 loop, a larger site spanning the interface between gp41 and gp120 (interface) including the fusion peptide, and the membrane-proximal external region (MPER).
Recently, a sixth site was discovered, de ned by the bNAb VRC-PG05, which binds to the center of the so called "silent face" of gp120 18 .
To date, the identi cation of bNAbs has required B cell isolation and clonal expansion from selected individuals possessing a serum with broadly neutralizing activity.This step is followed by antibody cloning and experimental validation of their neutralization potential.While both steps represent an important research effort, the process has bene ciated from identi ed immune donors 19 and the development of high-throughput analyses of antibody repertoires by next-generation sequencing (NGS).
Still, the number of identi ed HIV bNAbs remains relatively low, with only 250 of them reported 3,20 .Some bNAbs have been investigated in registered clinical trials, for prevention, as a component of long-acting antiretroviral therapy (ART), or as a component of intervention aimed at long-term drug-free remission of HIV 17,21,22 .Although, it is likely that the clinical success of bNAb passive immunization strategies will require a combination of antibodies to increase the overall breadth and potency against diverse HIV-1 isolates and to prevent the emergence of resistance 23 .The recent deployment of large datasets of human B cell repertoires on database repositories represents an opportunity for novel bNAbs identi cation assuming that computational tools for their automatic identi cation and classi cation are developed 24 .
Arti cial intelligence (AI)-based prediction tools to nd the antibodies and antigens have been developed 25 .However, most of these tools rely on structural or amino acid sequence similarities of related antibodies to identify potential target proteins 26 .Nonetheless, despite important research and characterization efforts, a precise set of criteria required for classifying bNAbs versus non-bNAbs is still lacking.
Here, we developed a computational pipeline named RAIN for Rapid Automatic Identi cation of bNAbs from Immune Repertoire.RAIN is based on four different machine learning algorithms, which can be trained in just a few minutes using a Python script.RAIN only requires the following: a cellranger scBCR output going through the Immcantation pipeline, and nally a R script converting the repertoire data into a features table for bNAbs prediction.We validated RAIN on previously identi ed bNAbs leading to a prediction accuracy of 100% and an Area Under the Curve (AUC) value ranging from 0.92 to 1 depending on the antigenic site.In addition, we isolated class switched memory B cell from HIV-1 immune donors and performed single-cell BCR sequencing to demonstrate the method performance.
Importantly, immune repertoire analysis of donors with a serum able to broadly neutralize different HIV-1 isolates led to the identi cation of three bNAbs, while none was detected in the repertoire of immune donors with sera that did not possess a broad neutralizing activity.The identi ed bNAbs were further characterized for their a nities to the envelope stabilized prefusion trimer BG505 DS-SOSIP, neutralizing activities and by cryoelectron microscopy (cryoEM) for one of them.

Results
Subset of discrete characteristics discriminate HIV-1 bNAbs from mAbs.
The automatic identi cation of HIV bNAbs cannot be solely based on amino acid sequence similarity of the heavy or light chains, due to a large sequence variability resulting from the long a nity maturation process.In contrast, HIV-1 bNAbs isolated from chronically infected adults exhibit a signature of characteristic features, including high somatic hypermutations (SHMs), insertions or deletions (indels), long complementarity-determining region H3 (CDRH3), high potency, and broad viral neutralization breadth 3 .Moreover, the VRC01 class bNAbs, targeting the CD4bs, have also been shown to preferentially use speci c germline alleles 27,28 and possess an unusually short CDRL3 of ve amino acids, needed to contact gp120, while avoiding the glycan at position N267 in the D loop of gp120 29 .While bNAbs targeting the V1V2 apex use speci c IGHV genes and together with bNAbs binding the V3 glycan, they are characterized by a long (20-34 residues) CDRH3 sequence 30,31 .
We hypothesized that integrating speci c parameters characterizing HIV-1 bNAbs in a machine learning framework, could allow a rapid identi cation of bNAbs from an immune repertoire (Figure 1).To identify predictors of HIV-1 bNAbs, we investigated speci c features associated with these antibodies and inferred them from their highly diversi ed amino acid sequences.We collected and curated bNAbs sequences from the CATNAP (Compile, Analyze and Tally NAb Panels) database 32 .Data curation consisted of only considering human a nity matured sequences, removing incomplete or unpaired sequences obtained from CATNAP database (Supplementary Table 1).We obtained a total of 255 bNAbs paired sequences, binding the V1V2 apex (n=98), V3 glycan (n=56), CD4 binding site (n=54), gp120/gp41 interface (n=26) and MPER (n=21).Next, to create a dataset of paired BCR sequences that are unlikely to recognize an HIV antigen (hereafter named mAbs), we retrieved and curated paired antibodies sequences from ten healthy seronegative donors to obtain a total of 14'962 sequences (Supplementary Table 2).Following this step, we investigated if some of the bNAbs distinct properties could be used as predictive variables for each targeted antigenic site.We considered as potential predictors, the length of the CDR3 for the heavy (H3) and light (L3) chains, the frequency of somatic hypermutation in the V gene (ν) or improbable acquired mutations in the framework regions only (uν), and the hydrophobicity of CDRH3 33,34 (φ) (Figure 2a-e).Interestingly, anti-CD4bs bNAbs analysis demonstrated a statistically higher somatic hypermutation frequency, a higher frequency of unconventional mutations (outside of the CDRs) 35 , and a signi cantly shorter length of CDRL3 (Figure 2a, b, e and Extended Data Fig 1a) compared to the control mAbs reported in Supplementary Table 2.For the anti-MPER bNAbs, we observed a longer CDRH3, with higher hydrophobicity, and a higher mutation frequency in both V gene and framework (FRW) regions (Figure 2a, b, c, d and Extended Data Fig 1b).The bNAbs targeting the V1V2 apex showed a higher mutation frequency of V gene, but the difference was mainly due to a higher hydrophobicity of the CDRH3 and a longer CDRH3 (Figure 2a-d  Part of these results were expected but con rmed that this set of characteristics is statistically different between bNAbs and mAbs.To further investigate if these characteristics could be used to discriminate between bNAbs and mAbs, we decided to use them as variables in a two-dimensional Principal Component Analysis (Figure 2f-j).Remarkably, the ve characteristics were su cient to separate bNAbs from mAbs into two distinct clusters within each category of antigenic sites.We observed an explained variation of 0.43 for PC1 and 0.29 for PC2 across all ve antigenic sites, while the weights of the features exhibited striking similarities.For PC1, both frequency of mutation in CDRs and framework regions were important, although hydrophobicity and length of CDRH3 were important for PC2.Unexpectedly, the length of CDRL3 was a less important feature.Based on these observations, we decided to use this set of measurable characteristics as predictors to classify bNAbs from mAbs.
Algorithm selection and validation for the computational pipeline.
To further investigate the feasibility of an automatic identi cation of potential HIV-1 bNAbs, we decided to use different machine learning (ML) approaches.First, antibody sequences were converted to a list of values corresponding to the set of predictors identi ed previously.BNAbs sequences coming from CATNAP database were annotated using Igblast and the Immcantation work ow [36][37][38] .The resulting Adaptive Immune Receptor Repertoire (AIRR) characteristics were converted to a feature format table.
Similarly, mAb sequences obtained from public databases were processed as described previously 39 and converted to a features table.For each antigenic site, bNAbs and mAbs were pooled as one dataset and subdivided into three: 60% as training set and 20% each as validation and testing set respectively.While the reported number of HIV-1 bNAbs is limited, a large quantity of mAb sequences is available.We thus decided to use rst the anomaly detection algorithm (AD) for the automatic identi cation of bNAbs.We used the multivariate gaussian model based on a threshold value (Epsilon) to estimate the probability of an antibody being agged as 'anomaly' or not.Then, the optimal Epsilon parameter minimizing the number of false positives was obtained using the validation set (Extended Data Fig. 2a-e), while evaluation of the AD performance, including computing of the area under the curve (AUC) was done with the test set (Figure 3a, b).We observed that the AD algorithm discriminates well bNAbs targeting the V1V2 apex (AUC: 0.93), the CD4bs (AUC: 0.88), the MPER (AUC: 0.82), and the interface (AUC: 0.8).
However, bNAbs targeting the V3 glycan were poorly identi ed, with an AUC of 0.64.Moreover, a high number of false positives was obtained, indicating a low precision with the AD (Figure 3a).To increase recall and precision of our detection method, we used both decision tree (DT) and random forest (RF) algorithms.First, we used a random forest to analyze the identi cation pro le of bNAbs with two classifying features and found that it allowed a clear decision boundaries plot on the training dataset for Following this result, we allowed the DT and RF algorithms to use all available features, including VH and VL genes, and further optimized our models.We used the validation dataset to perform hyperparameter tuning and systematically explore different combinations of hyperparameters.We based the classi ers' hyperparameter tuning on the false positives number and for the hierarchical model of the decision tree, the cost complexity pruning parameter (alpha) was set to zero (Extended Data Fig 4).Next, entropy was chosen as the quality measurement for the split in both DT and RF (further details are presented in the Methods section).Finally, we used the test datasets and evaluated performance metrics, including AUC, precision, recall and accuracy for the DT and RF models (Figure 3a, c, and d).The DT algorithm exhibited superior recall and precision performance compared to the AD algorithm, while the RF algorithm demonstrated even higher performance, achieving a minimum AUC of 0.92 for all tested antigenic sites.It achieved a precision of 1 for almost all antigenic sites (0.83 for the interface).Moreover, an AUC of 1.0 and 0.95 for the MPER and interface site respectively, but also 0.95 for the V3 glycan, demonstrating that RF had the best performance as expected.Next, we reviewed the selected parameters used as RF classi ers.Interestingly, among the seven most important features, some were shared between the antigenic sites, while others were distinct.Nevertheless, some expected characteristics were found, such as the frequency of mutation in V genes and unconventional mutations or the length of the CDR3 light chain, which has been described to be important for anti-CD4bs 40 .Similar parameters were observed for interface bNAbs, also characterized by their mutation frequency both conventional and unconventional.
While the V1V2 apex binders were classi ed based on their CDRH3 lengths.Interestingly, bNAbs targeting the V3 glycan and MPER have a more balanced classi cation with features such as CDRH3 hydrophobicity, mutation and CDRH3 length sharing similar weight (Figure 3e).The immunoglobulin variable VH5-51 gene segment was associated with bNAbs targeting the V3 glycan as previously reported for 35% of human anti-V3 bNAbs 41 .As a nal validation step, we compared the prediction results of each algorithm.Altogether, we observed that the different methods (AD, DT, and RF) identi ed the same true positives, while there was minimal overlap in false positives (Extended Data Fig 5).
In an effort to combine the different ML algorithms used above, we chose to incorporate the Super Learner Ensembles algorithm (SL) as an additional validation step 42 .SL is an algorithm combining multiple models to make an "ensemble" prediction.The SL algorithm exhibited very high accuracy and precision performance with a score of 1 for all antigenic sites (Extended Data Fig 6a) and achieved high performance for the MPER, V1V2 apex, and interface antigenic sites with a minimum AUC of 0.92 (Extended Data Fig 6b).In contrast, AUC was lower for the CD4bs, and V3 glycan antigenic sites (0.77 and 0.68), with a recall score of 0.53 and 0.35 respectively (Extended Data Fig 6a).Based on the performance of our machine learning approach for Rapid Automatic Identi cation of bNAbs from Immune Repertoire (RAIN), we decided to use it on experimental samples in an effort to discovery new bNAbs.
Experimental validation of pipeline using de novo immune repertoires.
To identify potential bNAbs, we investigated the neutralizing activity of puri ed immunoglobulin G (IgG) from the sera of different HIV-1 infected donors.Polyclonal IgGs from the serum of donors were puri ed with protein G resin and tested on the global HIV-1 panel of reference strains, containing strains that are representative of the global epidemic 43,44 .Interestingly, we observed that sera of donors 3, 11 and to some extent donor 9 had a broad neutralizing activity (Figure 4a).In contrast, sera from donors 1, 2, 5, 6, 7 and 8 were able to neutralize only one or two viruses (Figure 4a).Based on this result, we selected the donor 3 as test sample for bNAbs identi cation, while donors 1 and 2 were selected as negative control.We isolated IgG-class-switched B cells from peripheral blood mononuclear cells (PBMCs) of the different donors and performed single-cell sequencing of the B cell receptors (BCRs) (B3, G3, S4, and G4).Importantly, no enrichment step was applied for B cell sorting to ensure an unbiased repertoire for the downstream analysis.After ltering for error-corrected and productive sequences, we successfully reconstituted a set of 15,713 IgG sequences for donor 3.As a negative control, we sequenced BCRs from IgG+ memory B cells of donors 1 and 2 (that did not have sera with broad neutralization activity), which resulted in the acquisition of 8,347 IgG sequences (D1 and D2).Interrogation of the RAIN pipeline on the sequences obtained from donor 3, led to the identi cation of several potential bNAbs, but only 3 were recognized by the three algorithms out of 15'713 paired sequences (Extended Data Fig 7a).To further con rm this result, we used the SL model, which identi ed thirteen potential bNAbs in donor 3: six predicted to bind to the CD4 binding site, one to V1V2 apex, and six interface binders (Extended Data Fig 7b).Interestingly, SL con rmed our predicted bNAbs, but also identi ed an anti-V1V2 apex binder in donor 2. Three potential bNAbs were constantly identi ed as CD4 binders (bNAb2101, bNAb4251, and bNAb1586) belonging to the VRC01-class and derived from the VH1-2*02 variable heavy chain gene segment (Extended Data Fig 8).
Binding and neutralization properties of the bNAbs.
To validate these ndings, we cloned the potential bNAbs and some additional antibodies as negative control (hereafter referred as mAbs).BNAbs and mAbs were recombinantly produced to test their speci city and neutralizing activities.We rst assessed their binding to the envelope trimer SOSIP (using the clade A gp140 envelope stabilized prefusion trimer BG505 DS-SOSIP trimers) 45,46 , which is known to bind bNAbs that are representative of the majority of the known gp120 neutralizing antibody class 47,48 .Using biolayer interferometry (BLI), we detected a high-a nity interaction between all the identi ed bNAbs and SOSIP, characterized by an equilibrium dissociation constant (K D ) of 75nM, 3nM and 0.4nM for bNAbs 1586, 2101 and 4251 respectively.In contrast, no interaction could be detected between the control mAb and SOSIP (Figure 4b).To investigate the neutralization potency of our bNAbs, we sought of determining their IC 50 using global HIV-1 panel strains on TZM-bl cells 43,44 .We observed a broad neutralization activity across tiers and viral clade for bNAb4251, with a geometric mean IC 50 of 1.8µg/ml (Figure 4c).BNAb2101 could also neutralize different strains and speci cally clade AE viruses, however its neutralization pro le could not be considered as broad (Figure 4c).Finally, bNAb1586 was a relatively poor neutralizer, only able to inhibit the CNE55 strain at 38µg/ml (Figure 4d).Importantly, none of the antibodies had an effect on the SIVmac251.30.SG3 virus indicating a speci c neutralization activity.Overall, bNAb4251 could neutralize about 80% of the tested viruses but was not active against the TV1.29 and BJOX002000, similarly to VRC01, which targets the CD4 binding site 49 .Since the potentials bNAbs were predicted to target the CD4 binding site, we further tested their neutralization potential on virus strains lacking the glycosylation surrounding the CD4bs such as BG505.W6M.C2 strain with residue T332N (C2) or N197, N276, N363, and N462 (gly4) and other mutations previously described 50 (Figure 4d).While the additional clade B viruses: JRCSF.JB (modi ed at D167N) to be more susceptible to VRC01 neutralization and YU2.DG, a tier 2 strain.Finally, clade C strains were also used as the glycan at 362 was naturally absent.The neutralization pro le demonstrated a gain of potency speci cally for the mutation surrounding CD4bs (Figure 4d).
To con rm the epitope and explore the binding mode of bNAb4251, we decided to perform cryo-electron microscopy (cryo-EM) of the antigen-binding fragment (Fab)4251 in complex with the soluble native-like trimer (BG505 DS-SOSIP) 51 .After several rounds of 2D and 3D classi cations (Extended Data Fig 9 ), we could segregate trimers with zero, or one Fab attached and solved the structure of the complex at the resolution of 3.7Å (Figure 5a and, Supplementary Table 3).As predicted by our method RAIN, Fab4251 interacts with the CD4bs of the trimer and makes multiple contacts with both heavy and light chain (Figure 5a and b).In total, fty-one residues of the Fab interact with fty-six residues on gp120, to bury a surface area (bsa) of 950Å 2 .The interaction is principally dictated by the heavy chain with 700Å 2 bsa, while the light chain buries 250Å 2 of the gp120 surface (Figure 5c).The CDR-H2 makes most of the contact, totaling a bsa of 528Å 2 , a binding mode that have similarity to the previously described interaction of the CD4 receptor with gp120 (Figure 5d).The previously solved interaction of CD4 with gp120 revealed that two amino acids, F43 and N59 of CD4 make multiple contacts centered on residues N368, E370 and W427 of gp120 [52][53][54] (Figure 5d).Interestingly, H54 of CDR-H2 also mediates interaction with N368 and E370 of the "P43 cavity" located at the interface between the inner and outer gp120 domains (Figure 5c and d).
Previously reported bNAbs targeting the CD4-binding site (CD4bs) have been classi ed in two groups based on their mode of recognition, the VRC01 class (3BNC117, N6, N49P7, 3BNC60, VRC-PG20, NIH45-46, VRC-CH31 and 12A12) and the non-VRC01 classes (CH103, 8ANC131, VRC13 and VRC16) 55 .Structural investigation revealed that Fab4251 possesses an angle of approach similar to VRC01 (Figure 5e), a result in agreement with its CDR-H2 mediated contact on gp120, indicating that it belongs to the same antibody class.Moreover, hydrogen bounds are also present between the heavy chain R53, K62 and Q428, S460 of gp120, respectively (Figure 5e) and a salt bridge between R71 and N368 of gp120.The CDR-H3 also contact the gp120 with N100 contributing a hydrogen bond with N279 in loop D of gp120, as it was already reported for other VRC01 class bNAbs (Figure 5f).The light chain also participates in the interaction with the 5-residue LCDR3 QxxEx motif and a deletion in CDRL1 to accommodate the gp120 N276-glycan 28 , a feature also associated to VRC01-class antibodies.

Discussion
The advent of single-cell technologies resulted in the growing availability of paired full-length variable heavy and light chain BCR sequences.Therefore, immune repertoire sequencing coupled to arti cial intelligence holds great promise to improve diagnostic and treatment for numerous immune-related or infectious diseases 56 .The identi cation of speci c sequences involved in an immune response has already been successfully used in research settings to elucidate the role of immune dysregulation in conditions such as systemic lupus erythematosus, rheumatoid arthritism, type 1 diabetes, multiple sclerosis, Grave's disease, Crohn's disease, and many others 57 .However, limitations exist and only few studies examined the bene t of incorporating full length variable regions from heavy and light chain sequences to predict antibody speci city.Those studies are based on sequence-based embedding models 58,59 .Other efforts have focused on nding amino acid sequence similarity to an already known antibody.The similarity approaches led to important scienti c and medical successed [60][61][62] , but hold some limitations when sequences are very divergent.
In this study, we present RAIN, a pipeline based on two innovative technologies, single-cell BCR sequencing and machine-learning to identify bNAbs against HIV-1, based on their binding site.Our approach differs from other methods as the parameters required for the identi cation derived from selected characteristics, that are inferred from the amino acid sequences using Immcantation annotations.We demonstrate that ve speci c characteristics were su cient to separate bNAbs from mAbs (non-bNAbs) into two distinct clusters within each category of antigenic sites.In addition, we identify the frequency of unconventional mutations as key factor to de ne a HIV-1 bNAbs.Former studies reported the presence of mutations in the frameworks of bNAbs and correlated with binding a nity to the CD4bs 34,63 .Our results suggest it is an important characteristic for all bNAbs.This can be interpreted as a consequence of the time needed for the maturation process or as a modi cation of the immune system in response to chronic infection.
Performing a PCA analysis across all ve antigenic sites, we observed that despite their sequence divergences, the weights of the features exhibited striking similarities.This result could be interpreted as an additional level of immune escape that was not studied yet 64,65 .The RAIN approach can achieve a precision of 1 for almost all antigenic sites and be applied easily on any immune repertoire or already isolated antibody sequences to identify HIV-1 bNAbs.Importantly, another unique aspect of our work is the experimental validation with de novo data.Data were corroborated by functional cloning, expression and puri cation of the antibodies, and functional neutralization assays.Moreover, we characterized the bNAb4251 binding to DS-SOSIP at almost atomic resolution using cryoEM.In summary, we believe that our approach offers an innovative, straightforward method to search and identify antibodies in immune repertoires, accelerate antibody discovery, and might shed light on potentially unexplored mechanism of HIV-1 immune escape.

Serum IgG isolation
Serum samples from HIV-1-infected individuals were heat-inactivated at 56°C for 40 min and incubated with Protein G Sepharose (GE Life Sciences) overnight at 4 °C.IgGs were eluted from chromatography columns using 0.1Mglycine (pH= 3.0) into 0.1MTris (pH= 8.0).Buffer was exchanged to PBS through Amicon 30 kDa spin membranes (Millipore).Concentrations of puri ed IgGs were determined by UV/Vis spectroscopy (A280) on a Nanodrop 2000 and samples were stored at -20 °C.
Single-cell BCR-seq library preparation and sequencing 10X Genomics: The 5 single-cell VDJ libraries were generated using Chromium Next GEM Single Cell V(D)J Reagent kit v.1, 1.1 or v.2 (10X Genomics) according to the manufacturer's protocol.Paired heavy and light chain BCR libraries were prepared from the sorted B cell populations.Brie y, up to 20,000 memory B cells per well of 10X chip were loaded in the 10X Genomics Chromium Controller to generate single-cell gel beads in emulsion.After reverse transcription, gel beads in emulsion were disrupted.Barcoded complementary DNA was isolated and used for the preparation of BCR libraries.All the steps were followed as per the manufacturer's instructions in the user guide recommended for 10X Genomics kit v.1, 1.1 or 2. The puri ed libraries from each time point were pooled separately and sequenced on the NextSeq550 (Illumina) as per the instructions provided in 10X Genomics user guide for the read length and depth.BD Rhapsody: Memory B cells were targeted for single-cell targeted RNA-seq and BCR-Seq analysis using the BD Rhapsody Single-Cell Analysis System 66 (BD Biosciences).Brie y, the single-cell suspension was loaded into a BD Rhapsody cartridge with >200,000 microwells, and single-cell capture was achieved by random distribution and gravity precipitation.Next, the bead library was loaded into the microwell cartridge to saturation so that the bead was paired with a cell in a microwell.The cells were lysed in a microwell cartridge to hybridize mRNA molecules onto bar-coded capture oligos on the beads.These beads were then retrieved from the microwell cartridge into a single tube for subsequent cDNA synthesis, exonuclease I digestion, and multiplex-PCR-based library construction.Sequencing was performed on NovaSeq paired-end mode.
Singleron: Single-cell suspensions with 1 × 105 cells/mL in PBS were prepared.Then, the suspensions were loaded onto micro uidic devices, and scRNA-seq libraries were constructed according to the Singleron GEXSCOPE protocol in the GEXSCOPE Single-Cell RNA Library Kit (Singleron Biotechnologies) 67 .Individual libraries were diluted to 4 nM and pooled for sequencing.Pools were sequenced on an Illumina HiSeq X with 150 bp paired end reads.

Recombinant antibody production
Expi293 cells (ThermoFisher Cat No. A14527) were diluted to a nal volume of 0.5 L at a concentration of 2.5 × 10 6 cells.mL-1 in Expi293 media.Heavy chain and light chain plasmids were complexed with Polyethyleneimine (ThermoFisher) and added to the cells.On day ve, cells were cleared from cell culture media by centrifugation at 10,000g for 30 min and subsequently passed through a 0.45-μm lter.The supernatant containing the recombinant antibody was incubated with protein A resin (ThermoFisher) overnight at 4 °C.The resin was washed with 25 mL of phosphate-buffered saline (PBS).A total of 30 mL of 10 mM glycine pH 2.4, 150 mM NaCl were used to elute the antibody off the protein A resin.The acidic pH of the eluted antibody solution was increased to approximately 7 by the addition of 1M Tris pH 8.0.The antibody solution was buffer exchanged into PBS with successive rounds of centrifugation, ltered, and stored at −80 °C.

Fragment antigen binding (Fab) generation
For the Fab production, the heavy chain was engineered with a two amino acids glycine serine linker followed by a six-histidine tag and stop codon.Light and mutated heavy chains were transfected as described in the previous section.Cell supernatant was harvested ve days post transfection and puri ed by IMAC chromatography, followed by size exclusion chromatography on a Superdex 16/600 HiLoad column (Cytiva).
Recombinant HIV-1 envelope SOSIP gp140 production BG505 DS-SOSIP trimer 68 production and puri cation were performed as previously described 46 .Brie y, prefusion-stabilized Env trimer derived from the clade A BG505 strain was stably transfected in CHO-DG44 cells and expressed in ActiCHO P medium with ActiCHO Feed A and B as feed (Cytiva).Cell supernatant was collected by ltration through a Clarisolve 20MS depth lter followed by a Millistak + F0HC lter (Millipore Sigma) at 60 LMH.Tangential Flow Filtration was used to concentrate and buffer exchange clari ed supernatant in 20 mM MES, 25 mM NaCl, pH 6.5.The trimer was then puri ed by ion exchange chromatography as described 46 .Fractions containing theBG505 DS-SOSIP protein were pooled, sterile-ltered, snap-frozen, and stored at −80 °C.

IgG neutralization assay
Neutralization assays with IgGs against the 12-strain "global" virus panel, were performed in 96-well plates as previously described 43,69,70 .Brie y, 293T-derived HIV-1 Env-pseudotyped virus stocks were generated by cotransfection of an Env expression plasmid and a pSG3ΔEnv backbone.Animal sera were heat-inactivated at 56°C for 1 hour and assessed at 8-point 4-fold dilutions starting at 1:20 dilutions.Monoclonal antibodies were tested at 8-point 5-fold dilutions starting at 50 μg/ml or 500 μg/ml.Virus stocks and antibodies (or sera) were mixed in a total volume of 50 μL and incubated at 37°C for 1 hr.
TZM-bl cells (20 μl, 0.5 million/ml) were then added to the mixture and incubated at 37°C.Cells were fed with 130 μL cDMEM on day 2, lysed and assessed for luciferase activity (RLU) on day 3.A nonlinear regression curve was tted using the 5-parameter hill slope equation.The 50% and 80% inhibitory dilutions (ID50 and ID80) were determined for sera and the 50% and 80% inhibitory concentrations (IC50 and IC80) were determined for mAbs.All samples were tested in duplicates.

Biolayer interferometry
The biolayer interferometry experiments using SOSIP were performed as follows.All experiments were performed in reaction buffer (TBS pH 7.4 + 0.01% (w/v) BSA + 0.002% (v/v) Tween 20) at room temperature (RT) using an Octet K2 instrument (ForteBio).Protein A (Fortebio) biosensor probes were rst equilibrated in reaction buffer for 60 s.IgGs were diluted to 10 µg/ml in reaction buffer and immobilized onto the protein A probes for 300 s, followed by a wash for 60 s in reaction buffer.The binding of SOSIP trimers to the IgGs was then measured at various concentrations for 300 s, followed by dissociation for 800 s in reaction buffer.Analysis was performed using the Octet software and GraphPad Prism version 9.0.

Cryo-EM sample preparation
BG505 DS-SOSIP trimers complexes were prepared using a stock solution of 2 mg/ml trimer incubated with a six-fold molar excess of bNAb4251.To prevent interaction of the trimer complexes with the airwater interface during vitri cation, the samples were incubated in 0.085 mM n-dodecyl β-D-maltoside (DDM).Samples were applied to plasma cleaned QUANTIFOIL holey carbon grids (EMS, R2/2 Cu 300 mesh).The grid was blotted in an automatic plunge freezing apparatus Vitrobot MarkIV (Thermo Fisher, Hillsboro, USA) to control humidity and temperature.

Cryo-EM data collection
Grids were screened for particle presence and ice quality on a TFS Glacios microscope (200 kV), and the best grids were transferred to a TFS Titan Krios G4.Cryo-EM data were collected using a TFS Titan Krios G4 transmission electron microscope, equipped with a Cold-FEG on a Falcon IV detector in electron counting mode.Falcon IV gain references were collected just before data collection.Data were collected using TFS EPU v2.12.1 utilizing the aberration-free image shift protocol, recording 4 micrographs per ice hole.Movies were recorded at a magni cation of ×165,000, corresponding to the 0.73 Å pixel size at the specimen level, with defocus values ranging from −0.9 to −2.4 µm.Exposures were obtained with 39.89 e − Å −2 total dose, resulting in an exposure time of approximately 2.75 s per movie.In total, 15,163 micrographs in EER format were collected.
Cryo-EM Data processing and structure tting Data processing was performed with cryoSPARC including Motion correction and CTF determination 71 .
Particle picking and extraction (extraction box size 350 pixels 2 ) were carried out using cryoSPARC 71 .Next, several rounds of reference-free 2D classi cation were performed to remove artifacts and selected particles were used for ab-initio reconstruction and hetero-re nement.After hetero-re nement, 72'497 particles contributed to an initial 3D reconstruction of 3.7 Å resolution (Fourier-shell coe cient (FSC) 0.143) with C1 symmetry.A model of a SOSIP trimer (PDB ID 4TVP) 72 or AlphaFold2 (ColabFold implementation) models of the 4251 Fab were tted into the cryo-EM maps with UCSF Chimera.These docked models were extended and rebuilt manually with re nement, using Coot and Phenix 73,74 .Figures were prepared in UCSF Chimera, UCSF ChimeraX and Pymol 75 .The numbering of Fab 4251 is based on Kabat numbering of immunoglobulin models 76 .Buried surface area measurements were calculated within ChimeraX and PISA 77 .

CATNAP sequences
For all antigenic sites, paired bNAb sequences were collected from the CATNAP database 32 as of 1 st January 2022 as nucleotide and amino acid sequences.First, the 249 heavy chain and 240 light chain nucleotides sequences were annotated with Igblastn 36 .Sequences were then processed and analyzed using the Immcantation Framework (http://immcantation.org) with MakeDB.pyfrom Change-O v1.2.0 (with the options --extended -partial).Next, bNAbs were ltered by a dedicated Java script to keep only sequences with an annotated CDR3 and paired sequences (VH+VK/L).Each paired antibody was associated with its targeting Env antigenic site, information provided by the database CATNAP text le (abs.txt as of 1 st January 2022).The 27 CATNAP antibodies with only the protein sequences available were annotated with IgBlastp followed by MakeDB.pyfrom Change-O v1.2.0 (with the options igblast-aa -extended).In parallel, using the fasta protein sequences, ANARCI 78 was used to identify the junction region.As for nucleotide sequences, paired and annotated-CDR3 bNAbs were ltered in.In total, 255 bNAbs sequences were collected.Repartition of the antigenic site is as follows: 54 bNAbs target the CD4bs, 21 MPER, 98 V1V2, 56 V3, and 26 interface.

Paired B-cell receptor repertoires
For the training and evaluation of the machine learning models, paired BCRs repertoires of ten healthy donors were collected.The repertoires were obtained from various sources (Extended Data Table and sequenced using 10X genomics technology.Annotation and processing of the sequences were done as previously described 39 and resulted in the generation of a customized AIRR format table containing 14'962 paired BCRs.For HIV-1 immune donors three different sequencing technologies were employed: 10X genomics (D1, D2, G3, and G4), Singleron (S4), and BD Rhapsody (B3).Single-cell sequencing of selected HIV-1 immune donors using Singleron technology was processed using celescope v1.14.1 (https://github.com/singleron-RD/CeleScope)with ' v_CR' mode utilizing cellranger v7.0.1.BD rhapsody single cell sequencing was rst processed using BD Rhapsody Targeted mRNA Analysis Pipeline (version 1.11) and then, using a custom script, the generated 'VDJ_Dominant_Contigs.csv'le was converted into cellranger-like output les, namely ltered_contig_annotations.csv and ltered_contig.fasta.Lastly, the 10X Genomics single cell sequencing was processed with cellranger v7.0.1.The cellranger output les of the different HIV-1 repertoires enabled us to annotate and process them as described earlier, resulting in a table of paired BCRs with AIRR characteristics.The six different experiments resulted in 2'152 BCRs for D1, 6'195 BCRs for D2, 4'008 BCRs for B3, 3'794 BCRs for G3, 3'112 BCRs for S4, and 4'799 BCRs for G4.

Data pre-processing
Using a custom script, AIRR characteristics were converted into our features of interest.The 'mutation frequency' was calculated using the difference of residues between the protein sequence of the BCR and its germline sequence in the FWR1+CDR1+FWR2+CDR2+FWR3 regions (VH gene).The 'framework mutation frequency' was calculated similarly but using only FWR1+FWR2+FWR3.The 'hydrophobicity' of the CDRH3 sequences was computed using a customized score, aromatic residues having a highest value (1 for W, 0.75 for Y and 0.5 for F).Residues A, L, I, M, P, and V were set to 0.1, while the rest of the resides were set to zero.The values of all residues were summed up for each CDRH3.In addition, length of the CDRH3, CDRL3, VH and VL/K genes were considered as features.Two extra features were added to be used by the anomaly detection algorithm: 'VH1+CDRL3 length of ve residues' with a zero or one value designed for the bNAbs targeting the CD4bs and 'VH1-69+VK3-20+GW motif in the CDRH3' with a zero or one value for the bNAbs targeting MPER.
Training and evaluation of machine learning models Three ML-based approaches were trained on the features table generated using BCRs obtained from healthy donors and bNAbs datasets, using Python v3.8.16 and scikit-learn v1.0.2.These algorithms were: Anomaly Detection (AD), Decision (DT) and Random Forest (RF).For each antigenic site, dataset was partitioned into training, validation, and test sets with an 60:20:20 ratio, setting random.seed to 1 for all models.For the AD model, bNAbs data were removed from the training set, since this algorithm only trains with non-anomaly data.For this model, the features with discrete values were rst normalized using the preprocessing.normalizemethod (axis=0) from the scikit-learn library.Features exhibiting signi cantly different values from the normal distribution, were selected for each antigenic site, which included the frequency of mutations in the V genes and in the frameworks.For CD4bs, we added the combined feature VH1+CDR3L with a length of 5 residues.For MPER, we included the combined feature VH1-69, VK3-20, and the GW motif in CDRH3.Additionally, CDRH3 hydrophobicity was added for MPER, V1V2, and V3.Lastly, CDRH3 length was incorporated for V1V2 and V3.Using the validation test, a multivariate normal random variable was calculated with the mutivariate_normal function from the scipy package v1.8.0 and used for setting the optimal Epsilon parameter ( ) minimizing the false positive numbers.The Epsilon value was set to 619.55 for CD4bs, 231501.41 for MPER, 866803.64 for V1V2, 845445.99 for V3 and 24.36 for interface.Those threshold values were used on the test set to predict a BCR as an anomaly (bNAb) or not.For DT and RF models, V genes (for heavy and light chains) were onehot encoded as a pre-processing step, resulting in a total of 122 features in the features table.Hyperparameter tuning was conducted using the validation dataset, minimizing the number of false positives.DT models were trained with a balanced class weight, the Entropy criterion for measuring the quality of splits and the cost complexity pruning parameter alpha of zero.RF models were trained with 100 estimators, a balanced class weight, the Entropy criterion for measuring the quality of splits, maximum samples were set to 1.0, maximum depth of tree of 'none', maximum features of 11 (√122), and bootstrapping to build trees.Matplot library v3.6.2 was used to generate ROC plots from performance results and to generate the Venn diagrams showing the intersection of the number of true positives or false positives between the three models.The Super Learner Ensembles algorithm was implemented using the ML-Ensemble (mlens) v0.2.3 library.For each antigenic site, the dataset was partitioned into train and test sets with a 75:25 ratio.The SuperLearner was created with the precision score as scorer parameter, a k-fold cross validation of 10 folds and the option shu e set to true.The following classi ers were used as based models in the Super Learner algorithm: DecisionTreeClassi er, SVC (Support Vector Classi cation), KNeighborsClassi er, AdaBoostClassi er, BaggingClassi er, RandomForestClassi er and ExtraTreesClassi er.A LogisticRegression was used as the meta-model, with the solver parameter set to 'lbfgs'.

Statistical analysis
Flow cytometric data were acquired using BD FACSDiva (v.9.0) software.Flow cytometric data were analyzed using FlowJo (v.10.7.1).Statistics were conducted using R Statistical Software (v4.2.1) and ggstatsplot package 79 .The Complex Heatmap package was used for visualization 80 .No statistical methods were used to predetermine the sample size.The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Declarations
RAIN pipeline for automatic identi cation of bNAbs.Data collected from the CATNAP database (bNAbs) and healthy donor repertoires (mAbs) were converted as a feature table to train and validate four machine learning models: anomaly detection (AD), decision tree (DT), random forest (RF) and super learner (SL).
We performed single-cell BCR sequencing from HIV-1 seropositive donors with (illustrated by orange arm) or without (illustrated by white arm) sera broadly neutralizing activities.BCR sequences are processed by Immcantation work ow and analyzed as a features table.Next, the predicted bNAbs found by the four algorithms were produced and tested in neutralization and binding assays.
and Extended Data Fig 1c).BNAbs targeting the V3 glycan have higher frequency of mutations, and slightly higher hydrophobicity of the CDRH3 and a longer CDRH3 (Figure 2a-d and Extended Data Fig 1d).While bNAbs targeting the interface region, also demonstrated an increased frequency of mutations in the V gene and FWR regions (Figure 2a, b and Extended Data Fig 1e).
bNAbs targeting the interface or the V1V2 apex (Extended Data Fig 3a, b).The receiver-operating characteristic (ROC) curve and corresponding AUC of 0.94 was obtained for V1V2 apex (Extended Data Fig 3a) and 0.9 for interface (Extended Data Fig 3b), indicating good classi cation performance for both antigenic sites.Furthermore, a measured AUC of 0.77 was obtained for bNAbs binding the CD4bs (Extended Data Fig 3c).However, detection of bNAbs against other antigenic sites such as MPER (Extended Data Fig 3c), and V3 loop (Extended Data Fig 3e) was not satisfactory with an AUC close to 0.5 and 0.67 respectively.

Figure 2 Characteristics
Figure 2