Cell-based in situ selection for anti-CD151 Abs
To generate Abs targeting CD151, we used the phage-displayed Library F37 of synthetic antigen-binding fragments (Fabs) that offers advantageous features for CellectSeq (Figure S1). First, Library F is extremely diverse (> 1010 unique members) and precisely designed to ensure that most members are stable and well-displayed on phage38. Second, the library proves functional for selections with either purified antigens37 or cell-surface antigens36 and has yielded numerous selective Abs per selections36. Third, the library was constructed with a single, highly stable human framework resulting in negligible display bias where most library members are presented at similar levels37. Also, the abundances of individual clones in pools enriched for target antigens are highly correlated with relative affinities39; this property enhances NGS analysis based on enrichment ranking, allowing for the identification of highly selective and high affinity clones. Fourth, the synthetic Abs are diversified at only four complementary determining regions (CDRs; H1, H2 and H3, and L3), which permit standard NGS procedures utilizing primers that anneal to common framework regions in a cost-effective manner (Figure S2 & S3). Fifth, each of the four CDRs is composed of defined amino acid positions with restricted diversities. Therefore, the NGS data quality can be very accurately evaluated by assessing any deviations from the fixed framework or occurrence of unexpected codons at diversified positions. For instance, CDRs H1 and H2 contain only six or eight binary degenerate codons and offer a diversity of 64 and 256 unique sequences, respectively (Figure S1D). Conversely, CDRs L3 and H3 are much more diverse in terms of loop lengths (3–7 or 1–17 degenerate codons, respectively), and in terms of sequence composition (encoded by defined ratios of nine codons encoding nine amino acids). The CDRs L3 and H3 offer a theoretical diversity of the order of 107 and 1017 unique sequences, respectively (Figure S1D). Ultimately, the four CDRs combined offer a practical diversity approximating 1011 unique clones37. Thus, the highly diverse Library F, with defined length and chemical diversity encoded in CDRs L3, H1, H2, and H3, permits the precise probabilistic detection and elimination of artifactual CDR sequence combinations from NGS data, such as those derived from PCR sequence amplifications required for the NGS Illumina process40 (see Material and Methods).
We performed in situ selections against cell-surface CD151 on live cells, where CD151 is targeted at its native cell-surface environment. For cell engineering, we selected the HEK293T cell line because it grows rapidly in suspension and exhibits high display of transgenic cell surface proteins41. To enrich binders for CD151, we engineered the HEK293T cells to stably overexpress CD151 (HEK293T-CD151+; positive cells) (Figure S4A). Conversely, to deplete non-target selective binders we engineered HEK293T cells that stably expressed a short hair-pin RNA that depleted CD151 mRNA, and consequently, reduced cell-surface display of CD151 (HEK293T-CD151-; negative cells) (Figure S4A). The strategy of CellectSeq in situ selections utilizes multiple rounds of selection against antigen positive and negative cells, where it aims to produce a positive Ab pool enriched with selective clones for the target antigen, and a negative Ab pool enriched with non-specific clones (background).
To this end, the naïve phage pool representing Library F was subjected to four rounds of selections with the engineered cell lines (Fig. 2). Round 1 consisted of a positive selection on HEK293T-CD151 + cells to enrich for Fab-phage that bound to CD151, followed by elution of bound phage and amplification by passage through E. coli. In round 2, we employed a strategy whereby phage pools were exposed to control cells HEK293T-CD151- to deplete clones that bound to other cell-surface antigens, followed by positive selections with HEK293T-CD151 + cells. Round 3 repeated the round 2 process using the amplified phage pool from round 2. For the last round the amplified phage pool from round 3 was split into two pools, and then subjected to a round 4 selection process that involved elution and amplification of phage bound to either HEK293T-CD151 + cells (positive selection) or to HEK293T-CD151- cells (negative selection) (Fig. 2). Thus, the round 4 phage selection output consisted of two pools, a positive and a negative pool. After the four rounds of selection for binding to in situ CD151, we manually isolated 96 random Ab-phage clones derived from the round 4 phage output of HEK293T-CD151 + cells (positive pool). We screened all 96 clones by cellular phage ELISA42, where phage signals were measured for binding to HEK293T-CD151 + cells and compared to control HEK293T-CD151- cells. Here, we identified 49 phage clones, with binding signals 5-fold or greater over controls deemed as positive binders for cellular CD151 (Figure S5). After Sanger DNA sequencing analysis, all 49 clones shared the same sequence of clone CD151-1 (Table 1), indicating the Ab selection enriched for an immune-dominant clone. Accordingly, manual Ab screening failed at deriving multiple unique and diversified CD151 selective clones; consequently, we next performed NGS analysis of the output selection.
NGS enrichment ranking selection for anti-CD151 Abs
To identify unique CD151 specific Fab-phage clones in the round 4 selection output, we performed NGS analysis to explore the output diversity and relative abundance of every Ab clone. Therefore, we deep sequenced the round 4 output derived from the positive and negative pools. This allowed us to obtain CD151 selective sequences (derived from the positive pool), and non-specific background sequences (derived from the negative pool). The phage DNA from the Ab selection output pools were subjected to PCR amplification resulting in amplicons with Illumina NGS adaptor sequences and unique barcode identifiers that flanked the region of CDRs L3 and H3 (Figure S2 & S3). The amplicons from each output pool (positive and negative) were quality controlled for correct size, purified, and quantified, then normalized and pooled, and finally sequenced using an Illumina HiSeq 2500 instrument (see Materials and Methods). Besides the Illumina universal sequencing primers (PE1 and PE2), the NGS runs also included a custom primer that allowed for the complete sequencing of CDRs H1 and H2. Thus, the three primer reads (PE1, PE2, and custom; Figure S2 & S3) provided the complete sequence coverage of the four diversified CDRs in Library F37 (Figure S1). We performed duplicate NGS runs, and each run controlled for high sequence quality scores43, 44. The sequences were filtered from instrument sequencing errors using per base high quality score cut-off of Q = 30, which corresponds to 1:1000 of incorrect base call43. Following, all sequencing reads from the duplicate NGS runs were combined and deconvoluted. The three different primer reads (PE1, PE2, and custom) for each clone were transformed into a single sequence to derive the complete synthetic Ab sequence (see Materials and Methods).
The obtained high quality nucleotide sequences were then compared to the designed sequence repertoires of Library F37 to remove technical errors inherent to Illumina sequencing and PCR amplification. For each Ab clone, the nucleotide sequences were evaluated for codon deviations from the synthetic design of the fixed framework and restricted CDR positions (Figure S1A-B). Any divergent sequences from the synthetic library were discarded. Subsequently, the sequences were filtered for potential PCR-induced artifacts that may arise during the NGS sample preparation and Illumina sequencing process (see Material and Methods). This may occur due to incorrect annealing amalgams (i.e. combinations) of different clones40, which for our case may be driven by the fixed Ab framework coding region (non-CDR). Therefore, for every sequence we obtained the frequencies (i.e. number of observations) of CDRs H3 and L3, respectively, since these two CDRs are the most diversified in the synthetic library and drive the majority of affinity interactions with the antigen37. We then identified valid L3/H3 pairs by calculating a frequency cut-off to determine a minimal threshold of valid occurrences, with all below-threshold pairs filtered from the selection pool (see Materials and Methods). Thus, we obtained 7,541,189 and 7,250,873 high quality NGS reads for the positive and negative pool, respectively. The reads were then translated into amino acid. This process ultimately yielded 23,671 and 56,352 unique amino acid sequences in the positive and negative pools, respectively.
To perform NGS Ab enrichment ranking selection of potential CD151 selective clones, the unique high-quality sequence reads from each pool were parsed based on CDR sequences and observation counts. For each unique paratope we plotted the counts in the positive pool (x-axis) versus the ratio of its abundance (i.e. frequency) in the positive pool relative to the negative pool (y-axis) (Fig. 3). To estimate the number of potential unique CD151 binding clones in the plot, we defined an upper-right quadrant of putative binders. Here, the upper-right quadrant sequences represent observations counts of more than 200 in the positive pool, and more than four-fold enriched relative to the negative pool (Fig. 3). After performing comparative analysis of the unique sequences, the NGS enrichment ranking revealed all upper-right quadrant clones as close homologs of clone CD151-1; all showing more than 80% sequence identity in both L3 and H3 sequences. This finding reveals that the Ab selection is enriched for homolog clones with a potentially similar targeted epitope (immunodominant), where CD151-1 is the most abundant and selective clone.
Motif-based algorithm identifies selective and diversified Abs against CD151
Due to the lack of Ab diversity derived by both manual selection of Ab clones and NGS enrichment ranking, we developed a novel motif-based algorithm to identify highly selective Abs for CD151 from the deep sequenced phage pools. The in silico strategy for scoring CD151 selective Abs is based on exploring all possible sequence motifs (i.e. consensus motifs) in the positive pool and scoring their enrichment over the negative pool (Fig. 4A). This follows the premise that highly selective Abs are enriched with paratope motifs (i.e. linear information) that recognize the target antigen, whereas non-selective Abs lack such enrichment45 (Figure S6). Therefore, for each Ab clone in the positive pool (i.e. candidate) (Fig. 4A1) we explored the entire space of linear information by exhaustively enumerating all possible motifs matching its CDR sequences46, and obtained the frequencies (number of matching sequences / total number of sequences) of every motif in the positive and negative pools (Fig. 4A2). According to the premise above, the high enrichment of the motifs in the positive pool relative to the negative pool implies the Ab candidate is potentially highly selective (Fig. 4A3). Thus, we analyzed each Ab in the positive pool for the selective binding to CD151 by scoring the separation between the two distributions of frequencies of the motifs in the positive and negative pools (see Methods for details). To this end, we calculated the t-test47 to score the separation of the two distributions, then we calculate the p-value to evaluate the statistical significance of the t-test48–50. Thus, the lower the p-value the higher is the separation between the two distributions, thus, the higher is the selectivity of the candidate Ab. Finally, we applied the stringent p-value cut-off of \({10}^{-10}\) to identify highly selective Ab clones (see Materials and Methods). Therefore, this motif-based in silico strategy allowed us to explore rapidly and exhaustively the selectivity of all Ab clones in the positive pool. We were able to identify potentially selective CD151 binders, regardless of their individual frequencies in the total pool of sequences; thus, bypassing the limitations of standard NGS analyses based solely on enrichment counts of individual clone sequences, which has difficulties for discriminating between selective Ab clones and background.
Filtering PCR-induced sequence artifacts improves the in silico Ab selection results
As previously mentioned, PCR-induced artifacts may arise during the NGS sample preparation and Illumina sequencing process40. These artifacts represent invalid amalgams of existing CDRs L1, H1, H2, and H3 sequences, which may be seen as novel Ab clones40. These artifacts may significantly bias the frequencies of individual clones that will inevitably affect the in silico Ab discovery strategy. Therefore, for both the positive and negative pools, we obtained the frequencies (i.e. number of observations) of CDRs H3/L3 pairs, where both CDRs are the most diverse in terms of length and amino acid compositions in Library F37. We calculated a frequency cut-off to determine valid L3/H3 pairs utilizing a minimal occurrence threshold, with all invalid pairs filtered from the selection pool as potential PCR and NGS artifacts (see Materials and Methods). We therefore applied the motif-based in silico Ab discovery strategy to predict CD151 highly selective binders (p-values < 10− 10) for both scenarios, before and after filtering. The application of error-filtering to the positive pool Abs reduced their clonal diversity to 80% less unique Abs (Fig. 4B1-C1-D1). Similarly, the application of the error-filtering before the motif-based in silico prediction of CD151 clones reduced their diversity to 85% less unique Abs (Fig. 4B2-C2-D2). Interestingly, before error-filtering the in silico predicted Abs clustered into 183 distinct families of similar L3/H3 sequences (> 80% identity), whereas after filtering the Abs reduced to only 4 distinct families (Fig. 4B3-C3-D3).
To experimentally assess the validity of predicted antibodies in both scenarios, we selected the Abs with best specificity scores (p-values; Fig. 4A3) from each of the 4 families predicted after filtering, as well as 23 additional Abs predicted before filtering (Fig. 4B3-C3-D3 & Table S1). Due to the low NGS enrichment of the identified Ab clones, instead of PCR rescue or similar methods51, 52, all 27 candidate clones were synthesized as Ab DNA sequences into Fab protein expression plasmids. After Fab purification, we tested the activity of each clone by flow-cytometry for binding to HEK293T-CD151 + cells when compared to HEK293T-CD151- cells (control). All four Ab clones predicted after filtering were determined as CD151 binders (Pass validation; Table S1), with fluorescence signals of 3-fold or greater than controls. On the other hand, all 23 pre-filtering Abs failed to bind to CD151 (Table S1). The success rate of the motif-based in silico Ab discovery before and after filtering is respectively 4:27 (i.e. ~15%) and 4:4 (i.e. 100%). This difference between the success rates highlights the requirement to filter PCR-induced and NGS artifacts to derive accurately and effectively selective clones. Furthermore, the abundance (enrichment) of the 4 identified clones (based on motif-based in silico Ab selection) varied from high (30%) to extremely low. In fact, the clones CD151-2 and CD151-3 have frequencies below 0.01%, and clone CD151-4 possesses the extremely low frequency of 0.00009% (Table 1). These latter clones would be impossible to identify using manual sampling or standard NGS analyses solely based on enrichment.
Characterization of motif-based in silico identified Abs against CD151
To demonstrate the advantage of the motif-based in silico Ab discovery strategy, termed CellectSeq (Figure S7), we measured all 4 clones (CD151-1 thru − 4) as Fab versions for dose-dependent binding to HEK293T-CD151 + cells. Quantitative flow cytometry displayed tight and saturable binding of each Fab to HEK293T-CD151 + cells (Fig. 5A), with EC50 values in the low-nanomolar range (Table 1). We also used flow cytometry to assess epitope overlap by measuring the ability of immunoglobulin (IgG) versions of each clone to block binding of each Fab to HEK293T-CD151 + cells. As expected, preincubation of HEK293T-CD151 + cells with each IgG reduced subsequent binding of the cognate Fab. Moreover, all IgGs blocked binding of the different Fab clones (Fig. 5B), implying that all four distinct clones share a similar CD151 binding epitope.
Further corroboration of specificity for CD151 was provided by performing immunoprecipitation mass spectrometry (IP-MS) experiments with each Fab for HEK293T-CD151 + cells and HT1080 cells (express native levels of CD151 protein). Tandem mass spectra were searched against a human database to validate MS/MS protein identifications. Protein identifications were accepted if they could be established at greater than 99% probability based on identified peptides. After background filtering to remove keratin, immunoglobulin and cytoplasmic proteins, the highest peptide counts for all four Fabs were for CD151 on both different cell lines (Fig. 5C and S8). The integrin β1 (ITGB1), a receptor identified to associate with CD15153, also immunoprecipitated with Fabs CD151-1, CD151-3, and CD151-4 (Fig. 5C), adding further validity of the Fabs selectivity for CD151. Taken together, the results show that the four in silico Abs recognize cell-surface CD151 with high affinity and specificity, with all different clones likely bind to overlapping epitopes.