Machine Learning-driven Fragment-based Discovery of CIB1-directed Anti-Tumor Agents by FRASE-bot

SUMMARY Chemical probes are an indispensable tool for translating biological discoveries into new therapies, though are increasingly difficult to identify. Novel therapeutic targets are often hard-to-drug proteins, such as messengers or transcription factors. Computational strategies arise as a promising solution to expedite drug discovery for unconventional therapeutic targets. FRASE-bot exploits big data and machine learning (ML) to distill 3D information relevant to the target protein from thousands of protein-ligand complexes to seed it with ligand fragments. The seeded fragments can then inform either (i) de novo design of 3D ligand structures or (ii) ultra-large-scale virtual screening of commercially available compounds. Here, FRASE-bot was applied to identify ligands for Calcium and Integrin Binding protein 1 (CIB1), a promising but ligand-orphan drug target implicated in triple negative breast cancer. The signaling function of CIB1 relies on protein-protein interactions and its structure does not feature any natural ligand-binding pocket. FRASE-based virtual screening identified the first small-molecule CIB1 ligand (with binding confirmed in a TR-FRET assay) showing specific cell-killing activity in CIB1-dependent cancer cells, but not in CIB1-depleted cells.


INTRODUCTION
Recent progress in molecular biology and genome-scale studies constantly increase our understanding of cellular processes and implication of individual proteins in disease [1][2][3][4][5] thus extending the landscape of potential drug targets to harder-to-ligand proteins such as transcription factors, signaling or scaffolding proteins [6][7][8][9][10] . The global drug discovery pipeline is yet to adapt to such a shift. Screening collections used by pharmaceutical companies are mainly composed of ligands targeting historic target families, such as G-protein coupled receptors 11 or protein kinases 12 . Several strategies provide a promising avenue to address the challenge by extending the screenable ligand space. The modern computing prowess enables an ever-increasing scale (up to a billion ligands) of structure-based virtual screening [13][14][15] . Alternatively, DNA-encoded libraries (DEL) push the boundaries of the accessible chemical space even further by allowing one-pot synthesis and screening of multiple billions of compounds [16][17][18][19][20] . Finally, generative neural networks [21][22][23][24] promise access to virtually limitless chemistries. It is certain, however, that the above strategies come at a cost and bring their own set of problems. For instance, docking and scoring on novel targets are often subject to inacceptable false-positives rates 25,26 , multi-billion DEL screening requires a long and costly triage and hit confirmation process, and generative approaches may need a prohibitively time-and effort-consuming synthetic effort without necessarily producing a novel chemistry 22,27,28 . Hence, there is urgent need for both improving the existing strategies and developing new approaches to lead finding.
To further expand the lead finding toolkit, we introduce FRASE-bot, a technology platform for de novo construction of small-molecule ligands to a protein of interest directly in its binding pocket. The only input FRASEbot needs to initiate the design process is a 3D structure of the protein of interest. It makes use of deep learning to distill 3D information relevant to the protein of interest from 3D structures of tens of thousands of ligandprotein complexes in the Protein Data Bank (PDB), and respective structure-activity relationships (SAR). FRASEbot exploits the concept of FRAgments in Structural Environments (FRASE) 29 . Conceptually, FRASE-based design of a new ligand for a given protein involves two steps: (i) identification of structural environments, stored in the FRASE database, that match those in the protein of interest (this step results in an automated seeding of ligand fragments from the matching FRASEs into the target protein), and (ii) inspecting the seeded ligand fragments and combining them into synthetically tractable compounds. The concept of FRASE has been successfully applied to develop potent kinase-targeting in vivo antitumor agents 29 , though it was implemented with one important limitation: it only allowed exploiting information within a given protein family, hence precluding the ligand discovery for the majority of novel targets of interest belonging to understudied families. To overcome that limitation we developed FRASE-bot, a computational approach to exploit the full body of 3D structural and SAR data to assemble a ligand in the binding site of any orphan protein. This advancement is due to a combination of a novel FRASE screening algorithm with a machine learning (ML)-based triage of selected ligand fragments to predict "nativeness" of the fragment's pose in the new protein environment. FRASE-bot might be considered as a step toward a "virtual medicinal chemist" capable of conceiving ligands to novel targets based on an unbiased analysis of the full body of structural, chemical and biological data.
The new method was applied to identify potential agents against Triple Negative Breast Cancer (TNBC) by targeting Calcium-and Integrin-Binding Protein 1 (CIB1) [30][31][32] . Approximately one million new TNBC cases are diagnosed each year globally and 25% of patients die from this aggressive disease within 5 years of diagnosis [33][34][35] . Current TNBC treatment options are limited to surgery, radiation and systemic chemotherapy, which often fail due to inherent or acquired resistance. The need for new therapeutic approaches for TNBC is therefore urgent. CIB1 has been found to promote cell survival, growth and proliferation in cancer by regulating at least two prominent growth and oncogenic pathways, PI3K/AKT and MEK/ERK. Its inhibition has the potential to kill cancer cells, with minimal effects on normal cells 32,36 . Several 3D structures of apo and ligand bound CIB1 were solved using x-ray crystallography 31,37 . CIB1 is an ideal challenging target for a novel hit/lead generation approach since it does not belong to a well-studied protein family and lacks binding pockets for endogenous small-molecule ligands purposefully sculpted by nature.

FRASE-bot
The FRASE-bot workflow includes the following steps: First, the FRASE database is screened to identify FRASEs with structural environments potentially matching those in the target protein. At this step, ligand fragments belonging to hit FRASEs are automatically seeded in the target protein. All potential hit FRASEs (typically thousands) are ranked using a "nativeness" score calculated by a neural network model. The model predicts whether a FRASE's ligand fragment has a native-like pose within the target's protein environment. Next, the topscoring fragments (tens to hundreds) are exploited to obtain drug-like ligands for the target. Currently, two options exist: (i) build one or more pharmacophore models by analyzing densities of pharmacophoric features on the selected ligand fragments and use them to screen databases of commercially available compounds (this option was used in this study); or (ii) apply a generative neural network to enumerate novel ligand structures matching 2D and 3D constraints set by the poses of the seeded fragments. At the final selection step, the pharmacophore screening hits, or generated compounds may be submitted to various filters, such as computational assessment of the binding free-energy, e.g., by MM-PBSA 38,39 or by a visual inspection of ligand poses. Eventually, selected ligands are purchased or synthesized and tested in relevant bioassays.

FRASE database screening
Finding whether a target protein has residue arrangements similar to those in the database composed of 184,963 FRASEs (see Supplementary Methods for the database description) is a combinatorially challenging task. There is a large number of ways in which a set of 5 to 15 residues, could be picked from the target protein and there is a countless number of ways in which an environment picked from the target protein could be aligned with a given environment from the FRASE database. This makes a comprehensive enumeration and alignment of hundreds of environments in a target protein to hundreds of thousands environments stored in the FRASE database computationally prohibitive. To address the combinatorial challenge, all protein environments, both in the database and in the target protein, are encoded into a canonical string notation enabling their ultra-fast comparisons. Due to an oversimplified encoding, the screening step produces a significant number of irrelevant screening hits. However, at the next step of the workflow, all the hits are evaluated by a more thorough scoring method.
In the environment-encoding scheme, for each FRASE, we enumerate triplets of protein residues forming approximately equilateral triangles with sides between 8 Å and 12 Å ( Figure 1A). Each residue in a triplet is represented by a many-hot 11-bit string expressing its physical properties -charge, H-bond donor/acceptor, aromaticity, size, and hydrophobicity (e.g., position #1 would be set to 1 if the residue has a positive ionizable group, 0 otherwise). Three positions are reserved for  respectively size and hydrophobicity (e.g., all three positions are set to 1 for tryptophane and to 0 for glycine). Bitstrings for individual residues are then concatenated into a single fingerprint for the whole triplet. Six distinct fingerprints are being generated for each triplet to make triplet screening invariant to the order of residues. Typically, 6-15 qualifying triplets are being located in a single FRASE. Although the triplet fingerprint explicitly encodes only 1D information about the triplets, it also contains an implicit 3D information due to geometric constraints on the shape/size of the triangle formed by the triplet residues. Same procedure can be applied to a whole protein structure, i.e., that of the target protein, to typically produce ~250 triplets for a medium-size protein. The triplets enable a FRASE database screening process as shown in Figure 1A. First, every triplet of the target protein is compared with every triplet in the database. Second, matching triplets are used to align the respective FRASE with the protein of interest. Third, the ligand fragment from the matching FRASE is inserted into the target protein. Fourth, the ligand fragment is scored on its fitness to the target environment using a neural network model (see next section). One advantage of the FRASE screening approach is that it allows bypassing an explicit detection and comparison of ligand-binding sites, a non-trivial task, and an area of continued research [40][41][42][43][44] .

FRASE scoring
After thousands FRASE hits were selected from the database using fast-search algorithms and ligand fragments seeded in the target protein, each of them is evaluated on their fitness to their new environments. Since it is impossible to experimentally measure contributions of ligand fragments into the potency of the respective ligands, here, the "fitness" is assessed through learning the distinction between interaction patterns in "true" FRASEs (i.e., FRASEs extracted from high-affinity ligand-protein complexes) and interaction patterns in decoy FRASEs. To this end, we applied a neural network model that learns most informative features from the ligandprotein interaction graph ( Figure 1B) to predict the likelihood of whether a FRASE is "true". The interaction graph is an extension of the chemical graph in which every pair of nearby (within 5 Å) ligand/protein atoms is connected by an interaction bond defined by the respective atom-centered pharmacophoric features (see Supplementary Methods for the feature lists). A full set of interaction bonds represents a unique ligand-protein interaction signature. To ensure invariance of the interaction signature with respect to the atom numbering, it is transformed into an interaction fingerprint in which any given position encodes a count of interaction bonds of a particular type (e.g., hydrogen-bond donor -hydrogen-bond acceptor). The dimension of the fingerprint is m*n (=377), where m (=29) is the number of possible pharmacophoric features on the ligand side and n (=13) is the number of features on the protein side. The interaction fingerprint is fed to a dense neural network ( Figure 1B) to predict whether an input signature "looks" like a true FRASE or a decoy. The training set for this neural network model consists of a set of 38,791 true FRASEs randomly selected from the database and a set of 77,582 decoys. The decoys were produced by randomly swapping ligand fragments between true FRASEs. The trained models were validated by making predictions for the remaining 146,172 true FRASEs and 509,184 decoys. More details on the model training and validation are provided in Supplementary Methods.

Pharmacophore model and screening
To quickly and cost-effectively test the relevance of the fragments seeded in the target protein, we opted for exploiting the fragment-protein complexes in pharmacophore-based virtual screening. To this end, all seeded fragments were converted into pharmacophoric features (Aromatic, Hydrophobic (aliphatic), H-bond Acceptor, H-bond Donor, Positive Ionizable and Negative Ionizable). All features were clustered based on their types and 3D coordinates. Centroids of the most populous and dense clusters were used to compose pharmacophoric queries to search Molport (all in-stock compounds, 2017.2), a database of ~5.1 million commercially available compounds.

FRASE database screening and converting hit fragments into pharmacophore queries
Fast screening of the FRASE database against the x-ray structure of CIB1 (PDB: 6OCX) enabled identification of 5,362 fragments seeded in the structure's multiple regions (Figure 2A). Next, a filter passing only those fragments that do not "collide" with the protein (that is, whose atoms are at least 1 Å away from protein atoms) and that are sufficiently "buried" within the protein (that is, each ligand atom has an average of 5 protein atoms within 5 Å) reduced the number of fragments to 726. Finally, the fitness score threshold of 0.4 was applied to further reduce the number of fragments to 151 ( Figure 3B). Each of the 151 hit fragments were then converted into a set of pharmacophoric features (H-bond donor/acceptor (HBD/HBA), Ionizable positive/negative (Pos/Neg), Aromatic (Ar), and Hydrophobic aliphatic (Hyd)), resulting in a total of 398 features ( Figure 3D). All features were clustered using k-means algorithm (see Methods) by type and 3D coordinates into 76 clusters (with a maximum distance between cluster members of 3 Å). Centroids of the ten largest clusters ( Figure 3D) were considered as potential features for the pharmacophore model. Visual inspection of the protein structure suggested that centroids forming two groups (1-2 and 5-8) occupy two distinct nearby pockets that are close enough for a hypothetic ligand combining features from both groups to be a drug-like molecule. Eventually, centroids 1, 2, 6, and 8 were retained to compose a pharmacophore query consisting of two HBA and two Ar features ( Figure 3D).

Virtual screening
The pharmacophoric query was used to screen the MolPort database. Schrodinger's Phase algorithm 45 was used to perform the pharmacophoric search, which yielded 190,320 hits. All pharmacophore hits were docked to the x-ray structure of CIB1 (PDB: 6OCX) using Glide method 46 with standard precision. The docking grid was defined to comprise the four features of the pharmacophore query with a 5 Å margin. The Glide gscore values were distributed in a range from -1.4 to -10.3 kcal/mol with a median of 5.6 kcal/mol. About 10,000 ligands (i.e., ~5% of the docked set) having gscore below -7 kcal/mol were selected for further triage. These docking hits were clustered to reduce the set of ligands for visual inspection throughout the triage process to top-scored cluster representatives (i.e., marking of a cluster representative for removal meant the removal of the whole cluster). K-means clustering on FCFP4 fingerprints with a maximum Tanimoto distance between cluster members of 0.60 resulted in 567 clusters. The triage process included visual inspection of 2D ligand structures and 3D ligand-protein complexes. First, 567 structures of cluster representatives were scrutinized computationally and visually for the presence of unwanted reactive groups, suitability for docking (e.g., ligands containing long aliphatic linkers were removed), and their lead potential (113 clusters were removed at this step). Second, the 3D poses were browsed for a visual assessment of the binding entropy (e.g., flexible ligands whose binding relies on solvent-exposed hydrogen bonds were removed) and ensuring that the docking pose aligns well with  the pharmacophore query (~350 clusters were removed at this step). Next, up to 3 cluster representatives (depending on the cluster size) were selected from the clusters that survived the first two triage steps resulting in a set of ~200 candidates for purchase. The final triage step involved the refinement of the purchase list based on prices, shipping fees and available budget. Eventually, 56 compounds were purchased for experimental confirmation (see Supplementary data).

Experimental confirmation in a CIB1 TR-FRET displacement assay
Binding of the 56 purchased compounds was experimentally assessed in a time-resolved fluorescence energy transfer (TR-FRET) assay ( Figure 3A). Briefly, it serves as a displacement assay using an AlexaFluor-labeled linear CIB1-binding peptide and a Europium (Eu)-labeled Streptavidin donor for attachment to the AviTag biotinylated CIB1 protein (described previously in Puhl et al and Haberman et al 47,48 ). Dose-response studies (typically in a 0.005-100 µM concentration range) identified 19 compounds with a dose-dependent response, 3 of which had IC50 values below 10 µM (Figure 3B)). The three most promising compounds were retested in the dose-response TR-FRET assay numerous times (n = 8) as biological replicates and the average curves are shown in the SI. Several compounds showed rather high Hill slopes (>3), which is a potential indication of promiscuous or aggregating artifact activity, whereas UNC10245380 typically yielded Hill slopes within the acceptable range of 1.5 -2.5. Additionally, numerous analogs of UNC10245380 were purchased from commercial sources and data for this brief SAR-by-catalog study is also shown in the SI. None of these follow up compounds yielded dramatic improvement of potency and physicochemical properties, and therefore we focused on UNC10245380 for further studies.

CIB1 hit selectively target CIB1-dependent cancer cells
As mentioned above, we decided to proceed with investigation of the most promising compound, UNC10245380 (IC50 = 8.0 ± 2.5 uM; Hill slope = 2.0), from the TR-FRET assay using cell-based assays. Previous studies showed that CIB1 depletion induces cell death and inhibition of CIB1-dependent signaling in 8 of 11 of triple-negative breast cancer cell lines . The remaining 3 cell lines were insensitive to CIB1 depletion and thus serve as important controls for compound off-target effects. We selected 3 CIB1-depletion sensitive and 2 CIB1-depeltion insensitive TNBC cell lines to screen compounds for cell death activity. Of the 19 (I actually tested 15) compounds, UNC10245380 (IC50 = 8.0 µM) showed cell-specific killing activity in all three CIB1-depletion-sensitive (BT549, MDA-MB-436, and MDA-MB-468) cell lines ( Figure  3C). Importantly, UNC10245380 showed no activity in CIB1depletion-insensitive (MDA-MB-231 and MDA-MB-453) cell lines (Figure 3C), strongly suggesting a lack of non-specific cellular toxicity. Small molecule UNC10245380 also showed >90% dose-dependent cell death of MDA-468 TNBC cells over 48h, indicating a more robust and rapid cell death than CIB1 shRNA depletion by 72h (data not shown). We next examined  the effects of UNC10245380 on CIB1-dependent signaling events. Western blot analysis showed that UNC10245380 inhibited AKT and ERK phosphorylation in CIB1 depletion-sensitive, but not -insensitive cell lines (data not shown). UNC10245380 also upregulates death receptor TRAIL-R1/D5 expression specifically in the three CIB1 depletion-sensitive cell lines (data not shown) that recapitulate our previous findings 49 . Taken together, our data indicate that UNC10245380 closely parallels the CIB1 depletion cell death and signaling phenotypes in TNBC cells [49][50][51] , and thus provides an appropriate starting point for the future CIB1-targeted anticancer drug development.

DISCUSSION
In our previous work 29 , the concept of fragments in structural environments was introduced and tested in a "model system", that is, on protein targets with multiple known ligands and belonging to a well-established superfamily of protein kinases. Moreover, FRASE-based ligand design was performed stepwise, by visually assessing the choice of the next fragment and its attachment site. In this study, the FRASE-based ligand finding strategy was applied to Calcium and Integrin Binding protein 1 (CIB1), a genuine orphan target belonging to a small family of four proteins with the closest family member, Calcium and Integrin Binding protein 4, showing only 45% of sequence identity to CIB1. Furthermore, no known small-molecule ligands have been reported to any of the CIB family members and their only function appears to be signaling through association with a number of other proteins [52][53][54][55][56] . Putatively, CIB1 does not interact with any endogenous small-molecule binder and does not have a pocket sculpted by nature for such an interaction. Previously, several unsuccessful attempts have been made by our groups to identify small-molecule CIB1 binders, including high-throughput screening of several collections of ~110K commercially available compounds. However, the only currently known CIB1 inhibitors are peptides identified from phage display screens 37,[57][58][59] . This study demonstrates the high potential of the FRASEbased strategy in the intended setting, that is, applied to a difficult non-conventional target.
On the technology side, FRASE-bot is a formalized semi-automated workflow enabling faster hit finding and leaving less room for a subjective human decision making. The quantum lip from the previously published system 29 , in which fragments were only exchanged between the aligned members of the same protein superfamily, has become possible due a triplet-based FRASE screening. Next, the ML-based scoring function enabled fragment ranking by inferring the latent similarity of their 3D poses to those of FRASEs in the FRASE database. Finally, cluster analysis of the fragments' pharmacophoric features made possible a formal algorithm for a transition from a large set of top-scoring ligand fragments to 3D pharmacophore-based screening of millions of commercially available compounds. A possible future alternative to virtual screening might be through using the seeded fragments, along with respective 3D geometric constraints, as an input to ligand generators thus biasing them toward generating binders to the target protein.
Conceptually, FRASE-bot builds on the legacy of multiple earlier developments. For over a century, medicinal chemists dealt with finding optimal substituents for a given scaffold, a paradigm quantified through an additive approach by Free and Wilson in 1970-s 60,61 . In 1990-s, the advent of high-throughput x-ray and NMR technologies enabled structure-guided fragment-based discovery that explicitly exploited the concept of fragments "liking" specific protein environments [62][63][64][65] . In 2000-s, a broad family of approaches collectively termed chemogenomics or proteochemometrics experimented with various ways of using the protein sequence or structure to share SAR information between targets [66][67][68][69][70] . Finally, during the last decade, deep machine learning has shown potential to combine protein and ligand data for a fast prediction of protein-ligand binding affinity [71][72][73][74] .
Beyond FRASE-bot, its components may have a broader use in hit finding and drug design. For instance, the ligand fragments identified by FRASE screening can be used as an input to a conventional structure-based design through an incremental fragment growing or plugged as a structural constraint to a generative neural network. Furthermore, the module translating a large number of seeded ligand fragments into pharmacophoric queries for a large-scale screening of commercially available compounds can be applied to experimentally identified fragments, e.g., from the Diamond Light Source 75 . And the neural scoring function distinguishing between true FRASEs and decoys could be used to rank poses in structure-based virtual screening. Another important outcome of this study is the first small-molecules CIB1 binder with a demonstrated cellular effect. We expect it be a helpful probe to further exploit and validate CIB1 as a promising anti-cancer target. Later, UNC10245380 may be a source of inspiration for developing a CIB1-targeting drug or an in vivo probe.

FRASE database
The FRASE database was collected using data-processing protocols implemented in Pipeline Pilot 76 X-ray structures of high-affinity ligand-protein complexes were imported from the Protein Data Bank (PDB) 77 . The PDB codes for high affinity complexes for drug-like ligands were obtained from BindingDB 78 (retrieved in 05/2018). "High affinity" was defined as KD, Ki, IC50 or EC50 < 100 nM. Drug-likeness of the ligands was warranted by filters including Lipinski 79 , REOS 80 and structural queries to remove peptides, inorganic and phosphorous compounds, as well as those containing highly reactive groups. This selection process resulted in 10,464 complexes involving 4,724 unique ligands and 3,068 unique proteins. This dataset allowed us to generate a database 184,963 FRASES involving 51,060 unique ligand fragments. A FRASE was defined as a ligand fragment with all nearby protein residues (i.e., residues having at least one atom within 4.5 Å from the closest ligand's atom). Ligands were fragmented using the "Enumerate Fragments" Pipeline Pilot component, allowing only single, non-cyclic bonds to be broken and keeping a-atoms attached to the cyclic fragments. Only fragments weighing between 50 and 300 Da were retained. All FRASEs, that is, ligand fragments with nearby protein residues, were saved to an SD file.

FRASE database screening
The screening protocols used in this study were implemented in Pipeline Pilot 76 as a combination of standard components and custom Pilot scripts. The four key steps of the screening process include (i) enumerating triplets of protein residues and representing them as fingerprints, (ii) similarity search in the FRASE database for triplets similar to those in the target protein, (iii) alignment of the FRASEs containing hit triplets to the protein structure by the Ca atoms of the triplet's residues, and (iv) making a target-based FRASE for further scoring. The "Align Molecules using Substructure" component is used for the alignment. After the hit FRASE from the database is aligned to the target protein, its ligand fragment can be used to create a new, target -based FRASE by cutting out the nearby residues of the target protein (i.e., residues having at least one atom within 4.5 Å from the closest ligand's atom) and merging them with the ligand fragment.
All triplets of protein residues satisfying the geometric condition were enumerated in all FRASEs from the database and in the target protein. The geometric condition for a set of three residue to qualify as a triplet was for its respective Ca atoms to form an approximately equilateral triangle with edge lengths between 8 Å and 12 Å. Each residue in a triplet is represented by a many-hot 11-bit string expressing its physical properties -charge, H-bond donor/acceptor, aromaticity, size, and hydrophobicity. First 3 positions are reserved for size, next 3 for hydrophobicity, 1 for hydrogen bond acceptors, 1 for hydrogen bond donors, 1 for positive ionizable, 1 for negative ionizable, and 1 for aromatic. Bit strings for all 20 side chains are shown in Supplementary Table S1. Table S1. Bit strings for 20 amino acids used in residue triplet screening.
The back-propagation neural network consisted of two hidden layers (32 and 16 nodes) with a rectified linear activation function and an output layer with a sigmoid activation function and a binary cross-entropy loss function. The network was trained with an ADAM optimizer and a batch size of 50 for 500 epochs. Accuracy on a test set was used as a performance metrics. The training set for the model consisted of a set of 38,791 true FRASEs randomly selected from the database and a set of 77,582 decoys. The test set included 146,172 true FRASEs and 509,184 decoys.

Pharmacophore queries and screening
The 3D pharmacophore queries for database screening were created using the Phase software 81 with Maestro graphics interface. Individual pharmacophore features were created and manually placed at the centers of feature clusters (see "Pharmacophore model and screening" in the main text for context). The created features were merged into pharmacophore queries for screening in the "Merged hypotheses" mode. The ligand input file was in maegz format. Prior to screening, the Molport ligand collection (all in-stock compounds, 2017.2) was filtered through a modified Lipinski 79 and REOS 80 filters (the modified Lipinski rule allowed ligands of up 600 Da). The 3D ligand structures for 5,142,498 ligands from the MolPort database were generated by the Pipeline Pilot software 76 . The "Generate conformers during search" (up to 50) was applied to the input ligands. "PhaseScreenScore" was used to rank the screening hits.

Docking
Ligands were docked to CIB1 using the Glide program 46 in standard docking precision (Glide SP). The binding region was defined by a 20Å × 20Å × 20Å box centered on the geometric center of the pharmacophore model. A scaling factor of 0.8 was applied to the van der Waals radii. Default settings were used for all the remaining parameters. One pose per ligand was generated.

Virtual screening triage
Virtual hits ranked and selected based on the Phase Fitness score and Glide gscore were submitted to a hit triage process. First, the hit redundancy was reduced through retaining the best-scoring hits from clusters of similar compounds. Clustering was performed by a k-means method as implemented in the Pipeline Pilot software 76 . The inclusion criterion was 45% of Tanimoto similarity on ECFP4 fingerprints to the current cluster center. Second, binding poses of top-ranked non-redundant hits were visually inspected to remove poses whose scores putatively underestimate the entropic penalty on binding. Finally, a fraction of hits from the final list was eliminated based on pricing and availability criteria.

TR-FRET assay
For the TR-FRET tracer molecule, UNC10245204 was synthesized with an N-terminal cysteine to facilitate conjugation with maleimide AlexaFluor 647 (Thermo Fisher Scientific; Alexa647-CIB1-peptide). Compounds were dispensed into 384-well plates using a Mosquito HTS nanoliter instrument (TTP LABTech) as 3-fold serial dilutions (100x in DMSO, 0.1 μL) for final concentrations ranging from 100 μM -0.5 nM. Biotinylated CIB1 was diluted to a final concentration of 3 nM in assay buffer (20 mM TRIS pH 7.5, 150 mM NaCl, 1 mM CaCl, 1 mM CHAPS, and 1 mM DTT (added fresh each time), and 5 μL of diluted protein was added to the wells of the assay plate using a Multidrop Combi Reagent Dispenser (Thermo Scientific) and incubated at room temperature for 20 min. 5 μL detection solution containing 2 nM Lance Eu-Streptavidin (Perkin Elmer) and 30 nM Alexa647-CIB1-phage peptide diluted in assay buffer was added to the wells and incubated 30-60 minutes at room temperature protected from light. TR-FRET signals were measured using an Envision Multilabel Plate Reader (PerkinElmer; Eu excitation 320 nm, Eu emission 615 nm, Alexa dye emission 665 nm). TR-FRET signal is measured as the ratio 665 nm/615 nm, and percent inhibition was calculated using two sets of control wells; biotinylated CIB1 and detection solution in the presence (100% inhibition, positive control) or absence (0% inhibition, negative control) of CIB1 inhibitor. Inhibition curves were analyzed using a four-parameter non-linear curve fit using ScreenAble software, and the mean and standard deviation were calculated. Publication curves were averaged and fit with a four-parameter non-liner curve fit using GraphPad Prism.

Cellular assay
Human triple-negative cell lines MDA-MB-468, MDA-MB-436, MDA-MB-231, MDA-MB-436, and BT-549 were cultured in Dulbecco's modified eagle medium (DMEM, Gibco) supplemented with 10% fetal bovine serum and 1% non-essential amino acids (Gibco) at 5% CO2 and 37 °C. BT549 and MDA-436 media was also supplemented with 10 µg/ml insulin (Gibco). For cell death studies, each cell line was plated at a density of 1.5 x 10 5 cells/well. After 24 h, the media was replaced with 1.5 mL of media containing 30 µM of compound or 1% DMSO vehicle and incubated and additional 24-48 h. Floating and adherent cell populations were harvested and cell death quantified by Trypan blue exclusion and expressed as the mean percentage of dead cells (i.e., trypan blue positive) from both floating and adherent total cell populations. Statistical analysis from 2 separate experiments was performed using GraphPad Prism software.

SUPPLEMENTAL INFORMATION
The source code used to perform the current study, as well as the latest version of FRASE-bot are shared through the GitHub repositories (https://github.com/kireevlab/FRASE-bot-Pipeline-Pilot and https://github.com/kireevlab/FRASE-bot-RDKit). The input/output files generated by the Pipeline Pilot workflow are shared as a Mendeley data set (http://dx.doi.org/10.17632/9yn47cy5jv.1).