Descriptor-Free QSAR: Effectiveness and Screening for Putative Inhibitors of FGFR1

.


ii Introduction
The effectiveness of "conventional" quantitative structural activity relationship (QSAR) models in computer-aided drug discovery are well documented [1][2][3]-we define conventional QSAR as any model utilizing descriptors in the course of training. However, there are challenges i.e.
QSAR models are only as good as their descriptors ("garbage in garbage out"). Different types of descriptors have been developed [4] to have robust, and reliable models; but these are not without problems which include but not limited to: descriptor interpretations, bias, need for third party software for descriptor calculations, effective descriptors selection algorithm, intercorrelations, etc. [5][6][7]. This raises the question; can QSAR models be trained directly on the compound SMILES while eliminating descriptors? Recent works have employed long short term memory recurrent neural networks (LSTM-RNN) algorithm to build a descriptor-free QSAR model on large and diverse datasets as proof of concepts [8], but there is paucity of data on direct application of this method for drug design and discovery.
Squamous-NSCLC (Sq-NSCLC), unlike lung adenocarcinomas, lacks commonly targetable oncogenic aberrations such as EGFR mutations, ROS-1, or ALK rearrangements [10,11] but recent discoveries have revealed the basic fibroblast growth factor receptor 1 (FGFR1) as a crucial druggable target in squamous non-small cell lung cancer. FGFR1 is a transmembrane receptor tyrosine kinase having an extracellular domain for binding of ligand and a catalysismediated intracellular domain being responsible for the receptor kinase activity [12]. It plays a physiological role in the basic hallmarks of cell development including cell proliferation, growth, differentiation, angiogenesis, migration, and survival [13,14] but dysregulated in Sq-NSCLC condition through mechanisms including over-amplification of chromosome 8p12 and/or aberrant transcriptional regulation [9][10][11].
We therefore aim to investigate the effectiveness of descriptor-free QSAR (comparably with conventional QSAR) in modeling FGFR1 inhibitors dataset in Chembl repository; employ the iv The data were split into three sets: training-set (70%), test-set (20%), and validation-set (10%) using the RDKit Max-Min Picker module. The Max-Min algorithm calculates the fingerprints for the whole dataset, evaluates the Tanimoto distance between fingerprints (MACCS) and diverse subsets selected [21].

Model Training and Evaluation
Three models were built in this study: • long short-term memory (LSTM) with canonical SMILES. i.e., no descriptors (LSTM-SM); • neural network model with molecular fingerprints as descriptors (NN-FP); • random forest model with MOE 2D descriptors (RF-2D).

LSTM-SM Model i) LSTM Principle
Long short-term memory (LSTM) was introduced to solve the long term dependence problem of Recursive Neural Network (RNN) [22]; it utilizes cell states-serving as a form of "memory"connected network-wide. To update cell states during training, "Gates" are introduced: a forget layer gate (f t )that determines part of the cell state to be discarded (Eq1); an input layer gate (i t ) which determines part of the cell state that has to be updated (Eq2); and a tanh layer gate (Čt) that creates new candidate values that would be added to the cell state (Eq3).
Čt= tanh (WČ . [ (4) vi Finally, an output gate layer (Ot) is created to determine which aspect of the new cell state (Ct) to be outputted as a hidden state (h t ) to the next cell in the network (Eq5, Eq6).

ii) Model Training
Talos module was used for hyperparameter tuning: a python dictionary specifying different hyperparameters and value ranges was provided and the module randomly selects from this parameter dictionary and constructs various LSTM models to return the model performance of each selection.

iv) Model Validation
Due to the stochastic nature of neural networks, the LSTM-SM model was validated using the following protocols (model performance was validated at different random seeds): vii • 10-Fold Cross-validation: the LSTM-SM model was subjected to a 10% split 10-fold cross-validation and the average performance of the model was computed • Y-randomization: the training label was randomized and trained,this new model was used to predict the test set and validation sets. This protocol ascertaining that the observed performancewas not due to chance. It is expected that the new model would perform significantly lower than the original model (unrandomized). This process is iterated 10 times and average performance wascomputed.
• Validation-set: Validation data set were also used to evaluate model performance at every stage of training and testing.

Baseline Models
Two baseline models were built to serve as representative of conventional QSAR: • A fully connected neural network (NN-SM) trained on RDKit Morgan fingerprint (126 bits) The hyperparameters used for NN-SM and RF-2D model is stated in Table 2 and 3

viii Applicability Domain
Two protocols were applied to define the applicability domain of the model: • Enalos Similarity KNIME node [24] was used to flag compounds that were not similar to the training dataset. The fingerprints of the compounds were computed and subjected to the Enalos similarity node.
• Compounds with functional groups not found in the training set were considered outside the applicability domain of the model

Database Screening
ChemDiv database representative compounds (300,000) were downloaded for screening: in screening, active class prediction probability was restricted to 0.75 and above.Also, due to the stochastic nature of neural networks,the model predictions were repeated four times, and only compounds consistent in at least 3 predictions were selected.

Molecular Docking
The crystal structure of FGFR1 in complex with AZD4547(PDB ID: 4V05) was downloaded from the RCSB protein database and prepared using the Schrödinger protein preparation wizard [25]; missing side chains and loops were filled with prime [26], water beyond 5Å from the het group was deleted and het states were generated using Epik [27](pH 7.0 +/-2.0) while all other parameters were left at default values. The predicted active compounds were prepared using the Schrödinger LigPrep module in which force field minimization using OPLS2005 [28] and Het states were generated using Epik [27] (pH 7.0 +/-2.0) while the active site coordinates of the FGFR1 was extracted using the receptor grid generation module of Schrödinger.
The predicted active compounds were thereafter docked using Schrödinger virtual screening workflow (consisting of a filtering stage: based on drug-likeness criteria, docking, and binding affinity calculation). The docking stage was a three-step process utilizing the three Schrodinger glide docking algorithms: high throughput virtual screening (HTVS), Standard Precision (SP), and Extra precision (XP) sequentially-each with an increasing level of accuracy [29]. We initially docked the compounds in the first step in which 10% of the top scored compounds were returned as input for step two which involved the glide SP docking of the compounds for returns ix of 10% of the top best scoring compounds. Finally, the resultant compounds were docked using glide XP to retain the top best 100 compounds prior to their binding affinity calculations using the MMGBSA protocol.

Molecular Mechanics Generalized Born Surface Area
Compounds binding affinity was calculated using the prime molecular mechanics-generalized Born surface area (MM-GBSA) [30].  (7) where Gcomplex, Gprotein and Gligand represent the binding free energies of the protein-ligand complex, protein, and ligand respectively.

Molecular Docking Protocol Validation
The docking protocol was validated by redocking the co-crystalized ligand and superimposing the redocked pose with the crystalized pose. The RSMD value of pose differences was calculated. An enrichment study was done: 20 FGFR1 inhibitors reported in the literature were mixed with decoys and docked (this investigated how well the docking protocol was able to select active compounds ahead of decoys); ROC curve was plotted and AUC calculated.

Induced-fit Docking
The top 12 compounds from the molecular docking study were subjected to induced-fit docking to predict the binding pose of the compounds and calculate corresponding binding affinities (using MMGBSA).
Maestro induced-fit docking module [32,36] was used; briefly, the compounds were docked into the active site (with the active site residues held rigidly), prime module refined the active sites residue backbone, and finally redocked the compounds into the refined protein conformation.

Residue Mutation Analysis
Schrodinger maestro mutates utility was used to mutate FGFR1 Val561 to Met561; the mutated protein was subjected to induced-fit docking with selected compounds to investigate possible bonding interactions with the mutated residue. x

QM-MM Optimization
Optimization studies were further carried out on the compound poses obtained from the inducedfit docking experiment using the Schrodinger Q-site module [31]. This was to validate bonding interactions observed in induced-fit docking poses [32].

Figure 2: LSTM-SM model evaluation a) accuracy and loss over 100 epoch b) ROC curves:
Training-set: AUC 0.95; Test-set: AUC 0.95; Validation-set: AUC 0.95 The cross-validation of the NN-FP model showed accuracy, sensitivity, and specificity of 0.91, while there was a significant reduction in the model performance (Table 5)

Figure 3: NN-FP model evaluation a) accuracy and loss over 15 epoch b) ROC-CURVEs:
Training-set: AUC 0.95; Test-set: AUC 0.99; Validation-set: AUC 0.99 The LSTM-SM model was used to screen the ChemDiv database and a total of 15,000 compounds were predicted as actives. These compounds were subjected to molecular docking (to filter out compounds with low binding affinity) and induced-fit docking (to elucidate plausible binding modes indicating putative mechanism of inhibition and specificity).

Molecular Docking and Induced-fit Docking
The docking protocol was validated (see methods); superimposing RSMD was 0.789Å and the AUC of the enrichment ROC curve was 0.99 ( Figure 4)

QM-MM optimization
The QM-MM calculation was employed to optimize the predicted induced-fit poses. AZD4547 (standard) formed a new hydrophobic bond (π-π stacked) with Phe489; compound 2912 lost its hydrogen bond with the water moiety; compound 3488 lost its bonding interactions with Phe489 and Gly567; compound 5227 lost its interaction with Phe489; compound 1717 gained a hydrophobic interaction with Phe489 and two hydrogen bonds with Glu531, but lost its interaction with Val492 (figures 8,9). We further observed a reduction in the binding affinity of compound AZD4547, 2912, 5227, 1717, and an increase in compound 3488 binding affinity ( Table 7). The QM-MM optimized poses were then selected as final poses and binding affinities.

xxi Gateway Residue Mutation Analysis
We mutated Val561 to Met561 to investigate possible interactions with this mutated residue; compound 3488 and compound 5227 formed π-alkyl interactions with Met561 and compound 2912 formed π-sulfur interactions with Met561 as shown in figure 10. The compound 1717 was not subjected to this experiment as it did not make any initial interaction with Val561. We utilized the predicted interactions with key active site residues to suggest putative mechanism of inhibition and specificity. Our suggestions are based on observed interactions of experimentally validated inhibitors. Inhibitors of FGFR1 (non-covalent) developed so far inhibits via two mechanisms: type I and II. Type I (e.g. AZD4547) inhibits FGFR1 in its DFG-in conformation via interaction with Asp641 thus interrupting the coordination of ATP phosphate group [35,36]; Type II (e.g. Ponatinib) inhibits FGFR1 in its DFG-out conformation and forms interactions with the conserved Glu531 of the αC helix region [36,37].
Specificity for FGFR1 occurs via interactions with certain regions of FGFR1 active site. This interactions includes: interactions with Phe489 of the P-loop region which induces its closure (Ploop closure) over the inhibitor thereby creating a better fit; the conserved sites in the FGFR family makes most inhibitors to be active over a wide range of FGFRs (pan-FGFR inhibitors), but the difference in specific residue positions (Tyrosine, Cysteine, or Phenylalanine) in the hinge regions of FGFRs can be exploited for specificity; Tyr563 have been identified in FGFR1 [38]. Val561 confers a natural resistance on FGFR1 via steric hindrance [36] thereby serving as a xxiii gateway residue for FGFR1 the ability to form bonds with this residue also suggest specificity.
Finally, the ability to interact with hinge residues (Glu562 -Lys566) also suggests a level of specificity [36].
We, therefore, suggest the following: compound 2912 exhibits both type I and II features (interaction with Asp641 and Glu531 is observed), with its mechanism of specificity via interactions with Phe489 (P-loop), Val561 (gateway residue), Ala564, and Try563 (FGFR1 hinge residues); mechanism of inhibition for Compound 3488 is not clear (absence of bonding interactions with Asp641 or Glu531), however, specificity mechanism could be via interaction with Val561, Ala564, and Try563; Compound 5227 could inhibit via type II mechanism (interaction with Glu531) and specificity via interaction with Glu531, Phe642, Ala564, Try563, and Val561; Compound 1717 also exhibits both type I and II features (interaction with Asp641 and Glu531), with specificity via interactions with Phe489 and Ala564.
Acquired resistance is a challenge when considering long term efficacy of FGFR1 inhibitors, this resistance occurs via mutation of the gateway residue with a bulky amino acid e.g. methionine or isoleucine (resulting in the "gates been closed") [36,37,39]. Simulating this mutation(Val561 to Met561) we predict that the compounds might still be effective regardless of such mutations since the compounds still interacted with Met561: compound 2912 formed π-sulfur interaction, compound 3488, and 5227 from π-alkyl interaction suggesting effectiveness despite this acquired resistance. For compound 1717, interactions with Val561 or Met561 were absent.
With this we submit compounds 2912, 5227, and 3488 as potential specific inhibitors of FGFR1; however, the exact mechanism of inhibitions for the compounds needs to be experimentally verified. Compound 1717 (despite its high binding affinity) is not considered an effective inhibitor due to its in ability to interact with the gateway residue.

xxiv
We however, recognize that training a descriptor-less QSAR model is computationally intensive, but with advances in computing power this should be a non-issue (it is therefore, a choice between either speed or a QSAR model based purely on SMILES representation); also the stochastic nature of neural networks might result in some compound been missed.
For future perspective, we recommend further tuning of hyperparameters (this could reduce the number of epochs required to train), training with more FGFR1 inhibitors from other databases to make the model more robust, bidirectional LSTM algorithms could be experimented with, improvement or invention of a new textual representation of compounds purely for descriptorless QSAR modeling to ensure that attention values are easily mapped back to compound structure is also recommended. Finally, the suggested mechanism of inhibition and specificity of the compounds remain predictions and experimental validation is needed.

xxv Conclusion
With the advent of more sophisticated machine learning algorithms capable of self-feature extractions, we have shown that by allowing the model to extract its own descriptor/features, it can perform as good as feeding the model with descriptors if not better; we have also shown its effectiveness in screening for active compounds and elucidated a putative mechanism of inhibition and specificity for selected compounds. Hence, we can affirm that "descriptor calculation, preparation, filtering, and selection steps in the QSAR workflow can be eliminated".