A Machine Learning Approach to Identify Small Molecule Inhibitors of Secondary Nucleation in α-Synuclein Aggregation


 Drug development is an increasingly active area of application of machine learning, prompted by the need to overcome the high attrition rates of conventional drug discovery pipelines. This issue is especially pressing for neurodegenerative diseases, where disease-modifying drugs are not widely available yet. To address this problem, we describe an approach based on a combination of machine learning with chemical kinetics to target diseases caused by protein misfolding and aggregation. We use this approach to identify specific inhibitors of the proliferation of α-synuclein aggregates through secondary nucleation, a process implicated in Parkinson’s disease. Our results demonstrate that this approach leads to the identification of novel chemical matter with an improved hit rate and potency over more traditional approaches.


Introduction
Parkinson's disease is the most common neurodegenerative movement disorder, estimated to affect 2-3% of the population over 65 years of age [1][2][3][4] . The aggregation of the intrinsically disordered protein α-synuclein is thought to be responsible for the initial neurodegenerative processes underlying this disease, in which the pathological accumulation of misfolded protein molecules results in neuronal toxicity beginning in the substantia nigra region of the brain [1][2][3]5 . This hypothesis is supported by genetic evidence and by observations of the accumulation of α-synuclein in inclusions known as Lewy bodies in the neurons of Parkinson's disease patients [5][6][7] . The aggregates of αsynuclein, including insoluble fibrils and misfolded soluble oligomers, have been shown to exhibit various mechanisms of cellular toxicity including cell membrane disruption 8,9 .
The pathological relevance of the aberrant aggregation of α-synuclein has led to major efforts being invested into studying its mechanisms and identifying molecules that can inhibit those aggregation mechanisms associated with neurotoxicity [10][11][12][13] . This is a particularly pressing need given the lack of therapies and diagnostics currently available to PD patients [14][15][16] .
Computational methods could in principle reduce the time and cost of traditional drug discovery pipelines [17][18][19] . In this context, machine learning is rapidly emerging as a powerful drug discovery strategy 20 . To explore the potential of this strategy in drug discovery programs for Parkinson's disease and other protein misfolding diseases, we describe here a combination of machine learning with an innovative drug discovery approach based on chemical kinetics 10 to effectively explore the chemical space to identify compounds that inhibit the aggregation of α-synuclein. Our starting point is recently described set of compounds that bind to the fibril structure of α-synuclein, and prevent the autocatalytic proliferation of α-synuclein fibrils as a result 21 . Here, we used this initial set of compounds as input for machine learning to identify chemical matter that is both efficacious and represents a significant departure from the parent structures, providing novel compounds that conventional similarity searches would have failed to efficiently identify.
The machine learning component of our approach consists of 3 main components 22 : (1) the algorithms required to represent the structures of the molecules, (2) a model for training and prediction using these representations and the assay metrics, and (3) an experimental method to test the model and feed results back into it. This set up is illustrated in Fig. S1. Component 1 is a junction tree variational autoencoder 23 , pretrained on a set of 250,000 molecules 24 . Component 2 consists of a shallow learning algorithm utilising an initial random forest regressor (RFR) with a Gaussian process regressor (GPR) fitted to the residuals 25,26 . Since Gaussian processes work effectively on small datasets, this approach enables the leveraging of the associated uncertainty measure when ranking molecules during acquirement prioritisation 22 . Depending on how conservative the experimentalist wishes to be, a greater or lesser weighting can be applied to this uncertainty, allowing a ranking based on both the predicted potency of the molecules and the uncertainty of that prediction. The last component is a chemical kinetics assay 10,27,28 , which identifies the top compounds that significantly inhibit the surface-catalysed secondary nucleation step in the aggregation of α-synuclein.
The primary aim of the machine learning approach presented here is to open a route to obtain a greater hit rate and novelty than high-throughput screening, docking simulations and similarity searches. Overall, this work represents the rational development of an approach that combines chemical kinetics and machine learning for identification and optimisation of compounds that can inhibit potently and selectively the secondary nucleation of α-synuclein, which could be of use in therapeutic programs targeting synucleinopathies in the future.

A machine learning method of exploring the compound space
The overall approach is illustrated schematically in Fig. 1 Fig. 1A). Here, using the Tanimoto similarity metric 29 , 2 similarity searches were then carried out using these 4 structures as starting points. A selection of closely related molecules (Tanimoto similarity > 0.5) to the parent compounds (referred to as the 'close similarity docking set', Fig. 1B) was experimentally tested in an aggregation assay. This step was then followed by a larger selection of compounds with a looser cut-off of structural similarity (Tanimoto similarity > 0.4) to the parent compounds (referred to as the 'loose similarity docking set' , Fig. 1B). The compounds resulting from these experiments were then used as input for a machine learning method for an iterative exploration of the chemical space (Fig. 1C) The similarity searches removed the most obvious targets of the machine learning approach, but also increased the size of the dataset available for training. The training set, however, remained small by typical machine learning standards, as it consisted of a few hundred molecules. Since training set sizes such as the one used here are typical in early-stage research, a further aim of this work was to demonstrate that machine learning can be used rather effectively even in such data sparse scenarios.
Selected molecules in the close similarity docking set were tested in an aggregation assay, which yielded 5 new hits from 25 new molecules (Fig. S2). Additional selected molecules in the loose similarity docking set were also tested in this assay in order to expand the training set beyond the initial docking compounds and their closely related derivatives. Although new hits featured amongst this set, the hit rate was low (4%), and both molecules 48 and 52, which had initially appeared the most promising of the parent structures, yielded poor results. From the 29 molecules related to molecule 48 in the loose similarity docking set, none were hits, while from the 24 molecules related to molecule 52, only 2 were hits. The functional range of molecules 48 and 52 appeared narrowly limited around the chemical space of the parent structures. Overall, the hit rate from the loose similarity docking set was less than a quarter of that of the close similarity docking set and involved testing 3 times as many compounds.
At this point it would have been challenging to further explore the chemical space using conventional structure-activity relationship (SAR) techniques without significant attrition, since the hit rate worsened as the similarity constraint to the hits was loosened.

Iterative application of the machine learning approach
One of the issues with applying machine learning to a data sparse scenario is that predictions are likely to be overconfident. While this problem can be addressed to an extent by utilising Gaussian processes, a complementary strategy is to restrict the search area to a region of chemical space that is more likely to yield successful results.
To this end, a structural similarity search of the 4 molecules in the docking set was carried out on the 'clean' and 'in stock' subset of the ZINC database, comprising ~6 million molecules. Any molecules showing a Tanimoto similarity value of > 0.3 to any of the 4 structures of interest was included. This low threshold for Tanimoto similarity was intended to narrow the search space but without being overly restrictive of the available chemical landscape, yielding a dataset of ~9000 compounds which comprised the 'test set'. Different machine learning models were then trialled against the docking scores calculated for the test set, and the best performing model was trained on the whole aggregation data set and used to predict the top set of molecules (see Machine Learning Implementation in Supplementary Information, and Figs. S3 and S4). For this work, the half time of aggregation was used as the metric of potency to be used in machine learning because of its robustness. For comparison, the amplification rate is more susceptible to small fluctuations in the slope of the aggregation fluorescence trace 31 (Fig. S5). Molecules that achieved a half time 2-fold greater than that of the negative control under standard assay conditions (see Methods) were classed as hits 21 . The algorithm was run 10 times from different random starting states and those molecules that appeared in the top 100 ranked molecules more than 50% of the time (64 molecules) were chosen for purchase (first iteration). In this first iteration, there was an inherent bias towards the structure of molecule 69 in the dataset given the relative population sizes (Fig. S6), but with the caveat that many of these structures were only loosely related to the parent (Tanimoto similarity < 0.4). Many of the hit molecules came from this group, suggesting significant chemical departures from the parent structure.
After the first iteration, the compounds were pooled together to extend the training set and a further 2 iterations were carried out. Example kinetic traces for 1 molecule are shown in Fig. 2A, the molecules being labelled according to iteration number and hit identifier within that iteration, I1.01 being the first hit (01) within iteration 1 (I1). The aggregation data from the first 3 prediction iterations are also shown in Fig. 2B. Of the 64 molecules from iteration 1, 8 were strong hits, representing a hit rate of 12.5%, the second iteration showed a further increase, with 12 strong hits representing a 18.8% hit rate and the third iteration, with 12 hits, exhibited a hit rate of 21.4%. These hit rates represent an order of magnitude improvement over HTS (~1%) and, remarkably, an overall 45% improvement over the combined similarity search hit rates, which removed the most likely hit candidates. The potency of these hits was also significantly higher on average than those identified by the similarity searches (Fig. 3A), without compromising the CNS-MPO 41 scores (Fig. 3B).
The chemical space explored by the machine learning approach was also inspected via dimensionality reduction techniques such as PCA, t-SNE and UMAP (see Methods) to investigate how the model was prioritising molecules (Fig. S7). The relative positioning of the training points and the parents within the chemical space is shown in Fig. 4. The stacked RFR-GPR model assigned low uncertainty to areas of the chemical space proximal to the observed data, and the corresponding acquirement priority mirrored this when trained on the aggregation data (Fig. 5). This figure also illustrates how the uncertainty weighting could be altered during the ranking, depending on how conservative a prediction was required. A drawback to a high uncertainty penalty was that the model remained in the chemical space it was confident in, while a lower uncertainty penalty ensured reasonable confidence of hit acquirement while still exploring the chemical space.
The changes in similarity of the hits to the parent structures are shown in Fig. 6. The similarity dropped for all structures at successive stages of the investigation, reaching its lowest point at the iterations of the machine learning approach. The most potent hits mostly retained the left-hand side scaffold of molecule 69 with the addition of polar groups to the benzene ring, but with significant alterations to the right-hand side of the scaffold. For example, from iteration 1, I1.01 replaced the right-hand side of molecule 69 with a single substituted benzene ring, while I1.02 replaced the right-hand side with a substituted furan ring. These changes were reflected in the Tanimoto similarity values, which were at the lower end of what was permitted in the test set, 0.3 being the cut off.
It was evident from this result that parts of the substructure were important to retain for potency, which the model did effectively while also identifying alterations in the rest of the scaffold that enhanced the potency considerably beyond that of the parent.

Validation experiments
A series of validation experiments were carried out on the most potent hits from the machine learning iterations. We first tested the binding to fibrils using the change in fluorescence polarisation under a titration of fibrils (Fig. S8). One example, molecule I3.08, which had the appropriate fluorescence properties for the assay, was investigated and exhibited a KD of ~ 700 nM.
The dose dependent potency in a light seeded aggregation assay was also investigated ( Fig. 2A and Fig. S9), with all hit molecules exhibiting substoichiometric potency. As transient oligomeric species are considered the most damaging of the aggregate species in vivo 8,9 oligomer flux simulations are also shown in Fig. 2A and Fig. S9, with all hits demonstrating dose dependent delay and reduction of the oligomer peak. The calculations and input to derive these are shown in the Methods. The last metric extracted from these light seeded experiments was an approximate overall rate at each molecule concentration, obtained by taking 1/t1/2, as shown in Fig. 2A and Fig. S10.
These approximate rates were fitted to a Hill slope to obtain the 50% kinetic inhibitory concentration (KIC50), the concentration of molecule at which the t1/2 is increased 2 fold with respect to the control, which is also shown in Fig. 2A and Fig. S10. The more potent hits had KIC50s in the region of 1 µM, translating to a significant potency at a ratio of 1:10 molecule to α-synuclein monomer.
Finally, transmission electron microscopy imaging of the fibrils produced at the end of the light seeded aggregation reaction was carried out (Fig. S11), to verify fibrils were produced, and mass spectrometry was carried out on α-synuclein samples after incubation with hit molecules. No change was observed in the mass of the α-synuclein peak for the molecule incubations compared with a 1% DMSO control, excluding covalent modification as a mechanism.

Discussion
Chemical kinetics approaches have advanced to the point that specific mechanisms of α-synuclein aggregation can be targeted in a reproducible way 10,27,28 . The mechanism targeted in this work is the surface-catalysed secondary nucleation step, which is responsible for the autocatalytic proliferation of α-synuclein fibrils. In a recent proof-ofprinciple report, initial hit molecules identified via docking simulations were shown to bind competitively with α-synuclein monomers along specific sites on the surface of αsynuclein fibrils 21,32,33 . Specific rate measures and other aggregation metrics were derived from these experiments allowing quantitative and reliable comparisons between molecules in terms of SAR and offering metrics to optimise structures of interest 10,34 .
The aim of this work was to develop a machine learning approach to drug discovery for protein aggregation diseases that could improve both the hit rate of the in vitro assays employed and provide novel chemical matter more efficiently than conventional approaches. As of the first iterations, the hit rate of the approach using initial hit compounds identified via docking simulations was an over 20-fold improvement over typical HTS hit rates (~0-1%) 35 . These structures also represent discoveries that could not have been obtained by staying close in chemical space to the parent structure, as would have been dictated by similarity search approaches. There are ~4000 molecules in the test set that have Tanimoto similarity values in the range of these hits, and all of these would potentially have had to be screened to locate these hits using similarity searches alone, as demonstrated by the looser similarity search approach which exhibited a comparatively poor hit rate (4%) despite conservative structural alterations to the parent hits. The machine learning method was therefore able to supply a degree of novelty as well as an improved hit rate.
A limitation of this approach is the requirement to pick from a pre-existing library. To address this limitation generative models could be employed 36 . A second limitation is the focus on one assay metric of interest as a learning parameter. Addressing this limitation involves multi-parameter optimisation, which is a challenging area in rapid development [37][38][39] . Another topic of great interest in machine learning drug discovery approaches besides potency prediction is the prediction of pharmacokinetics and toxicity 40 . It could be possible to achieve this multi-parameter optimisation utilising multiple models in parallel and then employing a joint ranking metric, or architectures such as generative adversarial networks may also be capable of achieving this in a single model, although this has primarily been demonstrated with predicted chemical properties such as clogP and QED rather than experimental results [37][38][39] . The molecules in this work were derived from a set that passed CNS-MPO 41 criteria in the initial docking simulation, and so the CNS-MPO score of the whole aggregation inhibitor set is relatively favourable with most molecules exceeding the common cut off value of 4 41 (Fig. 3B).

Conclusions
The results that we have presented illustrate the potential of a drug discovery approach that involves the combination of chemical kinetics and machine learning. The hits identified by this combined approach offered a significant improvement in potency over the parent molecules and represented a major structural departure from them. We anticipate that using machine learning approaches of the type described here could be of significant benefit to researchers working in the field of protein misfolding diseases, and indeed early-stage drug discovery research in general.

Compounds and chemicals
Compounds were purchased from MolPort (Riga, Latvia) or Mcule, and prepared in DMSO to a stock of 5 mM. All chemicals used were purchased at the highest purity available.

Determination of the elongation rate constant
In the presence of high concentrations of seeds (» µM), the aggregation of α-synuclein is dominated by the elongation of the added seeds 43,44 . Under these conditions where other microscopic processes are negligible, the aggregation kinetics for α-synuclein can be described by 43,44

Determination of the amplification rate constant
In the presence of low concentrations of seeds (~ nM), the fibril mass fraction, M(t), over time was described using a generalised logistic function to the normalised aggregation data 10,45 ( ) where mtot denotes the total concentration of α-synuclein monomers. The parameters a and c are defined as The parameters and represent combinations for the effective rate constants for primary and secondary nucleation, respectively, and are defined as 45 where kn and k2 denote the rate constants for primary and secondary nucleation, respectively, and nc and n2 denote the reaction orders of primary and secondary nucleation, respectively. In this case, nc was fixed at 0.3 for the fitting of all data (corresponding to a reaction order of n2 = 4), and k2, the amplification rate, is expressed as a normalised reduction for α-synuclein in the presence of the compounds as compared to in its absence (1% DMSO).

Determination of the oligomer flux over time
The theoretical prediction of the reactive flux towards oligomers over time was  followed by a loose similarity search with Tanimoto similarity cut off > 0.4 (the 'loose similarity docking set'). A machine learning method was then applied using the observed data to predict hits from a compound library derived from the ZINC database with Tanimoto similarity > 0. 3

to the parent structures (the 'test set'). (C) Successive
iterations of prediction and experimental testing yielded higher hit rates, and molecules with higher potency on average than those identified in the previous similarity searches.
Validation experiments were also carried out on the hits identified.    be appropriate for small datasets and easily applicable on standard hardware available for most laboratory workers over a short timescale. As a first line test Gaussian process regression (GPR) was employed alone, following Hie et al. 22 A binding score training set was then used to fit the GPR, which was in turn used to predict the binding scores of the test set. The metric used to evaluate performance in this case was the R 2 score or coefficient of determination. This score measures the goodness of fit between a set of predictions and the ground truth values. This score ranges from 1, in a perfect fit, to arbitrarily negative values as a fit becomes worse, and is 0 when the predictions are equivalent to the expectation of the ground truth values of the training set 56 . The training set size was subsequently expanded, and the same process carried out. This was compared with a naïve Bayes, which failed to score above 0 for any training set size.
The GPR kernel was initially the same as that utilised by Hie et al. 22 , i.e. a combination of a constant kernel and a radial basis function (RBF). Using these initial settings, R 2 scores of ~0.2 were obtained for small training set sizes. Hyperparameter optimisation yielded only marginal improvements in this performance. A selection of other kernels was tested, and all models were optimised via hyperparameter tuning before implementation, but most did not offer an improvement in performance. The Matérn kernel, a generalisation of the RBF with an extra parameter controlling the smoothness of the function, did however show a marginal improvement. These flexible functions are the most likely to be able to fit shallow energy minima problems such as those encountered here. The R 2 scores were still low, especially for the smaller training sets, but represented a viable starting point.
The same process was utilised using different proportions of the molecular feature set, and it was found that GPR performance metrics were better when using truncated feature sets compared with using the entire representation. In general, it is to be expected that fitting fewer features to a predicted value is easier for a regressor to achieve and so higher scores are obtained. However, a better average R 2 score across the data set does not necessarily lead to a better result in terms of the actual molecules picked.
To investigate this issue, a simulation was created to mimic how the experimental cycle of testing might work using the docking scores as a surrogate for aggregation data. In the simulation, a random subset of 100 molecules was selected and the model trained on these molecules and their binding scores. The resultant model was then used to predict binding scores for the remaining molecules and rank them using a combination of the predicted value and the associated uncertainty value. The top 100 were then selected and their binding scores added to the training set as would occur in the experimental scenario, and this process was repeated 10 times. The ideal scenario would be that molecule sets with improved mean binding energy relative to the mean of the test set would be selected, and that selections would improve as the training set expanded.
Different uncertainty penalties were tested during this process. We found that a low uncertainty penalty produced better results by removing the most overconfident predictions without placing too many limitations on the model. At the early stages most predictions with low uncertainty were those with predicted binding scores close to the mean of the training set. An excessive uncertainty penalty during these stages would cause the model to only predict molecules that it was confident in, which were also likely to be mild.
A snapshot of the results of this testing is shown in Figs. S3 and S4. Fig. S3 demonstrates 2 points: the performance was slightly improved using the Matérn kernel in place of the RBF kernel both in terms of overall hit selection and performance improvement with increasing training set size, and the full-length molecular representation gave a significant boost in terms of number of hits selected vs the truncated representation, despite the lower R 2 scores. These results also provided some evidence that Gaussian process learning might work reasonably effectively even in this data sparse scenario albeit at a modest level. It was expected that fitting experimental data would prove more challenging, however, and so a boost in performance was sought for that would not compromise the simplicity of the model.     The half time, although a simpler measure, is more robust and so was chosen for the machine learning approach. Data obtained from reference 21.