Improving de novo molecular design with curriculum learning

Reinforcement learning is a powerful paradigm that has gained popularity across multiple domains. However, applying reinforcement learning may come at the cost of multiple interactions between the agent and the environment. This cost can be especially pronounced when the single feedback from the environment is slow or computationally expensive, causing extensive periods of non-productivity. Curriculum learning provides a suitable alternative by arranging a sequence of tasks of increasing complexity, with the aim of reducing the overall cost of learning. Here we demonstrate the application of curriculum learning for drug discovery. We implement curriculum learning in the de novo design platform REINVENT, and apply it to illustrative molecular design problems of different complexities. The results show both accelerated learning and a positive impact on the quality of the output when compared with standard policy-based reinforcement learning. While reinforcement learning can be a powerful tool for complex design tasks such as molecular design, training can be slow when problems are either too hard or too easy, as little is learned in these cases. Jeff Guo and colleagues provide a curriculum learning extension to the REINVENT de novo molecular design framework that provides problems of increasing difficulty over epochs such that the training process is more efficient.

T he application of deep learning for drug discovery provides potential to accelerate therapeutics development. One fundamental challenge is molecular design, involving the design and prioritization of candidate molecules for experimental validation 1,2 . Molecular design entails a multiparameter optimization (MPO) search in chemical space, estimated to be in the range of 10 23 -10 60 molecules 3 . Existing methods for molecular design include virtual screening (VS), which continues to be a valuable paradigm in identifying candidate molecules 4,5 . Correspondingly, recent work by Sadybekov et al. introduced V-SYNTHES, a synthon-based screening approach capable of handling ultralarge compound libraries and yielding synthetically tractable molecules that were experimentally validated 5 .
Recently, deep learning methods have emerged as an alternative to VS. An advantage with these methods is that a much larger chemical space can be sampled when compared with methods relying on enumerating molecules 6 . However, care needs to be taken that the generated molecules are synthetically feasible. Deep generative models using policy-based reinforcement learning (RL) 7-13 , value-based RL 14 , learning a molecular latent space 15 , and other methods including tree search 16 and genetic algorithms [17][18][19][20] have been proposed to design molecules possessing desired properties. In the policy-based RL paradigm, an agent (a generative model) learns a policy (series of actions to take at given states) to generate molecules that maximize a reward, which is typically computed on the basis of a predefined reward function [7][8][9][10][11][12][13] . Often, physics-based approximations of binding affinity such as molecular docking are included as a component in the reward function to design molecules with enhanced predicted potency. Given sufficiently long training time, these models can learn to generate molecules that satisfy the desired MPO objective. However, in cases with complex reward functions where minima are difficult to find, the agent may spend many epochs sampling from areas in chemical space that are far away from the desired objective. The issue is exacerbated when computationally intensive components are included in the reward function, such as molecular docking. Thus, policy-based RL can be infeasible for complex MPO objectives, leading to suboptimal allocation of computational resources and eventually suboptimal molecules identified for synthesis.
Curriculum learning (CL) has been proposed as a training strategy to overcome difficulties in learning complex tasks 21 . The basis of CL is to decompose complex objectives into simpler constituent objectives that are sequentially learned, guiding training towards successful convergence of the final objective. Provided with a curriculum where constituent objectives are strongly correlated with the final objective, corresponding gradients from sequential simpler tasks are more effective at traversing the optimization landscape and can accelerate convergence 22,23 . Molecular design often requires optimization of correlated properties that cumulatively define favourable chemical space-for example, generating known active scaffolds and improving binding affinity 24 . By applying concepts from CL, existing limitations of policy-based RL for molecular design can be circumvented. CL provides a strategy to lower the learning barrier of complex MPO objectives, reaching a state of productivity within a reasonable timeframe.
In this work, we build on the de novo molecular design platform REINVENT, and introduce a CL implementation that can address tasks where policy-based RL has difficulties 8 . The use of CL extends REINVENT's applicability to complex reward functions that were previously infeasible with standard policy-based RL. We demonstrate the use of CL in REINVENT by formulating a case study to design 3-phosphoinositide-dependent protein kinase-1 (PDK1) inhibitors 25 . We show that immediate states of productivity can be achieved by assembling a curriculum compared with standard policy-based RL, which can circumvent high computational costs associated with reward components such as molecular docking. Moreover, we show that CL provides a natural method for agent policy regularization, such that minor changes in the curriculum can steer molecular design, enabling control over the quality and diversity of the results in a predictable manner.
In CL, a complex task is decomposed into simpler constituent tasks to accelerate training and convergence. The goal is to guide the agent to learn tasks with increasing complexity before providing the production objective. Agent learning progresses through the curriculum phase to the production phase and is controlled by curriculum progression criteria, checking that the agent achieves an adequate score threshold for each objective (Fig. 1). In the former, the agent is trained on simpler sequential tasks with gradually increasing complexity. In the latter, the agent reaches a state of productivity, whereby the agent samples compounds in favourable areas of chemical space that satisfy the production objective (see Methods for a formal definition of the CL strategy). Agent policy update is maintained in the production phase to ensure that the agent samples from diverse minima.

results
In this section, we devise three experiments to demonstrate the enhanced capability of CL to satisfy complex objectives relative to standard policy-based RL.
(1) Production objective: generate compounds that possess a target scaffold. Scoring function: matching substructure to target scaffold. Curriculum objective: achieve a state of productivity by decomposing the target scaffold into simpler sequential substructures with gradually increasing structural complexity. Scoring function: matching sequential substructures. For experiments 2 and 3, we further define a 'low' (0.5) and a 'high' (0.8 for Tanimoto and 0.75 for ROCS) scenario denoting the minimum score the agent must achieve with the curriculum objective activated before proceeding to the production objective. The purpose of these scenarios is to investigate the effect of variable degrees of agent curriculum objective knowledge on compound sampling in the production phase and how it impacts the state of productivity.
Target scaffold construction. As an initial example, we show that CL can guide the agent to generate compounds possessing a relatively complex scaffold that is not present in the training set for the prior (Fig. 2). The dihydro-pyrazoloquinazoline scaffold was identified as a promising starting point for PDK1 inhibitor design owing to good cell permeability and low promiscuity, and studies kept the phenyl substituent constant 25 . The goal is to generate compounds that possess this target scaffold. We first demonstrate that the task is too complex for standard policy-based RL and denote this as baseline RL (Supplementary Fig. 1). In the baseline experiment, the only component in the scoring function is the target scaffold. Each generated compound scores either 1.0 or 0.5, denoting whether the scaffold is present or not, respectively. The average score of the baseline experiment does not exceed 0.5 across 2,000 epochs, indicating that the scaffold is not found. Given that the scaffold is not present in the training set, the likelihood of sampling a compound possessing the scaffold is low and the inability to do so prevents meaningful agent learning. It is worth noting that, provided unlimited time, baseline RL will almost surely find the scaffold due to sampling stochasticity. On the other hand, CL can accelerate convergence by decomposing the target scaffold into simpler substructures with gradually increasing structural complexity (Fig. 2). There are five curriculum objectives, each assigned to successively more complex substructures with curriculum progression criterion thresholds of 0.8. The agent is tasked to generate compounds possessing each substructure until the average score is 0.8. When a curriculum progression criterion is satisfied, the successive and more complex curriculum objective is activated. A sharp decrease in average score accompanies each curriculum objective update-for example, at approximately epoch 150 (Fig. 2), as it is unlikely that currently In the curriculum phase, the agent progresses through successive curriculum objectives that gradually increase in complexity. The agent samples compounds in the SMILES format through an RL cycle such that the conditional probabilities are updated to maximize the reward obtained on the basis of a scoring function composed of the current curriculum objective 38 . Curriculum progression criteria check for sufficient learning of each curriculum objective on the basis of a threshold that the agent must achieve. If and only if the final curriculum progression criterion is satisfied does the agent progress to the production phase, in which a scoring function comprised of the production objective is applied. sampled compounds will also possess the successive substructure by chance. Over the course of training, the agent learns to generate compounds possessing increasingly complex substructures until the target scaffold is constructed.
Satisfying a molecular docking constraint. In this section, we demonstrate that simple curricula, utilizing a single curriculum objective, can accelerate agent productivity and generate compounds that satisfy a docking constraint, that is, predicted to retain experimentally validated interactions (see Methods for experiment hyperparameters) 9,10,16-18 . Simulating a real-world application where limited computational resources must be allocated, baseline RL and CL performances are compared, given a maximum number (300) of permitted production epochs, that is, epochs that involve docking, as these are relatively computationally demanding. For CL, curriculum objectives are first applied to guide the agent and the number of permitted curriculum epochs is not limited, as these are computationally inexpensive (Supplementary Table 2). Angiolini et al. design PDK1 inhibitors by leveraging the dihydro-pyrazoloquinazoline scaffold, which forms two hydrogen-bonding interactions with Ala 162 (Fig. 3a) that are crucial for potency 25 . The structure-based optimization is mimicked by defining the following production objective.
Production objective: generate compounds that retain the two hydrogen-bonding interactions with Ala 162, possess enhanced predicted potency compared with the reference ligand (as assessed by LigPrep and Glide docking score) and are drug-like, as measured by the QED 26,[29][30][31][32][33] . The QED score also prevents compounds from exploiting the weaknesses of docking algorithms to achieve favourable scores-for example, presence of excessive hydrogen-bond donors would substantially restrict membrane permeability 34 .
First, we show that the production objective is challenging for baseline RL (Fig. 3b,c). The docking score is approximately 0 for the first 100 epochs, indicating that essentially no compounds sampled satisfy the docking constraint. For epochs 100-200, some compounds satisfy the docking constraint but the score is still low. It is only from epoch 200 onwards that the docking score begins a steep improvement and indicates the point at which the agent starts entering a state of productivity. It is evident that baseline RL is suboptimal, as the agent spends a substantial amount of time generating compounds that do not satisfy the production objective. It is worth noting, however, that the agent eventually converges given enough time, and achieves scores similar to those of CL ( Supplementary Fig. 5).
To circumvent the limitations of baseline RL, we devise curricula and introduce two curriculum objectives to guide the agent to productivity: Tanimoto (2D) and ROCS (3D) shape-based similarity to the reference ligand 27, 28 . In the former, the rationale is that, by teaching the agent to first generate compounds with 2D similarity to the reference ligand, subsequently generated compounds will have a greater likelihood of satisfying the docking constraint. The rationale for ROCS is identical except with 3D similarity to match the shape and electrostatics of the reference ligand. Triplicate baseline RL experiments with Tanimoto and ROCS components (using a scoring function comprised of Tanimoto/ROCS, docking and QED together, respectively) were conducted for a fair comparison with CL. These baseline experiments did not improve agent productivity, and training progress similar to that of the baseline shown in Fig. 3b,c is observed (Supplementary Figs. 6 and 7). For the low and high Tanimoto scenarios, the agent is immediately capable of generating compounds that satisfy the docking constraint (Fig. 3b). More specifically, although docking starts at a relatively low value (but higher than baseline RL) for the low Tanimoto experiment, the agent quickly improves over the first 50 epochs and continues to do so for the remainder of the experiment. In the high Tanimoto scenario, docking starts at a score that exceeds the maximum score achieved by the baseline RL agent over 300 epochs and maintains productivity. The results are intuitive, as enforcing the agent to first learn to generate compounds with higher 2D similarity to the reference ligand should increase the likelihood of satisfying the docking constraint. Similar observations are made when using ROCS as a curriculum objective (Fig. 3c). In both the low and high scenarios docking starts more favourably than for baseline RL, but unlike the Tanimoto experiments the ROCS experiments start at a less favourable docking score. First, these results are not completely surprising as training the agent to satisfy a 3D shape similarity objective will decrease the likelihood, relative to 2D similarity, of satisfying the docking constraint owing to more potential conformational We define a curriculum where the target dihydro-pyrazoloquinazoline with a phenyl substituent scaffold is decomposed into sequential simpler substructures (highlighted) to guide the agent. The score drops momentarily when a successive substructure objective is introduced, as it is unlikely that currently sampled compounds will also possess the new substructure by chance. By using CL, the agent is able to find the target scaffold within 1,750 epochs while standard policy-based RL is unsuccessful in the same number of epochs ( Supplementary Fig. 1).
discrepancies of the generated compounds compared with the reference ligand 35 . Second, the agent still improves significantly over 100 and 50 epochs for the low and high ROCS scenarios, respectively. These results convincingly demonstrate that the improvement in CL performance over baseline RL is attributable to the sequential nature of the CL objectives as opposed to the presence of the additional curriculum objectives only.
To visualize the quality of the results, the binding poses of selected generated compounds are superimposed with the reference ligand (Fig. 3d). The binding poses retain the two hydrogen-bond interactions with Ala 162, as enforced by the docking constraint. Furthermore, the superimposed binding poses demonstrate excellent agreement with the reference ligand, supporting plausibility. Thus, we show that using Tanimoto (2D) and ROCS (3D) shape-based similarities to the reference ligand as curriculum objectives can guide the agent to satisfy a complex production objective, and the results demonstrate that CL outperforms baseline RL given the same number of production epochs. Moreover, tuning  the degree of curriculum objective optimization, as shown in the low and high scenarios, provides direct control in guiding the agent to productivity.
Curriculum objectives enhance objective optimization. To further investigate the output of the baseline RL and CL experiments, all docking scores of the collected compounds were pooled from the triplicate experiments and the resulting distributions are illustrated in Fig. 4. First, CL generates a significantly greater number of favourable compounds compared with baseline RL, as only those that pass a minimum score based on docking and QED are stored. This is consistent with Fig. 3b-d, where the baseline RL agent struggles for the first 150 epochs, predominantly sampling compounds that do not satisfy the production objective. Second, compounds generated by CL exhibit more favourable docking scores than those of baseline RL, on average. Third, for both curriculum objectives (Tanimoto and ROCS), the high scenario has a greater density of favourable docking scores compared with the low scenario.
To quantify this, the fraction of compounds collected that possess a docking score better than that of the reference ligand (−10.907 kcal mol −1 ) was calculated for each experiment (Supplementary Table 3). The task chosen resembles a potential real-world application, as the reference ligand is an experimentally validated nanomolar inhibitor 25 . For all cases, CL generates 2,941-9,068 and 12.42-23.79% more compounds that dock more favourably than the reference ligand by absolute counts and percentage, respectively, compared with baseline RL. Furthermore, for both curriculum objectives (Tanimoto and ROCS), the high scenario outperforms the low scenario (316-3,415 and −0.4-10.57%) at the same task. Thus, a single curriculum objective provides a tunable parameter that can enhance and control the degree to which the agent is able to satisfy a production objective.  Supplementary Table 4 for individual experiments) 36 . It is evident that the CL experiments generate more unique scaffolds than baseline RL. This is expected from the training plots observed in Fig. 3b,c, where the baseline RL experiments generate essentially no favourable compounds in the first 100 epochs. Of the curriculum objectives, Tanimoto generates more unique scaffolds than ROCS. Similarly, high scenarios generate more unique scaffolds than low scenarios for both Tanimoto and ROCS. To assess the quality of the generated scaffolds, we denote scaffolds as 'favourable' if the corresponding compound possesses a more favourable docking score than the reference ligand. CL generates more unique favourable scaffolds than baseline RL by absolute counts and percentage (Fig. 5). This is in agreement with the docking score distributions in Fig. 4, which illustrate clear enrichment in docking scores for the CL experiments. The results show that using curriculum objectives increases the number of favourable scaffold ideas generated and maintains agent exploration as enforced by the diversity filter (DF, Methods). Importantly, the unique scaffold count demonstrates the capability of CL to perform scaffold hopping (see Supplementary Figs. 10-12 for example compounds generated in the high Tanimoto experiment and Supplementary Figs. 13-15 for unique scaffold statistics for each replicate experiment).

Direct steering of agent policy allows trade-off between production objective optimization and solution space diversity.
To further elucidate the role of curriculum objectives and the extent to which the agent retains acquired knowledge in downstream production tasks, the collected compounds from the CL Tanimoto experiments were pooled and the average Tanimoto similarity (compound and scaffold) to the reference ligand calculated for each epoch  The results shown consist of all the stored compounds from the 300 permitted production epochs with the production objective docking and QED. N in the x-axis labels is the number of compounds collected (those that exceed a total score encompassing docking and QED above a threshold) in each pooled violin plot. 'Baseline Tanimoto RL' and 'Baseline ROCS RL' refer to baseline RL using a scoring function composed of docking, QED and Tanimoto/ROCS together, respectively. 'CL Tanimoto Low' and 'CL Tanimoto High' refer to using Tanimoto similarity as a curriculum objective that requires achieving corresponding thresholds of 'low' (0.5) or 'high' (0.8) score before progressing to the production objective. Analogously, 'CL ROCS low' and 'CL ROCS high' denote the results based on using ROCS as a curriculum objective where the low score is 0.5 and the high is 0.75. Lower Glide docking scores denote a greater predicted binding affinity. The docking score for the reference ligand is −10.907 kcal mol −1 and is shown by the horizontal black dashed line. not only does CL collect more compounds than baseline RL, but the compounds also possess more favourable docking scores, on average. (Fig. 6a). The left-hand subplot shows the gradual optimization of Tanimoto similarity for the low and high scenarios, representing the curriculum phase. The right-hand subplot shows the Tanimoto similarities for all the compounds collected in the production phase. In general, the compounds generated from the high Tanimoto experiments possess a greater Tanimoto similarity to the reference ligand than do those from the low Tanimoto experiments, as expected (see Supplementary Fig. 17 for the distribution of Tanimoto similarities). Moreover, the gradual decrease in Tanimoto similarity at the scaffold level further supports the capability of CL to perform scaffold hopping (Fig. 6a). Interestingly, however, the difference is not marked and can be explained by cross-referencing the training plots shown in Fig. 3b. The low Tanimoto experiments start at docking scores notably lower than those in the high Tanimoto scenario, which suggests that the compounds collected at the beginning are those that happen to exhibit high Tanimoto similarity to the reference ligand. Consequently, the low Tanimoto experiments generate less favourable compounds in the first 50 epochs when the production objective is activated (Supplementary Fig. 18).
To investigate the effect of similarity-based curriculum objectives on solution space diversity, cross-Tanimoto similarities between each unique compound pair in the pooled datasets were calculated to quantify how different the collected compounds are from each other (Methods). Relative to the baseline RL experiments, collected compounds in the CL experiments exhibit greater intraset similarity, interpreted as the agent sampling compounds from 'closer' areas in chemical space (Fig. 6b). Moreover, the high scenarios have a greater density of high cross-Tanimoto similarities than do the low scenarios. Uniform Manifold Approximation and Projection (UMAP) was used as a dimension reduction technique to visualize the solution space diversity from the CL Tanimoto experiments 37 . There is notable similarity, without overlap (as there is no scaffold overlap, Supplementary Fig. 14), between the compounds sampled from the low and high scenarios (Fig. 6c). The results suggest that moderate optimization of similarity-based curriculum objectives (as in the low scenarios) already narrows substantially the agent perceived solution space, in agreement with the cross-Tanimoto similarity distributions shown in Fig. 6b. The similarity between the compounds generated from the low and high experiments was quantified by calculating the cross-Tanimoto similarity between the two datasets ( Supplementary Fig. 20). The majority of cross-Tanimoto similarities are >0.7, confirming that the compounds generated from the two scenarios were sampled from areas close in chemical space (Fig. 6c). Taken together, the observations in this section suggest that devising a curriculum and using curriculum objectives to guide the agent to a production objective facilitates knowledge retention that is exploited to achieve a state of productivity. However, there is an inverse relationship between using similarity-based curriculum objectives to enhance production objective optimization and intraset diversity, imposing a trade-off when using CL over baseline RL.

Conclusions
In this work, we build on the de novo molecular design platform REINVENT by adapting CL to accelerate agent convergence on complex MPO objectives 8 . Relative to baseline RL, which may issue many non-productive calls to expensive descriptors, curricula consisting of even one curriculum objective can successfully guide the agent to achieve productivity in substantially reduced time. We demonstrate the application of CL on two production objectives: constructing a relatively complex scaffold and satisfying a molecular docking constraint. In the former, given the same number of epochs, CL successfully constructs the complex scaffold from simpler constituents while baseline RL is unsuccessful. In the second application example, using Tanimoto (2D) or ROCS (3D) shape similarity to the reference ligand as a curriculum objective guides the agent to areas of chemical space that satisfy the docking constraint 27, 28 . In contrast, baseline RL visibly struggles, spending many epochs generating unfavourable compounds. CL facilitates direct steering of agent policy towards a production objective by providing the ability to teach the agent specific knowledge. The results show that teaching the agent to optimize curriculum objectives to a greater degree can enhance the ability to satisfy a complex production objective, relative to baseline RL. However, optimizing similarity-based curriculum objectives to a greater degree leads to lower intraset diversity, as the agent generates compounds that are closer in chemical space. Future work will apply CL towards other pertinent drug discovery tasks. For example, Nicolaou Table 4 for individual experiment quantities). 'Favourable unique scaffolds' denotes the scaffolds that possess a more favourable docking score than the reference ligand. The fraction of favourable scaffolds generated is shown as an annotated percentage.
evolutionary algorithm to design compounds with good selectivity and present an interesting case study 20 . Curriculum phase. In the curriculum phase, the goal is for the agent to learn to generate compounds that satisfy sequential curriculum objectives with increasing complexity that guide the agent towards the production phase. O C1 , …, O C n−1 , O Cn are designated curriculum objectives with corresponding curriculum progression criteria P 1 , …, P n − 1 , P n , which enforce sufficient agent learning of each sequential curriculum objective on the basis of a score threshold. If the score threshold is met, the agent progresses to the next curriculum objective; otherwise, the agent continues learning the current curriculum objective. This process collectively constitutes the curriculum phase.   Left: the curriculum phase, where the agent is taught to sample compounds with Tanimoto (2D) similarity to the reference ligand. Right: the production phase. In general, high Tanimoto experiments sample more compounds that possess greater similarity to the reference ligand (this is shown at the compound and scaffold levels). b, Cross-Tanimoto similarity for intraset diversity. The plot shows the pooled collected compounds (those that exceed a total score encompassing docking and QED above a threshold) from the triplicate experiments, in which the overall dataset was reduced in size by a factor of ten to decrease computation time. Relative to the baseline RL experiments, CL generates compounds with notably greater intraset similarity. The effect is more pronounced in the high scenarios compared with the low scenarios. c, CL (Tanimoto curriculum objective) UMAP. The top 3,000 compounds were extracted from each triplicate experiment. Overall, the low and high scenarios sample from areas close in chemical space, but generally distinct from baseline RL.

Methods
Production phase. If and only if the final curriculum progression criterion P n is satisfied, the production objective O P is activated. Presumably, the agent is in a state of productivity and samples compounds that satisfy the production objective. Balance between chemical space exploration and exploitation can be achieved by tuning hyperparameters (Methods). The agent samples for a predefined number of epochs and all compounds that score above a minimum threshold are stored and outputted at the end.
REINVENT CL extension. The implementation of CL builds on the REINVENT generative model, which uses a recurrent neural network architecture 6,8 . The molecular design task is formulated as a natural language processing problem, where compounds are sampled in the SMILES format on the basis of conditional probabilities 38,39 . The recurrent neural network in this work features three hidden layers of 512 long short-term memory cells with an embedding size of 256 and a linear layer with softmax activation 8,40 . A prior generative model is first trained on the ChEMBL dataset to learn the SMILES syntax 8,38,41 . The agent is initialized as the prior and is then focused towards an MPO task via RL. For further details on REINVENT, see the work by Blaschke et al. 8 .
REINVENT's learning hyperparameters. The same hyperparameters were used for the baseline RL and CL experiments: batch size of 128, learning rate of 0.0001, sigma scalar factor of 128, all scoring function components' weights set to 1 and using the Adam optimizer 42 .
Agent exploration and exploitation. Balance between agent chemical space exploration and exploitation was achieved by using a DF, inception and learning thresholds (curriculum progression criteria). A DF enforces diverse results by defining buckets with limited size that track the number of compounds sampled possessing the same scaffold. Once a bucket is full, further sampling of compounds with the same scaffold will be penalized 8,43 . Only compounds that exceed a user-defined total score (based on the score contributions of each component in the scoring function defined) are stored and added to the corresponding bucket. The specific total score threshold used in this work was 0.4. Inception is a form of experience replay to mitigate catastrophic forgetting and can speed up convergence by replaying previously sampled favourable compounds to the agent 8,44 . For further details on REINVENT, see the work by Blaschke et al. 8 . In the baseline RL experiments, an identical Murcko scaffold DF (penalizes the agent if the same Bemis-Murcko scaffold is sampled beyond the bucket size) and inception were applied. In contrast, the implementation of CL in REINVENT allows one to initialize separate DFs and inception for the curriculum phase and production phase. During the curriculum phase, the goal is for the agent to acquire intermediate knowledge. Thus, no DF was applied as it can be counterproductive to guiding the agent to favourable areas of chemical space. In the production phase, a new inception (previous favourable compounds during the curriculum phase cleared) was initialized. Presumably, the agent is in a state of productivity and samples compounds that satisfy the production objective 8,43 . Thus, an identical Murcko scaffold DF with a bucket size of 25 was applied to encourage exploration, such that the agent samples from different local minima 8,36 . Finally, different curriculum progression criteria were applied in experiments 2 and 3 via the low and high scenarios to investigate the effect of curriculum objective optimization and agent exploration and exploitation. The results show that optimizing similarity-based curriculum objectives to a greater degree can decrease intraset diversity. Thus, curriculum progression criteria provide an additional control mechanism for result diversity.
ROCS 3D shape similarity. ROCS is a 3D shape similarity metric, comprised of two components: 'shape' and 'colour' . The components are quantified by the match, if any, between the volumes occupied and the defined pharmacophoric features between the two ligands, respectively 27, 28 . Compounds with similar 'shape' and 'colour' are more likely to exhibit similar properties. The implementation of ROCS in REINVENT is described in detail by Papadopoulos et al. 45 . In the CL experiments, the hyperparameters used for ROCS were 1:1 shape:colour, giving equal weighting to each component in the final ROCS similarity score.
Molecular docking constraint experiments. The PDK1 receptor crystal structure was obtained from the Protein Data Bank with Protein Data Bank identifier 2XCH (ref. 25 ). A receptor grid was generated in the Maestro graphical user interface with two hydrogen-bonding constraints specified between the reference ligand and Ala 162 (ref. 46 ). Ligand preparation and docking was performed using DockStream, which is integrated with REINVENT, facilitating parallelization over numerous CPU cores 47 . 3D coordinates for all agent-sampled compounds from the baseline RL and CL experiments were generated using LigPrep. Default parameters were used except for the pH tolerance range, set to 7.0 ± 1.0 with Epik, and a maximum of two stereoisomers kept per compound 29 . Glide docking used standard precision with the following settings: allow only amide trans isomers, allow up to 25 poses for post-docking minimization, apply strain correction and apply enhanced sampling with a factor of 2 (refs. 30-33 ). All baseline RL and CL experiments were allowed 300 production epochs, that is, epochs that involve docking, for a reasonable allocation of computational resources and for a fair comparison between baseline RL and CL. The docking score transformation was chosen to encourage agent sampling of compounds that possess a more favourable docking score than the reference ligand ( Supplementary Fig. 2).
Cross-Tanimoto similarity. The cross-Tanimoto similarity is calculated as the Tanimoto similarity for each unique compound pair in a dataset. Note that the compound pairs ' AB' and 'BA' are the same, and hence only calculated once.

Data availability
The trained generative model to reproduce the experiments in this work is provided at https://github.com/MolecularAI/ReinventCommunity/blob/master/ notebooks/models/random.prior.new. The raw data that support the findings of this study are available from the corresponding author upon request.