Hard-threshold-Neural-Network based Prediction of Organic Synthetic Outcomes

Retrosynthetic analysis is the canonical technique to plan the synthesis route of organic molecules in drug discovery and development. In this technique, the screening of synthetic tree branches requires accurate forward reaction prediction, but existing software is still far from completing this step independently. Previous studies have attempted to apply neural network in the forward reaction prediction, but the accuracy is not satisfying. Through using the Edit Vector based description and Extended-Connectivity Fingerprints to transform reaction into vector, the presented work focuses on the update of neural network to improve the template-based forward reaction prediction. Hard-threshold activation and target propagation algorithm are implemented by introducing the mixed-convex combinatorial optimization. Comparative tests are conducted to explore the optimal hyperparameter set. Using 15, 000 experimental reaction extracted from granted United States patents, the proposed hard-threshold neural network is systematically trained and tested. The results demonstrate that a higher prediction accuracy is obtained when compared to the traditional neural network with backpropagation algorithm. Some successfully predicted reaction examples are also briefly illustrated.

extensive pharma R&D activities are dramatically required, and relative short discovery-development-deployment cycle is desirable [1] .
Over the past decades, as a rate-limiting factor, innovations in organic synthesis have significantly enabled the discovery and development of important lifechanging medicines, improving the health of patients worldwide [2,3] . Nevertheless, innovations and excellence in organic synthesis are expected to be the most powerful driver for all phases of drug discovery and development. Very recently, chemists have enthusiastically applied advanced machine learning and artificial intelligence (AI) technologies towards the preparation of drug molecules. For example, AI driven discovery of drug molecules [4][5][6] , automated planning of synthetic routes [7][8][9] , machine learning driven optimization of reaction conditions [10][11][12] , and autonomous assembly of synthetic processes [13][14][15] .
Synthesis planning, which is regarded as one of the central elements of organic synthesis, can be traced back to 1960s [16] . Traditional computer-aided approaches for synthesis planning are labeled with different disadvantages such as low efficiency, poor repeatability and high experimental cost. Especially, more and more compounds and reactions have been discovered. For example, the number of reactions and compounds that Reaxys database contained has exceeded 40 million and 100 million, respectively. Novel synthesis planning approaches are highly required. During the past 3 years, a variety of machine learning and artificial intelligence methods such as random forest, automated reasoning, support vector machines and more recently deep learning have demonstrated their capacity for organic molecules discovery, design, and production [13,[17][18][19] . Clearly, the application of the aforementioned new technologies to end-to-end organic molecules discovery and development will be the key point for realizing the fully automated synthesis planning [20] .
Retrosynthetic analysis is the canonical technique used to plan the synthesis of small organic molecules for drug discovery [21] . In General, retrosynthesis analysis consists of four steps: step 1. determine the target compound; step 2. disconnect certain bonds which are thought to be easy to form according to known chemical knowledge in the target compound as the reverse of reaction, and search for possible precursors in this way; step 3. repeat step 2 for all the precursors to form a synthesis tree, and expand it until all precursors are available; step 4. evaluate all the branches of the synthetic tree one by one, then take the most possible one as the optimal route. Since different groups of reaction sites in the same group of precursors may exist, in other words, the difference between real reactions and the expected ones in the branches may occur, step 4 is therefore very critical but also the most difficult one. Forward reaction prediction is a necessity point to guarantee the correct evaluation of each branch. The main aim of the paper is to present the machine learning approaches for assisting the forward reaction prediction.
The existing approaches for forward reaction prediction can be categorized as template-based and template-free method. For example, Coley et al. [17] applied reaction templates to reactants to generate candidate reactions as many as possible, which then are used to train a neural network. While for template-free methods, Schwaller et al. [22] compared chemical reactions from reactants to products to translations from one language to another, so that forward reaction prediction can be transformed into machine translation and solved by training a seq-to-seq recurrent neural network which directly takes reactant SMILES (Simplified Molecular Input Line Entry Specification) as input.
As the research target is to discover new reactions meaningful for organic synthesis according to existing reaction mechanisms, to fully reuse the discovered reaction rules summarized from experimental results so far, the presented paper adopts the template-based approach. Specifically, the concentration of this paper is to adopt a novel hard-threshold based deep neural network for improve the prediction accuracy for the forward reaction prediction. In detail, hard-threshold activation and target propagation algorithm are implemented by introducing the mixed-convex combinatorial optimization. Comparative tests are conducted to explore the optimal hyperparameter set. The reminder of the paper is as follows: forward reaction prediction will be described firstly, then hard-threshold neural network is touched upon. Results and discussions are then given. Conclusion is finally presented.

Forward reaction prediction
The above reaction can be decomposed into three Edits: atom 1(nitrogen) loses a hydrogen, atom 2(carbon) and 3(chlorine) loses a single bond, and atom 1 and 3 respectively gains a single bond. In the reactant, atom 1 has one hydrogen and two non-hydrogen neighbors, atom 2 has no hydrogen but is surrounded by three nonhydrogen neighbors, while atom 3 has one non-hydrogen neighbor without any hydrogen. Therefore, feature vectors of atom 1, atom 2, atom 3, along with two Where "/" represents the connection of feature vectors, and the outermost brackets represent that Edit Vector is the combination of all the vectors of basic Edit in a reaction. For instance, if one reaction has four atoms losing hydrogen in the reactant, its Edit Vector of hydrogen loss (e 1 ) will have four 6D feature vectors.

Candidate reaction selection
The selection step uses a complex neural network consisting of several subnetworks, as shown in Fig.3. For each candidate reaction, the above mentioned four Edit Vectors are calculated and severed as input to four corresponding subnetworks. The sum of outputs from four subnetworks is then fed to the lower integrating subnetwork to produce scalar probability scores. The above steps are repeated for all candidate reactions, and then all the probability scores are normalized by the softmax method to estimate the probability of occurrence of each candidate reaction. Finally, all candidate reactions are sorted by the probability score, and the candidate ranking first is regarded as the prediction result. The prediction is right if its outcome has the same SMILES as the recorded one, and vice versa.
In order to improve the prediction accuracy, a hybrid model using the Edit Vector and the Extended-Connectivity Fingerprint (ECFP) [23] is also considered in our work.
The only difference between two models is that an extra subnetwork without hidden layer which evaluates the ECFP is added to the hybrid model, as shown in the middle of Fig.3. The output of ECFP subnetwork is multiplied by ε when subnetwork outputs are summed before input to the final integrating subnetwork, where ε is the mixing factor. By adjusting ε, the proportion of ECFP subnetwork can be precisely controlled.

Hard-threshold neural network
Neural network, especially deep neural network learning, is currently the most popular machine learning algorithm with powerful fitting capability [24] . However, with the ceaseless expansion of the size of the neural network, a series of problems, particularly gradient vanishing and gradient explosion would often occur. To get rid of this dilemma, the paper tends to apply the hard-threshold neural network for predicting the outcomes of organic synthesis.

Constructing hard-threshold neural network
"Hard-threshold neural network" means neural network with hard-threshold activation, including Step activation and Staircase activation, which is shown in Figs 4a and 4b, respectively. Indeed, staircase activation is the sum of some step activations. Actually, hard-threshold activation was used in the early contributions to deal with the binary classification when the neural network had not been proposed. Hard-threshold activation has a constant derivative as 0, which can effectively avoid gradient vanishing and gradient explosion. Besides, the scale of output is almost fixed and insensitive to the scale of the input, which helps avoid certain abnormal propagation and simplify the computation. However, the zero derivative of hard-threshold activation also prevents it from being trained with traditional backpropagation.

Target propagation algorithm
A new backpropagation algorithm is required to train a hard-threshold neural network bypassing the zero derivative of hard-threshold activation.
Based on the "target propagation" concept [25] , a new target propagation algorithm named FTPROP-MB was recently proposed [26] . Basically, since perceptron with Step activation is trainable, hard-threshold neural network could also be trainable if it can be decomposed into perceptron. Specifically, a target vector t d is introduced to represent what the d th layer is supposed to be the output for all hard-threshold activation layers. After the normal forward propagation procedure for each layer, FTPROP-MB determines t d firstly, then introduces a layer loss L d which is used to compute gradient just like training perceptron so that weights can be updated.
Considering that the output of the hard-threshold activation is a set of discrete which indicate that the partial derivative of L d+ 1 is positive, it is clear that L d+ 1 will increase by making h dj =+1 when fixing other components of h d , so there is no doubt that t dj =r(h dj )=-1. On the other hand, when r(h dj )=+1, which means the partial derivative of L d+ 1 is negative, we don't know exactly whether L d+ 1 would increase or decrease by making h dj =+1. But the difference between h dj and r(h dj ) indicates that the current value of h dj is lack of confidence, so a natural choice is to lead z dj to 0 by adjusting t dj to make it more possible for h dj to flip, w.r.t In summary, the training process of a n-layer hard-threshold neural network has both optimization problem on weights and convex-combinatorial optimization problem on target vectors, so a mixed convex-combinatorial optimization problem is formed. The block diagram of target propagation algorithm is shown in Fig.5.

Layer loss function
Till now we are still facing a problem of choosing layer loss function. According to related work [27] , it is acceptable to adopt soft hinge loss and weighing according to the gradient, which is shown in Fig.6.

Preparing reaction database
The reaction data sets are originally extracted from the 1976-2013 USPTO dataset complied by Lowe [28] . Based on the popular template sets [29,30] , the original extracted reaction data sets are reduced to 15, 000 groups of reactants corresponding to 15, 000 real reactions. Then, approximately 5 million candidate reactions including real and fake ones are generated by aforementioned data augmentation strategy and stored with MongoDB format.

Structure of Edit Vector
Atom features used in this paper is much more complex than the simplified example illustrated above, while bond features are the same. Specific structure is shown in Tab

Structure of ECFP for hybrid model
Molecular fingerprint is also a common method for vectorizing molecules. A fingerprint is usually a 0-1 vector with an adjustable dimension and equivalent to the hash of a molecule. The Extended Connectivity Fingerprints (ECFP) proposed by Rogers et al. [23] in 2010 is a circular fingerprint based on the Morgan algorithm, which has become the de facto standard in the industry. In this paper, the ECFP of the reactants and products is used with a radius of 2 and a dimension of 1024 as a supplement to the Edit Vector to construct the hybrid model, where the radius and dimension are determined by convention.

Building training platform
PyTorch is one of the most popular deep learning frameworks, and its high customizability brings much convenience to implement all kinds of "non-standard" backpropagation algorithm. Therefore, this paper uses PyTorch and a NVIDIA GeForce GTX 1070 GPU to conduct all experiments.
The dataset containing 15000 groups of reactants (5 million candidate reactions) is divided by convention: the latter 20% (3000 groups with 1 million candidate reactions) is the test set; 12.5% of the former 12, 000 groups is randomly taken as the validation set (1500 groups with 0.5 million candidate reactions); the rest 10, 500 groups (3.5 million candidate reactions) are considered as the training set.

Results and Discussion
Through the extensive observations on the prediction accuracy, we conclude that the model tends to be stable after 100 epochs, and the AdaM optimizer makes training more stable compared to AdaDelta. Therefore, the following results and discussions are according to the experiments performed based on AdaM optimizer.

Edit Vector based model
The core of the hard-threshold neural network is activation, so its influence on the model is examined firstly. The results are shown in Tab.2, where the "(Soft/Hard)" in the first column marks the kind of the activation. It can be seen that the prediction accuracy cannot be significantly improved via deepen or to widen the structure of subnetworks, we identify that the subnetwork structure of 200/100/50 is complicated enough in this task. In other words, what limits the accuracy is overfitting rather than underfitting, and more hidden nodes will only disturb training. Therefore, the following experiments will focus on how to avoid overfitting using original subnetwork structure.

Tab.2 Influence of hard-threshold activation on Edit Vector based model
Dropout is a common and convenient strategy to avoid overfitting [31] . The original idea of dropout is very simple: in each forward propagation step, some outputs of hidden nodes are forced to zero. Then hidden nodes are prevented from connecting incorrect partners and overfitting can thus be avoided. Very similar to regularization, the principle of dropout is also to reduce the number of non-zero parameters in the model. However, two additional advantages of dropout should be emphasized here: first, the meaning of the dropout rate is relatively intuitive, so we can directly adjust the proportion of parameters set to zero in the model via dropout rate; second, different from regularization that penalties all non-zero parameters, the parameters are set to zero by dropout in a random manner during each forward propagation in training, which improves the robustness of the model. It should be noticed that the dropout rate must be set carefully: a dropout rate too low cannot avoid overfitting, while one too high will lead to underfitting.  [17] , more than 120 reactions can be correctly predicted additionally.
Moreover, the training and validation processes of our model along with the model provided by the published literature [17] , are shown in Fig.7. It is obvious that hardthreshold neural network has the potential to approach higher prediction accuracy while reducing the instability of the running processes.
As for training efficiency, experiments above show the hard-threshold neural network and the corresponding optimization algorithm do not require much extra computing resources during training. It takes about 13-14 hours with 100 epochs, which is very similar when compared with traditional neural networks.

Hybrid model
As mentioned in the section "Candidate reaction selection", the mixing factor ε determines the proportion of the ECFP subnetwork's output in the sum fed to the integrating subnetwork, which directly determines how much the model relies on ECFP. According to the results in the section "Edit Vector based model", a more complex subnetwork is meaningless for prediction, so only the influence of mixing factor and dropout rate is examined for the hybrid model, and the results are shown in Tab.5.
Tab.5 Effect of mixing factor ε and deactivation rate on hybrid model

Prediction examples
Here we illustrate a couple of very important reactions that our proposed model can successfully predict but the model in the literature failed.
The reaction in Fig.8 is taken from the synthesis route of certain substituted 3,4diarylpyrazole compounds, which modulate the activity of protein kinases [32] . These The reaction in Fig.9 is extracted from the synthesis route of a novel P2X3 receptor antagonists that play a critical role in treating disease states associated with pain, in particular peripheral pain, inflammatory pain, or tissue injury pain [33] . For this reaction, since the hydrochloric acid-pyridine condition is weakly acidic, the imine hydroxyl group on the product should not dehydrate to form a cyano group. Our propose model assigns a probability of 70.1% to the true product. On the other hand, the model of the published literature assigned a probability of 47.1% to the true product, and a probability of 48.5% to the wrong product.

Conclusion
In this paper, we implemented the vectorized description of reaction by using the Edit Vector and ECFP, and applied the hard-threshold neural network with target propagation algorithm to the template-based forward reaction prediction thereby.
For the pure Edit Vector based model, the prediction accuracy reached as higher as 72.7% which is higher than the published accuracy with 68.5%. We also found that the prediction accuracy could benefit from the utilization of ECFP with the proper mixer factor. Although the implemented hard-threshold neural network, whose hyperparameters were adjusted via heuristic approach, only improved the prediction accuracy by 4.2%, it did provide a new alterative approach for computer-aided template-based forward reaction prediction of organic synthesis for drug discovery purpose. An automatic approach for adjusting the hyperparameters for improving the prediction accuracy is under investigated. Furthermore, novel approaches for describing the reaction for prediction purpose is also under consideration.
Abbreviations SMILES: Simplified Molecular Input Line Entry Specification Figure 1 The illustration of template-based forward reaction prediction process Reaction between chloroacetyl chloride and 2-methylamino-5-chlorobenzophenone.

Figure 3
The illustration of selection process.   Weighted soft hinge layer loss function.

Figure 7
Training and validation accuracy history of original model and optimized model. Amination of hydrazines with aromatic aldehydes under hydrochloric acid-pyridine condition.