By observing the prediction accuracy history on validation set in primary test, it can be found that the model tends to be stable after 100 epochs, and the AdaM optimizer makes training more stable compared to AdaDelta. Therefore, the following experiments will be performed in these two conditions.
Edit Vector based model
The core of the hard-threshold neural network is activation, so its influence on the model is examined first. The results are shown in Tab.2, where the "(Soft/Hard)" in the first column means the function has soft/hard threshold, and the prefix of the Staircase activation represents the order, that is, the number of "stairs" on its image.
Tab.2 Influence of hard-threshold activation
on Edit Vector based model
Activation
|
Train Acc
|
Val Acc
|
Test Acc
|
Tanh(Soft)
|
80.0%
|
71.1%
|
70.0%
|
Step(Hard)
|
72.7%
|
71.5%
|
69.1%
|
3-Staircase(Hard)
|
76.4%
|
69.3%
|
69.8%
|
5-Staircase(Hard)
|
78.0%
|
68.9%
|
68.8%
|
7-Staircase(Hard)
|
80.1%
|
70.0%
|
71.2%
|
10-Staircase(Hard)
|
80.0%
|
69.8%
|
70.8%
|
Although innocent Step activation works even worse than Tanh activation, the prediction accuracy gradually improves as the order of the hard-threshold activation increases, which can be seen from Tab.2. When the order reaches 7, Staircase activation has achieved a higher prediction accuracy than traditional activation, indicating that the target propagation does have an advantage. However, prediction accuracy decreases on the contrary when the order continues to increase to 10, indicating that 10 is unnecessary and 7 is proper for the order, and thus following experiments will be performed with 7-Staircase activation.
Tab.3 Influence of subnetwork structure
on Edit Vector based model
Subnetwork structure
|
Train Acc
|
Val Acc
|
Test Acc
|
200/100/50
|
80.1%
|
70.0%
|
71.2%
|
250/100/50
|
80.0%
|
72.0%
|
71.3%
|
250/125/50
|
78.8%
|
71.7%
|
70.9%
|
300/100/50
|
78.7%
|
69.8%
|
69.8%
|
100/100/50/50/50
|
76.0%
|
70.4%
|
69.9%
|
100/100/50/50/50
|
74.7%
|
70.1%
|
69.5%
|
Furthermore, test results in term of subnetwork structure is shown in Tab.3. It can be seen that the prediction accuracy cannot be significantly improved via deepen or to widen the structure of subnetworks, sometimes, the prediction accuracy even may decrease, which means that the subnetwork structure of 200/100/50 is complicated enough in this task. In other words, what limits the accuracy is overfitting rather than underfitting, and more hidden nodes will only disturb training. Therefore, the following experiments will focus on how to avoid overfitting using original subnetwork structure.
Dropout is a common and convenient strategy to avoid overfitting. The original idea of dropout is very simple: in each forward propagation step, forcing some hidden nodes to output zero just like “killing” them, where “killed” ones are randomly chosen with a pre-set dropout rate before each forward propagation step starts. In this way, hidden nodes are prevented from connecting incorrectly and overfitting can thus be avoided. It should be noticed that the dropout rate must be set carefully: a too low dropout rate cannot avoid overfitting obviously, while a too high one will lead to underfitting instead because it hurts the neural network too much. Test results of dropout rate are shown in Tab.4.
Tab.4 Influence of dropout rate on Edit Vector based model
Dropout rate
|
Train Acc
|
Val Acc
|
Test Acc
|
0
|
80.1%
|
70.0%
|
71.2%
|
0.01
|
77.3%
|
69.8%
|
70.1%
|
0.02
|
79.9%
|
72.5%
|
72.7%
|
0.1
|
75.4%
|
69.7%
|
70.8%
|
As shown in Tab.4, high dropout rate (such as 0.1) damage the model too much, while low dropout rate (such as 0.01) cannot solve the overfitting problem, and the prediction accuracy does not improve in both cases. A dropout rate of 0.02 can achieve a balance between the above two cases, in other words, it can moderate overfitting while not damaging the model too much.
Integrating the adjustments in this section, the prediction accuracy of the final model is shown with bold line in Tab.4. The test accuracy is 72.7%, which is higher than that of Coley’s Edit Vector based model (68.5%).
Hybrid model
As mentioned in Section 1.3, the mixing factor ε determines the proportion of the ECFP subnetwork output in the summation subnetwork outputs, which directly determines how much the model relies on ECFP. According to the results in Section 4.2, a more complex sub-network is meaningless for prediction, so only the influence of mixing factor and dropout rate is examined in this section, and the results are shown in Tab.5.
Tab.5 Effect of mixing factor ε and deactivation rate
on hybrid model
ε
|
Dropout rate
|
Train Acc
|
Val Acc
|
Test Acc
|
0
|
0
|
80.1%
|
70.0%
|
71.2%
|
1.0000
|
0
|
99.9%
|
63.1%
|
61.6%
|
0.1000
|
0
|
99.7%
|
65.1%
|
66.9%
|
0.0200
|
0
|
98.8%
|
71.0%
|
70.8%
|
0.0010
|
0
|
85.3%
|
71.9%
|
72.5%
|
0.0008
|
0
|
85.5%
|
75.3%
|
72.6%
|
0.0005
|
0
|
83.6%
|
70.1%
|
72.3%
|
0.0010
|
0
|
85.3%
|
71.9%
|
72.5%
|
0.0010
|
0.01
|
85.7%
|
73.8%
|
73.0%
|
0.0010
|
0.02
|
85.4%
|
73.7%
|
73.9%
|
0.0010
|
0.05
|
83.4%
|
73.1%
|
73.1%
|
0.0010
|
0.1
|
82.6%
|
70.9%
|
72.7%
|
0.0010
|
0.2
|
77.9%
|
68.7%
|
70.3%
|
As shown in Tab.5, when mixing factor is large (such as 1), the model shows a very serious overfitting. However, when mixing factor is gradually reduced to around 0.001, there is almost no obvious overfitting in the model, and the prediction accuracy is even better than the Edit Vector based model (ε=0), indicating that the extra information introduced by the ECFP does help the prediction. Moreover, if the mixing factor continues to decrease, the hybrid model will lose its meaning. The influence of dropout rate on the hybrid model is nearly the same as that on the Edit Vector based model, which means a dropout rate of 0.02 can give a balance to the model.
Integrating the adjustments of mixing factor and dropout rate, the prediction accuracy of the final model is shown with bold line in Tab.5. The test accuracy is 73.9%, which is higher than that of Coley’s hybrid model (72.8%).
Prediction examples
Neural network is often criticized for its "black box" feature (the output cannot be explained to human exactly), but it can still be observed through examples. Here we take the final Edit Vector based model in Section 4.1 (referred to as "optimized model") as an example to compare with the model trained with the same hyperparameters from literature (referred to as "original model"), and give two prediction examples.
For the reaction in Fig.7, the substitution should occur on the pyrazole ring due to the strong electron withdrawing effect of the nitro group. The optimized model assigned a probability of 33.1% to the true product. On the other hand, the original model assigned a probability of 1.7% to the true product, and a probability of 31.6% to the wrong product.
For the reaction in Fig. 8, since the hydrochloric acid-pyridine condition is weakly acidic, the imine hydroxyl group on the product should not dehydrate to form a cyano group. The optimized model assigned a probability of 70.1% to the true product. On the other hand, the original model assigned a probability of 47.1% to the true product, and a probability of 48.5% to the wrong product.