Characterization of Extracted Chemical Reaction Rules
We used the reaction database showing the imbalanced frequency distribution, where the reaction rules having more than 10 frequency occupied 0.63% (61,234 rules) only among the rules with frequency ≥ 1. On the contrary, the covered reaction data (frequency ≥ 10) occupied 21.3% (3,395,642 reaction data) among the corresponding reaction data (frequency ≥ 1) (Table S1), which indicates a high imbalance of our database. To improve the prediction accuracy by reducing the number of classes (reaction rules, 9,672,940 à 61,234, Figure S1) while maintaining most of the reaction database (15,930,914 à 3,395,642), the reaction rules having frequency under 10 were removed (Table S1). As in the following discussion, the reaction rules with low frequency have large size compared to the ones with high frequency, therefore, it is expected that the reaction rules with the larger size (frequency < 10) is normally subset of the reaction rules with the smaller size (frequency ≥ 10). Therefore, the dataset having frequency more than 10 were selected as the test bed for undersampling study without losing generality, which also shows high class imbalance (Figure 1, bottom).
The top-10 reaction rules selected by frequency are presented in Figure 2. The rank 1 reaction rule has a maximum frequency of 82,906, which is a chemical reaction between carboxylic acid and ester functional groups. Among top-10 reaction rules, six rules are related to the functional groups containing oxygen (such as ester, carboxylic acid, alcohol, and ether), and the other four rules are related to the nitrogen functional groups (such as amine, nitro, and azide). Furthermore, the protecting groups of tert-butyldimethylsilyl, tert-butoxycarbonyl, tert-butyl, and acetyl for amine and alcohol functional groups are involved. Although the part of the Reaxys database [20] is used, we expect that these top-10 reaction rules are also major reactions historically identified by chemists. The explanation is that the database that we used (approximately 16M) corresponds to more than half of the reaction database with one product and one- and two-reactants (approximately 28M) in the Reaxys [20].
High-ranking reaction rules have a relatively small size (i.e., short length). Except for the protecting groups, center atoms in reaction centers have one (radius=1) or two (radius=2) adjacent heavy atoms before and after the reaction (Figure 2). On the contrary, low-ranking reaction rules include multiple (radius ≥ 3) heavy atoms near the center atom. Therefore, we expect that most excluded rules having a frequency less than 10 can be a subset of the selected rules having frequency over 10, where the size and details are different, but the reaction center is the same.
Higher frequency cut-offs can be used to further improve the prediction accuracy of machine learning models. However, the elimination of additional large size reaction rules can affect the reaction yield depending on the specific atomic environment near the reaction center, where specific large reaction rules can provide higher yields than general small reaction rules. Hence, the trade-off between the diversity of the rule affecting yields and the improvement of prediction accuracy by the frequency cut-off should be carefully considered when generating datasets.
Quantitative Performance on the Undersampling Datasets
Our strategy of undersampling is to improve the prediction accuracy while maintaining the number of reaction rules. In other words, our approach is to reduce data imbalance, which increases the prediction accuracy for minor frequency rules. In Figure 3, structural clusters based on random, similarity, and dissimilarity undersampling are shown for rank 1 and rank 3 reaction rules corresponding to the top rank rules for oxygen (Figure 3(a)) and nitrogen (Figure 3(b)) functional groups, respectively. In Figure 3(a), the similarity cluster (blue box) includes organic molecules with similar scaffolds (conjugated rings) and carboxylic acid group. However, the dissimilarity cluster (green box) has organic molecules with dissimilar scaffolds of rings and chains. Finally, the organic molecules having conjugated and saturated rings and a side-chain having carboxylic acid group appears in a random cluster (red box), which is in between similarity and dissimilarity clusters. In addition, a similar relationship is demonstrated in the molecular clusters having an amine functional group (Figure 3(b)). All amine functional groups are located next to the conjugated ring and, especially, cellobiose (C12H22O11) and 2H-pyran with an acetoxymethyl and triyl triacetate are mainly included in random cluster (red box).
Similarly, for each of the 61,234 reaction rules, three undersampling (random, similarity, and dissimilarity) datasets having the size of 612,340 were prepared using the Taylor–Butina clustering algorithm [21, 22]. Neural network models were trained on using those random, similarity, and dissimilarity datasets. The 5% prediction data randomly separated from each dataset were used to measure the prediction accuracy, in which the prediction dataset sizes are 239,877 and 30,617 for baseline and undersampling models, respectively. Figure 4(a) presents the top-10 prediction accuracies of a baseline and three undersampling trained models for each prediction dataset. In particular, the prediction accuracies of three undersampling models were averaged by using both prediction datasets independently and commonly sampled from three undersampling datasets for statistical robustness, therefore, standard errors were marked on each averaged undersampling graph in Figure 4(a) and the overall averaged undersampling graph in Figure S3. For the top-1 accuracy, the similarity and random models show a higher accuracy compared with the dissimilarity and baseline models. In particular, the similarity model shows a higher accuracy in all ranges of top-k among all models due to the characteristic of the prediction dataset with structural similarity. By analogy with the one-sided selection [16] technique, the selected 10 samples for each reaction rule may show the obvious representativeness by removing data having noise and near the boundary between classes (reaction rules). Indeed, the large molecules in the dissimilar datasets (Figure 3) have a chance to match multiple reaction rules having a similar probability. Thus, the prediction accuracy of the dissimilarity model is relatively low. In addition, all undersampling models show higher-than-baseline accuracies for all top-k, which can be understood as mitigating the data imbalance problem through undersampling.
Despite these improvements, the undersampling method results in information loss, as some data are discarded. To examine the effect of this information loss on the prediction accuracy, the same prediction dataset was used for all undersampling models. Figure 4(b) depicts the prediction accuracy obtained from all undersampling models using the same random dataset. In all ranges of top-k, the similarity model shows the highest prediction accuracy among all models. Unlike in the previous results, the dissimilarity model shows the second-highest accuracy, and the random model shows the lowest performance for all top-k. Besides, the top-1 accuracies for similarity and dissimilarity models have been increased (Table S2). This is probably due to that some prediction data in the random dataset may have structural similarity with the similarity and dissimilarity training datasets. On the contrary, the training and prediction data in the random dataset may not be structurally similar. Indeed, as shown in Figure 3, the undersampled data do not match exactly, and some samples in each dataset are structurally similar. Figure 4(c) shows the prediction accuracy obtained from all undersampling models using the same similarity dataset. The random model shows the highest prediction accuracy. Furthermore, the dissimilarity model indicates a higher prediction accuracy than the similarity model. In Figure 4(d), the dissimilarity prediction dataset was used for all undersampling models. Similarity and random models show similar tendencies for all top-k and overlap with the dissimilarity model after top-7. These results stem from the characteristic of the dissimilar prediction dataset. As shown in Figure 3, the molecular structures of dissimilarity datasets are heterogeneous compared with the random and similarity datasets. The convergence of prediction accuracies in all undersampling models after top-7 can be explained by such a highly heterogeneous dataset. These results confirm that the dissimilarity undersampling was successful.
To get more extended picture for this information loss, the two oversampling experiments based on random and synthetic minority oversampling (SMOTE) [12, 13] samplings were performed, where the same frequency cut-off of 10 was used for reaction rules. To reduce the burden on the vast amount of oversampled data, the hybrid method was applied, in which the undersampling and oversampling were applied for the dataset of reaction rules over frequency 20 (36% of the reaction rules) and under frequency 20 (62% of the reaction rules), respectively (Figure S4(a)). Therefore, the size of oversampling dataset is twice the size of undersampling dataset. The prediction accuracies of oversampling models are shown in Figure S4(b). Both random and SMOTE samplings show very similar prediction accuracy overall range of top-k. Comparing to the accuracy of random undersampling method (Figure 4(a)), the 6% for top-1 and 2% for top-10 were improved, which seems to be due to a doubling of the dataset size. It is also expected that the prediction accuracy of the model trained with both oversampled data under frequency 20 and original data over frequency 20 is lower than the hybrid model we used due to data imbalance. This shows that the oversampling model can improve the prediction accuracy and the hybrid model using oversampling for minor rules and undersampling for major rules can be an alternative model between oversampling model with a training burden and undersampling model with some information loss.
Furthermore, higher frequency cut-off models that combine undersampling methods can be considered. In this case, the reaction data per reaction rule increases, and the number of reaction rules (i.e., classes) decreases by applying a higher-frequency cut-off. In other words, the overall prediction accuracy increases, but the diversity of reaction rules decreases. Hence, the number of reaction samples corresponding to survived reaction rules also increases, and this may remedy the loss of information by undersampling. Compared to previous studies of multiscale reaction classification [17] and data augmentation [18] methods, the effect of data sampling on prediction accuracy can be relatively large. In other words, inappropriate sampling in the undersampling model may result in low prediction performance.
Qualitative Analysis based on Detailed Examples
For the qualitative analysis of the baseline model and the three undersampling (random, similarity, and dissimilarity) models, we selected three target molecules from three reaction rules with different frequencies. In addition, target-oriented undersampling experiments based on these target molecules were performed by using the training dataset sampled by structural similarity with each target molecule, where the structural similarity was calculated for all product molecules in each reaction rule using the Tanimoto coefficient [23]. The 10 reaction samples with small pairwise distance for each reaction rule were selected for the training and prediction dataset. As shown in Figure 5, the first molecule (① 2,6-dimethylphenanthridine) was selected from the reaction rule with a frequency of 10, and, therefore, all reaction samples containing the first molecule are included in all datasets of both the baseline and three undersampling (random, similarity, and dissimilarity) models. Hence, the prediction of reaction rule for this molecule using the baseline and four undersampling (random, similarity, dissimilarity, and target-oriented) models represents one example of whether the imbalance problem is mitigated by undersampling. Figure 5(c)① shows the results of a single-step retrosynthesis for 2,6-dimethylphenanthridine, in which the predicted reactants between top-1 and top-5 are represented for the baseline model and four undersampling models. Due to the dataset imbalance, the baseline model predicted a ground truth in top-3, where the ground-truth rule corresponds to the minor rules with frequency 10. Contrastingly, the similarity, dissimilarity, and target-oriented models result in higher-ranked top-1, top-2, and top-2 predictions for the ground truth, respectively, which is higher than the rank of the baseline model. Meanwhile, the random model predicts the ground truth at top-4, which is lower than the rank of the baseline model. In addition, the predicted reaction rules between top-1 and top-5 are the same for both the baseline and four undersampling models, where the only difference is in the ranks of the predicted reaction rules. Although this target molecule was included in individual datasets for all prediction models, the baseline model predicts the ground truth in the top-3, which ranks relatively low compared with undersampling models. This example confirms that the undersampling method can mitigate the imbalance problem.
Figures 5c② and ③ present how the four undersampling models work for the target molecules not included in all training datasets except for the baseline model (Figure 5(a)). The two target molecules were selected from the reaction samples of major reaction rules with frequencies of 208 and 373 (Figure 5(b)). In Figure 5(c)②, the prediction results for the target molecule of 7H-benzo[c]phenothiazine reveal that the baseline model ranks the ground truth at the top-1, which may be attributed to both the reaction rule with a high frequency of 208 and the dataset containing the target molecule. Among the undersampling models, the dissimilarity and target-oriented models predict the ground truth as top-1, and the other two models failed to predict the ground truth within top-5. The degree of overlap in the predicted reaction rules between the baseline model and each undersampling model is only 20% each. In contrast, the degrees of overlap are 60, 80, and 60% between similarity and dissimilarity models, similarity and random models, and dissimilarity and random models, respectively. In other words, the three undersampling models show similar prediction tendencies compared with the baseline model, and it is mainly related to the differences between the training datasets with distinct statistical distributions. However, the target-oriented model shows only 0, 20, 0 % overlap with similarity, dissimilarity, and random models, respectively.
Finally, an additional selection criterion for the target molecule of methyl (E)-3-(2-methoxy-5-(1-(3,4,5-trimethoxyphenyl)vinyl)-phenyl)acrylate is a large molecule, which can contain multiple reaction centers (Figure 5(c)③). As a result, only the random undersampling model predicts the ground truth in the top-1. In contrast, the baseline model predicts it in the top-4, even though the reaction rule of the ground truth has a higher frequency of 373 compared with 10 frequency used in the random undersampling model. The degree of overlap in the predicted reaction rules between the baseline model and the undersampling models (similarity, dissimilarity, random, and target-oriented) is 0, 20, 40, and 40%. The degree of overlap between the undersampling models is 0 ~ 40%. Hence, all prediction models had difficulties with predicting the reaction rule of the ground truth due to multiple reaction centers in the large target molecules regardless of the statistical distribution of the training datasets.
For these three target molecules, the prediction accuracy of target-oriented models was shown in Figure 5(d), where top-k accuracies are very similar for all target molecules. This result represents the robustness of target-oriented models for the prediction accuracy irrespective of target molecules. In addition, the top-1 and top-10 accuracies are higher than those of baseline model by 9.3 and 4.2%, respectively. Hence, this target-oriented model can be an alternative model overcoming inappropriate sampling problem in undersampling methods.