Data set. Obtaining pre-existing experimental data from publicly available databases such as SciFinder or Reaxys enables chemists to create a large data set based on published literature or patents16–19. However, in order to avoid inconsistency between sources and the omission of critical synthesis metadata, we selected a certain representative test reaction type from one literature to reduce the uncertainties20. Meanwhile, to explore the generalization capability of the models, we divided our data sets into two prediction tasks, one for predicting reaction performance, and the other for predicting enantioselectivity21. According to these principles, we selected the previous four representative work as complete experimental reaction data—Buchwald-Hartwig amination (BHA, C-N)3 and Suzuki-Miyaura coupling reactions (SMC, C-C)15 were used to predict the yields; asymmetric N, S-acetal formation (ANSA, C-S)4 and asymmetric hydrogenation (AHR)5 were used to predict the reaction enantiomeric excess.
Although an even distribution of yields or enantioselectivities across the range in each data set is preferable for model training, given the limitation on data availability no trimming based on yields or ee was carried out for the four data sets basically. The detail statistics of components were labeled in blue for each reaction (Fig. 1a-d). In BHA, we selected 4037 reactions performed via HTE and took account of four components which containing aryl halides, ligands, bases, and solvents. Interestingly, the reaction yields were found to be centred below 40%, especially when four of additives — benzo[c]isoxazole, ethyl 5-methylisoxazole- 4-carboxylate, ethyl isoxazole-4-carboxylate, and methyl isoxazole-5-carboxylate participated in reactions (Fig. 1a). SMC data set was created by Sach and co-workers through an automated flow-based synthesis platform and was not designed for ML task on purpose15. Its categories of components are the most complicated in four data sets, which contain five components, boronic acids, aryl halides, ligands, bases, and solvents, making it the most challenging data set for machine learning models. In which, we excluded all reactions with yields of 0% and some reactions with yields between 0–10%, and selected 4139 reactions finally. It’s clear to see that yields distribution of SMC is similar with BHA but no compounds serve as severe reaction poisons (Fig. 1b). Unlike the task of predicting reaction yields, it is necessary for the machine learning models to predict the R/S configuration with its enantioselectivity in asymmetric reactions precisely. Here we set the R configuration to positive (+) and the S configuration to negative (-) so that the result of prediction is a score between − 100% and + 100%. Both asymmetric catalytic reaction data sets have a higher percentage of reactions with over 70% ee, but ANSA has few reactions in low ee or the range of < 0% ee representing the S configuration (Fig. 1c). The AHR had fewer reactions compared to other three datasets and only made the prediction of the absolute value of reaction enantioselectivities previously. We found that in our setting enantiomeric excess distribution in AHR below − 70% are equivalent to those over 70%, and there were few reactions in the middle interval results. Considering the relatively small size of AHR data set, we thought it is necessary to combine unsupervised learning methods to expand the size of training data (Fig. 1d).
At the same time, in order to evaluate the complexity of these four data sets, MACCS keys are extracted and Tanimoto coefficient is calculated as the standard to measure their similarities22,23. Figure 1e-h are spider plots, which reflect the complexity of each category in every data set. When a category has a high score, it means the category has a high-similarity. Figure 1i is a summary of the scores for each data set (Eq. (1)). By comparing the scores, SMC turned out to be the most complex data set, followed by AHR and BHA, where ANSA had the most similar molecules in each category. With these complexity scores, chemists can preliminarily judge the fitting of machine learning models to corresponding data set. Generally, the more complex the data set is, the more difficult it is to give accurate prediction results, and the worse the fitting is.
Extraction and analysis of Chemical descriptors. To generate machine-readable inputs, we need to quantify physicochemical properties of candidate molecules, which is also called chemical descriptors. Although there are some researches on feature extraction using unsupervised learning24–27, the mainstream feature extraction method is still using DFT to calculate the relevant properties or use software libraries such as RDKIT28 to generate feature matrices which is like molecular fingerprints. Considering the molecular fingerprint is encoded using a one-hot encoding scheme, it can’t reflect more modest changes compared to the continuous variables calculated by DFT29,30. In view of this, although the computational cost of DFT is more expensive, using DFT for molecular feature engineering is the preferred method, especially it has been difficult to get access to sufficient experimental data to warrant implementation of the ML models.
Following existing feature engineering with DFT, current methods are mainly based on the partial features of molecules. Doyle et al paid more attention to the properties of the skeleton shared by different molecules, such as the charges or NMR of atoms in the skeleton3. Sunoj and co-workers also considered the steric hindrance of substituents in the molecular skeleton5. In fact, partial information of their molecules can well describe the reaction space of their respective data sets, but the extensibility for other reactions is not general enough. Due to the limitation of molecular skeletons, chemists need to redesign the feature extraction process for molecules with different skeletons. Denmark and co-workers extracted the conformation and space occupation of molecules, although the spatial information of the whole molecule was included, the feature dimension is too large and the extraction process is complex. Therefore, we used easily available and more universal descriptors to predict reaction performance and enantioselectivity here.
We divided the chemical descriptors into two parts, which are molecule-level descriptors (MLD) and atom-level descriptors (ALD). MLD distinguishing global properties of different molecules is composed of molecular volume, single-point energy, HOMO, LUMO and dipole. ALD include atomic number, three-dimensional coordinates, Mulliken charges, NPA charges, NMR shielding. By inputting ALD in accordance with the atomic sequence in the input file, it is accessible to reconstruct chemical properties of atoms in the ML models. Boobier and co-workers also extracted atomic features in predicting the solubility of solutes in different solvents, but the features of the same elements were summed as descriptors finally31. We think this method will lead to the loss of atomic information by virtue of different atomic positions in the molecular structure. Likewise, integrating the features does not full use the learning ability of the ML models, and ALD have a good adaptability to the graph neural network32. In ALD, we especially extracted three-dimensional coordinates of all atoms which were always ignored within feature engineering, because of significant correlations between three-dimensional features and atomic coordinates, such as steric hindrance or dihedral angle. The complete extraction process can be implemented only by Gaussian 09 with automatic python script. It is worth noting that compounds belonging to one category in data sets required to be aligned to the molecule with the most atoms in the same category to achieve uniform-length structure descriptors. Finally, ALD and MLD of each compound are concatenated for input of the ML models.
In order to analyze these descriptors, we carried out the correlation analysis of MLD and ALD firstly33. In Fig. 3a-d, exceptional autocorrelation was observed between MLD. Among all atoms in the four data sets, 7 features (the coordinates are divided into coordinate-x, coordinate-y, coordinate-z) show acceptable autocorrelation in Fig. 3e-h. The correlation between Mulliken charge and NPA charge verifies the effectiveness of correlation analysis. These two features are essentially the same properties calculated by different methods, so they have a greater correlation34. Meanwhile, because ALD are independent variables and change continuously in numerical value, it can reflect the subtle differences between atoms of the same elements. In the ML models, the composition of descriptors is sorted with the atomic sequence of each molecule, properties of each atom can correspond to the element which is represented by the atomic number. The ML models could learn the properties of every atom in the learning process, without being limited by the molecular skeleton and other defined structures previously.
Principal component analysis (PCA) can reduce the feature dimension and display it through two-dimensional or three-dimensional images, which makes it easier for chemists to observe the distribution of data sets on their descriptors, de-noise the data set, and pick out representative reactions35effectively. Figure 3i-l showed the distribution of four data sets with yields% or ee% in the first three principal components of MLD. BHA, SMC and ANSA display acceptable feature distribution. The similar reactions in the descriptors space are effectively clustered together, and there is no obvious noise data. AHR is more complex, the overall reaction distribution is scattered. However, highly selective reactions (red or blue) still show a trend of aggregation.
Model training and selection. In Fig. 4c, random forest and neural network are implemented in ANSA datasets to achieve R2 values of 0.93 and 0.95 respectively which delivered the most substantial improvement. The remaining four algorithms AdaBoost, Linear Regression, Support Vector Machine, K-Nearest Neighbors on the three data sets provide no improvement. Further observation showed that with the increase of the number of reactions and complexity of data sets (Fig .1), the use of MLD solely led to worse fitting of the ML models, but R2 of neural network can always maintain over 0.90, so employing MLD can describe global molecular properties notably. It should be noted that the distribution of yield or ee% will induce the prediction tendency of the ML models. In BHA, the overall yield distribution is in a lower range, and Fig. 4a showed that the model tends to predict low yield. In ANSA, there are more reactions with high ee%, and Fig .4c reflected that the overall prediction ee% of the models are on the high side. These results indicated there are obvious compounds with the same result orientation in these reactions. With this in mind, the ML models only need to capture some exclusive information to make accurate regression analysis. Accordingly, when the complexity of the data increases (Fig .1i), MLD could no longer satisfy the learning of the ML models. In this situation, it is necessary to improve the complexity of the reaction descriptors, and adding ALD can meet the quantitative experimental parameters or descriptors.
When employing ALD, the fitting of all ML models has been significantly improved on ANSA, BHA and SMC compared to MLD in Fig. 5a-d. Considering that the dimension of ALD is much higher than that of MLD, thus highlighting the contribution of the properties of different molecules to the yield and enantioselectivity. In conclusion, when the data set was not complex enough, we extracted MLD to train ML model in advance, and then determined whether to enrich descriptors according to the training results. When using a deep learning model, such as a deep neural network, we recommend adding ALD to fit the parameter training of the neural network. In cases of weighing the training speed and prediction result, it mainly depended on the extraction of descriptors, followed by the selection of algorithm. To sum up, we employed all molecular and atomic descriptors for following tasks. Considering that architecture and parameters of the neural network can be adjusted for different tasks and data sets flexibly, we proposed the feed forward neural network to maximize the extrapolative ability among the candidate ML models.
Prediction with both atomic and molecule level descriptors. The suitability of the molecular and atomic level descriptor (MALD) was compared with that of the previous two descriptor sets. Here, the calculation protocol and machine learning method are the same as before, but use a combination of atomic and molecular level descriptors for each compound to represent molecular feature, rather than using only one descriptor set. As can be seen from Fig. 6a-c, the neural network showed the most significant improvement after using MALD descriptor, and the R2 reached 97% − 98% on all three data sets. In view of the outstanding performance of neural network, we further analyze its use of three descriptors on all data sets (Fig. 6f). In this analysis, we found that when using the best performing neural network, MALD was always better than ALD and ALD was better than MLD. This makes us believe that complex descriptors can enable neural networks to automatically extract more useful molecular properties, and the integrated MALD has the potential to become a general descriptor for a variety of chemical prediction tasks.
At the same time, we see that the results of the models using ALD descriptor on all data sets are closer to that of the full feature descriptor MALD. The reason may be that ALD can represent atomic level features other than global information of molecules, which is crucial for describing the stereochemical information of molecules. In order to verify this conjecture, we compared the performance of all machine learning methods on MLD, ALD and MALD, and found that various machine learning methods had extremely consistent performance trends on ALD and MALD, which were significantly different from MLD (Fig. 5d, Fig. 6d and Fig. 7e). These two different trends proved that the performance of the models for all data sets is mainly affected by ALD, and the complete reconstruction of molecular three-dimensional information is very important for predicting and optimizing chemical reactions.
Then, we paid much attention to the ability of descriptors to describe chemical spaces with different complexity, which is the key factor to determine whether machine learning protocols have universal ability for different chemical prediction tasks. Because the complexity of data sets will directly affect the performance of descriptors, we selected three data sets with very representative complexity. Specifically, the BHA data set has minimal complexity, that is, it has small chemical space (only 15 halogenated aromatics, 4 catalysts and 23 additives) and up to 4037 reactions. Not surprisingly, MLD, ALD and MALD performed well on this data set, although MALD was outstanding. However, the difference of the models on ANSA with a small number of reactions began to attract our attention, especially when the atomic level feature was added to the descriptor, the performance of all models had been improved (for example, the R2 of SVM method increased sharply from 0.25 to 0.89 in Figs. 4c and 5c). In particular, the SMC data set with the largest complexity brings more obvious hints: ALD and MALD enable machine learning methods, especially RF and NN, to achieve very high performance in a large chemical space. Obviously, MALD descriptor showed good performance and universality on these three representative data sets. Therefore, we can see that the MALD descriptor demonstrated satisfactory universality on these three representative data sets. More promisingly, our MALD outperformed the state-of-the-art works on these three representative data sets (see 1–3 rows in Table 1).
f Comparison of all data sets between MALD, ALD, MLD (training with neural network).
Leave one molecule out prediction. Chemists pay more attention to the prediction results of the ML models for unknown molecules and reactions, it is helpful for synthetic chemists to use the existing data in their hands for more effective experimental design. This puts forward higher requirements for the generalization ability of machine learning model. Żurański and co-workers proposed a cross-validation method called leave one molecule out, which helped to evaluate the ability of machine learning models to explore unknown chemical spaces9. Here we adopted the test method of leaving out a molecule, the training set did not contain the reactions of the molecule, and all the reactions of the molecule were put into the test set, to evaluated the extrapolation ability of ML models training in known chemical reaction space. The neural network was still used for prediction, but in order to enhanced the generalization ability of the model, we added a dropout parameter with a probability of 0.5 to each hidden layer of the neural network.
Figure 7a-c described the prediction of the model for each molecule in the data set. The black dotted line in the figure represents the training results of the model on 5-fold cross-validation, and is the standard RMSE value used to evaluate the effect. The standard RMSE of BHA, SMC and ANSA are 4.6, 4.4 and 4.3 respectively. Each black spot corresponds to the RMSE value trained by leaving out one molecule prediction. The prediction of the neural network for the unknown molecule can be evaluated by comparing the degree of deviation of each molecule from the standard RMSE value.
In BHA (Fig. 7a), the neural network can make accurate extrapolation prediction for all unknown additives and aryl halides in this data set. The RMSE of 21 additives, 9 aryl halides are less than 4.5. RMSE of molecules in bases and ligands indicated a large deviation with right-shifted. This was mainly because of the large number of reactions involved in these two categories, the reactions of compounds in each category accounted for 24.7% and 32.7% of the total, and the similarity score in base is the lowest (Fig .1e). Therefore, the reactions of bases and ligands are more complex than those of additives and aryl halides, but neural network still offer a high level of accuracy, and all values of RMSE are lower than 8.0. In ANSA (Fig. 7c), the neural network performed better prediction accuracy for two categories of reactants, and the results of imines are better than mercaptans. The leading reason is that the changes between structures of mercaptans are more notable than imines, and the similarity score of mercaptans is lower than imines. Nevertheless, the neural network tends to encounter predictive limitations when applied to substrates with embedded catalysts in ANSA data set. It showed a large deviation in the extrapolation prediction accuracy, and the same situation also occurs in aryl halides and boronic acids of SMC (Fig. 7b). By evaluating the composition of these two data sets, we found that although the similarity score of ANSA was higher than that of SMC, LOMOP in the two data sets present the same trend. We therefore hypothesized this problem was exacerbated by the presence of activity cliffs when some catalysts could be a source of deleterious side reactivity36. For example, in ANSA, we found that the ee% of catalysts whose prediction results were larger than 30 RMSE were almost all less than 60%. Even though catalysts have the highest similarity score in ANSA, the neural network cannot accurately predict these catalysts. Because the overall ee% of ANSA is more than 70%, the ML models have deviations in identifying these unusual data, which could not be reflected by the similarity of compounds. In BHA, since each category of compound used the same molecular skeleton, the neural network only needed to predict yields with the change of the substituent. Interestingly, in SMC aryl halides and boronic acids have the highest similarity scores (Fig .1b) bearing only the substituent difference, however, when these molecules were picked as the test set, the prediction accuracy was at a low level. In contrast, molecules of SMC in ligand, base, and solvent with low similarity scores (Fig .1b) achieved low RMSE scores. Considering when an aryl halide or boronic acid with modest change in structure is removed, the entire reaction is left out synchronously. As a result, the reaction performance is determined not only by the differences in substituents of aryl halides or boronic acids, but also by ligand, base, and solvent. The influence of the latter occupies a dominant position, so that the neural network makes prediction inaccurately when dealing with the reactions comprising new compounds. The first two principal components of all molecules space with MLD also revealed distinct regions of high, medium, and low accuracy (Fig. 7d-f). The red points are not accurate at all, the yellow and green points are more accurate, and the blue points are the most accurate. Molecules of the same category were grouped into clusters with high intra-class similarity. The 2-D chemical space has the potential to explore unknown reaction combination by selecting and optimizing molecules with the best LOMOP results.
Sparse data set prediction. As mentioned above, ALD and MALD descriptors bring obvious benefits on small data sets such as ANSA, but in practice, we face a more serious problem of insufficient reaction data, that is, small data sets with large chemical space, which we call sparse data sets. Most of the data sets owned by experimentalists are sparse data sets, which cannot be augmented by chemical experiments due to the constraints of time and cost. In order to further improve the ability of our machine learning protocol to deal with sparse data sets, we explored data augmentation based on MALD.
Here, we used a k-nearest neighbor based SMOTE algorithm to augment the MALD descriptor data of existing sparse data sets37. We selected a typical sparse reaction data set AHR, which contains only 334 reactions but a large number of reactant molecules (127 olefins, 63 imines, 58 ligands and 12 solvents) (Fig. 1D). The chemical space of this data set is complex and the amount of data is small, so it is very difficult to predict its enantioselectivity.The algorithm randomly selects N samples from k-nearest neighbor data for random linear interpolation, creating a new minority class sample. Specifically, it generates new data in three steps: (1) For each sample, the Euclidean distance to all samples is calculated to obtain its k-nearest neighbors. (2) Randomly select some samples from K nearest neighbors for each sample according to the preset sampling rate. (3) Random linear interpolation is performed between each selected nearest neighbor sample and the original sample to generate a new sample.
Sunoj and co-workers5 augmented AHR data set from 368 to 586 reactions. For comparison, we used SMOTE algorithm based on MALD to create the same amount of reaction data. Then, the Augmented AHR data set was applied to various machine learning methods, in which 80% of the reaction data was randomly selected as the training set and the remaining 20% was used as the test set. With the help of MALD and data augmentation algorithm, all machine learning methods achieved good results, and the R2 of four methods reached 0.90–0.97. As a control, we used the original AHR data set for training and testing in the same way. We found that our MALD based data augmentation algorithm significantly improved the performance of all machine learning methods (Fig. 8b) and outperformed the state-of-the-art work on the AHR data set (Table 1, row 4). Therefore, MALD can well explore unknown feature space of chemical reactions with the help of SMOTE, and provide more useful information for machine learning methods.
We noted that Sunoj and co-workers5 only described the absolute value of enantiomeric excess because the AHR data set is scarce and difficult to predict. In order to further explore the effectiveness of our MALD based data augmentation algorithm, we carried out the prediction of R and S stereoisomer for chiral centers. We constructed a new data set from AHR, in which the data of each reaction are reactants and the R and S stereoisomer of the reaction product. R and S stereoisomers were obtained by the sign of enantiomeric excess. Surprisingly, our MALD based data augmentation algorithm enabled all machine learning methods to predict R and S stereoisomer very well, and their AUCs reached 0.95-1 (Fig. 8c). The decision boundary (Fig. 8d) and decision probability (Fig. 8e) after principal component analysis also show the effectiveness of the machine learning method based on MALD and SMOTE.