Highly accurate prediction of reaction performance and enantioselectivity with a uniform machine learning protocol

doi:10.21203/rs.3.rs-820030/v1

Download PDF

Article

Highly accurate prediction of reaction performance and enantioselectivity with a uniform machine learning protocol

https://doi.org/10.21203/rs.3.rs-820030/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Synthetic reactions, especially asymmetric reactions are key components of modern chemistry. Chemists have put enormous experimental effort into recognizing various molecule patterns to enable efficient synthesis and asymmetric catalysis. Recent application of machine learning algorithms and chemoinformatics in this field demonstrated their huge potential in facilitating this process by accurate prediction. However, existing methods are relatively limited to specific designed data set, and only implement single prediction of reaction performance or reaction enantioselectivity, rendering their general use in broader scenarios challenging. Here we provide a uniform machine learning protocol that can predict both reaction performance and enantioselectivity with high accuracy. Reconstruction of molecular chemical space derived from more comprehensive three-dimensional atomic and molecular descriptors allow for training of our neural network-based model over four representative datasets. This uniform machine learning protocol was validated with outperformance of accuracy than other methods over all four cases (C-C, C-N, C-S cross coupling reactions and asymmetric hydrogenation) in the prediction of both reaction performance and enantioselectivity. It was also successfully applied to the out-of-set and sparse set prediction, leveraging its possible wide application in accelerating synthesis improve and molecular architects.

Catalysis

Organic Chemistry

catalysis

synthetic reactions

machine learning

Performance of synthetic organic reactions is crucially important to modern chemistry¹. In particular, the ever-growing demand for enantiopure chiral compounds in pharmaceutical industry and agrochemicals has further boosted the development of asymmetric catalysis². The common research challenge for desired reaction performance and enantioselectivity was due to its intricate factors including diverse reaction types, substrates, catalysts, conditions, mechanisms, and many others. Historically, the problem was driven by chemical intuition and empiricism with enormous experimental effort. Recently, emergence of machine learning approaches to facilitate synthetic chemistry and asymmetric catalysis has provided an attractive alternative to this long-standing challenge^3–6. First, without the need to consider reaction mechanism, quantifying molecular properties derived from various reaction substrates and catalysts can be characterized by designed multidimensional descriptors in a certain type reaction. Further, inputting the corresponding molecular properties into suited machine learning algorithms realized the find of patterns that are incomprehensible to human brain. In this way, these patterns learned by machine learning models provided chemists with accurate prediction to expedite reaction discovery, optimization, and synthesis of chiral molecules.

Prior leading efforts of reaction performance or reaction enantioselectivity prediction using machine learning had focused on partial molecular chemical space mainly represented by density functional theory (DFT)^3,5 and molecular fingerprint or SMILES based descriptors^7,8. Wherein the DFT based descriptors offered more subtle physiochemical property differences compared to molecular fingerprint/SMILES ones, impelling it a dominant strategy⁹. Another pivotal challenge for the implementation of machine learning models to reaction prediction was its requirement of sufficient and consistent experimental data¹⁰. The recent advance of high-throughput experiment (HTE) technology has brought machine learning based reaction prediction into a new era^11–14. Groundbreaking research revealed by Sach and co-workers demonstrated an automated flow-based synthesis platform which could perform over 1500 Suzuki-Miyaura (C-C) coupling nanomole-scale reactions per 24 hours¹⁵; Another representative work was reported by Doyle’s group, they employed shared atomic and molecular descriptors to predict the performance of Buchwald-Hartwig (C-N) cross-coupling reactions with random forest models through HTE³. Next, Denmark and co-workers described the very close examples that utilize support vector machines and deep feed-forward neural networks to predict the reaction enantioselectivity of asymmetric N, S-acetal formation (C-S) with impressive results⁴. A noteworthy breakthrough here is their generation of three-dimensional descriptors of a given compound scaffold by averaging conformer steric occupancy with grid point-type features. Later, Sunoj and co-workers extended the reaction enantioselectivity prediction to asymmetric hydrogenation by selection of descriptors with quantum chemical information from the molecular skeleton using random forest model in sparse data set⁵.

These above work and many other studies have greatly promoted and enlightened the development of machine learning based reaction prediction research. However, existing methods are relatively limited to specific designed data set in a certain type reaction, and can only implement single prediction of reaction performance or reaction enantioselectivity, rendering their general use in broader scenarios challenging. This was mainly hindered as high accurate prediction in a particular reaction class normally required the suitable descriptor with right machine learning model. In most cases, the descriptor parameters were chosen where they share equivalent/common core regions from the molecular scaffolds^3,5, resulting the incomplete expression of real molecular chemical space. Also, it further generated negative effects on extracting all-inclusive patterns through machine learning models and limited its prediction beyond the range of original data set. Herein we report a uniform machine learning protocol to predict reaction performance and enantioselectivity over different reactions. The previous four representative work, C-C¹⁵, C-N³, C-S⁴ cross coupling reactions and asymmetric hydrogenation⁵ were selected as the experimental data sets. We reconstructed the molecular chemical space by proposing robust universal three-dimensional atomic and molecular descriptors. Meanwhile, our trained neural network-based model efficiently reproduced physicochemical relationships between reaction outcomes and molecular properties in above four reactions and achieved outperforming accuracy in all cases. Last, this method was also successfully validated against out-of-sets prediction and sparse data set.

Data set. Obtaining pre-existing experimental data from publicly available databases such as SciFinder or Reaxys enables chemists to create a large data set based on published literature or patents^16–19. However, in order to avoid inconsistency between sources and the omission of critical synthesis metadata, we selected a certain representative test reaction type from one literature to reduce the uncertainties²⁰. Meanwhile, to explore the generalization capability of the models, we divided our data sets into two prediction tasks, one for predicting reaction performance, and the other for predicting enantioselectivity²¹. According to these principles, we selected the previous four representative work as complete experimental reaction data—Buchwald-Hartwig amination (BHA, C-N)³ and Suzuki-Miyaura coupling reactions (SMC, C-C)¹⁵ were used to predict the yields; asymmetric N, S-acetal formation (ANSA, C-S)⁴ and asymmetric hydrogenation (AHR)⁵ were used to predict the reaction enantiomeric excess.

Although an even distribution of yields or enantioselectivities across the range in each data set is preferable for model training, given the limitation on data availability no trimming based on yields or ee was carried out for the four data sets basically. The detail statistics of components were labeled in blue for each reaction (Fig. 1a-d). In BHA, we selected 4037 reactions performed via HTE and took account of four components which containing aryl halides, ligands, bases, and solvents. Interestingly, the reaction yields were found to be centred below 40%, especially when four of additives — benzo[c]isoxazole, ethyl 5-methylisoxazole- 4-carboxylate, ethyl isoxazole-4-carboxylate, and methyl isoxazole-5-carboxylate participated in reactions (Fig. 1a). SMC data set was created by Sach and co-workers through an automated flow-based synthesis platform and was not designed for ML task on purpose¹⁵. Its categories of components are the most complicated in four data sets, which contain five components, boronic acids, aryl halides, ligands, bases, and solvents, making it the most challenging data set for machine learning models. In which, we excluded all reactions with yields of 0% and some reactions with yields between 0–10%, and selected 4139 reactions finally. It’s clear to see that yields distribution of SMC is similar with BHA but no compounds serve as severe reaction poisons (Fig. 1b). Unlike the task of predicting reaction yields, it is necessary for the machine learning models to predict the R/S configuration with its enantioselectivity in asymmetric reactions precisely. Here we set the R configuration to positive (+) and the S configuration to negative (-) so that the result of prediction is a score between − 100% and + 100%. Both asymmetric catalytic reaction data sets have a higher percentage of reactions with over 70% ee, but ANSA has few reactions in low ee or the range of < 0% ee representing the S configuration (Fig. 1c). The AHR had fewer reactions compared to other three datasets and only made the prediction of the absolute value of reaction enantioselectivities previously. We found that in our setting enantiomeric excess distribution in AHR below − 70% are equivalent to those over 70%, and there were few reactions in the middle interval results. Considering the relatively small size of AHR data set, we thought it is necessary to combine unsupervised learning methods to expand the size of training data (Fig. 1d).

At the same time, in order to evaluate the complexity of these four data sets, MACCS keys are extracted and Tanimoto coefficient is calculated as the standard to measure their similarities^22,23. Figure 1e-h are spider plots, which reflect the complexity of each category in every data set. When a category has a high score, it means the category has a high-similarity. Figure 1i is a summary of the scores for each data set (Eq. (1)). By comparing the scores, SMC turned out to be the most complex data set, followed by AHR and BHA, where ANSA had the most similar molecules in each category. With these complexity scores, chemists can preliminarily judge the fitting of machine learning models to corresponding data set. Generally, the more complex the data set is, the more difficult it is to give accurate prediction results, and the worse the fitting is.

Extraction and analysis of Chemical descriptors. To generate machine-readable inputs, we need to quantify physicochemical properties of candidate molecules, which is also called chemical descriptors. Although there are some researches on feature extraction using unsupervised learning^24–27, the mainstream feature extraction method is still using DFT to calculate the relevant properties or use software libraries such as RDKIT²⁸ to generate feature matrices which is like molecular fingerprints. Considering the molecular fingerprint is encoded using a one-hot encoding scheme, it can’t reflect more modest changes compared to the continuous variables calculated by DFT^29,30. In view of this, although the computational cost of DFT is more expensive, using DFT for molecular feature engineering is the preferred method, especially it has been difficult to get access to sufficient experimental data to warrant implementation of the ML models.

Following existing feature engineering with DFT, current methods are mainly based on the partial features of molecules. Doyle et al paid more attention to the properties of the skeleton shared by different molecules, such as the charges or NMR of atoms in the skeleton³. Sunoj and co-workers also considered the steric hindrance of substituents in the molecular skeleton⁵. In fact, partial information of their molecules can well describe the reaction space of their respective data sets, but the extensibility for other reactions is not general enough. Due to the limitation of molecular skeletons, chemists need to redesign the feature extraction process for molecules with different skeletons. Denmark and co-workers extracted the conformation and space occupation of molecules, although the spatial information of the whole molecule was included, the feature dimension is too large and the extraction process is complex. Therefore, we used easily available and more universal descriptors to predict reaction performance and enantioselectivity here.

We divided the chemical descriptors into two parts, which are molecule-level descriptors (MLD) and atom-level descriptors (ALD). MLD distinguishing global properties of different molecules is composed of molecular volume, single-point energy, HOMO, LUMO and dipole. ALD include atomic number, three-dimensional coordinates, Mulliken charges, NPA charges, NMR shielding. By inputting ALD in accordance with the atomic sequence in the input file, it is accessible to reconstruct chemical properties of atoms in the ML models. Boobier and co-workers also extracted atomic features in predicting the solubility of solutes in different solvents, but the features of the same elements were summed as descriptors finally³¹. We think this method will lead to the loss of atomic information by virtue of different atomic positions in the molecular structure. Likewise, integrating the features does not full use the learning ability of the ML models, and ALD have a good adaptability to the graph neural network³². In ALD, we especially extracted three-dimensional coordinates of all atoms which were always ignored within feature engineering, because of significant correlations between three-dimensional features and atomic coordinates, such as steric hindrance or dihedral angle. The complete extraction process can be implemented only by Gaussian 09 with automatic python script. It is worth noting that compounds belonging to one category in data sets required to be aligned to the molecule with the most atoms in the same category to achieve uniform-length structure descriptors. Finally, ALD and MLD of each compound are concatenated for input of the ML models.

In order to analyze these descriptors, we carried out the correlation analysis of MLD and ALD firstly³³. In Fig. 3a-d, exceptional autocorrelation was observed between MLD. Among all atoms in the four data sets, 7 features (the coordinates are divided into coordinate-x, coordinate-y, coordinate-z) show acceptable autocorrelation in Fig. 3e-h. The correlation between Mulliken charge and NPA charge verifies the effectiveness of correlation analysis. These two features are essentially the same properties calculated by different methods, so they have a greater correlation³⁴. Meanwhile, because ALD are independent variables and change continuously in numerical value, it can reflect the subtle differences between atoms of the same elements. In the ML models, the composition of descriptors is sorted with the atomic sequence of each molecule, properties of each atom can correspond to the element which is represented by the atomic number. The ML models could learn the properties of every atom in the learning process, without being limited by the molecular skeleton and other defined structures previously.

Principal component analysis (PCA) can reduce the feature dimension and display it through two-dimensional or three-dimensional images, which makes it easier for chemists to observe the distribution of data sets on their descriptors, de-noise the data set, and pick out representative reactions³⁵effectively. Figure 3i-l showed the distribution of four data sets with yields% or ee% in the first three principal components of MLD. BHA, SMC and ANSA display acceptable feature distribution. The similar reactions in the descriptors space are effectively clustered together, and there is no obvious noise data. AHR is more complex, the overall reaction distribution is scattered. However, highly selective reactions (red or blue) still show a trend of aggregation.

Model training and selection. In Fig. 4c, random forest and neural network are implemented in ANSA datasets to achieve R² values of 0.93 and 0.95 respectively which delivered the most substantial improvement. The remaining four algorithms AdaBoost, Linear Regression, Support Vector Machine, K-Nearest Neighbors on the three data sets provide no improvement. Further observation showed that with the increase of the number of reactions and complexity of data sets (Fig .1), the use of MLD solely led to worse fitting of the ML models, but R² of neural network can always maintain over 0.90, so employing MLD can describe global molecular properties notably. It should be noted that the distribution of yield or ee% will induce the prediction tendency of the ML models. In BHA, the overall yield distribution is in a lower range, and Fig. 4a showed that the model tends to predict low yield. In ANSA, there are more reactions with high ee%, and Fig .4c reflected that the overall prediction ee% of the models are on the high side. These results indicated there are obvious compounds with the same result orientation in these reactions. With this in mind, the ML models only need to capture some exclusive information to make accurate regression analysis. Accordingly, when the complexity of the data increases (Fig .1i), MLD could no longer satisfy the learning of the ML models. In this situation, it is necessary to improve the complexity of the reaction descriptors, and adding ALD can meet the quantitative experimental parameters or descriptors.

When employing ALD, the fitting of all ML models has been significantly improved on ANSA, BHA and SMC compared to MLD in Fig. 5a-d. Considering that the dimension of ALD is much higher than that of MLD, thus highlighting the contribution of the properties of different molecules to the yield and enantioselectivity. In conclusion, when the data set was not complex enough, we extracted MLD to train ML model in advance, and then determined whether to enrich descriptors according to the training results. When using a deep learning model, such as a deep neural network, we recommend adding ALD to fit the parameter training of the neural network. In cases of weighing the training speed and prediction result, it mainly depended on the extraction of descriptors, followed by the selection of algorithm. To sum up, we employed all molecular and atomic descriptors for following tasks. Considering that architecture and parameters of the neural network can be adjusted for different tasks and data sets flexibly, we proposed the feed forward neural network to maximize the extrapolative ability among the candidate ML models.

Prediction with both atomic and molecule level descriptors. The suitability of the molecular and atomic level descriptor (MALD) was compared with that of the previous two descriptor sets. Here, the calculation protocol and machine learning method are the same as before, but use a combination of atomic and molecular level descriptors for each compound to represent molecular feature, rather than using only one descriptor set. As can be seen from Fig. 6a-c, the neural network showed the most significant improvement after using MALD descriptor, and the R² reached 97% − 98% on all three data sets. In view of the outstanding performance of neural network, we further analyze its use of three descriptors on all data sets (Fig. 6f). In this analysis, we found that when using the best performing neural network, MALD was always better than ALD and ALD was better than MLD. This makes us believe that complex descriptors can enable neural networks to automatically extract more useful molecular properties, and the integrated MALD has the potential to become a general descriptor for a variety of chemical prediction tasks.

At the same time, we see that the results of the models using ALD descriptor on all data sets are closer to that of the full feature descriptor MALD. The reason may be that ALD can represent atomic level features other than global information of molecules, which is crucial for describing the stereochemical information of molecules. In order to verify this conjecture, we compared the performance of all machine learning methods on MLD, ALD and MALD, and found that various machine learning methods had extremely consistent performance trends on ALD and MALD, which were significantly different from MLD (Fig. 5d, Fig. 6d and Fig. 7e). These two different trends proved that the performance of the models for all data sets is mainly affected by ALD, and the complete reconstruction of molecular three-dimensional information is very important for predicting and optimizing chemical reactions.

Then, we paid much attention to the ability of descriptors to describe chemical spaces with different complexity, which is the key factor to determine whether machine learning protocols have universal ability for different chemical prediction tasks. Because the complexity of data sets will directly affect the performance of descriptors, we selected three data sets with very representative complexity. Specifically, the BHA data set has minimal complexity, that is, it has small chemical space (only 15 halogenated aromatics, 4 catalysts and 23 additives) and up to 4037 reactions. Not surprisingly, MLD, ALD and MALD performed well on this data set, although MALD was outstanding. However, the difference of the models on ANSA with a small number of reactions began to attract our attention, especially when the atomic level feature was added to the descriptor, the performance of all models had been improved (for example, the R² of SVM method increased sharply from 0.25 to 0.89 in Figs. 4c and 5c). In particular, the SMC data set with the largest complexity brings more obvious hints: ALD and MALD enable machine learning methods, especially RF and NN, to achieve very high performance in a large chemical space. Obviously, MALD descriptor showed good performance and universality on these three representative data sets. Therefore, we can see that the MALD descriptor demonstrated satisfactory universality on these three representative data sets. More promisingly, our MALD outperformed the state-of-the-art works on these three representative data sets (see 1–3 rows in Table 1).

f Comparison of all data sets between MALD, ALD, MLD (training with neural network).

Leave one molecule out prediction. Chemists pay more attention to the prediction results of the ML models for unknown molecules and reactions, it is helpful for synthetic chemists to use the existing data in their hands for more effective experimental design. This puts forward higher requirements for the generalization ability of machine learning model. Żurański and co-workers proposed a cross-validation method called leave one molecule out, which helped to evaluate the ability of machine learning models to explore unknown chemical spaces⁹. Here we adopted the test method of leaving out a molecule, the training set did not contain the reactions of the molecule, and all the reactions of the molecule were put into the test set, to evaluated the extrapolation ability of ML models training in known chemical reaction space. The neural network was still used for prediction, but in order to enhanced the generalization ability of the model, we added a dropout parameter with a probability of 0.5 to each hidden layer of the neural network.

Figure 7a-c described the prediction of the model for each molecule in the data set. The black dotted line in the figure represents the training results of the model on 5-fold cross-validation, and is the standard RMSE value used to evaluate the effect. The standard RMSE of BHA, SMC and ANSA are 4.6, 4.4 and 4.3 respectively. Each black spot corresponds to the RMSE value trained by leaving out one molecule prediction. The prediction of the neural network for the unknown molecule can be evaluated by comparing the degree of deviation of each molecule from the standard RMSE value.

In BHA (Fig. 7a), the neural network can make accurate extrapolation prediction for all unknown additives and aryl halides in this data set. The RMSE of 21 additives, 9 aryl halides are less than 4.5. RMSE of molecules in bases and ligands indicated a large deviation with right-shifted. This was mainly because of the large number of reactions involved in these two categories, the reactions of compounds in each category accounted for 24.7% and 32.7% of the total, and the similarity score in base is the lowest (Fig .1e). Therefore, the reactions of bases and ligands are more complex than those of additives and aryl halides, but neural network still offer a high level of accuracy, and all values of RMSE are lower than 8.0. In ANSA (Fig. 7c), the neural network performed better prediction accuracy for two categories of reactants, and the results of imines are better than mercaptans. The leading reason is that the changes between structures of mercaptans are more notable than imines, and the similarity score of mercaptans is lower than imines. Nevertheless, the neural network tends to encounter predictive limitations when applied to substrates with embedded catalysts in ANSA data set. It showed a large deviation in the extrapolation prediction accuracy, and the same situation also occurs in aryl halides and boronic acids of SMC (Fig. 7b). By evaluating the composition of these two data sets, we found that although the similarity score of ANSA was higher than that of SMC, LOMOP in the two data sets present the same trend. We therefore hypothesized this problem was exacerbated by the presence of activity cliffs when some catalysts could be a source of deleterious side reactivity³⁶. For example, in ANSA, we found that the ee% of catalysts whose prediction results were larger than 30 RMSE were almost all less than 60%. Even though catalysts have the highest similarity score in ANSA, the neural network cannot accurately predict these catalysts. Because the overall ee% of ANSA is more than 70%, the ML models have deviations in identifying these unusual data, which could not be reflected by the similarity of compounds. In BHA, since each category of compound used the same molecular skeleton, the neural network only needed to predict yields with the change of the substituent. Interestingly, in SMC aryl halides and boronic acids have the highest similarity scores (Fig .1b) bearing only the substituent difference, however, when these molecules were picked as the test set, the prediction accuracy was at a low level. In contrast, molecules of SMC in ligand, base, and solvent with low similarity scores (Fig .1b) achieved low RMSE scores. Considering when an aryl halide or boronic acid with modest change in structure is removed, the entire reaction is left out synchronously. As a result, the reaction performance is determined not only by the differences in substituents of aryl halides or boronic acids, but also by ligand, base, and solvent. The influence of the latter occupies a dominant position, so that the neural network makes prediction inaccurately when dealing with the reactions comprising new compounds. The first two principal components of all molecules space with MLD also revealed distinct regions of high, medium, and low accuracy (Fig. 7d-f). The red points are not accurate at all, the yellow and green points are more accurate, and the blue points are the most accurate. Molecules of the same category were grouped into clusters with high intra-class similarity. The 2-D chemical space has the potential to explore unknown reaction combination by selecting and optimizing molecules with the best LOMOP results.

Sparse data set prediction. As mentioned above, ALD and MALD descriptors bring obvious benefits on small data sets such as ANSA, but in practice, we face a more serious problem of insufficient reaction data, that is, small data sets with large chemical space, which we call sparse data sets. Most of the data sets owned by experimentalists are sparse data sets, which cannot be augmented by chemical experiments due to the constraints of time and cost. In order to further improve the ability of our machine learning protocol to deal with sparse data sets, we explored data augmentation based on MALD.

Here, we used a k-nearest neighbor based SMOTE algorithm to augment the MALD descriptor data of existing sparse data sets³⁷. We selected a typical sparse reaction data set AHR, which contains only 334 reactions but a large number of reactant molecules (127 olefins, 63 imines, 58 ligands and 12 solvents) (Fig. 1D). The chemical space of this data set is complex and the amount of data is small, so it is very difficult to predict its enantioselectivity.The algorithm randomly selects N samples from k-nearest neighbor data for random linear interpolation, creating a new minority class sample. Specifically, it generates new data in three steps: (1) For each sample, the Euclidean distance to all samples is calculated to obtain its k-nearest neighbors. (2) Randomly select some samples from K nearest neighbors for each sample according to the preset sampling rate. (3) Random linear interpolation is performed between each selected nearest neighbor sample and the original sample to generate a new sample.

Sunoj and co-workers⁵ augmented AHR data set from 368 to 586 reactions. For comparison, we used SMOTE algorithm based on MALD to create the same amount of reaction data. Then, the Augmented AHR data set was applied to various machine learning methods, in which 80% of the reaction data was randomly selected as the training set and the remaining 20% was used as the test set. With the help of MALD and data augmentation algorithm, all machine learning methods achieved good results, and the R² of four methods reached 0.90–0.97. As a control, we used the original AHR data set for training and testing in the same way. We found that our MALD based data augmentation algorithm significantly improved the performance of all machine learning methods (Fig. 8b) and outperformed the state-of-the-art work on the AHR data set (Table 1, row 4). Therefore, MALD can well explore unknown feature space of chemical reactions with the help of SMOTE, and provide more useful information for machine learning methods.

We noted that Sunoj and co-workers⁵ only described the absolute value of enantiomeric excess because the AHR data set is scarce and difficult to predict. In order to further explore the effectiveness of our MALD based data augmentation algorithm, we carried out the prediction of R and S stereoisomer for chiral centers. We constructed a new data set from AHR, in which the data of each reaction are reactants and the R and S stereoisomer of the reaction product. R and S stereoisomers were obtained by the sign of enantiomeric excess. Surprisingly, our MALD based data augmentation algorithm enabled all machine learning methods to predict R and S stereoisomer very well, and their AUCs reached 0.95-1 (Fig. 8c). The decision boundary (Fig. 8d) and decision probability (Fig. 8e) after principal component analysis also show the effectiveness of the machine learning method based on MALD and SMOTE.

Table .1 Comparing descriptors on the four data sets.
Database	Descriptors	R²	RMSE
BHA	Skeleton-based³	0.92	7.8
BHA	MALD	0.97	4.6
SMC	SMILES⁸	0.81	\
SMC	MALD	0.98	4.4
ANSA	ASO⁴	0.92	\
ANSA	MALD	0.98	4.3
AHR	Pivotal properties⁵	\	8.4
AHR	MALD	0.97	5.0

We have demonstrated that machine learning using chemical descriptors with complete information can be a general way to efficiently predict both chemical reaction performance and enantioselectivity. In addition to confirming that the chemical descriptor using only the global properties of molecules is not competent for the prediction on complex data sets, we found that our atomic level descriptor containing three-dimensional information makes machine learning very efficient for data sets with arbitrary complexity. By integrating complete atomic level properties and molecular level properties into more comprehensive descriptors, our machine learning method showed the universality and effectiveness of predicting both reaction performance and enantioselectivity on three representative data sets (Table 1). We explored accurate enantioselectivity prediction on sparse data sets from two aspects: chemical descriptors and data augmentation algorithms. Finally, in order to further explore the applicability of the proposed descriptors and machine learning methods, we effectively carried out the prediction of R and S stereoisomers on a sparse data set. Therefore, we believe that these methods will help machine learning reduce the excessive dependence on the experience of experimental practitioners in automatic chemical experiments. In view of its excellent performance in reaction performance, enantioselectivity prediction and R and S stereoisomers classification, its potential in more fields of chemical informatics can also be expected.

The required reaction data (BHA, SMC, ANSA and AHR) were taken from four published chemical literatures^3-5,15. We collected compounds, reaction conditions, yields and enantiomeric excess from these literatures, and then extracted chemical properties of these molecules by quantum chemistry methods.

Initial 3D structures were generated with RDKIT and then optimized at the B3LYP/6-311G* basis set in Gaussian 09³⁸. After that, all features were calculated at m062x/def2tzvp basis set. We extracted the global properties of molecules, such as HOMO, LUMO, dipole, volume, and single point energy, and took them together as the molecular level descriptor MLD. In order to ensure the algorithms to obtain the three-dimensional position relationship and other information between atoms in molecules, we generated and arranged the physical and chemical properties of atoms in strict accordance with the order of atoms in molecules. In this way, we assembled the atomic numbers, three-dimensional coordinates, Mulliken charge, NPA charge, and NMR of atoms into an atomic level descriptor, namely ALD.

In the analysis of descriptor correlation, we used SciPy³⁹ to calculate the Pearson correlation coefficient between descriptors and drew a heat map. We used RDKIT²⁸ to calculate the MACCS of each molecule, and then calculated the ratio of intersection and union of two MACCS to get the similarity score. The similarity score of all molecules were averaged to get the complexity of the whole data set: see formula 1 in the supplementary files section.

Data Availability

All necessary data generated or analysed during this study are included in this published article, and other auxiliary data are available from the corresponding authors upon request.

Code Availability

Relevant Python code is included in Supplementary Note 3: Model generation.

Funding

This work was supported by Shenzhen Science and Technology Research Fund No. JCYJ20190806142203709; JSGG20191129114029286; JSGG20201103153807021.We are also grateful to the startup of Harbin Institute of Technology (Shenzhen) through the Talent Development Starting Fund from Shenzhen Government (No. HA11409030).

Author Contribution

X.S. and J.Z. performed conceptualization, methodology, formal analysis, visualization and wrote the manuscript. H.Y. secured funding and supervised the study with B.C. and F.C. All authors contributed to the revise of manuscript.

Wöhler, F. Ueber künstliche Bildung des Harnstoffs. Ann. Phys. Chem. 88, 253–256 (1828).
Taylor, M. S. & Jacobsen, E. N. Asymmetric Catalysis in Complex Target Synthesis. ChemInform 35, (2004).
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).
Zahrt, A. F. et al. Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363, eaau5631 (2019).
Singh, S. et al. A unified machine-learning protocol for asymmetric catalysis as a proof of concept demonstration using asymmetric hydrogenation. Proc Natl Acad Sci USA 117, 1339–1345 (2020).
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 3, 434–443 (2017).
Gallarati, S. et al. Reaction-based machine learning representations for predicting the enantioselectivity of organocatalysts. Chem. Sci. 12, 6879–6889 (2021).
Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn.: Sci. Technol. 2, 015016 (2021).
Żurański, A. M., Martinez Alvarado, J. I., Shields, B. J. & Doyle, A. G. Predicting Reaction Yields via Supervised Learning. Acc. Chem. Res. 54, 1856–1865 (2021).
Shen, Y. et al. Automation and computer-assisted planning for chemical synthesis. Nat Rev Methods Primers 1, 23 (2021).
Buitrago Santanilla, A. et al. Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science 347, 49–53 (2015).
Shaabani, S. et al. Automated and accelerated synthesis of indole derivatives on a nano-scale. Green Chem. 21, 225–232 (2019).
Bellomo, A. et al. Rapid Catalyst Identification for the Synthesis of the Pyrimidinone Core of HIV Integrase Inhibitors. Angew. Chem. 124, 7018–7021 (2012).
Dreher, S. D., Dormer, P. G., Sandrock, D. L. & Molander, G. A. Efficient Cross-Coupling of Secondary Alkyltrifluoroborates with Aryl Chlorides—Reaction Discovery Using Parallel Microscale Experimentation. J. Am. Chem. Soc. 130, 9257–9259 (2008).
Perera, D. et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359, 429–434 (2018).
Reaxys. https://www.reaxys.com/.
SciFinder. https://scifinder.cas.org/.
Tetko, I. V., Engkvist, O., Koch, U., Reymond, J.-L. & Chen, H. BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry. Mol. Inf. 35, 615–621 (2016).
Lowe, D. M. Extraction of chemical structures and reactions from the literature. 210.
Jia, X. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019).
Sandfort, F., Strieth-Kalthoff, F., Kühnemund, M., Beecks, C. & Glorius, F. A Structure-Based Platform for Predicting Chemical Reactivity. Chem 6, 1379–1390 (2020).
Cereto-Massagué, A. et al. Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015).
Godden, J. W., Xue, L. & Bajorath, J. Combinatorial Preferences Affect Molecular Similarity/Diversity Calculations Using Binary Fingerprints and Tanimoto Coefficients. 4.
Shields, B. J. et al. Bayesian reaction optimization as a tool for chemical synthesis. Nature 590, 89–96 (2021).
Henle, J. J. et al. Development of a Computer-Guided Workflow for Catalyst Optimization. Descriptor Validation, Subset Selection, and Training Set Analysis. J. Am. Chem. Soc. 142, 11578–11592 (2020).
Martin, T. M. et al. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? J. Chem. Inf. Model. 52, 2570–2578 (2012).
Christensen, M. et al. Data-science driven autonomous process optimization. Commun Chem 4, 112 (2021).
RDKit. https://www.rdkit.org/.
Lin, A. I. et al. Automatized Assessment of Protective Group Reactivity: A Step Toward Big Reaction Data Analysis. J. Chem. Inf. Model. 56, 2140–2148 (2016).
Anderson, N. G. Using Continuous Processes to Increase Production. Org. Process Res. Dev. 16, 852–869 (2012).
Boobier, S., Hose, D. R. J., Blacker, A. J. & Nguyen, B. N. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 11, 5753 (2020).
Roszak, R., Beker, W., Molga, K. & Grzybowski, B. A. Rapid and Accurate Prediction of pKa Values of C–H Acids Using Graph Convolutional Neural Networks. J. Am. Chem. Soc. 141, 17142–17149 (2019).
Benesty, J., Chen, J., Huang, Y. & Cohen, I. Pearson Correlation Coefficient. in Noise Reduction in Speech Processing vol. 2 1–4 (Springer Berlin Heidelberg, 2009).
Guerra, C. F., Handgraaf, J.-W., Baerends, E. J. & Bickelhaupt, F. M. Voronoi deformation density (VDD) charges: Assessment of the Mulliken, Bader, Hirshfeld, Weinhold, and VDD methods for charge analysis. Journal of Computational Chemistry 25, 189–210 (2004).
Shahlaei, M. Descriptor Selection Methods in Quantitative Structure–Activity Relationship Studies: A Review Study. Chem. Rev. 113, 8093–8103 (2013).
Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discovery Today 19, 1069–1080 (2014).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. jair 16, 321–357 (2002).
Gaussian 09, Revision D.01, M. J. Frisch et al., Gaussian, Inc., Wallingford CT, 2013 (full citation in Supplementary).
SciPy. https://www.scipy.org/.
scikit-learn. https://scikit-learn.org/stable/.
PyTorch. https://www.pytorch.org.

There is NO Competing Interest.

SIHighlyaccuratepredictionofreactionperformanceandenantioselectivitywithauniformmachinelearningprotocol.pdf
SI-Highly accurate prediction of reaction performance and enantioselectivity with a uniform machine learning protocol

Download PDF

Version 1

posted

You are reading this latest preprint version

Highly accurate prediction of reaction performance and enantioselectivity with a uniform machine learning protocol

Status:

Version 1

Abstract

Figures

Introduction

Results And Discussion

Conclusion

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1