Organic synthesis is the foundation for the development of life science, such as pharmaceutics and chemical biology1,2. For decades, the discovery of chemical reaction was driven by serendipitous intuition stemming from expertise, experience and mechanism exploration3. However, professional chemists sometimes have hard time to predict whether a specific substrate can indeed go through a desired reaction transformation, even for some well-established reactions4,5. When optimizing reaction yield or selectivity, small changes in reaction factors, including catalysts, temperature, ligands, solvents and additives, may result in outcomes that deviate from the intended target. Thus, scientists intend to develop computational model6–8 to explore the relationships between reaction factors and outcomes.
One of the crucial parts in modelling is finding an appropriate representation of chemical reaction. Quantum-mechanics (QM) based descriptors, representing electrostatic or steric characterizations, calculated by density functional theory (DFT) or other semi-empirical methods 9–12 are frequently used for modeling. Doyle et al.13 utilized QM derived descriptors to build a random forest model, which achieved good prediction performance of the Buchwald-Hartwig cross-coupling of aryl halides with 4-methylaniline. Sigman et al. 14 defined four important DFT parameters to capture the conformational dynamics of the ligands, which were fed into multivariate regression modelling for the correlation of ligand properties and relative free energy. Denmark et al. 15 generated a set of three-dimension QM descriptors to develop an AI-based model for enantioselectivity prediction. Applying QM descriptors to modeling offers the advantage of model interpretability, but it usually requires a deep understanding of reaction mechanisms, which may be difficult to transfer to other reaction prediction tasks. Another kind of popular descriptors is the so-called reaction fingerprints. Glorius and co-workers16 developed a multiple fingerprint features (MFFs) as molecular descriptors, by concatenating 24 different fingerprints, to predict the enantioselectivities and yields for different experimental datasets. Although good results were observed, this method can be a time and resource intensive process, as a single molecule was represented in a 71,374-bit array. Reymond et al. 17 reported a molecular fingerprint called differential reaction fingerprint (DRFP), by taking reaction SMILES as input which were embedded into an arbitrary binary space via set operations for subsequent hashing and folding, to perform reaction classification and yield prediction. Though the reaction fingerprints are easily built, the reaction fingerprint may lose certain chemical information due to the limited predefined substructures, and thus a task-specific representation which could learn from dataset is needed.
One possible solution to the issue of universal reaction descriptors is to apply graph neural networks (GNNs) on reaction prediction tasks18,19. Owing to the powerful capacity for modelling graph data, GNNs have recently become one of the most popular AI methods and have achieved remarkable prediction performance on several tasks20–23. Various graph-based models, such as graph conventional network(GCN)21,24, GraphSAGE25, graph attention network(GAT)26 and message passing neural network(MPNN)27, have been proposed to learn a function of the entire input graph over molecular properties, by either directly applying a weight matrix on the graph structure or using a message passing and aggregation procedure to update node features iteratively. A molecule is regarded as a graph, where atoms are treated as nodes and bonds are treated as edges. Node and edge features are influenced by proximal ones, and these features are learned and aggregated to form the embedding of entire molecule graph28,29. In this work, we proposed a modified communicative message passing neural network (GraphRXN), which was used to generate reaction embeddings for reaction modelling without using predefined fingerprints. For chemical reactions comprised of multiple components, reaction features can be built up by aggregating embeddings of these components together and correlated to the reaction output via a dense layer neural network.
Another major challenge for reaction prediction is the access of high-quality data30,31. Though numerous data were accumulated, bias toward positive results in the literatures led to unbalanced datasets. What’s more, extracting valid large-scale data from literature requires substantial human intervention32,33. High-throughput experimentation (HTE) is a technique that can perform a large number of experiments in parallel34,35. HTE could serve as a powerful tool for advancing AI chemistry as it has the capability to significantly increase experiment throughput, and ensure data integrity and consistency. With this technology, several high-quality reaction datasets were reported30, including Buchwald-Hartwig amination13,36,37, Suzuki coupling38–40, photoredox-catalyzed cross coupling41. These datasets contain both successful and failed reactions, which is critical for building forward reaction prediction models. Three public HTE datasets were used as proof of concept studies for our method and encourage results were demonstrated. As further verification, we used our in-house HTE platform to generate data of Buchwald-Hartwig cross-coupling reaction. The GraphRXN methodology was then applied on the in-house dataset and a decent prediction model was obtained (R2 of 0.713), which highlights that our method can be integrated with reaction robotics system for reaction prediction We expect that deep learning based methods like GraphRXN, combined with the data-on-demand reaction machine, could potentially push the boundary of reaction methodology development42,43 .