Multi-task mixture density DimeNet++
Directional Message Passing Neural Network 38, or DimeNet for short, is one of the state-of-the-art GNN methods for predicting the properties of molecules and materials. By default, DimeNet is comprised of one Embedding Block (EB), constructing initial directional edge features (or embeddings) based on the node features (or embeddings), and six Interaction Blocks (IBs), performing updating and aggregation of edge embeddings. DimeNet extends the concept of message-passing from node embeddings to directional edge embeddings, therefore enhancing the ability to differentiate among various molecules. As an improved version of DimeNet, DimeNet + + 39 drastically decreases the training time needed while maintaining high accuracy by replacing the bilinear layer with element-wise multiplication and adding more multilayer perceptrons (MLPs). Additionally, the number of IBs used in the algorithm has also been reduced from six to four.
Our proposed model’s architecture, named multi-task mixture density (MT-MD) DimeNet++, is illustrated on the right panel of Fig. 1a, with DimeNet + + shown on the left for comparison. The model is designed under the MT learning framework, comprised of two branches for two distinct but related objectives. The main branch (MB) is used to predict adsorption energy, while the structure relaxation branch (SRB) is used to predict the change of atomic positions after relaxation in the form of bond length change. It is reasonable to assume that the change of pair-wise distance among atoms follows a distribution rather than a set of fixed values. And MDNs 56, a type of specially designed artificial neural networks for approximating conditional distribution of labels, are applied in SRB. Additionally, MB and SRB are separated into two branches at the beginning of the architecture, and cross-stitch units (CSUs) 58, a state-of-the-art MT learning technique, are applied to build a connection between the two branches, in order to determine the ideal network architecture for each task to prevent possible negative transfer.
Benchmarks on basic adsorption energy prediction tasks
A dataset containing 3075 Cu-based SAAs and corresponding \({E}_{CO}\) 17 is applied in this work, in which 41 different element species are selected as dopants. By varying the surface index, the doping position, and the adsorption site of the CO molecule, 75 distinct structures are constructed for each doping species, composed of 6, 8, 8, 27, and 26 structures on the Cu(100), Cu(110), Cu(111), Cu(210), and Cu(411) surfaces, respectively. The whole dataset is split 10 times randomly with 60% being the training set, 20% being the validation set, and the remaining 20% being the test set, respectively. Every model in the tests in this section is trained and evaluated on the same 10 splits of the dataset to make a fair comparison.
Two fundamental tasks are depicted in Figs. 1b and 1c, employing initial and relaxed structures as inputs to GNNs to predict \({E}_{CO}\) obtained after optimization, respectively, in which the initial structures are defined as the unrelaxed ones. The two tasks are comparable to the IS2RE task and S2EF task defined in Open Catalyst Project (OCP) 59, excluding force. The former is referred as the "I2I" task because both the training and test sets contain initial structures, whereas the latter is named the "R2R" task because both the training and test sets contain relaxed structures. In addition, we establish the "R2I" task, where models are trained to predict adsorption energies using relaxed structures as inputs, but use initial structures to predict corresponding energies during inference. In all three tasks, the labels used are energies obtained after DFT structural relaxation.
The dataset is benchmarked by performing the three tasks with several representative baseline models. Table 1 presents a comprehensive comparison of the 8 models on I2I and R2R tasks, in which AEVR. and STD. denote the average and standard deviation of evaluation metrics on 10 test runs, respectively. The STDAE means the standard deviation of absolute errors of test results, and a high STDAE implies that some energies estimated by the model depart significantly from their actual values calculated by DFT, indicating a weak generalization ability. The metrics "within 0.02" and "within 0.1" represent the ratio of samples in test sets predicted with an absolute error smaller than 0.02 and 0.1 eV, respectively. Three symmetry-function (SF) 60, 61 based Gradient Boosting Regression (GBR) 62 models are listed as representatives of traditional ML algorithms, with doping and carbon atoms chosen as centers to create the SFs. Details of constructing SFs and the hyper-parameters of GNNs are available in Supplementary Note 1 and 2. The results show that all models have a higher accuracy when predicting the adsorption energy with relaxed structures instead of their initial counterparts, which is consistent with earlier works 43, 59. GBR models have a substantially lower average MAE than SchNet and CGCNN on both I2I and R2R tasks, but their STDAE values are larger than those of GNN models, indicating that GBR models can only perform well on some of the data points. In comparison, DimeNet-based models perform the best on all metrics, though the run time and the number of trainable parameters exceed those of SchNet and CGCNN. In R2R task, DimeNet is superior to DimeNet + + with the original hyper-parameters (embedding size = 128, output embedding size = 256), but the performance of DimeNet + + with the embedding size = 256 and output embedding size = 192 (denoted as DimeNet++ (emb256)) is identical with or slightly better than that of DimeNet. Hereafter, the DimeNet++ (emb256) is applied for further experiments.
Table 1
Comparison of models on I2I, MT I2I and R2R tasks.
Task
|
Model
|
Test MAE AEVR. (eV)
|
Test MAE STD. (eV)
|
Test STDAE AEVR. (eV)
|
Test within 0.02 AVER.
|
Test within 0.1 AVER.
|
Run Time AVER. (s)
|
Number of Parameters
|
I2I
|
Localized cos 61 + GBR
|
0.120
|
0.0178
|
0.457
|
0.299
|
0.758
|
6.83
|
\
|
Gaussian-cos 60 + GBR
|
0.130
|
0.0192
|
0.482
|
0.271
|
0.738
|
6.82
|
\
|
Gaussian-tanh 60 + GBR
|
0.132
|
0.0201
|
0.485
|
0.268
|
0.735
|
6.71
|
\
|
SchNet 32
|
0.257
|
0.113
|
0.421
|
0.103
|
0.409
|
1845.45
|
541697
|
CGCNN 34
|
0.182
|
0.0887
|
0.401
|
0.147
|
0.557
|
1712.22
|
703361
|
DimeNet 38
|
0.099
|
0.0157
|
0.358
|
0.329
|
0.792
|
7741.72
|
2100070
|
Dimenet + + 39
|
0.0966
|
0.0146
|
0.350
|
0.312
|
0.802
|
5335.63
|
1887110
|
DimeNet++ (emb256)
|
0.0938
|
0.0148
|
0.356
|
0.337
|
0.821
|
5906.27
|
3545542
|
MT I2I
|
MT-MD DimeNet++ (MB1-SRB1)
|
0.0926
|
0.0149
|
0.347
|
0.348
|
0.813
|
4421.92
|
2099199
|
MT-MD DimeNet++ (MB3-SRB2)
|
0.0875
|
0.0135
|
0.310
|
0.384
|
0.823
|
7258.57
|
4569855
|
MT-MD DimeNet++ (MB4-SRB3)
|
0.0870
|
0.0114
|
0.303
|
0.398
|
0.829
|
9231.51
|
6250431
|
R2R
|
Localized cos 61 + GBR
|
0.116
|
0.0237
|
0.585
|
0.311
|
0.792
|
26.25
|
\
|
Gaussian-cos 60 + GBR
|
0.108
|
0.0138
|
0.594
|
0.361
|
0.800
|
25.42
|
\
|
Gaussian-tanh 60 + GBR
|
0.117
|
0.0122
|
0.643
|
0.358
|
0.801
|
26.03
|
\
|
SchNet 32
|
0.183
|
0.150
|
0.386
|
0.201
|
0.628
|
1904.08
|
541697
|
CGCNN 34
|
0.139
|
0.102
|
0.355
|
0.198
|
0.683
|
1780.57
|
703361
|
DimeNet 38
|
0.0556
|
0.0138
|
0.312
|
0.569
|
0.908
|
7678.62
|
2100070
|
Dimenet + + 39
|
0.0603
|
0.0116
|
0.315
|
0.466
|
0.898
|
5171.22
|
1887110
|
DimeNet++ (emb256)
|
0.0551
|
0.0137
|
0.312
|
0.544
|
0.911
|
5893.75
|
3545542
|
Supplementary Tables 1 and 2 show the test results achieved by models trained using normalized training labels for the I2I and R2R tasks, respectively. Because the normalization approach could not guarantee a consistent performance increase for all algorithms, the labels of training sets in the following experiments are not normalized during the training phase. Results of R2I task, shown in Supplementary Table 3, exhibit a large MAE for all models. This proves that training models on relaxed configurations to distinguish new catalysts by their initial structures can be an ineffective way, due to a limited sampling of the potential energy surface 45.
Adsorption energy prediction with MT GNN models
Most ML models, including GNNs, cannot predict \({E}_{CO}\) with a high accuracy for I2I task. To solve such a challenge, an MT learning framework is proposed, whose workflow is illustrated in Fig. 1d. In these tests, initial structures are applied as the inputs to predict \({E}_{CO}\), and the information of relaxed structures is included during training by adding a new branch to predict the change of bond length. The trained models are then applied to predict directly \({E}_{CO}\) using initial structures without relaxation. Supplementary Table 4 compares the performance of DimeNet + + and MT-MD DimeNet + + with multiple architectures on the 10 test sets of I2I task, and results of two MT-MD DimeNet + + models are also listed in the "MT I2I" category in Table 1. In these tables, the numbers behind ‘MB’ and ‘SRB’ indicate the numbers of IBs used in MB and SRB, respectively.
As is shown in Supplementary Table 4, all MT-MD DimeNet + + models outperform original DimeNet++. Notably, even the model utilizing only one IB in MB and SRB (MB1-SRB1, also shown in Table 1), where both the number of trainable parameters and run time are less than those of DimeNet++, has an improved prediction accuracy with an average MAE of 0.0926 eV. These results clearly show the superiority of our method. The best model on the MT I2I task is MB4-SRB3 with an MAE of 0.0870 eV and 9231.51s needed to train for once. To balance the performance and computational costs, MB3-SRB2 with an MAE 0.0875 eV is chosen for further discussion.
Extrapolation on different surfaces
It is now well-known that different Cu facets lead to different products for CO2RR 63–66. More and more experiments are designed to realize a high product selectivity by synthesizing Cu nanostructures with specific facets. Accordingly, a good extrapolation capability on different surfaces is the key to applying ML models to facilitate the optimization of CO2RR catalysts. The demonstration of the generalization capacity of our model is proceeded by training the model to predict \({E}_{CO}\) of SAAs on untrained surfaces.
In our dataset, there are 246, 326, 328, 1107, and 1066 samples based on Cu(100), Cu(110), Cu(111), Cu(210), and Cu(411) surfaces, respectively. By randomly selecting 246 samples built upon each surface, five subsets are constructed. One of the five subsets is chosen as the target, which is further split half-and-half as the validation and test sets, while the remaining 4 subsets are combined as the training set, leading to a dataset with an 8: 1: 1 split ratio. Accordingly, the number of training samples is the same for different surface extrapolation tasks. Based on these datasets, MT-MD DimeNet + + is trained to predict \({E}_{CO}\) on SAAs which are built on surfaces that do not appear in the training sets. The process is conducted 10 times with the same 10 random seeds for each of the five surfaces, and the average MAEs are depicted in Fig. 2a, with detailed results reported in Supplementary Table 5. Although trained on a relatively small dataset (984 samples in total), our model still achieves average MAEs below 0.2 eV on 4 out of 5 test sets. For the Cu(110) surface, we observe that the high MAE is caused by 4 outliers from the Sb-doped case, where the DFT calculated \({E}_{CO}\) is larger than 10 eV. These apparently unphysical values are attributed to a great reconstruction of the subsurface structure after optimization, shown in Supplementary Fig. 1, which is not the focus of our work. The data points are not removed in previous tests to avoid data leakage.
To further demonstrate the extrapolation ability of MT-MD DimeNet++, we evaluate our model by predicting \({E}_{CO}\) on SAAs based on Cu(751), a surface that has been proved to be more selective for C-C coupling than Cu(100) 67. We choose 4 doping species (Ag, Zn, Ga and Ge), 2 from d-block and 2 from p-block, which are the most promising doping elements to build Cu-based SAA catalysts for CO2RR due to their high activity and selectivity, high stability, non-toxicity and low cost 17, 68, 69. Together with 9 initial structures for each doping case, 36 SAA configurations on Cu(751) are calculated in total to build a new test set. The 4 outliers of Sb-doped Cu(110) mentioned above are removed from the original SAA dataset, and the remaining 3071 samples are applied as the training set. The results in Fig. 2b show that both the training and test MAEs are below 0.1 eV, and most predicted values are within the acceptable 0.1 eV error threshold (71.8% for the training set and 69.4% for the test set), reflecting an extraordinary extrapolation capacity of our model.
Additionally, to compare MT-MD DimeNet + + with baseline algorithms, all models are then trained on the dataset, which has been shuffled with 5 different random seeds, to perform the Cu(751) extrapolation test for 5 times, with results shown in Supplementary Table 6. Three DimeNet-based models show the lowest MAEs compared with other models, and MT-MD DimeNet + + is the only one with an MAE below 0.09 eV (0.0893 eV).
Extrapolation on different elements
The doping species have a great impact on the chemical properties of SAAs, leading to different substrate-CO interaction behaviors. The ability to predict \({E}_{CO}\) on SAAs with unseen doping species is another important aspect to accelerate the development of new SAA catalysts. In this section, each of the 41 doping elements is applied as the target element, respectively, to construct 41 datasets. In each dataset, 75 SAAs doped by the target atom form the test set, while the remaining samples, in which the target element does not appear, form the training set. When Sb is not the target element, the training set contains 2996 data points, excluding 4 Sb-containing samples with \({E}_{CO}\) larger than 10 eV, whereas the number of training data is 3000 for the case with Sb as the target element. MT-MD DimeNet + + performs the prediction task for 5 times on each target element with 5 different random seeds, and the results are shown in Fig. 3a in a periodic table with more details provided in Supplementary Table 7. In total, the average test MAEs of 11 out of 41 elements are within the range of acceptable error (0.10–0.15 eV). The model can predict most untrained non-metal elements with a relatively high accuracy. Although the model performs the best for Cd and Pd (with an MAE of 0.100 and 0.101 eV, respectively), it does not show a good performance when extrapolating to some transition metal elements, especially those in VIB, VIIB groups and the first two columns of VIIIB group. Our preliminary interpretation for these phenomena is that the relatively higher complexity of the interaction pattern between the transition metal elements with Cu substrates benefits the extrapolation on the simpler non-metal doping elements. We leave the study of the underlying mechanism in our future work.
In order to further assess the generalization ability, a series of transfer learning tasks are designed for the three target elements with the worst performance (except Sb), i.e., N, Re and Ru. In this experiment, 2, 5, 10, 20, and 30 data points from the initial test sets are randomly selected and added to the training set to build transfer learning tasks with decreasing levels of difficulty. For each task, the sampling is repeated 5 times based on 5 different random seeds, resulting in 25 datasets in total for each element. All of the baseline models are applied to make a comparison with MT-MD DimeNet++, and the average MAEs for different tasks are plotted in Figs. 3b-d, details of which are available in Supplementary Tables 8–10. Experiments in which no extra data points are supplemented are also conducted for baseline models, and the results are also derived as an average of 5 tests for each model. None of the models can predict \({E}_{CO}\) on N-doped SAAs accurately, and the three GBR-based models even perform worse when two N samples are included. It is noted many \({E}_{CO}\) on N-doped samples (23 out of 75) are smaller than − 1.0 eV, largely deviated from the mean of the \({E}_{CO}\)distribution for the whole dataset. This indicates that there is an unusual pattern of interaction between CO and N-containing catalysts suggested by the reconstructed structures after optimization, leading to the failure of extrapolation. Compared with N-doped cases, test MAEs on Re- and Ru-doped samples show a gradual decrease as the number of Re/Ru samples added increases. For Re-doped SAAs, MT-MD DimeNet + + is the only model achieving an average MAE lower than 0.1 eV (0.0881 eV) when only 10 Re samples are trained, and it continues to show the best performance till 30 extra samples are included (0.0739 eV), notably better than DimeNet++ (0.0895 eV). For Ru doped SAAs, the transfer is relatively easy compared with the other two cases. DimeNet++ (0.0533 eV) outperforms MT-MD DimeNet++ (0.0657 eV) for datasets including 10 Ru samples, while in the task with 30 Ru samples, the MT-MD one is still better than DimeNet + + with an MAE of 0.0310 eV versus 0.0356 eV. Therefore, although MT-MD DimeNet + + cannot directly predict \({E}_{CO}\) of SAAs doped with untrained species, the model reaches a high accuracy as long as around 5 to 10 extra samples are included as part of the training samples, largely reducing the computational cost.