MOFTransformer
The overall schematics of our MOFTransformer is shown in Figure 1(a). To build towards universal transfer learning, both pre-training and fine-tuning strategies are implemented. The objective of pre-training is to allow the MOFTransformer to learn the essential characteristics of a MOF. This pre-trained model serves as a starting point for all subsequent applications. Fine-tuning refers to the process of training the pre-trained models for the specific application at hand (e.g. gas adsorption uptake prediction). Figure 1(b) shows the schematic of the MOFTransformer architecture, which is based on a multi-layer bidirectional Transformer encoder developed by Vaswani et al.27 The MOFTransformer is a multi-modal Transformer that takes two types of embedding as inputs, each representing the local and global features: (1) atom-based graph embedding (2) energy-grid embedding.
Previously, Xie et al.21 devised crystal graph convolution neural networks (CGCNN) that transforms atoms (i.e., nodes), bonds (i.e., edges), and their features (i.e., the distance between atoms) into a vector space. Although CGCNN consists of convolutional layers and pooling layers from the original paper, the atom-based graph embedding in the MOFTransformer uses output vectors of the CGCNN without the pooling layers. It allows our model to deal with the atom-wise features without losing information. It should be noted that many atoms in the unit-cell of MOFs have the same embedding from the CGCNN, given that the CGCNN creates the embedding by taking atom types of nodes, distances, and atom types of the neighbor nodes (see Supplementary Figure S1). We grouped the topologically identical atoms and defined these sets as unique atoms (the details of the algorithm are explained in Supplementary Note S1). Removing the information from the overlapping atoms enables efficient training and prevents significant memory issues that frequently appear when training with long sequences of inputs.
When it comes to the energy-grid embedding, the energy grids were calculated using a methane molecule probe that was selected due to its facility in modeling. Universal Force Field,34 and TraPPE35 were used to describe adsorbate-adsorbent van der Walls interactions in MOFs and the methane molecule, respectively. The 3D energy grids can be treated as 3D images, which means that the grid points and the energy values of the energy grids serve as pixels and 1-channel colors, respectively. Similar to the Vision Transformer,29 the MOFTransformer takes 1-dimensional (1D) patches of the flattened 3D energy grids where (H, W, D) are the height, width, and depth of energy grids and (P, P, P) is the patch resolution, and N = HW D/P3 is the number of patches. Given that the energy grids were interpolated to 30 × 30 × 30 Å, the height H, weight W, and depth D are 30 Å. The patch size P was set to 5 Å, so the number of patches N is 216.
The MOFTransformer model is derived from the BERT-based model28 (L=12, H=768, A=12), where L is the number of blocks, H is the hidden size, and A is the number of self-attention heads. Similar to BERT’s class and separate tokens, the class token [CLS] and the separate token [SEP], which are learnable embedding layers, are located at the first position and between the two types of embedding, respectively (see Figure 1(b)). The [CLS] token is a head token of the Transformer blocks and predicts desired properties by adding a single pooling layer for the pre-training and fine-tuning tasks. Apart from these, a volume token [VOL], which is the normalized cell volume, is added at the final position of the input embedding because the interpolation of the energy grids leads to a loss of information regarding the volume of the original energy grids. Finally, position embedding and modal-type embedding, which are also learnable embedding layers, are added to the input embedding by the element-wise summation. The position embedding is a vector that encodes the position of the sequence, and the modal-type embedding encodes the two types of embedding to 0 and 1.
Understanding MOF descriptors
It is important to recognize how MOF descriptors (i.e., local features and global features) influence the properties of MOFs. As shown in Figure 2, H2 uptake, H2 diffusivity, and band gap were selected as case-study applications for MOFs that represent adsorption, diffusion, and electronic properties, respectively. Figure 2(a-c) shows the structure-property maps obtained from the molecular simulations for each of these applications. For H2 uptake and diffusivity, the data was taken from our fine-tuning dataset (20,000 structures). And the QMOF database with the PBE functional (20,373 structures) was used for band gap values. From Figure 2(a-b), it can be seen that the H2 uptake and diffusivity increase with accessible volume fraction and are strongly dependent on the MOF topology due to the correlation between topology and void fraction. Meanwhile, the band gap exhibits no correlation with accessible volume fraction and topology, which is reasonable given that electronic properties are more dependent on local chemical features as opposed to global geometric features.
On top of this, Figure 2(d-f) shows the correlation between the MOF properties and the types of metal atoms. It can be seen that the dependence on metal atoms is lowest for H2 uptake while highest for the band gap energy. And similar trends can be found for the organic linkers (see Supplementary Figure S2). Along with the aforementioned geometric analysis, Figure 2(d-f) confirms that adsorption and diffusion properties rely more on global features, while electronic properties rely more on local features. Apart from these, some properties like O2 diffusivity (which is more dependent on electronic effects than H2 diffusivity) and CO2 Henry coefficient have more complex correlations between features and properties (see Supplementary Figure S3). As such, this illustrates the importance of integrating both local and global features within the Transformer to enable universal transferability across different applications.
Pre-training Results
The pre-training tasks play an essential role in determining the effectiveness of the transfer learning performance. Three pre-training tasks were designed to capture the essential features of the MOFs: (1) MOF topology prediction (MTP), (2) void fraction prediction (VFP), and (3) metal cluster/organic linker classification (MOC). For the MTP task, the model was trained to predict the 1,079 topologies of MOFs by adding a classification head, which consists of a single dense layer to the [CLS] token. The list of topologies is summarized in Supplementary Table S1. For the VFP task, the model is trained to predict accessible void fraction calculated by ZEO++26 by adding single dense layers to the [CLS] token. Finally, the MOC task was performed as it would enable the model to learn the features separately stemming from each metal node and organic linker. The binary classification (determining a given MOF atom as belonging to the metal or the organic linker) is conducted for the atom-wise features of atom-based embedding. The accuracies of MTP and MOC were 0.97, 0.98 and the MAE of VFP was 0.01.
Next, we visualized the embedding vector of the pre-training model in a two-dimensional space using t-SNE, and PCA methods, as shown in Figure 3. Figure 3(a) shows a result of a t-SNE plot for the embedding vector of class tokens with the top 10 frequently appearing topologies in the dataset. Figure 3(a) shows that MOFs with different topologies are clustered together and segregated from other MOFs, indicating that proper learning has occurred. And the same pattern of results was seen for all topologies (see Supplementary Figure S4). Furthermore, it is interesting to note that the PCA plots exhibit the distribution of the embedding vector that gradually increases according to the void fraction, as shown in Figure 3(b). This indicates that the embedding vectors are clustered with similar values of void fraction. These results demonstrate that the pre-training model is successfully trained to capture the critical features of the MOFs.
Fine-tuning Results
Figure 3(c) shows the fine-tuning results for predicting H2 uptake (100 bar), H2 diffusivity, and band gap, which were obtained from GCMC, MD, and DFT simulations, respectively. While 1 million hMOFs were used for the pre-training step, a relatively smaller number of MOFs (i.e., 5,000 to 20,000) were used for training during the fine-tuning stages. The performance of the fine-tuning is compared with the three baseline models (i.e., the energy histogram,17 descriptor-based ML model,18 and CGCNN19,21) as these have shown high performance in predicting gas uptake, diffusivity, and band gap, respectively. And from these comparisons, it can be seen that the MOFTransformer outperforms all of these other models, demonstrating both its superior performance as well as transferable capabilities. The ablation studies of the fine-tuning to demonstrate the effect of the data size on the pre-training tasks are explained in the Supplementary Note S2.
To demonstrate further transferability across different applications, the MOFTransformer was fine-tuned for various properties summarized in Table 1. Table 1 shows a performance comparison between our fine-tuned model and the machine-learning models used in other works. And it can be seen that the MOFTransformer model has either similar or higher performance (i.e., higher R2 score or lower MAE) across all properties. It is interesting to note that the MOFTransfromer outperforms all the other models regardless of gas types, even though the energy grids were created with the methane molecule. Moreover, our model extends well to showcase lower MAE than the machine learning model using revised autocorrelations (RAC)37 with geometric features as descriptors to predict solvent removal stability and thermal stability collected by text-mining. This result suggests that one can easily obtain high-performance structure-properties relationships by using our pre-trained model and fine-tuning it without needing to develop a new model from scratch.