Multiscale Graph Attention Neural Networks for Mapping Materials and Molecules beyond Short-Range Interatomic Correlations

: Bringing advances of machine learning to chemical science is leading to a revolutionary change in the way of accelerating materials discovery and atomic-scale simulations. Currently, most successful machine learning schemes can be largely traced to the use of localized atomic environments in the structural representation of materials and molecules. However, this may undermine the reliability of machine learning models for mapping complex systems and describing long-range physical effects because of the lack of non-local correlations between atoms. To overcome such limitations, here we report a unified framework named the multiscale graph attention neural network to map materials and molecules into a generalizable and interpretable representation which combines local and non-local information of atomic environments from multiple scales. As an exemplary study, our model is applied to predict electronic structure properties of one class of technologically important reticular materials, i.e., metal-organic frameworks (MOFs) which have notable diversity in compositions and structures. Our model is trained on density functional theory calculated datasets and the results show that it achieves the state-of-the-art performance for such a challenging task. The clustering analysis further demonstrates that our model leads to a high-level identification of MOFs with spatial and chemical resolution, which would be capable of revealing new insights into complex systems and efficiently guiding the search for reticular materials with desired properties.


INTRODUCTION
The past few years have witnessed a surge of interests in applying machine learning (ML)   technologies to power many aspects of both computational chemistry and materials science [1][2][3] .For example, ML techniques open new avenues in the construction of complicated potential energy surface from quantum-mechanical data in an automated fashion to tackle the long-standing computational challenges (e.g., the realistic modeling of chemical reactions or complex materials and interfaces) that are inaccessible to either poorly transferrable empirical force fields or computationally demanding ab initio methods [4][5][6][7][8][9][10] .What is more, ML approaches are revolutionizing the discovery and design of materials and molecules at an astounding rate through direct in silico screening and statistical analysis of massive chemical data sets [11][12][13][14][15][16][17][18][19][20][21] .
The representation of materials and molecules is a crucial ingredient to construct effective ML models dedicated to statistical property regression, to clustering of chemical structures, or to visualization of material phase space in a low-dimensional manifold.Conventional ML models adopt hand-crafted descriptors that encode the raw information about atomic systems (such as the chemical nature and coordinates of each atom) into a suitable representation with physical symmetry, such as widely used smooth overlap of atomic positions (SOAP) 5,22 , Coulomb matrix 23 , atom-centered symmetry function 4,24 , and composition-based features 25 .Arguably, the design of effective descriptors usually requires both considerable domain expertise and human efforts, which proves challenging for an immense number of materials with complex structures and many compositions.
Recently, tremendous attention has been received in the development of deep learning models to automatically discover the flexible representations of materials and molecules with minimal human intervention.Notably, deep learning allows raw data directly as inputs from which complex and abstract material representations can be learned by a series of hierarchically nested neural networks 26 .To date, a number of deep learning models have been proposed for concerning problems especially in both material and chemical science, including deep tensor neural network (DTNN) 27 , ANI-1 28 , crystal graph convolutional neural network (CGCNN) 29 , SchNet 30 , PhysNet 31 , and AIMNet 32,33 .Those deep models have shown strong and flexible capability to represent complex systems (such as protein-like compounds 31 and drug-like molecules 11 ) and generally outperform conventional ML methods for predicting the various quantum-chemical properties of organic small molecules, crystals, disordered materials, and surfaces [34][35][36][37][38] .
Aside from powerful representation learning, the idea of locality undoubtedly lays a solid basis for current state-of-the-art (SOTA) ML-based potentials or property regression schemes.
Locality, supported by the principle of electronic nearsightedness 39,40 , is associated with the description for atom-centered short-range chemical environments to infer complex many-body interaction and renders ML models interpretable, scalable, and robust for extensive properties 41,42 .
In the context of locality, most of current SOTA models only build global representations as a collection of atom-centered local environments and neglect long-range interatomic correlations.
However, this could significantly undermine the reliability of ML schemes when long-range interactions beyond the cutoff radius (such as electrostatics and van der Waals dispersion) dominate the properties of systems like ionic solids and electrolyte solutions [43][44][45] .Moreover, only encoding local information may lose global shape characteristics and prominently weaken pattern recognition for complex systems with widely structural and configurational diversity 35,46,47 .
Indeed, there is a great and increasing demand for ML to accelerate the discovery of complex materials.One class of representative materials is metal-organic frameworks (MOFs) that have great potential in many applications, such as gas storage and separation, sensors, thermoelectric, catalysis, and photovoltaics [48][49][50][51] .Notably, over 80,000 nonporous MOFs have been synthesized over the past decade by assembling the organic linkers and metal clusters 50,52 , but they are just a small part of the myriads of possible structural motifs of MOFs that can be yielded 53 .This results in an urgent need for ML in this area.However, the overwhelming chemical space plus the large number of atoms in MOF structures makes it difficult to obtain a high-level global representation only incorporating short-range interatomic correlations.
In this work, we introduce a unified multiscale graph attention neural network (MGANN) architecture that allows to capture local and non-local features of atomic environments, aiming to provide a deep and high-level representation for complex materials and molecules.We evaluate our model on a recent quantum-chemical database 47 of MOFs for predicting their electronic bandgaps.This exemplary study demonstrates that our model achieves the similar accuracy with respect to density-functional theory (DFT) calculations and outperforms the prior SOTA models, manifesting its general applicability.Finally, we illustrate how the learned information-rich representations can be used in the high-fidelity chemical clustering and sharply narrow down the candidate space for fast searching of desired materials.

RESULTS
Multiscale graph attention neural networks.Figure 1 depicts a unified MGANN architecture proposed in this work.Inspired by the interpretability, generalizability, and remarkable performance of deep graph networks in predicting material properties 34,36,38,54,55 , MGANN receives atomistic structures of materials through a graph-based descriptor where the atoms and the bonds that connect the atoms are regarded as the nodes and the edges in the graph, respectively.Following a routine protocol in graph-based models 29,38 , the initial node attributes are F-dimensional one-hot encodings of the chemical properties of elements, independent of atomic environments for now.The edge attributes between nodes i and j are encoded as a set of Gaussian-expanded distances where ij r is the interaction distances between atoms i and j;  is an adjustable parameter specifying the width of Gaussian basis; is the cutoff radius and K is the number of bond features.Herein, an undirected graph ( , )  A graph convolutional neural network module is built to upgrade the atomic embeddings by passing messages from neighbors and bonds.Distinct from previous SOTA deep graph networks like CGCNN 29 and SchNet 30 , we exploit new bond convolution (BondConv) operations to directly extract the interaction features from all the bonds emanating from each atom.The updated atomic embeddings are outputted by a channel-wise symmetric aggregation operation, i.e., max pooling to remain permutational invariance, and takes the form in the first convolutional layer where 1 1 1 ( ,..., , ,..., , ,..., ) are the learnable weight parameters of atomic features and interactions; l denotes the current layer, here l = 0.In the sequent several convolutional layers, we change the BondConv to a simpler but effective form Both Eqs.2 and 3 can be implemented by a shared multilayer perceptron, guaranteeing permutational invariance to the ordering of neighbors.Similar convolutional operations have been successfully used in the visual tasks on 3D point clouds 56 , but not yet to atomic structures.Finally, the graphs are updated through each convolutional layer of the networks by increasingly embedding more information of local environments into atomic features.Meanwhile, besides aforementioned permutational invariance, the outputs are also strictly invariant to translation and rotation because atom-centered descriptors and only pairwise distances are used in the networks.
To address the losing the long-range information correlations between atoms in previous local graph representations, for the first time, we introduced the self-attention mechanism to our architecture.The self-attention mechanism was originally proposed and applied in the emerging Transformer architectures for boosting the performance of neural machine translation and the speed of the model training 57 .The biggest benefit of self-attention comes from the fact that it allows processing the word sequences in parallel and make the modeling of the dependencies between words without regard to their distance in the input sequences.We notice that a few studies [58][59][60][61] have just recently introduced self-attention to mapping the space of chemical reactions from text-based representations, namely SMILES 62,63 , but none involves mapping atomic configurations to the best of our knowledge.
We now illustrate how self-attention is implemented in our model.Let the node-level embedding sets outputted from the graph networks be with the number of atoms N in a system and the feature dimensionality g d .Regardless of the graph structure, the node set in the graph can be served as a N-component and out-of-order sequence.To compute self-attention, three matricesquery Q, key K, and value V as defined in the original literature 57 need to be created by linear transformations of the input features em G as follows: ( , , ) ( , , ) where , and indicates the shared learnable linear transformation matrices, and a d is the dimension of the query or key vector.With the dot product between the query and key matrices, we can evaluate the attention weights of any local atomic environmental against itself and each of all others in the whole systems and over a long distance.The attention weight matrix A takes the form: The attention weights determine how relevant the information of a certain local atomic environment is to that of other local and nonlocal ones, from which the long correlation of information can be built.To make gradients more stable and all attention weights positive, the attention weights are further scaled by a factor 1/ a d and normalized by a softmax operation: softmax( ).
The outputs sa F of the self-attention layer are expressed by summing up the weighted value vectors: Here, multiplying each value vector by the softmax weights is to automatically drown out irrelevant information between local atomic environments and keep intact those that are worth attending.As all operators in self-attention are independent to the order and size of inputs, this endows our model with strictly permutational invariance and scalability.The self-attention module is shown in Fig. 1c.In fact, there are many available variants of self-attention that can be used to enhance the model.An alternative is to employ an evolved self-attention moduleoffset-attentionto replace the original one, which is inspired by the benefits of the Laplacian operator used in graph convolution networks 64 .One can refer to our recent work for more details on offset-attention 65 .
After the raw descriptors flow through a stack of the graph and self-attention layers, the local and nonlocal information will be extracted hierarchically on a large scale, and this eventually arrives at a high-level and global embedding of atomic structures.Finally, the complex mappings from atomic structures to material properties are established by two fully connected hidden layers.
All weights and other learnable parameters in the networks are iteratively updated by using mini-batch stochastic gradient descent and minimizing the difference between the predicted properties and the reference data computed by DFT.
Quantum MOF (QMOF) database.MOFs are a class of promising porous materials, and the fascinating aspects of MOFs are their synthetic versatility, chemical tunability, and stability.The isoreticular principle enables the size of MOFs to vary in a wide range without changing their underlying topology 66 .For instance, the tunable pore aperture and surface area of MOFs could range from < 10 Å to ~ 100 Å and 1000 to 10,000 m 2 /g 49,67 , respectively.It allows one to fine-tune the structures of these materials with respect to selectivity and activity 68 .To date, more than ten thousand kinds of MOFs have been synthesized by assembling the organic linkers (benzene-1,4-dicarboxylate, 2,5-dihydroxybenzene-1,4-dicarboxylate, biphenyl-4,4′-dicarboxylate, etc.) and metal clusters (Mn, Fe, Co, Cu, Zn, Ni etc.) 52,53 .
The QMOF database is a collective MOF subset of the Cambridge Structural Database 52 and the 2019 CoRE MOF database 69 .All crystal structures in the QMOF database are experimentally synthesized and fully relaxed at the DFT+PBE-D3(BJ) level [70][71][72] .Compared with the previous MOF databases, such as the OQMD 73 and the CoRE MOF database 74 , the QMOF database particularly calculates an important electrical structure property of MOFsbandgap, g E in eV.
The bandgap is an excellent indicator for classifying MOFs into metal or semiconductor.Given that the majority of MOFs are electrical insulators, it is essential to identify the metallic MOFs or those with low bandgaps for expanding applications of MOFs into (opt-)electronic devices and revealing novel quantum-chemical insight into MOFs 49,75,76 .The periodic table with color (Fig. 2a) shows 78 chemical elements covered in the QMOF database.The violin plots (Fig. 2b) further illustrate the statistical distributions of sizes per primitive unit cell in the QMOF databases.The multiple chemical elements (Fig. 2a) associated with the diverse structures (Fig. 2b) show the complexity of the MOF chemistry and make the QMOF database an excellent modeling target to assess the generality of our MGANN model.Learning bandgaps of MOFs.We now move to evaluate the performance of the MGANN model on the QMOF benchmark set and make a comprehensive comparison with other common ML models.Here, we simply divided the benchmark models into two categories: classical machine learning (CML) models with hand-crafted descriptors and deep learning (DL) models with learnable representations.The original work 47 of the QMOF database provides the benchmarks for a DL model (i.e., CGCNN) and five CML models in terms of the QMOF-2 dataset.The five CML models are constructed with the same kernel ridge regression method but different descriptors, i.e., Sine Coulomb matrix (SineCM) 77 , ''Stoichiometric-45'' (SM-45) 78 , ''Stoichiometric-120'' (SM-120) 25 , Orbital field matrix (OFM) 79 , and SOAP.Additionally, SchNet is a representative graph-based DL model and is also involved as a comparison benchmark.For direct comparison with the benchmarks from the QMOF database, we also train MGANN and SchNet on the same QMOF-2 dataset, with the mean absolute error (MAE) and Spearman rank-order correlation coefficient () as the joint metrics to quantitatively gauge the performance of different models.
The dataset is randomly split into 80% for training, 10% for validation, and 10% for testing.The random splitting is repeated five times for five parallel runs over which the statistics of MAEs and  on the testing sets are obtained.Notably, SchNet is trained by using the previous hyperparameters particularly optimized for MOFs 35 , while such a fine exploration in the hyperparameter space is not carried out for MGANN.
As shown in Fig. 3a, on the one hand, SOAP achieves the best performance among the CML models.It indicates that the descriptors sensitive to atomic structures and chemical elements are essential to improve the model performance.In fact, SOAP has been proved to perform equally well with other SOTA models in building machine-learning potentials for systems containing a few elements 80 .On the other hand, the DL models (CGCNN, SchNet, MGANN) substantially outperform all CML models in predicting the properties of structure-complex and element-diverse Visualization of MGANN latent space.The latent features inside DL models usually serve as the final learned representations of molecules and materials.Gaining insights into the learned latent space is essential for efficient data mining and analysis, which is most often in relation to rational materials design and accelerated materials discovery.We use the unsupervised t-distributed stochastic neighbor embedding (t-SNE) 81 technique to project the high-dimensional latent features into 2D space for visualizations.As illustrated in Fig. 4

CONCLUSIONS
In summary, we proposed a multiscale graph attention neural network architecture for hierarchically learning deep representation of materials, aiming to address the losing long-range correlations between atoms from present local graph representations.The introduction of the self-attention mechanism enables our model to capture multi-scale characteristics of systems over potentially long distances on top of the local graph.In the meanwhile, this method can keep high parallelizable efficiency during training and predicting.The SOTA performance achieved by our model on predicting the quantum-chemistry properties of MOFs demonstrates its generality and extensibility for complex materials with large unit cells and widely diverse structures and compositions.Moreover, the latent space analysis substantiates the high fidelity of our model in chemical clustering for materials that share chemical and structural similarities.Conclusively, our model makes it possible to explore the latent representations for gaining more chemical insights and useful knowledge to sharply narrow down the search space for high throughput screening, towards accelerating the discovery of complex materials.

12 {
g is ready for describing the local configuration of atom i, where of the node and edge attributes in the local graph; li N denotes the number of all neighbors of the i-th atom within a certain cutoff radius c R or usually a fixed number of the nearest neighbors to save computer memory especially for those unit cells with large size.The whole system is naturally described by an undirect multigraph consisting of all local graphs.

Fig. 1 .
Fig. 1.Illustration of the multiscale graph attention neural networks.(a) Architecture of overall MGANN model.MGANN receives atomistic structures of materials through graph-based descriptors where the atoms and the bonds that connect the atoms are regarded as the nodes and the edges in the graph, respectively.After the raw descriptors flow through a stack of the graph blocks and self-attention blocks, the local and nonlocal information will be extracted hierarchically on a large scale, and this eventually arrives at a high-level and global embedding of atomic structures.(b) Architecture of the GCNN blocks.(c) Architecture of the self-attention blocks.(d) Schematic of mapping molecules by the MGANN model.(e) Evaluation of the attention weights of any local atomic environmental (LAE) against itself and each of all others in the whole systems and over a long distance by the scaled dot-product attention operators.

Fig. 3
Fig. 3 Performance of the MGANN model on the QMOF-2 dataset.(a) Comparison of the MAEs and  in the bandgaps between the MGANN model and prior benchmark models.All models have been simply divided into two categories: classical machine learning (CML, blue boxes) models with hand-crafted descriptors and deep learning (DL, yellow boxes) models with learnable representations.Note that all results presented here are the average of five parallel runs, with standard deviations illustrated as error bars.(b) Comparison of DFT-computed (from the QMOF-2 database) and MGANN-predicted bandgaps of MOFs on a test set.

Fig. 4 .
Fig. 4. It is illustrated that the three representative points generally gather together because of their structural similarity.Interestingly, the various distances between the representative points explicitly indicate the difference of their properties.Specifically, the bandgap values of Cu[Cu(pdt) 2 ]C 2 H 2 , Cu[Ni(pdt) 2 ]C 2 H 2 , and Cu[Ni(pdt) 2 ] are 0.2395 eV, 0.0392 eV, and 0.0246 eV at the DFT+PBE-D3(BJ) level, respectively47 .Hence, we can obviously observe that the two
84the structure distribution of MOFs in the 2D latent representation space shows a discernible pattern with respect to property values.MOFs with high and low bandgaps are located in the distinctly different regions.It is shown that the learned representations from our MGANN model are leading to groupings of those MOFs that share similarities in their atomic structures, elemental compositions, and chemical properties.We first take the MOF-74-type analogs as an example.The MOF-74 analogs are derived from the known MOF-74 (Zn 2 (2,5-dihydroxybenzene-1,4-dicarboxylate)) 82 .They share the same underlying topology but they have difference in metal clusters (e.g., Zn, Mg), functional groups and length of ligands.There are dozens of the MOF-74 analogs 83 contained in the QMOF database.It demonstrates the capability of the MGANN model to understand similarities in structures and trends in material properties.Another exemplification is also performed for three isostructural MOFs, i.e., Cu[Cu(pdt) 2 ]C 2 H 2 (pdt 2-= 2,3-pyrazinedithiolate)84, Cu[Ni(pdt) 2 ]C 2 H 2 85 , and Cu[Ni(pdt) 2 ] 85 .The difference between Cu[Cu(pdt) 2 ]C 2 H 2 and Cu[Ni(pdt) 2 ]C 2 H 2 is in the metal clusters, while the difference between Cu[Ni(pdt) 2 ]C 2 H 2 and Cu[Ni(pdt) 2 ] is in the additional C 2 H 2 molecules.The representative points of the three MOFs are highlighted by red hollow squares as shown in