A Universal Framework for Featurization of Atomistic Systems

Molecular dynamics simulations are an invaluable tool in numerous scientific fields. However, the ubiquitous classical force fields cannot describe reactive systems, and quantum molecular dynamics are too computationally demanding to treat large systems or long timescales. Reactive force fields based on physics or machine learning can be used to bridge the gap in time and length scales, but these force fields require substantial effort to construct and are highly specific to a given chemical composition and application. A significant limitation of machine learning models is the use of element-specific features, leading to models that scale poorly with the number of elements. This work introduces the Gaussian multipole (GMP) featurization scheme that utilizes physically-relevant multipole expansions of the electron density around atoms to yield feature vectors that interpolate between element types and have a fixed dimension regardless of the number of elements present. We combine GMP with neural networks to directly compare it to the widely used Behler-Parinello symmetry functions for the MD17 dataset, revealing that it exhibits improved accuracy and computational efficiency. Further, we demonstrate that GMP-based models can achieve chemical accuracy for the QM9 dataset, and their accuracy remains reasonable even when extrapolating to new elements. Finally, we test GMP-based models for the Open Catalysis Project (OCP) dataset, revealing comparable performance to graph convolutional deep learning models. The results indicate that this featurization scheme fills a critical gap in the construction of efficient and transferable machine-learned force fields.

The most common limitation is related to the scaling of the feature vector size as the number of elements in the system increases. Most existing descriptors scale polynomially or combinatorially with the number of elements in the system [24,36]. This poor scaling means that these featurization strategies can only be applied to training sets with a limited number of elements, making the prospect of a universal machine-learning model that works for all elements infeasible. While some approaches overcome this limitation [38,43], they have not been tested in on large many-element benchmark systems. In addition to fingerprint scalability, there is also the issue of scalability of the regression model. Many high-accuracy models rely on kernel ridge regression (KRR), a technique that suffers from poor scalability at with extremely large training set sizes. Neural networks are far more scalable, but if typical element-specific featurization schemes are used then a high-dimensional neural network (HDNN) [24,45] is required to connect the features to atomic properties. The HDNN scheme uses element-specific neural networks, so that the number of model parameters also increases with the number of elements. An alternative to featurization is deep learning [46], typically with graph-convolutional neural network (GCN) models [25,26,27,28,32,29,30]. These deep learning models show an excellent ability to learn appropriate representations for molecular, solid-state, and surface systems with many elements [47]. However, this comes at the cost of less transparent models that typically require more time for training and prediction than their feature-based counterparts. Moreover, many of the deep learning approaches utilize elemental features [32,25,27,28], suggesting that improved featurization schemes will translate to improved deep learning models.
Here we introduce a new featurization approach called Gaussian multipole (GMP) features. The GMP features utilize an implicit description of the electron density so that the feature vector's dimension is independent of the number of elements. Using the electron density as the fundamental input makes them suitable for universal machine-learning models that work for all elements and provides a straightforward route for extending them to systems that involve charged atoms or magnetic moments. It also naturally allows the use of the more efficient single neural network (SNN) [33] structure, where all elements share the same neural network. Moreover, the GMP features are related to a multipole expansion of the electron density [48,49], making them physically relevant and systematically improvable. Remarkably, we show that the GMP features are capable of interpolating between elemental species, a capability that to our knowledge has not been demonstrated with any other featurization schemes. These properties of the GMP features will facilitate a new family of fast and interpretable feature-based machine learning models that have the general applicability of more complex and opaque GCN models.
The GMP approach encodes the elemental identity through an approximate description of the electron density based on Gaussian basis functions. In this work, we use the valence density extracted from the SG15 pseudopotentials [50] approximated by 2-7 atom-centered Gaussians per element, normalized by the total number of valence electrons. The number and widths of the Gaussians for each element are determined using nonlinear regression, and these "static valence densities" are fixed for all models presented in this work. Details are provided in the supporting information. Conceptually, this is equivalent to using the valence density of non-interacting atoms as the fundamental description of the system (Fig. 1a). This approach also allows for interpolation and extrapolation between different elemental species, and concepts like charged atoms can be naturally accommodated using the isolated charged atom to construct a suitable valence potential. In principle, it is also possible to use other quantities like the all-electron density, spin density, or self-consistent electron density, as long as the quantity can be represented by a linear combination of atom-centered Gaussians.
The approximated electron density is then vectorized via the inner product between the electron density and atomcentered "probe" functions to generate the features for the atoms. The probe function consists of a Gaussian function and a Maxwell-Cartesian spherical harmonic (MCSH) function. Gaussian functions of varying width account for the radial variation (Fig. 1b) of surrounding electron density of an atom, and the MCSH functions capture angular variation (Fig. 1c). The Gaussian product rule provides analytical solutions to the integral in Eqn. 1, enabling highly efficient computation of the features. Rotational invariance is enforced by taking norms of each MCSH order [51] (different than the previous publication [52]). Figure 1: Illustration for the GMP featurization scheme. a) Construct the electron density distribution using linear combination of Gaussians. b) Apply the Gaussian radial probe to focus on the electronic environments at different radial length scales around each atom core by masking the electron density distribution with Gaussian functions of different widths. c) Apply the angular (MCSH) probe that acts as a multi-pole expansion of the radially-masked electron density via inner products with different groups of MCSH functions. d) Take the norm of the results for each group to ensure rotational invariance, yielding an entry to the feature vector for each individual atom.
Mathematically this is described as where < a, b > denotes inner product of two functions, V is the volume, µ i,abc is the feature resulting from the radial probe G i and angular probe S abc .ρ is the distribution of electron density of a molecule, approximated by linear combinations of primitive Gaussians G dens,j,k centered at each atom. The formalism does not include a strict cutoff radius, though for computational efficiency the sum can be restricted to surrounding atoms with a non-negligible overlap. The MCSH functions S abc are linear combinations of polynomials: where abc is the specific index of the spherical harmonic function, n = a + b + c is the order of it, and m x , m y , m z are the exponents of the specific polynomial. The first 4 orders of MCSH functions are listed in Table 1. To ensure that the resulting features are rotationally invariant, we use the weighted sum of square of each order, µ i . Please note that this definition is different than our previous publication [52]. Therefore, the GMP feature vector is defined as where = denotes the GMP features, i is an index over the radial probes, abc is an index combination corresponding to a rotational group, and P (a, b, c) denotes the permutation group of a, b, c (e.g. P (1, 0, 0) = {(1, 0, 0), (0, 1, 0), (0, 0, 1)}).
The weight w abc is shared among MCSH within the specific subgroup (perturbaion group) of order n = a + b + c Therefore, the set of features can thus be written as Conceptually, the resulting features are similar to a multipole expansion of the electron density around each atom. The multipole expansion provides rotational features that are complete and orthogonal. The Gaussian probe's width controls the radial length scale of the multi-pole expansions, and the radial features are over-complete. These properties reduce linear dependencies within the features and lead to a systematic improvement in the system's description as the number of features increases. The GMP scheme is closely related to the "wavelet" features [38], but it uses a different representation of the electron density, and differs in the way the spherical harmonics are normalized and computed, so that it provides atom-centered features. It is also similar to the SOAP scheme, which uses element-specific densities and spherical harmonics [24,36,37]. However, the key difference is that the length of the feature vector is fixed regardless of the number elements, and the normalization across rotational groups leads to fewer features. These enable GMP to efficiently featurize complex systems with an arbitrary number of elements and arbitrary boundary conditions.
In this work, we combine the GMP features with a neural network regression model to demonstrate the efficiency, accuracy, and transferability of models based on these features. As mentioned, the single neural network (SNN) architecture [33] is well-suited to the GMP features since all elements utilize the same features ( Fig. 1). We also utilize a per-element bias term, equivalent to fitting to formation energies instead of total energies, to reduce the magnitude of the energies. We have implemented the GMP+SNN framework in the AMPTorch code [53,54] and used this implementation for all models in this work. The number of parameters used by the GMP+SNN model varies depending on the number of features, hidden layers, and nodes per layer but is generally lower than the number of parameters in a comparable GCN by an an order of magnitude. Details of model fitting, implementation, and parameters are provided in the supplementary information. The notation GMP(N Gaussian ,N M CSH ) is used to denote the feature sets in the examples below, where N Gaussian is the number of radial Gaussians, and N M CSH is the maximum order MCSH used to construct the feature set. The notation SNN([list layers]) is used to denote the SNN model, The activation function is T anh for the MD17 test, and GELU for the QM9 and OC20 tests, and batch normalization is always applied for each layer. Therefore, GMP(9,3) is a feature set constructed using 9 radial Gaussians, resulting in 9 × 4 = 36 features and GMP(9,3)+SNN(50,50,50) is a SNN model based on the GMP(9,3) feature set with 3 hidden layers and 50 nodes per layer. We note that models with the same number of radial Gaussian probes are not necessarily equivalent, since Gaussian widths are determined manually at this point. Additional details including the widths used for the radial Gaussians are provided in the supplementary information.
First, we compare the GMP+SNN model to the Behler-Parinello neural network (BPNN) approach that is one of the most computationally efficient featurization schemes [39] and ubiquitous in materials science and chemistry due to its simplicity, generality, and efficiency [24,55,56,57,58,59,54]. We utilize an established molecular dynamics trajectory of the 3-element (C, H, O) aspirin molecule at the DFT/PBE+vdW-TS level of theory for this comparison [60]. For BPNN featurization, we use 12 variants inspired by examples in literature [61]. All neural networks consist of 3 hidden layers with 50 nodes each, and the BPNN approach uses separate neural networks for each element (resulting in 3 times the number of parameters). We use a training set size of 40K images for all models, and the test and validation sets each contain 10K images. We ensure robustness by using ten randomly selected train/test/validation sets. The performance is measured by the mean absolute error (MAE) for the predicted energies of the test set images. The results reveal that the GMP+SNN accuracy increases systematically with the multipole expansion order and the number of radial probes, providing a clear strategy for identifying an appropriate feature set for a given problem. In Fig 2b, we compare the accuracy of the GMP+SNN model to the BPNN model as a function of the number of features. All Pareto-optimal models utilize the GMP+SNN approach, despite the fact that the BPNN models have three times more fitted parameters due to element-specific neural networks. It is also clear that the accuracy of BPNN models does not systematically improve with the number of features. Finally, we compare the wall time needed to compute a single image for each model. To ensure a fair comparison, we use the same CPU for each test, both approaches use the same C++ source code and loop structure, and the time is on average over the 10K predicted validation images (see supplementary information). Fig. 2c) shows that the GMP+SNN framework is always faster than the BPNN approach at a fixed accuracy level, or is always more accurate for a fixed computation time. This example demonstrates that the GMP+SNN approach is capable of achieving a lower error than the BPNN approach with fewer parameters and less time.
The performance of the GMP+SNN can be further improved with force regularization, with a loss function defined as where N image is the number of training images, N i atom is the number of atoms in image i, E N N and E ref are the predicted and reference energy of a image, F N N and F ref are the predicted and reference forces of an atom along each Cartesian axis, and β is the force regularization coefficient. Figure 3 shows the results of GMP+SNN with 2 feature sets, GMP(6,3)+SNN (50,50,50) and GMP(10,6)+SNN (50,50,50), as a function of the force regularization coefficient, β. These two models are both on the Pareto frontier from the previous test. For simplicity, this test was done with 1K training images and 1K test images. The results confirm that energy training benefits from force regularization. In this case, the energy error is reduced by a factor of 2 to 3 at the optimal regularization coefficient. Both energy and force prediction accuracy show the same trend and the same optimal force coefficient, indicating that the model avoids the apparent trade-off between energy and force accuracy that has been observed for some GCN model architectures [47]. Next, we show that the GMP+SNN technique can scale to systems with more elements by training it on the atomization energy of the established QM9 benchmark dataset [62]. This dataset consists of 130K chemical species with up to 9 heavy atoms and five elements (C, H, O, N, F) optimized at the B3LYP level of theory. In this example, 4 GMP descriptor sets and corresponding SNNs of different sizes are trained and tested using energies of each system. The learning curves showing the out-of-sample test error [63] as a function of training set size are shown in Fig. 4a. The errors of the tested GMP+SNN models are just slightly higher than the best state-of-the-art models based on Gaussian processes or GCNs (about 6 meV with 100K training data) [63,30,31,27,25,28,35,36], but the GMP+SNN models utilize fewer adjustable parameters and are more scalable than non-parametric models like Gaussian process regression. Fig. 4b shows the results of transfer learning between different elemental species in the QM9 data set. We trained a model on the set of all molecules that do not contain fluorine (128,908 molecules), and tested the same model on fluorine-containing molecules (1,923 molecules). The model error increases by around one order of magnitude when the predictions include elements not present in the training set. However, the test set MAE is 0.32 eV, which is competitive with the accuracy of generalized gradient approximation of density functional theory (GGA DFT) [64]. This reveals that the model is reasonably accurate even for elements outside the training set. The error distribution for transfer learning is bimodal, which is primarily related to the number of F atoms in the test systems. If the error is normalized by the number of F atoms in the system the MAE decreases to 0.179 eV/F atom, and the distribution becomes more symmetric (see SI). In addition, we note that the error distribution for F atoms is relatively sensitive to the initial weights of the neural network. The MAE varies from 0.2 -0.4 eV per system depending on initial weights(see SI). This indicates that the model is extrapolating and therefore more sensitive to the initial conditions and training procedure. However, the resulting models never fail catastrophically since the errors are always competitive with GGA DFT, indicating the general robustness of the transfer learning between elements.
Finally, we test the performance of the GMP+SNN approach on the Open Catalysis Project (OC20) S2EF dataset. This dataset consists of >100M geometries and corresponding energies of >100K adsorbate-catalyst pairs containing a total of 56 elements across 82 adsorbates and up to > 100K different catalyst compositions for each adsorbate. The dataset also provides an independent set of 1M test systems [47]. The energies correspond to the GGA DFT level of theory with the RPBE functional [65], with mixed boundary conditions. The size, number of elements, and mix of solid-state and molecular systems make this one of the most challenging benchmark datasets available. To date, the only models capable of training and prediction for this dataset utilize elaborate GCN models [47]. Fig. 5a shows the learning curves for four GMP+SNN models of different sizes tested on the provided in-domain (ID) validation set.  1.43M parameters). Details of all models are provided in the supplementary information. The full OC20 training set of 100M energies would require months to train with current computing resources available to us, so we restrict the analysis to energy training on data sets with fewer than 5M training images.
The results of training and testing on the OC20 set are shown in Fig. 5a [47]. Moreover, the GMP+SNN models require fewer parameters than the GCN models, with the largest GMP+SNN model having ∼1.4M parameters, compared to 1.8M (DimeNet++), 3.6M (CGCNN), and 7.4M (Schnet) parameters for the GCN models [47]. The error also decreases as the number of data points, the number of GMP features, or the neural net size increase. This suggests that further improvement is possible, although significant computational resources will be required to optimize and evaluate GMP+SNN models for the OC20 dataset.
Finally, the scaling of computational time versus the number of atoms in the system is plotted in Fig. 5b. The systems of different sizes are generated by making supercells that are periodic replications of an original unit cell of 35 atoms. The scaling is strictly linear due to the local nature of the features, which presents an advantage over most graph-based models require the entire structure as input. The linear scaling of the GMP+SNN model, along with a structure that is straightforward to parallelize, makes it ideal for the simulation of large systems that may not be feasible with graph-based models. These examples demonstrate that the GMP featurization scheme is an efficient and universal approach to fingerprinting atomistic systems with an arbitrary number of elements or atom types. The GMP features are more computationally efficient than the deep-learning GCN models that are commonly used for many-element systems, and are even faster than the widely used Behler-Parinello symmetry functions, resulting in machine-learned force fields that can be scaled to large many-element systems and long time scales. The GMP feature vectors utilize an implicit description of the system's electron density, making the number of features independent of the number of elements, facilitating the inclusion of general concepts from electronic structure theory, and enabling extrapolation to new element types. Moreover, the GMP features utilize physically-meaningful concepts, leading to more interpretable models that can be systematically improved and providing a foundation for hybrid models that incorporate more physics. For example, the self-consistent electron density from a simplified Hamiltonian could be used as input to the GMP model, or the limiting behavior derived from electrostatics could be used to constrain the regression model.
The examples presented here use the SNN regression model in conjunction with the GMP features based on static valence density to obtain accuracy comparable to state-of-the-art models on the MD17, QM9, and OC20 datasets. The results confirm that the GMP+SNN approach can reach competitive accuracy with GCN based models despite having far fewer adjustable parameters and a simpler structure. There are numerous opportunities to revise and optimize the details of the GMP+SNN approach presented here, including modifying the valence density representation, optimizing feature selection and systematic optimization of the SNN architecture. However, the encouraging initial results across a wide range of application domains suggest that the GMP+SNN approach is a promising universal route to constructing efficient and general machine-learning models for atomistic systems.

Code and Saved Models
Please checkout and install the "MCSH_paper1" branch of AM P T orch to try any of the saved models and test scripts in this study. The code can be found here: https://github.com/medford-group/amptorch. The test scripts can be found here: https://github.com/medford-group/GMP_AmpTorch_Tests. Tutorials for regenerating all the test models are all included in the repo.

Sample Calculation and Proof of Rotation Invariance
To show that the norms of the groups are rotation invariance, it is suffice to show that the results depend only on distances.

General Expression
Substitute in the definition of S abc and the Gaussian functions: where A, B, α, β are parameters for the Gaussian functions, and r 0 's are the relative coordinates of the nearby atoms of interest.
Since the product of Gaussian functions is another Gaussian function, the equation can be written as: where D, K are constants, r 0 's are the center of the resulting Gaussian functions, and x 0 , y 0 , z 0 are the components of r 0 . Since the integral has analytical solutions, µ i,abc is straightforward to calculate. More specifically,

Rotation Invariance
To show the features = i,abc are rotation invariance, it is suffice to show that the values of them only depend on distances. Although a general proof that it works for all MCSH functions (S n abc ) is more involved and beyond the scope of this work, it is straight forward to show few specific cases as examples, and the rest can be proved the same way.

= i,000
Follow Equation 8, and note that S 000 = 1 only has one term It is evident that this feature is rotation invariant Follow Equation 8 where where l iterates through j, k. similarly, Threfore, which is only a function of distances. Hence it is rotation invariant.  The list of test results with BP + HDNN models are shown below in Table 3 Num.  The sigmas of the radial probe Gaussians are listed in Table 5 0.5 QM9 Example

Per-element Bias
A per-element bias is added to the SNN model to improve performance. Conceptually, this is equivalent to fitting to formation energies rather than absolute energies. The total energy of an atom is the model predicted energy plus the per-element bias of the specific atom type. To determine the bias, a linear model is applied. The number of atoms for each element types are counted for all the images in the training set, and they are the independent variable. The corresponding energies for each system is the dependent variable. For example, the per-element bias of the trials with 100K training images is shown below in  With the per-element bias determined, GMP+SNN models are fitted to the atomization energy minus the per-element biases. The model setups are given in Table 7. The cutoff distance is always 15 Å, so that the largest radial probe takes a negligible value of 8 × 10 − 4 at the cutoff. The models are trained for 12,000 epochs with learning rate decrease by factor of 2 every 2,000 epochs, from 1 −2 to 3 −4 . The batch size is set to be 32 images.

Transfer Learning to New Element
For this example, we used the same procedure as above, with the caveat that the per-element biases are not fitted using a linear model, but directly pulled from the 100k molecule trial. For more detail please refer to the test scripts.
Shown in Figure 6 is the error distribution of the system containing F atoms in the basis of error per F atom. Shown in Figure 7 are the error distributions from different trials with different random seeds.