Beyond the Scales: A physics-informed machine learning approach for more e cient modeling of SARS-CoV-2 spike glycoprotein

This paper presents a physics-informed machine learning approach to the derivation of a bottom-up coarsegrained model of the SARS-CoV-2 spike glycoprotein from all-atomic molecular dynamics simulations. The machine learning procedure employs a force-matching scheme in the optimization of interaction parameters, where the forcematching scheme is combined in methodology with the initialization of the interaction parameters by the traditional iterative Boltzmann inversion method. The force-matched machine learning procedure is constructed based on two physics-informed layers: one is the Harmonic layer consisting of bond, angle, and dihedral terms as bonded potentials; the other is the Lennard-Jones layer consisting of the non-bonded Lennard-Jones potential. Coarse-grained validation simulations are performed with the learned parameters to test the derived bottom-up coarse-grained model. The simulations are able to reach the microsecond time scale with stability. The physics-informed learning approach yields simulation speeds nearly 40,000 times faster than conventional all-atomic simulations while maintaining comparable simulation accuracy. Additionally, through examination of the non-bonded Lennard-Jones parameters and the radial distribution function analysis, the learning approach matches pairwise distances of the ground-truth data with greater accuracy than the conventional iterative approach method. Keywords—coarse-graining, machine learning, multiscale modeling, SARS-CoV-2, S-protein, molecular


INTRODUCTION
The outbreak of SARS-CoV-2 in 2019 and its continued persistence have led to millions of deaths globally [1], initiating great investigation efforts on its molecular structure and mechanisms of infection. There exist numerous unique spike glycoproteins that cover the outer surface of the virion. These proteins function as glycosylated trimers largely responsible for the binding of the virus to the host cell receptor angiotensin-converting enzyme 2 (ACE2), mediating cell entry [2]. The S-protein of the virus is comprised of two domains: the first one is the S1 domain which is on the top of the S-protein and contains the receptor-binding domain (RBD) responsible for binding to ACE2 as an initial step toward cell entry; the second one is the S2 domain that forms the stalk of the S-protein, which mediates cell-fusion and integration of the virus into host cells [3][4][5][6].
Efforts toward analyzing conformational states, binding, and functions of the virus have been critical to uncovering developments with regards to therapeutics and vaccines [7]. Currently, prominent techniques of electron microscopy and X-ray crystallography have been proven useful in unveiling protein structural models at high resolutions [8,9]. All-atomic molecular dynamics (AAMD) simulations have provided a vital insight into the biological properties of systems down to the atomic-and femtosecond-level resolution, deepening understanding into the biological functions, interactions with other molecules, and conformational states [10,11]. Studies are currently underway in uncovering specific mechanisms of action of the SARS-CoV-2 virion or possible therapeutics [12][13][14]. However, many such practical and large-scale applications of AAMD simulations are challenged by the computational expense when dealing with these larger proteins or structures. The S-protein, containing over twenty thousand atoms, is no exception to this limitation [15].
Because of the complexity of the S-protein, multiscale modeling (MSM) [16][17][18], which has been shown advantageously on modeling large molecules over greater timescales, is adapted to this investigation. In the MSM framework, coarse-graining (CG) has been well established within literature for simulating complex, high-definition systems using simplified, lower-resolution representations, thus simulating with greater computational efficiency [19][20][21][22]. There exists a broad spectrum of coarse-graining resolutions, resulting in various definitions of models of interactions. Among these definitions are "physics-based" approaches that utilize all-atom models of interactions in defining united-atom potentials [23,24] and "knowledge-based" approaches that derive interaction models based on known protein structures [25][26][27].
Our motivation is to derive "physics-informed" coarse-graining models to enable the stable simulation of the SARS-CoV-2 S-protein across more relevant spatial and temporal scales that are inaccessible to conventional allatomic simulations. Our physics-informed CG models are categorized as the aforementioned "physics-based" approach.
In this work, we construct a "bottom-up" CG model of the SARS-CoV-2 S-protein from classical all-atomic models of interactions. While "physics-based" transferable CG models like MARTINI [28] have shown feasibility in a variety of applications, we use more aggressive coarse-grain modeling which involves molecular species/groupings specific to the mapping of the S-protein structure, i.e. a bottom-up CG model. We derive and optimize the interaction potentials of the CG model using a combination of iterative and physics-informed machine learning (ML) approaches. The AAMD simulations are conducted on powerful supercomputers to help obtain large amounts of data to derive the associated CG potentials. The experimental outcomes show computing speed nearly 40,000 times faster than that of the AAMD simulations while retaining comparable structural accuracy. The contributions of this work can be summarized as follows.
• We present an innovative application of supervised ML in the derivation of a CG model; • We combine ML with molecular dynamics towards greater efficiency, achieving a speed of coarse-grained simulations of 40,000 times faster than that of conventional all-atom models while maintaining comparable structural accuracy; • The greater efficiency presents the timeliness of the research in producing long-term simulations of SARS-CoV-2 that can help illuminate protein mechanisms and impacts of environmental changes.
The rest of this paper is organized as follows. Section 2 details the proposed physics-informed bottom-up CG model based on the ideas introduced above and the implementation of the model to conduct the simulations. Section 3 reports the experimental outcomes of our CG model with comparison to the AAMD simulations. Section 4 draws conclusions from the reported results, followed by discussions of our future research topics along the direction of multiscale modeling of biomolecular systems.

METHODOLOGY
In the sections of (2.1) and (2.2) below, some background information is given to help the presentation of our physics-informed bottom-up CG model in section (2.3).

Coarse-Grained Structure Development
The full SARS-CoV-2 S-protein model was obtained from the protein data bank 6VXX and was run through Nanoscale Molecular Dynamics (NAMD) software [5,29]. The all-atom system consists of 22,815 atoms in the 6VXX structure, with a total of 45,153 atoms including the hydrogens during the simulation. We utilized the Shape-Based Coarse Graining approach [30] to aggressively reduce the model to 60 atoms, maintaining the homotrimeric structure with 20 atoms per chain, see Fig. 1. The coarse-grained beads were distributed using a self-organizing neural network with a stochastic learning algorithm. The learning parameters are as follows: we used 12000 learning steps with an initial epsilon value of 0.3 and an initial lambda value of 12.0; we used a final epsilon value of 0.05 and a final lambda value of 0.01; the bonds were determined from the all-atom structure.

All-Atomic Simulations
To obtain the ground-truth data for our learning procedure, we first conducted all-atomic simulations. Our simulations were conducted on the AiMOS supercomputer, a heterogeneous system architecture that includes IBM POWER9 CPUs connected to NVIDIA GPUs, and the Seawulf computing cluster at Stony Brook University. We utilized the CHARMM-36 force field [31] in describing the S-protein system in a vacuum canonical ensemble at 310K. The simulation began with 10 ps of energy minimization and was run for 400 ps to reach a stable state with a 1 femtosecond timestep. Instead of running on a single long-term simulation, we ran two hundred short simulations in parallel, with starting states randomly sampled and transformed from an equilibrated trajectory. Random frame sampling within the stable state in addition to random translational and rotational transformations was used to generate initial states for two hundred separate subsequent simulations. From these simulations, frames containing coordinate and force data were collected every one femtosecond. We accumulated a total of 9.7905 nanoseconds of the groundtruth all-atom data, which upon mapping yielded 97,905 data frames of coordinates and forces, representing our ground-truth data.

Coarse-Grained Modeling
In addition to the developed coarse-grained structure and constructed ground-truth data, our proposed physicsinformed CG model follows a serial multiscale approach, as shown by Fig. 2. In the first box of "Data Collection", spatial and temporal mapping schemes are employed to map the AAMD simulations to the reduced-resolution coarse-grained structure. In the second box of "Parameter Optimization", the resulting ground-truth data is used to generate the initial parameterization of the CG model. The model parameters are then further refined by the physics-informed supervised ML approach. The initialization strategy employs the iterative Boltzmann inversion (IBI) approach on the bonded parameters [32,33] and the Visual Molecular Dynamics (VMD) software for the nonbonded terms [34]. The ML approach employs a force-matching scheme using the ground-truth position and force data. In the last box of "Validation Analysis", the learned parameters are implemented in a coarse-grain molecular dynamics (CGMD) validation simulation for comparison with a separate continuous long-term AAMD vacuum simulation. The simulations are evaluated and compared in terms of simulation accuracy and computation speed. More details are provided in the following sections.  Fig. 2: Illustration of our coarse-grain modeling pipeline including data collection, parameter optimization, and validation analysis.

Coarse-grained mapping
The raw trajectory obtained in sections (2.1) and (2.2) was processed by mapping the extracted atomic coordinate and force data to the coarse-grained scale. Spatial mapping was conducted by computing the center of mass and the sum of forces for each atom group, constituting a bead, according to the following equations: where , and , represent the calculated position and force, respectively, of bead , , and , represent the position and force, respectively, of atom within the atom group constituting bead , and represents the mass of atom as a weighting factor.
In addition to the spatial mapping, temporal averaging was performed by the averaging of both coordinates and forces across the temporal dimension every 100 frames. Using this mapping scheme, we processed the ground-truth data for the initialization of our model parameters.

Parameter initialization
We utilized the traditional IBI method in generating initial bonded values or weights for our proposed ML network to train the data. Diverging from the conventional implementation, we incorporated the refinement of dihedral parameters in addition to the bonds and angles. From the ground truth simulations, we were able to extract distribution functions ( ) of variable representing the bond lengths, bond angles, or torsion angles. Utilizing the Boltzmann relation: representing the potential function, where is a parameter and T represents the temperature. Furthermore, the bonded parameters can be modeled as harmonics: where 0 represents the respective equilibrium measurement and represents the harmonic constant. Thus, we can illustrate the relation between distribution functions and harmonic constants as follows: where the equilibrium measurement 0 is equal to the average position ⟨ ⟩. A coarse-grain simulation was conducted as a trial run using these initial guesses, where the Boltzmann inversion relation was once again applied to extract the potentials. To update the potentials to match the reference or ground-truth distribution, constants were scaled to match the trial distribution with the reference. The process is an iterative refinement until trial distribution matches the reference within some tolerance.
Additionally, we utilized the VMD software in deriving the initialization of the Lennard-Jones (LJ) interaction potential. In this procedure, each bead was assigned an LJ strength based on where ℎ ℎ and represent the hydrophobic and total solvent-accessible surface areas of domain , respectively, and is the user-controlled maximum energy for LJ potential well-depth, which was selected to be 20 kcal/mol. The LJ potential radius (with the minimum of ) is given by the radius of gyration of the group of atoms constituting bead i, which is increased by a user-defined addition, e.g. an increment of 1 Å was selected in this work. The LJ energy constant and the equilibrium distance for a pairwise interaction are defined in Eq. (7) and Eq. (8), respectively. These parameters thus constitute the LJ potential, , between pairs of beads as shown in Eq. (9).

Physics-informed model
We constructed a force-matching model as a means of preserving thermodynamic consistency in minimizing the error between the instantaneous ground-truth forces and predicted forces [35][36][37][38]. The error is defined as: where represents the instantaneous force and represents the coarse grain potential and is the gradient operator. Firstly, the model contains an initial featurization layer that provides the physics-informed layers with the pairwise distances, bond lengths, bond angles, and torsion angles extracted from the input coordinates, as displayed in Fig. 3. The model was constructed based on two physics-informed layers: one is the Harmonic layer comprised of bond, angle, and dihedral terms as bonded potentials; and the other is the Lennard-Jones layer consisted of the LJ potential accounting for the weak dipole attraction between distant atoms and the hard-core repulsion between close atoms. Within the Harmonic layer, the trainable parameters include the harmonic constants for each of the aforementioned bonded potentials, whereas, in the Lennard-Jones layer, the trainable parameters are the bead strength and the minimum radius, , for each unique bead . There exist 471 and 40 trainable parameters that comprise the bonded and non-bonded interactions, respectively, in the physics-informed model. The parameters are associated with the Lennard-Jones layer or LJ potential of Eqs. (7)-(9) and the Harmonic layer with the dihedral potentials of Eqs. (11) or (12), both of which will be presented in the following paragraphs.
For the dihedral potentials, we represented them in two ways: harmonic representation of Eq. (11) and periodic representation of Eq. (12). The harmonic form represents the dihedral potential in the same manner as bonded potentials, where the trainable constants are analogous to spring constants. The periodic representation accounts for the periodicity of dihedrals, where the phase shift angle, , was adjusted to fit the equilibrium value as the potential minima.
We visualize the torsion angle distribution to depict the unimodality in Fig. 4 and thus confirm our choice of = 1 as the multiplicity for the periodic representations, as well as = 0 for the harmonic representation. The distributions in Fig. 4 present more common conformations with the yellow color, where the means are the respective equilibrium states. Based on the above analysis, the resulting energy in Eq. (10) can be calculated according to: where , and are constant parameters, r is bond distance, is bond angle, is torsion angle, and was defined above as the torsion phase shift angle, which acts as an equilibrium angle in the harmonic representation, see Eq. (11).
The calculated energy was passed through a gradient layer to compute the derivatives with respect to the input coordinates, thus the force predictions ̂ can be calculated by Eq. (13): ̂= − (13). From these predictions, we can determine the loss as a mean-squared error function between the calculated and the mapped ground-truth forces, as given in Eq. (10). Validation of our above presented physics-informed bottomup CG approach is described in the following section. The validation analysis is shown as the third box of the pipeline of Fig. 2.

CGMD Validation
To measure the performance of our approach, we evaluated a validation CGMD simulation across the metrics of accuracy and speed. Using the optimized parameters, we ran the CGMD simulations of one microsecond for both the harmonic and periodic dihedral parameter sets respectively.
With regards to accuracy analysis, torsional analysis was applied in the form of free energy surface plots and the free energy was plotted along two dihedral quadruples, where the free energy profiles provide insight into the conformational states. From the plots, we compared our validation simulations with the ground-truth training data using the dihedral pairs belonging to S-protein RBD and S2 domain. Radial distribution functions (RDF) were applied as well in providing insight into the distribution of particles around certain particles. Thus, we can utilize these plots as a mode of comparison between the ground-truth training AAMD simulations (i.e. the reference baseline for comparison purposes) and the CGMD validation simulations (i.e. our presented CGMD model) for evaluating structural accuracy.
Additionally, we incorporated the root-mean-square-deviation (RMSD) analysis to monitor the structural stability of the compared models throughout their respective trajectories. In this analysis, we ran a separate AAMD vacuum simulation (as the baseline for the purpose of comparison) in order to obtain a continuous trajectory of data.
In addition to the accuracy analysis, we further performed speed analysis as a measure of the overall model's performance and efficiency. Using the Seawulf cluster, we ran the CGMD validation simulation for 5 microseconds and compared its simulation speed with the continuous 0.1 nanosecond AAMD validation simulation (i.e. the one used in our RMSD analysis as the reference baseline).
The validation experimental outcomes are reported in the section below.

RESULTS
In section (3.1) below, we will focus on the performance of our machine learning for those parameters associated with our CG model. The learned parameters will be compared to the reference distributions derived from the ground-truth data. With the learned model parameters, we will report the accuracy of our CGMD model simulations vs. the baseline AAMD model simulations in section (3.2) and the simulation speed in section (3.3).

Parameter Learning
Using the obtained 97,905 coordinates and force frames, the parameter initialization for bonds, angles, and dihedrals, respectively, proceeded with 3 iterations. For each iteration, the trial simulations were conducted with 10 femtosecond timesteps, minimized for 500 picoseconds, and simulated for 4 ns. Figure 5 illustrates the IBI refinement of the parameters to match the reference distributions to reasonable accuracy after three iterations. There exist 511 total learnable parameters, including the LJ potential parameters, which are learned with the physics-informed model configured with Adam optimizer with a learning rate of 0.001, a batch size of 256 for 10 epochs. Figure 6 displays the loss plot over the training process. Both training and validation losses approached convergence after 4 epochs. The optimization of each individual bonded and non-bonded parameter over the 10 epochs is visualized in Fig. 7. The results of Fig. 7 agree very well with that of Fig. 6.

CGMD Validation: Accuracy
To test the efficacy of our CG model, we conducted validation simulations using the learned parameters of section (3.1). The simulations were configured with a timestep size of 10 femtoseconds, as opposed to the 1 femtosecond time-step used in our ground-truth all-atom simulations. The resulting simulations, for both the harmonic and periodic dihedral representations, reached stability for one microsecond. We used this data to conduct our accuracy analysis (as well as speed analysis to be presented in section (3.3) below) in comparison to the groundtruth data. We evaluated the validation accuracy using the torsion analysis, the RDF plots, and the RMSD analyses.

Torsion analysis
During our accuracy analysis, we evaluated the free energy profiles as a function of dihedral angles. The plots were used to analyze and compare the torsion angles as a representation of the protein conformational states. We display two separate pairs of torsional angles for such analysis: one is located on the receptor-binding domain, and the other is located in the S2 subunit, see Fig. 8.

Radial-distribution function measure
We implemented the RDF measure to compare the mapped ground-truth statistics (or the AAMD simulations) with the performance of our CGMD simulations, see Fig. 9 and Fig. 10.  As illustrated in Fig. 9 and Fig. 10, our physics-informed machine learning approach is able to resemble the reference plots with reasonable accuracy, as it can capture the more significant peaks within the RDF. However, there does indicate some loss of resolution and accuracy in the smaller peaks at larger distances. Investigation on this observation of some loss in some small peaks will be one of our future research topics.
In addition to the use of RDF measures on the statistics above, we further expanded the RDF measures on the performance of our ML procedure for model parameterization. As seen in section (3.1) above, the ML procedure indicates very noticeable refinements on the model parameterization, particularly on the non-bonded LJ potential terms. Within this refinement is the very noticeable decrease in both the epsilon and the associated well-depth terms. Upon further investigation, it is shown that the model's calculated energies begin with positive repulsion, gradually becoming negative by the end of the training, demonstrating the proper optimization to match the distances of the ground-truth data. In comparison with the IBI trial results above in section (3.1), specifically on the atom pair between atom numbers 18 and 47, the learned distances are more consistent with the ground-truth result, as shown in Fig. 11. Within the RDF plots, our CG model is able to capture the positions and peaks in the respective pairs with comparable accuracy to the ground-truth model. Further analysis indicates stability for the entirety of the microsecond, indicating the feasibility of our approach for long-term modeling of the SARS-CoV-2 S-protein.

RMSD
We analyzed the evolution of our presented CG model trajectory by calculating the RMSD values using the starting structure as a reference frame, see Fig. 12. The RMSD reveals the overall stability and conformational change of the whole protein. Protein coordinates were recorded every 10 picoseconds and the RMSD was calculated on the aligned trajectory. Figure 12 presents the RMSD of our CGMD simulations alongside the baseline continuous AAMD simulations. Our presented CGMD simulations appear to have greater fluctuations in comparison to the reference AAMD simulations but remain consistent throughout the full microsecond of simulation, indicating long-term structural stability. Investigation on this observation of fluctuations will be another topic of our future research interests. It is worth noting that convergence to a higher value is a known drawback of the RMSD metric and, therefore, does not indicate inaccuracy within our model.
In addition to the RMSD analysis above, we also visualized the protein structural conformations of the starting and ending states of the AAMD and CGMD simulations as shown in Fig. 13. The protein maintains the overall structure, however, the slight deviations in the AAMD and CGMD structures will be investigated in a future study. The following section will focus on the comparison studies in terms of simulation efficiency.

CGMD Validation: Speed
Both the reference AAMD ground truth simulations and our CGMD validation simulations were conducted on the Seawulf cluster, where each computing node consists of two Intel Xeon E5-2690v3 CPUs. By using the parallel NAMD package on 1 node with 24 CPU cores, the AAMD simulations with 1 femtosecond as time step size produced 0.243 nanoseconds/day while the CGMD simulations with 10 femtoseconds as time step size produced 9532.6 nanoseconds/day. The experimental outcomes indicate that our CGMD validation simulations have a speed nearly 40,000 times faster than that of the reference AAMD simulations. Detailed measurements are presented in Table 1.

CONCLUSION AND DISCUSSION
This work represents an implementation of artificial intelligence-enabled modeling towards more efficient multiscale CGMD validation simulations. The presented physics-informed ML approach to the model parameterization includes two phases: firstly, using all-atom simulations to generate the ground-truth data for parameter learning; secondly, using the learned parameters to run long-term coarse-graining simulations. Our proposed physics-informed bottom-up CGMD model simulations were compared to the ground-truth AAMD model simulations.
The comparison between our coarse-graining modeled validation simulations and the mapped all-atom modeled simulations yields high accuracy in agreement within the free energy profiles, indicating a resemblance of the conformation.
Aside from the accuracy, our simulation model is significantly faster than the all-atom simulation model. With the aggressive coarse-graining approach, our model is able to achieve the simulation speed nearly 40,000 times faster than that of the all-atom simulations. This significant speedup lends itself to more aggressive coarse-graining of the protein structure, or more accurately reproducing the structural properties of the S-protein compared to other CG models. Our work presented here underscores the following contributions toward more efficient multiscale modeling: • Our approach demonstrates feasibility and advantages with the application of supervised ML in the derivation of a coarse-graining model; • In combining ML with molecular dynamics, we immensely accelerate simulation speed compared to conventional all-atom models while maintaining stability and structural accuracy; • The gained efficiency can elucidate protein mechanisms and render a great impact on future simulational studies of environmental changes by relieving the ongoing concerns about timeliness.
During the validation simulations, we observed some deviations from our expectation and investigations on these deviations will be our future research tasks: • The observation of some loss in small peaks in the RDF analysis; • The observation of the fluctuations in the RMSD analysis; • The observation of the slight deviations in the AAMD and CGMD structures.
In addition, the application of this approach in simulating the S-protein under various environmental conditions, including solvation, will be investigated as another future research topic.