ST-PCNN: Spatio-Temporal Physics-Coupled Neural Networks for Dynamics Forecasting

Ocean current, fluid mechanics, and many other spatio-temporal physical dynamical systems are essential components of the universe. One key characteristic of such systems is that certain physics laws -- represented as ordinary/partial differential equations (ODEs/PDEs) -- largely dominate the whole process, irrespective of time or location. Physics-informed learning has recently emerged to learn physics for accurate prediction, but they often lack a mechanism to leverage localized spatial and temporal correlation or rely on hard-coded physics parameters. In this paper, we advocate a physics-coupled neural network model to learn parameters governing the physics of the system, and further couple the learned physics to assist the learning of recurring dynamics. A spatio-temporal physics-coupled neural network (ST-PCNN) model is proposed to achieve three goals: (1) learning the underlying physics parameters, (2) transition of local information between spatio-temporal regions, and (3) forecasting future values for the dynamical system. The physics-coupled learning ensures that the proposed model can be tremendously improved by using learned physics parameters, and can achieve good long-range forecasting (e.g., more than 30-steps). Experiments, using simulated and field-collected ocean current data, validate that ST-PCNN outperforms existing physics-informed models.


INTRODUCTION
S PATIO-temporal modeling is essential in many scientific fields ranging from studies in biology [1], [2], information flow in social networks [3], sensor network communications [4], traffic predictions [5], [6], climate and environment forecasts [7], [8], to recent COVID-19 spread modeling [9]. These applications rely on accurate predictions of spatio-temporal structured data reflecting the realworld phenomena. In all mentioned cases, the major challenge is to infer, model, and predict the underlying causes, which generate the perceived data stream and propagate the involved causal dynamics through graphs and distributed sensor meshes. A stunning characteristic of such dynamical systems is that the widely distributed members (or sensors) share striking homogeneity and heterogeneity. The former is driven by the physics laws governing the systems, whereas the latter is impacted by the localized factor in spatial and temporal regions.
Take information propagation mechanisms of ocean current in Fig. 1 as an example, where each node denotes a geographic location observing ocean current. Three types of Fig. 1: Three types of dependencies in spatio-temporal modeling. A node is mixed with heterogeneity (orange) and homogeneity (blue) information, propagating between neighbors. At any time point t, the status of a center node is influenced by its previous time point (t-1) and its neighbors (red-arrows with various levels of strengths/weights). dependencies exist in spatio-temporal modeling: 1) Spatial dependence: a node in the mesh concurrently affects, and be affected by, its neighbors; 2) Temporal dependence: a node status depends on its previous status and affects its future status; and 3) Spatio-temporal dependence: a node even directly influence its neighbor nodes across the time.
Deep learning methods, such as graph neural networks (GNNs), have been applied to spatio-temporal modeling. Existing methods take temporal information into accounte.g. ARIMA [10], or integrate complex spatial dependencies into temporal models -e.g. ConvLSTM [11] and ST-3DNet [12]. Most recently, researchers utilize graph convolution methods, such as DCRNN [13] and STGCN [6], to model spatial correlations in spatio-temporal structured data. Instead of modeling the spatio and temporal correlations separately, STSGCM [14] and STG2Seq [15] try to simultaneously capture the localized spatio-temporal correlations. It is worth mentioning that STSGCM deploys multiple modules arXiv:2108.05940v1 [cs.LG] 12 Aug 2021 on each time period to capture the heterogeneity, which is computationally intensive.
One major issue of existing deep learning methods is that they seldom include prior knowledge of the underlying physics, i.e. taking the "homogeneity" into consideration.
A key property, that all spatio-temporal processes have in common, is that some generally underlying principles will apply irrespective of time or location when observing natural processes. As a result, the same predictable patterns individually modified by local spatial and temporal influences are observable repeatedly at different spatial locations in time. Recently, physics informed learning has emerged to incorporate physics into the deep learning [16], [17], [18]. PDE-Net [19] and ODE-Net [20] are proposed to train neural networks that simultaneously approximate the simulations and conforms to the PDEs representing the physical knowledge of systems. However, these methods suffer two major shortcomings: (1) they are often restricted to 1D temporal sequence or to a regular grid where constraints on the learnable filters can be easily defined; and (2) there is no good solution to combine homogeneity and heterogeneity for effective prediction. In summary, three research challenges are identified as follows: Learning physics: Physics is one of the fundamental pillars describing how the real-world behaves. Although physics informed learning has [16], [17], [18] taken physics into consideration, such methods are only applicable when specific/general equations are explicitly given and none of them consider the spatio-temporal cases. In reality, it is not always possible to describe all rules governing real-world data. We need to have a learning mechanism to automatically discover the physics underneath the data observations.
Coupling physics and spatio-temporal information: Homogeneity and heterogeneity are two key characteristics of dynamical systems, but governed by different modules. We need to have a new way to enable the learning of physics, and use the inferred physics to further guide the spatio-temporal learning with robust prediction, regardless of physical plausibility. Long-term forecasting: For dynamical systems, long-term forecasting allows proactive controls and early planning. However, most existing models only work on a one-stepahead short-term forecasting [21], [22]. An intuitive way to achieve long-term prediction is to recursively reuse previous-step predictions as input for the next-step prediction. Inevitably, such a mechanism leads to prediction errors that are probable to accumulate over time, and the results obtained may be modest.
In this paper, we propose a spatio-temporal physicscoupled neural networks (ST-PCNN) model to capture spatio-temporal correlations, and heterogeneity and its inherited homogeneity in spatially distributed manner. ST-PCNN is a three-network architecture, consisting of a forecasting net (FN), a transition net (TN), and a physics net (PN). ST-PCNN learns predictive neural network (FNs) that are distributively executed at different locations of a grid. Additional information routing transition neural network (TNs) laterally connect the FNs. Both FNs and TNs share their weights respectively, allowing efficient parallel computation and capturing heterogeneity from all spatial locations.
In order to incorporate physics laws with which an effective model is supposed to follow and helps training with less samples and being robust to unseen data, a third network PN is developed to reveal unknown governing physics from pre-given spatio-temporal data and vice versa facilitates the overall model to capture the homogeneity.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 defines the spatiotemporal forecasting problem for the dynamical systems from a machine-learning perspective. The proposed model ST-PCNN is introduced in Section 4, followed by experiments in Section 5 that includes comparative and ablation studies and conclusion in Section 6.

RELATED WORK
Physical process modeling is close to the field of spatiotemporal statistical models that are increasingly being used across a wide variety of scientific disciplines to describe and predict spatially explicit processes that evolve over time [23]. [24] advocate the use of physical prior knowledge to develop statistical models, e.g., PDEs related to the observed real-world phenomenon. They mainly consider auto regressive models within a hierarchical Bayesian framework. Another research direction is the use of neural networks (NNs) for enhancing the performance and reducing the complexity of numerical physical process simulation. There are three major approaches: 1) NNs are used in place of a computational demanding component of the simulation process. For examples, [25] uses a random forest to compute particle location and [26] adopts a CNN to approximate part of a numerical PDE scheme. 2) NNs are combined with related physics equations to model the whole physical process. For example, [27] uses physics-informed neural networks (PINNs) to learn nonlinear relations between spatialand temporal-coordinates with a given PDE. The given physics could be changed for various purposes, such as advection-diffusion equation informed in [28], fluid dynamics in [7], Lagrangian mechanics in [29], and Hamiltonian in SymODEN [30], etc. These methods are only applicable when the specific equations are explicitly given but hard to be generalized to incorporate other types of physics equations. 3) NNs are used to uncover the underlying hidden physics and model the dynamics of complex systems. Stateof-the-art work includes discovering the PDEs [16], [19], [31] or ODEs [20] from given observations of the systems.

PROBLEM DEFINITION
A dynamical system is observed from a grid of nodes (e.g., a sensor network), distributed/located at different locations. s (t,i,j) ∈ R denotes observed value of a node located at (i, j) at time t. For ease of representation we use s (t,:) to denote value of any node at time t, and s (:,i,j) denotes values observed at (i, j). Accordingly, each time slice of the grid observation is denoted by S (t,:) = {s (t,i,j) } i,j∈Ω ∈ R Ω and the time series of each node is defined by S (:,i,j) = {s (t,i,j) } t∈T ∈ R T (Ω is the total nodes and T is the total recorded time steps). S ∈ R Ω×T denotes the whole observations andŜ represents the variable flow in the future. Assume variable observed from a node is governed by an unknown physics rule, i.e., PDE/ODE 1 , which relates a function u(x, t) with its derivatives, i.e., D a u(x, t) := ∂ a 1 ∂x1 a 1 · · · ∂ a k ∂x k a k u(x, t), where a = (a 1 , · · · , a k ) are nonnegative integers. Consequently, any of the PDEs can be defined by this notion as: where G is a function that relates position x = (i, j) ∈ Ω and time t ∈ R with u and its partial derivatives at x and t. We say u is a solution to the PDE if Eq. (1) holds for every point x ∈ Ω and time step t ∈ R. In this paper, we observe data points (t, x; S) = {t, i, j; s (t,i,j) } i,j∈Ω , where S are noisy function values at x, that is, s (t,i,j) = u(t, i, j) + ξ (t,i,j) with noise ξ (t,i,j) impacted by the temporal localized factor. The u(t, i, j) refers to homogeneity while ξ (t,i,j) is the cause of heterogeneity in dynamical systems.

Spatio-Temporal Forecasting Problem:
From a machinelearning perspective, the spatio-temporal forecasting problem is to learn a non-linear mapping function f that maps the historical spatio-temporal observations {S tp−τ , S tp−τ +1 , · · · , S tp } into the future predictions {Ŝ tp+1 ,Ŝ tp+2 , · · · ,Ŝ tp+τ }, where τ denotes the length of observation conditioned on and τ denotes the prediction horizon. The learning is formulated as a deterministic optimization problem that constitutes both minimizing the data mismatches and estimating the hidden underlying PDE of a physical model by equating derivatives of the neural network approximation.

THE PROPOSED FRAMEWORK: ST-PCNN
To better modeling the dynamical systems in terms of considering both of the homogeneity and heterogeneity, a spatio-temporal physics-coupled neural networks (ST-PCNN) model is proposed. ST-PCNN is a tri-network architecture, as shown in Figs. 2a and 2b, consisting of physics network (PN), forecasting network (FN), and transition network (TN). FN receives 1) dynamic data, which is subject 1. Ordinary differential equations form a subclass of partial differential equations, corresponding to functions of a single variable. to prediction and changes over time, 2) static information, which stays constant and characterizes the location of each FN, and 3) lateral information from neighbors. The output of each FN includes predicted dynamics and additional lateral information that will be interacted with its neighbors. Such interaction, that distinguishes our architecture from others, is conducted through a TN with two-stacked linear layers. TN aims to model the location-sensitive transitions between adjacent FNs and thus enabling local context-dependent spatial information propagation.

Spatio-Temporal Heterogeneity
In many natural phenomenons, data are collected in a distributed manner and exhibit heterogeneous properties: each of these distributed locations present a different view of the natural process at the same time, where each view has its own individual representation of the space and dynamics [32]. Theoretically, each location may contain information that other location do not have access to. Therefore, all local views must be interacted in some way in order to describe the global activity comprehensively and accurately.
How to explicitly encode location information into neural networks is critical in this location-wise forecasting. Inspired by the Transformer [33] that encodes word positions in sentences, we extend the absolute positional encoding to represent grid positions. In particular, let i, j be the desired position in a regular grid,p (i,j) ∈ R d be its corresponding encoding, and d be the encoding dimension, then the encoding scheme is defined as: where ω k = The wavelengths form a geometric progression from 2π to 10000 · 2π. The positional embedding as a vector contains pairs of sines and cosines for each decreasing frequency along the vector dimension, it would allow the model to easily learn to attend by relative positions [33].
As illustrated in Fig. 2, the FN and TN are executed in space simultaneously. At each time t, the TN first encodes the current operation node's lateral info Ł and static infop as follows: where θ T = [W T , b T ] denote the weights and bias of TN. Ł (t,i,j) is a vector used to characterize interaction between a node at i, j to its neighbors. It is initialized as zeros at first step and continuously updated by Eq. (7) when t > 0. Then, FN encodes each view (i.e., staticp, dynamics S, and encoded Ł enc of each node) using a fusion layer: These features, f (t,i,j) ∈ R d f i,j , are then fed into an LSTM to model the node-specific interactions over time. The update mechanism of the LSTM cell is defined as: where σ(·) applies sigmoid on the input gate I (t) , forget gate F (t) , and output gate O (t) , and tanh(·) on memory cell A cell updates its hidden states h (t) based on the previous step h (t−1) and the current input f (t,i,j) . An output layer is stacked at the end of FN to transform the LSTM output into the expected dynamic prediction and additional lateral information as: whereŜ (t,i,j) denotes the prediction of the node dynamics at time step t. The learnable parameters are characterized by W (t) ∈ R d f i,j ×dy i,j and b out ∈ R dy i,j , where d yi,j denotes the total dimension of the dynamic and the lateral outputs.

Homogeneity by Underlying Physics
Physicists attempt to model natural phenomena in a principled way through analytic descriptions. Conservation laws, physical principles, or phenomenological behaviors are generally formalized using differential equations, which can best reasoning the homogeneity of observations. Knowledge accumulated for modeling physical processes in well developed fields such as maths or physics could be a useful guideline for dynamics learning [28]. Our proposed ST-PCNN includes a physics-aware module, the physics network PN, to learn underlying hidden physics. Two main approaches are considered: • PDE-learning-net: explicitly estimating the underlying partial differential equation (PDE) from time series assisted by neural network model fitting to PDE solution.
• ODE-informed-net: implicitly approximating timedependence with neural network and solving via an ordinary differential equation (ODE) solver.

PDE-learning-net
Partial differential equation (PDE) is an equation which imposes relations between the various partial derivatives of a multivariable function. PDEs are ubiquitous in mathematically-oriented scientific fields, such as physics and engineering. For instance, they are foundations in the modern scientific understanding of sound, heat, diffusion, fluid dynamics, general relativity, quantum mechanics, etc.
PDEs for a single variable v can be generally described as a linear combination of functions of u, its derivatives, and the dimensions. More formally, let D = {D 1 , · · · , D K } be a dictionary of such terms, we can then generally define PDEs as: where c = {c k } K k=1 is a set of coefficients to be determined. These terms are determined by best estimate of what would be relevant for each use case. For most physical systems, this would typically be limited to second order derivatives 2 .
Suppose we observe u(x, t) such that u is a solution to Eq. (8). It is then possible to approximate u by a neural networkû : Ω ∈ R, where observations at (x, t) becomes the training inputs. To approximate u, the PDE-learningnet [31], as illustrated in the dotted box of Fig. 3, consists of a four-linear-layer stacked, fully-connected network with linear input and output layers as denoted below: where the inputs are position x and time t, and yields the prediction valueû at that position. The tanh() activation function is employed after each layer except for the output layer. This approximation is optimized by a combination of multiple loss terms to place emphasis on different parts of the model, defined as: 2. Example: suppose the 1D wave equation is autt − buxx = G(utt, uxx) = 0, ∀x ∈ Ω ∈ R 2 , ∀t ∈ R. If we define our dictionary as D 1 = utt, D 2 = uxx, where aD 1 (u, x, t) − bD 2 (u, x, t) = 0, the PDE can then be represented by the coefficients c = [a, −b].

Algorithm 1: PDE-learning-net
Input : Spatio-temporal point (x, t); Dictionary of differential terms {D(û, x, t)} Output: c (the estimated coefficients of PDE) Initialize Neural network parameters: θ ofû, θc/ θc ; for number of epochs do Update θ and θc ← L.backward(); c ← θc/ θc ; end where the use of L u as a scaling factor for the other losses acts as a regularizing term to maintain a more consistent ratio of losses even when the mean square error (MSE) becomes very large or small. The λ d and λ sparse coefficients are set such that the individual loss function contributions will have similar magnitude.
L u is the loss between observed data points u and values calculated by the neural network at these pointsû: where θ is the neural network parameters. This is the primary regression loss term for the neural network. L d is a differentiation loss used to measure the error of the estimated PDE coefficients. Here, a dictionary of L differential terms D = {D 1 , D 2 , · · · , D L } is used. These functions are evaluated at K points x = {x i } K i=1 , sampled from Ω, resulting in a R K×L matrix with entries: By definition of the PDE, we require D(u, x )c = 0, thus c must lie within the null space of D(u, x ). Equivalently, c is a singular vector of D(u, x ), with associated singular value 0. If û − u m,2 < and assuming some regularity conditions on D, we can have D(u, x ) − D(û, x ) m,2 < C for some constant C that depends on D. Therefore, the singular vector of D(û, x ), associated with its smallest singular value, is an approximation of c.
The loss term L d is then defined along with the constraint c = 1 to avoid c being minimized to zero: The contribution of L d is twofold: minimizing L d (x , θ, c) over c recovers the PDE and minimizing L d (x , θ, c) overû additionally that encourages fitting to a solution of the PDE, thus preventing over-fitting to noise.
The loss term L sparse is further introduced: to impose the assumption that natural systems are inherently simple and are thus dependent on a few terms. The PDE-learning-net training is presented in Algorithm 1.

ODE-informed-net
Ordinary differential equation (ODE) is a differential equation containing one or more functions of one independent variable and the derivatives of those functions. The term ordinary is used in contrast with the term partial differential equation which may be with respect to more than one independent variables. ODE-informed-net, in this paper, is proposed to implicitly approximate time-dependence with neural network and solving via an ODE solver. The model structure is shown in Fig. 4. ODE-informed-net training is performed with the DO-PRI5 method [34] in the ODE solver with an absolute and relative tolerance of 1e −3 . The DOPRI5 method is well established as an ODE solver and allows for adaptive steps, thus the model can then perform network evaluations with a dynamic and arbitrary number of times. This can be tuned in and outside of training by changing tolerances before network evaluation. Model inputs are a set of derivatives of the target function, the current state, and the previous state. This is then encoded into the latent space, h 1 , as: The model interprets this as the initial state for the ODEsolver 3 DOPRI5 [34]. g(u, du dt , ∆t), where ∆t is the ODE solver time step and du dt is approximated by a neural network, in this case, a two-linear-layer stacked fully-connected network. The following process is looped within the ODE solver: where h 1 is the time derivative of the latent space state, h 2 is the calculated latent space state at t + ∆t, and t is the current time of the solver. If this time is not yet t + 1, t and h 1 are updated for the next iteration. Once this converges to expected tolerance, the final h 2 is decoded as the model's estimate for the true state at t + 1 as:  , τ ); θO)dτ + uE; u(x, t + 1) ←ûD(uO; θD); end L ← u(x, t + 1) −û(x, t + 1) 2 +α [θE, θO, θD] 2 ; Update θE, θO, θD ← L.backward(); end Here, the ODE-informed-net unites a sub-network acting as an ODE functionû O . By leveraging the multiple evaluations of this sub-network, the ODE-informed-net can effectively act as a deeper network while using fewer parameters and being more stable in training. When used to predict time-series, the ODE function solves for future states from some initial state at u(t 0 ) as: where the united ODE functionû O fits to the time derivative of u, mimicking the behavior of the physical systems. In more complex systems, if it is assumed that the system can be described with PDEs, then time gradients can be estimated as a function of the local state, previous state, and a set of spatial derivatives. These derivatives can be estimated via finite difference and be more formally defined as the set {D α u}; α = 0, 1, · · · , m, where m is the highest order of derivative with a typical value of 2. Our implementation performs this calculation at individual points in space and predicts the state of that point in the next time step. By leveraging ODE integration tools, any future values can be solved with arbitrary accuracy. When approximating time series data, the ODE-informednet is trained using pairs of data at sequential time steps to predict values for the next time step. The data is sampled at successive time points t − 1 and t at some spatial point x and their corresponding values u(x, t − 1), u(x, t), where u = (u 1 , u 2 , · · · , u d ) T is a d-dimension vector representing data values at time t and point x. Finally, an estimate of u(x, t + 1) is calculated at each point using the input states as the initial state for the ODE integration.
Regression loss L is defined for ODE-informed-net as a function of the MSE and the network parameters θ: whereû denotes the prediction. The ODE-informed-net training is presented in Algorithm 2.

Coupling Hetero-and Homogeneity
Based on above analysis, we describe the ST-PCNN to model the heterogeneous properties of spatio-temporal data and to reveal the homogeneous physics from raw data.  Here, a stacking coupling mechanism is proposed to integrate the obtained physics into the spatio-temporal learning.
As shown in Fig. 2a, at each time step t and location i, j, the FN produces the initial predictionŜ (t+1,i,j) he based on the current observation S (t,i,j) , the hidden states h (t−1,i,j) from previous-step (within LSTM), and the lateral info from its neighbors produced by TN. Here the 'heterogeneous' initial prediction leverages its own specific local attributes only. Then, regarding the integration of differential equations, the previous-step initial predictionŜ (t−1,i,j) and the current observation S (t,i,j) are fed into the learnt physics PDEs/ODEs from PN to derive the numerical solutionŜ (t+1,i,j) ho , which is the 'homogeneous' part of the dynamics regularized by governing physics. Finally, a coupling layer with parameters θ C = [W C , b C ] in Eq. (21), is used to produce final prediction S (t+1,i,j) by synthesizingŜ ST-PCNN training is presented in Algorithm 3 with supervised loss including sum of l 1 -norm and l 2 -norm loss.

Datasets
In this section, we evaluate ST-PCNN on a synthetic dataset and a real-world ocean current dataset, with the data statistics summarized in Table 1 and details introduced as below. Fig. 5, single-waves are propagating outwards, where waves are reflected at borders such that wave fronts become interactive. The following 2D wave equation was used for reflected wave data generation: Plots from left to right denote temporal evolving of the circular wave (propagate from center to boundaries). After the wave reaches the border, reflecting effects are generated through boundary conditions. PDE solutions were solved numerically using an explicit central difference approach:

Reflected Wave Simulation Data As illustrated in
where b stands for a variable of function u, and h is the approximation step size. In the case of calculating simulated wave data, we apply Eq. (23) to Eq. (22) to obtain: which can be solved for u(x, y, t + ∆t) to obtain an equation for determining state of the field at the next time step t + ∆t at each point. Both the boundary conditions (when x < 0 or x > f ieldwidth, analogously for y) and initial condition (in time step 0) are treated as zero. The following variable choices were met: ∆ t = 0.1, ∆ x = ∆ y = 1 and c = 3.0. The field was initialized using a Gaussian distribution: with amplitude factor a = 0.34, wave width in x and y directions σ 2 x = σ 2 y = 0.5, and s x , s y being the starting point or center of the circular wave. Fig. 6, the sensor array are placed in the GoM region, covered from 89 o W to 85 o W , and 25 o N to 27 o N with 30-50 km horizontal resolution, where the Loop Current (LC) extended northward and, more importantly, where eddy shedding events occurred most often [35]. This sensor array consisted of 25 pressure-recording inverted echo sounders (PIES), 9 full-depth tall moorings with temperature, conductivity and velocity measurements, and 7 near bottom current meter moorings deployed under the LC region. The dataset contains velocity data gathered from June 2009 to June 2011. Since the sampling frequency from multiple sensors varied from minutes to hours, we processed the dataset with a fourth order Butterworth filter and sub-sampled at 12-hour intervals, leading to a total of 1,810 records (905 days).

Baselines
• FC-LSTM [36] has been proven powerful for handling temporal correlation. The LSTM network here consists of 256 hidden units.
• PredRNN [37] memorizes both spatial appearances and temporal variations in a unified memory pool, which consists of two ST-LSTM layers with 128 hidden states each and 3 × 3 convolution filters.
• CDNN [28] is a recently developed physics-informed network, which has a convolutional-deconvolutional module with a warping mechanism to produce an interpretable latent state -the motion field 4 -advecting ocean dynamics.

Model Details and Implementation
Forecasting Network: The FN consists of a fully-connected layer, followed by an LSTM layer with 256 hidden units, and another fully-connected layer. Transition Network: The lateral output dimension is set to 8, analogously to the dimension of lateral input. Physics Network: Two types of PN are adopted: 1) PDElearning-net is a fully-connected network of width 50 with four hidden layers with ReLU activation. 2) ODE-informednet is of width 50 with integrated portion consisting of two fully-connected layers with ReLU activation.
All models are trained with ADAM optimizer with the sum of l 1 -norm and l 2 -norm loss and 0.01 learning rate. The batch size is set to 16. Learning rate decay and scheduled sampling are activated once the model does not improve in 20 epochs (in term of validation loss). The implementation was based on the Pytorch equipped with NVIDIA Geforce GTX 1080Ti and Titan Xp GPU with 32GB memory.

Physics Learning Analysis
We validate the hypothesis that the PDE-learning-net and ODE-informed-net are able to uncover the underlying hid-   den physics from raw data, and thus, are able to assist the spatio-temporal networks to capture the dynamics of the natural phenomenon. Both models assessed in this section are trained against values at individual points, i.e., trained over a single sequence of synthetic reflected wave data. The ODE-informed-net model efficacy is demonstrated via a time-series derived from a closed-loop evaluation initialized with a pair of time steps. Qualitative assessments show that the model is capable of accurately recreating the physical behavior of the ground truth data and multi-steps ahead in the future, as shown in Table 2 and Fig. 7. This suggests that for a simple physics example, an ODE approximation can accurately recreate data well with a relatively shallow network. However, computational overhead from finite difference operations required for the implementation of the network offsets some of this benefit.
Both the true and predicted PDEs are represented as a normalized vector of coefficients c for the following dictionary of differential terms: {u tt , u xx , u yy , u t , u x , u y , u}.  Table  3, the PDE-learning-net is able to reveal the optimal pde expression from data. To further evaluate the accuracy of the estimated PDE, it is compared to the synthetic data set, as shown in Table 2, which is performed by solving the estimated PDE numerically to directly compare the solutions. Although the estimated PDE produces a similar result, a damping term was incorrectly introduced, resulting in fading in multi-steps modeling, as shown in Fig. 7. This PDE estimation is valuable over a traditional neural network as this method can scale over the space-time dimensions and has flexibility through choice of numerical solver.

Spatio-Temporal Modeling Analysis
We compare our model with several baselines evaluated with a mean square error (MSE) metric. The complete visualization result of the synthetic reflected-wave forecasting tasks is shown in Fig. 8, where the top-half shows the no-physics-informed approaches while the bottom-half 5. The PDE estimation error is calculated by err(c, c * ) = (1 − |(c · c * )/( c c * )|) 1/2 , which is always non-negative, and is 0 iff. c and c * are co-linear. shows the physics-informed approaches. Table 4 (top-half) shows the quantitative comparison of different no-physicsinformed approaches for the synthetic reflected-wave forecasting tasks. In most cases, our ST-H H PCNN consistently outperforms other baseline methods. Although ConvLSTM and FC-LSTM yield the best single-step forecast, they seem easier to over-fit and lead to earlier fading prediction compared to Pred-RNN and ST-H H PCNN. Considering that the closedloop performance is more challenging than just single-step forecasting, which requires both intrinsic model stability and the maintenance of plausible ongoing dynamics, our ST-H H PCNN that distributively execute prediction in space shows the best generalization performance. Table 4 (bottom-half) presents the performance of physics-informed approaches on reflected-wave data. ST-PCNN with wave equation (Eq. (22)) informed is regard as a reference here, proving that if the governing physics is known, our model could perfectly capture the spatiotemporal dynamics even in the long-run. The CDNN, that discretizing the solution of the motion field (advectiondiffusion equation) with a warping scheme to inform forecasting, is not able to capture the complex long-term dynamics that strongly deviate from the truth after 40 steps closedloop prediction (shown in Fig. 8). It indicates that pre-given knowledge (e.g., PDE of 'general form') may not always be beneficial in physical process modeling, since it neglects the 'heterogeneity'. In the contrary, the MSE score of ST-PCNN, either informed by predicted ODE or PDE, is better than any of the baselines with/without physics inform. It indicates that ST-PCNN can capture both 'homogeneity' (underlying physics) and 'heterogeneity' (localized information) in modeling evolution and dynamics of natural phenomena.

GoM Loop Current Prediction
So far, the synthetic wave data only consider the strict regularly distributed grid, where distances between single measurement point are identical and enriched in physical property. Such situation may not suitable in many real-word applications. We adopt the LC data to validate ST-PCNN's performance in handling irregularly and widely distributed sensor grid. As shown in Table 5 and Fig. 9, ST-H H PCNN outperforms other no-physics-informed approaches, not only reaches the lowest single-step forecast error but also yields the best multi-step forecast performance. Among physicsinformed approaches, the DCNN achieves the lowest shortterm forecast MSE while the predicted ODE-informed ST-   PCNN reaches the best single-step and lasting forecast performance.

CONCLUSIONS
In this paper, we proposed a ST-PCNN model for accurate spatio-temporal forecasting. We argued that real-world dynamical systems are often challenged by heterogeneity and homogeneity. By coupling three networks to learn underlying physics, enable transition of node interaction, and forecast the future, ST-PCNN shows superior performance to baseline models. The key contribution of the paper, compared to existing research in the field, is three-fold: 1) A novel ST-PCNN framework capturing complex localized spatial-temporal correlations; 2) A 2D PDE-learningnet and ODE-informed-net as the physics nets to learn hidden physics; and 3) The physics-coupled neural network (PCNN) for long-term forecasting using only limited observations. Further, ST-PCNN is a general framework suitable for many other physical processes modeling.