Manifold-constrained Gaussian process inference for time-varying parameters in dynamic systems

Identification of parameters in ordinary differential equations (ODEs) is an important and challenging task when modeling dynamic systems in biomedical research and other scientific areas, especially with the presence of time-varying parameters. This article proposes a fast and accurate method, TVMAGI (Time-Varying MAnifold-constrained Gaussian process Inference), to estimate both time-constant and time-varying parameters in the ODE using noisy and sparse observation data. TVMAGI imposes a Gaussian process model over the time series of system components as well as time-varying parameters, and restricts the derivative process to satisfy ODE conditions. Consequently, TVMAGI does not require any conventional numerical integration such as Runge–Kutta and thus achieves substantial savings in computation time. By incorporating the ODE structures through manifold constraints, TVMAGI enjoys a principled statistical construct under the Bayesian paradigm, which further enables it to handle systems with missing data or unobserved components. The Gaussian process prior also alleviates the identifiability issue often associated with the time-varying parameters in ODE. Unlike existing approaches, TVMAGI can be applied to general nonlinear systems without specific structural assumptions. Three simulation examples, including an infectious disease compartmental model, are provided to illustrate the robustness and efficiency of our method compared with numerical integration and Bayesian filtering methods.


Introduction
Ordinary Differential Equations (ODEs) are often used to analyze the behavior of dynamic systems, such as the spread of infectious diseases (Li and Muldowney (1995)), interactions between species (Takeuchi et al. (2006)), and viral dynamics (Perelson et al. (1996)). This paper studies a general formulation of ODE equations, where some of the parameters are allowed to be time-varying: Here, x(t) is the series of system outputs from time 0 to T , ψ denotes time-constant parameters, θ(t) denotes time-varying parameters, and f is a set of general functions that characterize the derivative process. When f is non-linear, the system outputs x(t) typically do not have analytic solutions. To solve x(t) given initial conditions x(0) and parameters θ(t) and ψ, numerical integration methods are often required, such as Euler's Method or Runge-Kutta Method (Lapidus and Seinfeld (1971)).
This paper focuses on the inverse problem that, given the observations, how to efficiently draw inference on the ODE parameters. Our goal is to estimate time-constant parameters ψ and time-varying parameters θ(t) inside the ODE from data. In real world, the observation data of system components x are often obtained at discrete time points and are subject to measurement errors. We thus assume that we observe y(τ ) = x(τ ) + (τ ), where τ denotes the observation time points while error (τ ) denotes Gaussian noise. We focus on the inference of θ(t) and ψ given y(τ ), with emphasis on nonlinear structure f .
The time-varying parameter θ(t) in the ODE is often important yet challenging to recover from real-world data. For example, during a pandemic, the time-varying disease reproduction number is critical for public health policy decisions. However, its estimation can still be crude despite the best effort (Abbott et al. (2020)). The time-varying θ(t) provides too much degree of freedom to the ODE system, and two different θ(t) can both give x(t) that fits the observation data, resulting in identifiability issues (Miao et al. (2011)).

Review of related literature
Most existing methods for ODE inference only accommodate time-constant parameters (Dondelinger et al. (2013); Yang et al. (2021); Wenk et al. (2019)). For time-varying parameters inference in the ODE system, existing methods all have their deficiencies (Wu (2005)). For example, Li et al. (2002) relied on time-consuming numerical integration; Huang and Liu (2006) proposed a Bayesian parametric approach to model time-varying coefficients in the HIV-1 dynamic model, sacrificing some flexibility; Chen and Wu (2008) and Cao et al. (2012) developed an efficient two-stage local polynomial estimation method that circumvents numerical integration for a non-parametric time-varying parameters, but required ODE system to have linear dependency on the time-varying parameters (see Eq. (18)). Bayesian filtering methods are also explored in the time-varying ODE parameter inferences, although lacking some statistical rigor. For example, Pei and Shaman (2020) and Shaman and Karspeck (2012) apply Ensemble Adjustment Kalman Filter (EAKF) algorithm to estimate parameters in a metapopulation SEIR model. Schmidt et al. (2021) proposed an extended Kalman Filter approach based on Gauss-Markov process that can infer time-varying parameter but cannot accommodate time-constant parameters any more.
To the best of our knowledge, there is no existing Bayesian inference method that eliminates numerical integration for a general ODE system with both time-constant and time-varying parameters.

Our contribution
We propose a fast and statistically principled method to infer time-varying θ(t) and timeconstant ψ from noisy observations of ODE. The key idea is to use Bayesian approach and place Gaussian process (GP) prior on x(t) and time-varying parameters θ(t), and thus the identifiability issue is mitigated using the informative prior that favors smoother parameter curves. Our method is built upon the prior work of MAnifold-constrained Gaussian process Inference (Yang et al. (2021)) where the Gaussian process x(t) is restricted on a manifold that satisfies the ODE system. Placing a Gaussian process on x(t) facilitates a fast inference on θ(t), as it completely bypasses numerical integration. Our approach also adheres to the classical Bayesian paradigm with rigorous theoretical derivation. Through a Gaussian process model on θ(t), we are able to generalize the MAnifold-constrained Gaussian process Inference to the situation where time-varying and time-constant parameters co-exist.
We name our method TVMAGI (Time-Varying MAnifold-constrained Gaussian process Inference), emphasizing its capability in handling time-varying parameters. We demonstrate the effectiveness of TVMAGI through three realistic simulation examples, where TVMAGI works well even when some of the system components x(t) are partially observed. Through these simulation examples, we also show that TVMAGI can outperform benchmark methods including a numerical integration approach, a Bayesian filtering approach, and a twostage approach. Thanks to the computational savings of bypassing numerical integration, TVMAGI has great potential to be generalized in high-dimensional and large-scale systems.
TVMAGI achieves a substantial leap from the previously-proposed time-constant parameter inference methods Wenk et al. (2019); Yang et al. (2021) by investigating a much more complicated problem with functional estimate of time-varying parameters. The change from time-constant parameter to time-varying parameter also creates a notable difference in the scientific context, as parameters to be inferred in most real-world phenomena are non-stationary or changing over time.
We assume that Θ(t) is continuous and differentiable in t during time period [0, T ], which helps to prevent overfit and to alleviate identifiability issue, but can be relaxed later. The prior distribution of X and Θ in each dimension is independent Gaussian process. That is, where K X d and K Θ p : R × R → R are positive definite covariance kernels for GP, while µ X d and µ Θ d : R → R denote mean functions.
The likelihood. The observations are denoted as y(τ ) = (y 1 (τ 1 ), ..., y D (τ D )), where In this paper, notation t shall refer to time generically, and τ shall denote specifically the observation time points.
The manifold constraint. We introduce a variable W to quantify the difference in the derivative processẊ(t) between Gaussian process and ODE: Intuitively, W is the L ∞ norm of derivative difference, and W = 0 if and only ifẊ(t) strictly satisfies the ODE structure, which is equivalent to constraining X(t) on the manifold of the ODE solutions. The advantage of L ∞ norm is further discussed in Supplementary Material Section A. In the ideal situation where W ≡ 0, the posterior distribution of Θ(t), Ψ, and X(t) shall be formulated as P Θ(t),Ψ, However, such ideal posterior is not computable in practice. Therefore, we approximate W by finite discretization on time points I = {t 1 , t 2 , . . . , t n }, such that τ ⊂ I ⊂ [0, T ]. We similarly define W I on the discretization set I as the L ∞ distance of the derivative from GP and that from ODE: Here W I is the maximum on a finite set, and W I → W monotonically as I becomes dense.
The associated computable likelihood of the discretized manifold constraint W I = 0 is which is a multivariate Gaussian distribution since the time derivativeẊ(t) of GP is also a GP with specific mean and covariance kernel.
The posterior. Therefore, a computable discretized posterior for TVMAGI inference of X(t), Θ(t), and Ψ is: Equation (8)  The closed-form derivation. The posterior distribution of X(t), Θ(t), and Ψ in Eq. (8) can be further derived as where ||v|| 2 A = v T Av, |I| is the cardinality of I, and f x,θ,ψ d,I is short for the d-th component of A deeper look into the above equation reveals that Eq.(9) is the joint probability in Bayesian statistics, and Eq.(10) further decomposes it into parts. The 1st Part Eq.(2) corresponds to independent GP prior distribution of Θ(I), as the prior of Θ(t) and Ψ are independent. The 2nd Part Eq.(3) is the prior of GP on X(I), because the prior of X(I) is independent from Θ(t) and Ψ. The 3rd Part Eq.(4) is the level of observation noise, and given the value of underlying true components X(τ ), the additive Gaussian observation noise (τ ) is independent from everything else. The 4th Part Eq.(7) can be simplified to be the conditional probability ofẊ(I) given X(I) evaluated at f (x(I), θ(I), ψ, t I ). All four parts are multivariate Gaussian distributed. Especially, The 4th Part Eq. (7) is Gaussian because conditionalẊ(I) given X(I) has a multivariate Gaussian distribution, provided that the GP kernel K X is twice differentiable.
We choose Matern kernel with degree of freedom ν = 2.01 for both Θ(t) and X(t) to guarantee a differentiable GP that allows more flexible patterns: where K ν denotes the modified Bessel function of the second kind. In this case, K = ∂ ∂s K(s, t), K = ∂ ∂t K(s, t), and K = ∂ 2 ∂s∂t K(s, t) are all well-defined.

Algorithm
This section provides a detailed computational scheme of TVMAGI, including the hyperparameter settings. The implementation is available on GitHub 1 . Overall, the Maximum A Posteriori (MAP) of X(I), Θ(I), and Ψ is obtained by optimization, while the posterior mean/interval is obtained by Hamiltonian Monte Carlo. To set the hyper-parameters and initiate the optimizer, we introduce a multi-stage approach in the algorithm. The advantages of multi-stage algorithm are discussed towards the end of this section.

Initialization and inference of the mean
At the first stage, we impose a GP only on X(t) and substitute the time-varying θ(t) with its unknown mean µ Θ in the entire model. This formulation ignores the time-varying property of θ(t) and treats it as time-constant, which fits in the time-constant parameters inference framework of Yang et al. (2021). As such, we can use MAGI package (Yang et al. (2021)) to obtain point estimates for the parameters and system components, denoted asμ Θ , ψ (0) , and x(I) (0) . Theμ Θ is subsequently used as the prior mean value for the time-varying θ(t) in an empirical Bayes fashion, and will be plugged in Eq.(11). The ψ (0) and x(I) (0) will be used as the initial values for ψ and x(I) in the later MAP optimization.
The hyper-parameters (φ X 1,d , φ X 2,d ) for kernel K X d in Eq.(12) and the noise level σ for each system component X d , d = 1, ..., D, are also estimated in MAGI package, using the Gaussian process smoothing marginal likelihood (Yang et al. (2021)). The MAGI estimated noise level σ (0) will serve as initial value for later joint MAP optimization.
We denote the optimized θ(I) asθ(I), and x(I) (0) , ψ (0) , σ (0) are updated to a new optimum x(I),ψ andσ. We callθ(I) the point-wise estimate since there is no requirement on the smoothness or continuity ofθ(t) on I. Although wiggling and possibly overfitting the data, the point-wise estimateθ(I) captures the trend of parameter changes, which provides information to set the hyper-parameters of GP kernels K Θ p for Θ(t).

GP hyper-parameters for time-varying ODE parameters
The length-scale parameter φ Θ controls how fast θ(t) could change. Provided the point-wise estimateθ(I), we use Gaussian Process smoothing method to set the hyper-parameters φ Θ 1,p , φ Θ 2,p of GP kernels K Θ p in Eq.(12). We shall treatθ(I) as observations of Θ(I), and operate on each dimension of time-varying ODE parameters separately.
Recall the prior Θ p (I) ∼ GP(μ Θ p , K Θ p (I, I)), where the meanμ Θ p is obtained in Section 3.1. We use the empirical Bayes approach again to set φ Θ 1,p , φ Θ 2,p by maximizing its posterior density atθ p (I):φ Θ 1,p ,φ Θ 2,p = arg max whereθ p (I)|φ ∼ N (μ Θ p , K(φ) + diag(δ 2 )), the δ is the nuisance parameter governing the induced noise in point-wise estimateθ p (I), and the π Φp (·) is the hyper-prior. In practice, the hyper-prior π Φp (·) is often set to be uniform on a reasonable interval depending on the context to ensure desired level of smoothness for the time-varying ODE parameter θ p (t).

Maximum A Posteriori (MAP) optimization
All the hyper-parameters are now set and will be held as constant when optimizing Eq.(11) to get the MAP, with initial values θ(I) (0) =θ(I), x(I) (0) =x(I), ψ (0) =ψ and σ (0) =σ, all from Section 3.2. The joint posterior function Eq.(11) of x(I), θ(I), ψ and σ is optimized with Adam optimizer (Kingma and Ba (2014)) in PyTorch to get the MAP estimation of TVMAGI. Finally, to mitigate the potential issue of Adam optimizer converging to local optimum, we suggest trying multiple initial values, including starting x(I) at linear interpolations from the observations y(τ ).

Interval estimation of parameters
In addition to the MAP point estimate, we also quantify of the parameter uncertainty in TVMAGI using posterior samples. In particular, we sample the posterior function Eq. (11) using Hamilton Monte Carlo (HMC) (Neal, 2011), while holding all the hyper-parameters at the same constant value as in Section 3.4. Details about the HMC algorithm can we found in Supplementary Material Section B. Specifically in all illustration examples of this paper, we set step size = 10 −5 , number of leap-frog steps L = 100, sample size 8000, burn-in ratio 0.5, and the HMC is initialized at the MAP estimate.

Advantages of our multi-stage algorithm
Compared with joint optimization of hyperparameters and parameters together, the multistage optimization method enjoys several advantages. First, the GP hyperparameters Φ X for the system components are set at the first stage and held as constant in the rest of the optimization so that the inverse of kernel matrix only needs to be computed once, following the recommendation of Golightly and Wilkinson (2010). Second, GP hyperparameters Φ Θ for the time-varying parameters could not be set without any information about Θ(I).
Therefore, a multi-stage procedure is necessary, where a point-wiseθ(I) is obtained in one stage without GP, and then GP hyperparameters Φ Θ is estimated in the following stage based onθ(I). Lastly, The multi-stage optimization ensures that each step of the optimization starts with sensible initial value obtained from previous modularized optimization, thus drastically decreasing the chance of Adam optimizer stuck in local mode. Experiments have shown that our carefully designed multi-stage optimization is faster and achieves better results than joint optimization with randomized starting values.

Benchmark methods
In this section we provide a brief introduction of two common approaches for time-varying parameter inference in ODE: numerical integration methods, represented by Runge-Kutta method, and Bayesian filtering methods, represented by Ensemble Adjustment Kalman Filter (EAKF). Supplementary Material Section E has some additional theoretical discussion about the limitations and the statistical rigor of the benchmark methods for ODE inference when time-varying parameters and time-constant parameters co-exist.

Runge-Kutta method
Runge-Kutta method (Lapidus and Seinfeld (1971)) is a brute-force way for parameter inference in ODE systems. As a non-linear least square method, Runge-Kutta method minimizes the MSE of observations and reconstructed trajectory using numerical integration from the proposed initial conditions and parameters. The objective function is given by: where X RK4 denotes the reconstructed trajectory using the 4th Order Runge-Kutta method.
Supplementary Material Section C includes a few more Bayesian filtering benchmark methods of Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF), and EnKF.
All the Bayesian filtering methods have the inherent limitation that all parameters must be assumed time-varying, and thus cannot accommodate time-constant parameters.

Evaluation Metrics
To assess the quality of the parameter estimates and the system recovery, we consider two metrics based on root mean squared error (RMSE). First, we examine the accuracy of the parameter estimates, using parameter RMSE. For the time-constant parameters, we directly calculate the RMSE of the parameter estimates to the true parameter value across simulations. For the time-varying parameters, we additionally average over discretization set I for the RMSE. Second, we examine the system recovery, using trajectory RMSE. Due to the potential identifiability issue that different parameters can give similar system observations, we measure how well the system components are recovered as another independent evaluation.
To calculate the trajectory RMSE, we use numerical integration to reconstruct the trajectory based on the TVMAGI inferred parameters and initial conditions. The RMSE of the reconstructed trajectory to the true system is then calculated at observation time points.
We emphasize that the numerical integration is only used for evaluation purpose, and throughout our TVMAGI approach, no numerical integration is ever needed. For better distinction, we refer to the MAP of x(I) directly from TVMAGI as the inferred trajectory, and refer to the numerically integrated x(t) based on the TVMAGI inferred parameters and initial conditions as the reconstructed trajectory.
To assess the quality of the interval estimates, we consider the Frequentist coverage of our posterior intervals. For the time-constant parameters, we directly calculate the proportion of repeated simulations where our posteiror interval covers the truth. For the time-varying parameters, we additionally average over discretization set I for the coverage. The coverage of the inferred trajectory can be similarly calculated, averaging over discretization set I.
We do not compute the coverage of the reconstructed trajectory as it will require numerical solver for each posterior sample of the parameters and initial conditions.

Results
We illustrate the accuracy and efficiency of TVMAGI through three realistic simulation studies of ODE models in epidemiology, ecology, and system biology. We begin with a disease compartmental model that demonstrates the effectiveness of TVMAGI for problems with partially observed system component(s). We then use an ecology example to show how TVMAGI can mitigate the identifiability issue through the informative GP prior that favors smoother time-varying parameters. Lastly, we apply TVMAGI on a system biology example with non-stationary rapid-changing time-varying parameters, and presents TVMAGI's competitive performance with one additional tailor-made benchmark method for such ODE.

SEIRD model
Consider a COVID-19 cases/deaths modeling using an infectious disease Susceptible-Exposed-Infectious-Recovered-Deceased (SEIRD) compartmental ODE model Hethcote (2000); Hao et al. (2020), where the entire population is classified into S, E, I, R, D components, and any transitions from one state to another state (i.e., the disease spreading dynamics) are modeled as ODE: N is the total population, and the cumulative recovered population is R = N −S −E −I −D.
The S, E, I and D denote the susceptible, exposed, infected population and cumulative death respectively. The 4 parameters of interest are investigated: rate of contact by an infectious individual (β), rate of transferring from state of exposed to infectious (v e ), rate of leaving infectious period (v i ) and fatality rate (p d ). During a pandemic, parameters in the SEIRD model can evolve over time due to pharmaceutical and non-pharmaceutical interventions.
We assume that β is time-varying due to the mutation of disease and policy interventions during a specific time; p d is time-varying depending on the sufficiency of medical treatments; v e is time-varying due to the different levels of public awareness or complacency, and v i is assumed to be unknown time-constant parameter to avoid identifiability issues. is also presented to visualize the noise level and observation schedule.
In the experiment we set v i = 0.1, β t = 1.8 − cos(πt/8), v e t = 0.1 − 0.02 cos(πt/8), p d t = 0.05 + 0.025 cos(πt/8), and focus on a time horizon of 32 days. The initial values of four components are set as (100000, 100, 50, 50) for (S, E, I, D). We assume S, I, D are observed on daily frequency with log-normal multiplicative observation noise at 3% level. The exposed population E is assumed to be only sparsely observable at 3% noise level, with one observation per two days, due to the high cost of data acquisition from sampling test.
Such pandemic settings (Dong et al. (2020); Mwalili et al. (2020)) capture the periodic fluctuation of parameters often observed in the real world.
We apply TVMAGI on a log-transformed system (by taking the log of populations in each of the S, E, I, D state) over 100 simulation datasets, with 2 discretizations per day. Figure 1 shows the results of parameter inference and the TVMAGI reconstructed trajectory X(I) of the ODE system. The parameter RMSE and trajectory RMSE introduced in Section 4.2 are presented in Table 1, where TVMAGI is shown to be more accurate than Runge-Kutta or EAKF.
For point estimates, Figure 1 and Table 1 shows that, even when the exposed population is sparsely observed, TVMAGI is still capable of providing good results of inference. As the most important parameter when assessing the spread of disease, β t can be accurately and robustly inferred. v i can also be accurately inferred as constant. p d t has larger variability at the start, as initial deaths are too few to provide enough information. In comparison, the variability of v e t inference increases at the end of the period, because susceptible population has decreased to nearly zero while infectious population reaches plateau. Despite variations in the inferred parameters, the inferred system trajectories are all very close to the truth, confirming the intuition that the system is possibly less sensitive to p d in earlier state and v e in later stage. Supplementary Material Figure S5 and Figure  On the computational cost, Table 1 also shows that TVMAGI is much faster than the Runge-Kutta numerical integration methods. EAKF is fast, but gives unreliable results (see Supplementary Material Section E.2 for more discussion on the reliability of EAKF).

Lotka-Volterra Model
Lotka-Volterra (LV) model (a.k.a. predator-prey model) is widely used to describe population fluctuation of predators and preys and their interactions in the ecosystem (Goel et al. (1971)).
With the introduction of time-varying parameters, the system becomes weakly identifiable during certain time range, which creates a challenge in the inference. Specifically, the ODE system is characterized as: where x and y denote the population of preys and predators. α t indicates the birth rate of the prey and γ t denotes the death rate of the predator, both of which are assumed to fluctuate according to seasonality. β and δ describe the interaction relationships between predators and preys, and are assumed constant. We set the parameters β = 0.75, δ = 1, α t = 0.6 + 0.3 cos(πt/5), and γ t = 1 + 0.1 sin(πt/5). The time is measured on a yearly basis, and data for 20 years are generated with monthly observations contaminated by 3% multiplicative log-normal noise. The initial values of predators and preys are 1 and 3, as an ideal ratio in real ecology systems (Donald and Stewart Anderson (2003)).   Table 2. Our recovered system components x and y are very close to the truth, despite the weak identifiability of the parameter α t and γ t when the x and y are at peak (year=12). Most notably, α t could deviate from the truth in the attempt to best fit the observed noisy data at the peak of x t , resulting in a biased inference of the time-varying parameters at the weakly identifiable time points, although all deviations are still within the range of smoothness constraints on α t . Nevertheless, both TVMAGI inferred system components x and y are still accurate.
Comparing with benchmark models in Table 2, TVMAGI gives the most accurate parameter inference thanks to the GP smoothing prior that mitigates the identifiability issue. The numerical method of Runge-Kutta gives better trajectory inference, but it cannot handle the identifiability issue in the parameters (see also Supplementary Material Figure   S7). The coverage from TVMAGI is not ideal, possibly due to the bias in α(t) estimate and the variance in γ(t) estimate -if the GP smoothing prior is too strong, the point estimates will be biased, and if the GP smoothing prior is too weak, the point estimates will have large  Figure S10). The comparison on computational cost again demonstrates the expected advantage of TVMAGI over Runge-Kutta, while EAKF is the fastest method with the worst accuracy.
This example illustrates the performance of TVMAGI in the presence of weak identifiability -the inferred time-varying parameters at the weakly identified time points could subject to deviation from the truth, although the parameters are smooth and still fit the observed data well.

HIV model
In this example, we compare TVMAGI with a state-of-the-art two-stage Efficient Local Estimation (ELE) method proposed by Chen and Wu (2008) in an HIV dynamic model that they studied. This is a challenging case for GP modeling as the true time-varying parameter has non-periodic non-stationary trends with rapid changes. The ODE model that characterizes the response of anti-viral regimens during HIV infection is given by: where Z i (t) is the known covariate, and a i (t) is the unknown time-varying coefficient.
Chen and Wu (2008) transformed the system Eq.(17) into Eq.(18) by taking d = 2, a 1 (t) = −N T * (t), a 2 (t) = N k[1 − r(t)]X(t), Z 1 (t) = 1 and Z 2 (t) = T (t), and then used their ELE method estimate time-varying coefficients a i (t). For benchmark comparison, we treat a 1 (t) and a 2 (t) in Eq.(18) as unknown time-varying parameters for TVMAGI. Figure 4 shows the TVMAGI inferred time-varying parameter a(t) = d i=1 a i (t)Z i (t) and the reconstructed trajectory X(t). The parameter/trajectory RMSEs of TVMAGI and the benchmark methods are reported in Table 3. TVMAGI has a small advantage over the stateof-the-art method on HIV model inference of X(t). Further visual comparison to benchmark methods (Supplementary Material Figure S8) shows that TVMAGI is slightly more accurate at the beginning phase of the system, which is in fact the most challenging phase for HIV inference as viral load drops sharply due to the drug effect. TVMAGI also achieves competitive inference result on a(t), which is of clinical importance for the generation rate of HIV virus (Cao et al. (2012)). The TVMAGI posterior interval coverage is less ideal because of the decreased accuracy in a(t) towards the ending period (Supplementary Material Figure   S11). Most importantly, while the benchmark method requires a highly restricted form of ODE formulation, TVMAGI assumes no specific form of ODE equations, and is thus applicable for general ODE systems, albeit with longer computing time.

Number of discretization
In this section we explore the sensitivity of TVMAGI to the number of discretization. In the paper we used discretization level of 2 in the SEIRD model, that is, with 32 observation points, we have a total of 64 discretization points (2 discretization per observation). For comparison, we use the same observation data with discretization level of 1 and level of 4, corresponding to 32 discretization points (1 discretization per observation) and 128 discretization points (4 discretization per observation), respectively. Table 4 shows that the inference accuracy indeed converges. However, the computational time scales up linearly with the number of discretization points. In practice, we recommend gradually increasing the number of discretization points until the result is stabilized, trying to balance the inference accuracy with the computing cost.

Selection of kernel
In this section we discuss how the kernel selection will affect the performance of TVMAGI.
In the paper we recommend modeling θ(t) as Gaussian process with Matern kernel ν = 2.01 to guarantee a continuous and differentiable time-varying parameters while maintaining high flexibility. We can also use other GP kernels or hyperparameters to control the smoothness.
For example, Matern kernel with ν = 2.5 can be used for even more smooth GP with a simple closed-form kernel. The condition of differentiability can also be further relaxed if we substitute the kernel with ν = 1.5, and then the parameters are only assumed with continuity without differentiability, allowing more flexible patterns for the time-varying parameters θ(t). Table 5 shows the result under both kernels, where the the performance is similar to the recommended ν = 2.01 in SEIRD model, indicating that TVMAGI is not sensitive to the choice of kernels.

Discussion
In this paper, we introduce a Bayesian approach, TVMAGI, for time-varying parameters inference in ODE dynamic systems. TVMAGI models time-varying parameters and system components as Gaussian process, and is constrained to have the derivative processes satisfy the ODE dynamics. We show that TVMAGI is statistically principled and illustrate its general applicability through three simulation examples. Results have shown that TVMAGI yields accurate and robust parameter inference from noisy observations, with reasonable interval estimates as well. Moreover, TVMAGI can mitigate the identifiability issue and the overfitting issue in the time-varying parameters using the informative GP smoothing prior.
TVMAGI is also generally applicable in the presence of missing observations.
TVMAGI is more accurate than the benchmark methods because TVMAGI addresses the challenges of the numerical integration method and the Bayesian filtering method for ODE time-constant and time-varying parameter inference. Numerical integration methods are the gold standard for the ODE parameter inference when all parameters are time-constant.
However, with the presence of time-varying parameters, due to the lack of smoothness structure on θ(t), the inferred time-varying parameters from Runge-Kutta will overfit the noisy observation data, resulting in volatile θ(t) with little information about the true trends. The Bayesian filtering approach, on the other hand, cannot infer time-constant parameter ψ because the update in ψ is not permissible in a state-space model fashion. We can nevertheless enforcing an update on the time-constant parameter, but there will be no guarantee on the accuracy of the reconstructed trajectory. The ELE two-stage approach relies on a regression technique that can only be used if the ODE has linear dependency on the time-varying parameters. Therefore, TVMAGI is the only approach that is theoretically sound, practically accurate, and generally applicable for the ODE inference problem when time-constant and time-varying parameters co-exist.
On the computational time comparison, TVMAGI has the notable advantage of reduced computation cost compared to numerical integration method, while the inference is more accurate compared to the fast-yet-unreliable Bayesian filtering methods. Even for the three small-sized illustration problems in this paper, TVMAGI is more than twice as fast than the numerical integration method of Runge-Kutta with better accuracy. When dealing with large-scale system, the gain in computational time is likely to be even larger, as TVMAGI computational time would scale linearly as the dimension of system components grow, while inference with numerical integration method such as Runge-Kutta typically scales super linearly. Therefore, TVMAGI has strong potential in large-scale systems, where numerical integration is expensive.
There are two settings that may require tuning in TVMAGI. First, the number of discretization can affect the inference results. When observed components are sparse, the number of discretization should increase until the results are stabilized. However, overdensed discretization will lead to higher computation cost. For example, in SEIRD model, we set discretization as 2 data points per day for optimized performance, as further increasing the discretization will not improve the result accuracy. Second, the inference results on TVMAGI can be affected by hyper-parameter settings of the GP kernel for θ(t). To achieve the desired variability level of time-varying parameters, we find it helpful to use informative hyper-prior that specifies the range of length-scale (a.k.a. bandwidth) parameter of the GP kernel for θ(t) to prevent obvious over-smoothing or over-fitting.
The Gaussian process modeling of θ(t) with Matern kernel ν = 2.01 ensures continuously differentiable time-varying parameters, which prevents overfitting the parameter to the observation noise. The variability in time-varying parameter θ(t) can be further controlled by the length-scale GP hyperparameter φ 2 through its hyper-prior. The Matern kernel together with the hyper-prior on the length-scale hyperparameter ensures the smoothness and the degree of variability in θ(t), which in turn prevents over-fitting and mitigates identifiability issues. If a more flexible θ(t) is desired, Matern kernel ν = 1.5 with hyperprior favoring smaller GP hyperparameter φ 2 can be used to allow rapid non-differentiable changes in θ(t).
One limitation of TVMAGI is its inherent bias. Just like any other Bayesian approaches, TVMAGI could be biased towards smoother curves due to the GP prior. The results of inference would be less accurate when the true time-varying parameters have rapid changes, and the posterior interval coverage could suffer. But as shown in the examples, the magnitude of such bias is small in practice, and our accuracy is still comparable with state-of-the-art approaches while TVMAGI having much better universal applicability.
Although we carefully designed the multi-stage optimization and sampling schedule, occasionally the Adam optimizer or the HMC sampler could still get stuck. Among the total of 300 simulated datasets across three examples, the algorithm got stuck in one particular dataset of the SEIRD model. In the stuck case, some manual tuning of the hyper-parameters or jittering of the sampled parameters might be needed. We will continue to improve the robustness of our proposed algorithm and our software implementation.
TVMAGI is also not suitable if the underlying time-varying parameter is a jump process.
In this case, methods in change point detection literature might be more applicable Cuenod et al. (2011). Alternatively, we can place the prior of continuous-time Markov chain or Poisson process on θ(t), instead of Gaussian process, to model the jump process.
There are also many other interesting future directions for TVMAGI. We currently focus on empirical performance of TVMAGI through simulation examples. More theoretical study on the convergence property, identifiability issue, and asymptotic behavior of the timevarying parameter estimate are all natural directions of future research. The deterministic systems specified by ODEs are studied in this paper, and it would be of future interest to extend TVMAGI for partial differential equation of spatial-temporal dynamics Xun et al. (2013), or stochastic differential equation of inherent noise modeling Kou and Xie (2004).

Supplementary Material
A Advantage of the L ∞ norm in Eq.(5) of main text In this section we illustrate how L ∞ norm in Eq.(5) of main text facilitates theoretical construction, compared to L 2 norm. First, with L ∞ norm in W , it is clear that on the discretization subset I, the corresponding W I will simply be the maximum over I.
However, with the L 2 -norm of T 0 (X(t) − f (X(t), Θ(t), Ψ, t)) 2 dt, the formulation of the corresponding W I is not as clear. Second, using L ∞ makes the theoretical justification easier. To mathematically study the properties of TVMAGI while avoiding Borel paradox, one can use the fact that {W I < } ≡ ∩ i∈I {W i < }, thanks to W I being the L ∞ norm over the set I. Third, the L ∞ norm in Eq.(5) and Eq.(6) automatically transforms into L 2 loss for likelihood calculation in Eq.(7) and Eq.(11) through a simple mathematical derivation, which facilitates computation while maintaining the theoretical rigor. This is because when a Gaussian distributed vector is constrained to have zero deviation with some fixed value (i.e., vector L ∞ distance to the fixed value is zero), the fixed value will be plugged into the Gaussian probability density function, inducing an L 2 loss in the target function Eq.(11).

B HMC algorithm
We outline the HMC procedure for sampling from a target probability distribution. The interested reader may refer to Neal (2011) for more thorough introduction to HMC. Algorithm 1 provides the details of our HMC implementation. Ensemble Kalman Filter (EnKF) are less discussed in ODE parameter inference applications.
To further illustrate the Bayesian filtering approaches, we also provide inference results using the other 3 methods. Fig.S1 and Fig.S2 illustrate the results of SEIRD model. Fig.S3 shows the results of LV model and Fig.S4 shows the results of HIV model.      The fundamental difference is the lack of randomness in the state transition given the model parameters, and thus the Bayesian update given the observation will have zero effect.
With the ODE structure, there is no randomness in x t |x t−1 , θ. The state transition distribution p(x t |x t−1 , θ) essentially has shifted Dirac delta distribution. Therefore, regardless of emission probability p(y t |x t ), the hidden state x t will not depend on y t . As such, all Bayesian updates will have zero effect to shift distribution of p(x t |x t−1 , θ), which is still shifted Dirac delta distribution. The exact Bayesian filtering results will simply be the solution of ODE dynamics given the initial sample of x 0 and the parameter θ. In this case, the parameter estimation in the exact Bayesian filtering reduces to using numerical solver to generate the entire ODE curve given x 0 , θ, and then using a least square approach to compare the solved curve and the observations to find the best x 0 , θ. The exact Bayesian filtering in this case degenerates to a numerical integration method. From another particle filter perspective, each particle of sampled x 0 , θ will evolve in time according to ODE without any randomness, and the parameter estimation becomes a brute-force search of the particle of sampled x 0 , θ that provide the smallest mean squared error to the observation.
In light of this, the Bayesian filtering/smoothing is more suitable for inference of Stochastic Differential Equations (SDEs) parameters (see Donnet and Samson (2013); Golightly and Wilkinson (2010)).
However, Bayesian filter methods still can be applied in ODE inference problem if we can forego some statistical rigor. We can treat all parameters as time-varying, and artificially introduce additional randomness in θ t |θ t−1 and ψ t |ψ t−1 . In this case, the simultaneous estimation of system components x(t), time-constant parameter ψ, time-varying parameter θ(t) becomes an estimation of joint hidden state (x t , ψ t , θ t ). Then Bayesian filtering methods such as EKF, UKF, EnKF and EAKF become applicable. However, this is not a statistically principled approach, because (1) time-constant parameter ψ is now changing with time, and (2) the inference on the distribution is not exact as Gaussian distributional approximation is used on system components x(t). Nevertheless, this method could work empirically, with the notable success of SIRS-EAKF model Shaman and Karspeck (2012); Yang et al. (2020).
In our example, we found the Bayesian filter methods are highly sensitive to the initialization of the parameters. Randomized initialization often lead to insensible results.
Therefore, all the parameters for Bayesian filter methods are initialized at the TVMAGI initial points from Section 3. parameters is large at weakly identifiable time points (Figure S3), and the estimates for the later time points are no longer accurate due to the cascading effect.
The failure of Bayesian filtering in our setting is not surprising. First, it violates the assumption of the model, as it is not principled to allow time-constant parameters ψ to change over time. Although we used the averageψ of the filtered parameter ψ t to be the final estimate for the ψ, the approximation error can still be large. The idea of changing a time-constant parameter to be time-varying during inference and later plug-in the average of the inferred values is not theoretically sound. The trajectory RMSE precisely evaluates how accurate the estimated parameters can be used to reconstruct the entire system given the ODE structure, of whichψ would fail. Second, contrary to the typical setting of Bayesian filtering in machine learning where there is a long sequence of data, our experiments are designed to see how the method performs with short time series and sparse observations, as in most scientific experiment settings. The lack of long series of data poses a challenge to Bayesian filtering methods. Third, the ODE structure is no longer exactly followed in Bayesian filtering, which loosens the structure constraints and creates additional loss of information from the observation data that is already sparse.

F Additional results of TVMAGI interval estimates
In this section we present the visualization for the interval estimation results. We see that for long time series of observations, the estimated intervals tend to be narrow and may not contain the true values, which is a limitation of TVMAGI.

G Additional results on TVMAGI sensitivity study
In this section we give additional visualizations for TVMAGI sensitivity study when varying the discretization level, and Matern kernel degree of freedom ν.