Universal Differential Equations for Scientific Machine Learning

In the context of science, the well-known adage"a picture is worth a thousand words"might well be"a model is worth a thousand datasets."In this manuscript we introduce the SciML software ecosystem as a tool for mixing the information of physical laws and scientific models with data-driven machine learning approaches. We describe a mathematical object, which we denote universal differential equations (UDEs), as the unifying framework connecting the ecosystem. We show how a wide variety of applications, from automatically discovering biological mechanisms to solving high-dimensional Hamilton-Jacobi-Bellman equations, can be phrased and efficiently handled through the UDE formalism and its tooling. We demonstrate the generality of the software tooling to handle stochasticity, delays, and implicit constraints. This funnels the wide variety of SciML applications into a core set of training mechanisms which are highly optimized, stabilized for stiff equations, and compatible with distributed parallelism and GPU accelerators.


Introduction
Recent advances in machine learning have been dominated by deep learning which uses readily available "big data" to solve previously difficult problems such as image recognition [1,2,3] and natural language processing [4, 5, 6]. While some areas of science have begun to generate the large amounts of data required to train deep learning models, notably bioinformatics [7,8,9,10,11], in many areas the expense of scientific experiments has prohibited the effectiveness of these ground breaking techniques. In these domains, mechanistic models are still predominantly deployed due to the inaccuracy of deep learning techniques with small training datasets. While these mechanistic models are constrained to be predictive by uses prior structural knowledge (i.e. a form of inductive bias), the data-driven approach of machine learning can be more flexible and allows one to drop the simplifying assumptions required to derive theoretical models. The purpose of this work is to give a mathematical framework and accompanying software tool which bridges the gap by merging the best of both methodologies while mitigating the deficiencies.
This work falls into the burgeoning field of "scientific machine learning" which seeks to integrate machine learning derived idea into traditional engineeringrelated methods in order to utilize domain knowledge and known physical information into the learning process [12]. It has recently been shown to be advantageous to merge differential equations with machine learning. Physics-Informed Neural Networks (PINNs) use partial differential equations in the cost functions of neural networks to incorporate prior scientific knowledge [13]. While this has been shown to be a form of data-efficient machine learning for some scientific applications, PINNs frame the solution process as a large optimization. The data efficiency of PINNs has been shown to further improve when encoding structural assumptions like energy conservation by directly modeling the Hamiltonian [14,15]. While this formalism incorporates the knowledge of physical systems into machine learning, it does not incorporate the numerical techniques which have led to stable and efficient solvers the large majority of scientific models. Treating the training process as fully implicit is compute-intensive which recent results have shown is increasingly difficult as the stiffness of the model increases [16]. While work has demonstrated that efficiency is greatly improved when incorporating classical discretization techniques into the training process (as "discrete physics-informed neural networks", such as in multi-step neural networks [17]), the formalism provided by PINNs is not conducive to the full use of classical model simulation software and thus every major PINN software framework, such as deepxde [18] and SimNet [19], does not have the ability to automatically combine scientific machine learning training with the highly efficient differential equation solvers and adjoint techniques developed over the last century.
Thus as a basis for scientific machine learning that incorporates the efficient numerical solver and adjoint techniques, we developed a formalism which we denote the universal differential equation (UDE). UDEs are differential equations which are defined in full or part by a universal approximator. A universal approximator is a parameterized object capable of representing any possible function in some parameter size limit. Common universal approximators in low dimensions include Fourier or Chebyshev expansions, while common universal approximators in high dimensions include neural networks [20,21,22,23,24] and other models used through machine learning. Mathematically in its most general form, the UDE is a forced stochastic delay partial differential equation (PDE) defined with embedded universal approximators: N [u(t), u(α(t)), W (t), U θ (u, β(t))] = 0 (1) where α(t) is a delay function and W (t) is the Wiener process.
In this manuscript we describe how the tools of the SciML software ecosystem give rise to efficient training and analysis of the wide variety of UDEs which show up throughout the scientific literature. An astute reader may recognize a neural ordinary differential equation u = NN θ (u, t) as the specific case of a onedimensional UDE defined in full by a neural network [25,26,27,28,29,30]. The previously mentioned discrete physics-informed neural networks also arise when one uses the appropriate fixed time step method as the solver for the UDE. In addition, neural network based optimal control [31], i.e. finding a neural network c(u, t) which minimizes a loss of a differential equation solution u = f (u, c(u, t), t), and model augmentation [32] can similarly be phrased within this framework. Therefore, this architecture has an expansive reach in its utility throughout SciML and control applications.
It is thus in this context that we demonstrate the SciML ecosystem as a set of tools for handling the wide array of possible UDEs with solvers and adjoints handling all of the cases of adaptivity, stiffness, stochasticity, delays, and more. Figure 1 gives a general overview of how the tools of the SciML ecosystem connect to give rise to a modular and composable toolkit for scientific machine learning. The examples of this paper are constructed as the interplay of over 200 dependent libraries constructed by and for the SciML ecosystem, from high-level symbolic computing tools to low-level customized BLAS implementations to improve the performance of internal matrix factorizations over commonly used tools like OpenBLAS or MKL. To simplify the discussion, we will focus on how the composition of the numerical libraries and compiler-based code generation mechanisms gives rise to an efficient computing stack for SciML applications.
In Section 2.1 we showcase how the DifferentialEquations.jl solvers, the Dif-fEqSensitivity.jl adjoint methods, and the DiffEqFlux.jl helper functions combine to give rise to efficient and stable training framework for UDEs. In the sections following, we focus on use cases which go beyond the obvious neural ODE and optimal control applications in order to demonstrate the generality of the framework and formalism. In Section 2.3 we highlight how the DataDriven-DiffEq.jl symbolic regression tooling can be used in conjunction with the UDE framework in order to augment models with known physical information and decrease the data requirements for symbolic model discovery. In Section 2.4 we showcase how recent methods in solving high-dimensional parabolic partial differential equations can be phrased as a universal stochastic differential equation, and thus these methodologies can be extended to high order, implicit, and adaptive methods through this formulation on the SciML tools. In Section 2.5 we demonstrate how the discovery of non-local operators for reduced order modeling, known as parameterizations in the climate modeling literature, can similarly be phrased as a UDE training problem. Together this shows how the SciML ecosystem and its UDE formalism substantially advances the ability for scientists and engineers to combine all scientific knowledge with the latest techniques of machine learning and numerical analysis to arrive at a highly efficient and flexible set of automated training techniques.

Efficient and Stable Training of Universal Differential Equations
Training a UDE amounts to minimizing a cost function C(θ) defined on u θ (t), the current solution to the differential equation with respect to the choice of parameters θ. One choice of cost function is the Euclidean distance C(θ) = i u θ (t i ) − d i at discrete data points (t i , d i ). When optimized with local derivative-based methods, such as stochastic gradient decent, ADAM [36], or L-BFGS [37], this requires the calculation of dC dθ which by the chain rule amounts to calculating du dθ . Thus the problem of efficiently training a UDE reduces to calculating gradients of the differential equation solution with respect to parameters.
Efficient methods for calculating these gradients are commonly called adjoints [38,39,40,41,42]. The asymptotic computational cost of these methods does not grow multiplicatively with the number of state variables and parameters like numerical or forward sensitivity approaches, and thus it has been shown empirically that adjoint methods are more efficient on large parameter models [43,44]. The DiffEqSensitivity.jl module uses the composable multiple dispatch based modular architecture of the DifferentialEquations.jl [35,45] in order to extend all 300+ solvers for a wide variety of equations, including ordinary differential equations, stochastic differential equations, delay differential equations, differential-algebraic equations, stochastic delay differential equations, and hybrid equations which incorporate jumps and Levy processes. The full set of adjoint options available in this new software, which includes continuous adjoint methods and pure reverse-mode AD approaches, is described in Supplement S6.
To the authors' knowledge, this is the first differential equation library which includes choices from all of the demonstrated categories of forward and adjoint sensitivity analysis methods. The importance of this fact is because each of these methods offers a substantial trade-off and thus all of these adjoints have different contexts in which they are the most applicable. Methods  , with computational trade-offs between the two approaches. The former can be faster on scalarized heterogeneous differential equations while the latter is more optimized for homogeneous vectorized functions calls like are demonstrated in neural networks and discretizations of partial differential equations. Supplement S8 describes the literature behind continuous vs discrete adjoint approaches and showcases it as a general trade-off between performance and stability. Additionally, continuous and discrete forward mode sensitivity analysis approaches are also provided and optimized for equations with smaller numbers of parameters. Given the large number of variables involved in choosing a correct derivative calculation method, Supplement S7 describes a decision tree to guide users towards an appropriate choice. A compiler analysis of the user's differential equation function is provided by default to automatically choose an efficient adjoint choice.
As described in Supplement S6.2, these adjoints use reverse-mode automatic differentiation for vector-transposed Jacobian products within the adjoint definitions to reduce the computational complexity. This has been shown to contribute up to two orders of magnitude in performance increases for large stiff differential equations over purely numerical vector-Jacobian product approaches which are common in other stiff ODE solver libraries with adjoints like Sundials CVODES and PETSc TS [44]. In addition, the module DiffEqFlux.jl handles compatibility with the Flux.jl neural network library so that common deep architectures, such as convolutional neural networks and recurrent neural networks, automatically uses all efficient backpropagation kernels whenever encountered in the derivative calculation of differential equations. Thus together these three tools, DifferentialEquations.jl, DiffEqSensitivity.jl, and DiffEqFlux.jl, combine to give a UDE training framework which covers the vast set of combinations to allow efficiently training each model.

Features and Performance of the SciML Ecosystem
We assessed the viability of alternative differential equation libraries for universal differential equation workflows by comparing the features and performance of the given libraries. Table 1 demonstrates that the SciML ecosystem is the only differential equation solver library with deep learning integration that supports stiff ODEs, DAEs, DDEs, stabilized adjoints, distributed and multithreaded computation. We note the importance of the stabilized adjoints in Section 2.5.1  as many PDE discretizations with upwinding exhibit unconditional instability when reversed, and thus this is a crucial feature when training embedded neural networks in many PDE applications. Table 2 demonstrates that the SciML ecosystem exhibits more than an order of magnitude performance when solving ODEs against torchdiffeq of up to systems of 1 million equations. Because the adjoint calculation itself is a differential equation, this also corresponds to increased training times on scientific models. We note that torchdiffeq's adjoint calculation diverges on all but the first two examples due to the aforementioned stability problems.
To reinforce this measure of performance, Supplement S10 demonstrates a 100x performance difference over torchdiffeq when training the spiral neural ODE from [28,53]. Additionally we note that the author of the tfdiffeq library of TensorFlow has previous concluded "speed is almost the same as the PyTorch (torchdiffeq) code base (±2%)" 1 . In addition, Supplement S10 demonstrates a 1,600x performance advantage for the SciML ecosystem over torchsde using the geometric Brownian motion example from the torchsde documentation [46]. Given the computational burden, the mix of stiffness, and non-reversibility of the examples which follow in this paper, these results demonstrate that the SciML ecosystem is the first deep learning integrated differential equation software ecosystem that can train all of the equations necessary for the results of this paper. Note that this does not infer that our solvers will demonstrate more than an order of magnitude performance difference or more on all types of equations. For example, large non-stiff neural ODEs used in image classification tend to be dominated by large dense matrix multiplications and thus are in a different performance regime. However, these results demonstrate that on the equations generally derived from scientific models (ODEs derived from PDE semi-discretizations, heterogeneous differential equation systems, and neural networks in sufficiently small systems) that an order of magnitude or more performance difference exists over a wide set of cases.
In the following sections we will demonstrate the various applications which are enabled by this combination of efficient libraries.  Table 2: Relative time to solve for ML-augmented differential equation libraries (smaller is better). Non-stiff solver benchmarks from representative scientific systems were taken from [54] as described in Supplement S10. SciML stands for the optimal method choice out of the 300+ from SciML, which for the first is DP5, for the second is VCABM, and for the rest is ROCK4.

Knowledge-Enhanced Symbolic Regression via UDEs
Automatic reconstruction of models from observable data has been extensively studied. Many methods produce non-symbolic representations by learning functional representations [55,17]

Improved Identification of Nonlinear Interactions with Universal Ordinary Differential Equations
As a motivating example, take the Lotka-Volterra system: Assume that a scientist has a time series of measurements dense in the prey x and sparse in the predator y from this system but knows the birth rate α of x and assumes a linear decay of the population y. With only this information, a scientist can propose the knowledge-based UODE as: which is a system of ordinary differential equations that incorporates the known structure but leaves room for learning unknown interactions between the the predator and prey populations. Learning the unknown interactions corresponds training the UA U : R 2 → R 2 in this UODE. Supplement S11.1 describes the full details of the UODE training hyperparameters as well as additional examples.
To then learn the missing portions of the equation, we can directly apply symbolic regression to the trained UAs in order to arrive at symbolic equations suggesting the missing equations. In contrast, the popular SINDy method normally [66, 67, 68] approximates derivatives using a spline over the data points and subsequently applies symbolic regression to the full equation. The use of such interpolating spline techniques for derivative estimates implies dense enough data for accurate derivative calculations, which is not required in the UODE framework as the trained neural network can be sampled continuously and information can be spread throughout different data sources (such as structural inductive bias) as long as a causal and identifiable relation within the system is present. Figure 2 shows the UODE-based symbolic regression, Figure 3 showcases its ability to fully and accurately recover the correct symbolic equations from the limited data, while the same symbolic regression method with the same data used in the SINDy approach with derivative smoothing is unable to recover the equations. Supplement S11.1 further demonstrates additional examples using this approach and showcases how the UDE improves the performance in the presense of noise. Together this highlights how incorporating prior structural knowledge through the UDE framework can improve the performance of symbolic regression techniques.

Generalizing Sparse Regression of Non-ODEs via UDEs
The use of UDEs for extending sparse regression training methods also applies to extending the types of problems which sparse regression can be applied to. For example, one might wish to encode prior knowledge of the conservation equation 1 = i u i (t) to the evolution of the states of a discovered model. In this case, a universal differential-algebraic equation (DAE) of the form: can be utilized to encode this prior knowledge as a DAE M u = U θ (u) where M is singular. One the UA is trained, symbolic regression on the U θ (u) subsequently discovers the dynamical equations. Similarly, one may know the general evolution of the system but be unaware of the stochastic features, leading to a universal stochastic differential equation of the form: Tutorials in DiffEqFlux.jl showcase how to train the embedded UAs within these objects, which can then subsequently be symbolically regressed upon as in Section 2.3. As a concrete example, we show how UDEs can turn equation discovery of partial differential equations into a low-dimensional sparse regression problem. To demonstrate discovery of spatio-temporal equations directly from data, we consider data generated from the one-dimensional Fisher-KPP (Kolmogorov-Petrovsky-Piskunov) PDE [69]: with periodic boundary conditions. Such reaction-diffusion equations appear in diverse physical, chemical and biological problems [70]. To learn from the generated data, we define the UPDE: and train the UAs within this equation by performing a method of lines discretization as described in Supplement S11.1.4. Note that the derivative operator is approximated as a convolutional neural network CNN, a learnable arbitrary representation of a stencil while treating the coefficientD as an unknown parameter fit simultaneously with the neural network weights. In this representation, the nonlinear term U θ of the semilinear partial differential equation is simply an R → R operator locally applied to each point in space. We note that Supplement S11.1.4 showcases how using the Fourier layer UA provided by DiffEqFlux.jl is an order of magnitude more efficient in training time than using a neural network for this low-dimensional case. This incorporation of prior structural information greatly decreases the dimensionality of the sparse regression as opposed to direct sparse regression techniques like PDE-FIND [71], changing the sparse regression from acting on a O(n d ) state variables for n discretization points in d dimensions to simply O(1). Given the cost scaling of the sparse regression techniques is in many cases O(k 2 m + m 2 k) for k states and m data points, this amounts to a substantial decrease in the asymptotic cost of the symbolic regression by incorporating the inductive bias of the UDE. Figure 4 shows the result of training the UPDE against the simulated data, which recovers the canonical [1, −2, 1] stencil of the one-dimensional Laplacian and the diffusion constant, while symbolic regression of the U θ term recovers the quadratic growth term.

Computationally-Efficient Solving of High-Dimensional Partial Differential Equations
It is impractical to solve high dimensional PDEs with mesh-based techniques since the number of mesh points scales exponentially with the number of dimensions. Given this difficulty, mesh-free methods based on universal approximators such as neural networks have been constructed to allow for direct solving of high dimensional PDEs [72,73]. Recently, methods based on transforming partial differential equations into alternative forms, such as backwards stochastic differential equations (BSDEs), which are then approximated by neural networks have been shown to be highly efficient on important equations such as the nonlinear Black-Scholes and Hamilton-Jacobi-Bellman (HJB) equations [74, 75, 76, 77].
Here we will showcase how one of these methods, a deep BSDE method for semilinear parabolic equations [75], can be reinterpreted as a universal stochastic differential equation (USDE) to generalize the method and allow for enhancements like adaptivity, higher order integration for increased efficiency, and handling of stiff driving equations through the SciML software. Consider the class of semilinear parabolic PDEs with a finite time span with a terminal condition u(T, x) = g(x). Supplement S12 describes how this PDE can be solved by approximating by approximating the FBSDE: where U 1 θ1 and U 2 θ2 are UAs and the loss function is given by the requiring that the terminating condition g(X T ) = u(X T , W T ) is satisfied.

Adaptive Solution of High-Dimensional Hamilton-Jacobi-Bellman Equations
A fixed time step Euler-Maryumana discretization of this USDE gives rise to the deep BSDE method [75]. However, this form as a USDE generalizes the approach in a way that makes all of the methodologies of our USDE training library readily available, such as higher order methods, adaptivity, and implicit methods for stiff SDEs. As a motivating example, consider the classical linearquadratic Gaussian (LQG) control problem in 100 dimensions: where X t is the state we wish to control, λ is the strength of the control, and c t is the control process. Minimizing the control corresponds to solving the 100-dimensional HJB equation [78,79]: We solve the PDE by training the USDE using an adaptive Euler-Maruyama method [80] as described in Supplement S12. Supplementary Figure 15 showcases that this methodology accurately solves the equations, effectively extending recent algorithmic advancements to adaptive forms simply be reinterpreting the equation as a USDE. While classical methods would require an amount of memory that is exponential in the number of dimensions making classical adaptively approaches infeasible, this approach is the first the authors are aware of to generalize the high order, adaptive, highly stable software tooling to the high-dimensional PDE setting.

Accelerated Scientific Simulation with Automatically
Constructed Closure Relations

Automated Discovery of Large-Eddy Model Parameterizations
As an example of directly accelerating existing scientific workflows, we focus on the Boussinesq equations [81]. The Boussinesq equations are a system of 3+1-dimensional partial differential equations acquired through simplifying assumptions on the incompressible Navier-Stokes equations, represented by the system: where u = (u, v, w) is the fluid velocity, p is the kinematic pressure, ν is the kinematic viscosity, κ is the thermal diffusivity, T is the temperature, and b is the fluid buoyancy. We assume that density and temperature are related by a linear equation of state so that the buoyancy b is only a function b = αgT where α is the thermal expansion coefficient and g is the acceleration due to gravity. This system is commonly used in climate modeling, especially as the voxels for modeling the ocean [82, 83, 84, 81] in a multi-scale model that approximates these equations by averaging out the horizontal dynamics T (z, t) = T (x, y, z, t) dx dy in individual boxes. The resulting approximation is a local advection-diffusion equation describing the evolution of the horizontallyaveraged temperature T : This one-dimensional approximating system is not closed since wT (the horizontal average temperature flux in the vertical direction) is unknown. Common practice closes the system by manually determining an approximating wT from ad-hoc models, physical reasoning, and scaling laws. However, we can utilize a UDE-automated approach to learn such an approximation from data. Let where P are the physical parameters of the Boussinesq equation at different regimes of the ocean, such as the amount of surface heating or the strength of the surface winds [85]. Using data from average temperatures T and known physical parameters P , the non-locality of the convection term may be captured by training a universal diffusion-advection partial differential equation.
Supplementary Figure 16 demonstrates the accuracy of the approach using a deep UPDE with high order stabilized-explicit Runge-Kutta (ROCK) methods where the fitting is described in Supplement S13. To contrast the trained UPDE, we directly simulated the 3D Boussinesq equations under similar physical conditions and demonstrated that the neural parameterization results in around a 15,000x acceleration. This demonstrates that physical-dependent parameterizations for acceleration can be directly learned from data utilizing the previous knowledge of the averaging approximation and mixed with a data-driven discovery approach.

Data-Driven Nonlinear Closure Relations for Model Reduction in Non-Newtonian Viscoelastic Fluids
All continuum materials satisfy conservation equations for mass and momentum. The difference between an elastic solid and a viscous fluid comes down to the constitutive law relating the stresses and strains. In a one-dimensional system, an elastic solid satisfies σ = Gγ, with stress σ, strain γ, and elastic modulus G, whereas a viscous fluid satisfies σ = ηγ, with viscosity η and strain rateγ. Non-Newtonian fluids have more complex constitutive laws, for instance when stress depends on the history of deformation, alternatively expressed in the instantaneous form [86]: where the history is stored in φ i . To become computationally feasible, the expansion is truncated, often in an ad-hoc manner, e.g. φ n = φ n+1 = · · · = 0, for some n. Only with a simple choice of G(t) does an exact closure condition exist, e.g. the Oldroyd-B model. For a fully nonlinear approximation, we train a UODE according to the details in Supplement S14 to learn a closure relation: from the numerical solution of the FENE-P equations, a fully non-linear constitutive law requiring a truncation condition [87]. Figure 5 compares the neural network approach to a linear, Oldroyd-B like, model for σ and showcases that the nonlinear approximation improves the accuracy by more than 50x. We note that the neural network approximation accelerates the solution by 2x over the original 6-state DAE, demonstrating that the universal differential equation approach to model acceleration is not just applicable to large-scale dynamical systems like PDEs but also can be effectively employed to accelerate small scale systems.
3 Discussion While many attribute the success of deep learning to its blackbox nature, the key advances in deep learning applications have come from incorporating inductive bias into architectures. Deep convolutional neural networks for image processing directly utilized the local spatial structure of images by modeling convolution stencil operations. Similarly, recurrent neural networks encode a forward time progression into a deep learning model and have excelled in natural language processing and time series prediction. Here we present a software designed for generating such inductive biased architectures for a wide variety of scientific domains. Our results show that by building these hybrid mechanistic models with machine learning, we can arrive at similar efficiency advancements by utilizing all known prior knowledge of the underlying problem's structure. While we demonstrate the utility of UDEs in equation discovery, we have also demonstrated that these methods are capable of solving many other problems such as high dimensional partial differential equations. Many methods of recent interest, such as discrete physics-informed neural networks, can additionally be written as discretizations of UDEs and thus efficiently implemented using the SciML tools. Our software implementation is the first deep learning integrated differential equation library to include the full spectrum of adjoint sensitivity analysis methods that is required to both efficiently and accurately handle the range of training problems that can arise from universal differential equations. We have demonstrated orders of magnitude performance advantages over previous machine learning enhanced adjoint sensitivity ODE software in a variety of scientific models and demonstrated generalizations to stiff equations, DAEs, SDEs, and more. While the results of this paper span many scientific disciplines and incorporate many different modeling approaches, together all of the examples shown in this manuscript can be implemented using the SciML software ecosystem in just hundreds of lines of code each, with none of the examples taking more than half an hour to train on a standard laptop. This both demonstrates the efficiency of the software and its methodologies, along with the potential to scale to much larger applications.

Code and Data Availability
The code for reproducing the computational experiments can be found at: https://github.com/ChrisRackauckas/universal_differential_equations The following are the core libraries of the SciML ecosystem which together implement the functionality described in the paper:

Acknowledgements
We thank Jesse Bettencourt, Mike Innes, and Lyndon White for being instrumental in the early development of the DiffEqFlux.jl library, Tim Besard and Valentin Churavy for the help with the GPU tooling, and David Widmann and Kanav Gupta for their fundamental work across DifferentialEquations.jl. Special thanks to Viral Shah and Steven Johnson who have been helpful in the refinement of these ideas. We thank Charlie Strauss, Steven Johnson, Nathan Urban, and Adam Gerlach for enlightening discussions and remarks on our manuscript and software. We thank Stuart Rogers for his careful read and corrections. We thank David Duvenaud for extended discussions on this work. We thank the author of the torchsde library, Xuechen Li, for optimizing the SDE benchmark code.
[78] Fwu-Ranq Chang. Stochastic optimization in continuous time. The DiffEqFlux.jl pullback construction is not based on just one method but instead has a dispatch-based mechanism for choosing between different adjoint implementations. At a high level, the library defines the pullback on the differential equation solve function, and thus using a differential equation inside of a larger program leads to this chunk as being a single differentiable primitive that is inserted into the back pass of Flux.jl when encountered by overloading the Zygote.jl [52] and ChainRules.jl rule sets. For any ChainRules.jl-compliant reverse-mode AD package in the Julia language, when a differential equation solve is encountered in any Julia library during the backwards pass, the adjoint method is automatically swapped in to be used for the backpropagation of the solver. The choice of the adjoint is chosen by the type of the sensealg keyword argument which are fully described below.

Pullbacks and Adjoints
Given a function f (x) = y, the pullback at x is the function: where f (x) is the Jacobian J. We note that B x f (1) = (∇f ) T for a function f producing a scalar output, meaning the pullback of a cost function computes the gradient. A general computer program can be written as the composition of discrete steps: and thus the vector-Jacobian product can be decomposed: which allows for recursively decomposing a the pullback to a primitively known set of B x f i : where

Backpropagation-Accelerated DAE Adjoints for Index-1 DAEs with Constraint Equations
Before describing the modes, we first describe the adjoint of the differential equation with constraints. The following derivation is based on [92] but modified to specialize on index 1 DAEs with linear mass matrices. The constrained ordinary differential equation: can be rewritten in mass matrix form: We wish to solve for some cost function G(u, p) evaluated throughout the differential equation, i.e.: To derive this adjoint, introduce the Lagrange multiplier λ to form: Since u = f (u, p, t), we have that: for s i being the sensitivity of the ith variable. After applying integration by parts to λ * M s , we require that: to zero out a term and get: If G is discrete, then it can be represented via the Dirac delta: in which case at the data points (t i , d i ). Therefore, the derivative of an ODE solution with respect to a cost function is given by solving for λ * using an ODE for λ T in reverse time, and then using that to calculate dG dp . At each time point where discrete data is found, λ is then changed using a callback (discrete event handling) by the amount g u to represent the Dirac delta portion of the integral. Lastly, we note that dG du0 = −λ(0) in this formulation. We have to take care of consistent initialization in the case of semi-explicit index-1 DAEs. We need to satisfy the system of equations where d and a denote differential and algebraic variables, and f and g denote differential and algebraic equations respectively. Combining the above two equations, we know that we need to increment the differential part of λ by at each callback. Additionally, the ODEs with µ(T ) = 0 can be appended to the system of equations to perform the quadrature for dG dp . We note that this formulation allows for a single linear solve to generate a guaranteed consistent initialization via a linear solve. This is a major improvement over the previous technique when applied to index 1 DAEs in mass matrix form since the alternative requires a full consistent initialization of the DAE, a problem which is known to have many common failure modes [93].

Current Adjoint Calculation Methods
From this setup we have the following 8 possible modes for calculating the adjoint, with their pros and cons.
1. QuadratureAdjoint: a quadrature-based approach. This utilizes interpolation of the forward solution provided by DifferentialEquations.jl to calculate u(t) at arbitrary time points for doing the calculations with respect to df du in the reverse ODE of λ. From this a continuous interpolatable λ(t) is generated, and the integral formula for dG dp is calculated using the QuadGK.jl implementation of Gauss-Kronrod quadrature. While this approach is memory heavy due to requiring the interpolation of the forward and reverse passes, it can be the fastest version for cases where the number of ODE/DAE states is small and the number of parameters is large since the QuadGK quadrature can converge faster than ODE/DAE-based versions of quadrature. This method requires an ODE or a DAE.

InterpolatingAdjoint: a checkpointed interpolation approach. This ap-
proach solves the λ(t) ODE in reverse using an interpolation of u(t), but appends the equations for µ(t) and thus does not require saving the timeseries trajectory of λ(t). For checkpointing, a scheme similar to that found in SUNDIALS [41] is used. Points (u k , t k ) from the forward solution are chosen as the interval points. Whenever the backwards pass enters a new interval, the ODE is re-solved on t ∈ [t k−1 , t k ] with a continuous interpolation provided by DifferentialEquations.jl. For the reverse pass, the tstops argument is set for each t k , ensuring that no backwards integration step lies in two checkpointing intervals. This requires at most a total of two forward solutions of the ODE and the memory required to hold the interpolation of the solution between two consecutive checkpoints. Note that making the checkpoints at the start and end of the integration interval makes this equivalent to a non-checkpointed interpolated approach which replaces the quadrature with an ODE/SDE/DAE solve for memory efficiency. This method tends to be both stable and require a minimal amount of memory, and is thus the default. This method requires an ODE, SDE, or a DAE.
3. BacksolveAdjoint: a checkpointed backwards solution approach. Following [28], after a forward solution, this approach solves the u(t) equation in reverse along with the λ(t) and µ(t) ODEs. Thus, since no interpolations are required, it requires O(1) memory. Unfortunately, many theoretical results show that backwards solution of ODEs is not guaranteed to be stable, and testing this adjoint on the universal partial differential equations like the diffusion-advection example of this paper showcases that it can be divergent and is thus not universally applicable, especially in cases of stiffness. Thus for stability we modify this approach by allowing checkpoints (u k , t k ) at which the reverse integration is reset, i.e. u(t k ) = u k , and the backsolve is then continued. The tstops argument is set in the integrator to require that each checkpoint is hit exactly for this resetting to occur. By doing so, the resulting method utilizes O(1) memory + the number of checkpoints required for stability, making it take the least memory approach. However, the potential divergence does lead to small errors in the gradient, and thus for highly stiff equations we have found that this is only applicable to a certain degree of tolerance like 10 −6 given reasonable numbers of checkpoints. When applicable this can be the most efficient method for large memory problems. This method requires an ODE, SDE, or a DAE.
4. ForwardSensitivity: a forward sensitivity approach. From u = f (u, p, t), the chain rule gives d dt du dp = df du du dp + df dp which can be appended to the original equations to give du dp as a time series, which can then be used to compute dG dp . While the computational cost of the adjoint methods scales like O(N + P ) for N differential equations and P parameters, this approach scales like O(N P ) and is thus only applicable to models with small numbers of parameters (thus excluding neural networks). However, when the universal approximator has small numbers of parameters, this can be the most efficient approach. This method requires an ODE or a DAE. 5. ForwardDiffSensitivity: a forward-mode automatic differentiation approach, using ForwardDiff.jl [51] to calculate the forward sensitivity equations, i.e. an AD-generated implementation of forward-mode "discretize-thenoptimize". Because it utilizes a forward-mode approach, the scaling matches that of the forward sensitivity approach and it tends to have similar performance characteristics. This method applies to any Julia-based differential equation solver.
6. TrackerAdjoint: a Tracker-driven taped-based reverse-mode discrete adjoint sensitivity, i.e. an AD-generated implementation of reverse-mode "discretize-then-optimize". This is done by using the TrackedArray constructs of Tracker.jl [88] to build a Wengert list (or tape) of the forward execution of the ODE solver which is then reversed. This method applies to any Julia-based differential equation solver.
7. ZygoteAdjoint: a Zygote-driven source-to-source reverse-mode discrete adjoint sensitivity, i.e. an AD-generated implementation of reverse-mode "discretize-then-optimize". This utilizes the Zygote.jl [52] system directly on the differential equation solvers to generate a source code for the reverse pass of the solver itself. Currently this is only directly applicable to a few differential equation solvers, but is under heavy development.
8. ReverseDiffAdjoint: A ReverseDiff.jl taped-based reverse-mode discrete adjoint sensitivity, i.e. an AD-generated implementation of reverse-mode "discretize-then-optimize". In contrast to TrackerAdjoint, this methodology can be substantially faster due to its ability to precompile the tape but only supports calculations on the CPU.
For each of the non-AD approaches, there are the following choices for how the Jacobian-vector products Jv (jvp) of the forward sensitivity equations and the vector-Jacobian products v J (vjp) of the adjoint sensitivity equations are computed: 1. Automatic differentiation for the jvp and vjp. In this approach, automatic differentiation is utilized for directly calculating the jvps and vjps. For-wardDiff.jl with a single dual dimension is applied at f (u+λ ) to calculate df du λ where is a dual dimensional signifier. For the vector-Jacobian products, a forward pass at f (u) is utilized and the backwards pass is seeded at λ to compute the λ df du (and similarly for df dp ). Note that if f is a neural network, this implies that this product is computed by starting the backpropagation of the neural network with λ and the vjp is the resulting return. Four methods are allowed to be chosen for performing the internal vjp calculations: (a) Zygote.jl source-to-source transformation based vjps. Note that only non-mutating differential equation function definitions are supported in this mode. This mode is the most efficient in the presence of neural networks.
(b) Enzyme.jl source-to-source transformation basd vjps. This is the fastest vjp choice in the presence of heavy scalar operations like in chemical reaction networks, but is currently not compatible with garbage collection and thus requires non-allocating f functions.
(c) ReverseDiff.jl tape-based vjps. This allows for JIT-compilation of the tape for accelerated computation. This is a the fast vjp choice in the presence of heavy scalar operations like in chemical reaction networks but more general in application than Enzyme. It is not compatible with GPU acceleration.
(d) Tracker.jl with arrays of tracked real values is utilized on mutating functions.
The internal calculation of the vjp on a general UDE recurses down to primitives and embeds optimized backpropagations of the internal neural networks (and other universal approximators) for the calculation of this product when this option is used.
2. Numerical differentiation for the jvp and vjp. In this approach, finite differences is utilized for directly calculating the jvps and vjps. For a small but finite , (f (u + λ ) − f (u)) / is used to approximate df du λ. For vjps, a finite difference gradient of λ f (u) is used.
3. Automatic differentiation for Jacobian construction. In this approach, (sparse) forward-mode automatic differentiation is utilized by a combination of ForwardDiff.jl [51] with SparseDiffTools.jl for color-vector based sparse Jacobian construction. After forming the Jacobian, the jvp or vjp is calculated.
4. Numerical differentiation for Jacobian construction. In this approach, (sparse) numerical differentiation is utilized by a combination of DiffEqD-iffTools.jl with SparseDiffTools.jl for color-vector based sparse Jacobian construction. After forming the Jacobian, the jvp or vjp is calculated.
In total this gives 48 different adjoint method approaches, each with different performance characteristics and limitations. A full performance analysis which measures the optimal adjoint approach for various UDEs has been omitted from this paper, since the combinatorial nature of the options requires a considerable amount of space to showcase the performance advantages and generality disadvantages between each of the approaches. A follow-up study focusing on accurate performance measurements of the adjoint choice combinations on families of UDEs is planned.

Sensitivity Algorithm Decision Tree
If no sensealg is provided by the user, DiffEqSensitivity.jl will automatically choose an adjoint technique from the list. By default, a stable adjoint with an auto-adapting vjp choice is used. The following decision tree is similar to that within the defaults (but is continuously updating due to various performance changes).
• If there are 50 parameters+states or less, consider using forward-mode sensititivites. If the f function is not ForwardDiff-compatible, use For-wardSensitivty, otherwise use ForwardDiffSensitivty as its more efficient.
• For larger equations, give BacksolveAdjoint and InterpolatingAdjoint a try. If the gradient of BacksolveAdjoint is correct, many times it's the faster choice so choose that (but it's not always faster!). If your equation is stiff or a DAE, skip this step as BacksolveAdjoint is almost certainly unstable.
• If your equation does not use much memory and you're using a stiff solver, consider using QuadratureAdjoint as it is asymptotically more computationally efficient by trading off memory cost.
• If the other methods are all unstable (check the gradients against each other!), then ReverseDiffAdjoint is a good fallback on CPU, while Track-erAdjoint is a good fallback on GPUs.
• After choosing a general sensealg, if the choice is InterpolatingAdjoint, QuadratureAdjoint, or BacksolveAdjoint, then optimize the choice of vjp calculation next: -If your function has no branching (no if statements) and is heavily scalarized, use ReverseDiffVJP(true).
-If your calculations are on the CPU and your function is very scalarized in operations but has branches, choose ReverseDiffVJP().
-If your calculations are on the CPU or GPU and your function is very vectorized, choose ZygoteVJP().
-Else fallback to TrackerVJP() if Zygote does not support the function.
-If none of the reverse-mode AD based vjps work on your function, fallback to autojacvec=true (for forward-mode AD via ForwardDiff) or false for numerical Jacobians.

Continuous vs Discrete Adjoints
Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [ [99,100,101]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [102,103,104,105,106,107,108,109,110]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [111,112,113,114,115], and our software contribution helps researchers switch between all of these optimization approaches in combination with hundreds of differential equation solver methods with a single line of code change.

Integration with Existing Code
The open-source differential equation solvers of DifferentialEquations.jl [35] were developed in a manner such that all steps of the programs have a well-defined pullback when using a Julia-based backwards pass generation system. Our software allows for automatic differentiation to be utilized over differential equation solves without any modification to the user code. This enables the simulation software already written with DifferentialEquations.jl, including large software infrastructures such as the MIT-CalTech CLiMA climate modeling system [116] and the QuantumOptics.jl simulation framework [117], to be compatible with all of the techniques mentioned in the rest of the paper. Thus while we detail our results in isolation from these larger simulation frameworks, the UDE methodology can be readily used in full-scale simulation packages which are already built on top of the Julia SciML ecosystem.

ODE Solve Benchmarks
The three ODE benchmarks utilized the Lorenz equations (LRNZ) weather prediction model from [54] and the standard ODE IVP Testset [118,119]: The 28 ODE benchmarks utilized the Pleiades equation (PLEI) celestial mechanics simulation from [54] and the standard ODE IVP Testset [118,119]: where The rest of the benchmarks were derived from a discretization a two-dimensional reaction diffusion equation, representing systems biology, combustion mechanics, spatial ecology, spatial epidemiology, and more generally physical PDE equations: where α A (x) = 1 if x > 80 and 0 otherwise on the domain x ∈ [0, 100], y ∈ [0, 100], and t ∈ [0, 10] with zero-flux boundary conditions. For the purpose of parameter gradient tests, we treated calculated the derivative of the solution with respect to the entires of α A (x). The diffusion constant D was chosen as 100 and all other parameters were left at 1.0. In this parameter regime the ODE was non-stiff as indicated by solves with implicit methods not yielding performance advantages. The diffusion operator was discretized using the second order finite difference stencil on an N × N grid, where N was chosen to be 16, 32, 64, 128, 256, and 512. To ensure fairness, the torchdiffeq functions were compiled using torchscript which we varified improved performance. The code for reproducing the benchmark can be found at: https://gist.github.com/ChrisRackauckas/cc6ac746e2dfd285c28e0584a2bfd320 We omit the the gradient performance benchmarks for this case from the main manuscript since the backsolve adjoint method of torchdiffeq is unstable on all of the partial differential equation examples. We note that similarly when using BacksolveAdjoint with the SciML tools we similarly see a divergence, while other adjoints such as InterpolatingAdjoint do not diverge, reinforcing the point that this is due to lack of stability in the algorithm. In the cases where torchdiffeq does not diverge, we see torchdiffeq 12,000x slower and 1,200x slower on Lorenz and Pleiades respectively. However, we note that this is because at the size of those equations forward sensitivity analysis is more efficient which is not available in torchdiffeq, and thus this overestimates the general performance difference in the derivative calculations.

Neural ODE Training Benchmark
The spiral neural ODE from [28] was used as the benchmark for the training of neural ODEs. The data was generated according from the form:

SDE Solve Benchmark
The torchsde benchmarks were created using the geometric Brownian motion example from the torchsde README. The SDE was a 4 independent geometric Brownian motions: where µ = 0.5 and σ = 1.0. Both software solved the SDE 100 times using the SRI method [120] and compute a the action of the universal approximator on the chosen states: . . . . . . . . . . . .
Then evaluate the observations in a basis Θ(X). For example: where X Pi stands for all P i th order polynomial terms such as Using these matrices, find this sparse basis Ξ over a given candidate library Θ by solving the sparse regression problemẊ = ΘΞ with L 1 regularization, i.e. minimizing the objective function Ẋ − ΘΞ 2 +λ Ξ 1 . This method and other variants of SINDy applied to UDEs, along with specialized optimizers for the LASSO L 1 optimization problem, have been implemented by the authors and collaborators as the DataDrivenDiffEq.jl library on top of the ModelingToolkit.jl and Symbolics.jl computer algebra systems.

Application to the Partial Reconstruction of the Lotka-Volterra
On the Lotka-Volterra equations, we trained a UDE model in two different scenarios, starting from x 0 = 0.44249296, y 0 = 4.6280594 with parameters are chosen to be α = 1.3, β = 0.9, γ = 0.8, δ = 1.8. The neural network consists of an input layer, two hidden layers with 5 neurons and a linear output layer modeling the polynomial interaction terms. The input and hidden layer have gaussian radial basis activation functions. We trained for 200 iterations with ADAM with a learning rate γ = 10 −1 . We then switched to BFGS with an initial stepnorm of γ = 10 −2 setting the maximum iterations to 10000. Typically the training converged after 400-600 iterations in total. Training Data Figure 6: Scenario 1) of the Lotka Volterra Recovery Scenario 1) consists of a trajectory with 31 points measured in x(t), y(t) with a constant step size ∆t = 0.1. We assumed perfect knowledge about the linear terms of the equations and their corresponding parameters α, δ. The trajectory has been perturbed with additive noise drawn from a normal distribution scaled by 5% of its mean. The loss was chosen was the L2 loss Scenario 2) consists of a trajectory with 61 points measured in x(t) with a constant step size ∆t = 0.1 and 6 points measured in y(t) with a constant step size ∆t = 1.2. The trajectory has been perturbed with additive noise drawn from a normal distribution scaled by 1% of its mean.We assumed perfect knowledge about the linear terms of the first differential equation and their corresponding parameters α. In addition to the UDE a linear decay rate with unknown parameter was added in the second differential equation governing the predator dynamics. The parameter was included in the training process. The data was divided into j = 5 different datasets d containing i = 13 measurements in x, t and 2 measurements in y for the initial and boundary condition. The loss was chosen was similar to shooting like techniques plus a regularization term From the trained neural network, data was sampled over the original trajectory and fitted using the SR3 method with varying threshold with λ = exp10.(−7 : 0.1 : 3). A pareto optimal solution was selected via the L1 norm of the coefficients and the L2 norm of the corresponding error between the differential data and estimate. The knowledge-enhanced neural network returned After determining the results of the symbolic regression, the learned ODE model was then trained to refit the parameters before extrapolating. In comparison, interpolations of the sparse predator measurements have been performed using linear, quadratic and cubic spline and a polynomial interpolation of order 5. Using SINDy with same library of candidate functions over the interpolated data and their numerical derivatives leads to the the results listed in 3. Given the sparsity of the data, quality of the resulting fit and its derivative, none of the attempts were able to recover the true equation describing the predator dynamics.
The UDE training procedure was evaluated 500 times in total for Scenario 1) given the initial trajectory with varying noise levels 0.1, 0.5, 1.0, 2.5, 5% in terms of the mean of the trajectory (100 repetitions each). The results of the successful recovery of the missing equations are shown in Fig. 8, the training loss is shown in Fig. 9. Overall, the recovery rate is (50.4 ± 25.7)% for all 498 error-free runs . Two runs failed due to numerical instabilities of the combination of trained parameters and numerical integrator. Given that the study has conducted in a brute-force manor, e.g. no prior data cleaning, no hyperparam-Interpolation Recovered Equation Linearẏ = 2.19 · x 2 − 0.64 · x 3 + 0.05 · x 4 − 0.01 · y 4 Quadratic Splineẏ = −0.77 · y 3 + 0.23 · y 4 − 0.02 · y 5 Cubic Splineẏ = 0.69 · x 3 − 0.14 · x 4 + 0.03 · x 4 · y − 0.02 · x 3 · y 2 Polynomialẏ = 0.08 · x 3 − 0.03 · x 4 + 9.62 · y 3 − 2.09 · y 4 + 0.06 · y 5 Table 3: Results of using SINDy on interpolation data of the predator measurements eter optimization, no (automatic) pre-selection of the networks architecture or candidate selection has been performed, the sensitivity of the networks initial parameters is also captured in this study. However, since the scope of this work is to highlight the usefulness of UDEs as means to extract specific information from time series we limited ourselves to this naive approach. Further research can highlight additional methods to improve noise robustness.
In Fig. 10 some examples for failed recovery can be seen. It is worth noting that the mere visual quality of the fit points in most cases, except for Sample 450, would indicate a success. In some cases, e.g. Sample 500, some discontinuities can be seen in the solution, possibly due to overfitting. However, adding a regularization penalty to the overall loss function has shown little effect and was hence neglected.

Application to the Reconstruction of the Lotka-Volterra Using the Hudson Bay Dataset
Additionally, a full recovery based on a dataset of the Hudson Bay Data of Hares and Lynx between 1900 and 1920, taken from [121] and originally published in [122] has been performed. The data has been normalized to the interval (0, 1] and the assumed model is given aṡ Incorporating a linear birth rate of the prey and a linear decay rate of the predator. The neural network consists of an input layer, two hidden layers with 5 neurons and a linear output layer. Again, radial basis activations have been used, except for the last hidden layer which used a tanh activation. As in Scenario 2) we started by using a shooting loss with L2 regularization of the parameter, intentionally widthholding data of the predators to smoothly optimize the parameters using ADAM with a learning rate of 0.1 for 100 iterations and BFGS with an initial stepnorm of 0.01 until converged ( maximum 200 iterations). After this initial fit, we switched onto a an L2-Norm loss with regularization and trained until convergence using BFGS. The results are shown in Figure 11.    Afterwards we performed a sparse regression with a candidate library of multivariate polynomials up to degree 5 and sinusoidal signals in the states on the recovered signal of the UDE using sequentially thresholded regression. The UDE has been subsampled to an interval of 0.5 years, efficcently augmenting the limited data. The mixed terms were recovered successful. The resulting symbolic model has been post-fitted with an L2-Norm loss to decrease its resulting error, as can be seen in Figure 12. A long term estimation can seen in Figure 13, using the parameters α = 0.557, δ = 0.826, β = −1.70 and γ = 2.04.

Application to the Reconstruction of the Fisher-KPP Equations
To generate training data for the 1D Fisher-KPP equation 7, we take the growth rate and the diffusion coefficient to be r = 1 and D = 0.01 respectively. The equation is numerically solved in the domain x ∈ [0, 1] and t ∈ [0, T ] using a 2nd order central difference scheme for the spatial derivatives and the timeintegration is done using the Tsitouras 5/4 Runge-Kutta method. We implement periodic boundary condition ρ(x = 0, t) = ρ(x = 1, t) and initial condition ρ(x, t = 0) = ρ 0 (x) is taken to be a localized function given by with ∆ = 0.2 which represents the width of the region where ρ 1. The data are saved at evenly spaced points with ∆x = 0.04 and ∆t = 0.5. In the UPDE 8, the growth neural network NN θ (ρ) has 4 densely connected layers with 10, 20, 20 and 10 neurons each and tanh activation functions. The diffusion operator is represented by a CNN that operates on an input vector of arbitrary size. It has 1 hidden layer with a 3 × 1 filter [w 1 , w 2 , w 3 ] without any bias. To implement periodic boundary conditions for the UPDE at each time step, the vector of values at different spatial locations [ρ 1 , ρ 2 , . . . , ρ Nx ] is padded with ρ Nx to the left and ρ 1 to the right. This also ensures that the output of the CNN is of the same size as the input. The weights of both the neural networks and the diffusion coefficient are simultaneously trained to minimize the loss function where λ is taken to be 10 2 (note that one could also structurally enforce w 3 = −(w 1 + w 2 )). The second term in the loss function enforces that the differential operator that is learned is conservative-that is, the weights sum to zero. The training is done using the ADAM optimizer with learning rate 10 −3 . Similar to the Lotka-Volterra experiments, symbolic regression over monomials up to degree 10 has been performed to recover the nonlinear term of the Fisher-KPP Equations U θ from Section 11.1.4 in Scenario 3). The system has been simulated with r = 1 for t = [0, 5]s with a time resolution of ∆t = 0.5s and 25 measurements in x. Normally distributed noise with 2.5% of the mean has been added. Figure 14 shows the recovered dynamics nonlinear term via the UDE approach, using a network similar to Scenario 1). We trained the network using a loss function similar to Section 11.1.4 with a regularization of 1 using ADAM with a learning rate of 0.1 for 200 iterations and then switching to BFGS with an initial stepnorm of 0.01. The recovered equation is U θ (ρ) = 1.0ρ−1.0ρ 2 . The symbolic regression has been performed using sequentially thresholded least squares over thresholds λ = [10 −3 , 10 2 ] logarithmic evenly distributed with a step size of 0.01.

Analysis of Alternative Function Approximators in Fisher-KPP
Additionally, we analyzed the performance of alternative universal approximators for this spatiotemporal PDE discovery task. To do this, we analyzed both the parameter efficiency and the computational efficiency. For the parameter efficiency, we attempted to quantify the minimum numbers of parameters that could robustly identify the terms of the PDEs. It is predicted by the lottery ticket hypothesis [123] that low parameter neural networks can generally produce accurate fits, but have a low probability of being discovered. Thus we defined robustness as the ability to perform 5 optimizations with random initializations mixed with stochastic optimizers and recover a fitted solution to a loss of 0.01. All optimizations used the same optimization process which was 400 iterations of ADAM to find a local minima and then BFGS to complete the optimization (which automatically exited when the solution achieves a local maximum according to the default detection criteria of the Optim.jl library [124]). For the neural network we choose 1 hidden layer with a tanh activation function and scaled the size of the hidden layer from 2-4, corresponding to 4, 7, and 15 parameters. At 4 parameters, all 5 optimization runs failed to produce a loss of 0.01, while at 7 and 15 all optimizations passed. To test against a nonneural network function approximator we chose to use the Fourier basis. Tests with 3, 5, 7, and 15 parameters all give fits to a loss of 0.01 on all 5 optimization runs, demonstrating that the smaller parameter Fourier basis model was more robust than the smallest parameter neural network model. The least parameter neural network, the 7 parameter version, took approximately 2,500 seconds with a standard deviation of approximately 1,000 seconds, a minimum time of approximately 1,300 seconds and a maximum time of approximately 3,300 seconds. The Fourier basis with 7 parameters took approximately 250 seconds to train, with a standard deviation of approximately 5 seconds, a minimum time of approximately 242 seconds and a maximum time of approximately 255 seconds. This demonstrates that the training time with the Fourier basis was on average an order of magnitude faster than the neural network when comparing the same parameter size between the models.
All of the output losses and times for the experiments are stored as comments in the code for the experiments FisherKPPCNNSmall.jl and FisherKPPCNN-Fourier.jl in the example repository. The implementations of the classical basis functions can be found in the DiffEqFlux.jl repository 2 with an associated tutorial on classical basis functions and TensorLayer 3 .

Discovery of Robertson's Equations with Prior Conservation Laws
On Robertson's equations, we trained a UDAE model against a trajectory of 10 points on the timespan t ∈ [0.0, 1.0] starting from y 1 = 1.0, y 2 = 0.0, and y 3 = 0.0. The parameters of the generating equation were k 1 = 0.04, k 2 = 3e7, and k 3 = 1e4. The universal approximator was a neural network with one hidden layers of size 64. The equation was trained using the BFGS optimizer to a loss of 9e − 6.
12 Adaptive Solving for the 100 Dimensional Hamilton-Jacobi-Bellman Equation

Forward-Backwards SDE Formulation
Consider the class of semilinear parabolic PDEs, in finite time t ∈ [0, T ] and d-dimensional space x ∈ R d , that have the form: ∂u ∂t (t, x) + 1 2 tr σσ T (t, x) (Hess x u) (t, x) + ∇u(t, x) · µ(t, x) + f t, x, u(t, x), σ T (t, x)∇u(t, x) = 0, with a terminal condition u(T, x) = g(x). In this equation, tr is the trace of a matrix, σ T is the transpose of σ, ∇u is the gradient of u, and Hess x u is the Hessian of u with respect to x. Furthermore, µ is a vector-valued function, σ is a d × d matrix-valued function and f is a nonlinear function. We assume that µ, σ, and f are known. We wish to find the solution at initial time, t = 0, at some starting point, x = ζ. Let W t be a Brownian motion and take X t to be the solution to the stochastic differential equation with a terminal condition u(T, x) = g(x). With initial condition X(0) = ζ has shown that the solution to 9 satisfies the following forward-backward SDE (FBSDE) [125]: f (s, X s , u(s, X s ), σ T (s, X s )∇u(s, X s ))ds [∇u(s, X s )] T σ(s, X s )dW s , with terminating condition g(X T ) = u(X T , W T ). Notice that we can combine 60 and 61 into a system of d + 1 SDEs: dX t =µ(t, X t )dt + σ(t, X t )dW t , dU t =f (t, X t , U t , σ T (t, X t )∇u(t, X t ))dt where U t = u(t, X t ). Since X 0 , µ, σ, and f are known from the choice of model, the remaining unknown portions are the functional σ T (t, X t )∇u(t, X t ) and initial condition U (0) = u(0, ζ), the latter being the point estimate solution to the PDE.
To solve this problem, we approximate both unknown quantities by universal approximators: Therefore we can rewrite 62 as a stochastic UDE of the form: dX t = µ(t, X t )dt + σ(t, X t )dW t , dU t = f (t, X t , U t , U 1 θ1 (t, X t ))dt + U 1 θ1 (t, X t ) with initial condition (X 0 , U 0 ) = (X 0 , U 2 θ2 (X 0 )). To be a solution of the PDE, the approximation must satisfy the terminating condition, and thus we define our loss to be the expected difference between the approximating solution and the required terminating condition: Finding the parameters (θ 1 , θ 2 ) which minimize this loss function thus give rise to a BSDE which solves the PDE, and thus U 2 θ2 (X 0 ) is the solution to the PDE once trained.

The LQG Control Problem
This PDE can be rewritten into the canonical form by setting: where σ = √ 2, T = 1 and X 0 = (0, ..., 0) ∈ R 100 . The universal stochastic differential equation was then supplemented with a neural network as the approximator. The initial condition neural network was had 1 hidden layer of size 110, and the σ T (t, X t )∇u(t, X t ) neural network had two layers both of size 110. For the example we chose λ = 1. This was trained with the LambaEM method of DifferentialEquations.jl [35] with relative and absolute tolerances set at 1e−4 using 500 training iterations and using a loss of 100 trajectories per epoch.
On this problem, for an arbitrary g, one can show with Itô's formula that: which was used to calculate the error from the true solution.

Reduction of the Boussinesq Equations
As a test for the diffusion-advection equation parameterization approach, data was generated from the diffusion-advection equations using the missing function wT = cos(sin(T 3 )) + sin(cos(T 2 )) with N spatial points discretized by a finite difference method with t ∈ [0, 1.5] with Neumann zero-flux boundary conditions. A neural network with two hidden layers of size 8 and tanh activation functions was trained against 30 data points sampled from the true PDE. The UPDE was fit by using the ADAM optimizer with learning rate 10 −2 for 200 iterations and then ADAM with a learning rate of 10 −3 for 1000 iterations. The resulting fit is shown in 16 which resulted in a final loss of approximately 0.007. We note that the stabilized adjoints were required for this equation, i.e. the backsolve adjoint method was unstable and results in divergence and thus cannot be used   on this type of equation. The trained neural network had a forward pass that took around 0.9 seconds. For the benchmark against the full Bossinesq equations, we utilized Oceananigans.jl [126]. It was set to utilize adaptive time stepping to maximize the time step according to the CFL condition number (capped at CFL ≤ 0.3) and matched the boundary conditions, along with setting periodic boundary conditions in the horizontal dimension. The Bossinesq simulation used 128×128×128 spatial points, a larger number than the parameterization, in order to accurately resolve the mean statistics of the 3-dimensional dynamics as is commonly required in practice [81]. The resulting simulation took 13,737 seconds on the same computer used for the neural diffusion-advection approach, demonstrating the approximate 15,000x acceleration.

Automated Derivation of Closure Relations for Viscoelastic Fluids
The full FENE-P model is: f (σ) = L 2 + λ(L 2 −3) L 2 η Tr(σ) where is the upper convected derivative, and L, η, λ are parameters [87]. For a one dimensional strain rate,γ =γ 12 =γ 21 = 0,γ ij = 0 else, the one dimensional stress required is σ = σ 12 . However, σ 11 and σ 22 are both non-zero and store memory of the deformation (normal stresses). The Oldroyd-B model is the approximation: with the exact closure relation: As an arbitrary nonlinear extension, train a UDE model using a single additional memory field against simulated FENE-P data with parameters λ = 2, L = 2, η = 4. The UDE model is of the form, where U 0 , U 1 are neural networks each with a single hidden layer containing 4 neurons. The hidden layer has a tanh activation function. The loss was taken as L = i (σ(t i ) − σ FENE-P (t i )) 2 for 100 evenly spaced time points in t i ∈ [0, 2π], and the system was trained using an ADAM iterator with learning rate 0.015. The fluid is assumed to be at rest before t = 0, making the initial stress also zero.