Biologically informed NeuralODEs for genome-wide regulatory dynamics

Models that are formulated as ordinary differential equations (ODEs) can accurately explain temporal gene expression patterns and promise to yield new insights into important cellular processes, disease progression, and intervention design. Learning such ODEs is challenging, since we want to predict the evolution of gene expression in a way that accurately encodes the causal gene-regulatory network (GRN) governing the dynamics and the nonlinear functional relationships between genes. Most widely used ODE estimation methods either impose too many parametric restrictions or are not guided by meaningful biological insights, both of which impedes scalability and/or explainability. To overcome these limitations, we developed PHOENIX, a modeling framework based on neural ordinary differential equations (NeuralODEs) and Hill-Langmuir kinetics, that can flexibly incorporate prior domain knowledge and biological constraints to promote sparse, biologically interpretable representations of ODEs. We test accuracy of PHOENIX in a series of in silico experiments benchmarking it against several currently used tools for ODE estimation. We also demonstrate PHOENIX’s flexibility by studying oscillating expression data from synchronized yeast cells and assess its scalability by modelling genome-scale breast cancer expression for samples ordered in pseudotime. Finally, we show how the combination of user-defined prior knowledge and functional forms from systems biology allows PHOENIX to encode key properties of the underlying GRN, and subsequently predict expression patterns in a biologically explainable way.


Abstract
Models that are formulated as ordinary differential equations (ODEs) can accurately explain temporal gene expression patterns and promise to yield new insights into important cellular processes, disease progression, and intervention design. Learning such ODEs is challenging, since we want to predict the evolution of gene expression in a way that accurately encodes the causal gene-regulatory network (GRN) governing the dynamics and the nonlinear functional relationships between genes. Most widely used ODE estimation methods either impose too many parametric restrictions or are not guided by meaningful biological insights, both of which impedes scalability and/or explainability. To overcome these limitations, we developed PHOENIX, a modeling framework based on neural ordinary differential equations (NeuralODEs) and Hill-Langmuir kinetics, that can flexibly incorporate prior domain knowledge and biological constraints to promote sparse, biologically interpretable representations of ODEs. We test accuracy of PHOENIX in a series of in silico experiments benchmarking it against several currently used tools for ODE estimation. We also demonstrate PHOENIX's flexibility by studying oscillating expression data from synchronized yeast cells and assess its scalability by modelling genome-scale breast cancer expression for samples ordered in pseudotime. Finally, we show how the combination of user-defined prior knowledge and functional forms from systems biology allows PHOENIX to encode key properties of the underlying GRN, and subsequently predict expression patterns in a biologically explainable way .  047  048  049  050  051  052  053  054  055  056  057  058  059  060  061  062  063  064  065  066  067  068  069  070  071  072  073  074  075  076  077  078  079  080  081  082  083  084  085  086  087  088  089  090  091  092 2 PHOENIX

Background
Biological systems are complex with phenotypic states, including those representing health and disease, defined by the expression states of the entire genome. Transitions between these states occur over time through the action of highly interconnected regulatory processes driven by transcription factors. Modeling molecular mechanisms that govern these transitions is essential if we are to understand the behavior of biological systems, and design interventions that can more effectively induce a specific phenotypic outcome. But this is challenging since we want not only to predict gene expression at unobserved time-points, but also to make these predictions in a way that explains any prior knowledge of transcription factor binding sites. Models that accurately encode such interactions between transcription factors and target genes within gene regulatory networks (GRN) can provide insights into important cellular processes, such as disease-progression and cell-fate decisions [1][2][3][4].
Given that many dynamical systems can be described using ordinary differential equations (ODEs), a logical approach to modeling GRNs is to estimate ODEs for gene expression using an appropriate statistical learning technique [3][4][5][6]. Although estimating gene regulatory ODEs ideally requires time-course data, obtaining such data in biological systems is difficult (if not impossible given the destructive nature of the associated assays). One can instead use pseudotime methods applied to cross-sectional data to order samples and subsequently estimate ODEs that captures the regulatory structure [7,8].
While a variety of ODE estimation methods have been proposed, most suffer from key issues that limit their applicability in modeling genome-wise regulatory networks. Some systems biology models formulate ODEs based solely on biochemical principles of gene regulation, and use the available data to parameterize these equations [9]. However, such methods impose several restrictions on the ODEs and cannot flexibly adjust to situations where the underlying assumptions do not hold; this increases the risk of model misspecification and hinders scalability to large networks, particularly given the enormous number of parameters necessary to specify a genome-scale model [10,11]. Other methods are based on non-parametric methods to learning regulatory ODEs, using tools such as sparse kernel regression [3], random forests [5], variational autoencoders [7,12,13], diffusion processes [4], and neural ordinary differential equations [6,14], but these fail to include biologically relevant associations between regulatory elements and genes as constraints on the models.
These latter models can be broadly placed into two classes based on the inputs required to estimate the gradient f of the gene regulatory dynamics, where f (x) = dx dt . The first class consists of methods like PRESCIENT [4] and RNAForecaster [6] that can learn f based only on time series gene-expression input {x t0 ; x t1 ; . . . ; x t T } without additional steps or consideration of other regulatory inputs [4,6,15]. In the process of learning transitions between The second class consists of "two-step" methods such as Dynamo [3], RNA-ODE [5], and scDVF [12] that aim to alleviate this performance loss by replacing the difficult task of learning f with two separate steps [3,5,12]. First, the RNA velocity ( dx dt ) is explicitly estimated for each data point in a preprocessing step, using spliced and unspliced transcript counts, and one of many available velocity estimation tools [3,16,[19][20][21][22][23]. In the next step, the original task of learning f is reduced to learning a vector field from expression-velocity tuples [x i , ( dx dt ) i ] and a suitable learning algorithm is deployed. Apart from needing additional inputs that may not always be available (for example, spliced and unspliced counts are not available for microarray data), these "two step" methods are also sensitive to the velocity estimation tool used, many of which suffer from a multitude of weaknesses [19]. Still, the Jacobian of the estimated vector field can help inform whether the learned dynamics are biologically meaningful [2,3,5,8].
While the flexibility of both classes of models is helpful in estimating arbitrary dynamics, they are"black-box" methods whose somewhat opaque nature not only makes them prone to overfitting, but it makes it difficult to extract interpretable mechanistic insights about regulatory control [1,6]. These models are optimized solely to predict RNA velocity or gene expression levels and so the predictions are not explainable in the sense that most cannot be related back to a sparse causal GRN. Another major issue is the scalability of these methods; because of their computational complexity, they have not yet been shown to feasibly scale up to tens of thousands of genes -and definitely not to the entire genome [3,4,7,[12][13][14]. Consequently, most of these methods either restrict themselves to a small set of highly variable genes [4,6,7,12,14] or resort to dimension-reduction techniques (PCA, UMAP, latent-space embedding, etc.) [3,4,7,13] as a preprocessing step . This leads to certain biological pathways being masked in the dynamics and impedes the recovery of causal GRNs. Finally, these models generally lack a means of incorporating biological constraints and prior knowledge to guide model selection and prevent overfitting [1,24].
We developed PHOENIX (Prior-informed Hill-like ODEs to Enhance Neuralnet Integrals with eXplainability) as a scalable method for estimating dynamical systems governing gene expression through an ODE-based machine   139  140  141  142  143  144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180  181  182  183  184   4 PHOENIX learning framework that is flexible enough to avoid model misspecification and is guided by insights from systems biology that facilitate biological interpretation of the resulting models [25,26]. At its core, PHOENIX models temporal patterns of gene expression using neural ordinary differential equations (NeuralODEs) [27,28], an advanced computational method commensurate with the scope of human gene regulatory networks -with more than 25,000 genes and 1,600 TFs -and the limited number of samples. We implement an innovative NeuralODE architecture that inherits the universal function approximation property (and thus the flexibility) of neural networks while resembling Hill-Langmuir kinetics (which have been used to model dynamic transcription factor biding site occupancy [10,29,30]) so that it can reasonably describe gene regulation by modeling the sparse yet synergystic interactions of genes and transcription factors.
Importantly, PHOENIX operates on the original gene expression space and does not require any dimensionality reduction, thus preventing information loss (especially for lowly-expressed genes that are nonetheless important for cell fate) [4]. This together with the incorporation of user-defined prior knowledge of likely network structure ensures that a trained PHOENIX model is explainable -it not only predicts temporal gene expression patterns, but also encodes an extractable GRN that captures key mechanistic properties of regulation such as activating (and repressive) edges and strength of regulation.

The PHOENIX model
Given a time series gene expression data set, the NeuralODEs of PHOENIX implicitly estimate the local derivative (RNA velocity) at an input data point with a neural network (NN). We designed activation functions that resemble Hill-kinetics and thus allow the NN to sparsely represent different patterns of transcriptional co-regulation by combining separate additive and multiplicative blocks that operate on the linear and logarithmic scales respectively. An ODE solver then integrates the estimated derivative to reconstruct the steps taken from an input x i at time t i to a predicted output x i+1 at time t i+1 [27]. The trained neural network block thus encodes the ODEs governing the dynamics of gene expression, and hence encodes the underlying vector field and GRN. An important advantage of incorporating an ODE solver is that we can predict expression-changes for arbitrarily long time intervals without relying on predefined Euler discretizations, as is required by many other methods [4,12,18]. We further augmented this framework by allowing users to include prior knowledge of gene regulation in a flexible way, which acts as a domain-knowledge-informed regularizer or soft constraint of the NeuralODE [24] (Figure 1). By combining the mechanism-driven approach of systems biology inspired functional forms and prior knowledge with the data-driven approach of powerful machine learning tools, PHOENIX scales up to full-genome data sets and learns meaningful models of gene regulatory dynamics.  Figure 1 PHOENIX is powered by a NeuralODE engine. Given an expression vector g(t i ) ∈ R #genes at time t i , a neural network (dotted rectangle) estimates the local derivative dg(t i )/dt and an ODE solver integrates this value to predict expression at subsequent time pointsĝ(t i+1 ). The neural network is equipped with activation functions (φ Σ and φ Π ) that resemble Hill-Langmuir kinetics, and two separate single-layer blocks (NNsums and NN prods ) that operate on the linear and logarithmic scales to model additive and multiplicative co-regulation respectively. A third block (NN combine ) then flexibly combines the additive and multiplicative synergies. PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties; the first (red arrows with weight λ data ) aims to match the observed data, while the second (blue arrow with weight λ prior ) uses simulated (ghost) expression vectors γ(t i ) to implement soft constraints defined by user-supplied prior models (P * ) of putative regulatory interactions.

Neural ordinary differential equations (NeuralODEs)
NeuralODEs [27] learn dynamical systems by parameterizing the underlying derivatives with neural networks: Given an initial condition, the output at any given time-point can now be approximated using a numerical ODE solver S of adaptive step size: This is the basic architecture of a NeuralODE [27], and it lends itself to loss functions L (e.g. ℓ 2 loss) of the form: To perform back-propagation, the gradient of the loss function with respect to all parameters θ must be computed, which is done using the adjoint sensitivity method [27].

Model formulation and neural network architecture
Most models for co-regulation of gene expression are structured as a simple feedback process [29]. Given that gene regulation can be influenced by perturbations across an entire regulatory network of n genes, the gene expression of all genes g j (t) can have an effect on a specific g i (t) at time point t: where g(t) = {g i (t)} n i=1 , f reg : R n → R n , and f reg is approximated with a neural network. To model additive as well as multiplicative effects within f reg , we used an innovative neural network architecture equipped with activation functions that emulate -and can thus sparsely encode -Hill-kinetics (see Figure 1). The Hill-Langmuir equation H(P ) was originally derived to model the binding of ligands to macromolecules [30], and can be used to model transcription factor occupancy of gene regulatory binding sites [10]: which resembles the softsign activation function ϕ soft (y) = 1/(1 + |y|). For better neural network trainability, however, we shifted it to the center of the expression values. To approximate suitable exponents α, we further log transformed H, since composing additive operations in the log transformed space with a Hadamard exp • function can represent multiplicative effects. Thus, were employed as activation functions to define two neural network blocks (NN sums and NN prods ), representing additive and multiplicative effects: The concatenated vectors c Σ (g(t)) ⊕ c Π (g(t)) served as input to a third block NN combine (with weights W ∪ ∈ R n×2m ) that flexibly combined these additive and multiplicative effects. We found that a single linear layer was sufficient for this purpose. To simplify the representation of steady state for genes without upstream transcription factors dgi(t) dt = 0, ∀t , we introduced gene specific parameters υ ∈ R n . Accordingly, the output derivative for each gene i was multiplied with ReLU(υ i ) = max{(υ i , 0)}. We expressed this using the Hadamard product (⊙) of the previous output and the elementwise ReLU of υ: . 277  278  279  280  281  282  283  284  285  286  287  288  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320  321  322 PHOENIX 7 The were learned based on observed data and prior domain knowledge (details in Supp. Methods 1).

Structural domain knowledge incorporation
One challenge we found in interpreting PHOENIX is that NeuralODEs have multiple solutions [31], of which many are inconsistent with our understanding of the process by which specific transcription factors (TFs) regulate the expression of other genes within the genome. Most solutions accurately represent gene-gene correlations, but do not necessarily reflect biologically established TF-gene regulation processes. Inspired by recent developments in physics-informed deep learning [24], we introduced biologically-motivated soft constraints to regularize the search for a parsimonious approximation. We started with the NeuralODE prediction for the gene expression vector: We observed that the unregularized PHOENIX provides an observed gene expression-based approximation for the local derivative dg(t) dt , but often we have additional structural information available about which TFs are more likely to regulate certain target genes. Hence, one could also formulate a domain knowledge-informed P * g(t) that is a prior-based approximation: prior-based .
By promoting our NeuralODE to flexibly align with such structural domain knowledge, we automatically searched for biologically more realistic models that still explained the observed gene expression data. To this end, we designed a modified loss function L mod that incorporated the effect of prior model P * using a set of K randomly generated expression vectors {γ k ∈ R n } K k=1 . This induced a preference for consistency with prior domain knowledge.
While our modeling framework is flexible regarding the nature of the prior model P * , we incorporated a simple linear model, a common choice for chemical reaction networks or simple oscillating physical systems [32]. We used the adjacency matrix A of likely network structure based on prior domain knowledge (e.g. experimentally validated interactions, motif map of promoter targets, etc.) with known activating and repressive edges set to +1 and -1 respectively. For cases where the sign (activating/repressive) was unknown, we randomly assigned +1 or -1 with uniform probability.

Results
We demonstrate the utility of PHOENIX for estimating gene expression dynamics by performing a series of in silico benchmarking experiments, where PHOENIX exceeds even the most optimistic performance of popular blackbox RNA dynamics estimation methods. We demonstrate the scalability of PHOENIX by applying it to genome-scale breast cancer samples ordered in pseudotime and investigate how scaling to the complete data set improves a representation of key pathways. Finally, we apply PHOENIX to yeast cell cycle data to show that it can capture oscillatory dynamics by flexibly deviating from Hill-like assumptions when necessary.

PHOENIX accurately and explainably learns temporal evolution of in silico dynamical systems
We began our validation studies with simulated gene expression time-series data so that the underlying dynamical system that produced the system's patterns of gene expression was known. We adapted SimulatorGRN [29,33] to generate time-series expression data from two synthetic S. cerevisiae gene regulatory systems (SIM350 and SIM690, consisting of 350 and 690 genes respectively). The activating and repressive interactions in each in silico system was used to synthesize noisy expression "trajectories" for each gene across multiple time-points (see Methods 1.1 and 1.2). We split up the trajectories into training (88%), validation (6% for hyperparameter tuning), and testing (6%), and compared PHOENIX predictions on the test set against the "known"/ground truth trajectories (detailed results in Supp. Results 1).
Since PHOENIX uses user-defined prior knowledge as a regularizer, we also corrupted the prior model at a level commensurate with the "experimental" noise level (see Supp. Methods 4.2), reflecting the fact that transcription factor-gene binding is itself noisy. 369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400  401  402  403  404  405  406  407  408  409  410  411  412  413  414 PHOENIX 9 Figure 2 We applied PHOENIX to simulated gene expression data originating from two different in silico dynamical systems SIM350 (top row) and SIM690 (bottom row) that simulate temporal expression of 350 and 690 genes respectively. Each simulated trajectory consisted of five time-points (t = 0, 2, 3,7,9) and was subjected to varying levels of Gaussian noise ( noise σ mean = 0%, 5%, 10%, 20%). Since PHOENIX uses a user-defined prior network model as a regularizer, we also corrupted the prior-models up to an amount commensurate with the noise level. For each noise setting we trained PHOENIX on 140 of these "observed" trajectories and validated on 10. The performance on the validation trajectories was used to determine the optimal value of λ prior . We then tested the trained model (with the optimal choice of λ prior ) on 10 new test set trajectories. We display both observed and predicted test set trajectories for four arbitrary genes in both SIM350 and SIM690, across all noise settings. We display the mean squared error (MSE) between the predictions and the 10 test set trajectories pre-noise.
We found that PHOENIX accurately learned the temporal evolution of the SIM350 and SIM690 data sets ( Figure 2) and was able to recover the true test set trajectories (that is, test set trajectories pre-noise) with a reasonably high accuracy even when the training trajectories and prior knowledge model had included high levels of noise. Furthermore, the shapes of the predicted trajectories (and hence the predicted steady state levels) obtained from feeding initial values (expression at t = 0) into the trained model remained robust to noise, suggesting that a trained PHOENIX model could be used to estimate the temporal effects of cellular perturbations.
Since the primary prediction engine of PHOENIX is a NeuralODE, we wanted to benchmark its performance relative to "out-of-the-box" (OOTB) Neu-ralODE models (such as RNAForecaster [6]) to understand the contributions of our modifications to the NeuralODE architecture. We tested a range of 415  416  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447  448  449  450  451  452  453  454  455  456  457  458  459  460 10 PHOENIX OOTB models (activation functions: ReLU, sigmoid, and tanh) where we adjusted the total number of trainable parameters to be similar to that of PHOENIX (see Supp. Methods 3.2). Because PHOENIX uses a domain prior of likely gene-regulation interactions in its optimization scheme, we also tested a version (PHX 0 ) where the weight of the prior was set to zero (λ prior = 0). For each of SIM350 and SIM690, we observed that PHOENIX outperformed OOTB NeuralODEs on the test set in noiseless settings (Supp. Fig. 1 and Supp. Table 2). When we added noise, the PHOENIX models still generally outperformed the OOTB models, especially in the larger SIM690. The validation MSEs were more comparable between all the models in the high noise setting in SIM350. The consistently strong performance of PHOENIX suggests that using a Hill-kinetics inspired NeuralODE architecture better captures the dynamics of the regulatory process, in part because it models the binding kinetics of transcription factor-gene interactions.
In terms of contribution of the prior to PHOENIX's performance, we observed that PHOENIX was generally outperformed by PHX 0 , its unregularized version (Supp. Fig. 1 and Supp. Table 2). However, given that the prior can be interpreted as soft biological constraints on the estimated dynamical system [24], an important question is whether PHX 0 (as well as OOTB models) makes accurate temporal predictions by correctly learning elements of the causal biology, or whether the lack of prior information results in an alternate learned representation of the dynamics, which -despite predicting these particular trajectories well -does not explain the true biological regulatory process.
To this end, we recognized that the parameters of a trained PHOENIX model encode an estimate of the ground-truth gene regulatory network (GRN) that causally governs the system's evolution over time. We therefore inferred encoded GRNs from trained PHOENIX models and compared them to the ground truth networks GRN 350 and GRN 690 used to synthesize SIM350 and SIM690 respectively (see Supp. Methods 2). Given PHOENIX's simple Neu-ralODE architecture, we were able to develop a GRN inference algorithm that could predict edge existence, direction, strength, and sign, using just model coefficients, without any need for time consuming sensitivity analyses (unlike other approaches [2,12]). For comparison, we wanted to extract GRNs from the most predictive OOTB models; given their black-box nature, OOTB model GRNs had to be obtained via sensitivity analyses (see Supp. Methods 3.2).
To further assess this phenomenon, we computed the estimated model effect between every gene pair in SIM350, and compared these values between PHOENIX and PHX 0 . We found that that the incorporation of priors helped PHOENIX identify core elements of the dynamics, and predict gene expression patterns in a biologically parsimonious manner (Supp. Fig. 2).
Since the inclusion of such static prior knowledge greatly increased the explainability of the inferred dynamics, we also investigated how explainability was affected by misspecification of the prior. In our in silico experiments, we had randomly corrupted (misspecified) the prior by an amount commensurate with the noise level (see Supp. Methods 4.2). We compared network representations of these misspecified prior constraints to GRNs extracted from the PHOENIX models that used these very priors. We found that PHOENIX was able to appropriately learn causal elements of the dynamics beyond what was encoded in the priors (Supp. Table 1). This suggests that even though the user-defined priors enhance explainability, PHOENIX can deviate from them when necessary, and learn regulatory interactions from just the data itself. Figure 3 We extracted encoded GRNs from the trained PHOENIX models and the best performing out-of-the-box NeuralODE models, for both in silico dynamical systems SIM350 (top row) and SIM690 (bottom row) across all noise settings. We compared these GRN estimates to the corresponding ground truth GRNs used to formulate SIM350 and SIM690, and obtained AUC values as well as out-degree correlations (ρout). We also reverse-engineered a metric (Cmax) to inform how sparsely PHOENIX had inferred the dynamics (see Supp. Methods 2). Furthermore, we used these Cmax values to obtain optimal true positive and true negative rates (TPRmax and TNRmax) that were independent of any cutoff value, allowing us to compare between "best possible" networks across all settings .   507  508  509  510  511  512  513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544  545  546  547  548  549  550  551  552 12 PHOENIX

PHOENIX exceeds the most optimistic performances of current black-box methods in silico
Having established PHOENIX models as both predictive and explainable, we compared its performance to other existing methods for gene expression ODE estimation in silico ( Table 1). As discussed earlier, these can be placed into two groups based on the input data. The "one-step" methods estimate dynamics by directly using expression trajectories; these include RNAForecaster [6] (which is an out-of-the-box NeuralODE), and PRESCIENT [4], among others [14,15]. PHOENIX is more similar to these methods.
"Two-step" methods such as Dynamo [3], RNA-ODE [5], and scDVF [12] estimate dynamics by first reconstructing RNA velocity using inputs such as spliced and unspliced mRNA counts and then estimating a vector field mapping expression to velocity. To avoid the need for uncommon input data types and to also emulate the theoretically optimal performance in the first step of these "two-step" velocity-based methods, we used the noiseless ground truth velocities as input into their second step since the true velocities were known in our in silico experiments (see Supp. Methods 3.1). Further, we used the validation set to optimize key hyperparameters of all the methods ( Table 1, right-most column) before finally testing predictive performance on expression values from held-out trajectories. Most of the methods also provide a means for extracting a gene network which we used to evaluate each method's explainability (see Supp. Methods 3.3).
In these comparisons, we found that the "one-step" trajectory-based methods generally yielded better predictions than the "two-step" velocity-based methods (although Dynamo sometimes achieved performance compared to the single step methods). This makes sense since trajectory-based methods are optimized to predict gene expression trajectories while velocity-based methods predict trajectories by first optimizing RNA velocity estimations [1]. Overall, PHOENIX outperformed even the optimistic versions of the blackbox methods by large margins both in terms of predicting gene-expression (Supp. Table 2) and explainability (based on consistency with the ground truth network Supp. Table 3). We found that Dynamo was generally the most explainable competing method but that, in some settings, RNA-ODE and scDVF were more explainable. Finally, we found that the dynamics estimated by PHOENIX were much sparser than any other method, and that this sparsity remained fairly insensitive to noise levels (Supp. Table 4).
Further ODE estimation approaches (not included in our experiments) and their functionalities are discussed in Supp. We tuned the prior weight (to λ prior = 0.10) using the validation set to induce higher explainability by promoting a sparse GRN structure.
PHOENIX was able to learn the temporal evolution of gene expression across the yeast cycle; it was able to explain over 84% of the variation in the test set ( Figure 4, bottom-left). Notably, when we visualized the estimated dynamics by extrapolating from just initial values (expression at t = 0), we found that PHOENIX plausibly predicted continued periodic oscillations in gene expression, even though the training data consisted of only two full cell cycles (Figure 4, top). The amplitude of the predicted trajectories dampened across time points, which is expected given that yeast array-data tends to exhibit underdamped harmonic oscillation during cell division possibly reflecting de-synchronization of the yeast cells [36]. This performance on oscillatory dynamics is indicative of the high flexibility of PHOENIX, which inherits the universal function approximation property from NeuralODEs, allowing it to deviate from Hill-like assumptions when necessary, while still remaining explainable due to the integration of prior knowledge.

PHOENIX infers genome-wide dynamics of breast cancer progression and identifies central pathways
Although there are a number of tools for inferring dynamics of regulatory networks, most do not scale beyond a few hundreds of genes, falling far short of the 25,000 genes in the human genome ( Table 1). Given the performance improvements we saw that were driven by PHOENIX's use of soft constraints, we wanted to test whether PHOENIX could be extended to human-genome scale networks. Due to the dearth of longitudinal human studies with genomewide expression measurements, we used data from a cross-sectional breast cancer study (GEO accession GSE7390 [38]) consisting of microarray expression values for 22000 genes from 198 breast cancer patients, and ordered these samples in pseudotime. For consistency in pseudotime ordering, we reused a version of this data that was already preprocessed and ordered (using a random walk based pseudotime approach) in the PROB paper [8].
We limited our analysis to the genes that had measurable expression and appeared in our regulatory prior, and obtained a single pseudotrajectory of expression values for n g = 11165 genes across 186 patients, each at a distinct pseudo-timepoint. To explore whether PHOENIX's performance depends on the size of the data set, we created pseudotrajectories for n g = 500, 2000, and 4000 genes by subsetting the data set to its n g most variable genes. Similar to the yeast example, we split up the 185 transition pairs into training (170, 90%), validation (8, 5%), and test (7, 5%). For the domain prior network, we again used a simplistic prior model derived from a motif map of promoter targets (see Methods 3.2), and tuned λ prior using the validation set.
For each pseudotrajectory of size n g , we fit a separate PHOENIX model and calculated variation explained in the test set. We observed very impressive predictive performance, with test set R 2 values in the 97% -99% range ( Figure 5, top). We found that even though PHOENIX's computational cost increased as we scaled to n g = 11165 genes, the cost was not excessive even at this genome-scale (see Supp. Table 7). It is noteworthy that this type of feasible scalability to n g > 10 4 genes is unprecedented in tools for estimating gene regulatory dynamical systems; other methods either focus on a subset of highly variable genes [6,7,12,14] or model dynamics in a less interpretable, lower dimensional PCA/UMAP/latent space [3,4,13].
Next, we investigated PHOENIX's ability to identify biologically relevant and actionable information regarding gene regulation in breast cancer. First, we tested the performance of the learned dynamical system to reconstruct a gene regulatory network and predict TF-gene interactions. While the ground truth GRN is unknown, we can estimate performance by comparing a validation network of experimental ChIP-chip binding information [39] to a subnetwork of the encoded GRN of a trained PHOENIX model. We found good alignment between the two GRNs (AUC = 0.81 -0.91) even when we scaled up to 719  720  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736  737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752  753  754  755  756  757  758  759  760  761  762  763  764 PHOENIX 17 n g = 11165 genes ( Figure 5). It is important to note that the PHOENIXbased concordance with experimental data was generally greater than that obtained by comparing just the prior knowledge to the validation network (Supp. Table 1), indicating that PHOENIX was improving upon the GRN suggested by the prior knowledge, in addition to learning a dynamical model.
To better understand the benefits of PHOENIX's scalability, we investigated how estimating regulatory dynamics based on a subset of only the n g most variable genes can alter the perceived importance of individual genes to the regulatory system in question. We reasoned that a model trained on all assayed genes should reconstruct biological information better than those that are restricted to a subset of genes [6,12,14]. First, we performed a gene-level analysis by perturbing in silico the PHOENIX-estimated dynamical system from each value of n g (500, 2000, 4000, 11165). This yielded "influence scores" representing how changes in initial (t = 0) expression of each gene affected subsequent (t > 0) predicted expression of all other genes (see Methods 3.3).
As might be expected, the influence scores grew increasingly more concordant with centrality measures in the ChIP validation network, consistent with the key roles played by transcription factor genes in large GRNs (Supp. Table 7).
We observed that highly variable genes with known involvement in breast cancer (such as WT1 [40], ESR1 [41], AR [42], and FOXM1 [43]) were generally influential across all values of n g (Supp. Fig. 3). It is interesting to note that both WT1 and FOXM1 were very influential in the n g = 500 system, but their score drops in the full genome (n g = 11165) system. This is likely due to the way in which we constructed the smaller subsets of the whole genomeby selecting the most variable genes. One would expect that the most variable transcription factor genes falling within any subset would be highly correlated in expression with other genes falling in the same set, and that the overall effect would be diluted by adding more -potentially uncorrelated -genes to the system. It is more interesting that genes missing in the smaller subsets (due to low expression variability) were identified as central to the dynamics in the full (n g = 11165) system. Among these genes, we can find some encoding cancer-relevant transcription factors such as E2F1 [44,45] and CTCF [46], members of the TP53 family (TP73 [47,48]), DNA methyltransferase enzymes (DNMT1 [49]), and members of the KLF family [50].
We found that the more computationally manageable systems (n g = 500, n g = 2000) yielded an incomplete picture of gene-level influences, since the the method used in constructing these subsets hinders the mechanistic explainability of the resulting regulatory model. Certain genes exhibit relatively low variability in expression but are still central to disease-relevant genome-level dynamics; compared to methods that exclude such genes to make computation tractable [4,6,12], PHOENIX can correctly identify such as central because of its ability to model subtle but important genome-scale dynamics. 765  766  767  768  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800  801  802  803  804  805  806  807  808  809  810 18 PHOENIX Finally, we performed a pathway-based functional enrichment analysis by translating these gene influence scores to pathway influence scores using permutation tests on the Reactome pathway database [51] (see Methods 3.4). Not surprisingly, the dynamical systems with fewer genes missed many pathways known to be associated with breast cancer that were identified as over-represented in the genome-scale (n g = 11165) system ( Figure 5 and Supp. Table 6). Notably, the pathways missed in the smaller networks include apoptosis regulation (a hallmark of cancer [52]), estrogen-related signalling (whose role in breast cancer is well documented [53]), regulation of beta cells (relevant for immune processes [54]), and TP53 regulation of caspases (relevant to apoptosis control in tumors [55]).
In a parallel analysis testing for functional enrichment of GO biological process terms, we again found the smaller systems to overlook important pathways that were clearly influential in the genome-scale analysis; these included developmental processes associated with epithelial tumor development, germ cell development (possibly linked to the regulation of cancer testis antigens), and a wide array of RNA metabolism processes that are increasingly recognized as being significant to breast cancer development [56] (Supp. Fig. 4). Similarly, for GO molecular function terms, the smaller gene subsets missed key processes such as estrogen receptor binding (Supp. Fig. 5).
These results clearly demonstrate the importance of scalable methods such as PHOENIX that can model genome-wise dynamics. Our reduced gene sets from which we built the smaller PHOENIX models consisted of the 500, 2000, or 4000 most variable genes. These gene sets likely consist of variable genes that are correlated with each other, meaning that we are sampling only a portion of the biological processes driving the temporal changes in breast cancer; the full picture only emerges when looking at regulatory processes across the spectrum of genes that can contribute. Alternative approaches, such as concentrating on specific pathways, risk introducing self-fulfilling biases in the discovery process. Similarly, methods that use low-dimensional embedding to reduce the complexity of modeling dynamics risk obscuring losing valuable, biologically relevant insights. PHOENIX's scalability offers the best potential for discovery of interpretable insights with high explainability relative to the phenotypes under study .   811  812  813  814  815  816  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848  849  850  851  852  853  854  855  856 PHOENIX 19 Figure 5 (top panel) We applied PHOENIX to a pseudotrajectory of 186 breast cancer samples (ordered along subsequent "pseudotimepoints") consisting of ng = 11165 genes [38]. We trained on 170 transition pairs, used 8 for validation (to tune λ prior ), and tested predictive accuracy on the remaining 7. We also repeated the analysis on smaller subsets of genes ng = 500, 2000, 4000, where we subsetted the full trajectory to only the ng most variable genes in the pseudotrajectory. We display both the predictive performance (R 2 on the test set) and the explainability performance (AUC from comparing encoded GRNs from trained models against a ChIP-seq validation network [39] 857  858  859  860  861  862  863  864  865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880  881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896  897  898  899  900  901  902 20 PHOENIX

Discussion
Given the importance of regulatory networks and their dynamics, there has been a tremendous interest in inferring and modeling their physical and temporal behavior. The use of NeuralODEs represents an extremely promising technology for inferring such networks, but so far, attempts to implement NeuralODE-based network modeling have encountered significant problems, not the least of which has been their inability to scale to modeling genomewide dynamics in a biologically explainable manner.
PHOENIX represents an important new methodological extension to the NeuralODE framework that is not only scaleable to the full human genome, but also biologically well interpretable and able to capture explicitly both additive as well as multiplicative ways in which transcription factors cooperate in regulating gene expression. For a simplified analysis, the underlying gene regulatory network can also be extracted from a learned model and compared with experimental evidence. An optional feature of PHOENIX that contributes significantly to its explainability is that it can be guided by (structural) domain knowledge. Notably, PHOENIX also remains flexible to deviate from domain knowledge when necessary, and learn novel insights consistent with the training data.
The predictive accuracy, scaleabilty, flexibility, and biological explainability can be attributed primarily to two things. First, our novel NeuralODE architecture that includes the use of Hill-like activation functions for capturing the kinetic properties of molecular binding provides a massive advantage in terms of predictive power. And second, the introduction of soft constraints based on prior knowledge of putative network structure leads to a scalable and biologically explainable estimate of the underlying dynamics.
Using simulated data we have shown that PHOENIX outperforms other models for inferring regulatory dynamics (including other NeuralODE-based models), particularly in the presence of experimental noise. Also, an application to data from the yeast cell cycle elucidates PHOENIX's flexibility in modelling arbitrary dynamics. More importantly, PHOENIX is the only NeuralODE method capable of extending its modeling to capture genomescale regulatory processes. Using data from breast cancer patients organized in pseudotime we illustrate not only the ability of PHOENIX to faithfully model genome-scale networks, but also demonstrate the power of extending regulatory modeling to capture seemingly subtle but biologically important regulatory processes.

Simulating time series gene expression data
For each n ∈ {350, 690}, we used the ground truth dynamical system SIMn to generate expression vectors g(t) ∈ R n , across time points t. We started by i.i.d. sampling 160 standard uniform R n vectors to act as initial (t = 0) conditions. We used these initial conditions to integrate SIMn and obtain 160 expression trajectories across t ∈ T = {0, 2, 3, 7, 9} using R's desolve package: {{g(t) i } t∈T } 160 i=1 . We used only five time points to emulate potential scarcities of time-series information in real data sets, while the range t = 0 to 9 generally covered the transition from initial to steady state. Lastly, we added Gaussian noise vectors ε(t, σ) i i.i.d ∼ N (0, σ 2 I) of varying σ to get noisy data sets: {{g(t) i + ε(t, σ) i } t∈T } 160 i=1 σ∈S . Since the average simulated expression value was ≈ 0.5, using σ ∈ S = {0, 1 40 , 1 20 , 1 10 } corresponded roughly to average noise levels of 0%, 5%, 10%, 20%.

Model setup for training and testing
For each simulation scenario, there were 160 simulated trajectories, out of which we used 140 (88%) for training, 10 (6%) for validation (hyperparameter tuning) and 10 (6%) for testing. We provide some details on PHOENIX implementation (e.g. training strategy, prior incorporation, etc) in Supp. Methods 1, and include finer technicalities (e.g. exact learning rates) in our GitHub repository [34]. For prior domain knowledge model we used the simple linear model:  [35]), and consisted of two dye-swap technical replicates measured every five minutes for 120 minutes. Each of two replicates were separately ma-normalized using the maNorm() function in the marray library in R/Bioconductor [58]. The data were batch-corrected [59] using the ComBat() function in the sva library [60] and probe-sets mapping to the same gene were averaged, resulting in expression values for 5088 genes across fifty conditions. Two samples (corresponding to the 105 minute time point) were excluded for data-quality reasons, as noted in the original publication, and genes without motif information were then removed, resulting in a expression data-set containing 48 samples (24 time points in each replicate) and 3551 genes. The expression values were then normalized to be between 0 and 1.

Model setup for training and testing
Given the extreme similarity between the two replicates in terms of gene expression values, the approach of using one replicate for training and the other for validation would have led to artificially good performance on the validation trajectory. Instead, we noted that the data set contained 46 different transition pairs, where a transition pair consists of two consecutive expression vectors in the data set (g(t i ), g(t i+1 )). We split these 46 transition pairs into training (40, 86%), validation (3, 7%), and test (3, 7%). We provide some details on PHOENIX implementation (e.g. training strategy, prior incorporation, etc) in Supp. Methods 1, and include finer technicalities (e.g. learning rate schedule, initialization scheme, etc) in our GitHub repository [34].
For prior domain knowledge model we used the simple linear model: P * γ k = A · γ k − γ k . We based our choice of A on the regulatory network structure of a motif map, similar to that used in other methods, such as PANDA [26]. We downloaded predicted binding sites for 204 yeast transcription factors [37]. These data include 4360 genes with tandem promoters. 3551 of these genes are also covered on the yeast cell cycle gene expression array. 105 total transcription factors in this data set target the promoter of one of these 3551 genes. The motif map between these 105 transcription factors and 3551 target genes provides the adjacency matrix A of 0s and 1s, where we randomly assigned each of the 1s to either +1 (activating) or -1 (repressive).
We used ChIP-chip data from Harbison et al. [37] to create a network of TF-target interactions, and used this as a validation network to test explainability. The targets of transcription factors in this ChIP-chip data set were filtered using the criterion p < 0.001. We calculated AUC values by comparing the encoded GRN retrieved from the trained models (see Supp. Methods 2) to the validation network.
3 Testing on breast cancer pseudotime data

Data procurement and psuedotime ordering
The original data set comes from a cross-sectional breast cancer study (GEO accession GSE7390 [38]) consisting of microarray expression values for 22000 genes from 198 breast cancer patients, which we aimed to sort along a pseudotime axis. We noted that the same data set was also used in the PROB [8] paper. PROB is a GRN inference method that infers a random-walk based pseudotime to sort cross-sectional samples and reconstruct the GRN. For consistency and convenience in pseudotime inference, we obtained the same version of this data that was already preprocessed and sorted by PROB. We normalized the expression values to be between 0 and 1. We limited our analysis to the genes that had measurable expression and appeared in our prior model, and obtained a pseudotrajectory of expression values for 11165 genes across 186 patients. We also created pseudotrajectories for n g = 500, 2000, and 4000 genes by subsetting to the n g highest variance genes.

Model setup for training and testing
We noted that the data set contained 185 different transition pairs, where a transition pair consists of two consecutive expression vectors in the data set (g(t i ), g(t i+1 )). We split up the 185 transition pairs into training (170, 90%), validation (8, 5%), and test (7, 5%); Please find further implementation details in Supp. Methods 1 and our GitHub repository [34].
For prior domain knowledge model we used the simple linear model: P * γ k = W 0 · γ k − γ k . We based our choice of W 0 on a motif map, similar to that used in the breast cancer analysis in OTTER [61]. The network W 0 is derived from the human reference genome, for the breast tissue specifically. W 0 is a binary matrix with W 0i,j ∈ {0, 1} where 1 indicates a TF sequence motif in the promoter of the target gene. Sequence motif mapping was performed using the FIMO software [62] from the MEME suite [63] and the R package GenomicRanges [64]. Note that W 0 carries no sign information so that we cannot infer whether TFs inhibit or activate the expression of a gene. Hence we assigned each of the non-zero entries as +1 or -1 with uniform probability.
Validation of explainability was challenging, since there are only few data sets that have ChIP-seq data for many TFs from the same cells. We used ChIPseq data from the MCF7 cell line (breast cancer, 62 TFs) in the ReMap2018 database [39] to create a validation network of TF-target interactions. We calculated AUC values by comparing the encoded GRNs retrieved from the trained models (see Supp. Methods 2) to the validation network.