Sparse Bayesian Identification of Nonlinear State-Space Systems via Stein Approximation Approach


 This paper concerns with the parameter estimation of nonlinear discrete-time systems from noisy state measurements in state-space form. A sparse Bayesian convex optimisation algorithm was proposed for the parameter estimation and prediction. The proposed method takes full account of the data correlation, parameter priori, constrains, which is explicitly modeled by Bayesian sparse learning and optimisation. The only assumption about the structure of the model is that there are only a few important terms that govern the dynamics, so that the equations are sparse in the space of possible functions and the number of basis function is generated randomly. The main identification challenge resides in two aspects: first, a new objective function can be obtained by Stein approximation method in the convex optimization problem. Second, a recursive least squares estimation with L1-regularization is developed, wherein the regularization parameter is selected from the point of optimization. Compared with the function maximum likelihood(ML) method only, it usually captures more information about the dependence of the data indicators. Three simulation examples are given to demonstrate the proposed algorithm's effectiveness. Furthermore, the performances of these approaches are analyzed, including parameter estimation of RMSE, parameter sparsity and prediction of state and output result.


Introduction
Estimating nonlinear systems is widely acknowledged for its importance and difficulties [1,2]. As a result, there is a significant and active study effort focused on the issue. One of the most important aspects of this activity is that it concentrates on certain system classes, such as neural networks [3], nonlinear ARMAX (NARMAX) [4], and Hammerstein-Wiener [5] models and so on. Sums of terms from a set of parameterized functions can be used to expand nonlinear functional forms (see Sec.2 and [6]), a common method for identifying nonlinear state-space models is to look for a concise description that is consistent with the current data among a collection of probable nonlinear terms [7]. Classic functional decomposition methods such as Volterra expansion, Taylor polynomial expansion, or Fourier series [6,8] provide a few options for basis functions. These methods are founded on the idea that there are a finite set of fundamental functions whose linear combination can be utilized to describe the dynamics of a system. Maximum likelihood(ML) approach is a common method for nonlinear systems estimating, which is also used in linear dynamic systems. To compute estimates, it is common to use a gradient-based search strategy like the damped Gauss-Newton method [9]. Employing the Expectation Maximisation method [10] for the production of ML estimates is an approach recently investigated in [11] in the context of bilinear systems. Various approximation EMbased techniques have also been studied in the past. In [12], the authors have examined applying mediocre results to the underlying nonlinear smoothing issue, often Kalman filter method. A GP state space model can also be used to describe the nonlinear model. In [13,14], both a Bayesian and a maximum likelihood estimation strategy are employed in addition to a competitive GP approximation method. In [15], for learning, a Monte Carlo technique is utilized, which includes variational method using the same GP approximation.
However, too many unknown parameters increase the complexity of the model. There are just a few papers that deal with the sparse representation identification issue and multiple constraints on the system's parameters [17,25]. Different from the previous work, our suggested framework, on the other hand, allows us to include more constraints on the corresponding model parameters, e.g., inequality constraint, priori information and stein discrepancy constraints imposed among parameters. The dynamical system discovery issue is reimagined in this paper from the perspectives of sparse regression [19,20,29], compressed sensing [21][22][23], stein approximation theory [24], convex optimization method [25]. The use of sparsity approaches in dynamical systems is relatively new [26][27][28]. We take use of the fact that most nonlinear systems have only a few important dynamics elements, resulting in sparse paremeter set in a high-dimensional nonlinear function space.
Noise is an important factor affecting parameter estimation. When the input data of state measurements or their temporal derivatives is distorted by noise, the identification of reliable models is a key difficulty in model recovery via sparse identification of nonlinear dynamics (SINDy). Noisy measurements might lead to erroneous basis terms being identified and poor parameter estimates in the model. The essential insight is that the function f has only a few terms for many of the systems of interest, making it sparse in the range of potential functions. For model recovery, Least absolute shrinkage and selection operator (LASSO) [29], least angle regression (LARS) [30], sequentially thresholded least squares (STLS) [31], and basis pursuit denoising (BPDN) [32] are paremeter sparsity algorithms that can be applied. Recent advancements in sparse methods, such as L0 and L1 norms, make this sparsity viewpoint attractive, because it is now feasible to detect which parameters are nonzero without undertaking a combinatorially arduous and time-consuming search. The sparse model identification that results strikes a natural balance between model sparsity and precision, preventing the model from being overfit to the data. In nonlinear dynamics system identification, to reduce noise and anticipate disasters, we use sparse regression.
We will discuss how a Bayesian technology, optimization method and stein approximation strategy might mitigate the difficulties of large correlations in the state matrix in this technical note, from a statistical perspective. The following are some of the technical note's most important contributions: (1)A sparse Bayesian nonconvex optimisation algorithm was proposed for the parameter estimation and prediction. The proposed method takes full account of the data correlation, parameter priori, constrains, which is explicitly modeled by Bayesian sparse learning and optimisation. Moreover, a new objective function can be obtained by stein method. Compared with the function ML only, it usually captures more information about the dependence of the data indicators.
(2)Construct a nonconvex optimisation problem using a sparse Bayesian version of the nonlinear system identification issue with additive noise.
(3)Compared with other related methods, we consider the stein discrepancy of parameters as an objective function, which is minimized in the nonconvex optimization problem. It can captures more important information about the dependence of the data.
The rest of this paper is organized as follows. Section 2 describes problem statement and background. In section 3, constructing the model in Bayesian framwork. In Section 4, nonconvex optimisation with Stein constrain for identification is introduced. Three numerical illustration including Narendra-Li Model, NARX model, Kernel state-space models(KSSM) are presented in Section 5. Finally, in Section 6, we give some closing remarks.

Problem statement and background
This research looks at the difficulty of determining the parameters θ for certain categories of the nonlinear state-space model. Discrete-time dynamics is a nonlinear state-space model that is defined as follows: , · · · x ik · · · x nx,k ] is the state variable in time k, u k is the external control input. When system is time continuous, π (x ik ) =ẋ ik ; When system is time discrete, π (x ik ) = x ik or x ik − x i,k−1 ; ν ik ∼ N (0, σ 2 ik ) is noise(i ̸ = k, σ ik = 0), which is set to be i.i.d. Gaussian distribution. f ir (x k , u k ) : R nx+nu → R is Lipschitz continuous and n x and n u are the dimensions of x k and u k , respectively, while θ ir are the weight of basis functions. Together, θ ir and f ir determine the dynamics. It's worth emphasizing that we make no assumptions regarding the nonlinear functions on the right-hand side of the equations' form (1). The output is o ik .
If the system of interest can provide M data samples that meet (1), the system in (1) can be represented as where , and F i ∈ R M ×N i is called dictionary matrix. Its j -th column is written as The identification task in this framework is to find θ i given the measured data in Y i . This leads to the solution of a linear regression issue, in which the least square(LS) method can be used if some of model's nonlinear part is understood, i.e., F i is known. In (1), we merely discuss the identification problem, as we did in g i (x k , u k ). Only a few frequently-used nonlinear dynamical model must be considered, depending on the field. Because of the potential insertion of non-relevant or non-independent dictionary functions in F i , the solution θ i to (2) is often sparse.
The subscript i used to index the state variable is removed for convenience of writing because the linear regression issues in (2) are independent.
3. Constructing the model in Bayesian framwork

Priors selection for Paremeter Sparsity
All unknowns in Bayesian modeling are evaluated as they were random variables with specific distributions [33]. For Y = Fθ + ν in (3), noisy variables ν is Gaussian independently identically distribution(i.i.d.), i.e., ν ∼ N (0, βI), β = σ 2 . We can obtain the likelihood of the data P To impose a sparseness on θ, the function − 1 2α θ T θ is selected as a concave, non-decreasing function of |θ j |, which includes Gaussian priors and t-distribution priors (see [34] for details).
Parameter θ is identified based on Y , which is not fixed. We substitute the expected value of θ for its true value. The posterior P (θ | Y ) is heavily linked and non-Gaussian, so computing the posterior mean E(θ | Y ) is often difficult. To solve this problem, take P (θ | Y ) as an approximation of Gaussian distribution. In [33], effective posterior computation algorithms is used in the computation.
Another method is to use super Gaussian priors, in which the priors P (θ j ) is computed with the variational EM algorithms [34]. We define hyperparameters α [α 1 , . . . , α N ] ⊤ ∈ R N + . The priors of θ can be written as: Considering the data Y, the posterior probability of θ can be represented as [35], the posterior mean m θ and covariance Σ θ are given by: where Λ is a diagonal matrix written as diag [α]. It is obvious that to maximize P (θ, α | Y), the most important question is how to select the bestα. P (Y | θ) and P (θ; α) can be taken as prior information, so we consider ∫ P (Y | θ)P (θ; α)dθ only. Using type-II maximum likelihood [35], the marginal likelihood ∫ P (Y | θ)P (θ; α)dθ can be maximised, the selectedα is written as Afterα is computed in (5), the estimation of θ can be obtained asθ . It indicates that by picking the most likely hyperparametersα, which is capable of explaining the data Y.

Stein operators selection and Stein constrain design
The approach can be sketched as follows for a target distribution P with support I. Find a suitable operator B := B P (referred to as the Stein operator) and a large class of functions F B = F(B P ) (referred to as the Stein class) such that Z has distribution P, denoted Z ∼ P, if and only if we have f ∈ F B for all functions A stein operator can be designed in a variety of ways [24]. In our framework, stein's identity and kernelized stein difference are crucial. P (θ) is probability density function , which is continuous and differentiable on θ The stein operator, abbreviated B P , operates on the function φ(θ)Q(θ) and produces a zero mean function B P φ(θ)Q(θ) under θ ∼ P . Q(θ) is a different smooth density, which is a given function related with P (θ).
is related with the difference between P and Q. The probability distances for P (θ) and Q(θ) in some proper function set F B are defined as the expectations above.
The discriminative power and computational tractability of the stein discrepancy are determined by the function set F B chosen. Traditional definitions of F B include sets of functions with bounded Lipschitz norms, which creates a difficult functional optimization problem that is either computationally intractable or necessitates special considerations.
By choosing proper Q(θ) and optimizing φ in the unit sphere of a reproducing kernel Hilbert space (RKHS), kernelized stein discrepancy(KSD) avoids this challenge. The term KSD is described as The optimal solution of (8), which is which we are responsible for For θ for any fixed θ ′ , kernel fuction k (θ, θ ′ ) belongs to set F B . S(Q, P ) =0 , that is to is purely positive definite in a strict sense. Because of its decaying property, the RBF kernel belongs to the stein class of accepted smooth densities in θ ⊆ R d . Both the stein operator and the KSD are only dependent on P (θ) via the score function Q(θ)∇ θ log P (θ), the coefficient of which does not affect the overall structure of the model, so not be considered temporarily. As a result of this property, stein's identity can be used to handle unnormalized distributions effectively.
Let K(θ j , ·) = k(θ j , ·)Q(θ j ). From section 2, prior distribution of θ j can be written as We can then re-express T j as The expectation of T j is Another technique to pickα is to make it the minimizer of E[T ].

Parameter Sparse Identification of Constraints from Data
In (12), E[T ] is another objective function, which make the model parameters less dependent on noisy data and to prevent overfitting. The problem of system identification with convex constraints is given a sparse Bayesian formulation, which is then handled as a nonconvex optimisation problem in this section. To obtain a better parameter α, the new objective function is constructed asα

Objective Function in parameter identification
Theorem 1: Use the notation J α (α) as the objective function.
By minimising J α (α), the optimal hyperparametersα in (13) can be derived, where p (α j ) = −2 log ϕ (α j ). The mean of θ is calculated and represented asθ =ΛF ⊤ Proof 1: Using the Woodbury inversion identity, re-express m θ and Σ θ in (4): We can express the integral in (13) as follow because the data likelihood P(Y | θ) is Gaussian. ( We get E(Y) using the Woodbury inversion identity.
Just for the sake of calculation, we can evaluate the integral of exp{−E(θ)} Then applying a −2 log(·) transformation to (17), we have

From (13), we then obtain
To acquire an approximation of θ ,we compute the posterior mean θ : (14) is nonconvex function.

Modified objective Function in θ estimation
We use the modified objective function of θ with a penalty function. By analyzing the corresponding objective function of (14) in the α space, the analogous objective function is subsequently shown to be nonconvex as well.
Theorem 2: Solving the optimisation problem below yields the estimated value for θ given restrictions. min

Proof of Theorem 2
Using the data-dependent term in (21) and J α (α) in (14), a stringent upper boundary auxiliary function can be created on J α (α) as When we minimise J α,θ (α, θ) over α and obtain We can see from the derivations in (21) that the poseterior mean m θ is the estimate of the paremeter θ.
Lemma 2 In Theorem2, the penalty function r(θ) promotes sparsity on the weights by being a non-decreasing, concave function of θ.

Parameter estimation with sparse method
From (23), we can see that ρ * (α * ) does not affect the estimation of parameters α, so J α,θ (α, θ) is redefined as For a fixed α * , we notice that J α * (α, θ) is jointly convex in θ and α. In (27), we can obtain minimized for any θ. Substitute the α j into J α * (α, θ),θ can be obtained as folloŵ Due to the concavity of r(θ), the objective function in Theorem 2 can be optimised using a re-weighted L 1 -minimisation in a similar kth way as was considered in (27).
In order to obtain more stable and accurate parameterθ, the re-estimated method is put forward (Algorithm 1). At k-th iteration, the modified weight is then supplied by u On the basis of the aforementioned, we can now describe how the parameters can be updated. To begin, we set the iteration count k to zero, u . Consider J α,θ (α, θ) again. For any fixed α and θ, the tightest bound can be obtained by minimising over α * . α * is estimated, which equals the gradient of the function ρ(α) in Lemma 1. The estimation of α * is computed asα * = ∇ α ρ(α) = diag ] ⊤ . The optimal α * (k+1) can then be replaced byα * (k+1) = diag . After computing the estimation of we can compute α * (k+1) j , which gives Algorithm 1 summarizes the above-mentioned procedure.

Numerical example
We will give three numerical examples: Narendra-Li Model [38], NARX model [37], and Kernel state-space models [39]. The utility and performance of Algorithm 1 are proven on three simulation cases in this section. The performance of Algorithm 1 on examples involving a well studied and challenging nonlinear system is then illustrated. The root mean squard error (RMSE) criterion will be utilized to demonstrate the performance of the suggested identification Algorithm 1 Nonlinear Identification Algorithm Input: 1: Generate time series data from the system of discrete-time dynamics characterized by (1) 2: Choose the dictionary functions that will be used to build the dictionary matrix mentioned in Section 2; Initialization: Solve the minimisation problem with L1-regularization and optimization method on θ.
Update parameter α The sparse weight set of θ.
approach against noise perturbation. The ith estimate of parameter θ is denoted byθ i at the Monte-Carlo experiment. The RMSE at the experiment is defined as where θ * represents the true system parameter vector, n is the trials. To validate the theoretical results, the identification of the structured state-space model in cases will be simulated in this part.

Example 1: Parameters identification general Narendra-Li Model
Consider the state space representation of a nonlinear system.
is Gaussian white noise. To generate the estimation data, the system is excited with a uniformly distributed random input signal u(t) ∈ [−2.5, 2.5] with 1 ≤ t ≤ 500. The validation data set is generated with the input u(t) = sin 2πt 10 + sin 2πt 25 , t = 1, . . . , 500 .The dictionary matrix Φ can be built as follows because there are two state variables.
Then the output can be defined as Using the dictionary matrix Φ in (34), the true value of parameter θ i for the model in (35) should be as follows: The values in brackets are the correct parameter values. In our study, we used T = 1000 data samples for learning and added a white Gaussian measurement noise of ξ i (t) = 0.1 to the    [38] 0.06 2000 DWO [40] 0.43 50000 MOD [41] 0.46 50000 AHH [42] 0.31 2000 MARS [42] 0.49 2000 training data. We apply Algorithm 1 in the (32) and (33). The coefficients θ (j) to learn is in the (35), with prior data. RMSE of θ is computed in the simulation, the result of which is 0.039. Despite using 1000 times as many data points, our finding is substantially better than the significantly larger simulation errors in [38]. We acquire a result that is nearly identical to the noise-free data in the more interesting scenario of measurement noise in the training data.
Compare to some previous results reported in the literature [38,[40][41][42] with our method. In this experiment, the Algorithm 1 was executed 8 times. The average of the RMSE is computed. The RMSE of y t is 0.06. From the Table 1, we can see that our method perform the best.
In this paper, we look examine how well the method performs for various ξ i (t) = 0.1, 0.3, and see that Algorithm 1 is robust, the RMSE is not increased fast from Fig.2. In our experiment, 60 testing data is selected for testing and the predicting result is compared with the true value. It turns out this will give a satisfactory performance. It is clear that the proposed model is capable enough to well describe the system behavior.

Example 2: Application to a NARX model
We analyze the following polynomial terms model for a single-input single-output (SISO) nonlinear autoregressive system with exogenous (NARX) input in this example [37].
with y, u, ξ ∈ R. In expanded form, we may write (36) as: = w ⊤ f (y (t k ) , . . . , y (t k−mx ) , u (t k ) , . . . , u (t k−mu )) + ξ (t k ) Model (36) is the general form of (37). d x and d u are the degree of the output and input; m y and m u is the given memory order of the output and input; w ⊤ = [w 1 , . . . , w N ] ∈ R N is the weight vector; and f ( . . , f N (·)] ⊤ ∈ R N is the functions vector. Taking the NARX model (36) as a example, we set that d y = 1, d u = 2, m y = 2, m u = 2. This yields f(·) ∈ R 28 and thus w ∈ R 28 . Since w ∈ R 4 , only four of the twenty-eight linked weights w i are nonzero. The estimated parameter w agrees with the true value, as shown in Fig. 3. The predicting performance of Algorithm 1 is in the Fig. 4 From the figure, the predicted and exact trajectories match well with different ξ i (t) = 0.1, 0.3. The RMSE is 0.021 and 0.074.

Example 3: Kernel state-space models(KSSM) for Autoregressive Modeling
Kernel state-space models(KSSM) is autoregressive model, which satisfy the τ -order difference equation. As seen below, the model may be described as a first-order multivariate process.x wherex t = [x t , . . . , x t−τ +1 ] T , F t (x t ) = [f t (x t , . . . , x t−τ +1 ) , x t , · · · , x t−τ +2 ] T , and V t = [ξ t , 0, . . . , 0] T . The hidden state of an SSM can then be viewed as the processx t , producing an SSM formulation of a complex autoregressive model with noisy. By using nonlinear autoregressive modeling with a fixed number of delayed samples, the model can be utilized to predict time series. In addition, if the state-transition function f t is defined using kernels(39), we derive the suggested KSSM suited for autoregressive time series.
where Y t is the observed process, h (x t ) is the observation function, V t is observation noise, and w = [w 1 , · · · w N ] is the weight. Periodic time series is widely used in physics, engineering and biology. We take fourier kernel function in the KSSM. Consider 5 candidate kernel functions for k i (·): sin x i , cos x i , x i , sin 2x i and cos 2x i .
The estimation data in the experiment has 500 sample points, and Fig.5 shows the simulated outputs of the two processes evaluated on the validation set. In the figure, We compare the true and estimated value w using the probability distribution and dispersoid distribution. The RMSE of w is 0.16, which is a satisfactory result.

Conclusion
The parameter estimation of nonlinear discrete-time systems using noisy state data in statespace form is the subject of this work. For parameter estimation and prediction, a sparse Bayesian convex optimisation technique was presented. The suggested method considers data correlation, parameter priori, and constraints, all of which are explicitly modeled using Bayesian sparse learning and optimization. The sole assumption made regarding the model's structure is that there are just a few key variables governing the dynamics, resulting in sparse equations in the space of potential functions. The number of basis functions is produced at random. The fundamental problem with identification is that it is divided into two parts: The first step is to create a reweighted L1-regularized least squares solver, with the regularization value chosen from the optimization point. The second step, the stein approach can be used to create a new goal function. It usually captures more information about the reliance of the data indicators than the function maximum likelihood(ML) technique alone.