Reinforcement Learning based Optimal Control for Constrained Nonlinear System via A Novel State-Dependent Transformation

This paper focus on developing an optimal controller for the strict-feedback nonlinear systems with or without asymmetric time-varying full state constraints. A novel nonlinear state-dependent transformation function is presented, by which the strict-feedback nonlinear systems with state constraints is transformed into a new strict-feedback where the state constraints is im-plicit in. Optimized backstepping technique is utilized to develop the optimal controller for the new strict-feedback system to track the desired reference signal without the feasibility conditions. Reinforcement learning (RL) is exploited to implement the optimal control in every step, where identifier, critic and action network are used to estimate the unknown system dynamics and generate the control output, respectively. It is theoreti-cally proved that all the signals in the close loop system are bounded and the proposed optimal controller can track the desired signal with or without time-varying asymmetric full state constraints. Two simulation ex-Lei amples are presented demonstrating the efficacy of the proposed scheme.


Introduction
Most practical systems are subject to various forms of constraints, due to physical limitation, system performance requirement or security consideration, such as autonomous vehicles [1,2] and robotic systems [3,4], which makes handling constraints an important area of research in the domain of control design [5,6]. If not properly accommodated those constraints, it may lead to the inaccuracy of control, system instability, and sometimes unexpected accidents, making the elementary constrained control issues of nonlinear dynamic systems extremely crucial and competitive..
There have been various approaches in the literature to address the state constraints, such as reference governors [7], set invariance [8] and model predictive control [9]. It is noted that, there have been a large array of improvements achieved in the last few years by utilizing Barrier Lyapunov Function(BLF) or integral Barrier Lyapunov Function(iBLF) in addressing output or state constraint of nonlinear system, see [10][11][12][13][14]. In [10] and [15] by Tee et al., the definition of BLF is given and both symmetric and asymmetric Barrier Lyapunov Function is used to address state constraints for nonlinear systems, which state constraint is time-invariant. Full state constraints in strict-feedback system is considered by proposed an adaptive neural nwtwork control scheme by Liu and Tong et al. in [16]. Then, nonlinear pure-feedback system and stochastic with full state constraints also considered using adaptive control technique by Liu and Tong in [17,12]. In [13], Li et al. designed a time-varying asymmetric BLF candidate to cope with the time-varying full-state constraints. By developing a adaptive fuzzy controller, Sun et al. tackled a class of nontriangular structural stochastic switched nonlinear systems with full state constraints based on Barrier Lyapunov Function [18]. In [14], the Barrier Lyapunov functions are constructed to ensure the constraints are not transgressed for a class of uncertain nonlinear systems with full-state constraints. A high-order tantype barrier Lyapunov function (BLF) is constructed to handle the full-state constraints by Sun et al. in [19]. However, the current adaptive backstepping method based on BLF (or iBLF) has to fulfill the feasibility condition of no violation by the virtual controller, which means that the virtual controller must meet a prespecified constraint interval [20]. Feasibility condition make it more difficulty for strict-feedback and pure-feedback nonlinear systems with state constraints. An offline parameter optimization method is employed to satisfy the feasibility conditions of the virtual controller at each step, bringing computational costs and complex design procedures [21].
Therefore, new techniques that do not require the employment of BLF need to be formulated to overcome the constraint problem and thus prevent the feasibility condition. Fortunately, some state mapping or transformation based methodologies have been proposed to solve the output or state state constraint issues [22,23,20,[24][25][26]. In [22] and [23], nonlinear mapping and oneto-one nonlinear mapping were proposed to transformed the original system into a new system without constraint, respectively. However, the drawback of this type of method is that when the constraint is asymmetric, the mapped variable is not zero when the state is zero. That is, this type of method is not able to handle asymmetric constraints well. Zhao and Song constructed a nonlinear state-dependent function which depends only on constrained states and addressed asymmetric full state constraints directly in [20]. In [27], Zhao and Song et al. expanded their work to handle on constraints without any additional work. Cao and Song et al. [28] use the same technique proposed a robust control scheme to deal with full states asymmetric and time-varying constraints for pure-feedback systems. Li and Liu et al. [29] using a natural logarithmic type nonlinear mapping removed feasibility conditions for nonlinear stochastic system. Liu and Zhang et al. [25] introduced a nonlinear state-dependent function to prevent asymmetric time-varying full state constraints for nonstrict-feedback nonlinear systems. A new general constraint function is introduced for uncertain pure feedback systems with uniform consideration of the case with or without state constraints by Cao and Wen in [21]. Yao and Tan et al. developed an adaptive fuzzy control for constrained stochastic nonlinear systems by using a nonlinear state-dependent transformation in [30] However, none of the above mentioned methods considering the optimization performance of the controller, especially in using adaptive backstepping techniques.
Optimal control, a much talked philosophy in control theory and engineering in recently years, focuses on optimizing costs to achieve maximum control performance [31]. In theoretical view, it can be obtained the optimal controller by solving the Hamilton-Jacobi-Bellman (HJB) equation, however, this nonlinear equation is very difficult to solve due to the inherent nonlinearity [32,33]. Adaptive Dynamic Programming (ADP) or Reinforcement Learning (RL) is a prospective means of addressing the solution fo HJB equation [32,33,31,34]. Adaptive backstepping control combined with ADP or RL is a way to develop optimal controller for strictfeedback system, which can be divided into two categories. For the first category, equivalent optimal regulation problem is transformed by modifying the standard backstepping technique to address the optimal problem [35][36][37]. Recently, a new optimized backstepping method was developed in [38][39][40][41], whose core idea is to use ADP or RL for optimization at each step of the backstepping design procedure. In [38], Wang and Liu et al. proposed an equivalent optimal controller, which is obtained by Sontag feedback formula, for strict feedback system and borden the backstepping technique. Wen et al. [39] implemented reinforcement learning algorithm of the identifier-actor-critic architecture based on fuzzy logic system (FLS) approximators for multiagent system with unknown nonlinear dynamics. Then, Wen developed optimized backstepping scheme for a class of strict-feedback systems in [42]. In [43,40], Wen and Liu et al. developed a simplified optimized backstepping control scheme for a class of nonlinear strictfeedback system with unknown dynamic and perturbed nonlinear systems , respectively. In [44,45], optimized backstepping controll scheme have been extended into multi-agent system. Nevertheless, none of the above mentioned literature considered state constraints, not to mention time-varying asymmetric state constraints. Although [46,47] considered state constraints, they assume that the state constraints are known constants and use a barrier optimization performance function to avoid these constraints, which measn that those methods have no ability to address time-varying asymmetric full state constraints.
Motivated by the above literature and discussion, this paper concentrates on the issue of optimized back-stepping control of strict-feedback systems with fullstate constraints which is time-varying asymmetric. First, a novel nonlinear state-dependent function (NSDF) is developed and used to transform the original strictfeedback system with time-varying asymmetric state constraints into a new system without constraints. Then, simplified optimized backstepping control scheme is utilized to developed tracking controller for the new system, in which reinforcement learning algorithms containing identifiers, critic networks and action networks are deployed to acquire optimal controllers. Our contributions are outlined as follows 1. In contrast to the typical BLF (iBLF) based approach addressing the state constraints, which exists a feasibility condition that the virtual controller should satisfy, whereby a novel NSDF is formulated to directly address the time-varying asymmetric fullstate constraints without the feasibility condition. Moreover, it has the ability to cope with the cases with or without state constraints simultaneously and the steady state tracking error is not affected by the proposed novel NSDF. 2. Compared to the existing literature addressing state constraints, in which backstepping technique is utilized, in this paper, an optimal controller is developed for the system with time-varying asymmetric full-state constraints based on optimized backstepping technique. In addition, reinforcement learning algorithms are used to yield optimal controller, and the restriction of the persistence excitation condition is released.
The remainder of the article is organized as follows. In Section 2, formulation of the problem and a brief fundamentals is given. The novel nonlinear statedependent function and the optimal controller design procedure is presented in Section 3. The stability and performance analysis of the proposed scheme is showed in Section 4. In Section 5, two simulation examples is demonstrated. Finally, the conclusion is summarized in Section 6.

Problem Description
A class of nonlinear strict-feedback system with timevarying asymmetric full-state constraints is considered.
The nonlinear system is described as: . .
x n = f n (x n ) + u . , x n ) ∈ R n are the state vectors of system. u is system input and y = x 1 is output of the system. And f i (x i ) is assumed to be unknown. The state variable x i is subjected to a asymmetric time-varying constraints and the constraint boundaries F i1 (t) and The control objectives of this article is to design a optimal controller for the system (1) with asymmetric time-varying state constraints such that: 1. All the signals in the closed-loop systems are bounded. 2. System output y can track the desired signal y r and under the conditions that all states are subjected to an asymmetric time-varying constraints.
The following assumptions are made in order to realize the above control objectives, .

Assumption 1
The desired signal y d is continuous and its first time derivativeẏ d is bounded and available. Moreover, reference signal y r should satisfy that F r1 < y r < F r2 , where F r1 = F 11 + F 0 and F r2 = F 12 − F 0 , F 0 and F 0 are positive constant, F 11 and F 12 are the lower and upper boundary of x 1 .

Assumption 2
The time varying boundary F i1 (t) and F i2 (t) are smooth and its derivatives are bounded and continuous.
Remark 1 Assumption 1 and 2 have been commonly employed in the literature for handling full-state constraints, such as [21,25,48]. Assumption 1 indicates that the upper and lower bounds of the desired reference signal are slightly smaller than the constraint on the system state x, that is to say, the desired signal cannot be surpassed the range of the state constraints. The purpose of assumption 2 is to transform the desired signal into a new reference variable with the proposed statedependent transformation function and applied to the controller design, as detailed in section 3.2.

Neural Network
The radial basis function neural network (RBFNN) w * T S(X ) is used to approximate uncertainty, like [49]. Suppose f (X ) is a continuous function defined on a compact set, for any given constant ε 0 , there exists a constant vector w, such that the following equation holds where w is the ideal weight, ε 0 denotes the approximation error which satisfy ε 0 ≤ε, the input vector of RBFNN is denoted by X = [X 1 , . . . , X m ] T ∈ R ℓ and the dimension of the input denoted as ℓ. The ideal weight vector is described as Gaussian function as the basis function, defined as where π j ∈ R ℓ , j = 1, . . . , m is the center of the basis function and m is the number of the hidden layers.γ j is the width of Gaussian function.

Optimal Control Formulation
Consider a class of affine nonlinear continuous-time systemṡ where x ∈ R n is the state vector and u(x) ∈ R m is the control input of system (5), respectively. System function f (x) ∈ R n and g(x) ∈ R n×m are continuous function and f (0) = 0. And f (x)+g(x)u(x) is Lipschitz continuous on the set Ω belonging to R n and containing the origin. We assume that system (5) is stabilizable.
For the optimal control problem on finite time domain, define utility function as where Q(x(t)) ≥ 0, and R = R T > 0 is a square matrix with dimension m. The cost function is defined as Definition 1 (Admissible Control): The control u(t) is said admissible with respect to the cost function on a compact set Ω ∈ R n if u(x) is continuous on Ω, u(0) = 0, u(x) stabilizes the systems on Ω and ∀x 0 ∈ Ω, J (x 0 ) is finite.
Definte u * (x) as the optimal controller, then the cost function is described as Tacking the time derivative of (8) on both side, yielding HJB equation described as where J * x (x) = ∂J * (x)/∂x. The optimal state feedback control law can be obtained Substituting (10) into (9), we obtain Due to the inherent nonlinear and uncertain terms of HJB equation, it is hard to acquire a exclusive optimized controller by seeking a solution directly, and ADP or RL will be used to solve the equation.

Novel State-dependent Function
A novel NSDF is introduced to tackle the timevarying state constraints.
Firstly, we construct the following functions to denote the difference of state x(t) and boundaries F 1 (t) and F 2 (t) of constraint, where δ > 0 is a scale parameter. For simplicity, define . Now, to tackle the all-state constraint, we define a new constrained variable ξ(x) and the novel statedependent transformation is described as Then, the time derivative of the new constrained variable is described aṡ The novel state-dependent variable has some properties as follows: Remark 2 It should be noted that the four properties of the new state-dependent variables encompass all cases of asymmetric state constraints.
-Case 1: When the state x is 0, the new state-dependent variable is also 0, while when x is not equal to 0, it is impossible for the new state dependent variable to be 0. -Case 2: As the state variables approach the lower or upper boundary, the new state-dependent variables rapidly converge to infinity, which enforces the controller to pull the state back away from the constraint boundary. -Case 3: Specifically, the proposed new state-dependent transformation function has the ability to handle not only the state constraint but also the case of no state constraint, which indicates that the new statedependent variables return to the original state when the state constraint boundary is at infinity.
-Case 4: Parameter δ is used to scale the distance between the state and the boundary. Increasing the value of δ, ξ(x) will be more closer to x. On the contrary, ξ(x) will be enlarged. See Fig.1(b).

Remark 3
In order to compare with the state-dependent transformations in the existing literature, the majority of approaches, which are marked as A-E, are shown schematically, see Fig.1. It can be seen that all methods can handle asymmetric state constraints, but the approach proposed in [23,29], marked by D, when the state is 0, the new state is not equal to 0, unless the state boundary is symmetric. When the state is away from the constraint boundary, method A (Proposed), C see [21], E see [20,48,25] have better linearity than method B [27]. However, the slope of the linear part of method E is much smaller than that of A and C, indicating that A and C have a better ability to restore the original state. Nevertheless, the state-dependent transformation function proposed by method C is a segmented function with respect to the state and constraint boundaries, and its derivative as well depends on the segmented function.
Remark 4 Most of the literature addressing state constraints using BLF usually constructs such a Lyapunov . Similarly, in [20], a state-dependent transformation function is constructed as ξ(x) = x/((F 1 +x)(F 2 −x)). Noted that, when there is no state constraint, i.e. F, F 1 , F 2 tends to ∞, V or ξ(x) tends to 0. That is, these methods mentioned above cannot cope with the situation where there are no state constraints or time varying state constraints tend to infinity. As the remarkable work in [21], the proposed novel NSDF has the ability to handle this situation, see Case 3 in Remark 2.
Now, reconsider the system (1) with asymmetric time-varying full state constraints, by utilizing the novel state-dependent function, the following transformations can be made on the original systeṁ where i = 1, 2, . . . , n − 1. The new state variable vector ξ and system (18), in which state constraints are incorporated, will be used to design optimal controller.
Remark 5 Under the proposed novel nonlinear statedependent transformation function, the original timevarying asymmetric state constraint is implicitly incorporated into a new strict feedback system, see (18). It is worth mentioning that F i (ξ) is a continuous dynamic function and when x = 0, F i (0) = 0. It means that the new system (18) is stabilizable, i.e., there exists a continuous input u that stabilizes the system asymptotically.

Optimal Backstepping Controller Design
An optimized controller will be designed in this section for system (18).Before designing the controller, a new tracking signal variable needs to be introduced. The new tracking signal variable ξ r is defined as where Similarly, we can getξ r aṡ where In order to design the optimal controller by utilizing backstepping technique, a new coordinate transformation is defined as whereα i is the designed virtual controller in ith step. The optimal controller is designed as follows: where,c i is the design parameter of the ith step optimal virtual controller, and u is the optimal controller.Ŵ T fi ∈ R pi ,Ŵ T ci ∈ R qi ,Ŵ T ai ∈ R qi is the estimated weight matrix of identifier, critic network and actor network if the ith step, respectively. Φ fi ∈ p i , Φ ci ∈ q i are the basis function vectors of identifier and critic network, respectively. p i and q i are the dimension of the weight and basis function of step i. The parameters should satisfy the below conditions Next, the detailed design procedures are presented.
The identifierF 1 (ξ) with updating lawẆ f1 is designed aŝ whereF 1 (ξ) ∈ R is the output of identifier,Ŵ f1 ∈ R p1 is the weight of identifier neural network, and Φ f1 (ξ) ∈ R p1 is the basis vector. Updating law is described aṡ where Γ 1 is a positive definite matrix, σ 1 > 0 is a design parameter. The critic and action network are designed as where ∂Ĵ * 1 /∂ζ 1 ∈ R is the estimated ∂J * 1 /∂ζ 1 and W c1 ∈ R q1 is weight of the critic NN, which updating law iṡ where γ c1 > 0 is the critic designed parameter.
where γ a1 > 0 is the actor designed parameter.
Step i(i=2,. . . ,n-1): According to (22), we can geṫ Similarly, the cost of the system i is described as where h i (ζ i , α i ) = ζ 2 i + α 2 i . Treat ξ i+1 as the optimal virtual controller α * i , then wherec i is a positive design parameter and The identifier, critic and action network for the subsystem i with updating law is designed aŝ whereF i (ξ) ∈ R is the output of the identifier ,Ŵ fi ∈ R pi is the weight if identifier NN, and Φ fi (ξ) ∈ R pi is the basis. ∂Ĵ * i /∂ζ i ∈ R is the estimated ∂J * i /∂ζ i and W ci ∈ R qi is weight of the critic. Their tuning law are described as followṡ where Γ i is a positive definite matrix, σ i > 0 is a design parameter. γ ci > 0, γ ai > 0 are the designed parameter of critic and actor networks.
The identifier, critic network and actor network defined as follows: whereF n (ξ) ∈ R is the output of identifier,Ŵ fn ∈ R pn is the weight of identifier, and Φ fn (ξ) ∈ R pn is the activation function vector. ∂Ĵ * n /∂ζ n ∈ R is the estimated ∂J * n /∂ζ n andŴ cn ∈ R qn is weight of the critic NN. The updating law as followṡ where Γ n is a positive definite matrix, σ n > 0 is a design parameter.γ cn > 0, γ an > 0 are the designed parameter with the conditions as follows c n > 3, γ an > 1 2 , γ an > γ cn > γ an 2 (56)

Main Results and Stability Analysis
The main results and proofs are given as follows.
Theorem 1 Take account of the strict-feedback uncertain nonlinear system (1) with full state constraints which is asymmetric and time-varying F i1 (t) < x i (t) < F i2 under the assumptions 1 and 2, by utilizing the optimized virtual and actual controller (23) with the updating law (24), and the design parameters conditions (25), thereafter the proposed scheme can ensure the followings 1. All the signals ζ i ,W fi ,W ci andW ai are bounded; 2. The output of system y 1 can track the desired signal y r ; 3. All states do not violate the asymmetric time-varying constraints with no dependence on feasibility conditions.
Proof Before giving the stability analysis, we show that the designed weight updating laws (24) can minimize the approximation error of the HJB equation.
Recalling the HJB equation (11), by utilizing (43) and (44), we have Bellman residual error e i (t) is defined as where From equation (58), it can be seen that the optimal To ensure that the updating laws satisfy the above equation, we define a positive function P i = (Ŵ ai − W ci ) T (Ŵ ai −Ŵ ci ) with the facts that ∂P i /∂Ŵ ai = −∂P i /∂Ŵ ci = 2 Ŵ ai −Ŵ ci . It can be seen that when P i is equal to 0, it means that the equation (60) is satisfied. Therefore, the weight updating laws can be structured byṖ i ≤ 0. Recalling the updating laws (24), we have The above inequality means that the updating laws (24) can minimize the bellman residual error e i (t).

Remark 6
It should be noted that in the procedure of giving the proof of the neural network weight updating law that can guarantee that the Bellman residual error converge to 0, we use the design result of step i without using steps 1 and n. Actually, the proof process of steps 1 and n is fundamentally the same as that of step i, only the partial notation is different. Next, the system stability proof is presented.
Remark 8 It is also important to note that by employing the proposed state-dependent transformation function, as argued in [27,29,20,24,48,25,21], the controller design procedure does not require the feasibility condition to be satisfied and the asymmetric time-varying full-state constraint is not violated.

Simulation Examples
In this section, the effectiveness of the proposed optimal control scheme are validated by some simulation examples.
The virtual controller and actual controller of step 1 and step 2, as well as the tuning laws of identifier, critic and actor network are designed according to (23) and (24), respectively. In each step, the structure of neural networks are designed to be same, the centers are split on average in the range [−8, 8]     The tracking error of the system, including the transformed system and the original system, is given in Fig.3, showing that satisfactory tracking results are obtained. Both virtual controller and actual controller input is presented in Fig.4, which indicates that the feasibility condition is not required to be met for virtual control. The effectiveness of the proposed optimized backstepping controller is showed in Fig.5 and Fig.6, in which we can see that both cost functions and estimated weight vectors can converged rapidly.

Example 2:
To further verify the efficacy of the proposed scheme, a second set of simulation was conducted on an electromechanical system, which is formed in Table 1 and 2. The electromechanical system is described as   The desired signal is y r = sin(0.5t)) + 0.5(sin(t). The neural networks have the same structure as Example 1, as well as the initial value of weight updating laws. The parameters of the controllers are chosen asā 1 = 4, a 2 = 3.1,ā 3 = 3.6, σ 1 = 0.36, σ 2 = 0.28, σ 1 = 0.24, γ c1 = 1.4, γ c2 = 1.3, γ c3 = 1.3, γ a1 = 1.8, γ a2 = 1.5, γ a3 = 1.5. The initialization state is x(0) = 0.   Simulation results are graphically illustrated in Fig.7-10. From Fig.7 and Fig.8 can be figured out that system output x 1 can track the desired signal y r , and system state x 2 and x 3 are subjected in a predefined time-varying boundary. The actual tracking error is presented in Fig.9. And Fig.10 shows the output of the actual controller and virtual controllers.
The simulation results explicitly demonstrate that the proposed optimized back-stepping controller not only can track the desired reference signal well under the time-varying asymmetric state constraint, but also all the closed-loop signals are bounded. Alternatively, the virtual controller in the proposed scheme does not have to fulfill the feasibility condition.

Conclusion
This paper investigates the optimal control of nonlinear strict feedback systems subject to time-varying asymmetric state constraints. A novel state-dependent transformation function is proposed, and based on it, the original system is transformed into a new system with the state constraint incorporated. An optimized backstepping controller is designed to track the desired reference signal, and a reinforcement learning algorithm is used to implement the optimal control, where identifier, critic network and action network are utilized to estimate the uncertain system dynamics, critic the performance and yield the controller. The proposed novel state-dependent transformation function not only avoids feasibility conditions, but also has the ability to simultaneously handle cases with or without state constraints. Simulation examples verify the effectiveness of the proposed transformation function and the optimized controller. In future, the reinforcement learning algorithm of the optimized backstepping controller without relying on the identifier is a worthwhile research problem.