Control and Anti-control of Chaos Based on the Moving Largest Lyapunov Exponent Using Reinforcement Learning

 Abstract In this work, we propose a method of control and anti-control of chaos based on the moving largest Lyapunov exponent using reinforcement learning. In this method, we design a reward function for the reinforcement learning according to the moving largest Lyapunov exponent, which is similar to the moving average but computes the corresponding largest Lyapunov exponent using a recently updated time series with a fixed, short length. We adopt the density peaks-based clustering algorithm to determine a linear region of the average divergence index so that we can obtain the largest Lyapunov exponent of the small data set by fitting the slope of the linear region. We show that the proposed method is fast and easy to implement through controlling and anti-controlling typical systems such as the Henon map and Lorenz of time delay of time

for ecological sustainability of the Sea [2]. Chaotic motions cause the unpredictability and high risk in asteroid explorations [3]. The emergence of a Shilnikov chaotic attractor in the economy model with financial bubbles in the banking sector may cause a financial crisis [4]. For the above and similar circumstances, chaos is harmful and should be suppressed [5]. In 1990, Ott, Grebogi and Yorke proposed an effective method of controlling chaos, called the OGY method [6]. They chose one of unstable orbits embedded in a chaotic attractor, which can improve the system performance, and then stabilized the system to the chosen orbit by using small but elaborately selected parameter perturbations. Their method can be applied to experimental situations in virtue of the technique of delay coordinate embedding. Subsequently, a large number of methods on control of chaos have been developed, such as the delayed feedback control [7][8][9][10], the open-plus-closed-loop control [11], and the adaptive control [12].
Unfortunately, most approaches usually rely on hypothetical knowledge or analytical descriptions of the system dynamics, which are difficult to obtain in practical applications. Based on reinforcement learning with Q-learning, Gadaleta and Dangelmayr introduced a general method for optimal chaos control [13]. They showed that this method not only can control high-dimensional discrete systems, 1-D and 2-D coupled logistic map lattices [14], but also can attack the targeting problem in a complex multi-stable system through guiding its trajectory to a metastable state [15]. Lei and Han successfully applied the method with Q-learning to the control of chaos in the Frenkel-Kontorova model [16]. The goal of reinforcement learning is to maximize the cumulative rewards, which determine whether the agent can ultimately learn the desired goal-stabilizing an unstable periodic orbit embedded in the chaotic attractor.
The agent can acquire all information required to find the optimal control strategy through interaction with the system. The method is data driven and it provides an intelligent black box control because it does not require analytical descriptions of governing dynamical equations nor knowledge of targeted unstable periodic orbit.
On the other hand, the existence of chaotic characteristics makes dynamical systems flexible and useful. Davies et al. showed that chaotic behaviors can speed up the combustion process, because sensitive dependence of chaos on initial values promotes the mixing of reactants [17]. Li and Xu proposed that chaotification of the quasi-zero-stiffness system can mask line spectrum characteristics of acoustic noise of machinery vibration, so as to enhance the concealment capability of underwater vehicles [18]. In the neural network of biological systems with periodic solutions, Kohannim and Iwasaki achieved an expected strange attractor by tuning a small parameter to destabilize oscillators' phase difference [19]. In this sense, chaos provides great flexibility for the system performance because the chaotic attractor is embedded in infinite unstable periodic orbits and we can accomplish different goals with the aid of these unstable orbits [20]. Chen and his coauthors designed a small-amplitude feedback controller to chaotify discrete and continuous systems [22,23]. They showed that the controlled system is chaotic and has a positive Lyapunov exponent. Hereafter, various methods have been proposed for the anti-control of chaos, such as the impulse control [21], the feedback control [24,25], and the topological conjugate mapping control [26]. Similarly, most of the existing methods on the anti-control of chaos face the problem of requiring explicit knowledge on systems in real world applications. We demonstrated that control of chaos using reinforcement learning is model-free and easy to employ in the FK model [16]. As far as we know, the method of anti-control of chaos using reinforcement learning has not been reported. A natural question is whether one can apply the reinforcement learning based method to the anti-control of chaos by appropriately designing states, actions and reward functions for the agents.
In this work, we consider the problem of control and anti-control of chaos using reinforcement learning based on the work of Gadaleta and Dangelmayr [13]. We show that the method is effective in controlling chaos with the stabilization of unstable periodic orbits by altering a control parameter or systems' variables and designing a reward function. If the goal changes from control to anti-control, one can design the opposite reward function by exchanging the actions of reward and punishment.
Unfortunately, when applying the method to destabilizing a period-1 system for the purpose of anti-control of chaos, we find that the state of the controlled system may change to be in period-2. Using the method to anti-control chaos, we have to avoid suppressing the system to other high periodic orbits, instead of chaotic orbits. Even if we could avoid the problem of falling into a high periodic orbit, we may still face the problem of not distinguishing a quasi-periodic orbit from chaos. This will bring a difficulty in designing a reward function and should not be neglected. In view of this, we introduce a reward function based on the largest Lyapunov exponent of the system because it can be used to judge whether the corresponding system is chaotic. To describe the interactions of the agent and the environment in reinforcement learning, we need to make use of the concept of Markov decision process, which discards previous data after each interaction. Therefore, we should consider calculating the largest Lyapunov exponent from a small data set. In this study, we define a moving largest Lyapunov exponent by using a recently updated time series with a fixed, short length. In 2018, Zhou and Wang proposed a practical method [27] for calculating the largest Lyapunov exponent from a small data set by using the density peaks-based clustering algorithm [28]. This method becomes a strong support for the moving largest Lyapunov exponent. On this basis, we adopt the density peaks-based clustering algorithm to determine a linear region of the average divergence index so that we can automatically obtain the largest Lyapunov exponent of the small data set by fitting the slope of the linear region. Once we derive the moving largest Lyapunov exponent, the reward function can be defined by the exponent, so as to implement a control policy using reinforcement learning.
The rest of the paper is organized as follows. In Sect.2, we briefly introduce reinforcement learning and describe the model-free control and anti-control algorithm based on the work of Ref. [13]. We take the Henon map and Lorenz systems as illustrative examples with numerical simulations. In Sect. 3, we propose a new method of control and anti-control of chaos based on the moving largest Lyapunov exponent using reinforcement learning. We also demonstrate the effectiveness and advantages of the proposed method with the two systems. Section 4 presents some conclusions and discussions.

Control and anti-control of chaos based on the states' period using reinforcement learning
This section discusses the problem of control and anti-control of chaos. Firstly, the theory of reinforcement learning is briefly reviewed, and then the control algorithm is presented and extended to the anti-control of chaos. Finally, the Henon map and Lorenz system are taken as illustrative examples to show the feasibility of the method.

Control policy with the states' period
In this study, we consider the use of model-free RL algorithm to control and anti-control chaos in dynamical systems. In the model-free RL algorithm, we don't need to have a clear understanding of a system. The agent's perception and cognition for the system can be realized through continuous interaction with a simulation system, and the data obtained from the interaction is not used to model the system, but to optimize the agent's own behavior. This avoids the requirement of analytical knowledge of system models in the traditional control method.
Reinforcement learning (RL), as a common machine learning technique, is a process of intelligent decision-making. Its learning mode is closest to human learning mode in machine learning. Through process simulation and interaction, RL improves decision-making ability to quickly accomplish expected goals. RL includes five elements: agent, environment, state, action and reward. The specific learning process is as follows: when the environment's state is t s at time t , the agent selects an action This learning process is called the "agent-environment" interaction process in the Markov decision process.
In general, the goal of RL is to maximize the expectation of discounted accumulative rewards where (0,1)   is a constant, which guarantees that the rewards are limited.
The RL algorithm is based on estimating the state-action value function as follows

s a E G s s a a p s r s a r V s p s r s a r a s q s a
Note that the policy  is exactly a probability distribution, i.e., it keeps a probability distribution and stores the probability of taking a certain action at each state.
Under the optimal policy   , whose cumulative reward is the highest, the optimal state-action value function is defined as Therefore, the bellman optimal equation of the state-action value function is In general, state transition probabilities are hard to obtain. On this condition, one should seek the optimal strategy to control or anti-control chaos by using a model-free method. The temporal-difference (TD) method provides a model-free, one-step updated algorithm since it combines the Monte Carlo sampling method and bootstrapping of dynamic programming.
Gadaleta and Dangelmayr introduced a chaos control algorithm using the TD method in RL, in which the optimal state-action value function is approximated by Q or Sarsa learning, and they showed the performance of Q-learning is better than that of Sarsa learning [13]. Here, we select Q-learning as the control or anti-control algorithm and its state-action value function is updated with the following equation where 01   is the learning rate and we set 0.9  == . In addition, a deterministic policy is used to choose action, which makes the current state-action pair's Q value maximum at each step. If there are several actions satisfied, we randomly choose one of them. Notice that the desired periodic or chaotic behaviors are unknown before the strategy is employed.
Each state-action pair's Q value is recorded in a Q-table, whose row and column represent the action space A and the reference state set W respectively (the Q values are initially 0 ). Moreover, the action space that depends on perturbed parameters of the system and the reference state set that depends on states of the system are set before an environment is controlled or anti-controlled. In each episode of the task of controlling or anti-controlling chaos, the first state 0 s is randomly initialized.
For each step 1, 2,... n = in an episode, the agent first gets the reference state () n ws of the system's state n sS  . Based on () n ws , the agent uses a deterministic strategy in the Q-table to select action n a , and then implement n a for the system getting the reward 1 n r + and the next state 1 n s + . At the same time, the value ( ( ), ) nn Q w s a in the Q-table is updated by Eq. (5) until this episode reach the goal. In addition, our training method is an offline strategy. In other words, we train the Q-table through limited episodes (each episode randomly selects the initial state). According to the updated Qtable with any initial value, we can judge the best action to take in any state, so as to achieve the aim of control or anti-control.
The aim of controlling chaos is to guide a chaotic system to unstable periodic orbits or fixed points, which are embedded in the chaotic attractor in the environment (the system), or to make the chaotic system bifurcate to a periodic behavior. Due to the limitation of the Q- n n l n n n n l n w w w w w w r In the following, we will consider anti-control of chaos in view of the above control strategy. The anti-control of chaos refers to changing periodic behaviors of environments into chaotic behaviors. Unlike the control of chaos, where any of the unstable periodic orbits can be considered the target of control, the objective of anticontrol of chaos is to destabilize the system and make it to reach a chaotic trajectory.
For the anti-control of chaos using RL, there are two problems to be solved.
where l is an integer.

Numerical simulation
In this subsection, we first take the Henon map as the simulation model to verify the algorithm in Sec.2.1. The Henon map has the following form,    Next, we consider the Lorenz system with the following form ( ), , . x x y y xz x y z xy z In the control of chaos, the parameters are set as 16  shown in Fig. 4(a), and the variable x transfers from a fixed point to disorder after transient states when =60 l , as shown in Fig. 4(b). In this sense, an appropriate choice of l will lead to disorder. In the above two systems, we show that the algorithm in Section 2.1 can quickly achieve the aim of controlling chaos. But we find that whether the anti-control of chaos succeeds depends on the choice of l in the reward function. Sometimes, the existing reward function can accomplish the task, but it cannot be employed easily. In an unknown environment, the value l in the reward function must be attempted and elaborately chosen, indicating that this technique is not suitable for general dynamical systems. Therefore, we have to modify the RL based method for the control and anticontrol of chaos, in which the reward function is extremely important and must be redesigned.

Lyapunov exponent method using reinforcement learning
In this section, we define and calculate a moving Largest Lyapunov Exponent (mLLE), on which a new reward function is based and redesigned, thus proposing an algorithm for the control and anti-control of chaos using RL. We also use the Henon map and Lorenz system as illustrative examples to verify the proposed method.

Control policy with the moving largest Lyapunov exponent method
Choosing an appropriate reward function in RL is crucial when employing a control policy because the reward function relates desired goals to state variables of systems and determines whether tasks can be accomplished or not. As mentioned above, the anti-control of chaos in Sec. 2 often fails due to the reward function improperly chosen which brings the difficulty of finding a key parameter l for specific systems. For the purpose of seeking a universal strategy, we urgently need to redesign a reward function that can be applied to general systems. In a dynamical system, Lyapunov exponents measure average convergence or divergence rates of adjacent trajectories in phase space, and the sign of the largest Lyapunov exponent can be used to judge whether the system is chaos. Taking this into account, we design a novel reward function for the control and anti-control of chaos with the aid of the largest Lyapunov exponents.
With a model free control method, the model of the system is assumed to be unknown, and to compute its largest Lyapunov exponent, we usually require a big data set collected from the system. This requirement contradicts the assumption that the process of RL improving the current policy through observations of delayed rewards to seek an optimal policy is approximated as a finite Markov process that does not deal with a large amount of historic data. Starting from a set of small data and in view of the moving average method [27,28], we define a moving largest Lyapunov exponent of the small data and use a clustering technique depending on density peaks to identify linear regions to calculate the mLLE with the following algorithm.

Algorithm 1 Calculating the moving largest Lyapunov exponent
Step1: Collect a set of small data with a fixed length Step2: Reconstruct the phase space of the small data as its element can be written where the index number e) Classify remaining points. Non-clustering centers are in the same cluster as the nearest and higher-density centers.
Step5: Pick the linear region. K points have a same cluster as (1) y and their indexes increases sequentially. If the Pearson correlation coefficient of the cluster and their indexes is more than 0.9, turn to Step 6; else, implement Step 4 for the picked cluster. If (1) y is an isolated cluster, the algorithm fails.
Step6: Fit the linear region using the Least Squares Method and the associated slope is taken as the current value of the mLLE.
Step7: Update the set of small data with length L , that is, delete the first point and insert a new data to the last point, and repeat the above procedure to calculate the current mLLE unless observable states of the environment end.
The proposed algorithm can forecast the sign of largest Lyapunov exponents by calculating current largest Lyapunov exponents of a continuously updated time series with a fixed length. Therefore, we define a reward function according to the sign of moving largest Lyapunov exponents for control and anti-control of chaos using RL in Sec. 2.1. The reward function for control of chaos is defined as and the reward function for anti-control of chaos is For the control and anti-control of chaos based on the above reward functions using RL, the agent chooses an action n a at the current state n s of a system with the deterministic policy at the n th step of an episode, and the system performs n a to get the next state Select the action n a under n w through the deterministic policy.
Execute n a in the system and then obtain  (8) cannot distinguish whether the observed time series of states is chaotic, highperiodic or quasi-periodic. However, in Algorithm 2, the problem of determining the length L of the data set is yet to be solved. In this case, the parameter L is taken as a hyperparameter, which we will show in the numerical simulations can be defined in advance or learned from data in the learning process.

Numerical simulation
In this subsection, we still choose the Henon map and Lorenz system to check the validity of Algorithm 2. In the algorithm, we calculate at each step of an episode the mLLE before updating the time series by adding a new observed state of system to its end and deleting its first element. In the learning process, we set the maximum number  . The reference set W and the action space A are the same as in Sec. 2.2. The reward function is defined in Eqs. (12). It can be seen in Fig. 6 the variable x of the slaved Henon map appears aperiodic, random but bounded after transient states when the control policy is performed on 100 iterations.
Therefore, the mLLE based reward function also enables the RL technique to chaotify the Henon map. It is found that in the control and anti-control of the Henon map, the length L of the observable states remains unchanged, which is more applicable than the return function in Sec. 2.2. Indeed, mLLEs of the observed states is greater than 0 after the anti-control technique is implemented, indicating that the slaved map is a chaotic system, which avoids the slaved system falling into a high periodic orbit. In addition, when we set different L values, such as 150, 200 L = , the goal of control or anti-control of chaos in the Henon map can still be achieved.  For the purpose of checking the performance of Algorithm 2 in a low-dimensional continuous dynamical system, we consider the Lorenz system an environment of RL.
The Lorenz system exhibits a chaotic behavior with parameters 16, 50, 4 For the key elements of RL, the reference state set W and action space A are the same as those in Sec. 2.2. The reward function is chosen as Eqs. (11) and the hyperparameter 100 L = . Fig. 7. depicts the variable of the controlled system eventually evolves along a fixed point. It is also shown that Algorithm 2 is effective for the control of chaos in the continuous system.
When the parameters are assumed to be 16,20,4    = = = , the uncontrolled Lorenz system appears periodic. For the key parameters, we select the same action state A and reference set W as those in the control of chaos in the Lorenz system in Sec.

2.2.
In the anti-control of chaos, the reward function of Algorithm 2 is selected as Eqs.
(12) and the hyperparameter 100 L = . The anti-control results are shown in Fig. 8 where the control action is executed after 1000 iterations. Fig. 8(a) shows that the state variable evolves from a fixed point to a irregular behavior and the phase space

Conclusions and discussions
This study considers the control and anti-control of chaos using a RL based method.
The method is model-free and only requires observed data from the environment to interact, instead of mathematical models of systems and targeted orbits. Besides controlling chaos, we generalize the model-free RL method based on the state's period to the anti-control of chaos by reversing the original reward function, that is, interchanging reward and punishment of the function. We show with numerical simulations that the method can quickly achieve the aim of controlling chaos, but sometimes fails for the anti-control of chaos due to the improper choice of the state's period parameter making the systems fall into another periodic orbits. Furthermore, the method faces the problem of not distinguishing between chaos and high-periodic or quasi-periodic orbits. To overcome the shortcomings and seek a universal control strategy, we modify the RL based method and redesign a new reward function by defining a moving Largest Lyapunov Exponent (mLLE) for the observed data with a fixed, short length. To calculate the mLLE, we adopt the density peaks-based clustering algorithm to determine a linear region of the average divergence index so that we can automatically obtain the largest Lyapunov exponent of the small data by fitting the slope of the linear region. Since the mLLE borrows the idea of the moving average, it can be used to characterize current dynamical behaviors of the observed data and thus guide with the control strategy the system to the trajectory as expected with the aid of the reward function. Further numerical simulations show that the proposed method is fast and easy to implement through controlling and anti-controlling typical systems such as the Henon map and Lorenz system. In future, we will consider controlling spatiotemporal chaos using deep RL based on the mLLE. Figure 1 The evolution of the state variable of the slaved Henon map for the control of chaos, where the control action is performed on 200 iterations.

Figure 2
The evolution of the slaved Henon map when a period-one orbit is destabilized, where the control action is performed on 200 iterations: (a) the evolution of variable x for l = 4, (b) the evolution of variable x for l = 20.

Figure 3
The evolution of variable x of the slaved Lorenz system for the control of chaos, where the control action is performed on 1000 iterations.

Figure 4
The evolve of Lorenz system when a xed point is destabilized, where the control action is performed on 1000 iterations: (a) the evolution of variable x for l = 50, (b) the evolution of variable x for l = 60. Figure 5 The evolution of the state variable of the slaved Henon map for the control of chaos, where the control action is performed on 200 iterations.

Figure 6
The evolution of the state variable of the slaved Henon map when a period-one orbit is destabilized, where the control action is performed on 100 iterations.

Figure 7
The evolution of variable x of the slaved Lorenz system for the control of chaos, where the control action is performed on 1000 iterations.

Figure 8
The anti-control of Lorenz system, where the control action is performed on 1000 iterations. (a) the evolution of variable x, (b) the phase diagram x−y after transient states.