DDPG-based continuous thickness and tension coupling control for the unsteady cold rolling process

Cold rolling is an important part of the iron and steel industry, and the unsteady rolling process of cold rolling usually brings significant influences on the stability of product quality. In the unsteady rolling process, various disturbances and uncertainties such as variable lubrication state, variable equipment working conditions lead to difficulties in the establishment of state space model of thickness and tension, which has become a thorny problem in thickness and tension control. In this paper, we present a model-free controller based on Deep Deterministic Policy Gradient (DDPG), which can continuously control the thickness tension of the unsteady rolling process without the mathematical model. We first formulate the thickness and tension control problem to Markov Decision Process (MDP). We apply strategies such as dividing state space variables with the mechanism model, defining reward function and state normalization, the random disturbance and complex uncertainties of the unsteady cold rolling process are coped with by utilizing the DDPG controller. In addition, these strategies also ensure the learning performance and stability of DDPG controller under random disturbance. Simulations and experiments show that the proposed the DDPG controller does not require any prior knowledge of uncertain parameters and can operate without knowing unsteady rolling mathematical models, which has better accuracy, stability, and rapidity for thickness and tension in the unsteady rolling process than proportional integral (PI) controller. The artificial intelligence–based controller brings both product quality improvement and intelligence to cold rolling.


Introduction
Cold rolling strip is one of the important parts of strip products, which has been widely used in the automotive industry, aerospace, shipbuilding, bridge, architecture, electronic, and household appliance industries. The cold rolling strip is produced by the cold rolling production line composed of stands in which the hot strip passes through each stand in turn, and each stand applies rolling force to the strip through pairs of work rolls, the thickness of the strip reduces continuously. Figure 1 is the equipment distribution diagram of a fivestand cold rolling line.
The thickness of the strip is an important quality indicator of cold rolling strip. In recent years, the demand for the cold rolling strip is constantly increasing, which brings new challenges to the production quality and the production efficiency of the cold rolling strip. Once the tension of the strip is out of range, it will cause the strip fracture which shows that the tension control of the strip is the key to guaranteeing the stable production [1]. For the cold rolling process, there is a strong interaction between strip thickness and tension. Generally, the control strategy of the thickness control and the tension is that roll gap controls tension and roll speed adjusts thickness.
The unsteady rolling process is a special part of the cold rolling process which includes acceleration, deceleration, and over weld process, and so on. With the advancement of the industrialization process, the control performance of the steady-state rolling process has reached a very high level for the cold rolling process. However, in the unsteady rolling process, various disturbances and uncertainties such as variable lubrication state, variable equipment working conditions lead to difficulties in the establishment of the mathematical model, which cause the thickness and tension control performance in the unsteady rolling process are still inadequate. In addition, the coupling of thickness and tension in the unsteady rolling process also affects the quality of the finished strip.
The widely used Automatic Gauge Control (AGC) and Automatic Tension Control (ATC) in the cold rolling process based on PI controller have reached a high level. PI controller has achieved excellent control performance in Single Input Single Output (SISO) control system, but it also has many shortcomings such as inability to solve multivariable coupling problems, lack of flexibility, and poor adaptability. In the actual production process of cold rolling strip, there is a strong correlation and interaction between the thickness control system and tension control system.
To better adapt to the actual production process and improve the production quality, many researchers try to use different control theories to control the thickness and tension of cold rolling strips [2]. Wang et al. [3] constructed the mathematical model which considered random actuator failure and designed the fault-tolerant controller for the AGC system to control the thickness of the cold rolling process. Friedel et al. [4] established the rolling process model of a single stand and applied predictive functional control to thickness tension control to improve the accuracy of thickness and tension control. Some researchers focus on decoupling control of thickness and tension. Li et al. [5] designed a decoupling controller based on PID algorithm with the help of a diagonal recurrent neural network and achieved certain results on the proposed dynamic coupling model. In view of the coupling relationship of the thickness and tension, a decoupling compensator based on invariance principle was designed and improve the accuracy of thickness and tension [6]. An [7] designed the multivariable decoupling control system of thickness and tension of cold rolling based on the neural network with the genetic algorithm which can resist small disturbances in the system. However, the above methods did not consider the random disturbance in the unsteady rolling process, which led to the insufficient thickness tension control in the unsteady rolling process.
To improve control performance and enhance system stability, many researchers have established mathematical models considering disturbance terms and applied them to the thickness tension control. Hu et al. [8,9] introduced a new state space model of the cold rolling process, proposed the multivariable optimization strategy based on inverse linear quadratic form and the control method based on receding horizon control strategy respectively to optimize the control performance of thickness and tension. Koofigar et al. [10] established a mathematical model considering the uncertainty of cold rolling process, and used the model to design a robust controller based on H∞ control theory. Ogasahara et al. [11] designed a controller based on explicit model predictive control, which obtained good control performance in the acceleration process with a small disturbance. In the unsteady rolling process with random disturbance, the performance of the controller is still insufficient. Hu et al. [12] established a state space model with constraints and proposed the distributed model predictive control online optimal control strategy which has good tracking performance compared with PI controller. Ozaki et al. [13] proposed a nonlinear mathematical model to accurately control the tension and thickness of the cold rolling system during the acceleration process and deceleration process. In order to analyze the characteristics of the unsteady rolling process, Cao et al. [14] proposed an unsteady lubrication model and developed a dynamic chatter model, and analyzed the effects of the main process parameters on the stability of rolling stand. The methods mentioned above can be categorized as model-based methods, and the parameters are constant in the unsteady rolling process. However, accurate model parameters cannot be obtained because of many unexpected situations, such as insufficient data samples [15]. In addition, when the system state changes, the controller based on Meanwhile, Reinforcement Learning (RL) has developed significantly in recent years and has been applied in many fields, from wireless Internet of Things, manufacturing to industrial process control [16][17][18]. In recent years, Deep Reinforcement Learning (DRL) has attracted wide attention in solving high-dimensional control with high complexity [19]. DDPG is one of the model-free DRL methods for continuous action spaces and has been widely applied in many fields, such as robotic control [20], manipulator control [21], and wireless sensors [22].
The stability of the control system with neural networks was proved by the Lyapunov method in [23,24], and DRL for control has made great progress in recent years [25]. Many researchers applied DDPG to industrial process control which has characteristics of large disturbance and strong coupling and achieved a series of results [18]. Gao et al. [26] applied the DDPG method in heating, ventilation, and air conditioning(HAVC) to achieve continuous and accurate thermal control of the system. Ma et al. [27] applied DDPG to a polymerization system with strong nonlinearity, large time delay, noise tolerance. The DDPG controller learned the optimal strategy as the optimal control strategy for controlling nonlinear valves in [28]. Spielberg et al. [29] extended the DDPG to process control problems, and evaluated their approach on SISO Systems, Multi Input Multi Output (MIMO) systems, and tested it under various scenarios as well. Notice that the DDPG method can operate without knowing mathematical models and increase stability against uncertainty in the thickness and tension control problem of the unsteady rolling process.
Therefore, the main contributions are listed as follow: 1. In this paper, the model-free DDPG algorithm is applied in the thickness and tension control problem of the unsteady rolling process with random disturbance and complex uncertainty for the first time, which overcomes the inaccurate modeling caused by random disturbance and coupling relationship of thickness and tension. 2. We map the thickness and tension control problem of the unsteady rolling process to MDP by taking random disturbance in the unsteady rolling process into consideration, where action space of the agent and state of the environment were divided by state space model of steady-state rolling process and reward function was designed to overcome the absence of the mathematical model of unsteady rolling process and improve the control performance of the unsteady rolling process. 3. Compared with the traditional PI controller, simulation results show that the DDPG controller has better accuracy, stability, and rapidity for the thickness tension in the unsteady rolling process. Compared with the maxi-mum tension error of PI controller of 26.13%, the maximum tension error of the DDPG controller is 11.58%. The controller based on artificial intelligence improves the quality of the cold rolling strip and realizes intelligence.
The rest of the paper is organized as follows. In Sect. 2, we describe the unsteady rolling process and mapping thickness and tension control to Markov decision process. In Sect. 3, we set up the environmental network and apply the DDPG controller to solve the problem of thickness tension control in the unsteady rolling process. Section 4 provides training results and testing results which show that the DDPG controller has better stability and rapidity compared with a PI controller. Finally, some conclusions are shown in Sect. 5.

The unsteady rolling process
The unsteady rolling process is a special part of the cold rolling process which includes acceleration, deceleration and over weld process, and so on. How to control the unsteady rolling process stably and accurately has become an urgent problem in the cold rolling process. The factors that make the unsteady rolling process hard to control are listed as follows: Due to the nonlinearity, time-varying in the unsteady rolling process coupled with various uncertainties such as variable lubrication state, variable equipment working condition, etc., it is difficult to establish an accurate mathematical model to describe the unsteady rolling process. Despite having the steady-state mathematical model for reference, the constant parameter mathematical model used in the actual control process is difficult to meet the requirements of continuous change which leads to low model matching. In addition, there are deviations in key parameters such as rolling force and forward slip in the actual production process which also leads to inaccurate results when using the existing model. Figure 2 depicts the curve of thickness and tension control in the steady-state rolling process and the unsteady rolling process. It is noticed that the thickness control curve and tension control curve have good tracking performance to the target curve in the range of constant speed. However, in the range of velocity variation, the thickness control curve and tension control curve deviate from the target curve, which indicates that the control accuracy of thickness and tension in the acceleration and deceleration process with the PI controller is insufficient.
In our work, we propose a controller based on DRL to alleviate the insufficient control accuracy of thickness and tension in the acceleration and deceleration process. In order to deal with the difficulty of accurately describing the unsteady rolling process by the mathematical model, we adopt the strategy of constructing a deep neural network to represent the mathematical model. The specific methods will be described in Sect. 3.

State space model of thickness and tension
According to the bounce equation of the rack, the incremental equation of the output thickness can be expressed as Eq. (1). The subscripts in , out , 0 and i in the formula respectively represent entry of stand, exit of stand, ith stand, and initial value.
where h , S , P , M m represent strip thickness, roll gap, rolling force, the mill stiffness, respectively.
The rolling force is affected by input thickness, output thickness, front tension, and back tension.
where T is the tension of strip between adjacent stand, According to Eqs. (1) and (2), the formulation of output thickness increment model can be expressed as where In the rolling process, rolling mill exit forward slip S h and back slip S H . The forward slip coefficient f i and the backward slip coefficient b i are defined as Eqs. (6) and (7): where V represents rolling speed.
According to the mass flow equation, The backward slip coefficient is defined as the following, Forward slip is related to the friction coefficient, deformation resistance, and strip thickness and tension between Combined with the above formula, the change rate of tension increment is deduced as follows, stands, we defined the increment of the forward slip coefficient as the following: where The increment of rolling speed and backward slip coefficient can be expressed as the following: According to Hooke's law and the tension formula between frames, the change rate of tension increment between frames can be written as: Δh in,i Based on the field data and rolling experience, the parameter change rate of the actuator can be defined as follows: where ΔU Si and ΔU Vi are the roll gap target value and roll speed target value of the stand i respectively. S is the time constant of ASR system and V is the time constant of HGC system. By adjusting the control signal U Si and U Vi , the thickness of strip and the inter-stand tension can track the reference value.
The control structure of stand 5 is shown in Fig. 3. Based on the above formula and state space theory, the state space model of tension and thickness between adjacent stand can be defined as the following:

RL and DDPG
RL is an important branch of machine learning, RL algorithm maps the environment state detected by the agent to the action of the agent. Through a trial-and-error learning process in the environment, the agent takes the action of maximizing the long-term discount reward in Eq. (19) where the discount factor 0 < < 1.
Many RL algorithms, such as Q-learning, SARSA, DQN, and so on, rely on the action-value function estimated from  experience which is called the value-based RL algorithm. Trial-and-error search and delayed rewards are prominent features of RL and provide a solid foundation for RL. In essence, the action-value function estimates the total reward R t from a state s by taking a certain action follow policy . The action-value function is expressed by Eq. (20) Once the action-value function Q converges to the optimal policy with sufficient experience, the action taken in each state will change the current state to the state with the highest action-value function estimate.
Usually, the action-value function is represented by nonlinear function approximators. If action-value functions are not generalized, the architecture of the action-value function will no longer be applicable [30]. DQN and DDPG update the weights Q of the Q-network iteratively with minimizing the loss function L.
RL was initially proposed to deal with the discrete actionspace problem, but there are many situations that require continuous action [31]. An algorithm called actor-critic based on policy gradient for continuous space was proposed by Sutton et al. [32]. The actor-critic consists of two parts: one is the actor network updated by the policy gradient, the other is the critic network estimating the action-value function Q . Figure 4 shows the schematic diagram of the actor-critic algorithm.
However, the actor-critic algorithm needs a long training cycle, and it is difficult to converge to the optimal result. Lilllicrap developed DDPG based on learning in mini-batch from DQN to ensure convergence and improve training speed [33]. Replay buffer is used in DQN to store historical samples [s t , a t , r t , s t+1 ] . In the training period, the network's weight is updated by the mini-batch tuple data that were sampled randomly.
Referring to the process from DQN to DDQN, DDPG also adopts the concepts of the target network and current network. The weight updating speed of the target network is slow, which greatly improves the stability of the network [34]. DDPG is a deterministic policy, different from the actor-critic algorithm, its actor policy is updated by the sampled gradient ∇ | |s i : The target network is slowly updated by Eqs. (23) and (24).

Mapping thickness and tension control to MDP
In this subsection, we map the thickness and tension control problem of the unsteady rolling process as a Markov Decision Process (MDP), which is then solved by a model-free DRL algorithm in Sect. 3.3.
MDP is a mathematical model of the sequential decision in RL [34]. It describes the completely observable environment, including four basic elements: state s , action a , reward r , state transition probability p . In MDP, the agent constantly interacts with its environment, chooses actions a according to policy, and the environment responds to it by presenting a new situation s t+1 , and the environment also gives feedback reward r to the agent.
According to the state space model of single stand in Eq. (18), the thickness and tension of strip steel at the current time are only affected by the previous state parameters and not by the state parameters at any other time.
In the thickness and tension control problem of the unsteady rolling process, the four basic elements are described as follows: State s : (1) current entry tension of (i − 1)th stand T t in,i−1 ; (2) current entry tension of (i + 1)th stand T t in,i+1 ; (3) current entry tension of ith stand T t in,i ; (4)current roll gap of (i − 1)th stand S t in,i−1 ; (5) current roll gap of ith stand S t in,i ; (6) current Notice that the state parameters include disturbance term d i (t) , input term x i (t) , and output term y i (t) of state space model.
Action a : (1) The set-point of roll gap U t Si ; (2) The setpoint of Roll speed U t Vi−1 Action includes control term u i (t) of state space model which U t Si and U t Vi−1 are continuous variable.
We define tension error ΔT t in,i and thickness error Δh t out,i as follows: Reward: We design the reward function by comprehensively considering the accuracy of thickness and tension of rolling stand and the convergence efficiency of the agent training process, which is defined as follows: We define the reward function as the negative number of the square sum of tension error and thickness error. This form of the reward function can avoid the excessive thickness error or tension error, and the thickness and tension can be controlled at the reference value to maintain stable control performance. For the DDPG algorithm, if the average value of the reward is closer to 0, the optimal strategy will be easier to learn as well as the better convergence speed of agent training. When the square sum of tension error and thickness error is between 10 and 20%, the agent will be given an additional reward of 0.05. When the square sum of tension error and thickness error is less than 10%, the agent will be given an additional reward of 0.08. According to the definition in Eq. (26), the higher the reward value is, the more the thickness and tension curves fit the set-point, which also represents that the controller has better control performance. The MDP of the thickness and tension control problem of the unsteady rolling process is illustrated in Fig. 5.
For the above MDP, the state transition probability p is not defined. The state transition probability is the probability distribution of state transition to other states after the agents take certain actions U t Si and U t Vi−1 in a certain state. When the state transition probability is known, the MDP is completely observable, the cumulative reward can be solved by some iterative methods. In the thickness and tension control (25) problem of unsteady rolling process, obtaining accurate state transition probability models is a complex task, because an accurate mathematical model is difficult to establish as mentioned in Sect. 2. Based on the above considerations, we adopted the model-free DDPG algorithm to overcome the uncertainties of the unsteady rolling process and cross-coupling between stands. Without prior knowledge of the environment or state transition, the model-free DDPG algorithm learns the control strategy of thickness tension from the environment with the growth of the training episode by interacting with the environment.

Thickness and tension control framework
In the thickness and tension control problem of the unsteady rolling process, the simulation starts from a random initial state. In each step of each episode, the action a t is generated by the deterministic policy of the actor network with random OU noise for exploration [31]. The noise decreases with the increase of episodes. The neural network simulating the rolling state will return a set of training data to the replay buffer at each time step, in which the reward value is calculated according to the formula in Eq. (26).
After storing enough experience in the replay buffer, the optimal strategy is learned by random sampling in small batches. The update of the critic online network is updated by Eq. (21), and the action network is updated by Eq. (22). After each training step, the target critic network and the target action network are slowly updated by Eqs. (23) and (24).
The specific process is shown in Fig. 6. The pseudocode of DDPG algorithm for thickness and tension control in cold rolling is included as following.
The flowchart of DDPG Algorithm for thickness and tension control in cold rolling is shown as Fig. 7 below.
The advantage of the DDPG controller is that it can carry out continuous control, and improves the training efficiency and stability compared with the controller based on actor-critic algorithm [31]. The DDPG controller has become one of the ideal candidates in the industrial process control field.

Simulation result
The training and test results of thickness and tension control of the DDPG controller of unsteady rolling process are given in this section. We compare the control effect in thickness and tension control of the PI controller and the DDPG controller to further illustrate the superiority of the DDPG controller. The results show that the DDPG controller has better stability and rapidity.

Simulation environment based on depth neural network
In this subsection, we define the simulation environment of the thickness and tension control problem of the unsteady rolling process. According to the theory of MDP mentioned in the above section, the simulation environment should accept the current state and action to present a new situation of next time interval and reward. We use a data-driven neural network to simulate this process, and train the neural network as the simulation environment with real data. The action and the current state are the input vectors of the environment network, and the state at the next time interval is defined as the output of the environment network to simulate the rolling stand in the thickness and tension control problem of unsteady rolling process. Considering the faster training speed and better convergence, different types of output data and input data of the environment network are normalized to [0, 1].
In the environment network with 12 inputs and 10 outputs, we tried the maximum neurons of 128 with five fully connected layers, but the accuracy of the environment network in the test set only reached about 75%. Considering the complexity of the thickness and tension control problem of unsteady rolling process, we increase the maximum number of neurons and the fully connected layers to better simulate the unsteady rolling process. Ultimately, an environment network with seven fully connected layers is used in the simulation, which has 512 and 256 nodes in the third and fourth hidden layers, respectively. And to normalize the Fig. 6 The framework of thickness and tension control output data to [0, 1] , we adopt sigmoid as the activation function of the output layer. In the thickness and tension control problem of unsteady rolling process, the output thickness of stand 5 directly affects the product quality, so this paper simulates the control of stand 5 in the deceleration process and acceleration process.
In order to better evaluate the performance of the environment network, 10% of the deceleration process data and acceleration process date are extracted as the test set separately. Figure 8 shows the convergence trend of accuracy over episode of deceleration process and acceleration process. The accuracy of the deceleration process environment network reaches 0.8 after 200 episodes and 0.9 after 700 episodes is shown in Fig. 8a. At last, the training accuracy was maintained at 0.93 and stabilized. Similarly, in Fig. 8b, the accuracy of the acceleration process environment network reaches 0.8 at about 500 episodes and 0.9 at about 1000 episodes. Finally, the accuracy was maintained at 0.92 and stabilized.
The accuracy of the trained deceleration and acceleration process environment network is in the test set 92.44% and 91.82% separately, which indicates that the environment network has good generalization and can well characterize the unsteady rolling process of the rolling mill in the deceleration process and acceleration process.

Training result
We construct a framework of thickness tension control in the unsteady rolling process on Python platform based on tensorflow package [35]. Considering that action network and critic network only have 2 and 1 outputs, respectively, we reduce the maximum neurons and the fully connected layer compared with the environment network. We set the action network as five fully connected layers with 32, 64, 256, 128, and 32 neurons respectively. Similarly, we set the critic network as five fully connected layers, and the number of neurons is 64, 256, 128, 16, and 1 respectively. For 3000 episodes of training, it takes about 11 h to complete in Intel Core i5-4200 h CPU.
We set the time interval of each episode to 5 s, each step to 0.1 s, and each episode contains 50 steps. Figure 9a shows the total reward of each episode vs. training episodes in the deceleration process. We sampled every five episodes of 3000 episodes in Fig. 9a. In 200 episodes of random exploration, the total reward of each episode is concentrated Fig. 7 Flowchart of DDPG Algorithm for thickness and tension control between − 22 and − 3. With the training start, the DDPG controller keeps learning better control strategies and obtaining the higher total reward. As mentioned in the definition of the reward function, the higher the reward is, the better the coupling control performance of thickness and tension is. The total reward converges to a maximum of about 2.5 in about 1600 episodes, and it takes about 5 h for the DDPG controller to learn the optimal control strategy.
In Fig. 9b, we also provide the total reward of each episode vs. training episodes in the acceleration process. Similarly, we sampled every five episodes of 3500 episodes. The reward of the DDPG controller decreases once in 1000-1500 episodes, and finally converges to a maximum of about 3.9 in 2000 episodes, which indicates that the DDPG controller learns the optimal control strategy in the acceleration process. It takes about 7 h for the DDPG controller to learn the optimal control strategy.

Test result
In this subsection, we compare the control performance in thickness and tension control of the PI controller and the DDPG controller to further illustrate the superiority of the DDPG controller. Figure 10a shows the curve of rolling speed of stand 5, which represents that the simulation of rolling stand is in the deceleration process. The results in Fig. 10b, c show that the actions the DDPG controller takes are quite different from the PI controller in the deceleration process, in which the PI controller takes a smooth control action. The resulting trajectories of the  23.13kN Fig. 11 Control performance comparison between DDPG controller and PI controller in the accelerated rolling process DDPG controller follow closely to the target curve in the thickness and tension control problem of unsteady rolling process in Fig. 10d, e, but the curve of the PI controller fluctuates up and down and does not track the target curve well. Adopting historical data for pre-training can reduce the training episodes in the practical application stage. It takes 1600 training episodes for the controller to learn the optimal strategy as shown in Fig. 10. It is important to note that the training period for learning the optimal strategy of thickness tension control is not fixed, which is jointly determined by the data used for training environment network and the initial observation state. The strategy adopted by the DDPG controller is to increase the exit thickness of stand 5 and reduce the entry tension of stand 5 to resist the subsequent random disturbance. From the comparison of the control results in Fig. 10d, e, the maximum tension error of the DDPG controller is 11.65% compared with 26.48% to the PI controller, and the set-point of tension is 18.62kN. The control performance of thickness and tension control in the acceleration process is shown as Fig. 11 below. Different from the thickness tension control in the deceleration process, the controller achieved the control performance as shown in Fig. 11 after about 1800 episodes in the acceleration process. The difference of training periods for learning optimal strategies in the deceleration and the acceleration process are caused by different data used for training environment network and the selection of initial observation state. Figure 11a represents the simulation of rolling stand in the acceleration process. Figure 11b, c also show the actions taken by DDPG controller are quite different from the smooth control actions of the PI controller in the acceleration process. Figure 11d, e show that the control performance of thickness and tension in the acceleration process, which DDPG controller tracks the target thickness and tension curve well but PI controller cannot. Compared with the control strategy of the PI controller, the optimal strategy adopted by DDPG controller reduces the maximum error of tension from 22.33 to 1.65%, and the set-point of tension is 23.52 kN in the acceleration process.
In dealing with the unsteady rolling process with random disturbance, DDPG controller has better stability and accuracy than the traditional PI controller, and the effect of thickness and tension control is well improved. Notice that the DDPG controller learns the control strategy without any prior knowledge to resist the random disturbance in the control system by using the strategy of sampling the replay buffer in mini-batch and using target network and current network to update the actor network.

Conclusion
This paper develops the DDPG controller for the thickness and tension control problem of the unsteady rolling process. The controller realizes the coupling control of thickness and tension in the deceleration process and acceleration process. By comparing, the DDPG controller has better control performance for thickness and tension in the unsteady rolling process than the PI controller. DDPG improves control accuracy and stability of the process by maximizing the value of the long-term reward function. We also prove that the DDPG controller has strong adaptability in complex process control fields with random disturbance, strong coupling, and nonlinearity. In addition, because the DDPG algorithm is a modelfree RL algorithm, the control performance of the controller based on the DDPG algorithm will not be affected by the change of rolling environment. In summary, the controller based on the DDPG algorithm can control the thickness and tension of the unsteady rolling process without prior knowledge. DRL can be used to solve complex industrial problems and is different from other control methods, which indicates that the controller based on the DRL method has a great application prospect in industrial process control.
In future work, we will use parallel actors in a distributed system to comprehensively control the whole cold rolling thickness tension control system. We may also consider the prioritized replay mechanism to improve the training efficiency in the next research.
Author contribution Wenying Zeng: conceptualization, methodology, investigation, data curation, software, formal analysis, experiment, and writing of the manuscript. Jinkuan Wang: conceptualization, resources, funding acquisition, supervision, project administration, and review. Yan Zhang: data collection and curation, writing review and editing. Yinghua Han: methodology, supervision, and writing including review and editing. Qiang Zhao: methodology, supervision, and writing including review and editing.
Funding This work was supported by the National Natural Science Foundation of China (U21A20475,U1908213), Colleges and Universities in Hebei Province Science Research Program (QN2020504), The Fundamental Research Funds for the Central Universities (N2223001).

Data availability
The data are available on reasonable demand.

Code availability
The code is available on reasonable demand.

Declarations
Ethics approval and consent to participate All authors understand and approve the ethical responsibilities of the authors. The authors consent to participate.