Hierarchical Reinforcement Learning Considering Stochastic Wind Disturbance for Power Line Maintenance Robot

Robot intelligence includes motion intelligence and cognitive intelligence. Aiming at the motion intelligence, a hierarchical reinforcement learning architecture considering stochastic wind disturbance is proposed for the decision-making of the power line maintenance robot with autonomous operation. This architecture uses the prior information of the mechanism knowledge and empirical data to improve the safety and efficiency of the robot operation. In this architecture, the high-level policy selection and the low-level motion control at global and local levels are considered comprehensively under the condition of stochastic wind disturbance. Firstly, the operation task is decomposed into three sub-policies: global obstacle avoidance, local approach and local tightening, and each sub-policy is learned. Then, a master policy is learned to select the operation sub-policy in the current state. The dual deep Q network algorithm is used for the master policy, while the deep deterministic policy gradient algorithm is used for the operation policy. In order to improve the training efficiency, the global obstacle avoidance sub-policy takes the random forest composed of dynamic environmental decision tree as the expert algorithm for imitation learning. The architecture is applied to a power line maintenance scenario, the state function and reward function of each policy are designed, and all policies are trained in an asynchronous and parallel computing environment. It is proved that this architecture can realize stable and safe autonomous operating decision for the power line maintenance robot subjected to stochastic wind disturbance.


Introduction
In the electric power field with large coverage and complex network, the power transmission lines (PTLs) often need to be maintained to ensure the stability and reliability of the power supply system. Electric workers are usually in high altitude, high voltage working environment where contains safety risks, and high requirements on the technical level of operators. Therefore, in such a dangerous and complex environment, it is appropriate to use robots instead of human to carry out live operation. Autonomous operation is a promising grid robotics technology that improves the safety and efficiency of live operations, reduces energy consumption, and liberates human operators. In the architecture of autonomous operating robot, decision-making is the key component of autonomous operation. So far, most decision algorithms can be divided into two broad categories: rule-based approaches and imitation-based approaches. The former is usually addressed by manually coding rules for robot behavior, while the latter focuses on using supervised learning techniques to mimic the operations of electrical workers. In order to make safe and efficient decisions under different operating conditions, both rely on a large amount of natural operational data to cover all possible operating scenarios for testing or training.
The rule-based approach has been widely studied in the field of Power Line Maintenance Robot (PLMR).
Expliner robot [1] is designed to cross overhang clips on PTLs in an "acrobatic mode" and can be controlled in a semi-automatic manner. After the engineer inputs the position data of the encoder into the memory of the controller, it can be mapped into the motion sequence by certain rules, and the staff can make the robot reach the correct predetermined position only by simple operation of the manipulator. LineScout robot [2] uses unified strategies and rules to safely and controllably coordinate all motors through visual perception. In order to avoid human error and the possibility of accidentally damaging Linescout, three safety interlocking rules are implemented in the control software, including speed limit, mode lock, and linkage limit. The sequencer plans the movement of the robot according to these rules and assists the operator in correctly crossing obstacles.
Lioutikov and Maeda et al. [3] intuitively codes the hierarchical architecture by using formal grammar based on the nature of rules, and constrains and effectively searches the grammatical space by using the physical properties of robot motion primitives. The adaptability of simple motion primitives and the continuity of complex motion sequences are verified by the proposed 7-DOF lightweight manipulator. Ananas and Moreno et al. [4] proposes a robot grasping method based on probabilistic logic framework, which uses semantic object components to improve its grasping ability. The framework semantically interprets the pre-grab configurations associated with the expected task and uses object-task implications and object/task ontologies to encode rules that generalize similar object parts and object/task classes. Muni and Parhi et al. [5] take the sensor's output on the distance of obstacles as the input of the rule base model, and obtains the turning angle as the output. A rulebased analysis method is used to train the fuzzy controller with membership function. The output of the rule-base model and other conventional inputs are provided to the Sugeno fuzzy model, and the effective steering angle is used as the final output to avoid obstacles existing in the environment and safely guide the humanoid robot to the target point. Gordic and Jovanovic et al. [6] develops two complementary decision rules to identify conflicts.
The first rule, based on the absolute difference between matched samples after comparison, uses a statistically determined threshold to perform a quick detection of clear conflicts. The second rule is based on the eigenvalue of the matched sample covariance matrix, which uses its higher sensitivity to detect collisions with lower intensity. 3 The method based on learning is an effective way to solve the problem of PLMR. Yan Yu et al. [7], combining the nonlinear mapping and approximation characteristics of neural network, abstracts the evaluation of insulator cleaning effect into a nonlinear approximation process from the actual cleaning effect to the ideal cleaning effect.
An evaluation method of robot insulator cleaning effect based on neural network is proposed. Through the training of the neural network, the cleaning control parameters of the robot can be obtained and used in the online operation control of the robot, so that better cleaning effect can be obtained. Jiang Wei et al. [8], based on the nonlinear approximation characteristics of neural network, uses neural network to solve the problem of fault diagnosis of the system. In the control method of PLMR, the tilt angle of the robot caused by wind load and space field strength is the input signal of the neural network. Through the training and learning of the neural network, the output control variables are used to compensate the actual operating posture of the robot. Compared with the traditional proportional integral differential (PID) control, this algorithm has high real-time performance and good stability. Li Hongjun et al. [9] derives the state space expression of joint motion control under the condition of disturbance and uncertainty, and establishes the trajectory tracking control model of the manipulator for sliding mode variable structure. The neural network learning method is used to compensate the influence of the perturbation control parameters on the motion control of the robot, and the stability of the controller is analyzed by using Lyapunov theory. The control method has the characteristics of good generality, strong adaptability and easy extension. Xiang and Su [10] proposes an effective model-free off-policy actor-critic algorithm by integrating task reward and task-oriented guiding reward, and applies it to the skill acquisition and continuous control of the robot. The agent can explore the environment more consciously so as to realize the sampling efficiency. It can also leverage the experience more effectively, resulting in significant performance gains. Sasabuchi and Wake et al. [11] propose a body role division method suitable for different robots. It guides the configuration of the robot to complete a given task, which helps to teach a series of task sequences. Using a single human demonstration to combine task and movement knowledge, the method starts from the fact of human movement and decomposes the robot's body configuration into different roles by using the structure analogy of human body. Zuo and Zhao et al. [12] proposes a new off-policy AIL method -Robust Adversarial Imitation Learning (RAIL). The hindsight idea of variable reward (VR) is first incorporated into the off-policy AIL framework in order to make the agent significantly better than the suboptimal expert who provided the instruction. In the AIL framework, the hindsight copy (HC) strategy designed by the new method maximizes the use of different teaching methods and speeds up the learning.
In recent years, Hierarchical Reinforcement Learning (HRL) is regarded as a reinforcement learning algorithm for more complex environments. Ahn and Song et al. [13] proposes a method for learning decision ability in a messy environment, enabling the robot to grasp the target object after rearranging the objects that hinder the target. Learning takes a hierarchical architecture, consisting of high-and low-level motion selectors. The former determines the grab or push actions, and the latter determines how to perform those selected actions. Saha and Gupta et al. [14] uses HRL to learn the optimal strategy to operate at different time steps in order to achieve user goals in the hierarchy of dialogue exchange. The dialog manager consists of a high-level domain meta-policy, a mid-level intent meta-policy for selecting from multiple seed-tasks or options, and a low-level control policy for selecting basic actions to complete subtasks given by high-level meta-policies in different intents and domains.
Duan and Li et al. [15] proposes a HRL method for the decision-making of autonomous vehicles, which does not rely on a large number of marked driving data. This method takes into account both horizontal and vertical directions of high-level control selection and low-level motion control. Firstly, the driving task is divided into lane driving, right lane change and left lane change, and the sub-policies of each maneuver are studied. Then, a master policy is learned to select the manipulation policy in the current state. Li and Srinivasan et al. [16] proposes a layered approach that relies on traditional model-based controllers at the lower level and learning policies at the middle level. Low level controllers can robustly perform different operating primitives (pose, slide, roll). The middle level policy orchestrates these primitives. The method is robust to the inaccuracy of the target model and observation noise. Pang and Liu et al. [17] studies a layering approach in which layering involves two levels of abstraction. One is a macro-operation extracted from an expert demo trajectory that can shrink the action space by an order of magnitude, but still be effective. The other is a two-level, hierarchical architecture that is modular and easy to extend. The agent is trained from easy to hard through a curriculum transfer learning method. In order to solve the performance problem in path planning, Yu and Su et al. [18] uses neural network to enable robots to perceive the environment and perform feature extraction, which enables them to have state-action functions adapted to the environment. The path planning model of mobile robot is constructed by mapping the current state of these actions by HRL.
The influence of stochastic wind load on mechanism/cable systems, especially in the high-altitude environment, cannot be ignored. Alhassan and Zhang et al. [19] develops a lightweight, dual-arm robot and studies its robustness to wind disturbances on a lab-scale PTL structure. At the same time, the design of the robot components ensures the low drag coefficient of the airflow, and the mechanism of the wind force acting on the robot-line coupling system is given. Lu and Liu et al. [20] proposes a finite frequency H∞ controller based on prior information for active control of double telescopic boom. The robustness of the control strategy with adaptive estimator to stochastic wind field is studied. Although the stochastic wind field greatly aggravates the fluctuation of contact force, the control strategy has a good effect. In order to study the dynamic response of transmission tower line system under the action of wind and rain load, Fu and Li et al. [21] proposes the calculation method of wind and rain load acting on PTLs. Modeling and dynamic analysis are carried out for three-span conductors and pylon system, and the effect of wind attack angle on the response of pylon system under wind and rain load is studied. Pombo and Ambrosio [22] develops a multi-body/finite element coupled computational model. In the co-simulation environment, the stroke field forces act as nonlinear external forces that vary with time and are distributed on the finite element grid to exert on each object. These wind field forces can be characterized by computational fluid dynamics or wind tunnel experiments. Zdziebko and Martowicz et al. [23] uses the fluid-structure-interaction (FSI) method with the Altair AcuSolve solver to calculate the additional lift force caused by airflow at high speeds. The Spalart-Allmaras turbulence model is considered in the steady-state simulation. A rule-based controller is used to output a suitable torque taking into account the actual speed, aerodynamic force and electromagnetic force. Song and Liu et al. [24] uses empirical spectrum to simulate the fluctuating wind speed and stochastic wind field along the catenary. The chattering force acting on the contact line and the suspension cable is deduced. In order to improve the current quality of pantograph system under strong stochastic wind field, a proportional differential (PD) sliding mode controller for high-speed active pantograph is proposed.
The main contribution of this research is to propose a HRL architecture considering stochastic wind disturbance for the decision-making of the PLMR with autonomous operation. This architecture utilizes the prior information of mechanism knowledge (fluid-rigid-flexible coupling dynamics) and empirical data (random forest composed of decision tree) to improve the safety and efficiency of PLMR operation. In this architecture, the high-level policy selection and the low-level motion control at global and local levels are considered comprehensively under the condition of stochastic wind disturbance. Firstly, the task is decomposed into three sub-policies: global obstacle avoidance, local approach and local tightening, and each sub-policy is learned. Then, a master policy is learned to select the operation sub-policy in the current state. All policies (including master policy and operation policy) are represented by fully connected neural networks (FCNNs). The master policy uses dual deep Q network (DDQN) algorithm, while the operation policy uses deep deterministic policy gradient (DDPG) algorithm. In order to improve the training efficiency, the global obstacle avoidance sub-policy takes the random forest composed of dynamic environmental decision tree (rapid-exploration random tree or rapid search tree) as the expert algorithm for imitation learning. The architecture is applied to a PTL maintenance scenario, and the state function and reward function of each policy are designed respectively. All policies in asynchronous parallel computing environment (APCE) to complete the training.
The rest of this paper is organized as follows: Section 2 proposes a hierarchical reinforcement learning architecture for decision-making of the robot. Section 3 describes the design of the experiment, the state space, and the reward function. Section 4 discusses the training results of each operation policy and task. Section 5 is the conclusion of this paper.

Preliminaries
The PLMR operates autonomously using the Markov Decision Process (MDP): for the step of running, the robot observes a state vector , takes an action vector , receives a scalar reward , and then reaches the next state +1 . Reward is usually designed by human experts, with a lower value of when the robot is in a bad state (such as a collision) and a higher value of when it is in a normal state (such as keeping a safe distance from PTL). This process continues until the robot reaches a termination state, such as a collision or goal completion. In this investigation, the reinforcement learning goal is: for each state , it finds a deterministic policy ( ) that maximizes the value function ( ) = [∑ ∞ =0 + | = ] representing the expected return, where ∈ (0, 1] is a discount factor that weighs the importance of current and future rewards. DDQN and DDPG algorithms are used in the HRL mentioned in this study for the decision-making characteristics of the PLMR. DDQN algorithm is an unbiased estimation version of DQN. In order to reduce the influence of overestimation as much as possible, the algorithm separates the work of selecting and estimating the optimal action, and reduces the bias of judging whether the action is good or bad, so the stability of the algorithm is improved. DDPG algorithm absorbs the ideas of DQN and DPG and a lot of improvement schemes, so that the model can play a better effect in the high dimensional continuous space, and has the guarantee in the training speed and convergence.
Their action policy output can be respectively expressed as: ( , | ) and ( | ) are value and policy networks respectively; and are their parameter vectors respectively; is the stochastic noise vector from the noise model.
Their value function approximates , which can be uniformly described as minimizing the following loss function: where is the sampling number of experience tuple ( , , , ′ ). The objective of value function can be written as follows: For the DDPG algorithm, the action policy improvement ∇ can be described as:

Overall Architecture
Previous studies on reinforcement learning usually use end-to-end networks to make decisions for autonomous operating robots. In order to complete a task that consists of a variety of tasks, human experts need to design complex reward functions. Inspired by the option framework, this study proposes a robot learner based on HRL architecture, which decomposes the task decision process into two levels to enable the robot to obtain the ability of autonomous operation, as shown in Fig. 1. The architecture consists of two abstraction levels: (a) high-level policy selection and (b) low-level motion control. At the policy selection level, a master policy (DDQN agent) is used to select the operation policy to be executed in the current state. At the motion control level, the corresponding operation policy (DDPG agent) will be activated and output torques from the actuators (components of the motor and reducer). Multiple combinations of operation behaviors can form diversified operation tasks, which means that the trained policies can also be applied to other tasks. Therefore, hierarchical architecture has better portability in the field of operation decision-making than end-to-end reinforcement learning. In order to learn the policies of HRL, the task should be decomposed into several operations, namely, global obstacle avoidance, local approach and local tightening. The robot can then learn a task policy consisting of three sub-policies, each driven by an independent sub-goal, as shown in Fig. 2. The reward function of sub-policy learning only considers the corresponding task, but does not consider the whole task. The robot then learns a master policy and activates certain sub-policies based on operation tasks. Although the entire operation task 8 needs to be considered when training the master policy, the associated reward function can be very simple because it doesn't have to worry about how to control the actuator to achieve each operation. In addition, if the input to each policy is the entire perceptual information, then the learning algorithm must determine which information is relevant. Therefore, this study suggests that different meaningful indicators be designed as state representations of different policies. In addition, the movement of the robot is realized by a plurality of controllers jointly controlling the actuators, which are relatively independent of each other. It is noted that the robot dynamics used in this study still takes into account the coupling between the controllers. In summary, if an operation policy is selected by the master policy, the relevant controllers will work simultaneously to enforce control.

Network Architecture and Hyperparameters
As the emphases of master policy ( ) and operation policy (global obstacle avoidance ( ), local approach ( ), and local tightening ( )) are different, the corresponding network architecture is also different. , and are network parameters of the three operation policies respectively. The master policy needs to uniquely select the appropriate operation policy according to the current state of the robot/PTL system.
The action space is the discrete code representing the policy, so DDQN agent is used to execute it. The action space of the operation policy is continuous torque of each joint of the robot, so DDPG agent is used to realize it.
All agents are trained in an Asynchronous Parallel Computing Environment (APCE). When training an agent using parallel computing, the algorithm sends copies of the agent and the environment to each parallel worker.
Each worker simulates the agent's activity in the environment and sends its experiential data back to the master agent. The master agent learns from the data sent by the worker and sends updated policy parameters back to the worker. Local approach and local tightening: the network model of these two sub-policies is the same, and the architecture of the DDPG agent is roughly the same as that of the global obstacle avoidance, except that the critic has been redesigned to fully meet the positioning accuracy for the end tool of the manipulator during the local operation. The velocity of each joint of the robot is taken into account, and the pose and coaxiality error between the end tool and the task target are also paid attention to, but the position error of the wrist joint is not considered. For the local task, the output signal of the policy network is limited to a reasonable range, that is, the amplification ratio is reduced to half.

Imitative Learning for Global Obstacle Avoidance
For the global obstacle avoidance sub-policy, the PLMR to finish in the relatively complex, narrow space obstacle avoidance operation, needs more exploration and trial and error in the process of training, in order to avoid the local optimal value generated from the policy into obstacles. Therefore, even if multi-agent learning is in [25] and is also readily available in the typical human-in-loop task. In order to find the initial parameters of the global obstacle avoidance policy, the random forest [26] is used as a potential learnable expert policy in the imitation learning framework. The random forest sets a large number of decision trees with random initial configuration (initial velocity is 0) as the root, and the decision trees grow exploratory or objective into any collision-free state space by adopting the rapid-exploration random tree (RRT) or rapid search tree (RST) algorithm under dynamic environment (DE). The simulation learning method can be decomposed into two substeps: dataset sampling and policy optimization. The first step is to sample a dataset = {〈 , * 〉}, in which each sample is a combination of state observations and the best action. The second step is to train the initialization policy ( ) for global obstacle avoidance, based on the random forest with given , as shown in the upper left of Fig. 1.

Stochastic Wind Disturbance
The PLMR operates in special extreme environment. Among many external disturbance factors, wind load is the main influencing factor due to its randomness, time-varying and non-linearity. Due to the stochastic high-altitude wind load, the rigid robot sways and vibrates on the flexible PTL, and the eddy current fells off. The turbulence generated in the boundary layer near the robot body excites the arm and end-effector of the manipulator, which will affect the positioning accuracy for its end tool. Therefore, the research on HRL of the PLMR under wind load has strong theoretical significance and practical application value. According to the rigid-flexible coupling dynamics of robot and PTL without wind, the mathematical model with stochastic wind disturbance is further established by mapping the wind force to the joint of the robot and the finite element nodes of PTL in this section. The robot can be used for reinforcement learning in the simulation environment with stochastic wind disturbance, which improves the robustness of the control for the manipulator end-effector.
In order to better study the influence of wind field on robot/PTL system, it is necessary to apply stochastic timevarying nonlinear aerodynamic forces on rigid robot and flexible PTL. Assuming the robot/PTL system is in a streamlined wind field, the relative velocity between the object and the wind needs to be calculated first, and then the drag and lift forces on the object are obtained. The direction of these forces depends on the direction of the relative velocity, as shown in the upper right of Fig. 1. The aerodynamic drag and lift force can be respectively written in the following form [27]: where and represent the drag and lift force vectors of the robot/PTL system under the wind load in the ground coordinate system respectively; is the fluid density of air; and are respectively the drag and lift coefficients per unit area related to the wind attack angle (collectively referred to as the aerodynamic coefficient, dimensionless). ̃ is the area of aerodynamic action; is the vector of the relative velocity between the object and the wind; ̃ and ̃ are unit vectors in the direction of drag and lift, respectively.
If the geometrical shape of the object is certain and the influence of wind speed is not considered under the breeze condition, the aerodynamic coefficient is a function of the wind attack angle. The flow field around the robot/PTL system is simulated in Computational Fluid Dynamics (CFD) software to calculate the aerodynamic coefficients of the robot and the PTL fittings at different wind attack angles. In order to accurately model the environment of reinforcement learning agents, it is necessary to accurately understand the aerodynamic forces acting on PTL, because they significantly affect the positioning accuracy for the end tool of the manipulator under the action of rigid and flexible coupling. In order to meet this requirement, the CFD model of PTL segment is established, and the aerodynamic coefficients of PTL under different wind attack angles are calculated by using the FSI method. The geometry of the robot/PTL system is simplified to facilitate modeling.
The aerodynamic coefficients of PTL are calculated according to the CFD model, as shown in Fig. 3.

Fig. 3 Calculation of the PTL aerodynamic coefficients based on the CFD model
Based on the previous research work [25,28] and Equation (6) is the total mass matrix of PTL; represents the generalized coordinate vector of PTL nodes; is the shape function matrix associated with the aerodynamic ANCF element; is the Jacobian matrix of PTL for the constraint; represents the Lagrange multiplier vector corresponding to the constraint.

Operation Task Description
This paper mainly studies the operation problem of the dampers on PTL. PTL has four conductors, and the conductors are fixed by spacers at certain distances, which are arranged in square corners from the cross-section.
The robot will initialize and operate in any non-collision configuration between the four cables of PTL. The robot will not collide with the cables, dampers and bolts mounted on them during the operation. The objective of the operation is to tighten the damper bolts on the random cable. The damper models include FD-5 and FDZ-5F.
The autonomous operating robot needs to learn how to complete a target task as quickly as possible without colliding with itself or obstacles. In order to simulate the real operation situation, the wind disturbance is

Action Space
The robot motion is controlled jointly by multiple controllers to actuators, and each actuator is relatively independent. The tasks of the PLMR can be divided into three operation policies: global obstacle avoidance, local approach and local tightening. In order to improve the learning performance, the output actions of the master policy are normalized. Therefore, the action space of the master policy is defined as the following ternary

State Representation and Reward Function
Four states are proposed to represent the operation scenario and task: the state associated with the robot, the state associated with the cable, the state associated with the damper, and the state associated with the bolt. A total of 13 17 indicators are presented to represent operation state, and a complete list of these states is shown in Table 1.
Each state is added to a stochastic wind disturbance before being observed by the robot learner. Some states are selected from the list as input to the relevant value networks and corresponding policies. The state space and reward function for each network are varied and depend on the operation policy. In order to better set the reward function, four errors of the evaluation state are given, namely the position error for the wrist joint, the position error for the end tool, the posture error for the end tool, and the coaxiality error for the end tool. The formula is as follows: in which: 11) and * are respectively the position of coordinate system for the wrist joint and its expected value; and * are respectively the position of coordinate system for the end tool and its expected value; and * are respectively the posture of coordinate system for the end tool and its expected value; # is the distance between the end point of the end tool and the center axis of the bolt, and the subscript # ∈ { , } represents the left and right ends of the end tool. Note that due to the flexibility of PTL, variations in * , * , and * occur that can be solved by the system coupling dynamics model during robot operation. In order to normalize the error, the error metric function is defined as # = ( # ), and the subscript # ∈ { , , , } represents error type.

Error metric
Wrist joint position • End tool position In the table, denotes joint position; denotes joint velocity; denotes node coordinate; denotes geometric solid. The subscript # ∈ {1, ⋯ , } for PTL represents a mesh node number within a limited range.

Master Policy:
given the three sub-policy of the operation policy, the robot needs to consider all the states in Table 1 where and are the weight coefficients of the error metrics of the wrist joint and end tool position in the reward function of the global obstacle avoidance sub-policy.

Experiment Results
The HRL algorithm uses a discount of = 0.99 for all policies and updates the target network with a smoothing factor of 10 -3 . For each policy, the training process ran 16 copies of the asynchronous parallel agents and environment at a fixed frequency of 10 Hz. In each round of training, each agent has a sampling period of 1 second and a maximum duration of 100 seconds. For all the learning processes, the value function and the action policy learn the neural network parameters at a learning rate of 10 -3 and 10 -4 , respectively. Their gradient threshold is 1, which limits the variation of network parameters in the training iteration. The l2norm method is used to deal with the gradient value exceeding the gradient threshold, which reduces the learnable parameters whose 2 modulus of the gradient is longer than the gradient threshold, so that the 2 modulus length is equal to the gradient threshold. The 2 regularization (weight decay) factor is set as 10 -4 , which alleviated the over-fitting of the neural network. The standard deviation of the motion noise model and its decay rate and mean attraction constant are 0.2, 10 -4 and 0.15, respectively. For the agents of each policy, no matter the master policy or the operation policy, the learning speed, training stability and policy performance of multi-agent under APCE are much higher than that of single agent.  As shown in Fig. 4(c), compared with the local tightening policy, the reward obtained by the local approach policy reached a higher and more stable level. There are two main reasons for these apparent differences. First of all, the task of local tightening is difficult, and it needs to meet stricter coaxiality error requirements, which is more complicated than the operation of the end tool using the local approach sub-policy, so it needs more exploration. Secondly, the random initialization of the autonomous operating robot will inevitably fall into the local optimum temporarily caused by coaxiality error in the process of operation. But with the increase of the episode number of training, the robot will explore the area beyond the local optimal through the noise of the action policy, and get better returns. The fluctuation of the two curves near the peak value after convergence is caused by random initialization.

Simulation Verification
The performance of the learned master and operation policies is evaluated through a series of simulations. The simulation platform used in this study includes the model module of the robot/PTL system and wind load, which is built in MATLAB. At the beginning of each episode, the stationary autonomous operating robot is initialized on PTL with a random, collision-free configuration and pose. The target task is randomly set to tighten the damper bolt of any cable in PTL. In order to simulate real wind disturbances and verify the robustness of the policies against wind disturbances, continuous wind loads at random speeds and attack angles are set throughout the operating space. The robot/PTL system is divided into a rigid-body and a flexible-body mesh for wind loading. During the training process, the simulation environment adds continuous wind loads at random speeds and attack angles to each mesh node of the system. According to the wind speed of the city, the wind load considered is no more than third-order wind (breeze) and the range of attack angle is within ±180°.  near the target, it switches to a local approach policy (from about 8 seconds to 9 seconds). It switches to a local tightening policy (approximately from 9 seconds to 10 seconds) finally when the tightening position is reached.
This suggests that our approach can learn a basic value network to evaluate operation states. In order to study the sensitivity of each operation policy to wind disturbance, these policies evaluate the joint positions of the manipulator under different wind loads based on the determined master policy, as shown in Fig. 5(b). It can be seen that when the wind load does not exceed the specified range, the policy is less affected by noise.
Interestingly, depending on the choice of the master policy, the robot tended to adopt the global obstacle avoidance policy at the beginning without explicit instruction. This is because the reward is the highest relative to the other local policies. When approaching the targets of the local policies, the control accuracy for the end tool of the manipulator by the neural network is limited under the action of stochastic wind disturbance, so it is easy to cause small positioning error. To some extent, this shows that the master policy, together with the three operation policies, can achieve smooth and safe decisions on PTL, and that the trained policy is capable of being translated into practical applications.

Hierarchical and Single-policy Architecture
One question that is often encountered with a hierarchical architecture is whether the robot decision problem can be addressed by using a non-hierarchical single-policy architecture. Previous papers have shown the learning effects of using single-policy reinforcement learning algorithms. Obviously, this is not a satisfactory performance. Therefore, this study divides the task into two difficulty levels, and tests the effect of using the hierarchical and single-policy architecture to make decisions under different difficulty levels. The single-policy architecture also uses state representation and reward function similar to those mentioned in Section 3.3 for training, and its network architecture is the same as that of global obstacle avoidance policy. As shown in Table 2, the study is surprised to find that at the lower difficulty levels, the single-policy architecture could be learned almost as well as the hierarchical architecture, with less significant differences; However, when it comes to high difficulty, the hierarchical architecture performs significantly better than the single-policy architecture. This indicates that the designed state representation and reward function are effective for training a good agent regardless of difficulty. As the difficulty of the problem increased, the single-policy architecture did not learn to use the nearest configuration path and the fastest running time to complete the target task, resulting in lower average operation speeds and longer operation times (about 18% slower than the hierarchical architecture). It is obvious that the local optimal solution is found with the single-policy architecture. This is because in the training process, when the robot operates locally, it often collides with the obstacles on PTL, so it thinks that the policy of choosing the farther configuration path or running at a slower speed is the best.
To sum up, on the one hand, due to the high state dimension, the single-policy architecture cannot learn the appropriate policy; On the other hand, modularization facilitates sub-policy substitution. For example, if the master policy needs to replace an operation policy, the parameters of other networks can still be retained, thus improving the portability of the policy and training efficiency. This is the reason why hierarchical architecture is chosen for the autonomous operating robot.

Conclusions
Aiming at the motion intelligence of robot intelligence, a hierarchical reinforcement learning architecture considering stochastic wind disturbance is proposed for the decision-making of the PLMR with autonomous operation. This architecture uses the prior information of the mechanism knowledge and empirical data to improve the safety and efficiency of the PLMR operation. In this architecture, the high-level policy selection and the low-level motion control at global and local levels are considered comprehensively under the condition of stochastic wind disturbance. Firstly, the task is decomposed into three sub-policies: global obstacle avoidance, local approach and local tightening, and each sub-policy is learned. Then, a master policy is learned to select the operation sub-policy in the current state. The dual deep Q network algorithm is used for the master policy, while the deep deterministic policy gradient algorithm is used for the operation policy. In order to improve the training efficiency, the global obstacle avoidance policy takes the random forest composed of dynamic environmental decision tree as the expert algorithm for imitation learning. The architecture is applied to a PTL maintenance scenario, the state function and reward function of each policy are designed, and all policies are trained in an asynchronous and parallel computing environment. It is proved that this architecture can realize stable and safe autonomous operating decision for the PLMR subjected to stochastic wind disturbance. In the future, researchers will test and improve the algorithm in more complex operating environments.