Gesture and Vision Based Automatic Grasping and Flexible Placement in Teleoperation

Teleoperation system has attracted a lot of attention because of its advantages in dangerous or unknown environment. It is very difficult to develop an operating system that can complete complex tasks in an completely autonomous. This paper proposes a robot arm control strategy based on gesture and visual perception. The strategy combines the advantages of humans and robots to obtain a convenient and flexible interaction model. The hand data were obtained by Leap-Motion. Then a neural network algorithm was used to classify the nine gestures used for robot control by a finite state machine. The control mode switched between indicative control and mapping control. The robot acquired a autonomous grasp ability by incorporating YOLO 6D, depth data, and a probabilistic roadmap planner algorithm. The robot completed most of the trajectory independently, and a few flexible trajectories required a user to make mapping actions. This interactive mode reduces the burden of the user to a certain extent, that makes up for the shortcomings of traditional teleoperation. ABSTRACT: Teleoperation system has attracted a lot of attention because of its advantages in dangerous or unknown environment. It is very difficult to develop an operating system that can complete complex tasks in an completely autonomous. This paper proposes a robot arm control strategy based on gesture and visual perception. The strategy combines the advantages of humans and robots to obtain a convenient and flexible interaction model. The hand data were obtained by Leap-Motion. Then a neural network algorithm was used to classify the nine gestures used for robot control by a finite state machine. The control mode switched between indicative control and mapping control. The robot acquired a autonomous grasp ability by incorporating YOLO 6D, depth data, and a probabilistic roadmap planner algorithm. The robot completed most of the trajectory independently, and a few flexible trajectories required a user to make mapping actions. This interactive mode reduces the burden of the user to a certain extent, that makes up for the shortcomings of traditional teleoperation.


Introduction
With the development of computer vision, the tasks that autonomous robots can accomplish become more complex. However, in the highly unstructured dynamic environment, the object is unfamiliar, the shape changes, or the motion is unknown. For complex tasks, the decision-making and control of robots need human intelligence, especially in complex and dangerous environments.
When a human does not want or is difficult to appear in the robot field, it is necessary to remotely operate the robot. Such cases include nuclear energy [1], ocean exploration [2], and many others.
The commonly used human-robot interface for remote operation includes various equipment, like exoskeleton equipment [3], electromagnetic [4] or motion capture devices [5], or EMG devices [6]. However, as these devices are attached to the human body, they may affect the comfort and flexibility of human operation. Other controlling devices, such as actuator replicas, control panel, joystick, and mouse [7], need that the operator to learn extra operation skills. Gesture interaction based on non-contact [8] has been studied and integrated with speech recognition to give users a more intuitive and natural interactive experience. However, these interface modes are relatively direct to deal with some small tasks, such as moving, rotating, running and stopping, etc. These modes would be difficult to deal with complex and large-scale actions. The advantages of combing humans and robots are not considered the task can segment according to the areas where robots and humans are good at. Currently, the development direction is intelligent; improving the interaction efficiency and reducing the work intensity and the cost. Also, some slight indicative actions can trigger the robot to complete certain tasks through eye contact and brain signals [9], which reduces the human's work effort.
In addition to the model of direct control [10,11], supervision is another way to reduce working strength. Supervised autonomy is one model that allows the operator to start tasks while the execution is autonomously [12,13]. It allows the operator to control the robot without concentration, thus reducing the task intensity of the operator. On the other hand, in sharing or mapping control, continuous input on the human-computer interface is used to control robot [12], which increases human input, such as protection function and automatic collision avoidance. Hayati [14] provides strategies in robot systems, including supervised autonomy and shared control. Bauer [15] combines the two models of operation via semantics. Semantic is easy to implement for simple commands, such as moving 10cm to the left while it is difficult for complex tasks, such as following a trajectory.
From the perspective of spatial scope, the recognition range of the hand is limited. In other words, the hand is not suitable for moving in a large range. Therefore, Du [16] intentionally made a 3dimensional mobile platform to increase the mobile range of gesture recognition. Accordingly, we can enable the supervision model to perform a wide range of movement, and direct control mode in the small range of flexible movement may also able to solve this problem. With regard to obstacle avoidance, the author uses the Neural-Learning-Based method to avoid obstacles [17], which can control each joint. To simplify obstacle avoidance and improve operation efficiency, we use a random node generation method to avoid obstacles that only considers the end of the robot [18].
Grasping is a normal behavior for humans, but it consists of considerable complex cognitive and biomechanical processes [19]. Also, for many innovative robot systems, grasping and manipulating objects are necessary. Robot grasping technology is also an important technology for the next generation industry. In 2016, in the Amazon grabbing challenge, the team who used the fast R-CNN algorithm won the champion of grabbing and stacking projects [20]. Sahin [21] classifies object pose recognition into five categories, which are based on classification, regression, classification and regression, template matching, and point cloud feature matching. To improve the recognition speed, the YOLO [22] formulates the detection problem as a regression problem and directly predicts the bounding box and classification probability from the image.However, the image recognition should not be only limited in the 2D plane but needs to know the position and posture of the object. Based on this, Tekin [23] designed YOLO 6D target attitude prediction method, which has high recognition accuracy in complex scene attitude estimation. In this study, indicative actions and mapping gestures were incorporated to reduce the effort of user engagement. The HRC model combined object recognition, gesture recognition, and a finite state machine (FSM). Automatic teleoperation grasping and placing mode was devised, and object verification was done in a real environment. The robot identified the object and planned the trajectory and movement by itself. The user gave some demand hints so that the robot could grab the appropriate objects. On the other hand, the robot could be seamlessly switched to a mapping mode to place the object, which enabled good flexibility.
Special contributions of the research include the following: 1. Leap-Motion (Ultraleap Holdings, Ltd., UK) was used to acquire human finger gestures. The gestures were related to the two interactive modes of indicating and human-robot mapping, which combined with the FSM to complete the tasks of grasping, obstacle avoidance, and placing.

2.
The object detection architecture YOLO 6D's attitude recognition, based on a 2D plane, was applied to a real robot grasp. The output was combined with point cloud data to meet the grasping requirements.

The interaction system structure
As shown in Fig. 1, the physical structure of the system was composed mainly of a mechanical arm, a gesture recognition device, and visual recognition. The robot arm manipulated objects with its gripper. The gesture recognition device recognized user gestures and controlled the operation of the manipulator. The visual system displayed scene information, including object information and scene depth. Two cameras were used, one for the front view and the other for the side view, to help the operator control the robot.   Figure 2 shows the data structure of the system, which comprised two main parts: Part 1 collected human data and translated them to the scene. In Part 2, the robot got data from humans and the environment, grasped automatic ally and positioned the object based on a human gesture pose. In Fig. 2, the blue box on the left represents the user. Hand data were collected through images and transformed into a state and a position. The position information was based on the user's hand indication points, which could be recognized by the computer through a series of algorithms. The grey box on the right is the data flow of the system, which involved mainly object recognition, obstacle avoidance, motion planning, kinematics, and inverse kinematics, and coordinate transformation. The object pose recognition was based on YOLO 6D imaging, which was written in Python. The obstacle avoidance algorithm was based on an improved probabilistic roadmap planner (PRM). The state switching was based on an FSM from MATLAB's Simulink.

Input method
For convenience, the gesture was chosen as the input method. Leap-Motion is small and positioned on a desktop approximately 20 cm away from the hand, or positioned on the head in combination with VR glasses to input human gestures [24]. As shown in Fig. 3a, Leap-Motion can identify hand postures well and has a strong ability to avoid interference from the environment due to its mode, which is based on infrared light. Leap-Motion combined with Unity can do some beautiful and subtlehuman-computer interactions, such as grasping the petal of a flower (Fig. 3b).

Gesture classifier
To obtain gesture transformation, a set of typical gestures was used (Fig. 4a), based on previous research [25,26]. The corresponding angle information of the fingers is shown in Fig. 4b. The network structure is shown in Fig. 5. w1, w2 and w3, mean weight. b1, b2, and b3 mean bias.  The input-to-output hidden layer takes the 'th row of and multiplies that row by x: . (1) From hidden layer to output layer: . ( Sigmoid is used to add nonlinear factors: . ( The softmax function is applied to get normalized probability: . (4)

Indicative point of gestures focal
The position and indication vector of the finger could solve the focal point to the plane only when the gesture was G1. i P is the position of the index finger, L i s the direction, o P is the intersection point between the straight line and the plane, N is the normal vector of the plane, and D is the distance to the plane. The point o P canbe calculated as shown in Eq. (5) and (6). .
Select the object closest to point o P as the target object for attitude display: is the object coordinate and is the subscript:

Scene generation
For scene modeling, positioning, and navigation, a depth camera such as an RGB-D [27] or LIDAR [28] is commonly used. The advantage [29] of the RGB-D is that it can obtain the scene information with good performance at a relatively low price and can display a colour image and depth data. Commonly used equipment such as RealSense and Kinect [30] are suitable for indoor and other small scenes. The RealSense second-generation D415 with a range of 0.3 to 10 m.
Typically, the D415 can capture colour and depth images concurrently, with a high frame rate of 60 frames per second. A single frame comprises a colour image and a depth image and contains geometrical 3D information. The generated depth data can be converted to the point cloud.  Depth data that determines the distance between the object and the camera mapped to the pixel coordinate system do not directly reflect the Cartesian coordinates of the object in the scene. Also, the depth camera is subject to environmental interference that may cause noise, and a large amount of depth information data will affect the real-time performance of data interaction. To prevent this, the first step (1 in Fig. 6) was to turn the deep data into a point cloud and detect the presence of noise. Due to the very large amount of data, direct filtering would consume many computing resources, so step 2 was to subsample and then filter. If the camera was tilted, part of the area would inevitably be lost after rotation, so step 3 was done to limit the range of the point cloud and obtain the local highest point.
Step 4 was done to make up for the point cloud by combining the highest point cloud and the threshold value and expanding outward with the camera position as the starting position.

Object attitude recognition and trajectory planning
This Object attitude recognition is based on the YOLO 6D [23], and the network structure is in  .
Normally, the confidence in 6D is calculated in 3D space, but this is very troublesome, so the point was converted to 2D coordinates for calculation. The Euclidean distance was calculated by the difference between the predicted 2D coordinates and the true value. The smaller the distance, the more reliable the prediction. In Eq. (9), is the threshold set in advance, such as 30 pixels, and .
The loss function of training can be simplified as (10) where , , and are the coordinate point, confidence, and classified loss respectively. The preceding coefficients are the weights of each loss. The recognition results of four objects in Fig.   8 are shown in Table 1.
Motion planning is a problem in many engineering applications, such as robotics [31][32][33], navigation, and autonomous driving [34][35][36]. The essential problem in motion planning is to avoid obstacles and find a path to connect the start and target locations. An improved PRM [18] like algorithm 1 is used in our system: Algorithm 1. Probabilistic roadmap planner (PRM)

Finite state machine for human-robot collaboration
The switch modes are shown in Fig. 9. The G1 gesture was used to indicate an object. The indication by G1 meant that the robot was required to grasp the object and put it at the target position.
G2 was used to switch the robot mode from the indication state to the mapping state. In the mapping state, the hand directly controlled the robot arm's motion. G3 was used to end the mapping state.
G4 was used to control the manipulator to open. G1 to G9 means hand was recognized, but when no hand detection, G21 was used to control the manipulator maintain. The user changed the running state of the robot solely by hand. For example, if the current state of the robot was recognition, once the user gave the G2 gesture, the robot switched to mapping, or for G1, the robot switched to the indicative state. Other gestures caused the robot to remain in the recognition state.

Human-robot motion mapping
In the operation situation, the posture of the hand had to map onto the end effector. The motion was divided into translation and rotation. In Fig. 12a, the translation part can be expressed by The derivation process can be seen in Eqs. (12) to (17). Where h Q and m Q are the quaternions of the hand and the robot respectively. h  and m  are the angles of rotation of the hand and the robot respectively around the rotation axis. In a previous study, electromyography was used to change the flexibility of a pure mapping interaction. For more details, see [37].
Following the quaternion of the hand at the moment n and n + 1, the micro rotation difference can be obtained by quaternions: .
Quaternions can be expressed as To solve the rotation angle of the hand: .
The obtained rotation is transferred to the end of the robot arm, and its rotation axis can be expressed as .  (15) The rotation angle of the robot can be expressed as The rotation quaternion of the robot at time n + 1 can be expressed as

Results and discussion
The operation interface of the experiment is shown in Fig. 10a. The gestures were recognized by Leap-Motion. After the neural network processing and geometric computation in Unity, the gesture was passed to Python and Simulink via a universal datagram protocol (UDP). Leap-Motion was connected to a computer via a serial port with a recognition speed of 60 Hz. YOLO 6D running in Python was used to recognize the object posture. The CPU was an Intel i7-4702, the graphics card was a GTX 1060, and the object's recognition frequency was 2 Hz. The robot used AUBO i5 for transmission through a transmission control protocol and Simulink. An OnRobot 6-degrees-offreedom force sensor, using UDP transmission at a transmission rate of 100 Hz, was used to limit the contact force. To play a protective role, the maximum contact pressure was set to 10 N. All the data were run on a computer, and the calculated trajectories were returned to the AUBO robot.
The steps in the experiment are shown in Fig. 10b. The red dot is the indicating position of the finger. When the object was selected and the target position was determined, it turned into a red circle and flashed three times. Figure 10b1 indicates the position of the object to be grasped. Figure   10b2 indicates the target location.  .
The pixel results obtained are shown in Fig. 12. Figure 12a shows that all objects were located at small points. Figures 12b and 12c show that the variance in the X direction observed numerically was 0.45 pixels, and the maximum and minimum pixel difference was 2.45 pixels. The variance in the Y direction was 0.19 pixels, and the maximum and minimum pixel difference was 1.18 pixels.
It can be seen that the accuracy of the plane-based object attitude recognition was acceptable after it was converted into pixels.
The four points approximately correspond to the actual position In Figs. 13a and 13b. However, the point positions were somewhat different from the actual positions, so the spatial object mapped to the colour image had to be transformed. A large and flexible mechanical grip on a robot arm has less of a requirement for an accurate position, but the mechanical grip in this study was small and rigid, so the position had to be corrected. Although the position was obtained by the junction point cloud, the geometric size of the object had to be further considered to be corrected.
To simplify the process, a correction equation was devised. (Based on the actual situation in this study, the positions of several of the objects did not differ greatly). The position of the object could be obtained by combining pixel coordinates, depth data, rotation, and object size. As shown in Eqs. (19) to (21), the corrected displacement of the object was related to the measured displacement ; the displacement of the object from the pixel centers ; the width and height of the object and ; and the rotation of the object. The offset in the Y direction was corrected by Eq. (20), which was mainly related to the height and width of the object. The offset in the Z direction was corrected by Eq. (21), which is nonlinear, so it used . The scale factor was . The results modified are shown in Fig. 13c and 13d, which also show that the object position was close to the actual position and met the grasping requirement.  The complete trajectory is shown in Fig. 14a. The robot avoided the obstacle by itself. Only the placement (inside the dotted box) was completed by gestures directly mapped to the user. The manual mapping part is the final stage of fine-tuning. To observe the situation of the robot following the hand, the reciprocating movement is added. As shown in Fig. 14b,14c and 14d the hand trajectory is relatively round. The displacement target of the robot is closed to the hand trajectory.
Although the delay of the robot causes some sense of contour in the actual running trajectory, the robot can follow the local motion well on the whole. In comparison, whole-process location mapping [38] requires a user to operate all trajectories, including movement, grasping, obstacle avoidance, and placement, which is a heavy burden on the user. In some situations, automatic operation can be achieved, such as peg-in-hole assembly [39] with complex requirements. But some scenes require a user to participate. For example, take a waiter in a restaurant: There are many studies on motion obstacle avoidance for restaurant food delivery robots. The food can be delivered to customers, but cannot be placed [40]. In a Chinese restaurant, there are usually a few more dishes than in other restaurants, which can cause an overlap problem. Also, it may be necessary to consider who is has the highest status at the table, which may influence the dish placement position; that undoubtedly increases the difficulty of placement. HRC grasping and placing, seemingly simple, is a research direction full of challenges and rich rewards.

Conclusion
In this study, Leap-Motion was used to collect hand data, and then nine gestures were taught to and recognized by the neural network. The control mode combined the gestures and an FSM.
Based on YOLO 6D, the position of an object can be determined and corrected by combining depth data with geometric transformation.Then, by combining the depth data and an improved PRM, the robot trajectory was generated to achieve obstacle avoidance. Results show that in the interactive mode, most of the tracking can be completed by a robot, and only a small part requires user participation, which reduces the burden on the user. Future work will focus on the user experience and systems of force feedback. Also, intelligent learning capabilities may be added that automatically learn the habits of the user placement. Finally, a mobile module could be integrated to increase the working range of the robot. The present study will further promote robot intelligence and enhance the degree of HRC.

Fundings
The project is supported by The National Key Research and Development Program of China  Figure 1 Scene of human-robot interaction   Scene generation using a depth camera  Collaborative interaction framework based on the nite state machine Figure 10 Experimental operation procedure     Automatic motion trajectory of the AUBO robot Figure 14 Motion trajectory of the AUBO manipulator