Real-time Deep Learning Based Image Processing for Pose Estimation and Object Localization in Autonomous Robot Applications

doi:10.21203/rs.3.rs-1239304/v1

Download PDF

Research Article

Real-time Deep Learning Based Image Processing for Pose Estimation and Object Localization in Autonomous Robot Applications

https://doi.org/10.21203/rs.3.rs-1239304/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Artificial intelligence (AI) is shaping manufacturing to make it smarter, intelligent, and autonomous. Presently, flexible robots have been introduced that collaborate with humans on the shop floor to enhance productivity and efficiency. In an autonomous robotic system, object classification and pose estimation are crucial problems for proper grasping. Extensive research is being conducted to achieve low-cost, computationally efficient, and real-time assessments. However, most of the existing approaches are computationally expensive and constraint to previous knowledge of the 3D structure of an object. This article presents an AI-based solution, which generalizes cuboid- and cylindrical-shaped objects’ grasping in real-time, irrespective of the dimensions. It is identified without the knowledge of the objects’ 3D model. The pose is estimated in real-time, accurately. The integrated solution has been implemented in a robotic system fitted with two grippers, a conveyor system, and sensors. Results of several experiments have been reported in this article, which validates the solution.

Artificial intelligence

Real-time pose estimation

Real-time object detection

Robotic applications

Real-time gripper selection

Many objects can be found in an indoor scene that resembles cuboid shapes like parcel boxes, containers, etc., and cylindrical shapes like cans, bottles, glasses, etc. The resemblance of objects is crucial to an assembly line consisting of robotic systems. It includes tasks like sorting different types of things in the assembly line or removing them that does not qualify a set of constraints. For this purpose, the sub-tasks are object detection, segmentation, pose estimation, and grasping. Earlier research involves manual techniques that are labor-intensive and do not improve efficiency. Other are offline methods involving costly depth cameras and sensors. The features, i.e., detection, segmentation, pose estimation and grasping, enable the autonomy feature in a robotic system. However, a comprehensive solution involving these features is lacking in the literature. Artificial intelligence (AI) is shaping manufacturing to make it smarter, intelligent, and autonomous. Presently, flexible robots have been introduced that collaborate with humans on the shop floor to enhance productivity and efficiency. An AI-based solution has been developed, which enables these tasks in real-time in a robotic system. The following discusses the related literature, their advantages, and research gaps for each of the tasks.

Object detection has been a specific task in many application areas, such as traffic monitoring, face recognition, etc. In manufacturing, one of the applications is the tracking of jobs in a conveyor system. Research reports the use of deep learning (DL) techniques for object recognition. The object detection framework consists of three parts, namely, network architecture, loss function, and optimization algorithm. Each DL algorithm has a style for defining these parts. These can generally be classified into two processes, one-stage and two-stage, and have been used to understand the existing literature. Figure 1 schematically shows the approach of these methods.

The one-step method is also known as regression-based, and it evaluates bounding box and class probability scores in one go. Some of the one-stage methods are discussed as follows. The first method is You Only Look Once (YOLO). It conducts a combined grid regression predicting the bounding boxes along with the class probabilities [1]. The other method is the single shot detector (SSD) that predicts class probability scores and bounding box offsets for a same set of anchor boxes produced by a sliding window [2].

The two-stage method includes an external proposal generator, which makes the anchor boxes containing objects from the feature maps by employing a region proposal or sliding window technique. It is followed by categorization using the feature descriptions of the bounding boxes. Including a region proposal network (RPN) in the two-stage method increases accuracy and slows the process. The one-stage methods predict the class probability scores and bounding boxes in one examination, which results in faster results and lower accuracy when compared to two-stage methods. This article uses YOLOv4, and its details have been discussed later.

A large number of local and global descriptors are available, which are useful for object detection, and are discussed here. ORB is a combination of the BRIEF descriptor and FAST keypoint detector with some improvements [3]. At first, it uses FAST to determine the key points. It is followed by the Harris corner measure generating the top N points. The intensity weighted centroid and the centre corner are then determined. The drawback is the unavailability of orientation, as it is not determined by the algorithm. The orientation is achieved from the vector between corner point and centroid. Moments are determined to fine-tune the rotation invariance [6].

SIFT consists of four steps [4]. Firstly, the Difference of Gaussian (DoG) is employed to calculate the scale-space extrema. Secondly, the key-point is localized. This is followed by assigining the key point orientation in the third step, and it is based upon the local image gradient. Finally, the local image descriptor is calculated for each key point using the dimensions of the image gradient and orientation. The algorithm is particularly useful for image rotation, affine transformations, intensity, and viewpoint change in similar features. This article uses ORB, and more details are discussed later.

Pose estimation plays an important role in grasping an object. The pre-existing data of the objects’ 3D models are necessary while working with correspondence-based methods. It includes finding the similarities between a 3D model and input information. These are referred to as similar key-points. In other words, the similar key-points between pixels (2D) and points (3D) are determined for problem solving. The object’s 6D pose can be computed using Perspective-n-Point (PnP) algorithms by finding the connections between 3D points of 3D model and 2D points of the RGB image [11]. Some examples of this technique are HybridPose [26] and Dense Pose Object Detector (DPOD) [7]. However, these methods hamper the robotic gripping of novel or unique objects by constraining the dimension of the thing as the knowledge of existing 3D models is a requirement.

Some approaches like Simultaneous Localization and Mapping (SLAM) and monocular 3D object detection use existing models of the objects in static as well as dynamic circumstances [8]. Such methods require a depth camera for RGB-D images. Depth cameras are expensive and require more computation and additional space in comparison to RGB cameras. Also, some methods have tried to get the 6 degrees of freedom (DoF) of a cuboid from a single RGB image [9]. Such methods locate and recognize 3D cuboids by finding the coordinates of their corners in single view images. The drawback of this method is its dependency on a large database of images, which are computationally extensive and difficult to acquire and handle.

This paper describes the use of a single-stage detector to classify objects, which aids in gripper selection. Emphasis has been laid on image processing techniques to localize the object. A PnP algorithm is put to use to compute the object pose for the robotic gripper. The pre-existing 2D or 3D models are not required as in the case of correspondence-based methods for pose estimation, and hence the present approach is applicable for novel objects. A low-cost webcam has been used for experimentation, avoiding expensive depth cameras. Therefore, the complete solution is artificially intelligent to generalize cuboid- and cylindrical-shaped grasping in real-time, irrespective of the object’s dimensions. The solution has been implemented in an industrial robot for practical application using the Open Platform Communications Unified Architecture (OPC-UA). The system has employed a feedback mechanism that signals the robot controller to select a suitable gripper to grasp an object in real-time. Several real-time experiments have been conducted on the industrial robot equipped with multiple grippers to validate the developed model.

This article aims to generalize the concept of grasping cuboid and cylinder-shaped objects, the two most common shapes. Objects of different geometrical shapes and sizes need to be picked in pick and place or metrological inspection. Therefore, one style of the gripper is not suitable for varying geometrical shapes. In this work, two grippers are designed, fabricated, and fitted onto the industrial robot. One of them is a parallel one, and the other is a three-finger gripper for cuboid and cylinder, respectively. Tasks such as detection, segmentation, and pose estimation are treated as the inspection modules, while grasping is the decision and feedback for selecting a gripper. Figures 2 (a) and (b) depict the two grippers.

The pose estimation task follows object detection and classification. The detection of X and Y coordinates is required for grasping a cylindrical object. However, it is not sufficient for the cuboid-shaped object, and thus, it needs estimation of the pose as well. Pose of the cuboid is required in general and YAW in particular for grasping. Figures 3 (a) and (b) show two situations wherein the importance of pose estimation can be felt.

The following discusses the details of the AI and machine vision-based techniques developed and implemented in this work. It also includes the details of the experimental setup, results, and validation.

2.1 Experimental setup – robot manipulator and gripper system

The experimental setup consists of an industrial robot (KUKA KR500 R2380 MT). On the end effector of the robot, the gripper system was assembled. The gripper system contains a gripper fixture, actuator, and fingers fabricated for griping the cylindrical and cuboid-shaped jobs. A pneumatic system actuated both the grippers with a pressure of 5 bar. The gripper system was integrated with the robot controller for clamping and unclamping based on the feedback received from the AI solution. Figure 4 (a) shows the picture of the gripper system and (b) shows the block diagram for the integration of the gripper system with the robot.

2.2 Experimental setup – conveyor system

A conveyor system was designed and fabricated in-house for transporting the jobs from one station to the other. It was mounted in front of the robot. One end of the conveyor is treated as the feeding station, and the other is for dispatch. It is 1 m long and 0.3 m wide. The conveyor was designed into modular sections assembled by slots and extrusions in complementary modules. It brought a little change to the conventional design of conveyor belts. Aluminium 6061 was considered the material for the modular sections’ fabrication, shown in Fig. 5 (a) and (b).

The screw take-up method was adopted during designing to provide sufficient tension to the belt to avoid slippage. It utilizes tenons to tighten the belt. A single-row cylindrical roller bearing with an inner diameter of 17 mm and an outer diameter of 40 mm was chosen as it is suitable for high speeds, absorbs radial loads, has lower friction, and has a long service life. A stepper motor with worm and worm gear was chosen for the conveyor because of its compatibility with a wide range of rotational speeds, fast response to acceleration and deceleration, etc. The holding torque is 21.5 N-m with a maximum rotational speed of 130 rpm. Figure 6 shows the front view of the motor.

The motor was mounted to the conveyor using a motor bracket. The material for the frame was also aluminium alloy 6061. Mounting holes were drilled corresponding to the tapped holes present in the motor. The material for the conveyor belt was chosen to be polyvinyl chloride (PVC) nylon. The fasteners used were M10 socket head bolts, M10 nuts, M5 socket head cap screws, and M6 socket head cap screws. After designing, the simulation of the conveyor system was performed in the ANSYS WORKBENCH 18.1 to ensure safe operation and performance. Figure 7 (a) shows the meshing of the modular section and (b) shows the stress generation.

On the conveyor, a proximity sensor has been fitted that detects the arrival of a job. The conveyor was also integrated with the robot controller. The next section discusses the details of the AI solution.

2.3 AI solution – detection and classification

There exist various methods that employ deep learning for object detection, such as YOLO [1], SSD [2], Faster-RCNN [10], Masked-RCNN [11], and so on. These models can be categorized into two, two-shot, and one-shot detection. The two-shot detection consists of two stages: generation of region proposal network (RPN) and classification of regions, and refinement of the bounding boxes. The one-shot detectors skip the RPN generation process and directly generate the bounding boxes and class scores. A research article suggests the two-shot detectors have higher accuracy over the one-shot techniques, with the difference lying in the foreground/background imbalance during training [12]. It is so because the two-shot ones can easily handle the imbalance. YOLO was selected in this work as a one-shot detector since the objective was to develop a real-time solution for autonomous robotic applications. YOLO employs two fully connected layers and a sliding window approach that splits the image into X x X grids and predicts M bounding boxes, i.e., X x X x M, as depicted in Fig. 9.

YOLO also employs bounding box regression, a rectangular box that highlights the object present in an image. The bounding box contains:

W, H which refers to the width and height, respectively
c, indicating the class of the object (cylinder/Cuboid, in this work)
B_x, B_y, the middle of the bounding box

All the features mentioned above are evaluated using a single bounding box regression, as depicted in Fig. 10.

YOLO uses Intersection over Union (IoU) to provide a bounding box that properly wraps the object. Each grid calculates the confidence scores along with the bounding box of the object. If IoU equals one, then the predicted bounding box is equal to the real one. Figure 11 depicts two bounding boxes; the green color has an IoU of 0.85 while the blue has less than 0.5.

SSD is an alternate technique employing convolutional layers of varying sizes and a fixed set of bounding boxes is applied to feature maps. The network predicts on feature maps of various scales to get higher accuracy, as compared to YOLO, on predictions. During predicting, the network adjusts to match the object shape and provides probabilities for the presence of each classification label in the box. However, it is computationally expensive than YOLO. Speed has been prioritized over accuracy for achieving a real-time solution; thus, YOLO was selected. Images in YOLO undergo various steps to provide the bounding boxes and class labels. Those steps are discussed in the following sub-sections.

One of the concerns of using DL is the need of large amount of training data. In manufacturing applications, gathering a dataset is expensive. Transfer learning is one technique, which is very useful in this regard. It facilitates the use of a neural network trained on some dataset to train on another custom dataset. The custom dataset is usually small, and the trained neural network’s weights and biases are useful. It helps decrease the training time and increase the accuracy in a small dataset. This article also leverages the benefits of transfer learning on the YOLO model. The pre-trained model was trained on the COCO dataset consisting of 80 classes. The filters and output layers of the model were adjusted according to the available set of classes to prepare the network for transfer learning.

3.2 Pre-processing before coordinates and pose estimation

This sub-section details the requirements fulfilled for identifying the coordinates, followed by assessing the pose for grasping. As the proximity sensor detects the object, the conveyor stops, and the camera feed starts. Figures 12 (a) and (b) show the camera feed with and without an object on the conveyor, respectively.

DL-based methods such as masked RCNN are popular for object localization as it segments the object from the surroundings. There exist fitting techniques, which fit various 2D shapes such as rectangle, circle, polygon, etc., onto the object to localize and find the object. They have the modules of primitive fitting and contour extraction. From Fig. 12, a few color variations in the conveyor can be observed, and the object does not camouflage with the background. Therefore, applying basic image processing techniques to locate the object was beneficial as they would be computationally less expensive. However, the results would depend greatly on factors, such as contrast, lighting, angle of view, etc. Hence, techniques have been adopted to enhance the images with poor contrast. The camera employed in this work is a low-cost webcam; thus, the pictures received were of poor contrast. Various morphological operations have been used to remove the imperfections in the image [13]. The angle of view has been made normal, estimating the perspective transform [14]. Figure 13 (a) shows a picture depicting the edges of the object that does not involve any morphological operation, and (b) shows the one with morphological transformations. The dilation method has been applied to the image, which helps obtain edges more promptly even in low lighting and other variable physical conditions.

Figure 14 (a) shows the image obtained from the live camera feed and (b) shows the perspective transformation, which helps get the normal view of the object, and thereby, helps to change the angle of view.

3.3 X and Y coordinates estimation

It was necessary to locate the object on the conveyor to estimate its coordinates. A Canny edge detector has been utilized to find the edges of the cuboid or cylinder present on the conveyor [15]. It enabled to get the contours of the edges, and the centre coordinates were determined using the moments (Fig. 15).

These coordinates were found in pixels. However, for a decision on grasping, i.e., feedback to the robot, the coordinates were required in millimeters. It was achieved by mapping the camera resolution with the distance in millimeters visible in the frame. The resolution of the webcam utilized was 720 px × 1280 px. When fixed, the camera captured a fixed length and breadth of the conveyor, which were 220 mm and 385 mm, respectively. Thus, the distance per unit pixel in mm along the length was found by dividing 385 with 1280. Similarly, the distance per unit pixel in mm along the breadth was found by dividing 220 with 720 (Fig. 16).

3.4 Pose estimation

After calculating the required vertices, the perspective transform of the top surface is taken. The features of the transform are calculated using ORB. The top surface of the cuboid is detected by comparing the features using ORB. After finding the matches, the Homography Matrix is determined. Homography matrix maps key points in one image to the similar or matching points in the other image. The matrix is matched with the perspective transform on the corner points of the warped image. It serves as the 3D object points. The PnP model has been adopted to determine the 6D pose, i.e., X, Y, Z, roll, yaw, and pitch of the object concerning the camera [16]. The PnP model helps in computing the relative pose of the camera with respect to a 3D object. It helps to calculate the pose without indulging in neural networks, thereby less reliance on the heavy data used to train it. Equation 1 below best describes it.

Where, f_x and f_y are the scaled focal lengths, y is the skewed factor (assumed to be 0), and u₀ and v₀ are the principal points, s is the scaled factor, r and t are the rotation and translation calculated, and x, y, z are the 3D points in world coordinates. The overall flow diagram of the approach is shown in Fig. 17.

3.1 OPC-UA integration

The preference towards employing YOLO was solely towards reducing the processing time, so it does not require any external GPU. Object detection helps select the appropriate gripper required for grasping. However, the orientation of the gripper is also a piece of crucial information. It is fulfilled by estimating the coordinates and pose. The detection, coordinates, and pose estimation process occurs sequentially; thereby, the coordinates, pose, and gripper type is fed to the robotic controller. These methods run simultaneously along with the conveyor belt, which is integrated using the OPC-UA server. It enables communication and the exchange of data between devices. A schematic diagram for the data transmission is depicted in Fig. 18.

In the KUKA robot, OPC-UA works via the DeviceConnect module. The robot controller has a computer wherein the OPC module is installed. It is referred to as the server. Another device served as the client containing the program for object detection, classification, and pose estimation. The robot variables, data, and commands can be read and written by creating the node IDs. The variables were connected to the input and output bits of the digital card of the controller.

3.2 Object detection, coordinate and estimation of pose

The training dataset consisted of 270 images, and 57 images were used for the testing. It took around 1800 epochs for the loss of training dataset to converge, and the average loss was 1.114. Some of the test results are shown in Fig. 19, and Table 1 lists the model-specific information.

Table 1 Model information

Model

Size of input images

Scores

Average precision

(in %)

Mean average precision

Average

IoU

F1 score

Cuboid

Cylinder

YOLOv4

608x608

0.859

61.51%

0.80

89.44

82.43

Figures 20 and 21 show the results for the estimation of the coordinates and pose, respectively. The results are accurate and precise enough for the robot to grasp the cuboid or cylindrical object appropriately. As highlighted previously, the pose of the cuboid has been computed by employing the four coordinates representing a plane of an object in the PnP model.

3.3 Real-time gripper selection

Figures 22 and 23 show the real-time selection of grippers while experimenting with the cuboid and cylindrical objects, respectively. The robot comes to a fixed position from its home position as the program begins, as shown in Fig. 22 (a). The conveyor carries the object, and it stops once the proximity sensor detects it. The live camera feed begins, and the pictures are transmitted to the client’s computer for analysis. The detection, coordinate and pose estimation occurs in the client computer in real-time, as shown in Fig. 22 (b). After that, feedback is sent to the server for selecting the gripper. This signal actuates the robot to move to the desired location and orientation for grasping. It can be observed from Fig. 22 (c) and (d). Similarly, experiments were also performed for the cylindrical object, and the gripper selection for it is shown in Fig. 23.

The average time of grasping was 9 seconds for the cuboid and 12 seconds for the cylindrical object. While the AI solution’s result is quite faster (less than 1 second), the remaining time is invested in the movement of the conveyor, robot, and gripper selection. The selection of a three-finger gripper for a cylindrical object requires 3 seconds more. The client system used for processing has 16 GB RAM, a 64-bit operating system with an x64 processor of Intel Core i7-9750H. Therefore, the processing time can be reduced by operating in a system with a higher configuration.

To summarize this article’s findings and contributions, the first challenge was the lack of information in data collection. Current methods mostly use the RGB-D image acquired from one fixed position to decide a grasp. However, it lacks the information backward [12]. It is also an expensive solution and often requires a few intrinsic offset problems. Therefore, a webcam has been utilized in this study. The second challenge was the lack of sufficient amounts of data given as input for training. The need for the huge data is extremely important when the intention is to build a highly intelligent system capable of grasp and detection. However, there is small amount of open source grasp datasets, and the involved objects mostly serve as sample, which is very small when one compares with real world objects. This problem has been tackled by avoiding using a dataset for cuboid and cylindrical-shaped objects. A good result has been achieved under some constraints, like fixing the camera at some angle and position. The third challenge was the prevalence of grasping novel items.. Other than computing the 6D pose, the prevalent grasp estimation methods have certain prevalence in handling new or unknown entities. These methods perform well with trained objects but fail to show increased performance for novel items. This challenge has been tackled successfully because of generalizing the solution based on object shape rather than depending on datasets for objects’ poses like the template and correspondence-based pose estimation methods.

This grasping approach could be used by industries that require automating segmenting certain objects on a conveyor belt. It can also be used in warehouse automation, where most of the commodities are cuboid or cylindrical. Any industry that requires cost-effective and high-precision mechanisms to automate picking and placing tasks can also use this novel method.

This work has developed a solution to generalize cuboid- and cylindrical-shaped objects’ grasping by estimating the pose in real-time. It does not rely on expensive depth cameras. Machine vision techniques have been utilized to determine the pose. The solution has been implemented in a robot, in which a conveyor system has been integrated. The robot is mounted with a gripper assembly carrying two different grippers for grasping the cuboid and cylindrical objects. Several experiments have been performed, which validates the solution. The average time for grasping a cuboid object is 9 seconds and 12 seconds for a cylindrical object.

Statements and Declarations

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Competing Interests

The authors have no relevant financial or non-finanical interests to disclose.

Authors Contributions

All the authors contributed to the study conception and design. The data collection and analysis were performed by Ritam Upadhyay and Abhishek Asi. The conveyor design and analysis were performed by Nidhi Prasad and Pravanjan Nayak. The sensor interation and robotic implementation were performed by Ritam Upadhyay and Pravanjan Nayak. The first draft of the manuscript was written by Ritam Upadhyay and Debasish Mishra and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

[1] Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection 2015.

[2] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD: Single Shot MultiBox Detector 2015. doi:10.1007/978-3-319-46448-0_2.

[3] Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An efficient alternative to SIFT or SURF. 2011 Int. Conf. Comput. Vis., IEEE; 2011, p. 2564–71. doi:10.1109/ICCV.2011.6126544.

[4] Lowe DG. Object recognition from local scale-invariant features. Proc. Seventh IEEE Int. Conf. Comput. Vis., IEEE; 1999, p. 1150–7 vol.2. doi:10.1109/ICCV.1999.790410.

[5] Bay H, Tuytelaars T, Van Gool L. SURF: Speeded Up Robust Features, 2006, p. 404–17. doi:10.1007/11744023_32.

[6] Karami E, Prasad S, Shehata M. Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images 2017.

[7] Zakharov S, Shugurov I, Ilic S. DPOD: 6D Pose Object Detector and Refiner 2019.

[8] Yang S, Scherer S. CubeSLAM: Monocular 3D Object SLAM 2018. doi:10.1109/TRO.2019.2909168.

[9] Xiao J, Russell BC, Torralba A. Localizing 3D cuboids in single-view images. Adv. Neural Inf. Process. Syst. (NIPS 2012), 2012.

[10] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 2015.

[11] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN 2017.

[12] Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal Loss for Dense Object Detection 2017.

[13] Sreedhar K. Enhancement of Images Using Morphological Transformations. Int J Comput Sci Inf Technol 2012;4:33–50. doi:10.5121/ijcsit.2012.4103.

[14] Christopher RW. Perspective Transform Estimation 1998. file:///C:/Users/USER/Downloads/PerspectiveTransformEstimation.pdf (accessed June 26, 2020).

[15] Canny J. A Computational Approach to Edge Detection. IEEE Trans Pattern Anal Mach Intell 1986;PAMI-8:679–98. doi:10.1109/TPAMI.1986.4767851.

[16] Lepetit V, Moreno-Noguer F, Fua P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int J Comput Vis 2009;81:155–66. doi:10.1007/s11263-008-0152-6.

Download PDF

Editorial decision: Major Revisions Needed
25 Mar, 2022
Reviews received at journal
17 Jan, 2022
Reviewers invited by journal
17 Jan, 2022
Editor assigned by journal
09 Jan, 2022
First submitted to journal
07 Jan, 2022

You are reading this latest preprint version

Real-time Deep Learning Based Image Processing for Pose Estimation and Object Localization in Autonomous Robot Applications

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methodology

3. Results

4. Discussion

5. Conclusion

Declarations

References

Status:

Version 1