This article aims to generalize the concept of grasping cuboid and cylinder-shaped objects, the two most common shapes. Objects of different geometrical shapes and sizes need to be picked in pick and place or metrological inspection. Therefore, one style of the gripper is not suitable for varying geometrical shapes. In this work, two grippers are designed, fabricated, and fitted onto the industrial robot. One of them is a parallel one, and the other is a three-finger gripper for cuboid and cylinder, respectively. Tasks such as detection, segmentation, and pose estimation are treated as the inspection modules, while grasping is the decision and feedback for selecting a gripper. Figures 2 (a) and (b) depict the two grippers.
The pose estimation task follows object detection and classification. The detection of X and Y coordinates is required for grasping a cylindrical object. However, it is not sufficient for the cuboid-shaped object, and thus, it needs estimation of the pose as well. Pose of the cuboid is required in general and YAW in particular for grasping. Figures 3 (a) and (b) show two situations wherein the importance of pose estimation can be felt.
The following discusses the details of the AI and machine vision-based techniques developed and implemented in this work. It also includes the details of the experimental setup, results, and validation.
2.1 Experimental setup – robot manipulator and gripper system
The experimental setup consists of an industrial robot (KUKA KR500 R2380 MT). On the end effector of the robot, the gripper system was assembled. The gripper system contains a gripper fixture, actuator, and fingers fabricated for griping the cylindrical and cuboid-shaped jobs. A pneumatic system actuated both the grippers with a pressure of 5 bar. The gripper system was integrated with the robot controller for clamping and unclamping based on the feedback received from the AI solution. Figure 4 (a) shows the picture of the gripper system and (b) shows the block diagram for the integration of the gripper system with the robot.
2.2 Experimental setup – conveyor system
A conveyor system was designed and fabricated in-house for transporting the jobs from one station to the other. It was mounted in front of the robot. One end of the conveyor is treated as the feeding station, and the other is for dispatch. It is 1 m long and 0.3 m wide. The conveyor was designed into modular sections assembled by slots and extrusions in complementary modules. It brought a little change to the conventional design of conveyor belts. Aluminium 6061 was considered the material for the modular sections’ fabrication, shown in Fig. 5 (a) and (b).
The screw take-up method was adopted during designing to provide sufficient tension to the belt to avoid slippage. It utilizes tenons to tighten the belt. A single-row cylindrical roller bearing with an inner diameter of 17 mm and an outer diameter of 40 mm was chosen as it is suitable for high speeds, absorbs radial loads, has lower friction, and has a long service life. A stepper motor with worm and worm gear was chosen for the conveyor because of its compatibility with a wide range of rotational speeds, fast response to acceleration and deceleration, etc. The holding torque is 21.5 N-m with a maximum rotational speed of 130 rpm. Figure 6 shows the front view of the motor.
The motor was mounted to the conveyor using a motor bracket. The material for the frame was also aluminium alloy 6061. Mounting holes were drilled corresponding to the tapped holes present in the motor. The material for the conveyor belt was chosen to be polyvinyl chloride (PVC) nylon. The fasteners used were M10 socket head bolts, M10 nuts, M5 socket head cap screws, and M6 socket head cap screws. After designing, the simulation of the conveyor system was performed in the ANSYS WORKBENCH 18.1 to ensure safe operation and performance. Figure 7 (a) shows the meshing of the modular section and (b) shows the stress generation.
The motor was mounted to the conveyor using a motor bracket. The material for the frame was also aluminium alloy 6061. Mounting holes were drilled corresponding to the tapped holes present in the motor. The material for the conveyor belt was chosen to be polyvinyl chloride (PVC) nylon. The fasteners used were M10 socket head bolts, M10 nuts, M5 socket head cap screws, and M6 socket head cap screws. After designing, the simulation of the conveyor system was performed in the ANSYS WORKBENCH 18.1 to ensure safe operation and performance. Figure 7 (a) shows the meshing of the modular section and (b) shows the stress generation.
On the conveyor, a proximity sensor has been fitted that detects the arrival of a job. The conveyor was also integrated with the robot controller. The next section discusses the details of the AI solution.
2.3 AI solution – detection and classification
There exist various methods that employ deep learning for object detection, such as YOLO [1], SSD [2], Faster-RCNN [10], Masked-RCNN [11], and so on. These models can be categorized into two, two-shot, and one-shot detection. The two-shot detection consists of two stages: generation of region proposal network (RPN) and classification of regions, and refinement of the bounding boxes. The one-shot detectors skip the RPN generation process and directly generate the bounding boxes and class scores. A research article suggests the two-shot detectors have higher accuracy over the one-shot techniques, with the difference lying in the foreground/background imbalance during training [12]. It is so because the two-shot ones can easily handle the imbalance. YOLO was selected in this work as a one-shot detector since the objective was to develop a real-time solution for autonomous robotic applications. YOLO employs two fully connected layers and a sliding window approach that splits the image into X x X grids and predicts M bounding boxes, i.e., X x X x M, as depicted in Fig. 9.
YOLO also employs bounding box regression, a rectangular box that highlights the object present in an image. The bounding box contains:
- W, H which refers to the width and height, respectively
- c, indicating the class of the object (cylinder/Cuboid, in this work)
- Bx, By, the middle of the bounding box
All the features mentioned above are evaluated using a single bounding box regression, as depicted in Fig. 10.
YOLO uses Intersection over Union (IoU) to provide a bounding box that properly wraps the object. Each grid calculates the confidence scores along with the bounding box of the object. If IoU equals one, then the predicted bounding box is equal to the real one. Figure 11 depicts two bounding boxes; the green color has an IoU of 0.85 while the blue has less than 0.5.
SSD is an alternate technique employing convolutional layers of varying sizes and a fixed set of bounding boxes is applied to feature maps. The network predicts on feature maps of various scales to get higher accuracy, as compared to YOLO, on predictions. During predicting, the network adjusts to match the object shape and provides probabilities for the presence of each classification label in the box. However, it is computationally expensive than YOLO. Speed has been prioritized over accuracy for achieving a real-time solution; thus, YOLO was selected. Images in YOLO undergo various steps to provide the bounding boxes and class labels. Those steps are discussed in the following sub-sections.
One of the concerns of using DL is the need of large amount of training data. In manufacturing applications, gathering a dataset is expensive. Transfer learning is one technique, which is very useful in this regard. It facilitates the use of a neural network trained on some dataset to train on another custom dataset. The custom dataset is usually small, and the trained neural network’s weights and biases are useful. It helps decrease the training time and increase the accuracy in a small dataset. This article also leverages the benefits of transfer learning on the YOLO model. The pre-trained model was trained on the COCO dataset consisting of 80 classes. The filters and output layers of the model were adjusted according to the available set of classes to prepare the network for transfer learning.
3.2 Pre-processing before coordinates and pose estimation
This sub-section details the requirements fulfilled for identifying the coordinates, followed by assessing the pose for grasping. As the proximity sensor detects the object, the conveyor stops, and the camera feed starts. Figures 12 (a) and (b) show the camera feed with and without an object on the conveyor, respectively.
DL-based methods such as masked RCNN are popular for object localization as it segments the object from the surroundings. There exist fitting techniques, which fit various 2D shapes such as rectangle, circle, polygon, etc., onto the object to localize and find the object. They have the modules of primitive fitting and contour extraction. From Fig. 12, a few color variations in the conveyor can be observed, and the object does not camouflage with the background. Therefore, applying basic image processing techniques to locate the object was beneficial as they would be computationally less expensive. However, the results would depend greatly on factors, such as contrast, lighting, angle of view, etc. Hence, techniques have been adopted to enhance the images with poor contrast. The camera employed in this work is a low-cost webcam; thus, the pictures received were of poor contrast. Various morphological operations have been used to remove the imperfections in the image [13]. The angle of view has been made normal, estimating the perspective transform [14]. Figure 13 (a) shows a picture depicting the edges of the object that does not involve any morphological operation, and (b) shows the one with morphological transformations. The dilation method has been applied to the image, which helps obtain edges more promptly even in low lighting and other variable physical conditions.
Figure 14 (a) shows the image obtained from the live camera feed and (b) shows the perspective transformation, which helps get the normal view of the object, and thereby, helps to change the angle of view.
3.3 X and Y coordinates estimation
It was necessary to locate the object on the conveyor to estimate its coordinates. A Canny edge detector has been utilized to find the edges of the cuboid or cylinder present on the conveyor [15]. It enabled to get the contours of the edges, and the centre coordinates were determined using the moments (Fig. 15).
These coordinates were found in pixels. However, for a decision on grasping, i.e., feedback to the robot, the coordinates were required in millimeters. It was achieved by mapping the camera resolution with the distance in millimeters visible in the frame. The resolution of the webcam utilized was 720 px × 1280 px. When fixed, the camera captured a fixed length and breadth of the conveyor, which were 220 mm and 385 mm, respectively. Thus, the distance per unit pixel in mm along the length was found by dividing 385 with 1280. Similarly, the distance per unit pixel in mm along the breadth was found by dividing 220 with 720 (Fig. 16).
3.4 Pose estimation
After calculating the required vertices, the perspective transform of the top surface is taken. The features of the transform are calculated using ORB. The top surface of the cuboid is detected by comparing the features using ORB. After finding the matches, the Homography Matrix is determined. Homography matrix maps key points in one image to the similar or matching points in the other image. The matrix is matched with the perspective transform on the corner points of the warped image. It serves as the 3D object points. The PnP model has been adopted to determine the 6D pose, i.e., X, Y, Z, roll, yaw, and pitch of the object concerning the camera [16]. The PnP model helps in computing the relative pose of the camera with respect to a 3D object. It helps to calculate the pose without indulging in neural networks, thereby less reliance on the heavy data used to train it. Equation 1 below best describes it.
Where, fx and fy are the scaled focal lengths, y is the skewed factor (assumed to be 0), and u0 and v0 are the principal points, s is the scaled factor, r and t are the rotation and translation calculated, and x, y, z are the 3D points in world coordinates. The overall flow diagram of the approach is shown in Fig. 17.