Binocular Vision-Based Intelligent 3-D Perception for Robotics Application

- Vision-based robotics has been the subject of several research contributions in the area of vision and control. Vision technology is becoming a pioneer in the most common applications such as localization, automated map creation, autonomous navigation, mapping analysis, or risk pattern prediction. The Stereo applications or programs use pairs of 2-D images as inputs and generate reconstructed 3-D imagery by locating the matching points. This paper introduces a method to the development of an algorithm of intelligent 3-D view reconstruction using the binocular vision for the robotic applications. The proposed system consists of two identical colour cameras and cameras were mounted as one stereo camera. 3-D reconstruction and visualization were performed according to the pair of 2-D images. Calibration, multi-view image acquisition, the stereo rectification process, and the disparity process were discussed in Section II. Real-time captured images and stereo image from Middlebury Stereo Datasets were used to test the system and verify results in Section III.


I. INTRODUCTION
achine stereo vision, or also known as the stereoscopic vision has been an active area of robotics and engineering research for decades. It has been widely investigated before the emergence of event-based sensors. An autonomous robot needs to be aware of the three-dimensional state of the world to understand and think for its environment. However, the problem with vision is that the perceived image is a two-dimensional 3-D world projection [1]. Stereo vision must be viewed as a spatial integration of multiple viewpoints to recover depth, and a temporal integration is also possible.
Biology understands a scenario more easily than machines, even at smaller energy budgets (Martin et al., 2018). Most animals have two eyes for a reason. Through eye's vision is combined in a stereoscopic reality that becomes a 3-D map of the human brain. Vision from each eye is only slightly different to a certain degree, and this variation is what helps us to perceive that anything is closer or farther away. In humans, stereopsis has become an attractive model system for understanding the neural activity-perception relationship (Roe et al., 2007). Stereopsis has not been seen behaviorally in any non-human animal until 130 years after Wheatstone, with evidence of stereopsis by Bough in 1970 in macaque monkeys.
Modern machine's stereo algorithms are, to some extent, inspired by human stereopsis, which is powerful but also complicated and expensive [2]. Figure 1 illustrates the typical stereo vision system of a human. Stereo vision suggests numerous points of view and coordinating; thus, it gets profundity from a couple of pictures. Calculations for stereo system vision are additionally utilized prosperously in robotics [3].
Every visual sensor, whether artificial or biological, maps a 2-D representation of 3-D worlds. Depth sensors are the key to unlocking next-level machine vision applications in modern engineering. 3-D depth calculation and machine vision techniques are widely accepted in many applications, such as healthcare applications [4][5][6][7][8], autonomous navigation, teleoperation, or virtual/augmented reality modelling [9]. Google's Project Tango uses depth sensors to measure the real environment accurately and inform its graphics algorithms to virtual position content in the appropriate locations. Many warehouses are now using fully autonomous vehicles to carry items from one place to another. The vehicle's ability to travel on its own includes depth-sensing so it can know where it is in the world, where other important objects are, and most importantly how it can get from point to another point safely. www.ijsrp.org Therefore, given that any stereo matching process identifies two locations of points, the depth is calculated from the different points or the distinction of the two points in the picture pixel coordinates.

Figure 2. Illustration of the Stereo Geometry
Given the pixel coordinate (X L , Y L ) at the left and (X R , Y R ) at the right images, the coordinates of the 3-D universe (X, Y, Z) are determined as, Where, = (X L − X R ) in pixel, = Focal length of the camera, = the parallax or interocular separation of camera (mm) , 3-D performance of data reconstruction depends on the quality of the disparities, calibration, image rectification, and overall stereo system architecture. Figure 3 depicts the 3-D reconstruction model of the proposed system.

Image rectification
Methods of rectification are well known and have been studied extensively for years. These techniques are aimed at adjusting the captured images to simplify the problem of stereo correspondence. According to the optics, the resulting image varies from the real world geometry when the image is captured using an optical camera. Generally, there are two variables to be modified in stereo applications: image distortion and image epipolar geometry. This is known as the rectification process. If the stereo pair of images is fixed, then the problem of stereo correspondence simplifies and reduces from a 2-D search of order N 2 to a 1-D search of order N 1 for each matching pair of points on the same epipolar line [11].

Epipolar Geometry
Epipolar geometry is the geometry of the Stereo-Vision system. In case two cameras view a 3-D scene from two separate locations, there is an array of spatial relations between the 3-D focuses and their projections into the 2-D pictures resulting in imperatives between the focuses of seeing. These relations are established from the preface that the demonstrated pinhole device should surmise the cameras. Figure 4 illustrates the example of Epipolar Geometry.
Let us assume that the first camera is aligned with the world reference system with the second camera offset first by a rotation R and then by a translation T. This sets out the matrices for the image projection to be: www.ijsrp.org Here, is the focal length of the colour camera; ( , ) are the coordinates of the principal point. and are physical sizes of the pixel in the horizontal and vertical directions, respectively.
Here , , , are the internal parameters of the colour camera.

II. METHODOLOGY
This study aims to develop an algorithm of intelligent 3-D view reconstruction using the binocular vision for machines and robotic applications. The methodology of the proposed system composed of calibration, image acquisition, pre-processing, stereo rectification, point-cloud generation, 3-D reconstruction, and visualization. Figure 5 illustrates the program flow chart of the proposed system. Image Acquisition: The proposed system was comprised of two optical cameras (Raspberry Pi 5MP camera module, 60fps, 640×480 Pixels), positioned as a single stereo camera parallel to it. The system's most significant necessity is to guarantee that the frame of both cameras is recorded concurrently with the same brightness, exposure, shutter time, and parameters acquired. There are some methods of multiple-camera calibration which can overcome the adjustment of cameras for intrinsic and extrinsic parameters at the same time. Using pinhole cameras with nonlinear radial and tangential distortion compensation and the python language was used to develop calibration algorithms.

Grayscale Conversion:
In this study, acquired images were converted into grayscale and then passed results to the stereo rectification process. Grayscale is the set or range of monochrome (gray) shades that range from pure white on the lightest end to pure black on the other end. Grayscale includes only information about luminance (brightness) and no information about colours [13][14]. That's why the highest luminance is white, and the minimum luminance is black; the shade of gray is everywhere in between. Therefore, grayscale images contain only shades of gray and no colours. In this study, acquired stereo images were converted into the grayscale using the weighted grayscale method.
 Unweighted: We simply take on average the red, green, blue pixel data in this case. There's no bias, and there's no connection with human vision [15].
 Weighted: In this scenario, we take into account the human eye's sensitivity factor in different colours and set the bias to the average as a result [15].
Stereo Rectification: Methods of rectification are well known and have been practised widely for years. These techniques are designed to adjust the captured images to quantify the analysis of stereo correspondence. Due to the lenses, the resulting image varies from the real world geometry as an optical system takes the image. For stereo implementations, there are essentially two variables that must be modified, such as image distortion and image epipolar geometry. In this study, Image rectification was performed in the mage calibration process.

Finding the Disparity (Sum of Added Differences):
The Sum of Added Differences (SAD) is a way to determine the disparity. Since the images are represented as 2-D arrays, it will be associated with any block of (m x n) pixels. If all of the pixels match perfectly, then all of the colour values associated with each pixel is the same. Then the two blocks are going to be identical. But, in stereo pairs, these identical pairs don't occur so we need to search for the block that has the closest match. It will teach how to obtain measurements of the SAD. (10)

Mapping the Disparity in Three Dimensions:
To obtain measurement points, it is important to determine a disparity map before constructing an occupancy grid through stereo-vision. When the magnitude of the disparity increases, the warmth of colour grows proportionally. The height, width of the image are the x , y axes and each combination (x , y) represents every pixel in the image. Those pixels in the map displays the disparity which was simply a numerical representation of how close the pixel is to the camera. Finally, the distance information was plotted in the z-axis and plotted against the coordinates x and y ( figure 6). Point clouds are a means of assembling a significant number of single spatial measurements (x, y, and z) into a dataset that can then represent a whole (object or space). Figure 7 illustrates a sample point cloud image of a Torus. Such points represent the geometric coordinates of a single point on a sampled surface underlying the x, y, and z. Several formats may be used to store a cloud of data. Essentially, any format which can store three numbers representing the coordinate x, y, and z can be used. Many formats are widely used for processing point clouds, however. These formats can be classified into Binary and ASCII types [16].

3-D Visualization:
MeshLab [18] is an open-source program that is widely used for 3-D triangular mesh creation and editing. It provides a set of tools that can be used to edit, clean, heal, inspect, render, texture, and convert meshes. In this study, we were used MeshLab to 3-D visualization based on the point cloud dataset, which generated in the point cloud generation process.

III. RESULTS AND DISCUSSIONS
In this section, we were demonstrated depth map results using four image sets from two categories.
 Category 1: Experiment one was conducted in this segment, and the image was captured in real-time using two cameras that were functioned as a single stereo camera. The image shown in figure 8 and figure 9 is a man sitting on a chair with his left hand holding a helmet.  www.ijsrp.org In the stereo matching process, the algorithm for a corresponding point is not searched for the entire 2-D right image. The "epipolar constraint" reduces the search space to a onedimensional line. The patch in the left image was compared with the patches along the same row in the right image. To achieve measurement points, it is important to determine a disparity map before constructing an occupancy grid through stereo-vision. Figure 10 illustrates the disparity map of the test one.  Category 2: Three stereo images from 2005 [19] and 2014 [20] Middlebury Stereo Datasets, which is publically available for researchers in vision.middlebury.edu [21][22] was used for this category.

Experiment Two
Experiment two was carried out by using a Middlebury stereo image, as shown in figure 12. The depth map for the second experiment is shown in figure 13. As shown in figure 13, the differences between the two images give depth information. This depth information is visualized as the depth map. Due to the low patch values disparity map for the second test was reduced it's detailed as shown in figure 14. Close objects were resulted in a large disparity value. This is translated into light greyscale values and objects further away will appear darker.  Figure 16 to Figure 19 is a representation of the stereo image, depth map, disparity map, and 3-D visualization of experiment three.

IV. CONCLUSIONS
In this paper, we presented the development of an algorithm of an intelligent 3-D perception for robotic applications based on the binocular vision using two identical cameras. Stereo imaging is a passive technique that can restore the environmental structure by comparing the features observed in different photographs of the same scene. This algorithm can be used utilizing robotic hands which are guided by visual perception for instruments equipped for handling devices.
These experiments are used two identical cameras which were mounted as a single camera to get a standard epipolar line. Stereo matching is the most crucial step in binocular vision reconstruction. Due to the lack of perfect stereo image acquisition from both cameras, the 3-D visualization was not correctly completed in experiment one. The cloud point development failed because of the errors caused in the process of calibration and image acquisition.
According to the results, the disparity was larger (brighter) for closer surfaces. Figure 24, 25, and 26 represents the depth (z) against disparity (px) for experiment 2, 3, and 4. www.ijsrp.org Table 1 shows the focal length, baseline, and maximum disparity values for experiments 2, 3, and 4, which were observed during the experiments. Each experiment was associated with the stereo images that were taken from the Middlebury stereo dataset.