Three-dimensional Spatial Localization Based on Binocular Vsion

Medical Big DataResearch and Application Demonstration of Common Key Technologies for Diagnosis and Treatment of Special Abstract: With the rapid innovation of science and technology, researchers are no longer satisfied with the simple reconstruction and composition of binocular vision. Therefore, in order to solve the problems of low accuracy and unstable system performance, this paper proposes a three-dimensional space recognition and location algorithm based on binocular stereo vision and deep learning algorithm. Firstly, Zhang's calibration method is used to set the calibration error at 0.10 pixels, and sad algorithm is selected to reduce the search range of matching points and reduce the data burden for subsequent experiments. Then, the three-dimensional spatial data calculated by binocular parallax is input into Faster R-CNN model for data training, and the target feature is extracted and classified. Finally, the object and its position coordinate information can be detected in real time. Experimental data analysis shows that when the calibration error is the best, and the number of data training is enough, the algorithm can effectively improve the quality of target detection, positioning accuracy and target recognition rate is improved by about 3% - 5%, and can achieve faster fps.

I. Introduction and other ranging sensors. It is a non-contact ranging scheme, which not only solves the problems of signal interference and non line of sight error, but also reduces the application cost [3][4].
Three-dimensional space positioning based on binocular vision is a positioning method developed in recent years. It uses binocular cameras to obtain images, and then uses computer vision algorithm for image processing to calculate the three-dimensional information of the scene, and obtains the exact location information of the target object through recognition, so as to achieve the effect of three-dimensional space positioning of the object [5][6]. Visual slam generally goes through image acquisition, camera calibration, feature extraction, feature matching and other links [7].
Researchers also use various improvement methods to further improve the high accuracy and robustness of the visual positioning system. Camera calibration is to understand the transformation relationship between the object from the real world to the computer image plane.
Zhang Zhengyou plane calibration method is the most commonly used one [8]. In order to obtain accurate depth information, the calibration accuracy is improved by improving the checkerboard template, OpenCV coupling Zhang Zhengyou calibration method and other improved methods [9][10][11][12]. After the completion of target calibration, feature points are generally extracted. Common point feature-based algorithms include SIFT, SURF, ORB and so on. Researchers deeply study the extraction and matching of feature points and realize the accuracy and real-time performance of the algorithm [13][14][15][16][17]. In [18], a variety of vision schemes are proposed for different complex environments. In [19], binocular vision is applied to the design of autonomous mobile robot. The visual positioning technology is quite mature.
With the rapid development of artificial intelligence, deep learning algorithm has made breakthrough progress in the field of machine vision. The [20][21][22][23] summarizes the deep learning technology, and proposes that it is an inevitable trend to apply deep learning to visual field in the future. There are two categories of binocular vision algorithms based on deep learning: classification based and regression based. In order to meet the requirements of indoor high accuracy, classification based algorithms are usually selected [24][25][26]. In order to improve the system performance, [27] innovatively proposed Faster R-CNN algorithm, [28] used Fast R-CNN training data and surf algorithm for stereo matching. On this basis, [29] fused Faster R-CNN with multi-scale features, and [30] proposed an improved deep learning target detection framework Faster R-CNN model to improve the detection effect and positioning accuracy. Then, by comparing different data sets for training and testing, we can find a more accurate and time-consuming data model, and achieve effective indoor recognition and three-dimensional spatial positioning of objects [31][32][33][34][35].
Based on binocular stereo vision, this paper proposes a three-dimensional space positioning algorithm combining depth learning and binocular vision, which can output the coordinates of objects in real time and effectively improve the recognition accuracy of the target. The main contributions of this paper are as follows.
1) This paper proposes a block stereo matching algorithm with fast search speed, which helps to reduce the search range of matching points, thus reducing the amount of computation, so as to achieve higher FPS.
2) A binocular target detection method based on deep learning algorithm is proposed. Compared with using binocular recognition alone, the positioning accuracy can be improved by 3% -5%.
The rest of the article is as follows. The second part introduces the theory of binocular Stereo Vision and the algorithm of object detection based on depth learning, and summarizes the design of binocular vision and depth learning. The experiments and analysis are described in section III and the conclusions in section IV.
II. Method

Binocular stereo vision
In this section, we will briefly introduce the binocular camera model, camera calibration and stereo matching, which are needed in the follow-up knowledge.

Coordinate system
Before analyzing the camera model, we first have a brief understanding of the camera coordinate system, which lays the foundation for subsequent experiments and analysis, and is also the auxiliary measurement of three-dimensional reconstruction. Binocular coordinate system generally includes world coordinate system, camera coordinate system and image coordinate system. Among them, the world coordinate system is used as the reference system of the target object in binocular vision, and from this step, the target object is included in the operation. Camera coordinate system is the coordinate system to measure the camera's own angle, which is the only way to transform the world coordinate system to the image coordinate system.
The object in the world coordinate system is transferred to the camera coordinate system through rigid body transformation, and then the position coordinates of the object in the image coordinate system are obtained by perspective projection. The image coordinate system is the representation of the object in the image [4]. Due to the invariance of rigid body transformation, there will be no deformation when using rigid body transformation between world coordinate system and camera coordinate system, and the relationship between them is expressed by the following formula: The homogeneous expression can be written as: Among them, R is the unit orthogonal matrix, t is the translation vector, and R 、 T is actually rotation, translation and other operations.
The transformation from camera coordinate system to image coordinate system is a changing relationship from 3D to 2D. The relationship between x and y can be known by using similar triangles： To sum up, the world coordinate system can be converted to the image coordinate system by using the

binocular camera model
In the vision slam system, the camera is usually used to obtain the information in the scene, and then the position, geometry, height, color and other information can be used for corresponding processing. The binocular camera is composed of two monocular cameras, and the distance between the two cameras has been determined. The inspiration of binocular camera comes from human eyes.
When human eyes observe things, they will have a three-dimensional sense of things because of their own depth perception ability. Therefore, the left and right cameras of binocular camera, like a pair of eyes of the same person, are in the same plane and the optical axes taken are parallel to each other [4]. Binocular vision can calculate the three-dimensional coordinates of each pixel in three-dimensional space through parallax, and when calculating the parallax between two images, it can directly measure the distance of the object in front, without judging what type of obstacles exist in front [5]. It is mainly based We can't calculate all the captured points, so we need to extract some characteristic points from them for auxiliary calculation. According to the above model, as long as we can determine the relative position of the camera, and then use the geometric relationship, we can calculate the coordinate position of the feature point in a certain stereo camera coordinate system, however, the first thing to be clear is that for the left and right cameras, the measuring point must correspond to the same, that is, the left and right cameras are in the same coordinate system [4][5][6].
In this space coordinate system, let the left camera be the main camera, then the object point in its world coordinate system and the camera coordinate system of the left camera will be transformed as shown in equation 5: After the camera is calibrated, the position relationship between the right camera and the left camera can be calculated, as shown in equation 6: The rotation matrix and translation vector of the right camera relative to the measuring point can be obtained by the above formula derivation [15].

Camera calibration
Camera calibration is to understand the transformation relationship between the object from the real world to the computer image plane, and solve the internal and external parameters. In addition, the camera perspective projection will be distorted due to various deviations. Therefore, the calibration is also to solve the distortion coefficient for image correction.
Because of the distortion of the image taken by the camera, we should understand the distortion before calibration. Radial distortion and tangential distortion are usually considered in distortion. Radial distortion is the distortion of light in radial position. Generally, the radial distortion in the center of image plane is 0, and the distortion becomes more and more serious as it moves to the edge. Taylor series expansion can be used to describe its mathematical model [10]: Where,   The tangential distortion is the distortion caused by the image plane and the lens are not completely parallel. The model can be described by the following formula and adding two additional parameters 1 p and 2 p [17]: Zhang Zhengyou's calibration method is generally used in camera calibration, which was proposed by Professor Zhang Zhengyou in 1998 [8]. It is a single plane chessboard method based on coplanar spatial feature points. The Then the rotation matrix R and translation relation vector T between the two cameras are obtained as follows:   [13][14][15][16][17] In block stereo matching BM algorithm, the following similarity measurement function can be used for calculation: (1) Sum of Absolute Differences (SAD) : (2) Sum of Squared Differences (SSD) : (3) Normalized Cross Correlation (NCC) : (19) In this paper, SAD algorithm is used to sum the absolute value of the corresponding value difference of each pixel, and then compare the similarity, which is a local matching algorithm [17]. The basic process is as follows: define a

RPN Network
RPN discards selective search and directly uses convolutional neural network CNN to generate detection frames of candidate regions, which is the core of the whole network. In fact, RPN network has two lines: softmax classifiers to obtain positive and negative, and bounding box region offset to calculate relative anchors to obtain accurate proposal. It is a full convolution network, which can judge whether it is a target when predicting the target area box of each position in the input image [25].
RPN is composed of a 3 * 3 convolution layer and two 1 * 1 convolution networks. Firstly, the convolution operation of the basic network is used to get the common feature map and input it into RPN network. In the RPN stage, they will go through a 3 * 3 convolution layer to get 256 * 16 * 16 feature graph, which makes the anchor points on the feature graph generate anchor box with inconsistent size and aspect ratio, and then input the generated anchor box into two parallel 1 * 1 convolution networks. With the help of preset anchors, one performs the classification task to analyze whether there is a target in the candidate box, and the other completes the acquisition of frame position information to achieve the purpose of preliminary positioning [29].
RPN first appeared in the network structure of Faster R-CNN. Its essence is "classless object detector based on sliding window". Its main task is to generate proposals and use them together with the feature map of the last layer. In     The reason why the depth learning algorithm is combined with binocular vision is that the core of binocular stereo vision ranging is to calculate the distance between the object and the camera by using the triangle relationship.

Faster R-CNN detection module
We all know that the triangle in mathematics is the most stable shape, and has uniqueness. Therefore, the stability of the triangle relationship can make the three-dimensional data more accurate. Monocular vision is the basis of other vision systems, which is relatively simple, but monocular vision has scale uncertainty, and needs a lot of data support, so it can not get absolutely accurate position information.
Multi vision not only has its own requirements, but also increases the computational complexity of the algorithm.
The common RGBD visual odometer is due to the great influence of light and the limited measurement distance. In contrast, binocular vision requires relatively low hardware, fast, accurate, flexible and stable, so it is suitable for both indoor and outdoor.

III Results and discussion
The experiment combines 100 degree distortion free binocular camera and Python, Matlab software to achieve, using Matlab to complete the camera calibration, and adjust many times, get the ideal calibration error, the calibration error of this experiment is set at 0.10 pixels.

Stereo matching experiment
In the stereo matching experiment, we choose the template matching algorithm based on gray level, and on this basis, we compare SAD algorithm, SSD algorithm and NCC algorithm.

Model training loss and analysis
The loss function is the core of the performance of the   found that after a period of training, there will be a short-term loss close to 0, but the overall trend is downward. Figure 12. Totalloss and Clone_loss Figure 12 shows the total loss and clone loss. Since the experiment only trains the model on a single GPU, the two losses are the same, and only the total loss is considered.
The total loss is obtained by adding the classification and regression loss of RPN network and the loss of Fast R-CNN network, which reflects the loss of the whole training model.
We can see that at the beginning of training, the loss of beating is relatively large, and after reaching a certain training, the loss tends to 0, and has maintained a stable trend.  Table 2 shows the comparison of the coordinates of the objects captured by the left and right cameras. As is shown in Table 2 Table   2. For example, the coordinate of the keyboard relative to the binocular camera in three-dimensional space is (334,330,41), and the unit is centimeter.

IV.Conclusion
In this paper, based on binocular stereo vision, the fusion of deep learning, the use of Faster R-CNN model to complete the data training, and ultimately achieve three-dimensional positioning of objects. Binocular vision uses "parallax" to obtain the three-dimensional information of the object. Because of the advantages of appropriate accuracy and fast detection speed, it is suitable for places with high performance and environmental requirements, and determines the best calibration error. The SAD algorithm, SSD algorithm and NCC algorithm are compared and analyzed, and the most suitable sad algorithm is selected, which lays the foundation for the subsequent data set training and target detection. Using Faster R-CNN to achieve target detection can make the alignment effect of features and targets better, avoid the repeated extraction and calculation of target features, greatly reduce the amount of calculation, improve the classification accuracy, and the overall performance is about 94.4%, and the system is more stable.