OptiDepthNet: A Real-Time Unsupervised Monocular Depth Estimation Network

With the development of deep learning, the network architectures and algorithm accuracy applied to monocular depth estimation have been greatly improved. However, these complex network structures can be very difficult to realize real-time processing on embedded platforms. Consequently, this study proposed a lightweight encoding and decoding structure based on the U-Net model. The depthwise separable convolution was introduced into the encoder and decoder to optimize the network structure, further reduce the computational complexity, and improve the running speed, the implementation algorithm being more suitable for embedded platforms. When the accuracy of similar depth images was achieved, the network parameters could be reduced by up to eight times, and the running speed could be more than doubled. The research showed the proposed method to be very effective, having a certain reference value in monocular depth estimation algorithms running on embedded platforms.


Introduction
With the ongoing development of artificial intelligence, computer vision technology has been widely used in industrial applications, including production line inspection, unmanned aerial vehicle (UAV) detection, cooperative robots, and intelligent driving. Moreover, a good understanding of scene information can help robots accurately locate and complete complex technical actions, accurate and effective depth information improving the efficacy of 3D reconstruction, target recognition, and semantic segmentation [1].
Currently, there are many ways to obtain depth information, which can be divided into active and passive methods. Active methods tend to use ultrasonic, time of flight (TOF)-based laser, and lidar methods, amongst others. They rely on the sensor to send signals to obtain the depth information of objects in a scene. The depth information can be obtained quickly, but there are several disadvantages, including large sensor volumes, 1 3 high power consumption, and the measured data can be easily disturbed by external noise and other environmental factors. Passive methods tend to be common, including binocular stereo matching [2], structure and motion recovery methods [3], and multi-view stereo matching,which can reduce costs but require camera parameter calibration and a large number of algorithm iterations, placing specific requirements on hardware platforms and taking time to perform. Moreover, the calculation accuracy can be greatly affected by the environment.
Currently, these common depth acquisition methods have limitations for small robotic platforms, in terms of power consumption, volume, and cost. However, a scheme based on single-purpose depth estimation has become the system of choice. Camera imaging can be used to convert three-dimensional information into two-dimensional information, although depth information can be lost in the imaging process. If only a single image is used to restore the depth information of the image, it can be regarded as an ill-conditioned problem that is difficult to solve.
The deep learning method has been gradually applied to the field of monocular depth estimation, commonly used methods including the unsupervised learning and supervised learning methods. The supervised learning method uses a codec network to extract depth information from a pixel-level image, outputting the corresponding depth map and label and transforming the depth-estimation problem into a pixel-level regression problem. However, the supervised learning method relies on large sample dataset training, which can be expensive, and require an enormous training burden. Moreover, it cannot achieve good classification and recognition for some unfamiliar scenes, limiting its popularity. Conversely, unsupervised learning methods can effectively avoid the training burden of big data and make up for these limitations. However, to improve their depth estimation accuracy, many unsupervised learning methods are committed to increasing the depth and feature extraction details of the deep neural network architecture, resulting in an increase in the number of network parameters, requiring the platform to have greater computing resources and storage capacity. Consequently, how one can apply these excellent algorithms and architectures to embedded devices with limited hardware resources has become an important research trend. One of the more important challenges is how to maintain the balance between the accuracy of monocular depth estimation algorithms and improve their computational speed.
Presently, a common technique is to use an embedded platform to obtain images, transmit them to an edge server for training after preprocessing, before transmitting the trained data and models to the embedded platform for depth estimation [4]. Using this method, the embedded device terminal only performs the functions of acquisition and communication and does not implement the algorithm itself. To solve this problem, this study proposed an unsupervised learning method based on optimization, which could effectively reduce the number of network parameters and floating-point operations under the condition that the output depth information remained unchanged, help improve the computational power of the platform, realize the termination of the deep learning algorithm, and contribute to the distributed computing of the overall architecture.
The main idea underpinning the proposed method is to introduce depthwise separable convolution into the codec architecture, improve the computational power of the network by optimizing the parameters of the convolution layer, and realize feature extraction and image reconstruction. Practice has shown that after running the optimized self-coding network on an NVIDIA GeForce RTX 2080 platform, the training speed increased by more than 33.3%, and the image accuracy remained at its original level or was slightly improved. Specifically, the contributions of this study are as follows: 1. A full convolution unsupervised monocular depth estimation model, OptiDepthNet, was proposed to perform left-right depth consistency. The encoder was based on the ResNet50 architecture model and used depthwise separable convolution process instead of an ordinary convolution for optimization. 2. The performance comparison with several common models showed the effectiveness of the proposed method, the reduction in network parameters, and improved operational efficiency.
In summary, our contribution was to propose an optimized depth estimation method, so that monocular depth estimation could be adapted to an embedded platform and promote the edge computing of the system.

Related Work
In the research of monocular depth estimation, deep learning is a commonly used research method. Based on whether the real depth of a scene is needed, network architectures can be divided into supervised and unsupervised networks.
In this section, we summarize the development of supervised and unsupervised estimations and introduce the development of a real-time network architecture suitable for embedded devices.

Supervised Depth Estimation
For the depth estimation task of a single image, we usually pay attention to the prediction of absolute depth. In particular, working sites such as industrial robots and UAVs need to select a working strategy based on the scene depth. Generally, the supervised regression model is used for prediction-that is, the model training data are labeled, and the continuous depth values are regressed and fitted.  first introduced deep learning to the field of monocular depth estimation and proposed coarse-scale and fine-scale network architectures. The coarse-scale network was used to predict the global depth of the scene and obtain depth clues-such as target location, vanishing point, and spatial alignment-and the fine-scale network was used to locally optimize the results of global predictions [5]. Based on this research, Eigen et al. (2015) proposed a unified multi-scale network architecture, using a deeper VGGNet network and three fine scales to increase the detail and improve the resolution, adding a gradient regularization term on the basis of scale invariant loss, and calculating the difference between the predicted and real gradients for depth prediction, surface normal vector estimation, and semantic segmentation [6].  combined the depth convolution network with the continuous random field, used the univariate potential energy and paired potential energy term of the continuous conditional random fields (CRFs), and a depth structured strategy to extract the estimated depth [7]. Li et al. (2015) proposed a multi-scale depth estimation method. Firstly, the super-pixel scale was regressed using a neural network, before multi-layer conditional random field post-processing was used to optimize the depth at a super-pixel scale and pixel scale [8]. Laina et al. (2016) added residual learning to the full convolution network architecture, increased the depth of the network structure to improve the depth estimation effect, and proposed a new up-sampling method and BerHu as the loss function [9]. Cao et al. (2018) treated the depth estimation problem as a pixel-level classification problem, projected the depth value into the logarithmic space, and then discretized the continuous depth value into category labels based on the depth range [10].
Although supervised depth estimation can obtain better depth estimation accuracy, each image is required to have the corresponding label depth, the acquisition price of which can be very expensive. Consequently, the collected original depth labels are usually sparse points, which do not match the original image well.

Unsupervised Depth Estimation
The unsupervised method does not need depth labels. The existing left-and right-view sets can meet the research requirements, and the relative depth from the object to the camera can be obtained by combining the polar constraints and the automatic coding mechanism.
Garg et al. [11] used the original and target images to form a stereo image pair. First, the encoder was used to predict the depth map of the original image, after which the decoder was used to reconstruct the original image combined with the target image and the predicted depth map, the reconstructed image then being compared with the original image to calculate the loss. Godard et al. [12] realized unsupervised depth prediction by using the consistency of left and right views, generating parallax maps by using epipolar geometric constraints, improving performance and robustness by using the consistency of left and right views, learning the mapping relationship from the left (right) image to the right (left) image, estimating the scene depth information, and transforming monocular image depth estimation into image reconstruction. Godard et al. [12] subsequently added the loss of left-and right-image consistency and the loss of enhanced parallax smoothness, further improving the upgraded network effect and the accuracy of depth information estimation. However, they still did not solve the problems of unclear object contours and unsmooth depth changes in the obtained depth map [13]. Tosi et al. [14] transformed monocular depth estimation into a stereo matching problem, and then used a stereo matching network for parallax estimation. The entire network structure included a primary feature extraction network, primary parallax estimation network, and parallax optimization network. Casser et al. [15] proposed modeling the scene and a single object, introducing geometry in the learning process, and self-learning the camera's self-motion and object motion. Wang et al. [16] proposed the idea of calculating the loss function in the hierarchical embedding space for depth estimation model training. A HEGS(hierarchical embedding generators) for generating multi-level embedding was designed to extract features from the depth map and construct subspaces at different levels. The loss function was then constructed by calculating the distance between the reference depth embedding and the predicted depth embedding. Mancini et al. [17] proposed a visual object detection system, which used the depth neural network method to train real and synthetic images to realize depth estimation and could detect obstacles at long distances and high speed. Amir et al. [18] proposed a training method based on style conversion and antagonistic training-that is, based on the training of a large number of synthetic environment data, the depth of pixels was predicted from a single real color. However, this method could not be applied to sudden illumination changes and saturation in style conversion.
These unsupervised methods obtained higher depth map accuracy by increasing the network complexity. However, the network parameters were large, and the calculations required considerable resources.

Lightweight Monocular Depth Estimation Network
With the increase of network complexity and use of unsupervised computational methods, although the accuracy of the depth maps obtained improved, these algorithms could not be applied to small robotic platforms with limited resources. The need to optimize existing network structures, reduce the training parameters, and consider the accuracy of image acquisition has become increasingly important.
Fast target detection and classification methods in deep learning are conducive to semantic image segmentation. Common detection models include SSD [19] and Yolo3 [20], and classification schemes include AleNet [21], VGGNet [22], and ResNet [23]. SSD combines the advantages of Yolo and FastRCNN [24], offering high speeds and accuracy.
For embedded systems, Dianna Wofk et al. (2019) used MobileNet-v2 [25] in the encoder, and a depthwise separable network in the decoder. The original intention was to lighten the codec structure, and network pruning [22], with other technologies being adopted to reduce the training parameters and memory usage [26]. Jun Liu et al. (2020) proposed a MiniNet network structure with recursive functions [27], which was not only extremely light, but also realized deep network abilities, maintaining real-time high-performance unsupervised single-sided depth prediction of video sequences at a rate of 110 fps on a single GPU, 37 fps on a single CPU, and 2 fps on Raspberry Pi3.
In this study, we proposed a lightweight network-that is, OptiDepthNet-based on an existing full convolutional codec network and introduced depthwise separable network optimization technology, which greatly improved the running speed of training while ensuring its accuracy.

Methods
This section introduces our unsupervised learning network architecture for depth estimation. Inspired by the U-Net network [28] and DeeperLab 3 structure [29], a layer hopping structure was introduced between the encoder and decoder, and depthwise separable convolution was introduced to improve the network computational speed, realizing a balance between image accuracy and computational speed. This paper analyzes our optimized network from a principle of image depth acquisition and an encoding and decoding network reconstruction depth image perspective.

Obtaining Depth Estimation from Image Reconstruction
In a test case, inputting an image ( I in ) could output the corresponding depth map ( d out ), requiring us to realize d out = f (I in ) based on a calculation function ( F ). In the process of obtaining the function ( F ), we constructed an unsupervised learning scheme based on the principle of binocular ranging and realized depth image reconstruction by combining training loss and left-right consistency checks.
Assuming that the image is corrected, the baseline distance between the two cameras is b, the camera focal length is F , and Δd is the image parallax of the left and right input images I l and I R . Based on the parallax acquisition formula d out = b * f ∕Δd , the depth ( d out ) of the pixel can be preliminarily obtained [30]. Based on the full convolution network architecture [31], the calibrated image pairs are input into the training network, combined with the left-right consistency loss, parallax smoothing loss, and appearance matching loss, to realize network training and obtain a suitable model architecture.

Network Architecture
We proposed OptiDepthNet, a network belonging to the full convolution network architecture, referring to the left-right consistency network proposed by Godard et al. However, there were several modifications made in the network architecture, so that we could improve the training network speed. The network structure is shown in Fig. 1. The proposed network comprises an encoder and decoder, which realize depth map reconstruction and semantic segmentation of the input image. The features extracted from different layers of the encoder are fused in the decoder to improve the detail and feature accuracy of the reconstructed depth map. The parallax map is generated based on the left and right images, and the image reconstruction is realized through the depthwise neural network. The output depth image does not represent the absolute distance from the object in the image to the camera, but the relationship between the objects in the image. The brighter the brightness in the figure, the closer it is to the camera.
The encoder is responsible for extracting the depth features of the input image, while the decoder gradually restores the details and corresponding spatial dimensions of the target through up-sampling and deconvolution, using skip-connections to compensate for the loss of some features, to achieve the reconstruction of the depth image. We chose to use the classical residual network structure Resnet50 as the main framework of the encoder for accurate depth prediction -as the aim is to obtain real-time depth estimation and extracting rich image features is very important-and introduced depthwise separable convolution on this basis.

Encoder Network
In recent years, the deepening of convolutional neural networks (CNNs) to solve more complex practical problems has also been accompanied by some gradient disappearance and gradient explosion, making training increasingly difficult. Consequently, our DResNet encoder was optimized based on ResNet50 and consisted of a standard convolution layer and four groups of residual blocks. The first layer of the encoder was a 7 × 7 convolution (in steps of 2), activated by an exponential linear unit (ELU) function. The number of Fig. 1 The RGB image input coding layer extracts features and enters the decoder for image reconstruction to generate the depth map output channels was set to 64. In the residual block, the middle convolution was changed to depthwise separable convolution, the convolution size of the rest being 1. Figure 2a shows a normal residual block model. For the input image ( I x ), three convolution operations are performed-that is, 64 convolutions of 1 × 1 and 3 × 3, and 256 convolutions of 1 × 1. To extract the features to obtain the output feature ( I x1 ), the input ( I x ) is connected to the output through the shortcut connections, and the output ( I y ) through the ELU. Figure 2b shows the convolution module with a convolution layer of 3 × 3, optimized into a depthwise separable convolution. The I x1 part is divided into three groups based on the number of channels-that is, the convolution of 3 × 3, the convolution of 1 × 1, finally outputting the characteristic graph ( I x2 ).
The normal convolutional of 3 × 3 operates with the convolution of 3 × 3 and 1 × 1, respectively, the depthwise separable convolution achieving channel and region separation, with considerable improvements in computational performance and reduced training parameters, although the channel has a similar output effect.

Decoder Network
The function of the decoder is to reconstruct the extracted feature map of the encoder to generate appropriate predictions and obtain a depth map corresponding to the input. Each layer of the encoder is used to gradually reduce the spatial resolution and extract higherlevel features. Many image details may be lost, which can make it difficult for the decoder to recover pixel-level data. To meet the requirements of high precision and real-time (a) (b) Fig. 2 a Implementation process of residual block convolution. b Implementation process of changing the middle convolution to depthwise separable convolution Fig. 3 The encoder output signal conv5 is upsampled to obtain upconv6 after the convolution of 3 × 3, connecting conv4 as an output. Depthwise and pointwise convolution is then conducted to obtain the output iconv6 (the input of the next decoder) performance, the depthwise separable convolution operation is performed in the output of each stage to simplify the network parameters, as shown in Fig. 3. The output of the encoder (conv5) is regarded as the input of the first layer of the decoder. After applying the nearest neighbor interpolation method and up-sampling (using a scale of 2) the input, upconv6 is fused with the conv4 layer of the encoder as an output. Then, using the deeply depth-wise and point-wise separable convolutions, the computational parameters are greatly reduced in the output layer to achieve a lightweight network, and finally obtain the input of the next level of decoder. Our DDenseNet decoder comprised five fusion modules and reduced the number of output channels by half relative to the number of input channels. Using interpolation and deconvolution methods, feature maps 1/2, 1/4, 1/8, 1/16, and 1/32 the size of the original image were obtained. The feature maps of these six sizes were then concatenated with the feature maps of the same size obtained in the original encoder to generate six parallax maps of different sizes.

Loss Function
The proposed algorithm follows the loss function proposed in Ref. [12], which comprised three parts-that is, the similarity between the generated reconstructed image and the original image, the smoothness of the parallax image, and the consistency of the predicted left and right images. The appearance matching loss indicates that the input left and right parallax images need to be sampled by parallax in the training network, the images being generated by bilinear sampling, comprising L1 regularization and SSIM [32]. The function of parallax smoothing loss is that the parallax becomes smooth [33], the generated parallax map becoming as continuous as possible using L1 regularization. The loss of left and right consistency denotes the loss of the consistency of the left and right disparity map. When only the left view is input, the left and right disparity map is predicted. To ensure consistency, a consistency penalty of the L1 left and right disparity map is used as part of the model.

Experimental Results and Discussion
We used the KITTI dataset experimental results to prove the effectiveness of the proposed method and conduct various model measurements. Comparing various encoders, based on the execution efficiency and image data, we showed that deepening the depth of the network structure could improve the image quality to a certain extent.

The KITTI Dataset
The KITTI dataset was used to evaluate the performance of computer vision technologies-such as stereo image, optical flow, and visual ranging-in a vehicular environment, including real image data collected from urban, rural, and expressway scenes. Consequently, 3,756 frames were selected from 30 scenes for training and 500 frames were used for verification. In this study, the image resolution of each input RGB frame was adjusted to 256 × 512 pixels. The depth map output image resolution was 256 × 512 pixels.

Implementation Rules
The depth estimation network of the proposed OptiDepthNet was implemented on an open TensorFlow model. The network was trained on the KITTI dataset, and their accuracy was evaluated by official training and test data segmentation. For training, we used one GPU with 23,500 training steps and images of 256 × 512 pixel resolution. The batch size was 8, the initial learning speed (learning_rate) was set to 0.0001, and num_threads was 8. OptiDepthNet was trained on an i7-9700 CPU-based platform (3 GHz, 32 GB RAM). The graphics card was an NVIDIA GeForce RTX 2080, and the training time amounted to 36 h. During the training process, the input frame was uniformly sampled with a probability of [0.8, 1,2] for color and saturation, and [0.5,2.0] for brightness, with a probability of 50% for image enhancement implementation.
In the process of building the network model, we used DReset50, a variant of the ResNet50 model, as our encoder, the architecture and training process of the other models remaining unchanged.
Based on previous work, we used several image evaluation indexes to evaluate the depth images obtained using OptiDepthNet for unsupervised monocular depth estimation [34]. The quantitative evaluation indexes used by most algorithms in monocular image depth estimation were relative error (REL), root mean square error (RMS), log error (LG), and accuracy (% correct). Generally, the smaller the error the better, and the higher the accuracy the better.

Experimental Results
Firstly, we compare the related work qualitatively and quantitatively. Secondly, we analyze the computational efficiency of the proposed OptiDepthNet. Thirdly, we determine the effect of depth estimation on the KITTI dataset. Fourthly, we study the deepening of the network level to prove the effectiveness of the proposed optimization method.

Comparison with Other Work
We evaluated OptiDepthNet using a KITTI split, the results being listed in Table 1. Compared with using ResNet50 as an encoder and DenseNet as a decoder [35], the proposed method optimized the convolution operation of the encoder and decoder to reduce the computational overhead, the effect being obvious. As can be seen from Table 1, the proposed OptiDepthNet network parameters are reduced by 2.67 times compared with the Kuznietsov method. Gordon's method [12], which is closely related to ours, was selected for detailed comparison. As shown in Fig. 4, our proposed network parameters are 1.9 times less than those of Gordon's network. If the encoder selects the VGG network, the parameters used in our method are 8.28 times less than those of Gordon's method, and 1.54 times less than those using ResNet152, It can be seen that if the encoder adopts a lightweight network, more parameters are reduced. The comparative evaluation index shows that the image quality obtained by using ResNet50 as the encoder is better.
The running model was trained on a single GPU and compared with Gordon's work. The running time of a single epoch was tested using VGG, ResNet50, and ResNet152 as encoders, respectively, with DVGG, DResNet50, and DResNet152 being the optimized networks, respectively. Figure 5 summarizes the final results obtained using the proposed method and the comparative results of previous work.
From the above comparative data, it can be seen that using DResNet50 as the encoder and DDenseNet as the decoder for image reconstruction provides a great leap in network parameters and computational speed, which is conducive to the embeddedness of the network. Figure 6 compares the quality of four images in the KITTI dataset. The encoder analyzes the ResNet50 and DResNet50 networks, respectively. It can be seen that the optimized image depth is prominent. Table 2 shows comparisons of the image quality results of the four images before and after optimization.

The KITTI Dataset Test
From the test samples, it can be seen that the optimized network model has a certain optimization in image quality. Fig. 4 Comparing the parameters before and after network optimization, the encoder adopting VGG, ResNet50, and ResNet152 networks, respectively. After the optimization, the parameters of the network with fewer parameters are further reduced

Extension to other Network Structures
In Fig. 7, we show the extension of the network optimization method to the network model. Compared with using VGG for the encoder and DVGG for the depthwise separable convolution network optimization network structure, the network parameters and single operation time are greatly improved. For the proposed model, four images in the KITTI dataset are used for testing, the image quality being as shown in Table 3. The image quality can be seen to have improved to a certain extent. It can be seen from the comparative results that after VGG network optimization, the image performance parameters are improved and optimized. It can be concluded that the proposed optimization method is applicable to a variety of network architectures.

Limitations
The proposed method greatly improves the network parameters and computational speed, but there several problems remain. After reconstructing the depth image of small objects, some boundary details can be lost, and the reconstruction algorithm needs to be further improved. In addition, in terms of memory usage, network pruning can be conducted to reduce memory consumption and further optimize memory usage. At present, the proposed method tests just a single image, and should be extended to strengthen the  network structure optimization of video sequences, better adapt to the embedded system platforms, and strengthen the edge computing work.

Conclusions
The depth estimation network architecture proposed in this study was experimentally evaluated on the KITTI dataset, focusing on the optimization of network parameters and the improvement of training speed. With the same image accuracy, the optimized network parameters could be reduced by up to eight times, and the training speed more than doubled, enhancing the real-time performance of the depth estimation network, and enabling it to be used in embedded devices, including robots, UAVs, and other small devices.
Although the current research focuses on depth estimation, we believe that such methods could also be applied to areas such as image segmentation based on deep learning methods to improve the performance of intensive prediction tasks.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Feng Wei obtained a Master's degree in communication and information systems from China Hohai University in 2007 and entered Hohai
University to study for a Doctor's degree in information and communication engineering in 2017. Wei has much engineering practice knowledge in embedded systems field, including intelligent transportation, image processing, deep learning and more.
Xinghui Yin was born in Hunan, China, in 1962. He received a BSc. degree in electromagnetic engineering from Xi'dian University, China, in 1983. Since 1983, he has been with the Purple Mountain Observatory, National Astronomical Observatories of China, Chinese Academy of Sciences, Jiangsu, China, where he worked with several radio telescopes, remote sensing radiometers, and satellite earth station development projects. His research activities include radio-heliography, low noise receivers, and remote sensing measurement.