Image defogging based on multi-input and multi-scale UNet

The coarse-to-fine image defogging strategy has been widely used in the structural design of individual image defogging networks. In the traditional method, multi-scale input image subnets are superimposed, so that the sharpness of the image is gradually improved from the bottom subnet to the top subnet, which inevitably leads to the loss of image details. Toward a fast and accurate dehazing network design, we revisit the coarse-to-fine strategy and present a multi-input and multi-scale U-Net (MIMS-UNet). The MIMS-UNet has two distinct features. On the one hand, the single-encoder of MIMS-UNet adopts multi-input and multi-scale image, which increases the computation amount but greatly improves the network performance. On the other hand, codec structures with context blocks are used to capture context information and recover more details. The experimental results show that the proposed method achieves good results in both quantification and visualization. Compared with the existing methods, the proposed network can achieve ideal results of defogging and effectively avoid color distortion after defogging.


Introduction
Due to the absorption and scattering of light by airborne particles, foggy images acquired by imaging equipment have problems such as decreased contrast, color distortion and loss of details and thus affect the application of images in subsequent tasks, such as object recognition and scene understanding. At the same time, it is not conducive to image feature extraction and recognition and reduces the effectiveness of outdoor vision system. Therefore, image defogging  According to the physical scattering models [1][2][3], the hazing process is usually expressed as: where I(x) and J(x) are the observed hazy image and the haze-free scene radiance, A is global atmospheric light which represents the intensity of ambient light, and t(x) indicates scene transmission describing the portion of light that is not scattered and reaches the camera sensors. d(x) and β(λ) represent scene depth and atmospheric scattering parameters, respectively. However, it is often difficult to estimate the transmission image from foggy images. Early prior-based methods attempted to estimate transmission images by using the statistical characteristics of clear images, such as non-local prior [4] and color attenuation prior [5]. However, these images have a large prior error, resulting in serious color distortion and contrast reduction in the restored images. At present, with the improvement of computer computing power, dehazing methods based on convolutional neural networks have become the main stream of research. These methods are effective and superior to prior-based algorithms and have significant performance improvement. However, most of the current methods of defogging are to directly remove the observed image, ignoring the destruction of texture details in the process of defogging, resulting in the phenomenon of noise amplification and color distortion after the image defogging. In addition, it usually minimizes the mean square error between foggy images and fog-free images, which is easy to lose high-frequency image details. In foggy conditions, this method may cause transition smoothness and produce artifacts for some regions with rich texture boundaries. In addition, the receptive field of traditional CNN is relatively small, and expanding the receptive field by deepening the network structure will lead to high resource consumption.
In this paper, MIMS-UNet is used to solve the above problems. We propose to use a network to process multi-scale images and then fuse multi-input information to compensate for the lost high-frequency image details. In view of the relatively small receptive field of traditional CNN, the context module is introduced, which not only does not increase the network structure, but also can further expand the receiving domain and capture multi-scale information. Compared with the most advanced methods, our method can achieve good performance while maintaining relatively small computational overhead.
The contributions of this work are summarized as follows: We propose a new MIMS-UNet network for effective dehazing. The network can extract relevant features from fogged image content and recover details and textures from fogged images.
We propose to use encoder-decoder structure with context block to capture multi-scale information and defog from coarse to fine.
Through a lot of experiments, this model has better performance than other advanced algorithms.

Related work
In general, there are three methods for image defogging: the method based on image enhancement algorithm based on signal processing principle, the method based on physical model [2] and the method of deep learning. Methods based on image enhancement [6,7] do not consider the reason of image degradation and improve visual effect by enhancing image contrast, such as histogram equalization algorithm [8] and Retinex [9] algorithm. However, such methods do not take into account the root cause of image degradation, and images after defogging are prone to information loss. In addition, the application scenarios of this method are often limited.
The method based on the physical model is based on the atmospheric scattering model, and the transmission diagram is estimated and the atmospheric light pair model is solved according to different prior information, so as to realize the image defogging. Among these methods, the most widely studied and applied [10] is the defogging method based on dark channel prior, and the representative works in this regard include [4,[10][11][12]. Specifically, Tan [11] proposed a method of maximum contrast defogging, which observed that a clear image often had a higher contrast than the corresponding image with fog, and the image obtained after defogging would have color distortion. He et al. [9] used dark channel prior (DCP) to estimate transmittance in local areas. This method is based on the following assumption: in at least one color channel, pixel values in haze-free patches are close to zero, but severe block effect exists in the transmittance obtained. Subsequent work improved the efficiency and performance of the DCP method [13][14][15][16][17]. Zhu et al. [18] proposed to recover the depth information of the image based on color attenuation prior and then estimate the transmission image. Berman et al. [4] assumed that hundreds of different colors could well approximate the colors of clear images and then carried out image defogging based on this prior. Although these methods have been shown to be effective for image defogging, their performance is inherently limited because the assumed priors are not suitable for all real images. Although significant progress has been made in physical model-based defogging methods, the simplified physical model cannot accurately estimate the transmission image and atmospheric light parameters, and the prior information is not universally applicable. For example, the processing ability of image sky region is poor, and the phenomenon of incomplete defogging and background information loss is easy to occur.
In recent years, methods based on deep learning have become the mainstream of research. This method uses convolutional neural network to restore clear images without considering the specific causes of image degradation. Cai et al. [19] proposed the end-to-end defogging model DehazeNet to estimate the transmission graph and construct the network according to the prior information in the traditional defogging method. Ren et al. [20] proposed an end-to-end gated fusion network (GFN) based on the idea of image fusion. The network first estimated the corresponding weight graphs of three input images, and then, weighted fusion of the three input images guided by the weight graphs to obtain fog-free images. Ren et al. [21] proposed a coarseto-fine strategy to estimate the initial transmittance. Zhang et al. [22] proposed a dense Connected Pyramid Dehazing Network (DCPDN) to estimate the medium transport map. Qu et al. [23] proposed an enhanced pix2pix defogging net to directly learn the mapping relationship between fog and nothing. Pairs of images are not required for training, and the  [24] proposed the parameters of the physical scattering model of atmospheric light by integrating the transmission diagram and the model and designed AOD-Net to estimate the parameters based on CNN. The above direct end-to-end method can improve the image restoration quality to a certain extent, but the over-fitting phenomenon often occurs in the learning process and the original style characteristics of the image are easily ignored, resulting in incomplete or excessive defogging in some areas and color distortion in the restored image.

Proposed method
This section describes the details of our MIMS-UNet framework. First, let's outline our approach. The architecture of MIMS-UNet is based on MSBDN [25]. In order to achieve efficient defogging, MSBDN [25] is greatly improved. Then, the image MIMS module and image defogging module are introduced in detail. The overall architecture of the proposed MIMS-UNet is shown in Fig. 1.

Multi-input multi-scale dehazing network
This paper designs a multi-input and multi-scale defogging network based on CNN, which is independent of atmospheric scattering model and can realize end-to-end defogging directly from foggy images to fog-free images. Firstly, the backbone of MINS-UNet is a UNet network. It has been proved that multi-scale images can better process fog of different degrees in images. A variety of CNN-based defogging methods also adopt this idea and take foggy images of different scales as the input of each sub-network.
In our MIMS-UNet, each dense feature fusion module takes a foggy image of different scales as input. That is to say, in addition to extracting the reduced features from the DFF above, we also extract the features from the reduced fog image and then combine the two. By using complementary information of small size features and feature information obtained from small size images, our algorithm can effectively remove fog.
In order to make smaller foggy images (B2, B3, B4, B5) can be smoothly input into the network, they are encoded by SCM module and then fed into DFF module and multi-scale information fusion as the input of each layer. It can keep the original color of the image well while defogging and overcome the problems of incomplete defogging and color distortion after defogging.
We first used shallow convolution module [26] (SCM) to extract features from the downsampling image, as shown in Fig. 2. We use the 3 × 3 and 1 × 1 convolution stacks. We connect the last 1 × 1 layer feature to the input B K , and then, an additional 1 × 1 convolution layer is used to further refine the characteristics of the join. The output of SCM at the Kth level is represented as SCM out k , and here, we use SCM for the second and third and fourth levels as shown in Fig. 1.

Context block
Multi-scale information is important for image dehazing tasks; therefore, the downsampling operation is usually adopted in networks. But, when the image resolution is too low, the image structures are destroyed, and information is Fig. 3 The architecture of the context block lost. It is not conducive to reconstruction of features. In order not to further reduce the space resolution and increase the receiving domain and embedded multi-scale features, we are introduced in the context of smallest scales between the encoder and decoder, which contains different parallel convolutions, the inflation rate rather than the sample, so that we can without any increase in the number of arguments or destroy the structure of expanding receptive field. The background block [27] obtained good results in the image fragment [28] and the blur removal task [29]. The four expansion rates are 1, 2, 3, and 4. Features are extracted from different acceptance domains, and the output is estimated by fusion (as shown in Fig. 3). This is useful for estimating offsets from larger acceptance domains. In our network, we first use 1 × 1 convolution to compress the feature channel to further reduce the running time and simplify the operation. In the fusion process, 1 × 1 convolution output channel is used to fuse with the original input feature. Meanwhile, in order to prevent information blocking, jump connection is adopted between the input and output. Context module incorporates rich hierarchical context information, which is beneficial to image defogging.

Implementations
The method in this paper is implemented based on Pytorch framework, and NVIDIA 3090 GPU is used to train defogging network in Ubuntu environment. The initial learning rate was set as 10 −4 , batchsize was set as 8, Adam optimizer with momentum decay index β 1 0.9 and β 2 0.999 was used for optimization, and the number of iterations was 100. Use MSE as a loss function to limit network output and ground truth.

Experimental results
In this section, we will introduce and analyze the experimental methods and results. First, the experiment in this paper is based on training and testing in the RESIDE dataset (OTS and ITS), and the results are compared with other algorithms.

RESIDE dataset
The RESIDE dataset [30] contains both synthesized and realworld hazy/clean image pairs of indoor and outdoor scenes. The Japanese training set contains 1399 sharp images and 13,990 misty images generated from the corresponding sharp images. For a fair comparison, we use the training set provided by MSBDN [25]. In order to compare with the latest methods, PSNR and SSIM are used in this paper, and a comprehensive comparison test is carried out in the Comprehensive Objective Test Set (SOTS), which contains 500 indoor and 500 outdoor images. In this paper, comparative experiments are conducted with the current superior defogging methods, including the classical defogging method DCP [10] and the recent deep learning-based defogging method. The network in this paper still shows good defogging ability, and the method in this paper performs well in image details and color fidelity, and the results are shown in Table 1.

NTIRE2018-dehazing challenge dataset
The Ntire2018-Dehazing Challenge dataset includes an outdoor dataset (O-HAZE [31]) and an indoor dataset (I-HAZE [31]). The indoor data set [31] includes 35 pairs of foggy and fog-free images of indoor family environment, and the outdoor data set [31] includes 45 pairs of foggy and fog-free images of various outdoor scenes. We cut the images of the two data sets into a size of 544 × 400 as a real fog test set to test the stability of our network.

Performance evaluation
We evaluated our method based on manual prior (DCP [10] and NLD [4]) and deep convolutional neural network (AOD [24], MSCNN [21], MsPPN [37], DcGAN [38], GFN [20], GCANet [39], PFFNet [33], GDN [34], DuRN [35], MSBDN [25], FDU [36], DA_dahazing [32], GridDehazeNet [34], EPDN [23]). We also give a quantitative comparison of the results of defogging in Table 2. Compared with classical defogging algorithms such as DCP [10] and NLD [4], our results are far superior to these classical algorithms in quantitative (PSNR and SSIM). Meanwhile, compared with the most advanced fog removal algorithm [36] [46] in recent years, our results   Fig. 4, we can observe that DCP [10] algorithm has serious color distortion, and the result does not look close to reality. AOD [24] only removes a small amount of fog, and the overall tone of the restored image is white. We can see that GFN [20] also suffers from color distortion in some cases, and the results after defogging appear darker than our method. Although the direct estimation of sharp images based on the end-to-end trainable network [22,23] has better results than other indirect methods, there is still the problem of incomplete defogging. The algorithm based on MSBDN [25] is not clear enough in some details. Compared with these methods, the image restored by our algorithm has sharper structure and details and is closer to ground truth. Our algorithm produces better visual results.
After testing our effectiveness in the SOTS test set, we also test our network on O-Haze [31] and I-Haze [31] datasets to verify the stability and generalization of our network. The results are shown in Fig. 5 and 6. The test results were not as good as those in the SOTS test set, but were largely free of fog on the images. The comparison results of the above experiments can reflect the superiority of the proposed algorithm using multi-input and multi-scale input. This technical scheme makes the algorithm in this paper show good performance in both overall and local defogging effects.

Ablation study and analysis
In this section, in order to verify the effectiveness of the proposed module, we conducted a laugh ablation experiment (as shown in Fig. 7). In order to make a fair comparison, all the experimental methods mentioned below were trained using the same setup as the proposed algorithm.

Ablation on the MIMS
When the MIMS module was introduced into the basic network, the network performance was greatly improved. After the quality evaluation of the experiment was carried out on the SOTS test set, the PSNR result (as shown in Table 2) increased by 0.64 dB, which fully proves the validity of the proposed module.

Ablation on the context block
The context block complements the downsampling operation to capture larger field information. We can observe that performance is improved when context blocks are introduced, and the details that were lost after the image was defogged are nicely fixed.
We also performed ablation experiments, as shown in Fig. 7, which represents the decontamination fog fruit dia-gram before adding our module. With the successive addition of MIMS and CB modules, the image after decontamination becomes clearer and the recovery of details is getting better and better. In conclusion, ablation studies show that our proposed multi-input multi-scale module and context background module are useful for improving the model's defogging effect and recovering more detailed information.

Conclusion
Aiming at the problems of existing image defogging algorithms, such as dependence on prior information, loss of feature information and fuzzy details after defogging, we proposed an effective multi-input and multi-scale U-Net defogging network. The network is based on encoder-decoder architecture, and the study shows that the proposed module can effectively solve the problem of image defogging, while preserving the original color of the image well. The method proposed in this paper effectively solves the problem of incomplete defogging or color distortion caused by excessive defogging. Through comparative tests, the overall performance of the proposed algorithm is better than the existing algorithm, and the local performance of the algorithm is better.