An end-to-end multi-resolution feature fusion defogging network

Traditional convolutional neural networks work well on single-image defogging synthetic datasets, but for real-world images with different concentrations, it leads to incomplete defogging or distortion and loss of image detail information. In this paper, the authors propose an end-to-end single-image defogging network, which adopts an encoder–decoder structure, obtains a large perceptual field of view of high pixels through a large number of pooling operations in the U-Net network, and uses operations such as hopping connections to retain most of the image feature information. To achieve superior quality haze removal, the proposed method utilizes a bilateral grid to capture high-frequency information pertaining to the image edges in low-resolution pixels. Additionally, relevant haze-related features are extracted, and a local affine model is fitted within the bilateral space. Finally, the high and low pixel data are integrated with the extracted features to generate clear and vivid images. The authors compare the algorithm qualitatively and quantitatively with several state-of-the-art algorithms and show that the algorithm achieves better defogging results in both the SOTS dataset and real-world images, retains high-frequency image details, achieves higher peak signal-to-noise ratio, and performs better defogging in haze images with different concentrations.


Introduction
Haze is a common weather phenomenon caused by small airborne particles like dust and smoke. These particles have a strong light absorption and scattering effect, leading to degraded image quality. Haze can severely threaten practical applications such as video surveillance, remote sensing, and autonomous driving. Moreover, it poses a challenge to advanced computer vision tasks like detection and recognition [1].
The atmospheric scattering model [2] offers a basic estimation of the haze effect. It generally supposes that: network to extract features results in more redundancy, and integrating various types of image information is crucial. Currently, the commonly used method for recovering from foggy images is to train a fully convolutional network (FCN) [3]. The computational complexity of FCN grows quadratically with the spatial dimension of the input image, thus causing the network of FCN to be more difficult to train and possibly introducing artifacts. In addition, the perceptual field of FCN does not change with the size of the input image, and the features extracted from images with different resolutions are limited. In order to simultaneously preserve and integrate the deep and shallow features of the image, we obtain a flexible perceptual field through a large number of pooling and jump connection operations.
This paper offers the following significant contributions: 1. We propose a multi-resolution feature fusion dehazing network, which is free from the atmospheric scattering model. Compared with other advanced dehazing methods, the dehazing effect is prominent in uneven haze regions and rich texture features regions. 2. By adopting the encoder-decoder structure [4], adjusting the U-Net [5,6] structure, and adding the ResNet [7] module to generate a new network UResNet, the loss of feature information after the pooling operation is reduced, and the skip connection is used to retain most of the multi-scale information. 3. The image enhancement method of bilateral grid learning is combined with UResNet, and the contrast and edge detail information of the images are fused by multiguided affine transformation and interpolation. The edge detail of the low-resolution image is used as the secondary auxiliary network branch, to fuse with the high-resolution information to restore the high-quality fog-free image. 4. To cater to distinct dehazing application scenarios, the proposed network is trained separately on indoor and outdoor training sets. Moreover, it exhibits improved performance on the SOTS test set.

Related work
He et al. [8] proposed the dark channel prior dehazing algorithm based on the atmospheric scattering model, which led to significant progress in image dehazing. This method relied on the assumption that in a haze-free image, at least one color channel pixel value would be very low. Nevertheless, this prior model was prone to violations in this method, which led to the imprecise estimation of the transmission map and unsatisfactory dehazing results in the sky, snow, and other areas. Berman and Avidan [9] proposed a hypothesis that the haze-free image is similar to hundreds of different colors, and Wang et al. [10] observed that the hazy image's blurred regions are mainly focused on the brightness channel of the YCrCb color space. The above physical model-based defogging methods are limited to complex scene conditions. Compared with traditional methods, with the support of big data, deep learning dehazing has achieved better performance and robustness. In 2016, Cai et al. [11] proposed DehazeNet, which obtained the transmission map by nonlinear regression of convolutional neural network, using the assumption prior to calculate the atmospheric light A, and finally recovered the haze-free image according to the atmospheric scattering model. CAI also proposed a new nonlinear activation function BReLU, which proved its importance in obtaining accurate image restoration. Li et al. [12] proposed AOD-Net in 2017. Li believed that calculating the transmission map and atmospheric scattered light separately would cause greater systematic errors, so the transmittance and atmospheric light were merged into an intermediate variable K, and the lightweight neural network regression variable K was used to restore the haze-free image. Zhu's Dehaze GAN algorithm [13] utilizes a method similar to AOD-Net, estimating both the transmittance and atmospheric light value simultaneously within a generative adversarial network. Ren et al. [14] proposed MSCNN in their 2016 paper, which uses a multi-scale network to estimate the transmission map at different levels of detail. By capturing local and global image features, the approach can alleviate halo artifacts in dehazed images. Mei et al. [15] separated the atmospheric scattering model in 2019 and uses the encoder-decoder architecture of the residual network block to directly train the mapping relationship between the haze-free and haze-free images. This end-to-end dehazing method achieves better results. Proposed in 2019, Griddehaze-Net [16] comprises three modules: preprocessing, backbone, and post-processing. The preprocessing module is trained to generate diverse and relevant feature inputs that enhance the network's performance. Dong et al. [17] introduced MSBDN, which uses multiscale feature maps, residual and dense connections, and pixel reconstruction and perceptual losses to improve image clarity and reduce haze. FFA-Net [18] incorporates channel and pixel attention mechanisms, and leverages local residual connections to bypass unimportant information such as haze regions or low frequencies.
The existing methods often lead to incomplete or distorted image details in the process of dehazing. This is mainly because certain important image feature information may be lost or blurred during the dehazing process. For most CNN networks, their perception field is not flexible enough, and this inflexibility will leave artifacts in the process of fog removal. Although the encoder-decoder has a powerful feature extraction function, it is difficult to avoid the loss of many details.
Currently, some studies emphasize the use of edge highfrequency information and incorporate it into the auxiliary network as the second branch. For example, Fang et al. [19] presented a CNN method for single-image dehazing that incorporates edge branches and reconstruction branches to fuse spatial details. Wang et al. [20] used a two-branch encoder to extract the low-light and near-infrared features separately and to fuse the features to obtain better results. Duan et al. [21] decomposed the superimposed image into two independent images, then combined the original mixed input (superimposed image) and the activated visual information, and then competed to obtain the unambiguous image, which effectively suppresses the visual confusion. At the same time, to obtain more robust features and semantic information from the input image, and to obtain a flexible receptive field, Chen et al. [4] used soft pooling to retain more information in the encoder. Zhang et al. [22] adopted an attention mechanism to enable the encoder to obtain more clustered features. Song et al. [23] proposed a global residual and multi-scale feature fusion instead of concatenation fusion and achieved a very amazing dehazing effect.
Bilateral filtering/meshing has achieved a lot in preserving edge details [24,25]. It focuses on object boundaries with abrupt color changes and it can better focus on highfrequency information by edge perception in bilateral space. Gabiger-Rose et al. [26] provided a comprehensive account of an FPGA implementation of a bilateral filter that enhances images in real-time with synchronous processing. Gharbi et al. [27] employed the bilateral grid approach for image enhancement using local affine color transformation. However, the method suffers from the issue of color information loss due to collapsing the high-resolution guidance map into a two-dimensional space. In 2021, Zheng et al. [28] proposed learning multiple full-resolution maps in bilateral space that maintain intricate edges and textures within the image.
Following the mentioned algorithm, the authors utilized an encoder-decoder network architecture to handle full-resolution fog images. This architecture employed multibranch residual parallel processing and multi-source information interaction. Additionally, a bilateral grid served as the second auxiliary network for multi-resolution feature information fusion, and learned the high-frequency information of the low-resolution edge. The goal was to restore a highquality haze-free image, as demonstrated in Sect. 3.

Proposed method
Initially, the input image reduces to 256 × 256, and then passed through a mini-UResNet with an original channel number of 8 and a down-sampled channel number of 64 to extract low-resolution features. The approach merges the extracted features into a bilateral grid and interpolates the fea-ture maps of the image's RGB channels using the grid. This results in high-resolution feature maps that capture fine color and edge details. The feature maps of the three channels are combined with multi-scale high-resolution feature information and passed through convolutional blocks in UResNet for feature fusion. The obtained coefficient tensor is then multiplied by the original hazy image to generate the haze-free image.
The proposed network model architecture is presented in Fig. 1, consisting of two branches, one for low-resolution (256 × 256) information extraction, the other for edge information learning through bilateral, and finally full-resolution image information fusion to restore haze-free images with rich edge information.

UResNet
The self-attention of encoder-decoder and traditional multiscale networks have demonstrated significant advantages in image restoration tasks. Especially in encoder-decoder networks or traditional multi-scale networks, the information flow is often affected by the bottleneck effect due to the hierarchical structure. U-Net uses downsampling/upsampling blocks and dense connections across different scales to integrate deep and shallow information. The 3 × 3 convolutional layer (activation function is ReLU) module of U-Net is changed into the Resnet module as shown in Fig. 2a, and bilinear interpolation is used in the upsampling process to make the feature information smoother. The modified network makes the exchange and aggregation of image information more flexible, and the network structure is shown in Fig. 2b.

Bilateral grid learning
The authors aim to tackle the problem of contrast and visibility degradation in hazy images by proposing a method that predicts an affine bilateral model, which recovers sharp edge and structure information in the image. This is achieved by fitting a 3D array of affine functions on the bilateral space in a dimensionality-boosting manner. The proposed network uses feature extraction blocks to build bilateral coefficients on the reduced resolution of the input image. The regression affine bilateral grid is then used to reconstruct a high-quality haze-free image on the R, G, and B channels, resulting in better recovery of image details.
Firstly, like the feature extraction module in Fig. 1, the haze image reduces to 256 × 256, extracts the low-resolution features through UResNet, and performs a series of fusions with the original input, which will produce a feature map M array, which contains rich edge and semantic information. In this approach, M is represented as a bilateral grid of size 64×16×16, where each cell comprises 12 values. These val- Fig. 1 The scheme of the proposed Network Structure   Fig. 2 Overall structure of UResnet ues correspond to the RGB channels, and each channel has four associated affine transformation matrices. Finally, by slicing, the low-resolution features were affine transformed with the three-channel feature coefficients generated by URe-sNet processing, and the final haze-free image was restored. Specifically, in this paper, the low-resolution feature M is treated as a 3D expanded multichannel bilateral mesh G, it can be expanded as follows: where x = 16, y = 16 , z = 64 represents the coordinates of the n=12 element points of the bilateral grid. Similarly, [x = 256, y = 256], z = 3 represents the size of the image and number of channels.

Multi-resolution feature fusion
To effectively fuse the edge detail features extracted from the bilateral grid with the haze image, the authors suggest a polynomial function to boost the feature maps of the RGB channels, aiming to enhance their representations, which can be expressed as follows: where the enhanced feature maps of the RGB channels are denoted as R , G , B , R, G, B represent the three color channels feature maps extracted by UresNet, the features on the three-channel are processed using a three-layer 2D convolution to obtain w r , w g , w b and b r , b g , b b . This process is illustrated in Fig. 1 as feature extraction on the low-resolution feature map. Finally, the enhanced feature map is integrated with the BReLU activation function through three layers of convolution, and the haze-free image is generated by element-wise multiplication of the coefficient tensor and the original hazy image.

Loss function
The training of the proposed network utilizes both smooth L1 loss and perceptual loss, the latter of which was introduced by Johnson et al. [29]. These loss functions are combined to create the total loss, which is expressed as follows: The weight of the perceptual loss is adjusted by the parameter λ in the occupancy loss function. In this paper, λ is assigned a value of 0.04.

Smooth loss
MSE or L2 loss is commonly utilized as the loss function for single image dehazing. Nonetheless, Lim et al. [30] argued that L1 loss can prevent potential gradient explosion, and many image restoration tasks trained on L1 loss exhibit superior performance compared to L2 loss. At the same time, in order to make up for the fact that the L1 loss function is not smooth enough to be sensitive to outliers, we choose the smooth L1 function as pixel loss. The disparity between the predicted pixel x of the ith color channel by the network and the actual haze-free image is captured byĴ i (x) − J i (x), while the Smooth L1 loss can be mathematically represented as follows: where

Perceptual loss
This study employs VGG16 [31], a deep neural network pretrained on ImageNet [32], as a loss network to extract multiscale features for measuring the visual dissimilarity between the estimated and actual images. The deeper layers of the network capture increasingly abstract and complex semantic information, bringing the image perception closer to the level of human cognition. The proposed perceptual loss function is defined by extracting features from the output layers of the first three stages of the network. And it can be expressed as follows: In the formula, φ j (Ĵ ), φ j (J ), j = 1, 2, 3 correspond to the VGG's Con1-2, Con2-2, Con3-3 estimated feature maps and the true feature maps. C j , H j , and W j represent the number of channels, size of the feature map, respectively.

Experiments
In this paper, the training network adopts the RESIDE [33] dataset, which contains synthetic datasets of indoor and outdoor fog maps from NYU depth V2 [34] and the Middlebury stereo dataset [35]. The indoor training set (ITS), which consists of 1399 clear images, is generated by selecting the atmospheric scattering coefficient β ∈ [0.6, 1.8], and the After removing any unwanted data, the outdoor training set (OTS) was obtained from 8477 clear images, where hazy maps were generated with β ranging from 0.04 to 0.2 and A ranging from 0.8 to 1.0, contain a total of 296,659 pairs of blurred outdoor images. The blurred image was centrally cropped to the size of 240 ×240, and the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were elected to evaluate and test in the comprehensive target test set (SOTS). The SOTS dataset comprises of 500 pairs of indoor images and 500 pairs of outdoor images.

Implementation details
The end-to-end dehazing network proposed in this study is trainable and does not necessitate pre-training. The RGB images of size 240 × 240 were used for training. To expedite the training process, we utilized the Adam optimization algorithm [36]. We set the batch size to 16 and used the default values of 0.9 and 0.999 for β 1 and β 2 , respectively. The initial learning rate is set to ×10 −3 . For the ITS, the total training epochs are 120, and the learning rate is reduced by 80 percent after every 10 epochs. For the OTS, we only trained for 8 epochs and the learning rate decreased by 75 percent after every 2 epochs. All the training and testing in this paper are carried out on an RTX 3060 GPU with 12GB of Graphics memory.

Synthetic dataset
This section presents a qualitative and quantitative comparison of the proposed approach with state-of-the-art dehazing networks on the SOTS. In this paper, four advanced dehazing algorithms, DCP [8], AOD-Net [12], GCA-Net [37], and Griddehaze-Net [16], are compared. Except DCP, which is based on physical model, the rest are based on data-driven dehazing algorithms, and the results are shown in Table 1.
In terms of PSNR, the method mentioned in this paper has advantages. The network can retain more information while defogging. As shown in Fig. 3, the edge map of the defogged image has deeper detail information. The visual comparison is performed in Fig. 4, along with the qualitative analysis. In Fig. 4, the top three rows dehazing indoor images and the bottom three rows dehazing outdoor images. It can be seen that due to the limitation of the physical model, the DCP leads to the distortion of part of the picture, and the dehazing effect is poor in parts such as Fig. 4b3,4,5 in the sky. Because in a bright area such as the sky, the dark channel information may not be obvious due to the high illumination intensity, which leads to the difficulty of the algorithm to accurately estimate the density and depth of the fog. AOD-Net uses a deep neural Fig. 3 An example of edge image, a is the edge map of a hazy image, b is the edge map after dehazing by Griddehaze-Net, c is the edge map obtained after dehazing using our method, and d is the edge map of a ground truth picture network to estimate the aerosol optical depth, which is then used to estimate the transmission map. The network architecture and training process can affect the accuracy of the results and the generalization of new images. AOD-Net cannot completely remove haze information and retains less detailed information. However, GCA-Net struggles to preserve edges due to the inherent tradeoff between texture enhancement and structure preservation. As shown in Fig. 4d3,5 texture, the processing of edges is not in place, so there is a color distortion phenomenon at the boundary between buildings and the sky. The model parameters of Griddehaze-Net limit its performance in dealing with deep haze, as shown in Fig. 4e5. Subsequently, we find that it has a poor effect on dealing with fog maps with inconsistent concentrations in reality.

Real-world hazy image
Since the haze concentration of a single image in the synthetic dataset is consistent, but the haze concentration in the real world is uncertain, it is necessary to conduct experiments on real-world images.
We selected five images from the RTTS dataset and the natural outdoor fog map for dehazing. As shown in Fig. 5, the halo effect of DCP appears on the edge of buildings, and the white aperture appears on the border between Tiananmen Square and the sky in the third image of Fig. 5, and the results are darker than other algorithms, and the overall image quality is poor. Although the AOD-Net method overcomes the color distortion phenomenon to a large extent, the dehazing of the sky and clouds is incomplete, resulting in the loss of details. GCA-Net has shadows in the sky and clouds, and the  overall image is too bright. Griddehaze-Net has a good effect on the haze image near, but it is not perfect in the distance, such as the font on Tiananmen Square in the third picture and the billboard with heavy haze in the fifth picture. The effect of our method after dehazing performs the best visually.

Ablation study
To validate the efficacy of the approach proposed in this paper, the authors modify the configuration of the dehazing network and conduct ablation experiments. This paper mainly considers the role of the Resnet module, bilateral grid learning, feature confusion, and perceptual loss on network performance, and conducts qualitative analysis on the SOTS indoor test set. The results are shown in Table 2. According to the indicators in the table, more multi-scale information can be extracted by combining the bilateral grid method with UResNet.

Running time
This paper compares the average processing time of 500 indoor and outdoor fog maps from the SOTS test set for each method using the same device. Figure 6 displays the results. The proposed approach achieves an average processing time of 0.206 s per image in the SOTS dataset.

Conclusion
This paper introduces a novel trainable end-to-end dehazing network that utilizes bilateral grid and UResNet. UResNet adopts an encoder-decoder structure. Through operations such as skip connection, concat, and add, the bilateral grid learns to assist the main network to retain high-frequency details and restore high-quality haze-free images. In addition, the pixel loss and perceptual loss introduced in the loss function, the proposed network has strong advantages in the restoration of image details and color fidelity, and a higher peak signal-to-noise ratio and high-frequency information can be obtained in the recovered pictures.
Although the method in this paper can achieve pleasant visual effects, the processing speed is slow in 4K resolution images. In future work, the author will modify the model to further improve the applicability and processing speed of application scenarios.
Author Contributions PX was involved in writing, review and editing, research supervision and guidance. SD helped in data collection and processing, visualization of experimental results.

Declarations
Conflict of interest All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Ethical approval This work did not require ethical approval under the research governance guidelines operating at the time of the research.