A multi-stage feature fusion defogging network based on the attention mechanism

This study suggests an eﬀective multi-stage feature fusion defogging network based on an attention mechanism for the problem of problematic defogging and picture color distortion in complicated surroundings with hazy images. To create a multi-branch defogging network, the model integrates several attention techniques to create various sorts of branching network architectures. The image’s spatial details and contextual information are supplemented and merged based on the recovered image feature information of diﬀerent network branches to increase the eﬀectiveness of the network model. The stage attention fusion mechanism created among several network branches can lessen the loss of image data during feature extraction and improve the eﬀectiveness of the image-defogging operation. The experimental results demonstrate that the proposed algorithm has superior defogging performance in both synthetic and real-world scene datasets and performs more admirably in terms of accuracy compared to other sophisticated algorithms, particularly in the Reside and O-Haze datasets. The PSNR metrics on the Reside and O-Haze datasets are improved by 1.58 dB and 1.61 dB, respectively, compared to the best-advanced technique suggested in this study, and the SSIM metrics on the O-Haze dataset are improved by 3.4%.


Introduction
In many locations, severe weather phenomena such as haze have occurred often in recent years due to widespread environmental contamination and deterioration of atmospheric quality.In hazy weather, many tiny suspended particles can lead to the scattering of atmospheric light and affect light absorption [1].The environment is involved, and the reflected light is weakened, causing severe degradation of the images obtained by imaging devices, resulting in unclear generated images, reduced visibility, color distortion, and low contrast [2,3], thereby limiting the application of imaging devices.Due to the unpredictability of several aspects [4], such as natural scenery, fog image concentration, and weather conditions, obtaining optimal defogging image quality takes a lot of work.In this regard, domestic and international academics have extensively researched image-defogging algorithms, primarily divided into three categories: physical algorithms based on image enhancement, image restoration algorithms, and deep learning-based algorithms [5].Image-enhancing defogging methods include dark Channel apriori, histogram equalization, wavelet transform, Retinex [6], etc. Histogram equalization's local processing approach enhances the image's local features by acquiring non-overlapping or little partially overlapping local chunks after image segmentation and then executing histogram processing on them.Single-scale, multi-scale, and Retinex algorithms with color restoration improve images but result in color shifts, poor adaptation, and other issues.For image restoration defogging algorithms, image defogging is mainly based on physical models of atmospheric scattering, containing physical algorithms such as dark channel prior [7,8], Markov network model [9], and linear prior model of color decay [10].
Image defogging is based primarily on deep learning and currently falls into two categories: predicting all the estimated parameter values needed in the atmospheric scattering model through a lightweight network and then using the atmospheric scattering model for haze removal and the other category is generating a clear image directly from a blurred image using a neural network.Cai et al. [11] introduced the first end-to-end haze removal network model complex ( DehazeNet), which directly learns and estimates the relationship between the fogged image region and the transmission map, and estimates the transmittance via a deep architectural embodiment, proposing a new bilinear activation function to converge the model training speed.Ren et al. [12] presented a network structure for multiscale convolutional neural networks (MSCNN) based on local multi-patching of pictures for fog removal by fusing multiple image regions and contextual data followed by jumping connections.Li et al. [13] proposed dividing the fog removal process into two stages (all-inone defogging network, AODNet) neural network.Initially, the K-estimating module consisting of a light neural network acquires the multiscale features, estimates the fog depth, estimates the K(x) value from I(x), and then brings in the fog removal module consisting of an atmospheric scattering model to estimate J(x) to obtain the fog removal map.Nevertheless, AOD-Net relies excessively on the atmospheric model and needs to be improved for dense fog removal.Combining attention mechanism with multiscale feature extraction for defogging, Liu et al. [14] incorporated a semantic segmentation network into the defogging work.Chen et al. [15] suggested an aggregated context aggregation network (GCANet) incorporating semantic information and expanding convolution to extend the receptive field.
Current advanced techniques based on deep learning include AOD-Net and GCANet, among others.Although these networks perform well for defogging thin fog data sets in synthetic scenarios, there are still issues, such as the need for more extraction of deep information characteristics from foggy photos and the loss of high-frequency information.Furthermore, issues like fog residue, color shift, and distortion occur when dealing with severe fog in actual settings.This research offers a multi-stage feature fusion defogging network based on attention mechanism to overcome the issues above.
(1) Combining single-scale and multi-scale image information feature extraction improves the network's feature extraction capabilities.
(2) The residual attention feature sub-network (Raf-Net) built for full resolution output accurately recovers spatial image detail feature information and concentrates on useful image information via residual learning and various attention focus.
(3) This network introduces a cross-stage attention to integration mechanisms (CATM) method that allows distinct sub-network branches to correlate in extracting features and lowers the loss of image features extracted by the overall network via front-to-back feature transfer.
(4) The multi-stage feature defogging network based on the attention mechanism performs effectively defogging for defogging datasets in synthetic settings.It has a solid defogging impact on dense fog datasets in natural scenes to reduce the dense fog residue.

Atmospheric scattering model
On foggy days, the scattering effect of light absorption by air particles is the primary cause of decreased visibility and fuzzy imagery.In addition, the atmospheric scattering model [16], whose equation is shown below, elucidates the degradation concept of hazy sky photographs.
Essentially, this formula is a recovery process for solving the fog-free image J(x) from the fogged image I(x) obtained by the imaging system.The parameters x expressed the spatial locations of the pixel points in the image; the transmission matrix t(x), whose physical meaning is the proportion of atmospheric light that reaches the imaging device after attenuation by suspended particles in the atmosphere; and A is the estimated value of atmospheric light.
In the above Equation, β can be considered a uniform constant under certain conditions, and d(x) is the scene depth, from which the transmittance t(x) can be calculated.Under the transformation of Equation (1), it is possible to obtain as : The transmission matrix t(x) and the anticipated atmospheric light value A are necessary to calculate the recovered clear image precisely.Owing to the imprecision of the estimated atmospheric light value, the error in determining the maximum value of picture pixels might be produced by pixels in the white region of the image, which impacts the computation of transmittance.Due to the uncertainty of the parameters, utilizing the atmospheric scattering model for defogging will result in subpar defogging.
Fu et al. [25] employed a multi-stage cascade strategy to gradually recover the rain map to achieve the goal of deblurring the image.Zhang et al. [26] proposed a multi-patch cascade network based on the single-stage combination of codec structure, which obtains local information about the image in the grid via the single-stage structure of the encoder-decoder and passes the obtained features from the bottom up from multi-region features to single-region features in order to achieve an excellent deblurring effect.Ren et al. [27] proposed a multiscale depth network that combines coarse-scale and fine-scale networks to roughly predict the fog map's transmittance using different stages and then refine the transmittance map using an edge network that combines local information and the output of the roughly estimated transmittance for defogging.Das et al. [28] performed cascade defogging by a three-stage sub-network, using the image multi-patch region as the input and encoder-decoder as the structure of each stage of feature extraction for cascading, thereby achieving image clarity; however, its structure is single and single-scale feature extraction restricts the overall expressiveness of the network.This network utilizes the concept of merging multi-stage and different branches, with the first two stages employing the coding-decoding structure and the final stage employing the residual attention feature sub-network.After extracting features individually, the final branch network features are fused to perform picture defogging in varied scenarios.

Attention Mechanisms
Attention mechanisms primarily handle information in two dimensions, spatial and channel, and include numerous types of attention, including spatial, pixel, and channel [29,30] .The first two are handled by altering the feature information in space, while the third involves weighting the information in the dimension of the channel.The channel attention method developed by Hu et al. [31] promotes high-quality features by compressing spatial feature information.Woo et al. [32] further weigh the feature map by adding maximum pooling to spatial attention to get more refined attention weights.To include global perceptual fields in the spatial portion of the obtained attention weight information, Hou et al. [33] employ bar pooling of the weighted spatial information and generate local feature information by stacking the bar information.The attention module suggested by Zhong et al. [34] uses local pooling to produce pixel-weighted image features, partially compressing the feature map's spatial feature information.This paper employs distinct attentional feature extraction modules in the network for feature weighting in different dimensions to emphasize practical information and enhance the acquisition of feature information.and weight the feature information in order to effectively prioritize important information.The entire network's structure consists of a feature extraction component, a feature augmentation component, and an overall optimum defogging component.Then, shallow features are retrieved from the input original picture data for each branch stage of the entire network structure using two 3×3 convolutions and one PRelu activation unit in the first stage.The extracted shallow features will enter the coding and decoding subnets based on the U-Net architecture in the first and second subnets.The image will be downsampled through the attention mechanism and downsampling operation to obtain a low-resolution feature map, followed by the upsampling operation through bilinear upsampling; the process accomplishes the partial mapping of the shallow features to the image's low resolution for representation.The entire procedure entails mapping the shallow feature portion to the image's low-resolution representation and restoring the image's original resolution by reverse mapping, learning extensive contextual semantic information, encoding multi-scale features, and obtaining a feature map containing all extracted features.The enhanced output image is then obtained by entering the feature enhancement module.( a new module obtained from the transformed convolution of Waqas Zamir et al. [35]. ) The third sub-network of the overall network, designed directly with the original resolution in mind, generates correct highresolution image information and thereby retrieves rich spatial detail features.The residual maps of Rs, Rp, and Rc created by the three branch networks enter the residual skip connection module.The entire module comprises a weighting method designed to produce a better residual map, and the residual map Rr and fused features merged to fulfill the goal of image defogging.

Encoder-decoder subnet
A U-Net architecture-based branching network consisting primarily of the decoder and encoder is utilized in the first and second stages.The network's distinctive architecture supports acquiring a receptive field and permits extracting a vast array of semantic data.As depicted in Figure 2, the branching network possesses a three-layer connection structure consisting primarily of an encoding and decoding structure, an upsampling and downsampling component, and a jump connection comprising an attention module.Images are downsampled, and their resolution is decreased during the encoding phase to get multi-scale information.The decoding section adopts the same symmetric structure as the encoding section, performing the up-dimensional operation on the image and combining its features with those of the encoding section for layer-by-layer feature extraction and fusion to obtain image data containing all the features.The distinction resides in using multiple attention types for feature extraction and assigning different weights to the features so that the coding and decoding sub-network branches pay greater attention to the feature information of the fog map in different dimensions.

Subnet of residual attention feature
The residual attention sub-network constructed for original resolution compensates for the problem of a single sub-network and limited network performance in the initial network but also generates high-resolution images rich in spatial information without any downsampling of the image data.As shown in Figure 3, the subnet consists of a residual attention group, a jump connection designed based on the residual structure, and two attention modules.The residual attention group contains five identical basic blocks and a convolution operation.Firstly, for the sub-network input, feature information is processed by the basic residual block.The output quantity obtained is concatenated five times, then convolved and combined with the residual-based jump connection.The composed residual attention group is cascaded again and summed with the initial input features of the subnet through the residual structure to obtain the image information features with full resolution.The subnet consists of a residual attention group, a jump connection designed based on the residual structure, and two attention modules.The residual attention group contains five identical basic blocks and a convolution operation.For the subnetwork input, feature information is processed by the basic residual block.The output quantity obtained is concatenated five times, then convolved and combined with the residual-based jump connection.The composed residual attention group is cascaded again and summed with the initial input features of the subnet through the residual structure to obtain the image information features with full resolution.

Residual Foundation Module
As shown in Figure 4, this base block mainly contains the local covariance residual learning [36] and the feature attention module.Firstly, the input image information is convolved 3 × 3 with the PRelu activation function to form the residual block.The local residual learning and feature attention modules are combined by connecting them in series with channel attention (Figure 4(a)) and pixel attention (Figure 4(b)).Finally, the input information is processed through a long-hop connection to obtain the overall output features.In local residual learning, it is allowed to bypass the less critical information in the image, such as fog or some low-frequency regions with low attention, through multiple local residual connections, thus making the residual attention subnetwork on the main line more focused on valid information.The feature attention module combines the Channel Attention model and Pixel Attention model mechanisms, allowing the block to handle different pixels and features and adding some flexibility to handle different information and extend the network expression capability; the process is shown in equations ( 4) and (5).
f c denotes the initial feature map of the input, Conv 3 represents the 3×3 convolutions, δ denotes the PRelu activation function, f ′ c , F * c , and p * c , have the same meaning as described in equation ( 4), and F out represents the final feature following residual basis module processing.

Cross-stage consideration of integration mechanism feature
As illustrated in Figure 5, before propagating the features from one stage to the next stage for aggregate, they are first refined with 1×1 convolution to decrease the number of parameters, followed by summation as well as batch normalization to stabilize the distribution of data inputs, reduce the sensitivity of network parameters, and maintain the network learning process stable; the feature maps are passed through the activation function, and following the SE attention module, the interdependence between channels between different stages is built more effectively, and spontaneously calibrate the features between channels accordingly and finally sum up with the input of the next stage into the sub-network.With the stage fusion mechanism, the feature distribution of haze in images can be learned better.The features of the previous stage can be stimulated to extract high-quality features and remove redundant parts before the feature transfer between stages.This can improve the inter-network expression and make the network less susceptible to losing feature information.Moreover, the multiscale characteristics of the previous stage contribute to enriching the characteristics of the subsequent stage, making the network optimization process more stable and simplifying the information flow throughout the entire network, as illustrated in Equation ( 6).
In the formula, c is the number of channels, J and X are the final weight vector, and the initial feature map of the SE attention module.f 1 and f 2 are the feature maps between different branch networks, Conv 1 is the 1 × 1 convolution, δ is the PRelu activation function, and X c is the output feature of the SE attention module, F out is the new feature output by the stage attention fusion mechanism.4 Validation and analysis of experimental results

Experimental environment configuration
In this research, we present an end-to-end defogging network with a multistage fusion technique, employing Ubuntu OS as the experimental platform, a PyTorch learning experiment framework, and a machine with an NVIDIA GTX 3060 GPU.During the training process, the number of input feature channels of each branch network is set to 96; when training the input images of different datasets, the images are randomly cropped to dimensional values of 256×256; and the cropped images are rotated and panning using data enhancement to increase the amount of data.The overall network model uses the Adam optimizer to reduce the loss and improve the learning ability by continuously updating the model parameters; the smoothing constants β1 and β2 are set to 0.9 and 0.999, respectively, and the learning rate is set to 0.0001 at the beginning of training, and it is reduced to 0.000001 using the cosine annealing strategy.The training batch is set to 2. Combining the number of distinct datasets, the resolution of the images, and the various training scenarios results in a variable number of iterations.

Experimental data set
In order to verify the practical defogging processing ability of the network model in this paper, experiments were conducted using the synthetic dataset Reside, and the actual scene dataset O-Haze and D-Haze, respectively.The synthetic dataset Reside has five subsets, of which the indoor training set (ITS), including 13990 indoor fog images, is used extensively in this study.The O-Haze dataset is a real scene dataset obtained by capturing fog generated by an artificial professional fog machine.It comprises 45 pairs of outdoor fog photographs, clearly exhibiting the same content captured with and without fog.In order to further illustrate the improved adaptability and robustness of the network described in this paper, the model is validated using the D-Haze dense fog dataset.This dataset comprises 55 pairs of real-world dense fog photographs and equivalent clear images of varied outdoor situations.It is characterized by dense and uniform fog in the background of natural scenes.In this research, a complete network validation is conducted using the three datasets with distinct properties listed above, and the corresponding subjective evaluation is conducted.

Experimental analysis of synthetic data sets
In this study, we first adequately evaluate the experimental effects on the SOTS test subset of the Reside dataset and then compare and analyze the experimental model's findings with those of other sophisticated methods.
(1) Subjective visual analysis As depicted in Figure 6, this work compares the effect graphs of the employed techniques with several sophisticated algorithms (DCP, DehazeNet, AOD-Net, GCA, CMF) by picking various scene photographs from the synthetic dataset and visually displays the defogging effect of various approaches.Among them, after the DCP algorithm defogs the image, the overall image shows a dark phenomenon and produces distortion for the color of the high brightness area (shown in Figure 6(b)); the Dehaze algorithm does not entirely defog, and their is a particular color distortion (shown in Figure 6(c)); for the AOD algorithm, there is the same situation as DehazeNet, and it can be observed that there is fog residue (shown in Figure 6(d)); GCA, as well as CMF algorithms, have apparent effect of defogging, but there are still problems of gray blur and unnatural color in the detail part (Figure 6(d) and Figure 6(f) shown on the upper left wall of the third image).Compared to the previous method, the presented defogging algorithm image has distinct texture features, a full defogging effect, natural image color, and no distortion phenomena.It is closer to the original image in visual effect from all angles, with significant visual advantages.
(2) Objective data analysis The algorithm in this paper is evaluated objectively by contrasting it with data from other algorithms in the evaluation index to demonstrate its superiority from multiple perspectives.
As shown in Table 2, by comparing the data obtained in the SOTS dataset, we know that the classical algorithm DCP data is the lowest, and the evaluation index of this algorithm is the highest.The mean value of SSIM increases from the best of 0.971dB to 0.976dB; the mean value of PSNR increases from the best of 31.38dB to 32.96dB, and the relative increase of PSNR value is 1.58dB.[11] 20.64 0.800 AOD [13] 19.16 0.850 GCANet [15] 30.23 0.975 CMFNet [37] 31.38 0.971 Ours 32.96 0.976 The evaluation metrics in this paper are peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), respectively, and their larger values indicate better image results.

Experimental analysis of synthetic data sets
The superiority of this paper's algorithm is further verified by de-fogging dense fog images in natural scenes; the equalized dense fog dataset D-Haze and dense fog dataset O-haze are used for experiments and evaluations, respectively.
(1) Dense-haze datasets As shown in Table 3, the dense fog dataset D-Haze was used to test this paper's algorithm, as shown in Table 3.It is known that when this algorithm is compared to other advanced algorithms, the PSNR is 14.86dB, which achieves their most significant values.
As seen in Figure 7, various dense fog photos taken from actual scenes are chosen and compared using various fog removal algorithms.From the subjective visual point of view, because of the limitations of the traditional algorithm of DCP based on the dark channel theory, there is a vast deviation in removing the dense fog from the image.It can be seen that the image (shown in Figure 7(b)) after the de-fogging by the DCP algorithm has a severe problem  [11] 13.84 0.425 AOD [13] 13.14 0.414 GCANet [15] 14.25 0.497 CMFNet [37] 14.46 0.533 Ours 15.01 0.531 Even in outdoor scenarios with dense fog, the method proposed in this paper offers superior defogging capabilities.There is the least amount of fog compared to other advanced algorithms, the authentic image is visible, and the subjective visual advantage is evident.
To demonstrate the powerful defogging ability of the algorithm in this paper on the Dense-haze dense fog dataset, as seen in Figure 8 and Figure 9, in the subjective visualization, we conducted a quantitative comparison of single suit images and enlarged the details of single images to observe the defogging image effect.in the objective data, we also performed data statistics on single quantitative images to further verify the algorithm's superior performance.10, from the visual effect, we can see that the environment significantly affects the DCP algorithm.The color shift problem occurs when the natural outdoor scenes containing fog are defogged compared with the Densehaze dataset.The CMF and GCA fog removal are better, but there is still a small amount of fog.This paper's algorithm consists of a multi-branch network, which has a noticeable fog removal effect and effectively removes fog from natural outdoor scenes.
As shown in Figure 11, after the quantitative comparison of a single image, the fog removal effect of this paper's algorithm for the O-Haze dense fog data set is verified from the side regarding subjective visual and objective data.It can be seen that this paper's algorithm for this single image has a significant

Experimental results analysis
In this section, we use the Reside dataset as the test benchmark dataset to analyze and compare the performance of the network model by comparing different modules and branch networks quantitatively to conclude.
Table 5 shows the comparative effect of the network models before and after the addition of modules in the ablation study of module types.The comparison reveals that when the stage attention fusion mechanism and the residual attention subnet are not added, the PSNR and SSIM are the lowest in the table, at 31.36dB and 0.968dB, respectively.After adding the stage attention fusion mechanism and the residual attention subnet, the indexes are   Table 6 illustrates the impacts of including different subnets on the experimental metrics PSNR and SSIM.For the U-Net subnets, the overall network of three U-Net is superior to those of a single U-Net and two U-Net.The experimental measures PSNR and SSIM improve from 29.32 to 31.36 dB and from 0.877 to 0.990 dB from a single U-Net branch network to three identical networks.In contrast to the network consisting of three U-Net subnets of a single type, the combination of residual attention subnets and U-Net subnets for feature extraction and final fusion can effectively extract features containing rich spatial details and contextual information, thereby improving the model's image-defogging capability.

Conclusion
The multi-stage feature fusion defogging network based on the attention mechanism is proposed in this paper, which is a multi-branch structure composed of different types of branching subnetworks.It performs effective defogging of synthetic fog maps.It has the powerful defogging ability for both uniform and non-uniform dense fogs in natural scenes, which solves the problem of problematic defogging and color distortion in complex environments.Specifically, the network described in this paper comprises U-Net and the proposed Raf-Net subnet for extracting image spatial detail features and contextual information features.The feature information extracted by the two types of subnets complements each other to obtain more image information, improve the network's feature extraction capability, and enhance the extracted information by various types of attention modules.Moreover, a stage attention fusion mechanism is included across distinct sub-networks to effectively limit information loss during picture extraction.Feature fusion is conducted after the overall network model to retain more image detail information and achieve high-performance defogging.The multi-branch network model suggested in this paper, which has excellent robustness and a decent defogging effect, has been validated on a variety of datasets; all of them achieve exceptional defogging capability with fantastic performance and can fulfill the task of image defogging efficiently.

Ethical Approval
This declaration is not applicable.

Competing interests
The authors declare that they have no competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding
No funds.

Availability of data and materials
Due to internal research in the lab, data generated and analyzed in the current study will not be made public.For special requests, please contact the corresponding author.

Fig. 1
Fig. 1 Schematic diagram of multi-stage feature fusion defogging network based on attention mechanism

Fig. 3
Fig. 3 Schematic diagram of the subnet of residual attention feature

Fig. 6
Fig. 6 Comparison of defogging effect of different algorithms on SOTS dataset

Fig. 7
Fig. 7 Comparison of defogging effect of different algorithms on D-Haze dataset

Fig. 8
Fig. 8 Quantitative comparison chart of single images in the D-Haze dataset.Below each image, the first data is the peak signal-to-noise ratio; the second data is the structural similarity.

Fig. 9
Fig. 9 Single picture detail comparison chart

Fig. 10
Fig. 10 Comparison of defogging effect of different algorithms on O-Haze dataset

Fig. 11
Fig. 11 Quantitative comparison chart of single images in the O-Haze dataset.

Table 1
Output dimensions of the overall network

Table 2
Comparison table of evaluation metrics of different algorithms on SOTS dataset

Table 3
Comparison table of evaluation metrics of different algorithms on D-haze dataset

Table 4
Comparison table of evaluation metrics of different algorithms on O-Haze datasetAs shown in Table4, the evaluation indexes PSNR and SSIM of this paper's algorithm have the highest average values of 23.15 dB and 0.771 dB, respectively, which again proves the good defogging effect of this paper.

Table 5
Comparison table of performance evaluation of different modules

Table 6
Comparison table of performance evaluation of different branch networks