Iterative dual regression network for blind image super-resolution

Previous single-image super-resolution methods assume that the blur kernel is known (e.g., bicubic) when degrading from high-resolution (HR) images to low-resolution (LR) images. They use a single degradation to train a model to restore HR images. However, the actual degradation in the real world is often unknown. It is difficult to deal with LR images caused by different degradations. To cope with the above situation, previous methods attempt to restore SR images using a blur kernel estimation structure that combines with a non-blind SR network. There are two problems that should be earnestly considered: (1) For accurate blur kernel estimation, insufficient correlation of consecutive kernels lead to an unsatisfied reconstruction result. (2) For ill-posed issue of image reconstruction, a more efficient constraint condition is worth trying. To solve the two problems, we propose an iterative dual regression network for an adaptive and precision blur kernel estimation, which improves the speed of kernel estimation by learning a dual mapping. Specifically, we design a Predictor-Generator structure: the Predictor, through several iterations, searching for accurate kernels through intermediate kernels and generated SR images; the Generator, generating final SR images with the help of the predicted kernels. More importantly, the elaborately designed dual learning strategy can not only provide additional constraints for accurate kernel estimation but also reduce the domain gap between SR images and HR images. Experiments on synthetic degraded images and real-world images show that our network is competitive in performance and superior in visual results.


Introduction
As a classic low-level vision task, single-image superresolution (SISR) refers to restoring the plausible and sharp detailed HR image from its counterpart LR image. It is widely used in various image tasks, such as visible images [1], infrared images [2], and medical images [3]. The introduction of convolutional neural networks (CNN) has made SISR flourish. Many CNN-based methods [1,[4][5][6][7][8][9][10] have innovatively designed network architectures and training strategies to achieve a new level of performance. In order to process images in real-time, some lightweight networks [11][12][13] have also emerged. These methods assumed that the blur kernel is known (e.g., bicubic) when degrading from HR images to LR images. However, the degradation in actual applications is much more complex. It is unknown and varies from image to image due to shooting device parameters, external complications, etc. In addition, when the degradation of LR images deviates from the assumption [14], there is a large domain gap between the SR results and the desired HR images, which leads to severe performance drop. Therefore, in order to alleviate this performance drop, we should focus on the case of unknown blur kernel, i.e., blind SR.
In blind SR, the optimization of the blur kernels for undetermined variables is particularly important. To make this problem easier, the previous methods, such as IKC [15] trained a kernel estimation structure combined with a nonblind SR network to recover SR images, respectively. DAN [16] iteratively optimized the blur kernel and generated the final SR output in the last iteration. However, the inference speed of these methods is slow because of the relatively large number of iterations. They are still not accurate enough for the estimation of the blur kernel.
In this paper, we propose an iterative dual regression network (IDRN) for an adaptive and precision kernel estimation, which can speed up the estimation of the blur kernel.
Specifically, we design a Predictor-Generator structure: the Predictor, by several iterations, searching for accurate kernels through intermediate kernels and generated SR images; the Generator, generating final SR images with the help of the predicted kernels. More importantly, we combine dual learning into the kernel estimation structure to obtain additional constraints. The additional constraint helps us to predict the blur kernel more accurately. As the error between the estimated kernel and the ground-truth kernel decreases, the domain gap between the SR image trained by the inaccurate kernel and the HR image is gradually reduced. Accordingly, the accurate blur kernel can also help us to generate better SR images. In experiments, our method achieves comparable results on both synthetic and real-world images.
We summarize our contribution in two points.
• We proposed an iterative dual regression network (IDRN) to adaptively conduct kernel estimation. Through the elaborately designed Predictor-Generator structure, the speed of kernel estimation significantly improved and the constructed results are more satisfying. • We proposed a dual learning strategy to optimize the reconstruction accuracy of SR images. In this way, both the ability of accurate kernel estimation and the capability of domain gap reduction are improved macroscopically.

Blind SR
Blind SR assumes the degradation kernel is unknown. ZSSR [17] proposed an unsupervised super-resolution algorithm based on a single image. They exploited the repetitive nature of the information within the image by extracting samples from the input image itself and training a small imagespecific CNN that is also applicable to the SR of that image. DGDML-SR [18] utilized the image internal depth information for super-resolution training of individual images without the need for external datasets. Through generative adversarial networks (GAN), KernelGAN [19] used its own distribution to learn its degradation process, instead of using bicubic as the default degradation. However, these methods of using their own patch are quite time-consuming. In order to design more reasonable blind SR networks, a sequential combination of kernel estimation methods and non-blind SR methods is usually used to recover HR images. As pointed out in IKC [15], the SR results of these methods are usually sensitive to the provided blur kernels. When the estimation of the blur kernels is inaccurate, there will be a domain gap between the SR images and the HR images and cause the performance degradation. Thus, kernel estimation methods are also an important component of blind SR. IKC [15], a spatial feature transform (SFT) layer was proposed and inserted into each residual block in order to better preserve the details of the LR images. This allowed better preservation of the blur kernel information in a deeper network. An additional, kernel estimation network was also trained separately to estimate what degradation the current LR was and to provide better performance. In the same context, DAN [16] proposed a conditional residual block (CRB), which concatenated the stretched kernel and LR image at the beginning of the residual block. However, the blur kernels estimated by these methods are still not accurate enough. During inference, when the number of iterations set by inference exceeds the number of training iterations, the performance will drop significantly.
To address the perceptual metric NIQE non-differentia bility problem, AMNet-RL [20] incorporates reinforcement learning into a blind SR model. Yamac et al. [21] proposed a modular and interpretable neural network structure KernelNet for the blind SR kernel estimation problem. By estimating the blur kernel of the LR image, and then upsampling the blur kernel by self-convolution, a more accurate blur is obtained. Kernel estimates. DASR [22] adopts the strategy of contrastive learning to distinguish various degradations in the representation space by learning abstract representations instead of explicit estimation in the pixel space. But this way of learning implicit degradation requires more training cost. AdaTarget [23] can improve images, but it is still inferior to blind SR methods.

Dual learning
Dual learning methods [24,25] contain a primal model and a dual model to learn two opposite mappings simultaneously. Previous methods [25,26] were typically used to perform image translation in the absence of paired training data. Recently, this scheme has also been used to train SR networks in a semi-supervised method. Specifically, both unpaired and paired data are used for training, and closed loops are used to reduce the space for SR possible functions. However, this method is still limited to a SR method with bicubic interpolation downsampling and has less application in terms of blind SR. We apply dual learning to blind SR. On the one hand, the accuracy of the blur kernel can be improved in the process of kernel prediction. On the other hand, the domain gap between SR images and HR images can also be reduced, and the artifacts caused by random blur kernel can be alleviated.

Problem formulation
Actually, the degradation process of HR images can be expressed as a combination of blurring, decimation, and noise; mathematically, it can be expressed by: where y is the degraded LR image, x is the HR image, ⊗ denotes convolutional operation, k is denotes blur kernel, ↓ s represents the bicubic downsampler, and n refers to additive white gaussian noise. In research, blur kernels are generally divided into regular and irregular, such as isotropic and anisotropic gaussian blur kernels. More commonly used is the isotropic blur kernel, because it can better qualitative and quantitative research. For the sake of simplicity, this paper focuses on isotropic blur kernels for regularities.

Our method
We design a Predictor-Generator structure in our blind SR framework. It consists of two main modules Generator G and Predictor P. By several iterations, the Predictor searches for accurate kernels through intermediate kernels and generated SR images; the Generator generates SR images with the help of the estimated kernel.
We iterate Predictor and Generator T times. To balance time and performance, in this paper, T is set as 3. When T = 0, we use PCA to reduce the blur kernel predicted by ConditionNet. Here, it is denoted as the initial kernel k 0 . The Generator generates the SR image I 0 based on k 0 , and the Predictor predicts the k 1 with the help of I 0 and k 0 . Without loss of generality, when T = i, the Generator generates the I S R i−1 based on the k i−1 , and the Predictor predicts the k i through the combination of I S R i−1 and k i−1 . Finally, the Generator outputs the final SR image. We can see the structure of the whole framework in Fig. 1.
In addition, two processing modules ConditionNet C and Downscaler D are also included. The ConditionNet predicts initial kernels from LR images. The Downscaler is embedded in Predictor to reconstruct the blur and reduced (BR) images. We minimize the difference between the BR images and the LR images to obtain additional constraints on the blur kernel. In this way, the additional constraints help us to predict the blur kernels more accurately. The pseudo-code is shown in Algorithm 1.

Dual regression
We assume that x ∈ X is the LR image and y ∈ Y is the HR image. The original mapping P can be denoted as X → Y such that the predicted P(x) has a similar distribution to its corresponding HR image y. That is, the HR image is reconstructed by learning the original mapping P. The dual mapping D is denoted as Y → X such that the predicted D(y) has a similar distribution to the original input LR image x. Specifically, we use the bicubic downsampler and predicted kernel by Predictor to reconstruct the SR image to the size of LR image. We represent the reconstructed LR image as BR image. Meanwhile, we use L1 loss to minimize the difference between BR image and LR image. As the generated BR image gets closer to the original LR image, the predicted kernel also gets closer to the ground-truth kernel at this point. It shows that dual regression network provides additional supervision and helps Generator and Predictor work well together.

Algorithm 1 Iterative Dual Regression Network
Require: the LR image I L R Require: the max iteration number T 1: I nitial kernel ← C I L R (Initialize the kernel estimation) 2: k 0 ← I nitial kernel (Kernel reduced by using principal component analysis (PCA)) 3:

Iterative dual regression network
ConditionNet. To obtain better SR results, we employ Con-ditionNet [27] to predict an initial kernel from the LR image. As seen in Fig. 1, the ConditionNet contains two global pooling layers. The first layer aims to drop the feature maps by half their size, and the last layer gathers the feature maps into a 1 × 1 spatial size. Then, we aggregate spatial information to obtain a kernel for the prediction.
Generator. The unfolded structure of the Generator is shown in Fig. 1. The reduced kernel and LR image are mapped into feature maps with a consistent number of channels through a single convolutional layer. In this paper, the number of channels is set as 64. The body of the Generator contains the residual group (RG) only. The structure of RG contains 10 feature modulation layers (FMLayer) and a skip connection. Finally, we use the PixelShuffle layer to scale the features to the desired size. We constructed the FMLayer, shown in Fig. 2a, which uses the guidance of a simplified kernel k i−1 to adjust the feature map. Specifically, we can represent the layer as follows.
where k i denotes the blur kernel of iterations, x i represents the middle feature maps, and is referred to element-wise multiplication, i.e., Hadamard product. φ and ψ are considered as modulation functions. The expanded structures are noted as the following equations.
Predictor. In this section, we first process the SR images generated by the Generator. The kernel and SR images through the depth-wise convolution and downscaler to generate the BR images. As mentioned in the previous Sec 3.3, BR images are targeted to LR images, which means they should be similar in distribution. The front part of the main body of Predictor consists of RB only. Particularly, in order to better aggregate spatial information, the tail part is also composed of ConditionNet [27]. The predicted kernel is transformed by PCA to generate a reduced kernel. It is fed into Generator to form a loop. The comparison is conducted using Gaussian8. DAN-3 means three iterations, and DAN-4 means four iterations. The best two results are highlighted in bold and Italics respectively The best two results are highlighted in bold and Italics respectively

Dataset and training details
Dataset. Following [15,16,22,28], we use 800 HR images in DIV2K [29] and 2650 HR images in Flickr2K [7] as the training set. We crop the HR images to patches of 256 × 256.
In this paper, we adopt isotropic Gaussian blur kernels, and it is possible to compare different blur kernels quantitatively. In addition, the isotropic Gaussian blur kernels help us to explore the impact of different kernel widths on the images. We set the range of kernel width σ as [0.2, 4.0], and the size of the kernel is fixed to 21 × 21 for scale factors 4, respectively. We train our network on noise-free degradations with isotropic Gaussian blur kernels. Further, on this basis, we add different levels of noise to cope with more complex situ-ations. For quantitative evaluation, including four benchmark datasets (Set5 [30], Set14 [31], BSD100 [32] and Urban100 [33]). During testing, we also define a kernel function that can make a reasonable comparison. For scale factor 4, the range of kernel width is set to [1.8, 3.2], and the kernel size is also fixed to 21 × 21, denoted as Gaussian8 [15]. Training. We crop the LR training samples into 64 × 64 patches and set the batch size to 32. All models are trained for 6 × 10 5 iterations. We use Adam [34] as the optimizer, with β 1 = 0.9 and β 2 = 0.99. The initial learning rate is 2 × 10 −4 and will abate to half after every 1.5 × 10 5 iterations. The overall loss function is defined as follows.
where L H = L 1 (HR, SR), L L = L 1 (LR, BR), and L K is noted that the L1 loss between ground-truth kernels and predicted kernels. We train all models on 4 TITAN XP GPUs.

Experimental results
On Synthetic Test Images, we evaluate our method by Gaus-sian8 kernels on synthetic test images. As shown in Fig. 4, we present visual comparisons on different methods. We mainly compare our results with IKC, DAN [16], AdaTarget [23], and DASR [22]. IKC is two-step solution, and it has disadvantages, such as long training time and less accurate estimation of the blur kernel. DAN is no correlation between the estimated blur kernels in each iteration. Its performance is still insufficient. AdaTarget and DASR performance is still not good enough. In our method, we have achieved better performance at a faster speed. The results are in Table 1. With different levels of noise degradation, real-world-based methods, such as BSRGAN [35], Real-ESRGAN [36] and Swin IR [37], do not dominate in Gaussian8 as well as noisy random degradation. As shown in Table 2, our method still has a good performance when the noise degradation.
To verify the generalization of our method, we test our IDRN on real-world images. Since real-world images do not have ground-truth, we only compare the visual results of different methods. BSRGAN [35], Real-ESRGAN [36] and Swin IR [37] are designed to handle real-world images. Swin IR is using the backbone of BSRGAN. The visualization results as shown in Fig. 5. It can be observed that the DAN result is cleaner than IKC. Note that BSRGAN is designed for real-world images and it produces a cleaner result than DAN. But in this real image, the letter "W" repaired by BSR-GAN has a clear variant in the middle. Real-ESRGAN and Swin IR still have a few artifacts. This shows that although these GAN methods promote the visual effect with perceptual loss, the artifact may be the defect of these methods. But IDRN result is not significantly distorted. This shows that although our IDRN is trained on synthetic images instead of multi-degenerate pooling like BSRGAN, it still has some generalization ability in practical applications.

Quantitative comparison
Experimentation with the number of iterations. We compare the number of iterations with DAN. The PSNR results on Set5 as shown in Fig. 3. The blue line represents DAN, and the red line is our method. The number of iterations is increased from 1 to 5, and DAN achieves stable performance at the 4th iteration. Interestingly, only three iterations of training that we need, our model achieves the highest PSNR. As seen in Fig. 3. At the fourth and fifth iterations, the PSNR remains stable. In other words, our method converges faster and is more robust.
In order to quantitatively compare inference time, model parameters, and computation. We evaluate different methods on the same platform. Here, we select 40 images synthesized by Gaussian8 kernels in Set5 as test images, all evaluated on the same platform with an RTX2080Ti GPU. As shown in Table 3, our model spends less time on inference speed compared to other blind SR methods. We choose IKC [15] and DAN [16] as comparison methods. The average inference time of our method is only 0.58 seconds, while IKC takes a lot of time to iterate and is 7 times slower than our method on average per image.4 Note that, although our parameters are a bit higher than DAN, our method can reduce 25% of the computation and 22% of the inference time. This comes down to the reduction in the number of iterations, which effectively reduces redundant computation. The comparison shows that IDRN is superior to the two methods in PSNR results and inference speed.

Ablation experiments
To verify the effectiveness of our method, we conducted ablation experiments using DAN as a baseline. The proposed network has three important modules: (1) Dual learning, (2) ConditionNet in head, and (3) ConditionNet in tail. It can be seen from Table 4. Model 1 adds the dual learning module to the DAN only. Although the improvement is relatively small at 4 iterations, the performance improves by 0.2 dB at 3 iterations. It means we can use fewer iterations to achieve the performance that would otherwise require multiple iterations to achieve, thus increasing the inference speed. In other words, the dual learning module is not significant for the overall network performance improvement, but it is effective in reducing the number of iterations. Model 2 adds the Con-ditionNet in the head, and still has performance improvement at three iterations. We think that the extraction of degenerate information by a single ConditionNet is limited. In Model 3, which is the IDRN, we add the ConditionNet in the tail of the network as well, which couples the information of the head conditional network during the blur kernel prediction phase. It is helpful for the overall network performance improvement. With the addition of the head ConditionNet and the tail ConditionNet, there is no longer any change in performance between the 3 and 4 iterations. It means that we have found the optimal number of iterations to be 3. It saves a lot of inference and training time compared to DAN.

Conclusion
In this paper, we proposed an iterative dual regression network for an adaptive and precision kernel estimation, which improves kernel correlation and speed by learning a dual mapping. Specifically, we designed a Predictor-Generator structure: the Predictor, by several iterations, searching for  Experimental results show that our network has good performance in synthetic images and real-world images. In the future, if these two parts of IDRN can be implemented by a lightweight network, we believe its practicality can be further improved.