Multi-Scale Cross-Fusion for Arbitrary Scale Image Super Resolution

Deep convolutional neural networks (CNNs) have great improvements for single image super resolution (SISR). However, most of the existing SISR pre-training models can only reconstruct low-resolution (LR) images in a single image, and their upsamling factors cannot be non-integers, which limits their application in practical scenarios. In this paper, we propose a multi-scale cross-fusion network (MCNet) to accomplish the super-resolution task of images at arbitrary scale. Speciﬁcally, we construct a multi-scale cross-fusion module (MSCF) to enrich spatial information and remove redundant noise, which uses deep feature maps of diﬀerent sizes for interactive learning. A large number of experiments on four benchmark datasets show that the proposed method can obtain better super-resolution results than existing arbitrary scale methods in both quantitative evaluation and visual comparison.


Introduction
Image super resolution is a basic image processing technology, which aims to generate high resolution (HR) images on the basis of degraded low resolution (LR) images.In recent years, the single image super resolution (SISR) method based on deep convolutional neural networks (CNNs) has been significantly developed compared with the conventional SISR models [1-3, 3-14], and has been widely applied in various fields such as medical images [15,16] and satellite imaging [17].However, most existing SISR pre-training models can only perform single image restoration for LR images, which consumes additional computer resources.In addition, the fact that upsampling factors can only be integers limits its application in real-world scenarios.
In order to overcome the above problems, the up-sampling network is redesigned.Lim et al. [18] developed a multi-scale deep super resolution architecture (MDSR), which uses three different upsampling branches (×2, ×3, ×4) to generate HR images of different sizes from degraded images in the same model.In order to extend the scale factor to non-integer domains, Hu et al. [19] proposed a new advanced method for image reconstruction at arbitrary scale, called magnification-arbitrary network (Meta-SR), which uses several fully connected layers to predict the corresponding pixel  Compared to the traditional single scale upsampling module, the above arbitrary scale upsampling network method has better adaptability and flexibility.However, they completely ignore the importance of backbone networks for image restoration at arbitrary scale.Using the traditional SISR backbone network for reconstruction limits the performance of arbitrary scale image super resolution.
In this letter, we design a novel multi-scale cross-fusion network (MCNet), which has an excellent performance in arbitrary scale reconstruction.Specifically, we design a powerful multiscale cross-fusion module (MSCF) after the backbone network to enrich the spatial information and remove the redundant noise from the deep features.In our MSCF, deep feature maps in different sizes are used to conduct interactive learning from each other.The experiments with four benchmark datasets show the highly advantageous performance of our MCNet method.
The main contributions of this paper focus on the following aspects: 1) We propose a novel multi-scale cross-fusion network (MCNet), which not only removes the blurring artifacts for efficient and accurate image reconstruction but also delivers the most advanced results compared with other SR methods.2) To see further improvement in feature representation ability, we propose a multi-scale cross-fusion module (MSCF) after the backbone network.The module consists of two basic components: a) multi-downsampling convolution layer (MDConv) uses convolutional layers of different kernel sizes to generate smaller feature maps, and b) dual-spatial mask (DSM) is a dual spatial mask module that learns interactive information from the features with different scales.
2 Proposed Method
First, the extracted features F d is obtained by performing a 3 × 3 convolutional layer and an existing SISR backbone network on the input LR image; i.e., where E φ denotes the backbone network with multiple stacked residual blocks [18].The second part of the MCNet framwork is our proposed MSCF, which makes a significant contribution to generating clean and abundant features F rid given by where Q(•) will be described in more detail in section B.
In the upsampling network, we incorporate scale information for image reconstruction by adding a new SGU module to another branch, which can accomplish a tailored image restoration task for our SR model.After the enrichment of features, F rid and its mapping coordinate C in HR image space are used to facilitate the next stage on image upsampling network.Similar to the LTE [21], an HR image Y is generated through a continuous image upsampling module with local texture estimator G lte ; i.e., where i is the index of an offset latent code around F rid and W i is the corresponding weight of each coordinate.Consider a set of (I LR i , I HR i ) N i that contains N LR − HR pairs, where I LR i is an input LR image and I HR i stands for the corresponding groundtruth(GT) image.We choose the L 1 loss function Fig. 2 Architecture of the multi-scale cross-fusion module (MSCF).
to optimize our network during training.
where Ω denotes the set of learning parameters in our proposed model.

Multi-Scale Cross-Fusion Module (MSCF)
To further improve the quality of the reconstruction images in backbone network, we design a powerful module consisting of multi-downsampling convolutional architecture (MDConv) and dual spatial mask (DSM).Referring in Fig. 2, in MDConv module, a set of convolutional layers are conducted to downsample the deep features F d delivered by the SR backbone network; that is, where k(k = 1 8 , 1 4 , 1 2 , 1) represents the downsampling factor.F td is the downsampled feature with a specific scale, which contains more plentiful global features of images.By performing the interpolation in space and concatenation in channel, the generated feature F k td is used to redefine the new feature map F k cd .MDConv provides many feature maps with different receptive fields and structural information for the next step.Then, the multi-scale features F cd are fed into our MSCF sub-module dual spatial mask (DSM) in succession through performing the communication as follows: ) where D k and C k are the corresponding outputs by the DSM module, and Coi represents the corresponding operation of interpolation and concatenation.The operator DSM i (•) denotes our dual spatial mask (DSM), which learns attention weights from two feature maps with different scales and its detailed structure is shown as follows: where F and C denote two different inputs of the mask module, ↑ 2 is the operation for ×2 upsampling.SM (•) is the spatial gate mechanism.Note that, two inputs of different sizes are adjusted to the same shape through the processing of our DSM.D k is served as a part of the final output from MSCF, while C k is used for the interactive learning in the next sub-module.They are utilized to learn additional textures and structures from each other.
3 Experiment Results

Performance Evaluation
Three SOTA SR networks are used to compare with our proposed MCNet method, including Meta-SR [19], LIIF [20], and LTE [21].Table 3 shows the Peak Signal to Noise Ratio (PSNR) values of the four benchmark datasets, we can find that our proposed MCNet significantly outperforms EDBNet on the urban100 dataset.Specifically, compared with the EDBNet model and our MCNet method, the PSNR results show improvements at medium scales of our model.Furthermore, we also show a visual comparison in Fig. 3.For the challenging details in "img044" and "img054", most previous work lost some crucial details when restoring the images.On the contrary, our MCNet achieves better results by recovering more detailed components.In addition, as shown the cost consumption of four arbitraryscale image super-resolution models in Table 1, we can find that the MCNet model only increases a little additional computation resources.achieves further improvement particularly with the upsampling scales that in training distribution, which is consistent with our motivation.To confirm the validation of MDConv, we also compare MCNet to MCNet(-M), which enhances the quality of both in-scale and out-of-scale factors.

Conclusion
In this letter, we propose a novel scale-guidance fusion network (MCNet) for the existing SISR network with arbitrary scaling factors.The multiscale cross-fusion module (MSCF) cleans redundant noises of deep feature maps and provides abundant space embedding for subsequent image restoration.The comprehensive evaluation has demonstrated that our MCNet achieves superior performance compared to state-of-the-art arbitrary-scale works.

Fig. 1
Fig.1The network structure of our proposed MCNet, which contains three main parts for: 1) feature extraction network, 2) cross-fusion module and 3) image reconstruction network.

Fig. 3
Fig. 3 Qualitative comparison of different methods on Urban100 datasets.

Table 1
Memory usage and time consumption compared with other arbitray-scale SR models for ×2 upsampling.

Table 2
An Ablation Investigation of Three Variants Performed on the Dataset Urban100.