Lightweight Image Super-Resolution with ConvNeXt Residual Network

Single image super-resolution based on convolutional neural networks has been very successful in recent years. However, as the computational cost is too high, making it difficult to apply to resource-constrained devices, a big challenge for existing approaches is to find a balance between the complexity of the CNN model and the quality of the resulting SR. To solve this problem, various lightweight SR networks have been proposed. In this paper, we propose lightweight and efficient residual networks (IRN), which differ from previous lightweight SR networks that aggregate more powerful features by improving feature utilization through complex layer-connection strategies. The main idea is to simplify feature aggregation by using simple and efficient residual modules for feature learning, thus achieving a good trade-off between the computational cost of the model and the quality of the resulting SR. In addition, we revisit the impact of the activation function in the model and observe that different activation functions have an impact on the performance of the model. The experiment results show that IRN outperforms previous state-of-the-art methods in benchmark tests while maintaining a relatively low computational cost. The code will be available at https://github.com/kptx666/IRN.


Introduction
This paper focuses on the problem of single image super-resolution.Image super-resolution is a classic low-level vision task in computer vision that has a wide range of applications in security, surveillance, satellite, and medical imaging, and it can be used as a built-in module for other image recovery or recognition tasks.Single-image super-resolution refers to the reconstruction of visually appealing high-resolution images from corresponding low-resolution images.In the past few years, deep learning has greatly advanced the development of SR, and many deep neural network-based image SR methods have been proposed with great success.For example, Dong et al. [1] first proposed a super-resolution convolutional neural network with only three layers, SRCNN, and achieved superiority over previous non-deep learning methods.Subsequently, because deep convolutional neural networks [2] achieved good results in ImageNet classification, people were inspired to propose deeper and more complex architectures to improve the performance of SR methods.Kim et al. [3,4] pushed the depth of SR networks to 20 and achieved better performance than SRCNN.EDSR networks [5] reached a depth of more than 160 layers.It was further demonstrated that deeper models are more beneficial to improving the performance of SR models.Although these SR networks greatly improved the quality of reconstructed images, their memory consumption and computational cost were huge, which made them difficult to deploy to resource-constrained devices, such as mobile devices.Therefore, improving the efficiency of SR models and designing lightweight models becomes critical.
To address these problems, a number of lightweight super-resolution models have been proposed.Ahn et al. [6] proposed a lightweight efficient cascaded residual network CARN-M with multiple residual connections, but its PSNR was too low.Hui et al. [7] proposed an information distillation network IDN, which achieved better performance with a smaller number of parameters.Subsequently, the Information Multiple Distillation Network (IMDN) [8] introduced an information multiple distillation block with a contrast-aware attention layer, which further improved the IDN.Wenbo et al. [9] proposed a linear combinatorial pixel adaptive regression network (LAPAR) that transformed direct LR to HR mapping learning into a linear coefficient regression task based on a dictionary of multiple predefined filter bases.FALSR [10] employed Network Architecture Search (NAS) techniques to implement lightweight super-resolution models.However, these SR models were not lightweight enough and the SR performance can be further improved.
For this purpose, this paper proposes a lightweight residual network (IRN) to better balance model performance and computational cost.It is computationally less expensive than IMDN, LAPAR-A and FALSR [8][9][10] and has better performance compared to them.Unlike most previous small parameter models that use recursive structures and information distillation, we design a residual block inspired by the ConvNeXt Block [11], which is shown to increase the depth of the network at a smaller computational cost, thus improving the performance of the network.Secondly, we introduce an effective Enhanced Spatial Attention (ESA) [12] module, which is used to improve the SR model's ability to collect a variety of fine-grained information.Specifically, we make use of more useful features (e.g.edges, corners, textures, etc.) for image recovery.
The contributions of this paper can be summarized as follows: 1. We introduce the ConvNeXt Block to construct the residual block and demonstrate its effectiveness against SR. 2. We deploy an effective attention module(ESA), to strengthen the model with an additional finite amount of computation.3. Our proposed IRN integrates ConvNeXt Block and an effective attention module, which successfully enhances the compactness of the model and reduces the computational cost without sacrificing SR recovery accuracy.

Single Image Super-Resolution
In recent years, with the rapid development of deep learning, more and more deep learning methods have been applied to SR tasks, which have greatly improved the performance of SR tasks.Dong et al. [1] first proposed the deep learning-based method SRCNN, a model that achieves better performance than traditional methods despite having only three layers.
Although SRCNN achieves good results, its pre-input amplification of SRCNN achieves good results but its bi-trivial interpolation of LR images for amplification before input makes a large number of redundant computations.The authors subsequently improved SRCNN in FSRCNN [13] by removing this pre-processing and amplifying the image directly at the end of the network using transposed convolution to reduce the computational cost.To progressively reconstruct higher resolution images, Lai et al. proposed the Laplace operator pyramidal super-resolution network (LapSRN) for progressive upsampling networks [14] There are other works such as MS-LapSRN [15] and Progressive upsampling SR (ProSR) [16] which also adopt this progressive upsampling SR framework and achieve relatively high performance.Shi et al. proposed an efficient sub-pixel convolution layer in ESPCN [17], where LR images were mapped through a series of features and then at the end of the network through a sub-pixel convolution module are amplified into an HR output.Due to the effectiveness of subpixel convolutional layers, later proposed networks have used subpixel convolutional layers as reconstruction modules and obtained better performance.Kim et al. [3,4] obtained deeper layers of VDSR and DRCN by stacking convolutions using residual connections, resulting in a total of twenty layers and improved SR performance.Lim et al. [5] obtained a deeper layer of VDSR and DRCN by removing the batch normalized (BN) layer of residual blocks [18] were stacked to construct deeper and wider residual networks EDSR and MDSR and achieved significant performance improvement.Zhang et al. [19] proposed RDN based on EDSR by introducing dense connections [20,21] to make full use of the information in all feature layers.They introduced channel attention module [22] into the residual block and proposed the very deep residual attention network (RCAN) [23].They then introduced the non-local module into the residual block to construct the residual non-local attention network (RNAN) [24] for various image recovery tasks.Guo et al. [25] proposed a dual regression method to improve the performance of the SR model by introducing additional constraints.Liang et al. [26] proposed a Transformer architecture for image recovery based on the Swin Transformer [27], while Chen et al. [28] later proposed the SR model (HAT), which achieved significant improvements in the performance of the SR model.

Efficient SR Models
Although deep learning-based SR methods have achieved great success in terms of performance, their computational cost is too large and not suitable for application to resourceconstrained devices, such as mobile devices.Therefore, many methods have been proposed to reduce the computational cost of SR models.For example, FSRCNN [13] reduced the redundant computation caused by direct bi-cubic interpolation of input images by SRCNN [1] by introducing inverse transpose convolution at the end of the network.DRCN [4] applies recurrent networks to SR models to reduce the number of parameters by reusing feature information multiple times.Ahn et al. [6] proposed a lightweight and efficient cascaded residual network CARN-M with multiple residual connections by using grouped convolution to reduce the computational effort of standard convolution.Hui et al. [7] proposed the Information Distillation Network (IDN), which splits the previously extracted features and then processes them separately to reduce the computational effort.The Information Multi Distillation Network (IMDN) [8] improved on the IDN by introducing a contrast-aware attention layer, thus improving the performance of the SR model.The Residual Feature Distillation Network (RFDN) [12] revisits the network architecture of the IMDN, further making the network lighter and improving the performance of the SR model by using feature distillation connections (FDC) and shallow residual blocks (SRB).Nancy et al. [29] proposed a lightweight multi-scale attention-based image super-resolution network with better performance.

Attention model
The attention mechanism has become an important component in improving the performance of deep neural networks and was widely used in various computer vision tasks, such as image classification.The attention mechanism can be interpreted as focusing on the more useful information in the features.Hu et al. [22] proposed the (SE) block to exploit interchannel attention at a lower computational cost, improving ResNet and achieving significant performance gains in image classification tasks.Wang et al. [30] improved the efficiency of the module and improved the performance by improving the fully connected layer in the SE block.CBAM [31] modified the SE block to enable the use of both spatial and channel attention.
In recent years, several attention-based SR models have also been proposed and have significantly improved the performance of SR.Zheng et al. [8] proposed IMDN with a contrast-aware channel attention mechanism (CCA) to enhance the ability of SR models to collect various fine-grained information.Zhang et al. [23] introduced the channel attention mechanism into SR models and proposed RCAN.Dai et al. [32] proposed a second-order attention network SAN to explore more powerful feature representations using second-order feature statistics.Wang et al. [33] proposed a non-local module to generate an attention graph using the computation of the correlation matrix between each spatial point in the feature graph, which is used to guide dense aggregation of contextual information.

Network Architecture
In this section, we describe in detail our proposed lightweight residual network (IRN), the overall network architecture of which is shown in. Figure 1.
Our IRN consists of three main components: the first shallow feature extraction block, multiple stacked residual blocks (IRBs) and the reconstruction module.We denote I L R and Fig. 1 IRN I S R as the input and output images of the IRN respectively.In the first stage we used a single 3 × 3 convolutional layer for shallow feature extraction, which can be represented as where H ext (•) denotes the convolution operation for shallow feature extraction and F 0 denotes the extracted feature map.We then use multiple IRBs in a cascade fashion for deep feature extraction, a process that can be represented as where H n I RB (•) denotes the nth IRB function and F n is the nth output feature map.In addition, we use a 3 × 3 convolutional layer to refine the deep features and then use the reconstruction module to generate I S R , which can be expressed as H rm represents the reconstruction module, which consists of a 3 × 3 convolution with a 3 × S 2 output channel and a sub-pixel convolution.In addition, H r represents a 3 × 3 convolution operation.The model is optimized using the L 1 loss function, which can be expressed as where H I RN (•) is our IRN, θ is the model-learnable parameter, and • 1 is the L 1 norm.

Residual Blocks
In this subsection, we introduce the residual block (IRB).As shown in Fig. 2, the residual block consists of one 3 × 3 convolution and n ConvNeXt Blocks, where n = 1 or 3, only in the second IRB of the model n = 3, and n = 1 in the rest of the IRBs.We use the 3 × 3 convolution and ConvNeXt Blocks to extract features.In particular, our IRN uses fewer activation functions, with only the 4 × dim layer in the ConvNeXt Block using the GELU [34] activation function, and none of the other layers using any activation function.The ConvNeXt Block [11] is an inverse bottleneck layer architecture that decomposes the standard convolution into depth-by-depth convolution and point-by-point 1 × 1 convolution, as shown in Fig. 3, and has a wider feature before activation.[35] A model with wider features before activation can significantly improve the performance of single image super resolution (SISR) for the same parameters and computational budget.We therefore introduce it into the model and experimental results show that it can significantly reduce the computational cost while maintaining SR model performance.Given an input feature of F in , the structure Fig. 4 The architecture of ESA is described as where E F 1 denotes a 3 × 3 convolutional block and E F C B denotes a residual block made up of 1 or 3 ConvNext Blocks.F cdc j is the feature extracted by the jth module.After two feature extraction modules, we add the final feature F cdc2 and the skipped feature F in directly.This can be expressed as where F cdc is the final refined output feature.Next, we feed F cdc into the ESA module [12] to obtain the final output of the IRB.

ESA Attention Module
As the effectiveness of the ESA module has been proven [12,36], we introduce this module into our IRN.To keep the ESA module sufficiently lightweight, it applies a 1×1 convolutional layer at the beginning to perform the reduction of the channel dimension of the input features.Then a stepwise convolution and a maximum pooling layer are used to reduce the size of the feature map.After a set of convolutions to extract features, interpolation-based up-sampling is performed to recover the original feature map size.Finally, an attention mask is generated through the sigmoid layer.The specific architecture of the ESA module is shown in. Figure 4.

Datasets and Metrics
Following previous work, in our experiments, we used the DIV2K [37] dataset, which is widely used for image recovery tasks and contains 800 high-quality RGB training images to train the model.For testing, we used five widely used benchmark datasets: Set5 [38], Set14 [39], BSD100 [40], Urban100 [41] and Manga109 [42].We use two metrics, peak signalto-noise ratio (PSNR) and structural similarity index (SSIM) [43], to evaluate the quality of super-resolution images.Based on the existing work, we calculate PSNR and SSIM on the Y channel of YCbCr converted from RGB.

Implementation Details
Our model is trained on the RGB channel, and the LR images are generated by downsampling (×2, ×3 and ×4) the HR images in MATLAB using bicubic interpolation.In this paper, we use a randomly cropped HR image patch of size 192 × 192 from the HR image as input to our model, with the mini-batch size set to 64.We augment the training data with random horizontal flips and 90 rotations.Our model was trained using the ADAM optimizer [44] with momentum parameters β1 = 0.9, β2 = 0.999, = 10 −8 .The initial learning rate was set to 5 × 10 −4 and was reduced by half after every 2 × 10 5 iterations.When training the final model, the ×2 model was trained from scratch.After the model converges, we use it as a pretrained model for other scales.In the IRN, we set the number of IRBs to 4. We implemented our network on the Pytorch framework and trained it on an NVIDIA RTX A5000 GPU.

Model Analysis
In this subsection, we investigate the model parameters, the validity of the ESA, the effect of the activation function on the SR model and the validity of the IRN.Model parameters.In order to construct a lightweight SR model, the parameters of the network are crucial.From Table 3, we can observe that our IRN achieves comparative or better performance compared to other state-of-the-art SR methods such as LAPAR-A (NeurIPS 21), SRFBN-S (CVPR 19), etc.We also visualize the trade-off analysis between performance and Multi-Adds/Parameters in Fig. 5 We can see that our IRN achieves a better trade-off between performance and computational cost.  1, the IRN without ESA showed a significant performance degradation for a parameter drop of approximately 10%, and the complete IRN showed significant performance improvements on the Set5, Set14, BSD100, Urban100 and Manga109 datasets.The results show that the ESA module can effectively improve the performance of SR.
A study of different activation functions.When introducing the ConvNeXt Block, we retain GELU as its activation function.However, most previous SR networks have used ReLU [45] or LeakyReLU [46] as the activation function.Therefore, we investigate the effects of these three activation functions on the SR model.The results in Table 2 show that among these activation functions, GELU obtains a significant performance improvement.Therefore, we chose to retain GELU, as the activation function in our model.

Running Time
As shown in Table 4, our method has the lowest number of parameters and running time compared to LAPAR-A (NeurIPS 21) and IMDN (ACM 19).For the average running time, as it is related to the optimization of the code and the computation of specific testbeds for different operators (more 1 × 1 convolutions are used in our IRN than in IMDN and LAPAR-A).Therefore, our method does not differ much from the runtime of IMDN (ACM 19).

Study on the Impact of Different Number of IRB Blocks on Network Performance
We further investigated the impact of different number of IRB blocks on the network performance.It can be observed from Table 2 that both performance metrics increase with the increase in the number of IRB blocks.However, increasing the number of IRBs to more than 4 results in a smaller improvement in the performance of the proposed network.Therefore, we used 4 IRB blocks to construct the IRN (Table 5).

Conclusion
In this paper, we propose a lightweight and efficient single image super-resolution residual network (IRN).By using simple and efficient residual blocks, which are used to reduce   the number of network layers and simplify the connections between layers, our network is made lighter and faster.In addition, we use effective ESA blocks to enhance the ability of model to collect fine grained information.We then investigate the effect of the activation function on the SR model to explore the best choice for our approach.Extensive experiments show that our proposed IRN strikes a good balance between model size, performance and computational cost compared to other lightweight SR models, so that it can be easily ported for use on mobile devices.

Fig. 5
Fig. 5 Illustration of PSNR, Multi-Adds and parameter numbers of different SISR models on the Set5 dataset for 4 × SR 123

Table 1
Ablation studies of ESA

Table 3
Comparisons on multiple benchmark datasets for lightweight networks.The Multi-Adds is calculated corresponding to a 1280

Table 4
Comparison of our IRN with IMDN, LAPAR-A's 3 × SR.Run times are the average of 10 runs on the Urban100 test set

Table 5
Analysis of the number of IRB blocks in proposed network for × 4 on Urban100