No-reference image quality assessment with multi-scale weighted residuals and channel attention mechanism

With the rapid development of deep learning, no-reference image quality assessment (NR-IQA) based on convolutional neural network (CNN) plays an important role in image processing. Currently, most CNN-based NR-IQA methods focus primarily on the global features of images while ignoring detail-rich local features and channel dependencies. In fact, there are subtle differences in detail between distorted and reference images, as well as differences in the contribution of different channels to IQA. Furthermore, multi-scale feature extraction can be used to fuse the detailed information from images with different resolutions, and the combination of global and local features is critical in extracting image features. As a result, in this paper, a multi-scale residual CNN with an attention mechanism (MsRCANet) is proposed for NR-IQA. Specifically, a multi-scale residual block is first used to extract features from distorted images. Then, the residual learning with active weighted mapping strategy and channel attention mechanism is used to further process image features to obtain more abundant information. Finally, the fusion strategy and full connection layer are used to evaluate image quality. The experimental results on four synthetic databases and three in-the-wild IQA databases, as well as cross-database validation results, show that the proposed method has good generalization ability and can be compared with the most advanced methods.


Introduction
With the rapid development of computer technology and mobile devices, image information has become an information source for perceptual content in daily life and learning. However, image distortions may occur in storage devices as a result of compression, copying, transmission, and other operations. Therefore, image quality assessment (IQA) has emerged as a critical technology in computer vision and image processing (Hong et al. 2016;Hao et al. 2019 In general, IQA is divided into two categories: subjective and objective evaluations. Subjective evaluation is to evaluate images with people as observers to reflect people's visual perception. Objective evaluation calculates image quality by designing models and algorithms. It reflects the subjective perception of human eyes and provides evaluation results based on digital calculation. The objective IQA technology can be divided into three categories: full-reference (FR), reduced-reference (RR), and no-reference (NR). For many IQA methods, reference images are difficult to obtain due to factors like environmental impacts on real-world practical applications, etc. As a result, the FR-IQA and RR-IQA methods are severely limited in their application of IQA. Because the NR-IQA does not require many reference images, this technology has piqued the interest of many researchers.
Due to the rapid development of deep learning, convolutional neural network (CNN) has been successfully applied in the field of computer vision (Holzinger 2018(Holzinger , 2021, such as image super-resolution and recognition tasks. It has achieved numerous remarkable results (Jin et al. 2016). Despite CNN s impressive performance in many visual tasks, NR-IQA remains challenging due to a lack of real-rich sam-ples for training. First of all, because CNNs have a large number of parameters, learning these parameters requires a large number of label data. The number of reference images in existing synthetic IQA databases, such as LIVE (Sheikh et al. 2006), CSIQ (Larson and Chandler 2010), TID2008 (Ponomarenko et al. 2009), and TID2013 (Ponomarenko et al. 2015) for NR-IQA is relatively small with only about 20 reference images. In wild IQA databases, LIVE Challenge (Deepti and Alan 2015), Koniq-10K (Vlad et al. 2020), and SPAQ (Fang et al. 2020) contain 1,169, 10,073, and 11,125 images with different contents, respectively; however, there are NR images for them. Second, image distortion is typically reflected in the variation of high-frequency information details across an entire image. In general, the more severe the image distortion, the more the high-frequency information is lost, whereas the low-frequency information hardly changes. People will ignore some real details and only pay attention to the dense contour information if they do not consider the observation process of objects and only think about the problem from the pixel value of images. This will increase the difficulty of learning.
To address the issue of a limited number of training samples, Kang et al. (2014) proposed a very shallow network. They obtained the global image quality by learning and predicting the local image quality using 32 × 32 image blocks as input. This method used image enhancement techniques, which are necessary for image denoising or super-resolution problems. The network structure, however, is relatively simple, and the number of network layers is relatively shallow. To extract image features, only one layer of the network structure with a convolution kernel of 7 × 7 size is used, and local features cannot describe global features. As a result, this method is incapable of meeting the requirement of sufficient accuracy.
Many researchers have devoted themselves to the task of NR-IQA based on CNNs and proposed many effective algorithms. Li et al. (2016); Dash et al. (2017) trained a network using a full image of 224 × 224 pixels. Pan et al. (2016); Zuo et al. (2016) proposed saliency maps to adjust the weight of each patch.  predicted subjective scores by learning human visual perception behaviors. Lin and Wang (2018) introduced GAN into a quality assessment network to provide fake reference images for NR-IQA.  proposed a NR-IQA network based on a two-stream CNN structure, with the network receiving both the original and gradient images. To extract different features,  adopted multi-scale convolution to process images for super-resolution, resulting in more effective highfrequency details and good results. Inspired by this,  also applied multi-scale feature extraction in the field of image quality evaluation. Experiments showed that multi-scale feature extraction can improve IQA performance.
However, most existing methods focused on extracting more contour features from images, ignoring detail-rich local features and feature map channel dependencies. In fact, fusion of features with different resolutions can help to reduce the loss of excessive details in severely distorted images. Channel attention mechanism and active weighted mapping strategy have achieved good results in the super-resolution direction of image processing. They mainly simulate human vision system for weighted image processing. As far as we know, no one has combined multi-scale (three-scale) feature extraction, channel attention mechanism, and active weighted mapping for IQA.
To address the above issues, this study proposed a novel multi-scale residual CNN with an attention mechanism for extracting contour and detail features and processing channel correlation. The proposed network employs a multi-scale residual block (MSRB) with three parallel branches to extract distorted images, which differs significantly from the structures of Refs. ; . MSRB can fuse the global and local information from images with different resolutions. In addition, the network employs residual learning of active weighted mapping strategy (AWMS) and channel attention mechanism (CAM) to process global and local features for obtaining richer information. The experiments showed that the proposed network model can improve quality assessment performance for no-reference images.
The contributions of this study are as follows: (1) A novel IQA algorithm based on convolution neural network is proposed, which used several effective modules to extract image details and contour features. Experimental results show that it can better learn image quality features and evaluate image quality more accurately.
(2) A MSRB with three parallel branches is proposed to extract distorted images. Furthermore, the active weighted mapping mechanism is used in MSRB for residual learning, with its main role being to redesign input mappings rather than directly adding two residual learning results. (3) CAM is incorporated into the IQA task. This mechanism determines the weight values of various channels. This mechanism can effectively improve cross-layer information flow through residual connections and channel relationships.
The rest of the paper is organized as follows. Section 2 provides a brief overview of some methods related to the proposed network. Section 3 describes the proposed network model. Section 4 describes the experimental details and conducts comparative and cross-validation experiments. Section 5 presents the conclusion and direction for future research.

Related works
In this section, we review some related methods and algorithms, and briefly discuss their advantages and disadvantages

No-reference IQA image quality assessment based on CNN
Several years ago, Kang et al. (2014) first introduced CNN into the field of NR-IQA and proposed a network with only five layers, which has crucial research significance. The method adopted the image block training strategy. The original image is first divided into 32×32 non-overlapping image blocks, with the quality of each image block assumed to be roughly the same as that of the original image. The image blocks are then sequentially fed into the network for training parameters. The proposed network pushed NR-IQA to a high level despite the lack of samples in the training database. Many scholars have since devoted themselves to researching the application of CNN to NR-IQA. However, the network proposed by Kang et al. has only one convolutional layer, which is relatively shallow. As a result, the network has some limitations and fails to meet accuracy requirements. Inspired by Kang's idea, Li et al. (2016); Dash et al. (2017) proposed a CNN with multiple layers of structures. They trained the network with a full image of size 224 × 224 pixels. Although the small patch block segmentation technique was not used, the deep network structure produced relatively good results on some datasets with fewer reference images. Sun et al. (2016) and Bosse et al. (2016) achieved remarkable results by fine-tuning parameters in the deep CNN network Dash et al. 2017). Pan et al. (2016) and Zuo et al. (2016) also designed a deep network for the small patches. A CNN-based significance map detection method was proposed in this network. The significance map was used to adjust the weight of each patch, and the overall image quality score was output at last. Cheng et al. (2017) proposed a NR-IQA network based on a pre-significance map and proved through experiments that the prediction error of significance information was lower than that of a uniform region. Based on this fact, they proposed a pre-SM algorithm that can assign relatively large weights to small blocks of significance graphs to reduce the error.
Recently,  have proposed a new distorted IQA (DIQA) model. The training of this model is divided into two parts. The first part is to determine the information associated with the distorted and reference images. The second part primarily employs human subjective scores to fine-tune the proposed CNN network to predict subjective opinion scores by learning human visual perception behav-ior. However, when the network randomly clips the input image, the patch overlaps will be caused, resulting in double calculation and waste time and resources. Lin and Wang (2018) proposed a NR-IQA network based on a generative adversarial network. In this model, GAN was introduced into the quality assessment network to provide pseudo-reference graphs for NR. Furthermore, the loss of quality perception is added to the GAN generator and the discriminator is used to discriminate the pseudo-reference graph. Although the above methods have effectively improved and enhanced the NR-IQA, the problem of gradient propagation weakening or disappearing may occur in some databases due to the deep network.
To address this issue,  proposed a NR-IQA network based on a two-stream CNN structure, which inputs original and gradient images into the network, enabling the network to extract different features representing distortion. Because the network extracted features using a single-size convolution kernel, the feature information extracted from Yang's network is insufficient, and detailed features may be lost.  applied multi-scale feature extraction in the field of image quality evaluation. In the field of super-resolution,  first used a multi-scale residual convolution kernel to process images, thereby achieving a good zoom performance effect. They did not, however, attempt to use an active weighted strategy or a CAM to enhance the effectiveness of feature representation.  used the structure and texture similarity of feature maps extracted by intermediate layers to represent FR model features, and employed the global average and standard deviation of final feature maps fused by intermediate feature maps to represent NR model features.  proposed a ladder structure to learn quality perception features more effectively. This structure integrates the features of the middle layers hierarchically into the final feature representation, allowing the model to fully utilize low-level to high level vision information. They could not pay more attention to the special channel of distorted images because they did not consider the correlation between channels.

Channel attention
Visual attention is the unique brain signal processing mechanism of human vision. We can obtain the important target areas that require attention by quickly scanning a global image, and then pay more attention to this part to obtain more detailed features and suppress some relatively useless features. The attention mechanism in deep learning draws lessons from human attention thinking mode.
Many researchers have recently devoted themselves to investigating the mechanisms of attention in deep learning. Hu et al. (2018) proposed a squeeze and excitation (SE) block that adaptively adjusts channel weights by learning channel dependence.  introduced a deep residual channel attention network with CAM. To process input images, the network combined residual learning with a SE block and adjusted different feature maps based on the degree of dependence between input images. These feature maps were then given appropriate weights. The results of the experiments showed that CAM can improve image processing. Woo et al. (2018) proposed a convolutional attention module by combining channel attention and spatial attention. This module can automatically calculate the importance of feature channels as well as the different regions of the feature space. Gao et al. (2020) proposed a global second-order average pooling module that used covariance to represent the relationship between channels. Because the original input can be scaled along channel dimension, this module played a role in channel attention.

Active weighted mapping
In general residual learning, the original input information is added to the information obtained after some convolution layers, and the final information is obtained. The simplest residual learning is depicted in Fig. 2a as follows: where x and y represent the input and output of the residual block, and F (·) is the residual learning convolution function.  Hyoungho et al. (2018) proposed an active weighted mapping (AWM) strategy in 2018 based on residual learning. The mechanism of AWM differs from that of residual learning. It is the result of redefining the mapping of residual learning rather than directly adding two paths. The AWM mechanism, in particular, believes that the weights of x and F (x) are different. Different weights ( Fig. 2 (b)) can be derived and redistributed following the AWM mechanism. The AWM mechanism can be expressed as follows: where λ and μ are the weight values calculated by the AWM mechanism.

A new multi-scale residual block (MSRB)
Before we present our MsRCANet model, a MSRB structure is shown in Fig. 3, where S n and S n−1 are the input and output of the MSRB, respectively. This module consists of three parallel branches of various convolution kernels and an AWMS mechanism. The MSRB structure is described in detail below.
In the first part, three convolution kernels of different sizes, 3 × 3, 5 × 5 and 7 × 7, are used to extract multi-scale features from input features. Following the convolution operation, the LRelu function is used to obtain more detailed context information. The three output features are then concatenated into a tensor. The procedure is repeated twice. Finally, the 1×1 convolution kernel is used to adjust the channel number of output features. This part can be described as follows: where W k×k i j and b k×k i j denote the weight and bias of the branch in the layer, respectively; and k × k denotes the size of a convolution kernel, k = 3, 5, 7, i = 1, 2, j = 1, 2, 3. L (·) represents the LRelu function, A i j and B i j represent the results obtained using LRelu functions. concat [·] represents the fusion operation, and D represents the result obtained after applying the 1 × 1 convolution kernel.
In the second part, AWMS is used to deal with the output feature maps and the initial input feature maps S n−1 obtained in the first part (see Fig. 4). Assume S n−1 = {S 1 , S 2 , . . . , S k } has k different feature maps, and the size of the feature maps is M × N . The global average channel information Z = {z 1 , z 2 , . . . , z k } is obtained after global average pooling. It means that the two-dimensional feature map on each channel will become a statistic after global average pooling, and the cth element of Z can be expressed as follows: where Ave(·) is the global average pool operation and S c (i, j) is the value for the position (i, j) in the cth channel feature map. As shown in Fig. 4, Z 1 and Z 2 represent two statistics after pooling the global average. We concatenate Z 1 and Z 2 into Z to get the weights of D and S n−1 . The weights λ, μ of S n−1 and D are calculated using two nonlinear fully connected layers. The procedure is as follows: where W1 ∈ R h×2k and W2 ∈ R 2×h denote the weight matrices of the fully connected layers and h (2 < h < 2k) represents the number of dimension nodes in the hidden layer. σ and δ represent Sigmoid and Relu functions, respectively. γ = {λ, μ} is the weight vector of D and S n−1 , and S n is the MRSB output.

Channel attention mechanism (CAM)
In general, it is considered that the feature map of each channel input is equally important. In fact, the importance of feature information input of each channel is different; thus, each channel should be given a different weight. This study employs CAM similar to Hu et al. (2018) to assign weights to each channel. As shown in Fig. 5, CAM consists of two parts: squeeze and excitation. These two parts are used to learn the weight of each channel to generate the attention of channel domain. Next, the structure and operation process of squeeze will be given in detail.
The squeezing function is similar to the AWMS mechanism. If X = {X 1 , X 2 , . . . , X C } is the input of CAM and its space size is M × N , and the global average channel information is Z = {z 1 , z 2 , . . . , z C } after global average pooling, the kth element of Z can be expressed as follows: The excitation function consists of two fully connected layers. It is mainly used to generate channel weights for input feature maps; the learned weights represent the importance of feature channels. Here, the Relu and Sigmoid functions are selected. After obtaining the weight of each feature channel, the weight is multiplied by the corresponding original feature channel. The weighted feature maps are created using these specific steps. The specifics are as follows: where W1 ∈ R C r ×C and W2 ∈ R C× C r represent the weights of the two fully connected layers, r is a scaling ratio, and σ and δ represent the Sigmoid and Relu functions, respectively. λ = {λ 1 , λ 2 , . . . λ C } is the learned weight vector of feature channels. is the result of channel-wise multiplication, and X is the output of CAM. In particular, the kth element inX can be expressed asX k = λ k · X k . Figure 6 depicts the proposed network's architecture. I nput and Out put represent the network's input and output, respectively, where I nput is a locally normalized 32 × 32 image patch. To begin, the network employs a 3 × 3 convolution kernel to extract the detailed features of an input image. The network then employs a 9 × 9 convolution kernel to extract contour features. During the training process, the quality score of each patch is used as the true label score of its initial image. In the testing stage, the score estimation of real quality for each image is obtained by averaging its patch prediction scores. As shown in Fig. 6, the network is divided into three sections-the initial feature extraction, advanced feature extraction, and fully connected network parts. The local normalization method and three main components of the network are described in detail below.

Network model architecture
Local normalization employs a straightforward normalization method similar to that described in Zhang et al. (2016). Assuming that the pixel value at the position (x, y) is I (x, y), the normalized value I (x, y) is as follows: where θ is a small positive number (ensuring that the denominator is not 0), and P Q is the total number of image pixels in the local window. In the experiment, this study uses a window with P = Q = 3, which has a relatively good effect. We found that the effect of contrast and brightness of image patch on the learning of network structure can be alleviated by local normalization. The initial feature extraction part employs the 3 × 3 convolution kernel to extract features p 1 from I nput. The 9 × 9 convolution kernel, then, extracts contour features and obtains feature information p 2 . The two operations can be described as follows: where * denotes the convolution operation, and W 3×3 1 and b 1 denote the weight matrix and bias of the 3 × 3 convolution kernel, respectively. W 9×9 2 and b 2 represent the weight matrix and bias of the 9 × 9 convolution kernel, respectively. The output p 2 is the input of the next phase.
The advanced feature extraction part is the core part of the network. First, the MSRB is used to extract multi-scale features p 2 with the AWMS mechanism. Second, CAM is used to learn the correlation between feature channels. Then, the network uses a 5 × 5 convolution layer for feature extraction and employs max pooling and min pooling for the output features, following by three convolution layers and a parallel pooling layer. Finally, the pooling results are fed into a fully connected network component. The advanced feature extraction part is described as follows: p 6 = p 1 6 , p 2 6 (31) where f (·) and g(·) represent the MSRB and CAM processes, respectively; max − pool(·) and min − pool(·) represent the 2 × 2 convolution kernel process. Similarly, the output p 10 of this stage is the input of the next stage. The fully connected network part consists of three fully connected layers. To reduce the complex co-adaptive rela-tionship between neurons, the dropout method is used after the first full connection layer, and the ratio is set to 0.5. This part can be described as follows: where W 1 , W 2 , and W 3 represent the weight matrices of the three full connection layers with the node numbers 100, 800, and 1, respectively. The out put represents the final predicted score of a test image.

Loss function
We can use mean square error or mean absolute error as the objective function in regression problems. Although the mean square error is a simple calculation, the mean absolute error has greater robustness to outliers and is better suited for the IQA model regression calculation. As a result, the mean absolute error serves as the optimization objective function in this study. Assume that x n , y n are an input patch and its true score and θ is the parameter of network, then the optimization objective function is defined as follows: Our primary goal is to solve the following optimization problem:

Experimental results and analysis
In this section, we describe experimental settings, perform comparative experiments on several databases, and analyze the experimental results in detail.

Database
When training the network, patches of 32 × 32 size (nonoverlapping) are obtained from the initial image and used as the network s input. Compared with taking the initial image as the original input, more training samples can be obtained by partitioning the initial image into patches. Kang et al. (2014) pointed out in experiments that the quality score of each patch could be used as the ground score of the initial image. Therefore, the average value of the predicted scores for input patches is used in the test stage as the estimate of the quality score of the initial image. This study uses four synthetic databases: LIVE (Holzinger 2021), CSIQ (Larson and Chandler 2010), TID2008 (Ponomarenko et al. 2009), and TID2013 (Ponomarenko et al. 2015). Besides the synthetic IQA databases, we also validate the proposed model on three in-the-wild IQA databases, including LIVE Challenge (Deepti and Alan 2015), KonIQ-10K (Vlad et al. 2020), and SPAQ (Fang et al. 2020). Detailed information about the four synthetic databases and three in-the-wild databases is described as follows. LIVE database consists of 29 reference images and 779 distorted images. It has five types of distortion: JP2K compression (JP2K), JPEG compression (JPEG), white Gaussian (WN), Gaussian blur (GB), and fast fading (FF). LIVE database provides differential mean subjective scores (DMOS) for all distorted images, with values in the range [0,100]. The closer the DMOS is to 100, the worse the image quality and the more severe the distortion. Figure 7 shows the DOMS comparison of the same image with different distortion types in the LIVE dataset.
CSIQ database consists of 30 reference images and 866 distorted images, which is a shared database developed by the Computational Perception and Image Lab of Oklahoma State University. There are six types of distortion in the CSIQ database. CSIQ database provides DMOS for all distorted images, with DMOS values in the range [0,1]. The closer the DMOS value is to 1, the worse the image quality and the more severe the distortion. In this paper, we use a subset of this database with the same distortion types as the LIVE database.
TID2008 was established by the N504 Department of Signal Reception, Transmission, and Processing of the National Aeronautical and Aerospace University of Ukraine. TID2008 consists of 25 reference images and 1700 distorted images. There are 17 types of distortion, such as GB, image noise, JPEG compression, JP2K compression, JPEG transmission error, and JP2K transmission error. TID2008 provides mean subjective scores (MOS) for all distorted images. The value of MOS is in the range [0,9]. Contrary to DMOS, the closer the value is to 0, the worse the image quality and the more distorted the image.
TID2013 was established by the National Aeronautical and Aerospace University of Ukraine. TID2013 consists of 25 reference images and 3000 distorted images. There are 24 types of distortion, such as GB, JPEG compression, and JP2K compression. TID2013 provides mean subjective scores (MOS) for all distorted images. The value of MOS is in the range [0,9]. In this paper, we use a subset of this database with the same distortion types as the LIVE database. KONIQ-10K (Konstanz Authentic Image Quality Database) was established by University of Konstanz in Germany in 2020, containing 10,073 distorted images. The subjective score was obtained through crowdsourcing with 1,459 annotators and 1.2 million subjective data. At the same time, the database also provides image attributes and EXIF (Exchangeable Image File Format) information.
SPAQ (Smartphone Photography Attribute and Quality) was established by Jiangxi University of Finance and Economics in 2020, mainly for the evaluation of mobile phone imaging quality, including 11,125 images collected from 66 different imaging devices. The subjective experiment was carried out in a standard laboratory environment, and the value range of MOS is [0, 100]. At the same time, the database provides image attribute information, category information, and EXIF information.

Evaluation metrics
To effectively evaluate the proposed NR-IQA algorithm, it is necessary to evaluate the algorithm's performance. This study uses the Spearman rank correlation coefficient (SROCC), Kendall rank-order correlation coefficient (KROCC), Pearson linear correlation coefficient (PLCC), and root mean squared error (RMSE) to evaluate the performance of the proposed IQA algorithm. SROCC, PLCC, KROCC, and RMSE are used to measure the monotonicity and accuracy of the NR-IQA results, respectively.
where D i denotes the difference between the ith image's rankings in the predicted and objective quality scores. The SROCC is in the range of [−1,1]. The closer the SROCC is to 1, the better the performance of the corresponding algorithm.
where n c is the logarithm of consistent elements in the database and n d is the logarithm of inconsistent elements in the database. The range of KROCC is between [0,1]. The closer the KROCC is to 1, the better the performance of the corresponding algorithm.
where x i and y i represent the subjective and objective quality score of the ith image, respectively, and x and y represent the average values of the subjective and objective quality scores, respectively. The range of PLCC is in [0,1]. PLCC is closer to 1 and RMSE is smaller, indicating that the image quality evaluation algorithm performs better.
To train the network, 100 iterations were conducted in the experiments. In each training iteration, the training, test, and verification sets were divided according to the ratio of 6:2:2. The proposed network was trained in the Pytorch deep learning framework and optimized using the Adam stochastic optimization method. The basic learning rate and weight attenuation were both set to 10e−4. The training was stopped after 100 epochs because there was no further continuous improvement, and the experimental results in the paper were obtained after convergence in 100 epochs.

Evaluation of synthetic databases and in-the-wild databases
Evaluation of synthetic databases. The distorted images were trained on four synthetic databases. Figures 8 and 9 show SROCC and PLCC variation and loss charts obtained by using 100 epochs of training on LIVE database, respectively. As shown in the figures, the network converged at about 60 epochs. Tables 1 and 2 compare the most advanced IQA algorithms in four synthetic databases. The best performance results for these IQA models in the tables are shown in bold , the second-best performance model is shown in italics. As shown in Table 1, four FR-IQA methods and nine NR-IQA methods were compared with our proposed MsR-CANet on four indexes: SROCC, KROCC, PLCC, and RMSE. The four FR-IQA methods included SSIM (Wang et al. 2004), PSNR (Wang and Bovik 2002), FSIM (Zhang et al. 2011), andVSI (Zhang et al. 2014) and nine NR-IQA methods consisted of CORNIA (Ye et al. 2012), BRISQUE (Anish et al. 2012), CNN (Kang et al. 2014), CNN++ (Kang et al. 2015), HOSA , RAN4IQA (Ren et al. 2018), RankIQA (Liu and Bagdanov 2017), dynamic (Chen et al. 2020), and static (Chen et al. 2020). As shown in Table 2, twenty-one methods were compared with our proposed algorithm on two indexes: SROCC and PLCC. These methods included NIQE (Mittal et al. 2012 (Su et al. 2020), UNIQUE , HFF , and MMMNet (Li et al. 2021). Table 1 lists the results of SROCC, PLCC, KROCC, and RMSE on the synthetic databases LIVE, TID2008, and TID2013. It can be seen that our algorithm is superior to all of the compared FR-IQA and NR-IQA methods. Table 2 lists the results of SROCC and PLCC on the synthetic databases LIVE, CSIQ, and TID2013 and average values of the three databases. As shown in Table 2, the performance of the proposed network is the best and most competitive when compared to other advanced models, except for the TID2013 database. The proposed model has the best performance after averaging the SROCC and PLCC on the three synthetic databases, indicating that it has strong generalization ability and is superior to other NR-IQA algorithms.
To demonstrate that the proposed network has good generalization ability, we tested it and other algorithms (PSNR (Wang and Bovik 2002), FSIM (Zhang et al. 2011), BRISQUE (Anish et al. 2012, GM-log (Xue et al. 2014), CNN (Kang et al. 2014), BIECON Kim and Lee 2017) in each distortion of the LIVE, CSIQ, and TID2008 databases. The experimental results are shown in Table 3. We can see that the obtained results of our network on SROCC are better than other networks in all distortion types for these databases. It demonstrates that the network is relatively effective in evaluating these distorted images, indicating that our algorithm has good generalization ability.
Evaluation of in-the-wild databases. The proposed algorithm and six popular IQA algorithms were trained on distorted images from three in-the-wild IQA databases. These algorithms include QAC (Xue et al. 2013), NIQE (Mit-  (Anish et al. 2012) and BMPRI (Min et al. 2018), CNN (Kang et al. 2014), and WaDIQaM-NR (Bosse et al. 2017). Table 4 shows the comparative results on three in-the-wild databases. It can be seen that although the performance of our proposed algorithm on the three inthe-wild IQA databases was not always the best, it did achieve better performance results, which indicates that our proposed model has good generalization ability on in-the-wild IQA databases. The best performance results for the IQA models in the table are shown in bold, while the second-best performance model is shown in italics.

Cross-database evaluations
To  Table 5 displays the SROCC results. To make the comparison more intuitive, we drew a visualization result diagram, as shown in Figs. 10, 11, 12, and 13. As given in Table 5 and Figs. 10, 11, 12, and 13, the generalization ability of the proposed model is better than the most advanced NR-IQA models in such challenging experiments, demonstrating the effectiveness of the proposed method.

Ablation study
In this section, we performed ablation experiments on the proposed network modules to verify whether they were helpful to network performance. The size of the input patch was first compared. We used patches of 16 × 16, 32 × 32, and 64 × 64 sizes for the experiments while ensuring that other modules remained unchanged, and the results are shown in Table 6 and Fig. 14. It was seen that the evaluation effect was best when the patch size was 32 × 32. As a result, in our experiments, the input patch size was set to 32 × 32.
To verify the effectiveness of the proposed MSRB and CAM, a comparative experiment was then carried out on the LIVE database to determine whether the use of MSRB and CAM was effective for our network. The results are shown in Table 7 and Fig. 15. From the results, we can conclude that the models with feature extraction blocks MSRB and CAM are better than those without MSRB and CAM. That is, multi-scale feature extraction and channel attention mechanism indeed improved the evaluation results of the proposed

Conclusion
In this paper, a novel no-reference quality assessment model MsRCANet based on CNN is proposed. This proposed model can extract contour and detail features, predict image quality sores better, and improve IQA accuracy. To better extract   multi-scale features and deal with the information flow between channels, we designed a multi-scale feature residual extraction block, in which active weighted residual learning mechanism is used to process the relevant information between channels. Specifically, parallel branches with three different convolution kernel sizes were first used to extract distorted images, and then, active weighted mechanism strategy (AWMS) was used to process input information. As different channels have different contributions to image fea- tures, we used the channel attention mechanism (CAM) to process the information of feature mapping in each channel and further applied the backpropagation algorithm to learn the dependence between channels in CAM. In this way, the mechanism can be used to deduce the corresponding weights of different channels, and then, the final output results through different channels can be obtained. The performance of the proposed method was analyzed by experiments. The MsRCANet outperforms the most advanced NR-IQA methods in terms of image quality evaluation performance. Although MsRCANet performed well in IQA, there are still some issues worthy of further study. For example, how do we visualize extracted feature maps in the IQA process? How can we improve the performance of in-the-wild IQA databases better? In future research, we will concentrate on these issues to find more effective ways to implement IQA. Data availability Enquiries about data availability should be directed to the authors.

Conflict of interest
The authors have not disclosed any competing interests.