Image denoising in the deep learning era

Over the last decade, the number of digital images captured per day has increased exponentially, due to the accessibility of imaging devices. The visual quality of photographs captured by low cost or miniaturized imaging devices is often degraded by noise during image acquisition and data transmission. With the re-emergence of deep neural networks, the performance of image denoising techniques has been substantially improved in recent years. The objective of this paper is to provide a comprehensive survey of recent advances in image denoising techniques based on deep neural networks. We begin with a thorough description of the fundamental preliminaries of the image denoising problem, followed by an overview of the benchmark datasets and commonly used metrics for objective assessment of denoising algorithms. We study the existing deep denoisers in the supervised and unsupervised training paradigms and review the technical specifics of some representative methods within each category. We conclude the survey by remarking on trends and challenges in the development of state-of-the-art algorithms and future research.


Introduction
The role of digital cameras is to approximate an image of the real world by sampling from a discrete grid, while maintaining image quality as judged by human perception. The visual quality of images collected by handheld consumer cameras (Abdelhamed et al. 2018;Plotz and Roth 2017), medical imaging equipment , or industrial cameras may be impaired by several intrinsic and extrinsic factors related to the acquisition environment such as the pixel pitch of the sensor or the scene light level. Noise corruption caused by light interference, dark current leakage, shot noise and lens aberration can deteriorate 1 3 the perceptual quality of images. Image noise can also impact subsequent higher-level computer vision tasks , therefore it is often crucial to denoise images prior to any higher-level image interpretations tasks.
Image denoising refers to the process of inspecting a noisy image and recovering an estimate of the underlying clean counterpart by discarding the noise artifacts. Traditionally, image denoising is framed as an optimization procedure searching for the most likely clean image. Given the very large number of image-noise combinations that could yield the noisy image, image denoising falls into the family of ill-posed inverse problems (Bertero and Boccacci 1998;Vogel 2002). Traditional denoising techniques impose explicit regularization to constrain the search space of image-noise combinations. Some examples of these classic denoising methods leveraging a priori knowledge about the clean image include non-local self-similarity models (Buades et al. 2005;Dabov et al. 2007), sparse models (Mairal et al. 2009), gradient models (Beck and Teboulle 2009;Rudin et al. 1992;Osher 2005) and, Markov random field models (Lan et al. 2006;Roth and Black 2009). However, these methods typically suffer from two major deficiencies: (1) discovering clean images often involves a set of tuned hyper-parameters that do not generalize well to unseen data and (2) all the computational steps are performed at the inference phase requiring a considerable amount of resources.
Driven by the availability of large datasets, rapid increases in computational power and advances in algorithmic development for optimization of neural networks, deep learning has made impressive improvements in many computer vision tasks (Krizhevsky et al. 2012;Simonyan and Zisserman 2014;Szegedy et al. 2015;. In the context of image denoising, deep learning has attracted significant research interest and spawned many new research directions over the last decade (Lemarchand et al. 2020;Plotz and Roth 2017;Tian et al. 2020;Thakur et al. 2019) (Fig. 1). Employing neural networks in image denoising can be traced back to the seminal works exploring the advantages of lightweight networks over classical hand-crafted denoisers (Burger et al. 2012;Jain and Seung 2009). The research question initially asked was whether neural networks could compete with engineered classical denoisers. As our understanding of neural networks has improved, deep denoising networks have become the de-facto choice for state-of-the-art denoising Fig. 1 Statistics of a peer-reviewed and b arXiv papers on image denoising over the past few years. c The evolution history of image denoising algorithms in deep learning era applications. The extensive use of neural networks for denoising has created a diverse set of approaches to choose from, ranging from convolutional networks (Tai et al. 2017) to generative adversarial frameworks .
The field of deep image denoising has developed rapidly but in a disparate manner. As depicted in Fig. 2, different denoising paradigms have been proposed during the past years, however, most of these methods are tailored to specific contexts and are based on benchmark datasets that are not directly comparable. Additionally, some new benchmark datasets have been proposed which are not included in available literature (Plotz and Roth 2017;Tian et al. 2020;Lemarchand et al. 2020;Thakur et al. 2019). This motivates us to examine recent advances in this research domain to provide an overview of current methods and a perspective of promising research directions. We provide a new taxonomy of the existing deep denoising techniques by grouping methods into two major categories: supervised and unsupervised approaches. In each category, we further organize the representative methods in accordance with their network design, adopted priors and training strategies. We present the methods in chronological order to show the advancement timeline for each training paradigm category.
To summarize, the main contributions of the survey are as follows: 1. We provide a thorough description of the preliminaries for image denoising as well as a comprehensive summary of the benchmark datasets and evaluation metrics. 2. We deliver an extensive overview of deep denoisers. We introduce a novel taxonomy of the existing methods in an effort to present a complete picture of the state of the art in deep denoising. 3. By compiling the results of previous work, we discuss research challenges and open issues to identify new trends and future research directions for the denoising community.
This survey is organized as follows: Sects. 2 and 3 cover the problem definition and review the mainstream datasets and evaluation metrics. In Sect. 4, we investigate the representative works in the supervised denoising area. Section 5 delivers a summary of recent unsupervised denoising methods and, Sect. 6 provides a summary of denoising applications in other domains. We conclude this survey in Sects. 7 and 8 with a discussion on a number of open problems and future research directions. We list the notations that will be used in this survey in Table 1.

Problem definition and terminology
Formally, let X = {x i ∈ ℝ} n i=1 be a noisy image with n pixels that is corrupted by a degradation function Φ , and let Y = {y i ∈ ℝ} n i=1 be the corresponding clean counterpart. The degradation function Φ for the i-th pixel is written as: where indicates the set of parameters associated with the degradation function and noise model. Degradation by noise is often modelled as noise addition followed by pixel-wise clipping to account for sensor saturation. Suppose that i denotes the noise component for i-th pixel physically caused by light or the camera. Therefore, the additive noise model can be written as: without loss of generality assuming the pixel intensities are in the range [0, 1], we have that clip y i = min max y i , 0 , 1 . The task of image denoising is to recover Y from the observed noisy data X . Typically, the degradation function and the noise parameters are unknown. Thus, an approximation of the inverse function is learned such that: (1) x i = Φ(y i ; ); ∀i ∈ {1, 2, … , n} (2) x i = Φ(y i , ) = clip y i + i ; ∀i ∈ {1, 2, … , n}, (3) y i = Φ −1 (x i ; ); ∀i ∈ {1, 2, … , n} where Φ −1 and denote the denoising function and its parameters, respectively. The learning-based denoiser is implemented as a regression function that maps the noisy X inputs to the clean Y ground truth; i.e. Φ −1 ∶ X ↦ Y . When training a neural network as a denoiser, the loss is typically composed of a fidelity term L(y i ,ŷ i ) measured between the clean estimate and the ground truth and, a regularization term (ŷ i ) to constraint the solution space adjusted with and a trade-off parameter . The denoising network is trained to learn an optimal parameter configuration: When choosing a fidelity term, the prior knowledge of the clean input may be an important consideration. A mean squared error (MSE) or L2 distance fidelity term tends to produce over-smoothed outputs, which may lack high-frequency details due to enforcing a Gaussian prior on the restored output. Some works have demonstrated the benefits of mean absolute error (MAE) or L1 distance to produce higher quality restored images with perceptually sharper edges and textures (Zhao et al. 2017).

Noise formation model
Noise in digital images comes from many sources, such as variation in sensor sensitivity (ISO factor), thermal fluctuations, signal transmission errors, photon shot noise and quantization noise. Noise models are approximations of the real noise created during signal conversion in the sensor and readout by an analog-to-digital converter. In this section, we elaborate on three types of most commonly studied noise models in digital imaging.

Shot noise or signal-dependent noise
Photons travel from the world to the camera during the exposure and arrive at the pixel sites in whole numbers, or packets. Scene irradiance is measured by the conversion of incident photons into charge at each pixel in a sensor array (Hasinoff 2014). The packet-count varies proportionally to the square root of the average incident photon count i.e. the average light intensity, and therefore the photon count at the sensor array has an uncertainty that comes from random fluctuations in the arrival time of the photons. Such uncertainty is known as the shot or photon noise and is theoretically described by the Poisson distribution P y i , where the mean and variance of the noise at pixel location i depends on the pixel intensity in the clean image y i . The scalar coefficient indicates the sensor-specific scaling factor of the signal (Fig. 3).

Read noise or signal-independent noise
Photons accumulated at each cell during the exposure are readout as a charge or voltage that is eventually stored as a scalar pixel value. Read noise is the summation of the noise from random events during the photon to photo-electron accumulation and readout process, including lower-level noises such as thermal fluctuations, analogue-to-digital quantization noise, reset noise, and source follower noise (Leyris et al. 2005;Konnik and Welsh 2014). Different sensor types have different read noise characteristics. A CCD sensor typically has one read action for all pixels, thus the read noise is consistent among pixels but varies from image to image. A CMOS sensor has a read action for each pixel or column of pixels, thus there is variability in the read noise from pixel to pixel within a single image. Read noise is conservatively approximated using a Gaussian distribution N( , 2 ) with mean and variance 2 , with and being fixed everywhere within the spatial dimensions of the image. The read noise is often considered as white Gaussian noise when it is sampled from a zero-mean distribution. Read noise alone underestimates the actual real noise corruption occurring in real images as it only models the errors from reading the charge accumulated at each pixel and not noise from charge accumulation itself.

Poisson-Gaussian noise
Digital imaging induces both signal-independent and signal-dependent errors and, a Gaussian or Poisson distribution alone may not be sufficient for precise noise modelling. To address such a limitation, real noise is often modelled using a combination of both Poisson and Gaussian components, In practice, Poisson-Gaussian noise is modelled using a heteroscedastic Gaussian model. The parameters of a heteroscedastic Gaussian noise model change with respect to some quality of the signal. In image denoising, noise is often represented by a Gaussian distribution whose variance is proportional to the signal intensity, i.e., The heteroscedastic Gaussian model is commonly referred to as the noise level function (Liu et al. 2008;Foi et al. 2008).

Benchmarks
The recent use of deep neural networks has led to a consensus that datasets are of critical importance for a variety of computer vision and image processing applications. For image denoising, numerous publicly available datasets have emerged that greatly differ in image counts, quality, resolution, diversity and, most importantly noise characteristics. In this section, we review some of the most widely used image denoising datasets. We group these datasets into synthetic noise and real noise categories and highlight their remarkable properties, such as image amounts, resolution, and acquisition settings. Table 2 lists a summary of studied datasets.

Synthetic noisy datasets
A common strategy to train neural networks for image denoising is to consider the image datasets used for other computer vision tasks (Martin et al. 2001;Agustsson and Timofte 2017) as a collection of clean images and simulate the noisy equivalents by imposing i.i.d 1 Gaussian, Poisson or Poisson-Gaussian random samples. Despite the popularity of synthetic noisy datasets, there are often reasons to question whether image drawn from other datasets are genuinely clean. Importantly, the noise characteristics in the real noisy image do not always conform to those of the synthetic ones (Liu et al. 2008;Foi et al. 2008) resulting in a significant performance discrepancy when networks trained on synthetic are evaluated on real noisy images. This performance discrepancy may be addressed by leveraging transfer learning and domain adaptation methods , and finding a more realistic noise model and considering in-camera pipeline . Table 2 summarizes the most widely synthetic datasets for training and evaluation of DL-based denoising models. Among them, we describe BSD and DIV2K in more detail below:

BSD
Berkeley segmentation dataset (Martin et al. 2001) is the most widely used dataset to render noisy and clean pairs via noise synthesis strategy. BSD is a collection of natural images with human-labelled segmentation ground truths consisting of 500 natural RGB images of size 481 × 321 with at least one discernible object. By today's standards, BSD however contains fairly low-resolution images which makes it less useful for real-world applications. (8)

DIV2K
Recently, Luc Van Gool et al. (Agustsson and Timofte 2017) introduced a larger dataset primarily used as a benchmark for image super-resolution. In contrast to the BSD dataset, DIV2K contains images of higher resolution (2K) and larger content diversity. To fairly benchmark competing methods, 1000 images in DIV2K dataset have been partitioned into the subsets of size 800, 100 and, 100 for training, validation and test, respectively,

Real noisy datasets
There have been efforts to pair clean images with their real noisy equivalents to assist the denoising development in real-world applications. These pairs can be captured by constraining the extrinsic variables in the imaging environment or adjusting the intrinsic parameters of the imaging aperture. A prevalent strategy to approximate the clean ground truth is to offset the inherent noise by collecting a rapid sequence of shots from the fixed scene followed by temporal averaging. Another strategy is to consider the image taken with lower ISO factors and slower shutter as clean ground truths. In either setting, precise postprocessing steps and image manipulations might be exploited to marginalize the noise in the clean ground truths even more. Lebrun et al. (2015) provided the first collection of real noisy images containing 15 images without clean counterparts. The images in RNI15 cover a variety of noise types including low-light images from smartphones, old photographs, aerial images, etc. Due to the absence of clean ground truths, RNI15 is merely used for qualitative evaluation purposes. Anaya and Barbu (2018) presented the first dataset containing both noisy and clean images. In RENOIR, images of 120 scenes are captured with low and high ISO settings. For each scene, two clean images are taken interleaved with one or two noisy ones in between. Multiple clean shots are used to secure the spatial alignment of images within the entire acquisition process. Finally, the low ISO images are averaged and paired with either or both of the noisy images. Two consumer cameras (Canon Rebel T3i, Canon S90) and a smartphone (Xiaomi T3i) are used to collect images at various ISO levels ranging from 100 to 6400. The RENOIR dataset does not model heteroscedastic noise and, low-frequency bias is not removed. Nam et al. (2016) collected a laboratory-controlled dataset from 11 static scenes with printed pictures and few real objects. For each scene, 500 successive JPEG images were captured and used to approximate the (nearly) clean ground truth. Images are taken by three consumer cameras (Nikon D800, Nikon D600 and, Canon 5D Mark III) across three ISO factors (1600, 3200 and, 6400). A major drawback of the NAM dataset is its use of printed pictures that deviate from the scenes in the real world. Additionally, NAM does not model the heteroscedastic noise and low-frequency bias repair. Lastly, the images with misalignment or different illumination are not discarded in NAM.

DND
The Darmstadt noise dataset (Plotz and Roth 2017) consists of 50 scenes taken by 4 consumer cameras (Sony A7R, Olympus E-M10, Sony RX100 IV and, Huawei Nexus 6P) across different ISO ranges and shutter speeds. Image with high ISO (short exposure time) and low ISO (long exposure time) are taken as real noisy and clean images, respectively. Additional post-processing including correction of spatial misalignment and removing low-frequency bias are further adopted to derive more accurate clean ground truths for low ISO images. Moreover, the employed intensity transform is based on a heteroscedastic Tobit regression model.  introduced See-in-Dark (SID) dataset consisting of 5094 raw pairs captured with fast shutter (low exposure) and slow shutter (long exposure) using two cameras (Sony 7S II and Fujifilm X-T2). The dataset contains both indoor and outdoor images where the latter are captured at night under moonlight or street lighting.

SSID
This dataset (Abdelhamed et al. 2018) is collected from 10 scenes using five smartphones (Apple iPhone 7, Google Pixel, Samsung Galaxy S6 Edge, Motorola Nexus 6 and, LG G4) with fifteen ISO levels (50-10,000) under three illumination temperatures (3200K for tungsten or halogen, 4400K for fluorescent lamps and, 5500K for daylight) and three light brightness levels (low, normal and, high). Each scene is captured multiple times with different cameras settings and/or different lighting conditions rendering more than 30,000 images. The collected noisy images are then processed by a systematic procedure to obtain the clean ground truth. The main focus of SSID is to address the problem of noticeable noise caused by small sensor sizes in small apertures. Xu et al (2018) introduced a more comprehensive dataset taken from 40 versatile scenes in different lighting conditions using five cameras (Canon 5D Mark II, Canon 80D, Canon 600D, Nikon D800 and, Sony A7 II). To include more camera settings, each image is captured with 6 difference ISO factors (800, 1600, 3200, 6400, 12,800 and, 25,600). Moreover, other intrinsic camera parameters such as shutter speed, aperture and, luminance are re-adjusted for each ISO to render all images normally exposed. Each scene is captured 500-1000 times and the ones with spatial misalignment and luminance discrepancy are removed. Next, multiple samples of the same scene are averaged and taken as the clean ground truth. Since the image pairs are subjectively monitored, spatial misalignment is almost avoided. PolyU contains both raw s-RGB and JPEG images.

NIND
Most recently, NIND (Brummer and De Vleeschouwer 2019) was rendered from 101 scenes using two cameras (FujifilmX-T1, Canon C500D). Each scene is captured with a set of different ISO factors starting from 100 up to the highest possible value. The image with the lowest ISO is taken as the clean ground truth. Also, images with the highest ISO tend to be quite dark and therefore are correctly exposed using the software. As the ISO increases, the shutter speed decreases to match the original exposure value. On average, six images are captured for each scene rendering a dataset of a total size of 616 paired images.

Evaluation metrics
In this section, we provide a summary of two of the well-known metrics used in evaluating the performance of denoising methods. Although the majority of the existing works use quantitative metrics for comparisons, the visual quality of denoised images is also important in deciding the best models as a human is often the end consumer of denoised images.

Peak-signal-to-noise ratio
Peak-signal-to-noise ratio (PSNR), measured in decibels (dB), is the most prevailing criterion to quantify the degradation derived from losses in image transformations (e.g. compression, transmission, or reconstruction). Due to its low complexity and high simplicity, it is widely used and compared with. Given two images X = {x i ∈ ℝ} n i=1 and, , PSNR is calculated as follows: where MAX X is the maximum value in the dynamic range of the images. In the case of image reconstruction, higher PSNR values indicate the reconstruction of better quality, however, in some cases, it may not since it poorly correlates with the perceived quality by human eyes (Wang et al. 2002;Blau et al. 2018). Wang et al. (2002Wang et al. ( , 2003; Zhou and Bovik (2002); Zhou et al. (2004) proposed structural similarity (SSIM) as a more intelligent image quality assessment metric that is better linked to how humans perceive the visual quality of images. SSIM measures the visual impact of changes in the image luminance, contrasts, spatial dependencies and, collectively structural information in the viewing field (Zhou and Bovik 2002). Given two images

Structural similarity
, SSIM is computed as follows: PSNR =10 log 10 MAX 2 where a > 0, b > 0, c > 0 control the relative significance of each terms. The luminance, contrast and, structural components are computed defined as follows: where X and Y denotes the mean, Y and Y represent the standard deviation and, X,Y refers to the covariance of X and Y . Also, 1 , 2 , and 3 are constants introduced to avoid instabilities when denominators are close to zero.

Supervised denoising
Supervised image denoising implies using both the noisy and the clean images while training neural networks. On the other hand, optimizing the parameters in neural networks requires accessing massive datasets with accurate clean ground truths for supervision.
Owing to the prevalence of synthetic noisy datasets in recent years, many deep denoisers based on supervised training schemes have been dominantly presented in the literature Tai et al. 2017;Liu et al. 2018). Apart from the synthetic datasets, the advent of large denoising datasets with real noisy and clean image pairs has contributed significantly to the success of supervised denoisers for real-world denoising problems (Plotz and Roth 2017;Abdelhamed et al. 2018). In this section, we summarize the existing methods for supervised image denoising. We study the literature in two major directions, i.e. discriminative and generative approaches. In Table 3 and Fig. 4, we summarize key methods and corresponding network architectures for supervised image denoising.

Discriminative models
Discriminative methods have recently become increasingly prevailing for image denoising, thanks to their trade-off between denoising quality and speed at test time. In the scope of deep denoisers, the discriminative models exploit the capacity of neural networks to learn a direct mapping from noisy images to clean counterparts. In particular, these methods attempt to find the optimal parameters of a feed-forward network that maximize the conditional probability P(y i | x i ) directly from the training set D train . Mathematically, it can be written as: The success of discriminative deep denoisers in fast inference is attributed to fact that the learned parameters are kept fixed during the testing implying fixed computational cost for Table 3 Summary of some representative methods in supervised denoising

L1
Dynamic Conv "S" and "R" denote synthetic and real noise, respectively. Also, "D", "F", "LSC", and "SSC" represent effective depth, max filter size, long skip connection, and short skip connection, respectively Table 3 (continued) Non-Local, Graph Network each image. However, this comes at the expense of less flexibility and the necessity of training distinct networks for different noise levels. The differences between the approaches in this line of work are mainly related to the network design, learning strategies and, modelling prior information. In the remaining parts of this section, we collect and summarize some representative works in this specific category.

Plain networks
Plain feed-forward neural networks are known as the simplest DL-based models for image denoising, yet they have achieved superior performance against classic approaches such as BM3D (Dabov et al. 2007) and WNNM (Gu et al. 2014). In a nutshell, these networks are formed by assembling alternating sequences of convolutional or fully-connected layers, potentially interleaved with non-linear activations (Nair and Hinton 2010;, normalization (Ioffe and Szegedy 2015;Ulyanov et al. 2016) and, dropout operations (Srivastava et al. 2014;Krizhevsky et al. 2012). Leveraging neural networks for image denoising arguably began gathering momentum in 2008 when Jain and Seung (2009) proposed to exploit convolutional layers as a way to relax the computational expenses associated with the parameter estimation and inference of popular probabilistic denoising methods. With the achievements of sparse coding models in image processing, Xie et al. [63] proposed a stacked sparse denoising auto-encoder (SSDA) framework by sequentially stacking multiple instances of the denoising auto-encoders connecting the noisy input to the network output. In addition to the reconstruction term, an auxiliary KLdivergence term was employed to ensure the sparsity of the mean of intermediate activations. Later, Agostinelli et al. (2013) extended the previous SSDA framework by introducing adaptive multi-column SSDA to improve its robustness against various noise types. Following this, Burger et al. (2012) showed that a well-trained multi-layer perceptron (MLP) network over a massive collection of noisy and clean patches could outperform the well-known BM3D (Dabov et al. 2007) method. Some of the proposed methods for image denoising are based on unrolling the inference procedure in model-based techniques where the computational steps are modelled by neural layers. In this group, Schmidt and Roth (2014) borrowed the notion of shrinkage functions from wavelet restoration domain (Hel-Or and Shaked 2008) and proposed a cascaded set of shrinkage fields (CSF) to model stage-wise predictions in an unrolled half-quadratic optimization procedure. Shrinkage functions in CSF were learned in a data-driven manner reducing the optimization procedure in each stage into a single quadratic minimization. The run-time speed was improved by leveraging convolution operation and discrete Fourier transform. Another example in this group is TNRD by Chen and Pock (2017); Chen et al. (2015) which exploited advances in partial differential equations for image restoration. TRND designed a flexible denoising framework in which each stage was modelled by a convolutional layer with large trainable kernels optimized over a large dataset. Kim et al. (2017) proposed to learn data-driven plain feedforward networks as the implicit regularizer in the widely adopted alternating minimization algorithms (Wang et al. 2008) for image restoration.
The major body of previous denoisers based on plain deep networks focused on noise phenomenon with spatially-fixed statistics. The work proposed by Zhang et al. (2018) was among the earliest attempts to take into account the spatially-varying noise by adapting the performance of a plain deep CNN for different regions. Particularly, they proposed to augment the noisy image with a noise level map prior to feeding it to the network. Noise level maps were generated by stretching either the actual or estimated noise variance across the spatial dimensions to match the input size.

Residual networks
Plain networks with deep architectures suffer from the potential risk of struggling with vanishing or exploding gradients. Therefore, assisting techniques, such as skip connections Srivastava et al. 2015a, b) are often utilized to facilitate unhindered information flow within the layers of the network. The image denoising literature has witnessed a significant use of residual learning in recent years (Bae et al. 2017;Jiao et al. 2017;Song et al. 2019;Kokkinos and Lefkimmiatis 2019;Tai et al. 2017;Ren et al. 2018;Santhanam et al. 2017). An early attempt to leverage the residual connections for image denoising was described in REDNet, by Mao et al. (2016), where skip connections were used between corresponding layers in a mirrored encode-decoder architecture.
Residue learning Instead of the absolute clean images. some denoising methods leverage long skip connections in their design to learn the residue between the noisy and clean images. Notably, DnCNN by  introduced the earliest attempt in this avenue and exhibited superior performance with simpler architectures. The residue image produced in the output of the DnCNN is subsequently subtracted from the input noisy image to acquire the clean estimate. In other words, instead of learning a sophisticated mapping from a complete image to another, DnCNN learns the residue image that discards the noise part of the image and recovers the high-frequency details. Mathematically, residue learning based models can be written as: where r i denotes the output residue. The clean estimate is computed as: DnCNN has been successfully employed in many model-based denoising algorithms serving as an implicit natural image prior (Meinhardt et al. 2017;. To complement DnCNN, Remez et al. 2018 proposed CADN that reflected the direct impact of all intermediate layers in estimating the residue image. Other improvements Some researchers have adopted more sophisticated patterns for skip connections (Huang et al. 2017) to improve the representation power of the networks. Tai et al. (2017) designed MemNet by incorporating densely connected memory blocks between a low-level feature extractor and a reconstruction block. The memory blocks end in 1 × 1 gating convolutions that adaptively control how much information received from the previous layers needs to be preserved or discarded before delivering them to the subsequent module.  proposed a residual dense network intending to make full use of the hierarchical features with densely connected global memory blocks, which are themselves formed by a sequence of densely-connected convolutions. Most recently,  proposed dual residual building blocks to enhance the interaction between paired operations, e.g. down-sampling and up-sampling, occurring within the network.

Attention mechanism
The attention mechanism has become an integral part of the neural networks in recent years (Vaswani et al. 2017;Chaudhari et al. 2019). In neural networks equipped with attention mechanisms, the relationships among learned features are explicitly analyzed and exploited to improve the efficiency of representation learning (Hu et al. 2018a, b). In image denoising tasks, many works have tried to exploit the principal merits of attention mechanisms to achieve better denoising performance with faster training and smaller model size (Anwar and Barnes 2019;Gu et al. 2019;Cheng et al. 2021;Hu et al. 2021;Zamir et al. 2020;Suganuma et al. 2019;Zamir et al. 2021). Anwar and Barnes (2019) proposed the first work that benefited from the emerging popularity of channel-wise attention mechanism. Hu et al. (2018a) for image denoising to improve the learning efficiency of the network by re-scaling the feature channels in accordance to their mutual dependencies. Later, Cheng et al. (2021) proposed a novel subspace attention module in which the noisy images are projected into a learned clean subspace such that the reconstructed image can keep most of the original content and remove the noise, i.e. the irrelevant information to the generated basis vectors. Hu et al. (2021) designed an efficient 3D auto-correlation that can extract vertical, horizontal and channel-wise axes simultaneously. In contrast to regular auto-correlation attention modules, this lightweight pseudo 3D module can avoid dense connections and high dimensional operations.
A different class of attention mechanisms relate the features from different scales or operations. Specifically, Gu et al. (2019) described a technique that connects the contextual features extracted at different resolutions to each other in a top-down processing architecture. The input image is initially down-sampled into multiple scales using the shuffling operation. Then, a hierarchical coarse-to-fine structure gradually receives and manipulates individual resolutions of the inputs. After several convolutions, the features from the coarser scale are delivered into the subsequent first-level coarser scale as a way to transfer the cross-scale contextual information within the entire multi-scale framework. Different from the previous work that aggregates multi-scale contents in a top-down manner, Zamir et al. (2020) proposed a model that aggregates the contextual information among multi-scales through exchanging the information across all scales at each resolution level. Moreover, the delivered information from other resolution levels is adaptively gated and fused to the current information by a self-attention mechanism. Similar to Zamir et al. (2020), Zamir et al. (2021) proposed to incorporate a supervised attention module between every two stages in a multi-stage architecture. Furthermore, they introduced a cross-stage information exchange module to improve the feature fusion between early stages and later ones.
Most recently, Suganuma et al. (2019) presented a more versatile layer architecture that embodies multiple operations such as convolutions with different kernel sizes applied on the input. In such a setting, the attention mechanism intends to produce a weight vector to determine the impact of each operation within the layer. The weight vector is then multiplied with the outputs of the operations to re-scale them in accordance with their significance.

Nonlinear activation functions
Increasing the depth of the network architecture for better learning capacity is not always feasible due to the limited computational resources in many practical applications. To address this, more focus has been put on the role of activation functions in constructing efficient yet powerful networks Klambauer et al. 2017;Misra 2019). Toward efficient image denoising, there have been various recent improvements that focus on ameliorating the activation functions (Kligvasser et al. 2018;Gu et al. 2019).
As opposed to the ubiquitous RELU (Nair and Hinton 2010) activation that operates per pixel, Kligvasser et al. (2018) incorporated the notion of learnable activations with spatial connections into deep denoisers. The RELU (Nair and Hinton 2010) activation can be explained as a hard-gating mechanism where irrelevant activations are discarded by a binary weight map. Conversely, xUnit offers a soft-gating scheme by adopting internal convolutions and Gaussian gating modules to provide spatially-dependent continuous-valued weight maps for activations. Although this method requires more computational power, the improved representational efficiency of the layers reduces the size of the network compared to alternative approaches.
In the same vein, Gu et al. (2019) crafted another learnable activation function (MTLU) that assists in boosting the learning capacity of small networks. As depicted in Eq. 17, the core methodology of MTLU has two highlights: a) dividing the activation space into several equidistant bins and, b) learning the coefficients for different linear functions per bin using the back-propagation strategy during the training.
where the set {c k } K k=1 is K hyper-parameters for MTLU and, {a k } K k=1 and {b k } K k=1 are the coefficient for linear functions.

Non-local similarity
Many classic image denoising methods have demonstrated the merits of self-similarity (NSS) prior on natural images for image restoration (Buades et al. 2005;Dabov et al. 2007). The NSS prior states that similar image patches tend to re-occur within the image in non-local regions. While the NSS has been broadly explored in classic denoisers, a few works have attempted to incorporate this internal image property into deep networks for image denoising (Lefkimmiatis 2017;Plötz and Roth 2018;Zhang et al. 2019;Liu et al. 2018;Guo et al. 2021;Lefkimmiatis 2018;Tachella et al. 2021). Among them, we discuss two major categories distinguished by the way non-local information is incorporated into the training signal of the network, i.e. non-local retrieval and implicit non-local attention.
Non-local retrieval Motivated by BM3D (Dabov et al. 2007) and non-local means (Buades et al. 2005), the intent of work in this category is to explicitly find and retrieve the most similar patches to a query patch and utilize them in subsequent stages to discard the noise component. The earliest deep denoiser exploiting a non-local prior was the NLNet proposed by Lefkimmiatis (2017). NLNet is a patch-based proximal gradient method unrolled into multiple stages. Each stage is efficiently modelled by a sequence of convolutions to linearly transform every patch, a block-matching to collect similar patches and a collaborative filtering block that projects all patches into a single patch representing the clean estimate. Later  proposed a patch denoising framework in which the network takes in an individual noisy patch and a set of most similar patches and outputs a vector of matching scores. The denoised patch is then obtained by averaging across candidates using the matching scores. In contrast to normal convolutions which have a rigid sampling grid and kernel weights,  proposed to explicitly learn the sampling locations along with the kernel weights in a data-driven manner. Thus, the network is able to adaptively sample from the 2D input space to freely expand the respective field.  not also adopted deformable convolutions ), but also inserted the modulated deformable convolution in their proposed network to sample the spatially relevant features for weighting.
Non-local attention Most existing denoising methods suffer from having a small receptive field due to local convolutions. However, long-range similarities may be used for denoising the patches.  embedded the concept of non-local mean in the neural networks and proposed a non-local neural network leading to a considerable boost in many computer vision applications. Zhang et al. (2019) adopted this work and proposed a residual trunk-and-mask  architecture for the task of image denoising. The trunk branch provides the intermediate features whereas the mask branch calibrates the feature based on the non-local correspondences in the spatial domain. Another novel technique is N 3 Net (Plötz and Roth 2018), which proposed a continuous deterministic relaxation for the non-differentiability of KNN selection rule. It is then used within the internal layers of the network to concatenate every feature vector with a weighted average of the most similar feature in the 2D space of the intermediate representations. Liu et al. (2018) integrated the non-local mean operation into the recurrent neural networks and, proposed to perform non-local matching in a confined region centered at query position rather than considering the entire spatial scope for matching.
Graph neural networks Valsesia et al. (2020Valsesia et al. ( , 2019 proposed to exploit graph convolutions to cope with the limited receptive field in traditional convolutional layers. They generalized the traditional convolution layers by creating adaptive receptive fields based on nearest-neighbour graphs. During training, distant but similar features are aggregated to leverage non-local similarities. An additional module estimates the aggregation weights to further increase learning adaptability. Li et al. (2021) designed a cross-patch graph convolutional network to explicitly cross-patch long-range contextual dependencies. For every patch, their proposed network aggregates similar patches to the primary input patch, and ensembles the extracted features toward a more accurate clean patch estimate.  extended patch-based graph convolutional networks and proposed a dynamic attentive graph in which the query patch can have a dynamic and adaptive number of neighbors.

Raw denoising
Until recently, most of the deep denoisers have been leveraging pairs of simulated noisy and clean datasets for training, resulting in a dramatic performance discrepancy once the network(s) assessed real noisy images. To narrow the gap between training and inference domains, recent works in image denoising attempt to perform training and validation on raw real noisy datasets in explicit manner Moseley et al. 2021;Wei et al. 2020;Pan et al. 2020;Liu et al. 2021;Kim et al. 2020;Zamir et al. 2020;Jaroensri et al. 2019;Jang et al. 2021).
RAW Noise Synthesis As described in Sect. 2.2, in-camera processing pipeline, a.k.a image signal processor (ISP) affects the nature of the noise and therefore noise can come from different sources in the real camera system. Guo et al. (2019) proposed a noise model that takes into account both heteroscedastic Gaussian noise as well as demosaicing, Gamma correction and JPEG compression. The new noise model is used to simulate a large set of noisy images that resembles real-world noise. Brooks et al. (2019) incorporated more ISP components in the noise modelling. Particularly, they showed that a generic clean image can be unprocessed into real RAW data by successively applying inverse tone mapping, gamma decompression, sRGB-to-RGB correction and, inverse white balance & digital gain. The heteroscedastic noise is then added to the RAW data to stimulate real noisy and clean pairs for Raw-to-Raw training. Similarly, Zamir et al. (2020) proposed a cycle framework for learning RGB-to-Raw and Raw-to-RGB mappings through two distinct network branches.
Additional improvements Motivated by the instance normalization module (Ulyanov et al. 2016), Kim et al. (2020) adopted an adaptive instance normalization and a transfer learning scheme to reduce the domain discrepancy between the synthetic and real noise data. After training the network on synthetic noisy datasets, only the adaptive instance normalization layers are fine-tuned on the real noisy data to bridge the distribution gap.  also proposed a lightweight model for on-device denoising of real images. The core of their methodology is adopting a novel k-sigma transform that projects noisy images captured across different ISO settings into an ISO-invariant space. This way, a single network is capable of processing images with different noise characteristics. Liu et al. 2021 leveraged an invertible network for image denoising. To mitigate the challenges associated with different distributions of input and output pairs, Liu et al. (2021) proposed to transform the noisy input into a low-resolution clean image and a latent encoding for the noise. The noise component can be discarded by replacing its corresponding encoding with a sampled representation from a prior distribution.

Boosting
Boosting algorithms are one of the most widely used techniques for improving the performance of machine learning algorithms (Talebi et al. 2013;Romano and Elad 2015). In the image denoising field,  incorporated data-driven DLbased denoising networks as the base units in a cascaded boosting framework. Inspired by the Strengthen-Operate-Subtract (SOS) (Romano and Elad 2015), in each boosting unit, the summation of the denoised image and the noisy input is fed into the subsequent denoising module. Next, the identical denoised image is subtracted in each step to ensure the iterability of SOS. A cascaded boosting configuration leads to a very deep architecture. To cope with this challenge, they leveraged a set of lightweight structures equipped with dense residual connections and dilated kernels in base units to reduce the overall computational burden of the network. In contrast to the cascaded boosting, Choi et al. (2019) developed a convex optimization procedure to optimally aggregate the outputs of multiple denoising units (CsNet). Specifically, they solve a quadratic minimization problem to find the optimal weights for combining the complementary outcomes of different denoising networks.

Generative image denoising
In contrast to the discriminative models that compute P(y i | z i ) , a generative model often captures the generation process of the observed noisy example by modelling P(z i | y i ) . In other words, the discriminative models are mostly focused on separating the underlying clean image, while the generative models try to understand the basics of noisy image formation. In the following, we will elaborate on a few representative denoisers based on the generative models. We review the existing works related to the generative models from two perspectives, methods based on variational inference and, generative adversarial networks.

Variational inference
In an attempt to discern the generation process of the noisy observations, Yue et al. (2019) proposed a novel variational inference framework that performs both noise estimation and noise removal in a Bayesian framework. Their proposed framework learns an approximate posterior of the clean and noise statistics conditioned on the observed noisy image. By replacing the approximate posterior with a variational distribution, they take advantage of the independence property of variational latent variables and represent the variational distribution in form of two distinct functions. Consequently, the variational approximation of the clean image is modelled by a conjugate Gaussian prior with mean and covariance parameters. For noise estimation, the inverse Gamma distribution is taken as the conjugate prior. Two distinct deep networks were trained to learn the mapping between the noisy image and the parameters of the variational posteriors. A second important generative denoiser was proposed by Abdelhamed et al. (2019) (NoiseFlow) which unites the basic parametric noise models and the power of normalizing flow architectures (Kingma and Dhariwal 2018), initially proposed in variational inference (Rezende and Mohamed 2015) and density estimation (Dinh et al. 2014), to approximate the real noise distribution from large datasets. Starting from a simple distribution, NoiseFLow learns the transformation to a complex distribution of the real noise via a sequence of differentiable and invertible mappings. The learned noise distribution is then used to compile a set of realistic synthetic images for training deep neural networks.

Generative adversarial networks
In recent years, generative adversarial networks (GAN) (Goodfellow et al. 2014) have received significant attention in a variety of applications in computer vision and image processing tasks, thanks to their compelling ability to generate realistic examples plausibly drawn from an existing distribution of samples. In essence, GANs consist of training two distinct networks with different objectives simultaneously-the generator network and the discriminator network. The discriminator is typically a binary classifier that distinguishes the real and synthetic data. On the contrary, the generator attempts to generate synthetic data that resembles the training data distribution. The objective for the generator is to synthesize realistic data such that the failure rate for the discriminator is maximized. On the other hand, the discriminator aims to minimize the binary classification error (Jabbar et al. 2021). For image denoising, the applicability of GANs has been explored in recent years Kim et al. 2019;Lin et al. 2019;Marras et al. 2020).  was the first to leverage GANs for real noise modelling and build a paired training dataset. Specifically, the generator produces synthetic noise while the discriminator is trained to distinguish between real and synthetic noise. However, this approach only takes a random vector as the input to the noise generator. Thus, the noise samples from the generator are signal-independent since the network has not seen the clean intensity signal during the training. Kim et al. (2019) improved the previous method by including more parameters such as clean image, ISO and shutter speed as additional inputs to the generator. Most recently,  proposed to decouple the noise generation and camera characteristics via two distinct networks. Particularly, a noise generative network receives and processes the clean image and an initial noise sample. In addition, a latent vector generated from a camera encoding network is adopted to transfer camera-specific characteristics of the image to the noise generative network via feature concatenation. The final synthetic noise is obtained in the output of the generative network. The noise generative and camera encoding networks are trained jointly along with a discriminator supervised by adversarial, feature matching ) and triplet losses (Schroff et al. 2015).
Inspired by Li et al. (2017),  designed a dual GAN to learn the joint distribution of noisy and clean pairs. The joint distribution is approximated by its two different factorized forms. Therefore, their proposed framework consists of two networks; (a) a denoiser that maps the noisy image to the clean estimate and, (b) a noise generator that maps the clean image to the noisy one. Both of these networks are jointly trained along with a discriminator. After training, the learned denoiser can be directly used for noise removal. On the other hand, the noise generator can also be utilized to build realistic noisy and clean training pairs. Marras et al. (2020) proposed to constrain the residue of the noisy input. A denoising network takes the noisy input as well as encoded information about the camera and produces the residue estimate. During training, the ground-truth residue, clean image and, encoded camera information are further fed into an auto-encoder to estimate the residue estimate. Given that the decoders in both denoising and auto-encoder are shared across networks, the denoiser is explicitly constrained to generate residual estimates that are consistent with the noise manifold.

Unsupervised and self-supervised denoising
The DL-based image denoising research has flourished with hundreds of works seeking to learn the mapping between noisy and clean pairs. However, collecting clean images in some domains is very expensive, or sometimes infeasible. Accordingly, some interests have Table 4 Summary of some representative methods in self-supervised and unsupervised denoising "BS", "SU", "PR", "IO" represent networks based on blind-spot networks, SURE-based losses, image prior, and input output custimization, respectively  (2018) 2018, ICASSP ✓ ✓ --Denoiser based on SURE-like estimated loss and blind-spot nets Moran et al. (2020) 2020, CVPR ---✓ Adds noise to input and learns mapping between noisier and original noisy images Krull et al. (2019) 2019, CVPR ---✓ Leverages surrounding context to predict noise-free estimates of a central pixel Lehtinen et al. (2018) 2018, ICML ---✓ Learns mapping between pairs of noisy images of the same clean image Laine et al. (2019) 2019, NeurIPS ---✓ Formulates blind-spot denoising in a Bayesian framework Huang et al. (2021) 2021, CVPR ---✓ Assumes neighboring pixels as different noisy realizations of same signal Xie et al. (2020) 2020, NeurIPS ---✓ Provides a novel training and loss which exploits the entire noisy image for updates Soltanayev and Chun (2018)  Another line of work is to design advanced self-supervised loss functions to train networks in absence of clean images (Stein 1981;Soltanayev and Chun 2018;Zhussip et al. 2019;Soltanayev et al. 2020). It is noteworthy that researchers often use the terms unsupervised denoising, self-supervised denoising and, blind denoising interchangeably in the literature. In Table 4 and Fig. 5, we summarize key methods and corresponding network architectures for unsupervised and self-supervised image denoising.

Unbiased MSE estimators
Mean-squared error is recognized as an indispensable element of the deep denoisers that necessitates the availability of clean ground truths during the training. In the past, the applicability of Stein's unbiased risk estimator (SURE) (Stein 1981) has been explored for unsupervised denoising in traditional frameworks (Nguyen and Chun 2017;Van De Ville and Kocher 2009). Given its success, SURE has attracted considerable attention in DLbased image denoising over the past few years (Soltanayev and Chun 2018;Zhussip et al. 2019;Soltanayev et al. 2020). Soltanayev and Chun (2018) proposed the first work investigating the benefits of SURE in DL-based denoisers in lieu of the MSE loss. The SURE function can be written as:

Fig. 5 Different methods and corresponding network architectures for unsupervised and self-supervised image denoising
However, the divergence term in Eq. 18 cannot be analytically solved in many circumstances. To address this issue, authors adopted Monte-Carlo SURE (Ramani et al. 2008) to approximate the divergence term with the following: Provided that ̂ is a random vector from normal distribution ̂ ∼ N(0, 1) and, is a fixed small positive value. The DnCNN ) network is trained with the proposed MC-SURE objective function for simulated Gaussian noise removal. Zhussip et al. (2019) extended the SURE-based method to train denoising networks when two uncorrelated noise realizations per clean image are available. Furthermore, they investigated the feasibility of using imperfect clean ground truths for supervision. The Monte-Carlo approximation for SURE involves a hyper-parameter for optimal performance that is hard to select. To address this, Soltanayev et al. 2020 proposed a new approximation for divergence term without any hyper-parameter.

Image prior
Some researchers have examined the benefits of data-driven or hand-crafted priors for unsupervised denoising. Izadi et al. 2019;Ulyanov et al. 2018;Mataev et al. 2019). Ulyanov et al. (2018) showed that the structure of the neural networks itself is able to capture a great portion of the image statistics prior. Specifically, an image-specific network is firstly initialized with random weights and then a random uniform vector u is fed into the input layer and, the parameters of the network are optimized to match the output of the network to the observed noisy image using L2 loss. i.e.
In a clean image, different regions are spatially coherent and therefore the network can rapidly capture this prior and reconstruct smooth estimates. Conversely, less spatial coherency makes the perfect reconstruction of the noise to be time-consuming. Accordingly, the implicit regularization imposed by the structure of the image and the early stopping of the optimization implies generating clean estimates. Mataev et al. (2019) combined the implicit regularization captured by the CNN structure in deep image prior (Ulyanov et al. 2018) with the explicit regularization paradigm in Regularization by Denoising (Romano et al. 2017) to improve the overall regularization effect and enhance image denoising. Similarly,  proposed to combine the implicit CNN regularization with an explicit total variation penalty to improve the denoising power of deep image prior (Ulyanov et al. 2018).  took deep image prior idea one step further and employed neural search algorithm (Zoph and Le 2017) to optimize the CNN structure by searching for both the upsampling units in the decoder and skip connection patterns between encoder and decoder layers. Jo et al. (2021) designed a novel metric based on the loss value to improve the stopping criterion in deep image prior training.

Noise statistics
One of the well-known properties of the noise component in digital images implies that the noise pixels at different spatial locations are independent given the clean pixel values, i.e. Furthermore, the noise is assumed to be zero-mean distributed, i.e.
Many works have recently relied on these statistical properties of noise and have employed neural networks for the denoising task in absence of clean images (Krull et al. 2019;Lehtinen et al. 2018;Batson and Royer 2019), showing a great advance in performance with respect to both the reconstruction error and perceptual quality. We briefly discuss the representative methods related to this line in the following. Lehtinen et al. (2018) introduced noise2noise a training scheme in which the parameters of the network are optimized to learn a mapping function Φ −1 (X' ) between pairs of independent corrupted images X = {y i + i } and X � = {y i + � i } . In other words, two images X and X ′ in the training pairs are identical as they share the same underlying clean image Y , but per pixel noise realizations, i.e. { i } and { i �} , are independent and different. With such training pairs, the network minimizes the MSE loss between the original noisy input and the noisy target, Obviously, it is impossible for the learned network to predict a different noisy image from another image. Therefore, the network inevitably converges to output the arithmetic mean of inputs for each pixel, i.e. [x i ] . Given that the noise is assumed to be zero-mean, the learned network converges the clean image, as shown below:

Noise2Noise learning
This training framework allows the network to be trained only based on the noisy images without access to the clean ground truth. Even though this learning strategy may ask for multiple noisy images during training, however, Lehtinen et al. demonstrated that even one additional noisy image is sufficient to achieve reasonable denoising performance.

Blind-spot networks
Despite the unprecedented success of noise2noise, requiring different noisy pairs during training is a significant shortcoming of noise2noise. To solve this problem, a line  (Krull et al. 2019) and noise2self (Batson and Royer 2019) proposed the idea of blind-spot networks to train denoisers by using single noisy images without ground truth. Considering a patch {x i } k 2 i=1 of size k × k centered at location i, the central pixel x i is excluded from the receptive of the network through a masking scheme. Then, the network is trained to predict the value at location i while the original pixel value x i is utilized as the ground truth for loss calculation. Due to the lack of information about the x i in feed-forward, the network fails to learn an identity mapping between the input and output and, unavoidably produces an estimate consistent with the surrounding. According to Eqs. 21 and 24, the neighbouring pixels carry no information about the noise part i and therefore the networks produce the expected value of inputs at convergence. i.e. [x i ] . The key idea of blind-spot networks has been recently expanded by Laine et al. (2019) who incorporated the blind receptive field in architecture design rather than masking scheme on input patches. In particular, four rotated versions of the input image are fed into a network with directional receptive filed to build exclude the central pixel from the surrounding context.

Additional improvements
Along this direction, Quan et al. 2020 proposed self2self which trains a single network for every noisy image. They exploit the dropout operations (Srivastava et al. 2014) to randomly mask a subset of pixels during training. At inference, multiple random masks are applied to the input image to produce a set of clean estimates followed by an averaging to generate a robust clean estimate with less variance. Moran et al. (2020) built on noise2noise and proposed a novel learning strategy called noisier2noise. Given a statistical model for the noise, a synthetic noise is drawn and added to the original noisy image to generate a doubly-noisy image. Then, the network is trained to predict the original noisy image from the doubly-noisy image. After convergence, the clean estimate is obtained by a set of simple mathematical operations. Similar to noisier2noise,  proposed to add synthetic noise to the original noisy image in the input, however, they train a distinct network for every individual image to be denoised. Additionally, Huang et al. 2021 introduced neighbor2neighbor learning to train a network on different versions of the individual noisy image collected by a novel random neighbour sub-sampler. Since neighbour pixels are very similar in terms of underlying clean content, and the noise is independent, their introduced learning strategy provides an approximation of noise2noise using only single noisy images. Furthermore, the authors proposed a novel regularization in the loss function to improve the denoising performance. Pang et al. (2021) extended the noise2noise and proposed Recorrupted2Recorrupted scheme where only a set unorganized noisy images without pairwise correspondences are used during training. Lastly, Kim and Ye (2021) introduced a novel Bayesian framework called noise2score that provides the posterior mean of canonical parameters from noisy images based on Tweedie's formula (Efron 2011).

Denoising applications
Thanks to the ubiquitous demand for denoising algorithms based on deep learning and the rapid advances of denoising techniques in recent years, the ideas of DL-based denoising are being widely adopted in various applications such as video and burst images.

Joint demosaicing and denoising
Image demosaicing and denoising typically form the first two components of the image processing pipeline in camera systems, causing the most data loss and perturbation. The sequential nature of these two modules results in accumulative errors introduced by either component and it is therefore sub-optimal. Recent data-driven approaches have been developed to mitigate this challenge by joint demosaicing and denoising Ehret et al. 2019;Xing and Egiazarian 2021;Kokkinos and Lefkimmiatis 2019). Ehret et al. 2019 presented the first attempt to leverage a deep neural network for joint learning of demosaicing and denoising through fine-tuning on RAW bursts. Inspired by the emergence of content-adaptive networks,  proposed a self-guided network for joint demosaicing and denoising. In particular, an initial estimate of the green channel is used to guide the content recovery for all channels in the main branch. A density map is utilized in the main branch to help the network deal better with different difficulty levels of regions. Xing and Egiazarian (2021) focused on developing a joint solution for three fundamental image restoration problems -demosaicing, denoising, and super-resolution. Their proposed network is universal in the sense that each of the modules can be eliminated from the process, which results in the output for the remaining modules.

Burst denoising
In recent years, mobile photography on handheld mobiles has been re-targeted to the task of burst denoising. A burst captures a sequence of short exposures with small cross-frame motion and strong in-frame noise. Given that the noise is independent across different frames, burst denoising relies on the assumption that averaging multiple noisy images lead to a more accurate estimate of the clean image. Many recently proposed burst denoising techniques employ deep neural networks to improve the state-of-the-art (Mildenhall et al. 2018;Rong et al. 2020;Liang et al. 2020;Godard et al. 2018;Hasinoff et al. 2016;Kokkinos and Lefkimmiatis 2019;Marinč et al. 2019;Bhat et al. 2021).
Mildenhall et al. 2018 adopted kernel-prediction networks (KPN) for burst denoising in which pixel-wise kernels are predicted by the network and convolved with the sequence of frames to obtain a clean frame. The averaging weights in every window centred on pixels are predicted from the noisy images to address the cross-frame motion and in-frame image discontinuities. Later, Marinč et al. (2019) proposed an extended version of KPNs with multiple kernels of different sizes. Additionally,  equipped KPNs with an attention mechanism to account for inter-frame and intra-frame relationships for better denoising performance. Taking one step further,  proposed a basis prediction network (BPN) that, given a sequence of the burst images, produces a set of basis 3D kernels and per-pixel mixing coefficients. Basis kernels and coefficients are then combined to generate per-pixel kernels which are then convolved to estimate the clean image. Liang et al. (2020) designed a model to decouple the learning of motion from the learning of noise statistics for burst denoising. Most recently, Bhat et al. proposed a deep reparametrization of the MAP formulation for burst image super-resolution and denoising. Their method learns a error metric and a feature space for the target clean image. The learned feature space can then be used to directly image formation process and to integrate image priors in to the clean estimate.

Video denoising
Until recently, video denoising with neural networks had been largely under-explored. Similar to burst denoising, video data typically contains a strong correlation along the temporal dimension that could aid in the restoration process. Therefore, existing works on video denoising mainly focus on making full use of the spatio-temporal correlations between consecutive frames via recurrent methods Maggioni et al. 2021;Chen et al. 2016), explicit motion estimation and warping (Xue et al. 2019;Tassano et al. 2019;Ehret et al. 2019) and, implicit motion compensation Claus and van Gemert 2019;Tassano et al. 2020;Dewil et al. 2021). The first attempt to denoise videos was proposed by Chen et al. (2016) who leveraged recurrent neural networks to learn the mapping between noisy and clean video sequences. Among the optical flow-based methods, Tassano et al. (2019) proposed to align individually denoised frames with respect to the reference frame, followed by a temporal denoiser to operate on a sequence of aligned frames. Similarly, Claus and van Gemert (2019) proposed the idea of decomposing video denoising into two steps-frame alignment and temporal filtering. Vaksman et al. 2021 introduced the idea of generating artificial frames based on patchcrafting, which are used to augment video sequences. The enlarged video sequence is then processed by applying spatial and temporal filtering to yield the denoised video.
Due to the heavy computational costs associated with motion estimation, several attempts have tried to deal with motion in an implicit manner. Claus and van Gemert (2019) suggested sequentially chain spatial denoising and temporal fusion by processing three frames at a time to get the clean estimate of the middle frame. Similarly, Tassano et al. (2020) extended their previous work by replacing the optical flow alignment with an implicit motion compensation integrated by network architecture. Maggioni et al. (2021) adopted a multi-stage framework to perform video denoising. The temporal coherency across frames is firstly aggregated by a fusion stage and then a spatial denoising stage discards the leftover noise in the fused image. Lastly, a spatio-temporal refinement step restores more high-frequency details.
More recently, some works have attempted to take advantage of the temporal redundancy in videos in designing self-supervised denoising solutions Ehret et al. 2019;Dewil et al. 2021). Inspired by noise2noise (Lehtinen et al. 2018 proposed a frame-to-frame training scheme for blind video denoising to adapt a generic pre-trained denoiser for different noise models and data. Lee et al. (2021) proposed a simple yet effective self-supervised training scheme in which a pre-trained denoiser is fine-tuned for every individual input test sequence. During fine-tuning, the initial output of the pre-trained network is considered as the pseudo clean ground truth for loss calculation. Given a sequence of noisy frames surrounding frame at time t, Dewil et al. (2021) adopted a solution similar to blind-spot networks and proposed to withhold a frame at time t-1 from the inputs to the network and use it as the ground truth to penalize the network's output for the frame at time t during training.

Medical imaging
Medical imaging analysis has developed rapidly in recent decades and has become a crucial factor in disease diagnosis. Medical images are often used to provide an accurate internal view of the human body which is subsequently assessed by diagnostic techniques to identify tissues or body organs requiring treatment. In medical imaging, precise and accurate information extraction is of paramount importance for disease diagnosis, staging and treatment. However, noise artifacts may degrade the visual representation of the medical images during the process of acquisition and/or later processing steps. On the other hand, low-quality medical images complicate the disease identification and may hamper the patient's care and treatment. Hence, denoising of medical images is indispensable and has become a mandatory pre-processing step in medical imaging systems.
There are several major sources of noise in medical imaging, including signal attenuation and scattering through the tissue, inherent random variations in photon counts, electronic sensor and detector systems, patient motion, and image reconstruction imperfections. For example, PET and SPECT imaging are primarily affected by inherent noise due to variations in photon counts, electronic components, and detector planes. In addition to the errors in photon count, the thermal activity may also produce Gaussian noise in X-ray imaging (Goyal et al. 2018). Noise in Magnetic Resonance Imaging (MRI) images mainly takes the form of Rician noise, Gaussian noise, and Rayleigh noise due to electronic interference in the receiver circuits and radio-frequency emissions caused by the thermal motion of the ions in the patient body, and possible failure of electronic components. In Computed Tomography (CT) images, the high-speed computation causes thermal energy fluctuations, which subsequently lead to Gaussian noise. Additionally, CT images are often contaminated by noise due to mathematical computations and quantum statistics [209]. Speckle noise is notorious in ultrasound imaging (Goyal et al. 2018). The reader could refer to Sagheer and George (2020) for a more comprehensive review on image denoising in medical imaging.
In this section, we firstly provide a brief introduction to most prevalent medical imaging modalities followed by recent advances in denoising algorithms for each of them.

Ultrasound imaging
Ultrasound provides an efficient and non-invasive medical imaging modality and is widespread in medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases. One of the main issues with ultrasound images is the presence of noise artifacts introduced during the process of acquisition, transmission and analysis, which complicates the diagnosis of diseases by clinicians or computer-aided diagnosis (CAD) systems . In recent years, attempts have been made to leverage the strength of CNNs with application in ultrasound denoising, a.k.a despeckling (Lan and Zhang 2020;Karaoğlu et al. 2021;Cammarasana et al. 2021). Lan and Zhang 2020 proposed a novel residual network based on UNet architecture for ultrasound image despeckling. Furthermore, they equipped the network with both spatial and channel-wise attention mechanisms to enhance the feature learning of the network for improved noise removal. Cammarasana et al. (2021) trained a network in which the inputs are noisy images and the ground truths are denoised counterparts obtained from the parameter-tuned WNNM algorithm.

Magnetic resonance imaging
Magnetic resonance (MR) imaging is a widely used non-invasive imaging technique that provides high-resolution visualization of the anatomical structure, tissues and organs. However, MR images may inevitably be captured along with noise artifacts caused by physiological motion, instabilities of the MR imaging scanning hardware, and short acquisition time. As a result, noise removal or attenuation is essential for the comprehension and evaluation of MR images. In recent years, a number of deep learning-based methods have been proposed for image denoising (Jiang et al. 2018;Ran et al. 2019;You et al. 2019;Aetesam and Maji 2021). Jiang et al. 2018 proposed to extend DnCNN  for multi-channel MR inputs in two training strategies: with and without a noise model. Ran et al. (2019) exploited a residual encoder-decoder coupled with adversarial and perceptual loss (Johnson et al. 2016) to outperform state-of-the-art methods in both simulated and real clinical data. You et al. (2019) used a wide architecture design to address the vanishing gradient issue and facilitate capturing more structural features. Most recently, Aetesam and Maji (2021) proposed to incorporate the prior information about the image degradation in form of loss functions to improve the learning performance of the network. They also adopted the Bayesian maximum a posteriori (MAP) estimator to further improve the quality of the restored images.
Given the advances in self-supervised denoising approaches (Lehtinen et al. 2018;Batson and Royer 2019;Krull et al. 2019), Xu and Adalsteinsson (2021) proposed a denoising framework for dynamic MRI images. This approach uses only single noisy images along with a few auxiliary observatories from different time frames to optimize the parameters of the network. To further improve the quality of the restored image, single-image and multiimage denoising schemes are aggregated in an end-to-end trainable network. Furthermore, spatial transformer networks (Jaderberg et al. 2015) are utilized to approximate the motion between slices.

Computed tomography
Computed tomography (CT) imaging is a widespread imaging modality that allows highresolution visualization of anatomical structures. However, a major concern inherent to CT acquisition is the potential health hazard related to ionizing radiation (Brenner and Hall 2007). A common strategy to alleviate radiation exposure is to lower the operating current in CT examinations. However, a potential drawback of dose reduction is the introduction of noise artifacts in reconstructed CT images. Hence an active research direction to improve the quality of low-dose CT images is focused on post-processing the obtained images after reconstruction. Recently, DL-based denoising methods have shown promising performance to remove the unwanted artifacts in low-dose CT.
Early deep learning methods leveraged feedforward and residual architectures to improve the feature extraction capability of networks for low-dose denoising (Chen et al. 2017a, b;Kang et al. 2017). Chen et al. (2017a) adopted a CNN to learn the mapping between low-dose and normal-dose CT in a patch-by-patch manner. Concurrently, they continued their efforts by proposing a residual encoder-decoder network for patch-based low-dose denoising. Kang et al. (2017) proposed to apply a directional wavelet transform on low-dose CT images prior to feeding them into CNNs. By doing so, the network can efficiently exploit the intra-and inter-band correlations to suppress noise artifacts. In order to capture structural information across large regions,  introduced a 3D selfattention module to benefit from spatial information both within CT slices and between CT slices.
With the increased popularity of GANs in medical imaging (Yi et al. 2019), many researchers attempted to boost the performance of DL-based low-dose CT denoising methods (Shan et al. 2018;Wolterink et al. 2017;Yi and Babyn 2018). Wolterink et al. 2017 was the first who adopted GANs for low-dose CT denoising. Next, Yi and Babyn (2018) used a UNet-like architecture for the generator and demonstrated improved denoising performance, thanks to its multi-scale encoding and decoding structure. Yang (2018) further augmented the loss functions in a Wasserstein GAN with perceptual loss (Johnson et al. 2016) to replace noise artifacts with more plausible recovered details. Most recently, ) further improved the low-dose denoising performance by adding edge-aware and noise-aware attention mechanisms in the generator. They also adopted a multi-scale discriminator to expand its receptive field and improve its judgemental capabilities. A recent work by  investigated the applicability of cycleGAN (Zhu et al. 2017) to train a low-dose CT denoising network based on unpaired image-to-image translation.
Built around successful use of self-supervised denoisers on the natural images, several works have tried to remove the need for normal-dose CT images used during training of deep networks (Won et al. 2021;Hasan et al. 2020;Hendriksen et al. 2020). Hendriksen et al. 2020 designed Noise2Inverse where a CNN is trained to transform an image reconstructed from a sub-sinogram to another from the complementary sub-sinogram. The key idea of Noise2Inverse is to partition the data in the sinogram domain and train the CNN in the image domain. After the training, the network is applied to perform denoising only in the image domain. Inspired by Noise2Noise, Hasan et al. (2020) introduced a collaborative technique to map many low-dose CT images to the normal-dose CT counterpart through joint training of multiple generators. The difference between any two generated outputs is further incorporated in the overall loss function to provide the collaborative circumstance between generators. The most recent work on self-supervised CT denoising was proposed by Won et al. (2021) who developed a novel training strategy based on the pre-trained noise model and denoiser. For a new test low-dose CT image, the pre-trained denoiser is further fine-tuned through back-propagating the loss between the output and Pseudo-CT, which is simply a noise difference map predicted by a pre-trained noise model.

Positron emission tomography
Positron emission tomography (PET) imaging is one of the leading imaging modalities for quantitative in vivo measurement of physiological and biochemical processes with applications in oncology (Rohren et al. 2004), cardiology (Dilsizian et al. 2016) and neurology (Herholz and Heiss 2004). However, high noise levels are one of the main shortcomings of PET compared to CT or MR. The amount of noise artifacts in PET directly depends on two factors: (1) amount of injected tracer and (2) duration of scanning. On the other hand, patients' exposure to radiation has been a major concern in recent years (Nievelstein et al. 2012). Therefore, a significant amount of research has been devoted to reconstructing normal-dose PET images from low-dose counterparts by removing the noise artifacts.
The DL-based denoising methods for PET imaging can be divided into two categories. The first category only works on the PET image for noise removal (Gong et al. 2018;Xu et al. 2017;Ouyang et al. 2019;Zhou et al. 2020). Xu et al. 2017 proposed to leverage a UNet to learn the mapping between the reconstructed PET images with 1/200 injection and normal-dose ground truth images. They further offered a multi-slice input strategy to improve the robustness of the network. Gong et al. 2018 proposed pre-training a CNN with simulated data and fine-tuning the last few layers using real data. They also adopted perceptual loss (Johnson et al. 2016) to improve details of the restored image. Later,  exploited a 3D GAN framework to estimate high-quality normal-dose PET images from their corresponding low-dose PET images. Zhou et al. (2020) adopted cycleGAN (Zhu et al. 2017) to learn an enhanced mapping between low-dose and full-dose PET mapping. Most recently, Gong et al. (2020) used a Wasserstein GAN to perform denoising on low-dose PET images. They further adopted a task-specific initialization to transfer the weights from a pre-trained model for improved training.
The other category encompasses work that receives PET and MR images as the input to the network (Xiang et al. 2017;. Xiang et al. 2017 designed a CNN with two input channels of low-dose PET and the accompanying T1-weighted acquisition from the MR modality. Then, the network learned to combine these two different inputs to improve noise removal. Intending to incorporate more structural information in the network,  proposed a network to receive multi-contrast MR images along with low-dose PET in the input. Another emerging family of low-dose PET denoising methods is based on selfsupervised or unsupervised training. Cui et al. (2019) conducted the first investigation performing low-dose PET denoising without full-dose clean ground truth. Inspired by deep image prior (Ulyanov et al. 2018), they designed a network whose input is CT or MR images of the same patient and used the original low-dose PET as the ground-truth images in the loss calculation. The Noise2Noise (Lehtinen et al. 2018) motivated Yie et al. (2020) to propose a self-supervised method for low-dose PET denoising. In particular, they proposed to employ clinical list-mode PET data to generate real statistically independent noisy images with various noise levels, then data were used to train CNNs on pairs of noisy images.

Fluorescence microscopy
Fluorescence microscopy (FM) has become an indispensable tool in cell biology that provides visualization of living cells and tissues, hence forming the basis for the analysis of their morphological and structural characteristics. However, due to weak signal strength and diffraction limits, FM images suffer from high amounts of noise artifacts. There are several methods focused on the development of DL-based denoising schemes in FM imaging (Weigert et al. 2018;Pronina et al. 2020;Khademi et al. 2021). Weigert et al. 2018 applied a data generation technique to collect semi-synthetic FM images followed by training a UNet model for image restoration. A step towards combining an optimization scheme and deep learning was made by Pronina et al. (2020) by aggregating learnable regularizers into the Wiener-Kolmogorov filter.
A common practice to collect pairs of noisy and clean images in FM is to simulate noise from models and overlay it onto the synthetic clean images. Zhong et al. (2021) followed the same strategy and adopted a GAN framework to synthesize synthetic noisy and clean pairs to train a denoising network. Zhang et al. (2019) introduced the first dataset for CM imaging where the clean images are obtained by averaging multiple noisy captures. Averaging process, however, are not an effective way to obtain clean images as it only weakens the noise artifacts rather than eliminating them. On the other hand, the process of collecting the clean ground truth images is cumbersome.
To remove the requirement of the clean images during training, a variety of selfsupervised and unsupervised schemes for CM denoising were proposed in recent years Izadi and Hamarneh 2020;Krull et al. 2020;Goncharova et al. 2020;Zhong et al. 2021;Lequyer et al. 2021;Byun et al. 2021). Izadi et al. 2019 developed a disentangling network that was able to separate the noise and signal components of the input image using formulated prior information about the noise and desired clean output in the loss function. Later, they integrated a classic patch-based non-local Bayesian filtering algorithm into a deep network (Izadi and Hamarneh 2020). Goncharova et al. (2020) built upon the success of blind-spot networks and proposed injecting additional knowledge about the structure of the signal into the self-supervised architecture of the network. They added a convolution operation between the network output and a point spread function (PSF) to account for the diffraction limitation in light microscopy. Krull et al. (2020) extended Noise2Void (Krull et al. 2019) by computing a posterior distribution based on a sampling-based noise model and prior distribution over the true pixel intensities. The clean estimate for each pixel is then obtained with an arbitrary statistical estimator. The most recent work was developed by Byun et al. (2021) with the focus on improving the computational burden and inference speed of blind-spot networks.

Denoising for high-level tasks
PSNR and SSIM are currently the most popular evaluation metrics for existing denoising algorithms and no consideration is given to the added value of the restored images in downstream tasks. Recently, several works have proposed to tackle this limitation by connecting denoising and high-level vision tasks Chen et al. 2021. For example,  cascaded a denoising module with various downstream networks to establish the relationship between the low-level and high-level vision tasks during training. They leveraged the joint loss from denoising and downstream tasks to update the denoising parameters only in the denoising network. Similarly,  proposed a network architecture with shared encoder blocks for image denoising and classification. In addition to the shared architecture, the joint loss function employed in this network combines restoration and classification loss terms for objective optimization. In a medical imaging application, Chen et al. (2021) proposed a collaborative network design for image denoising and lesion detection in low-dose Computational Tomography (CT) images. In their approach, the feedback from the downstream lesion detection task is injected into the denoising network through computing the perceptual loss on extracted regions of interest from lowdose and normal-dose CT images. Their research shows that collaborative training is helpful for both the denoiser and the lesion detection optimization. Lastly,  proposed a method that leveraged per-pixel soft segmentation and consistency regularization to denoise images and detect particles in cryogenic electron microscopy (cryo-EM) and cryo-electron tomography.

Future directions
Thanks to the recent advent of deep learning techniques, especially the strong learning capacity of convolutional neural networks, there has been a promising progression of denoising methodology and novel applications in scientific literature. However, there are still many open problems related to the intrinsic difficulty of solving ill-posed denoising tasks. In this section, we summarize a few challenges together with promising future research directions in this field.

Theoretical analysis
Most of the existing works in DL-based image denoising lack a theoretical foundation to endorse design choices. Particularly, the proposed methods are often mainly designed by intuition and empirically evaluated on benchmark datasets. In the era of deep learning, research aimed at bridging the gap between traditional image denoising techniques and neural networks through establishing a solid theoretical foundation in architecture designs, loss functions, and even training strategies would be highly impactful.

Universality and robustness
The universality mentioned here is twofold: generalization of the denoising algorithms against (1) different types of noise and (2) different noise intensities from the same noise source. Among the studied denoising algorithms, most of them train distinct networks for different noise strengths and noise types represented by the statistical distributions and their parameters. However, many extrinsic factors from the scene and/or intrinsic parameters from the camera can dynamically influence the nature of the noise in practice. Therefore, improving the robustness of the model against various noise characteristics is substantially meaningful.

Interpretability
DL-based denoising approaches inherit the black-box nature of deep learning models and often aim to reach higher performance on benchmark datasets, ignoring the explainability of the learned representations and results. We believe that the literature demands more thorough efforts to make the models more transparent to humans by illustrating why the found parameter setting and network design outperforms classical, interpretable approaches.

Computational efficiency
Since DL-based denoising research has been focused on improving the state of the art, progressive improvements on benchmark datasets have been correlated with an increase in network complexity, power consumption and execution time. Accordingly, such powerful denoising models might not necessarily be efficient enough for deployment in the real world. For instance, one of the essential uses cases of denoising algorithms in smartphones ISP and other embedded devices with limited computational power that demand highly efficient and fast models for real-time execution. As such, improving the computational burden of DL-based denoising approaches to make them more compatible with existing real-world compute constrained hardware and software is a timely yet challenging topic.

Data curation
Today nearly all researchers acknowledge the unquestionable impact of real-world data for improved performance of denoising algorithms. As such, the future of data curation may need to be tuned for bridging the gap between the data distribution that the network sees during training and inference. Moreover, the model camera pipeline has evolved around the idea of capturing and fusing multiple frames for denoising. Therefore, it is expected that datasets with multiple captures of the same scene to become available in the future. Similarly, the utilization of physics-based medical image data synthesizers (e.g. POSSUM (Graham et al. 2016), SimSET (Bai et al. 2013), TOAST++ (Schweiger and Arridge 2014)) may be leveraged for providing training data for medical imaging denoising methods. Another notable direction for the future of image denoising and data curation is to combine multiple tasks such as denoising, demosaicing, super-resolution, and highdynamic range imaging into a single task.

Conclusion
Image denoising has played a key role in steadily improving the acquisition quality of cameras and delivering high-quality content to customers and/or other downstream tasks in computer vision. In this paper, we began with revisiting the fundamental concepts and mathematical definition of image denoising and later provided an in-depth review of existing benchmark datasets and widely used evaluation metrics. Then we laid out a novel categorization of supervised and unsupervised techniques and systematically highlighted the improvements and new trends in each category. The novel taxonomy introduced in the paper is systematic and comprehensive and may help the reader appreciate the multiple research areas of training strategies, loss functions, and architecture designs. We further discussed denoising challenges in burst images and videos and elaborated on important future research directions in these implementation contexts. This survey provides a comprehensive view of the recent progress made in deep learning based image denoising techniques, which we hope will drive further interest in image denoising research and facilitate impactful research addressing the discussed limitations in this domain.