Intensity and phase imaging through scattering media via deep despeckle complex neural networks

The existence of a scattering medium causes the degeneration of intensity and phase information, especially in biological imaging. The present techniques to address this challenge only focus on the reconstruction of intensity information, yet few attempts have tried to recover the phase information. We propose a method to simultaneously predict both intensity and phase information from a speckle image employing a deep despeckle complex neural network (DespeckleNet). Our method enables the high contrast single-shot imaging of biological samples through scattering media without labeling. Various experiments demonstrate the superior reconstruction and generalization performance of our method under multiple types of biological samples with different scattering media. We also provide the real-time observation of living cellular activities without any contaminations or damages to the cells. Our method offers simple yet effective imaging through scattering media and paves the way for real-time unlabeled biological imaging. Because light propagates in complex form, our module is more applicable in the simultaneous recovery of both intensity and phase processes. The experiment results support this conclusion. Compared to conventional complex network 52 , our complex residual blocks in the translate module are used for feature transformation with the same size, so the pooling layer is removed. The residual structure in the translate module can speed up the convergence rate.


Introduction
As one of the most challenging and practical research topics, imaging through scattering media 1,2 receives more attention recently in many fields, such as cloud tomography 3 , underwater imaging 4 , and biomedical imaging 5 . In particular, the tissues and cells of most organisms exhibit heterogeneous refractive indices, which causes the relative phase of the laser to be randomly scrambled and thus generates speckle 6 . This limits the resolution of imaging as well as the depth of observation. Many techniques have made great achievements in imaging through scattering media over the years, including optical coherence tomography (OCT) 7 , wavefront shaping [8][9][10][11] , optical transmission matrix [12][13][14][15][16][17][18] , and others. However, the spatial distribution of the speckle is a complicated function of the microscopic arrangement of the scattering media and the wavefront of the incident field. And it is difficult to provide a wide field of view (FOV) due to the memory effect. OCT and wavefront shaping techniques demand a sophisticated optical design and hardware, which is hard to deploy in practice. In recent years, deep learning (DL) has shown favorable achievements in the field of imaging through scattering media [19][20][21][22][23][24][25][26] . Li built a "one to all" model based on the UNet backbone, which can learn statistical information about similar scattering media with different microstructures and extract statistical invariance of the speckle 23 .
However, these methods only consider the reconstruction of intensity information and ignore phase information.
Phase information reflects the cellular structures and plays an important role in the imaging of transparent samples.
Conventional bright-field microscopy needs to stain the sample or use fluorescent labels to show the structure 27 , which may affect the normal vital movement of cells 28,29 . In contrast, phase imaging is a label-free microscopic imaging method that enables the imaging of transparent samples by attaining phase information of different components of cells. A well-known phase imaging technique is phase-contrast microscopy proposed by Zernike.
Phase-contrast microscopy uses interference between the scattering and non-scattering light waves to convert the optical phase into high contrast image 30 . Nomarski further invented differential interference contrast (DIC) microscopy based on phase-contrast microscopy, which can display a three-dimensional projection of the structure 31 .
Both of these methods belong to qualitative phase imaging techniques. Quantitative phase imaging techniques have made great progress in the biomedical field recently, including the transport of intensity equation (TIE) 32 , digital holographic imaging 33,34 , and tomographic phase microscopy [35][36][37] . Among these methods, the TIE technique can be easily deployed to conventional microscopy due to the desirable property of simple acquisition manner, no reference beam, and no phase unwrapping. Furthermore, it also suits both coherent and partially coherent illumination [38][39][40] .
However, TIE usually requires a series of images captured at different focal depths, which extends the acquisition time and limits real-time observation of the dynamic process. All of these methods cannot be directly applied to scattering problems.
Previous studies indicate that there exists a functional relationship between the speckled image and the original sample 13 , and phase information can also be inferred from the intensity distribution 41 . In this paper, we developed a deep despeckle method to simultaneously predict the intensity and phase information of biological samples through the scattering media via a complex neural network (DespeckleNet). A speckled image, generated by inserting a diffuser in the optical path, is fed into our network and both intensity and phase are predicted as output. The network is devised based on the generative adversarial network (GAN) framework 42 , and we obtain a resilient speckle decorrelation for the intensity and phase with a wide range of statistical variations. So we can make high-quality object predictions with a completely different set of diffusers of the same class. We carry out experiments on both cells and tissue to demonstrate the superior performance of our method. In addition, we also observe the real-time division process of unlabeled living Hela cells using quantitative phase imaging. Our solution is a single-shot imaging method and can be applied in a variety of microscope systems without providing an additional sophisticated optical structure for despeckle.

Results
The overall experimental setup is depicted in Fig. 1(a). The coherent light source is modulated through the diffuser to illuminate the biological sample. Due to the optical roughness of the diffuser surface, the image captured by the camera is mixed with the granular pattern. Based on captured speckle image, both intensity and phase information can be simultaneously recovered by our DespeckleNet, whose flowchart is shown in Figs. 1(b) and 1(c). To build the dataset for network training and testing, we replace the laser with an incoherent light source to obtain a series of in-focus and out-of-focus sample images without scattering. The in-focus image without scattering is the ground truth of intensity. The ground truth of the phase component is obtained by TIE, which takes images from different focal planes as input to extract the phase information. To reduce the computational complexity, each speckle image and the corresponding ground truth are cropped into lots of smaller-sized patches and then feed into our model. The speckle pattern of each patch is different even they come from the same diffuser. Our model learns the statistical invariance of the diffuser and can output high-quality reconstructions even for the speckle patterns that have not been seen during training. Experimentally, we verify the network performance with three kinds of diffusers, including thin tape, Petri dish, and tissue, as shown in Fig. 1(d). We also use four different biological samples, containing breast cancer cells, nasopharyngeal cancer cells, living Hela cells, and breast cancer tissue, to demonstrate the generality of our method, as shown in Fig. 1(e). In Fig. 1(f), we show real-time quantitative phase observation of living Hela cells without labeling or staining over a long time, which proves that our method can achieve singleshot unlabeled imaging through scattering media.  2(e). The generator generates the intensity and phase, and the discriminator judges whether the outputs are real or fake. Specifically, we first extract the features of the input image through the feature extractor, as indicated in Fig.   2(a). Next, we add a translate module to cope with the influence of speckle by stacking real-valued and complex residual blocks sequentially, as indicated in Fig. 2(b). The operation of the modified complex residual block is shown in Fig. 2(f), which decouples the intensity and phase information from extracted features. Compared with the real-valued convolution, the complex convolution is more consistent with the optical complex field propagation model.
Then we employ two decoders to reconstruct the intensity and phase images, as shown in Figs. 2(c) and 2(d). The two decoders have the same structures except that the intensity branch has an additional tanh activation function in the last layer to normalize the output. For the discriminator, we adopt the PatchGAN to perform discrimination patchwisely, which is beneficial for the reconstruction of details, as shown in Fig. 2  Multiple types of samples imaging through different scattering media. To verify the generalization and robustness of our method, we experimentally image cell samples through different scattering media, including thin tape, tissue section, and Petri dish. In addition, we test tissue samples to further demonstrate the superior property in terms of different sample types. Notably, all the samples preserve the original structure without fluorescence labeling or staining. These biological samples are illuminated by the laser through the diffuser, and on the other side, a camera captures the raw image mixed with the speckle pattern. The detailed setup is shown in Fig. S1. The captured raw images, whose speckle patterns have not been seen by the network, are fed into our trained pixel to pixel network.
Since the unstained cells are transparent, the captured raw images have poor contrast, which leads to a difficult solution to the pixel-level prediction task in bioimaging. Nevertheless, our network outputs the simultaneously recovered intensity and phase images with high quality, compared with the ground truth obtained under a non-scattering scenario. We evaluate the performance of our DespeckleNet over nasopharyngeal carcinoma cells (C666-1) and breast cancer cells (MDA-MB-231). The first sample is imaged through thin tapes to keep consistent with previous methods [43][44][45] . Furthermore, we replaced the thin tape with a breast cancer tissue section (~4 µm thick) as a diffuser for the second sample, which is more in line with real bioimaging. Representative examples of the speckle and prediction pairs are shown in Figs. 3(a) and 3(b). The speckle prevents our normal observation of the cells. We input the speckle into our network and the outputs are clear images of the cell's intensity and phase, which are very close to the ground truth. We can see that the intensity images of cells without labeling have low contrast and it is difficult to recognize the boundary of cells against the background. In contrast, the corresponding cell phase images have high contrast and we can observe the difference between the morphological features of the cells. Our results show that our network performs well not only under the thin tape but also in the real biological tissue scattering media, providing a potential solution for unlabeled deep tissue laser imaging. Next, we test the performance of the DespeckleNet over living Hela cells through a Petri dish. Imaging of living cells often requires the cells to be placed in a Petri dish, which is equivalent to a diffuser. Different from other experimental configurations, the Petri dish consists of the upper and lower surfaces, and the laser will go through the scattering media twice in the whole optical path. The results are shown in Fig. 3(c), where the outputs of the network are still satisfactory even after scattering twice.
We also carry out the experiment over breast cancer HER2 tissue through a thin tape to prove the robustness of our network with different types of samples. As shown in Fig. 3(d), the pathological tissue structure is more complicated and the texture has more details, which inevitably imposes a great challenge for high-quality reconstruction. On the other hand, the observation of tissue usually requires a wide FOV, so we capture the raw image with 2048×2048 pixels (FOV ~250×250 2 ). The conventional scattering imaging technique has a small imaging FOV due to the limitation of the optical memory effect, and our method can break through the limitation.
In Fig. 3(d), the outputs of the network remain high fidelity compared to the ground truth, even though the tissue speckle images are blurred and coarse. To quantitively analyze the performance of our network, the intensity and Although points A and A+ locate the same position in the phase image and intensity image, the red solid curve in Fig. 4(c) has no obvious upward or downward tendency during the division process. It reveals that it is difficult to distinguish the cell division process by only using the recovered intensity image. The intensity of the dashed curve is relatively lower, and no distinct fluctuation can be observed in the dash red curve. Overall, it is more accurate to identify the dividing state of the living Hela cells employing phase information. In particular, there is an obvious phase difference between A and B in Fig. 4(b) after the left cells finished the division process at 540 min. However, the intensity difference between the two points at this moment is slight due to the high transparency of the cells. This further reveals the benefits of simultaneously recovering both intensity and phase in our method.

Discussion
In coherent bioimaging, the coherent transfer function is a complex-valued function describing the response of intensity and phase. Considering this optical property during imaging, we design the DespeckleNet. In contrast to the traditional real-valued network, our method utilizes the complex convolution block. We first extract features using a series of general convolution layers, then the features are further divided into two groups for intensity and phase reconstruction separately. Our network stacks 5 common convolution blocks and 4 complex convolution blocks in the translate module, as shown in Fig. 2. More details about the complex convolution block can be seen in When a coherent signal consists of a large number of complex phasor components, the speckle would be generated by the summation of independent phases. In particular, the phase and amplitude of the laser distribute randomly after passing through a diffuser, and the superposition of these complex components results in the "random walk" 46 . In addition, the high coherence of the laser causes coherent superposition in a certain optical plane to form a speckle. The complex wavefront is given by where is the magnitude of the resultant phasor component summation, is the phase of the resultant phasor component summation, denotes the number of phasor components in the random walk, and represent the th magnitude and phase of the phasor vector, respectively. As shown in Fig. 6, we evaluate the impact of training datasets with different distributions on network performance using the learned perceptual image patch similarity (LPIPS) 50 metric, which is closer to human perception in visual similarity judgments. We acquire multiple sets of data under different experimental conditions (including light intensity, light incidence angle, etc.) to construct our training datasets, and the number of sets reflects the data diversity in training datasets. We constructed three training sets for comparison, cropped from 6, 12, and 18 sets of speckle images, respectively. For the sake of fairness, the total number of images is the same in different training datasets. Through quantitative analysis, the training dataset with the largest number of sets has the lowest mean of LPIPS, which demonstrates that the diversity of training sets is conducive to improving the generalization ability of the network.

DespeckleNet implementation
Generative model. In this paper, the backbone of our generative model is UNet, which was first proposed for medical image segmentation 51 . The generator contains four components, which are feature extractor, translate module, intensity, and phase reconstruction module, as shown in Fig. 2. The feature extractor consists of five consecutive convolution blocks. Each convolution block contains a convolution layer with 4×4 kernel size, batch normalization layer, and leaky ReLU activation layer. After each convolution block, we double the number of channels and set the convolution stride to 2, which can downscale the feature map and expand the receptive field.
Since the formation of speckle results from the nonlinear weighted summation of different scattering mode microstructures within a certain receptive field, expanding the receptive field is beneficial for the despeckle task.
Our proposed feature translate module adopts five residual convolution blocks and four modified complex residual blocks rather than the common real-valued networks. Because light propagates in complex form, our module is more applicable in the simultaneous recovery of both intensity and phase processes. The experiment results support this conclusion. Compared to conventional complex network 52 , our complex residual blocks in the translate module are used for feature transformation with the same size, so the pooling layer is removed. The residual structure in the translate module can speed up the convergence rate.
Traditionally, given a complex-valued convolutional filter = + and a complex input ℎ = ℎ + ℎ , where subscript represents the real part and represents the imaginary part, the complex convolution (CConv) is shown as: where * denotes real-valued convolution operation. In our modified complex convolution, the input channels are divided into two groups, standing for real and imaginary features, respectively. The convolution results of these two groups are then fused by 1×1 convolution rather than directly adding the real and the imaginary part. Next, we use a universal activation function, Leaky ReLU, to realize nonlinear mapping. The complex residual block ( ) in our feature translate module is defined as The feature map in the translate module has small lateral dimensions (16×16) and a large number of channels (512), which can encode sufficient information beneficial to the following task. Next, these feature maps go through the intensity reconstruction module, which contains five consecutive transposed convolution layers(kernel size=4×4， stride=2)followed by a ℎ activation layer to normalize the output from -1 to 1. The information across different spatial scales is tunneled through the down-up paths by skip connections to preserve high-frequency information. Loss function. We design our loss function as the combination of the adversarial loss with L1 loss as the regularization terms. To ensure the convergence, we alternatively train the discriminator and generator. The loss function of the discriminator is formulated as: where represents a fixed generator, and represents the discriminator to be updated, = ( , ) is the concatenated ground truth of intensity and phase. The loss function of the generator is formulated as: where is the regularization hyper parameters, 1 ( ( ), ) represents the L1 loss between ( ) and . The symbols , and have the same meaning as those in Eq. (4). While the adversarial loss guides the generative model to map the speckle images into despeckle ones, the regularization terms can speed up the convergence of our network and smooth the training loss. Experimental results reveal that good performance can be achieved when is set between 0.3 and 0.4.

Metrics
Following the numerical evaluation methods proposed in the previous papers, the SSIM where , are the averages of , , σ 2 , σ 2 are the variances of , , is the covariance of and ; and 1 , 2 are the variables used to stabilize the division with a small denominator. We also use LPIPS, a learning-based perceptual similarity metric, to measure the distance in feature space, which is more consistent with human perception. The LPIPS is defined as where , � are the intensity (or phase) ground truth and output from the DespeckleNet, and , � ∈ ℝ × × are the corresponding th layer features extracted from pre-trained AlexNet, , , are the size of feature map at th layer, ∈ ℝ is real-valued scaling weight, and ⊙ represents the Hardamard product operator. The number of feature layers is set to 5. The lower LPIPS value means that the images are more similar to the ground truth in the aspect of human perception.

Network training schedule
Our model is implemented with Python based on Pytorch on a computer with 4 NVIDIA RTX 2080Ti GPUs. Within each iteration, the generative model and the discriminative model are alternatively updated. Both the generative model and the discriminative model were randomly initialized and optimized using the adaptive moment estimation (Adam) optimizer, with momentum parameters 1 = 0.9, 2 = 0.999. The initial learning rate is set as 2×10 -4 and decreases with a decay rate of 0.1 every 100 epochs. Meanwhile, the batch size of the training dataset is set to 16.

Data acquisition
In our experiments, we imaging different types of cells and tissues through different scattering media including thin tape, tissue, and Petri dish. To obtain the ground truth for network training, we collect ±2 µm, ±4 µm out-of-focus images, and an in-focus image. The in-focus image is used as the ground truth of the intensity image, and the ground truth of the phase image is calculated by the TIE algorithm using four out-of-focus and one in-focus intensity image.
The input of the network is the speckle image, which is obtained by illuminating the biological sample through the diffuser using the laser. For each sample under a specific diffuser, we collect 20 sets of images, which are then randomly cropped into about 2,000 512×512 patches for training, and about 200 patches for testing. The images in the testing dataset are not used for training, which ensures that our network never sees the speckle patterns before.
Therefore, the datasets have the following two characteristics. One is that each patch corresponds to a different scattering pattern, which can test the robustness to the scattering modes. Another is that our datasets consist of different samples, which can test the generalization of the sample types.
We design two scattering modes for data acquisition. In mode one, the living Hela cells are imaged through a Petri dish. In mode two, the breast cancer cells, nasopharyngeal carcinoma cells, and breast cancer HER2 tissue are imaged through thin tape and tissue section, respectively. The laser goes through the diffuser twice in mode one, and once in mode two. A 40×/1.25NA objective is used to magnify the Hela cells in mode one. A 40×/0.65NA objective is employed for the cell samples and a 20×/0.5NA objective is employed for the tissue samples in mode two. The detailed description of the experimental setup can be found in Figs. S1 and S2.

Additional information
Correspondence and requests for materials should be addressed to X.L. (lin-x@tsinghua.edu.cn), Y.Z. (ybzhang08@hit.edu.cn) Supplementary information is available for this paper, including: Figs.S1 to S11 Tables S1 to S3 Video S1

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Code availability
Our code of DespeckleNet is available upon reasonable request.