Coded aperture snapshot spectral imaging fundus camera

Spectral imaging holds great promise for the non-invasive diagnosis of retinal diseases. However, to acquire a spectral datacube, conventional spectral cameras require extensive scanning, leading to a prolonged acquisition. Therefore, they are inapplicable to retinal imaging because of the rapid eye movement. To address this problem, we built a coded aperture snapshot spectral imaging fundus camera, which captures a large-sized spectral datacube in a single exposure. Moreover, to reconstruct a high-resolution image, we developed a robust deep unfolding algorithm using a state-of-the-art spectral transformer in the denoising network. We demonstrated the system performance on both standard targets and an eye phantom.


Introduction
Retinal imaging is crucial for the detection and management of ophthalmic diseases, such as age-related macular degeneration (AMD) [1] and glaucoma [2]. The current standard-of-care retinal imaging technologies encompass color fundus photography, scanning laser ophthalmoscopes (SLO) [3], and optical coherence tomography (OCT) [4]. Despite being extensively used in clinics, these techniques measure only spatial information of the retina. In contrast, spectral imaging captures light in three dimensions, i.e. acquiring both spatial coordinates (x, y) and wavelengths (λ) of a scene simultaneously. The rich information could be used to classify the underlying components of the object. Originally developed for remote sensing [5], spectral imaging has gained increasing popularity in medical applications, including retinal imaging [6]. The overall rationale of using a spectral camera in retinal imaging is that the ocular tissue's endogenous optical properties, such as absorption and scattering, change during the progression of a retinal disease, and the spectrum of light emitted from tissue carries quantitative diagnostic information about tissue pathology.
To measure a spectral datacube (x, y, λ), conventional spectral imaging cameras rely on scanning, either in the spatial domain, such as using a slit scanning spectrometer [7], or in the spectral domain, such as using a liquid-crystal-tunable-filter [8]. The scanning mechanism typically leads to a prolonged acquisition, making these techniques prone to motion artifacts. Furthermore, the data acquired from sequential measurements need to be registered in post-processing, a complicated procedure that is sensitive to motion and image signal-to-noise ratio (SNR). Particularly in retinal imaging, post-acquisition registration can result in artifacts due to tissue movement between successive images caused by arterial pulses as well as changes in the lens-eye geometry [9]. Additionally, it is challenging to keep a patient fixating on a target for an extended period of time.
A snapshot spectral imaging system can avoid all these problems and provide an ideal solution for obtaining retinal spectral data. In this category, representative techniques include computed tomographic imaging spectrometer (CTIS) [9,10], the fourdimensional imaging spectrometer (4D-IS) [11], image mapping spectrometer (IMS) [12,13] and coded aperture snapshot spectral imagers (CASSI) [14,15]. Among these methods, only CASSI can measure a large-sized spectral datacube because it uses compressive sensing to acquire data [16,17], leading to a high resolution along both spatial and spectral dimensions. CASSI uses a coded aperture (mask) and a dispersive element to modulate the input scene, and it captures a two-dimensional (2D), multiplexed projection of a spectral datacube. Provided that the spectral datacube is sparse in a given domain, it can faithfully be reconstructed with a regularization method. Essentially, just one spectrally dispersed projection of the datacube that is spatially modulated by the aperture code over all wavelengths is sufficient to reconstruct the entire spectral datacube [14]. Therefore, the acquisition efficiency of CASSI is significantly higher than non-compressive methods that directly map spectral datacube voxels to camera pixels, such as IMS.
In this work, we developed a spectral imaging device based on CASSI and integrated it with a commercial fundus camera. The resultant system can capture a 1180×1100×35 (x, y, λ) spectral datacube in a snapshot with a 17.5 μm and 15.6 μm resolution along the horizontal and vertical axes, respectively. The average spectral resolution is 5 nm from 445 nm to 602 nm. Moreover, to enable fast and high-quality image reconstruction, we developed a deep-learning-based algorithm. Once trained, our algorithm can reconstruct a megapixel CASSI image with only 60 s, a 20 times improvement compared with conventional iterative algorithms. We demonstrated our method on both standard targets and an eye phantom.

System principle and method
Optical setup and system model The schematic and photograph of the system are shown in Fig. 1(a) and (c), respectively, where we couple a CASSI system to a commercial fundus camera (Topcon TRC-50dx) through its top image port. The illumination light is provided by the internal halogen lamp of the fundus camera. After being reflected by an annular mirror, a doughnut-shaped beam is formed and refocused onto the subject's eye pupil, forming uniform illumination at the retina. The reflected light is collected by the same optics and forms an intermediate image at the top image port of the camera. This image is then passed to the CASSI system. In the CASSI system, we use two achromatic lenses (AC254-50, f = 50mm, Thorlabs) to relay the input image to a coded mask with a random binary pattern. The mask was fabricated on a quartz substrate with chrome coating by photolithography (Frontrange-photomask), as shown in Fig. 1(d). The smallest coded feature of the pattern is sampled by approximately 2×2 pixels on the camera (11.6 μm), leading to a maximum solution of 1180×1100 pixels in the reconstructed image. The spectral range of the system is 450-600 nm, bandlimited by a combination of a 450 nm long pass filter (FELH0450, Thorlabs) and a 600 nm short pass filter (FESH0600, Thorlabs). Next, a 4f system consisting of an objective lens (4× /0.1 NA Olympus objective) and an achromatic lens (AC254-100, f = 100mm, Thorlabs) relays the coded image to a CMOS image sensor (acA2500-60um, Basler). To disperse the image, we position a round wedge prism (PS814-A, 10° Beam Deviation, Thorlabs) at the Fourier plane of the 4f relay system. The CASSI measurement can be considered as encoding high-dimensional spectral data and mapping it onto a 2D space [14]. As is shown in Fig 1(b), the coded aperture modulates the spatial information over the entire spectrum. The image right after the coded aperture is where 0 ( , ; ) represents the spectral irradiance of the image at the coded aperture, and ( , ) denotes the transmission function of the coded aperture.
The resultant intensity image captured at the detector plane is the superposition of multiple images of the spatially modulated scene at wavelength-dependent locations, which can be expressed as ( , ) = ∫ 2 ( , ; ) .
Because the detector plane is spatially discretized with a pixel pitch of ∆, ( , ) is sampled across the entire 2D dimension of the detector plane. The measurement at the ( , ) pixel can be written as where denotes the measurement noise at the ( , ) pixel, and rect is a rectangular function. After discretizing the spectral information into bands, the discrete measurements from the camera pixel can be written as where and are the discretized representations of the source spectral irradiance and the coded aperture pattern, respectively.

Deep unfolding reconstruction algorithm
Despite being simple in hardware, the image reconstruction of CASSI can be computationally extensive when using conventional iterative algorithms like Two-step iterative shrinkage/ thresholding (TwIST) [18]. Additionally, conventional iterative algorithms usually have two steps in each iteration: physical projection and hand-crafted priors [18][19][20]. The mismatch between the hand-crafted priors and real data often results in poor image quality. Recently, deep learning methods have been used in CASSI to improve reconstruction quality and reduce the time cost [21][22][23]. However, most deep learning methods fail to incorporate the physical information of the system, such as the mask pattern, into the model and treat reconstruction as a "black box" during the training phase. Therefore, these methods lack robustness and are prone to artifacts. To solve this problem, we combined the advantages of physical projection in conventional iterative reconstruction and the strong denoising ability of deep learning and developed a deep unfolding network for CASSI reconstruction. Because the mask information is mainly processed in the projection operation, our algorithm exhibits strong robustness to variations in mask patterns.
To reconstruct the original spectral datacube from the 2D CASSI measurement, we first vectorize the image and rewrite Eq.
Given the measurement y and matrix , there are two optimization frameworks to predict the original spectral scene : the penalty function method and the augmented Lagrangian (AL) method. Because the AL method outperforms the penalty function method, as shown in the previous studies [24][25][26], we use the AL method to solve the above inverse problem: where ( ), 1 , and 1 denote the prior regularization, Lagrangian multiplier, and penalty parameter, respectively. Eq. (7) can be further written as To solve Eq. (8), we adopt an alternating direction method of multipliers (ADMM) method [27][28][29]. According to ADMM, Eq. (8) can be stratified into two subproblems and solved iteratively where is an auxiliary variable, and the superscript i denotes the iteration index. Eq. (9) is a classical denoising problem, which can be solved by a denoising prior such as total variation, wavelet transformation, or denoising network. Herein we use a deep unfolding network and a state-of-the-art spectral transformer for denoising [30]. Eq. (10) has a closed-form solution [29], which is termed projection operation Due to the special structure of , which consists of a diagonal block matrix, as shown in [20], Eq. (11) can be solved in one shot. Therefore, can be solved by multiple iterations of spectral-transformer (denoising) and projection operation, as shown in Fig. (2). Here 0 is written as: 0 = ( 2 + 1 ) −1 ( ). Fig. 2 shows the iterative architecture of the deep unfolding algorithm. In the projection operation stage, is calculated from according to Eq. (11), where , 1 , 2 , , 1 , and 2 are inputs (For convenience, we use FV i package to represent these inputs in Fig. 2). In the denoising operation stage, is calculated from −1 according to Eq. (9), where 2 , ′, and 2 are inputs (we use VF i package to represent these inputs in Fig. 2). Additionally, the spectral transformer is used as the denoiser, which uses matrix ′ to guide the transformer. ′ is transformed by a convolution neural network (CNN) based on 2 , , ).

(13)
The Lagrangian multiplier can be updated as and the penalty parameters 1 , 2 can be trained in deep unfolding.

Figure 2. Deep unfolding algorithm for CASSI reconstruction.
The projection operation is formulated in Eq. (11), and the spectral transformer is depicted in reference [30]. L is the total number of iterations.

Training
We used PyTorch [31] to train our model on an NVIDIA RTX3090 GPU. The mean square error (MSE) was selected as the loss function. For training, we adopt an Adam optimizer [32] and set the number of iterations in deep unfolding as four, the mini-batch size as one, and the spatial resolution as 256 × 256. Although we trained the model using images with 256 × 256 pixels, we can apply it to images of any dimensions after rescaling. The initial learning rate was set to 1×10 −4 . After the first five epochs, the learning rate decays at a rate of 0.9 every 15 epochs. The total number of epochs is 200, and the training time is about 60 hours.

Rainbow object
We first validated our system by imaging an object illuminated with rainbow light. As shown in Fig. 3(a), we positioned a linear variable visible bandpass filter (LVF) (88365, Edmund optics) in front of a broadband halogen light source. In LVF, the thickness of the coating varies linearly along one dimension of the filter, resulting in a linear and continuous variation in spectral transmission. The spectral resolution of the LVF is 7-20 nm. To project a broad spectrum to the field of view, we used a lens pair with the ratio of their focal lengths of 3.3:1 (MAP1030100-A, Thorlabs) to demagnify the linear filter and relayed it to an intermediate plane, where a letter object was located. Therefore, each lateral location of the object exhibited a distinct color. Figure. 3(b) and 3(c) show the raw image and the reconstructed panchromatic image, respectively. Ten representative spectral images from a total of 35 channels are shown in Fig. 3(d).

Spatial and spectral resolutions
We quantified the spatial resolution of the system by imaging a USAF resolution target. We positioned the resolution target at the back focal plane of a lens that mimicked the crystalline lens in the eye and located the combined model in front of the fundus camera. Meanwhile, rather than using the theoretical mask pattern as a prior for reconstruction, we experimentally captured the coded mask image under uniform monochromatic illumination (Fig. 1d) and used this data to improve the reconstruction accuracy. We illuminated the USAF target in the transmission model with broadband light (450-600nm). The reconstructed spectral images are shown in Fig. 4(a), and a zoomed-in view of the 532 nm channel is shown in Fig. 4(b). We also show the corresponding raw measurement in Fig. 4(c), where the spatio-spectral crosstalk is clearly visible. After reconstruction, we successfully removed the spatial modulation pattern and restored high-resolution images in all spectral channels. To quantify the spatial resolution, we zoomed in on the central part of the reconstructed 532nm channel and plotted the intensities across the dashed lines in the image (Fig. 4d). The image contrast is defined as − + , where is the intensity.
We calculated the image contrast for each group of bars within the field of view (FOV). Given a threshold of 0.4, Group 6 element 1 horizontal bars and group 5 element 6 vertical bars are minimally resolvable features. The corresponding spatial resolution along the horizontal and vertical directions are 17.5 μm and 15.6 μm, respectively.
The spectral resolution of the system is determined by the size of the smallest feature on the coded mask, the camera pixel size, and the spectral dispersion power of the prism [12]. In our system, the smallest coded feature corresponds to two camera pixels. The spectral dispersion across this distance determines the spectral resolution. Because the prism has nonlinear spectral dispersion, its spectral dispersion power varies as a function of wavelength. We measured the wavelength-dependent spectral resolution by imaging a blank FOV uniformly illuminated by monochromatic light of varied wavelengths. The dispersion distances (in pixels) between the coded mask images of adjacent wavelengths at four spectral bands are shown in Table 1. Because the smallest coded feature is mapped to two camera pixels, the spectral resolution is double the pixel dispersion.

Standard eye phantom
To further validate the system in retina imaging, we imaged a standard eye phantom (Wide field model eye, Rowe Technical design), which has both a realistic eye lens and a vasculature-like pattern painted at the back surface. A direct image of the retina obtained with a reference camera is shown in Fig. 5(a). We positioned the eye phantom in front of the fundus camera [ Fig. 5(b)], illuminated it with the camera's internal halogen lamp, and captured the retina image using CASSI in a snapshot.
The reconstructed spectral channel images of two ROIs are shown in Fig. 5c, showing a close resemblance to the corresponding regions in the reference image [ Fig. 5(a)]. Furthermore, to quantitatively evaluate the spectral accuracy, we measured the fieldaveraged spectrum using a benchmark fiber spectrometer (OSTS-VIS-L-25-400-SMA, Ocean Optics) as the reference. The CASSI reconstructed spectrum matches well with the ground truth [ Fig. 5(d)].

Conclusions
In summary, we developed a snapshot spectral retinal imaging system by integrating a CASSI system with a fundus camera. We also developed a deep unfolding method for fast and high-quality image reconstruction. The resultant system can acquire a large-sized spectral datacube in the visible light range in a single exposure. The system performance has been demonstrated with standard targets and an eye phantom. Seeing its high-resolution snapshot imaging advantage, we expect our method can find broad applications in retinal imaging.

Data Availability
Data underlying the results presented in this paper may be obtained from the corresponding author upon reasonable request.