Deep learning for ultra-widefield imaging: a scoping review

This article is a scoping review of published and peer-reviewed articles using deep-learning (DL) applied to ultra-widefield (UWF) imaging. This study provides an overview of the published uses of DL and UWF imaging for the detection of ophthalmic and systemic diseases, generative image synthesis, quality assessment of images, and segmentation and localization of ophthalmic image features. A literature search was performed up to August 31st, 2021 using PubMed, Embase, Cochrane Library, and Google Scholar. The inclusion criteria were as follows: (1) deep learning, (2) ultra-widefield imaging. The exclusion criteria were as follows: (1) articles published in any language other than English, (2) articles not peer-reviewed (usually preprints), (3) no full-text availability, (4) articles using machine learning algorithms other than deep learning. No study design was excluded from consideration. A total of 36 studies were included. Twenty-three studies discussed ophthalmic disease detection and classification, 5 discussed segmentation and localization of ultra-widefield images (UWFIs), 3 discussed generative image synthesis, 3 discussed ophthalmic image quality assessment, and 2 discussed detecting systemic diseases via UWF imaging. The application of DL to UWF imaging has demonstrated significant effectiveness in the diagnosis and detection of ophthalmic diseases including diabetic retinopathy, retinal detachment, and glaucoma. DL has also been applied in the generation of synthetic ophthalmic images. This scoping review highlights and discusses the current uses of DL with UWF imaging, and the future of DL applications in this field.


Introduction
In 1926, the first fundus camera was introduced by Zeiss and Nordensen. At that time, the camera provided only a 20-degree field of view. Shortly thereafter, an improved camera provided practitioners with a 30-degree field of view of the fundus [1]. While a major advance at the time, these cameras provided ophthalmologists with a limited view of the retinal periphery. In 1981, the Diabetic Retinopathy Study provided an objective method to visualize up to 75-degrees of the retina by combining seven conventional 30-degree fundus images (FIs) [2,3]. This image type, known as 7 Standard Field (7SF) imaging, was the gold standard used in imaging for diagnosing diabetic retinopathy (DR). This remained the gold standard until technical developments in widefield (WF) imaging and ultra-widefield (UWF) imaging.

Widefield and ultra-widefield imaging
The International Widefield Imaging Study Group established anatomic definitions of widefield images (WFIs) as "images depicting retinal anatomic features beyond the posterior pole, but posterior to the vortex vein ampulla, in all 4 quadrants" while describing ultra-widefield images (UWFIs) as "images showing retinal anatomic features anterior to the vortex vein ampullae in all 4 quadrants [4]. " WF imaging utilizes a scanning laser ophthalmoscope (SLO), which separates the illuminating and imaging lasers used [5]. By separating the beams, WF imaging reduces artifacts produced from the interfaces in the ocular media [5].
UWF imaging can provide up to a 200-degree view of the retina, which allows for visualization of the optic disk and the peripheral retina in the same view [6]. Multiple WF imaging systems are available, with each differing in their technology and their field of view. The first UWF imaging system was introduced in 2000 by Optos [7]. Optos (Optos Inc, Dunfermline, UK) captures 200 degrees of the retina in a single image. The image provides coverage of approximately 82% of the retinal surface and does so without direct patient contact [8].
Other UWF imaging systems include the Heidelberg Spectralis Ultra-Widefield module, a noncontact removable lens which is an add-on to the Heidelberg HRA cSLO (Heidelberg Engineering, Heidelberg, Germany). This module expands the viewing range of the system from 55-degrees to a full UWF view of the retina [6]. Other UWF imaging products include the Zeiss Clarus 500 retinal camera (Carl Zeiss AG, Oberkochen Germany), which provides color and high-resolution imaging across the UWF anatomic range [9].

Clinical utility of UWF systems
As UWF imaging has provided a broader view of the retina, it has consistently been more effective at diagnosing retinal disease than previous imaging modalities. Ultra-widefield fluorescein angiography (UWF-FA), which combines UWF imaging with fluorescein angiography (FA) to visualize vessels, has been significantly more effective in diagnosing DR than previous 7SF imaging [10]. UWFIs provide 3.2 times more retinal surface area than 7SF and allow for a more comprehensive assessment of peripheral lesions and nonperfusion in DR [11,12]. A comparison of ultra-widefield imaging to colour fundus photography is provided in Fig 1. In 15 (RD), UWF imaging has provided improved assessment of peripheral retina breaks in comparison to indirect ophthalmoscopy [13]. In 16, UWF imaging has been shown to have a high agreement with color digital stereoscopy (CDS) in evaluating vertical cup-to-disc ratio and may be as effective in diagnosing glaucoma as CDS [14]. In patients with age-related macular degeneration (AMD), it was found that peripheral retinal changes were highly prevalent, indicating UWF imaging's greater value in diagnosing AMD than traditional fundoscopy [15]. From these findings, UWF imaging may be a window to the retina, as well as to the brain more broadly.
The aim of this survey is to review articles that apply DL models to UWF imaging specifically. The goals of this paper are not specifically to discuss the added benefit of imaging the periphery to improve diagnosis or prognosis. Instead, our review focuses on the clinical utility of DL for UWF imaging and the current state of this field. This follows previous reviews on UWF imaging, which have described the landscape of UWF imaging and its clinical use in ophthalmology, as well as additional issues pertaining to UWF imaging and its clinical ability [6,16].

Machine learning, deep learning, and supervision
Machine learning (ML) refers to the ability of machines to generate associations and patterns between variables, to learn in a sense similar to humans. By simulating the neural networks of human brains, ML networks generate probabilities and associations between variables to emulate human intelligence [17]. ML algorithms often can draw inferences between variables that are either imperceptible to humans or are too complex for human associations [17]. ML is divided into categories based on the approaches taken to assist computers in learning, on a spectrum between supervised learning and unsupervised learning [18].
Deep learning (DL) is a subset of machine learning that uses multiple layers of learning to identify features in data [19]. For example, in processing a fundus image, a lower layer may identify the edges of the vasculature, while higher layers may then utilize these edges of the vasculature in context to identify vessels as larger objects. Fig. 1 Comparison of optos ultra-widefield imaging (200 degrees field of view) to color fundus photography (45 degrees field of view). A: Optos ultrawidefield optomap color image of the left fundus of a patient with diabetic retinopathy with an overlaid color fundus photograph from the same patient eye over the optic disc and macula region Supervised learning refers to ML from human-provided input and output pairs. For example, supervised learning for a task of classifying images would require a set of labelled images with their corresponding classification. The dataset is completely labelled, such that there is no ambiguity in the model that is training from it. For example, in a dataset of ophthalmic disease images, all the images would be labelled with the name of the disease presented in the image. By training a model on images and their corresponding classification, machines can learn to infer relationships between the two. The trained model should then be able to take unlabeled input data and determine its classification [20]. While this method is the most effective at training these associations, it also requires the most human involvement. Models trained using fully labelled datasets have higher accuracy, but also require greater human involvement for labelling the datasets.
Between supervised and unsupervised learning lies semi-supervised learning, which refers to ML training from incompletely labelled training datasets. This approach provides the machine with an initial relationship between input and output data, without a fully labelled set [21]. The dataset is incompletely labelled, which requires the ML model to then learn from the labelled images and then classify the unlabeled images. For example, in a dataset of ophthalmic disease images, only some of the images would be labelled with the name of the disease presented in the image. The remaining images would need to be classified by the model that has learned from the labelled images, providing a machine-generated label for the previously unlabeled images. Models trained using semi-supervised approaches generally have lower accuracy than supervised learning on completely labelled datasets. However, as they are not fully labelled, they can be less intensive for individuals to label an entire dataset.
Unsupervised learning uses algorithms to learn patterns from data that lacks human labels and input. The dataset contains no labels whatsoever, requiring the ML model to learn data features to classify the data first [22]. For example, in training a model to classify images using an unsupervised approach, a successful unsupervised learning algorithm would determine features that correspond to a given cluster and categorize each into separate categories without input labels from humans. In an example of a set of ophthalmic disease images, the model would learn the features that cluster between images, and then attempt to classify the images into categories based on image and disease features. The ML model would then be trained from these developed categories for classifying new images. Unsupervised learning has the lowest accuracy of the approaches described but requires no human involvement for labelling data. A figure showing the differences between the datasets and labels used in supervised, semi-supervised, and unsupervised learning is provided in Fig. 2.
Human-in-the-loop (HITL) is an example of human involvement in the training of ML models. In these models, human input is used to validate or negate the prediction Fig. 2 Comparison of unsupervised, semi-supervised, and supervised learning produced by the ML model, allowing the model to learn based on the responses of humans involved. This allows humans to direct the training of models. This is also useful for the training of models on unlabeled data, and provides a method to influence the training of ML models [23]. It is important to note that HITL is different from unsupervised learning, as the terms "supervised" and "unsupervised" refer to the labelling of data used in the learning processes, rather than human involvement.
Building a deep learning model DL specifically associates variables along nodes in a computational neural network. By associating data along these nodes, the artificial neural network (ANN) assigns a positive weight to variables with positive correlations and negative weights to variables with negative correlations. These weights determine the contributory strength of an input variable to the outcome of the neural network [24]. This develops a network of probabilistic associations between input variables. This is analogous to biological neurons, where associations between neurons are strengthened or weakened with excitatory and inhibitory stimuli respectively [25]. By associating data along these ANNs, machines can learn, and train models based on input data. By associating features of the data across the nodes of the ANN, correlations between the data are strengthened or weakened. These connections between the nodes, known as edges, are analogous to the synapses in biological brains [26].
In DL, nodes are associated into multiple layers. Each layer contains a set of nodes, and often perform different transformations on the input data. Each neural network contains an input layer, where the data enters untransformed, and an output layer, which produces the learned result. Between each is zero to multiple hidden layers, where further learning of data features occurs. Input data is processed forward from the input layer until it reaches the output layer [27].
Convolutional neural networks (CNNs) build on ANNs by organizing data and nodes in three dimensions. Furthermore, CNNs separate feature extraction and classification into distinct layers. CNNs rely on a convolution layer, which performs a convolution operation on the data array or tensor. The convolution operation extracts high-level features from a data source, such as the edges of an image. By doing so, it reduces the spatial size of the data and flexibly adjusts to the features of the data that are deemed more important to higher-level processing [28]. For this reason, CNNs are especially useful in image processing, where their convolutional operation allows them to ignore noise and focus on higher-order image structures, like edges. Multiple CNN models exist, including LeNet, AlexNet, VGGNet, and InceptionResNetV2 [29][30][31][32].

Code-free and automated machine learning
The development of machine learning models has been challenging for clinicians, many of whom lack the technical expertise to train and develop machine learning models. For this reason, the advent of code-free and automated machine learning (AutoML) systems has helped to democratize access to the development of effective ML models in medicine and ophthalmology [33][34][35]. These solutions provide a graphical user interface for individuals to help build ML models without code via graphical-user interfaces that are much more user-friendly [36].
AutoML can be described as "AI that can build AI" as it allows non-technical users to develop AI tools that can achieve accuracies close or equal to those of technical users developing code-based ML solutions. This has been applied to a variety of ophthalmic data including tabular data from electronic medical records, optical coherence tomography (OCT) scans, standard and UWFIs images, as well as surgical videos [35,[37][38][39][40]. Further, feasibility studies have shown that non-technical ophthalmologists have been able to develop effective ML models using AutoML tools, such as Google Cloud AutoML [34]. With the continued development and use of AutoML tools, it is expected that ML models will be used more effectively and more widely both in ophthalmology and medicine more broadly.

Training, validation, and testing
Datasets are split into training, validation, and testing sets such that each respective step has data that is similar to a model's intended input data. When datasets include more than one image per patient (right eye, left eye, and steered images), it is important to maintain patient-level splits ensuring that each image could be used either for training, validation, or testing but not all. This restriction eliminates the risk of train-test data contamination that can arise when DL models use non-clinically relevant patterns to drive predictions.
The training dataset is used to train the model to learn the weights and biases between dataset variables. This training dataset is the input data for the model to learn from. For this reason, the quality and quantity of this data will greatly impact the ability of the model to learn the features of interest [41].
Validation datasets are used to tune the hyperparameters of the model. Hyperparameters are parameters that control the learning process of the model, while parameters are the node weights that are derived from training the model. The results of the validation set are used by the engineer to determine the optimal hyperparameters for the learning process. For this reason, this dataset is also known as the "development" or "tuning" dataset. The model does not learn from this dataset and does not develop weights or biases that would alter the model in an automated sense [41].
Finally, the test dataset is used as the input data for evaluating the model. This dataset is used to determine the accuracy and effectiveness of the model. This test dataset is not used to adjust the model, nor does the model learn from it [41].
The ratio of the dataset located to each subset depends on the goals of the model being trained and evaluated. In models that require many hyperparameters to be adjusted, a larger validation set is recommended. However, a validation set is optional if the user does not intend to tune these hyperparameters. Similarly, the amount of data to be allocated towards the training set depends on the complexity of the data and the amount of learning needed [41].
Another approach to dividing the dataset is k-fold crossvalidation. In this process, image data is divided into k groups, while k-1 groups are used as training data and one group is used as validation data. This repeats until each dataset becomes a validation dataset.

Image preprocessing
Image preprocessing serves multiple purposes when training a DL model on an imaging task. The first purpose is to conserve computational resources by resizing images [42]. Often, large images (i.e., the standard 3900 × 3072 pixels of UWFIs) will be resized to significantly smaller, lowerresolution images (i.e., 227 × 227 pixels) [43].
The second purpose of image pre-processing is to increase the data available for training the model and to train the model on generalized cases. This is done via data augmentation, which increases the size of the data by performing transformations on it and producing new combinations of data to train on [44]. This provides the training set data that may match altered or changed input data. For example, one could augment an image dataset by adding noise to the images. This would help the model learn to classify noisy images with features of a given label correctly. As the goal is to train these models to be useful on real-world data, the training data should contain the same "errors" or adjustments that imperfect real-world input data does. This then allows for the model to become more robust in detecting the features of interest. Common image augmentation methods include adjusting brightness, gamma correction, histogram equalization, noise addition, and inversion [45]. Data augmentation can increase the size of the training set manyfold, often up to five or eighteen times the original dataset size [45,46].

Evaluating the model
In ML classification tasks, predictions are classified into true positives, true negatives, false positives, and false negatives.
From these values, the sensitivity and specificity values of the model are calculated. Sensitivity refers to the ability of the model to correctly predict positive observations, calculated as the number of true positives divided by the sum of the true positives and false negatives. Specificity refers to the ability to reject classifications for cases that do not fit the condition, calculated as the number of true negatives divided by the sum of the true negatives and number of false positives [47].
Sensitivity and specificity depend on the thresholds used for detection. As a threshold for detection increases, sensitivity (predicting positive outcomes) decreases while specificity (rejecting positive outcomes) increases. By plotting the sensitivity or true positive rate as a function of the specificity or false positive rate, the receiver operating characteristic (ROC) curve is produced. The integral of the entire curve is referred to as the area under ROC (AUROC) curve. The AUROC value serves as a measure of the model to correctly classify and predict based on the test data. A perfectly predictive model would score 1 while a perfectly inaccurate test would score 0 [48]. The area under the precision-recall (AUPRC) curve is occasionally used and is the integral of the plot of positive predictive value (precision) as a function of the sensitivity (recall) [49].
Another method for evaluation of some models is the Dice coefficient. It is often used in quantifying the performance of image segmentation tasks, as it quantifies the spatial overlap between the object intended to be identified and the object successfully segmented. For example, a Dice coefficient would be beneficial in quantifying how successfully a ML model has segmented and detected vasculature in an image [50]. Dice coefficient scores range from 0 indicating no spatial overlap between the target and the area segmented, to 1 indicating complete overlap between the target and the area segmented. A higher Dice coefficient score therefore indicates greater effectiveness in identifying the structure of interest.

Deep learning in standard ophthalmic imaging
Deep learning has been applied extensively to other modalities of ophthalmic imaging, such as OCT, standard fundus imaging, and fundus autofluorescence (FAF) imaging. While outside the scope of this review, it is worth noting that the application of DL to standard FIs has been successful and well-established for the detection of referable DR, diabetic macular edema (DME), and AMD. OCT imaging has been used with DL for the detection of AMD, and glaucoma extensively. While we discuss here the application of DL to UWF imaging specifically, it should be noted that other ophthalmic imaging modalities have been effectively used with DL for the detection and classification of ophthalmic disease [51].

Deep learning in ultrawide field imaging
Methodology A literature search was performed up to August 31st, 2021, using the following online databases: PubMed, Embase, Cochrane Library, and Google Scholar. Article screening was done by the senior author (RD). The inclusion criteria were as follows: (1) Deep Learning (2) Ultra-Widefield Imaging. The exclusion criteria were as follows: (1) articles published in any language other than English, (2) articles not peer-reviewed (usually preprints), (3) no full-text availability (4) articles using machine learning algorithms other than deep learning. No study design was excluded from consideration. The detailed search methodology was as follows: ("deep learning" OR "artificial intelligence" OR "machine learning") AND ("Ultra-Widefield" OR "UWF" OR "UWFI" OR "Optos").
A total of 36 studies were included. A full listing of included studies, authors, and their respective digital object identifiers are listed in Table 1. A full listing of included studies and their respective architectures, datasets, and experimental results are listed in Table 2. A listing of the included studies and descriptions of their ground-truth datasets, human annotation, grader level, and description of patient-level splits is available in Table 3. A chart detailing the number of included publications by year is included in Fig. 3. A map highlighting the number of publications by country is provided in Fig. 4.

Disease detection and classification
Disease detection and classification have been the most thoroughly investigated uses for UWF imaging with DL. Specifically, DL has been used for disease detection and classification of DR, RD, glaucoma, AMD, retinitis pigmentosa (RP), pachychoroid, retinal vein occlusion (RVO), idiopathic macular hole (IMH), retinal hemorrhage (RH), and sickle cell retinopathy (SCR).
Wang et al. first used UWFIs to train a DL model for the detection of referrable DR in 2018 [55]. In this study, 754 UWFIs were acquired from patients with diabetes presenting at the Narayana Nethralaya hospital in Bangalore, India. The images were transmitted and graded by "certified DR graders" at the Doheny Eye Institute, of which 643 were gradable and inputted into the algorithm. The study set a threshold of moderate non-proliferative DR (NPDR) or higher (i.e., level 2 or higher on the International Clinical Diabetic Retinopathy scale) as sufficient to warrant a referral to an ophthalmologist. The EYEART algorithm, developed and trained using standard flash color images, was applied here to UWFIs. The study used their proprietary and closed-source algorithm to automatically detect and quantify DR lesions, such as hemorrhages, microaneurysms, lipid exudates, and cotton wool spots in UWFIs. Half of the dataset was used to train the classifiers in the previously developed EYEART algorithm, while 50% remained as a testing dataset.
The algorithm found 21.22% of the images contained referral-warranted DR while the graders determined 30.77% contained referral-warranted DR. The algorithm was used on each eye independently, as well as at the patient level, where referrable DR in either eye classified the patient as having DR. At the patient level, the algorithm achieved a 91.7% sensitivity, 50.0% specificity, and 0.873 AUROC. When using individual eyes, the algorithm achieved a 90.3% sensitivity, 53.6% specificity, and 0.851 AUROC. While the authors were able to achieve high sensitivity, the low specificity indicates a high number of false positives using the EYEART algorithm. While the results were promising, a full understanding of their ML methods cannot be determined as the algorithm is closed source. Considering that EYEART was designed on FIs, it is expected that algorithms designed and trained on UWFIs would be more effective.
Nagasawa et al. published a study in 2019, which used DL for detecting treatment-naïve proliferative diabetic retinopathy (PDR) from UWFIs [56]. In this study, 378 UWFIs were graded for the presence of PDR by three retina specialists using the Early Treatment Diabetic Retinopathy Severity (ETDRS) scale, which includes disease features such as RH and neovascularization in grading presence of PDR.
The authors used the VGG-16 CNN, which automatically learns the local features of an image and generates a classification model [31]. The authors used 40 DL models from 40 learning cycles and chose the model with the highest correct answer rate from test data as the DL model for the study. The CNN selected achieved 94.7% sensitivity, 97.2% specificity, and 0.969 AUROC. Gradient-weighted class activation mapping (Grad-CAM) was utilized to visualize the image features used by the CNN to classify images as containing referrable PDR.
The authors specifically used treatment-naïve PDR, which may have improved their results relative to Wang et al. Nonetheless, the authors were able to achieve high sensitivity and a high specificity, indicating that CNN approaches trained on UWFIs may be superior to applying algorithms designed for color FIs (i.e., the EyeArt algorithm) to UWFIs for DR detection.    Using UWF-FA from PDR patients, Bawany et al. utilized DL to correlate automated vessel density with visual acuity (VA) in 2020 [52]. While not focusing on detection of DR generally, the goals of the study were to use a DLquantified measure (retinal vessel density) and determine if it correlated with an outcome (VA) known to be affected by PDR. From a dataset of 42 UWFIs from patients with PDR without significant center-involving DME, retinal blood vessels were first detected using a deep neural network (DNN) in a U-Net architecture. The study authors trained the dataset on UWF-FA images with corresponding ground-truth vessel maps trained using a HITL procedure first demonstrated by Ding et al. [58]. For each UWFI, two UWF-FA images effectively provided a "ground truth" of vessel location to evaluate the segmentation. The output of the DNN was a vessel map where pixel intensity indicated the likelihood of a pixel being a vessel. The trained DNN achieved 0.930 AUPRC.
Vessel density was measured by calculating the percentage of vessel pixels in a circular area centered around the fovea. To study the correlation between vessel density and best corrected visual acuity (BCVA), UWF-FAs were analyzed using the trained model. The study found a statistically significant positive correlation between vessel density and BCVA of 0.4071 (p = 0.0075), but no statistically significant correlation between vessel density and central retinal thickness.
Tang et al. published a study in 2021 which used DL to detect vision-threatening DR (VTDR) and referrable DR (RDR) from UWFIs [54]. In this study, 2861 UWFIs from the primary dataset were labeled for the presence or absence of RDR and VTDR respectively by graders according to the International Clinical Diabetic Retinopathy Disease Severity Scale. A total of 9392 UWFIs were used for training, primary validation, and geographical external validation across the primary dataset and the four external datasets. The authors then trained three CNNs to develop a pipeline for disease detection from the UWFIs. The first CNN classified images as gradable or ungradable, the second for detecting VTDR, and the third for detecting RDR. The study used transfer learning and applied ResNet50 models pre-trained on Ima-geNet. Finally, the authors applied Class Activation Mapping heatmaps for each result (true positive, true negative, false positive, and false negative) to assess DL performance.
The first CNN to determine gradeability achieved an 86.5% sensitivity, 82 [53]. They compared the ability of CNNs to classify ETDRS 7SF vs. optic disk and macula-centered ETDRS F1-F2 images as containing DR. They first trained a U-Net model with ResNet-18 for optic disk detection on the publicly available REFUGE dataset of color FI [59]. The authors then used size and distance thresholds to determine macula locations. An ophthalmologist with over ten years of experience and a certified grader with two years of experience categorized a dataset of 13,271 UWFIs as healthy or containing DR. They then inputted these UWFIs into their trained model to detect the optic disk and macula center. From these detected locations, they segmented the UWFIs into ETDRS 7SF images and F1-F2 images. The 7SF ETDRS images contain 7 fields of 30 degrees each, while F1-F2 images contain only 30-degree overlapping circles centered on the optic disk and macula center.
The authors then trained a ResNet-34 model pre-trained on ImageNet and optimized their model using their dataset. In doing so, they achieved an 0.915 AUROC, 83.38% sensitivity, and 83.41% specificity on 7SF images. However, they achieved a 0.8867 AUROC, 80.60% sensitivity, and 80.61% specificity on F1-F2 images. The 7SF images achieved results that were significantly greater for all three measures (p < 0.001) compared to those of F1-F2 images.
While the authors demonstrate that DL classification systems are more accurate using 7SF images, the achieved AUROC, sensitivity, and specificities have been greater in previously published studies using whole UWFIs. This indicates the greater utility of UWFIs over 7SF and F1-F2 images of the fundus. Nagasawa et al. published a second study on DR using CNNs and UWFIs in April 2021 [57]. They compared the accuracy of DL-based DR staging from UWFIs and OCT angiography (OCTA) images. UWFIs and OCT en face images of the superficial plexus, deep plexus, outer retina, choriocapillaris, and density map were extracted for 491 patients with diabetes. OCTA scans of a 6 × 6 mm region were acquired for each patient. The OCTA and UWFIs were combined into a single image file, to form a third "imaging modality." The dataset used contained images stratified by three retinal specialists into categories of into categories of no apparent DR, mild NPDR, moderate NPDR, and severe NPDR using the ETDRS scale.
The study authors then trained a VGG-16 CNN to first classify the images as containing DR, and the second to detect PDR. Each CNN was tested on UWFI, OCTA, and UWF-OCTA combined datasets. In detecting DR, the first CNN achieved AUCs after training on the UWFI, OCTA, and UWF-OCTA images of 0.790, 0.883, and 0.847 respectively. In detecting PDR, the second CNN achieved AUCs after training on the UWFI, OCTA, and UWF-OCTA images of 0.981, 0.928, and 0.964 respectively. This study demonstrates the ability of DL systems to detect DR and PDR but also demonstrates no additive benefit of combining imaging modalities (UWFIs and OCTA images) to increase the accuracy of disease classification.

Retinal detachment
Five studies have been published on RD detection from UWFIs [39,46,[60][61][62]. The first published, from Ohsugi et al., used 831 UWFIs to detect rhegmatogenous RD (RRD) [60]. The dataset contained 411 images from RRD patients and 420 images from non-RRD patients, which were then reviewed and classified by two ophthalmologists. The study used a CNN with 3 convolutional layers, of which each were followed by activation function (ReLU) layers and finished with two fully connected layers. The final output layer performed a binary classification using a softmax function. The trained model achieved an 0.988 AUROC, 97.6% sensitivity, and 96.5% specificity.
In 2019, Li et al. developed a DL system for identifying specific characteristics of RD from UWFIs [61]. They developed a DL system for detecting notable peripheral retinal lesions (NPRLs), such as lattice degeneration and retinal breaks, which can lead to RRD. Three retina specialists each with over 5 years of experience classified 5606 UWFIs with disagreements adjudicated by a retinal specialist with over 20 years of experience. They then compared the performance of 4 CNNs: InceptionResNetV2, InveptionV3, ResNet50, and VGG-16. With each CNN, the authors explored three methods for improving the DL algorithm: i) no data augmentation, ii) data augmentation with brightness shifts, 45-degree rotation, and horizontal and vertical flipping, and iii) data augmentation with histogram brightness equalizations, 45-degree rotation, horizontal flipping, and vertical flipping. This led to 12 models trained and compared. The study found that the dataset trained on Incep-tionResNetV2 with the second data augmentation method achieved the greatest performance, with 98.7% sensitivity, and 99.2% specificity, and 99.1% total accuracy. This was significantly greater than comparisons with ophthalmologists in the study. The authors found that a general ophthalmologist with 5 years of experience had a 97.6% accuracy, 93.6% sensitivity, and a 98.7% specificity, while one with 3 years' experience had a 94.5% accuracy, 85.9% sensitivity, and 96.8% specificity. These results are very promising for the continued use of DL in identifying NPRLs and its greater accuracy in comparison to trained ophthalmologists.
In 2020, Li et al. then applied an InceptionResNetV2based DL model to detect RD and discern macular status using 11,087 UWFIs labelled for RD by three retinal specialists [46]. They first developed a DL system to detect RD. The model for detecting RD achieved a 96.1% sensitivity, 99.6% specificity, and 0.989 AUROC. A retina specialist with 3 years' experience achieved 94.4% sensitivity and 99.1% specificity, while a specialist with 5 years' experience achieved 95.4% sensitivity and 99.8% specificity. The RD images were then used as the dataset for the DL for macular status classification. This DL model achieved 93.8% sensitivity, 90.9% specificity, and 0.975 AUROC. The ophthalmologist with 3 years of training achieved sensitivities and specificities of 86.3% and 87.1%, while the more senior ophthalmologist achieved 91.3% and 92.4% respectively. The difference in discerning macular status between the DL and ophthalmologists is greater than their difference in RD detection. As macular status is an indication for emergency surgery, this difference is significant in demonstrating the utility and necessity of DL in ophthalmology [46].
Zhang et al. developed a DL system for detecting lattice degeneration, retinal breaks, and RD in tessellated eyes [62]. They then tested two image pre-processing techniques with the seResNext50 CNN on 911 UWFIs classified by three retinal specialists for disease features from tessellated eyes. The first technique resized all images to 512 × 512, and when applied to the DL model, the model would output a positive number for each lesion per image. The second method used the cropping of patches of labelled lesions. The DL model applied to this dataset would then assign a positive score to each lesion and output the max score of all the image's patches. Furthermore, they trained three distinct models for detecting lattice degeneration, retinal breaks, and RD respectively for a total of 6 tested models. In detecting lattice degeneration, the resizing method achieved 0.888 AUROC while the cropping method achieved 0.841. For retinal breaks, the resizing method had 0.843 AUROC while the cropping method achieved 0.953 AUROC. In RD, the resizing and cropping methods achieved AUROCs of 1.000 and 0.979 respectively. The use of the full image led to greater accuracy in all cases except for retinal breaks, where the cropping method was found to be superior.
In 2021, Antaki et al. published a study exploring the use of AutoML technologies for classifying RD, RP, and RVO from UWFIs [39]. They trained a DL model through the Google Cloud AutoML platform using RD and normal UWFIs. The datasets used were publicly available and validated datasets, which were then reviewed by two ophthalmologists. The binary classification of RD achieved an 89.77% sensitivity and 78.72% specificity when the confidence level of the system was set to 0.8. This model also achieved an AUPRC of 0.921.

Glaucoma
Glaucoma, whose pathophysiology is related primarily to the optic disk and its degeneration, has been studied using DL with UWF imaging of the retina.
Two studies have investigated glaucoma with UWFIs [63,64]. In 2018, Masumoto et al. applied a DL classifier to UWFIs to detect glaucoma in a patient dataset stratified by disease severity. The study authors first categorized glaucoma patients into early (-6 dB), moderate (-6 to -12 dB), and severe (-12 dB or worse) based on visual field damage from Humphrey Field Analyzer measurements. The ground truth dataset therefore did not require any human annotation or grading, as the UWFIs were categorized from a quantifiable visual field deficit. In classifying any glaucoma, the DL model achieved a mean of 0.872 AUROC. For early, moderate, and severe glaucoma, the DL model achieved AUROCs of 0.830, 0.864, and 0.934 respectively. The DL model was similarly most sensitive and most specific in classifying severe glaucoma vs. healthy UWFIs. While the results are promising, the AUROC does not reach the 0.9 threshold, which was an acknowledged weakness by the study authors.
Li et al. used DL for automated glaucomatous optic neuropathy (GON) detection using UWFIs in 2020 [63]. They trained a CNN based on the InceptionResNetV2 neural network. All 22,972 UWFIs were classified as containing GON or not by three glaucoma specialists, based on a vertical cup to disc ratio ≥ 0.7, rim width ≤ 0.1 of disc diameter, retinal nerve fiber layer defects, or disc splinter hemorrhages. The primary dataset achieved an AUROC, sensitivity, and specificity of 0.999, 97.5%, and 98.4% respectively. The range of AUROC, sensitivity, and specificity achieved were 0.983-0.999, 97.5-98.2%, and 94.3-98.4% across the primary and four external datasets. The methods demonstrated by Li et al. achieved significantly greater outcomes in detecting and classifying glaucoma on UWFIs than Masumoto et al. likely due to the increased primary dataset size [64].
While outside the scope of this review, other studies have used posterior-segment OCT imaging with DL for glaucoma detection [65]. It is of particular note that methods using 3D segmentation-free OCT volumetric data achieved an AUROC value of 0.940, which exceeds the AUROC of Masumoto et al., but not Li et al.'s use of UWFI for DL-based classification of glaucoma [66]. This is of note, as it indicates some increased value of using UWFI in the DL-based detection of glaucoma compared to using OCT alone.

Age-related macular degeneration
At the time of writing, two studies related to UWFIs and using DL to diagnose or detect AMD or its complications have been published [67,68]. Matsuba et al. published a study in 2018 using DL to detect AMD on UWFIs [67]. In this study, they trained a CNN on UWFIs of healthy (no visible fundus disease) and patients with exudative AMD (wet AMD). Ground truth diagnosis was ascertained by two retinal specialists using a combination of standard fundus examination, OCT imaging and FA. The CNN achieved a 0.976 average AUROC, with 100% average sensitivity, and 97.31% sensitivity in detecting wet AMD. Six ophthalmologists yielded a correct classification 81.9% of the time, with 71.4% and 92.5% sensitivity and specificity respectively. The study ophthalmologists averaged 11 min and 23.54 s for classification, while the DL model averaged 26.29 s.
The second study, published in 2021, comes from Li et al. who used DL for the automated detection of retinal exudates and drusen from UWFIs [68]. Images were labelled as containing retinal exudates and/or drusen (RED) or non-RED by three retina specialists, with disagreements adjudicated by a retina specialist with over twenty years of experience. Two external datasets were then used for validation of the Incep-tionResNetV2 CNN model. On the primary dataset, 0.994 AUROC was achieved, with 94.2% sensitivity and 97.4% specificity. The external datasets achieved 0.972 and 0.988 AUROCs, with 94.9% and 95.1% sensitivities, and 96.5% and 97.3% specificities respectively. [69]. Using UWF and UWF fundus autofluorescence (UWF-FAF) images, they trained a CNN (VGG-16) to classify images based on whether they contained RP. UWFIs and UWF-FAFs from RP and healthy patients respectively were used in their dataset. RP was diagnosed based on corresponding data from clinical history, UWF-FAF, and electroretinograms (ERGs) according to the International Society for Clinical Electrophysiology of Vision standards. The UWF CNN achieved 0.998 AUROC while that of the FAF achieved 1.000 AUROC. The UWF CNN achieved 99.3% and 99.1% sensitivity and specificity scores respectively, while the UWF-FAF CNN achieved 100% and 99.5% sensitivity and specificity scores. There were no statistically significant differences between the sensitivities and specificities of the UWF and the UWF-FAF CNNs.

Masumoto et al. trained a CNN on UWFIs of RP in 2018
Antaki et al. published a study exploring the use of AutoML technologies for classifying RD, RP, and RVO from UWFIs, as previously discussed in the section on RD [39]. They trained a DL model through the Google Cloud AutoML platform using RP and normal UWFIs. The binary classification of RP achieved an 88.0% sensitivity and 100% specificity when the confidence level of the system was set to 0.5. This model also achieved 0.942 AUPRC. When repeated using the data from Masumoto et al. the system achieved an AUPRC of 1, with sensitivity, specificity, and PPV all increased to 100% with no misclassifications made by the AutoML model [39,69].

Pachychoroid
A single peer-reviewed study on using UWFIs in detecting pachychoroid disease has been published. Kim et al. used an AutoML platform to classify UWFIs based on their presence of pachychoroid disease [70]. Specifically, the authors trained the Google AutoML Vision on UWF indocyanine green angiography (UWF-ICGA) classified into categories of healthy and pachychoroid patients by two retinal specialists. Pachychoroid and non-pachychoroid UWF-ICGA images were uploaded. They trained two models, the first of which used all images in their original orientation and the second of which horizontally flipped left eye images such that all images were of the same laterality. The first model achieved precision, accuracy, sensitivity, and specificity values of 0.8182, 0.8367, 0.8182, and 0.8519 respectively, while the second model achieved 0.8636, 0.8776, 0.8636, and 0.8889 respectively. However, the mean precision, accuracy, sensitivity, and specificity scores of three retina specialists were 0.9048, 0.9388, 0.9500, and 0.9643. These results indicate that training the AutoML model with images of the same laterality led to better results, but that the current training did not reach the levels of precision or recall of retina specialists.

Retinal vein occlusion
Three peer-reviewed studies exist on using DL on UWFIs in RVO, two of which were published by Nagasato et al., and the third from Antaki et al. [39,71,72]. The first, published in 2018, uses UWFIs to classify and detect central RVO (CRVO). The study used UWFIs from CRVO and non-CRVO healthy subjects, which was classified by a single retina specialist. A VGG-16-based DNN was trained on the dataset, along with fine-tuning using parameters borrowed from ImageNet. After comparing 40 DL models obtained in 40 learning cycles, they used the DL model with the highest rate of correct answers for evaluation. The model achieved 0.989 AUROC, 98.4% sensitivity, and 97.9% specificity. They similarly used a support vector machine (SVM) algorithm to detect CRVO from UWFIs. The SVM achieved 0.895 AUROC, 84.0% sensitivity, and 87.5% specificity. The DL model achieved significantly greater results in all measures compared to the SVM (p < 0.001).
In 2019, the same group completed a similar study using a DL model on UWFIs of branch RVO (BRVO) patients. In this study, they used the same model (VGG-16), and DNN parameters on a BRVO dataset. Specifically, they trained the DNN on BRVO and non-BRVO healthy UWFIs classified by a single retinal specialist. They similarly tested an SVM model. In this study, the DNN achieved 0.976 AUROC, 94.0% sensitivity, and 97.0% specificity. The SVM model achieved 0.857 AUROC, 80.5% sensitivity, and 84.3% specificity. The authors demonstrated the ability of a DNN to accurately detect BRVO and the superiority of a DNN over SVMs in detecting BRVO.
In 2021, Antaki et al. published a study exploring the use of the Google Cloud AutoML platform for classifying RD, RP, and RVO from UWFIs [39]. The binary classification of RVO achieved a 84.9% sensitivity and 100% specificity when the confidence level of the system was set to 0.5. This model also achieved 0.967 AUPRC. While the sensitivity was lower than that of Nagasato et al., their model achieved comparable specificities [71,72].

Myopia
In 2020, Shi et al. published a study where they studied the ability of a DL system to detect myopia using UWFIs [73]. For this task, they used a custom CNN, known as the Myopia Detection network (MDNet). This network combined dense connection and Residual Squeeze-and-Excitation attention for detecting myopia. The CNN combined attention dense blocks, transition blocks, convolutional layers, max-pooling layers, and a dense layer to make full use of shallow features and improve information flow.
They trained the CNN on left and right UWFIs. The study defined severe myopia as having a spherical equivalent (SE) more negative than -6 diopters (D), moderate myopia as between -6D and -3D, and mild myopia as SE between -3D and 0D. As myopia could be measured quantitatively, the ground truth dataset was categorized based on quantifiable spherical equivalent values. Images were then cropped for a region of interest around the optic disk of 400 × 400 pixels, centered on the optic disk and including the macula.
In 9, the study authors used mean absolute error (MAE) as the main evaluation index, as well as root-mean-square error (RMSE) and mean-absolute-percent error (MAPE). The CNN achieved optimal results at an MAE of 1.1150 D and RMSE and MAPE of 1.4520 D and 24.99% respectively. These results show that myopia is effectively detected within reasonable error using DL and UWFIs.

Idiopathic macular hole
A single peer-reviewed study on detecting idiopathic macular hole (IMH) using UWFIs and DL has been published. In 2018, Nagasawa et al. trained a CNN on normal and IMH images [45]. Ground truth diagnosis was ascertained by a single retinal specialist who conducted ophthalmoscopy and reviewed OCT imaging. The CNN achieved an 0.999 AUROC, 100% accuracy, 100% sensitivity, and 99.5% specificity. The CNN was able to classify images at an average speed of 32.80 ± 7.36 s for a series of 50 test images. They similarly tested the ability of human ophthalmologists to detect IMH from the same UWF test images. The ophthalmologists were able to achieve an 80.6 ± 5.9% accuracy, 69.5 ± 15.7% sensitivity, and 95.2 ± 4.3% specificity, and required an average time of 838 ± 199.16 s to classify the same 50 images. From this study, it is clear that IMH is more accurately and rapidly diagnosed using CNNs than trained ophthalmologists.

Retinal hemorrhage
Li et al. published a study using a DL system to screen RH from a dataset of RH and non-RH UWFIs [74]. The dataset was categorized by three retinal specialists with disagreements adjudicated by a more experienced retinal specialist. The study used InceptionResNetV2, with weights pre-trained for ImageNet classification for CNN initialization. On the primary dataset, the CNN achieved an 0.999 AUROC, 98.9% sensitivity, 99.4% specificity, and 99.3% accuracy. Two external datasets were used for further testing, which achieved 0.998 and 0.997 AUROCs, 96.7% and 97.6% sensitivities, 98.7% and 98.0% specificities, and 98.4% and 98.0% accuracies respectively. On an external dataset, an ophthalmologist with five years of training achieved a 95.9% sensitivity and a 99.5% specificity, while an ophthalmologist with three years of training achieved 92.6% and 98.9% respectively. Here, the ophthalmologists scored sensitivities lower than the trained CNN, but specificities were close to the specificity of the CNN.

Sickle cell retinopathy
A single study, published in 2020, explores using DL with UWFIs to diagnose SCR. Specifically, the study from Cai et al. explored the detection of sea fan neovascularization (SFN) from UWFIs of patients with sickle cell hemoglobinopathy [75]. The dataset was categorized by two retinal specialists, using corresponding UWF-FA data when available. The study notes that the detection of potentially asymptomatic SFN provides the opportunity for prophylactic scatter laser photocoagulation, which can help to reduce the rates of proliferative SCR vision loss. An InceptionV4 CNN, pre-trained on the ImageNet dataset, was trained on the image set for 100 iterations. After training, the CNN achieved an 0.988 AUROC, 97.0% accuracy, 97.4% sensitivity, and 97.4% specificity. Only a single image received a false-negative classification from the CNN, due to a severe lid artifact obscuring the retinal vasculature.

Quality assessment
Three studies have been published on using DL methods for quality assessment of UWFIs [76][77][78]. The first, published in 2020 by Calderon-Auza et al., focuses on using CNNs as a teleophthalmology support system to determine the quality of images provided. Low-quality images were defined as containing factors such as an inability to distinguish the optic disc, as well as artifacts from the eyelashes and/or eyelids obstructing the image. Specifically, the system they proposed uses four steps to determine UWFIs quality. First, the system detects the optic disc (OD), performs quality analysis on the OD, determines obstruction (i.e., eyelash shadows) detection of the region of interest (ROI), and then segments the vessels of the image. For OD detection, Faster Region Based-CNN (FR-CNN) for feature extraction along with the AlexNet CNN architecture was used. On their dataset, this CNN configuration achieved an accuracy, sensitivity, and specificity of 0.9254, 0.9643, and 0.4424 respectively. Images determined as containing an OD were then used as the dataset for the OD quality analysis step. For this step, VGG-16 was used and achieved a 0.8612 accuracy, 0.9113 sensitivity, and 0.8064 specificity in detecting and classifying ODs by quality. For obstruction analysis in the ROI, centered on the optic disk and macula, a SegNet-trained CNN achieved a 1.0 accuracy due to the low number of artifacts in the training and test sets. Finally, for vessel segmentation in the ROI, a SegNet architecture with VGG-16 proposed by the authors achieved an 0.9784 accuracy, 0.7169 sensitivity, and 0.9816 specificity on their dataset.
While the study from Calderon-Auza et al. is a proof of concept of the uses of DL for detecting low-quality UWFIs, it also provides examples of these techniques in practice. For this reason, this study provides readers with a clear implementation of the uses of DL in UWFIs for a multi-step process in tele-ophthalmology.
Li et al. proposed and designed a classification system using "U-Net style" CNN and UWF-FA in 2020 [77]. UWF-FA were graded as ungradable, poor, good, or best by two image analysts with disagreements adjudicated by a third grader. The CNN achieved 90.5% sensitivity and 87.0% specificity for distinguishing between gradable and ungradable images and a sensitivity and specificity of 78.9% and 94.1% for distinguishing between optimal quality (good, best) and limited quality (poor, ungradable) images. The authors calculated the overall accuracy of the classifier as 89.0% for gradable vs. ungradable classification and 89.3% for recognizing optimal quality versus limited quality. The model also achieved an 0.920 AUROC.
In 2020, Li et al. proposed a DL-based image filtering system (DLIFS) to filter out poor-quality UWFIs in an automated fashion, such that only images of sufficient quality would be used in subsequent AI diagnostic systems [78]. Images were identified as poor-quality or good-quality images by 3 retina specialists with disagreements adjudicated by a more experienced retinal specialist. Image quality was categorized as poor if more than 1/3 of the fundus was obscured, macular vessels could not be identified or > 50% of the macular area was obscured, or if the vessels within a 1-disc diameter of the OD margin could not be identified. From this dataset, they trained InceptionResNetV2 with weights pre-trained for ImageNet. The CNN would classify the image quality for each inputted UWFI. The trained DLIFS achieved an 0.996 AUROC, 96.9% sensitivity, and 96.6% specificity. Two external datasets were used for testing, with which the DLIFS achieved 0.994 and 0.997 AUROCs, 95.6% and 96.6% sensitivities, and 97.9% and 98.8% specificities respectively.

Segmentation and localization
Five peer-reviewed studies have been published on segmentation and localization using UWFIs, all of which focus on vessel segmentation [58,[79][80][81][82]. The first, from Ding et al. presented a method to detect retinal vessels in UWF-FA [58]. In this study, the authors developed a method to produce vessel segmentation maps without previously labelled ground-truth datasets. They primarily relied on cross-modality transfer and HITL learning. The HITL approach allowed the DL system to predict the vessels and respond to human feedback regarding whether it had segmented the vessels correctly and accurately. Over multiple iterations, this led to complete segmentation of the vessels. The authors were able to reduce manual annotation effort by first using morphological analysis to segment the vessels in a preliminary fashion. This was followed by a cross-modality approach that transferred vessel maps from UWFIs to UWF-FA images using robust chamfer alignment in an Expectation-Maximization framework. These were combined using the HITL iterative DL process for detection of retinal vessels.
The first step in the pipeline, relying on cross-modality transfer, trained a DNN on a dataset of ground truth color UWFIs with UWF-FA images from the same patient eye taken at the same time. Specifically, the DNN was trained on existing labelled UWFIs to extract the vessel maps from unlabeled UWFIs. These detected vessel maps were then geometrically aligned and transferred to the UWF-FA. These new vessel maps, aligned to UWF-FA, served as the approximate ground truth for training a DNN for vessel detection in UWF-FA images. From this point, the DNN for detecting vessel segmentation was continually run starting from the approximate ground truth from the UWFI, until the DNN did not produce maps with new changes or more vessels segmented.
The process of producing vessel maps was approached as one that would be best suited for a generative adversarial network (GAN), in producing an output image of vessel segmentation from an input UWF-FA image.
The authors then evaluated their method of reducing the burden of annotation by calculating the number of pixels added and removed at each iteration. After 7 iterations, approximately 19,300 (2.0%) new pixels were added, and 14,100 (1.4%) of pixels were removed. Vessel detection on their primary dataset achieved an AUROC of 0.980 and a Dice coefficient of 0.829. In validating their approach on an external dataset, they achieved a maximal 0.987 AUROC, with significant improvements over traditional morphological techniques for vessel segmentation.
The same team of Ding et al. then published a method to segment vessels from color UWFIs via iterative multi-modal registration and learning [79]. In this project, they similarly utilized concurrently captured UWF-FA images to segment the vessels from UWFIs. The first step requires multi-modal registration of the vessels segmented first from UWF-FA using a pre-trained DNN to the UWFIs, using parametric chamfer alignment. The second step utilized a learning method to mitigate the noisy labels due to the differences in the UWF-FA and UWFIs modalities. The detected UWFIs vessel maps are then used for the registration in the following iteration, allowing for iterative improvement until the segmented vessel maps are accurate. After this training, the DNN can detect vessels from UWFIs without concurrently captured UWF-FA images. On their primary dataset, they achieved an AUROC of 0.987 and maximal Dice coefficient of 0.987. After training their DNN, they evaluated the model on an external dataset of UWFIs, achieving an AUPRC of 0.886.
Nunez do Rio et al. published a study in 2020 that explored the use of DL-based segmentation for quantification of retinal capillary non-perfusion using UWF-FA [80]. Capillary non-perfusion (CNP) is a metric that is useful in determining retinal ischemia. For this reason, they sought to use UWF-FA, which is a high-resolution image with a clearly defined retinal vasculature, to quantify this. For this process, they trained a U-Net-style CNN on 75 UWF-FA that were manually graded for CNP to segment and extract the vasculature of these images. 20 images were also segmented by an expert grader manually. To standardize the CNP measurement, a circular grid of rings of increasing radius was centered on the foveal avascular zone (FAZ). The segmentation model achieved an 0.82 AUROC. Between the manually graded images and the automatically segmented images, an inter-grader dice similarity coefficient (DSC) of 65.51 was achieved. In comparing the assessment of CNP between the CNN model and the grader, a Kappa score of 0.55 was achieved. The authors conclude that this automatic segmentation method allows for a DL-based segmentation of CNP and a quantifiable measurement of CNP from UWF-FA.
Wang, Z et al. published a study in 2020 that utilized a multi-task Siamese network for separating retinal arteries from retinal veins using deep convolution [81]. They did so on an FI dataset (DRIVE), a UWFI dataset (WIDE), and an OCT dataset (INSPIRE) [83][84][85]. Using these datasets, they first segmented the vessels using a CNN-based approach, followed by skeletonization of the vessels. Next, they built a graph representing the vascular network by finding branching and end points on the skeleton map. Next, errors such as twinborn nodes produced by overlapping vessels were removed by morphological analysis through the skeleton. This produced a refined vascular graph. They then used the Convolution Along Vessel method to extract visual features by convolving the image along the vessel segments and the geometric features of the vessels, by tracking the direction of blood flow in the vessels. Following this, the Siamese network was trained to learn to classify vessel types by visual features of vessel segments, and by estimating the similarity of every two connected segments by comparing their visual and geometric features. This was done to separate the vasculature types into individual trees of arteries and veins. On the WIDE dataset of UWFI, they were able to achieve an accuracy value of 94.5%.
In 2021, Sevgi et al. published a study that explored the ability to extract the cumulative retinal vessel areas (RVA) from UWF-FA images using CNN-based DL segmentation [82]. For this study, they extracted the RVA from the available UWF-FA image frames. Images that contained the maximum RVA were considered the optimum early phase, while a frame that was taken ≥ 4 min after that closely mirrored the RVA from the early image was considered the late phase frame. Image analysts then evaluated the selected pairs. There were 1578 UWF-FA sequences from 66 sessions used to create cubic splines and a total of 13,980 UWF-FA sequences from 462 sessions were used for evaluation. 85.2% of the sessions had appropriate images for both phases successfully identified. 90.7% of early and 94.6% of late frames were successfully identified.

Generative image synthesis using GANs
At the time of writing, four studies involving generative adversarial networks (GANs) and UWFIs have been published [86][87][88][89]. GANs, designed in 2014 by Goodfellow et al. are ML frameworks that utilize two competing neural networks to generate new data [90]. The first neural network, named the "generator," generates random data. The second neural network, the "discriminator," is trained on data that is to be modelled and produced. As the generator produces data, the discriminator will reject synthesized data from the generator that does not sufficiently represent the training data source. Through an iterative process, the generator becomes more effective at generating data that effectively represents the target data, until synthetic data that appears close to the ground-truth dataset is produced [91]. These approaches have been effective at generating data such as human faces that appear realistic [92].
Ju et al. published another study in 2020, where they utilized GANs to produce labelled datasets of UWFIs from labelled fundus image datasets [87]. They noted that due to the differences in FIs and UWFIs, labelled datasets of FIs could not be used for classifying UWFIs. For this reason, they used a GAN to generate synthetic UWFIs for training. Using a quantitative, classification-based consistency regularization method, they ensured that the pathologies present in labelled FIs were similarly present in corresponding generated UWFIs. FIs were first labelled by three ophthalmologists, requiring 2/3 ophthalmologists to agree on the label for the image for it to be included in the dataset. The authors tested the generated images with and without the consistency regularization method and found that it improved the generation of matching features. The first step in this process required using target UWFIs to train a target-task model, which helps to regulate the quality of generated data. Following this, pseudo-labels were generated for the generated UWFIs. Finally, they used the original UWFI samples and the generated samples to train the target-task model together.
To test that the generated UWFIs were properly pseudolabelled and carried the disease pathology of interest, they then classified the images that contained DR using a Res-Net50 based residual neural network. They then similarly validated their synthetic UWFIs by testing them with vessel segmentation and lesion detection tasks. This was then repeated on two public FI datasets as external validation. The study authors effectively succeeded at producing highquality UWFIs that mirrored FIs in pathology, and mirrored "natural" UWFIs in image quality and complexity.
In 2020, Xie et al. published a study where they proposed a GAN which used an attention encoder (AE) and generation flow network to build a UWFIs classifier for retinal pathologies found in patients under the age of eighteen (i.e., coats, familial exudative vitreoretinopathy, morning glory syndrome, RP, and DR) [88]. The two datasets used contained a total of 3072 abnormal and 1,518 normal UWFI respectively. The goal of this project was to harness the adversarial learning that occurs between the generator and the discriminator to build robustness into their classification model. Their proposed method achieved higher classification accuracy (84.75% and 97.25%) compared to classifiers based on a standard CNN architecture such as ResNet-50 (77.35% and 87.95%).
In 2020, Yoo et al. used GANs in a way opposite to Ju et al. [87,89]. In their study, they utilized a GAN architecture to produce synthetic FIs from UWFIs. Specifically, they used the CycleGAN architecture to translate the UWFIs to FIs while maintaining the structure, pathology, and lesions specific to the original FIs without generating new or fake features into the FI. The authors began by using a dataset of UWFIs and FIs, which were reviewed by two ophthalmologists for image quality. The GAN was trained on the dataset of UWFIs and FIs, and then tested on the test dataset of UWFIs to generate synthetic FIs. Image registration was applied to crop the region of interest on the input UWFIs, focused on the optic disk and fovea, for conversion into an FI. After training the CycleGAN model for 40 epochs, the model was able to successfully transfer the image from UWFIs to FIs with high fidelity to the original UWFI structure and pathologies. For example, UWFIs with DR microaneurysms and blot hemorrhages, GON, RD, CRVO, Drusen, and retinal atrophy all had their specific lesions transferred to FIs successfully. Finally, they calculated structural similarity (SSIM) indices between the generated FIs and the ground truth FIs, and achieved an average SSIM level of 0.802, indicating strong similarities between the image produced and the ground truth image.

Systemic diseases
UWF imaging is also being used in conjunction with DL for the prediction of non-ocular and neurological factors. As UWF imaging is a rich image format, exploratory studies have been conducted to determine if retinal changes can be associated with features like an individual's age, vascular changes, and neurological status.

Age and brachial-ankle pulse-wave velocity
Nagasato et al. published a study in 2020 demonstrating an ability to predict both patient age and their brachial-ankle pulse-wave velocity (baPWV) using UWFIs and DL [93]. For each patient included in the study, they also recorded patient baPWV on the same day UWFIs were taken. They then processed these images to contain the entire image (the total image), a cropped region of the optic disk and macula (the central region), and the total image with the central region covered in black pixels (the peripheral region). Each of these processed images were used as separate datasets for the model and compared by the study authors in their performance for DL prediction of age and baPWV. They then used patient baPWV, UWFIs, and age as input data for a VGG-16 based CNN. The results showed that the total, central, and peripheral images were all able to predict the age and baPWV of a patient with statistical significance. Standardized regression coefficients of 0.833 and 0.390 were achieved for age prediction and baPWV prediction respectively. Specifically, the statistical significance of the correlation between predicted and actual age and baPWV were both p < 0.001 for all three datasets. Conclusively, the authors show that UWFIs can be used to make clear and specific predictions of a patient's age and baPWV specifically, which is itself a marker of vascular health.

Alzheimer's disease
In 2020, Wisely et al. used multiple imaging modalities to train a DL model to identify symptomatic Alzheimer's disease (AD) [94]. In this study, the authors used UWFIs, UWF-FAF images, color maps of ganglion cell-inner plexiform layer (GC-IPL) thickness, and superficial capillary plexus en face OCTA for their training. They included these imaging modalities from eyes from cognitively healthy subjects and patients with symptomatic AD as confirmed by two expert neurologists. The DL model designed took the three imaging modalities as input, as well as OCT and OCTA numerical and patient data. The model used a shared-weight image feature extractor to extract modalityagnostic features that were then used in a modality-specific function in a fully connected layer. After training the model, they then tested the model on each imaging modality individually, as well as combinations of the data. They found UWFIs to lead to an 0.450 AUROC when inputted alone, and UWF-FAF images to achieve an 0.618 AUROC when inputted alone. On their own, OCTA achieved an 0.582 AUROC, and GC-IPL achieved an 0.809 AUROC. All images when inputted together achieved an 0.829 AUROC, while all images along with quantitative data achieved an 0.830 AUROC. All images and all data achieved an 0.836 AUROC, while GC-IPL, quantitative data, and patient data together achieved the highest AUROC of 0.841. These findings indicate that GC-IPL has the strongest individual predictive value of symptomatic AD and that the inclusion of more imaging modalities (i.e., OCTA, UWF-FAF, and UWF) do not significantly improve the predictive value in this case. As well, the predictive value of UWFI alone for symptomatic AD is low.

Discussion
Across a broad range of domains and diseases, DL has been demonstrated to be useful when used in conjunction with UWF imaging. In the detection and classification of disease, DL models have been accurate, sensitive, and specific across a variety of ophthalmic disorders. In this review, we summarized the use of DL in detecting DR, RD, glaucoma, AMD, RP, pachychoroid, RVO, IMH, RH, and SCR from UWFIs. While disease detection has been the most published application of DL in UWFIs, its use in 25 of UWFIs and segmenting and localizing the structures of the retina should not be overlooked. Similarly, the high-resolution provided by each UWFI allows for novel generative uses with GANs. Finally, authors have demonstrated the novel utility of UWF imaging's high-resolution imaging of the eye for estimating a patient's age and estimating vascular health via baPWV.

Benefits and risks of deep learning with UWFI
As shown by the studies discussed above, the diagnostic potential of DL when used with UWF imaging is accurate and often exceeds the accuracy of trained ophthalmologists in many cases. The detection of ophthalmic diseases is a clear use for DL in UWF imaging. With automated DL systems for detection and diagnosis, the likelihood of detecting vision-threatening pathology early is greater. As VA often does not return as many ophthalmic disorders progress, the early and high sensitivity detection of these disorders is beneficial to patient care. In multiple studies discussed above, DL systems achieved sensitivities greater than trained ophthalmologists but in some cases, specificities that were lower than their human counterparts [46]. For this reason, while DL models have achieved impressive accuracies, a reduced specificity compared to humans may lead to greater false-positive diagnoses. From this, the risk of unnecessary medical or surgical intervention increases if DL models are followed without questioning their results. For cases like this, where DL models are more sensitive but less specific than ophthalmologists, clinical use of DL models is still beneficial. In these cases, the combined increased sensitivity of DL models paired with the greater or equal specificity of ophthalmologists may lead to a synergistic effect in accuracy. In a clinical environment, this pairing can help to reduce both false negatives via the DL model's superior sensitivity and false positives via human specificity equal or greater than those of the DL models.
As DL models continue to learn associations between disease classifications and UWFI features, the risk of poor explainability is possible. Considering that DL models typically do not explain the associations they form, the possibility of associations being formed between unexpected image features and disease classification is possible. For example, the biases present in the training datasets may become codified in the DL models' associations. This reproduction of human biases has been seen in AI implementations in healthcare in the past [95]. However, this can be mitigated by using large datasets with multiple graders and reviewers, to minimize individual human bias. Furthermore, to ensure that ophthalmologists can understand the image features leading to disease classification, Grad-CAM can be applied to show heatmaps of the image regions and lesions leading to stronger associations with a disease type [96,97].
It should be noted as well, that there may be instances where the use of UWF imaging may not give added benefit to the goals of DL-based classification or disease detection. As discussed previously, in pathologies that do not use the periphery of the retina for diagnosis, the added view of the retina may not be beneficial. This is specifically important in the use of UWF imaging for glaucoma, where OCT may be a richer and more targeted data type. While the discussion of alternative modalities and their comparison is outside the scope of this review, it remains worthy of consideration for any practitioner interested in applying DL models for the detection and classification of ophthalmic disease.

The future of ultra-widefield imaging and deep learning
In this review, several novel methods using DL and UWFIs have emerged. In particular, the use of GANs is useful to translate UWFIs to other imaging modalities, such as FIs with high fidelity [89]. Furthermore, the ability of DL to translate existing FIs to UWFIs provides the opportunity to translate decades of FIs available to ophthalmologists into a novel imaging modality, which can further strengthen DL training in the future.
UWF imaging and DL also have the potential to lead to an improved understanding of the pathophysiology of ophthalmic and non-ophthalmic disorders. While not yet demonstrated by the papers included here, using DL as an exploratory method can be proposed. For example, with a large enough dataset carefully classified, Grad-CAM interpretability models could demonstrate lesions on UWFIs that may predispose or increase the likelihood of a specific ophthalmic disorder. When combined with other data types and other imaging modalities, as exemplified by Wisely et al. in their study for the detection of AD, the use of DL as an exploratory option for further understanding ophthalmic pathophysiology becomes possible [94].
As demonstrated by the quality assessment studies discussed in this review, DL may also be useful for expanded use of tele-ophthalmology. If general practitioners or technicians can access UWF imaging for patients in remote areas, quality assessment methods can be used to determine if the UWF imaging is of sufficient quality for an ophthalmologist to review remotely or at a later date. Other possibilities include the use of DL with UWF imaging for the purpose of population-based screening of retinal and ophthalmic diseases. For this reason, DL's utility is not simply in diagnosing and detecting diseases, but also in improving access to ophthalmic services for individuals in remote regions or with minimal access to ophthalmology expertise.

Future directions
For future studies, we hope to explore the differences of each study systematically and compare the accuracies of the proposed DL models quantitatively using meta-analysis methods. However, due to the heterogeneity of data and aims with UWF imaging and DL, such analysis was not feasible at this time.