Detecting abnormal fundus images by employing deep transfer learning

Background: To develop and validate a deep transfer learning (DTL) algorithm for detecting abnormalities in fundus images from nonmydriatic fundus photography examinations. Methods : A total of 1,295 fundus images from January 2017 to December 2018 at Yijishan Hospital of Wannan Medical College were collected for developing and validating the deep transfer learning algorithm in detecting abnormal fundus images. The DTL model was developed by using 929 (normal 254, abnormal 402) fundus images, including normal fundus images and abnormal fundus images, the latter including maculopathy, optic neuropathy, vascular lesion, choroidal lesions, vitreous disease, and cataracts. We tested our model using a subset of the publicly available Messidor dataset (using 366 images) and evaluated the testing performance of the DTL model for detecting abnormal fundus images. Results : In the internal validation dataset (n=273 images), the AUC, sensitivity, accuracy, and specificity of the DTL for correctly classified fundus images were 0.997, 97.41%, 97.07%, and 96.82%, respectively. For the test dataset (n=273 images), the AUC, sensitivity, accuracy, and specificity of the DTL for correctly classifying fundus images were 0.926, 88.17%, 87.18%, and 86.67%, respectively. Conclusion : In the evaluation, the DTL presented high sensitivity and specificity for detecting abnormal fundus-related diseases. Further research is necessary to improve this method and evaluate the applicability of the DTL in the community health care center.

Messidor dataset (using 366 images) and evaluated the testing performance of the DTL model for detecting abnormal fundus images. Results : In the internal validation dataset (n=273 images), the AUC, sensitivity, accuracy, and specificity of the DTL for correctly classified fundus images were 0.997, 97.41%, 97.07%, and 96.82%, respectively.
For the test dataset (n=273 images), the AUC, sensitivity, accuracy, and specificity of the DTL for correctly classifying fundus images were 0.926, 88.17%, 87.18%, and 86.67%, respectively. Conclusion : In the evaluation, the DTL presented high sensitivity and specificity for detecting abnormal fundus-related diseases. Further research is necessary to improve this method and evaluate the applicability of the DTL in the community health care center. Key words : Fundus images; Deep transfer learning; Developing and validation; Artificial intelligence.

Background
Retinal disease is one of the main causes of blindness worldwide, and the most common types of retinal conditions are dysfunctional retinal pigment epithelium and degenerating 3 photoreceptors. Aging, diabetes, trauma, retinal vessel occlusion, hypertensive retinopathy, retinitis, and family history can result in retinal disease. With the increase in the aging population and the prevalence of high myopia and diabetes, visual disabilities will continue to increase [1]. At present, the diagnosis of retinal diseases mainly relies on manual examination with the help of eye experts on retinal vessels, optic discs, the fovea, and lesions. As the prevalence of vision disabilities increases [2], early detection and effective treatment are the key to avoiding vision loss. Community health care centers with population concentration, comprehensive monitoring, capabilities of analyzing and evaluating individual or group health have the prospect of providing large-scale screening and early diagnosis. However, one of the main barriers to implementing widespread screening is the deficit of medical resources, particularly in low-and middle-income countries [3]. Given these concerns, developing a safe and effective screening program for early intervention to prevent currently incurable blinding conditions is essential.
Retinal fundus images have become one of the main references for screening and diagnosing retinal diseases. Recently, several research teams have investigated artificial intelligence-assisted systems based on fundus photographs to screen retinal diseases. However, many of these studies have been devoted to identifying DMR [4] and glaucoma [5,6], and studies about retinal disease recognition aiming to establish a classification of normal and abnormality in multicategorical retinal diseases have been very limited.
Artificial intelligence (AI) using machine learning algorithms, such as support vector machines (SVMs), naive Bayes classifiers and convolutional neural networks (CNNs), has received extensive attention after demonstrating that it could perform at least as well as humans in image classification tasks [4,7]. As the digital imaging modality rapidly develops, image processing, computer vision, and machine learning are being used to automatically detect retinal lesions based on color fundus photographs. This is of great 4 significance for the implementation of computationally assisted retinal disease detection and the promotion of large-scale screenings [8]. Deep transfer learning is a new machine learning method that leverages existing knowledge to solve different but related domain problems [9]. Confirming past studies, transfer learning was a highly effective technology, especially in domains where limited data [10] were available. The essential characteristics of DTL are compared with traditional image recognition methods, which do not need to rely on manual labeling and a large quantity of labeled training data and do not require much cost and time for data collection. The purpose of this study is to develop and validate an effective transfer learning algorithm for detecting abnormal fundus photographs and to provide an accurate and timely referral by employing a small multicategorical retinal disease image database. Additionally, new insights are generated for the screening program to efficiently build a detection model with a few labeled fundus photographs and some relation graph data.

Image dataset characteristics
A total of 1,295 fundus images were selected from the Yijishan Hospital of Wannan Medical College from January 2017 to December 2018 in this retrospective study. These images included normal and abnormal fundus photographs, the latter including maculopathy, optic neuropathy, vascular lesion, choroidal lesions, vitreous disease, cataract, and low-quality photographs. The image is labeled as poor quality and removed from the training and validation dataset in the following situations: blurred areas accounted for 50% or more, macula lutea and the optic disc is only one or none, macular region vessels cannot be distinguished. After removing 366 poor images, the deep transfer learning (DTL) model was developed using 929 retinal fundus images (normal 370, abnormality 559) from January 2017 to December 2018. Figure 1 shows the workflow of this study. The images were extracted from the ophthalmic clinics, inpatients and physical examination centers in our hospital. Three datasets were applied for DTL training (normal 254, abnormal 402), internal validation (normal 116, abnormal 157) and testing (normal 155, abnormality 251). The training dataset was used to adjust common parameters (weights, biases, etc.) in the network, and the test dataset was applied to evaluate the performance of the DTL after training with some important metrics, such as accuracy, specificity, and sensitivity. Images were captured through the use of common conventional desktop retinal cameras and the digital retinography system Topcon and NIDEK. In this study, three licensed ophthalmologists were invited for image labeling. The normal images were labeled as 0, and the abnormal images were labeled as 1. Fund images were classified between November and December 2018. The images were randomly assigned to every ophthalmologist, each ophthalmologist classified between 100 and 300 fundus photographs, and each image was classified more than three times. The images that obtained two or more consistent labels were transferred into a subgroup and made available for study. In this process, the labeling outcomes were blind. The senior ophthalmologist dealt with controversial image labeling. A total of 656 fundus images were randomly selected from 929 images as the training dataset, and the remaining images were considered as the internal validation dataset. To improve the accuracy of image recognition with only a small number of training datasets, several data preprocessing steps were implemented for normalization and standardization. To evaluate the model performance, an independent subset of the Messidor database was used for the test dataset. The 366 fundus images (normal 155, abnormal 251) were randomly selected from the Messidor dataset. To provide a standardized image format of the dataset for the succedent deep learning and final automated testing, all images were anonymized and 6 saved as the JPG data format and cropped black borders since convolutional neural networks are sensitive to color when extracting features.

Data processing
Data preprocessing can detect trends, minimize noise, underline important relationships and flatten the variable distribution in a time series [11]. In this study, several steps for data preprocessing were performed to normalize the images for variation, including removing meaningless photographs where important retinal information was lost due to shooting angles, light, media opacities, and cropping the black edges but preserving the crucial regions, adjusting the brightness to balance the color of images, reducing noise and enhancing contrast. All dataset image resolutions were 3,352 × 3,364 pixels.
To improve the accuracy of image recognition with a small database and avoid overfitting, data augmentation was introduced into the preprocessed data to expand the range of training data samples while keeping the prognostic features in the image. According to the characteristics of color photographs and convolutional neural networks, it is highly invariant in the form of rotation, mirroring, etc. [12]. Figure 2 shows the process of training dataset augmentation in Python. The parameter probability is the ratio of the images that perform the operation on the input images. Data augmentation was introduced into the original small dataset to increase the number of training data samples. After data augmentation, the training dataset was expanded to 7,000 images, including 3,500 normal and 3,500 abnormal fundus images. In this study, the Inception-ResNet-v2 architecture was applied to achieve transfer learning. It can help to overcome the difficulties of obtaining large manually labeled datasets and reduce the computational costs. Our model demands relatively low computational performance while maintaining effective classification results. To achieve the transfer, we remove the dense layer and the softmax layer of the pretrained network.
We need to eliminate the last two layers because the dimensions of the dense layer and softmax layer must be equal to the number of classes in our task. Then, we add adaptation layers to construct the new architecture. On this basis, the source pretraining model on the large-scale dataset was transferred to the target small dataset, and the model weights and image features, except for the last two layers, are extracted as the input of the new dense layer and the softmax layer to finish our specific task. Then, we fine-tune the convolutional layers by unfreezing and updating the pretrained weights to classify medical images. In the target task, a modified softmax layer outputs two categories (Fig. 3). The exponential decay learning rate [13] can asymptotically reduce the learning rate to stabilize the model in the later stage of training. The Adam optimizer is an adaptive learning rate optimization algorithm that is specifically designed for training deep neural networks. In this study, the transferred Inception-ResNet-v2 uses an Adam

Results
The manual classification of retinal fundus images was completed in November and December 2018, and DTL training and validation were completed in January 2019. Figure 4 shows the training process performance of the model. The accuracy of the training increased rapidly and ran to a subsequent plateau after approximately 30,000 training steps. As the training continued, a learning rate lower than what we initially set was more favorable; therefore, it was beneficial that we used an exponential decay learning rate.
The internal validation performance of the model is presented in Fig. 5. The performance of the internal validation dataset (normal 116, abnormal 157) and the AUC, sensitivity, accuracy, and specificity of the DTL for correctly classifying fundus images were 0.997, 97.41%, 97.07%, and 96.82%, respectively. A total of 273 images were randomly selected from the test dataset to validate the performance of the DTL. The performance of the DTL correctly classified the test dataset. The AUC, sensitivity, accuracy, and specificity of the DTL were 0.926, 88.17%, 87.18%, and 86.67%, respectively (Fig. 6). Table 1

Discussion
In this study, the DTL model achieved robust performance in abnormal fundus image detection, and the AUC, sensitivity, accuracy, and specificity of the DTL were 0.926, 88.17%, 87.18%, and 86.67%, respectively, in an independent subset of the test dataset.
AI-based automated detection of retinal diseases using deep learning and transfer learning systems has been reported in several studies. The initial focus was on deep learning technology. Ting et al. [14] validated their deep learning system (DLS) using 494,661 retinal images, demonstrating that the DLS had high sensitivity and specificity for identifying diabetic retinopathy and related eye diseases for the detection of any DR (AUC = 0.94-0.96); for possible glaucoma, the AUC was 0.942; for AMD, the AUC was 0.931. Similarly, Li et al [15] described the development and validation of an artificial intelligence-based method in 71,043 retinal images acquired from a web-based, deep learning algorithm for the detection of referable diabetic retinopathy. Testing against the independent multiethnic dataset achieved an AUC, sensitivity, and specificity of 0.955, 92.5%, and 98.5%, respectively. Stevenson et al. [16] showed their proof-of-concept AI system performance with 4,435 images. The classifiers were for AMD and vascular occlusion, both with accuracies of 99.1%, sensitivities over 99%, and specificities of 88.9%. In contrast to the above studies, our independent testing performance, the AUC, sensitivity, accuracy, and specificity of the DTL were 0.926, 88.17%, 87.18%, and 86.67%, respectively, and the results were relatively low. This may be attributed to the outputs of our model being divided into normal groups and abnormal groups, the latter including a multitude of disease states; thus, some rare and microlesions failed to be detected by DTL. Previous studies have demonstrated that AI will become a tool to quickly and reliably detect and diagnose eye diseases based on medical imageology. AI-based DL could be used with high sensitivity and accuracy in the detection and identification of fundus diseases. The application of AI in ophthalmology may increase accessibility and achieve high efficiency in large-scale eye disease screening programs.
Although some studies have shown outstanding research results, some limitations should be considered. First, most of the studies required a large manually labeled dataset to train and validate, which requires considerable time, manpower, and material resources. The diagnosis varies depending on the region. Second, more thorough research of false-negative values should be performed to recognize features and relevance. By comparison, our study is, to our knowledge, the first to develop a DTL to detect abnormal fundus images by employing a small dataset.
The deep transfer learning classification has been used for many years in disease screening research. Santin et al [17]. performed transfer learning to characterize the abnormal cartilage by using a pretrained neural network VGG16 and adapted the final layers to a binary classification problem. The AUC, sensitivity, and specificity of their study were 0.72, 83%, and 64%, respectively.
In an independent sample of 189 new thyroid images, the AUC was 0.70. Compared with this study, they all deployed a small dataset, but the performance of the Inception-ResNet-v2 architecture was significantly better than that of the VGG16. Similarly, Heisler M, et al. [18] demonstrated three different transfer learning methods to identify the cones in a small set of AO-OCT images using a base network trained on AO-SLO images, which all obtained results similar to that of a manual rater.
Using the results from the fine-tuning (Layer 5) method, they calculated four different cone mosaic parameters that were similar to the results found in AO-SLO images, showing the utility of their In this study, the reasons for false-negative cases of the testing datasets were analyzed. High myopic fundus accounted for approximately more than half of all false-negative cases. These results could contribute to our experts labeling mild myopic fundus as normal. Therefore, the model confused mild myopic fundus images and pathologic myopic images. In the same way, false-positive cases include mild myopic fundus. Other reasons for false negatives included peripheral retinal microlesions, vascular microlesions, optic neuritis, and congenital optic neuropathy.
This study presented an automated screening model that was trained with a relatively smaller number of fundus images. It can attain clinically acceptable performance in abnormal fundus image detection and will benefit medical institutions with no retinopathy screening program or a lack of experienced ophthalmologists. Additionally, the study shows our proposed model with high accuracy and reproducibility in detecting abnormal fundus images, even though it trained with a limited dataset. The DTL will permit users to utilize relation-labeled graph data to construct a detection model for the target image data. In this study, the transfer learning algorithm shows a well-applied prospect in community health care centers for screening retinal disease. The techniques described in this study, with great potential, apply in other medical field image classifications.
DTL is surprisingly effective in image classification. However, our study in its current state has several limitations. First, due to a training set in which our experts labeled mild myopic fundus as normal, the DTL trained on this set accessed a higher than normal prior probability for eye disease detection, which may cause a high false-negative rate. Second, our study dataset is not large and includes only patients from a local clinical setting. At present, the algorithm cannot be independent or matched with professional evaluation, but it can provide abnormal fundus images with obvious diagnoses so that ophthalmologists can focus on more difficult cases.

Conclusions
In conclusion, the current project demonstrated that deep transfer learning presented a promising future in the diagnosis of various diseases with higher accuracy and robustness based on multidomain data. In future work, we will be dedicated to adding more auxiliary domain information to our model and explore a screening algorithm for classifying retinal pathologic lesions and providing treatment recommendations. Further steps include improving this method and validating and evaluating its applicability in the community health care center.    Illustration of the proposed procedure in this study.
18 Figure 4 The accuracy and the learning rate of the training process.
19 Figure 5 Receiver operating characteristic (ROC) curves of deep transfer learning in the internal validation dataset.
20 Figure 6 Receiver operating characteristic (ROC) curves of deep transfer learning in the testing dataset.

Figure 7
Examples of fundus images show the possibilities for the DTL: a, b, c, d, e, f, and g: abnormal fundus images predicted as abnormal (true-positive); h, i, and j: abnormal fundus images predicted as normal (false-negative).

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.