Study Datasets
This study used EyePACS dataset for the CL based pretraining and training the referable vs non-referable DR classifier. EyePACS is a public domain fundus dataset which contains 88,692 images from 44,346 individuals (both eyes, OD and OS), and the dataset is designed to be balanced across races and sex. After removing the fundus images that met the exclusion criteria, the final training dataset contains 70,969 fundus images, of which 57,722 are non-referable DR and 13,247 are referable DR. An independent testing dataset from UIC retina clinic is used for the target task of DR classification. This dataset contains 2500 images from 1250 patients (both eyes OD and OS). Among 1250 subjects (mean [SD] age, 53.37 [11.03]), 818 were male (65.44%) and 432 were female (34.56%). The detailed demographic information of the subjects from UIC is in Table 1. There was no statistically significant difference in the distribution of age, sex, and hypertension between non-referable and referable DR groups (ANOVA, P = 0.32, 0.18, 0.59 respectively).
Table 1
Demographics characteristics of non-referable and referable DR subjects from the testing dataset at UIC.
| Non-referable DR | Referable DR |
Number of subjects | 750 | 500 |
Sex (male/female) | 458/292 | 360/140 |
Age (mean ± SD) | 50.36 ± 10.64 | 56.37 ± 11.84 |
Age range | 28–73 | 32–78 |
Duration of disease (years) | 14.23 ± 10.22 | 19.32 ± 12.94 |
Diabetes type | Type II | Type II |
Insulin dependent(Y/N) | 117/633 | 389/111 |
HbA1C, % | 6.6 ± 4.1 | 8.1 ± 3.2 |
HTN prevalence, % | 69 | 81 |
DR: diabetic retinopathy, SD: standard deviation, HbA1C: Glycated hemoglobin, HTN: hypertension |
Framework for contrastive learning-based pretraining
Our FundusNet framework consists of two primary steps. First, we perform self-supervised pretraining on unlabeled fundus images from the training dataset using contrastive learning to learn visual representations. Once the model has been trained, the weights are transferred to a secondary classifier model for supervised fine-tuning on labeled fundus images. Figure 1 describes a summary of the framework.
To teach our model visual representations effectively, we adopt and modify the SimCLR framework [21], which is a recently proposed self-supervised approach that relies on contrastive learning. In this method, the model learns representations by maximizing the agreement between two differently augmented versions of the same data using a contrastive loss (more details on contrastive loss is provided in Material and Methods section). This contrastive learning framework (Fig. 1a) attempts to teach the model to distinguish between similar and dissimilar images. Given a random sample of fundus images, the FundusNet framework takes in each image x, augments them twice, creating two versions of the input image xi and xj. The two images are encoded via a ResNet50 network (Fig. 1b), generating two encoded representations hi and hj. These two representations are then transformed via a non-linear multi-layer perceptron (MLP) projection head, yielding two final representations, zi and zi, which are used to calculate the contrastive loss. Based on the loss on each augmented pairs generated from a batch of input images, the encoder and projection head representations improve over time and the representations obtained place similar images closer in the representation space. The CL framework contains a Resnet50 encoder (containing convolutional neural network and pooling layers with skip connections) with a projection head (dense and Relu layers) that maps the representation. The batch size of the CL pretraining pipeline has been demonstrated to have significant effect on the model pretraining [21–23] and therefore, the performance of the target model. To test this, we trained our FundusNet model for bath sizes 32 through 4096 (step size 32). The model is trained for 100 epochs or until the loss function saturates.
Improving representation learning through neural style transfer (NST)
One of the key findings from CL based self-supervised pretraining is that augmentation and transformation are key to better representation learning. As we adopted and modified the SimCLR framework, which was originally used for classification of natural images, we found that regular image augmentation techniques such as flipping, rotating etc. did not generate a good representation (Zi and Zi in Fig. 1a) from fundus images. A study by Geirhos et.al. [20] demonstrated that CNNs used in computer vision tasks are often biased towards texture, compared to global shape features that are primarily used by humans for distinguishing classes. Increasing shape bias by randomizing texture environments can be a useful way to improve accuracy and generalizability of a CNN model. NST manipulates the low-level texture representation of an image (style) but preserves the semantic content. NST has been previously demonstrated to improve robustness to domain shift in CNNs for computer vision tasks [24, 25]. In our study, we integrated an NST-based augmentation technique into the CL pipeline, based on convolutional style transfer from non-medical style sources (i.e., art, painting etc.). The NST replaces the style of the fundus images (primarily texture, color and contrast) with the randomly selected non-medical images. However, it preserves the semantic contents (global objects, shapes like microaneurysm, vasculature etc.) of the image required for better disease detection. The NST convolution methodology was adopted from AdaIn style transfer [19, 20]. The style source was artistic paintings from Kaggle’s ‘Painter by Numbers’ dataset (79,433 paintings), downloaded via https://www.kaggle.com/c/painter-by-numbers. In the CL pretraining, the NST based augmentation was combined with the regular augmentation techniques such as rotation, flipping, color distortion, crops with resize, and gaussian blur. A higher probability (70%) of augmentation through NST was defined in the pretraining protocol. To compare the performance improvement of detecting referable DR due to integration of NST into our pipeline, we also trained two baseline CL frameworks: first one with just the original SimCLR augmentations [21] and a second state of the art lesion based CL model that utilized lesion patches instead of the whole fundus image (with original SimCLR augmentations).
Referable vs non-referable DR classification training schemes
Using the weights of the pretrained network as initializations, we trained an end-to-end supervised model for a downstream DR classification task (referable vs non-referable DR). We trained a ResNet50 encoder network with standard cross-entropy loss, a batch size of 256, ADAM optimizer and random augmentations (gaussian blurring, resizing, rotations, flipping, and color distortions). To compare our FundusNet results, we also trained two separate fully supervised baseline models (ResNet50 and InceptionV3 encoder networks, both initiated with Imagenet weights). Both the baseline models are based on based on DL models in literature that have achieved state of the art diagnostic accuracy in detecting referable DR [16, 26]. Standard hyperparameter search (learning rate (logarithmic grid search between 10− 6 and 10−−2), optimizer (ADAM, SGD), batch size (32, 64, 128, 256)) and training protocols were maintained for the FundusNet and both baseline networks.
To further investigate whether the CL pretrained model performs well with smaller training data (and ground truth), we reduced the training dataset gradually from 100–10% (10% step size) and conducted the downstream classification training for both the CL and Imagenet pretrained baseline models. After identifying the best hyperparameters and fine tuning the models for each experiment, we chose the model that had the best performance on validation dataset (5-fold cross validation). The final optimal models were tested on an independent testing dataset from UIC. In terms of encoder networks, we compared three types of encoder networks in our experiment (VGG, ResNet, and Inception architectures).
FundusNet performance on real-life clinical test data
The FundusNet model pretrained style transfer augmentation achieved an average AUC of 0.91 on the independent test dataset from UIC, outperforming the state-of-the-art supervised baseline models (ResNet50 and InceptionV3) trained with Imagenet weights [16] (AUCs of 0.80 and 0.83, respectively), and the CL baseline models trained with SimCLR and lesion based framework (AUCs of 0.83 and 0.86 respectively) (Table 2, 3). The significant performance difference in the testing test compared to the baseline model indicates that the FundusNet model generalized better through our pretraining framework due to enhance representation learning enabled by NST. The NST augmentation that was designed specifically for learning global geometric features, allowed learning of more discriminative visual representations of retinal pathologies, improving the overall classification performance compared to CL models pretrained with generic augmentation techniques.
Table 2
Classification performance of FundusNet framework for referable vs non-referable DR compared to fully supervised baseline models.
Models
|
AUC (95% CI)
|
P-value (vs FundusNet)
|
Sensitivity (95% CI)
|
P-value (vs FundusNet)
|
Specificity (95% CI)
|
P-value (vs FundusNet)
|
FundusNet
|
0.91 (0.898 – 0.930)
|
Ref
|
0.90 (0.895 – 0.917)
|
Ref
|
0.85 (0.830 – 0.862)
|
Ref
|
Baseline1 (ResNet50)
|
0.80 (0.783 – 0.820)
|
P < 0.001
|
0.81 (0.793 – 0.834)
|
P < 0.001
|
0.74 (0.731 – 0.758)
|
P < 0.005
|
Baseline2 (InceptionV3)
|
0.83 (0.801 – 0.853)
|
P < 0.001
|
0.84 (0.822 – 0.848)
|
P < 0.001
|
0.79(0.786 – 0.819)
|
P < 0.05
|
DR: diabetic retinopathy, CI: confidence interval; AUC: area under the ROC curve; Ref: reference; P value from measuring statistical significance using DeLong’s test for comparing pairwise AUCs.
Table 3
Evaluation of model performance on test set based on augmentation and contrastive learning techniques compared to the state of the art.
Method | AUC (95% CI) |
| 100% data | 10% data |
SimCLR [21] | 0.83 (0.80–0.85) | 0.71 (0.66–0.73) |
Lesion based CL (confidence threshold 0.7)[18] | 0.86 (0.81–0.87) | 0.75 (0.71–0.79) |
Lesion based CL (confidence threshold 0.8)[18] | 0.87 (0.84–0.89) | 0.75 (0.74–0.78) |
FundusNet (our final proposed model) | 0.91 (0.898–0.930) | 0.81 (0.77–0.84) |
DR: CI: confidence interval; AUC: area under the ROC curve; CL: contrastive learning; confidence threshold refers to the threshold set in [18] to reduce un confident prediction of patches from fundus images. |
To investigate the label-efficiency of the FundusNet model, we trained our model on different fractions of the labeled training data and tested each resulting model on the test dataset. We compared this to the performance of the baseline models. Fine-tuning experiments were conducted on five-folds training data and the results were averaged. Figure 2 shows how the performance varies using the different label fractions for both the FundusNet and baseline supervised models on testing dataset. We observe that the CL pretrained FundusNet model retains AUC performance even when the labels are reduced up to 10%, whereas there is significantly smaller performance of the baseline models. When reducing the amount of training data from 100–10% of the data, the AUC for FundusNet drops from 0.91 to 0.81 when tested on UIC data, whereas the drop is larger for the baseline models (0 0.80 to 0.58 for the ResNet50 and 0.81 to 0.63 for the InceptionV3 model). Importantly, the FundusNet model is able to match the performance of the baseline models using only 10% labeled data when tested on independent test data from UIC (FundusNet AUC 0.81 when trained with 10% labelled data vs 0.80 and 0.81, respectively, for baseline models trained with 100% labelled data).
In the experiment to evaluate the optimal batch size for CL pretraining, we observed that CL frameworks learned better image representations when there were higher number of negative examples in a batch (i.e., augmented image pairs generated from other images in a batch), therefore, higher batch size yielded better performance (Table 4; AUC of 0.77 for batch size 32 vs 0.91 for batch size 2048 on test dataset). However, high batch size also means the need for larger compute resources. We observed that at batch size 4096, the AUC did not improve significantly, so the optimum batch size was chosen as 2048. In terms of encoder networks, compared to VGG and Inception architectures, ResNet50 provided the best classification performance in both validation and test dataset (Table 5).
Table 4
Effect of batch size on FundusNet performance on detecting referable DR.
Batch size | AUC (SD) – test dataset | P value (vs 32 batch size) | P value (vs 2048 batch size) |
32 | 0.77 (0.039) | REF | < 0.001* |
64 | 0.78 (0.058) | 0.18 | < 0.001* |
128 | 0.78 (0.045) | 0.16 | < 0.001* |
256 | 0.80 (0.077) | 0.06 | < 0.001* |
512 | 0.86 (0.082) | 0.02* | < 0.001* |
1024 | 0.87 (0.034) | < 0.01* | < 0.01* |
2048 | 0.91 (0.014) | < 0.001* | REF |
4096 | 0.91 (0.011) | < 0.001* | 0.08 |
AUC: area under the ROC curve; SD standard deviation; Ref: reference; ‘*’ indicates significant difference; P value from measuring statistical significance using DeLong’s test for comparing pairwise AUC values among the batch size (all vs batch size 32 in third column, and all vs batch size 2048 in fourth column). This pairwise comparison was done to show whether the AUCs coming from experiments with all the batch sizes are significantly different that AUCs coming from experiments with batch size 32 (initial batch size) and 2048 (optimal batch size). |
Table 5
Classification performance using different encoder networks.
Encoder architecture | AUC (SD) on test dataset |
Resnet50 | 0.91 (0.01) |
Resnet152 | 0.90 (0.032) |
Vgg16 | 0.82 (0.044) |
InceptionV2 | 0.89 (0.019) |
InceptionV3 | 0.90 (0.064) |
a DR: diabetic retinopathy, CI: confidence interval; AUC: area under the ROC curve; Ref: reference |