Our results demonstrate robust overall performance of the CFDL model for predicting sex from retinal fundus photos. The AUROC of 0.93 from this framework, which does not require coding expertise, suggests significant capability of the CFDL platform for this task. Our code-free model’s performance is comparable with the Poplin et al. model AUROC of 0.97, which was designed and tuned by machine learning experts (Table 1). Our model was trained on a similar (UK Biobank), albeit significantly smaller dataset, as it did not include the additional 1.5 + million EyePACS fundus photos which Poplin et al. also utilized for training.
To our knowledge, two other studies have attempted to perform this image classification task. Yamashita et al. performed logistic regression on several features that were identified to be associated with sex. These features included papillomacular angle, retinal vessel angles and retinal artery trajectory25. They achieved an AUROC of 0.78, which further underscores the limitations of a classical machine learning approach, utilizing human-identified features for such novel tasks. Deep learning, even utilizing our CFDL approach, seems to outperform manual feature engineering significantly. Various studies have shown retinal morphology differences between the sexes, including retinal and choroidal thickness26–28. Others have demonstrated variation of ocular blood flow and have suggested the effect of sex hormones, but thus far, consensus is lacking29,30. The coder-engineered deep learning model developed by Dieck et al. for this task, which also included an image preprocessing step, demonstrated an accuracy of 82.9%, which was lower than our automated code-free approach31. The retinal features apparent to domain experts for this task may go unanswered, as the power of deep learning in integrating population-level patterns from billions of pixel-level variations is impossible for humans to match.
Performance of our model was slightly worse with external validation on the Moorfields dataset, which is typical when deep learning models are evaluated with datasets dissimilar from their training data distribution32,33. Specifically, the Moorfields dataset was obtained from a tertiary referral center, and 42.9% of the fundus photos contained foveal pathology. In eyes without foveal pathology, the external validation accuracy was within 1.5% of the Biobank validation set. The worse performance in pathologic eyes suggests the significance of the fovea for sex prediction, and was similarly demonstrated in the attention maps of Poplin et al. The region-based saliency maps we generated suggest that the optic nerve and vascular arcades are also regions of importance (Fig. 2). In the study by Poplin et al., when subgrouped for diabetic retinopathy (DR) presence, their model similarly trended towards worse performance for pathologic images compared with healthy controls. Furthermore, the ophthalmologists in that study “repeatedly reported highlighting of the macula for the gender predictions” when interpreting the attention maps22. These findings highlight the importance of considering machine learning performance only in context of the specific training and evaluation datasets utilized. This is especially critical for our task, when the salient features of an input image are unclear to domain experts.
Ungradable images from the UK Biobank validation dataset were labeled as such by retina specialists to the guidelines of lacking adequate visibility of the macula, optic nerve, and vascular arcades (Table S2). However, those images demonstrated only a slight reduction in model performance. Furthermore, the model shows similar salient regions in ungradable input images as in gradables (Fig. 2). This suggests that the model is sensitive to signal in poor quality images from subtle pixel-level luminance variations, which are likely indifferentiable to humans. This finding underscores the promising ability of deep neural networks to utilize salient features in medical imaging which may remain hidden to human experts.
Through characterization of high-dimensional data, our findings suggest that deep learning will be a useful tool for the exploration of novel disease and biomarker associations. Clinician driven research, particularly through the use of AutoML, has the potential to move this field forward. Crucially, AutoML as a platform does not fully automate the process of machine learning. Data preparation remains an essential manual step. As demonstrated by population differences in our external validation dataset, tasks such as equitable and representative acquisition, cleaning, and subgrouping of datasets remain important factors for the production of useful models. Clinicians are uniquely positioned to understand both the complexities of the clinical data, and the use-cases for the design of clinically relevant production algorithms.
While our deep learning model was specifically designed for the task of sex prediction, we emphasize that this task has no inherent clinical utility. Instead, we aimed to demonstrate that AutoML could classify these images independent of salient retinal features being known to domain experts, that is, retina specialists cannot readily perform this task. We intended to show that our framework’s performance may be comparable to state of the art algorithms designed for the same task by coders. This portends for the capacity of AutoML, utilized by clinician use-case experts, to design models for tasks where specific retinal features have not been categorized. Examples of such use-cases include cardiovascular and neurological disease characterization from retinal photos.
Limitations
Our study had several limitations. The design of the CFDL model was inherently opaque due to the framework’s automated nature with respect to model architecture and hyperparameters. While this opacity is not unique to CFDL, there is potential to further reduce ML explainability due to lack of insight of model architectures and parameters employed. Although we compared our performance to other models via AUROC, we were unable to compare performance using clinically relevant metrics such as sensitivity and specificity, as these were not provided by the authors of the other studies34. The UK Biobank dataset, composed of a generally healthy Caucasian population, was not fully representative of the general UK population, and demonstrates potential for algorithmic bias35–37. Although we attempted to address this with an external validation population with a higher prevalence of pathology, our patient level data was limited and did not include additional demographic information. Since both datasets were from UK populations and de-identified, there is the potential of overlap at the patient level.
Through our investigations of predicting other novel systemic signals from fundus photos, we noted several inherent limitations of the CFDL platform. Utilizing buckets of varying range, which were necessary due to lack of support for continuous variable prediction, we were unable to successfully predict age. Experiments to predict smoking status resulted in models with significantly lower AUC (0.64) as compared with Poplin et al. (0.71)22. We have engaged the platform development team, and aim to repeat our experiments as new platform features are released.