The study (registered ISRCTN28276860) was approved by the National Research Ethics Service in the UK (REC no: 19/LO/1385) and adhered to the Declaration of Helsinki. All participants provided written informed consent. Patients recruited retrospectively provided consent for their pseudo-anonymised data on the loading-phase to be used for the study, while the patients recruited prospectively provided consent for the assessments to be performed as per study protocol and for their pseudo-anonymised data to be used for the study.
Study design
Study participants
The cohort was composed of treatment-naïve adults aged 50–100 years diagnosed with active MNV across 10 clinics in the UK. Therapy consisted of three intravitreal aflibercept injections with a follow-up within 10 weeks of the last injection. Patients were recruited between 18 December 2019 and 4 August 2021. The retrospective cohort included patients who initiated aflibercept therapy before the start date of the study and the prospective cohort included patients who were diagnosed and initiated aflibercept after December 2019. Only those who had spectral domain OCT scans with Heidelberg Spectralis OCT (V6.15, Heidelberg Engineering GmbH, Heidelberg, Germany) at baseline and at follow-up visit post loading-phase were included. Other exclusion criteria were co-existent ocular conditions that, in the opinion of the local clinic’s investigator, could affect the visual acuity or macular fluid resolution, such as epi-retinal membrane and vitreomacular traction. There were no visual acuity restrictions (Figure 1).
As is often the case in real-world studies, there were some clinic- or patient-specific protocol variations. As such, we identified two main treatment protocols: eyes belonging to the standard protocol sub-cohort had a therapy course of 105–135 days, and about 8 weeks between third injection and final observation. Eyes belonging to the short protocol sub-cohort had a therapy course of 75–104 days, and about 4 weeks between last injection and final observation. There was no significant difference in injection interval between protocols (Figure 2).
Study assessments
All assessments and treatment procedures were performed as per standard of care. The OCT raster scan was centred at the fovea (range 20° × 25° to 30° × 20°) using 19, 25, 31 or 49 line scans. Each clinic employed one or more image acquisition protocols, which differed in resolution and the overall number of line scans as per the above. OCT scans were compulsory at baseline and final visit for study inclusion.
Data collection
Demographic data included age at presentation, sex, and ethnicity. Clinical data collected in a web-based database included visual acuity using standardized ETDRS (Early Treatment Diabetic Retinopathy Study) letter charts, central retinal subfield thickness (CST), total retinal volume (TRV) and dates of OCT and injections at all visits. The AI system used only features that were measured in the baseline exam, prior to therapy commencement.
Grading of OCT images
The OCT images were graded at Moorfields Eye Hospital, London, UK, by four medical retina fellows with at least 5 years’ experience in grading OCT images (ShC, ST, EK, SwC). All images were graded on hospital computers with 20-inch screens. The fellows were initially trained on a set of 50 OCT images and obtained a kappa28 of 0.94 for inter-grader agreement on macular fluid assessment, which is considered a high level of agreement. Each scan was corrected for any segmentation errors and foveal centring before grading. Firstly, the baseline scans were verified for the presence of MNV based on the Consensus on Neovascular Age-related Macular Degeneration Nomenclature (CONAN)29 classification. Cases with no evidence of MNV were excluded (Figure 1). Subsequently, the macula scans at final follow-up were reviewed for the presence or absence of any residual macular fluid (intra- or sub-retinal fluid).
Outcome definition
A good response to aflibercept loading-phase (denoted ‘0’) was defined as total absence of both intra- and sub-retinal macular fluid in the macular volume scan at the final follow-up. A suboptimal response was defined as presence of any macular fluid (denoted ‘1’) at the final follow-up.
AI system development
Data splitting
To develop an AI-based treatment response selection, we divided all OCT scans into train, validation and test sets. First, from the overall number of patients, 20% were randomly selected for test; 20% of the remaining patents were then selected for validation.
Splits were generated at the patient level to avoid having the same patient in the train and test sets. We stratified by protocol (standard or short), binary treatment outcome and scan size. Due to the short protocol sub-cohort size, we subsequently merged all non-training samples of the short protocol to a single hold-out test set. All algorithms were thus only tuned to perform well on the validation set of the standard protocol, or the training set of the short protocol (see the transfer learning step in the Supplementary Material). Both sub-cohorts’ training sets were further split into five folds, again by randomly sampling patients, stratified by clinic and outcome.
Pre-processing
All OCT scans were first subject to a standardised pre-processing algorithm. This consisted of three individual steps. First, the raw images were loaded scan-by-scan by their order in the volume and resized by linear interpolation to a unified height of 200 pixels and width of 300 pixels. Subsequently, we subsampled all individual OCT scans to yield a total of 20 scans per eye. This was done by selecting 20 scans from the total number of scans in regular intervals, for eyes with more than 20 scans, and evenly duplicating scans, for eyes with less than 20 scans. Lastly, we normalised each resulting 20 × 200 × 300 data matrix by mapping its individual 1% and 99% percentile to −1 and 1, respectively. We did not normalise by common reference due to the large variability in intensity between studies.
Clinical features were normalised as follows. For numerical features (e.g. age) we extracted the mean and standard deviation (SD) from the training set only to scale all values (including validation and test sets) to have a mean of 0 and SD of 1 by subtracting the mean and dividing by the SD of all training samples. After normalising, missing values were imputed by the mean (i.e. 0). CST was the only strictly positive variable that was strongly skewed, and we first transformed the natural scale of the variable by applying the natural logarithm followed by normalisation. Binary (sex) and multi-class variables (clinic) were one-hot encoded.
Augmentation
Images were further pre-processed by implementing a two-dimensional (2D) wavelet decomposition via a Daubechies order 1 wavelet,30 to subsequently only use ‘edges’ as image input, again individually standardised to a mean of 0 and SD of 1. The size of the input matrix was not changed by the wavelet transformation. A simple on-off Boolean hyperparameter controlled whether models were trained purely on the original OCT scans or on their wavelet transformations.
We also introduced a simple foreground detection algorithm. We aimed at finding both the upper and lower background of the relevant retinal layers by reviewing the OCT scans by image column, marking the start of tissue by the following criteria. Since there appeared to be a salient ‘edge’ between the retinal layers and the upper background, we based the detection on a composite criterion: seen from the top of the image, tissue detection by finding a significant ‘edge’ (based on the wavelet transformation) of at least two SDs above wavelet mean, logical-OR-connected with the surpassing of a baseline luminosity value, calculated by image column, regressed towards the luminosity of the whole image by 50%. For the lower background, as in general no distinct ‘edge’ was present, we introduced only a simple threshold value, calculated, again, based on the column-wise luminosity, regressed towards the luminosity of the whole image by 50%.
OCT image backgrounds were then set to 0 for all further training steps. Both upper and lower luminosity threshold values were controlled (multiplied) by hyperparameters pertaining to a respective luminosity percentile value, to enable coarser to finer background detection.
In addition to these image transformations, we applied image augmentations in every training epoch. These affine transformations were applied to each 2D scan, consistently for each OCT volume. The augmentations consisted of a rotation from −20° to +20°, translation of up to 10 pixels, flipping the image left to right, and scaling of the image, either zooming in, by 0 to 50%, or out, by 0 to 20%.
Model pipeline
The fuse-med-ml31 (version 0.1.10, open source python) library was used throughout the image classification process. Given the binary labels, the ensuing classification task was attempted by three different model types, using the pre-processed OCT images or the clinical features as input.
First, we employed two different artificial neural network architectures: a) a three-dimensional (3D) class activation maps network,32 and b) a 2.5D Resnet50 network.33
The former was a pure 3D convolutional neural network (i.e., 3D kernels) with six convolutional layers.34 The latter consisted of a single 2D Resnet50 network35,36 that was applied individually for every 2D scan (i.e., 2D kernels), followed by an averaging of output features from over all scans, which resulted in a total volume output (thus ‘2.5D’).
Both networks were adapted with an input layer to adjust for the size of OCT images (20 dimensional with only one channel each), as well as a classification layer based on the cross-entropy loss.
Hyperparameters included CNN type, batch size, learning rate, dropout rate, weight decay, background detection level and wavelet transformation, as well as the training folds (1 out of 5 for cross-validation). Models were trained for up to 100 epochs using the Adam optimizer,37 after which we selected the best epoch on the cross-validation held-out set (1 out of 5). We ran hyperparameter tuning, randomly sampling parameters and choosing the best model based on its performance on the validation dataset. The final model was a 2.5D Resnet50 with input wavelet transformation, a batch size of 4, learning rate of 3.4×10-6, weight decay of 0.036 and dropout rate of 0.24.
Secondly, we trained a simple image classification model based on a 2D Fast Fourier Transformation (FFT) of every OCT scan. Given the sparsity of natural images in the Fourier domain, we first reduced the overall number of features by approximately 85% to the most dominant frequencies. The transformed one-dimensional vectors, representing the 2D frequency maps, were used as direct input to a l1-norm regularised logistic regression (Lasso38), with the volume labels applied to each scan. The strength of regularisation was tuned in 5-fold cross-validation. The final model reached in general a similar level of performance as the final CNN (see Results) but was dropped consecutively by the ensemble feature selection (see below).
Thirdly, a logistic regression with l2-norm regularisation based on all clinical features (i.e. age, sex, (log.) CST, visual acuity and TRV) was standardised as described above. The strength of regularisation was again tuned in 5-fold cross-validation, after which the model was applied to the validation and test sets.
All models were either trained on standard or short protocol data, never on both simultaneously. However, since the performance of applying the CNN image classifier of the standard protocol sub-cohort on the short protocol sub-cohort was poor, we used the final CNN trained on standard protocol sub-cohort data to retrain on the short protocol sub-cohort tuning set, in order to employ the benefits of transfer learning.39,40 Thus, we report three different kinds of model performances: models trained and tested on the standard protocol sub-cohort data, trained on standard data and tested on short protocol data (in general the least promising), and trained on short protocol data (CNN also pretrained on standard protocol sub-cohort data) and tested on short protocol data.
Finally, we trained an ensemble of all three model types to reach a single classifier. To this end, we used the outcome scores of all models [0-1] on the training data to also train an ensemble in the form of a logistic regression classifier with l2-norm regularisation, again tuned in 5-fold cross-validation. We ran a forward feature selection algorithm41 for the standard protocol data, selecting the CNN as well as the clinical feature classifier based on the prediction performance on the validation data. Adding the FFT-based classifier did not further improve the performance on validation. This procedure was not repeated for the short protocol dataset.
Statistical analysis
Univariate associations of clinical features and outcome were estimated using Fisher exact test for binary features and Mann–Whitney U-test for continuous features. We corrected for multiple hypotheses by employing the Benjamini–Hochberg’s false discovery rate (FDR) correction. Area under the receiver operating characteristic curve (AUROC), sensitivity and specificity are reported with a 95% confidence interval (CI). A p-value <0.05 was considered to indicate a statistically significant difference.