3.1 Image collection
We collected 411 fundus images and excluded 90 images due to poor quality. As a result, our dataset contained 200 RP fundus images from 107 patients and 121 non-RP fundus images from 94 patients. The collected 321 images consisting of RP and non-RP fundus images were separated into a training set (159 RP and 96 non-RP) and a test set (41 RP and 25 non-RP). The patient demographic characteristics are shown in Table 1.
3.2 5-fold cross validation
In comparing the three CNNs used in training the model, 5-fold cross validation showed that Inception V3 attained the best performance in accuracy (Table 1). For VGG16 the accuracy was 0.749 ± 0.110, for Resnet50 0.883 ± 0.120, and for InceptionV3 0.897 ± 0.078.
After training, we conducted testing on the test dataset. Among these, the results of InceptionV3 yielded the best performance, as demonstrated below. As Tabel 2 shows, the model made wrong precisions only for 2 non-RP images. For the rest of the images in the dataset (41 RP images and 22 non-RP images), the model made correct prediction (Table 3).
Table 2 Performance metrics of deep learning models
The accuracy and standard deviation (SD) obtained from cross-validation, presented as a percentage. Cross-validation involves dividing the dataset into multiple smaller groups, using each group as a test set to evaluate the model's generalizability and reliability. The AUROC (Area Under the Receiver Operating Characteristic Curve) value obtained on the test dataset, presented as a percentage. The AUROC value measures how well the model can distinguish between different classes, with higher values indicating better performance.
Model
|
Cross Validation Accuracy (± SD), %
|
Test Dataset AUROC, %
|
VGG16
|
0.749 ± 0.110
|
95.21
|
Resnet50
|
0.883 ± 0.120
|
97.85
|
Inception V3
|
0.897 ± 0.078
|
99.32
|
Table 3 Confusion matrix with Inception V3.
This table shows the performance of the Inception V3 model in classifying data into two categories, RP and Non-RP.
|
|
Predicted Label
|
|
|
RP
|
Non-RP
|
True Label
|
RP
|
41
|
0
|
No RP
|
2
|
23
|
Figure2 Receiver operating characteristic (ROC) curve of Inception V3 model with plots of the average of ophthalmologists and a student.
ROC curve of Inception V3 model, with performance comparisons to average ophthalmologists and a student, depicted by orange and green triangles respectively.
The horizontal axis shows False Positive Rate (FPR) and the vertical axis shows True Positive Rate (TPR).
Figure 2 shows the performance of the Inception V3 model along with the ophthalmologists and medical student accuracy. Inception V3 scored best in AUROC compared with two other models shown in Table 2.
3.3 Visualization
We conducted a Grad-CAM to visualize which pixels the model sees as important in the prediction process for the model based on Inception V3 to confirm the validity of the model to show that it did not rely on features unrelated to RP.
Figure 3 Grad-CAM analysis of misdiagnosed samples. Representative original fundus color images and corresponding Grad-CAM heatmaps are shown for each diagnostic category. The Grad-CAM heatmap show the regions that contributed significantly to the model's decision: red and yellow indicate high contribution, blue indicates low contribution.
Our analysis revealed that the heatmap predominantly focus on the peripheral regions. However, instances of false positives were observed to be centered around the macular region, indicating potential misfocus. The model correctly identified all RP images. Figure 4 shows two RP images that was misdiagnosed as non-RP by the ophthalmologist but was correctly predicted to be RP by this program. As shown in this example, the program was able to correctly diagnose RP images that would have been missed by an ophthalmologist in a non-RP screening.
Figure 4. RP Images that only AI could correctly diagnose
Representative original fundus color images and corresponding Grad-CAM heat maps are shown for each diagnostic category.
3.4 Performance comparison with Ophthalmologists and a Medical student
The model and ophthalmologists demonstrated comparable performance in terms of accuracy, recall, specificity, and precision. Additionally, the model outperformed medical students in RP detection. The results are shown in Table 4. The performance of the model was s comparable to that of ophthalmologists and higher than that of a medical student.
Table 4 Performance comparison of our AI model, ophthalmologist, and medical student.
The table displays accuracy, recall, specificity, and precision for four ophthalmologists (labeled 1-4), their mean values, a medical student, and a machine learning model.
|
Accuracy, %
|
Recall, %
|
Specificity, %
|
Precision, %
|
Ophthalmologist 1
|
95.45
|
92.68
|
100
|
100
|
Ophthalmologist 2
|
98.48
|
97.56
|
100
|
100
|
Ophthalmologist 3
|
96.97
|
95.12
|
100
|
100
|
Ophthalmologist 4
|
96.97
|
95.12
|
100
|
100
|
Ophthalmologists Mean
|
96.97
|
95.12
|
100
|
100
|
Medical student
|
81.82
|
73.17
|
96
|
96.77
|
Model
|
96.97
|
95.35
|
100
|
100
|