Performance Comparison of Detection Network Models
Based on the presence or absence of lesions in the esophageal endoscopy image, two classes of data—normal and with lesions—were designated, and the results detected by the AI model were analyzed. Esophageal lesions were defined as true, and no esophageal lesions were defined as false. When the IoU value between the prediction and correct answer areas was 0.5, the prediction was considered successful.
To confirm the precision, sensitivity, and FPPI according to the compliance threshold, the performances of the esophageal cancer detection models YOLOv5 and RetinaNet were compared and analyzed for the detection results with a threshold value of 0.1 or more, as shown in Table 2. In the WLI dataset, the YOLOv5 model detected images with a precision of 93.7%, a sensitivity of 89.9%, and an FPPI of 6%. The RetinaNet model detected images with a precision of 96.1%, a sensitivity of 88.4%, and an FPPI of 3.5%. In the NBI dataset, the YOLOv5 model detected images with a precision of 86.5%, a sensitivity of 84.0%, and an FPPI of 13%. The RetinaNet model detected images with a precision of 98.4%, a sensitivity of 91.3%, and an FPPI of 1.4%. From the WLI dataset, 402 evaluation data points were obtained, composed of 201 normal data points without tumors and 201 data points with tumors, and the performance of the detection model was evaluated. In the YOLOv5 model, 179 of the 201 data points with tumors were determined as data with tumors (TP), and 20 were determined as data without tumors (FN). Moreover, 12 of the 201 normal data points without tumors were determined to have tumors (FP). In the RetinaNet model, 176 of the 201 data points with tumors were determined as data with tumors (TP), and 23 were determined as data without tumors (FN). Moreover, 7 of the 201 normal data points without tumors were determined as data with tumors (FP). By showing an example of image detection in Fig. 3, the true detection results of the tumor location predicted by the detection model and the actual tumor location can be confirmed. In the NBI dataset, 69 evaluation data points with tumors were constructed to evaluate the performance of the detection model. In the YOLOv5 model, 58 of the 69 data points with tumors were determined as data with tumors (TP), and 11 were determined as data without tumors (FN). Nine normal data points without tumors were identified as data points with tumors (FP). In the RetinaNet model, 63 of the 69 data points with tumors were determined as data with tumors (TP), and six were determined as data without tumors (FN). A normal datum without tumors was determined to be a datum with tumors (FP). As shown in Fig. 4, the detection of FP and FN results for the tumor location predicted by the detection model and the actual tumor location can be confirmed from the internal data. FP results were obtained because of the prediction of shadows from normal data as lesions, which accounted for almost all cases. In addition, as shown in Fig. 4b, when the lesion was very small and far away, the nearby crystal area was predicted to be an FP. The main cause of the FN predicted results was esophageal inflammation in the mucous membrane, as shown in Fig. 4c. In addition, even when the lesion occupied the entire area, as shown in Fig. 4d, it could not be predicted. Figure 5 shows the overall performance of the model with a precision–recall curve for the internal data. In general, the closer the curve is to the upper-right corner, the better the performance of the model. The two detection models identify the positive classes well and simultaneously consider the number of negative classes incorrectly classified as positive. In the detection model, recall with a low FN ratio was more important.
Table 2
Performance evaluation metrics for detection models based on confidence thresholds from internal data
| Model | Precision (95% CI) | Sensitivity (95% CI) | FPPI (95% CI) |
White-light imaging | YOLOv5 | 0.937 (0.892–0.951) | 0.899 (0.857–0.924) | 0.06 (0.043–0.075) |
RetinaNet | 0.961 (0.88–0.984) | 0.884 (0.80-0.954) | 0.035 (0.016–0.043) |
Narrowband imaging | YOLOv5 | 0.865 (0.824–0.913) | 0.840 (0.763–0.88) | 0.13 (0.07–0.267) |
RetinaNet | 0.984 (0.951–0.99) | 0.913 (0.842–0.944) | 0.014 (0.008–0.035) |
Through external verification, according to the presence or absence of lesions in the esophageal endoscopic image, two classes of data were designated as normal without lesions and with lesions, and the results detected by the AI model were analyzed. To confirm the precision, sensitivity, and FPPI according to the compliance threshold, the performances of the esophageal cancer detection models YOLOv5 and RetinaNet were compared and analyzed for the detection results with a threshold value of 0.1 or more, as shown in Table 3. In the WLI dataset, the YOLOv5 model detected images with a precision of 83.4%, a sensitivity of 79.4%, and an FPPI of 15.8%. The RetinaNet model detected images with a precision of 88.3%, a sensitivity of 70.2%, and an FPPI of 9.2%. In the NBI dataset, the YOLOv5 model detected images with a precision of 85.6%, a sensitivity of 71.3%, and an FPPI of 11.9%. The RetinaNet model detected images with a precision of 88.3%, a sensitivity of 81.1%, and an FPPI of 10.6%. In the WLI dataset, 488 evaluation data points with tumors were constructed to evaluate the performance of the detection model. In the YOLOv5 model, 387 of the 488 tumors were identified as data with tumors (TP) and 100 as data without tumors (FN). Moreover, 77 normal data points without tumors were identified as data with tumors (FP). In the RetinaNet model, 342 of the 488 tumors were determined as data with tumors (TP), and 145 were determined as data without tumors (FN). Moreover, 45 normal data points without tumors were identified as data with tumors (FP). By showing an example of image detection in Fig. 6, the true detection results of the tumor location predicted by the detection model and the actual tumor location can be confirmed.
Table 3
Performance evaluation metrics for detection models based on confidence thresholds from external data
| Model | Precision (95% CI) | Sensitivity (95% CI) | FPPI (95% CI) |
White-light imaging | YOLOv5 | 0.834 (0.761–0.88) | 0.794 (0.756–0.85) | 0.158 (0.086–0.194) |
RetinaNet | 0.883 (0.82–0.956) | 0.702 (0.657–0.782) | 0.092 (0.057–0.18) |
Narrowband imaging | YOLOv5 | 0.856 (0.817–0.893) | 0.713 (0.68–0.76) | 0.119 (0.007–0.142) |
RetinaNet | 0.883 (0.853–0.947) | 0.811 (0.766–0.89) | 0.106 (0.08–0.181) |
In the NBI dataset, 288 evaluation data points with tumors were constructed to evaluate the performance of the detection model. In the YOLOv5 model, 167 of the 288 tumors were identified as data with tumors (TP), and 67 were determined as data without tumors (FN). Moreover, 28 normal data points without tumors were identified as data points with tumors (FP). In the RetinaNet model, 190 of the 288 tumors were determined as data with tumors (TP), and 44 were determined as data without tumors (FN). Moreover, 25 normal data points without tumors were identified as data points with tumors (FP). As shown in Fig. 7, the detection of FP and FN results for the tumor location predicted by the detection model and the actual tumor location can be confirmed from external data. FP results were obtained because of the prediction of shadows from normal data as lesions, which accounted for almost all cases. In addition, as shown in Fig. 7b, the overall lesion was predicted as an FP. The main cause of the result predicted as an FN was the presence of only a part of the lesion, as shown in Fig. 7c. The case of esophageal inflammation of the mucous membrane shown in Fig. 7d could not be predicted to be a lesion. Figure 8 shows the overall performance of the model with a precision–recall curve for external data. In general, the closer the curve is to the upper-right corner, the better the performance of the model. The two detection models identify the positive classes well and simultaneously consider the number of negative classes incorrectly classified as positive. In the detection model, recall with a low FN rate is more important. High precision can lead to low recall, indicating that the model misses most of the tumor data.