Our results showed that with the aid of AI-CAD, specificity, PPV and accuracy significantly improved regardless of the experience level of the radiologist. Our results were consistent with previous studies that also found significantly improved specificity and PPV with the same AI-CAD program [6, 8, 13-15]. However, AUC did not significantly improve after AI-CAD was implemented, except in independent reading with inexperienced radiologists. Some earlier studies found significantly improved AUC when AI-CAD was used for breast US, and this was particularly observed when AI-CAD was used to assist inexperienced radiologists [10, 13, 14] who initially showed significantly lower diagnostic performances than experienced radiologists without AI-CAD. However, the overall AUCs for both the inexperienced and experienced groups in our study were higher than previous studies (0.868 and 0.922, respectively), which might limit the range for potential improvement. This difference from previous studies may be due to the type and number of images selected for review in our study as the previous studies used video clips for image analysis or pre-selected the CAD interpretation results [13, 14], while we used representative still-images of breast masses with the AI-CAD analysis being performed individually by radiologists.
Currently, no guidelines exist that designate when AI-CAD should be implemented in US interpretation. Thus, we compared two different workflows in this study: sequential and independent reading. For sequential reading, radiologists first assessed lesions without the assistance of AI-CAD and then, were allowed to modify their assessments after the AI-CAD results were made available. In contrast, for independent reading, the AI-CAD results were available to the radiologists from the beginning, and the radiologists assessed the breast masses on US with these results in mind. Our results showed that specificity, PPV and accuracy were higher in independent reading than sequential reading, regardless of the experience level of the radiologist. In addition, the AUC of the inexperienced radiologists significantly increased in independent reading (0.862 to 0.891, P=0.027). A previous study which compared the two different workflows in breast US using a different AI-CAD platform found results similar to ours in that AI-CAD proved to be of better benefit in independent reading for both experienced and inexperienced radiologists, but contrary to our results, this past study showed significantly improved AUC [7]. Based on these findings, we can see that the time point at which the AI-CAD results for breast US are made available can affect the diagnostic performance of radiologists, and this should be considered for the real-world application of AI-CAD.
In addition to diagnostic performance, significantly higher rates of change were seen for BI-RADS categories in independent reading than sequential reading, particularly for radiologists in the inexperienced group. Changes in final assessments from BI-RADS 2 or 3 to BI-RADS higher than 4a or vice versa are important as they can lead to critical decisions on whether or not to perform biopsy. The conversion rates were also significantly higher in independent reading than sequential reading for both experienced and inexperienced radiologists, suggesting that the type of workflow in which AI-CAD is implemented can also influence the clinical management of patients as was seen in a previous study [10].
Prior studies have reported considerable variability among radiologists in the evaluation of the US BI-RADS lexicon and final assessments [16]. In our study, 6 radiologists with variable experience in breast imaging showed fair to substantial agreement for descriptors and final assessments, which were in the value ranges suggested by previous studies [16]. Overall agreements for all BI-RADS lexicons and final assessments improved with AI-CAD. Moreover, independent reading with AI-CAD showed higher agreements between radiologists for shape, margin, orientation, posterior features and final assessments. However, when radiologists were subgrouped according to experience level, agreements for most BI-RADS lexicon did not significantly increase or even slightly decrease for final assessments in experienced radiologists. Agreements in our study were generally lower than previous studies in which AI-CAD improved agreements between radiologists for final assessments [8, 10, 13], possibly due to the inclusion of many radiologists from different training backgrounds and institutions.
This study has several limitations. The most notable one is its retrospective data collection from a single institution. However, in order to reflect real-world practice, we selected breast images from a consecutive population according to the benign-malignant ratio and the proportion of BI-RADS final assessments found for real-time US in preceding research using AI-CAD [10]. Second, pre-selected static images of breast masses were used for analysis. Analysis of video clips that includes series of images of the entire breast lesion may result in higher interobserver variability arising from selecting the representative image. This may affect the diagnostic performance and interobserver agreement in the multi-reader study. Third, we used the same set of images for sequential and independent reading. Although there was a 4-week washout period between the two reading processes, some images may have stuck in the radiologists’ memory and this might have affected their assessments.
In conclusion, using AI-CAD to interpret breast US improves the specificity, PPV and accuracy of radiologists regardless of experience level. More improvements may be seen when AI-CAD is implemented in independent reading through better diagnostic performance and agreement between radiologists, especially for inexperienced radiologists.