Go beyond image-based benign-malignant classication: AI can identify responsible frames better than physicians in breast ultrasound screening videos

Breast Cancer is the most common cancer in the world and the single leading cause of cancer mortality in women. Heavy workload and shortage of ultrasound specialists impede the penetration of breast cancer screening. To reduce the burden of sonographers and empower junior physicians, we propose a novel framework FEBrNet by integrating deep learning architecture with the idea of entropy from Information theory. FEBrNet is capable of auto-selecting responsible frames from ultrasound screening videos based on entropy reduce method and classifying breast nodules using Articial Intelligence (AI). A combination of 13702 images and 1066 videos from breast ultrasound exams are used to train and test the robustness of the proposed framework. Reader studies show that FEBrNet has equivalent or even superior diagnostic performance to that of ultrasound specialists and that overall physician’s performance improves when using FEBrNet's recommended frames and corresponding prediction. Therefore, merging FEBrNet into clinical ultrasound screening workow might bring actual benet by helping address the scarcity of sonographers, so as to increase the use of ultrasound screening in cancer prevention. both the insights in breast screening data and AI as a powerful feature-extracting tool, we propose a novel framework FEBrNet by integrating deep learning architecture with the entropy method from Information theory. 29 In FEBrNet, all frames in the breast screening video are processed with AI models in parallel, followed by an Entropy Reduce method to auto-select key frames with mutually distinctive features and signicant contributions for breast lesion diagnosis. A combination of 13702 images and 1066 videos from breast ultrasound exams are used to train and test the robustness of the proposed framework. Our exhaustive multi-center, multi-reader experiment shows overall physician’s performance is improved when aided by FEBrNet’s recommended frames and corresponding prediction.


Introduction
Breast Cancer is the most common cancer in the world 1 and the single leading cause of cancer in women. 2 According to Global Cancer Statistics 2020, the incidence of breast cancer in women has exceeded that of lung cancer, with more than 2.3 million new cases, accounting for 30% of female cancer patients and 11.7% of all cancer patients. 1 Effective, early screening is of great importance for improving 5-year survival rate. Studies have shown it can reduce the mortality rate by 38-48%, as well as local and distant recurrence rates. 3 At present, screening and follow-up mainly rely on two imaging modalities: mammography and ultrasound. [4][5][6] Mammography has been regularly adopted in breast screening programs in many countries. 7,8 With a higher sensitivity to dense breasts, ultrasound is also widely used worldwide. 9 Meanwhile, for low-income nations or regions, ultrasound can be a practical option for large scopes of screening due to its affordability, portability while without radiation. 10 In addition, ultrasound can potentially detect malignant foci obscured by thick glands as a supplement of mammography. 11 However, present popularity of breast cancer screening with ultrasonography is insu cient. One primary reason is the shortage of experienced ultrasound specialists. 12 In United States, 40 million breast screening examinations take place, with half of those in women with dense tissue, requiring more than 6 million work hours per year. 13 In other regions with a larger female population of dense breasts, the amount of required ultrasound examinations might be times that of the United States. The burden has beyond physicians' capacity to endure, resulting in serious physical health concerns. 14 Along with the physician shortage, the geographical distribution of experienced physicians is also unequal. The access to fast, reliable ultrasound test in underdeveloped areas is in stark contrast with its counterpart, primarily resulting from the better education, training and opportunities for physicians in developed regions that underdeveloped areas are simply not able to match. 15,16 The seeking of means to reduce the burden of sonographers and empower junior physicians has been long. Automated Breast Ultrasound(ABUS) is a technology that automates image acquisition, hoping to boost repeatability and decrease operator reliance. 17 It has shown high consistency with handheld ultrasound (HHUS) performed by competent doctors. [18][19][20][21] However, the large size and cost of their equipment preclude its use.
Arti cial intelligence (AI) algorithm is capable of extracting a large number of quantitative features from breast ultrasound screening, so as to improve clinical detection accuracy. Shen. et al proved that with the help of the AI, false positive rates could be decreased by 37.4% and the number of requested biopsies reduced by 27.8%, while maintaining the same level of sensitivity. 22 Zhang. et al established an arti cial intelligence system with the ability to not only identify cancer but also predict its molecular subtypes. 23 Dong. et al demonstrated that the AI system can provide explainable metrics such as diagnosis-based regions in addition to increasing accuracy, sensitivity, and speci city. 24 Further, AI could also be utilized in a variety of ultrasound hardware con gurations, including traditional handheld ultrasound, ABUS, and portable handheld ultrasounds to address a variety of clinical needs in a range of scenarios. [25][26][27][28] Most preceding works rely on xed frames (static images) previously selected by sonographers, and the model's prediction is made using these frames rather than the original video that records the entire screening process. In other words, AI operates on data already ltered by sonographers, potentially resulting in a series of suboptimal issues. To begin with, video may contain information invisible to human eyes, which could be vital for AI to unleash its capabilities. Second, physician-selected frames might not fully represent the screening video, either due to the lack of necessary knowledge by junior physicians, or the lack of scrutiny by busy senior ones. No matter which causes, it could lead to uncertainty in the subsequent diagnosis and obvious bias in AI performance. Therefore, if AI is able to work on the screening video and recommend responsible frames that it deems necessary for diagnosis, it will possibly streamline the process for sonographers, particularly less experienced physicians, and potentially improve diagnostic performance.
To deepen the usage of both the insights in breast screening data and AI as a powerful feature-extracting tool, we propose a novel framework FEBrNet by integrating deep learning architecture with the entropy method from Information theory. 29 In FEBrNet, all frames in the breast screening video are processed with AI models in parallel, followed by an Entropy Reduce method to auto-select key frames with mutually distinctive features and signi cant contributions for breast lesion diagnosis. A combination of 13702 images and 1066 videos from breast ultrasound exams are used to train and test the robustness of the proposed framework. Our exhaustive multi-center, multi-reader experiment shows overall physician's performance is improved when aided by FEBrNet's recommended frames and corresponding prediction.

Result
Reader studies experiment design We conducted four reader studies on the same video test set to compare the performance of the AI system and physicians, as well as to assess the bene ts of using AI to aid physicians.
Complete AI diagnosis (Complete-AI): Use FEBrNets with a DenseNet or MobileNet backbone to diagnose videos and evaluate their performance when varying the number of responsible frames is employed.
Complete physician diagnosis (Complete-Phy): Six physicians independently read the original video and make diagnoses.
Physicians select frames, then AI diagnoses (Phy-AI): AI diagnoses based on the responsible frames chosen from a video test set by two senior physicians. AI selects frames, followed by physician diagnosis (AI-Phy): FEBrNet offers physicians the top three responsible frames and predictions for each video. Physicians make diagnosis based on these information and physicians' diagnostic performance is evaluated.
Both 'Complete-Phy' and 'AI-Phy' use the same six physicians. Physicians are classi ed into three groups (junior, medium-level, and senior, each with two physicians) based on their experience. We conduct 'Complete-Phy' rst, followed by 'AI-Phy' one month later, long enough for physicians to forget about their previous diagnosis.
FEBrNet could achieve comparable performance with a limited subset of frames from original video In this part, we examine the performance of FEBrNet and the effect of changing the number of frames chosen on predication performance. Both the MobileNet and DenseNet121 backbone-based FEBrNet perform well in binary classi cation. MobileNet has the highest accuracy of 84.25%, while DenseNet121 has the highest accuracy of 84.93%. When the number of responsible frames is no less than 3, MobileNet's AUPR and AUROC values uctuate at 0.875, whereas DenseNet121 performs slightly better, with AUPR and AUROC around 0.885. Figure 1 is an illustration of how precision, recall, and F1-Score vary for MobileNet and DenseNet121 as the number of responsible frames increases. In Figure 1(a), when less than 3 frames are used, MobileNet performs poorly but picks up with increased number of responsible frames. MobileNet reached its peak performance when only 3 frames is used and plateaued when more frames are added. As seen in Figure 1(b), DenseNet121 performs well in most situations; even when just the top 2 responsible frame is used for prediction, achieving 80.26% recall, 80.14% accuracy, and a 0.808 F1-Score. When more than 15 frames are employed, the effect of adding additional frames is minimal with no increase in F1-Score.
FEBrNet achieves more balanced precision and recall when allowed to directly analyze original videos We also examine the performance of FEBrNet on the responsible frames chosen by senior doctors from each video in Table 1. MobileNet obtains 82.31% accuracy and DenseNet 81.63%. One observation is the imbalance between precision and recall, possibly implying a preference to reject more true positive cases to obtain greater precision and cause poor F1-Score (0.776 and 0.808 for MobileNet and DenseNet respectively). Such disparity is more noticeable in FEBrNet with MobileNet backbone, with 97.83% accuracy and just 64.29% recall. Meanwhile, FEBrNet with MobileNet backbone achieved 84.25% accuracy with 0.855 F1-Score and FEBrNet with DenseNet121 backbone got 84.93% accuracy with 0.864 F1-Score when AI engages original video le processing. FEBrNet is able to achieve a relatively better F1-Score when they are allowed to select frames on their own and make diagnosis. Overall, in our experiment, it is more stable to utilize AI to directly analyze original videos and forecast malignancy than relying on previously picked frames. FEBrNet outperforms physicians and can improve physicians' accuracy of diagnosis MobileNet backbone, suggesting that even a light convolutional neural network has the potential to achieve equivalent diagnosis capacity as senior physicians.
As a result of FEBrNet's involvement, the performance of all physicians has been improved with the assistance of FEBrNet. Senior-1, senior-2, and medium level-1 were able to beat the FEBrNet with MobileNet backbone, while Senior-1 and senior-2 also reached the performance of FEBrNet with DenseNet backbone.
As indicated in Table 2, when FEBrNet was used, the accuracy and F1-score of all physicians improved. Senior-1, senior-2, and junior-2 improved their precision and recall at the same time, while medium level-1 and junior-1 bene ted more from the enhanced recall. Accuracy and F1-Score improvement are greater for physicians with lower baseline performance.

Discussion
The con ict of limited medical resources and an increasing amount of patient population with breast cancer has long been a dilemma in public health, especially for underdeveloped regions. [33][34][35][36] To tackle this problem, efforts have been made to improve convenience and e ciency of breast screening. Arti cial intelligence has potential to tackle this issue, and its value has been proposed by prior studies. [22][23][24] Beyond using AI in static breast images, we take a step further in processing ultrasound screening videos and auto-select key responsible frames in this research. To the best of our knowledge, it is the rst time to combine deep learning models with entropy method of information theory in processing each frame of ultrasound screening videos. According to the result of our multi-reader studies, FEBrNet has shown its ability to diagnose breast cancer by outperforming physicians as well as improving physicians' accuracy of diagnosis, particularly for ones with limited expertise and baseline performance.
Responsible frames are vital in clinical work ows for locating key diagnostic information and predicting malignancy. As one major contribution of this work, the Entropy Reduce method is novel in addressing the issue of choosing responsible frames while avoiding selecting visually identical frames repetitively. As shown in gure 3, we use a video taken from a 45-year-old female patient with BI-RADS 4c and pathologically con rmed invasive breast cancer as a simple example to demonstrate the capacity of entropy reduce method. When frames are sorted by FScore a variable to evaluate the apparent degree of malignancy on image level, detailed de nition could be found in Method part , the top three frames are frame 26, 39, and 27, with FScore of 21.94, 21.29, and 21.03. These frames seem to be relatively similar in gure 3(a) and very close in time sequence. After using Principal Component Analysis (PCA) 37 to compress and display the feature matrices into two dimensions in gure 3(b), it is obvious that the distance in feature dimensions between the three frames is rather close. In Figures 3(c) and 3(d), the same approach is used to assess the top three responsible frames(frame 26, 111 and 96) chosen by Entropy Reduce method. While the FScore of frame 111 is low, it is considerably distant from the rst responsible frame (frame 26) in the feature dimensions. The top three frames selected by the Entropy Reduce method are scattered, echoing their various visual attributes in Figure 3(d).
We also notice that FEBrNet can identify the features easy to be neglected by physicians. FEBrNet proposes two malignant indicative frames with architectural distortion in gure 4(a) that were overlooked by physicians during their earlier diagnosis. Figure 4(b) illustrates a more typical situation in which clinicians have di culty in determining the lesion's malignancy due to a combination of benign (parallel orientation) and malignant (not circumscribed margin) features from physician chosen frames. FEBrNet contributes to the certainty of malignant likelihood by supplying frames with an additional malignant feature (calci cation). With the ability to choose appropriate responsible frames, FEBrNet could alleviate sonographers' daily burden and enable physicians with less expertise to perform ultrasound breast screening. Hence, FEBrNet can potentially help address the scarcity of sonographers and contribute to the widespread use of ultrasound screening to diagnose diseases early.
Although FEBrNet has solved the problem of selecting responsible frames for sonographers in part, there are still areas that need further investigation. The rst task is to nd out how to get high-quality ultrasound screening videos, which serves as the foundation for all subsequent work. AI has shown a great potential for assistance in the acquisition of high-quality ultrasound data 26,38 and several studies have revealed the advantages of applying AI to assist ultrasound imaging 39,40 One next endeavor is to develop a navigational AI for breast ultrasound screening. Furthermore, after we have obtained the responsible frames, we will need a tool to analyze the features of images. There are explicable aspects of FEBrNet worth examining, including currently established diagnostic features (e.g. margin, calci cation and other features in BI-RADS) 41 and the features have not yet been discovered. In the long term, it can be anticipated to construct an AI-based ultrasound screening system streamlining guided ultrasound imaging, video data processing, diagnosis, and explanation generation.
FEBrNet has further exploration possibilities, and our current work has certain limits. To begin, the quantity of breast ultrasound data included in this research is limited, and more multicenter breast ultrasound data from a broader patient group should be incorporated. Second, our study focuses only on breast nodule malignancy classi cation, although its applicability might be expanded to include subtype disease categorization, molecular phenotype prediction, and other tasks. Finally, ultrasonic screening for a variety of disorders may be possible using the FEBrNet architecture, and more indications should be evaluated to ensure its robustness.

Ethical Approval and Informed Consent
This study obtained ethical approval from the Institutional Review Board of the Shenzhen People's Hospital. The approval included the collection of data on implied consent. We only used retrospective data and the patients were not actively involved in the study. The requirement of written informed consent was waived by the Institutional Review Board.

Data sources and entry criteria
This retrospective study was conducted in accordance with the procedures speci ed by the hospitals that participated. The Ethics Committees of the Cancer Hospital of The Chinese Academy of Sciences and Shenzhen People's Hospital authorized this research. To ensure the quality of the data, we based the experiment's inclusion and exclusion criteria on clinical guidelines.
The following criteria apply to data inclusion: (1) Ultrasound detection of breast nodules; (2) Nodule diameter must be between 5.0-30.0mm; (3) Breast tissue surrounding the nodule must be at least 3.0mm thick; (4) Nodules must be BIRADS 0, 2, 3, 4a, 4b, 4c, or 5; (5) No intervention or surgery on the nodule has been performed before the ultrasound test; (6) Patients must undergo surgery or biopsy within one week of the ultrasound data collection and obtain pathological results.
The following criteria are used to exclude data: (1) normal breasts (BIRADS category 1); (2) a history of breast surgery or interventional therapies; (3) image quality is poor; (4) clinical data for the case are insu cient, and pathological outcomes are untraceable.

Study population and data distribution of image set
The study comprised 13702 2D ultrasound breast nodule images with pathology results acquired from 3448 female patients between 2020.10 and 2021.10 (9177 images from 2457 patients with benign pathology, 4545 images from 991 patients with malignant pathology), as stated in Table 3.
All pictures utilized are grayscale ultrasound images from each of one which a region of interest (ROI) is extracted. All non-object regions in the ultrasound image are eliminated. The image dataset is utilized to build the CNN image classi er in the rst step of FEBrNet, which is then transferred to a video classi er. Study population and data distribution of video set As shown in Table 4, the study includes 1066 ultrasound breast nodule lms with pathology results from 440 female patients between 2020.10 and 2021.10. (546 videos from 237 patients with benign pathology and 520 videos from 203 patients with malignant pathology). Additionally, we gathered the physician chosen responsible frames for each video in the dataset, which are the frames that two senior physicians con rm include signi cant characteristics indicative of malignancy (random number of responsible frames for each video, including raw frames and annotated frames).
The video dataset is used to train the random forest feature classi er, which processes the pretrained CNN image classi er's features. To prevent information leakage during model training, we ensure that the patients in the video data set do not overlap with the patients in the image data set.

Statistical evaluation
In this paper, we ran 3 trials to evaluate FEBrNet: 1. Results of eleven different numbers of responsible frames of FEBrNet are compared, using AUROC, AUPR, Accuarcy, Sensitivity, Speci city, Recall, Precision, and F1-Score, to assess the impact of number of responsible frames on FEBrNet's performance.
3. 'Complete physician diagnosis', 'Complete AI diagnosis' and 'AI selects frames, followed by physician diagnosis' compare, using scatter plot of FEBrNet and physician performances, AUROC, AUPR, Accuarcy, Sensitivity, Speci city, Recall, Precision, and F1-Score, to evaluate how FEBrNet compares physicians when diagnosing alone and whether physicians bene t from the assistance of FEBrNet.

Model section: Philosophy and DataFlow of FEBrNet
In this part, we illustrate the philosophy of FEBrNet, its computational operations and data ow step by step. In short, there are four steps in the work ow of FEBrNet: 1) feature distillation, 2) entropy matrix generation, 3) responsible frame recommendation, and 4) binary classi cation. Figure 5 depicts the data ow of FEBrNet. In the rst step, state-of-the-art deep learning models are trained to acquire the knowledge from static breast ultrasound images. The backbone of the model is transferred to the second step for parallel feature extraction from breast ultrasound videos, which are independent of the static images. Together with key weights from the rst step, frame-by-frame feature vectors are concatenated into a new feature matrix (feature entropy matrix). Third, we design a new entropy reduce method to select a subset of all frames to represent the entire ultrasound video for this particular event of breast lesion diagnosis. Meanwhile, binary classi cation of benign or malignancy is also conducted based on the feature entropy matrix to assist physicians in the diagnosis.

Model section: Feature distillation
Two major merits stand out in feature distillation: 1) Pre-accumulated physician-selected images contain a plethora of breast lesion features, in particular malignant ones. Therefore, to create a model capable of extracting essential malignant characteristics, Convolutional Neural Networks (CNN) can be pretrained on a relatively large ultrasound image dataset. 2) To a large extent, the backbone of the pretrained model can accelerate the process of video feature extraction compared to training parallel model from scratch.
Here, most standard CNN models, including DenseNet, ResNet, and MobileNet, could be used. DenseNet and MobileNet are used in this experiment for comparing light-weight model (3,230,914 parameters for MobileNet) and sophisticated model (7,039,554 parameters for DenseNet).
We split the entire image dataset containing 13072 2D images into three portions in our research (train : valid : test = 8:1:1), with images of the same patient appearing in only one of the subsets. For data augmentation, random stiff transformations on the original image can be used to mimic the image displacement, zoom, and ip that might occur as physicians scan for nodules in real-world clinical procedures. Some of the specialized methods including rotation, zoom, translation, and ip, as well as grayscale adjustment are also used. Binary Cross-entropy Loss is utilized to calculate classi cation loss and adjust network weights.

Model section: Entropy matrix generation
Various nodular features are scattered over the temporal sequence of frames. The principle of entropy matrix generation is mapping all features of all frames to a high-dimensional space, where all features of a single frame can be presented as one vector. Transferred from step 1, backbone of pretrained CNN model serves as a feature extractor to distill essential features from each frame of video in parallel and creates feature matrices. Feature matrices reveals the feature intensity of each frame in diverse feature dimension, where the number of dimensions is determined by the backbone model (1024 dimensions in the DenseNet121 backbone). By incorporating the weights of the nal layers in step 1 as they represent the indicative information of malignancy, we obtained feature entropy matrices.

Model section: Responsible frame recommendation
Here, feature entropy matrices are used to rank the contribution of frames for breast nodule diagnosis, and a key variable we de ne is called FScore, which is the sum of values of feature entropy matrices from all feature dimensions at each frame. With a higher FScore, the frame contributes more characteristics indicating the possibility of malignant. FScore assists in locating the frame that contributes the most to the possibility of malignancy, which we de ned as the rst responsible frame of the model's prediction. However, since adjacent frames usually share similar image signatures and have very close FScore, which means the frame with the largest FScore and the second largest FScore may looks almost identical.
To choose a comprehensive set of responsible frames with varied features, we rst extend the concept of FScore from a single image to a collection of frames, because video in essence, is a collection of frames. Second, we propose a novel entropy reduce method to select a minimal set of all frames with a lowest sum of entropy values. This set of frames is considered to have the highest likelihood to represent the entire video. The essential philosophy of the entropy method is a greedy mechanism, where the next frame is repeatedly searched to reduce the sum of entropy of all selected frames until the entropy sum can not be reduced. More mathematical illustrations and examples can be found in Supplement.

Model section: Binary classi cation
For classi cation task, we re-organize (MaxPool in deep learning) the feature entropy matrix of all frames or selected responsible frames (option 1 or 2 in gure 1) to compress and shape the information into a consistent shape of a vector (1*k). The matrix shows the maximum contribution of video in each feature dimension and is indicative to classify video as benign or malignancy. We refer to the compressed feature entropy matrix as the video feature entropy matrix.
In FEBrNet, we employ a classic machine learning model to analyze video feature entropy matrices and produce nal benign-malignant predictions. Here, random forest is adopted to train on our video training set, 30 with 1024 feature estimators and the maximum depth below 10. The classi cation is made based on entropy matrices instead of original videos for two reasons: 1) the feature entropy matrices already encapsulate key information from previous steps for classi cation; 2) undesirable noisy frames during the continual screening process might also undermine the classifying accuracy.

Conclusion
To summarize, we developed a system called FEBrNet, which is capable of auto-selecting responsible frames from ultrasound screening videos and classifying breast nodules using AI. Results of reader studies prove that FEBrNet has equivalent or even superior diagnostic performance to that of ultrasound specialists. It has the potential to empower physicians with less expertise to conduct better breast ultrasound screening and achieve better accuracy. Therefore, merging FEBrNet into clinical ultrasound screening work ow might bring actual bene t by helping address the scarcity of sonographers, so as to increase the use of ultrasound screening in cancer prevention.

Declarations
Con ict of Interest Figures Figure 1 Recall, Precision and F1-Score of FEBrNet when using different numbers of responsible frames (a) MobileNet's recall falls and precision increases as the number of responsible frames used grows when less than three frames used.
(b) DenseNet121 works well in most circumstances. More than 15 responsible frames had little in uence on precision and recall.

Figure 2
Comparison of 'Complete physician diagnosis', 'Complete AI diagnosis' and 'AI selects frames, followed by physician diagnosis'.
When physicians diagnose alone, FEBrNet with DenseNet backbone outperforms all of them. With the assistance of FEBrNet, all physicians' performances have been enhanced, and physicians 1 and 2 exhibits comparable diagnostic capability to AI.  Cases of FEBrNet discovered physicians overlooked malignant features.
Comparison of frames selected by physicians and FEBrNet reveals that FEBrNet has the capacity of identifying features easy to be overlooked by physicians.