Deep Learning-Based Breast Cancer Diagnosis at Ultrasound: Initial Application of Weakly-Supervised Algorithm Without Image Annotation Original Research

Conventional deep learning (DL) algorithm requires full supervision of annotating the region of interest (ROI) that is laborious and often biased. We aimed to develop a weakly-supervised DL algorithm that diagnosis breast cancer at ultrasound without image annotation. Weakly-supervised DL algorithms were implemented with three networks (VGG16, ResNet34, and GoogLeNet) and trained using 1000 unannotated US images (500 benign and 500 malignant masses). Two sets of 200 images (100 benign and 100 malignant masses) were used for internal and external validation sets. For comparison with fully-supervised algorithms, ROI annotation was performed manually and automatically. Diagnostic performances were calculated as the area under the receiver operating characteristic curve (AUC). Using the class activation map, we determined how accurately the weakly-supervised DL algorithms localized the breast masses. For internal validation sets, the weakly-supervised DL algorithms achieved excellent diagnostic performances, with AUC values of 0.92– 0.96, which were not statistically different (all P s > 0.05) from those of fully-supervised DL algorithms with either manual or automated ROI annotation (AUC, 0.92–0.96). For external validation sets, the weakly-supervised DL algorithms achieved AUC values of 0.86–0.90, which were not statistically different ( P s > 0.05) or higher ( P = 0.04, VGG16 with automated ROI annotation) from those of fully-supervised DL algorithms (AUC, 0.84–0.92). In internal and external validation sets, weakly-supervised algorithms could localize 100% of malignant masses, except for ResNet34 (98%). The weakly-supervised DL algorithms developed in the present study were feasible for US diagnosis of breast cancer with well-performing localization and differential diagnosis.


Introduction
Ultrasound (US) is the mainstay of differential diagnosis between benign and malignant breast masses and has traditionally been used in diagnostic settings with renewed interest of its use in screening settings 1,2 . Despite such wide applicability, breast US has intrinsic limitations, including interobserver variability in diagnostic performance that is often worse among non-experts 3 . This interobserver variability contributes to a high rate of false-positives, causing unnecessary biopsies and surgeries. With expectations to overcome these limitations, there has been a growing interest in the application of deep learning (DL) technology for breast US diagnosis [4][5][6] . Conventional approaches using DL algorithms have involved full supervision that require image annotation processes usually performed by drawing the region of interest (ROI) of the lesion by humans. Even with automated ROI segmentation methods, verification of ROI by humans is still needed. As DL is a data-driven technology, time-and labor-intensive image annotation process may hinder the development of wellperforming models due to the need of massive training data. Moreover, manual annotation can be biased as this task necessarily involves subjective pre-judgment of the lesion. DL with weak supervision (weakly-supervised DL) is a form of DL where unannotated images with only image-level labels (i.e., malignant and benign) are used in training for differential diagnosis and localization [7][8][9] . Weakly-supervised DL have the following advantages. For developing DL-based algorithm, a method without image annotation can compile large-scale image sets in a time-and labor-saving manner. For clinical applications, weakly-supervised DL algorithms allow us to use the entire image as input to the trained model, leading to an improvement in workflow efficiency over fully-supervised algorithms as the additional task of marking lesions can be avoided. Despite these benefits of weakly-supervised DL algorithms, only a few studies have demonstrated their feasibility in radiology. Weakly-supervised DL algorithm was evaluated in magnetic resonance imaging (MRI) or chest x-ray images and demonstrated good diagnostic performances in the classification of breast lesions and thoracic disease 10,11 . However, weakly-supervised DL algorithm has not been well studied in breast US images.
The main hypothesis of this work is that weakly-supervised DL algorithms for US images are feasible for diagnosing breast masses and comparable to conventional fully-supervised DL algorithms.
The purpose of this study was to develop a weakly-supervised DL algorithm that detects breast masses in US images and make a differential diagnosis between benignity and malignancy synchronously.

Material and Methods
Institutional review board (IRB) of Kyungpook National University Chilgok Hospital approved this retrospective study and all methods were carried out in accordance with relevant guidelines and regulations. The requirement of informed consent was waived under the IRB of Kyungpook National University Chilgok Hospital.

Datasets
We retrospectively collected 1400 US images for breast masses of 971 patients from two institutions (institution A: A University A′ Hospital; institution B: B University Hospital; Fig. 1) for training and validation sets. Although multiple masses per patient were allowed, only one image that was the most representative for the mass was used. Among the 1400 images, 700 were images with cancers confirmed by biopsy or surgery, and 700 were images with benign masses that were confirmed by biopsy (n = 163) or at least 2 years of follow-up imaging (n = 537). The training set contained 500 benign and 500 malignant masses obtained from institution A (data collection period: January 2011-August 2013). The validation sets were divided into internal and external validation sets, each with 200 images of 100 benign and 100 malignant masses. Images for internal validation were temporally split from institution A (data collection period: September 2013-July 2014) and were not used for algorithm training. Images for external validation were consecutively obtained from institution B (data collection period: May 2011-August 2015). All breast US images were extracted from picture archiving and communication systems and were stored in JPEG format. For the training and internal validation sets obtained from institution A, only one US equipment manufactured by Philips was used to generate images, while multiple US machines manufactured by Philips, GE, and Siemens were used for the external validation set (obtained from institution B).

Image annotation and preprocessing
Images were anonymized by minimal trimming of the edge of images to eliminate body mark and text annotation. For the weakly-supervised DL algorithms, further data curation was not performed to test the feasibility of the proposed system without ROI annotation (Fig. 2). For comparison with fullysupervised DL algorithms, ROI annotation was performed using two methods: manual drawing and automated DL-based segmentation. For manual drawing, a radiologist (W.H.K.; with 11 years of experience in breast US) marked ROIs and made binary masks for each mass using an in-house drawing tool. For the automated DL-based segmentation, we employed the deep segmentation network U-Net, which has been developed to segment medical images 12 . After the ROI annotation, we extracted a square image with a fixed margin of 30 pixels that enclosed the corresponding mass, resized the image to 224 × 224 pixels, and normalized the pixel intensities from 0 to 1 using the maximum intensity value.

Deep Classification Models
For deep classifiers, we employed three representative convolutional neural networks (CNN) that have achieved state-of-the-art performance in various computer vision tasks: VGG16, ResNet34, and GoogLeNet [13][14][15] . Details for each VGG16, ResNet34, and GoogLeNet were given in our To test the performance of the discriminative localization by the weakly-supervised DL algorithms, we extended the classification models using a global average pooling layer (GAP) that is added to the final convolutional layer of each model 10,16 . The GAP averages each feature map ( ) of the last convolution layer into feature scores ( ) as follows.
where and are the spatial indices of . The number of the feature maps is same as that of classes ( ). Then, the models perform linear classification using a fully-connected layer followed by a softmax function. The fully-connected layer with learnable weights (W = { , }) calculates class scores ( ) for each class as follows.

= ∑ ,
The class score is given to the softmax function to yield the predicted probabilities of all classes.
The predicted probability ( ) of each class and probability of malignancy (POM) was calculated as follows:

Discriminative localization
The class activation maps ( ) of each class can be acquired by merging the feature maps using the weights that are learned in the estimation of the class scores.

= ∑ ,
The relative intensity of is scaled using a min-max normalization for inter-subject comparison and visualization. The scaled class activation maps ( ′ ) are acquired as follows.

Performance metrics and statistical analysis
For differential diagnosis, we used area under the receiver operating characteristics curve (AUC) as the primary metric for comparing the algorithm performance, and the DeLong test of significance for comparing the AUC of two correlated receiver operating characteristics curves (ROCs). The exact McNemar test was used to test the differences in sensitivity and specificity. Discriminative localization is regarded as correct when the segmented area overlaps with the manually annotated area. Fisher's-exact test was used to compare rates of correct and incorrect localization between benign and malignant masses with the weakly-supervised DL algorithm. All statistical analyses were performed using the MedCalc statistical software, version 17.1 (Mariakerke, Belgium). Two-tailed P values of <0.05 were considered statistically significant.

Baseline characteristics of data sets
Baseline characteristics of the training set and internal/external validation sets are described in Table   1. The mean ages of patients in training, internal validation, and external validation sets were 49 years
For external validation test sets, the weakly-supervised DL model achieved high diagnostic performance but was slightly lower than those in internal validation sets, with AUC values of 0.89, 0.86, and 0.90 in VGG16, ResNet34, and GoogLeNet models, respectively (Table 3, Fig. 3). The AUCs of fully-supervised DL models with manual annotation were 0.91, 0.89, and 0.92 in VGG16, ResNet34, and GoogLeNet models, respectively. The AUCs of fully-supervised DL models with automated annotation were 0.85, 0.84, and 0.87, respectively. The AUCs of weakly-supervised DL models were not statistically different from those of fully-supervised DL models with manual ROI annotation (all Ps > 0.05). For the VGG16 network, the AUC was significantly higher in the weaklysupervised DL model than the fully-supervised DL model with automated ROI annotation (P = 0.04).
ResNet34 and GoogLeNet networks showed no significant differences between weakly-supervised DL model and fully-supervised DL model with automated ROI annotation (all Ps > 0.05).
Sensitivities of weakly-supervised DL models were 91% (91/100), 78% (78/100), and 88% (88/100) in VGG16, ResNet34, and GoogLeNet models, respectively, and the specificities were 72% (72/100), 80% (80/100), and 76% (76/100), respectively. The sensitivities did not significantly differ between weakly-supervised and fully-supervised DL models in VGG16 and GoogLeNet (all Ps > 0.05). For the ResNet34 model, the sensitivity was lower in the weakly-supervised DL model than the fully-supervised model with manual annotation (P < 0.001) but not significantly different from the fullysupervised DL model with automated ROI annotation (P = 0.66). The specificity of the weaklysupervised DL model was not significantly different from that of fully-supervised DL models with manual ROI annotation in VGG16 and GoogLeNet models all Ps > 0.05) and lower than that in the ResNet34 model (P < 0.001). The specificity was higher in the weakly-supervised DL model than the fully-supervised DL model with automated ROI annotation with statistical significance or borderline significance (P < 0.001, P = 0.07, and P = 0.04 in VGG16, ResNet34, and GoogLeNet models, respectively).

Discussion
In this study, we found that weakly-supervised DL algorithm provided excellent diagnostic performance (AUC: 0.86-0.96) that were not inferior to the fully-supervised DL algorithm with manual (AUC: 0.89-0.96) and automated annotation (AUC: 0.84-0.96). Furthermore, the weaklysupervised DL algorithm could correctly localize the benign and malignant masses with nearly perfect rates (96%-100%). This excellent classification and localization performance was achieved even in our relatively small-sized dataset and in the external validation set with different breast imagers and US equipment. Taken together, our results suggests that weakly-supervised DL algorithm is feasible to detect and diagnose breast cancer in US images through a highly efficient data-curation process in which image-based classification can be made without manual or automated annotation.
Classification methods using the DL algorithm can be categorized into region-and image-based classification. Most of previous studies have used region-based classification. Regions (usually for lesions) are necessarily determined prior to classification task either manually (including semiautomatically) or automatically. Studies using manually determined regions showed high diagnostic performances with AUCs 0.84-0.94 depending on cases and strategies for learning [17][18][19] 20 . In other studies, regions were determined more inclusively by cropping the image by human with excellent diagnostic performances (AUC, 0.913 and 0.951) 21,22 . Automated determination of regions has been proposed using various methods of region proposals. Diagnostic performances using the automatically determined regions were suboptimal or metrics were not-well demonstrated with the highest accuracy of 87.5% in a study using DenseNet 13 , 60.6% of sensitivity for malignant lesions in a study using fully convolutional networks 23 , no overall diagnostic metrics in a study using faster R-CNN 24 . Image-based classification with weaklysupervised DL algorithms has been proposed in the present study and our previous work 25 . In our previous work, we proposed a box convolution network with VGG-16 which learns kernel sizes and offsets of convolution filters from given datasets. We found that our proposed model had higher performances in diagnostic accuracy and localization than VGG-16 or dilated VGG-16. While our previous work was focused on box convolution network, we did not compare our model with fullysupervised DL algorithms and external validation or generalization to other networks were not evaluated. In the present study, using three representative networks and external validation test sets, the feasibility of weakly-supervised DL algorithm was demonstrated in comparison with a fullysupervised DL algorithm.
Weakly-supervised DL algorithms serve more closely as a human-mimicking algorithm than fully-supervised DL algorithm in differentiating malignant masses from benign breast masses in US images. Human-established algorithms employed in breast imaging reporting and data system (BI-RADS) take into account comprehensive sonographic features of both the mass and the surrounding breast tissue. Hence, weakly-supervised DL, using the information of the images in their entirety (not confined to the mass or its vicinity), may have advantages over fully-supervised DL; the proposed algorithm can learn a significant portion of BI-RADS lexicon describing information outside the mass (e.g., posterior features, architectural distortion) that is known to be helpful for differential diagnosis 26,27 .
GAP used in our study can enforce the feature maps to preserve spatial information relevant to the classes, so that they can be used to interpret the decision of the CNN models 8,28 . This method for identifying areas that are attributed to differential diagnosis using GAP with CAM leads toward the concept of eXplainable AI (XAI) 29,30 . XAI or responsible AI is an emerging paradigm to overcome the inherent "black box problem" brought by deep frameworks, wherein it is impossible for us to understand how decisions are furnished. CAM gives us an insight to interpret the decision-making process implemented by AI. In addition, we believe that the weakly-supervised DL with CAM may facilitate the development of DL-aided detection frameworks for clinically significant regions for healthcare providers 31,32 .
An important caveat in the present study was that our proposed weakly-supervised DL algorithm was not trained with a large-scale dataset due to our feasibility objectives. Further studies are needed using dataset with various institutions, imagers, and US equipment. Another limitation is that a time-and labor-efficiency was not directly quantified because of the complexity of data curation process.   Table 3. Diagnostic performance metrics of weakly-supervised and fully-supervised deep learning algorithms in external validation set. Table 4. Metrics for discriminative localization of benign and malignant breast masses in the weakly-supervised deep learning algorithm.   Ultrasound images show a 6-mm oval, circumscribed mass considered as benign (unchanged during the 55month follow-up period), which was predicted as benign with POM of 0.434, 0.006, and 0.006, respectively. Figure 1 Overview of the data acquisition.

Figure 2
Overview of weakly-supervised and fully-supervised deep learning (DL) algorithms for breast mass classi cation and localization. The weakly-supervised DL algorithm does not require image annotation of region of interest (ROI) of the lesion, whereas the fully-supervised DL algorithm requires tumor segmentation (manual or automated) and cropping for ROI before being put in the classi ers. For the weakly-supervised DL algorithm, a class activation map (CAM) is generated to visualize the region detected by this algorithm using a global average pooling layer (GAP) that is added to the nal convolutional layer. Ultrasound images show a 29-mm oval, hypoechoic mass with macrocalci cations considered as benign (unchanged during the 46-month follow-up period), which was predicted as malignancy with POM of