The Establishment of Transformer-Based Computer-Aided Diagnosis Model to Improve the Classication Consistency of BI-RADS-US3–5 Nodules Among Radiologists: A Multi-Center Study

Background: Signicant differences exist in classication outcomes for radiologists using ultrasonography-based breast imaging-reporting and data systems for diagnosing category 3–5 (BI-RADS-US 3–5) breast nodules, due to a lack of clear and distinguishing image features. As such, this study investigates the use of a transformer-based computer-aided diagnosis (CAD) model for improved BI-RADS-US 3–5 classication consistency. Methods: Five radiologists independently performed BI-RADS-US annotations on a breast ultrasonography image set collected from 20 hospitals in China. The data were divided into training, validation, testing, and sampling sets. The trained transformer-based CAD model was then used to classify test images, for which sensitivity, specicity, and accuracy were calculated. Variations in these metrics among the 5 radiologists were analyzed by referencing BI-RADS-US classication results for the sampling test set, provided by CAD, to determine whether classication consistency (the kappa value)(cid:0)sensitivity, specicity, and accuracy had improved. Results: Classication accuracy for the CAD model applied to the test set was 95.7% for category 3 nodules, 97.6% for category 4A nodules, 95.60% for category 4B nodules, 94.2% for category 4C nodules, and 97.5% for category 5 nodules. Adjustments were made to 1,583 nodules, as 905 were classied to a higher category and 678 to a lower category in the sampling test set. As a result, the accuracy, sensitivity, and specicity of classication by each radiologist improved, with the consistency (kappa values) for all radiologists increasing to >0.60. Conclusions: The proposed transformer-based CAD model improved BI-RADS-US 3–5 nodule classication by individual radiologists and increased diagnostic consistency. This retrospective clinical study only involved the collection of age data, breast nodule images, imaging system models, and pathological results for patients. It did not interfere with individual treatment plans and an exemption from informed patient consent was approved by the hospital ethics committee. Inclusion criteria were satised by images (1) produced using a high-frequency probe ( ≥ 12 MHz), (2) containing only one nodule, and (3) exhibiting nodules with identiable boundaries. Exclusion criteria were applied to images exhibiting (1) no nodules, (2) clear cysts, (3) more than one nodule, (4) nodules too large to display a complete outline, and (5) poor quality or unclear nodule boundaries. The open-source “cornerstonejs” and “cornerstoneTools” JavaScript frameworks were used to establish a breast nodule image annotation platform. Five ultrasound radiologists independently performed BI-RADS-US classication and labeling of all sample images, with 8 (DR1), 11 (DR2), 12 (DR3), 15 (DR4), and 19 (DR5) years of experience in breast


Background
Breast cancer exhibits the highest incidence among malignancies in women and its early diagnosis and treatment signi cantly reduces mortality rates. [1,2] Among several commonly used breast examination techniques, ultrasonography is the most convenient and most economical modality with no radiation and relatively low cost. However, the quality of ultrasonography directly depends on operator expertise and experience, especially on her/his scanning techniques, ability of lesion detection, and discription and interpretation of images. [3]. The emergence of breast imaging reporting and data systems on ultrasonography (BI-RADS-US) is an attempt to normalize and standardize the terminology used to describe a series of appearances in ultrasound images and then classify breast nodules from category 1 through category 6 depending on the probability of malignancy [4].However, this probability for category 4 nodules varies widely (2-95%) and speci c classi cation criteria for subcategories 4a, 4b, and 4c lack clear de nitions. In addition, there is a lack of clear criteria of the classi cation in distinguishing categories 4a and 3 or 4c and 5 breast nodules. [5,6]Hence, BI-RADS-US classi cations of category 3-5 nodules differ signi cantly between hospitals and individual radiologists. As a result, a given breast nodule may be over-or under-treated in response to a diagnosis. For example, a misclassi cation of benign nodules into category 4 or above increases psychological burden and medical expenses for a patient, while a misclassi cation of malignant nodules into category 3 can cause life-threatening delays in treatment.
Computer-aided diagnosis (CAD) models can bypass conventional subjective diagnoses by humans. In recent years, the expanded availability of breast imaging datasets has facilitated end-to-end deep learning, thereby achieving objective diagnosis of breast nodules. While CAD models can be a highly effective aid for assisting radiologists in diagnosing disease, their performance is closely related to the size of the training set, as larger sets comprised of higher quality images produce better diagnostic outcomes. While acquiring annotated images marked by experienced radiologists can be di cult, open databases have allowed the application of machine learning in a variety of elds. For example, convolutional neural networks (CNNs) have been applied to the segmentation of ultrasound images, [7]the diagnosis of benign and malignant breast nodules, [6,8,9] and BI-RADS-US classi cation, [10,11] achieving satisfactory results. However, the self-attention-based mechanism of a transformer has outperformed conventional CNNs for visual tasks. [12] Transformers were originally developed for machine learning in natural language processing, [13] but have since been applied to medical imaging research. [14,15] It has also been suggested that transformers focus more on shape recognition and exhibit higher computational e cacy and scalability than CNNs for texture recognition. As such, this study utilized a transformer to establish a CAD model for the classi cation of BI-RADS-US 3-5 nodules, thereby providing classi cation references for radiologists in an attempt to improve the diagnostic level and consistency. This retrospective clinical study only involved the collection of age data, breast nodule images, imaging system models, and pathological results for patients. It did not interfere with individual treatment plans and an exemption from informed patient consent was approved by the hospital ethics committee.

Methods
Inclusion criteria were satis ed by images (1) produced using a high-frequency probe (≥12 MHz), (2) containing only one nodule, and (3) exhibiting nodules with identi able boundaries. Exclusion criteria were applied to images exhibiting (1) no nodules, (2) clear cysts, (3) more than one nodule, (4) nodules too large to display a complete outline, and (5) poor quality or unclear nodule boundaries. The open-source "cornerstonejs" and "cornerstoneTools" JavaScript frameworks were used to establish a breast nodule image annotation platform. Five ultrasound radiologists independently performed BI-RADS-US classi cation and labeling of all sample images, with 8 (DR1), 11 (DR2), 12 (DR3), 15 (DR4), and 19 (DR5) years of experience in breast ultrasonography. A senior radiologist with 21 years of experience served as the "referee" for nal classi cation. Consistency was achieved to the degree possible as the 6 radiologists jointly discussed and formulated speci c criteria for the classi cation of BI-RADS-US 3-5 nodules, based on their experience and recent literature. [4,5,16,17] See Table 1 for further details.
The ve radiologists who were annotating on images performed BI-RADS-US classi cation based on image features, without knowing patient ages, clinical symptoms, or pathological results. A rectangular box on the labeling platform was used to mark the nodule margin prior to classi cation and label selection (see Figure 1). The following two strategies were adopted when the BI-RADS-US results were inconsistent. (1) Images with a 4:1 consensus were classi ed according to the majority opinion. (2) All other disagreements were nalized by the referee, who was allowed to access pathological results in making the nal diagnosis to provide a more accurate database for the model. Sample images were randomly divided into development and test sets using a 7:3 ratio. The development set was subsequently and randomly subdivided into training and validation sets using an 8:2 ratio.
A CAD model was constructed by modifying the hierarchical vision transformer architecture for the localization of breast nodules and BI-RADS-US classi cation. We introduced the multi-resolution feature extraction process to extract the lesion features, and classify them through the attention mechanism. The model included four sets of block 1 and one set of block 2 networks used for feature extraction from input images at different resolutions. In the rst four transformer structures, each set of blocks included a window-based multi-head selfattention mechanism for feature extraction. The last group was composed of two transformer structures, the rst of which was used to encode the input feature vector, while the second was used to obtain BI-RADS-US categories (e.g., 3, 4a, 4b, 4c, and 5) by decoding encoded feature vectors generated in the previous step. Input images of size 224×224×3 pixels were equally divided into 56×56 image blocks of size 4×4. The images were input to the rst block 1 set for feature extraction, producing 56×56×96 feature maps. The rst output was then divided into 28×28 image blocks of size 8×8, which were input to the second block 1 set for feature extraction and generation of 28×28×192 feature maps. Similarly, the third and fourth block 1 sets produced 14×14×384 and 7×7×768 feature maps, respectively. The 7×7×768 feature maps were subsequently input to block 2 for BI-RADS-US category classi cation (e.g., 3, 4a, 4b, 4c, and 5). The input feature maps for block 1 sets 1-4 were rst divided into image blocks of a speci ed size. The image blocks were then shifted, which modi ed feature information distributions for blocks of varying sizes, allowing the attention to be focused on a wider area. In block 2, position coding information was rst added to the coded image block, which was then converted into a one-dimensional vector and input to another transformer. BI-RADS-US categories were used as inputs to query detected nodular areas in each category. Final output results included the detected nodular areas and BI-RADS-US categories to which each node belonged. Figure 2 provides a owchart of the data processing model.
After CAD model training, 500 breast nodule images with consistent diagnostic results from both the model and the radiologists were randomly selected from the test set to form a sampling test set, which consisted of 100 images for each BI-RADS-US 3-5 category. The ve radiologists then re-classi ed the sampling test set by referencing BI-RADS-US classi cation results provided by the CAD model. Changes in diagnostic sensitivity, speci city, and accuracy for various nodule categories, including their categorical adjustments, were observed to determine whether classi cation consistency had been improved.

Results
This study involved 3,317 patients exhibiting 1 breast nodule and 661 patients exhibiting 2 or more breast nodules. Each nodule was represented in 1-90 images, with an average of 5.36 ± 5.91 images. The set of 5,057 total breast nodules included 2,390 benign nodules (10,041 images) and 2,667 malignant nodules (11,291 images), as shown in Table 2. The maximum diameter of the nodules ranged from 0.30 to 10.98 cm, with an average diameter of 2.00 ± 1.13 cm. A total of 34 ultrasonography machines were included in the study (see Figure 3).  Catheter changes (unusual pipe diameter or branch-like changes).
Edema, enhanced echo of the surrounding tissues, or tissue thickening.

Category 4a
Satisfying one of the above requirements.

Category 4b
Satisfying two of the above requirements or any one item marked with "*".

Category 4c
Satisfying three of the above requirements or any two items marked with "*".

Category 5
Satisfying four or more requirements. Other benign nodules 524 Table 3 shows the category 3-5 breast nodule distributions for the training set, validation set, and test set. The sensitivity, speci city, and accuracy of CAD for various categories in the test set, determined using BI-RADS-US classi cation standards nalized by the referee radiologists, are provided in Table 4. As seen in Table 5, referencing the CAD model signi cantly improved the diagnostic sensitivity, speci city, and accuracy of the ve radiologists (P <0.05  Among the 21,332 classi ed images, 1,416 (6.64%) exhibited the same annotations by all ve radiologists and 5,620 (26.3%) included a 4:1 inconsistency. A weighted Kappa test indicated the k value of any two radiologists for all images was less than 0.6 (0.33-0.57), representing normal or moderate consistency. The k values for the sampling test set also suggested normal or moderate consistency (0.30-0.62). Only the k values of radiologists DR1 vs DR2 and DR1 vs DR3 were slightly higher than 0.6. After referencing the CAD model, the sampling test set images were readjusted and the k values were increased to above 0.60 (0.67-0.85), as shown in Figure 4.

Discussion
This study established a multi-center breast nodule image dataset, used to train radiologists for classi cation with a transformer-based CAD model. This approach increased the accuracy, sensitivity, and speci city of BI-RADS-US classi cation for category 3-5 breast nodules. The diagnostic consistency among ve radiologists was also improved. In this process, BI-RADS-US was used to interpret image features and breast nodule descriptions, striving for a more uniform de nition among ultrasound radiologists. However, a clear clinical standard for BI-RADS-US classi cation has yet to be established. Such discrepancies are common in practice, especially for category 3-5 breast nodules, affecting diagnosis and treatment of the disease.
In this study, classi cation criteria were agreed upon in advance, to reach a common consensus. However, Kappa values across 21,332 images were as high as 0.574, which was consistent with the ndings of Berg et al and Jales et al [18,19]. This demonstrates that individual radiologists produced varying BI-RADS-US classi cation results, due to differences in the recognition of speci c image features, despite preemptively establishing diagnostic standards. To improve classi cation consistency, radiologists in the study of Berg et al.
provided feedback to correct their results and increased Kappa values from 0.53 to 0.59. The study also suggested that immediate feedback was helpful for correcting classi cations, thereby improving consistency regardless of radiologist experience.
The emergence of Arti cial intelligence (AI) has provided a mechanism for avoiding subjective effects on the classi cation of breast nodules. By training with large data quantities, AI forms a unique algorithm to provide radiologists objective reference information for disease diagnosis. This study involved 14,296 (67.01%) images in which the nal result could not be determined by simply following the majority opinion. The referee radiologist nalized classi cation by referencing pathological results, to provide a more accurate dataset for the CAD model. This approach produced an accuracy in the identi cation of category 3-5 breast nodules comparable to that of the referee radiologist, indicating improved classi cation capabilities. Lee et al also demonstrated that classi cation accuracy for benign and malignant breast nodules was signi cantly higher with the aid of a CAD model. In the study, the area under the curve (AUC) for inexperienced radiologists increased from 0.65 to 0.71, while the AUC for the experienced group increased from 0.83 to 0.84. [20] Similar studies have shown that young radiologists and radiologists lacking experience in breast ultrasound diagnosis have bene ted signi cantly from the use of a CAD model, especially for category 4a breast nodules, thereby minimizing unnecessary biopsies. [21,22] The radiologists participating in this study each had several years of experience and classi cation consistency among them reached values greater than 0.6 after referencing the CAD model. In addition, the sensitivity, speci city, and accuracy of classi cation improved with the use of CAD, informing adjustments to the initial diagnoses. A reduction to category 3 occurred in 54.4 instances on average, exhibiting a consistency rate of 81.99% with pathological diagnoses. These categorical decreases avoid unnecessary biopsies, medical costs, and psychological burden for patients.
[16] An increase to category 4a or above occurred in 29.4 instances on average, exhibiting a consistency rate of 46.26% with pathological diagnoses, the rate is less than 50%, which may be related to the large range of positive predictive values for category 4a-5 breast nodules (2-95%).
These increased categorizations provide patients and surgeons with a more credible basis for biopsy or surgery, thereby avoiding delays in the diagnosis and treatment of breast cancer. Results also indicated CAD to be a useful diagnostic tool for senior radiologists.
This multi-center study included images collected from 20 hospitals in China and involved a variety of cases and pathological results. Conventional ultrasonography systems were used, providing a foundation for the establishment of a model with good robustness and generalizability. The number of images for a single breast nodule ranged from 1 to 90, which may be a result of varying routines used by radiologists at the same or different hospitals in collecting images. Unlike the previous other studies reported, [6,8,[23][24][25] image features, such as nodule shape, orientation, edges, margin, internal echo, posterior echo, and variability in surrounding tissue, were not extracted as part of this study, in order to minimize the in uence of human-selected rules in the machine learning algorithm. Rather, this study was based on task-speci c feature extraction and the longdistance feature capture capabilities of self-attention and multi-head mechanisms used in transformer technology, to achieve end-to-end autonomous learning. [26] A primary limitation of this study involved basing breast nodule classi cation on individual images rather than individual cases, since multi-section scanning is typically adopted in clinical practice. As a result, different sections of the same nodule may produce different classi cation results due to variations in image features. A correlation analysis was not conducted between the classi cation and diagnostic results for the same nodule in different sections. In addition, previous studies have shown that AI combined with 2-dimensional ultrasonography and blood ow data from color Doppler ultrasonography may achieve better diagnostic results in benign and malignant breast nodules.
[8] This study did not involve blood ow data, spectral Doppler, elastography, or contrast-enhanced ultrasonography, which should be included in future research to further improve the CAD model.

Conclusions
This multi-center study introduced a CAD model, based on transformer technology, offering high accuracy for the classi cation of BI-RADS-US 3-5 nodules. Application of this model signi cantly improved diagnostic consistency among radiologists, con rming the value of AI for breast ultrasonography. This retrospective clinical study only involved the collection of age data, breast nodule images, imaging system models, and pathological results for patients. It did not interfere with individual treatment plans and an exemption from informed patient consent was approved by the hospital ethics committee.

Consent for publication
Not applicable Availability of data and materials Due to the privacy of patient data, the data set generated or analyzed in this study is not available to the public.
The authors will provide relevant data upon reasonable request. Figure 1 The annotation platform. The yellow rectangular outline in the middle of the right panel indicates marked nodular areas. BI-RADS-US classi cation was performed in the upper right corner (indicated by the yellow arrow)

Figure 2
A owchart for data processing in the CAD model. The name and primary structure of each data processing block are included in the owchart.

Figure 3
Page 17/17 Various ultrasonography machines and corresponding image quantities used in the study. k values for the full dataset and the sampling test set, before and after referring to the CAD model. The green, blue, and orange histograms represent k values for the whole dataset, the sampling test set before referring to the CAD model, and the sampling test set after referring to the CAD model, respectively. The p values were all below 0.05.