1. Study design and participants
This is a multicenter study, including both retrospective design and prospective design. The study was in accordance with the precepts established by the Helsinki Declaration, and he study protocol was approved by the Ethics Committee of our hospital (2019KT35). The informed consent was waived for the retrospective part, and obtained from all participants for the prospective part (Reg. No. NCT03708978). Our study consisted of three parts, the first part was the retrospective construction of AI system and internal verification. The second part tested whether the diagnosis efficiency of doctors with AI system assistance is higher than that of doctors alone. The third part prospectively verified the effect of the system in multi-center clinical practice.(see figure1)
1.1 Participants of the first part
We retrospectively enrolled patients who were admitted to our hospital for screening clinical symptoms from October 1, 2014 to September 30, 2016. Figure 1 shows the study flowchart. Table S1 showed the study sites and patients enrolled in this part. The inclusion criteria were patients with complete clinical data, mammogram data, and with pathological diagnosis or more than 2 years’ follow-up after the first examination. The exclusion criteria included unqualified images required for the segmentation and inconsistency in the location of the lesion between the mammogram and the pathological results.
1.2 Participants of the second part
To determine the effectiveness of the model for improving the accuracy of diagnosis, we collected mammograms from six centers from October 1 to 31, 2015, and conducted a complete cross-sectional evaluation of the developed diagnostic system with participation of 12 radiologists.
Figure S1 illustrates the processes of collection of mammography data and participants’ selection. A step-by-step procedure for estimating power and sample size was used that was proposed by Hillis et al.  for planned multi-reader receiver operating characteristic (ROC) studies. For 12 evaluators, in which the study efficacy was not less than 0.80, an area under the curve (AUC) difference of 0.05 required 200 mammograms (70 pathologically confirmed malignant cases, 30 pathologically or follow-up confirmed false positive cases, and 100 negative cases).
To ensure adequate mammography to determine the final sample size, we collected at least 14 cancer patients, 6 false-positive patients, and 20 negative patients in each center (data collected from centers E and F were combined due to the small number of cases in those centers). The inclusion and exclusion criteria were shown in figure 1.
To ensure image quality, all cases in this part were reviewed by three radiologists with more than 25 years of experience in mammography. Each case was available for pathology or follow-up. After review, 3 patients with unqualified image quality and 66 patients with very obvious symptoms of breast cancer were excluded.
1.3 Participants of the third part
To further investigate the clinical application of the model, we prospectively applied it in six centers. Patients undergoing mammography in each center were prospectively and consecutively enrolled from April 5, 2018 to May 4, 2018. The inclusion and exclusion criteria were shown in figure 1.There were no specific exclusion criteria in terms of demographic or clinical characteristics for participants without lesions.
2. Quality control of mammogram images
All mammogram images were stored using a picture archiving and communication system (PACS) in digital imaging and communications in medicine (DICOM) format. The two standard views were the craniocaudal (CC) and the mediolateral oblique (MLO). To ensure image quality, all cases were reviewed by 3 radiologists with more than 25 years of experience in mammography.
All pathological results were obtained from the pathology report and reviewed by an experienced pathologist. Pathological tissue was obtained by hollow needle biopsy or surgery and was stained with hematoxylin and eosin (H&E).
3. Radiologist’s annotations
Six certified and experienced radiologists, each with an average experience of at least 5 years (range, 5-10 years), read an average of 250,000 mammograms and annotated the images. Six radiologists were trained to read 800 mammograms and began to draw ROI respectively. The delineation principle was as follows: (1) manual delineation along the edge of the lesion; (2) inclusion of all suspicious parts of the tumor in the sketch; (3) the edge included burrs as far as possible; (4) when the label was generated, the characteristics of the lesion were marked according to the Breast Imaging-Reporting and Data System (BI-RADS) (2013 edition), including lesion type (mass, calcification, structural distortion, asymmetry), distribution characteristics, and pathological or follow-up results. In case of doubtfulness, a radiologist will consult with three other experienced radiologists to make a correct decision after discussion.
4. Algorithm development
Following the successful application of DL, we established the model (http://mgshow.yizhun-ai.com/), containing various modules to carry out automatic analysis of mammograms. It contains three deep neural models: the lesion detection module, the lesion matching module, and the malignant degree assessment module, which constitute a complete system for breast lesion analysis (Figure S2). The overview of our system is illustrated in Figure 2.
(a) Lesion detection module
We use Faster R-CNN  to detect suspicious lesions in all of the images of one patient. Faster R-CNN is one of the state-of-the-art methods in the area of object detection. Faster R-CNN contains two stages, where the first stage generates box proposals and the second stage refines the box localization and predicts the class of each object. We use ResNet-50  as the backbone network and adopt feature pyramid network to enhance the detector performance of small lesions.
Since the huge size of breast images and the existence of background areas with no information, we first pre-process the mammogram images before sending them to the neural networks. We crop the foreground area of each image by a simple thresholding method and then resize the images to keep spacing = 0.15mm. As shown in 1, the detector takes the four images of different views as inputs, and outputs bounding boxes and lesion classes (i.e., mass and calcification) for detected suspicious lesions. In our problem, mass and calcification can appear at the same location, so we use Sigmoid function to generate the objectivity score for each class instead of SoftMax. This modification allows an object to be identified as both mass and calcification. In practice, if a predicted box has high confidence in both mass and calcification, we will call this lesion a mass with calcification.
(b) Lesion matching module
The matching module is introduced to indicate whether a pair of detected candidates are from different views of the same lesion. In the clinical practice of mammogram examination, it is essential to combine the information of multiple views (MLO and CC). At most of the time, a lesion could be recognized in both MLO and CC views. If a mass can be only found in one view, radiologists may consider that it is caused by overlapping glands, but not lesion. According to this principle, it is natural to perform false positive reduction by matching the lesions of MLO and CC view in the CAD system.
In our model, we use a neural model to conduct lesion matching. The matching model is after the detector and takes the features of the detected proposals of suspicious lesions as input. We use vertex coordinates, sizes of the proposals, the probabilities of each class, and the depth of proposals in the gland as input features. In the matching process, the model should use the information of all proposals to perform matching, so that we use an attention model  to predict the relationship of all lesion pairs. The input of the model is the concatenated features mentions above, and it generates the probability of a real lesion pair for all possible pairs. The lesions with low probabilities will be removed during the output process.
(c) Malignant degree assessment module
We use a CNN based on ResNet  to estimate the malignant degrees of lesions. In our model, we treat the malignant degree assessment problem as an ordinal regression problem . Ordinal regression algorithms are to solve multi-class classification problem where the labels have strong ordinal relationships. In our problem, BI-RADS can represent a lesion’s degree of malignancy. BI-RADS sometimes provide more information than pathological results, since pathological results only tell us whether a lesion is malignant, but BI-RADS can tell us how malignant a lesion’s degree of malignancy. Therefore, we use BI-RADS to train our model. Experimentally, with large amounts of BI-RADS annotations confirmed by experts, we find the performance of our system is better than using the pathological results as labels, even we evaluate the system according to the pathological results.
Following some previous work , we use integration of several binary classification problems to solve the ordinal regression problem. We choose ResNet-18 as our backbone, which is one of the state-of-the-art classification models in the area of deep learning . In our data, there are 8 labels (’false positive’, ’BI-RADS 2’, ’BI-RADS 3’, ’BI-RADS 4A’, ’BI-RADS 4B’, ’BI-RADS 4C’, ’BI-RADS 5’ and ’BI-RADS 6’). Since there are little lesions which are ’BI-RADS 2’ or ’BI-RADS 6’ in our training data, we treat ’BI-RADS 2’ the same as false positive candidates and merge ’BI-RADS 6’ and ’BI-RADS 5’. Therefore, our model outputs 5 logits for each lesion, the first logit predicts whether the BI-RADS of a lesion is larger than ’BI-RADS 3’, the second logit predicts whether the BI-RADS of a lesion is larger than ’BI-RADS 4A’ and so on. Since we hope the network can output the possibility that a lesion is malignant, we add a fully connected layer to process the result of ordinal regression, which can be seen as a simple linear combination.
The online demo was shown in appendix. To train the models, the collected mammograms were chronologically divided into training dataset (~80%) and validation dataset (~20%). We trained the models during the first part of our study and further evaluated the established system in the next two parts.
5. Auxiliary efficacy for of the model
We evaluated the effectiveness of the model in detecting and diagnosing mammograms by monitoring the performance of 12 radiologists under different reading conditions (see Figure S3).
The 12 radiologists had an average of 9.5 years (range, 3 to 25 years) of experience with the certificate of Mammography Quality Standards Act, and had read more than 5000 mammograms per year over the past two years.
The 12 radiologists were blinded of any information about the patients, including prior imaging and histopathological reports. The assessment consisted of two stages. Each radiologist received separate training prior to the first evaluation. The purpose of the training was to familiarize radiologists with the evaluation criteria and the functions and operations of the AI-aided diagnosis model. Besides, 12 radiologists were informed that the rate of malignancy in the assessed dataset was higher than clinical practice.
For each case, the radiologists employed the BI-RADS classification (range, 1-5), and labeled the suspicious lesion as benign or malignant, and normal patients without lesion were taken as negative into account. The radiologists scored each case on a difficulty scale of 1-9 (9 represents the highest difficulty).
The evaluation was undertaken on an in-house developed workstation, using a 12-MP Mammography Display System that was calibrated to the medical grayscale standard display function of digital imaging. Radiologists used the AI system to read the film, which can freely adjust the window width and window level, and can scale and shift. Ambient lighting was set to about 45 lux.
6. Prospective clinical applications of the model
Prior to the application of the model in each center, nineteen radiologists in the six centers had participated in the training of the model, in which 200 cases were trained. The median experience in mammography diagnosis was 9.5 years (range, 5–26 years), and the mean number of mammograms read each year during the past 2 years was approximately 6500 (range, 1400–13 000). The purpose of the training was to make all the radiologists proficient in the operating system and application interface, so that they could be used freely in the routine clinical mammography.
The mammography was conducted by radiologists with the DL model at six centers. The model could automatically identify suspicious lesions and percentage of malignancy for reference, and automatically generate structured reports as well. The reading time of each case was automatically recorded by the system. Pathological and follow-up results were taken as the gold standard for the diagnosis of benign and malignant lesions, and three radiologists with more than 20 years of experience were taken as the gold standard for the detection of lesions, so as to observe the clinical effect of the DL model.
Clopper–Pearson method was applied to calculate the accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the model used to detect and diagnose mammographic lesions (see Appendix). We used the free-response receiver operating characteristic (FROC) curve to indicate the detection ability of the model and further analyze its diagnostic ability in different types of lesions. The ROC curve was plotted, and the AUC was used to evaluate the diagnostic performance of the model. All statistical analyses were bilateral with significance level of 0.05. Statistical analyses were performed using R 3.5.1 programming language.
The end point was to compare the AUC, sensitivity, specificity, and reading time of 12 radiologists who read independently and with the model. P<0.05 indicated a statistically significant difference between the two reading conditions. In the present study, if a radiologist did not mark the malignant lesion within the true quadrant of the lesion, the case was modified to be negative by the reader.
The reading time of each case was automatically measured by the workstation software. The paired sample t-test or Wilcoxon rank-sum test were used to compare the average reading time under two different reading conditions (reading alone and reading with the model), and the relationship between reading time and difficulty score was further analyzed. For this analysis, the outlier (defined as more than 1.5 times the standard deviation of the data) was removed.
8.Outcomes and follow-up
Definition of malignant lesions: within 2 years from the time the patient came to the hospital for the first mammogram, the pathological diagnosis of the same lesion as malignant was defined as malignant lesion. Definition of benign lesions :(1) pathological diagnosis of the same lesion within 2 years was benign;(2) the patients were followed up for more than 2 years, and mammography more than 2 years after the first mammography examination indicated benign, without pathological diagnosis. Follow-up plan was in supplementary files.