A computer-aided mass diagnosis system based on perceptive features learned from quantitative mammography radiology report: An observer-based study

Background : Computer-aided diagnosis (CAD) system can provide reference to radiologists in breast mass classification. This study was to verify if a CAD model, based on perceptive features learned from quantitative BI-RADS descriptions, can help radiologists improve diagnosis performance for breast masses in mammography. Methods : A retrospective multi-reader multi-case (MRMC) study was conducted to evaluate a CAD model established on perceptive features. Digital mammograms of 416 patients with breast masses were collected from 2014 to 2017, including 231 benign and 185 malignant masses. Altogether, 214 of them (109 benign, 105 malignant) were selected randomly to train the CAD model which consisted of perceptive feature extractor and classifier. The other 202 patients were used as the test set for evaluation from which 51 patients (29 benign and 22 malignant) were selected. Six radiologists were divided into three groups (junior, middle-senior, and senior).They evaluated 51 patients without and with support from the CAD model. BI-RADS category, benign or malignant diagnosis, probability of malignancy, and diagnosis time were recorded during two evaluation sessions. Results: In the MRMC evaluation, the average AUC of six radiologists with CAD support was significantly higher than that without support (0.896 vs. 0.850, p=0.02). Both of average sensitivity and average specificity increased (p=0.0253). More cases were assessed as BI-RADS 4 than BI-RADS 2 or 3. Five radiologists showed comparable diagnosis time per case with and without CAD support, and one radiologist showed a significant decrease when the CAD model was involved. Conclusion: The CAD model could improve radiologists’ diagnostic performance for breast masses without improving the diagnosis time.


Background
Full-field digital mammography (FFDM) is considered as an effective method for breast cancer screening [1,2]. In developed and developing countries, it has become the first option for routine medical exam [3]. However, the growing number of women screening examinations has resulted in an increasing workload of radiologists. The overall diagnosis time of each patient showed an upward trend, indicating that the work efficiency of radiologists decreased [4]. Lu S et al. showed that based on China's huge population, the increasing workloads of radiologists was accompanied by a decline in work efficiency [5]. In addition, Karssemeijer et al. illustrated a positive correlation between the increased workload of radiologists and the demand for breast screening [6].
Due to the lack of experienced radiologists, junior radiologists participated in screening prematurely without training, resulting in a decline in diagnostic accuracy and sensitivity of breast cancer and increased risk of misdiagnosis and missed diagnosis [7,8,9]. As the glandular type in Chinese women is generally dense breast, it further increases the difficulty for junior radiologists to recognize the characteristics of breast cancer, especially the margins and shape of breast cancer with mass as the main sign [10,11,12]. Friedewald et al. illustrated that radiologists, especially inexperienced junior ones, had decreased diagnostic sensitivity in dense breast [13]. Broeders et al. believed that junior radiologists were inexperienced in the characteristics of the mass of breast cancer, which led to inaccuracy in BI-RADS category evaluation and affected the prognosis of patients [14].
Computer-aided diagnosis (CAD) systems were introduced as auxiliary methods to improve radiologists' diagnostic efficiency. For the breast mass classification task, feature extraction is an important step. Conventional CAD methods extract several hand-crafted features from the region of interest (ROI) to form the feature vector for each mass, which are of three typesintensity, shape, and texture. [15]. In addition, deep learning technology has recently been used for feature design. Deep learning model can learn the latent features directly from the ground truth so that more represented features can be designed [16,17,18]. For example, Jiao et al. used convolution neural network (CNN) pre-trained on ImageNet as a feature extractor for the breast mass in breast cancer diagnosis [19]. Kooi et al. combined the deep features and conventional hand-crafted features to distinguish the true mass from normal breast mammary tissue. Their results showed that the combined feature set performed best in the classification stage [20].
Hand-crafted features are designed based on human's experience, but they are not taskspecific so that they may perform not so well in medical imaging analysis. Deep learning model learns features by optimizing weights according to our object of task, but this procedure lacks human's experience as reference, which is important in clinic. In order to combine radiologists' clinical experience and deep learning method together to design proper features for mass diagnosis, this study proposed a training scheme that took BI-RADS descriptions as into consideration. These features were referred to as perceptive features and a CAD model was established based on them. An observer study was also conducted to evaluate the use of this model in assisting radiologists in diagnosis. The results showed that that radiologists, especially junior ones, significantly improved their performances with the support of our proposed model while not increasing their workloads.

Materials and methods
To ensure that a deep learning-based model could obtain a sufficient quantitative ability, we used the BI-RADS characteristic description of a mass, to train a CNN as a perceptive feature extractor. To realize this goal, description quantification, feature extractor training, and classifier training were needed. An observer study was then conducted to verify the clinical meaning of this model. These steps are illustrated in detail as follows.

Dataset and mammograms collection
We retrospectively retrieved samples between April 2014 and October 2017 from Nanfang Hospital of Southern Medical University, Guangzhou, China. Since this study desired to establish a CAD model for breast masses' benign and malignant classification, each case collected in this study only had one mass in unilateral breast. Pathological result of each mass was used as classification ground-truth. Cases with calcification, architectural distortion, or asymmetries were not considered in this study in order to avoid confused ground-truth because each case only had one biopsy result. There was no positive sign in contralateral breast (BI-RADS category 1 or 2). In addition, all collected cases had bilateral craniocaudal (CC) and mediolateral oblique (MLO) images, clinical medical history, radiology reports, and operative and pathological findings. Women with implants, lesions not fully visible, or large lesions occupying almost the whole breast in CC or/and MLO mammograms were excluded.
In total, 416 cases that met the above inclusion criteria were obtained. Of these, 214 were chosen randomly as training sets, which were used to train the feature extractor and mass classification model. The remaining 202 cases were used as independent test sets to evaluate the model, out of which, 51 cases were randomly selected for the observer study. All of them were anonymized and represented by a new ID. The characteristics of the population and digital mammograms are shown in Table 1. The digital mammography was performed using the Selenia Dimensions System, Hologic, Bedford, MA, USA. The size of each image was 3328 × 2560 with pixel spacing of 0.06mm.
Other detailed information about these 416 cases is provided in Appendix A.

The region of interest selection
The ROIs of all masses were marked by an experienced radiologist (X. Liao, with 15 years' experience in digital mammography) by delineating the masses' boundary. Three radiologists (GG. Qin, L. Zhang, and WG Chen, also with 15 years' experience) reviewed the ROI, and all three did not take part in the observer study. If there was any disagreement about the location or shape of a certain mass among these three radiologists, they would determine a final ROI by voting.

Mass classification model 2.4.1 Quantification of the BI-RADS description
We hoped that our CAD model could extract perceptive features that reflect semantic characteristics such as human's vision perception and diagnosis experience. This experience is reflected in radiology reports. Several descriptions defined in BI-RADS lexicon for mass include shape, margins, and density, which are the main factors for radiologists to diagnose breast cancer. Different descriptions have different probabilities of malignant masses. For example, irregular shapes or circumscribed margin are correlated with suspicious findings, and oval shapes or indistinct margin are correlated with benign findings. The method to employ these descriptions in perceptive feature design is to use them as the ground-truth to train a regression network. The feature vector in the last fully connected layer can be regarded as the perceptive feature that we want. However, training procedure requires quantitative ground-truth instead of text descriptions. It is necessary to quantify these descriptions. To correlate the quantification to the classification task, we quantify descriptions as a malignancy probability. Descriptions about malignant, uncertain and benign findings are quantified as 1, 0.5 and 0 respectively. Details for quantification is shown in Table 2 and an example is shown in Figure 1.

Stage 1: Feature Extractor
The backbone of the feature extractor is the classical CNN VGG16 [21]. It is used to classify the class of objects in natural images and plays an important role in computer-aided diagnosis. In this study, we made a slight modification to VGG16 in order to meet our needs.
A 2-channel patch with size of 288 × 288 centered at a mass is the input of network. One channel is original FFDM and another is binary mask represents the ROI. The remaining convolution layers, activation functions and pooling layers are the same as the original VGG16 network. Then, three fully connected layers with Rectified Linear Unit (ReLU) activation functions and dropout operations are used to convert this feature map into a feature vector. Finally, the network outputs a 5-dimension vector, which represents the predicted quantitative descriptions of the input mass. The specific architecture of this feature extractor is shown in Figure 2.
Mean square error is used as loss function. The weights of each layer were initialized randomly according to a standard normal distribution and updated by an Adam optimizer during the training process. Both CC-view and MLO-view masses were fed into the same feature extractor. In the training process, the masses of the same patient in CC-view and MLOview FFDM, shared the same quantitative BI-RADS descriptions.
After the VGG16 network was trained, we discarded its output layer so that the remaining network will output a 128-dimension vector, which is the perceptive feature vector we need.

Stage 2: Benign and malignant classification
These features are then used to train a classifier that can distinguish between benign and malignant masses. Stepwise regression feature selection method and linear discriminant analysis (LDA) classifier were employed to realize this goal.
In the training of stepwise regression and LDA, we did not differentiate between masses from CC-view and MLO-view images, that is, a lesion-wise classification model was considered. In the test process, the malignancy probability output by model from CC-view and MLO-view images of the same case were averaged for case-wise evaluation.

Model Selection and Test
Ten-fold cross validation was used. The 214 training cases were randomly divided into ten folds. In each training time, 9 folds were used as the training set and 1 fold was used as the validation set. Feature extractor was trained until the loss of both the sets plateaued. After ten repetitions, all folds were used as the validation set once and ten trained models were obtained.
During the test process, ten trained models were used on the 202 independent test cases and output ten predicted scores for each case.
Models fusion always obtains a better model whose performance beyond each individual one. To fuse these ten trained models, the averaged probability of malignancy (POM) among these ten models was calculated for each case, which was used in the multi-reader multi-case (MRMC) evaluation and stand-alone study.

Observer evaluation 2.5.1 MRMC evaluation
Evaluation was separated by two sessions, one without our model's results as reference and one with this reference. Both sessions were performed on 51 cases. The time interval between the two sessions was more than 15 days.
To avoid individual diagnosis difference in different time for a same radiologist on same cases, more than one radiologist participated in this observer study. In total, six radiologists performed this evaluation. Two of them were junior radiologists with 2-year experience (reader 1: CY. Feng and reader 2: MW. MA), two of them were middle-seniority radiologists with 4-year experience (reader 3: ZY XU and reader 4: SN. Wang) and the remaining two radiologists were senior with 6-year experience (reader 5: JF. Wu and reader 6: H. Zeng).
In the first session, each radiologist observed the FFDM of the cases without our proposed model support (unaided evaluation). Only the original images and ROIs of each mass in both CC-view and MLO-view FFDM were provided. In the second session, in addition to the FFDMs and ROIs of masses, the POM calculated by our proposed classification model was also provided to radiologists (aided evaluation). In both of these two sessions, the BI-RADS category, benign or malignant classification, POM (ranged from 0-100%), consuming time, and so on were recorded by observers for statistical analysis. These results are recorded in an example table in Appendix B.
Before these two sessions, six radiologists learnt evaluation criteria and announcements. They would be informed that each case only had one lesion. History-taking, results of other examinations, and palpation were blinded for them. 20 example cases, were not included in the 51 observation cases, were used to train radiologists performing this process.

Stand-alone evaluation
The performance of the classification model for 202 independent cases was also compared with two senior experienced radiologists (reader 7: WM. Xu and reader 8: CJ. Wen) who did not participate in the MRMC evaluation. Each radiologist read the FFDM of 202 independent test cases without any reference and gave a POM for each case. The ROC performances of radiologists and the classification model were compared to show the difference between humans and our model.

Statistical analysis
In MRMC study, the area under receiver operating characteristic (ROC) curve (AUC) was calculated according to POM assessed by radiologists for each session. ROC curves were obtained by ranking all POMs evaluated by certain radiologist on certain case set in ascending order. True positive rate (TPR) and false positive rate (FPR) were calculated at the probability threshold of each ranked POM. Taking all of the TPRs as coordinates in y-axis and all of the FPRs as coordinates in x-axis will draw the ROC curve for this radiologist on this case set.
To analyze the significance between two evaluation session, Wald or z-test was used to yield a p-value with the null hypothesis of that these two sessions had same AUCs. For comparison of sensitivity and specificity between two sessions, the assessment results of benign and malignant masses were compared with the biopsy-proven ground truth. Binary version MRMC analysis was implemented to yield a p-value. The average diagnosis time of each case was calculated for each radiologist on each session and paired t-test was used to yield p-value for the difference between two sessions.
In stand-alone study, the ROC curves and AUCs were used to compared the performances between senior radiologists and our proposed CAD model and p-value was calculated by ROC test.
Package in R language of 'iMRMC', 'ROCR' and 'pROC' were used to conduct statistical analysis in this study.

Parameters Selection
The learning rate of the Adam optimizer was initialized as 0.0001. There were two decay times with decay gamma of 0.1 at epoch 30 and epoch 60, respectively. Stepwise regression and LDA classifier were implemented by MATLAB 2018a with default parameters.

ROC performance
In MRMC study, readers significantly improved their diagnosis performance with support of our proposed CAD model. The averaged AUCs increased from 0.850 to 0.896 (p = 0.0209). ROC curves changes are shown in Figure 3 and Specific AUC changes are shown in Table 3. Five of six radiologists' AUCs increased with reference to our proposed model and one radiologist's AUC decreased.

Benign and malignant evaluation
The sensitivity and specificity for radiologists in the two sessions are shown in Table 4. All radiologists' sensitivities with model support were higher than or equal to that without model support. It was most obvious in junior-group radiologists. Five of six radiologists' specificities were higher than or equal to that without model support. Only reader-5's specificity decreased. Binary MRMC analysis showed the performance improvement was significant (p = 0.0253) Table 4 The difference in sensitivity and specificity in three groups with and without model reference

BI-RADS Evaluation
In MRMC evaluation, all the readers adjusted the BI-RADS category of partial cases with model support, which focused on BI-RADS 2,3,4. The assessments tended toward an increase in BI-RADS 4, and fewer cases were defined as BI-RADS 2 or 3 with the model reference. In total, 80 cases' BI-RADS assessments increased while 48 cases' BI-RADS assessments decreased. More details are shown in Appendix C.

Diagnosis time
Diagnosis time was recorded by each radiologist using timer software with number of seconds. The average diagnosis time per case for radiologists in these two sessions are shown in Table  5. Five of six radiologists had comparable diagnosis efficiency. Reader 6's diagnosis time significantly decreased from 56.96 to 43.96 after involving the CAD support (p = 0.01). Two senior experienced radiologists showed a larger decreasing diagnosis time than other radiologists. Figure 4 shows the reading time comparison for all readers and reader 6. Table 5 The mean diagnosis time for radiologists in multi-reader multi-case study.

Stand-alone study
In the stand-alone study, the ROC curves in 202 independent cases of our proposed model and two senior experienced radiologists are shown in Figure 5. Our proposed model achieved AUC of 0.913. Reader 7 and reader 8 achieved AUCs of 0.969 and 0.988 respectively. Both radiologists' performances were better than our CAD model. This result showed that although this model improved the diagnosis efficiency and accuracy for radiologists with experience less than 6 years. It cannot attach the performance with experience more than 8 years.

Discussion and conclusions
In this study, we proposed a deep-learning-based perceptive feature extractor for breast mass in order to establish a CAD model for benign and malignant mass classification and evaluated this CAD model through observer study. The results showed that this CAD model could assist radiologists to improve diagnosis accuracy while not increased diagnosis time.
This perceptive feature extractor used quantitative BI-RADS descriptions instead of the biopsy-proven results to optimize its weights. This brought benefits to CAD model. First, BI-RADS descriptions were obtained from the radiologists. When optimizing, the human vision perception and clinical experience were integrated into the weights. It provided us more ideas to interpret the learned features of CNN. Second, compared to using a CNN directly to establish a CAD model, training the feature extractor first and then using a classifier to finish the diagnosis, divided this process into two stages, which was more consistent with the process of clinical diagnosis.
In the observer study, the AUCs showed that radiologists had a higher and better diagnostic performance with CAD model support than that without model support. In addition, the diagnostic sensitivities of all radiologists increased when the model involved [22]. Except for reader 5, other readers' specificities increased or remained the same after the model references were used. In particular, for the junior radiologists, their AUCs and sensitivities increased in a larger range than middle-seniority and senior radiologists while their specificities remained unchanged [23,24,25].
In addition, radiologists adjusted the evaluation of BI-RADS category after this CAD model involved. More BI-RADS 4 and less BI-RADS 2 or 3 were assessed, which means that the model could assist radiologists in observation, increasing their attention in the most suspicious characteristic.
Considering the workload of radiologists in daily work, we do not hope using CAD model will increase their diagnosis time for each patient. The average diagnosis showed that most radiologists had comparable diagnosis time and the other one radiologist had significant decreasing of diagnosis time [26]. This meant this model will not increase radiologists' workload.
A stand-alone study showed that the average AUC performance of the CAD model was weaker than that of senior radiologists with 8-year mammogram reading experience. This demonstrated that although this CAD model could help radiologists improve their diagnosis accuracy, it could not replace radiologists to make diagnosis.
Our study has some limitations. First, this is not an end-to-end CAD system. Automatic mass detection and segmentation of mass were not explored. To generate a classification probability, the radiologist must first locate the breast mass and mark the contour of the mass. Second, this is a single-center study. Different regions have different population distribution. We did not explore if this CAD model can be applied to other population outside our institution.
In the future, we will improve this model as an end-to-end CAD system. The model should detect and segment mass automatically so that radiologists can not only obtain the reference POM, but also the mass location and shape information, which may provide more useful tips. Since it increases the complexity of the whole task, we need more data to train a stable system. we tend to collect more FFDM data from different data center with mass to address the limitation of limited data size and develop multi-center study.

Ethics approval and consent to participate
This study has obtained ethics approval and consent from Medical Ethics Committee of Nanfang Hospital. The code number of clinical ethics project is NFEC-2018-037.
In this study, all methods were carried out in accordance with relevant guidelines and regulations, providing by the Medical Ethics Committee of Nanfang Hospital. And also, all participants' consent were waived by the Medical Ethics Committee of Nanfang Hospital.

Consent for publication
All the consent were waived by the Medical Ethics Committee of Nanfang Hospital.

Availability of data and materials
The datasets analyzed during the current study are not publicly available due to patient privacy information protection but are available from the corresponding author on reasonable request.