Patient characteristics
A total of 24866 breasts (mean age 49.8 years ± 10.3) were included, 10745 (43.2%) were confirmed by biopsy or surgical pathology (7266 were malignancy and 3479 were benign), and 14119 (56.8%) were no-findings with more than 2 year of follow-up imaging (Table 1).
Table 1
Dataset
|
Total
|
Malignant
|
Benign
|
No-findings
|
Age (mean (SD))
|
Patch-level
|
Breast-level
|
ROI
|
Training
|
11299
|
3515
|
1091
|
6693
|
51.3 (10.5)
|
|
Train
|
×
|
Validation
|
2824
|
844
|
277
|
1703
|
51.2 (10.3)
|
|
Validation
|
×
|
Testing 1
|
3517
|
529
|
992
|
1996
|
46.4 (9.2)
|
|
Test
|
×
|
Testing 2
|
6226
|
1378
|
1119
|
3729
|
48.1 (10.0)
|
|
Test
|
×
|
Auxiliary
|
1000
|
1000
|
|
|
|
Train/Validation
|
|
√
|
Note.— Data are number of breasts. ROI = region of interesting. BI-RADS = Breast Imaging Reporting and Data System. SD = standard deviation. |
Performance of AIS
The performance of AIS is shown in Figure S1 and Table 2. When discriminating breasts with malignant lesions from non-malignant lesions (benign group + no-finding group), AIS achieved an AUC of 0.995 (95% CI, 0.992–0.997), 0.933 (95% CI, 0.919–0.947), and 0.947 (95% CI, 0.939–0.955) in the validation set, testing set 1 and testing set 2, respectively. When discriminating breasts with malignant lesions from benign lesions, the AIS achieved an AUC of 0.988 (95%CI, 0.980–0.995), 0.910 (95%CI, 0.893–0.927), and 0.936 (95%CI, 0.926–0.946) in the validation set, testing set 1 and testing set 2, respectively.
Table 2
The performance of the AIS.
Type
|
Dataset
|
AUC
|
Sensitivity (%)
|
Specificity (%)
|
Accuracy (%)
|
F1
|
Malignant v.s.
Non-Malignant
|
Validation
|
0.995 (0.992, 0.997)
|
96.5 [814/844] (95.0,97.6)
|
98.9 [1959/1980] (98.4,99.3)
|
98.2 [2773/2824] (97.6,98.7)
|
0.970
|
Testing 1
|
0.933 (0.919, 0.947)
|
81.1 [429/529] (77.5,84.3)
|
95.0 [2838/2988] (94.1,95.7)
|
92.9 [3267/3517] (92.0,93.7)
|
0.774
|
Testing 2
|
0.947 (0.939, 0.955)
|
84.0 [1158/1378] (82.0,85.9)
|
96.0 [4654/4848] (95.4,96.5)
|
93.3 [5812/6226] (92.7,94.0)
|
0.848
|
Malignant v.s.
Benign
|
Validation
|
0.988 (0.980, 0.995)
|
96.1 [811/844] (94.5,97.3)
|
97.1 [269/277] (94.4,98.8)
|
96.3 [1080/1121] (95.1,97.4)
|
0.975
|
Testing 1
|
0.910 (0.893, 0.927)
|
80.3 [425/529] (76.7,83.6)
|
91.1 [904/992] (89.2,92.8)
|
87.4 [1329/1521] (85.6,89.0)
|
0.816
|
Testing 2
|
0.936 (0.926, 0.946)
|
84.4 [1163/1378] (82.4,86.3)
|
92.9 [1040/1119] (91.3,94.4)
|
88.2 [2203/2497] (86.9,89.5)
|
0.888
|
Note.— Numbers in brackets are numerator/denominator, and numbers in parentheses are 95%CI. AUC = area under the receiver operating characteristic curve. CI = confidence interval. AIS = artificial intelligence system. |
The results of AIS vs BI-RADS
Clinical characteristics and mammographic assessments are listed in Table S1.
The benefit of AIS for stratifying BI-RADS 3–4 subgroups. — Based on the pathological results, BI-RADS categorization itself achieved accuracy of 73.15 ~ 84.09%, sensitivity of 85.51 ~ 92.32%, specificity of 58.80 ~ 74.28%, and F1 of 0.716 ~ 0.898, while AIS yielded accuracy of 85.06 ~ 96.82%, sensitivity of 76.64%~96.99%, specificity of 90.55 ~ 96.30%, and F1 of 0.802 ~ 0.979 in the validation and testing sets, respectively (Fig. 3A and Table 3). The AUC values for AIS were significantly higher in validation set (0.828 vs. 0.988, P < 0.001), testing set 1 (0.843 vs. 0.892, P < 0.001), and testing set 2 (0.873 vs. 0.934, P < 0.001).
Table 3
Stratified analysis of BI-RADS 3 and 4.
Type
|
Dataset
|
AUC
|
Sensitivity (%)
|
Specificity (%)
|
Accuracy (%)
|
F1
|
BI-RADS
|
Validation
|
0.828 (0.797,0.859)
|
92.3 [613/664] (90.0,94.2)
|
58.8 [127/216] (51.9,65.4)
|
84.1 [740/880] (81.5,86.5)
|
0.898
|
Testing 1
|
0.843 (0.821,0.866)
|
85.5 [366/428] (81.8,88.7)
|
65.1 [427/656] (61.3,68.7)
|
73.2 [793/1084] (70.4,75.8)
|
0.716
|
Testing 2
|
0.873 (0.858,0.888)
|
89.7 [1041/1161] (87.8,91.4)
|
74.3 [566/762] (71.0,77.3)
|
83.6 [1607/1923] (81.8,85.2)
|
0.868
|
AIS
|
Validation
|
0.988 (0.978,0.997)
|
97.0 [644/664] (95.4,98.2)
|
96.3 [208/216] (92.8,98.4)
|
96.8 [852/880] (95.4,97.9)
|
0.979
|
Testing 1
|
0.892 (0.871,0.912)
|
76.6 [328/428] (72.3,80.6)
|
90.5 [594/656] (88.0,92.7)
|
85.1 [922/1084] (82.8,87.1)
|
0.802
|
Testing 2
|
0.934 (0.923,0.945)
|
84.8 [984/1161] (82.6,86.8)
|
92.4 [704/762] (90.3,94.2)
|
87.8 [1688/1923] (86.2,89.2)
|
0.893
|
Note.— Numbers in brackets are numerator/denominator, and numbers in parentheses are 95%CI. BI-RADS = Breast Imaging Reporting and Data System. AUC = area under the receiver operating characteristic curve. CI = confidence interval. AIS = artificial intelligence system. |
In Fig. 3B, we draw a sunburst plot to visualize the benefits of AIS on BI-RADS categorization. Among 3887 breasts in the validation and testing sets, BI-RADS categorization resulted in 6.0% [233/3887] (2.1% [83/3887] BI-RADS 3 and 3.9% [150/3887] BI-RADS 4A) of false-negatives, and 13.2% [514/3887] (11.6% [449/3887] BI-RADS 4B and 1.7% [65/3887] BI-RADS 4C) of false-positives. In contrast, when we analyzed these patients with AIS, it correctly classified 11.0% [427/3887] of BI-RADS 4B and 4C exams into benign groups, thus avoiding excessive surgical treatment. The AIS also correctly classified 3.2% [126/3887] of BI-RADS 3 and 4A exams into malignant groups, thus avoiding missing the best intervention time. Furthermore, 90.6% [1830/2020] of malignant BI-RADS 4B and 4C cases were classified as malignant to ensure necessary surgery, and 96.3% [1079/1120] of benign BI-RADS 3 and 4A cases were classified as benign to avoid unnecessary surgery. Eventually, only 5.9% [231/3887] of all BI-RADS 3–4 cases were mistakenly stratified compared with BI-RADS categorization.
In general, AIS appropriately assigned 83.1% [427/514] of false-positives (BI-RADS 4B and C) into the benign group, meanwhile, it assigned 54.1% [126/233] of false-negatives (BI-RADS 3 and 4A) into the malignant group. The benefits with AIS are to better prevent over- and delayed-treatments.
The sensitivity and specificity analysis under different subgroups. — In Fig. 4 and Table 4, we compare the sensitivity and specificity of BI-RADS categorization with those of AIS within different subgroups. For gland type subgroup, AIS yielded a 3.7% [39/1047] increase in sensitivity and a 9.4% [141/1500] increase in specificity within dense subgroup, while yielded a slight decrease (0.2% [2/805] of sensitivity and 1.9% [4/210] of specificity) within no-dense subgroup. For lesion type subgroup, AIS increased sensitivity and specificity by 4.2% [44/1052] and 20.8% [153/735] within the soft-tissue subgroup, respectively. As for the calcification subgroup, AIS decreased sensitivity and specificity by 1.2% [9/769] and 7.5% [28/375], respectively. For the no-sign subgroup, BI-RADS classified all breasts as benign group, and resulting in 0.0% [0/31] of sensitivity and 99.7% [598/600] of specificity. However, 4.9% [31/631] of breasts were proven to be malignant. Conversely, AIS appeared to be able to identify 19.4% [6/31] malignant breasts with a specificity of 99.7% [598/600]. For the cancer type subgroup analysis, AIS exhibited higher sensitivity and specificity within the IDC/ILC and DCIS subgroups, and lower sensitivity and specificity within the rare subgroup.
Table 4
Stratified analysis of gland type, lesion type and cancer type.
Type
|
Sensitivity
|
Specificity
|
AIS
AUC (95%CI)
|
AIS
|
BI-RADS
|
∆
|
AIS
|
BI-RADS
|
∆
|
Gland Type
|
|
|
|
|
|
|
|
No-Dense
|
93.7 [754/805]
|
93.9 [756/805]
|
-0.2 [2/805]
|
67.1 [141/210]
|
69.0 [145/210]
|
-1.9 [4/210]
|
0.937 (0.922,0.952)
|
Dense
|
89.8 [940/1047]
|
86.0 [901/1047]
|
+ 3.7 [39/1047]
|
85.1 [1277/1500]
|
75.7 [1136/1500]
|
+ 9.4 [141/1500]
|
0.919 (0.907,0.931)
|
Lesion Type
|
|
|
|
|
|
|
|
Calcifications
|
92.6 [712/769]
|
93.8 [721/769]
|
-1.2 [9/769]
|
69.1 [259/375]
|
76.5 [287/375]
|
-7.5 [28/375]
|
0.938 (0.925,0.952)
|
Soft tissue
|
93.2 [980/1052]
|
89.0 [936/1052]
|
+ 4.2 [44/1052]
|
74.7 [549/735]
|
53.9 [396/735]
|
+ 20.8 [153/735]
|
0.909 (0.895,0.922)
|
No sign
|
19.4 [6/31]
|
0.0 [0/31]
|
+ 19.4 [6/31]
|
100.0 [600/600]
|
99.7 [598/600]
|
+ 0.3 [2/600]
|
0.799 (0.690,0.907)
|
Cancer Type
|
|
|
|
|
|
|
|
IDC or ILC
|
92.8 [1482/1597]
|
90.3 [1442/1597]
|
+ 2.5 [40/1597]
|
85.4 [1456/1704]
|
74.8 [1275/1704]
|
+ 10.6 [181/1704]
|
0.941 (0.932,0.949)
|
DCIS
|
83.0 [137/165]
|
80.6 [133/165]
|
+ 2.4 [4/165]
|
77.5 [1321/1704]
|
74.8 [1275/1704]
|
+ 2.7 [46/1704]
|
0.870 (0.834,0.905)
|
Rare
|
84.4 [76/90]
|
91.1 [82/90]
|
-6.7 [6/90]
|
54.3 [926/1704]
|
74.8 [1275/1704]
|
-20.5 [349/1704]
|
0.877 (0.831,0.922)
|
Note.— For sensitivity calculations, the threshold for AIS is chosen to match the specificity of BI-RADS classification. For specificity calculations, the threshold for AIS is chosen to match the sensitivity of BI-RADS classification. Numbers in brackets are numerator/denominator. AIS = artificial intelligence system. BI-RADS = Breast Imaging Reporting and Data System. AUC = area under the receiver operating characteristic curve. CI = confidence interval. IDC/ILC = invasive ductal carcinoma/invasive lobular carcinoma. DCIS = ductal carcinoma in situ |
Re-assessment of the BI-RADS 0 with AIS assistance. — After applying AIS on all mammograms classified as BI-RADS 0 in the testing sets, radiologists were able to identify malignant lesions in 7 out of 43 breasts with a specificity of 96.7% [352/364]. (Two representative breasts are shown in Figure S2)
Counterbalance-designed AIS-assisted study
Among 1302 breasts assessed in the AIS-assisted study, AIS standalone achieved AUC of 0.926 (95%CI, 0.911–0.942), sensitivity of 83.4% [529/634], specificity of 93.7% [626/668], and accuracy of 88.7% [1155/1302]. The radiologists without assistance achieved AUC = 0.787 ~ 0.919, sensitivity = 51.0% [323/634] ~ 94.5% [599/634], and specificity = 45.7% [305/668] ~ 96.4% [644/668], accuracy = 69.4% [904/1302] ~ 87.5% [1139/1302], while radiologists with assistance achieved AUC = 0.808 ~ 0.925, sensitivity = 54.1% [343/634] ~ 93.8% [595/634], and specificity = 53.3% [356/668] ~ 96.0% [641/668], accuracy = 73.0% [951/1302] ~ 88.4% [1151/1302]. All 10 readers exhibited an increased AUC ranging from 0.003 to 0.047, where 5 readers exhibited significant improvements (P < 0.05, Fig. 5, Table S2). The average AUC across 10 readers was significantly improved with AIS assistance (0.870 vs. 0.888, P = 0.001). The use of AIS resulted in a trend toward increasing sensitivity in 8 out of 10 radiologists with improvements ranging from 2.2% [14/634] to 10.3% [65/634], while the average specificity slightly increased across mid-level and senior groups. Without AIS, ICC for 10 readers was 0.629 (95%CI, 0.609–0.649), while with the support of AIS, ICC increased to 0.672 (95%CI, 0.653–0.690) (Table S3). We plotted two representative breasts with dense parenchyma (Fig. 6) or mass lesions (Figure S3) that were difficult to identify for radiologists, but with the help of AIS, most readers downgrade BI-RADS in benign cases and upgrade BI-RADS in malignant cases.