Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening

Artificial intelligence (AI) tools may assist breast screening mammography programmes, but limited evidence supports their generalisability to new settings. This retrospective study used a three-year dataset (1/04/2016-31/03/2019) from a UK regional screening programme. The performance of a commercially available breast screening AI algorithm was assessed with a pre-specified and a site-specific decision threshold to evaluate whether its performance was transferable to a new clinical site. The dataset consisted of women who attended routine screening (50-70 years), excluding technical recalls, self-referrals, and those with a previous mastectomy, complex physical requirements or without the four standard image views. In total, 55,916 screening attendees (mean age, 60 ± 6 [SD] years) met the inclusion criteria. The pre-specified threshold resulted in high recall rates (48.3%; 21,929/45,444), which reduced to 13.0% (5,896/45,444) following threshold calibration, closer to the observed service level (5.0%; 2,774/55,916). Recall rates also increased approximately three-fold following a software upgrade on the mammography equipment, requiring per-software version thresholds. Using software-specific thresholds, the AI algorithm would have recalled 277/303 (91.4%) screen-detected cancers and 47/138 (34.1%) interval cancers. AI performance and thresholds should be validated for new clinical settings before deployment, while quality assurance systems should monitor AI performance for consistency.


Introduction
A recent United Kingdom (UK) National Screening Committee review (3,4) concluded that evidence was insufficient to support the implementation of AI in routine breast cancer screening.The review identified limited evidence on sources of variability, impact on interval cancers detected between screening cycles, and performance of a pre-set threshold to classify recall or no recall.In addition, evidence for the transferability of AI models is inconsistent (5)(6)(7).
We evaluated a commercial AI software (8) using data from a UK Screening Programme to determine whether its performance transferred to an external dataset generated with different mammography equipment.The AI software is CE-marked (CE: Conformité Européenne), indicating compliance with applicable European Union (EU) regulations.This study evaluates generalisability of the AI tool using consecutively acquired clinical data, comparing stand-alone performance to the dual reporting system in the UK screening service.

Sample
The Proportionate Review Sub-committee of the London -Bloomsbury Research Ethics Committee approved this retrospective study (20/LO/0563).Secondary use of de-identified data negated the requirement for individual consent.Public Benefit and Privacy Panel (PBPP) approval was obtained (1920-0258).National Health Service (NHS) Grampian clinical data and mammograms were collected from the Scottish Breast Screening Service (SBSS) (12/02/2016-31/03/2020). Full-field digital mammography (FFDM) images were acquired on five mammography X-ray units of the same make and model (make: Hologic; model: Selenia Dimensions) with no known differences at study commencement.All units conform to NHS breast cancer screening quality standards (9).The standard imaging protocol consisted of 2 views per breast [craniocaudal (CC) and mediolateral oblique (MLO)].As part of routine screening, two readers interpreted each set of images with a third reader arbitrating in cases of disagreement.
During the study period, mammograms in the screening centre were routinely read by a pool of 11 readers with 1 to 20 years of experience, led by GL.

Data Processing
SBSS clinical data were transferred to the Grampian Data Safe Haven (DaSH).
Mammograms from the breast screening picture archiving and communication system (PACS) were transferred to the Safe Haven Artificial Intelligence Platform (SHAIP) developed by Canon Medical Research Europe (10)."Hidden in Plain Sight" (11) deidentification was performed.
Mia™ (version 2.0.1),developed by Kheiron Medical Technologies (vendor), assessed mammograms for potential malignancies in SHAIP.Mia™ was previously trained and tested on images acquired on Hologic, GE Healthcare, Siemens and IMS Giotto mammography equipment.Mia™, an ensemble of deep learning algorithms, employs the four standard image views (FFDM CC and MLO views for each breast) to generate a continuous output ranging from 0 to 1 (malignancy prediction value).The malignancy prediction values were linked to the clinical data in DaSH.Mia™'s performance was evaluated using a predefined threshold (≥0.1117 indicates recall) (8) and site-specific threshold.
Mia's™ performance was evaluated by academic health data scientists (CFDV, JAD) in DaSH (12), which the vendor could not access.The vendor ran Mia™ within SHAIP with no access to the clinical outcomes to provide the Mia™ malignancy prediction values.The vendor also provided the Mia™ decision thresholds.

Threshold Calibration
Mia™ was not previously evaluated on images from Hologic Selenia Dimensions mammography equipment.The initial evaluation identified variability in algorithm performance.The vendor was provided with a validation dataset (16,204 screens) to generate a site-specific decision threshold.This subset included all screening data from 200 confirmed positives (women with histologically confirmed cancer), 4000 confirmed negatives (women negative for cancer with a negative 3-year follow-up screening and no interval cancer) and 8000 unconfirmed negatives (Appendix E1).

Statistical Analysis
A receiver operator characteristic (ROC) curve was plotted, and the area under the curve (AUC) and confidence interval (CI; DeLong method (13)) were calculated.Positive screens were defined as histologically confirmed cancers detected through standard screening.
Sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), as well as cancer detection and recall rates of Mia™, with CIs (Clopper-Pearson method ( 14)), were calculated for the pre-specified and site-specific thresholds.Cancer detection rate was quantified as the number of screen-detected cancers with a (Mia™) recall opinion divided by the total number of screens.The pre-specified threshold was evaluated on the entire dataset after exclusions (original dataset) and on the subset not used to calibrate the threshold (test dataset).The site-specific threshold was evaluated using the test dataset.
Furthermore, Mia™'s performance was compared with performance of the first reader (Reader 1).Mia™ was not compared with the second reader as, in the UK, they can access the first reader's opinion and therefore do not read independently.
As an exploratory sub-analysis, the site-specific threshold performance on the test dataset was stratified by mammography unit.Differences across units were assessed using Pearson Chi-squared (specificity, recall and cancer detection rate) and Fisher exact (sensitivity) tests.
Interval cancers (cancers not detected during routine screening but identified between screening rounds) were analysed separately.Following individual review, all readers in the clinical team regularly met to form a consensus on cancer visibility on prior screening mammograms (15): 1 -no visible lesion; 2 -lesion visible on review in hindsight; 3 -lesion clearly visible; and Occult -lesion not visible on screening or subsequent symptomatic imaging.The proportion of interval cancer patients Mia™ indicated to recall (with the updated threshold) was determined and stratified by consensus opinion.
Statistical analyses were performed in R (version 4.0.3),Appendix E3.ROC curves, AUC and CIs were generated using the pROC package (16).Sample size information is available in Appendix E2.P<0.05 was considered to indicate a statistically significant difference.

Data availability
The statistical output alongside the relevant R code is available in Appendix E3.Access to the raw SBSS data and mammograms (de-identified participant data) is subject to the required approvals (e.g.PBPP, NHS R&D, REC approval) and data agreements being in place.More information can be found on the DaSH website: https://www.abdn.ac.uk/iahs/facilities/grampian-data-safe-haven.php.
The mean age was 60 years (SD, 6.0 years); 450 patients had histologically confirmed screen-detected breast cancer, and 156 interval cancers were detected in follow-up (Table 1).
For the pre-specified threshold (original dataset: 55,916 screens and 450 cancers), sensitivity and specificity were 97.3% and 52.7%, respectively (Table 2).The recall rate was 47.7% and the cancer detection rate was 7.8 per thousand.For the test dataset (45,444 screens and 303 cancers, excluding screens used for threshold calibration), sensitivity and specificity were 98.3% and 52.1%, respectively; recall rate was 48.3% and cancer detection rate was 6.6 per thousand.

Threshold calibration
An initial site-specific threshold of 0.2938 was generated.This threshold revealed a step change in recall rate at set points for each mammography unit (Figure 2b).Review of image headers revealed that the increase in recalls correlated with a mammography unit software update.The AI algorithm was not updated during the study.All units had the same software before the update (version 1.7 Per-software version thresholds were generated to ensure stability of recall rates (Appendix E1).Due to a small number of positive studies in the post-software update subset, the vendor was provided with 35 additional positive studies (from Mammography Unit 4, post-software upgrade) to reduce the threshold's susceptibility to noise.
Two site-specific thresholds were generated across all mammography units: 0.2712 preupgrade and 0.4319 post-upgrade.
Applying the new thresholds to the test dataset resulted in a sensitivity of 91.4%, specificity of 87.6%, recall rate of 13.0% and cancer detection rate of 6.1 per thousand (

AI performance split by mammography X-ray unit and lesion size
Mia™ performance with the site-specific thresholds was significantly different across mammography units for specificity (p<0.001) and recall rate (p<0.001),but not for sensitivity (p=0.51) or cancer detection rate (p=0.93),Table 2.We found no evidence of a difference in sensitivity of Mia™ between small and large tumours (91.0%[162/178] and 93.7% [104/111], respectively; p=0.55).

Interval cancers
The test dataset contained 138 interval cancers (ICs).Using the site-specific thresholds,

Discussion
AI performance could be affected by different mammography systems, impacting deployment in new settings.In this study, local calibration and per-software version thresholds were required to reduce recall rates from 47.7% to 13.0%.Mia™ post-threshold optimisation had a higher recall rate than Reader 1 (13.0%vs 5.4%) but would have detected more cancers (277 vs 261), including those missed by routine dual reporting (47/138).The UK acceptable recall rate is <9% in a double reading setting with arbitration (18).The Mia™ false positive rate was higher than routine clinical practice, suggesting that Mia™ would be best used combined with human reader input, as recommended by the vendor.Economic and operational evaluations are required across possible implementation scenarios.
Our results are supported by previous research observing issues relating to the generalisability of radiology AI models (5,7,19).Furthermore, we have established that AI performance can be influenced by different mammography systems.The AI had previously been calibrated on a range of mammography units, including the Hologic Lorad Selenia, an older model of the unit employed (Hologic Selenia Dimensions).The software update applied to the mammography units included several enhancements that may affect image characteristics.Human reader performance was not adversely affected following the update.Independent verification of vendor-reported transferability of thresholds using the same mammography unit and software version elsewhere is needed.
A user-definable threshold could allow centres to perform threshold recalibration themselves.
However, many centres would struggle to gather enough data and/or will lack the technological expertise to adjust the thresholds successfully.A national implementation and validation framework for AI in breast cancer screening, alongside representative national datasets, could help set AI decision thresholds and quality assurance standards.
Study strengths include using a retrospective unenriched dataset consecutively acquired in a dual reporting screening setting, with sufficient follow-up to capture screen-detected and interval cancers.The AI was not trained on the dataset.Exclusions were minimal (3.9%).
Study limitations include the following: 1) the evaluation of one AI product, 2) single centre setting, 3) a predominantly white Caucasian sample, and 4) detailed interval cancer information was not available due to Covid-related delays.Post-hoc analyses of performance stratified by mammography unit and lesion size were not adequately powered and require further evaluation in larger studies.
Different mammography systems can substantially affect AI performance.AI performance and decision thresholds should be validated when applied in new clinical settings.Quality assurance systems, including change management, should monitor AI algorithms for consistent performance.The artificial intelligence required threshold calibration, with software-specific thresholds, for optimal performance.a: Mia™ receiver operating characteristic curve on the original dataset with pre-specified threshold.The original dataset was not used to establish the pre-specified threshold.b: Rise in recall rate after an event for the four mammography Xray units.The vertical dashed line indicates the date of a software upgrade.A fifth unit, a floating service mobile unit, was not upgraded during the study timeline and is not included in this figure.Note-Chi-squared tests (or Fisher's exact tests when there were small counts in the contingency table) were performed to determine whether the pre-set threshold performance was significantly different to the site-specific threshold performance, and whether the site-specific threshold performance was significantly different than Reader 1 performance on screen-detected cancers.Sensitivity, specificity, recall and cancer detection rate were significantly different between the pre-set and site-specific thresholds (p < 0.001).There were significant differences between the site-specific threshold and Reader 1 for specificity, recall and cancer detection rate (p < 0.001), but not for sensitivity (p = 0.067

Figure 1 :
Figure 1: Flow diagram showing the generation and composition of the original, test and validation datasets.Exclusions are indicated in the white boxes.The vendor-recommended exclusions are indicated in the shaded outer box.Confirmed positives are women with histologically confirmed cancer.Confirmed negatives are women negative for cancer with a negative 3-year follow-up screening and no interval cancer.DICOM = Digital Imaging and Communications in Medicine, UK = United Kingdom
Occultno lesion visible on the prior screening mammogram, nor on the follow-up mammogram.Occult lesions usually present as palpable masses not discernible or outwith the mammographic image.UK = United Kingdom

Table 2 :
Mia™ Performance on Screen-detected Cancers Cancer detection rateUnit 4 was excluded from the per-unit comparison of cancer detection rate.35additionalpositive studies from this unit were provided to the vendor.Therefore, the cancer detection rate reported for this unit (for the test dataset) was artificially low.Mia recall rate range pre and post software updateMonthly recall rate range pre and post software update with the initial recalibrated site-specific threshold (0.293801).Minimum and maximum recall rate shown data_full$Mia_recall_overall <-ifelse(data_full$MIA >= new_th reshold_overall, 1, 0) Monthly recall rate range pre and post software update.Minimum and maximum recall rate shown data_full$Mia_recall_overall <-ifelse(data_full$MIA >= new_th reshold_overall, 1, 0) # Convert reader opinions to 1 (recall) & 0 (no recall)