Automated Labeling of Chest X-ray Images using a Quantitative Explainable Atlas-Based AI Model

The inability to accurately, efficiently label large, open-access medical imaging datasets limits the widespread implementation of artificial intelligence models in healthcare. There have been few 13 attempts, however, to automate the annotation of such public databases; one approach, for 14 example, focused on labor-intensive, manual labeling of subsets of these datasets to be used to 15 train new models. In this study, we describe a method for standardized, automated labeling based 16 on similarity to a previously validated, explainable AI model (xAI), using an atlas-based approach, 17 for which the user can specify a quantitative threshold for a desired level of accuracy, the 18 “probability -of- similarity” (pSim) metric. We showed that our xAI model, by calculating the pSim 19 values for each feature based on comparison to its training-set derived reference atlas, could 20 automatically label the external datasets to a user-selected, high level of accuracy, equaling or 21 exceeding that of human experts. the most inter-rater and and atelectasis


24
The implementation of medical artificial intelligence (AI) into clinical practice in general, and 25 radiology practice in particular, has in large part been limited by the time, cost, and expertise 26 required to accurately label very large imaging datasets, which can serve as "platinum level" 27 ground truth for training clinically relevant AI models. The ability to automatically and efficiently 28 annotate large external datasets, to a user-selected level-of-accuracy, may therefore be of 29 considerable value in developing impactful, important, medical AI models that bring added value 30 to, and are widely accepted by, the healthcare community. Such an approach not only has the 31 potential to benefit re-training to improve the accuracy of existing AI models, but alsothrough 32 using explainable, atlas-based methodology [1] -may help to standardize labeling of open-source 33 datasets [2][3][4][5], for which the provided labels can be noisy, inaccurate, or absent. Such 34 standardization may, in turn, reduce the number of datapoints required for accurate model 35 building, facilitating training and re-training from initial small but well annotated datasets [1,6]. 36

37
In this study, we develop and demonstrate a method for standardized, automated labeling based 38 on similarity to a previously validated explainable AI model (xAI), using an atlas-based approach 39 for which the user can specify a quantitative threshold for a desired level of accuracy (the 40 48 Specifically, we applied our existing AI model for detection of five different chest X-ray (CXR) 49 imaging features (cardiomegaly, atelectasis, pulmonary edema, pneumonia, and pleural effusion), 50 to three large open-source datasets -CheXpert [2], MIMIC [3], and NIH [4] -and compared the 51 resulting labels to those of 7 human expert radiologists. Of note, there is an inverse relationship 52 between the selected pSim threshold values and the number of cases identified (i.e., "captured") 53 by the model from the external dataset; in other words, the higher the threshold for likelihood of 54 similarity, the fewer cases that will be identified from the external database as "similar" to the 55 model labeled cases. 56 57 We showed that our xAI model, by calculating the pSim values for each feature based on 58 comparison to the model's training-set derived reference atlas, could automatically label the 59 external datasets to a user-selected, arbitrarily high level of accuracy, equaling or exceeding that 60 of human experts. Although the pSim threshold value required to achieve "maximal" similarity 61 varies by feature, once that value is identified -based on comparison of model labels to expert-62 labeled ground truth -it can then be applied to the remaining external dataset, to identify cases 63 likely to be positive for that feature at a pre-determined, high confidence level. 64 previously validated five-feature CXR detection xAI model, using an explainable atlas-based 69

Results
approach. a, b The xAI model calculates "patch similarity" and "confidence" probabilities, based 70 on class activation mapping (CAM) [7,8] and predicted probability from the model, for each 71 feature. c, The harmonic mean between the patch similarity and confidence xAI model outputs 72 are then used to calculate a "probability of similarity" (pSim) for each feature. 73 possible pSim threshold required for 100% PPV or NPV, corresponds to the maximal "correct 126 capture rate", as shown in panel 2. 127 128 Also, as shown in the text boxes in Fig. 2 panels 3 and 4, as well as in Fig. 3, model accuracy 129 compared favorably to that of the available pooled public labels of the external, open-source 130 datasets. Figure 3 additionally shows that the automated-labeling model's AUROC performance, 131 compared favorably to that of the individual expert radiologists, for each feature, at both the 132 pSim=0 "baseline" value labeling threshold and the "optimal" pSim value labeling threshold (i.e., 133 the lowest pSim value achieving 100% accuracy, as per  features. For each of the five auto-labeled features (Fig. 5), we compared: (i) the percent of 159 positively auto-labeled CXR's "captured" from the three pooled, full public datasets (from Table  160 1); (ii) the percent of cases with complete agreement between the model and all 7 expert readers 161 (from Fig. 4); (iii) the lowest pSim value such that PPV=1 (graphed as "1-pSim@PPV1"; from lower values (e.g. pneumonia, pulmonary edema) corresponded to lesser model auto-labeling 166 efficiency and confidence. Of note, for atelectasis, "1-pSim@PPV1" was higher than "1-167 pSim@NPV1", indicating greater confidence that the model is correct in "ruling-in" this feature (i.e. 168 correctly auto-labeling true-positives) than in "ruling-out" this feature (i.e. correctly auto-labeling 169 true-negatives). This relationship was reversed for the other four features (e.g. greater confidence 170 that the model can correctly "rule-out" than "rule-in" pneumonia or pulmonary edema). (blue squares, n=number available labeled external cases per feature). ROC curves (y-axis 219 sensitivity, x-axis 1-specificity) are shown for both the "baseline" pSim=0 threshold (magnified box) 220 and the "optimal" pSim threshold (i.e., the lowest pSim threshold achieving 100% accuracy, as 221 per Fig. 2 panels 3 and 4). such that PPV=1 (graphed as "1-pSim", from Fig. 2, panel 3), and (iv) the lowest pSim value such 247 that NPV=1 (graphed as "1-pSim", from Fig. 2, panel 4). Features with higher y-axis values (e.g. 248 cardiomegaly, pleural effusion) correspond to those with greater model auto-labeling 249 efficiency/confidence; features with lower y-axis values (e.g. pneumonia, pulmonary edema) 250 correspond to those with lesser model auto-labeling efficiency/confidence. Of note, in the graph 251 for atelectasis, "1-pSim@PPV1" is higher than "1-pSim@NPV1", which can be interpreted as 252 greater confidence that the model is correct in "ruling-in" the feature (i.e. correctly auto-labeling 253 true-positives) than in "ruling-out" the feature (i.e. correctly auto-labeling true-negatives); this 254 relationship is reversed for the other 4 features (e.g. greater confidence that the model can 255 correctly "rule-out" than "rule-in" pneumonia or pulmonary edema).  We showed that our xAI model, by calculating the pSim values for each feature based on 292 comparison to its "remembered" training-set derived reference atlas, could automatically label a 293 subset of the external data at a user-selected, arbitrarily high level of accuracy, equaling or 294 exceeding that of the human experts (Fig. 3). 295 296 As shown in Fig. 2, the pSim value used for annotation reflects a trade-off between the accuracy 297 of image labeling (i.e., the higher the pSim value, the more accurate the labels) and the efficiency 298 of image labeling (i.e., the higher the pSim value, the fewer examinations that the model selects 299 for annotation). 300

301
To evaluate the efficiency of our automated-labeling approach, we applied our xAI model to the 302 three full public datasets, and compared the five auto-labeled features according to the following 303 parameters: (a) the percent of positively auto-labeled CXR's from the three pooled public datasets 304 (i.e., the "capture rate"), (b) the percent of cases with complete agreement between the model 305 and all 7 expert readers, (c) the lowest pSim value for annotation such that all positive cases 306 captured are "true positive" (i.e., "optimal" pSim for PPV=1), and (d) the lowest pSim value for 307 annotation such that all negative cases captured are "true negative" (i.e., "optimal" pSim for 308 NPV=1). We found a strong correlation between the magnitude of these parameters for each of 309 the annotated features, as shown in Fig. 5. It is noteworthy that the positive "capture rates" from 310 the three pooled public datasets also strongly correlated with the "capture rates" graphed in Fig.  311 2.2, for the subset of examinations (n=90-100) labeled by both the model and the radiologist 312 experts. Moreover, the parameter values reported for each feature corresponded well with the 313 kappa values for inter-observer variability shown in Fig.6. 314 315 Together, these results suggest that the overall accuracy and efficiency of the auto-labeling model, 316 applied to the full public datasets at the "optimal" pSim for each feature, is similar to the accuracy 317 and efficiency of the model as applied to the subset of examinations annotated by the 7 expert 318 radiologists. These results also suggest greater auto-labeling efficiency, with higher confidence 319 in label accuracy, for "cardiomegaly" and "pleural effusion" -two of the more objective findings in 320 CXR interpretation -and lesser auto-labeling efficiency, with lower confidence in label accuracy, 321 for "pneumonia" and "pulmonary edema" -two of the more subjective assessments in CXR 322 interpretation. Indeed, the larger the quantity "1-pSimoptimal" for a given feature (where 0<pSim<1 323 and pSimoptimal = the minimum pSim value such that PPV/NPV=1), the more reliable and robust is 324 the labeling for that feature, based on similarity to the "remembered" reference atlas derived from 325 the model's NLP training set. 326 327 A noteworthy aspect of our approach relates to system deployment. We can apply the pSim value 328 threshold to each class independently, selecting a "low" pSim value for high conspicuity features 329 with high inter-rater agreement, and selecting a "high" pSim value for noisier, more subjective 330 non-specific features with lower inter-rater agreement, the latter at the cost of generating fewer 331 labeled examinations (i.e., lower "capture rate"). Employing pSim values helps quantify which 332 features of the AI model are most reliably annotated and which need to be improved, making it 333 possible to measure system robustness. Deploying the xAI system is also HIPAA compliant, as 334 no patient identifiable source data need be stored, since the mode selection ( Fig. 1)  Another technical capability of our system is the "re-annotation" mode, which was not used in the 339 current study, but which may be of value in real-world clinical settings. This mode is a component 340 of the "explainability" functionality of our model, by which the system can prompt or query a human 341 user if the pSim value for feature detection falls below a pre-selected threshold. More generally, 342 the re-annotation mode has the potential to be applied to other medical AI models as a safety 343 feature, alerting users that there is a measurable, quantitative probability of interpretive error in 344 Our auto-labeling AI model reflects several characteristics of human intelligence [31] in general, 362 and radiologist-mimicking behavior in particular. Specifically, our system is "smart", in that it can 363 access its "memory" of examination features present in the training set, and quantitatively 364 estimate their similarity to features in the new, external examination data. The "1-pSimoptimal" 365 metric for each feature provides a measure of the "intelligence" of the system for efficient accurate 366 labeling, and its value (between 0 and 1) reflects the quality (i.e., ground-truth accuracy) of the 367 NLP-derived dataset used for initial training. The model can also provide feedback to users 368 through its "explainability" functionality, by displaying examples of the features under 369 consideration from its reference atlas together with their associated pSim value; this interaction 370 offers the user an additional level of confidence that the model is doing what it's supposed to do. 371 In this regard, our system can be viewed as an "augmented" intelligence tool to improve the 372 accuracy and efficiency of medical imagers. 373 374 Indeed, one limitation of our model is that its labeling accuracy and efficiency is directly 375 proportional to the quality of the initial training set. This may help explain why cardiomegaly and 376 pleural effusion -two high-conspicuity features routinely correctly described in the radiology 377 reports identified by NLP for model training -have higher efficiency metrics (Fig. 5) than 378 pulmonary edema and pneumonia, which are more non-specific and variably assessed by 379 different radiologists. This also may help explain why the "1-pSimoptimal values for NPV=1" in Fig.5  380 are higher than the "1-pSimoptimal values for PPV=1", for all features except "atelectasis", since 381 atelectasis is a lower conspicuity, more non-specific feature typically noted in CXR radiology 382 reports only when it is present, but not mentioned when it is absent (i.e., the model "learned" from 383 its NLP derived training set to have a higher level of certainty, and hence a higher "1-384 pSimoptimal" value, when atelectasis is present, then when it is absent). Pulmonary edema and 385 pneumonia, on the other hand, are typically described in CXR reports with a higher level of 386 certainty when they are definitely absent (e.g., "no evidence of pulmonary edema or pneumonia"), 387 than when they are possibly present (e.g., "cannot exclude pulmonary edema or pneumonia"). 388

389
Another limitation of our model is that our proposed xAI system requires substantial computational 390 resources and storage space to provide the prediction basis and to operate the mode selection 391 module. Because the explainable modules have been designed to operate independently, 392 however, we can differentially deploy the xAI system of adjusted capabilities according to the 393 specification of a given server. 394

395
In summary, we have developed and demonstrated an explainable AI model for automated-396 labeling of five different CXR imaging features, at a user selected quantitative level of confidence, 397 based on similarity to the reference atlas library of an existing, validated model. The ability to 398 automatically, accurately, and efficiently annotate large medical imaging datasets may be of 399 considerable value in developing important, high-impact AI models that bring added value to, and 400 are widely accepted by, the healthcare community. This approach might not only benefit re-401 training to improve the accuracy of existing AI models, but also help to standardize labeling of 402 open-source datasets, for which the provided labels can be noisy, inaccurate, or absent. Such 403 standardization may, in turn, reduce the amount of data required for accurate model building, 404 facilitating training and re-training from initial small, but well annotated, datasets. 405 imaging and communications in medicine) images were de-identified before data analyses. To 413 make a consistent dataset, we chose only examinations that had associated radiology reports, 414 view position information (e.g. AP/PA projections, "portable", etc.), and essential patient identifiers 415 (including but not limited to medical record number, age, or gender). If an examination had 416 multiple CXR images, only a single CXR image was included. We randomly selected 1000 images 417 for each view position as a test set; the remaining examinations, from non-overlapping patients, 418 were separated into training and validation sets (Supplementary Fig. 1). 419 420 Labeling of the development and test datasets. The labels for the training and validation sets 421 were determined exclusively from the automated NLP assignments, whereas those for the test 422 set were determined by consensus of three U.S. board-certified radiologists at our institution 423 (further details provided in Supplementary Table 1 To create the patch atlas, we search for main contours on a high-resolution CAM (512x512) 492 generated from a CAM for each class, select a bounding box to include the outline, define it as 493 the patch, and save it (one or two patches from a CAM are considered in this study). For each 494 feature, patches are saved as typical, representative patterns from only the CXR images with the 495 AI model's predicted probability of being greater than or equal to 0.9. We train a cosine metric 496 based UMAP model using the patches for all features [22]. The UMAP model transforms the 497 patches into coordinates in two-dimensional embedding space, such that the smaller the 498 Euclidean distance in this space, the higher the cosine similarity. For the automated labeling 499 method, therefore, the patch atlas consists of coordinates for all patches in the two-dimensional 500 embedding space and the UMAP model (Fig. 1b). 501 502 Patch similarity value calculation. To calculate the patch similarity as shown in Fig. 1b is the size of the Patch-atlas. The patch similarity is proposed to enable the AI model to interpret 512 the new patch based on the prediction-basis ( ), as a quantitative metric. The metric is 513 calculated by a percentile of how close a patch of a test image is on a prediction-basis of K 514 patches in the embedding space. 515  Fig. 1b, we propose the confidence metric, based on the 521 distribution atlas, as a measure of the trust level between the positive and negative predicted 522 probabilities for a feature. This quantitative metric is simply defined with equations (5) and (6)  Assuming that a predicted probability is for c-class, we calculate a percentile ( ( )) in the 527 positive Distribution-atlas and a percentile (1 − ( )) in the negative Distribution-atlas. Then, 528 the difference between two percentiles is calculated as the confidence. Because the predictive 529 ability of the xAI model for each feature is related to the shape and degree of intersection of the 530 two probability density curves (positive and negative) on the distribution-atlas, the confidence 531 metric, as defined based on equations (5) and (6), provides a quantitative measure analogous to 532 a p-value between different statistical distributions. In other words, the higher the confidence value 533 for a label, the higher the likelihood that the test image is mapping to the correct label, and the 534 lower the likelihood of incorrect mapping. Moreover, this metric has the ability to quantify different 535 levels of confidence according to different distributions of feature characteristics on the distribution 536 atlas for each class of the model, even at the same predicted probabilities. 537 538 pSim calculation, pSim threshold selection. Our automated dataset labeling method 539 calculates the pSim value using a harmonic mean between confidence and patch similarity 540 (pSimilarity in equation 7) for each test image. 541 pSim = 2⋅ ⋅ / ( e + ) (7) 542 The pSim threshold for each feature is chosen by the lowest pSim values that can achieve 100% 543 PPV and NPV, as per Fig. 2. 544 545 An additional functionality of our model design includes a "mode selection" algorithm, which, using 546 the selected pSim threshold value, can determine both the image label (positive, negative, or 547 unlabeled) and a "mode" -self-annotation or re-annotation -or each test image, as per Fig. 1 and  548 Supplementary Table 3. The "re-annotation" mode, which was not applied in this current study 549 but can be used as part of the "explainability" functionality of the model, to prompt the system to 550 alert or query a human user if the pSim value for a feature falls below a selected threshold. 551 552 Statistical analyses. To assess the statistical significance of the AUROC's, we calculated 95% 553 CIs using a non-parametric bootstrap approach via the following process: first, 1000 cases were 554 randomly sampled from the test dataset of 1000 cases with replacement, and the DCNN models 555 were evaluated on the sampled test set. After running this process 2,000 times, 95% CIs were 556 obtained by using the interval between 2.5 and 97.5 percentiles from the distribution of 557 AUROCs. The 95% CIs of percentage accuracy, sensitivity and specificity of the models at the 558 selected operating point were calculated using binomial proportion CIs. validation, and test datasets were divided without overlapped patients or duplicated cases. 705 706