REPRODUCIBLE AND CLINICALLY TRANSLATABLE DEEP NEURAL NETWORKS FOR CANCER SCREENING

Cervical cancer is a leading cause of cancer mortality, with approximately 90% of the 250,000 deaths per year occurring in low- and middle-income countries (LMIC). Secondary prevention with cervical screening involves detecting and treating precursor lesions; however, scaling screening efforts in LMIC has been hampered by infrastructure and cost constraints. Recent work has supported the development of an artificial intelligence (AI) pipeline on digital images of the cervix to achieve an accurate and reliable diagnosis of treatable precancerous lesions. In particular, WHO guidelines emphasize visual triage of women testing positive for human papillomavirus (HPV) as the primary screen, and AI could assist in this triage task. Published AI reports have exhibited overfitting, lack of portability, and unrealistic, near-perfect performance estimates. To surmount recognized issues, we implemented a comprehensive deep-learning model selection and optimization study on a large, collated, multi-institutional dataset of 9,462 women (17,013 images). We evaluated relative portability, repeatability, and classification performance. The top performing model, when combined with HPV type, achieved an area under the Receiver Operating Characteristics (ROC) curve (AUC) of 0.89 within our study population of interest, and a limited total extreme misclassification rate of 3.4%, on held-aside test sets. Our work is among the first efforts at designing a robust, repeatable, accurate and clinically translatable deep-learning model for cervical screening.


RESULTS 157
In this work, we conducted a comprehensive, multi-stage model selection and 158 optimization approach (Fig. 1, Fig. 2), utilizing a large, collated multi-institution, multi-159 device, and multi-population dataset of 9,462 women (17,013 images) ( Table 1), in 160 order to generate a diagnostic classifier optimized for 1. repeatability; 2. classification 161 performance; and 3. HPV-group combined risk stratification (Fig. 2) (see METHODS). 162 Table 2  dropout. Here, we adopted a conservative approach, choosing to keep design choices 173 that resulted in median QWK and corresponding adjusted β values that are relatively 174 close and not clearly distinguishable from each other and only dropped the clearly worst 175 performing choices; for instance, we decided to keep both the "3 level subsets" (β = -176 0.026) and the "5 level all patients" (β = -0.025) design choices within the "Multilevel 177

REPEATABILITY ANALYSIS 163
Ground Truth" design category, and pass them through to Stage 3. 178 Table 3 highlights the summary of the classification performance analysis (Stage II), 180

CLASSIFICATION PERFORMANCE ANALYSIS 179
reporting the median and the interquartile ranges for each of our two key classification 181 metrics: 1. Youden's index and 2. extreme misclassifications, as well as the adjusted 182 linear regression β for each design choice. Similar to Stage 1, we evaluated the metrics 183 both overall and within each design choice category, dropping the worst performing 184 design choices at this stage in a two-level approach. 185 In the first level, we looked at the Youden's index across all design choices and 186 dropped the worst performing choices; this resulted in 3 choices (SWT architecture, no 187 8 balancing, 5-level ground truth) or 17.6% of the remaining choices being dropped and 188 amounted to dropping choices that had median Youden's index of <150 (Table 3 predicted as precancer+ (27.4%), while sampling 2:1:1 had a high rate of median % 202 precancer+ predicted as normal (24.3%). The "3 level subsets" ground truth mapping 203 was dropped for practical reasons: it was generated from the 5-level map by omitting 204 the GL and GH labels to attempt to generate further distinction or discontinuity between 205 the three classes (normal, GM, precancer+) during model experimentation. Both the "  level all patients" and the "3-level subsets" ground-truth mapping are impractical due to 207 the limited clinical data (either HPV, histology and/or cytology) we anticipate having 208 available in the field to generate 5 distinct levels of ground truth, thereby rendering 209 retraining, validation and implementation of these approaches challenging. 210  Table 4 highlight the 10 best performing models that emerge following 212 Stages 1, 2 and 3 of our model selection approach. All 10 models perform similarly 213 among HPV positive women in the full 5-study set, while showing notable differences 214 per study as shown in the NHS subset of the full 5-study set, measured by the 215 combined HPV-AVE AUC. The NHS subset represents women who are closer to a 216 screening population that we would expect in the field when considering deployment of 217 our model, since this is a population-based cohort study (35); hence AUC on the NHS 218 subset represents a truer metric for model comparison. The models in Fig. 4a and Table  219 4 are in decreasing order of AUC on the HPV positive NHS subset. Fig. 4b plots the 220 ROC curves for each of the top 4 out of the 10 models highlighted in Table 4  safety of a patient. Therefore, it is essential that models designed with the goal of 285 clinical deployment be specifically optimized for improved repeatability and clinical 286

translation. 287
Our work addresses these concerns of reliability and clinical translatability. We 288 optimize our model selection approach with improved repeatability as the primary stage 289 (Stage I) of our selection criterionensuring that only design choices that produce 290 repeatable, reliable predictions across multiple images from the same woman's visit, are 291 passed through to the next stage of evaluation for classification performance. Our work 292 builds on prior work highlighting improvements in repeatability of model predictions 293 made by certain design choices (36,37). Our work also stands out among the paucity of 294 current approaches that have utilized AI and DL for cervical screening (21)(22)(23)(24); as 295 aforementioned, these are largely plagued by overfitting and no consideration of 296 repeatability. The dearth of work investigating repeatability of AI models designed for 297 clinical translation in the current DL and medical image classification literature has 298 meant that no rigorous study, to the best of our knowledge, has employed repeatability 299 as a model selection criterion. We posit that our work could motivate further efforts to 300 include repeatability as a key criterion for clinical AI model design. 301 Subsequent design choices of our work are optimized to improve clinical 302 translatability. Prior work (21)(22)(23)(24) has shown us that while binary classifiers for cervical 303 image-based cervical precancer+ detection can achieve competitive performance in a 304 given internal seed dataset, they translate poorly when tested in different settings; 305 uncertain cases can be misclassified, and predictions tend to oscillate between the two 306 classes. This oscillation phenomenon could prevent a precancer+ woman from 307 accessing further evaluation (i.e., false negative) or direct a normal woman through 308 unnecessary, potentially invasive tests (i.e., false positive). False negatives are 309 especially problematic in LMIC where screening is limited and represent a missed 310 opportunity to detect and treat precancer via excisional, ablative, or surgical methods, in 311 order to avert cervical cancer (13,38). By incorporating a multi-class approach and a 312 loss function that heavily penalizes extreme misclassifications, we improve reliability of 313 the model-predicted normal and precancer+ categories, and further ensure that women 314 that, we believe, should become standard for cancer classifier design, and in particular 323 for neoplasms with well-known clinical causative agents. 324 Our prior work has informed us that the HPV positive women in the NHS subset 325 better represent a typical screening population: specifically, the NHS subset represents 326 women who tested HPV-positive in any given population with an intermediate HPV 327 prevalence (35). The other 4 subsets within the full 5-study dataset comprise of women 328 referred from HPV-based/cytology-based referral clinics: this represents a colposcopy 329 population, which has a higher disease prevalence. We optimize each stage (I, II and 330 III) of our model selection approach on the full 5-study dataset to better capture the 331 variability in cervical appearance on imaging. At the end of this selection, we find that 332 our top models do not perform meaningfully differently among HPV positive women in 333 the full 5-study dataset, highlighted by similar HPV-AVE AUC values across the models 334 in the "HPV positive 5 study" column on Table 4. For the final selection of the top 335 candidates, given our goal of using AVE as a triage tool for HPV positive women in a 336 screening setting, we therefore narrow our focus to the combined HPV-AVE AUC in the 337 NHS HPV positive subset ("HPV positive NHS" column on Table 4; Fig. 4) for each 338 model on Test Set 1 and confirm performance of the top candidates on Test Set 2 339 (Table 5, Fig. 5a). 340

13
Despite the multi-institutional, multi-device and multi-population nature of our final, 342 collated dataset; the use of multiple held-aside test sets; and the exhaustive search 343 space utilized for our algorithm choices, our work may be limited by sparse external 344 validation. Forthcoming work will evaluate our model selection choices on several 345 additional external datasets, assessing out-of-the-box performance as well as various 346 transfer learning, retraining and generalization approaches. Future work will additionally 347 optimize our final model choice for use on edge devices, thereby promoting 348 deployability and translation in LMIC. 349 In this work, we utilized a large, multi-institutional, multi-device and multi-350 population dataset of 9,462 women (17,013 images) as a seed and implemented a 351 comprehensive model selection approach to generate a diagnostic classifier, termed 352 AVE, able to classify images of the cervix into "normal", "gray zone" and "precancer+" 353 categories. Our model selection approach investigates various choices of model 354 architecture, loss function, balancing strategy, dropout, and ground truth mapping, and 355 optimizes for 1. improved repeatability; 2. classification performance; and 3. high-risk 356 HPV-type-group combined risk-stratification. Our best performing model uniquely 1. 357 alleviates overfitting by incorporating spatial MC dropout to regularize the learning 358 process; 2. achieves strong repeatability of predicted class across repeat images from 359 the same woman; 3. addresses rater and model uncertainty with ambiguous cases by 360 utilizing a three-level ground truth and QWK as the loss function to penalize extreme 361 (between boundary class) misclassifications; and 4. achieves a strong additional risk-362 stratification when combined with the corresponding HPV type group within our 363 screening population of interest. While our initial goal is to implement AVE primarily to 364 triage HPV positive women in a screening setting, we expect our approach and selected 365 model to also provide reliable predictions both for images obtained in the colposcopy 366 setting, as well as in the absence of HPV results. Our model selection approach is 367 generalizable to other clinical domains as well: we hope for our work to foster additional, 368 carefully designed studies that focus on alleviating overfitting and improving reliability of 369 model predictions, in addition to optimizing for improved classification performance, 370 when deciding to use an AI approach for a given clinical task. This study set out to systematically compare the impact of multiple design choices on 376 the ability of a deep neural network (DNN) to classify cervical images into delineated 377 cervical cancer risk categories. We combined images of the cervix from five studies 378 (Supp. Table 1) into a large convenience sample for analysis. We subsequently labelled 379 the images into three distinct multi-level ground truth labelling approaches: 1. a 5-level 380 map, which included normal, gray-low (GL), gray-middle (GM), gray-high (GH), and 381 precancer+ (termed "5 level all patients"); 2. a 3-level map which combined the 382 intermediate three labels (GL, GM, GH) into one single gray zone (termed "3 level all 383 patients"); and 3. an additional 3-level map which excluded the GL and GH labels, and 384 considered only the normal, GM and precancer+ labels (termed "3 level subsets"). The 385 choice of multi-level ground truth labelling for model selection was motivated by our 386 previous work and intuition revealing the failure of binary models, as well as our specific 387 clinical use case. Table 1 highlights the population level and dataset level 388 characteristics for our final, collated dataset used for training and evaluation, 389 highlighting the distribution of histology, cytology, HPV types, population-level study, 390 age, and number of images per patient within each of the five ground truth classes. 391 We subsequently identified four key design decision categories that were 392 systematically implemented, intersected, and compared. These included: model 393 architecture, loss function, balancing strategy, and implementation of dropout, as 394 highlighted in Fig. 1. The choice of balancing strategy for a particular model determined 395 the ratios of randomly chosen train and validation sets used during training. We 396 subsequently trained multiple classifiers using combinations of these design choices 397 and generated predictions on a common test set ("Test Set 1") which allowed for 398 comparison and ranking of approaches based on repeatability, classification 399 performance, and HPV type-group combined risk stratification. Finally, we confirmed the 400 performance of the top models on a second test set ("Test Set 2") to mitigate the impact 401 of chance on the best performing approaches. Analysis population 413 The convenience sample was split using random sampling into four sets for use in the 414 evaluation of algorithm parameters. For the initial splits, women were randomly selected 415 into either training, validation, or test ("Test Set 1"), at a rate of 60%, 10%, and 20% 416 respectively. An additional hold-back test set ("Test Set 2") of 10% of the total women 417 was selected and used to confirm the findings of the best models from Test Set 1. All 418 subsets maintained the same study and ground truth proportions as the full set (Table 1, 419 Supp. Table 2). All images associated with the selected visit for each woman were 420 included in the set for which the woman was selected; 7359 women (77.8%) had ≥ 2 421 images. For a woman identified as precancer or worse (precancer+), the visit at or 422 directly preceding the diagnosis was selected, for women identified as any of the gray 423 zone categories (GL, GM, GH), the visit associated with the abnormality was selected, 424 and for a woman identified as normal, a study visit, if there were more than one, was 425 randomly selected for inclusion.  Table 1).

447
Ethics 448 All study participants signed a written informed consent prior to enrollment and sample 449 collection. All five studies were reviewed and approved by multiple Institutional Review 450 Boards including those of the National Cancer Institute (NCI), National Institutes of 451 Health (NIH) and within the institution/country where the study was conducted. Here, is a weighting factor used to address class imbalance, also present in standard 481 cross-entropy loss implementations, ≥ 0 is a tunable focusing parameter and is the 482 Here σ is the sigmoid function, ŷ is the model's output, and y is the level-encoded 493 ground truth. 494 Three balancing strategies were evaluated to deal with the dataset's class 495 imbalance: weighting the loss function, modifying the loading sampler, and rebalancing 496 the training and validation sets. These strategies were only applied during the training 497 process and were compared against training without balancing. To emphasize the least 498 frequent labels, one approach was to apply weights to the loss function in proportion to 499 the inverse of the occurrence of each class label. A second approach was to reweight 500 the loading sampler to present images associated with each label equally as well as 501 with specific weights -2:1:1, 1:1:2, or 1:1:4 (Normal : Gray Zone : Precancer+). The 502 final balancing strategy, henceforth termed "remove controls", involved randomly 503 removing "normal" (class 0) women from the training and validation sets and 504 reallocating them to Test Set 1, in order to better rebalance the training and validation 505 set labels; in this approach, a total of 2383 women (4555 images) from the initial train 506 set, and 410 women (780 images) from the initial validation set were reallocated to the 507 test set. The final class balance in the train and validation sets for the "remove controls" 508 balancing strategy amounted to ~40% normal : 40% gray zone (including GL, GM, and 509 GH) : 20% precancer+ (Supp. Table 3). 510 Finally, we evaluated multiple approaches to dropping layers during training to 511 alleviate overfitting and regularize the learning process by randomly removing neural 512 connections from the model (49). Spatial dropout drops entire feature maps during 513 training: a rate of 0.1 was applied after each dense layer for the DenseNet models, and 514 after each residual block for the ResNet and ReNest models. The Swin Transformer 515 models were used as implemented in (45). Monte Carlo (MC) dropout was additionally 516 implemented, which can be thought of as a Bayesian approximation (50) generated by 517 enabling dropout during inference and averaging 50 MC samples. MC models in this 518 work refer to models trained using dropout combined with the inference prediction 519 derived from the 50 forward passes. 520

Statistical analysis 521
Our model selection approach (Fig. 2) consisted of three stages, each utilizing model 522 predictions from Test Set 1. After selection of the 10 best models following stage III, we 523 further evaluated their performance in Test Set 2 to confirm results from Test Set 1. 524 In Stage I of our model selection approach, we evaluated models based on their 525 ability to classify pairs of cervical images reliably and repeatedly, termed the 526 repeatability analysis. We calculated the QWK values on the discrete class outcomes 527 for paired images from the same woman and visit for all models, calculating the mean, 528 median, and inter-quartile range of the QWK for each design choice. We subsequently 529 ran an adjusted multivariate linear regression of the median QWK vs. the various design 530 choice categories and computed the β values and corresponding p-values for each 531 design choice, holding the design choice with the highest median QWK within each 532 design choice category as reference. This allowed us to gauge the relative impacts from 533 the various design choices within each of the model architecture, loss function, 534 balancing strategy, dropout, and ground truth categories. 535 In Stage II of our approach, we evaluated classification performance based on 536 two key metrics: 1. Youden's index, which captures the overall sensitivity and specificity, 537 and 2. the degree of extreme misclassifications; this is termed the classification 538 performance analysis. We computed both sets of metrics for each of the design choices 539 within each design choice category. Our choice to include misclassification of the 540 extreme classes (i.e., precancer+ classified as normal or extreme false negative, and 541 normal classified as precancer+ or extreme false positive) as metrics was motivated by 542 the importance of these metrics for triage tests (51). Similar to the repeatability analysis, 543 we calculated the mean, median, and interquartile ranges for these metrics, as well as 544 68. In order to assess the ability of a model to further stratify HPV associated risk, we 557 ran logistic regression models on a binary precancer+ vs. <precancer variable. These 558 models were adjusted for hierarchical HPV type group and the model predicted class. 559 We subsequently calculated the difference in AUC between the model adjusted for both 560 predicted class and HPV type group and the model adjusted only for HPV type group 561 and highlighted the 10 models with the best additional stratification ( Table 4, Fig. 4). 562 Finally, we computed additional classification performance metrics (1. % 563 precancer+ as normal; and 2. % normal as precancer+), and repeatability metrics (1. 564 the % 2-class disagreement between image pairs; and 2. QWK values, on the discrete 565 class outcomes for paired images across woman) for each of the top 10 models on Test 566 Set 2 ( Table 5, Fig. 5), in order to further confirm the performance of these models.   Table 1, Supp. Table 1, and Supp. Methods for detailed description and breakdown of the studies by ground truth) used to generate the final dataset on the middle panel, which is subsequently used to generate a train and validation set, as well as two separate test sets. The intersections of model selection choices on the bottom panel are used to generate a compendium of models trained using the corresponding train and validation sets and evaluated on Test Set 1, optimizing for repeatability, classification performance, reduced extreme misclassifications and combined risk-stratification with high-risk human papillomavirus (HPV) types. Test Set 2 is utilized to verify the performance of top candidates that emerge from evaluation on Test Set 1. SWT: Swin Transformer; QWK: quadratic weighted kappa; CORAL: CORAL (consistent rank logits) loss, as described in the METHODS section.           (Table 2). Rows shaded in salmon indicate design choices filtered out at this stage due to poor classification performance (as captured by the Youden's index). Rows shaded in gray indicate design choices subsequently filtered out due to a combination of poor classification performance (as captured by the rate of extreme misclassifications) and/or practical reasons. SWT: Swin Transformer; ref: reference category.      The Natural History Study (NHS) is a population-based prospective study carried out in 6 Guanacaste Costa Rica between 1993 and 2000 (35). This cohort enrolled women 7 followed in either an active cohort with visits every 6-12 months or a passive cohort 8 screened once during follow-up between 5-7 years after enrollment. Screening visits 9 included collection of specimens for cytology, human papillomavirus (HPV) testing, and 10 digital images, while histology was collected among women with abnormal colposcopic 11 evaluation. Cytology was assessed via both conventional and liquid-based methods as 12 well as a first-generation automated approach. HPV testing by MY09/MY11 polymerase 13 chain reaction (PCR) consensus primers was performed on samples collected by 14 Dacron swabs, however, these results were not used for colposcopy referral during the 15 study. Two cervical images per visit were collected at each screening visit using a 16 Cervigram cerviscope, which were later digitized and compressed for storage (55). (LSIL). Women were followed for 2 years with screening visits every 6 months. 24 Screening visit specimen collection included two cervical specimens, one for liquid-25 based cytology and one for HPV testing, as well as cervical images. Referral to 26 colposcopy and histologic sampling varied by study visit, including enrollment referral 27 following the referral cytology result as well as the randomized HPV result, referral from 28 follow-up visit due to high-grade squamous intraepithelial lesion (HSIL) cytology, and 29 exit colposcopy for all women. Type-specific HPV results were not used for patient 30 management (56). Cytologic diagnosis were based on ThinPrep slides created from 31 Supplementary Table 2: Detailed breakdown of full 5-study dataset by set (train, validation, test 1, test 2), study and ground truth. ni =total # images; nw=total # women; (a) Ground truth ratios (by images or women) within each set (train/validation/test 1/test 2) = Total # (images or women) in the ground truth category of set ÷ Total # (images or women) in the set; (b) Proportion of total (images or women) in each set (train/validation/test 1/test 2) = Total # (images or women) in the set ÷ Total # (images or women) in the full dataset.  Table 3: Detailed breakdown of rebalanced dataset after "remove controls" balancing strategy, by set (train, validation, test 1, test 2), study and ground truth. ni =total # images; nw=total # women; (a) Ground truth ratios (by images or women) within each set (train/validation/test 1/test 2) = Total # (images or women) in the ground truth category of set ÷ Total # (images or women) in the set; (b) Proportion of total (images or women) in each set (train/validation/test 1/test 2) = Total # (images or women) in the set ÷ Total # (images or women) in the full dataset Supplementary