Transferability of Deep Learning Models for Focus Quality Assessment in Digital Pathology

Out-of-focus sections of whole slide images are a signiﬁcant source of false positives and other systematic errors in clinical diagnoses. As a result, focus quality assessment (FQA) methods must be able to quickly and accurately differentiate between focus levels in a scan. Recently, deep learning methods using convolutional neural networks (CNNs) have been adopted for FQA. However, the biggest obstacles impeding their wide usage in clinical workﬂows are their generalizability across different test conditions and their potentially high computational cost. In this study, we focus on the transferability and scalability of CNN-based FQA approaches. We carry out an investigation on ten architecturally diverse networks using ﬁve datasets with stain and tissue diversity. We evaluate the computational complexity of each network and scale this to realistic applications involving hundreds of whole slide images. We assess how well each full model transfers to a separate, unseen dataset without ﬁne-tuning. We show that shallower networks transfer well when used on small input patch sizes, while deeper networks work more effectively on larger inputs. Furthermore, we introduce neural architecture search (NAS) to the ﬁeld and learn an automatically designed low-complexity CNN architecture using differentiable architecture search which achieved competitive performance relative to established CNNs.


Introduction
Digital pathology is an expanding field focused on using Whole Slide Image (WSI) scans to facilitate the clinical workflow 1, 2 . One critical issue with this field is reliable and efficient quality control (QC) for the scanned images. With a global shortage of trained pathologists, automated QC methods are an attractive option for digital pathology [3][4][5] . Making effective diagnoses from whole slides requires high quality images, which can be affected by lighting conditions, the optical system, and the scanner's sensor itself [6][7][8][9] . In this context, QC refers to Focus Quality Assessment (FQA), which differentiates between varying degrees of in-focus and out-of-focus sections of an image.
Recently, deep learning models based on convolutional neural networks (CNNs) have emerged as viable FQA methods 2, 10-23 . Open source platforms such as HistoQC 13 , CellProfiler 3.0 17 and ImageJ 24, 25 also leverage deep learning models for FQA. Moreover, there is advancement in artificial intelligence (AI) for medical diagnosis purposes in digital pathology [26][27][28][29][30][31][32][33][34] . Out-offocus regions in an image are a major contributor to systematic errors in these diagnoses 2,35,36 , highlighting the importance of reliable FQA methods to accompany these diagnosis tools. However, two major barriers have been slowing down the adoption of deep learning methods in clinical workflows. The first is their undertested transferablity to diverse imaging conditions, and the second is their potentially high computational cost and scalability to extremely high scanning throughput in practical clinical workflows 3,10 .
Dynamic imaging conditions mean that FQA methods must be generalizable to different datasets. This requires that a variety of tissue types, stain types, and resolutions be used to train the model 13,37,38 . Unfortunately, there has until recently been a shortage of large and diverse datasets for FQA purposes 39 , which increases the risk of overfitting data-driven models 40 . Additionally, a principal advantage of digital pathology technology is that scanners can process images much faster than human pathologists, with some able to scan hundreds of images at a time 6,26,40 . In clinical settings, it is ideal for scans to be completed during night-time hours, so that they are ready for diagnosis the following day 10 . The high throughput of digital scans therefore requires QC pipelines that can handle this high throughput. Computational complexity of the FQA method should not be a restricting factor, and is therefore equally important in the evaluation of the method as its performance 10,40,41 .
A drawback of deep learning models for FQA is that they are not as easily applicable as knowledge-based methods [42][43][44][45][46][47][48] , which have low computational complexity and can be easily applied without adjustment or tailoring 10 . Conversely, when transferring CNNs to other computer vision applications it is usually recommended to use a process called fine-tuning [49][50][51][52][53] , where the network already has a majority of parameters set and is then trained to adjust the remaining parameters. This is used when the datasets and computational resources for training are limited 49 . However, this increases the resources spent transferring the network to different scanners, and would ideally not be necessary for high performance. While foregoing fine-tuning would have efficiency advantages, it could potentially produce some concerns regarding the explainability of the model and the trust in AI-based decision making 3,27 . Figure 1 shows our process for the evaluation of each deep learning model. The images are first normalized 27,39 , then used to train a CNN. The trained model, including the fully connected layer, is then tested on the same dataset on which it was trained to validate the success of the training. Afterwards, the full model is tested on a separate dataset it has not yet seen to evaluate the transfer process 13,54,55 , without the use of fine-tuning. Numerous metrics are used to assess each deep learning model, including computational complexity, layer probing quality metrics, and a spatial focus quality distribution. Table 1 shows an overview of existing deep learning models used for FQA purposes. The models are split into two categories based on the architecture of the CNN being used. This section explains the two categories in more detail: 1. Lightweight CNNs: these refer to CNNs that are optimized towards efficiency by containing only one convolutional layer [10][11][12][13][14][15]22 . Some of these networks are able to perform FQA even more quickly than knowledge-based methods 14,15 . They also directly address the scalability problem, as their low computational complexity allows gigabytes of WSI data to be assessed for focus quality in a matter of hours. Shallow CNNs can also be well trained using a limited number of samples 14,22 , which reduces the effort needed for training. These networks generally have worse performance than deeper CNNs, but can still provide reasonable FQA performance because blur is assumed to be encoded in low-level features in a well-controlled environment 10, 22 .
2. Deep CNNs: these CNNs have multiple layers and are optimized towards obtaining an accurate result using a robust framework 2, 16-21 . These networks focus on achieving high levels of accuracy and being highly generalizable to different tissue types, stain types, and resolutions. These relatively deeper CNNs have more layers which enable the network to capture fine features that may be missed by a shallow network 56 . A drawback of deep CNNs is that they are prone to overfitting when training on small datasets, which means more effort must be put into dataset creation 14 . After a critical depth, CNNs can be overparametrized which worsens their ability to generalize 57 . Table 1 Author Year M e th o d T y p e O rg a n V a ri e ty S ta in V a ri e ty T ra n sf e ra b il it y S c a la b il it y Method Description Wang 10 2020  Table 1 also shows which studies focused on scalability and transferability, the two key pillars of a successful FQA method 3,10 . The majority of studies using lightweight CNNs do not examine the transferability of the model, while the majority of studies using deep CNNs do not examine the scalability of their model. Three studies 10, 15, 18 examined both the transferability and the scalability of the method, which is not a large enough representation to draw conclusions about the characteristics of a model and method that translate to successful FQA. Additionally, transferability cannot effectively be compared between studies because of major differences in the methodologies 41 .
Additionally, of the papers that examined deep CNN methods for transferability, three studies 2, 16, 18 used a variety of tissue and stain types. A variety of WSIs is necessary to confirm that the methods can be generalized to FQA in digital pathology. To perform a more systematic analysis of how well a model can generalize, diverse datasets should be used in the training process, before transferring the models to a separate dataset that it has not yet seen for testing 13,54,55 .  While many of the models in Table 1 were custom-made for the application, none of them use neural architecture search (NAS) to arrive at an optimal architecture. Automating the architecture design process requires less effort on the part of human researchers, while still performing well compared to human-designed CNNs [58][59][60][61][62] . Networks that are tailored for digital pathology datasets may result in higher performance FQA.
Heatmaps are used in four studies 2, 10, 18, 20 to spatially represent the FQA of a deep learning model. Spatial representations are important for better visualizing which features the model is able to capture well and which features it misses. They are also important for understanding the characteristics of a dataset that may make it more or less transferable to other applications 2, 10, 20 .
In this paper, we make the following contributions to improving FQA in digital pathology: Experiment Design: We train ten architecturally diverse CNNs on five datasets of different stain types, tissue types, resolutions, and input patch sizes. We evalute the computational complexity of each CNN, and scale this to realistic scanning applications. We use these methods to draw conclusions about the effect of input patch size, tissue diversity, and stain diversity on the transferability of a deep learning model to other datasets.
Architecture Design: We develop an automatically designed architecture using differentiable architecture search 58 on the same diverse datasets to evaluate the performance of searched architectures relative to conventional CNNs.
Validation Methods: We use the knowledge gain and mappign condition metrics 63 to evaluate how well the model is learning as well as its degree of stability. We use ROC and PR metrics to evaluate the performance of each network when transferred to another dataset. We use focus quality heatmap representations to understand the spatial distribution of focus quality in an image.

Dataset Selection
This section describes the datasets used for this experiment, with information about each dataset summarized in Table 2. Datasets with the suffix "64", such as FocusPath64 43 , are copies of the original dataset with a 64 x 64 input patch size. T ra in in g /T es ti n g P at ch S iz e N u m b er o f P at ch es T ra in in g /T es ti n g S p li t M ea n H u e  Table 2 Figure 3 FocusPath 43 : The FocusPath 43 dataset contains 8640 patches of size 1024 x 1024 extracted from nine different stained slides. This dataset is useful for the development of CNNs geared towards FQA methods due to its diverse distribution of colors and stains relative to other datasets. A stain distribution can be found in the Supplementary Materials, with a sample hue distribution for this dataset shown in Figure 2.
DeepFocus 18 : The DeepFocus 18 dataset contains 118800 patches of size 64 x 64, consisting of 16 different slides with 4 types of stains. This dataset, alongside the FocusPath64 43 and BioImage64 16 datasets, were useful for determining the effect that a varying patch size can have on the quality of training and transferability to other datasets. This dataset has less stain diversity than FocusPath 43 , and slightly less hue diversity as well, with a standard deviation 1.8% smaller than that of FocusPath 43 .
BioImage 16 : The Broad BioImage Dataset 16 consists of 52224 patches of size 696 x 520. BioImage 16 was useful for investigating the effect that grayscale images have on the quality of training and transferability to other datasets. It has been observed that color information can positively enhance the FQA performance of a CNN 14 . Successful FQA methods should have the ability to distinguish focus levels in an image regardless of the colors in the image. This study therefore also seeks to further investigate the effect of color information on transfer performance.
TCGA 64 : The TCGA@Focus dataset contains 14371 image patches in total, with 11328 patches labelled in-focus and 3043 patches labelled out-of-focus. This dataset was chosen due to its wide spectrum of tissue textures and colors.

CNN Analysis
This experiment trained ten architecturally diverse CNN models on the five training datasets. The first is FocusLiteNN 10 , a light-weight CNN built for FQA, which was trained using 1, 2, and 10 input channels. Other architectures used include EONSS 11 58 . We apply the differentiable architecture search (DARTS) 58 due to its convolutional architecture applications and scalability advantages. We refer to our searched architecture as DARTS-FQA. The DARTS-FQA search space used a three-cell system, all of which are classified as reduction cells. Each cell is a directed acyclic graph which is built from four nodes, which each represent an ordered sequence of feature maps. Figure 4 shows the DARTS-FQA reduction cell architecture. The reduction cell architecture is one where all the operations adjacent to the input nodes have a stride of two, halving the pixel resolution of the image. The algorithm takes 60 epochs to learn the operations on the edges of these acyclic graphs from a few candidate operations in the search space, which it does using the highest validation accuracy score 58 . After the model search operations have been completed, the model is frozen and transferred to evaluation. A 3-layer, 20 input channel model based on the searched architecture is trained for 120 epochs using the Adam optimizer, with full details found in the Supplementary Materials.

CNN Complexity
FLOPs and GPU latency were the two parameters used to investigate the computational complexity of these CNNs for 4 different randomly cropped input patch sizes: 64, 128, 235, and 300. To fairly compare complexity, all models were evaluated on a Windows station using an Intel Core i7-10875H CPU @ 2.30GHz, and NVIDIA GeForce GTX 1650 Ti. For latency measurements, the GPU was given some time to initialize, and the latency was calculated as the average of 100 trials. Figure 5 shows that the GPU Latency and the FLOPs cost increase with the input patch size when evaluating a single image, especially for the deepest networks such as ResNet 67 . With shallower networks, the change is less noticeable because the time frames are much shorter and have more stochasticity. Figure 5 also confirms that the FocusLiteNN 10 networks have the lowest GPU latency per image, with a reduction in time per image of 88.7% from DenseNet-13 65 , and 98.7% from ResNet101 67 for a single 64 x 64 input patch. These FocusLiteNN 10 models as well as EONSS 11 also save on FLOPs per image, with a reduction of 99.2% between FocusLiteNN (10-channel) 10 and ResNet101 67 for a single 64 x 64 input patch.

Scalability for High Throughput Scanning
These complexity metrics are most relevant when scaled to larger processes, to see how they can affect digital pathology systems on a daily basis. A smaller latency can scale to days saved in diagnosis processes. Assuming a slide at 0.5µm/pixel@20X magnification, containing an approximately 1cm x 1cm tissue which translates to a pixel size of 25000 x 25000 for each WSI, Table 4 shows how GPU latency scales for each network. The two smaller patch sizes are worse at scale than larger patch sizes, even though they require less GPU inference time and FLOPs when considering a single input image. For a network such as EONSS 11 , decreasing the input patch size from 128 to 64 increases the time by a factor of 4.36. This is the difference between a successful overnight QC session, and one that carries over into the next working day. Smaller networks such as FocusLiteNN 10 can achieve high throughput scanning in only 0.3% of the time spent processing the WSIs using ResNet101 67 , and is able to complete it in just over 3 hours. If the performance is not severely impacted, there is an advantage to using larger input patch sizes, and especially to using shallower CNNs.

Training Performance Metrics
Assessing the training performance is important to understanding which models will train and transfer better when no fine-tuning is applied. The metrics chosen for this purpose are the Knowledge Gain and the Mapping Condition. The Knowledge Gain 63 quantitatively encodes the useful information carried over each convolution layer, which serves as a representation of how well a network is learning or gaining knowledge. The Mapping Condition 63 quantitatively encodes the sensitivity of the convolution mapping between a layer's inputs and outputs. A lower mapping condition is valuable when paired with a high knowledge gain, which shows good stability and a good ability to map input features to output features. If the mapping condition is high and the knowledge gain is low, the layers are very sensitive to input perturbations, and do not show desirable mapping capabilities 63 . For the ten CNNs used in the experiment, these metrics were averaged between the input and output channels, then averaged across each layer, then finally over the five trials used in the experiment. Figures 6a, 6b, and 6c plot the Knowledge Gain against the Mapping Condition for each network. A point near the top left of this plot is desired, as this means the network has a strong mapping condition and a high knowledge gain. For BioImage 16 , a change in patch size does not affect the knowledge gain and mapping condition significantly, shown by the proximity of each pair of points. For the FocusPath 43 dataset, the input patch size has a noticeable effect on the knowledge gain and mapping condition. Generally, the larger patch sizes have a worse mapping condition performance. This could be because smaller input patches have fewer features, meaning that input perturbations are less likely to impact the output.
Interestingly, across all three datasets, deeper networks have a worse average knowledge gain and a worse mapping condition performance than shallower networks. For BioImage 16 as an example, FocusLiteNN (10 channel) 10 shows 84.7% knowledge gain improvement and 87.8% mapping condition improvement over DenseNet-13 65 and 146% and 97.4% respective improvements over ResNet-101 67 . DARTS-FQA 58 falls outside of this trend with a lower knowledge gain but a more stable mapping condition relative to networks of similar complexity, likely because tailoring the network to a specific dataset makes the network much more stable in response to that dataset. Overall, these deeper networks would benefit from fine-tuning before transferring to other datasets, which is less necessary for shallower networks.

Results
The training and testing sets were randomly shuffled and distributed before each trial, and each model was trained five times on each dataset. Each model was then tested on a testing set from the same dataset on which they were trained before being transferred to TCGA@Focus 64 . The input patches were normalized by color before being passed into the CNN, for both the training and testing cases. A full description of training hyperparameters, evaluation metrics, and validation results can be found in the Supplementary Materials.   Discussion Table 5 shows the ROC and PR performance for each network and dataset when transferred to the TCGA 64 dataset, and Figure  7 shows the best and worst two ROC and PR curves for selected datasets. More ROC and PR Curves can be found in the Supplementary Materials. In general, the FocusPath 43 dataset has the highest transferability to the TCGA 64 dataset, with an average ROC of 93.98% and PR of 87.10%. DeepFocus 18 has the worst with respective averages of 68.87% and 43.31%, which is unacceptable for pathology applications when compared to ROC results from other studies 10,18 . Table 2 shows that FocusPath 43 has a higher organ and stain diversity than any of the other training datasets. This would suggest that the organ and stain diversity does have an impact on the transferability of the model. The main issue with DeepFocus 18 is the distribution of in-focus and out-of-focus classes, where only 9% of the patches are labelled in-focus. Since there are so few positive cases, it does not take many false positives to heavily impact the precision. This highlights the importance of a dataset with balanced focus levels as well.

6/12
As suggested by the training quality metrics, deep CNNs do not perform best on every dataset. Rather, it is clear that complex networks such as ResNet 67 exhibit better transferability when training on datasets with large input patch sizes. ResNet-101 67 shows 3.15% better ROC, and 6.74% better PR on BioImage 16 compared to BioImage64 16 , and similar performance on FocusPath 43 compared to FocusPath64 43 . Conversely, lightweight networks such as FocusLiteNN 10 perform better when working on datasets with small input patch sizes, with 1.47% better ROC and 3.58% better PR for FocusPath64 43 when compared to FocusPath 43 . Figure 8 reinforces this trend, as datasets with small input patch sizes display negative trends, while datasets with larger input patch sizes have positive trends. While the average transfer performance is worse for the FocusLiteNN 10 networks, the best performance that can be achieved from using a shallower network, using FocusLiteNN (2-channel) 10 on FocusPath64 43 , is only 1.13% lower than the best performance measured in this study, using DenseNet-13 65 and FocusPath 43 . Networks such as FocusLiteNN 10 and EONSS 11 scale so well to high-throughput scanning that using a smaller patch size is acceptable. Using FocusLiteNN (1 channel) 10 with a 64 x 64 input patch size still processes 300 WSIs in just over three hours, while largely outperforming deeper networks on FocusPath64 43 , BioImage64 16 , and DeepFocus 18 as shown in Table 5. Table 5 shows that FocusPath 43 has an advantage on average over BioImage 16 in ROC (2.3%) and PR (3.4%) performance, as does FocusPath64 43 over BioImage64 16 . This does support the notion that color information is helpful for network performance and transferability 14 .
The DARTS-FQA 58 architecture performed competitively compared to other CNNs, even scoring the best ROC performance by 0.18% for BioImage64 16 and the second best ROC and PR performance for DeepFocus 18 . DARTS-FQA 58 performed the best relative to networks of similar complexity (DenseNet-13 65 , MobileNetv2 66 ) on BioImage64 16 and DeepFocus 18 , but performed slightly worse on FocusPath64 43 . This suggests that architecture search can have clear advantages over other networks for transferability for certain datasets, though the effort involved in optimizing these networks make simpler, more generic networks such as FocusLiteNN 10 very valuable for efficient and reliable transferability performance as well. Figure 9 shows an example of a visual FQA representation of a whole slide image. Using the models trained on each dataset, a heatmap was produced using the predicted focus scores from each network. The background was filtered out of the image, and then each heatmap was normalized. The heatmaps were then interpolated, making them visually smoother. The same patch size and stride were used as were applied to testing each network. From the heatmaps analyzed in this paper, there is no pattern concerning how these deeper networks and coarsely sampled datasets will predict the focus quality distributions, which does affect their overall explainability. While Figure 9b is somewhat aggressive in its evaluation, the Supplementary Materials display numerous examples that are not aggressive enough. In

8/12
contrast, shallower networks and smaller input patch sizes such as the one shown in Figure 9c seem to spatially represent the focus quality distribution more closely to expected human perception.
From the samples analyzed in this study, models trained on datasets with smaller input patch sizes generally have spatial representations that match human perception more closely, which gives them an explainability advantage. If a pathologist wants to understand how the model was able to reach a classification result, the spatial distributions should match what the pathologist would expect. If they do not match, there should be patterns that pathologists could identify to explain why a network's spatial distribution is not what they expected. Further work would need to be done to discover patterns in spatial distributions for deep models with coarse sampling such as BioImage 16 and FocusPath 43 .

Conclusions
This paper sought to identify the characteristics of datasets and CNN architectures that make them more transferable and more scalable for FQA applications in digital pathology. The FocusPath dataset 43 showed the best transferability performance with an average ROC of 94.0% and PR of 87.1%. FocusPath's 43 tissue, stain, and hue diversity were major contributors to this success. Color information in the dataset also had a small transferability advantage of 2.03% when compared to the grayscale BioImage 16 dataset. DeepFocus 18 showed the worst transfer performance with an average ROC of 68.9% and PR of 43.3%, which can be attributed to a heavy imbalance between out-of-focus and in-focus patches.
The DARTS-FQA network performed quite well against the other nine CNNs, achieving performance in the intermediate and above-average range for each of the datasets. This is encouraging performance for a low-cost algorithm, and further optimizations of these networks would allow it to outperform other networks of similar complexity significantly.
Shallower networks showed excellent scalability performance, with the FocusLiteNN 10 networks running 300 WSIs in just over 4 hours for the slowest case. They also show acceptable transferability performance, and perform even better than deep networks and searched architectures when trained on datasets with small input patch sizes. Furthermore, shallower networks also have much better training quality metrics results than their deeper counterparts, which suggests that fine-tuning would not be as essential. Shallower networks are more easily explainable when showing focus quality heatmaps as they align better with human perception, which can help build confidence in AI-based solutions for QC in digital pathology. Overall, using diverse datasets to train shallow networks on small input patches can help optimize scalability without sacrificing significant transferability performance.

Data Availability
The FocusPath 43 Table 2: WSI Dataset Information, listed in alphabetical order. Number of classes refers to the number of focus levels used in the dataset. The subscript for the Hue Mean column represents the standard deviation. Table 3: CNN architecture and complexity summary. The #FLOPs and latency are listed for an input patch size of 64. Table 4: Hours taken per GPU to process 300 WSIs. Each WSI has a 25k x 25k pixel size. The shortest time is in green. Table 5: Performance Metrics for Transfer to TCGA@Focus 64 . The best PR and ROC result for each network is in green. The subscript indicates the standard deviation over the 5 trials. Figure 1: An overview of the data preparation, training, and evaluation pipeline for FQA. Figure 2: Hue relative frequency distribution for all colored datasets using a 1% sampling of pixels. Figure 3: Examples of patches for each dataset with various focus levels. The number in the top right corner of each image corresponds to the focus level. For TCGA, "IF" stands for "in-focus", and "OF" stands for "out-of-focus".      Figure 9: Focus Quality heat maps for a WSI from the TCGA 64 dataset. Red corresponds to a lower focus quality meaning out-of-focus, while blue is a higher focus quality. FocusPath 43