Tissue Segmentation from Whole-Slide Images Using Lightweight Neural Networks

Pathology slides of malignancies are segmented using lightweight convolutional neural networks (CNNs) that may be deployed on mobile devices. This is made possible by preprocessing candidate images to make CNN analysis tractable and also to exclude regions unlikely to be diagnostically relevant. In a training phase, labeled whole-slide histopathology images are rst downsampled and decomposed into square tiles. Tiles corresponding to diseased regions are analyzed to determine boundary values of a visual criterion, image entropy. A lightweight CNN is then trained to distinguish tiles of diseased and non-diseased tissue, and if more than one disease type is present, to discriminate among these as well. A segmentation is generated by downsampling and tiling a candidate image, and retaining only those tiles with values of the visual criterion falling within the previously established extrema. The sifted tiles, which now exclude much of the non-diseased image content, are eciently and accurately classied by the trained CNN. Tiles classied as diseased tissue -- or in the case of multiple possible subtypes, as the dominant subtype in the tile set -- are combined, either as a simple union or at a pixel level, to produce a segmentation mask or map. This approach was applied successfully to two very different datasets of large whole-slide images, one (PAIP2020) involving multiple subtypes of colorectal cancer and the other (CAMELYON16) single-type breast-cancer metastases. Scored using standard similarity metrics, the segmentations exhibited notably high recall, even when tiles were large relative to tumor features. With segmentations that can be generated locally and broadcast widely, eciencies in utilizing expert resources can be achieved. high recall scores ensure that clinicians using the map as an aid to their practice will observe the full extent of the diseased area. Similar results were obtained for CAMELYON16 images with suciently large diagnostic features, although average scores were lower due to inclusion of all results that did not fail entirely.


Introduction And Overview
Imaging modalities such as magnetic resonance imaging (MRI) produce clear, high-resolution tissue images that may be analyzed for the presence of disease or abnormality. Substantial strides have been made in automating this analysis in order to assist clinicians in making diagnostic classi cations.
Actually identifying and labeling diagnostic regions within a medical image represents a separate, and more di cult, computational task known as segmentation. Although often pursued alongside classi cation, segmentation is far more granular and therefore more challenging. CNNs have been used to segment images, including medical images of tissue, into distinct labeled regions 1 . In particular, they have been applied to "patch-wise" techniques that analyze small regions surrounding each pixel 2,3 and "fully convolutional" approaches that make predictions for all pixels at once 4,5 . The U-Net architecture 6 , developed expressly for biomedical tissue segmentation, builds on the fully convolutional architecture and is now routinely used 7 . These approaches generally process the entire image to be segmented and, as such, are subject to the size constraints affecting CNNs generally 8,9 . CNNs perform best at image sizes below 600 x 600 pixels; larger images entail complex architectures that are di cult to train, perform slowly, and require signi cant memory resources. Images this small can cover only a small anatomic region at a resolution su cient to retain key detail. This may be su cient for pedagogical purposes or if a region of interest can be localized in advance; patch-wise techniques, for example, can be applied to discrete image regions 11 . These limitations preclude CNN-based segmentation (as opposed to classi cation) of large images in which subtle disease patterns may be present at unknown locations, if at all 12 .
More generally, the subtleties distinguishing disease subtypes from each other and from undiseased tissue typically necessitate use of complex neural-network architectures. The original U-Net implementation had 23 convolutional layers 13 , for example, and other deep-learning techniques that avoid hand-engineered image features -that is, which learn directly from labeled images -also involve architectures with 16 or more convolutional layers 14 . These approaches require substantial computational and memory capacity to process the large number of parameters (16 million in the case of U-net, for example 15 ) inherent in their architectures. Although the computational capacity of mobile devices such as tablets and phones continues to grow, it is inevitably limited by battery life and the need to perform foreground tasks. Integrating complex deep-learning systems into clinical practices that continue to evolve toward telemedicine represents a substantial challenge.
The approach described here performs segmentation at the level of an image tile rather than a pixel, although pixel-level segmentations based on the tiles is also possible. This reduces segmentation to a tractable classi cation task that may be carried out using simple CNN architectures. As will be seen, even the largest medical images -whole slides of histopathology samples, which may have sizes exceeding 100,000 pixels in each dimension -can be handled.
Successful implementation depends on optimal tile sizing and selection. Larger tiles provide more anatomic information to the CNN for classi cation. But tile size also dictates the resolution of the segmentation and CNN complexity, as well as affecting tile selection. In particular, selection is based on a visual criterion that will locate all tiles corresponding to the diseased region while excluding as many non-qualifying tiles as possible. By presenting for CNN classi cation only those tiles that satisfy the visual criterion, spurious classi cations are minimized and wasteful CNN processing is avoided.
As discussed in earlier work 16 , image entropy re ects the degree of nonredundant information -the information diversity -in a region of pixels. For pure classi cation exercises, it is useful to compare the entropy of a tile with that of the larger image from which it is drawn. This approach is unsuited to segmentation analysis, however, because it does not directly discriminate between tissue types. Instead, using a training set of ground-truth slides annotated by pathology experts, tiles are derived from the diagnostic regions and their minimum and maximum entropy values noted. These values serve as boundaries or "rails" that constrain selection of tiles from a test image to be segmented: a candidate tile is retained only if its image entropy lies on or within the rails.
As shown in Fig. 1, the results of this initial sift can be striking.
As illustrated, tying the sifting criterion to the ground-truth training images instead of the image under examination, as is common in CNN-based segmentation systems, produced signi cantly better performance -but only if the visual criterion is well chosen. Tile-based classi cation and segmentation systems commonly select tiles based on image density 17 or background percentage 18 . Even if distortions due to variations in staining and digital acquisition of slides can be overcome 12 , all of the undiseased tissue regions will still be present in the resulting tiles, complicating the classi cation task.
The technique considered here may be applied to whole-slide images as follows. First, an annotated training set of images is obtained and downsampled. The degree of downsampling and the tile size into which the downsampled image is decomposed represent parameters speci c to the tissue morphology under study. Optimized together, they permit the use of source images small enough to be processed on mobile devices and a tile size that balances segmentation accuracy with resolution. The training images are segregated into two sets, one with just the diseased regions and the other with disease regions masked. Training tiles are then created from each set. The tiles overlap su ciently so their total, after initial sifting to remove tiles with excessive background area, is adequate for training -generally tens of thousands of tiles per class. The data redundancy resulting from overlap, it is found, matters less than tile population; for training purposes, quantity is quality.
If there is only one disease type, the classi cation task is binary. Particularly in studies of malignancy, however, multiple disease subtypes often exist and must be reckoned with. Even if the objective is simply to highlight disease regions of whatever subtype, combining diagnostic subtypes into a single training class is frequently a mistake; any visually signi cant differences among morphologies will likely skew classi cation and degrade accuracy. For example, one subtype may, to the CNN, appear less distinct from normal tissue than from the other subtype. The technique we describe here is suited to multiclass training with a softmax activation function and should be employed in this way when subtypes can be visually distinguished.
The minimum and maximum entropies of the diagnostic tiles, assessed over all disease types, are obtained. The nondiagnostic (i.e., normal tissue) tiles are then further sifted based on the entropy rails thus established. (Of course, in practice, sifting based on background content and entropy can take place simultaneously.) The CNN is then trained to distinguish between diagnostic and non-diagnostic tiles and, in the case of multiple disease types, among those classes as well.
Following CNN training, a new whole-slide image may be segmented by rst resampling at the lower resolution and decomposing the resampled image into overlapping tiles whose size matches the training tiles. Once classi ed, these tiles are used to generate a segmentation mask or segmented image. In a binary classi cation, pixel-level diagnostic probabilities may be computed by averaging (or otherwise combining) the tile-level probabilities for each pixel. This approach substantially improves the resolution obtainable by simple tile overlap without probability averaging. Softmax probabilities are more complex, however, and better segmentations may be obtained simply by taking the union of relevant overlapping tiles. In a multinomial classi cation, a tile is considered relevant if it corresponds to the dominant subtype, i.e., the subtype accounting for the majority of diagnostic tiles. Intriguingly, the accuracy of the segmentation is not necessarily compromised by an inaccurate subtype classi cation.
Two very different benchmark datasets were utilized in this study. Both consist of extremely large wholeslide images, which would be di cult to segment using conventional deep-learning techniques. The annotated training and validation slides of the PAIP2020 challenge [1] were selected to investigate segmentation of multiple disease subtypes. This dataset contains annotated whole slides exhibiting different degrees of microsatellite instability (MSI), a molecular phenotype of colorectal cancer that arises from a defective DNA mismatch repair system. MSI status in colorectal cancer has prognostic and therapeutic implications. In particular, a high degree of MSI (MSI-H) is associated with a better prognosis than a low degree (MSI-L). Histology samples are typically classi ed as MSI-H or MSI-L for diagnostic purposes.
We employed the CAMELYON16 19 dataset to investigate segmentation of a single disease type -cancer metastases in lymph nodes. Metastatic involvement of lymph nodes corresponds to a poorer prognosis for survival of breast cancer and, like microsatellite instability in colorectal cancer, is di cult and timeconsuming to diagnose from visual examination of histopathology images. In the CAMELYON16 wholeslide images, a sample is either malignant or normal, i.e., the lymph node contains no metastatic tissue. Although the binary classi cation task may seem simpler, the CAMELYON16 dataset is especially challenging in that diagnostic regions may be quite small -in some cases just a few pixels out of billions. Whereas the tumor regions in the PAIP2020 dataset tend to be large and contiguous, the CAMELYON16 lymph-node lesions often presented as a dusting of tiny features. The latter morphology tested the lower limits of useful tile sizes.
Where tile sizes were small enough relative to diagnostic features to produce meaningful segmentations, strong performance was observed. For the PAIP2020 dataset, segmentations achieved mean Jaccard similarity scores modestly exceeding 0.70 and recall (or sensitivity) scores well in excess of 0.90 (and as high as 0.96). Recall represents the proportion of the diagnostic region that actually appears in the segmentation map; high recall scores ensure that clinicians using the map as an aid to their practice will observe the full extent of the diseased area. Similar results were obtained for CAMELYON16 images with su ciently large diagnostic features, although average scores were lower due to inclusion of all results that did not fail entirely. Four rescaled sets of the whole-slide images were prepared such that, in each set, the longer dimension of the rescaled images did not exceed 3500, 4500, 6000, or 8000 pixels. Each of these image sizes seemed potentially adequate to preserve the diagnostically signi cant anatomy without being so large as to limit the utility of tiles ranging in size from 200 x 200 to 600 x 600 pixels. The tile-preparation, training and testing procedures discussed below were carried out for each image set, and the 6000-pixel set emerged as the best performer. The images in this set had an average size of 8. Within each training subset, the data was further unbalanced by the typically smaller image area occupied by the tumor. Consequently, to obtain similar numbers of tumor and non-tumor tiles, we overlapped them to different degrees. These ranged from 80% to 96% overlap depending on the image size and the number of images in each training class (MSI-H, MSI-L, and non-tumor).
Training tiles were sifted based on background fraction, with tiles having majority-background regions excluded. The background of tumor images -MSI-H and MSI-L images were preprocessed identicallyconsisted of a solid black background surrounding the tumor regions, while non-tumor images included black regions corresponding to the tumor locations with the remainder of the slide unmodi ed. We identi ed background regions by creating, for each tile, an 8-bit grayscale tile counterpart. Tiles were excluded if their grayscale counterparts included a majority of pixels with values above 235 (at least nearly white) or fell below 15 (at least close to black). The higher grayscale limit was chosen so that staining would still register as background but light-colored tissue regions would not. About 55% of the generated tiles survived background sifting; this fraction was consistent across tile sizes. Each image subset had about 180,000 training tiles evenly split among the MSI-H, MSI-L, and non-tumor classes.
For each tile size, the maximum and minimum image entropies of qualifying tumor tiles (with no distinction drawn between MSI-L or MSI-H tiles) were noted. For each image subset, test tiles were prepared by decomposing each subset test image into overlapping tiles and sifting tiles based on background content and image entropy. In particular, majority-background tiles and tiles whose image entropies were not on or between the entropy rails were excluded. The same procedures -image rescaling, tile generation, and exclusion based on background content and entropy -were carried out on the PAIP2020 validation set. Values of the entropy rails were quite consistent -within 1% -across tile sizes ranging from 200 x 200 to 650 x 650; the values for the 400 x 400 tiles were 6.42 and 7.46. As will be seen, the CAMELYON16 tiles behaved very differently at small tile sizes, and the spread between maximum and minimum entropy values can effectively limit the minimum usable tile size.
The CNN architecture employed in this study was selected to minimize the number of convolutional layers and consequent trainable parameter count. Three dropout layers mitigated the risk of over tting to the small dataset. We trained for 75 epochs in each training/test partition using a batch size of 16, a categorical cross-entropy loss function, an Adam optimizer, a learning rate of 0.0001, softmax activation, and random horizontal and vertical ip data augmentation. More signi cant data augmentation resulted from the degree of tile overlap noted above. Source code for this model has been posted. [1] Training and testing were carried out separately for each of the three tile subsets. In each case, the model was saved after each of the 75 training epochs. It was unclear a priori whether models producing the most accurate subtype classi cations would also generate the best segmentations; therefore, segmentations were obtained using all models with classi cation accuracies exceeding 60%. In fact, as noted below, models exhibiting poor classi cation performance sometimes produced good segmentations. Best segmentation and classi cation performance were found to occur with 400 x 400 pixel tiles.
Following analysis by the CNN, the tiles of a candidate image have been sifted twice: rst by the entropy rails prior to processing and then by the softmax activation function. Because of the high degree of overlap among tiles, the union of all tiles classi ed as MSI-H or MSI-L (whether or not the tumor-level classi cation is correct) were used to approximate the tumor region. The resulting segmentations, each based on an average of about 2000 test tiles, were assessed against the corresponding segmentation masks in terms of Jaccard similarity, precision, and recall. The Jaccard score quanti es the degree of overlap between the prediction P and the ground truth T: This metric is closely related to the Dice coe cient.
Precision represents the proportion of pixels classi ed as positive (i.e., as tumor pixels) that are, in fact, positive while recall corresponds to the proportion of all positive pixels correctly classi ed as such. In terms of true positives (TP), false positives (FP), and false negatives (FN), The Jaccard, precision, and recall scores obtained for masks produced with the best-performing models are shown in Table1. Scores on the validation set were comparable to those obtained for the different training/test subsets, particularly on a relative basis among the assessed metrics. Accordingly, the training subsets were generally representative and unbiased. Classi cation accuracies were obtained by majority vote from label counts corresponding to MSI-H and MSI-L classi cations for each image. Better classi cation accuracy, 0.90 on the validation set, was achieved by combining label counts produced by the three models. This re ects the likelihood that the classi cation error attributable to each model has some degree of independence from the others, so at least some of the overall error is eliminated by averaging. Surprisingly, successful segmentation was largely independent of proper image classi cation. The three images incorrectly classi ed by model 49, for example, had mean Jaccard, precision, and recall scores of 0.67, 0.77, and 0.85, respectively. This was true despite mapping with tiles of the dominant, and therefore incorrect, classi cation. Mapping with all tiles classi ed as either tumor subtype invariably produced lower-quality segmentations. Similarly, although the models producing the best segmentation metrics also delivered the most accurate classi cations, a few models exhibiting relatively poor classi cation performance generated unexpectedly good segmentations.
Our ve-layer CNN performs favorably compared with U-Net applied to a single downscaled image. To make this assessment, the Keras platform was used to create a U-Net model con gured to process 256 x 256 pixel images based on a frequently cited code example 20 . U-Net performs pixel-level binary classi cation based on a decision boundary, which is a hyperparameter of the architecture and ranges from 0 to 1. An initial value of 0.5 is common. For benchmarking purposes, we trained this model on 20 images from from the ISBI Challenge 21 , a dataset of neuronal structures, rescaled to pixel dimensions of 256 x 256. U-Net is known to excel at segmenting neural tissue containing sharply de ned structures with clear contrast 10 . To approximate the effect of tile overlap, various forms of data augmentation were employed: width and height shifts, shear, and zoom, all set at 0.05, and random horizontal ips as were utilized on the PAIP2020 training images. Tested on 10 ISBI images, this U-Net model delivered a mean Jaccard score of 0.90, a mean precision of 0.94, and a mean recall of 0.95.
The ISBI neuronal images featured well-de ned patterns that were structurally similar across the dataset. This is not the case for the colorectal cancer images, on which the U-Net model performed poorly following the same training and test procedure (see Table 2). Performance improved as the decision boundary was reduced, but all metrics fall well below those achieved with the ve-layer CNN on tiles drawn from much larger images. There is simply not enough anatomy visible in a 256 x 256 colorectal cancer image to support accurate segmentation. Also noteworthy is the size of the U-Net model at over 31 million parameters.  Without modi cation, masks generated by tile overlap as described above have blocky edges with stepped features, the roughness of which depends on the tile size and the degree of overlap. Although image blurring is to be avoided, it is possible to smooth the edges while preserving their sharpness using morphological operations based on a structuring element or kernel (Fig. 2), which de nes a neighborhood shape and size. Using a circular kernel to rst shrink ("erode") and then expand ("dilate") white mask regions results in progressively rounder, smoother edges as the kernel size increases.
Segmentation maps resulting from softer-edged masks have less visual distraction -they are more userfriendly -but obviously what are ultimately aesthetic considerations cannot trump accuracy. Fortunately, as indicated in Table 3, the effect of smoothing on the accuracy metrics is minimal over a visually signi cant range of kernel sizes. More concerning is the loss of diagnostic visual elements that can occur when the kernel size becomes a signi cant fraction of the tile size ( Fig. 2(d)), which imposes an upper limit on smoothing. Although Table 3 shows results for only one model, equivalent results were obtained for the best models of the other two image subsets.  Table 3 -Effects of edge smoothing, isomorphic shrinkage, and use of entropy rails on similarity metrics. With the rails omitted, tiles were sifted based only on the amount of background,with majoritybackground tiles excluded. Fig. 3 illustrates the practical bene t of high recall combined with at least acceptable precision and Jaccard scores. In the best case, 3(a), the segmentation includes the entire lesion and very little else.
Even for the worst performer, 3(c), nearly all diagnostically relevant tissue is captured with spurious highlighting con ned to regions immediately surrounding the lesion. Given this pattern, it seemed plausible that isomorphically shrinking the diagnostic mask regions might improve segmentation quality. As shown in Table 3, however, shrinking by 5% has minimal impact on overall (Jaccard) similarity while mean recall diminshes signi cantly. The effect of a 10% reduction is worse. The reason for this is the uneven distribution of misclassi ed pixels around a lesion; the error margin in some regions is larger than in others, so the bene cial and deleterious effects of an isomorphic size reduction largely cancel out.
Finally, Table 3 shows the improvement provided by sifting using entropy rails rather than simple background thresholding. While not dramatic, the effect -particularly on recall -is appreciable.

CAMELYON2016 Slides
The CAMELYON2016 dataset consists of whole-slide images provided in multilevel TIFF format. The dataset includes segmentation masks prepared by expert pathologists for 111 of these slide images, which have an average uncompressed size of 88,816 x 55,352 pixels. While a few of the images feature large tumor regions, such as that shown in Fig. 1(b), the majority have small lesions that may themselves consist of archipelago-like clusters of minuscule features (see Fig. 4).
This necessitated a much larger rescaled image size. In order to be classi ed properly as a tumor tile, at least half the tile area must be occupied by tumor tissue; the tile must contain enough image information to permit the CNN to distinguish reliably among classes. For the contours of a tile-based segmentation to exhibit reasonable delity to the represented tumor region, the tile size must be smaller (ideally, considerably smaller) than that region. And nally, if possible, the rescaled image should be small enough to be stored and processed on a mobile device. Balancing these considerations ultimately led to a maximum dimension of 15,000 for the rescaled image.
Ninety of 111 annotated tumor-containing whole-slide images were selected for training and validation, and 20 of the remaining 21 annotated images served as the test set. Tiles were prepared at different sizes for the tumor and non-tumor portions of each image as described above. The criterion of delity dictated a maximum practical tile size of 400 x 400. As shown in Fig. 5, the spread between minimum and maximum entropy values increased substantially below size 200 x 200, which produced the best segmentations. For this dataset at the selected degree of image rescaling, smaller tumor tiles had insu cient visual diversity (presumably arising from insu ciently distinctive anatomic detail) to be wellcharacterized by the entropy criterion. As a consequence, fewer tiles were rejected during preprocessing.
To obtain roughly equal sets of tumor and non-tumor tiles after majority-background sifting, the tumor tiles at size 200 x 200 and above were overlapped by amounts ranging from 86% to 90% and the nontumor tiles were overlapped by amounts ranging from 50% to 67%. At each size, enough tiles were removed at random from the class having the larger resulting population to equalize the tumor and nontumor tile sets. These training sets ranged in size from 245,735 tiles of each class at 100 x 100 pixels to 21,576 tiles of each class at size 400 x 400.
The CNN architecture used for this binary classi cation task was unchanged from that described above but for training we used a binary cross-entropy loss function and sigmoid activation. Once again, models were saved after each training epoch but over tting set in much earlier -generally after 25 epochs.
Best performance was observed with 200 x 200 pixel tiles. As shown in Table 4, the models that achieved highest classi cation accuracies also produced the best segmentations, but the similarity metrics other than recall were only fair. The large tile size relative to the size of tumor features in some images resulted in a few lesions escaping detection altogether. In other cases, the tumor area was fully captured or nearly so, resulting in high recall scores, but overall similarity suffered due to the large tile size; the tissue approximations, in other words, were coarse. Still, recall scores were above 0.9 for 41% of the images that received a score and above 0.5 for 88% of those images. In most cases, that is, the tumor regions were reasonably well covered despite the small feature sizes. Where feature sizes were large relative to the tile size (as in Fig. 1), performance was comparable to that achieved with the PAIP2020 dataset.
At 150 x 150 and 100 x 100 pixels, tile classi cation failed altogether. Despite training accuracies that exceeded 99%, none of the test tiles were classi ed as tumor and the resulting segmentation masks contained no white regions. That this might occur was suggested by the sudden increase in the spread between minimum and maximum entropy values in Fig. 5. With insu cient anatomic information in the tiles to distinguish between tumor and non-tumor tissue, the CNN seems to have over t immediately to the training tiles.  20) or four (in the case of model 14) of the segmentation masks showed no relevant features and received scores of zero for all metrics. These corresponded to images with tumor features that were small relative to the tiles size and spread out, so no tiles intercepted enough tumor tissue to trigger a positive classi cation.

Discussion
The primary objective of this work is to make accurate tissue segmentation possible for telemedicine and other applications that utilize computationally limited devices and bandwidth-limited communication. In traditional clinical practice, a complex or ambiguous case may be reviewed by more than one pathologist; with access to immunohistochemistry and molecular information in addition to whole-slide images of H&E-stained biopsy samples, an expert team can manually create tissue segmentations and diagnose disease with extremely high accuracy. But not every biopsy can be evaluated -at least not immediately -by specialists at a comprehensive cancer center. If relatively small biopsy images can be communicated instantly to disease experts who can analyze them on mobile devices, expert resources can be assembled ad hoc; a virtual team need not pore over the same image shoulder-to-shoulder. A screening pathologist, remote from the biopsy site and from other pathologists, might post an electronic request for additional perspectives -with links to the rescaled image and the lightweight model employed. Available responders, whether in the hospital or on an airplane, can comment or use their own favored lightweight models to generate alternative segmentations. Such "expert crowdsourcing" can bene cially match immediate clinical needs with ready expertise.
Achieving this objective requires computational simplicity, limited storage requirements, and su cient consistency and reliability across tissue types. Since the goal is to assist rather than outperform clinicians, recall represents the most important similarity metric. A high recall ensures that most or all of the diseased region is made visible in the segmentation. So long as precision and Jaccard scores are within reason (i.e., the tissue misclassi ed as diseased is not excessive), the segmentation will be useful as diagnostic guidance.
Tiling is a mechanically simple array operation. By preprocessing tiles as they are generated to exclude those with image entropies outside previously established boundaries, CNN classi cation accuracy is enhanced and fewer tiles need be classi ed. The spread between minimum and maximum entropy values is a critical parameter. As the gap widens, the fraction of tiles excluded -and hence the utility of the criterion for preprocessing -diminishes. Moreover, because of the logarithmic nature of the entropy function, relatively small increases or decreases in the gap width have signi cant effects on the proportion of tiles excluded. When the gap expands suddenly, the capacity to distinguish between tumor and nontumor tiles disappears. This abrupt change, which occurred with the smallest CAMELYON16 tiles but not similarly sized PAIP2020 tiles, reveals a close tie between image entropy and CNN performance.
Ultimately, the parameter of chief importance is tile size, since this quantity dictates the resolution of the segmentation -certainly if tiles are overlaid to form the map or mask, but even pixel-level resolution is progressively assessed at a tile level. Classi cation accuracy also depends strongly on tile size and its companion variable, the degree of image rescaling; together, these determine how much anatomy a tile will contain. Although it may seem tempting simply to maximize tile-level classi cation accuracy, this may fail to optimize segmentation resolution; some models were mediocre classi ers yet delivered very good segmentations, so peak classi cation accuracy may not guarantee the best segmentations. The approach we describe here requires systematic exploration of the parameter space for a particular tissue type and slide magni cation, but once established, the resulting parameter values should prove robust across similar slides.
The foregoing discussion centered on whole-slide images because they are di cult to analyze computationally and unwieldy to transmit and store; as a result, the bene ts of the described approach are most pronounced. Preprocessing with entropy rails is useful for any image too large to be processed directly by a CNN, however. Many medical images, from MRI scans to simple X-rays, are currently downsampled to satisfy CNN input requirements at the sacri ce of image detail. Tiling the original image rather than rescaling it preserves this detail. Indeed, if rescaling to an intermediate size is unnecessary, a hyperparameter -the degree of rescaling -is eliminated from consideration and the overall processing ow is simpli ed. Training and testing are further simpli ed by assessing the spread between minimum and maximum tile entropies. Those tile sizes with widening spreads can probably be omitted from consideration since they are unlikely to perform well. We believe that image tiling and sifting based on entropy rails offers a versatile alternative to many automated image-analysis protocols now in use -an alternative that is readily implemented on mobile devices and whose constituent models and images are easily shared.    Tumor features in representative CAMELYON16 image. Left, ground-truth mask with enlargement of tumor region, which consists of a cluster of features each only a few pixels in extent. Right, segmentation mask for the same image produced by the best-performing model. Only the denser regions of the tumor were recognized and segmented properly, and the size of a tile -an example of which appears toward the upper right -limits the ability both to detect and represent tumor features.