2.1 PAIP2020 Slides
The PAIP2020 training image dataset consisted of 47 whole slides — 12 of which were labeled as MSI-H and the remaining 35 as MSI-L — provided in multilevel SVS format. The slides had an average uncompressed size of 116,214 x 88,095 pixels and contained varying amounts of non-tissue background. The dataset also included binary segmentation masks defining the tumor regions. An unannotated, unlabeled validation set consisted of 31 additional slides.
Four rescaled sets of the whole-slide images were prepared such that, in each set, the longer dimension of the rescaled images did not exceed 3500, 4500, 6000, or 8000 pixels. Each of these image sizes seemed potentially adequate to preserve the diagnostically significant anatomy without being so large as to limit the utility of tiles ranging in size from 200 x 200 to 600 x 600 pixels. The tile-preparation, training and testing procedures discussed below were carried out for each image set, and the 6000-pixel set emerged as the best performer. The images in this set had an average size of 8.5 Megapixels (MP) and a maximum size of 13.5 MP. (The maximum possible image size would be 36 MP, or 6000 x 6000 pixels.)
The provided ground-truth masks were first used to create new, separate images of the tumor and non-tumor portions of each image. To cope with the unbalanced training dataset, three different subsets of the 47 training slides were defined. Each subset included 29 training images (8 MSI-H images and 21 MSI-L images) and 18 test images (4 MSI-H images and 14 MSI-L images) to preserve, in each subset, a training/test split above 60/40. Each of the MSI-H test sets was unique, i.e., contained no images found in any other test set.
Within each training subset, the data was further unbalanced by the typically smaller image area occupied by the tumor. Consequently, to obtain similar numbers of tumor and non-tumor tiles, we overlapped them to different degrees. These ranged from 80% to 96% overlap depending on the image size and the number of images in each training class (MSI-H, MSI-L, and non-tumor).
Training tiles were sifted based on background fraction, with tiles having majority-background regions excluded. The background of tumor images — MSI-H and MSI-L images were preprocessed identically — consisted of a solid black background surrounding the tumor regions, while non-tumor images included black regions corresponding to the tumor locations with the remainder of the slide unmodified. We identified background regions by creating, for each tile, an 8-bit grayscale tile counterpart. Tiles were excluded if their grayscale counterparts included a majority of pixels with values above 235 (at least nearly white) or fell below 15 (at least close to black). The higher grayscale limit was chosen so that staining would still register as background but light-colored tissue regions would not. About 55% of the generated tiles survived background sifting; this fraction was consistent across tile sizes. Each image subset had about 180,000 training tiles evenly split among the MSI-H, MSI-L, and non-tumor classes.
For each tile size, the maximum and minimum image entropies of qualifying tumor tiles (with no distinction drawn between MSI-L or MSI-H tiles) were noted. For each image subset, test tiles were prepared by decomposing each subset test image into overlapping tiles and sifting tiles based on background content and image entropy. In particular, majority-background tiles and tiles whose image entropies were not on or between the entropy rails were excluded. The same procedures — image rescaling, tile generation, and exclusion based on background content and entropy — were carried out on the PAIP2020 validation set. Values of the entropy rails were quite consistent — within 1% — across tile sizes ranging from 200 x 200 to 650 x 650; the values for the 400 x 400 tiles were 6.42 and 7.46. As will be seen, the CAMELYON16 tiles behaved very differently at small tile sizes, and the spread between maximum and minimum entropy values can effectively limit the minimum usable tile size.
The CNN architecture employed in this study was selected to minimize the number of convolutional layers and consequent trainable parameter count. Three dropout layers mitigated the risk of overfitting to the small dataset. We trained for 75 epochs in each training/test partition using a batch size of 16, a categorical cross-entropy loss function, an Adam optimizer, a learning rate of 0.0001, softmax activation, and random horizontal and vertical flip data augmentation. More significant data augmentation resulted from the degree of tile overlap noted above. Source code for this model has been posted.[1]
Training and testing were carried out separately for each of the three tile subsets. In each case, the model was saved after each of the 75 training epochs. It was unclear a priori whether models producing the most accurate subtype classifications would also generate the best segmentations; therefore, segmentations were obtained using all models with classification accuracies exceeding 60%. In fact, as noted below, models exhibiting poor classification performance sometimes produced good segmentations. Best segmentation and classification performance were found to occur with 400 x 400 pixel tiles.
Following analysis by the CNN, the tiles of a candidate image have been sifted twice: first by the entropy rails prior to processing and then by the softmax activation function. Because of the high degree of overlap among tiles, the union of all tiles classified as MSI-H or MSI-L (whether or not the tumor-level classification is correct) were used to approximate the tumor region. The resulting segmentations, each based on an average of about 2000 test tiles, were assessed against the corresponding segmentation masks in terms of Jaccard similarity, precision, and recall. The Jaccard score quantifies the degree of overlap between the prediction P and the ground truth T:
This metric is closely related to the Dice coefficient.
Precision represents the proportion of pixels classified as positive (i.e., as tumor pixels) that are, in fact, positive while recall corresponds to the proportion of all positive pixels correctly classified as such. In terms of true positives (TP), false positives (FP), and false negatives (FN),
The Jaccard, precision, and recall scores obtained for masks produced with the best-performing models are shown in Table1. Scores on the validation set were comparable to those obtained for the different training/test subsets, particularly on a relative basis among the assessed metrics. Accordingly, the training subsets were generally representative and unbiased. Classification accuracies were obtained by majority vote from label counts corresponding to MSI-H and MSI-L classifications for each image. Better classification accuracy, 0.90 on the validation set, was achieved by combining label counts produced by the three models. This reflects the likelihood that the classification error attributable to each model has some degree of independence from the others, so at least some of the overall error is eliminated by averaging.
|
Training Slides
|
Validation Slides
|
Mean Jaccard Score
|
Mean Precision
|
Mean Recall
|
Classif.
Accuracy
|
Mean Jaccard Score
|
Mean Precision
|
Mean Recall
|
Set 1: model 27
|
0.69
|
0.75
|
0.91
|
0.89
|
0.64
|
0.72
|
0.84
|
Set 2: model 49
|
0.63
|
0.65
|
0.96
|
0.83
|
0.62
|
0.69
|
0.91
|
Set 3: model 11
|
0.65
|
0.71
|
0.89
|
0.89
|
0.58
|
0.77
|
0.74
|
Average
|
0.66
|
0.70
|
0.92
|
0.87
|
0.60
|
0.70
|
0.86
|
Table 1 – The best models, labeled by epoch number, were identified for each image subset and their segmentation masks compared against those prepared by expert pathologists. These models were tested on the validation set and produced roughly similar scores.
Surprisingly, successful segmentation was largely independent of proper image classification. The three images incorrectly classified by model 49, for example, had mean Jaccard, precision, and recall scores of 0.67, 0.77, and 0.85, respectively. This was true despite mapping with tiles of the dominant, and therefore incorrect, classification. Mapping with all tiles classified as either tumor subtype invariably produced lower-quality segmentations. Similarly, although the models producing the best segmentation metrics also delivered the most accurate classifications, a few models exhibiting relatively poor classification performance generated unexpectedly good segmentations.
Our five-layer CNN performs favorably compared with U-Net applied to a single downscaled image. To make this assessment, the Keras platform was used to create a U-Net model configured to process 256 x 256 pixel images based on a frequently cited code example 20. U-Net performs pixel-level binary classification based on a decision boundary, which is a hyperparameter of the architecture and ranges from 0 to 1. An initial value of 0.5 is common. For benchmarking purposes, we trained this model on 20 images from from the ISBI Challenge 21, a dataset of neuronal structures, rescaled to pixel dimensions of 256 x 256. U-Net is known to excel at segmenting neural tissue containing sharply defined structures with clear contrast 10. To approximate the effect of tile overlap, various forms of data augmentation were employed: width and height shifts, shear, and zoom, all set at 0.05, and random horizontal flips as were utilized on the PAIP2020 training images. Tested on 10 ISBI images, this U-Net model delivered a mean Jaccard score of 0.90, a mean precision of 0.94, and a mean recall of 0.95.
The ISBI neuronal images featured well-defined patterns that were structurally similar across the dataset. This is not the case for the colorectal cancer images, on which the U-Net model performed poorly following the same training and test procedure (see Table 2). Performance improved as the decision boundary was reduced, but all metrics fall well below those achieved with the five-layer CNN on tiles drawn from much larger images. There is simply not enough anatomy visible in a 256 x 256 colorectal cancer image to support accurate segmentation. Also noteworthy is the size of the U-Net model at over 31 million parameters.
|
Set 2 - U-Net (PAIP2020 Training Slides)
|
Mean Jaccard Score
|
Mean Precision
|
Mean Recall
|
Threshold = 0.5
|
0.48
|
0.76
|
0.53
|
Threshold = 0.05
|
0.53
|
0.74
|
0.62
|
Threshold = 0.001
|
0.56
|
0.69
|
0.70
|
Table 2 – Performance of U-Net model trained and tested on the second colorectal cancer image subset (29 training images, 18 test images).
Without modification, masks generated by tile overlap as described above have blocky edges with stepped features, the roughness of which depends on the tile size and the degree of overlap. Although image blurring is to be avoided, it is possible to smooth the edges while preserving their sharpness using morphological operations based on a structuring element or kernel (Fig. 2), which defines a neighborhood shape and size. Using a circular kernel to first shrink (“erode”) and then expand (“dilate”) white mask regions results in progressively rounder, smoother edges as the kernel size increases.
Segmentation maps resulting from softer-edged masks have less visual distraction — they are more user-friendly — but obviously what are ultimately aesthetic considerations cannot trump accuracy. Fortunately, as indicated in Table 3, the effect of smoothing on the accuracy metrics is minimal over a visually significant range of kernel sizes. More concerning is the loss of diagnostic visual elements that can occur when the kernel size becomes a significant fraction of the tile size (Fig. 2(d)), which imposes an upper limit on smoothing. Although Table 3 shows results for only one model, equivalent results were obtained for the best models of the other two image subsets.
|
Set 2 - Model 49 (Training Slides)
|
Edge smoothing (kernel size = 150 pixels)
|
Edge smoothing (kernel size = 100 pixels)
|
Edge smoothing (kernel size = 50 pixels)
|
Un-
smoothed
|
Isomorphic shrink 5%
|
No rails
(un-
smoothed)
|
Mean Jaccard Score
|
0.64
|
0.64
|
0.63
|
0.63
|
0.64
|
0.62
|
Mean Precision
|
0.67
|
0.66
|
0.66
|
0.66
|
0.70
|
0.68
|
Mean Recall
|
0.94
|
0.95
|
0.96
|
0.96
|
0.89
|
0.90
|
Table 3 – Effects of edge smoothing, isomorphic shrinkage, and use of entropy rails on similarity metrics. With the rails omitted, tiles were sifted based only on the amount of background,with majority-background tiles excluded.
Fig. 3 illustrates the practical benefit of high recall combined with at least acceptable precision and Jaccard scores. In the best case, 3(a), the segmentation includes the entire lesion and very little else. Even for the worst performer, 3(c), nearly all diagnostically relevant tissue is captured with spurious highlighting confined to regions immediately surrounding the lesion. Given this pattern, it seemed plausible that isomorphically shrinking the diagnostic mask regions might improve segmentation quality. As shown in Table 3, however, shrinking by 5% has minimal impact on overall (Jaccard) similarity while mean recall diminshes significantly. The effect of a 10% reduction is worse. The reason for this is the uneven distribution of misclassified pixels around a lesion; the error margin in some regions is larger than in others, so the beneficial and deleterious effects of an isomorphic size reduction largely cancel out.
Finally, Table 3 shows the improvement provided by sifting using entropy rails rather than simple background thresholding. While not dramatic, the effect — particularly on recall — is appreciable.
2.2 CAMELYON2016 Slides
The CAMELYON2016 dataset consists of whole-slide images provided in multilevel TIFF format. The dataset includes segmentation masks prepared by expert pathologists for 111 of these slide images, which have an average uncompressed size of 88,816 x 55,352 pixels. While a few of the images feature large tumor regions, such as that shown in Fig. 1(b), the majority have small lesions that may themselves consist of archipelago-like clusters of minuscule features (see Fig. 4).
This necessitated a much larger rescaled image size. In order to be classified properly as a tumor tile, at least half the tile area must be occupied by tumor tissue; the tile must contain enough image information to permit the CNN to distinguish reliably among classes. For the contours of a tile-based segmentation to exhibit reasonable fidelity to the represented tumor region, the tile size must be smaller (ideally, considerably smaller) than that region. And finally, if possible, the rescaled image should be small enough to be stored and processed on a mobile device. Balancing these considerations ultimately led to a maximum dimension of 15,000 for the rescaled image.
Ninety of 111 annotated tumor-containing whole-slide images were selected for training and validation, and 20 of the remaining 21 annotated images served as the test set. Tiles were prepared at different sizes for the tumor and non-tumor portions of each image as described above. The criterion of fidelity dictated a maximum practical tile size of 400 x 400. As shown in Fig. 5, the spread between minimum and maximum entropy values increased substantially below size 200 x 200, which produced the best segmentations. For this dataset at the selected degree of image rescaling, smaller tumor tiles had insufficient visual diversity (presumably arising from insufficiently distinctive anatomic detail) to be well-characterized by the entropy criterion. As a consequence, fewer tiles were rejected during preprocessing.
To obtain roughly equal sets of tumor and non-tumor tiles after majority-background sifting, the tumor tiles at size 200 x 200 and above were overlapped by amounts ranging from 86% to 90% and the non-tumor tiles were overlapped by amounts ranging from 50% to 67%. At each size, enough tiles were removed at random from the class having the larger resulting population to equalize the tumor and non-tumor tile sets. These training sets ranged in size from 245,735 tiles of each class at 100 x 100 pixels to 21,576 tiles of each class at size 400 x 400.
The CNN architecture used for this binary classification task was unchanged from that described above but for training we used a binary cross-entropy loss function and sigmoid activation. Once again, models were saved after each training epoch but overfitting set in much earlier — generally after 25 epochs.
Best performance was observed with 200 x 200 pixel tiles. As shown in Table 4, the models that achieved highest classification accuracies also produced the best segmentations, but the similarity metrics other than recall were only fair. The large tile size relative to the size of tumor features in some images resulted in a few lesions escaping detection altogether. In other cases, the tumor area was fully captured or nearly so, resulting in high recall scores, but overall similarity suffered due to the large tile size; the tissue approximations, in other words, were coarse. Still, recall scores were above 0.9 for 41% of the images that received a score and above 0.5 for 88% of those images. In most cases, that is, the tumor regions were reasonably well covered despite the small feature sizes. Where feature sizes were large relative to the tile size (as in Fig. 1), performance was comparable to that achieved with the PAIP2020 dataset.
At 150 x 150 and 100 x 100 pixels, tile classification failed altogether. Despite training accuracies that exceeded 99%, none of the test tiles were classified as tumor and the resulting segmentation masks contained no white regions. That this might occur was suggested by the sudden increase in the spread between minimum and maximum entropy values in Fig. 5. With insufficient anatomic information in the tiles to distinguish between tumor and non-tumor tissue, the CNN seems to have overfit immediately to the training tiles.
|
Test Slides
|
Mean Nonzero Jaccard Score
|
Mean Nonzero Precision
|
Mean Nonzero Recall
|
Classif.
Accuracy
|
Model 14
|
0.30
|
0.34
|
0.74
|
0.99
|
Model 20
|
0.39
|
0.45
|
0.73
|
0.98
|
Average
|
0.35
|
0.40
|
0.74
|
0.99
|
Table 4 – Once again the best models, labeled by epoch number, were identified and their segmentation masks compared to ground-truth masks prepared by expert pathologists. Three (in the case of model 20) or four (in the case of model 14) of the segmentation masks showed no relevant features and received scores of zero for all metrics. These corresponded to images with tumor features that were small relative to the tiles size and spread out, so no tiles intercepted enough tumor tissue to trigger a positive classification.
[1] https://github.com/stevenjayfrank/A-Eye.