Data collection and processing
We first manually segmented stroma on histology whole slide images (WSIs) provided by the FNLCR Molecular Characterization (MoCha) group with results reviewed by a pathologist. The WSIs were cut into patches of 1000 x 1000 pixels to make manual segmentation more manageable. Customized Python scripts were then created, forming a pipeline to facilitate data processing. Briefly, under the guidance of a pathologist, the patches of image data were segmented with GIMP [20], and then the segmented layers were extracted and subsequently converted to black and white binary images. The generated annotations were reviewed by the pathologist. Methods from the NumPy [21] and PIL Image modules [22,23] were utilized to merge the patches back to the original WSIs. Then, methods from the PIL Image module were used to extract smaller patches of 256 x 256 pixels on the original images and ground truth labels (also called masks or segmentation maps) using a sliding window approach, increasing the dataset size. The smaller patches were then converted to NumPy arrays which were fed as input into our deep learning models. Overlapping patches were cropped from the WSIs to help smooth outputted predictions. Ultimately, we ended up with 10,240 patches in our dataset (excludes test set images), 75% of which we used for training and 25% of which we used for validation.
To compare the semantic image segmentation results between the U-Net model [11] and the DeepLabV3+ model [14], the publicly available MICCAI Gland Segmentation (GlaS) Challenge pathology image dataset [24] was downloaded [25]. The dataset consisted of .bmp (bitmap) images split into a training set of 85 images, a test A set of 60 images, and a test B set of 20 images. There were images of six different sizes in the training and test A sets: (574 x 433), (589 x 533), (775 x 522), (567 x 430), (578 x 433), and (581 x 442). All test B images were of size (775 x 522). All images were downloaded with their corresponding ground truth labels, outlining glandular objects in colon tissue, and they were pre-processed into forms that the deep learning models could use as input. Briefly, customized Python scripts were created to convert the labels to binary black-and-white images and crop the images and ground truth labels into corresponding 256x256 patches via a sliding window approach. The data were then converted into NumPy arrays and fed into the models as an input. We used the patches created from the training and test A sets as our combined training dataset and the test B set as our test set.
For the classification of carcinoma and sarcoma, all images were downloaded from the NCI Patient-Derived Model Repository (PDMR) [26]. As a pilot study, we only selected low-magnification (4x) image data. 244 carcinoma images from 9 different carcinoma subtypes and 180 sarcoma images from 7 different sarcoma subtypes were selected. These images are provided by PDMR as region of interests (ROIs) extracted from the original WSIs. We tried to select a diverse range of images that had vastly different appearances to improve the generalization abilities of our models. After obtaining data from the PDMR database, the data was split into a training, validation, and test set. 10% of the carcinoma and sarcoma images were first randomly set aside to be used as test set images, and then 20% of the remaining images were randomly set aside to be used as the validation set, leaving the other images to be used as the training dataset. After splitting up the images, each set was then converted to overlapping patches via a sliding window approach. Tables 1 and 2 show the details of the dataset split that was used to train the data to optimize the results with patch extraction for our binary classification study between sarcoma and carcinoma and multi-class classification study between subtypes of carcinoma, respectively. In our models, we also applied data augmentation techniques (horizontal flip, dimensional shift, rotation, zoom, etc.) to our image set to increase the generalization capabilities of our classification models, given our small dataset.
Table 1. Dataset split implemented for binary classification of sarcoma and carcinoma
|
|
ROI Images
|
Patches
|
|
Total
|
Train
|
Val.
|
Test
|
Train
|
Val.
|
Test
|
Carcinoma
|
244
|
187
|
33
|
24
|
18711
|
3089
|
2612
|
Sarcoma
|
180
|
138
|
24
|
18
|
13951
|
2702
|
1928
|
Table 1. Dataset split implemented for binary classification of sarcoma and carcinoma. 10% of the carcinoma and sarcoma images were first randomly set aside to be used as test set images, and 20% of the remaining images were randomly set aside to be used as the validation set, leaving the other images to be used as the training dataset. After splitting up the ROI images, each set was then converted to patches. Val.: Validation set.
Due to the small number of ROI Images of each tumor type, especially in the case of the dataset split in our multi-class classification study of carcinoma subtypes, patch extraction was again used to increase dataset size [17]. An internal assessment suggested patch extraction be a more effective data preparation technique than other methods such as downsizing (unpublished internal results). Similar to the pipeline we used in segmentation tasks, customized Python scripts were created to add necessary padding and then strategically crop the large images into 256x256 patches with an overlapping shift using a sliding window approach. We also applied a threshold value and discarded patches where more than 90 percent of the pixels in the patch were white and grey background pixels (we define this to be pixels with R, G, and B values all greater than 240).
Table 2. Dataset split implemented for multi-class classification of subtypes of carcinoma
|
Cancer Type
|
ROI Images
|
Patches
|
Total
|
Training
|
Validation
|
Test
|
Training
|
Validation
|
Test
|
Adenocarcinoma - colon
|
26
|
20
|
4
|
2
|
2175
|
326
|
147
|
Adenocarcinoma - pancreas
|
25
|
19
|
4
|
2
|
2094
|
341
|
490
|
RCC, clear cell adenocarcinoma
|
36
|
27
|
6
|
3
|
2374
|
406
|
328
|
Renal cell carcinoma, NOS
|
25
|
19
|
4
|
2
|
1541
|
222
|
90
|
Adenocarcinoma - rectum
|
25
|
19
|
4
|
2
|
3174
|
725
|
161
|
Laryngeal squamous cell carcinoma
|
19
|
15
|
3
|
1
|
1414
|
285
|
91
|
Pharyngeal Squamous Cell Carcinoma
|
38
|
28
|
7
|
3
|
2664
|
994
|
411
|
Lung adenocarcinoma
|
25
|
19
|
4
|
2
|
1546
|
286
|
142
|
Squamous cell lung carcinoma
|
25
|
19
|
4
|
2
|
1478
|
313
|
194
|
Total
|
244
|
185
|
40
|
19
|
18460
|
3898
|
2054
|
Table 2. Dataset split implemented for multi-class classification of subtypes of carcinoma. Patches were converted from each subtype of carcinoma dataset due to a limited number of ROI images in each category with ground truth labels. Dataset split implemented for the multiclass classification task. 10% of the images of each carcinoma subtype (rounded down) were first randomly set aside to be used as test set images, and 20% of the remaining images were randomly set aside to be used as the validation set, leaving the other images to be used as the training dataset. After splitting up the ROI images, each set was then converted to patches.
Segmentation model architectures and performance evaluation
We used the U-Net and DeepLabV3+ models for our segmentation studies. Named after the shape of its architecture, the U-Net convolutional neural network works well for biomedical image segmentation, even with minimal training images [11]. Developed by Google, DeepLabV3+ has achieved state-of-art results on the PASCAL VOC 2012 and Cityscapes datasets (89.0% and 82.1%, respectively), highlighting the model’s accurate segmentation abilities [14]. For best performance, the model utilizes an Aligned Xception model backbone modified to support atrous separable convolutions and batch normalization features. A Keras implementation of the DeepLabV3+ model was retrieved from GitHub [27] and integrated into our pipeline, while we recreated the U-Net model from scratch. Both the U-Net and DeepLabV3+ model architectures were built on Python 3 using the Keras [28] and TensorFlow [29] modules and were trained on an Ubuntu workstation with two NVIDIA GTX 1080 GPUs. Multi-GPU parallelism was used to significantly increase computational power and decrease training time [30].
To fairly assess their abilities, both models used the Dice coefficient [31,32] to measure similarity to ground truth labels, and the negative values of the Dice score were used as the loss values that the models learned to minimize in order to boost performance. K-fold cross-validation and Dropout layers were also utilized to help prevent overfitting and improve the models' generalization abilities. Customized Python wrapper functions acted as pipelines to handle the constructions and auxiliary function calls of the model architectures for training, model parameters setting and tuning, as well as performance evaluation. To improve model performance, hyperparameters such as learning rate, batch size, and the number of training epochs were fine-tuned based on previous advice [20]. Depending on the user’s available computational resources, the “output stride” variable can also be decreased from 16 to 8 in training the DeepLabV3+ model to achieve slightly better segmentation accuracy at the cost of higher computational complexity [14]. After training, the models outputted test accuracy values as an indicator of model performance. We also predicted masks on test set images that were not seen during training, and these mask predictions were outputted alongside their corresponding images and ground truth labels to visually evaluate performance.
In our work to train a model to automatically detect and segment stroma in WSIs, our U-Net model was also trained with a customized validation scheme. The model was run on a server in a Python Virtual Environment with the following packages: Keras 2.1.2, TensorFlow-GPU 1.4.1, Pillow 4.3.0, NumPy 1.13.3 and Scikit-Image 0.13.1.
We compared the performances of the U-Net and DeepLabV3+ models on the MICCAI GlaS image dataset. In training the models, an initial learning rate of 0.0001 (1e-4) was used, and 4-fold cross-validation was also utilized. Then test set image data were used for prediction by the best-trained models and visually evaluated for performance comparison. For the purpose of comparison, we also trained some models with a customized validation scheme in addition to training others with conventional k-fold cross-validation. In general, the DeepLabV3+ model was trained on batch sizes of 8 (due to memory constraints), while U-Net was trained on batch sizes of 16. We also fine-tuned other hyperparameters. Pipelines for U-Net and DeepLabV3+ with a similar organizational structure were created and run in similar Python Virtual Environments as the stroma image data.
Tumor type and tumor subtype classification model architectures and performance evaluation
A simple convolutional neural network (CNN) was trained from scratch as the reference point [33]. Weight decay and Dropout layers were used at the end of the simple CNN. Additional layers were added if needed, and the number of filters on existing layers were also adjusted to optimize model performance. We used a sigmoid activation at the end of the binary classification models, while we used a softmax activation at the end of the multiclass classification models. The labels assigned to each class (tumor type or subtype) made use of Keras’ ImageDataGenerator class and a customized directory structure.
The models trained with transfer learning made use of the VGG-16 [18] and InceptionResNetV2 [19] networks as feature extractors, two models of greatly varying complexity that both come pre-packaged with Keras. We trained the transfer learning models [15,16] with the selected pre-trained models as a convolutional base, and additional fully connected layers (Dense layers) were attached to the end of the model to act as the classifier. Although three different classifiers (Dense layers, global pooling layers, and linear support vector machines) were tested to optimize the models, Dense layers were selected to be used in this study.
We trained the transfer learning models with the techniques of feature extraction and fine-tuning [30]. A customized function was created and used to unfreeze layers and then recompile the model for fine-tuning with a new learning rate. The models were first trained with VGG-16 and InceptionResNetV3 acting as feature extractors, meaning that the model was trained with a completely frozen (untrainable) convolutional base for a certain number of epochs. Then, the top layers in the model were unfrozen, and the model was trained with fine-tuning for another certain number of epochs.
To evaluate model performance, we used the loss and accuracy metrics pre-packaged with the Keras module. For performance comparison, we also derived AUCs of ROC curves for each trained model with customized R scripts that used the pROC R package [34]. The confidence and accuracy of the binary and multiclass classification models in their test set ROI image predictions were also assessed by Keras’ built-in accuracy metric, which we used as test accuracy (patch-wise classification score) for performance evaluation. In addition, to study the details of performance, we also outputted image-wise classification scores and classification results on randomly selected ROIs using test set prediction functions designed for our classification tasks, like the pipeline we built for U-Net and DeepLabV3+.
In outputting predictions for binary classification between carcinoma and sarcoma, each test set image is turned into patches (the ones with too many background pixels are again thrown out) and sent into the model to predict and output class labels. The model outputs a prediction score between 0 and 1 for each patch, where values closer to 0 suggest the model is more confident that the patch should be classified as carcinoma, while values closer to 1 suggest the model is more confident the patch is sarcoma. We then averaged the prediction scores among all inputted patches to compute the image-wise classification score of the model on each overall ROI image. If the ROI ground truth label is carcinoma and the image-wise classification score is less than 0.5, we denoted that the ROI was correctly classified by the model. If the image-wise classification score is greater than 0.5, we denote that the ROI was incorrectly classified. Similarly, ROIs with the ground truth label of sarcoma are correctly predicted if the image-wise classification score is greater than 0.5 and incorrectly predicted if it is less than 0.5. We define this to be the classification result of each test set ROI. We used the image-wise classification scores and classification results of all individual test set ROIs as input for the ROC analysis for performance comparison of all models.
To output test set image predictions of our multiclass classification models for different carcinoma subtypes, we derived a simple customized method to output a final class label prediction by taking advantage of the mode of the patch class labels. First, each ROI is turned into patches, and we obtained the softmax-activated prediction results of the model and assigned a predicted class to each patch according to its largest probability value. The frequency of each predicted class among the patches was then tallied, and the predicted class label for the input image was taken to be the most frequently predicted class among all of its individual patches. If the predicted class label was the same as the ground truth label, we denoted that we classified the ROI correctly. We used this prediction function on our test set ROIs to output the classification results for our multiclass classification models.