An Annotation-free Whole-slide Training Approach to Pathological Classification of Lung Cancer Types by Deep Neural Network

Deep learning for digital pathology is hindered by the extremely high spatial resolution of whole slide images (WSIs). Most studies adopt patch-based methods which, however, require well annotated data for training. These are typically done by laboriously free-hand contouring on the WSI by experts. To both alleviate annotation burdens of experts and enjoy benefits from scaling up amounts of data, we develop a whole-slide training method for entire WSIs to classify types of lung cancers using slide-level diagnoses. Our method leverages unified memory to offload the excessive amount of memory consumption to host memory to train a classifier by entire hundreds-of-million-pixels slides. Experiments were conducted on the lung cancer dataset which contains 9,662 digital slides with various main types. The results showed that the proposed method can achieve an AUC of 0.950 and 0.924 for adenocarcinoma and squamous cell carcinoma on a separate testing set respectively. Furthermore, critical regions highlighted by the class activation map (CAM) technique of our model reveals a high correspondence to cancerous areas annotated by pathologists.


Introduction
Lung cancer is among the most frequently diagnosed cancer and the leading cause of cancer-related mortality in recent decades worldwide including Taiwan 1 . Non-small cell lung cancer (NSCLC) accounts for about 85% of newly diagnosed lung cancer cases, with two major histological types: adenocarcinoma and squamous cell carcinoma accounting for nearly 50% and 30% of NSCLC, respectively 2 . Invasive adenocarcinoma of lung is a malignant epithelial tumor including five major patterns: lepidic, acinar, papillary, micropapillary, and solid. Squamous cell carcinoma is a malignant epithelial tumor with squamous differentiation and/or keratinization. Proper pathologic diagnosis can be challenging in many cases since morphological differences among lung cancer types are subtle. Examples of pathological features of adenocarcinoma and squamous cell carcinoma are shown in Figure 1.
In the past few years, deep neural networks, especially convolutional neural networks (CNNs), have taken over as the dominant method for image recognition since their performance surpassed most of traditional image analysis algorithms in the ImageNet Large Scale Visual Recognition (ILSVRC) in 2012 [3][4][5][6] . In medical fields, deep learning algorithms have also been demonstrated to attain human-level performance on several tasks including tumor identification and segmentation through CT or MRI 7-8 , cardiovascular risk assessment through images of eyes 9 , and pneumonia detection through chest-X-ray 10 . However, analysis of digital whole slide image (WSI) is still challenging because of its extremely high spatial resolution compared to other medical images such as X-ray, CT, or MRIs.
Multiple instance learning (MIL) follows the same two-stage workflow as the traditional method while organizing the training procedure in a different way. The main idea of MIL on slide-level cancer classification is that if the patches with highest scores (the most possible K patches) on the slide were identified as carcinoma, the slide should be classified into cancer, and vice versa. Moreover, recent studies show that even state-of-the-art weak supervision methods still cannot attain the average performance of strong supervision methods in most computer vision fields such as object detection, semantic segmentation, and instance segmentation tasks.
In this work, we implemented the unified memory (UM) CNNs to train models with huge inputs. Additionally, we trained models parallel and distributed with supercomputing clusters. Experiment results show that our method can apply to WSI classification directly and outperform the MIL method. Our contributions are summarized as follows: 1). To the best of our knowledge, this is the first study evaluating methods that only use slide-level annotations, namely the MIL and our proposed method, on classifying lung specimens into adenocarcinoma, squamous cell carcinoma, or non-cancers. 2). Experiment results demonstrate a superior performance of our proposed method that achieved AUC scores of 0.950 and 0.924 of adenocarcinoma and squamous cell carcinoma respectively, which outperforms the MIL. 3). Critical regions highlighted by the class activation map (CAM) 23 technique of our model reveals a high correspondence to annotations provided by pathologists.

Results
All experiments were conducted with input of slides scaled to 4x magnification to train both the whole slide model and the MIL model. For whole-slide training methods, the size of inputs was 20,000 x 20,000 pixels on average. We experimented with both global average pooling (GAP) and global max pooling (GMP) layers as the functional aggregation layers for our proposed method. For MIL models, the size of instances was set to 224 x 224 pixels without overlapping to train the classifier and aggregated by the max-pooling operation to derive the final prediction. Unless specified, models were trained and evaluated by different

Comparison of Overall Model Performance
We describe the evaluation method and results among experiments. Models with different settings on the main type classifying, especially adenocarcinoma and squamous cell carcinoma, of lung specimens are measured by using the area under the receiver operating characteristic curve (AUC).
As shown in Figure 2 Interestingly, the whole-slide training method with GAP layers derives a nearly 0.5 AUC score as 0.546 (0.516-0.576), p-value=0.00330 for adenocarcinoma and 0.528 (0.486-0.571), p-value=0.188 for squamous cell carcinoma on this classification task, which indicates that the model captures only a limited amount of information from inputs and reveal a random-guessing-level performance. Although GAP layers are widely adopted by state-ofthe-art CNN models [3][4][5][6] for natural image classification, GAP layers applied on highresolution images are prone to lose subtle information presented by tiny features. Such inefficiency leads to a significant downgrade of model performance compared to the wholeslide training method with GMP layers.
In addition, learning curves of both MIL and our method are illustrated in Figure 3. It shows a significant difference between the two methods in the early training stage. The performance of our proposed method increases sharply during the first few epochs and converges gradually in the rest training time. This is a typical learning pattern in training DNNs since models tend to seek and learn features that can easily be divided in high dimensional spaces at first, and hence contributes to a drastic improvement in accuracy.
Along with the training procedure, most of the obvious features are used and model performance saturates gradually. Subtle features, instead, are extracted in later training stages since models try to seek other high dimensional planes to minimize losses, and hence refine the model. In contrast, the learning curves of MIL are relatively smooth at first few epochs.
According to the MIL training procedure, it relies on its work-in-progress model to choose representative tiles from whole slide images before training its classifier. However, models in the early training stage reveal a random-guess behavior since it has not yet learned any information. Those wrongly selected tiles will inevitably misguide the model and thus slow down the convergence rate.

Visualization
Though CNNs have led to impressive performance on the current classification task, it is more intriguing to understand how the models make decisions. Visualization is the most straightforward manner to investigate how models learn to solve the given task. Different visualization approaches are applied to different models since internal properties are not the same between MIL model and the whole-slide model. For the MIL model, we derived a prediction map simply by feeding all the tiles cropped from the WSI into a patch-level classifier. These predictions indicated the probability of each local area containing cancerous features.
On the other hand, the whole-slide method is not capable of the above method because no patch-level classifier is available. Instead, we adopted class activation map (CAM) 23 to visualize critical regions.
As shown in Figure 4, both the MIL model and the whole-slide model could discover representative information, which was highlighted by heatmaps, after iteratively learning from slide-level diagnosis. Furthermore, our method revealed a more comprehensive ability to highlight all suspicious areas on the slide, especially small lesions. This could explain the reason why the whole-slide method achieves a better performance.

Throughput Comparison
To overcome the GPU memory limitation, our approach leverages Unified Memory (UM) to offload temporary data to host memory. However, the low access rate of system memory significantly cuts back the overall performance and lengthens the overall training time. Therefore, we increased the UM efficiency by optimizing memory access and converting the model into mixed precision graph [26] during training time. Throughput improvements were measured by the number of slides being processed per second. To control the baseline among training methods and various speed improvement skills, all slides are resized to 10,000 pixels in both width and height dimensions beforehand. As shown in Figure   5, the throughput of the whole-slide training method can be speeded up 3.53x by incorporating both optimized memory access and using a mixed precision training.
Additionally, we deployed the training pipeline on [Anonymous Computing Center], a multi-GPU, a multi-node supercomputing environment. Given a hardware configuration of 4 computing nodes, 16 GPUs in total, the training process could achieve a 52.75x throughput compared to single-GPU, non-optimized one. As illustrated in Figure 5, the scaling efficiency was 93.32% which implies few performance overheads were paid when training models with our proposed method distributed.
Compared to the throughput of the MIL method which does not get involved with Unified Memory, it was 3.53x faster than our proposed method ( Figure 5A) in model training.
On the other hand, the inference throughput of our method, instead, outperforms that of MIL by 1.93x ( Figure 5C). The huge discrepancy between the speeds of training and inference of our method is due to their different memory consumptions. In our case with Tesla V100 GPU and 10,000 x 10,000 slide input, all the temporary data produced during inference phase can fit in the GPU memory. Thus, model inference of our method yields a better throughput without the overhead incurred by Unified Memory access.
In general, given the setting of 8 GPUs in the training phase, it took around 2 weeks for our proposed method and a week for MIL to reach convergence on the 5,045-slide training dataset. As for the inference phase, it took nearly 4.5 hours by our proposed method and took about 9 hours to complete all 1,397 slides in test set by using a single GPU.

Discussion
Generating detailed annotations on WSIs is extremely laborious, cumbersome, and costly. For example, it takes an expert over half to an hour to annotate a single WSI into different parts of regions, for instance normal tissue region, carcinoma region and fibrosis region, and yet represents only partial regions in the WSI. Borders between tissue types are often ambiguous, leading inconsistent annotations between pathologists. The high variability of tissue morphology makes it difficult to cover all possible examples during annotation.
These annotation shortcomings make deep learning models strongly biased by expert-defined annotations and difficult to learn comprehensively. To avoid requirements of enormous annotations and selection biases from experts, most of recent studies aim to seek weak supervision methods that can train deep learning models to explore relationships inside WSI to its clinical diagnosis directly.
Training cancer classifiers without detailed annotations alleviates the burden of annotation from experts, and allows deep neural network models to benefit from a huge amount of raw slide data with clinical diagnoses. Compared to previous works, methods on detecting cancers with strong supervision models by patch-wise annotations [11][12][13][14][15][16][17][18][19] still outperformed weak supervision models. However, research on weak supervision models for cancer detecting is gradually increasing since the model trained in a strong supervision manner is limited by how targets are annotated. On the other hand, even weak supervision is challenging but is easier to derive/update annotations according to different goals.
Although MIL can be trained with weak labels such as slide diagnosis, several drawbacks that may possibly affect the model performance: 1). Wrongly selected tiles in early training iterations may make models trapped in the local minimum due to random initialization of models. In some situations, models even stop improving after the first few epochs. 2). The MIL tries to utilize the k-most representative tiles while abandoning other relative information contained in others tiles. However, the size of k-most representative tiles is a hyperparameter and the true informative regions may vary from slides to slides. These kmost representative tiles may either overtake irrelevant tiles as positives or lack capacities to include atypical patterns that are crucial to diagnosis.
Instead of modifying algorithms and the training pipeline into a weak supervision form to bypass the out-of-memory issue, we leveraged the unified memory, UM, to train CNNs with huge image inputs directly without modifying any training pipeline. The UM enables GPUs to access the host memory directly, which expands the total capacity of temporally data from gigabytes level into terabytes level. Though the UM mechanism that can swap memory between GPUs and the host, it will drastically slow down the forward and backward propagation of CNNs due to frequent exchange of data between GPUs and the host.
In this study, we demonstrated that CNNs can identify features of adenocarcinoma and squamous cell carcinoma of the lung by using merely slide-level labels. Instead of using MIL, a commonly used training procedure for weak supervision tasks, we leveraged the UM to enable whole-slide training. Our proposed method achieved an AUC of 0.950 and 0.924 for adenocarcinoma and squamous cell carcinoma respectively. It also showed a high correspondence between the class-activation-map (CAM) and cancerous areas identified by experts.
Importantly, improving both classification accuracy and lesion localization is plausible by acquiring more data. Comparing results of the cross-sites experiment against single site experiments, a significant improvement was made merely by augmenting the dataset.
Compared to Coudray et al. 14 , training patch-level models with 1,634 slides and tens of millions of patch-level annotations, and Campanella et al. 24 , training MIL models with over tens of thousands of slides, our model had already achieved a competitive performance with only 7,003 weakly-labeled slides.
The lesion localization of our model by CAM revealed a great coverage in most cases. It is to be noted, however, that the semantics of CAM in our current method was slightly different from locating cancer cells. Areas highlighted by CAM were highly related to predictions. In other words, deep neural networks (DNNs) will use distinguishable features across the given dataset to classify an image into groups, which may often bring side-effects such as contextual bias. For instance, a classifier trained to learn cars and boats will highlight not only the boat itself but water since boats are always accompanied by water. In our case, the CAM for the squamous cell carcinoma highlighted not only cancerous regions but necrotic regions. (Figure 6). Despite the fact that necrosis can be caused by other reasons such as injury, infection or inflammation, squamous cell carcinoma has a tendency to consume nutrients on the border of tissues, leading to ischemia in inner regions and eventually, necrosis. Hence, it becomes a signature when classifying adenocarcinoma from squamous carcinoma since it is rare to get biopsy from purely injury, infection or inflammation patients and thus there is far less data in our current dataset.
Since deep learning models collect all possible clues to make decisions, such weak relation, or halo effect, is learned to be useful when classifying adenocarcinoma and squamous cell carcinoma. One way to resolve this issue is to add a small amount of detailedannotated slides to specify cancer cells. These annotations provide hints for models to separate cancerous representations from those weakly related representations. Such integration for leveraging both slide-level annotations and limited detail annotations can be developed to achieve a more comprehensive and precise model in the future.

Multiple Instance Learning
Most image binary classification tasks can be formed into a multiple instance learning This property makes training models with bag-level labels become possible by applying positive bag-level labels to critical instances in positive bags and negative bag-level labels to all instances in negative bags. During training, the MIL iteratively selects high-score instances from each bag when training a classifier.
To be more specific, the MIL method can be separated into two alternative steps: instance selection and classifier optimization, as illustrated in Figure 1. During instance selection, an instance selector, which computes probability of positive over instances of bags, is used to mine K-most positive instances from each bag. With selected instances, a classifier is trained to maximize the probability of instances selected from positive bags while minimizing the probability of instances selected from negative instances.
In the end of MIL, the classifier will be able to mine the most relative patterns of positive cases and the label of any given bag can be inferred by naive aggregation methods such as taking the maximum scores (max-pooling) or averaging scores of K-most instances (average-pooling).

Baseline MIL Method on Lung Cancer Type Classification
The lung cancer typing task with only slide-level labels can be considered as a MIL problem, since a slide is marked as either adenocarcinoma, squamous cell carcinoma, or noncancer tissues.
As illustrated in Figure 7, we treated each slide as independent bags and cropped patches of 224x224 pixels inside each bag as instances to train a classifier and aggregated prediction of instances by the most widely adopted max-pooling method. For the instance classifier, a ResNet-50 3 with fixup initialization 25 is implemented. While the dominant lung cancer type classification mainly relies on inspecting tissue-level morphology rather than cell-level morphology as shown in Figure 1, we generate instances at a 4x magnification to provide sufficient observation scope for the classifier. The slides were augmented by flip, translation, and rotation followed by an intensity normalization before generating instances.
We removed instances belonging to the background by a color filter which drastically reduced total numbers of instances by 75% and speeded up the whole training process. To select representative instances of each bag, we set K=1 to pick up an instance that is most unlikely to be either adenocarcinoma or squamous cell carcinoma instance and mark instances of bags to its corresponding slide-level annotations.

Whole-slide Training Method
As a workaround algorithm for hardware memory constraints issue, the MIL alters typical CNNs training pipeline and thus produces several drawbacks. First, instances were nearly randomly selected in the early training phase because of random initialization of the classifier. These selected instances in the early stage strongly affect the trend of selection strategy of the classifier in the following training process. In some situations, wrongly selected instances, for instance accompany hyperplasia tissues but not cancer cells itself, may lead the classifier to fall into a local minimum and thus stop improving after a few epochs.
Second, the K-most representative instances may undertake informative regions or overtake irrelevant regions into account, which limits models to learn targets comprehensively. Since overtaking irrelevant instances as positives could be more severe to the training procedure, most implementations set K=1 to take the most relevant instance only, which turns out that models will lack capacities to include atypical patterns that are crucial to diagnosis.
Alternatively, we propose a whole-slide training method that incorporates the standard CNNs architecture with the unified memory (UM) mechanism to support inputs of hundreds of millions of pixels directly to train the models as usual (Figure 7). By using unified memory (UM), we are able to give GPUs direct access to host memory, which provides terabytes of

MIL Assumption in the Whole-slide Training
Training a classifier of natural images and a classifier of WSIs is a totally different scenario due to the difference in scale of image size. Typically, an image of 224 x 224 spatial resolution will be condensed into 2,048 feature maps of 7 x 7 spatial resolution after multiple stacks of convolutional and downsampling layers in the Resnet.
Since these layers conduct sliding window operations, each 7 x 7 feature map remains the same spatial arrangement as the input image.
To be more precise, these layers can be deemed as a function that each pixel on the feature map encodes a certain size of region on the original input into a single 2,048 dimensional embedding vector.
The projection size of a pixel of feature maps corresponding to the original input can be referred to as a receptive field 28 . Information beyond a receptive field has no means to be encoded. According to operations of ResNet50, the receptive field of the final feature map is 483 x 483, which is larger than its common input size: 224 x 224. As a result, receptive fields of pixels on the final feature maps have already covered all information of the image.
Finally, the following global average pooling (GAP) layer is commonly applied to average feature maps of 7 x 7 x 2,048 into a 1 x 1 x 2048 vector. However, the receptive field of any given pixel on the final feature maps will no longer cover the whole image when enlarging the input into tens of thousands of pixels along its height and width. Such difference is critical in the cancer classification of WSIs. Since malignant regions may be relatively tiny compared to the whole tissues in the positive slides, only very few receptive fields cover critical areas.
With the majority voting aggregation, or the global averaging pooling (GAP), at the end of feature maps, critical signals were further diluted by signals coming from feature maps that are not relevant to patterns of cancers. It ultimately constraints the model to identify slides with small cancerous areas.
Inspired by the MIL, we replace the GAP layer by the global max pooling (GMP) layer, which only keeps the max value of each element of the 2048-long vectors, as shown in Figure   8. Large values appearing in the embedding vector implies meaningful features are extracted.
By adopting GMP, those large values are kept and thus distinguishable signals behind them are preserved.

Experiment Setup
We Along with the training progress, the kernel weights were gradually updated with the process called stochastic gradient descent (SGD). We used the Adam optimizer 30 (with an initial learning rate of 0.00001 and decays to 0.000001 when validation loss does not improve in 16 epochs) to train the model and evaluate the performance per 704 training steps.

Statistics
We use the area under receiver operating characteristics (Area Under ROC, AUC) as evaluation metrics to measure the slide-level performance of different methods. The 95% confidence interval was obtained by using a bootstrap approach. The bootstrap procedure was as follows: stratified sampling with replacement n slides from the testing set, where n is the number of testing set size, and compute AUC. We repeated the sampling procedure to derive 10,000 bootstrap samples and report the 2.5 and 97.5 percentile values. When comparing the AUCs of two models, the p-value was also calculated by bootstrap approach with the same parameters above and one-sided hypothesis. To evaluate the significance level of the AUC of a model, we adopted a dummy model that always returns 0.5 as the null hypothesis. In this case, the p-value was calculated by bootstrap method with two-sided hypothesis.
For the throughput tests, we collected the elapsed time for a model to train on a batch, and repeated the same procedure for 30 times. The throughput value was obtained by the sample mean and the 95% confidence interval was acquired by calculating 1.96 times the sample standard deviation.