Dataset
The ACDC-LungHP dataset(9, 15) contains 200 H&E stained biopsy samples with cancer. All samples have been digitalized by a digital slide scanner (3DHISTECH Pannoramic 250) with objective magnifications of 20x. The cancer regions on tissue level for each WSI have been manually annotated by experienced pathologists. Among them 150 samples with reference standards are released as training data. The remaining 50 samples are test data. Whole-Slide images are in TIFF format, manual annotations are in XML format. Detailed information about the dataset can be found at the published paper and web site(9, 15). In the clinical practice, more than one sample from the same biopsy will be scanned. The dataset did not annotate all samples. Specifically, if samples have a similar shape, the pathologist only annotated one sample for the WSI. Because most of these slides contain multiple samples, these unannotated samples must be ignored during model training. In order to exclude these unused tissue samples, we use the ASAP(Automated Slide Analysis Platform) software to annotate the region of interest(ROI) areas. For every slide, a ROI annotation XML file was created.
Data preprocessing
The ACDC-LungHP dataset contains a training dataset and a test dataset, however the competition has finished and the ground truth annotations of the test dataset are not public available. Therefore, the training dataset was randomly divided into a training, validation, and test set based on a slide level splitting with 100, 25, and 25 slides, respectively. Besides the data splitting process, the data preprocess included ROI and tumor masks creation and image patches generation. For every slide, a binary ROI mask file and a tumor mask file were created from corresponding XML annotation files using the ASAP library.
The image size of patches was 512 X 512 pixels, which is a common practice in image segmentation task and WSIs image analysis (16). And in the pre-experiment, there was no obvious difference between using patch size of 448 X 448, 512 X 512 and 576 X 576. Whole slide images were commonly stored at multiple resolutions to accommodate a streamlined method for loading images. Patches were generated using different slide levels 0, 1 and 2, which correspond to magnifications of 20X, 10 X and 5 X, respectively. Using 2.5X or even lower magnification patches was not considered because during pre-experiments models performances degraded because of small sample size. For a slide in the training dataset, a bounding box list was created first, which contains both regular grid style non-overlapping bounding boxes and random bounding boxes. If a bounding box did not contain at least 1% ROI area, it was removed. The ratio of non-overlapping patches to random patches was 2 to 1. For every bounding box, an image patch and a tumor mask patch were simultaneously created. For an image from the validation dataset and test dataset, only non-overlapping image patches were created. The file naming rule of regular patches and random patches is different. The filename of the regular patch contains its position row and column number on the WSI, so that during inference the predicted tumor mask of the whole slide can be reconstructed by combining predicted mask patches. Details of data preprocessing can be found in the source code.
Neural networks
From the point of view of computer vision, lung cancer detection is usually regarded as a semantic segmentation task. For semantic segmentation tasks in medical image analysis, U-Net(17) and its variants are most popular models. Ensemble learning was best suited for models that are high accurate and different(18). In order to boost the performance of the ensemble model, two different U-Net variants U-Net and U-Net++(19) with two different encoders ResNet34(20) and DenseNet121(21) were selected as base models.
In preliminary experiments, U-Net and U-Net + + achieved comparable performances. However, the more complicated attention U-Net(22) and R2U-Net (23) not only consumed more GPU memory but also achieved worse performance, so they were not adopted. Under the condition of adopting the same U-Net variant, choosing ResNet34 and DenseNet121 as encoders obtained better or at least equal performances than using other encoders such as MobileNetV3, EfficientNet, Res2Net, and among others, so they were chosen as encoders. During training, encoders were initialized using ImageNet pre-trained models.
Loss function
In the ACDC-LungHP challenge, the dice similarity coefficient (DICE) was used as the evaluation metric. Therefore, it is natural to use the dice loss function. However, when using the dice loss, the sensitivity was apparently lower than the specificity, which was bad for lung cancer screening. The weighted binary cross-entropy loss can flexibly balance false-positives and false-negatives through setting different weights. Moreover, the cross-entropy loss has smooth gradients. For every model training, we trained this model independently using two loss functions, i.e., Dice loss(24, 25) and BCE loss. The mathematical formulae of the loss functions are:
$${\mathcal{L}}_{W-BCE}=-\beta (\text{y}\text{l}\text{o}\text{g}\left(\widehat{\text{y}}\right)-\left(1-\text{y}\right)\text{log}\left(1-\widehat{\text{y}}\right)$$
$${\mathcal{L}}_{Dice}=1-\frac{2\text{y}\widehat{\text{y}}+\epsilon }{y+\widehat{\text{y}}+\epsilon }$$
\(\text{y}\) is the ground truth label and \(\widehat{\text{y}}\) is the predicted probability. Because of label smoothing(26), \(\text{y}\) is not always be 0 or 1. The β parameter is the penalty weight for false negatives, and based on experimental results it was set to 4. To quantify |\(y\)| and |\(\widehat{\text{y}}\)| in the Dice loss, the soft sum was used for this calculation. ε is a small number 1e-7 to avoid the divide by zero error.
Training strategies
In order to enlarge the samples size and improve the generalizability of the model, image augmentation was used during training(27). Compared with image augmentation before training, the on-the fly implementation not only save time but also is more flexible. The image augmentation operation included random horizontal and vertical flipping, random brightness and contrast modifications. The Albumentations(28) library and PyTorch dataset class were used to implement real time image augmentation. After image augmentation, all pixel values were normalized to (0–1). Technical details about image augmentation can be found in the source code.
Adam(29) with lookahead(30) (k = 5, alpha = 0.5) was used as the optimizer. Automatic mixed precision training(31) was used to speed up the training and inference and save GPU memory. Label smoothing (ε = 0.1) was used to calibrate probabilities and improve generalization ability(32). The batch size was set to 64 and the number of epochs were set to 4, 7, 10 for slide level 0(20X magnification), 1(10X magnification), 2(5X magnification) respectively. The initial learning rate was set to 1e-4, and multiplied by a factor of gamma = 0.1 after every 1 epoch until it reached 1e-6. Every model was trained 3 times under the same setting, and the model with the minimum validation loss was chosen as the best model. During experiments, performances were insensitive to these hyper-parameters.
Inference
Unweighted average was used to combine results of multiple models. Lung cancer detection can be regarded as a pixel classification task, every pixel was classified as cancer or not cancer. The mathematical formulae for every pixel prediction are:
probj =\(\frac{\sum _{\text{i}=1}^{\text{N}}{(\text{W}}_{\text{i}}\times {\text{p}}_{\text{i}\text{j}}^{})}{{\sum }_{\text{i}=1}^{N}{\text{W}}_{\text{i}}}\)
resultj = probs_final ≥ threshold
The number of base models is denoted by N, and Wi is the weight of the model No. i. pij is the predicted probability of model i on pixel j. For simplicity, instead of being learned by a meta-learner(33), Wi is set to 1 for all models(unweighted average). prob is the predicted cancer probability, which is ranged from 0 to 1. The final predicted class of pixel j is denoted by resultj, 1 stand for cancer and 0 stand for non-cancer. The default value 0.5 was used as the threshold.
To improve the training and reference speed, patches outside ROI areas and background patches were removed during patch generation. During inference, missing patches must be added first so that patches can be combined into a whole image. For every missing image patch, a predicted mask patch with all black pixels was created. Patches combination algorithm was implemented using the NumPy concatenate operation along the height and width dimensions. Please refer to the source code for more details.
Performance metrics
Accuracy, sensitivity, specificity and DICE coefficient were used to do performance evaluation(34, 35). We use positive for cancer and negative for non-cancer, based on that we use the following notations:
Accuracy is the total number of correct predictions divided by the total number of dataset. Sensitivity (true positive rate) refers to the probability of a positive test, conditioned on truly being positive. Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative. Dice coefficient is equivalent to F1, which is the harmonic mean of precision and recall. All these metrics are bounded between 0 and 1 (perfect). The mathematical formulas are as follows:
Accuracy=\(\frac{TP+TN}{TP+TN+FP+FN}\)
Sensitivity =\(\frac{TP}{TP+FN}\)
Specificity =\(\frac{TN}{TN+FP}\)
Fβ=\(\frac{(1+{{\beta }}^{2})TP}{(1+{{\beta }}^{2})TP+{{\beta }}^{2}FN+FP}\)
Dice = F1=\(\frac{2TP}{2TP+FN+FP}\)
Experimental settings
Hardware: Intel E5-2620 V4 * 2, 256GB Memory, Nvidia GTX 3090 * 2
Software: Ubuntu 20.04, CUDA 11.3, Anaconda 4.10.
The programming language and libraries: Python 3.8, Pytorch 1.10, Torchvision OpenCV, NumPY, SciPY, Sklearn, Matplotlib, Pandas, Albumentations, segmentation_models_pytorch, Cupy, Openslide-Python, Tqdm. Detailed information about these software libraries can be found in the file requirements.txt of the source code.