Datasets
The study was registered at clinical trials.gov (NCT02024113) and was approved by the Medical Ethics Review Committee of the Amsterdam UMC and registered in the Dutch trial register (trialregister.nl, NTR3508). All patients gave informed consent for study participation and use of their data for (retrospective) scientific research. Two datasets acquired at two institutions were included in this study with both datasets following the recommendations of the EARL accreditation program [19, 20]. All images were converted to Standardized Uptake Value (SUV) units before the segmentation process started in order to normalize the images for differences in injected tracer dose and patient weight. This paper focuses on the segmentation process and not on lesion detection. Therefore, before the start of the segmentation process, a large bounding box was drawn around every lesion including also a large number of non-tumor voxels as illustrated in Figure 1. The bounding box was drawn randomly such that the tumor was not always appearing in the middle but on different locations in the box. This step was performed in order to avoid that the CNN remembers the location of the object instead of other, more important characteristics. As a CNN requires that all images have the same size, each bounding box had a size of 64 x 64 x 64.
Training and testing dataset
For training, validating, and testing the segmentation approaches, 96 images of patients with NSCLC Stage III - IV were included. Patients fasted at least six hours before scan start and were scanned 60 minutes after tracer injection. All images were acquired on a Gemini TF Big Bore (Philips Healthcare, Cleveland, OH, USA). For attenuation correction, a low dose CT was performed. All images were reconstructed to a voxel size of 4 x 4 x 4 mm using the vendor provided BLOB-OS-TOF algorithm. More details about the patient cohort can be found in previous studies [21]. 5-fold cross validation was performed whereby randomly 70% of the images were used for training, 10% for validation and 20% for independent testing.
Test-Retest dataset
For a fully independent test-retest evaluation, ten PET/CT scans of patients with Stage III and IV NSCLC were analyzed. These ten patients underwent two whole-body PET/CT scans on two consecutive days. Images were acquired on a Gemini TF PET/CT scanner (Philips Healthcare, Cleveland, OH, USA) at a different institution (Amsterdam University Medical Center). Patient fasting time, time between tracer injection and scan start, as well as reconstruction algorithm and voxel size were the same as in the previous described dataset. A total of 28 lesions were included in the analysis.
Reference segmentations
The reference segmentations used for training, validating and testing the algorithm, were obtained by applying an automatic segmentation which identified all voxels with a SUV above 2.5 as tumor (here after SUV2.5). The segmentations were manually adjusted by an expert medical physicist (RB) with more than twenty years of experience in PET tumor segmentation. This approach was chosen as it has been demonstrated that the manual adaption of a (semi-) automatic algorithm is more robust than a pure manual segmentation [22].
Segmentation Algorithm
All segmentation algorithm were implemented in Python 3.6 using the libraries keras and scikit-learn.
Convolutional Neural Network (CNN)
A 3D CNN following the U-Net architecture proposed by Ronneberger et al. [23] was implemented with the keras library. U-net is one of the most famous and most frequently used CNN architectures for biomedical image segmentation and it was especially designed for scenarios where only a small number of training examples are available. An illustration of the used architecture is displayed in Figure 2. An U-Net consists of an encoding and decoding part. In the encoding part, the images are subsequently down-sampled while the number of features is increased. In the decoding part, the images are up-sampled while the number of features decreases. In both parts, three layers consisting of one convolutional block (= two convolutional layers with a kernel size of 5 followed by a Rectified Linear Unit (ReLu) layer), a max-pooling layer for down-sampling in the encoding or a convolutional up-sampling layer in the decoding part, a batch normalization layer to increase network convergence and a drop-out layer to avoid overfitting. Due to the relatively small dataset, the CNN was trained with 8 initial features in the first layer. The number of layers and initial features were determined iteratively until the validation accuracy was optimal and at the same time comparable to the accuracy in the training set. The latter is important as a large difference in training and validation accuracy is a hint for overfitting. Details about training and validation accuracy for different number of initial features can be found in the supplemental tables S1 and S2. The CNN was trained for 1000 epochs with a batch size of 25. The learning rate was set to 0.001 and an Adam-optimizer was used for weight adaptation. The negative Dice-coefficient was used as loss function measuring the overlap of two segmentations. A Dice coefficient of 1 is reflecting a perfect overlap. A U-Net requires that all images have the same size.
In order to increase the amount of training data and to avoid over-fitting, data augmentation was performed. This included rotations within -20 to 20 degrees, shifting in width and height direction within 20% of the side length, a rescaling of the images within 25%, intensity stretching, as well as adding Gaussian noise to the image.
For training, testing, and applying the CNN, the dataset was divided into smaller (<= 12.8 ml) and bigger tumors. The threshold was chosen empirically and it was found that this threshold led to the best performance. For each tumor size category, one separate CNN was trained. Splitting the dataset by lesion size was performed as this led to more accurate and repeatable segmentations (illustrated in supplemental material section 4). In order to train the two separate networks, lesions were selected using the volume of the ground truth mask. Depending on this tumor size, the lesion was used for training the corresponding CNN. After training and testing the CNNs, the appropriate CNN for a specific lesion was selected based on an initial guess of the tumor size. The latter is obtained using a majority vote (MV) segmentation. This MV segmentation uses four standard threshold approaches as input (see explanation below and supplemental material, section 5). The MV segmentation was chosen for this task because it provided the most accurate segmentations when compared with manual segmentations in previous work [7] and it is easy to implement. This initial tumor MV segmentation was only performed to select the corresponding CNN, i.e. to distinguish between smaller and bigger lesions.
Textural feature segmentation (TF)
In the TF segmentation approach, textural features of voxel neighborhoods were used for the voxel-wise segmentation of the tumor. For every view (axial, sagittal, coronal), a separate segmentation was performed and the summed probability was used to generate the final segmentation. The workflow of the TF segmentation for one image view is illustrated in Figure 3. As illustrated, every voxel was regarded as center of a scanning window. For each scanning window, statistical and textural features were calculated using the open-source software pyradiomics [24]. The feature space was then reduced by selecting the most important features for the segmentation task, which were identified by a random forest.
Next, a random forest classifier was trained to classify each voxel as tumor or non-tumor. The trained random forest was then applied to the testing dataset. The probability images of the three orientations are summed in order to obtain the final classification. A probability image contains information on the certainty of the classifier making the right decision. All voxels with a summed probability of more than 1.8 were included in the final tumor segmentation. A more detailed description of the algorithm can be found in the supplemental and in Pfaehler et al. [25].
To evaluate how well the AI based segmentations were matching the reference segmentation, the overlap between the AI based segmentations and those of the reference segmentation were analyzed using Jaccard Coefficients, as explained later.
Conventional segmentation algorithm
The repeatability of the AI based segmentations were compared with two established segmentation algorithm:
- 41%SUVMAX : all voxels with intensity values higher than 41% of the maximal SUV value (SUVMAX) are regarded as tumor
- SUV4: all voxels with a SUV higher than 4 are included in the segmentation
Moreover, two majority vote (MV) approaches combining four frequently used thresholding approaches were included in the comparison. Both MV approaches were previously found to be more repeatable than conventional approaches [7]. The underlying segmentation algorithms were the above described SUV4 and 41%SUVMAX method as well as a segmentation including all voxels with a SUV above 2.5 and a 50% of SUVmax threshold-based segmentation with background correction. The two MV segmentation methods include:
- MV2: the consensus of at least two of the 4 standard approaches
- MV3: the consensus of at least three of the 4 standard approaches
Evaluation of Segmentation Algorithm
For the evaluation of the segmentation algorithms, several metrics will be reported. Data analysis was performed in Python 3.6.2 using the packages numpy and scipy.
Accordance of AI segmentation and reference segmentation
In order to determine the accordance of the AI and reference segmentations, the Jaccard Coefficient (JC) was calculated. The JC is defined as the ratio between the intersection and the union of two labels and gives an indication about the overlap of the two labels:
A JC of 1 indicates perfect overlap, while a JC of 0 indicates that there is no overlap at all.
Furthermore, as the JC does not contain information about volume differences, the ratio between the volume of AI and reference segmentations were calculated:
A volume ratio above 1 indicates an over- and a volume ratio below 1 an under-estimation of the volume. A ratio of 1 represents perfect alignment. Finally, the distance of mass (barycenter distance) of the segmentations was calculated. Hereby, a barycenter distance close to 0 indicates perfect agreement.
Repeatability evaluation
The repeatability of the segmentation approaches was evaluated by comparing the differences of segmented volumes across days. For this purpose, the percentage Test-Retest difference (%TRT) was calculated:
The %TRT measures the proportional differences in segmented volume between the two consecutive scans. Moreover, the repeatability coefficient (RC) which is defined as 1.96 × standard deviation(TRT%) was calculated. Additionally, intraclass correlation coefficients (ICC) were calculated using a two-way mixed model with single measures checking for agreement. An ICC between 0.9 and 1 indicates excellent and an ICC between 0.75 and 0.9 indicates good repeatability [26]. If a lesion was completely missed by one segmentation approach, it was discarded from the analysis to analyze the same dataset for all segmentation approaches.
The accuracy metrics of the AI based segmentations as well as the TRT% of all approaches were compared using the Friedman test. The Friedman test is a non-parametric test, which does not assume a normal distribution of the data or independency of observations. It compares the rank of each data point instead of only comparing mean or median values. This means that if a segmentation algorithm provides consistently more accurate results, it will be ranked higher even if its mean or median are lower. As the Friedman test only contains information to show a significant difference in the data, a Nemenyi test was performed in order to assess which methods resulted in significant differences. P-values below 0.01 were considered as statistically significant. A Benjamini-Hochberg correction was applied in order to correct for multiple comparisons.