Datasets
The study was registered at clinical trials.gov (NCT02024113) and was approved by the Medical Ethics Review Committee of the Amsterdam UMC and registered in the Dutch trial register (trialregister.nl, NTR3508). All patients gave informed consent for study participation and use of their data for (retrospective) scientific research. Two datasets acquired at two institutions were included in this study with both datasets following the recommendations of the EARL accreditation program [19, 20]. All images were converted to Standardized Uptake Value (SUV) units before the segmentation process started in order to normalize the images for differences in injected tracer dose and patient weight. The focus of this paper lies on the segmentation process and not on lesion detection. Therefore, before the start of the segmentation process, a large bounding box was drawn around every lesion including also a large number of non-tumor voxels as illustrated in Fig. 1. The bounding box was drawn randomly such that the tumor was not always appearing in the middle but on different locations in the box. This step was performed in order to avoid that the CNN remembers the location of the object instead of other, more important characteristics.
Training and testing dataset
For training, validating, and testing the segmentation approaches, 96 images of patients with NSCLC Stage III - IV were included. Patients fasted at least six hours before scan start and were scanned 60 minutes after tracer injection. All images were acquired on a Gemini TF Big Bore (Philips Healthcare, Cleveland, OH, USA). For attenuation correction, a low dose CT was performed. All images were reconstructed to a voxel size of 4 × 4 × 4 mm using the vendor provided BLOB-OS-TOF algorithm. More details about the patient cohort can be found in previous studies [21]. The images were split randomly in training, validating, and testing sets, where 56 images (286 lesion) were used for training, 14 images (98 lesions) for validation, and 26 images (171 lesions) for independent testing.
Test-Retest dataset
For a fully independent test-retest evaluation, ten PET/CT scans of patients with Stage III and IV NSCLC were analyzed. These ten patients underwent two whole-body PET/CT scans on two consecutive days. Images were acquired on a Gemini TF PET/CT scanner (Philips Healthcare, Cleveland, OH, USA) at a different institution (Amsterdam University Medical Center). Patient fasting time, time between tracer injection and scan start, as well as reconstruction algorithm and voxel size were the same as in the previous described dataset. A total of 28 lesions were included in the analysis.
Reference segmentations
The reference segmentations used for training, validating, and testing the algorithm, were obtained by applying an automatic segmentation which identified all voxels with a SUV above 2.5 as tumor (here after SUV2.5). The segmentations were manually adjusted by an expert medical physicist (RB) with more than twenty years of experience in PET tumor segmentation. This approach was chosen as it has been demonstrated that the manual adaption of a (semi-) automatic algorithm is more robust than a pure manual segmentation [22].
Segmentation Algorithm
All segmentation algorithm were implemented in Python 3.6 using the libraries keras and scikit-learn.
Convolutional Neural Network (CNN)
A 3D CNN following the U-Net architecture proposed by Ronneberger et al. [23] was implemented with the keras library. U-net is one of the most famous and most frequently used CNN architectures for biomedical image segmentation as it was especially designed for scenarios where only a small number of training examples are available. More details about the architecture and the used configuration can be found in the supplemental material.
In order to increase the amount of training data and to avoid over-fitting, data augmentation was performed. This included rotations within − 20 to 20 degrees, shifting in width and height direction within 20% of the side length, a rescaling of the images within 25%, intensity stretching, as well as adding Gaussian noise to the image.
For training, testing, and applying the CNN, the dataset was divided into smaller ( < = 12.8 ml) and bigger tumors. The threshold was chosen by experiments, as this threshold let to the best performance. For each tumor size, one separate CNN was trained. The split of the dataset by lesion size was performed as this led to more accurate and repeatable segmentations (illustrated in supplemental material Sect. 2.1). For training, the tumor size was determined by calculating the volume of the ground truth mask. For testing and applying the CNN, an initial guess of the tumor size was performed using the majority vote (MV) segmentation of four established threshold approaches (see supplemental material, Sect. 3). The MV segmentation was chosen for this task as it resulted in previous work in the most accurate segmentation when compared with manual segmentations [7] and is easy to implement.
Textural feature segmentation (TF)
In this segmentation approach, textural features of voxel neighborhoods were used for the voxel-wise segmentation of the tumor. For every view (axial, sagittal, coronal) a separate segmentation was performed and the majority vote of the three views was regarded as final segmentation. The workflow of the TF segmentation for one view is illustrated in Fig. 2. As illustrated, every voxel was regarded as center of a scanning window. For each scanning window, statistical and textural features were calculated using the open-source software pyradiomics [24]. The feature space was then reduced by selecting the most important features for the segmentation task which were identified by a random forest.
Next, a random forest classifier was trained to classify each voxel as tumor or non-tumor. The trained random forest was then applied to the testing dataset. The probability images of the three orientations are combined in order to obtain the final classification. A probability image contains information about how certain the classifier is that it made the right decision. Hereby, all voxels with a summed probability of more than 1.8 were included in the tumor mask. A more detailed description of the algorithm can be found in the supplemental and in Pfaehler et al. [25].
In order to evaluate how well the AI based segmentations are matching the reference segmentation which was used for training, the segmentation results and the reference segmentation were compared in terms of accuracy.
Conventional segmentation algorithm
The repeatability of the AI based segmentations were compared with two established segmentation algorithm:
Moreover, two majority vote (MV) approaches combining four frequently used thresholding approaches were included in the comparison. Both MV approaches have been demonstrated in previous work to be more repeatable than conventional approaches. The underlying segmentation algorithm are explained in the supplemental Sect. 3 and are also described in previous work [7]. The two MV segmentation methods include:
Evaluation Of Segmentation Algorithm
For the evaluation of the segmentation algorithm, several metrics were combined. The data analysis was performed in Python 3.6.2 using the packages numpy and scipy.
Accordance of AI segmentation and reference segmentation
In order to determine the accordance of AI and reference segmentation, the Jaccard Coefficient (JC) was calculated. The JC is defined as the ratio between the intersection and the union of two labels and gives an indication about the overlap of the two labels:
$$\text{J}\text{C}=\frac{\text{A}\cap \text{B}}{\text{A}\cup \text{B}}$$
A JC of 1 indicates perfect overlap, while a JC of 0 indicates that there is no overlap at all.
Furthermore, as the JC does not contain information about volume differences, the percentage MATV differences of performed and reference segmentation were calculated: \(\frac{{MATV}_{SEGM}}{{MATV}_{REF}}\). A percentage volume difference above 1 indicates an over- and a percentage volume difference below 1 an under-estimation. A percentage difference of 1 represents a perfect alignment. Finally, the distance of mass (barycenter distance) of the segmentations was calculated. Hereby, a barycenter distance close to 0 indicates perfect agreement.
Repeatability evaluation
The repeatability of the segmentation approaches was evaluated by comparing the differences of segmented volume across days. For this purpose, the percentage Test-Retest difference (%TRT) was calculated:
The %TRT gives a measure for the proportional differences in segmented volume between the two consecutive scans. Moreover, the repeatability coefficient (RC) which is defined as 1.96 × standard deviation(TRT%) was calculated. Additionally, intraclass correlation coefficients (ICC) were calculated using a two-way mixed model with single measures checking for agreement. An ICC between 0.9 and 1 indicates excellent and an ICC between 0.75 and 0.9 indicates good repeatability [
26
]. If a lesion was completely missed by one segmentation approach, it was discarded from the analysis in order to analyze the same dataset for all segmentation approaches.
The accuracy metrics of the AI based segmentations as well as the TRT% of all approaches were compared using the Friedman test. The Friedman test is a non-parametric test which does not assume a normal distribution of the data or independency of observations. It compares the rank of each data point instead of only comparing mean or median values. This means that if a segmentation algorithm results consistently in more accurate results, it will be ranked higher even though its mean or median might be lower. As the Friedman test only contains information if there was a significant difference in the data, a Nemenyi test was performed in order to assess which methods resulted in significant differences. P-values below 0.01 were considered as statistically significant. A Benjamini-Hochberg correction was applied in order to correct for multiple comparisons.