Datasets
In this study we used two different datasets of baseline DLBCL 18F-FDG PET/CT scans: the HOVON-84 dataset [16] was used to train the CNN model whereas the PETAL dataset [17] was used as an external validation of the models performance. All patients from both datasets provided written consent for the use of their data. After correction for IPI, there were no significant differences in survival between the PETAL and HOVON-84 study [18]. Both studies were approved by institutional review boards and all included patients provided informed consent. The use of all data within the PETRA imaging database has been approved by the institutional review board of the VU University Medical Center (JR /20140414).
HOVON-84. Three hundred seventy-three DLBCL patients who underwent baseline 18F-FDG PET/CT from the HOVON-84 trial (EudraCT, 2006–005,174 − 42) were included in this study. The main inclusion/exclusion criteria from this trial can be found elsewhere [16]. From these, 317 diagnosed DLBCL patients were included in this analysis. Missing essential DICOM (Digital Imaging and Communication in Medicine) information was the main reason for exclusion. Other reasons included: quality control (QC) outside of range, incomplete whole-body scans, no FDG-avid lesions or plasma glucose out of range. Furthermore, 7 patients were lost to follow-up within 2 years and 14 other patients died within 2 years of unrelated reasons. This led to a total of 296 DLBCL patients included in the study. Of which, 244 were classified as TTP0 and 52 as TTP1.
PETAL. The external validation was performed using diagnosed DLBCL patients from the PETAL trial (EudraCT 2006-001641-33) who underwent baseline 18F-FDG PET/CT. The eligibility for the PETAL trial is described elsewhere [17]. Initially, the trial consisted of 1098 PETAL patients. Reasons to exclude patients were as follows: diagnosis other than DLBCL, incomplete scans or with artefacts, missing DICOM information, QC or glucose out of range or no FDG-avid lesions. This led to a total of 395 DLBCL patients with associated 18F-FDG PET/CT baselines scans available for this study. Moreover, 12 underwent a different treatment to R-CHOP, 24 patients were lost for follow-up within 2 years, and 19 died without progression. This led to a total of 340 patients. From these, 279 were classified as TTP0 and 61 as TTP1.
Quality Control of scans. The participating sites provided the scans in DICOM format. The scans were subsequently anonymised. For QC we used criteria described by EANM guidelines: mean standardized uptake value (SUVmean) of the liver should be between 1.3 and 3.0 and the plasma glucose level lower than 11 mmol/L [3]. The QC criteria are described in detail elsewhere [4].
Maximum Intensity Projections
The ACCURATE tool was used to obtain the so-called lesion masks which are the images that contain only the lymphoma tumour(s) segmentation [19]. The segmentation of the tumours was performed using a SUV threshold of 4.0 [20]. Any physiological uptake adjacent to tumours was manually deleted. The conversion to MIP images was performed using an in-house developed CNN preprocessing tool which generates coronal and sagittal MIPs with size 275 x 200 x 1 and a pixel size of 4x4mm. MIPs were generated for (binary) lesion masks, lesion MIPs (MIPs containing only tumours) and for the complete PET scans. Examples of these coronal and sagittal MIPs can be found in Fig. 1.The MIPs were normalized by a fixed maximum intensity value (SUV = 40). The values above this maximum were truncated to avoid normalization to be driven by the SUV value of high uptake organs such as the bladder.
Data sampling scheme
Since the training dataset classes were highly unbalanced (244 TTP0 vs 52 TTP1), we applied a data sampling scheme were the TTP0s were divided into 5 equally stratified data subsets: 4 subsets of 49 patients (subsets A-D) and 1 subset of 48 patients (subset E). Additionally, 3 randomly selected TTP0 patients belonging to different subsets, were added to each of these subsets in order to achieve a total of 52 TTP0s per subset (subset E with 51 TTP0s), which matched the total number of TTP1s. Eventually, each subset contained a total of 104 patients with a prevalence of 50% for each class (subset E with 103 patients). The details of this scheme can be found in Fig. 2. Each subset (A to E) was trained using a 5-fold cross validation (for each cross validation run, the data was split into two sets: training set (80%) and internal validation set (20%)).
Convolutional Neural Network
The CNN consists of two branches, one receives the coronal MIP as input and the other one receives the sagittal MIP as input, which are merged as a last step to yield the final prediction. Coronal and sagittal MIPs are analysed independently but in parallel by an identical multi-layer architecture. The CNN design consists of 4 convolution layers, each one of these are followed by a max pooling layer. In a CNN, the convolution layer uses different filters over the image to extract low level features (e.g. edges, gradients) in earlier layers and high level features in deeper layers. In our CNN, the feature maps are doubled at each convolution layer, starting at 16 in the first layer and going up to 128 in the last layer, and their dimensions continuously decrease by (3,3). In each convolution layer the rectified linear unit (ReLU) activation function is applied. After each convolution layer, a dropout of 0.35 is applied, indicating that 35% of the network nodes and connections are randomly dropped from the CNN in order to prevent overfitting. Right after the dropout, a MaxPooling layer was implemented. The MaxPooling layer acts as the dimensionality reduction layer. There are 3 Maxpooling layers in our CNN, each of these with feature map sizes of (3,3), (3,3) and (2,2). After the last convolution and SpatialDropout layer, the CNN is connected to a GlobalAveragePooling2D (GAP2D) layer also known as ‘flattening’ layer. This layer ‘flattens’ the output from the convolution layers into a less complex shape (i.e. a 2D tensor). The coronal and sagittal outputs are then concatenated at the final dense layer or fully connected layer (FCL) which generates an output for two different classes: TTP0 and TTP1. A softmax function is introduced to generate a probability for each of these classes, which is the final CNN output the probability of TTP longer than 2 years, P(TTP0), and the probability of TTP shorter than 2 years, P(TTP1) both add up to 1. The classifier was compiled using the Adam optimizer with a learning rate (LRt) of 0.00005 and a decay rate (DR) of 0.000001. The CNN structure is illustrated in Fig. 3.
In this study we trained the model following 3 different training schemes. These are illustrated in Supplemental Fig. 1: training based on 1) only-lesion MIPs (Lesion MIP CNN); 2) lesion masks and regular MIPs (MIP CNN); 3) lesion masks and MIPs after removal of the brain, brain removed MIPs (BR-MIP CNN). The network architecture is kept the same. This approach was followed in order to explore if the model could be trained to recognize pathological patterns.
The lesion MIP CNN uses only the lesion MIPs as input (Fig. 1). These MIPs contain only the information and intensity of the tumours. The lesions MIPs of both coronal and sagittal views were used to train and validate the model for 200 epochs.
The MIP CNN uses both lesion masks and MIPs for the training of the model. The lesion masks contain only information of the tumours but not the intensity values. The training of the MIP CNN consisted in two subsequent steps. Firstly, the lesions masks of both coronal and sagittal MIPs were used to train and validate the model for 200 epochs. In the second step, the pre-trained model on the lesion masks was re-trained and re-validated for another 300 epochs, this time using the regular coronal and sagittal MIP images instead. The same patients were used for training of the two steps.
The BR-MIP CNN follows the same training process as the MIP CNN but instead of the original MIPs, it uses MIPs without brain (Fig. 1). MIP brain removal was performed in order to provide greater consistency across the dataset since not all scans included the head. The process of removing the brains is described in detail in Supplemental Material 1 [21].
In the case of the MIP CNN and the BR-MIP CNN, the classifier required only the MIP images to make the predictions. The idea behind these two CNNs was to generate a classifier where the prediction is free of the observer-dependent tumour segmentation.
Plausibility of the CNN
To better understand the CNN predictions, we further investigated the output of the model by exploring the association between P(TTP1) and two PET extracted features: MTV and Dmaxbulk since both have shown potential as prognostic markers in DLBCL. The process to extract PET features has been explained in previous studies [4]. Moreover, we synthetically removed the tumours from the MIP images to simulate tumour-free data and evaluated the CNN predictions on this data. The tumours were masked using the lesion masks generated using the in-house built CNN preprocessing tool. The voxel values corresponding to the tumours were replaced by an average of the voxel intensities excluding the background. This process is shown in Supplemental Fig. 2.
To facilitate representation, the TTP1 probabilities obtained through the CNN were calibrated by performing a logistic regression fit with the probabilities as input and the TTP0/TTP1 labels as outcome [22]. The obtained regression fit coefficients were then applied to the CNN TTP1 probabilities generated after removing the tumours in order to accordingly calibrate the tumour-free MIPs CNN TTP1 probabilities.
Statistical Analysis
The receiver operating characteristic (ROC) and the area under the curve (AUC) were used to evaluate the CNN performance. During training, the fold with the highest cross validated (CV-)AUC across the 5 folds was preserved. See Supplemental Material 2 for more details [23]. The PETAL dataset was used to externally validate the CNN model. This performance evaluation process was performed for the lesion MIP CNN, MIP CNN and BR-MIP CNN. In addition, the AUCs were compared to the performance of an IPI-based prediction model. This IPI model defines patients with risk factor of 4 or higher as high-risk patients (i.e. TTP1) [4]. The association of the TTP1 probabilities and the PET-extracted features (MTV and Dmaxbulk) was assessed using Pearson’s correlation coefficient.