Clinical Data
In this study, mpMRI data from histologically confirmed primary PCa patients was used (histopathological samples obtained by biopsy). The data consists of two groups, an internal data set (n = 15 / 122, with/without whole mount histology data). Examinations were acquired between 2008 and 2019 on clinical 1.5T (Avanto, Aera & Symphony, Siemens, Erlangen, Germany) and 3T (Tim TRIO, Siemens, Erlangen, Germany) MRI systems: Images were acquired with surface phased array (body matrix) coils in combination with integrated spine array coils – note, that no endorectal RF coil was used. The study was approved by the institutional ethics review board (Proposal Nr. 476/14 & 476/19) and patients gave written informed consent.
Imaging protocol was as follows: T2-weighted turbo spin echo (TSE) images in transverse, sagittal and coronal orientation, DWI with an echo planar imaging sequence in transverse orientation. DWI data were acquired with b-values of 50, 400 and 800 s/mm^2. With the DWI data, a synthetic high b-value image was calculated for each patient. Therefore, \(\text{A}\text{D}\text{C}\) and \({S}_{0}\) were fitted pixel wise according to eq. (1)
|
\(S= {S}_{0}{e}^{-b \text{A}\text{D}\text{C}} .\)
|
(1)
|
Using these fitted values, a synthetic diffusion-weighted image with a b-value of 1400 s/mm2 was calculated which is routine practice in clinical settings.
The protocol included additional dynamic contrast enhanced imaging, which were not part of the CNN-based analysis.
Patient data was separated into a training cohort and a test group: The training cohort consists of a large irradiation and prostatectomy group (nirr = 122), and the test cohort consists only of a prostatectomy group (nprost = 15) from which whole organ histopathology slices were available. The mpMRI data to train the CNN contained T2 weighted images and apparent diffusion coefficient (ADC) maps together with synthetic high b-value images (b = 1400 s/mm²). For all 137 in house mpMRI images (nirr and nprost), the entire PG (PG-Rad), and PCa (PCa-Rad) within the prostate were contoured by two experienced radio-oncologists. As in (37, 38), PCa (PCa-Histo) tissues in the whole mount histology data from the test cohort were stained with hematoxylin and eosin. Tumor contours were then delineated by experts and digitized. These whole mount histology slices were intermediately registered with the corresponding T2 weighted ex vivo MRI using MITK software (MITK Workbench 2015.5.2). The histopathology slices and ex-vivo MRI images were registered using anatomical landmarks, by prioritizing the agreement between the prostate capsule contours, the urethra and cysts. Automatic interpolation was performed to generate 3D volumes. The ex-vivo MRI images along with the histology-based tumor contours (PCa-Histo) were imported into the radiation therapy planning system Eclipse v15.1 software (Varian Medical Systems, USA). Here, a careful manual co-registration of the ex-vivo MRI (PCa-Histo) and in-vivo MRI (PCa-Rad) was performed using anatomical landmarks, allowing for non-rigid deformation. All contours (PCa-Histo, PCa-Rad) were later transferred to the corresponding in vivo MRI image (cf. Figure 1).
For data preprocessing, the mpMRI sequences were cropped to a smaller FOV around the prostate gland and then registered and interpolated to an in-plane resolution of 0.78×0.78x3 mm³. Due to the large sizes of the image volumes which would result in very long computation times, calculations were performed on patches of size 64x64x16 that were chosen randomly with respect to the center location of the original image. The probability of the center pixel to be of the class background (BG), PCa or PG was set to 33% to account for class imbalance and a chance of 70% for a random 2D-rotation in the axial plane was added for data augmentation.
Convolutional Neural Network
A patch-based 3D CNN of the U-net architecture (34) was trained for the automatic segmentation of PCa and PG. The network was implemented in MATLAB® (2020a, MathWorks, Inc., Natick/MA) using the deep learning toolbox. The CNN consists of 3 encoder blocks for down sampling steps with max-pooling, 3 decoder blocks for up sampling steps with transposed convolution layers (kernel size:2x2x2, stride:2, padding:1) and skip connections by concatenation. The convolution blocks consist of 3x3x3 convolutions with stride and padding of 1, followed by batch normalization and Rectified Linear Unit activation(ReLU), except for the last convolution where 1x1x1 convolution without padding, batch normalization and softmax activation function were used.
The CNN was trained using optimal parameters learning rate 0.001, patch size 64x64x16 obtained by a Bayesian optimization scheme to maximize the segmentation performance within 150 epochs on an NVIDIA RTX2080 GPU. During the CNN testing phase, the mpMRI data from the test cohort (prostatectomy group) was used to evaluate the network prediction. The resulting segmentation is evaluated by comparing the Dice Sorensen Coefficient (DSC) with the ground truth.
3D – Grad-CAM for Segmentation
The Grad-CAM method proposed by (26) is generalized to be applied to a pre-trained CNN with fixed learned weights in a segmentation task. Yang et. al. (28) extended the Grad-CAM method to 3D-Grad-CAM. A schematic of the 3D-Grad-CAM is shown in Fig. 2. Here, for understandability, let \({\left\{{\text{A}\left(\overrightarrow{\text{x}}\right)}^{\text{k}}\right\}}_{\text{k}=1}^{\text{K}}\) be a set of selected feature maps of interest from \(\text{K}\) kernels of the last convolutional layer of the CNN, and \({\text{y}\left(\overrightarrow{\text{x}}\right)}^{c}\) be the raw score of the CNN for a chosen class \(c\)before softmax activation. The Grad-CAM method first computes the gradients \({G}^{c,k}\left(\overrightarrow{\text{x}}\right)\) of class scores \({y\left(\overrightarrow{\text{x}}\right)}^{c}\) with respect to all N voxels for each feature map \({A\left(\overrightarrow{\text{x}}\right)}^{k}\) of the convolutionallayer:\(\)
|
\({G}^{c,k}\left(\overrightarrow{\text{x}}\right)=\frac{ \partial {\text{y}\left(\overrightarrow{\text{x}}\right)}^{c}}{\partial {A\left(\overrightarrow{\text{x}}\right)}^{k}} .\)
|
(2)
|
These gradients are then globally averagepooled in all three spatial dimensions to obtain neuron importance weight \({{\omega }}^{c,k}\):
|
\({{\omega }}^{c,k}= \frac{1}{N}\sum _{\overrightarrow{\text{x}} }{G}^{c,k}\left(\overrightarrow{\text{x}}\right) .\)
|
(3)
|
Then, a heat map \({H\left(\overrightarrow{\text{x}}\right)}^{c}\) is computed by summation of the feature maps \({A\left(\overrightarrow{\text{x}}\right)}^{k}\) multiplied by their corresponding weight \({{\omega }}^{c,k}\) and subsequent ReLU activation to suppress negative contributions:
|
\({H\left(\overrightarrow{\text{x}}\right)}^{c} =ReLU\left(\sum _{k}{{\omega }}^{c,k}{A}^{k}\right)\)
|
(4)
|
Segmentation is essentially a classification of each voxel in the input image \(\text{I}\left(\overrightarrow{\text{x}}\right)\) to a category of target labels\({ \text{y}\left(\overrightarrow{\text{x}}\right)}^{\text{c}}\). Thus, from the method proposed in (39), we generalize the 3D Grad-CAM method for segmentation, by averaging the class score \({y\left(\overrightarrow{\text{x}}\right)}^{c}\)for a set of voxels in the output segmentation mask ʍ as in eq.5
|
\(\stackrel{-}{{y}^{c}} =\frac{1}{N}\sum _{ \overrightarrow{\text{x} }\in ʍ}{y\left(\overrightarrow{\text{x}}\right)}^{c}\)
|
(5)
|
The algorithm was implemented using the dlfeval function from the Deep Learning tool box in MATLAB® (2020a, MathWorks, Inc., Natick/MA).
Evaluation of Heat Maps
The quality of the generated heat maps for its localization ability is evaluated using the intersection over Union (IOU) metric. For this, as proposed in (40), the generated heat maps for the test images are min-max normalized. Then, they are thresholded at different intensity values δ to generate binary masks (\({L}^{c}\)) by converting the intensity values above δ to one and below δ to zero. Finally, we calculate the IOU (\({Loc}^{c}\left(\delta \right)\)) between the ground truth segmented label (\({y}_{Gt}^{c}\)) and the binary map (\({L}^{c}\)) for a class c thresholded at value δ for the test image \(\text{I}\left(\overrightarrow{\text{x}}\right)\) as,
|
\({Loc}^{c}\left(\delta \right)=\frac{{L}^{c}\left(\delta \right)\cap {y}_{Gt}^{c}}{{L}^{c}\left(\delta \right)\cup {y}_{Gt}^{c}}\)
|
(6)
|
A higher value of, \({Loc}^{c}\left(\delta \right)\) is indicative of a better localization of the heat map for the target class.
For the sanity check, the model randomization test and the independent cascaded randomization test proposed in (41) is used to study the sensitivity of the heat maps with the learned parameters of the CNN. For the model randomization test, we generate heat maps from an untrained U-Net model with random weights and bias, which are then compared to the heat maps from the trained network. For the independent cascaded randomization test, the weights of the convolutional layers in the decoder and encoder blocks are independently randomized from top to the bottom of the network in a cascading manner and the heat maps are generated. Finally, we compare the mutual information and SSIM between the heat maps generated from the learned model with fixed weights, model randomization test and independent cascaded randomization test.