Dosimetric Impact of Deep Learning-Based CT Auto-Segmentation on IMRT Treatment Planning for Prostate Cancer

Background: The evaluation of the automatic segmentation algorithms is commonly performed using geometric metrics, yet an evaluation based on dosimetric parameters might be more relevant in clinical practice but is still lacking in the literature. The aim of this study was to investigate the impact of state-of-the-art 3D U-Net-generated organ delineations on dose optimization in intensity-modulated radiation therapy (IMRT) for prostate patients for the ﬁrst time. Methods: A database of 69 computed tomography (CT) images with prostate, bladder, and rectum delineations was used for single-label 3D U-Net training with dice similarity coeﬃcient (DSC)-based loss. Volumetric modulated arc therapy (VMAT) plans have been generated for both manual and automatic segmentations with the same optimization settings. These were chosen to give consistent plans when applying perturbations to the manual segmentations. Contours were evaluated in terms of DSC, average and 95% Hausdorﬀ distance (HD). Dose distributions were evaluated with the manual segmentation as reference using dose volume histogram (DVH) parameters and a 3%/3mm gamma-criterion with 10% dose cut-oﬀ. A Pearson correlation coeﬃcient between DSC and dosimetric metrics, gamma index and DVH parameters, has been calculated. Results: 3D U-Net based segmentation achieved a DSC of 0.87(0.03) for prostate, 0.97(0.01) for bladder and 0.89(0.04) for rectum. The mean and 95% HD were below 1.6(0.4) andow 5(4) mm, respectively. The DVH parameters V 60 / 65 / 70Gy for the bladder and V 50 / 65 / 70Gy for the rectum showed agreement between dose distributions within ± 5% and ± 2% , respectively. The DVH parameters for prostate and prostate + 3mm margin (surrogate clinical target volume) showed good target coverage for the 3D U-Net segmentation with the exception of one case. The average gamma pass-rate was 85%. A comparison between geometric and dosimetric metrics showed no strong statistically signiﬁcant correlation between these metrics. Conclusions: The 3D U-Net developed for this work achieved state-of-the-art geometrical performance. The study highlighted the importance of dosimetric evaluation on top of standard geometric parameters and concluded that the automatic segmentation is suﬃciently accurate to assist the physicians in manually contouring organs in CT images of the male pelvic region, which is an important step towards a fully automated workﬂow in IMRT.


Background
The anatomical structure of the male pelvic region with prostate, surrounded by seminal vesicles, bladder, and rectum, makes intensity-modulated radiation therapy (IMRT) a favorable technique for the treatment of localized prostate cancer [1][2][3]. However, due to variable bladder and rectal filling, random shifts, and deformations of neighboring organs, online adaptation of the treatment plan would be necessary in order to take full advantage of modern radiotherapy techniques. [4,5].
Recontouring of the target volume (TV) and organs at risk (OARs) is an important step in treatment plan adaptation. Previous studies have shown that manual delineation is not only time-consuming (in order of several minutes) but also prone to inter-and intra-physician variability [6][7][8].
To address these problems, considerable scientific efforts have been made to develop efficient automatic segmentation tools. Previously, auto-segmentation methods such as (multi)atlas based and hybrid techniques have been considered stateof-the-art [9]. Over time, methods based on convolutional neural networks (CNN) [10] gained more attention [11,12]. Milletari et al. [13] proposed a 3D fully convolutional neural network architecture trained end-to-end on magnetic resonance (MR) prostate images, referred to as V-Net, and introduced a novel objective function based on the Dice similarity coefficient (DSC). Balagopal et al. [14] presented a hybrid network, having an additional 2D localization network prior to the 3D segmentation network to delineate prostate, bladder, rectum, and femoral heads on pelvic computed tomography (CT) images. In order to overcome the challenges of low soft tissue contrast in CT images as well as blurry boundaries, Wang et al. [15] and Tong et al. [16] focused additionally on edge enhancement techniques. Sultana et al. [17] proposed a two-stage network combining U-Net and generative adversarial network (GAN) architectures [18] for structure localization followed by precise prediction of organ delineation.
Evaluation metrics that are commonly used to measure segmentation performance focus purely on geometric accuracy. The most frequently used are the DSC, the mean, 95%, or maximal Hausdorff distance (HD), the positive prediction value (PPV) or the sensitivity [19]. The two main ideas behind them are: a) a pixel-wise comparison of ground-truth and predicted segmentation and b) measuring the distance between the ground-truth and the predicted contours. What carries a higher relevance in clinical practice, however, is the dosimetric accuracy and the quality of the treatment plans that can be achieved on the basis of the predicted segmentations. At the time of writing, no studies exist that have investigated and quantified the dosimetric impact of CT organ delineations for prostate cancer patients obtained from deep CNNs.
In this work a state-of-the-art 3D U-Net architecture for automatic organ segmentation in CT images of low-grade prostate cancer patients was trained. The training was carried out separately for the bladder, prostate and rectum. Since in patients with low-grade prostate cancer tumor is located only in the prostate, seminal vesicles were not considered for segmentation. Realistic VMAT plans with the same optimization settings were created for all test cases using manual segmentations and the automatic segmentations obtained from the 3D U-Net. This allowed to infer for the first time also the dosimetric impact of deep learning delineations.
To this aim, the quality of the treatment plans optimized on the automatically generated contours was compared with the reference plans in terms of dose volumehistograms (DVH) parameters, conformity index (CI) and gamma pass rate. In addition, a standard contour-based analysis based on DSC as well as an average and 95% HD calculation was performed. Both, geometric and dosimetric evaluation metrics, were compared in terms of Pearson correlation coefficient to investigate a possible correlation between them.

Database
The dataset used in this study consisted of 69 CT images, along with delineated structures associated with the low-grade prostate cancer treatment performed at the Klinikum Großhadern of the Ludwig Maximilian University of Munich. Patients with substantial CT artifacts, e.g., due to fiducial markers or metal bone implants, were not included in this study. The use of an ultrasound probe for prostate monitoring during irradiation in several cases, did not interfere with CT imaging of the pelvic region, therefore such cases were also included. CT data have been acquired with a Toshiba Acquilion LB CT scanner (Canon Medical Systems, Japan) using 512×512 pixels in the axial plane and a variable number of slices. Voxel size was 1.074×1.074×3 mm 3 . OARs, in particular bladder and rectum, were delineated by a trained radiation oncologist and stored as point clouds (DICOM RT-structs). The prostate contours were redrawn under the supervision of a trained physician according to guidelines for low grade (stage I and II) prostate tumor patients. Images and segmentations (converted into binary masks) have been resampled to a 1×1×1 mm 3 spacing grid, which was advantageous for the subsequent data augmentation at training stage. While aiming to minimize the influence of contour conversion between the DICOM RT-struct format, defined on a 1.074 × 1.074 × 3 mm 3 grid, and binary masks, defined on a 1×1×1 mm 3 grid, we found that employing nearest neighbor interpolation introduced negligible alterations to the structures. Finally, the dataset has been split into a training, validation, and test sets of 47, 11, and 11 images, respectively.

3D U-Net
The 3D U-Net presented here is based on the V-Net architecture [13], developed initially for prostate delineation on MR images. The encoding arm of the network is composed of five levels (including the lowest one) each comprising one (1st level), two (2nd level) or three (3rd-5th levels) convolutional layers and having 16, 32, 64, 128, 256 channels, respectively. The kernel size has been set to 5×5×5, stride to 1× 1 × 1 and group normalization has been applied after each convolution. The output of a given level is used in the subsequent one as input for the first convolution and is added to the output of the last convolution, thus creating a residual connection. For downsampling between the network levels convolution with a kernel of size 2 × 2 × 2 and stride 2 was used. Throughout the network the PReLU activation was applied. The decoding arm of the 3D U-Net is built in an analogous way, with up-convolution to increase the image size instead. The output of each level of the encoding arm (before the dowsampling) is concatenated with the corresponding input of the decoding arm. The last layer of the network uses the soft-max activation and thresholding of 0.5 to produce two binary masks representing segmentation of the structures and the background. For this project only the segmentation of the structures is relevant.

Data augmentation
The data augmentation, applied with probability p aug to each input pair, i.e. image and its segmentation, included 3D rotations around the image center (always aligned with the prostate center of mass), translations, B-Spline-based deformations, and zooming. Translations can be described by three parameters [x trans , y trans , z trans ] denoting the maximal translation distances along each axis. Similarly, Euler rotations can be denoted by the maximal rotation angles [α, β, γ] around the superior-inferior, anterior-posterior and medial-lateral axis, respectively. Zooming re-sizes each axis by a factor randomly drawn from [l min , l max ]. The pixel intensities have been truncated to fit the soft tissue window [I min , I max ] and subsequently rescaled to [-1, 1]. The deformation process is characterized by a grid of n × n × n control points and random shifts defined by a Gaussian [µ, σ]. In the last step of the augmentation pipeline, a central part of each image has been cropped to 128×128×128 due to memory limitations on the GPU. Nevertheless, the clinically relevant high dose regions close to the prostate were not affected by the cropping. While setting the initial values for the data augmentation parameters, special care was taken not to introduce strong artifacts or create unrealistic deformations.

Training
Training on single-label data has been performed separately for three regions of interest: prostate, rectum, and bladder. Each model has been trained on an NVIDIA Quadro P6000 GPU with the Keras implementation of the Adam optimizer (β 1 = 0.9, β 2 = 0.999, ǫ = 1e − 07) and the Dice loss function applied to both, segmentations and the background. The set of hyper-parameters to be optimized can be divided into two sub-groups: a) data augmentation related parameters such as maximal translation shifts, rotation angles, zooming and soft-tissue window limits, B-Spline deformation parameters, augmentation probability and b) training related parameters such as the learning rate and number of epochs. The optimization of the hyper-parameters was performed via a random search. Training with a certain set of hyperparameters was performed until a desired network performance was achieved, which is defined as a stage in which the loss function evaluated on the validation data reaches a stable minimum.

Treatment planning
For all test cases, single arc photon VMAT treatment plans were generated from scratch using a research version of the commercial treatment planning system (TPS) RayStation (version 8.99, RaySearch, Sweden). All plans aimed at a total dose of 74 Gy in 37 fractions. The generic beam model of an Elekta Synergy Linac (Elekta, Sweden) with Agility multi-leaf-collimator was used. For each test case, two treatment plans were optimized on the same planning CT image, one based on the expert segmentation and one based on the 3D U-Net segmentation of rectum, bladder, and prostate. In both scenarios, in accordance with our facility's clinical guidelines, a PTV margin of 6 mm (posterior 5 mm) was applied around the prostate. The same optimization settings, i.e., the same objectives and weights for planning target volume (PTV), bladder, and rectum, for both manual and automatic segmentation were used. Settings were chosen using the expert segmentation such that a PTV coverage of V 95% = 100% was achieved, while dose to OARs was below the recommendations of the QUANTEC report [20]. Care was additionally taken to choose optimization settings producing consistent planning results by applying small perturbations to the manual segmentation. For this, the original RT-structs were converted to binary masks and back to DICOM RT-structs. Then a new plan was generated with the same optimization settings and dosimetrically compared to the initial plan using the original RT-structs. For all test cases deviations in the considered OAR and target DVH parameter (see following section) were below ±2% for the perturbed scenario. These settings were then used to optimize a treatment plan using the 3D U-Net sementations without further user interaction. Table 1 summarizes the goals of the treatment plan along with the importance of each factor.

Data evaluation
In order to evaluate the network-generated contours, DSC, average HD and 95% HD (defined as 95 th percentile of the distances between boundary points), have been calculated for all test cases with expert delineations as the reference ground truth. Since there is no clear boundary between the rectum and colon, evaluation of the network predictions was limited to the slices containing the ground truth segmentation, i.e. no additional penalty was applied for the colon misclassification. The entire data evaluation has been restricted to the 128 × 128 × 128 volume.
The dose distributions for predicted and ground truth contours were analyzed using a 3D global gamma-criterion with a pass-rate of (3%, 3 mm), where only voxels with at least 10% of the prescribed dose were considered. Additionally, CI defined by Paddick [21] was calculated. This index has an ideal value of one and plan quality decreases with decreasing index value. Both dose distributions were also compared in terms of clinically relevant target and OAR DVH parameters. For prostate and prostate + 3 mm (surrogate CTV), values of D 98% , D 2% and V 95% were determined. Similarly, for the rectum V 50/65/70 Gy and for the bladder V 60/65/70 Gy were calculated. All DVH parameters were determined using the ground truth segmentations and the dose distributions optimized either on the predicted or on the ground truth contours.
To investigate the correlation between the dosimetric and geometric metrics, the Pearson correlation coefficient [22] between 1) DSC of prostate and gamma index, 2) average DSC and gamma index, and 3) DSC and DVH parameters were calculated.

Hyperparameter optimization
The following values of hyperparameters have lead to satisfactory results: p aug = 0.93, rotation angles α = 20 • , β = γ = 10 • , translation shifts x trans = y trans = z trans = 10mm, l min = 0.9, l max = 1.1, I min = −150HU, I max = 150 HU, grid control points n × n × n = 15 × 15 × 15, µ = 0, σ = 30. After 20k epochs with a batch size of two, we found all the loss functions to converge with no signs of overfitting. The learning rate of 10 −3 has been shown to perform best. Figure 1 illustrates ground truth and automatically-generated delineations of prostate, rectum, and bladder for three test patients. Images with the best, closest to the average, and the worst values of DSC for prostate are displayed. Table 2 collects the results of the geometric analysis for all test patients. Mean DSC (standard deviation) of 0.87(0.03), 0.97(0.01), 0.89(0.04) were achieved for the prostate, bladder, and rectum, respectively. The highest average DSC value was observed for the bladder, which can be attributed to its relatively large size. A slightly worse performance has been observed for rectum and subsequently prostate. The values of the average HD were 1.6(0.4) mm, 0.95(0.2) mm, 1.4(0.7) mm for prostate, bladder, and rectum, respectively. The values of the 95% HD show the same trend 4(1) mm, 2.5(0.5) mm, 5(4) mm for prostate, bladder, and rectum, respectively.  Figure 2 illustrates dose distributions of three exemplary patients with the highest, the average, and the lowest gamma pass-rate. The reference dose distribution optimized using the ground truth contours, the 3D U-Net dose distribution optimized using the predicted delineations, and their difference are shown. Deviations from the reference plan were found to be in the range of ± 10% and were located primarily outside of the prostate. The largest differences were found close to the borders of the PTV region, where dose gradients are steep. One can observe higher differences approximately 6 mm from the prostate contour corresponding to the PTV margin. The quantitative results of the dosimetric comparison are summarized in table 3. The value of the CI for the reference plans is in the range of 0.81 and 0.88 with the average (standard deviation) of 0.85(0.03). For the plans calculated on 3D U-Net generated contours the CI is in the rage of 0.69 and 0.88 with the average of 0.78(0.06). The gamma-pass rates (3mm, 3%) were between 71% and 94%, with an average value of 85%.  Figure 3 ilustrates differences between clinically relevant DVH parameters of the two optimized dose distributions. For rectum and bladder, all the differences are below 5% and 2%, respectively. No clear trend of increased or decreased bladder and rectum dose for the 3D U-Net segmentation based plans was found. Similarly, differences for the target volume are mostly below 3Gy/2% for D 98 , D 2 and V 95 , apart for one outlier (patient 59) where the network struggled to delineate the prostate, which is also reflected in the relatively low DSC of 0.82 and gamma index of 71%. No tendencies for the D 2 parameter have been observed, but the 3D U-Net based plans tend to have reduced values of D 98 and V 95 for both, prostate and prostate + 3 mm, indicating a slight reduction of target coverage which is in line with the reduced CI values.

Pearson correlation coefficient
The Pearson correlation coefficient with the p value for the DSC of prostate and gamma index was 0.67 (p = 0.023), which shows a moderate positive correlation. No statistically significant results were obtained for the other parameters (p < 0.05 confidence level).

Discussion
In this work a 3D U-Net has been successfully trained and applied for CT-based organ segmentation in the male pelvic area. In contrast to current literature, the evaluation of the network's performance was based not only on commonly used geometric metrics, but also on clinically relevant dosimetric parameters.
Satisfactory performance was observed with regard to the geometric accuracy of the contour delineation, indicating a high degree of similarity between automated and manual segmentations. The best results were observed for bladder segmentation, followed by the rectum and prostate. The best values of DSC and HD for the bladder can be explained firstly, by its simple geometry and secondly, by its relatively large size, which makes an incorrect prediction of a group of edge pixels with regard to the correctly classified central part of this organ less relevant. The low contrast of the prostate on the CT images makes its segmentation most challenging. The boundaries of the rectum were satisfactory, with the exception of one case (Pat. 32) in which a substantial portion of the colon was misclassified as part of the rectal contour. Since the rectum-colon boundary is visually difficult to identify, and is not located in the high dose region, we decided to reduce the penalty for this type of misclassification during the final evaluation (testing) by truncating the volume of interest to the axial slices that contained the ground truth segmentation.
Quantitative test outcomes showed state-of-the-art network performance in terms of DSC, mean and 95% HD. The 2D-3D hybrid network for localization and subsequent organ segmentation proposed by Balagopal et al. [14] achieved a DSC of 0.9 for prostate, 0.95 for bladder and 0.84 for rectum. The edge-calibrated multitask network by Tong et al. [16] showed an overall bladder, rectum and prostate segmentation performance of DSC = 0.89. The U-Net-GAN hybrid architecture by Sultana et al. [17] achieved at the same time DSC = 0.90. A more detailed comparison is shown in Tab. 4. In all studies, bladder achieved the highest segmentation accuracy, followed by prostate and rectum. Due to GPU memory limitations, images were cropped around the prostate center of mass, causing truncation of bladder and rectum parts in some cases. On the one hand, this could have made it easier to predict the outer walls, on the other hand, this reduced the organ volume. Since these factors have the opposite effect on DSC and are small in themselves, the effect on DSC is deemed negligible, while the value of HD might have been slightly underestimated. The truncated sections were always located in the low dose region and therefore dosimetric analysis and plan optimization were not affected.
In the scope of the additional dosimetric analysis, target volume D 98 , D 2 and V 95 of the plans optimized using 3D U-Net contours were found to differ only slightly from the reference plans based on expert delineations, especially a trend of lower D 98 and V 95 was observed as shown in Fig.3. In only one case (patient 59), major deviations were observed, which can be explained by an incorrect prostate contouring that is shifted towards the bladder, as can be seen in Fig. 1.
DVH parameters for OARs also showed satisfactory results. Treatment plans optimized on 3D U-Net-generated contours resulted in a slightly lower dose to the rectum with the greatest deviations of ±4%, measured by V 50/65/70 Gy . No considerable differences were found for the bladder, showing deviations of the order of ±2% for V 60/65/70 Gy . Consequently, results suggest, that plans optimized on automatically generated contours do not overdose the neighboring OARs, i.e. bladder and rectum.
The average value of the CI for was 0.78(0.06) for the plans optimized on 3D U-Net generated contours and 0.85(0.03) for reference plans. The lower value of the average CI index confirms slightly worse target coverage. More pronounced deviations between the dose distributions were captured by the gamma index analysis, showing values of 71-94% with a mean value of 85%.
The only statistically significant correlation was found between the DSC of the prostate and the gamma index. The Pearson coefficient showed a moderately positive correlation only. No statistically significant correlation was found between the gamma pass-rate and the DSC values of OARs and between the DVH parameters and the DSC. On the contrary, we have observed that it is not uncommon for patients to show a very similar DSC for the prostate, which is the most important segmentation in relation to the treatment planning of prostate cancer, while showing a very different gamma pass-rate e.g. DSC Pat.43 =DSC Pat.90 =0.85 while γ Pat.43 = 93 and γ Pat.90 = 74 or even DSC Pat.44 =0.88, DSC Pat.81 =0.91 while γ Pat.44 = 94 and γ Pat.81 = 87. This leads to the conclusion, that a high geometric similarity between contours does not necessarily result in a high fidelity dose distribution optimized using these contours. Since, eventually, the dosimetric analysis is clinically more relevant the results of this study highlight that the latter should always be carried out in addition to the geometric analysis.
Another important factor to consider is the contour conversion between two formats: the point cloud format (DICOM RT-Struct) used by the contouring software as well as the TPS, and the binary masks required for CNN training. The use of linear interpolation in the conversion pipeline caused the original contours to shrink, which led to differences in structural volume and subsequently to underdosage in the PTV. The nearest neighbors interpolation instead did not introduce any noticeable differences during structure conversion.
One possible improvement to this study could be to prepare separate training images for the bladder and rectum by cropping images around their mass centers and adjusting the soft tissue window to match closer their HU range. This could help create more precise contours, but should not significantly affect the dosimetric analysis as the parts of the OAR structures relevant for treatment planning are located in close vicinity of the prostate, which was used as center for cropping in this study. Furthermore, prostate patients with tumor stages III and IV could be included in future studies by including seminal vesicles in the prostate contour or training a separate network. However, this is a challenging task since in clinical practice the CTV / PTV might contain different proportions of seminal vesicles depending on the exact tumor stage. Therefore, the CTV/PTVs including the seminal vesicles might have more pronounced variations between patients and thus more training data would be required.

Conclusions
A 3D U-Net was successfully trained for organ segmentation on CT images of the male pelvic region. The geometric accuracy measured with DSC, mean and 95% HD shows state-of-the-art performance of our algorithm. Analysis based on clinically relevant DVH parameters of VMAT plans did not show excessive dose enhancement to OARs and proved good target volume coverage in most cases. Nevertheless, the gamma pass rate was not always acceptable, indicating that human review is crucial. No strong statistically relevant correlation between geometric and dosimetric metrics was observed, suggesting that both types of analysis should be included in the evaluation of automatic organ segmentation in the scope of radiotherapy.  the 3D U-Net segmented images. Additionally, relative dose differences are presented. For improved visibility, dose below 25% of the dose prescribed to PTV and deviations below 0.4% on the difference plot are not displayed. Ground truth contours of (green) rectum, (blue) bladder and (red) prostate are also shown.   Dose distributions of three test patients showing (left) the worst, (mid-dle) the average and (right) the best agreement quanti ed by the gamma-index of the treatment plan optimized on (top) the manual segmentation (middle) the 3D U-Net segmented images. Additionally, relative dose dierences are presented. For improved visibility, dose below 25% of the dose prescribed to PTV and deviations below 0.4% on the dierence plot are not displayed. Ground truth contours of (green) rectum, (blue) bladder and (red) prostate are also shown. Figure 3