Patients and Image Acquisition
A total of 200 and 70 cases with tumors of the prostate or H&N undergoing carbon-ion beam scanning therapy (C-PBS) at our treatment center participated in this study, respectively. The study was conducted with the approval of the Institutional Review Board (N21-001) and performed in accordance with the Declaration of Helsinki. All the patients provided informed consent for use the data from their medical records. During image acquisition, all patients were positioned on the treatment table with immobilization devices (urethane resin cushion [Moldcare, Alcare, Tokyo, Japan]) and low-temperature thermoplastic shells (Shell Fitter, Kuraray Co., Ltd., Osaka, Japan).
Planning CT image and Projecting DRR image
Treatment planning CT image data were acquired under breath-hold in exhalation using a 320-detector CT (Aquilion One Vision, Canon Medical Systems, Otawara, Japan). Imaging conditions were based on our clinical protocols using automatic exposure control [32]. Reconstructed CT slice thicknesses were 2.0 mm for the prostate and 1.0 mm for H&N cases. Image field-of-view was 500 mm for both diseases.
A pair of DRR images was generated by projecting the CT data (converted to X-ray attenuation coefficients) along the X-ray imaging beam path using our in-house software [33]:
$${q}\left(\text{x},\text{y}\right)= \sum _{\text{k}=1}^{\text{n}}{\varDelta \text{L}\bullet {\mu }}_{\text{k}}$$
1
where q(x, y) is the projection ray sum point on the DRR image position (x, y) and ΔL is the calculation grid size (= 1 mm in this study).
The CT image was shifted to position the tumor in the center of the DRR image. For some cases, the edge of the DRR image could not include the CT image completely due to the small number of CT slices, degrading DRR image quality. To solve this, we added additional CT slices before the first. This process was performed to the last CT slice (extended CT image region). The DRR image matrix size and pixel size were 768 × 768 pixels and 388 × 388 µm, respectively, the same dimensions as the FPD images. The DRR computation was programmed using commercial software (Compute Unified Device Architecture [CUDA] ver. 10.1, Microsoft Visual Studio 2013, Microsoft Corp, Redmond WA, USA) in a Windows 10 environment with a GPU (graphics processing unit) processor on an NVIDIA board (QuadroRTX 8000, NVIDIA Corporation, Santa Clara CA, USA), which is equipped with 4068 CUDA core units and 48 GB of memory, allowing a processing speed of more than 16.3 Tflops for a single precision calculation [34].
Fluoroscopic images
Digital fluoroscopic images from the prostate and H&N cases were acquired by imaging systems installed in the treatment room [12]. The X-ray imaging systems were set up according to the method of Mori [35]. The distance from the X-ray tube and the FPD was 239 cm, and from the X-ray tube to the room isocenter was 169 cm.
For the patient setup verification process, we performed 2D-3D image registration of the pair of FPD images and the planning CT data [35], coregistering patient anatomical structures on FPD images to those on the DRR images within a mean of 0.87 mm and 0.61 degrees expressed as the square root sum of squares for the respective three dimensions [36]. The prostate treatment protocol uses 90° and 270° beam angles. The H&N treatment protocol uses more than two beam angles by rotating the treatment couch around its long axis (ϕ: standard International Electrotechnical Commission [IEC] tabletop rolling angle) to extend the range of angles (− 10° to + 10°). The number of treatment fractions was 12 and 16 for prostate and H&N, respectively.
Network architecture
Our DNN was a modified 2D convolutional autoencoder with shortcut connections (U-net) [37] (Fig. 1). This DNN consisted of an “encoder block” and “decoder block.” The encoder block extracts features representing the input data with reduced spatial dimensions via a combination of the following hidden layers: a convolutional layer (stride size of 1 × 1), a rectified linear unit (ReLU) layer, and an instance normalization layer [38]. Spatial dimensions were reduced by using a convolutional layer with a stride size of 2 × 2. The number of output channels after spatial dimensional reduction was doubled (64, 128, 256, 512, and 1024).
The decoder block reconstructed feature representations in the input data and included a combination of the upsampling layers, instance normalization layers, ReLU layers, dropout layers [39], and convolutional layers. The upsampling layers doubled the number of spatial dimensions. Subsequently, the number of output channels was decreased by half. A dropout layer (rate = 0.2) was added to avoid overfitting. The convolutional layer with a of 1 × 1 and a single channel were added to export a single grayscale image for clinical use. The convolutional kernel size for all layers was 3 × 3 except for the last one (1 × 1). The shortcut connection is one solution to avoiding a vanishing gradient in the deep architecture [40–42]. It is applied before the dropout layer in the decoder block from the ReLU layer before the convolutional layer with the stride size of 2 × 2; however, it was not applied in the last convolutional layer with the stride size of 2 × 2 [43], because the input DRR image and the ground truth FPD image were not perfectly registered due to interfractional positional changes (misalignment).
Although original U-net uses pooling and batch normalization layers [40], we replaced U-net with a convolutional layer (stride size of 2 × 2) to instance the normalization layer to transfer the image style of the ground-truth image (FPD image) onto the output image (synthetic FPD image).
Network training
A total of 4000 and 2000 image pairs (DRR and FPD images) from prostate and H&N cases were randomly selected for the DNN training process, respectively. In this process, a pair of FPD and DRR images was subdivided into subimages (144 × 144 pixels) by changing position, rotation angle (± 3.0, 0.1–degree step, rotated by ± 90° and 180°), and flipping in the left-right or up-down directions. All FPD and DRR images were resized to 384 × 384 pixels with bicubic interpolation and normalized pixel values to the range of 0–1. A total of 50,000 and 10,000 subimage pairs were prepared for the prostate and H&N cases, respectively, with care to exclude subimages containing the irradiation port cover, air or bowel gas.
Treatment couch
The edge of the treatment couch included on the FPD image increased pixel values (marked as light green lines in Figs. 1a and 1c). The treatment couch positions on the FPD and DRR did not always match due to small differences in patient positioning. In the worst-case scenario, the treatment couch edge may have been absent from some images, resulting in large pixel value inconsistencies on the input and ground-truth images. To avoid this, we applied image processing to remove the treatment couch edge from the CT image, and then calculated the DRR image.
Irradiation port cover
The edge of the irradiation port cover was visualized on the FPD image (marked as arrows in Figs. 1a and 1c) but not on the DRR image.
Bowel gas
For the prostate cases, bowel gas positions differed on the FPD and DRR images, possibly due to interfractional changes or pixel value inconsistency. To avoid this, we outlined bowel gas regions of interest (ROIs) manually on the FPD images (marked with dotted yellow lines in Fig. 2a), and the DRR image without the bowel gas was corrected by resetting the Hounsfield Units (HU) in the gas collections to 0. The upper thighs/male genitalia were variable in contour; as a result, the pixel values were sometimes inconsistent between FPD and DRR images (marked as light blue dashed lines in Fig. 2a).
Air
In the H&N cases, regions external to the patient included the treatment couch edge and/or air on the FPD images. Pixel values for air inconsistently measured showed zero on DRR images, which could affect DNN prediction accuracy. To solve this problem, a mask was applied to the DRR image using a pixel value threshold of 0 (marked as a light blue line in Fig. 2d). This patient mask was adapted to another FPD image of the pair of images at the same position (marked as a light blue dotted line in Fig. 2c). The proportion of air was kept at < 40% of the subimage.
Parameter optimization
The DNN parameters were optimized to predict an FPD image from a DRR image as follows: the optimization process was performed for 3000 epochs with a batch size of 70 using stochastic gradient descent (SGD) to minimize loss. We did not set early stopping criteria; however, the learning curves for the training and validation data were checked by plotting loss values for the training data and validation data. When the learning curves did not improve, or showed overfitting, we stopped the optimization process.
We calculated two types of loss: content loss and perceptual loss (Fig. 1c).
Content loss was calculated using the selected mean absolute error (MAE) because L1 loss function MAE can improve the robustness of outliers (image noise and image artifact) due to misalignment between DRR and FPD images
\({L}_{content}=\frac{1}{n}\sum _{i=1}^{n}\left|{I}_{i}^{true}-{I}_{i}^{pred}\right|\) (2),
where \({I}_{i}^{true}\) and \({I}_{i}^{pred}\), are the ith pixel value in the ground-truth image and predicted image, respectively, and n is the total number of pixels in the image.
Perceptual loss was assessed using a pre-trained VGG19 model [44] as a feature extractor. It calculated the sum of six features of output from the respective convolutional layers (Fig. 1b). The error was computed as the mean square error of these features using the input DRR image and the predicted FPD image.
Perceptual loss (Lperceptual) was defined as:
$${L}_{perceptual}=\sum _{k}^{P}\left[\sum _{i=1}^{n}{\left({V}_{k}\left({I}_{i}^{true}\right)-{V}_{k}\left({I}_{i}^{pred}\right)\right)}^{2}\right] ,\left(3\right)$$
V k (I true ) and Vk(Ipred) were feature of kth layer in a VGG network when a ground-truth image and a predicted image, respectively. P was 1, 2, 5, 10, 15 and 20.
Finally, we calculated the total loss using the following equation:
$${L}_{total}={2.0\bullet L}_{content}+{{10}^{-5}\bullet L}_{perceptual} , \left(4\right)$$
The learning rate, momentum, and decay were set to 10− 5, 0.9 and 10− 5, respectively. The learning rate was decreased 2×10− 9 for every 4 epochs. The deep learning framework “TensorFlow 2.4” was used in Windows 10, 64-bit environment with a single GPU on the NVIDIA QuadroRTX 8000 board.
Post-processing
The ground-truth FPD image showed image noise from scatter. The predicted FPD image, however, did not fully reflect scattered radiation. Thus, we added image noise to the predicted FPD image.
The ground-truth FPD detected scattered radiation in the original image size (= 768 × 768); however, all training data were resized by half (= 384 × 384). The predicted FPD image was resized to 768 × 768, and Gaussian image noise was added (mean: 0, deviation: 0.001 and 0.0005 for prostate and H&N cases, respectively). Finally, the synthetic FPD image was decreased in size with image noise to 384 × 384 again.
Evaluations
We evaluated the quality of the synthetic FPD images using 2700 and 640 FPD (ground-truth) images for the prostate and H&N cases, respectively. These image data differed from the training data. The synthetic FPD images were compared with the ground-truth FPD image using the MAE, the peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) [45]. These metrics are widely used to quantify the similarity between two images. We also compared the image quality of the synthetic FPD image with that of the DRR image to understand the performance of our DNN.
The computation time for the prediction (not including the model file import) was evaluated.