Deep neural network-based synthetic image digital fluoroscopy using digitally reconstructed tomography

We developed a deep neural network (DNN) to generate X-ray flat panel detector (FPD) images from digitally reconstructed radiographic (DRR) images. FPD and treatment planning CT images were acquired from patients with prostate and head and neck (H&N) malignancies. The DNN parameters were optimized for FPD image synthesis. The synthetic FPD images’ features were evaluated to compare to the corresponding ground-truth FPD images using mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM). The image quality of the synthetic FPD image was also compared with that of the DRR image to understand the performance of our DNN. For the prostate cases, the MAE of the synthetic FPD image was improved (= 0.12 ± 0.02) from that of the input DRR image (= 0.35 ± 0.08). The synthetic FPD image showed higher PSNRs (= 16.81 ± 1.54 dB) than those of the DRR image (= 8.74 ± 1.56 dB), while SSIMs for both images (= 0.69) were almost the same. All metrics for the synthetic FPD images of the H&N cases were improved (MAE 0.08 ± 0.03, PSNR 19.40 ± 2.83 dB, and SSIM 0.80 ± 0.04) compared to those for the DRR image (MAE 0.48 ± 0.11, PSNR 5.74 ± 1.63 dB, and SSIM 0.52 ± 0.09). Our DNN successfully generated FPD images from DRR images. This technique would be useful to increase throughput when images from two different modalities are compared by visual inspection.


Introduction
With recent advances in deep neural network (DNN) techniques, research and development in medical image processing has accelerated such that it now exceeds the performance of conventional image processing [1-3]. Their benefits have been applied to image-guided radiation therapy (IGRT) and treatment planning, auto-segmentation [4,5], automated planning [6], planning CT image synthesis from cone-beam CT [7], deformable image registration [8], modality to another form of the same modality, e.g. CT to low-dose CT [16,17], MRI T1-weighted images to MRI T2-weighted images [18,19], and PET [20,21]. The intermodality image synthetic technique transforms images from one modality to another: MR to CT [22][23][24][25], CT to MR [26,27], and PET to CT [28][29][30]. Compared to image denoising and increasing image resolution [1, 2, 31], the major difficulties with image synthesis are large differences in pixel values, differences in visualized structures, and alignment error.
There have been no instances in which an image synthesis DNN has been used to generate an FPD image from a DRR image. With this purpose in mind, we developed an image synthesis DNN for FPD image data, and compared the quality of the synthetic FPD image data with those of the original FPD and DRR images of the prostate and head and neck (H&N).

Patients and image acquisition
A total of 200 and 70 cases with tumors of the prostate or H&N undergoing carbon-ion beam scanning therapy (C-PBS) at our treatment center participated in this study, respectively. The study was conducted with the approval of the Institutional Review Board (N21-001) and performed in accordance with the Declaration of Helsinki. All the patients provided informed consent for use the data from their medical records. During image acquisition, all patients were positioned on the treatment table with immobilization devices (urethane resin cushion [Moldcare, Alcare, Tokyo, Japan]) and low-temperature thermoplastic shells (Shell Fitter, Kuraray Co., Ltd., Osaka, Japan).

Planning CT image and projecting DRR image
Treatment planning CT image data were acquired under breath-hold in exhalation using a 320-detector CT (Aquilion One Vision, Canon Medical Systems, Otawara, Japan). Imaging conditions were based on our clinical protocols using automatic exposure control [32]. Reconstructed CT slice thicknesses were 2.0 mm for the prostate and 1.0 mm for H&N cases. Image field-of-view was 500 mm for both diseases.
A pair of DRR images was generated by projecting the CT data (converted to X-ray attenuation coefficients) along the X-ray imaging beam path using our in-house software [33]: where q(x, y) is the projection ray sum point on the DRR image position (x, y) and ΔL is the calculation grid size (= 1 mm in this study).
The CT image was shifted to position the tumor in the center of the DRR image. For some cases, the edge of the DRR image could not include the CT image completely due to the small number of CT slices, degrading DRR image quality. To solve this, we added additional CT slices before the first. This process was performed to the last CT slice (extended CT image region). The DRR image matrix size and pixel size were 768 × 768 pixels and 388 × 388 μm, respectively, the same dimensions as the FPD images. The DRR computation was programmed using commercial software (Compute Unified Device Architecture [CUDA] ver. 10.1, Microsoft Visual Studio 2013, Microsoft Corp, Redmond WA, USA) in a Windows 10 environment with a GPU (graphics processing unit) processor on an NVIDIA board (QuadroRTX 8000, NVIDIA Corporation, Santa Clara CA, USA), which is equipped with 4068 CUDA core units and 48 GB of memory, allowing a processing speed of more than 16.3 Tflops for a single precision calculation [34].

Fluoroscopic images
Digital fluoroscopic images from the prostate and H&N cases were acquired by imaging systems installed in the treatment room [12]. The X-ray imaging systems were set up according to the method of Mori [13]. The distance from the X-ray tube and the FPD was 239 cm, and from the X-ray tube to the room isocenter was 169 cm.
For the patient setup verification process, we performed 2D-3D image registration of the pair of FPD images and the planning CT data [13], coregistering patient anatomical structures on FPD images to those on the DRR images within a mean of 0.87 mm and 0.61 degrees expressed as the square root sum of squares for the respective three dimensions [35]. The prostate treatment protocol uses 90° and 270° beam angles. The H&N treatment protocol uses more than two beam angles by rotating the treatment couch around its long axis (ϕ: standard International Electrotechnical Commission [IEC] tabletop rolling angle) to extend the range of angles (− 10° to + 10°). The number of treatment fractions was 12 and 16 for prostate and H&N, respectively.

Network architecture
Our DNN was a modified 2D convolutional autoencoder with shortcut connections (U-net) [36] (Fig. 1). This DNN consisted of an "encoder block" and "decoder block." The encoder block extracts features representing the input data with reduced spatial dimensions via a combination of the following hidden layers: a convolutional layer (stride size of 1 × 1), a rectified linear unit (ReLU) layer, and an instance normalization layer [37]. Spatial dimensions were reduced by using a convolutional layer with a stride size of 2 × 2. The number of output channels after spatial dimensional reduction was doubled (64, 128, 256, 512, and 1024).

Irradiation port cover
The edge of the irradiation port cover was visualized on the FPD image (marked as arrows in Fig. 1a and c) but not on the DRR image.

Bowel gas
For the prostate cases, bowel gas positions differed on the FPD and DRR images, possibly due to interfractional changes or pixel value inconsistency. To avoid this, we outlined bowel gas regions of interest (ROIs) manually on the FPD images (marked with dotted yellow lines in Fig. 2a), and the DRR image without the bowel gas was corrected by resetting the Hounsfield Units (HU) in the gas collections to 0. The upper thighs/male genitalia were variable in contour; as a result, the pixel values were sometimes inconsistent between FPD and DRR images (marked as light blue dashed lines in Fig. 2a).

Air
In the H&N cases, regions external to the patient included the treatment couch edge and/or air on the FPD images. Pixel values for air inconsistently measured showed zero on DRR images, which could affect DNN prediction accuracy. To solve this problem, a mask was applied to the DRR image using a pixel value threshold of 0 (marked as a light blue line in Fig. 2d). This patient mask was adapted to another FPD image of the pair of images at the same position (marked as a light blue dotted line in Fig. 2c). The proportion of air was kept at < 40% of the subimage.

Parameter optimization
The DNN parameters were optimized to predict an FPD image from a DRR image for respective regions as follows: the optimization process was performed for 3000 epochs with a batch size of 70 using stochastic gradient descent (SGD) to minimize loss. We did not set early stopping criteria; however, the learning curves for the training and validation data were checked by plotting loss values for the training data and validation data. When the learning curves did not improve, or showed overfitting, we stopped the optimization process.
We calculated two types of loss: content loss and perceptual loss (Fig. 1c).
Content loss was calculated using the selected mean absolute error (MAE) because L1 loss function MAE can The decoder block reconstructed feature representations in the input data and included a combination of the upsampling layers, instance normalization layers, ReLU layers, dropout layers [38], and convolutional layers. The upsampling layers doubled the number of spatial dimensions. Subsequently, the number of output channels was decreased by half. A dropout layer (rate = 0.2) was added to avoid overfitting. The convolutional layer with a of 1 × 1 and a single channel were added to export a single grayscale image for clinical use. The convolutional kernel size for all layers was 3 × 3 except for the last one (1 × 1). The shortcut connection is one solution to avoiding a vanishing gradient in the deep architecture [39][40][41]. It is applied before the dropout layer in the decoder block from the ReLU layer before the convolutional layer with the stride size of 2 × 2; however, it was not applied in the last convolutional layer with the stride size of 2 × 2 [42], because the input DRR image and the ground truth FPD image were not perfectly registered due to interfractional positional changes (misalignment).
Although original U-net uses pooling and batch normalization layers [39], we replaced U-net with a convolutional layer (stride size of 2 × 2) to instance the normalization layer to transfer the image style of the ground-truth image (FPD image) onto the output image (synthetic FPD image).

Network training
A total of 4000 and 2000 image pairs (DRR and FPD images) from prostate and H&N cases were randomly selected for the DNN training process, respectively. In this process, a pair of FPD and DRR images was subdivided into subimages (144 × 144 pixels) by changing position, rotation angle (± 3.0, 0.1-degree step, rotated by ± 90° and 180°), and flipping in the left-right or up-down directions. All FPD and DRR images were resized to 384 × 384 pixels with bicubic interpolation and normalized pixel values to the range of 0-1. A total of 50,000 and 10,000 subimage pairs were prepared for the prostate and H&N cases, respectively, with care to exclude subimages containing the irradiation port cover, air or bowel gas.

Treatment couch
The edge of the treatment couch included on the FPD image increased pixel values (marked as light green lines in Fig. 1a and c). The treatment couch positions on the FPD and DRR did not always match due to small differences in patient positioning. In the worst-case scenario, the treatment couch edge may have been absent from some images, resulting in large pixel value inconsistencies on the input and groundtruth images. To avoid this, we applied image processing to The learning rate, momentum, and decay were set to 10 − 5 , 0.9 and 10 − 5 , respectively. The learning rate was decreased 2 × 10 − 9 for every 4 epochs. The deep learning framework "TensorFlow 2.4" was used in Windows 10, 64-bit environment with a single GPU on the NVIDIA QuadroRTX 8000 board.

Post-processing
The ground-truth FPD image showed image noise from scatter. The predicted FPD image, however, did not fully reflect scattered radiation. Thus, we added image noise to the predicted FPD image.
The ground-truth FPD detected scattered radiation in the original image size (= 768 × 768); however, all training data were resized by half (= 384 × 384). The predicted FPD image was resized to 768 × 768, and Gaussian image noise was added (mean: 0, deviation: 0.001 and 0.0005 for prostate and H&N cases, respectively). Finally, the synthetic FPD image was decreased in size with image noise to 384 × 384 again.

Evaluations
We evaluated the quality of the synthetic FPD images using 2700 and 640 FPD (ground-truth) images for the prostate and H&N cases, respectively. These image data differed from the training data. The synthetic FPD images were compared with the ground-truth FPD image using the MAE, the peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) [44]. These metrics are widely improve the robustness of outliers (image noise and image artifact) due to misalignment between DRR and FPD images.
where I true i and I pred i , are the ith pixel value in the groundtruth image and predicted image, respectively, and n is the total number of pixels in the image.
Perceptual loss was assessed using a pre-trained VGG19 model [43] as a feature extractor. It calculated the sum of six features of output from the respective convolutional layers (Fig. 1b). The error was computed as the mean square error of these features using the input DRR image and the predicted FPD image.
Perceptual loss (L perceptual ) was defined as: V k (I true ) and V k (I pred ) were feature of kth layer in a VGG network when a ground-truth image and a predicted image, respectively. P was 1, 2, 5, 10, 15 and 20.
Finally, we calculated the total loss using the following equation:

Prostate region
Large collections of bowel gas and the irradiation port cover edge were visualized on the ground-truth FPD image (marked as yellow and red arrows, respectively, in Fig. 3a). These regions were not visible in the same positions on the DRR image (Fig. 3b). The same anatomical shape was extended in the CT slice direction (superior to inferior) because of the extended CT image region (marked as a blue used to quantify the similarity between two images. We also compared the image quality of the synthetic FPD image with that of the DRR image to understand the performance of our DNN. The computation time for the prediction (not including the model file import) was evaluated. image quality metrics (Fig. 4a and c). MAE for the synthetic FPD image was improved (= 0.12 ± 0.02) from that for the input DRR image (= 0.35 ± 0.08) (Fig. 4a). The synthetic FPD image showed a higher PSNR value (= 16.81 ± 1.54 dB) than the DRR image (= 8.74 ± 1.56 dB) (Fig. 4b).
Although SSIM values for both images were almost the same (= 0.69), the number of outliers for the synthetic FPD image was smaller than that for the DRR image (Fig. 4c). Computation time was 89.6 ± 11.0 [msec].

Head and neck region
For the H&N region, the ground-truth FPD and input DRR image are shown in Fig. 5a and b, respectively. The irradiation port cover edge and earlobe were observed in a horizontal direction (marked with red and blue arrows in Fig. 5a, respectively); however, these were not included in the input DRR image (Fig. 5b). The skull curvature emphasized the range of thickness depending on the angle of the incident beam to the skull surface shape (marked as red arrows in dotted rectangle in Fig. 3b and d). The quality of the input DRR image compared to the ground-truth FPD image was expressed by the metrics MAE = 0.35, PSNR = 8.43 dB, and SSIM = 0.67. The synthetic FPD image was close to the ground-truth FPD image (Fig. 3c). The metrics improved to 0.13, 16.05 dB, and 0.64, respectively. Since the port edge was not contained in the input DRR image, it was not visualized on the synthetic FPD image. However, the bony structures with very lower pixel value on the input DRR image were not completely generated to the FPD image quality (marked as yellow arrows in Fig. 3c). To clearly understand the image differences between the groundtruth FPD and the synthetic FPD images, we subtracted the synthetic FPD image from the ground-truth FPD image (Fig. 3d). Bowel gas positional variations caused large pixel value differences.
Results for image quality averaged over all prostate cases are summarized in Fig. 4; Table 1. The box ranges in between the 25th and 75th percentiles for the synthetic FPD image were lower than those for the DRR image for all  89.6 11.0 108.9 90.3 10.6 109.3 Abbreviations: SD = standard deviation; DRR = digitally reconstructed radiography; FPD = flat panel detector; MAE mean absolute error; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index measure Objects that did not visualize on the DRR image were not visualized on the synthetic FPD image. For example, the earlobe was not visualized on the synthetic FPD image because it was not visualized on the input DRR image (Fig. 5). If an invisible object on the input DRR image, which was visualized on the ground-truth FPD image, was visualized on the synthetic FPD image, the DNN performed well on the training data but did not perform accurately in other images (overfitting). To avoid this problem, we checked the learning curve and predicted images during DNN during the network training process.
However, a DRR image should be created to visualize objects, which are visualized on the ground-truth FPD image. By doing this, the quality of the synthetic FPD image would be closer to that of the ground-truth FPD image.
We optimized DNN parameters for each region, given the significant differences in anatomical structures and image contrast. In our next study, we plan to evaluate the application of a single DNN across multiple anatomical sites.

Interfractional variation
We used DRR and ground-truth FPD images for the DNN training process because we wanted to generate realistic image quality using the DNN. Since planning CT and ground-truth FPD images were acquired on different days, organ and bowel gas positions were interfractionally different. We excluded image regions affecting interfractional changes in the training subimages; however, these might not exclude differences completely, especially with moving bowel gas. As long as original FPD images are used for the training data, this problem will not be completely resolved. One approach to solving this is to calculate a mimic FPD image by Monte Carlo simulation-based DRR calculation so that the quality of the mimic FPD image would be close to that of the original FPD image. This mimic FPD image would not include interfractional and intrafractional anatomical changes and misalignment in between the mimic FPD and the planning CT image. This approach could therefore be applied to the thoracoabdominal region. The quality of the mimic FPD image was not completely identical to that of the original FPD image; it is questionable whether it is possible to obtain sufficient image quality of a synthetic FPD image trained by using the mimic FPD images.

Loss function
The DRR image had lower spatial resolution and less image noise than the FPD image. It is more difficult to predict the synthetic FPD image from DRR image compared with the synthetic DRR image from the FPD image. Generally, L1 loss and L2 loss (such as MAE and mean square error Fig. 5b). Quality metrics of the input DRR image compared to the ground-truth FPD image were 0.60, 4.25 dB, and 0.38 for MAE, PSNR and SSIM, respectively. However, the quality of the synthetic FPD image was much closer to the ground-truth FPD image by visual inspection. Image quality metrics were also improved (MAE 0.04, PSNR 25.93 dB and SSIM 0.86) compared to those for the input DRR image (Fig. 5c). The irradiation port cover edge and earlobe were not visualized on the synthetic FPD image. The smoothness of the skull curvature was improved to the same degree as that of the ground-truth FPD image. Figure 5d shows image differences between the input DRR and synthetic FPD images. Relatively large pixel value differences were attributable to misalignment (red arrow in Fig. 5d).

Discussion
We developed DNN for synthetic FPD images from DRR images and evaluated image quality between the synthetic FPD and original FPD using prostate and H&N images. The quality of the synthetic FPD images was close to that of the original FPD images. Computation time for the prediction was approximately 90 msec per image on average and it was almost the same as both prostate and H&N regions.

Image quality
We evaluated quality of the synthetic FPD images using MAE, PSNR, and SSIM. Image quality metrics of the synthesized FPD image were improved compared to those of the DRR image for both prostate and H&N regions, except the SSIM for the prostate region. The SSIM values for the synthetic FPD image and the DRR image were roughly the same (= 0.69). Since SSIM compares local patterns of pixel intensities that have been normalized for luminance and contrast, reflecting changes in structural information. Therefore, the preservation or loss of structural information in the images is critical for this metric, bowel gas positional change between images might be a major cause of this. Also, we did not evaluate metal implants or prosthesis replacement in this study, because we did not have image data including them. Technology, and of our institute hospitals, for their support and discussion. We thank Libby Cone, MD, MA, from DMC Corp. (www.dmed. co.jp) for editing drafts of this manuscript.
Funding None.

Declarations
Conflict of interest Drs. Hirai and Sakata are employed by the Toshiba Corporation, Kawasaki, Japan.
Ethical approval The study was approved by the Institutional Review Board of our institution (N21-001). [MSE]) were used as loss functions for image synthesis. However, use of these loss functions evaluated distortion only, and could therefore predict blurred images similar to the DRR image. To predict enough information and minute structures included on the original FPD image, it is necessary to evaluate image quality using both distortion and perceptual quality [45]. We used perceptual loss and content loss to evaluate quality and distortion, respectively. Weight factors for the respective loss functions were manually adjusted in this study (Eq. 3), although when more than two hyper-parameters were used, it was arduous to determine them manually. To solve this problem, it would be useful to use a single dynamic loss function combined with several loss functions, such as a generative adversarial network (GAN). GAN is also widely used for image synthesis and U-net [28,46]. Although we did not use GAN in this study, GAN is able to generate a synthetic FPD image.

Applications to other IGRTs
Our group developed machine learning-based markerless tumor tracking software and integrated it into clinical applications [10,11,47]. The machine learning training process was performed using DRR images. The markerless tumor tracking detected tumor position on the FPD images in real time. Use of different image modalities in the training process and the prediction process makes it difficult to detect tumor position. If the markerless tracking training process used the synthetic FPD image data instead of the DRR image data, tracking detection accuracy could be improved.

Conclusion
Synthesizing images can be challenging due to the significantly lower contrast in FPD images and the substantial difference in image quality between the two modalities. The DRR image, which is created from the CT image with a fixed value for each material, provides a consistent image quality. However, the quality of the FPD image depends on the object size and imaging conditions (such as kV, mAs, etc.), so its quality can vary. This variability presents a significant challenge in our study compared to other image synthesis tasks like MRI to CT conversion etc. Despite these challenges, our DNN successfully generated an FPD image from a DRR image. Its quality was close to that of the ground-truth FPD image. This technique has the potential to boost throughput, when images from two different modalities are being visually compared in IGRT for both photon and particle beam therapy.