Patients and image acquisition
Image data of 283 cases of patients at our treatment centre with tumours of the prostate were randomly selected. The study was conducted with the approval of our Institutional Review Board (N21-002) and performed in accordance with the Declaration of Helsinki. All patients were immobilized on the treatment couch with a urethane resin cushion (Moldcare®, Alcare, Tokyo, Japan) and low-temperature thermoplastic shells (Shell Fitter®, Kuraray Co., Ltd., Osaka, Japan). No water restriction was imposed, and the rectum was emptied by the patient's effort or using a laxative or enema. Prior to treatment, two or three fiducial markers were implanted into the prostate.
Treatment planning CT image
Planning CT image data were acquired under breath-hold in exhalation using a 320-detector CT (Aquilion One Vision, Canon Medical Systems, Otawara, Japan) in the simulation room. CT imaging conditions were based on our clinical protocols, with a tube voltage of 120 kV. X-ray tube current was adjusted by automatic exposure control [10]. The reconstructed CT image pixel size and the slice thickness were 0.976 mm × 0.976 mm and 2.0 mm, respectively.
X-ray image
X-rays were acquired with a pair of X-ray tubes installed in the treatment room, with two indirect flat panel detectors (FPDs) (PaxScan 3030+®, Varian Medical Systems, Palo Alto, CA, USA) installed on the right and left sides of the vertical irradiation port at 35° and − 35°, respectively. Image size was 768 × 768 pixels, pixel size was 388 µm (2 × 2 binning mode using the original pixel size of 194 µm) and an amorphous silicon receptor. The distance between the isocentre of the room and the X-ray tube (ISO) was 1690 mm. The source-image receptor distance (SID) was 2390 mm. This allowed acquisition of an image measuring approximately 210 × 210 mm.
X-ray images used in this study were acquired after the patient setup verification process using 2D-3D image registration software [11]. This software coregistered patient anatomical structures on X-ray images to those on the reference DRR images which were generated from the planning CT.
Training data
Three types of DRR images were generated for the training input data (upper panel in Fig. 1). All DRR images were generated using our in-house software [12], which was programmed using CUDA (Compute Unified Device Architecture ver. 10.1, Microsoft Visual Studio 2013, Microsoft Corp, Redmond WA, USA) in a Windows 10 environment with a graphics processing unit (GPU) processor on an NVIDIA board (GeForce GTX 1080, NVIDIA Corporation, Santa Clara CA, USA).
Pre-processing
The planning CT was shifted to locate the centre of the tumour in the planning CT in the isocentre of the radiation field. Because the treatment couch positions on the planning CT and X-ray images were not always identical for the same point in the patient, we applied image processing to remove the treatment couch on the planning CT data before generating the DRR images. In addition, some CT data may have contained missing areas when converted to DRR images due to the insufficient number of slices. The first and last slices were therefore expanded based on the CT central slice to ensure a sufficient number of CT slices.
All-density DRR image
An all-density DRR image was generated by projecting the planning CT data (converting HU numbers to X-ray attenuation coefficients) along the virtual X-ray source-to-FPD beam path.
$$\begin{array}{c}q= \sum _{\text{k}=1}^{\text{n}}{\varDelta \text{L}\bullet {\mu }}_{\text{k}}\left(1\right)\end{array}$$
where q is the projection ray sum point on the DRR image position and ΔL is the calculation grid size (= 1mm in this study).
The projection angle, pixel size, and image size were ± 35°, 388 × 388 µm, and 768 × 768 pixels, respectively. To reflect the pixel value variation in X-ray images due to variations in the X-ray imaging dose, we generated three additional different qualities of DRR images by changing the weighting of the HU number. An additional three DRR images were generated by randomly shifting ± 5 mm and ± 2° along the respective axes from the reference position. A total of 24 DRR images (= 2 directions × 3 image qualities × 4 positions) were generated from one CT volume data
Bone DRR image
Patient position was registered to the reference position before the treatment beam irradiation, as closely as possible. Higher image contrast structures such as bones affected the extracted feature map more than lower-density structures. Because misalignment of the bony structures could decrease DNN segmentation accuracy, bone DRR images were generated from a region measuring ≥ 100 HU in the CT data and positioned randomly within ± 5 mm along the respective axes from the position of the all-density DRR image that had already been randomly shifted ± 5 mm and ± 2°.
Gas DRR image
Bowel gas collections on the planning CT were segmented using a threshold of -300 HU. Then, Gas DRR images were generated to match the location of all-density DRR images that had already been randomly shifted.
Post-processing
All datasets were resized to 256 × 256 pixels and the pixel values were normalised to the range of 0–1. We randomly selected one all-density DRR image from the three density groups, and applied gamma correction by randomly calculating a correction factor between 0.3–1.7. Finally, we added uniform random noise with a range of 0–0.01, and the pixel values were normalised to the range of 0–1 again.
Network architecture
Our DNN was a modified 2D convolutional autoencoder with skip connections (U-net). It output gas segmentation data by inputting the all-density DRR and bone DRR images (Fig. 2a). The DNN consisted of two ‘encoder blocks’ and one ‘decoder block’, which sequentially shared the features of the two images (FuseUNet) [13].
The two encoder sections performed feature extraction of the input all-density DRR image and the input bone DRR image. This approach has often been used for image segmentation models [14]. The encoder constructed three repeating convolutional layers with a kernel size of 3 × 3 pixels (Conv), a batch normalization layer (BN)[15], a rectified linear unit layer (ReLU)[16] and a max-pooling layer (Pooling). Then two Convs were added. The feature map of the bone DRR was merged with the feature map of the all-density DRR. This part is called a Fuse Block, which doubles the number of channels by concatenating feature maps (Fig. 2b). Each stream is then reduced to half its dimensionality by the Pooling layer. Most methods double the number of channels after the Pooling layer, but we kept the number of channels unchanged because we used Fuse Block. The encoder's channel transition is (64, 128, 256, 512, 1024) for all-density DRR streams and (64, 128, 256, 512) for bone DRR streams.
In the decoder section, Conv is used to match the number of channels in the encoder feature map to be combined in skip connection, and an Upsampling layer is usually used to double the dimensionality. The outermost skip connection was, however, removed to prevent information that was not fully extracted from the features from directly affecting the output. After that, the Fuse Block features in the encoder were combined to share the spatial location information of the bowel gas. The Conv, BN, and ReLU were then repeated as in the encoder section. A Conv with a kernel size of 1 × 1 and the sigmoid function were used as the activation function.
Network training
We used 6688 image sets (all-density DRR, bone DRR, and gas DRR images) for the 209 cases. The DNN parameters were optimized to predict gas DRR images from all-density- and bone DRR images. Adam (unpublished data) was used as the optimisation method, and the learning rate was set to 0.0001. The batch size was set to 16 and the epoch sets to 2000.
We used Focal Tversky loss applied to the imbalance problem [17]. The Tversky Index is an index that allows flexible adjustment of the balance between False Positives (FPs) and False Negatives (FNs) in the Dice score.
$$\begin{array}{c}{TI}_{c}=\frac{\sum _{i=1}^{N}{p}_{ic}{g}_{ic}+\epsilon }{\sum _{i=1}^{N}{p}_{ic}{g}_{ic}+\alpha \sum _{i=1}^{N}{p}_{i\stackrel{-}{c}}{g}_{ic}+\beta \sum _{i=1}^{N}{p}_{ic}{g}_{i\stackrel{-}{c}}+\epsilon }\left(2\right)\end{array}$$
$$\begin{array}{c}{FTL}_{c}=\sum _{c}{\left(1-{TI}_{c}\right)}^{\gamma }\left(3\right)\end{array}$$
where pic indicates that the i th pixel belongs to class c of bowel gas, and pi\(\overline{c}\) indicates that it does not, whereas g is the same for the ground-truth. ε is the total number of pixels in the image, and ε is a parameter that prevents it from dividing by zero. Focal Tversky Loss is the Tversky Index with parameter γ applied.
In this study, the parameters alpha and beta were used to adjust the balance of FP and FN, and gamma was used to adjust the balance with the background area. In this study, α = 0.7, β = 0.3, and γ = 0.75.
We used the deep learning framework ‘TensorFlow 2.6’, NVIDIA RTX A5000 GPU (VRAM: 24 GB) on Ubuntu 18.04 LTS.
Evaluations
In clinical treatment, our DNN predicted bowel gas collections from X-ray images acquired in the treatment room. However, it could cause manual delineation error because of the low quality of the real X-ray image. It is, therefore, difficult to evaluate DNN detection accuracy quantitatively. In the same situation, a few researchers [18] used DRR images, and not real X-ray images, for evaluation, but the qualities of the training data and test data were identical and did not reflect an actual clinical situation.
We used synthetic X-ray images converted from DRR images using the pre-trained DNN previously developed by our group (unpublished data) (lower panel in Fig. 1). Use of the synthetic X-ray images provided accurate ground-truth gas segmentation data without any delineation error. By doing this, our DNN segmentation accuracy could be evaluated quantitatively. A total of 102 image sets (synthetic X-ray, bone DRR, and gas DRR images) were obtained for the 51 cases, all of which differed from the training data.
Moreover, we evaluated DNN segmentation accuracy with real X-ray images, even though ground-truth gas collections might include delineation error. To minimize delineation error, one person delineated the bowel gas collections on the X-ray image, and another certified medical physicist, with over 20 years of clinical experience, checked carefully and modified manually if required. A total number of X-ray images was 102 for the 51 cases.
Bowel gas segmentation accuracy was evaluated using recall, precision, Intersection over Union (IoU) and Dice coefficient.
$$\begin{array}{c}IoU=\frac{TP}{TP+FN+FP}\left(4\right)\end{array}$$
$$\begin{array}{c}Recall=\frac{TP}{TP+FN}\left(5\right)\end{array}$$
$$\begin{array}{c}Precision=\frac{TP}{TP+FP}\left(6\right)\end{array}$$
$$\begin{array}{c}Dice = \frac{2TP}{2TP+FP+FN}\left(7\right)\end{array}$$
where TP, TN, FP and FN are true positive, true negative, false positive, and false negative, respectively.
Since DNN outputs a bowel gas probability density map with a range of 0–1, the threshold of the probability density should be determined to select bowel gas collections. We calculated Dice coefficients as a function of probability densities in the range of 0–1 in 0.005 steps using the training data. The Dice coefficient was used because it is an evaluation index that considers the balance of FP and FN [19, 20].