Predictive gamma passing rate of 3D detector array-based volumetric modulated arc therapy quality assurance for prostate cancer via deep learning

1. Introduction

As modern radiation therapies, intensity-modulated radiation therapy (IMRT) and volumetric modulated arc radiotherapy (VMAT) are widely used in many institutions [1, 2]. These techniques are advantageous for many treatment sites owing to their high dose conformity on a target volume while avoiding organs at risk. Owing to the implementation of beam complexity in IMRT or VMAT, several guidelines recommend performing measurement-based verifications, using an ionization chamber and a film or multi-dimensional detector array, as a patient-specific quality assurance (QA) before the treatment [3–5]. The measurement-based QA requires the setting up of devices and the irradiation. These processes are time consuming in the clinical QA practice [6]. Furthermore, when the result of a verification do not meet the acceptance criteria, it is necessary to analyze the reasons for the failure, create an alternative plan, and perform the measurements again. Many factors, including the complexity of the treatment plan, quality of the beam modeling, and accuracy of the phantom setup, affect the failure of patient-specific QA [7–9]. Analyzing these causes is difficult and time consuming. Nevertheless, QA needs to be performed successfully as part of the standard clinical workflow.

Several authors have reported on the relationship between the accuracy of dose delivery such as the gamma passing rate (GPR) and the IMRT complexity metrics, including the modulation complexity score (MCS) devised by McNiven et al., multileaf collimator (MLC) travel, gantry speed, and other parameters of a treatment plan [10–13]. Shiba et al. proposed another approach to predicting the GPR using a dose uncertainty potential accumulation technique [14, 15]. It has been reported that dosimetric-based prediction methods with artificial intelligence have a good predictability for GPR. Interian et al. predicted GPR of two-dimensional (2D) detector array via a convolutional neural network (CNN) using the fluence maps of the IMRT plans [16]. Tomori et al. predicted the GPR of Gafchromic film via a CNN model with a planar dose distribution on QA phantom, region of interest and monitor unit for 60 seven-field IMRT plans [17]. These studies predicted GPR of 2D IMRT measurements via deep learning based on the dose information on the detector. On the other hand, 3D VMAT measurement is the mainstream of IMRT QA in the actual clinical practice [18, 19]. Multi-dimensional detectors for 3D VMAT measurements, such as ArcCHECK (Sun Nuclear Corporation, Melbourne, FL, USA) or Delta4 (ScandiDos, Uppsala, Sweden), have been designed to sample the entire beam area. These detectors can verify the dose distribution with a true composite method as recommended by the American Association of Physicists in Medicine Task Group 218 report (TG-218) [20]. Therefore, the deep learning-based prediction model for the GPR expanded to VMAT QA using 3D detector array may be a clinically support tool. Additionally, the previous studies evaluated the accuracy of the prediction with the deviation between the measured and the predicted GPR. The accuracy evaluation method using deviation does not focus on the tendency of prediction error such as the bias of overestimation or underestimation of the GPR. To apply CNN model to clinical practice, to clarify the prediction tendency is important as well as prediction accuracy.

In this study, we developed a prediction model for the GPR of the prostate VMAT QA using a 3D detector array via an algorithm based on deep learning using a dose distribution on a cylindrical detector plane extracted from the 3D dose distribution. A CNN model is proposed to predict the GPR for four tolerances of 3%/3 mm, 3%/2 mm, 2%/3 mm, and 2%/2 mm, and the accuracy of the prediction model is evaluated. In addition, a cumulative frequency histogram of the prediction error was created to clarify the bias of prediction error.

2. Materials And Methods

2.1 Clinical equipment and workflow for patient-specific verification

In this study, the use of clinical materials was approved by the Institutional Review Board of Hiroshima University (E-1656-3). The schematic workflow for predicting GPR is depicted in Fig. 1. One hundred thirty-five prostate VMAT plans with dual-arc and 2-degree control-point spacing were collected retrospectively. All plans were created using the Eclipse treatment planning system (TPS) ver. 13.5 (Varian Medical Systems, Palo Alto, CA, USA). The prescription dose to the planning target volume was 74 Gy (2 Gy/37 fractions) or 78 Gy (2 Gy/39 fractions). All plans were created using 10-MV X ray of the TrueBeam linear accelerator (Varian Medical Systems, Palo Alto, CA) with a Millennium 120 MLC. Optimization and dose computation were performed with a 2.5-mm grid spacing. Acuros XB algorithm ver.13.5 was used as dose calculation.

The ArcCHECK dosimetry system was used for verification measurements. ArcCHECK is a helical 3D detector array consisting of 1386 diodes, with a 0.8 $\times$ 0.8-mm² active area embedded in a cylindrical wall of a phantom. The phantom had assigned 1.105 as the mass density for dose calculation of Acuros XB. The verification plan parameters on the ArcCHECK phantom were exported from TPS to the linear accelerator and cylindrical dose generator described in Section 2.2. in the digital imaging and communications in medicine (DICOM) format.

On the verification measurement, a MultiPlug™, made of polymethyl methacrylate, was inserted into the ArcCHECK phantom. Dose calibration was performed before verification measurement. The accuracy of the dose distribution of all QA plans was verified using gamma analysis [21]. The tolerances of 3%/3 mm, 3%/2 mm, 2%/3 mm, and 2%/2 mm with a 10% dose threshold, absolute dose mode, and global normalization were used. The 3%/2-mm tolerance was recommended in the TG-218 report [20]. The measured GPR (mGPR) was calculated using the SNC Patient (Sun Nuclear Corporation, Melbourne, FL, USA) software.

2.2 CNN dataset

A schematic of the input data creation for the CNN model is depicted in Fig. 2. The cylindrical dose distribution, ${D}_{\text{C}\text{y}\text{l}}\left(\theta ,Z\right)$, was calculated from the 3D dose distribution, ${D}_{\text{D}\text{c}\text{m}}\left(X,Y,Z\right)$, of verification plans exported from TPS to our in-house cylindrical dose generator software using

$${D}_{\text{C}\text{y}\text{l}}\left(\theta ,Z\right)={D}_{\text{D}\text{c}\text{m}}\left(R\text{sin}\theta ,R\text{cos}\theta ,Z\right)$$

where $R$ is the distance from the gantry-rotation axis to the detector elements (10.4 cm), and $Z$ is the longitudinal coordinate. Angle $\theta$ ranged from $-$180$^\circ$ to 180$^\circ$. The cylindrical dose was saved in an 8-bit portable network graphics format of 360 $\times$ 211 pixels. The CNN model analyzed the cylindrical dose and calculated a predicted GPR (pGPR), which was a predicted value of mGPR. The mGPR values were normalized between 0.0 and 1.0. Among the 135 cases selected for this study, 110 were used for modeling such as training and validation, and the remaining 25 cases were used for testing to evaluate the trained CNN model. The 110 modeling cases for training and validation were implemented in five folders for cross-validation. The 25 cases of the test data were used only for verifying the prediction accuracy.

2.3 Structure of CNN model

A schematic of the CNN model architecture is depicted in Fig. 3. This model has a total of 18 layers, including four convolution layers, four max pooling layers, four activation layers, flatten layer, and three dense layers. All activation layers were rectified linear units (ReLU). The ReLU removes output values below 0 at outputting features and makes learning with images more efficient [22]. In addition, dropout layers were used to improve the robustness of the network by the random removal of neurons, as well as by reducing the impact of overfitting [23]. The drop rate of the dropout layer in our CNN model was set to 0.25. After the flatten layer, three dense layers were set with 128, 32, and 4 neurons. Finally, the loss function was set, and the GPR was predicted for four tolerances. A mean squared error (MSE) layer was used as the final layer. The number of parameters of the CNN model are summarized in Table 1. The architecture of the CNN model was implemented using the publicly available Keras framework and TensorFlow as the backend [24, 25]. The network of the CNN model was optimized by adaptive moment estimation (Adam). The Adam optimizer is characterized by high computational efficiency and robustness [26]. Furthermore, a high predictability can be obtained without significantly changing the hyperparameter settings. The learning rate, $\alpha ,$ was set to 0.001, and the exponential decay rates, $\beta$1 and $\beta$2, were set to 0.9 and 0.999, respectively. Our CNN model was trained for 200 epochs. The training was performed on a personal computer (PC) with an Intel^® Core™ i3–8100K 4.0 GHz CPU and 8 GB of RAM.

Table 1

Network architecture and number of parameters.
Layer	Output	Kernel	Padding	Stride	Number of parameters
Input (cylindrical dose)	211 × 360 × 1				0
Convolution 1	106 × 180 × 8	5 × 5	2 × 2	2 × 2	208
ReLU 1	106 × 180 × 8				0
MaxPooling 1	53 × 90 × 8	2 × 2	0	2 × 2	0
Convolution 2	53 × 90 × 16	5 × 5	2 × 2	1 × 1	3216
ReLU 2	53 × 90 × 16				0
MaxPooling 2	26 × 45 × 16	2 × 2	0	2 × 2	0
Convolution 3	26 × 45 × 32	5 × 5	2 × 2	1 × 1	12832
ReLU 3	26 × 45 × 32				0
MaxPooling 3	13 × 22 × 32	2 × 2	0	2 × 2	0
Convolution 4	13 × 22 × 64	5 × 5	2 × 2	1 × 1	51264
ReLU 4	13 × 22 × 64				0
MaxPooling 4	6 × 7 × 64	2 × 3	0	2 × 3	0
Dropout	6 × 7 × 64				0
Flatten	2688				0
Dense 1	128				344192
Dense 2	32				4128
Dense 3	4				132

2.4 Evaluation of prediction accuracy

In this study, five prediction models were developed for the fivefold cross-validation of 110 cases. To reduce the generalization error, the prediction of the GPR was repeated five times, and the average values of these pGPRs were defined as the final prediction results. From mGPR and pGPR of the 25 test cases, the difference between mGPR and pGPR, i.e.,

$$d\text{G}\text{P}\text{R}=p\text{G}\text{P}\text{R}-m\text{G}\text{P}\text{R}$$

and the standard deviation (SD) of dGPR, mean absolute error (MAE), root mean squared error (RMSE), and Pearson’s correlation coefficient (CC) were calculated. In addition, a cumulative frequency histogram of the prediction error was created to clarify the prediction error tendency of our CNN model. The cumulative frequency histogram was evaluated separately for overestimate errors (pGPR > mGPR) and underestimate errors (pGPR < mGPR).

3. Results

The measured GPR for each tolerance in modeling and test set are shown in Table 2. In the mGPR of the modeling set, the median value was lower than mean value for all tolerances. Figure 4 shows the loss function of modeling set. The loss value of validation set shows that the CNN model was sufficiently trained at 200 epochs without over fitting. Figure 5 depicts the correlation between pGPR and mGPR for the four tolerances (3%/3 mm, 3%/2 mm, 2%/3 mm, and 2%/2 mm). The diagonal line on the graph represents a perfect prediction. The 2%/2-mm tolerance had the strongest correlation. The mean values and SDs of pGPR, mGPR and dGPR, MAE, RMSE, and CC are summarized in Table 2. The difference of mean GPR values between measured and predicted for each tolerance was less than 1%. The CC values were 0.67, 0.70, 0.66, and 0.73 for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. These results show a moderate to strong correlation between pGPR and mGPR. Figure 6 depicts the dGPR for each test case. The maximum differences were $-$4.3%, $-$5.8%, $-$4.9%, and $-$7.0% for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. Figure 7 depicts the cumulative frequency histograms of the overestimate and underestimate errors for the four tolerances. The probability of the underestimate error was higher than that of the overestimate error in our CNN model, the cases underestimated were 72%, 60%, 68%, and 56% for each tolerance. The overestimate errors with the cumulative frequency probability within 5% were 3.1%, 4.5%, 4.0%, and 4.4% for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. Similarly, the underestimate errors with the cumulative frequency probability within 5% were 3.8%, 4.9%, 3.2%, and 7.0% for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. Full training took approximately 1 h in the PC environment used in this study.

Table 2

Summary of measured GPR values (%) for each tolerance in modeling and test set.
		Mean	SD	Median
Modeling set	3%/3 mm	93.1	2.8	92.9
	3%/2 mm	88.0	4.4	87.8
	2%/3 mm	89.4	3.1	89.3
	2%/2 mm	81.6	4.7	81.4
Test set	3%/3 mm	93.2	2.9	93.5
	3%/2 mm	88.7	4.1	87.6
	2%/3 mm	90.0	3.3	89.3
	2%/2 mm	82.5	4.8	80.9

Table 3

Summary of evaluation items for each tolerance in test set.
	Mean$\pm$SD	MAE	RMSE	CC*
3%/3 mm	$-$0.9$\pm$2.2	2.1	2.4	0.67
3%/2 mm	$-$0.7$\pm$2.9	2.5	3.0	0.70
2%/3 mm	$-$0.4$\pm$2.5	2.2	2.6	0.66
2%/2 mm	$-$0.7$\pm$3.3	2.8	3.4	0.73
SD: standard deviation; MAE: mean absolute error; RMSE: root mean squared error; CC: correlation coefficient * p < 0.01.

4. Discussion

In this study, the GPR of the prostate VMAT QA using ArcCHECK was predicted via deep learning using the cylindrical dose distribution developed from the calculated 3D dose distribution in the ArcCHECK phantom. Moderate to strong correlation was shown between pGPR and mGPR, the features of the dose distribution on the cylindrical detector plane are considered to have a direct relationship with mGPR.

Some groups introduced the prediction of the GPR for IMRT QA. Ono et al. proposed the machine learning-based method for the prediction based on the data of ArcCHECK by 28 complexity metrics [27]. The CC values and SD achieved using this method were 0.57 and 2.1–2.4% at 3%/3 mm and 0.55 and 5.4–5.8% at 2%/2 mm. The CC values and SD achieved in our study were 0.67 and 2.2% at 3%/3 mm, 0.73 and 3.3% at 2%/2 mm, respectively. We consider that the dose distribution on the detector is one of the appropriate input data for predicting the GPR because our CNN model achieved current results with the cylindrical dose distribution. Most of the complexity metrics were the integrated value of all control points, and they may have lost some plan features. Because the 3D dose distribution at the detector has the potential to be more directly related to the GPR value, it would be recommended input data for predicting the GPR value.

Another example is a comparison with the deep-learning-based method. Tomori et al. proposed a deep-learning-based method using 2D dose distribution on a gafchromic film [17]. They obtained RMSEs of 1.1%, 1.5%, 1.5%, and 2.2% for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. These values are smaller when compared with our results (2.4, 3.0, 2.6, and 3.4%). Since our results are from an ArcCHECK system with a 1mm pitch detector element, and the values of mGPR was significantly different, a direct comparison of these values does not provide an accurate goal to achieve. The distribution of mGPR in this study is different from that in a previous study. Tomori et al. had few cases with mGPR lower than 95%, and the SD (mGPR) value was 0.59% for the 3%/3-mm tolerance. Our study had 100 cases (74% of the total 135 cases) with mGPR values lower than 95%, and the SD (mGPR) was 3.0%. Our coefficient of variation (CV) values of mGPR were 0.03, 0.05, 0.04, and 0.06 for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively. These values were significantly different from the values of Tomori et al. (0.01, 0.01, 0.01, and 0.02). The distribution of mGPR is considered to impact the accuracy of the prediction [28]. In both the previous studies and our study, the prediction accuracy of the 2%/2-mm tolerance was worse compared with the tolerance for 3%/3 mm [17, 28]. It would be difficult for the CNN model to predict accurately tight tolerances, including smaller GPR values, due to a larger variation in mGPR. For correlation between pGPR and mGPR, we achieved CC values of 0.67, 0.70, 0.66, and 0.73 for 3%/3-mm, 3%/2-mm, 2%/3-mm, and 2%/2-mm tolerances, respectively, despite the larger variation in mGPR. This result also demonstrated predictability by combining the CNN model and cylindrical dose distribution. The dose distribution of ArcCHECK was the dose on the entry and exit surfaces measured with diodes at 2.9 cm below the surface. These doses could retain more features than planar dose at the center because it is less composite. Thus, the deep learning-based prediction method using the dose distribution on the detector may be more suitable for the GPR of 3D VMAT measurement. Applying the CNN model to the same dataset of GPR measured by multiple device (e.g., gafchromic film, 2D or 3D detector array) may provide useful insight to understand the suitable features for deep learning.

In this study, the probability of the underestimate error was higher than that of the overestimate error. This bias is attributed to the over-representation of cases with a GPR value considerably lower than the mean value. Because the range of the prediction of the GPR value is close to the upper limit (100%), there are no cases with considerably higher GPR values. The median value is lower than mean value in the GPR of the modeling set. Only cases with a considerably lower GPR value may have contributed to the learning. Therefore, the CNN model could have a low prediction bias with less restriction on the GPR values. Thus, the proportion of underestimated cases may have increased. For introducing a CNN model into clinical practice, it is essential to pay attention to the error characteristics of the prediction model. Setting the tolerances for each underestimate and overestimate error is recommended because it is possible that the prediction error does not follow a normal distribution.

There are some limitations to this study. The treatment site for the prediction was limited to 135 prostate plans in this study. To apply our method to clinical practice and simplify the QA process for other treatment cases, it is necessary to broaden the target of predicted treatment sites. Additionally, the expected clinical advantage was not described. The practical advantage of the CNN model is important. However, the result in this study is from a limited number of 135 cases, and some of mGPR values in this study were lower than the acceptance criteria recommended by TG218. Thus, we consider that clinical feedback needs to be discussed carefully. This study used only dose distribution to develop the CNN model. To improve the accuracy of the prediction, it may be necessary to perform further studies using other components related to dosimetric accuracy, such as the dose uncertainty potential and complexity metrics of the treatment plan as additional input data to improve the predictability of the CNN model. The further study will be performed to improve a CNN model.

References

Chow JC and Jiang R. Prostate volumetric-modulated arc therapy: dosimetry and radiobiological model variation between the single-arc and double-arc technique. J Appl Clin Med Phys. 2013; 14(3): 3–12.
Gorayski P, Fitzgerald R, Barry T, et al. Volumetric modulated arc therapy versus step-and-shoot intensity modulated radiation therapy in the treatment of large nerve perineural spread to the skull base: a comparative dosimetric planning study. J Med Radiat Sci. 2014; 61(2): 85–90.
Ezzell GA, Burmeister JW, Dogan N, et al. IMRT commissioning: multiple institution planning and dosimetry comparisons, a report from AAPM Task Group 119. Med Phys. 2009;36:5359–5373.
Ezzell GA, Galvin JM, Low D, et al. Guidance document on delivery, treatment planning, and clinical implementation of IMRT: report of the IMRT subcommittee of the AAPM radiation therapy committee. Med Phys. 2003;30:2089–2115.
Hartford AC, Galvin JM, Beyer DC, et al. American College of Radiology (ACR) and American Society for Radiation Oncology (ASTRO) practice guideline for intensity-modulated radiation therapy (IMRT). Am J Clin Oncol. 2012;35:612–617.
Van Esch A, Bohsung J, Sorvari P, et al. Acceptance tests and quality control (QC) procedures for the clinical implementation of intensity modulated radiotherapy (IMRT) using inverse planning and the sliding window technique: experience from five radiotherapy departments. Radiother Oncol. 2002;65:53–70.
Bedford JL, Warrington AP. Commissioning of volumetric modulated arc therapy (VMAT). Int J Radiat Oncol Biol Phys 2009; 73(2): 537–545.
Ling CC, Zhang P, Archambault Y, et al. Commissioning and quality assurance of RapidArc radiotherapy delivery system. Int J Radiat Oncol Biol Phys. 2008;72(2):575–581.
García-Vicente F, Fernández V, Bermúdez R, et al. Sensitivity of a helical diode array device to delivery errors in IMRT treatment and establishment of tolerance level for pretreatment QA. J Appl Clin Med Phys. 2012;13(1):111–123.
McNiven AL, Sharpe MB, Purdie TG. A new metric for assessing IMRT modulation complexity and plan deliverability. Med Phys. 2010;37:505–515.
Masi L, Doro R, Favuzza V, Cipressi S, Livi L. Impact of plan parameters on the dosimetric accuracy of volumetric modulated arc therapy. Med Phys. 2013;40:071718.
Wang J, Jin X, Peng J, Xie J, Chen J, Hu W. Are simple IMRT beams more robust against MLC error? Exploring the impact of MLC errors on planar quality assurance and plan quality for different complexity beams. J Appl Clin Med Phys. 2016;17:147–157.
Sumida I, Yamaguchi H, Das IJ, et al. Organ-specific modulation complexity score for the evaluation of dose delivery. J Radiat Res. 2017;58:675–684.
Shiba E, Saito A, Furumi M, et al. Predictive gamma passing rate by dose uncertainty potential accumulation model. Med Phys. 2019;46:999–1005.
Shiba E, Saito A, Furumi M, et al. Predictive gamma passing rate for three-dimensional dose verification with finite detector elements via improved dose uncertainty potential accumulation. Med Phys. 2020;47:1349–1356.
Interian Y, Rideout V, Kearney VP, et al. Deep nets vs expert designed features in medical physics: an IMRT QA case study. Med Phys. 2018;45:2672–2680.
Tomori S, Kadoya N, Takayama Y, et al. A deep learning-based prediction model for gamma evaluation in patient-specific quality assurance. Med Phys. 2018;45:4055–4065.
Masi L, Casamassima F, Doro R, Francescon P. Quality assurance of volumetric modulated arc therapy: evaluation and comparison of different dosimetric systems. Med Phys. 2011;38:612–621.
Feygelman V, Zhang G, Stevens C, Nelms BE. Evaluation of a new VMAT QA device, or the “X” and “O” array geometries. J Appl Clin Med Phys. 2011;12:146–168.
Miften M. TH-A-BRC-03: AAPM TG218: measurement methods and tolerance levels for patient-specific IMRT verification QA. Med Phys. 2016;43:3852–3853.
Low DA, Harms WB, Mutic S, Purdy JA. A technique for the quantitative evaluation of dose distributions. Med Phys. 1998;25:656–661.
Nair V and Hinton GE. Rectified linear units improve restricted Boltzmann machines. Proc. 27th Int'l Conf Mach Learn. 2010:807–814.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–1958.
Deep learning for humans. https://github.com/fchollet/keras.
Abadi M, Agarwal A, Barham P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:160304467. 2016.
Diederik PK, Jimmy B. ADAM: A method for stochastic optimization. ICLR; 2014 (arXiv:1412.6980)
Ono T, Hirashima H, Iramina H, et al. Prediction of dosimetric accuracy for VMAT plans using plan complexity parameters via machine learning. Med Phys. 2019;46:3823–3832.
Li J, Wang L, Zhang X, et al. Machine learning for patient-specific quality assurance of VMAT: Prediction and classification accuracy. Int J Radiation Oncol Biol Phys. 2019;105(4):893–902.

Predictive gamma passing rate of 3D detector array-based volumetric modulated arc therapy quality assurance for prostate cancer via deep learning

Abstract

Purpose

Methods

Results

Conclusions