Comparing log file to measurement-based patient-specific quality assurance

Recent technological advances have allowed the possibility of performing patient-specific quality assurance (QA) without time-intensive measurements. The objectives of this study are to: (1) compare how well the log file-based Mobius QA system agrees with measurement-based QA methods (ArcCHECK and portal dosimetry, PD) in passing and failing plans, and; (2) evaluate their error sensitivities. To these ends, ten phantom plans and 100 patient plans were measured with ArcCHECK and PD on VitalBeam, while log files were sent to Mobius for dose recalculation. Gamma evaluation was performed using criteria 3%/2 mm, per TG218 recommendations, and non-inferiority of the Mobius recalculation was determined with statistical testing. Ten random plans were edited to include systematic errors, then subjected to QA. Receiver operating characteristic curves were constructed to compare error sensitivities across the QA systems, and clinical significance of the errors was determined by recalculating dose to patients. We found no significant difference between Mobius, ArcCHECK, and PD in passing plans at the TG218 action limit. Mobius showed good sensitivity to collimator and gantry errors but not MLC bank shift errors, but could flag discrepancies in treatment delivery. Systematic errors were clinically significant only at large magnitudes; such unacceptable plans did not pass QA checks at the TG218 tolerance limit. Our results show that Mobius is not inferior to existing measurement-based QA systems, and can supplement existing QA practice by detecting real-time delivery discrepancies. However, it is still important to maintain rigorous routine machine QA to ensure reliability of machine log files.


Introduction
Intensity modulated radiation therapy (IMRT) and volumetric arc radiation therapy (VMAT) are widely used treatment modalities that offer high degrees of conformity to the target volume [1]. Due to the complexity and precision involved in these types of treatments, a reliable and robust quality assurance (QA) process is necessary to prevent errors that will compromise treatment quality and safety [1] The current paradigm of quality assurance for IMRT and VMAT most commonly involves pre-treatment patient-specific QA (PSQA) [2].
To improve PSQA efficiency and error detectability, alternative PSQA methods, such as independent secondary calculations and log file-based PSQA, have increasingly been studied and adopted [10][11][12][13][14][15][16]. However, there remains concerns about the reliability and accuracy of these methods [14,17]. Such concerns notwithstanding, both independent dose recalculation and log file-based QA still hold promise in supplementing PSQA practice. Commercial solutions involving these methods have since been made available.
There is thus a need to establish confidence in the use of these newer PSQA methods. Basavatia 1 3 to compare the performance of Mobius against other traditional measurement-based methods in terms of plan failure concordance and dosimetric agreement, respectively [14,16]. This study adds to the existing literature by making use of TG218 recommendations in the analysis of measurementbased PSQA [2]. In this study, we assess the validity of the log-file based Mobius QA system by comparing it to two measurement-based methods (namely, ArcCHECK and Varian Portal Dosimetry) while using the newer TG218-recommended standards (i.e. global gamma analysis with criteria 3%/2 mm) [2]. The specific aim of this study is twofold: first, to quantitatively determine the degree of agreement between the three QA systems in passing and failing plans, and second, to determine the sensitivity of the three QA systems in detecting systematic errors.

Phantom and patient treatment plans
Ten phantom plans and 100 patient treatment plans were analysed in this study. The phantom plans consisted of five IMRT and five VMAT plans planned on solid water, and were derived from TG119 test plans. Treatment sites for the patient plans included prostate (41 plans), head and neck (HNN, 24 plans), abdomen (11 plans), pelvis (10 plans), brain (6 plans), breast (3 plans), arm (2 plans), lung (1 plan), and spine (1 plan). All treatment plans used either a 6 MV or 10 MV beam, and were calculated on Eclipse version 15.6 with the Analytical Anisotropic Algorithm (Varian Medical Systems, Palo Alto, CA). Each plan went through QA using Mobius (Varian Medical Systems, Palo Alto, CA), ArcCHECK (Model 1220; Sun Nuclear Corporation, Melbourne, FL), and Varian Portal Dosimetry (Varian Medical Systems, Palo Alto, CA). The plans were delivered on a Varian Vitalbeam (Varian Medical Systems, Palo Alto, CA) equipped with a Millennium 120 multi-leaf collimator.

Mobius QA
The Mobius system comprises of both the Mobius3D calculation engine and a Mobius FX module. Mobius3D uses a collapsed cone convolution (CCC) algorithm to recalculate patient dose distributions on the patient CT. MobiusFX captures LINAC log files and verifies delivery parameters against the original plan.
Each plan was sent to Mobius for a plan check. LINAC log files were automatically pushed to Mobius when the plan was delivered at the LINAC; MobiusFX extracted the various delivery parameters for recalculation of patient dose distribution on the patient CT, and also for verification against the original plan. Global gamma analysis at gamma criteria 3%/2 mm was automatically performed between the recalculated dose distributions (incorporating log file positions), and the original dose distributions. The dose threshold was set to 10%. The gamma pass rate (GPR) for each plan was recorded.

ArcCHECK QA
The 3D diode array ArcCHECK was used to measure the dose delivered by each plan. All available ArcCHECK corrections (angular, heterogeneity, and field size corrections) were applied, as was measurement uncertainty. The expected dose to ArcCHECK was calculated in Eclipse. Both measured and calculated dose files were exported to the SNC Patient software for global gamma analysis at gamma criteria 3%/2 mm. Dose threshold was set to 20%, which is our institution's current practice. GPRs were recorded.

Portal dosimetry
For each plan, predicted portal fluences were calculated on Eclipse using the portal dose image prediction (PDIP) algorithm. Source-to-imager distance was set to 100 cm. Measured EPID images were automatically captured by ARIA, and evaluated against the predicted fluence in the Portal Dosimetry module. For ease of comparison across QA methods, composite images were used for global gamma analysis at gamma criteria 3%/2 mm. Dose threshold was set to 10%. The GPRs were recorded.

Statistical analysis
Statistical analysis was performed using R Statistical Software (version 4.1.2) and the RVAideMemoire package [18,19]. Cochran's Q test was used to determine if the three QA systems passed or failed plans similarly, with passing thresholds at the TG218-recommended action and tolerance limits of 90% and 95% GPR. This was repeated for sites with at least 10 plans, namely prostate, HNN, abdomen, and pelvis, in order to investigate site-dependence. A p-value of less than 0.05 was considered statistically significant. Post-hoc analysis was conducted using pairwise McNemar's tests with Bonferroni correction.

Sensitivity to intentional errors
Ten IMRT plans that had passed all three QA systems at action limit (90%) were randomly selected and manipulated to simulate three types of systematic errors. The systematic errors introduced are either at or beyond the tolerances set out in Task Group 142 [20], and are as follows: 1. Gantry angle errors of 1°-3°. 2. Collimator angle errors of 1°-4°. 3. MLC bank shift errors of 1 mm, 2 mm, 3 mm, and 5 mm (both banks, towards patient left).
The effects of these errors on the GPR at 3%/2 mm gamma criteria were analysed on the respective QA platforms. In addition, the clinical consequences of the simulated errors were determined by recalculating patient doses of the error plans on Eclipse. It was assumed that the errors would persist throughout the entire treatment course. Dosevolume histogram (DVH) statistics for each plan were recorded. The clinical impact of the errors was determined by quantifying the percentage change in dose coverage to the PTVs.
Receiver Operating Characteristic (ROC) curves are frequently used to visualise the sensitivity and specificity tradeoff at different thresholds. In this study, sensitivity refers to the probability that an error plan will have a GPR below a specific GPR threshold, while specificity refers to the probability that an error-free plan will have a GPR above that same threshold. The ROC curve can then be constructed by plotting the sensitivities and specificities of the system when using different GPR thresholds. ROC curves for each type of error at each magnitude were constructed using the pROC package in R [21]. Sensitivities and specificities at the action limit was also calculated.

Comparison of pass/failure rates
The ten phantom plans passed all QA systems above the tolerance limit (95%). We are therefore able to establish confidence in all three QA systems. For patient plans, PD generally had the highest GPRs, while ArcCHECK and Mobius GPRs were more similar to each other. These data are included in the Appendix.
The p-values from Cochran's Q test are summarised in Table 1. No significant difference was found between the QA systems at the action limit, regardless of site. Prostate and abdomen had exactly the same results across the QA systems at the action limit and thus did not have p-values. Significant differences between the QA systems were found at the tolerance limit; post-hoc analysis revealed that this significance stemmed from PD passing plans more frequently than both Mobius and ArcCHECK. This was also reflected in prostate and HNN plans. Table 2 shows the truth tables in plans passing or failing the action limit and the tolerance limit between Mobius and ArcCHECK, and Mobius and PD. There is some agreement between ArcCHECK and Mobius. For both action and tolerance limits, Mobius marked the highest number of plans as failures, but a subset of plans failing ArcCHECK QA still passed Mobius. There were five plans that failed ArcCHECK but passed Mobius at the action limit; this increased to 17 at the tolerance limit. Figure 1 shows the breakdown of these plans. On the other hand, all disagreements between Mobius and PD arose from plans that failed the Mobius check; PD passed almost all cases except for one at the 95% tolerance limit.  Since all ten plans pass at the action limit, it is possible to determine how sensitive the different systems are to the respective errors when using the action limit as a threshold. These sensitivities are shown in Table 3. ArcCHECK measurements were the most sensitive to gantry errors, Mobius to collimator errors, and PD to MLC bank shift errors.

Sensitivity to intentional errors
In addition to GPR, Mobius also detects errors by comparing beam delivery parameters to the plan. Figure 3 shows an example of a plan with problematic delivery-the MLC bank was shifted by 5 mm, exceeding the user-defined tolerance. Despite passing the 90% action limit, Mobius still flags the plan.
Changes to patient dose distributions due to the intentional errors were also analysed. A collimator rotation angle error of 4° caused two plans (one HNN, one pelvis) to decrease its PTV dose coverage by more than 5%, while an MLC bank shift of 5 mm does the same for three plans (two brain, one prostate). No other errors caused changes to PTV dose coverage that exceeded 5%. These results are summarised in Fig. 4. All five plans causing significant decreased PTV coverage would have failed all QA checks at the tolerance limit. However, two plans with 5 mm MLC bank shift errors (one brain, one prostate) passed the action limit on Mobius.

Discussion
We first compared both phantom and patient plans' GPRs to quantitatively determine the agreement between QA systems when passing plans above a certain GPR threshold. Basavatia et al. had previously concluded that Mobius performed similarly to the other measurement-based systems when comparing whether patient plans passed by using 90% GPR at gamma criteria 3%/3 mm as the passing threshold [16]. Our study supports these findings while using the TG218recommended action limit. When using the TG218-recommended universal tolerance limit, Mobius consistently failed more patient plans than both PD and ArcCHECK, but the vast majority of plans that passed ArcCHECK and PD still passed Mobius. As such, it may be useful to use Mobius as a first screen to determine if a plan requires a further measurement-based QA check.
Tolerance limits could be calculated separately for plan cases with different complexities as plans with higher complexities may have a larger deviation in GPRs [2]. Both Au et al. and Song et al. stratified the investigated plans by treatment site when comparing the Mobius system with other traditional QA methods, and their data suggest that GPR is treatment site-and QA system-dependent [22,23]. Our results support these studies: prostate plans were more likely to fail Mobius than ArcCHECK, while HNN plans were more likely to fail ArcCHECK but not Mobius. The prostate plans that failed Mobius generally used very large fields and required long measurements on the ArcCHECK. Given Mobius' limitations with modelling larger field sizes and off axis regions, these failures may not be entirely unexpected and the ArcCHECK results may be more reliable. Improvements with Mobius' modelling method may resolve such issues in the future. Conversely, HNN plans tend to be more complex with higher degrees of MLC modulation, and the ArcCHECK's limited resolution may reduce its reliability in correctly picking out poor patient plans. It would be valuable to study the sources of these discrepancies to more detail. Nevertheless, if Mobius were to be used as the first screen in a PSQA workflow, measuring HNN plans regardless of their pass rates on Mobius may be a good approach.
The higher failure rates on Mobius may also be indicative of other differences in the techniques. First, our PD measurements use the perpendicular composite measurement technique, which is not recommended by TG218, but nevertheless remains common in QA practice across institutions worldwide [24,25]. The fact that these composite PD measurements rarely failed QA provides further proof that such measurements may not be useful in detecting poor plans. Second, the higher failure rates on Mobius as compared to ArcCHECK could reflect the fact that its GPRs are from comparing doses on heterogeneous patient CTs instead of homogeneous phantoms, which is the case for most measurement-based PSQA.
It is thus evident that TG218's action and tolerance limits for measurement-based QA may not be entirely appropriate for PSQA using Mobius. However, TG218 also recommends setting locally defined process-based tolerance and action limits when universal limits are less appropriate [2]. Hence, while TG218's focus was on different types of measurement-based QA systems, the statistical process control methodology described within can also be referenced to determine appropriate limits for Mobius, especially for treatment sites where discrepancies with measurement-based QA results are larger. We next investigated the sensitivity of the three QA systems in detecting systematic intentional errors, in terms of whether it fails a pre-determined GPR threshold. Mobius proved to be the most sensitive to collimator rotation errors, but ArcCHECK outperformed it in gantry angle error detection, and PD was the most sensitive to MLC bank shift errors. Still, Mobius was able to achieve good sensitivity of at least 70% for 3° gantry and collimator angle errors when using the action limit threshold. These findings partially Fig. 2 ROC curves for selected errors when evaluated with gamma criteria 3%/2 mm. a-b are for gantry errors, c-d are for collimator errors, and e-g are for MLC errors. In g, the ROC curves for AC and PD overlap completely. In general, the ROC curves show that the QA systems have some ability to pick out small gantry and collimator angles, but have more difficulty with MLC errors 1 3 support Au et al.'s work, which found that Mobius was able to detect 2° collimator angle errors and 3 mm MLC bank shift errors when assessed at 2%/2 mm gamma criteria [22].
An underlying assumption here is that the GPR is sensitive to such errors. This is the only means through which one can assess sensitivity for ArcCHECK and PD without the use of other software (such as 3DVH for ArcCHECK). However, GPR has been reported to be insensitive to small errors under several test conditions [7,11]. Therefore, it may be prudent to look towards other means of detecting errors.
Mobius offers a solution to this by determining if delivery parameters match the ones in the original plan. This allows the user to pinpoint the error. However, a core limitation of this method is that accuracy of the log files is assumed [12]. If the log files are not accurate because the LINAC calibration is off, then it is unlikely for Mobius to be able to detect delivery errors that arise. As an example, Agnew et al. had previously reported a discrepancy in the observed and log file-recorded MLC position [17]. Daily wear and tear of motors that control MLCs will also contribute to discrepancies between log file records and actual MLC positions [26]. This demonstrates the importance of establishing confidence in the accuracy of the log files through rigorous routine machine quality assurance.
We recalculated the plans with errors on the treatment planning system to determine how the DVHs would change, and the extent to which these errors resulted in clinically unacceptable plans. Surprisingly, gross errors did not affect clinical goals much, contrary to several previous studies that had also studied systematic errors [27,28]. Only very large errors in MLC bank shifts (5 mm) and collimator angle errors (> 3°) had a clear detrimental effect to the coverage of the PTV. These findings could be due to two main factors: the plans we studied did not have small PTVs, and were also fairly robust, in that they had exceeded their respective planning goals by a large margin. However, it was still surprising to see that there were mostly inconsequential dosimetric changes. This was also reflected in Lehmann et al.'s study, where systematic errors only caused clinically significant errors some of the time [29]. Nevertheless, given the small number of plans investigated, it would be

Conclusion
Our aim in this study was twofold: first, we applied TG218-recommended limits to determine how well the Mobius QA system agreed with ArcCHECK and PD. Second, we quantified the sensitivity of the three QA systems to three systematic induced errors: gantry angle, collimator angle, and MLC bank shift errors. The results showed that at the action limit, there are no significant differences between the three QA systems in passing and failing plans. Mobius GPRs are also fairly sensitive to collimator and gantry angle errors, but its main advantage against the other QA systems lies in its ability to flag deviations in beam delivery during treatment. Therefore, Mobius can be a useful supplement to current PSQA practice. However, log file data should still be routinely evaluated against actual machine parameters. There did not appear to be much clinical consequences in many of the introduced systematic errors-only gross errors in collimator angle (> 3°) and MLC bank shifts (5 mm) appeared to reduce PTV coverage by more than 5%. The fact that all three QA systems were largely able to detect erroneous plans with significant clinical consequences is also reassuring. A future study with a larger number of plans may be a worthwhile endeavour. Fig. 4 Changes in PTV percentage dose due to error type, according to magnitude. PTV dose coverage only appeared to be adversely affected (more than 5% decrease) for large collimator and MLC errors

3
Funding The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.