Clinically oriented contour evaluation using geometric and dosimetric indices based on simple geometric transformations

Background: In radiotherapy, geometric indices are often used to evaluate the accuracy of contouring; however, the ability of geometric indices to identify the error of contouring results is limited, mainly because it does not consider any clinical background. To study the relationship between geometric indices and dosimetric indices, four different types of targets were selected to introduce the systematic and random geometric errors in the delineation process. Materials and Methods: The C-shaped target outlined in the American Association of Physicists in Medicine (AAPM) TG-119 report (The report of Task Group 119 of the AAPM) and the targets of three actual cases of oropharyngeal cancer, metastatic spine cancer, and prostate cancer were selected as the test contours that needed to be modied. Python software was used to perform translation, scaling, rotating and sine function transformation to introduce systematic and random errors into the above contours. These geometrically transformed contours were regarded as reference contours corrected by systematic and random errors. The corresponding dosimetric indices were obtained from the original dose distribution of the radiotherapy plan, and correlations (R²) between geometric and dosimetric indices were quantied through linear regression. The Wilcoxon signed rank test was used to compare the ability of spatial-direction discrimination between the geometric indices of different directions of transformations. Results: The correlations between the geometric and dosimetric indices were inconsistent for the targets. For systematic errors, except for the sine function transformation (R²: 0.023–0.04, p (cid:0) 0.05), the geometric transformations of the C-shaped target’s planning target volume (C-PTV) had correlations with the dosimetric indices D98% and Dmean (R²: 0.689–0.988), 80% of which were strongly correlated (R² > 0.8). For the random errors, the correlation coecients of the actual cases were also high, R2 > 0.384, p < 0.05. The results of Wilcoxon signed rank test showed that only the p-values of volumetric geometric indices of C-PTV were less than 0.05. Conclusions: Clinically, an assessment of the contour accuracy of the region of interest is not feasible based on geometric indices alone, and should be combined with dosimetric indices. An SPSS 21.0 software (SPSS, IL, USA) was used for linear regression analysis. The correlation coecient R² was used to quantify the correlations between the geometric indices HD (maximum, mean, 95%), DSC, and Jaccard, and the dosimetric indices D98%, D mean , and D2%. Two-sided p-values were obtained, and p-values <0.05 were considered signicant. In addition, the geometric indices obtained from the equidistant scaling transformations and the right, anterior, and posterior directions of the C-shaped PTV translation transformations were compared, and the difference between them was tested for statistical signicance using the Wilcoxon signed ranks test in SPSS, and from scatterplots of geometric indices versus dose difference, the feasibility of assessing the accuracy of contours with geometric indices was analyzed.


Background
Contouring of the target and organs at risk (OARs) is a key step in radiotherapy, especially with the highly modulated radiotherapy technologies currently in use, such as intensity-modulated radiotherapy (IMRT) and volumetric modulated arc radiotherapy (VMAT). Inaccuracy in the contouring process could cause serious systematic errors to the subsequent radiotherapy work, this error has always existed in the subsequent radiotherapy processes used for patients, such as radiotherapy treatment plan, patient set-up [1][2][3]. The commonly used slice-by-slice manual approach or interpolation-based semi-automatic contouring approach is time-consuming and resource-consuming, and the corresponding results are susceptible to differences between observers; furthermore, the accuracy of contouring depends on the residents' clinical experience and the rational and e cient use of imaging techniques. Computed tomography (CT) images provide useful anatomical information and electron density for dose calculation of radiotherapy plans, however, comparing with the results after registration of CT and magnetic resonance imaging (MRI), the contrast of soft tissue in CT images is poor, and in some cases, the targets and organs at risk can not be clearly displayed. Improper selection of imaging techniques may also lead to differences in residents' delineation [4][5][6].
Methods for assessing the accuracy of contouring are generally divided into two categories, namely, subjective evaluations and quantitative evaluations. The subjective evaluation is only based on the experiences and personal preferences of the evaluators. Evaluators were guided to turn off the original contour display and grade all research contours using 3 levels: useful as test contours (= 1), useful with minor edits (= 2), and not useful (= 3). The de nition of minor edits was that the test contours would be acceptable after minor modi cations [7]. This evaluation method is deeply affected by the individual differences among the evaluators and required considerable time. Most contour accuracy studies are performed directly by using quantitative evaluation, which involves the employment of geometric indices to characterize the similarity between the test contour and the reference contour [8]. Geometric indices widely used in contour evaluations include distance-type geometric indices (e.g., the maximum (HD), mean (HD mean ), and 95% Hausdorff distances (HD95)) and volumetric geometric indices (e.g., Dice-similarity coe cient (DSC) and Jaccard) [9]. Although these indices are easy to calculate, they do not consider the clinical effect and may lack clinical relevance [10,11]. Under the assumption of a reference contour, the method for clinically assessing the accuracy of radiotherapy (RT) contours is to determine and predict the deviation of its dosimetric indices based on the dose distribution of the radiation treatment plan [10,[12][13][14].However, the relationships between geometric indices and dosimetric indices are yet to be further studied. In addition, different geometric indices have different properties, but some automatic segmentation studies randomly select geometric indices to evaluate the contour results [15][16][17].
In the present study, we explored the evaluation accuracy of the geometric indices for evaluating the contours under the systematic and random errors. This study arti cially introduced contour errors through the following four geometric transformations: translation, scaling, rotation, and sine function transformation. Then, based on these transformations, we investigated the correlations between the distance-type (HD, HD mean , HD95) and volumetric geometric indices (DSC, Jaccard) and the dosimetric indices (D98%, D mean , D2%); explored the ability of geometric indices to distinguish the contours with the same transformation type but different directions.

Contouring
Four different types of targets were selected for this study: C-shaped target, oropharyngeal cancer, metastatic spine cancer, and prostate cancer (Fig. 1). The C-shaped target was delineated on the water phantom, according to the TG-119 report [18], and the remaining three types of targets were outlined by senior residents in our research institution. The structures of the targets were exported from treatment planning system (TPS) Raystation (Raysearch, Stockholm, Sweden) in the form of a DICOM le, and the position information of the contours were read by an in-house developed Python software (version 3.7.3), and then used to perform the geometric transformations. Finally, the transformed structures were imported back to Raystation system in the form of a DICOM le. The targets (original contours) before the transformation were regarded as the delineation results by junior residents (test contours), and the transformed targets were regarded as reference contours after systematic and random errors correction by senior doctors.
In this study, the contour errors were introduced in the form of geometric transformations. The translation transformations were divided into the following three cases: right, anterior, and posterior direction. Based on the location of the original contours, at intervals of 1 mm, the data were moved 10 times to each of the right, anterior, and posterior directions to obtain the reference contours (see Additional le 1: Supplementary Figures S1-S3 for the contours after the right, anterior, and posterior directions translation, respectively). Scaling transformation Through the above seven geometric transformations, systematic errors were introduced into the C-shaped target.
In order to introduce random errors, the 5 layers of CT images were randomly selected to keep the position information of the structure unchanged, the remaining CT image layers were divided into three parts, and the above geometric transformations were carried out randomly, and 20 delineation results were obtained for each target.

Geometric indices
In this study, we chose ve widely used geometric indices for the evaluations, including three distance-type indices HD (maximum, mean, 95%) and two volumetric indices (DSC and Jaccard). These ve geometric indices were calculated by 3DSlicer version 4.10.2 [19], which is open source software. The calculation of HD was performed on the RT-DICOM structures. The HD indices calculated by the 3DSlicer represent bi-directional distances, and the bi-directional distance is symmetrical; this type of distance is more stable than the unidirectional distance calculated by other methods.

Dosimetric indices
The original clinical plans for these four targets all used IMRT technology. The C-shaped target met the requirements for a simple version in the TG-119 report, the dose of 5000 cGy received by 90% of the target volume was taken as the prescription, and the dose prescriptions for oropharyngeal cancer, metastatic spine cancer and prostate cancer were 95% of the target volume receiving 5400, 3000 and 5600 cGy, respectively, and the dose grid was 2 mm. After geometric transformation, the RTstructures were imported into the radiotherapy plan of the original clinical plans, and then, on the dose distribution, D98%, D mean , and D2% of the PTV were obtained. According to the ICRU-83 report [20], these dosimetric indices represent the minimum dose, mean dose, and maximum dose received by the target, respectively. In this study, the dose differences (ΔD) of three dosimetric indices D98%, D mean and D2% were calculated and normalized according to their respective clinical goals [13]. Here, , where x represents the type of dosimetric index.

Analysis
An SPSS version 21.0 software (SPSS, Chicago, IL, USA) was used for linear regression analysis. The correlation coe cient R² was used to quantify the correlations between the geometric indices HD (maximum, mean, 95%), DSC, and Jaccard, and the dosimetric indices D98%, D mean , and D2%. Two-sided p-values were obtained, and pvalues <0.05 were considered signi cant. In addition, the geometric indices obtained from the equidistant scaling transformations and the right, anterior, and posterior directions of the C-shaped PTV translation transformations were compared, and the difference between them was tested for statistical signi cance using the Wilcoxon signed ranks test in SPSS, and from scatterplots of geometric indices versus dose difference, the feasibility of assessing the accuracy of contours with geometric indices was analyzed.
valuations. The subjective evaluation is only based on the experiences and personal preferences of the evaluators. Evaluators were guided to turn off the original contour display and grade all research contours using 3 levels: useful as test contours (= 1), useful with minor edits (= 2), and not useful (= 3). The de nition of minor edits was that the test contours would be acceptable after minor modi cations [7]. This evaluation method is deeply affected by the individual differences among the evaluators and required considerable time. Most contour accuracy studies are performed directly by using quantitative evaluation, which involves the employment of geometric indices to characterize the similarity between the test contour and the reference contour [8]. Geometric indices widely used in contour evaluations include distance-type geometric indices (e.g., the maximum (HD), mean (HD mean ), and 95% Hausdorff distances (HD95)) and volumetric geometric indices (e.g., Dice-similarity coe cient (DSC) and Jaccard) [9]. Although these indices are easy to calculate, they do not consider the clinical effect and may lack clinical relevance [10,11]. Under the assumption of a reference contour, the method for clinically assessing the accuracy of radiotherapy (RT) contours is to determine and predict the deviation of its dosimetric indices based on the dose distribution of the radiation treatment plan [10,[12][13][14].However, the relationships between geometric indices and dosimetric indices are yet to be further studied. In addition, different geometric indices have different properties, but some automatic segmentation studies randomly select geometric indices to evaluate the contour results [15][16][17].
In the present study, we explored the evaluation accuracy of the geometric indices for evaluating the contours under the systematic and random errors. This study arti cially introduced contour errors through the following four geometric transformations: translation, scaling, rotation, and sine function transformation. Then, based on these transformations, we investigated the correlations between the distance-type (HD, HD mean , HD95) and volumetric geometric indices (DSC, Jaccard) and the dosimetric indices (D98%, D mean , D2%); explored the ability of geometric indices to distinguish the contours with the same transformation type but different directions.

Results
Linear regression analysis method was carried out on the geometric indices and dosimetric indices ( Table 1, 2 and 3). Table 1 Table 4 shows the results of the Wilcoxon signed ranks test analysis between the geometric indices of translation transformation in the right, anterior and posterior directions, and equidistant scaling transformation in opposite direction. For the analysis results of the ve directions of C-PTV, the p-values of HD, HD mean and HD95, were all greater than 0.05, while for DSC and Jaccard, the p-values were all less than 0.05 and the differences were statistically signi cant.
Figs. 2-6 show the relationships between the geometric indices of the targets and the dose difference ( 98%, mean ). For the targets with random errors or systematic errors, the relationships were inconsistent in different geometric indices. In addition, for the water phantom target (C-PTV) with systematic errors, the relationships were non monotonic.

Discussion
Many studies have shown that although it is important to quantify the degree of variation or uncertainty of the contouring, it is more important to determine the dose difference and clinical impact [10,11,13,14,17,21,22]. In earlier work, van Rooij et al. [14] studied the accuracy of automatic delineation of organs at risk in the head and neck region based on deep learning techniques while using geometric indices and dosimetric indices, and they analyzed the correlation between the geometric index SDC (Sørensen-Dice similarity coe cient) and dose difference. The study found that there was a weak correlation between the SDC and ΔD for all of the OARs through automatic segmentation, r = -0.24, p = 0.002, but the correlation was not speci c to a certain OAR or a certain patient. It is partly similar with the results described in our study. The correlation was not consistent for the different forms of geometric transformations, we found that the geometric indices obtained by geometric transformation were signi cantly correlated with the dosimetric indices, but for some speci c geometric transformation forms, the situations were different. There was a strong and signi cant correlation between the geometric indices and dosimetric indices in the translation, scaling, and rotation transformations of the C-PTV with systematic errors, but the results for the sine function transformation were not signi cant or weak. After the periodic transformation of xed amplitude of sine function, the C-PTV's contour changed very little (see Additional le 1: Supplementary Figures S7 for the contours after the sine function transformation), and the HD values were less than 2 mm (Fig 2b), which led to the small dose difference, and correspondingly, the correlation between them was weak. At the same time, it re ects the importance of contour training for junior residents, when the repeatability of contouring is high and the contour difference is small, the dose difference is also small.
The correlation coe cient obtained with the C-PTV anterior direction translation was lower than that of the other two translation transformations. In order to avoid high dose radiation to the surrounding organs at risk (Fig.1a), physicians try to keep high-dose areas away from the organs at risk when designing the original radiotherapy plan. The structures in the low-dose region are less likely to have obvious dose changes, even if its contour varies greatly in the spatial extent [23]. When the anterior direction translation occurs in this area, the minimum dose (D98%) of the target in this area is almost unchanged, thus resulting in a weaker correlation coe cient, and the correlation coe cient of D mean is higher than D98%. This is consistent with the study by Lim et al. [10], which found that the correlation between geometric indices and dosimetric indices was affected by the goals of the treatment plan. From these studies, it can be shown that the correlation between geometric indices and dosimetric indices can be affected by many factors, such as the method of geometric transformation, the relative positions of the target and organs at risk, and the constraint goals of the radiotherapy plan.
Beasley et al. [24] reported that when measured with a suitable spatial metric, the higher the geometric accuracy of the contour, the smaller the dose difference should therefore be re ected, and vice versa. At the same time, there have been many reports on auto-segmentation technology that randomly select the geometric index to evaluate the acceptability of contouring results. According to the Wilcoxon signed ranks test analysis results in Table 4, for the geometric transformation results of the C-PTV, the distance-type geometric indices HD, HD mean and HD95 cannot express the difference of the translation transformations in three different directions of right, anterior and posterior, and the difference between equidistant scaling. However, there were signi cant differences between the volumetric geometric indices obtained from the different transformation directions of the irregular shape target. The HD values of the equidistant scaling transformation were the same for C-PTV, but the clinical effect on them were different. As shown in Fig. 2b, when HD = 1.547, the dose difference was within 5% for equidistant expansion transformation, while for equidistant reduction transformation was beyond -5%. The geometric indices obtained from the same type of geometric transformation in different directions were not distinguishable. For a target contour, different types of geometric indices have the same value, the corresponding dose differences are different. When the distance-type geometric index value was about 3 mm, the dose difference corresponding to HD mean was already close to 30% (see . That's why, despite the highest correlation coe cient of HD mean , this study believes that HD mean is not suitable as an index for evaluating contour accuracy alone. Compared with HD and HD95, it is too sensitive and the change gradient is too big to re ect the actual clinical situation. Since there is no independent reference standard for distance-type geometric index values, it is impossible to compare the contour accuracy across different structures. This means that although two different structures have the same geometric index value, they cannot indicate that the quality of the contour is the same. And it can be seen in Figs 2-6 that the quality of the contour is controversial for the two structures (Oropharyngeal cancer and prostate cancer) of the same geometric index value.
For volumetric geometric indices, many studies believe that if the DSC is higher than the normally reported value of 0.7, the agreement between the reference contour and the test contour is considered to be good [22,25,26].
Our research showed that when the DSC and Jaccard values of anterior direction translation transformation were between 0.5 and 0.7, the corresponding dose differences were also very small (Fig. 2c), at the same time, there were some cases that the DSC values were greater than 0.7, and the corresponding dose difference values were large. These two contradictory situations show that it is not reliable to set an acceptable threshold for DSC.
This study introduced the systematic and random geometric errors through translation, scaling, rotation, and sine function geometric transformation, and analyzed the feasibility of clinical evaluation of geometric indices and the ability of geometric indices to identify the direction of transformation. Although geometric indices re ected the geometric difference between a test contour and a reference contour, and the correlation coe cient between the geometric index and the dosimetry parameter in this study was relatively high, the relationship between geometric indices and dosimetric indices were not consistent among different geometric indices, different transformation forms and different targets, so it was illogical to use only geometric indices to evaluate the clinical acceptability of contour results. In addition, our current research is based on the simulation experiment of geometric transformation, we should further explore the relationship between geometric indices and dosimetry indices by making use of the actual cases contouring results of the junior residents.

Conclusion
At present, there is a lack of guidance for the evaluation of contours by using geometric indices, and therefore, there is a need for a normative framework. We found that the differences between the geometric indices and dosimetric indices were not consistent, which indicates the inaccuracy when using only the geometric indices to evaluate the results of contouring. The clinical acceptability of contouring results cannot be judged by geometric indices alone. Therefore, we suggest that dosimetric indices should be added to evaluations of the accuracy of the results of delineation, which can be helpful for explaining the clinical dose response relationship of delineation more comprehensively, accurately.