Variation in Contouring
In all cases, both DLAS and ABAS can segment the muscles with an overall good representation. Fig.1 shows an example of the DLAS, ABAS, and manual contours. Contour variability was greatest for MP structures.
Fig.2 shows metrics of geometric and spatial similarity for all the structures manually delineated by the three clinicians. Overall, both T and MP were associated with lower values for DSC, recall, and precision compared with M and LP. Higher values for MSD and HD95/HD were observed for T and MP. Among all structures, T had the highest HD95/HD. More specifically, the mean value of DSC for M, T, LP, MP ranged from 0.82±0.06 to 0.90±0.02, with an overall mean of 0.86±0.05. The mean value ranges of HD and HD95 were from 0.42 ±0.08 to 1.46±0.85 and from 0.20±0.03 to 0.40±0.17, respectively, with overall means of 0.82±0.53 and 0.31±0.13 (unit: cm). The mean values of MSD ranged from 0.05±0.01 to 0.11±0.05, with an overall mean of 0.08±0.04 (unit: cm). The overall means of six metrics are shown in each sub-figure, which were used as the reference values for calculating scores.
Table 2 summarizes DLAS and ABAS geometrics indices for MM segmentations. DLAS was superior to ABAS for all quantitative metrics. More specifically, DSC was 0.86±0.03 and 0.83±0.04 for DLAS and ABAS, respectively, as compared to the inter-observer variation baseline of 0.86±0.05. HD95 was 0.30±0.09 for DLAS and 0.37±0.13 for ABAS, as compared to the baseline 0.31±0.13. MSD was 0.08±0.02, 0.11±0.03, 0.08±0.04 for DLAS, ABAS, and baseline, respectively. Overall, DLAS achieved equivalent performance compared to the mean interobserver variation for quantitative metrics, with smaller standard deviation (SD), except for precision. These results demonstrate that DLAS is more geometrically accurate and reproducible compared to ABAS, and the comparison showed statistical significance (p<0.05) for all metrics except for precision.
Fig.3 shows overall improvements in geometrics metrics for each pair MM when using DLAS, as compared to ABAS. Mean DSC for MM structures ranged from 0.79±0.05 to 0.85±0.04 for ABAS, and 0.83±0.03 to 0.89±0.02 for DLAS. When using DLAS, mean recall for all structures was also higher, while mean precision was similar with ABAS or slightly worse for some structures. For MM auto-segmentation structures, MP had the lowest DSC and recall value compared with other structures, and LP had the lowest MSD value. However, T had a larger HD/HD95 value compared with other structures. This can be explained by the larger volume of T muscles. Except for precision, paired t-test indicated that DLAS performed better than ABAS for all the metrics of each MM structure with statistically significance (p < 0.05).
The overall scores achieved by the two methods for every muscle is summarized in Fig.4. The highest scores were achieved for T by both methods. For most muscle pairs, DLAS-generated structures had mean scores above 50 while ABAS was less than 50, all with statistical significance (p < 0.05), which indicates ABAS is inferior to the reference established based on the inter-observer variation.
Table 3 shows the percentages (%) of cases where auto-segmentation performed worse than manual segmentation by compared with the mean DSC of inter-observer variation for each muscle. The percentages of cases that performed worse than manual segmentation ranged from 20.7% to 65.5% for DLAS, and from 41.4% to 96.6% for ABAS. Chi-Square test showed that the difference was statistically significant for most of the structures(p<0.05). These results indicate that DLAS performance is superior compared to ABAS and that ABAS segmentations require more contour revision to achieve equivalence. Among all MMs, T segmentations with either DLAS or ABAS had the fewest number of cases performing worse than that of manual segmentations.
Dosimetric impact of variation in contouring
Fig.5 shows dosimetric endpoints for DLAS and ABAS segmentations for paired MMs. Box plots show ∆dose of each muscle for DLAS and ABAS. The mean ∆D98%, ∆D95%, ∆D50%, and ∆D2% for most of the structures was less than 10%. However, ∆D98% and ∆D95% were large in some cases, such as ∆D98% of T-L, LP-L, MP-L for three cases was up to 100%. In addition, one case showed ∆D50% of MP-L was more than 50% (absolute dose greater than 10Gy). Among these cases, ipsilateral MMs showed larger degrees of dose variation compared with the contralateral muscles. These findings indicate that, for the organs in a steep dose gradient, segmentation variability of several millimeters may drastically change MM dosimetric endpoints. Comparison of ∆dose for DLAS and ABAS revealed generally similar results, the difference was not statistically significant for most of the cases (P>0.05). However, dose to MMs with DLAS more closely matched manual segmentations than did ABAS.