Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Accurate organ at risk (OAR) segmentation is critical to reduce the radiotherapy post-treatment complications. Consensus guidelines recommend a set of more than 40 OARs in the head and neck (H&N) region, however, due to the predictable prohibitive labor-cost of this task, most institutions choose a substantially simplified protocol by delineating a smaller subset of OARs and neglecting the dose distributions associated with other OARs. In this work we propose a novel, automated and highly effective stratified OAR segmentation (SOARS) system using deep learning to precisely delineate a comprehensive set of 42 H&N OARs. SOARS stratifies 42 OARs into anchor, mid-level, and small&hard subcategories, with specifically derived neural network architectures for each category by neural architecture search (NAS) principles. We built SOARS models using 176 training patients in an internal institution and independently evaluated on 1327 external patients across six different institutions. It consistently outperformed other state-of-the-art methods by at least 3-5% in Dice score for each institutional evaluation (up to 36% relative error reduction in other metrics). More importantly, extensive multi-user studies evidently demonstrated that 98% of the SOARS predictions need only very minor or no revisions for direct clinical acceptance (saving 90% radiation oncologists workload), and their segmentation and dosimetric accuracy are within or smaller than the inter-user variation. These findings confirmed the strong clinical applicability of SOARS for the OAR delineation process in H&N cancer radiotherapy workflows, with improved efficiency, comprehensiveness, and quality.

Head and neck (H&N) cancer is one of the most common cancers worldwide 1 . Radiation therapy (RT) is an important and effective treatment for H&N cancer 2 . In RT, the radiation dose to normal anatomical structures, i.e., organs at risk (OARs), needs to be limited to reduce posttreatment complications, such as dry mouth, swallowing difficulties, visual damage, and cognitive decline 3-6 . This requirement demands accurate OAR delineation on the planning computed tomography (pCT) images used to configure the radiation dosage treatment. Recent consensus guidelines recommend a set of more than 40 OARs in the H&N region 7 .
Nevertheless, precise manual delineation of this quantity of OARs is an overwhelmingly demanding task that requires great clinical expertise and time efforts, e.g., > 3 hours for 24 OARs 8 . Due to the factors of patient overload and shortage of experienced physicians, long patient waiting times and/or undesirably inaccurate RT delineations are more common than necessary, reducing the treatment efficacy and safety 9 . To shorten time expenses, many institutions choose a simplified (sometimes overly simplified) OAR protocol by contouring a small subset of OARs (e.g., only the OARs closest to the tumor). Dosimetric information cannot be recorded for non-contoured OARs although it is clinically important to track for analysis of post-treatment side effects 10 . Automatic and accurate segmentation of a comprehensive set of H&N OARs is of great clinical benefit in this context.
OARs are spatially densely distributed in the H&N region and often have complex anatomical shapes, large size variations, and low CT contrasts. Conventional atlas-based methods previously enjoyed a prominent history 11-15 , but significant amounts of editing efforts were found to be unavoidable 8,16 . Atlas-based methods heavily rely on the accuracy and reliability of deformable image registration that can be very challenging due to OARs' large shape variations, normal tissue removal, tumor growth, and image acquisition differences.
Volumetric deformable registration methods often take many minutes or even hours to compute.
Deep learning approaches have shown substantial improvements for improving segmentation accuracy and efficiency as compared to atlas-based methods 17 . After early patch-based representation 18 , fully convolutional network is the dominant formulation on segmentation [19][20][21][22] or adopting a segmentation-by-detection strategy 23 24 when the number of considered OARs is often fewer than or around 20. With a greater number of OARs needed to be segmented, deep network optimization may become increasingly difficult. From an early preliminary version of this work 25 , we introduced a novel stratified deep learning framework to segment a comprehensive set of H&N OARs by balancing the OARs' intrinsic spatial and appearance complexity with adaptive neural network architectures. The proposed system, stratified organ at risk segmentation (SOARS), divides OARs into three levels, i.e., anchor, midlevel, and small & hard (S&H) according to their complexity. Anchor OARs are high in intensity contrast and low in inter-user variability and can be segmented first to provide informative location references for the following harder categories. Mid-level OARs are low in contrast but not inordinately small. We use anchor-level predictions as additional input to guide the mid-level OAR segmentation. S&H OARs are very small in size or very poor in contrast. Hence, we use a detection by segmentation strategy to better manage the extremely unbalanced class distributions across the entire volume. Besides this processing stratification, we further deploy another stratification by using neural architecture search (NAS) to automatically determine the optimal network architecture for each OAR category since it is unlikely the same network architecture suits all categories equally. We specifically formulate this structure learning problem as differentiable NAS 26,27 , allowing automatic selection across 2D, 3D or Pseudo-3D (P3D) convolutions with kernel sizes of 3 or 5 pixels at each convolutional block.
SOARS achieves the state-of-the-art performance in segmenting 42 OARs in a single institution cross-validation evaluation 25 , but essential questions remain unclear regarding to its clinical applicability and generality: (1) does SOARS generalize well into a large-scale multiinstitutional evaluation?; (2) how much manual editing effort is required before the predicted OARs can be considered as clinically accepted?; (3) how well does the segmentation accuracy of SOARS compare towards inter-user variation?; and more critically, (4) what are the dosimetric variations brought by OAR differences in the downstream RT planning stage? To adequately address these questions, we first enhance SOARS by replacing the segmentation backbone of P-HNN 28 with UNet 29 and conduct the NAS optimization based on the UNet architecture. Then, we extensively evaluate SOARS on an external set of 1327 unseen H&N cancer patients from six institutions (one internal and five external). Using another 30 randomly selected external patients, we further conducted three subjective user studies: (1) physician's assessment of the revision effort and time spent when editing on predicted OARs; (2) a comparison of contouring accuracy between SOARS and the inter-user variation; and (3) in the intensity modulated RT (IMRT) planning, a dosimetric accuracy comparison using different OAR contours (SOARS, SOARS + physician editing, and physician's manually labeling).

Datasets for training and evaluation
In this multi-institutional retrospective study, we collected, in total, 1533 H&N cancer patients  Table 1.   Table 2.
Independent internal testing dataset. Next, for independent evaluation, we collected 326 patients from CGMH between 2012 and 2020 as another internal testing dataset besides the training-validation. OAR labels in this cohort were extracted from those generated during the clinical RT contouring process that senior physicians confirmed. Depending on the H&N cancer types or tumor locations, a range of 18 to 42 OAR contours were generally available for each patient in this cohort.
Multi-institutional external testing dataset. For quantitative external evaluation, 1001 patients were collected from five different institutions located in various areas of mainland China between 2013 and 2019 (external testing dataset). Each patient is accompanied by the clinical RT treatment OAR contours, ranging from 13 to 25 OARs, depending on their institutionalspecific RT protocols. Senior physicians of each institution examined these clinical OAR contours to ensure they met the delineation consensus guidelines 7 . Detailed patient statistics and subject characteristics of these five external institution datasets are given in Table 1.
Multi-user testing dataset. To further evaluate the clinical applicability of SOARS, 30 nasopharyngeal cancer (NPC) patients were randomly selected from one external institution (FAH-ZU) to form a multi-user testing dataset. In this cohort, each patient contained 13 OAR contours, the tumor target volume contours, and the IMRT plan originally generated by the clinical team at FAH-ZU. First, two senior physicians (both with >10 years' experience in treating H&N cancers) edited the SOARS predicted 42 OARs (resulting in SOARS-revised contours) and recorded the editing time to assess the revision efforts required for making SOARS predicted OAR contours to be clinically accepted. One senior physician manually edited the 13 OARs used in FAH-ZU's RT protocol, while the other senior physician edited the other 29 OARs not included in FAH-ZU's RT protocol. Second, another physician with 4 years' experiences manually contoured the 13 OARs in FAH-ZU's protocol (denoted as human reader contours).
Then, using the clinical treatment contours of the 13 OARs as gold-standard references, we compared the contouring accuracy of SOARS, SOARS-revised, and the human reader. Third, while keeping the original dose grid in RT plan, we replaced the clinical reference OAR contours by the SOARS, SOARS-revised, human reader contours respectively, to analyze the impact on OAR dose metrics. This helps determine if differences in OAR contouring would produce clinically relevant differences of radiation doses received by the OARs in the downstream dose planning stage. The overview of the multi-user evaluation is illustrated in Fig. 1.

Performance on the CGMH internal testing dataset
The quantitative performance of SOARS in the internal testing dataset is summarized in Table   3. SOARS achieved a mean Dice score coefficient (DSC), Hausdorff distance (HD) and average surface distance (ASD) of 74.8%, 7.9mm and 1.1mm, respectively, among 42 OARs. For stratified OAR categories, mean DSC, HD and ASD for anchor OARs were 86.9%, 5.0mm and 0.7mm, respectively; for mid-level OARs were 74.6%, 12.4mm and 1.9mm, respectively; and for S&H were 67.2%, 3.7mm and 0.7mm, respectively. In comparison, the previous state-of-the-art H&N OAR segmentation approach UaNet 24 had a significantly worse performance (DSC: 69.8% vs 75.3%, HD: 8.8 vs 7.9mm, ASD: 1.6 vs 1.1mm; all p<0.001). UaNet adopted a modified version of 3D Mask R-CNN 30 , which decoupled the whole task into detection followed by segmentation. Although UaNet achieved one of the previous best performances, it lacked dedicated stratified learning to adequately handle a larger number of OARs, possibly accounting for the markedly inferior segmentation accuracy compared to SOARS. Among three stratified OAR categories, S&H OARs exhibited the largest gap between SOARS and UaNet (DSC: 67.2% vs 59.4%, HD: 3.7 vs 4.7mm, ASD: 0.7 vs 1.2mm; all p<0.001). This result further confirmed the advantage of SOARS, which employed an adaptively tailored processing workflow and an optimized network architecture towards a particular category of OARs. Fig. 3 shows several qualitative comparisons on the internal testing dataset.

Performance on the multi-institutional external testing dataset
The overall quantitative external evaluation and the individual external institution evaluation results are shown in Table 4. SOARS achieved a mean DSC, HD95, and ASD of 78.0%, 6.7mm and 1.0mm, respectively, among 25 H&N OARs overall. These represented significant performance improvement (p<0.001) as compared against the UaNet (4% absolute DSC increase, 16% HD reduction, and 40% ASD reduction). For individual institutions, average DSC scores of SOARS ranged from 76.9% in FAH-XJU to 80.7% in GPH, while most institutions yielded approximately 78% DSC. HD values of SOARS were from 5.9mm in FAH-ZU to 8.1mm in SMU; and ASD obtained from 0.9mm in FAH-ZU and GPH to 1.3mm in SMU and FAH-XJU.
Although the OAR numbers varied for external institutions (due to differences among institutional specific RT treatment protocols), these quantitative performance metrics are generally comparable against the internal testing performance levels, demonstrating that SOARS' generality and accuracy hold well to this large-scale external dataset. SOARS consistently and statistically significantly outperforms (p<0.001) UaNet in external evaluation (UaNet had a mean DSC, HD and ASD of 74.3%, 8.0mm and 1.4mm, respectively). SOARS outperforms UaNet in 21 out of 25 OARs on all metrics, with an average DSC improvement of ~4% and relative distance error reductions of 17.5% for HD and 28.5% for ASD.

Assessment of editing effort in multi-user testing dataset
In 30 multi-user evaluation patients, assessment from two senior physicians showed that the vast majority (1237 of 1260 = 42 OAR types × 30, or 98%) of OAR instances produced by SOARS were clinically acceptable or required only very minor revision (no revision: 729 (58%); revision < 1 minute: 508 (40%)). Only 23 (2%) OAR instances had automated delineation or contouring errors that required 1-3 minutes of moderate modification efforts. None OAR instances required > 3 minutes of major revision.

Inter-user contouring accuracy in multi-user testing dataset
The contouring accuracy of SOARS, SOARS-revised and human reader in the multi-user testing dataset is shown in Table 5. It is observed that SOARS consistently yielded higher or comparable performance in all 13 OARs (used in FAH-ZU's RT protocol) as compared to the performance of the human reader (a physician with 4 years' experience). Overall, SOARS achieved significantly improved quantitative results (p<0.001) in mean DSC (0.82 vs 0.79), HD (4.3 vs 6.1mm) and ASD (0.6 vs 1.0mm). 11 out of 13 OAR types demonstrated marked improvements when comparing SOARS with the human reader. On the other hand, by comparing the contouring accuracy between SOARS and SOARS-revised, they have showed very similar quantitative performances (mean DSC: 0.82 vs 0.83, HD: 4.3 vs 3.9mm, and ASD: 0.6 vs 0.5mm). Note that SOARS derived contours (both SOARS and SOARS revised) have significantly better performance as compared to those of the human reader, representing the inter-user segmentation variation. Results from the inter-user variation and the previous revision effort assessment validated that SOARS can be readily serving as an alternative "expert" to output high-quality automatically delineated OAR contours, where very minor or no manual efforts are usually required on further editing the SOARS' predictions.

Dosimetric accuracy in multi-user testing dataset
Although OAR contouring accuracy reflects the OAR delineation quality, we can further examine its impact on the important downstream dose planning step. The quantitative dosimetric accuracy of various OAR sets, i.e., SOARS, SOARS-revised, and human reader, is illustrated in Table 5 and Fig. 5 (c), and the relationship between contouring accuracy and dosimetric accuracy is plotted in Supplementary Fig. 2 and Fig. 3. It was observed that, for SOARS, the dosimetric differences in mean dose and in maximum dose were 4.8% and 3.5%, respectively, averaged across all 13 OARs using 30 patients. These were statistically significantly smaller (p<0.001) than those of the human reader contours (6.1% and 5.0%), and comparable to those of SOARS-revised (4.7% and 3.4%). More specifically, using SOARS predictions, only 25 out of 390 (6%) OAR instances among 30 patients had a mean dose variation larger than 10%, and no OAR has a mean dose difference larger than 30%. In comparison, using the human reader contours, 49 out of 390 (12%) OAR instances among 30 patients had a mean dose variation larger than 10%, and 12 OAR instances with a mean dose difference larger than 30%. SOARSrevised contours generally had comparable performance with SOARS. Similar trends were observed for the differences in maximum dose. These results demonstrated that the high contouring accuracy of SOARS evidently led to high dosimetric accuracy in the dose planning stage. Fig. 5 (a, b) shows qualitative dosimetric examples and dose-volume histograms (DVH) for using three substitute OAR sets (SOARS, SOARS-revised, human reader). We observed that doses received by most OARs from SOARS and SOARS-revised matched more closely to the clinical reference doses than those from the human reader.

Discussion
In this multi-institutional study, we presented a novel Stratified OAR Segmentation deep learning model, SOARS, that can be used to automatically delineate 42 H&N OARs as the current most comprehensive clinical protocol. By stratifying the organs into three different OAR categories, the processing workflows and segmentation architectures (computed by NAS) were optimally tailored. As such, SOARS is a well-calibrated synthesis of organ stratification, multistage segmentation, and NAS. SOARS was trained using 176 patients from CGMH and extensively evaluated on 1327 unseen patients from six institutions (326 from CGMH and 1001 from five other external medical centers). It achieved a mean DSC and ASD of 74.9% and 1.3mm, respectively, in 42 OARs from the CGMH internal testing and generalized well to the external testing with a mean DSC of 78.0% and ASD of 1.0mm, respectively, in 25 OARs.
SOARS consistently outperformed the previous state-of-the-art method UaNet 24 by 3-5% absolute DSC and 16-32% of relative ASD in all six institutions. In a multi-user study, 98% of SOARS-predicted OARs required no revision or very minor revision from physicians before they were clinically accepted, and the manual contouring time can be reduced by 90% (from 106.4 to 10.5 minutes). In addition, the segmentation and dosimetric accuracy of SOARS were comparable to or smaller than the inter-user variation. In this work, from the OAR contouring quality, we further analyzed the OAR dosimetric accuracy in the subsequent dose planning step. The dosimetric differences in mean dose and in maximum dose were used as dose metrics consistently with previous work 33 . Overall, the vast majority of SOARS-predicted OAR instances had the mean and maximum dose variance no larger than 10%, which was comparable to or smaller than the inter-user dose variations in our experiment. This variation was also smaller than the previously reported inter-user dose variations in six H&N OAR types 33 , where quite few are larger than 30% or even above 50%.
For individual OARs, we observed that the optic chiasm and optic nerve (left and right) exhibited increased dose variation (10-30%) in a small portion of patients ( Supplementary Fig. 2 and Fig.   3). This phenomenon was consistently observed in SOARS, SOARS-revised, and the human reader contours. This indicated that dosimetries in areas consisting of these OARs are sensitive to the contouring differences, suggesting that more attention should be required to delineate the above OAR types for NPC patients.
Our study had several limitations. First, the external testing datasets do not have a complete set or the same amount of 42 OAR types. This reflects real-world situations among different institutions. Manually labeling 42 OARs for all 1001 external testing patients is impractical (estimated to require ≥3 hours per patient). Hence, we chose to use the existing clinically labeled OAR types to supplement for testing. Second, the multi-user testing dataset of FAH-ZU contains only 13 clinical reference OAR types according to its RT protocol. Thus we evaluated the inter-user variation of segmentation and dosimetric accuracy using these 13 OARs instead of the complete 42 OAR types. Nevertheless, these 13 OAR types included those from the three different OAR categories of anchor, mid-level, and S&H. We believe the performance from these would reflect the real inter-user variation with a larger number of OAR types. Third, to evaluate the dose metrics, we kept the original planning dose grid the same while replacing the original clinical OAR contours with substitute contours by SOARS, SOARSrevised, and the human reader. It would be interesting if we can further use the substitute OAR contours and original tumor target volumes to generate new planning dose grids to evaluate the OAR dose metrics, which might affect the tumor target dose distributions as well. It would also be helpful to conduct a randomized clinical trial comparing the side effects and life quality as outcomes of manual and SOARS assisted OAR contouring. This could further validate the clinical value of SOARS. We leave these for our future works.
To conclude, we introduced and developed a stratified deep learning method to segment the most comprehensive 42 H&N OAR types in radiotherapy planning. Through extensive multiinstitutional validation, we demonstrated that our SOARS model achieved accurate and robust performance and produced comparable or higher accuracy in OAR segmentation and the subsequent dose planning than the inter-user variation. Physicians needed very minor or no revision for 98% of the OAR instances (when editing on SOARS predicted contours) to warrant clinical acceptance. SOARS could be implemented and adopted in the clinical radiotherapy workflow for a more standardized, quantitatively accurate, and efficient OAR contouring process with high reproducibility.

Methods
The SOARS framework is illustrated in Fig. 2. It consists of three processing branches to stratify the anchor, mid-level, and S&H OAR segmentation, respectively. Stratification manifested first in the distinct processing workflow used for each OAR category. We next stratified neural network architectures by using differentiable neural architecture search (NAS) 26,27 to search a distinct network structure for each OAR category. We will explain each stratification process below.

Processing Stratification in SOARS
SOARS first segmented the anchor OARs. Then, with the help of predicted anchor OARs, midlevel and S&H OARs were segmented. For the most difficult category of S&H OARs, SOARS first detected their center locations and then zoomed in accordingly to segment the small OARs.
For the backbone of all three branches, we adopted the UNet structure implemented in the nnUNet framework 34 , which has demonstrated leading performance in many medical image segmentation tasks. We tailored each UNet with NAS, which is explained in the subsequent subsection.
We denoted the training data of instances as = { , , , } =1 , where , , , and were the input pCTs and ground-truth masks for anchor, mid-level, and S&H OARs, respectively. The indexing parameter was dropped for clarity. We used boldface to denote vector-valued volumes and used vector concatenation as an operation across all voxel locations.
Anchor branch: Assuming there are anchor classes, we first used the anchor branch to generate OAR prediction maps for every voxel location, , and every output class, : where UNet functions, parameters, and the output prediction maps were denoted as (⋅), (⋅) and ̂, respectively. Anchor OARs are easy and robust to segment based on their own CT image appearance and spatial context features. Consequently, they provided highly informative location and semantic cues to support the segmentation of other OARs.

Mid-level branch:
Most mid-level OARs are primarily soft tissue, which have limited contrast and can be easily confused with other structures with similar intensities and shapes.
Hence, we incorporated the anchor predictions into mid-level learning. Specifically, the anchor predictions and the pCT were concatenated to create a multi-channel input [ ,̂]: Small & hard branch: Considering the low contrast and extremely unbalanced class distributions for S&H OARs across the entire CT volume, direct S&H OAR segmentation is challenging. Here, we further decoupled this branch into a detection followed by segmentation process. Because the H&N region has relatively stable anatomical spatial distribution, detecting rough locations of S&H OARs is a much easier and reliable task. Once the OAR center was approximately determined, a localized region can be cropped out to focus on segmenting the fine boundaries in a zoom-in fashion. The detection was implemented using a simple yet effective heat map regression approach and the heat map labels were generated at each organ center using a 3D Gaussian kernel 35,36 . Let (⋅) denote the UNet function for the detection module, we also combined the anchor branch predictions with pCT as the detection input: where ̂ were the predicted heat maps of S&H OARs. Given the regressed heat map ̂, the pixel location corresponding to the highest value was extracted to crop a volume of interest (VOI) using three times the extent of the maximum size of the OAR of interest. Then, SOARS segmented the fine boundaries of S&H OARs within the VOI. Let denote the cropped VOI in pCT. The S&H OAR segmentation was implemented as: ) . (4)

Automatic Neural Architecture Search in SOARS
Considering the significant statistical variations in OAR appearance, shape, and size, it is unlikely that the same network architecture would suit each OAR category equally. Hence, SOARS automatically searches the more suitable network architectures for each branch, adding an additional dimension to the stratification. We conducted the differentiable NAS 26,27 on top of the network structure of UNet 29 . The NAS search space included 2D, 3D, and pseudo-3D convolutions with either kernel sizes of 3 or 5. Fig. 2 The architecture was learned in a differentiable fashion. We made the search space continuous by relaxing the selection of (⋅ ; × × ) to a softmax function over . For operations, we define a set of learnable logits for each. The weight for an operation is defined as = exp( ) ∑ exp ( ) , and the combined output is ′ = ∑ . As the result of NAS, we selected the operation with the top weight to be the searched operation. We used the same scheme to search the segmentation network architecture for all three branches (excluding the S&H detection module) and trained SOARS using the final auto-searched architecture. The searched network architectures for each branch are listed in Fig. 3. The implementation details are reported in the supplementary materials.

Quantitative evaluation of contouring accuracy
For the internal and external testing datasets, the contouring accuracy was quantitatively evaluated using three common segmentation metrics 37 38 , i.e., Dice similarity coefficient (DSC), Hausdorff distance (HD) and average surface distance (ASD). Additionally, for quantitative comparison, we also trained and tested the previous state-of-the-art H&N OAR segmentation method, UaNet 24 . For the model development of UaNet, we used the default parameter setting from original authors 24 as these have been already specifically tuned for the head and neck OARs. We applied the same training-validation split as ours to ensure a fair comparison.

Human experts' assessment of revision efforts
An assessment experiment by human experts was conducted to evaluate the editing efforts needed for the predicted OARs to be clinically accepted. Specifically, using the multi-user testing dataset, two senior physicians (both >10 years of experience in treating H&N cancers) were asked to edit SOARS predictions of 42 OARs according to the consensus guideline 7 .
Besides the pCT scans, other clinical information, and imaging modality such as MRI (if available) were also provided to physicians as reference. The edited OAR contours were denoted as SOARS-revised. Four manual revision categories were designated as no revision required, revision required in <1 minute (minor revision), revision required in 1-3 minutes (moderate revision), and revision required in >3 minutes (major revision).

Inter-user contouring evaluation
Using the multi-user testing dataset, we further asked a board-certified radiation oncologist with 4 years' experience specialized in treating H&N cancers to delineate the 13 OAR types in FAH-ZU's RT protocol manually. Patients' pCT scans along with their clinical information and other available medical images (including MRI) were provided to the physician. The labeled OAR contours were denoted as human reader contours. Then, we compared the contouring accuracy between SOARS, SOARS-revised, and the human reader using the evaluation metrics of DSC, HD and ASD. The contouring performance of SOARS-revised and the human reader represents the inter-user variation in OAR contouring.

Inter-user dosimetric evaluation
Differences in the OAR contouring accuracy would not, by itself, indicate whether such differences are clinically relevant in terms of radiation doses received by the OARs. Therefore, we further quantified the dosimetric impact brought by the OAR contouring differences. × 100% where OARsubstitute represents the OAR contours by SOARS, SOARS-revised, and the human reader, respectively, while OARref and Doseplan represent the original clinical OAR contours and the dose plans in the actual RT treatment, respectively. The dose-volume histogram (DVH) was also plotted for qualitative illustration. The dose/DVH statistics were generated using Eclipse 11.0 (Varian Medical Systems Inc., Palo Alto, CA).

Statistical Analysis
The Wilcoxon matched-pairs signed rank test was used to compare the evaluation metrics in paired data, while Manning-Whitney U test was used to compare the unpaired data. All analyses were performed by using R 39 . Statistical significance was set at two-tailed p<0.05.

Data availability
The data support the findings of this study are available from the corresponding author upon reasonable request. The image data utilized in this study on the head & neck organs at risk (OARs) segmentation is not publicly available due to the data privacy consideration and restricted permission of the current study.

Code availability
The baseline UNet used in this study is implemented in the nnUNet deep learning framework,

Author contributions
For three first co-authors, D.G. was responsible for the data cleaning, deep learning model  Note: others of tumor sites include tumors located at brain, nasal cavity, or lymph node metastasis. Note: PS, NAS represent processing stratification and neural architecture search, respectively. The unit for Hausdorff distance (HD) and average surface distance (ASD) is in mm.   Table 5. Quantitative comparisons between SOARS, SOARS-revised and human reader contour accuracy on the multi-user testing dataset of 30 patient. DSC, HD and ASD represent Dice similarity coefficient, Hausdorff distance, and average surface distance, respectively. Difference in mean dose and difference in maximum dose are calculated using the equation (6) and (7), respectively. DSC higher the better, while HD, difference in mean dose and difference in maximum dose lower the better. SOARS and SOARS-revised results are compared to human reader results using Wilcoxon matched-pairs signed rank test, and bold and highlighted values represent the best performance and significant improvement (calculated using Wilcoxon matched-pairs signed rank test) as compared between UaNet and SOARS, respectively. . We further randomly collected 30 nasopharyngeal cancer patients from FAH-ZU to form a multi-user testing dataset to evaluate the clinical applicability of SOARS, including the effort for manual revision, comparison to the inter-user OAR segmentation accuracy and comparison to the inter-user OAR dosimetric accuracy.