2.1. Patient data and data preparation
216 patients with pathologically proven postoperative cervical cancer who received external beam radiotherapy from 2017 to 2020 at our treatment center were involved. In this study, all patients’ information was anonymized and only the CT image data and contour data were used for analyses. All patients were immobilized in the supine position using thermoplastic mold and radiotherapy simulation CT images were obtained by Siemens SOMATOM Definition CT (Siemens Healthcare, Forchheim, Germany) or Revolution CT (GE Medical Systems, Madison, USA). Each of the scanning data sets was constructed with a size of 512×512 pixels and a layer thickness of 3mm. The lymph node subregions in the pelvic area usually include para-aortic lymph node (PAN), common iliac, external iliac, internal iliac, obturator, presacral and groin nodal region, which were named CTV_PAN, CTV_common iliac, CTV_external iliac, CTV_internal iliac, CTV_obturator, CTV_presacral and CTV_groin, respectively, in this study. 216 patients were divided into two cohorts, including 152 cases and 64 cases respectively. Each lymph node subregion for the 152 patients was contoured manually by a junior oncologist on Raystation (Version 4.7.5, RaySearch Laboratories AB, Stockholm, Sweden), then modified (only when needed) and approved by a senior oncologist with more than 15 years of experience. Then the 152 cases were randomly divided into training ( n=96 ), validation ( n=36 ) and test ( n=20 ) groups. Another group containing 64 cases were used to verify the feasibility of clinical application.
We designed an image preprocessing module before DL training which composed of four steps. The CT images were normalized in the first step. Then the external body structures were segmented from the whole CT images by an OTSU algorithm [24]. The center of external body in each image was shifted to the center of the image in the third step. Finally, all images were resized from 512 × 512 × N to 256 × 256 × N, where N represented for the number of slices.
2.2. Network architectures for DL and training process
Inspired by the advantages of residual networks [25], we designed an encoder-decoder 3D network with residual blocks (Fig 1) for our segmentation task. In the encoder of the residual network and down-sampling module, an adjusted stride of the convolution layer was used. On the other hand, the decoder of this network was designed as the combination of a hybrid channel decoder (HC-decoder) and an enhanced channel decoder (EC-decoder). The purpose of the decoder combination was to address the problem of large target volume differences between different cases. In HC-decoder, features were received from the encoder by skip connections. The output features of the upper convolution layer were mixed and were represented by CMi. And an attention network was added into the EC-decoder as well [26]. Each attention block was a mixture of the primary branch and the mask branch. The primary branch was composed of two residual blocks, and the output feature was represented by Ri (x). The mask branch adopted the bottom-up and top-down structures, and the output feature was represented by Mi(x). The mask branch was also used as the control gate of the primary branch, denoted as AMi. Formula (1) showed the relationship between them. Finally, the whole information from HC- decoder (CMi) and EC-decoder (AMi) were superimposed and fused to form the output of the DL network. In order to capture context features and preserve spatial information better, the Dense Atrous Convolution (DAC)model and residual multi-kernel pooling (RMP) was used to connect the decoder with the encoder [27].
Separate model was trained for each lymph node sub-region and the model with the highest performance was retained. Adaptive moment estimation was used in the training process. Our network adopted the combination of the cross-entropy loss and dice loss. And the patch sizes were 64, 256 and 256 respectively. The number of epoch was 500 and the learning rate was 0.001 which would be decreased to one-tenth of the last learning rate when the validation loss had not been updated for 20 epochs. The network was implemented by Tensorflow framework (Version 2.1) and the whole training process was operated on a NVIDIA GeForce RTX 2080Ti.
2.3. Evaluation metrics
The quantitative evaluation tools on the segmentation results used in this study were Dice similarity coefficient (DSC) and Hausdorff Distance (HD). The DSC is defined by formula (2):
where X and Y are predicted and ground truth segmentation, respectively. The DSC mainly depends on the overlapping area of two areas as a percentage of the total area [28]. The DSC calculated by equation is a parameter whose value range is between 0 and 1. The higher the DSC result is, the higher the similarity of two contours being compared is. On the other hand, in order to assess the subtle differences between ground truth and auto-segmentation contours in small regions, HD values have been calculated between contours, too. HD is sensitive to positional difference and defined as follows:
where h(X,Y) is the maximum value of all distances between the nearest point of one point set and another point set [29]. In our research,the 95%HD value was used to represent the distance between the contours boundaries in the 95th percentile.
As it was shown in the overall flowchart (Fig.2), after confirming the auto-segmentation model for each lymph node sub-region can achieve good performance, the feasibility of assembling overall lymph node CTVs by combing different number of sub-regions was evaluated with the data from another cohort including 64 patients at 6 different clinical situations. For each situation, different sub-regions were chosen to combine to a new CTV. The CTV_LN of each patient was treated as the ground truth contour which was the clinically irradiated target volume to compare the combined CTV named CTV_LNAT.