In this study, we proposed to develop a 3D U-Net algorithm to achieve automated detection and segmentation of pelvic LNs on DWI images for clinically suspected PCa patients. The results in the testing set and external dataset confirmed the feasibility of automated LN detection and segmentation, which may aid in LN staging, quantitative measurements of tumour burden and image-guided treatment of PCa patients.
The U-Net algorithm has been widely used for organ and lesion segmentation on MRI images, such as prostate [22, 23] and prostate lesions [24]. However, this method is not easily applied to LN lesions. There is great heterogeneity in the shape and size of lymphadenopathies in the pelvis, making it difficult to discriminate true LN regions from other regions. Compared with that of the background, LN lesions usually account for a small part of the image volume, and this imbalance makes segmentation more difficult. Additionally, the number of FPs that contain nonspecific high-intensity mimics is considerably large, which usually results in lower specificity. Last but not least, the inter- and intraoperator variability of manual annotation remains a longstanding bottleneck for automated image segmentation [25], and there is currently no reliable substitute for manual labelling.
Therefore, to obtain the specified voxel of the LNs as much as possible, all visible LNs in the algorithm development dataset were manually annotated. Moreover, considering that LNs are nodular structures for which 3D information is helpful in distinguishing them from tubular structures, because both may show a blob-like structure in 2D images, the 3D U-Net model was selected for segmentation algorithm development [26]. To improve the efficiency and interoperator reliability of data annotation in this study, annotations performed by the two junior radiologists were corrected by an expert radiologist, and the two sets of annotations after modification achieved high Dice scores (Mask 3 vs. Mask 4: 0.88 ± 0.06) compared with those before correction (Mask 1 vs. Mask 2: 0.75 ± 0.03). The results confirmed the reliability of manual annotation as ground truth for LN segmentation.
The algorithm in this study was trained for segmenting all visible LNs on DWI images, whether healthy or metastatic. Given that LNs with a short diameter of more than 0.8 cm were considered to be suspicious for metastasis and of more clinical significance, a cut-off threshold was set to filter out contiguous structures with a short diameter of less than 0.8 cm. These selected LNs were used for the performance assessment of suspicious metastatic LNs. In clinical practice, radiologists usually measure and record the short diameter and volume of the largest LN instead of all metastatic LNs. Therefore, in this study, we also analysed the detection and segmentation performance of the model for the largest LNs. The N-staging was automatically generated based on the quantitative measurements (short diameter and volume) of the largest LN and was sent to the structured report of PCa.
Our results showed that the model achieved good segmentation accuracy of pelvic LNs with an average Dice score and VS of 0.76 ± 0.15 and 0.82 ± 0.14, respectively, in the testing set. Furthermore, the segmentation accuracy of suspicious LNs was significantly higher than that of all LNs (Dice score: 0.85 vs. 0.76, P = 0.009; VS: 0.82 vs. 0.86, P = 0.046), which indicated that the segmentation model performed better with large LNs. The average short diameter and volume of the LNs were also measured to quantitively evaluate the segmentation performance. A short diameter is regarded as the reference value to determine the existence of suspicious metastatic LNs, and volume is of high significance for the evaluation of tumour load and response treatment. In our results, both the short diameter and volume of the automated segmentations showed a close correlation with manual annotations on all LNs, suspicious LNs and the largest LNs (all with R >0.80).
To evaluate the segmentation performance of the model, except for the commonly used overlap-based metrics (Dice, TPR and PPV), the volume-based metric (VS) was defined to assess the segmentation accuracy of the 3D U-Net model. The Dice coefficient, which directly compares the overlap between automated segmentation and manual annotation, is the most commonly used metric for evaluating medical image segmentation [27]. The TPR measures the portion of positive voxels in the manual annotations that are also identified as positive by the automated segmentation, and the PPV indicates the proportion of positive voxels in the manual annotation to the positive voxels in the automated segmentation. VS is a measurement that indicates similarity and considers the volumes of manual segmentation and automated segmentation [21, 28]. With a high VS, the model might be an accurate and convenient tool to assess tumour burden in LNs.
The lesion detection approach is proposed based on the segmentation result obtained with the 3D U-Net model. Unlike the segmentation assessment, the detection assessment of LNs was focused on the suspicious LNs in the testing set and external dataset. This is because the detection of small LNs (short diameter < 0.8 cm) is of little clinical significance. In the external dataset, in addition to the detection and segmentation assessment of 3D U-Net, we evaluated the clinical application value of the model in the evaluation of suspicious LNs by comparing the consistency of LN staging (N0 or N1) between different readers. Cohen’s kappa coefficient between the model and expert radiologist was significantly higher than that between the resident and expert radiologist (0.922 vs. 0.844), which confirmed the feasibility of its clinical application and the possibility of becoming a promising tool for improving the diagnostic accuracy of LN staging for less experienced residents.
Serval limitations need to be pointed out in this study. First, the current algorithm cannot provide the exact anatomical location of a particular LN, and a future CNN trained with multiclass annotations of different regional LNs (obturator nodes, external iliac nodes, internal iliac nodes, and common iliac nodes) may be helpful for anatomical location. Second, the 3D U-Net detects, segments and measures LNs larger than 0.8 cm in the short axis but does not render a diagnosis of metastatic LNs. Here, LNs suspicious of metastasis were diagnosed based on their short diameter, which is of reference value to some extent but cannot represent true metastatic LNs. More LN information from PET/CT, MR lymphography or pelvic LN dissection may be necessary for reliable metastatic LN diagnosis. Third, we confined our proof-of-concept study to the pelvic area, but it should be extended to the whole body in the future. A further step towards a faster and comprehensive automated analysis that provides a global tumour burden in PCa patients is the detection and quantification of PCa lesions and skeletal metastases. Last, more data inclusion is necessary for construction of a more robust segmentation model, and multicentre data should be collected to further consolidate the generalization ability of our model.
In conclusion, this study confirmed the feasibility of the 3D U-Net CNN for automated detection and segmentation of LNs on pelvic DWI images. This may present a promising step towards a clinically helpful deep learning-based tool that can provide a comprehensive and objective assessment of tumour burden in PCa patients.