Patients
This was a retrospective study involving routine care at two tertiary care medical centers. Ethics committee approval was granted by the local institutional ethics review board (protocol 2016-SRFA-093), with a waiver of written informed consent. All procedures conducted in the studies involving human participants were in accord with the 1964 Helsinki Declaration and its later amendments.
The two primary cohorts comprised an evaluation of the local database for the medical records to identify patients with pathologically confirmed PCa. The inclusion criteria were as follows: i) PCa with radical prostatectomy and ii) standard prostatic mpMRI exam within 4 weeks prior to surgical intervention. Patients without radical prostatectomy or with histories of previous surgeries or adjuvant therapies for PCa (interventions for benign prostatic hyperplasia or bladder outflow obstruction were deemed acceptable) were excluded. Finally, a total of 746 consecutive patients between January 2015 and June 2019 from Center 1 and 103 PCa patients between January 2017 and December 2019 from Center 2 who underwent standard prostate mp-MRI and radical prostatectomy were enrolled. The patient enrollment procedures are summarized in the supplementary data (Fig. S1).
Clinical variables included the age, PSA level, PSA density, biopsy Gleason score, number of positive cores and perineural invasion. Histopathological outcomes such as surgical Gleason score, positive surgical margin, presence of histological ECE, and presence of histological seminal vesicle invasion were also determined. All biopsies and surgical specimens were prepared and examined by two pathologists who had 10-yr experience in urologic pathology according to the ISUP 2005 recommendations. Histopathological ECE, referring to the tumor breaking through the prostatic capsule into periprostatic fat, was the primary clinical endpoint of this study.
Patients included in the Center 1 dataset were randomly split into training (n = 596) and test (n = 150) groups for model development and internal validation, respectively. A cohort of 103 patients from Center 2 dataset was used for external validation.
Image Acquisition and Analysis
Patients in two academic institutions underwent a pelvic phased-array prostatic mpMRI examination on a same type of 3.0 T MR scanner (Skyra; Siemens Healthcare, Erlangen, Germany). The scanning protocols are a combination of transverse T1-weighted, transverse, coronal, and sagittal T2-weighted imaging (T2WI) and transverse DWI sequences. The apparent diffusion coefficient (ADC) was measured using DWI with a mono-exponential fitting model. The scanner types and imaging parameters are summarized in Supplementary Materials (Table S1).
All images were retrospectively interpreted based on the guidelines of ESUR by two genitourinary radiologists at two institutions (reader 1, 15 years of experience with prostate MRI; reader 2, 10 years of experience with prostatic MRI) who were blinded to the pathological results and all clinical information. Staging assessment with mpMRI was performed using the ECE grading system introduced by Mehralivand et al.[17]. Imaging diagnosis of ECE is based on a three-tier grading approach using capsular contact length (CCL) of 15 mm or greater, capsular irregularity or bulge, and frank breach of the capsule: i) grade 0, no suspicion of pathological ECE, ii) grade 1, either CCL of 15 mm or greater or capsular irregularity or bulge, iii) grade 2, both CCL of 15 mm or greater and capsular irregularity or bulge, and iv) grade 3, frank ECE visible at mpMRI.
Construction of Deep Learning Networks
Image annotation and preprocess: Segmentation of prostate and PCa was performed with an in-house software (Oncology Imaging Analysis version 2; Shanghai Key Laboratory of MR, ECNU, Shanghai, China) by two experienced genitourinay radiologists. A prior attention was generated according to the attention of the prostate and PCa. Diffusion related sequences were aligned onto T2WI and all images were resampled to an inner-resolution of 0.5 × 0.5 mm2. Then the patch with a size of 200 × 200 were cropped and normalized by Z-score to make the scale similar before importing into the model .. The detail of image annotation and preprocess were described in Supplementary Section 1-2.
Architecture of Network: A two-denominational ResNeXt, which was proved to be an effective CNN model, with a convolutional block attention module (CBAM) was used to analyze the mpMRI images with labels provided by the concatenated use of high-resolution T2WI, high-b value (1500 s2/mm) DWI, and ADC[26]. The output of the model was the prediction of the ECE. In each training dataset, a single leading slice image with the largest cross-section of the tumor was used for model development. To guide the ResNeXt network to emulate the judgments of experts who provided the labels of the targeted lesion, we introduced a prior-attention guide (PAGNet) unit by inputting the attention map into CBAM[27]. The attention map was generated based on the annotations of the whole prostate and tumor lesion, and a high computational value in the attention map denoted a deserved-focusing region. Ensemble learning with 5-fold cross validation was used during the training stage, and in the inference stage, the average prediction of five independent models was treated as the final prediction of ECE risk. Details of attention map generation, network architecture, and analysis are described in Supplementary Sections 3–6 and Fig. 1.
Postprocess: Considering that, for each patient, the tumor can involve several imaging slices while the ECE may involve only parts of the involved imaging slices, we thus proposed two analysis approaches to postprocess the predicted outputs. One is a single-slice (SS) based prediction that is derived from a preset leading-slice image. The other is multi-slice (MS) based prediction, which is derived from images with entire tumoral coverage, among which the highest predicted result was used as the final MS prediction.
Integration of PAGNet and Clinical Identifications
Finally, we evaluated the integrative effects of clinical factors on DL networks to improve the diagnostic performance. The PSA, age, biopsy Gleason score, percentage of positive cores, and biopsy perineural invasion were added to the PAGNet model, namely, PAGNet+C, in which clinical information was directly added to the penultimate layer of the fully connected (FC) layer of PAGNet by increasing the number of neurons.
Performance of Deep Diagnostic Model
To evaluate the performance and clinical applicability of the deep diagnostic model, all data assessments were conducted independently based on AI, human experts, and expert-AI interaction. For expert-AI interaction, the expert score is upgraded when a positive assessment by the AI was determined, whereas the highest score of 3 remained unchanged even with a positive assessment by the AI. Conversely, the expert score is downgraded if a negative assessment is determined by the AI, and the lowest score of 0 remains unchanged with negative findings by the AI. To assess the effect of pathological variants on the performance, the assessments were conducted specifically in groups stratified by lesion size, D’Amico risk group[9], and PI-RADS score[28].
Statistical Analysis
Inter-reader variability was evaluated using inter-reader agreement and Cohen’s kappa. Model performance was typically evaluated against a “ground truth” with histopathological manifestations using a receiver operating characteristic (ROC) analysis. An inter-method comparison between expert, AI, and expert-AI interaction was applied using a summary ROC (SROC) curve through a Bayesian meta-analysis, which allows an assessment of the independent and pooled performance of all methods. For each comparison, contingency tables were used to present the results and calculate the diagnostic accuracy. The unit of assessment for the contingency table for the assessment of accuracy was one patient. Performance characteristics such as the area under the ROC curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and overall accuracy were also reported.
Second, the clinical usefulness and net benefits of the models were assessed using a decision curve analysis (DCA). The DCA estimates the net benefit of a model based on the difference between the numbers of true positives and false positives, weighted by the odds of the selected threshold probability of risk. SROC was estimated using Stata 15, DCA was estimated using R, and other statistical values were estimated using Python with scipy (v1.4.1) and the scikit-learn package (v.0.22). The reported statistical significance levels were all two-sided, with statistical significance set at 0.05.