Artificial intelligence is a promising prospect for the detection of prostate cancer extracapsular extension with mpMRI: a two-center comparative study

A balance between preserving urinary continence as well as sexual potency and achieving negative surgical margins is of clinical relevance while implementary difficulty. Accurate detection of extracapsular extension (ECE) of prostate cancer (PCa) is thus crucial for determining appropriate treatment options. We aimed to develop and validate an artificial intelligence (AI)–based tool for detecting ECE of PCa using multiparametric magnetic resonance imaging (mpMRI). Eight hundred and forty nine consecutive PCa patients who underwent mpMRI and prostatectomy without previous radio- or hormonal therapy from two medical centers were retrospectively included. The AI tool was built on a ResNeXt network embedded with a spatial attention map of experts’ prior knowledge (PAGNet) from 596 training patients. Model validation was performed in 150 internal and 103 external patients. Performance comparison was made between AI, two experts using a criteria-based ECE grading system, and expert-AI interaction. An index PAGNet model using a single-slice image yielded the highest areas under the receiver operating characteristic curve (AUC) of 0.857 (95% confidence interval [CI], 0.827–0.884), 0.807 (95% CI, 0.735–0.867), and 0.728 (95% CI, 0.631–0.811) in training, internal, and external validation data, respectively. The performance of two experts (AUC, 0.632 to 0.741 vs 0.715 to 0.857) was lower (paired comparison, all p values < 0.05) than that of AI assessment. When experts’ interpretations were adjusted by AI assessments, the performance of two experts was improved. Our AI tool, showing improved accuracy, offers a promising alternative to human experts for ECE staging using mpMRI.


Introduction
Preoperative staging of prostate cancer (PCa) is critical for guiding the treatment selection of patients and preventing both under-and over-treatment [1]. The presence of extracapsular extension (ECE) (stage T3a or more), accounting for one-third of all PCa patients primarily diagnosed [2,3], is associated with high rates of positive surgical margins and early biochemical recurrence after radical prostatectomy (RP) [4]. Resection of the neurovascular bundle (NVB) is recommended in these patients with the aim of decreasing positive surgical margins, which may substantially affect urinary continence and sexual potency [5,6]. To date, the balance between preserving urinary continence as well as sexual potency and achieving negative surgical margins for RP remains a challenge; accurate detection of ECE would thus have a significant impact on treatment planning and prediction of outcomes in patients with PCa.
Historically, digital rectal examination (DRE) has been the principal approach for clinical T-staging of PCa [7]. Clinical staging and risk stratification models combining the prostatespecific antigen (PSA) and Gleason score of the prostate biopsy with DRE-derived clinical T-stage have been designed to obtain more accurate predictions of PCa aggressiveness, disease mortality, and biochemical recurrence [8][9][10]. Generally, DRE is a rather subjective approach with potential interobserver variability and is at risk of underestimating the extent of anteriorly located tumors [11]. In the last few decades, multiparametric magnetic resonance imaging (mpMRI) has been widely used to characterize PCa and determine the clinical stage and use of MRI instead of DRE leads to significant upstaging of clinical T-stage and risk grouping [12][13][14][15]. In addition, despite considerable efforts such as alternative high-resolution imaging and new grading approaches were devoted, the diagnostic accuracy of MRI for staging T3a diseases revealed a poor and heterogeneous sensitivity of 30-70% [16][17][18][19]. The heterogeneity may be caused by the fact that there are no standard criteria for evaluation [20], and the high level of expertise required for radiologists with the aim of accurate interpretation and interobserver variability limit its consistency and availability [21].
Recently, artificial intelligence (AI), particularly deep learning (DL), has been proposed as a promising solution to many medical imaging tasks involving organ segmentation, lesion detection, and disease classification [22][23][24][25][26]. AI does not rely on predefined representations of low-level visual features within images that were required in early machine learning approaches. Instead, AI can learn to discover task-specific features such as anatomic localization, tumor contacting, neurovascular bundles, or direct evidence of abnormalities in periprostatic adipose tissue, which are the footstones for the imaging detection and staging of PCa. With a sufficient supply of expertly annotated examples, an appropriately designed model can learn to emulate the judgments of expert clinicians who provide the annotations.
Therefore, in this study, we hypothesized that an AI-based tool trained from a large dataset of high-quality labels would produce automated ECE staging capable of emulating the diagnostic acumen of a team of experienced radiologists. We further hypothesized that when AI-based tool is provided for assistance of experts' interpretation, expert-based performance in ECE staging of PCa with mpMRI will be improved. We verified our hypothesis by building a ResNeXt-based deep classification and detection model embedded with a spatial attention map of the prior knowledge of the radiologists for an imaging interpretation of ECE in patients with PCa [27]. We then validated it by comparing the performance of AI with expert-based interpretation and expert-AI interaction on two independent cohorts from two tertiary care medical centers with detailed outcome information.

Patients
This was a retrospective study involving routine care at two tertiary care medical centers. Ethics committee approval was granted by the local institutional ethics review board (protocol 2019-SR-396), with a waiver of written informed consent. All procedures conducted in the studies involving human participants were in accord with the 1964 Helsinki Declaration and its later amendments.
The two primary cohorts comprised an evaluation of the local database for the medical records to identify patients with pathologically confirmed PCa. The inclusion criteria were as follows: (i) PCa with RP and (ii) standard prostatic mpMRI exam within 4 weeks prior to surgical intervention. Patients without RP or with histories of previous surgeries or adjuvant therapies for PCa (interventions for benign prostatic hyperplasia or bladder outflow obstruction were deemed acceptable) were excluded. Finally, a total of 746 consecutive patients between Jan 2015 and Jun 2019 from Center 1 and 103 consecutive patients between Jan 2017 and Dec 2019 from Center 2 were enrolled. The patient enrollment procedures are summarized in the supplementary data (Fig. S1).
Clinical variables included the age, PSA level, PSA density (PSAD), biopsy Gleason score, number of positive cores, and perineural invasion. Histopathological outcomes such as surgical Gleason score, positive surgical margin, presence of ECE, and presence of seminal vesicle invasion were also determined. All biopsies and surgical specimens were prepared and examined by two pathologists who had 10 years of experience in urologic pathology according to the International Society of Urological Pathology (ISUP) 2 0 0 5 a n d 2 0 1 4 r e c o m m e n d a t i o n s [ 2 8 , 2 9 ] . Histopathological ECE, referring to the tumor breaking through the prostatic capsule into periprostatic fat, was the primary clinical endpoint of this study.
Patients from Center 1 were randomly split into training (n = 596) and test (n = 150) groups for model development and internal validation. A cohort of 103 patients from Center 2 was used for external validation.

Image acquisition and analysis
Patients in two academic institutions underwent a pelvic phased-array prostatic mpMRI examination on a 3.0 T MR scanner (Skyra; Siemens Healthcare, Erlangen, Germany). The scanning protocols are a combination of transverse T 1weighted, transverse, coronal, and sagittal T 2 -weighted imaging (T 2 WI) and transverse DWI sequences. The apparent diffusion coefficient (ADC) was measured using DWI with a mono-exponential fitting model. The scanner types and imaging parameters are summarized in the supplementary data ( Table S1).
All images were retrospectively interpreted based on the guidelines of the European Society of Urogenital Radiology (ESUR) by two genitourinary radiologists at two institutions (reader 1, 15 years of experience with prostate MRI; reader 2, 10 years of experience with prostatic MRI) who were blinded to the pathological results and all clinical information [30]. Staging assessment with mpMRI was performed using a criteria-based grading system for ECE introduced by Mehralivand et al. [17]. Imaging diagnosis of ECE is based on a three-tier grading approach using capsular contact length (CCL) of 15 mm or greater, capsular irregularity or bulge, and frank breach of the capsule: (i) grade 0, no suspicion of pathological ECE, (ii) grade 1, either CCL of 15 mm or greater or capsular irregularity or bulge, (iii) grade 2, both CCL of 15 mm or greater and capsular irregularity or bulge, and (iv) grade 3, frank ECE visible at mpMRI.

Construction of deep learning networks
Image annotation and preprocess Segmentation of prostate and PCa was performed with an in-house software (Oncology Imaging Analysis version 2; Shanghai Key Laboratory of MR, ECNU, Shanghai, China) by two experienced genitourinary radiologists. A prior attention was generated according to the attention of the prostate and PCa. Diffusion-related sequences were aligned onto T 2 WI, and all images were resampled to an inner-resolution of 0.5 × 0.5 mm 2 . Then, the patch with a size of 200 × 200 were cropped and normalized by Z-score to make the scale similar before importing into the model. The details of image annotation and preprocess were described in Supplementary Sections 1-2.
Architecture of network A two-dimensional ResNeXt, which was proved to be an effective convolutional neural network model, with a convolutional block attention module (CBAM) was used to analyze the mpMRI images by the concatenated use of high-resolution T 2 WI, high-b value (1500 s/mm 2 ) DWI, and ADC [27]. The output of the model was the prediction of ECE. In each training dataset, a single leading slice image with the largest cross-section of the tumor was used for model development. To guide the ResNeXt network to emulate the judgments of experts who provided the annotations of the targeted lesion, we introduced a prior-attention guide (PAGNet) unit by inputting the attention map into CBAM [31]. The attention map was generated based on the annotations of the whole prostate and tumor lesion, and a high computational value in the attention map denoted a deservedfocusing region. Ensemble learning with 5-fold cross-validation was used during the training stage, and in the inference stage, the average prediction of five independent models was treated as the final prediction of ECE risk. Details of attention map generation, network architecture, and analysis are described in Supplementary Sections 3-6 and Fig. 1.
Post-process Considering that for each patient, the tumor can involve several imaging slices while the ECE may involve only parts of the involved imaging slices, we thus proposed two analysis approaches to postprocess the predicted outputs. One is a single-slice (SS)-based prediction that is derived from a preset leading-slice image. The other is multi-slice (MS)based prediction, which is derived from images with entire tumoral coverage, among which the highest predicted result was used as the final MS prediction.
Integration of PAGNet and clinical identifications Finally, we evaluated the integrative effects of clinical factors on DL networks to improve the diagnostic performance. The PSA, age, biopsy Gleason score, percentage of positive cores, and biopsy perineural invasion were added to the PAGNet model, namely, PAGNet+C, in which clinical information was directly added to the penultimate layer of the fully connected (FC) layer of PAGNet by increasing the number of neurons.

Performance of deep diagnostic model
To evaluate the performance and clinical applicability of the deep diagnostic model, all data assessments were conducted independently based on AI, human experts, and expert-AI interaction. For expert-AI interaction, the expert score is upgraded when a positive assessment by the AI was determined, whereas the highest score of 3 remained unchanged even with a positive assessment by the AI. Conversely, the expert score is downgraded if a negative assessment is determined by the AI, and the lowest score of 0 remains unchanged with negative findings by the AI. To assess the effect of pathological variants on the performance, the assessments were conducted specifically in groups stratified by lesion size, D'Amico risk group [9], and PI-RADS score [30].

Statistical analysis
Inter-reader variability was evaluated using inter-reader agreement and Cohen's kappa. Model performance was typically evaluated against a "ground truth" with histopathological manifestations using a receiver operating characteristic (ROC) analysis. An inter-method comparison between expert, AI, and expert-AI interaction was applied using a summary ROC (SROC) curve through a Bayesian meta-analysis, which allows an assessment of the independent and pooled performance of all methods. For each comparison, contingency tables were used to present the results and calculate the diagnostic accuracy. The unit of assessment for the contingency table for the assessment of accuracy was one patient. Performance characteristics such as the area under the ROC curve (AUC), sensitivity, specificity, and overall accuracy were also reported.
Second, the clinical usefulness and net benefits of the models were assessed using a decision curve analysis (DCA). The DCA estimates the net benefit of a model based on the difference between the numbers of true positives and false positives, weighted by the odds of the selected threshold probability of risk. SROC was estimated using Stata 15, DCA was estimated using R, and other statistical values were estimated using Python with scipy (v1.4.1) and the scikit-learn package (v.0.22). The reported statistical significance levels were all two-sided, with statistical significance set at 0.05.

Comparison of deep network model
To determine the impacts of the two post-processing approaches on the prediction, the performance of SS-based versus MS-based assessments is shown in Fig. 2a  To illustrate the robustness of the index PAGNet model, deep generative features from the last penultimate layer of each network were extracted and plotted based on a t-distributed stochastic neighbor embedding (t-SNE) analysis (Fig. 2b). The features of index PAGNet showed better intraclass

Performance and clinical application of deep learning models
The performances of ECE prediction between AI, experts, and expert-AI interaction are summarized in Fig. 3. Regarding expert-based grading approach for ECE interpretation, the inter-reader agreement was observed in 437/849 (51.5%) observations, with a Kappa value of 0.343. And the performance of both two experts (AUC, 0.632 to 0.741 vs 0.715 to 0.857) is significantly (paired comparison of ROC, all p value < 0.05) lower than that of any of the AI assessments in the three cohorts. When expert-AI interaction was performed, by which the expert's interpretation was adjusted by the AI assessment, the performance of two experts for ECE assessment was significantly improved. To provide a more complete picture to illustrate the assistant role of AI to experts, the independent and integrated effects of the experts, AI, and expert-AI interaction were evaluated using SROC curves and forest plots with a Bayesian meta-analysis (Fig. 4).

Clinical implications of deep network models
The benefit for application of index PAGNet model in clinical practice is shown with a DCA in Fig. S2. The PAGNetderived probability demonstrated an improved clinical risk prediction against threshold probabilities of ECE ≤ 60%. The graph demonstrated better clinical risk prediction when using the PAGNet or expert-AI interaction approach as compared to expert-based grading. Additionally, we evaluated the performance of PAGNet for ECE staging in subgroups of patients with different tumor size, D'Amico risk, and PI-RADS score. Compared to experts and other AI-based models, the index PAGNet model showed a higher NPV in terms of tumor size < 1.5 cm low D'Amico risk, and PI-RADS 3 lesions, and showed a higher PPV in terms of tumor size ≥ 1.5 cm, intermediate/high D'Amico risk, PI-RADS 4, and PI-RADS 5 lesions (Fig. 5).

Discussion
Tailoring the most suitable surgical approach for patients with PCa in terms of nerve sparing, urologists are required to  Fig. 3 Diagnostic performance of different methods for ECE staging in the training, internal, and external validation cohorts. Yellow color highlights the best performance of the dedicated methods. AUC, area under the curve. Optimal cutoffs were selected using the Youden J-index balance the risk of ECE versus benefits from NVB preservation before RP is delivered. Expert-based assessment of ECE using DRE and/or MRI is highly heterogeneous [16]. In this study, we developed and validated an AI-based tool to preoperatively assess ECE stage of localized PCa using mpMRI. Our study contributes important methodology accompanied with model interpretability to address a critical question for clinical tumor staging of PCa. Our results on a cohort of 849 patients with RP from two tertiary care medical centers show promises of deep diagnostic model for ECE staging and potential utilities of this tool for improving performance and reducing inter-reader variability. Our study has several innovations compared with previous relevant researches. First, to our knowledge, this is the first study to apply an automatic AI tool for ECE staging in patients with PCa. This is of clinical relevance as imaging detection of ECE remains a great challenge with conventional approach and AI can provide potential improvement through training high-throughput-derived imaging features [32]. Our results revealed that our AI tool is capable of discriminating ECE in a quantitative and objective manner and performs better than expertbased scoring methods both in internal and external validations [33][34][35]. Second, in our approach, the proposed model generated a prior-attention probability map by gating the networks to learn potentially useful features across the boundaries between the prostate and tumor, thus making our model more robust and interpretable compared to the traditional black-box learning approaches. Third, from the perspective of Grad-CAM, our tool not only allows for a direct diagnosis of ECE but also provides a predicted region that is highly suspicious of ECE. This is a significant advancement compared to traditional predictive nomograms that only provide binary prediction [2,3,36]. This approach might be more applicable to radiologists in real-world clinical scenarios.
Our results have several clinical implications. First, from a clinical perspective, ECE most likely occurs in the pericapsular regions of the leading imaging slice, the accuracy of which needs to be carefully clarified. Taking this into account, two analysis strategies, i.e., SS versus MS, were proposed to optimize the predictions of our networks. The results revealed that the SS-based analysis performed better than the corresponding MS-based approach. This implies that, to a certain extent, an overprediction occurs based on the MS analysis, which leads to a false-positive prediction in non-leading imaging slices. The finding is partly consistent with our primary assumption that the pericapsular region on the leading imaging slice is the place with the highest probability for ECE. Focusing the attention learning on leading slice can provide a more effective assessment of ECE status. Second, the results of previous studies have revealed the critical roles of clinical characteristics such as PSA, PSAD, and biopsy findings for ECE prediction [37,38]. Nevertheless, adding these clinical factors did not contribute significantly to the improvement of PAGNet networks in performance. The inconsistence might be explained by the fact that the hidden FC features of our PAGNet networks are significantly larger than embedded clinical factors, and data in a training cohort are relative smaller than the deep layer features. Therefore, it is difficult for deep networks to extract critical information from limited sparse clinical features, that is, the curse of dimensionality. This, as a matter of fact, demonstrated that our state-of-theart AI tool, even avoiding invasive biopsy in patients, can offer a powerful discriminability in ECE staging. Third, although the expert-based ECE grading system used in our study has potential advances against the traditional nonstandardized reporting method [19], we did observe large inter-observer variances in ECE grade interpretation. The intrareader agreement is fair in our two cohorts, which varies significantly from that of Park et al. [19]. The positive rates in each interpreted ECE grade in our cohorts are comparable to that of Mehralivand et al. [17] but markedly lower than that of Park et al. [19]. In addition, we conducted a head-to-head comparison of performance between AI, experts, and expert-AI interaction. We did demonstrate that an AI-based assessment has a higher accuracy compared to expert-based assessment, and results from expert-AI interaction show that our AI could be of great assistance to radiologists in improving the diagnostic performance. Fourth, we further elaborated a subgroup analysis of PAGNet, the results of which supported that our AI achieved higher PPVs in tumor with larger size, higher D'Amico risk score, and higher PI-RADS score. Therefore, personalized surgical treatment of patients with PCa is feasible when MRI-derived ECE risk and other risk-based approaches are combined.
Although encouraging results were obtained, several limitations warrant mention in this study. First, the DL model was trained on single-center data; although the validation data were collected from two medical centers, the cohort size was still limited. In addition, in our external validation cohort, the diagnostic performance of any AIbased method decreased markedly compared to that in training and internal validation cohorts. Currently, multicenter application of AI-based approaches may be challenged by the sample size, imaging settings and study cohorts. Grad-CAM provided a perspective of visualizing the model performance, but it is still difficult to explain the essential reason for heterogeneity of the model in external validation. More data samples from external sites used to train the model may increase the performance, and quantitative analysis of the model interpretation need to be explored, which is one of our ongoing works. Therefore, a prospective multicenter randomized trial is required to validate the model in clinical scenarios before it can be made routinely available.
In conclusion, we proposed an AI-based tool embedded with a spatial attention map of the experts' prior knowledge for ECE staging using mpMRI. The tool performed better than expert-based interpretation and provided assistant role to radiologists. The interpretability of our AI-based approach is particularly imperative towards building trustable autoclassification and detection tool for clinical application and facilitating a streamlined patient management process.   Fig. 4 The sensitivity, specificity, and summary receiver operating characteristic (SROC) curves of the AI, experts, and expert-AI interaction for ECE staging in the training, internal, and external validation cohorts. Plots are individual and combined sensitivity, specificity, and area under SROC curves of the diagnostic methods using a Meta regression analysis, by which the integrated effects of AI and experts are evaluated and  Availability of data and material The imaging studies and clinical data used for algorithm development are not publicly available, because they contain private patient health information. Interested users may request access to these data, where institutional approvals along with signed data use agreements and/or material transfer agreements may be needed/negotiated. Derived result data supporting the findings of this study are available upon reasonable requests.

Declarations
Conflict of interest The authors declare that there is no conflict of interest.
Ethics approval and consent to participate This study was retrospective and approved by the local Research Ethics Board of The First Affiliated Hospital of Nanjing Medical University (protocol 2019-SR-396), and informed patient consent was waived. All procedures performed in studies involving human participants were in accordance with the 1964 Helsinki declaration and its later amendments.