Integration of Clinical Identications With Deep Transferrable Imaging Feature Representations Can Help Predict Prostate Cancer Aggressiveness and Outcome

Objective: To develope a generalizable machine learning platform, designated PI-Risk, which incorporates clinicians’ prior identications with deep transferrable imaging feature representations into predictive models for PCa Gleason grade. Patients and Methods: A retrospective study included 1442 biopsy-naïve patients from two tertiary care medical centers between January 2014 and December 2019. We investigated an interpretable risk assessment model (PI-Risk) to predict risk stratication of PCa using . The performance of PI-Risk model was independently tested on 232 internal test datasets, and on 539 external validation datasets. Model performance was typically evaluated against a “ground truth” with imaging-histopathologic annotations using receiver operating characteristic (ROC). Detection rates such as true positive, true negative, false positive and false negative rate were reported using a confusion matrix analysis. The Cox model’s performance was evaluated based on Harrell’s concordance index (C-index), calibration curves and Kaplan–Meier survival analysis. Results: The PI-Risk integrated with 10 risk factors is formed to accurate risk stratication. In multinomial regression analyzing model, predicted IntraT-Rad G0 and PI-RADS score 2 were two independent predictors of G0 stage. Predicted IntraT-Rad G1 (OR, 2.42; 95% CIs, 2.13–2.86, p < 0.001) and PSA 4-10 ng/ml were two independent predictors of G1 stage. PSA 10-20 ng/ml, Predicted PeriT-DLR-SqueezeNet G1 and Predicted IntraT-Rad G3 were three independent predictors of G2 stage. PI-RADS score 5 were the independent predictor of G3 stage. PI-RADS score 5 and PSA > 100 ng/ml were two independent predictors of G4 stage. Combined use of PSA 4-10 ng/ml, PI-RADS 1-2 and stacked IntraT-Rad G0-1 resulted in excellent NPV (94.1%) for CIS diseases; and combined use of PSA >100 ng/ml and PI-RADS 5 resulted in high PPV (79.8%) for high-risk PCa. In follow-up, patients stratied by PI-Risk showed signicantly different biochemical recurrence rate after surgery. Conclusions: We concluded that the PI-Risk can offer a noninvasive alternative tool to stratify PCa aggressiveness. This enables a step towards PCa risk stratication. PI-Risk for disease progression risk stratication. found that PI-Risk > G3, PSA level ≥ 20 ng/ml were signicantly associated with the worse BCR-free survival, implying the prognostic relevance of our PI-Risk assessment on short and long-term management of patients. Combined use of PI-Risk and PSA can result in a C-index of 0.76 for predicting 3-yr BCR in our primary cohort. After all, this was a preliminary result in a small population, the prognostic aspects of PI-Risk on external independent cohorts warrant further validations.


Introduction
Men diagnosed with same prostate cancer (PCa), however, often reveal signi cant heterogeneity in clinical outcomes.
In general, low-risk PCa is mostly indolent that would never progress or cause harm to the patient if left untreated [1,2]. Similarly, intermediate-risk PCa has lower biochemical recurrence (BCR) rates and signi cantly better survival rates than high-risk PCa [3]. Gleason score is currently the best prognostic factor of PCa that are clinically used to determine PCa aggressiveness and treatment planning. However, owing to random sampling, biopsies might misestimate the Gleason score compared to that do radical prostatectomy (RP) [4,5]. Additionally, the widespread use of prostatespeci c antigen (PSA) screening and the introduction of reduced PSA thresholds for biopsy have contributed to a signi cant increase in unnecessary biopsies in men who do not have PCa [6,7]. Therefore, to develop a noninvasive tool for accurate risk strati cation of biopsy-naïve patients would have a signi cant impact on clinical decision making, treatment planning and prediction of outcomes for patients and spare them from painful biopsies and their accompanying risk of complications.
Multiparametric magnetic resonance imaging (mpMRI), a clinically used tool for PCa detection, can also provide additional information for PCa aggressiveness and prognostications from its high-resolution characterization of tumorous heterogeneity and cellularity [8,9]. Previous studies have indicated that PCa with progressive pathological characteristics was associated with signi cantly different gray-level imaging patterns on T2-weighted imaging (T2WI) and diffusion-weighted imaging (DWI) [10,11]. Several studies also revealed that machine learning or deep learning on high-dimensional image features can provide improved diagnostic and predictive accuracy for localized PCa [12][13][14].
However, the increased dimensionality relative to limited cohort sizes in clinical settings, as well as the inherently complex networks of internal correlations between the measured tumor types and images present unique computational challenges [15]. Additionally, there are numerous multimodal feature representations such as clinical and demographic indicators, and large-scale imaging identi cations generated in varying clinical settings. A clinical tool should leverage the integration and interactions of multimodalities for risk strati cation. However, there are challenges associated for such analyses. First, effective approaches for integrating these multimodality data are lacking, especially in the context of gray-level imaging features. Second, although high-throughput deep networks have matured to a point that enables detailed discoveries of diseases in task-speci c setts, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and over tting. Deep generative approach, like deep transferable learning, can translate complex and high-dimensional images into relevant computational feature representations. Third, as every algorithm has its strength and weakness, there is no single algorithm that works best for every problem. Stacked-ensemble learning can make use of meta-algorithms to learn predictions of various algorithms and builds a stacked-ensemble model with them, which increases prediction accuracy and reduces the false positive rate of the base model predictions [16].
In this study, we introduce a generalizable auto machine learning (ML) platform, designated PI-Risk, which incorporates clinical identi cations with high-dimensional images into predictive models for PCa Gleason grade. The imaging phenotypes of observations are quantitatively characterized by radiomic descriptors on intratumorous region and by deep transferable learning feature representations on region around the tumor. We built end-to-end sparse model to integrate multimodal data representations for multi-task classi cation. Our study included 1,442 biopsy-naïve patients from two tertiary care medical centers, consisting of 671 datasets for model development and 232 patients for internal test and 539 patients for external validation.

Study cohort
This study was retrospective and approved by the local Institutional Review and need for written informed consent was waived. All included of consecutive patients who underwent prostate mpMRI at two tertiary care medical centers were reviewed. All procedures performed in studies involving human participants were in accordance with the 1964 Helsinki declaration and its later amendments.
The primary patients comprised an evaluation of the two institutions database for medical records and were histologically proven between January 2014 and December 2019. The inclusion criteria were followed: (1) patients with biopsy or prostatectomy proved PCa; (2) standard prostate 3.0 T MRI performed within 4 weeks before the biopsy or prostatectomy; (3) with standard histologic tissue slices of dissected prostatectomy specimens. Patients were excluded if (1) absence of biopsy, surgical intervention or medical records within 8 weeks after MRI examination (n = 9554); (2) noncompliance with imaging quality or imaging exam from outside institutions (n = 141); (3) previous surgery, radiotherapy or drug therapies for PCa (interventions for benign prostatic hyperplasia or bladder out ow obstruction were deemed acceptable) (n = 436). Finally, 903 patients from center 1 and 539 patients from center 2 were eligible for clinical evaluation.
As a standard part of patient management in our two medical centers, the lesion with a Prostate Imaging Reporting and Data System (PI-RADS) [17] scored ≥ 3 underwent fusion or cognitive targeted MR-guided biopsy in conjunction with systemic biopsy by ve urologists who had a prior experience of at least 1000 TRUS-guided prostate needle biopsies. Patients with PI-RADS 1-2 underwent TRUS-guided systemic biopsy. two high-experience uropathologists reviewed the available histopathological slides according to the 2014 WHO/ISUP recommendation [18]. From histopathology, we primarily de ned biopsy-benign, Gleason score 3+3, 3+4, 4+3 and ≥ 4+4 as G0, G1, G2, G3 and G4 group, respectively. We secondly grouped G0-1 into clinically insigni cant (CIS) disease, G2-3 into intermediate-risk PCa and G4 into high-risk PCa.
The PI-Risk model was primarily designed for a multi-task classi cation of G0, G1, G2, G3 and G4 diseases. We randomly split the data of center 1 into training (n = 671) and test (n = 232) group, respectively, for model development and internal test. We also used the data from center 2 with 539 patients for external validation. A ow diagram of patient selection with inclusion and exclusion criteria is showed in supplementary Fig. S1.

Follow-up
The rst postoperative visit was 6 weeks later after RP and then patients were consistently followed-up at intervals of 3 to 6 months based on PSA. The time of a BCR was recorded. Patients were censored in case of emigration, or on 30th Jul 2020, whichever came rst. The de nition of BCR was referred to criteria previously reported [19,20].

Prostate mpMRI
All imaging exams were performed on two 3.0T MRI scanners with pelvic phased-array coils (MAGNETOM Skyra; Siemens Healthcare, Munich, Germany) at the two institutes. The mpMRI consisted of T2WI in three panes, DWI with high b value of 1500 s/mm 2 and apparent diffusion coe cient (ADC) map in axial pane (supplementary data, Table   S1).

Lesion Segmentation
Entire volume of interest (VOI) of lesion was segmented using an in-house software (Oncology Imaging Analysis v2; Shanghai Key Laboratory of MRI, ECNU, Shanghai, China) based on histopathologic-imaging matching by two dedicated radiologists (reader 1 and reader 2 with 3-yr and 5-yr experience of prostate imaging). The contours of VOIs were then rechecked in consensus with a board-certi ed radiologist (reader 3, with 15-yr experience of prostate imaging). In patients with RP (n = 1006), postsurgical ex vivo prostates were processed using a previously described protocol [21]. Key steps included sectioning, digitization, and annotation of cancer regions by highly experienced urological pathologists. The histopathologic specimens were then assembled into pseudo-whole-mount sections and coregistered to the MRI using a previously described registration method [21]. In this way, regions of annotated PCa were mapped onto the images to produce the ground truth maps. In total, histopathologic-imaging matched specimens were identi ed. In patients without RP (n = 436), all subjects underwent MRI/TRUS-fusion targeted biopsy followed by 11-gauge core systemic needle biopsy. A central challenge in image labeling is the presence of ambiguous regions, where the true tumor boundary cannot be deduced from the image, and thus multiple equally plausible interpretations exist. To ll this gap, the VOI of each lesion was drawn twice by each of two independent radiologists. Regional identi cation overlapping in two instances was identi ed as the authorized VOI of the targeted lesion. Because it is inaccessible to achieve an imaging correlation with whole-mount prostatectomy specimens in our retrospective data, the unit of assessment in this study was per-patient. When patients had multiple lesions, only the index lesion with the largest lesion size and/or Gleason score was assessed.
Development, performance, and validation of predictive models Volumetric radiomics features were analyzed from the target lesions using an open-source Python package Pyradiomics [22]. Image normalization was performed using a method that remapped the histogram to t within µ ± 3σ (µ: mean gray-level within the VOI and σ: gray-level standard deviation). A total of 2,553 radiomic features such as intensity, shape, texture, and wavelets, were computed from the target volume on T2WI, b-value of 1500 s/mm 2 DWI and ADC images that provide rich descriptions on the heterogeneity of entire-volumetric intratumor regions (IntraT-Rad).
The IntraT-Rad features focus on the inner regions of PCa. We further investigated a tumor-related region around the target lesion using novel deep transferable learning feature representations (PeriT-DLR). PeriT-DLR features were directly measured on MRI data using an image embedding toolbox (https://github.com/biolab) through ve pre-trained deep neural networks, i.e. DeepLoc, Inception v3, SqueenzeNet, VGG-16 and VGG-19 as embedders [23]. In order to obtain the representative imaging features of the target lesion, we used hand-cropped VOI as an attention to gate each embedder for analyzing PeriT-DLRs (i.e., regions around the PCa) in the center slice of an MRI scan. Each embedder calculates a feature vector for each image and returns an enhanced image descriptor. For image embedding, we used the penultimate layer of embedders to produce new image pro les, serving as another set of imaging feature representations (PeriT-DLRs) in parallel to IntraT-Rad for PCa. The detailed parameters of each embedder are summarized in a supplementary data (Table S2).
Reducing the feature space dimension aims to select informative characteristics, reduce the risk of bias and potential over tting. To obtain the quantitative imaging hallmarks, we rst assessed multi-scale imaging pro les including 2553 vectors from Pyradiomics, 6144 vectors from Inception v3, 3000 vectors from SqueezeNet, 12288 vectors from VGG16, 12288 vectors from VGG19 and 1536 vectors from DeepLoc, respectively, using the mean decrease Gini index (MDGI) calculated by a Random Forest algorithm. The MDGI represents the importance of individual features for correctly classifying a residue into linker and non-linker regions. The MDGI was calculated by classifying 200 randomly selected linker features and 200 non-linker features, and the mean MDGI was calculated as the averaged MDGI over 100 trials. The mean MDGI z-score of each feature was calculated as: , where is the individual MDGI of the feature dedicated; and and σ is the mean and standard deviation of all MDGIs, respectively. Vector elements with MDGI z-score larger than 2.0 were selected as optimum feature candidates. Next, MDGI-selected features from each embedder were analyzed using an auto stacked-ensemble ML based on an open-source auto ML platform (https://github.com/awslabs/autogluon).
The rst layer of our auto ML framework has 5 base learners such as a k-nearest neighbors (kNN), AdaBoost, Random Forests, Logistic Regression (LR) and a Support Vector Machine (SVM), whose outputs are concatenated and then fed into the next layer, which itself consists of multiple stacker models. These stackers then act as base models to an additional layer. It merely employs random search for hyperparameter tuning, model selection, ensembling, feature engineering, data preprocessing, data splitting, etc., thus offers us an enticing alternative to deploy high-performance stack-ensemble models. We performed a random search over the parameter con guration, and chose the optimal parameters with the best score based on the evaluation of log-loss of ML model on 5-fold cross-validation datasets.
The outputs calculated from ML predictor indicated the relative risk that the patient had G0, G1, G2, G3 or G4 disease.
In order to evaluate synergistic effect of multimodal features for the prediction of Gleason grade, the obtained 6 new imaging signatures were integrated with 4 clinical variables such as patient age (≤ 60 yrs, > 60 yrs), PSA level (4-10 ng/ml, 10-20 ng/ml, 20-100 ng/ml and > 100 ng/ml), location of observation (peripheral zone [PZ], transition zone [TZ]) and a PI-RADS score from radiologists' reports. An interpretable risk assessment model (PI-Risk) was nally developed using a multinomial LR with elastic net penalty. The PI-Risk model is based on proportionally converting each regression coe cient in multivariate logistic regression to a 0-to 100-point scale. The effect of the variable with the highest β coe cient (absolute value) is assigned 100 points. The points are added across independent variables to derive total points, which are converted to predicted probabilities (Pi). The performance of PI-Risk model was independently tested on 232 internal test datasets, and on 539 external validation datasets. The entire owchart of auto ML analysis for the PI-Risk model development is showed in Fig. 1.

Predictors of clinical outcome
Additionally, we prospectively evaluated a Cox model in using 5 clinic-imaging risk factors including dedicated age, PSA, PI-RADS score and predicted PI-Risk to assess the incremental aspect of our imaging signatures for predicting biochemical recurrence (BCR) of PCa after RP in 462 PCa patients who underwent RP treatment.

Statistical Analysis
By using biopsy and/or prostatectomy specimens as reference standard, the extents of lesions were divided into G0, G1, G2, G3 and G4 group. Quantitative variables were expressed as mean ± standard deviation (mean ± SD) or median and range or median and range, as appropriate. Model performance was typically evaluated against a "ground truth" with imaging-histopathologic annotations using receiver operating characteristic (ROC). Detection rates such as true positive, true negative, false positive and false negative rate were reported using a confusion matrix analysis. The Cox model's performance was evaluated based on Harrell's concordance index (C-index), calibration curves and Kaplan-Meier survival analysis. All the statistics were two-sided, and a p-value less than 0.05 was considered statistically  Table 1.  Fig. S4).

Development, performance, and validation of PI-Risk model
The PI-Risk integrated with 10 risk factors is formed as interpretable nomograms (Fig. 2) (Fig. 3).
As part of this study, we considered predictive aspects of abridgedly combined use of independent factors at PI-Risk  Fig. 4

Discussion
Gleason score is the determining factor of treating planning of PCa and postoperative survival prediction [24]. In this study, we proposed a collaborative framework that enables integration of clinicians' prior knowledges and deep transferable image feature representations into an interpretable PI-Risk tool to improve the predictions of Gleason score. This integrated approach to data analysis can be generalized under the 'task-free image embedding with privileged deep networks' paradigm described by Zupan et al [23]. This study contributes important methodology Deep image embedding by 5 pretrained models is a core to our study. Feature learning with problem-speci c algorithms is implicit, however, training a deep network usually requires large number of images, which limits its utility.
Deep image embedding does not need training on a closely related set of images. A pretrained deep model on a su ciently large number of diverse images may infer useful features from a broad range of new image sets. This idea was proposed by Zupan et al [23], who explored a democratized image analytic tool box by integrating deep learning embedding. In our experiments, we used the dedicative embedders to build 6 new imaging hallmarks, providing an additional information for stratifying patients into groups with G0 to G4 risks. Even without incorporating clinical indicators, the new hallmarks can determine G0, G1, G2, G3 and G4 disease with accuracy of 0.806-0.918, 0.718-0.789, 0.591-0.680, 0.604-0.655 and 0.775-0.792, respectively. An auto ML platform using stack-ensemble is another core to our method. Different from prior approaches that focused on the task of combined algorithm selection and hyperparameter optimization, our approach performs advanced data processing, deep learning and model ensembling.
It allows to automatically recognize the data type in each column for robust data preprocessing, including special handling of high-dimensional imaging datasets; In particular, duo to the ability to employ multi-layer stack ensembling that combines the aggregated predictions of the base models as its features, the stacker model can improve upon shortcomings of the individual base predictions and exploit interactions between base models that offer enhanced predictive power.
For medical data, the potential phenotype information conveyed in images is more complex than simple variables, and it is also delicate and thus needs to be analyzed more carefully. PSA recurrence is currently the strongest clinical end point of PCa, driving almost all initial disease management decisions after primary treatment [24]. It had correctly demonstrated that patient with high-risk PCa had signi cantly worse BCR-free survival compared with low-to intermediate-risk groups [28,29]. Results from some studies indicated that imaging ndings such as PI-RADS and radiomics features had prognostic value on BCR after prostatectomy [30,20,31]. Our preliminary results also showcase the PI-Risk strati cation even revealed a potential role in predicting the prognostic of disease progression risk preoperatively. We found that PI-Risk > G3, PSA level ≥ 20 ng/ml were signi cantly associated with the worse BCR-free survival, implying the prognostic relevance of our PI-Risk assessment on short and long-term management of patients. Combined use of PI-Risk and PSA can result in a C-index of 0.76 for predicting 3-yr BCR in our primary cohort. After all, this was a preliminary result in a small population, the prognostic aspects of PI-Risk on external independent cohorts warrant further validations.
There are several limitations of our research. Firstly, although the data of this study originated from two medical centers with internal and external validation, the cohort size was still limited for our data-driven approach which is expected larger data sets. A larger studied population will be needed to optimize the performance of the model.
Secondly, part of our external data used MRI-guided biopsy as the reference standard, even targeted prostate biopsy was identi ed as a reliable method for PCa detection, the accuracy of which might be impacted by technical variations in the features or operation of equipment [32,33]. Third, currently, the deep transfer learning only used the center slice instead of the 3D full tumor volume, so the effect comparison between IntraT-Rad features and PeriT-DLR features may not be comprehensive. The center slices have been shown very close performance to using the 3D volume in many cancer imaging-based studies, thus our results on the PeriT-DLR features are still informative. In our next-step research, we will implement 3D-based deep learning approach by leveraging more powerful computational resources.

Conclusions
In summary, we proposed an interpretable tool for PCa aggressiveness assessment. We provided a robust auto-ML framework for integrating multimodality data in relation to PCa aggressiveness. The interpretability of PI-Risk is particularly imperative towards building trustable auto-ML tools for clinical applications. Our study on two cohorts showed that PI-Risk may serve as a great alternative to enhance biopsy-naïve patients' strati cation and prognostication. Further evaluation of our methods on a multi-center setting is needed and a goal of our future work.

Declarations
Ethics Committee approval was granted by the local institutional ethics review board, and the requirement of written informed consent was waived.
Con ict of Interest statement: The authors declare that they have no Con ict of Interests.  Interpretable nomogram (a) and estimated odds ratio (OR) with forest plots (b) for PI-Risk model. Only the signi cant (p < 0.05) factors at regression model are shown in the forest plots, shown is the mean, lower and upper of 95% con dence intervals of OR value.  Clinical application of independent factors, i.e., PSA level, IntraT-Rad signature, PI-RADS score, as well as with the simpli ed combinations for determining clinically insigni cant (CIS) disease, intermediate-risk PCa and high-risk PCa.
For determining CIS (a), PSA 40-10 ng/ml, IntraT-Rad G0-1 and PI-RADS 1-2 were combined, which resulted in highest negative predictive value (yellow color marked). For determining high-risk PCa (b), PSA > 100 ng/ml and PI-RADS 5 were combined, resulting in highest positive predictive value (yellow color marked). This simple approach can produce additional clinical implications for better treatment decision making.