A Deep CNN Approach For Predicting Cumulative Incidence Based On Pseudo-Observations

22 Background: Prognostic models are of high relevance in many medical application domains. However, many 23 common machine learning methods have not been developed for direct applicability to right-censored outcome 24 data. Recently there have been adaptations of these methods to make predictions based on only structured data 25 (such as clinical data). Pseudo-observations has been suggested as a data pre-processing step to address right- 26 censoring in deep neural network. There is a theoretical backing for the use of pseudo-observations to replace the 27 right-censored response outcome, and this allows for algorithms and loss functions designed for continuous, non- 28 censored data to be used. Medical images have been used to predict time-to-event outcomes applying deep 29 convolutional neural network (CNN) methods using a Cox partial likelihood loss function under the assumption of 30 proportional hazard. We propose a method to predict the cumulative incidence from images and structured clinical 31 data by integrating (or combining) pseudo-observations and convolutional neural networks. 32 Results: The performance of the proposed method is assessed in simulation studies and a real data example in 33 breast cancer from The Cancer Genome Atlas (TCGA). The results are compared to the existing convolutional neural 34 network with Cox loss. Our simulation results show that our proposed method performs similar to or even 35 outperforms the comparator, particularly in settings where both the dependent censoring and the survival time do 36 not follow proportional hazards in large sample sizes. The results found in the application in the TCGA data are 37 consistent with the results found in the simulation for small sample settings, where both methods perform similarly. 38 Conclusions: The proposed method facilitates the application of deep CNN methods to time-to-event data and 39 allows for the use of simple and easy to modify loss functions thus contributing to modern image-based precision 40 medicine. 41

cancer, tumors with predominantly micro-papillary and solid patterns have been associated with a poorer prognosis 48 [4]. With the advent of digital pathology, whole-slide-images (WSIs) of stained tissue sections are becoming 49 increasingly available. This may provide the opportunity to accurately predict individual prognoses using image data 50 paired with other clinical information at scale and provide clinicians with decision support that can guide clinical 51 management decisions to enhance personalized treatment and thus improve patient care [1][2]. 52 Few machine learning methods have been developed for survival outcomes originally, and thus, most existing 57 machine learning for survival outcomes are adaptations. This is also true for image analysis methods. CNN methods 58 have been used and adapted to address the task of predicting time-to-event outcomes from WSIs. Recent works [7-59 11] have used WSIs with CNN for survival predictions. They applied convolutional layers to extract features of the 60 images using convolutional kernels and pooling operations, followed by a sequence of fully connected layers where 61 the terminal layer outputs a predicted risk associated with the image. These risks are plugged in the Cox partial 62 likelihood and the network is trained using a back-propagation procedure and optimization algorithm. These prior 63 works combined modern CNN models with Cox regression for prediction of time-to-event outcomes, keeping the 64 assumption of proportional hazard. This stipulates no effect modification by time, which can be restrictive or even 65 unrealistic [12]. Furthermore, the negative partial log-likelihood is a relatively complicated loss function that can be 66 challenging to implement in existing CNN frameworks. 67 6 such as Aalen's linear hazard model [34], boosted Cox regression [35] or random forest [36]. Once pseudo-120 observations are obtained, the sample data for the analysis is given by Ɗ = 121 {( (1) , (1) , θ (1) ( )), … , ( ( ) , ( ) , θ ( ) ( ))} . Ɗ can be used to train any CNN model to predict the individual 122 risk of experiencing the main event before time τ , which would have been similar to basing our predictions on Ɗ 123 if this sample data were available. 124

125
Convolutional neural networks are a class of neural networks that can be applied to data that spatially encodes 126 information in an evenly-spaced grid topology, such as images or time series. As compared to multi-layer 127 perceptrons (MLPs), CNNs share weights between their kernels or filters to drastically reduce the number of model 128 parameters, which is based on the assumption of translational invariance of these filters. Through a hierarchical 129 structure of consecutive convolutional layers, CNNs can learn representations of increasing complexity, such that 130 the first layers typically extract unspecific low-level features such as corners and edges, whereas the last layer 131 encode more abstract concepts that are more specific to the training data. This structure of convolutional layers is 132 typically followed by one or more fully connected layers, which are comparable to an MLP that weights the activation 133 of the final convolutional layers, resulting in a model output. We refer the readers to [5, 37] for a more detailed 134

exposition. 135
Pseudo-Observation (PO) 136 CNN Our proposed PO-CNN procedure enables a CNN to be fitted using a PO-based response to predict the 137 cumulative incidence from images and structured clinical data. The pipeline of the proposed framework is shown in 138 The implementation that only includes the clinical data as intermediate input is a multi-output regression, which is 147 related to multi-task supervised learning approach [38]. In Figure 2. It is of note that although it is technically possible to use the image information in the fitting of the censoring model, 157 we do not believe this is practical or necessary. Instead, fitting a model for the censoring distribution based solely 158 on the set of available clinical covariates is likely sufficient and much more feasible in practice. Thus, we assume that 159 it is sufficient to condition on the clinical covariates when modeling the censoring in the IPCW-PO. This is slightly 160 more restrictive than a Cox based CNN as all inputs in the CNN are included in some way in the final layer and thus 161 are accounting for censoring; we investigate this in the simulations. 162

163
For all methods that follow we used a Residual Network model [39] with 18 layers (ResNet18), although any desired 164 model could be used. The typical block of layer in ResNet is i) convolution; ii) non-linearity activation function; iii) 165 batch normalization; iv) pooling and v) dropout. We used a pre-trained ResNet model on ImageNet [40, 41] to 166 initialize the weights instead of random initialization. We added to the last terminal layer of ResNet a simple fully 167 connected layer followed by tanh(·) as the activation function. We used Pytorch for the CNN  Simulations 176 We used images from the CIFAR-10 dataset as the basis of our simulated data. The CIFAR-10 dataset consists of color 177 images (32 × 32 × 3) in 10 classes. We denote the classes with y where y ∈ {0, . . . , 9}is the class for the image 178 I . We generated the true survival time based on the classes that each image represents as well as independent 179 covariates. We generated nine independent covariates X 1 , … , X 9 from the standard normal distribution and one 180 binary covariate X 10 . We presented six cases corresponding to different survival and censoring time models. For 181 each case, we consider a sample size of N = 1000 and 5000. From each sample size, 80% of the observations were 182 randomly sampled for training, while the remaining 20% were set aside as a test set. The simulations were repeated 183 100 times each. The accuracy of the prediction of the cumulative incidence at the four percentiles observed times 184 was assessed using the area under the ROC curve (AUC). The prediction at each time point were compared to the 185 true binary outcome of having an event prior to a given time point of interest. This latter variable is no censored 186 since we know the exact survival time. The pseudo-observations PO and IPCW-PO were computed for a grid of time 187 points corresponding to 20th, 30th, 40th and 50th percentiles of the overall time distribution. We do not tune any 188 hyper-parameter. All simulations were run using a learning rate of 0.0001. For each simulation, we applied both 189 approaches: single output and multi-output to both IPCW-PO and unweighted PO. 190 The six cases are as follow: 191 Case 1. The true survival time was generated from a proportional hazard model. T was generated with hazard 192 function: 193 λ (t|y, X) = λ ,0 (t)exp{1.7y + (0.3 + 0.6cos(y))X 10 + 0.2X 1 } 194 195 where λ ,0 (t) = 2t. We randomly selected around 30% observations to be rightcensored at time C generated from 196 a uniform distribution on (0, T). 197 Case 2. The true survival time was generated under a proportional hazard model as Case 1 but the censoring time is 198 generated from 199 where λ ,0 (t) = 12t . The censoring percentage is around 20%.

202
Case 3. The true survival time was generated from a proportional hazard model where 203 λ (t|y, X) = λ ,0 (t)exp{y − 1.6cos(y)X 10 + 0.3X 1 X 10 } 204 and λ ,0 (t) = 0.7t. The censoring time was generated using a gamma distribution with shape parameter equal to 205 exp{−1.8X 10 + 1.4X 1 + 1.5X 10 X 1 } and scale parameter equal to y. The censoring percentage is around 43%. 206 Case 4. The true model for survival time was generated using a gamma distribution with shape parameter equal to 207 exp{0.5y + 0.2X 10 cos(y) + 1.5X 1 + 1.2X 10 }. We randomly selected 30% observations to be right-censored at 208 time C generated from a uniform distribution on (0, T). 209 Case 5. The true survival times are non-proportional hazard as Case 4 and the censoring time was generated 210 Where λ ,0 (t) = 0.01t. The censoring percentage is around 60%.

212
Case 6. The true survival times and the censoring times are both generated using a gamma distribution. The shape 213 parameter is exp{0.7y + 0.4X 10 y − 0.1X 1 X 10 + 0.1yX 1 } and exp{3.8X 10 + 5.2X 1 − 3.3X 10 X 1 } for the survival 214 and censoring time, respectively. The shape parameter was set equal to y. The censoring percentage is around 65%. 215 Figure 3 and Figure 4 show the simulations results. When the sample size is small, Figure 3, IPCW-PO-CNN single 216 output had the best performance across the different cases even in cases that they did not require IPC-weights, with 217 a median AUC above of all other models. However, this method and the multi-output version tended to show high 218 variability when the censoring time was generated from a non-proportional hazard model (Case 3 and Case 6). The 219 latter behavior was expected since the censoring model is misspecified. The second best median performance was 220 the unweighted PO-CNN single output. The Cox-CNN demonstrated similar results to PO-CNN, even in the non-221 proportional hazard cases, something that was surprising and that we suspect is due to the sample size. 222  Table 3 in the appendix summarizes these variables. The breast histopathology image dataset are composed of 244 710 WSIs and each of them was tiled into image patches that span 512 x 512 pixels at 20X magnification. Tiling WSIs 245 into smaller image patches and assigning the patient-level label to each image patch is a common strategy in digital 246 pathology due to current memory constraints. Models are then fitted to these image patches in a weakly supervised 247 manner. In order to only predict risk scores from cancer tissue regions, we deployed a cancer detection model to 248 exclude benign tissue. Each WSI, which is associated to a patient, is linked to the clinical data. Theoretically, multiple time points should perform as well or better than single time points, due to the fact that 269 information across the time points is shared. 270 Comparative results are presented in Table 1 and Table 2 Lastly, we have not investigated tuning hyper-parameters in great detail other than the learning rate. The CNN model 298 as well as the two implementations (single output and multi-output) can be tuned and we think that with further 299 hyper-parameter tuning, better performance might be achieved. In the multi-output implementation, the criteria 300 used to combine the loss of the multiple outputs is an important hyper-parameter to tune. Our suspicion is that the 301 poor performance seen there in the simulation is caused by a lack of tuning. Furthermore, the real application poses 302 extra challenges. For instance, one should tune the aggregation procedure applied to get a per-slide prediction per 303 individual. Also, one might consider weighting the loss function at each time point to take account for the 304 heterogeneity across time points. When these later factors are tuned, a higher performance can potentially be 305 achieved. Although this is a future line of work for the authors, this in no way detracts from the fact that the PO-306 CNN is clearly a useful alternative to the Cox-CNN model that allows for simple and easier to modify loss function. 307

308
In this work, we proposed a method that uses classical pseudo-observations as the outcome in deep CNN methods 309 to predict the cumulative incidence using images and structured clinical data. Compared with Cox regression based 310 model, the proposed method is more flexible as it does not assume proportional hazards. The proposed method 311 facilitates the application of deep CNN methods to time-to-event data with a simple and easily modified loss 312 functions. This works contributes to modern image-based precision medicine be providing an alternative to Cox loss 313 in CNN image analysis for prediction of cumulative incidence or risk before a given time point. 314 Declarations 315 Ethics approval and consent to participate 316 Not applicable 317 Consent for publication 318 Not applicable 319 Availability of data and materials  is measured by days, and the vital status of the patient, if it is alive/censored (status=0) or dead (status=1). 520 Image pre-processing 522 Each WSI was preprocessed before inclusion into this study. As a first step, tissue masks were generated. To this 523 end, WSIs were down-sampled by a factor of 32 and converted to the HSV color space. Tissue masks were 524 generated by applying a pixel-wise logical operation between a mask that was generated by applying the Otsu 525 threshold [51] to the saturation channel and by applying a cutoff of 0.75 to the hue channel. We subsequently 526 performed morphological opening and closing to remove salt-and-pepper noise from the binary masks. WSIs were 527 then tiled with 50% overlap at 20X magnification into image patches spanning 598x598 pixels, where 598 pixels 528 correspond to 271µm of tissue section, while discarding tiles with less than 50% tissue content. Tiles were then 529 color-normalized using the method described by [52]. We subsequently applied a cancer detection CNN to identify 530 cancer regions and excluded all tiles that were predicted to belong to a benign tissue region. Furthermore, out-of-531 focus tiles with a low variance were excluded, which was computed by filtering each tile with a Laplacian.    Boxplots of AUC values for the prediction of the cumulative incidence at 20th, 30th, 40th and 50th percentile of the overall time across 100 simulated datasets of sample size 5000 using different methods: cox, CNN with Cox PH layer; po1, PO-CNN single output; po2, PO-CNN multi-output; po3, IPCW-PO-CNN single output, and po4, IPCW-PO-CNN multi-output.