Computational biomarker predicts lung ICI response via deep learning-driven hierarchical spatial modelling from H&E

24 Determining which lung cancer patients are likely to respond to immune checkpoint inhibitors (ICI) 25 remains a crucial challenge. Existing FDA-approved biomarkers lack sensitivity and specificity for 26 identifying treatment candidates. To overcome this problem, we present a computational biomarker for 27 predicting ICI response directly from routine H&E stained whole slide images of the initial biopsy. To 28 achieve this, we developed an end-to-end deep learning system (EPL-GNN) that performs hierarchical 29 spatial modeling on whole slide images to learn both spatial and morphological features from 2.1 billion 30 cells and output a response score for each patient. The computational biomarker was trained and 31 evaluated on the largest reported cohort of stage 4 lung cancer patients with ICI treatment response 32 (N=583), resulting in an AUC of 0.69 and sensitivity of 91% on the independent test cohort, which 33 compares favorably to PD-L1 immunohistochemistry (IHC) with an AUC of 0.68 and sensitivity of 57%, 34 and tumor mutation burden (TMB) with an AUC of 0.62. The EPL-GNN model correctly identified 81% of 35 the patients with a negative PD-L1 IHC result as responders. Visualizations of the hierarchical spatial 36 * These authors contributed equally model revealed potential cellular patterns that correspond to ICI treatment response. In addition to the 37 increased sensitivity achieved by the EPL-GNN model, H&E-based Computational Biomarkers offer a 38 faster, less expensive, more objective and reproducible alternative or adjunct to existing IHC or 39 sequencing based biomarkers. 40


Abstract
Determining which lung cancer patients are likely to respond to immune checkpoint inhibitors (ICI) 25 remains a crucial challenge. Existing FDA-approved biomarkers lack sensitivity and specificity for 26 identifying treatment candidates. To overcome this problem, we present a computational biomarker for 27 predicting ICI response directly from routine H&E stained whole slide images of the initial biopsy. To 28 achieve this, we developed an end-to-end deep learning system (EPL-GNN) that performs hierarchical 29 spatial modeling on whole slide images to learn both spatial and morphological features from 2.1 billion 30 cells and output a response score for each patient. The computational biomarker was trained and 31 evaluated on the largest reported cohort of stage 4 lung cancer patients with ICI treatment response 32 (N=583), resulting in an AUC of 0.69 and sensitivity of 91% on the independent test cohort, which 33 compares favorably to PD-L1 immunohistochemistry (IHC) with an AUC of 0.68 and sensitivity of 57%, 34 and tumor mutation burden (TMB) with an AUC of 0.62. The EPL-GNN model correctly identified 81% of 35 the patients with a negative PD-L1 IHC result as responders. Visualizations of the hierarchical spatial 36 * These authors contributed equally model revealed potential cellular patterns that correspond to ICI treatment response. In addition to the 37 increased sensitivity achieved by the EPL-GNN model, H&E-based Computational Biomarkers offer a 38 faster, less expensive, more objective and reproducible alternative or adjunct to existing IHC or 39 sequencing based biomarkers. 40 Main 41 Immuno-oncology by immune checkpoint blockade 42 Lung cancer remains the leading cause of cancer death worldwide in 2020 [1]. Immune checkpoint 43 inhibitors (ICI) are increasingly being used in treatment protocols for non-small cell lung cancer (NSCLC). 44 While ICI therapy can lead to dramatic responses and clinical remission, clinical benefit is limited to a 45 subset of all lung cancer patients. Thus, identifying the patients who benefit from ICI therapy is among 46 the largest challenges in clinical practice. Extensive explorations for biomarkers have been undertaken 47 to identify clinical features that correspond to tumor response to ICI, spanning from demographic 48 associations to complex genomic investigations [2,3,4]. To date, the Food and Drug Administration 49 (FDA) of the United States has approved two biomarkers which are used as testing modalities for 50 predicting response to ICIs: PD-L1 protein expression by immunohistochemistry (IHC) protocol and 51 tumor mutation burden (TMB) by large panel somatic next-generation sequencing (NGS) assays [5]. 52 While these biomarkers improve response rates relative to unselected populations, they suffer from 53 challenges. PD-L1 IHC suffers from the need to expend tumor tissue for testing, high preanalytical 54 variability, and significant interobserver variability [6,7,8]. TMB requires large amounts of tumor tissue 55 with frequent technical failures and the threshold for what is considered "high" TMB has not been 56 standardized across platforms, although efforts in this regard are ongoing [9]. NGS as a methodology is 57 limited by important aspects such as a turnaround time of 10 or more days, logistical complexity of 58 transferring physical material between laboratories, technical challenges, high cost, and frequency of 59 quantity insufficient samples [10]. The challenges mentioned above in performing the currently 60 approved methodologies often prevent patients from being assessed for clinical benefit from ICI. 61

62
Computational pathology is an emerging field of technologies that is rapidly gaining clinical acceptance. 63 The FDA recently authorized the use of a computational model for clinical use in prostate cancer [11]. 64 Computational biomarkers create opportunities to overcome many of the limitations of tissue-based 65 biomarkers as they utilize existing standardized protocols, such as H&E histopathology images, and thus 66 no additional tissue is consumed. In the era of COVID-19, the FDA has cleared digital pathology image 67 viewers for clinical sign out [12]. Like the digitization of radiographic images, the digitization of 68 pathology opens up many additional use cases for digital microscopy images. For example, digital 69 pathology allows for remote viewing by histopathologists and immediate retrieval of archived cases. 70 The adoption of digital pathology opens up the opportunity for deployment of systems that can predict 71 response to therapy, which we refer to as computational biomarkers. 72 Computational models for predicting response to ICI in lung cancer 73 The current computational pathology efforts for predicting ICI response in lung cancer fall into two 74 categories: First, computational models that reproduce existing biomarkers, such as PD-L1 IHC and TMB,  75 from histopathology images [13,14]. Second, computational models that correlate clinical data directly 76 to prognosis [15,16,17]. Computational models trained to learn from manual PD-L1 IHC interpretations 77 remain subjected to the variation of staining protocols between laboratories and the subjective 78 interpretations of the pathologists who provide the labels for training [6,7,8] . More importantly, due 79 to the fact that these models are attempting to replicate manual IHC interpretation, their predictive 80 power is logically limited at the upper-bound by the limit of PD-L1 IHC expression. Likewise, models 81 trained to predict TMB suffer from the same limitation. These approaches that predict proxy biomarkers 82 instead of outcome cannot surpass the limitations of the IHC or NGS biomarkers they were trained on. Overall workflow and EPL-GNN 101 Figure 1 presents the high-level workflow of the computational system. As ICI target the interaction 102 between ligands and receptors on lymphocytes and tumor cells, previous studies suggest the correlation 103 between tumor immune microenvironment and the ICI treatment response [19,20,21]. Based on this 104 hypothesis, we first consider a computational quantitative analysis of lymphocyte-tumor cell 105 interactions on digital histology slides. A state-of-the-art deep-learning based cell nuclei detector, 106 vector-oriented confidence accumulation (VOCA) [22], is used to determine the spatial location of 107 lymphocytes and tumor cells on all H&E-stained whole slides from the patient cohort. Clusters of 108 lymphocytes and tumor cells are identified using spatial clustering, specifically the DBSCAN algorithm 109 [23], and described by their enclosing polygons. We define the intersections, which represent extensive 110 colocalization of lymphocytes and tumor cells as infiltration hotspots (Figure 2a). Two scores describing 111 slide level infiltration are calculated (c.f. methods Section). We construct cell graphs for each infiltration 112 hotspot tile: 1) each pair of cells within a Euclidian distance of 25 microns is connected by an edge., 2) 113 each cell is described by morphology features that were extracted using a self-supervised deep learning 114 model called Nuc2Vec [24]. In aggregate, EPL-GNN performs hierarchical spatial analysis based on these 115 cell graphs and the slide infiltration scores to generate a probability score for response to ICI. 116

126
EPL-GNN adopted the end-to-end part learning (EPL) framework [25], which is a recent breakthrough for 127 end-to-end WSI-outcome mapping, to learn and model a wide array of features over the entire whole 128 slide image. It has proven to be successful both on traditional tasks like cancer identification [25] as well 129 as prognosis predictions [26]. The EPL-GNN system presented here extends this model into the spatial 130 domain by replacing the encoder of EPL with a graph neural network (GNN). The GNN encoder performs 131 spatial modeling on the cell graph of hotspot tiles through graph convolutional layers and maps the 132 infiltration hotspot tiles to a feature space that represents both cell morphology and cell-cell 133 interactions. Next, EPL-GNN groups feature embeddings of all tiles from a slide into k clusters, each of 134 which represents a subtype of learned cell patterns. For each cluster, the tile nearest to the cluster 135 centroid in feature space was selected by EPL-GNN as representative for the subtype (Figure 2e). The 136 final aggregation layer integrates tile-level subtype feature vectors and the slide-level infiltration scores 137 to output the response probability. During training, the embedding vectors of all tiles belonging to a 138 cluster are pushed towards the corresponding cluster centroid for a compact subtype representations 139 and easier differentiation to other clusters. For model introspection and explainability the locations of 140 tiles belonging to each cluster can be visualized on the infiltration heatmaps and represent distinct 141 spatial patterns (Figure 2f). In summary, EPL-GNN performs a deep hierarchical spatial modeling at both 142 cellular and tissue level, aggregates a variety of information from all scales over the whole slide, and 143 predicts the ICI response. The whole system is trained end-to-end and requires only an H&E slide as 144 input to predict the response for a patient.  threshold for determining positivity of PD-L1 IHC has been established at 50%, which is the cutoff used 179 for FDA approval of pembrolizumab and atezolizumab [27,28,29]. The 50% cutoff was established 180 based on the ability of multiple observers to achieve agreement. Lowering the cutoff when manually 181 interpreting PD-L1 IHC stains to increase sensitivity of the IHC test is not practical since inter-observer 182 variability significantly goes up when reducing the threshold below 0.5 [8]. At this cutoff, PD-L1 has a 183 sensitivity of 57% on the test set. At inference stage, EPL-GNN is a deterministic system that produces 184 consistent, reproducible outputs. Using the same cutoff as PD-L1 IHC of 0.5 at the softmax score of the 185 neural network, the sensitivity of EPL-GNN is 91%. Importantly, the computational biomarker identifies 186 responders with low PD-L1 expression. Of the 43% of responders with low PD-L1 expression, EPL-GNN 187 rescued 81% of these patients (Figure 3b).

215
Interpretable spatial heterogeneity 216 Investigators have speculated on the relevance of tumor cell and immune cell interactions for patient 217 response to immunotherapies [30,31]. EPL-GNN can not only be used as a robust, highly sensitive 218 biomarker for lung ICI, but can also shed light on the spatial heterogeneity of cancer at multiple scales 219 by model introspection and visualization. 220 We first visualize the importance of each cluster estimated by absolute value of the gradients with 221 respect to the slide loss of the features as described by [25]. The model focuses on the feature vector of 222 one representative tile for each cluster in feature space ( Figure 2e) and combines the information of all 223 clusters for the final prediction. Figure 4a shows the representative tiles of ten responders and non-224 responders with high model confidence. Each column represents one slide/case, and each row stands 225 for a feature cluster. The importance of each cluster is color-coded and mapped back onto the original 226 slide (Figure 4b). 227 The importance for response prediction of each single cell nucleus can also be calculated from the 228 learned parameters of the GNN encoder (c.f. methods Section). The importance score of each individual 229 cell reflects the cumulative information from a cell graph of its vicinity rooted at the center of the 230 nucleus. It includes information about cell morphology and the interaction with the surrounding cells 231 (Figure 4c). Being able to visually represent the outputs of the model at various resolutions in situ allows 232 in-depth analysis of the computational biomarker. One can observe that non-responders have a higher 233 proportion of cluster 1 (p=0.026), while cluster 2 shows a trend towards enrichment in responders 234 (p=0.085) (Figure 4d). 235 236 The hematoxylin and eosin (H&E) stained tissue is the cornerstone of surgical pathology. Every tissue 249 reviewed by a surgical pathologist receives an H&E stain and thus H&E stained slides are the most 250 ubiquitous and standardized images generated in histopathology [32]. The ability for a computational 251 biomarker to utilize H&E stained slides allows for broad adoption at point of care as no additional 252 technical procedures are required. In contrast, models trained to predict ICI response from IHC slides 253 require cutting additional sections and applying less standardizable protocols with high pre-analytical 254 variability. 255 The challenge of ICI response prediction 256 To overcome the challenge of ICI response prediction from H&E slides, we approached the problem 257 from multiple different angles. Our initial efforts included simple hypothesis-driven cell-based spatial 258 statistics, traditional convolutional neural network (CNN) based tile classification, and explorative end-259 to-end mapping. After modest results with these techniques, we eventually converged on a hybrid of 260 hypothesis-driven and explorative hierarchical spatial modelling (EPL-GNN). 261 The driving hypothesis for modeling the relationship between tumor immune microenvironment and ICI 262 response is that patients with a high level of tumor infiltrating lymphocytes (TILs) are more likely to 263 respond to ICI treatment in multiple cancer types [19,20,21]. In our experiments, the most predictive 264 metrics are two scores describing the infiltration of lymphocytes over the whole slides: infiltration 265 degree and infiltration density (c.f. methods Section, Table 2). Next, we considered traditional 266 convolutional neural network (CNN) based tile classification. The simplest baseline is to train a CNN 267 model that maps each image tile of a slide to a response probability score, and then average their scores 268 as the response probability for the whole slide. Alternatively, one can average the feature vectors of all 269 tiles to represent the global slide feature vector which can then be mapped to the response score. Both 270 of these traditional CNN approaches didn't show strong signal for ICI response (Table 2). We also tested 271 multiple instance learning (MIL), which has proven to be tremendously successful for cancer detection 272 on datasets with tens of thousands of patients [33]. However, we could not achieve a performance 273 comparable to other methods for ICI response prediction with MIL (Table 2). We thus concluded that 274 the local morphological features carried in single tiles alone do not contain adequate information to 275 predict patient outcome of ICI treatment. 276 To overcome these deficiencies, we explored the EPL framework [25] for end-to-end modeling on 277 diverse features over whole slides. The original EPL model uses a CNN as encoder to learn subtypes of 278 morphological features. While this approach improved beyond slide-wide infiltration features, it is 279 known that CNNs are suboptimal for learning Cartesian, spatial features [34]. Motivated by the 280 hypothesis that the relevant features for ICI response are likely carried by lymphocyte-tumor cell 281 interactions, we replaced the EPL encoder with a graph neural network (GNN) which learns local spatial 282 features on cell graphs. Finally, the two scores describing global infiltration were concatenated to the 283 slide embedding followed by the EPL aggregator. This hybrid approach of hypothesis-driven and 284 explorative hierarchical spatial modelling displayed significant improvement over all competing methods 285 listed in Table 2. 286 Research in computational pathology often focuses on diagnostic problems, including the detection and 293 classification of cancer [33], cancer grading [35], or metastasis analysis [36]. These approaches use 294 human assessment as gold standard and produce a computational replacement or adjunct tool. 295 Alternatively, predictive computational biomarkers are trained against clinical outcome as gold standard, 296 for example response to treatment in case of ICIs. However, for tasks such as ICI response prediction, 297 the location and nature of predictive features are unknown to pathologists. An approach that is able to 298 model on a wide array of information over the whole slide is a prerequisite. 299 The prediction function for cancer identification in literature can be summarized as a weighted average 300 of tumor features in tiles: ∑ , where is the learnable linear weights, and is the 301 feature in tile . EPL-GNN describes a slide in terms of subtypes of features, and maps from the 302 concatenation of these features to the response: )), where 303 represents a multi-layer perceptron aggregator, and is the activation function for the output. This 304 approach enables end-to-end explorative studies for tasks in which the predictive features are unknown 305 a priori. 306 The largest reported cohort for lung ICI response 307 The patient cohort presented here is the largest reported dataset for lung ICI response prediction based 308 on RECIST criteria to-date. There are multiple challenges of obtaining a large and relevant dataset for 309 this study. These include the identification of patients who meet inclusion criteria (c.f. Data Section) and 310 have response to ICI documented in a standard protocol such as RECIST. Likewise, these patients must 311 have a pre-treatment biopsy and long-term follow-up. While we consider the size of the cohort a 312 strength of the study, the patients were all treated at the same academic medical center. Multiple 313 attempts were made to identify external cohorts, but no satisfactory dataset of any significant size could 314 be obtained. With growing adoption of ICIs in clinical practice, we hope to overcome this challenge in 315 future work. 316

317
We present the first computational biomarker to predict ICI response directly from H&E images. The 318 system outperforms the FDA-approved biomarkers for predicting response to ICI therapy in terms of 319 cost, speed, reproducibility and increased sensitivity. This was achieved by curating the largest cohort of 320 stage 4 lung cancer patients with ICI treatment and RECIST response quantification. The novel deep 321 learning architecture for spatial pathology was developed to analyze more than 2 billion lymphocytes 322 and cancer cells. Given that the system relies only on standard H&E images, this computational 323 biomarker could be easily implemented in clinical practice to overcome many of the limitations of the 324 current FDA-approved biomarkers. 325 Method 326 Data 327 Patients were selected for metastatic NSCLC treated with PD-L1 blockade-based immunotherapy 328 between 2013-2019 for the discovery and validation cohort. In both cohorts, those treated with 329 combination therapy of PD-(L)1 and chemotherapy were excluded. Inclusion criteria were digitized H&E 330 images as well as available outcome data from their response to PD-(L)1 therapy. Objective overall 331 response was determined by RECIST, performed by a blinded thoracic radiologist. Patients who did not 332 progress were censored at the time of their last available imaging assessment. Clinical data was locked 333 as of 09/2019. All handling of slide image data was through a digital slide viewer developed at MSK [37]. 334 Lymphocyte and tumor cell detection with VOCA 335 We used a state-of-the-art cell nuclei detection method called vector-oriented confidence accumulation 336 (VOCA) [22] for the detection of lymphocytes and tumor cells. Over 6000 lymphocytes and tumor cells 337 were manually annotated on a software called "MSK Slide Viewer" [38] and used as the ground truth to 338 train the VOCA model. The final model was used to for detection of more than 2 billion cells on the 339 whole cohort. The threshold of filtering out false positive detections was also learned automatically by 340 the model. The coordinates of each lymphocyte and tumor cell were saved as the input for further 341 feature extraction. 342 Cell feature extraction with Nuc2Vec

343
For each detected nucleus, we extracted the surrounding image patch of size 64x64 pixels, rescale it to 344 128x128, and did a forward pass through a resnet-34 model pretrained with the Nuc2Vec method on 345 slides from the MSKCC-IMPACT patient cohort, to obtain a 128-dimenstional embedding vector. Briefly, 346 Nuc2Vec utilizes a contrastive learning method to learn vector embeddings of local image patches 347 centered around each nucleus, such that the Euclidean distance in the embedding space is 348 approximating a metric of the morphological similarity between nuclei. Thus, the embedding vector 349 represents the morphology of each nucleus. For a detailed description of the method, training dataset, 350 and quality of the learned vector embeddings, we refer to [24]. 351

352
Using the detection results of VOCA, we applied DBSCAN [23] to identify cell clusters, represented as 353 polygons, of both tumor cells and lymphocytes. For lymphocytes, we consider cells belonging to the 354 same cluster if they are located less than 40 pixels (20 microns) from each other, while 50-pixel was 355 chosen for tumor cells. The polygons generated by DBSCAN represent cell clusters, and the intersection 356 between the lymphocyte clusters and tumor cell clusters are considered potential infiltration regions. In 357 addition to the local spatial configuration of cells, we calculated two scores capturing infiltration 358 information at the whole slide level. One is the infiltration degree: 359 This score describes the degree of infiltration over the whole slide defined as the ratio between the area 360 of lymphocyte-tumor cell cluster intersection and the area of tumor cell clusters. The second score is: 361 The infiltration density counts the number of lymphocytes within the lymphocyte-tumor cell cluster 362 intersection and divides it by the area of tumor cell clusters. Experiments of using these two scores 363 alone for ICI response prediction showed a predictive signal for ICI (c.f. Table 2). 364

Graph Neural Networks (GNN) of cells 365
From all infiltration hotspots we sampled image tiles of size 112x112 microns without overlap. The cell 366 graphs on these image tiles are used as input to the GNN encoder. To construct the cell graphs, the 367 nodes represent the detected cells and are described using the feature representation from Nuc2Vec. 368 Finally, cell graphs are constructed as radius graphs [39] where all co-occurrences within 25 microns are 369 modeled as an edge in the graph. 370 GNN encoder architecture 371 We constructed a GNN encoder based on graph isomorphism networks (GIN) [40]. The model contains 372 four GIN layers, each is followed by a top-K pooling layer [41]. The output feature of each cell is the 373 concatenation of the node features after each convolution, and the graph readout is the sum of all 374 concatenated node features. The importance scores for each node at each layer are learned by the 375 top-K pooling function. Also, the layer attribution can be estimated by the gradient as in the original 376 EPL paper [25].Therefore, the node importance score is calculated as: 377 The GNN encoder gradually pools important cell patterns for the response prediction and encodes the 379 cellular information of each infiltration tile, then multiple groups of these encoded cellular patterns are 380 represented by their centroid in feature space and aggregated. Along with the quantitative slide level 381 infiltration scores, these graphical deep features can produce a slide response score. Details of the 382 overall framework architectures and training strategy are described in [25]. 383 EPL framework for end-to-end WSI outcome mapping 384 End-to-end part learning [25] models a WSI as consisting of groups of image tiles of similar features. 385 Each group of tiles form a cluster in feature space, and the one tile nearest to the centroid is used as the 386 representative tile for that cluster. feature vectors, each of length are concatenated and fed forward 387 to the aggregation module, which can be multiple fully connected layers, to generate the output score. 388 Meanwhile, the tile features of each cluster are also pushed towards the centroid to ensure that 389 representative tile can approximate the common features of the cluster. In general, this framework 390 enables the end-to-end WSI-outcome mapping, and theoretically can be applicable for any slide-wide 391 labels (c.f. [25] for details). 392