Utilizing Electronic Health Records (EHR) and Tumor Panel Sequencing to Demystify Prognosis of Cancer of Unknown Primary (CUP) patients

Cancer of unknown primary (CUP) is a type of cancer that cannot be traced back to its original site and accounts for 3-5% of all cancers. It does not have established targeted therapies, leading to poor outcomes. We developed OncoNPC, a machine learning classifier trained on targeted next-generation sequencing data from 34,567 tumors from three institutions. OncoNPC achieved a weighted F1 score of 0.94 for high confidence predictions on known cancer types (65% of held-out samples). When applied to 971 CUP tumors from patients treated at the Dana-Farber Cancer Institute, OncoNPC identified actionable molecular alterations in 23% of the tumors. Furthermore, OncoNPC identified CUP subtypes with significantly higher polygenic germline risk for the predicted cancer type and significantly different survival outcomes, supporting its validity. Importantly, CUP patients who received first palliative intent treatments concordant with their OncoNPC-predicted cancer sites had significantly better outcomes (H.R. 0.348, 95% C.I. 0.210 - 0.570, p-value 2.32 × 10−5). OncoNPC thus provides evidence of distinct CUP subtypes and offers the potential for clinical decision support for managing patients with CUP.

: 0.789 and 0.791, respectively). Across 10 cancer groups (grouped by sites and treatment options 108 (Table 1), OncoNPC achieved an overall weighted F1 score of 0.824 (weighted precision and recall : 109 0.829 and 0.826, respectively). Despite the evident class imbalance across cancer types, OncoNPC 110 showed well-balanced precision across the cancer types (Fig. 2a) and cancer groups (Fig. 2b). 111 Thresholding on prediction confidence (p max , the maximum posterior probability across all labels) 112 further increased the performance: weighted F1 score of 0.830 with 91.6 % remaining samples at 113 p max ≥ 0.5 and 0.942 with 65.2 % remaining samples at p max ≥ 0.9 (Fig. 2c, 2d). While rarer cancer 114 types had generally lower overall performance, increasing the p max threshold reduced this difference 115 between common/rare cancer types (Fig. 2c, 2d). At p max ≥ 0, common cancer types in the upper  Table 1). The OncoNPC perfor-  Applying OncoNPC to CUP tumor samples 141 We applied OncoNPC to classify 971 CUP tumors from patients who were admitted to DFCI and  S1b). This demonstrates that OncoNPC is resistant to making overconfident wrong predictions. 153 Despite the slightly lower prediction confidence, over half of the CUP tumors (518 out of 971) 154 could still be classified with high confidence (i.e., prediction probability > 0. 8), and multiple clas-  Germline PRS-based validation on CUP tumor samples 191 We hypothesized that, if OncoNPC was accurately identifying latent primary cancers, the classified 192 CUP cancer types would exhibit increased germline risk for the corresponding cancers. To that end,    counts; 9.52%) and fusions (7 counts; 3.33%) as shown in Fig. 5a. The four most frequent oncogenic 250 mutations were in PIK3CA, KRAS, ALK, and ERBB2 genes, occurring in CUP tumor samples 251 classified as BRCA (PIK3CA and ERBB2 genes) and NSCLC (KRAS, ALK, and ERBB2 genes).

252
Overall, among the eligible CUPs whose prediction confidences are greater than 0.5 (N = 794; see

274
We note that as this was not a randomized analysis, a potential concern may be systematic  Table 2 for comparison of the two groups across the measured covariates). To minimize 279 biases from potential confounders and move towards a predictive estimate of treatment concordance 280 on patient survival, we adopted two estimation strategies: multivariable Cox regression [32] (i.e.,  prediction uncertainty (i.e., entropy of predicted probability distribution over the considered cancer 297 types) were inversely associated with receiving concordant treatment (coefficient -1.259, 95% C.I.
. As a true positive reference, we repeated the above procedure for the CKP tumor samples. excluded patients with CUP that were lost to follow up at the time of tumor sequencing and those 479 whose primary cancer types were predicted with low probability (see Supplementary Fig. S6). The

525
As we were interested in the counterfactual causal impact of the OncoNPC-treatment concor-526 dance, we utilized the principles of causal inference to account for potential patient heterogeneity 527 and confounding. Specifically, we estimated the effect of treatment concordance specified by the 528 indicator variable, A, which was 1 when the first palliative treatment for a patient with CUP was 529 concordant with the corresponding OncoNPC prediction and 0 otherwise. Our analyses make the 530 following identifiability assumptions: It means that given patient i's a set of 532 covariates X i , the patient's treatment concordance A i is as good as random.

533
• Consistency : T ai i = T i , which means that a counterfactual outcome T ai i for patient i is the 534 observed outcome for the patient with a treatment concordance a i .

537
In addition to the above identifiability assumptions, we made independent censoring (i.e. C i ⊧ T i |X i ) 538 and independent entry assumption given the covariates (i.e. E i ⊧ T i |X i ). 539 We adopted two different estimation strategies to obtain the impact of treatment concordance: . In this formulation, each individual is weighted by the corresponding IPTW, w i , and we obtained

559
In the Cox proportional hazard regression framework, we estimated the hazard function of patient i as follows: λ(t|A i , X i )) = λ 0 (t)exp(αA i + β T X i ), where α, A i ∈ R and β, X i ∈ R m (m is the number of measured confounders). Under the above identifiability assumptions and validity of the estimation model, e α is the hazard ratio capturing the causal effect of the treatment concordance A. Finally, under the assumption of no ties between event times across the patients, the parameters α and β are estimated by maximizing the following partial likelihood    (h) A subset of CUP patients with detailed treatment data were evaluated for treatment-specific outcomes.     Table 1). The sensitivity for each cancer type or cancer group is shown below each confusion matrix and the sample size is shown to the left of each confusion matrix.