Extraction of genomic features
To develop CUPLR, we constructed a harmonized dataset from two large pan-cancer WGS datasets from the Hartwig Medical Foundation (Hartwig) and Pan-Cancer Analysis of Whole Genomes consortium (PCAWG). The raw sequencing reads were analyzed with the same mutation calling pipeline to construct a catalogue of uniformly called simple and complex mutations. The harmonized dataset consisted of tumors from 6820 patients across 33 different cancer types (Figure 1a). In contrast to many previously published papers [4, 5, 12, 13], this dataset includes a large proportion of samples taken from metastatic lesions, which is relevant for TOO classification as CUP samples are by definition from patients with metastatic cancer.
A wide range of features (n=4122) were extracted for classifying cancer type based on driver/passenger and simple/complex mutations (Figure 1c). First, we determined the presence of gain of function (amplifications and activating mutations) and loss of function (deep deletions and biallelic loss) events in 205 cancer associated genes. These genes were selected based on having enrichment of gain and/or loss of function events in at least one cancer type (see methods). Second, we calculated the mutational load of single base substitutions (SBS), double base substitutions (DBS), and indels for each sample. Third, we determined the number of contributing mutations to the SBS, DBS and indel signatures from the COSMIC catalog [14]. Fourth, the number of SBSs in each 1Mb bin across the genome (n=3071) was calculated to determine the RMD [17]. Mutational signatures and RMD were normalized by the mutational load of the respective mutation type to account for differences in mutational load across samples. Fifth, for each sample we determined the copy number change of each chromosome arm relative to the genome ploidy [18]. Sixth, gender was determined based on sex chromosome ploidy [19]. Lastly, we parsed the called simple and complex structural variants to determine: (i) the SV load per sample; (ii) the number of large deletions and duplications stratified by size, as well as the number of long interspersed nuclear element (LINE) insertions, normalized by the total number of SVs; (iii) the presence of complex SV events; and (iv) the presence of gene fusions and viral sequence insertions [19, 20].
Classifier training
The extracted genomic features were then used to develop CUPLR, a classifier consisting of two components (Figure 1d). Firstly, an ensemble of binary random forest classifiers each discriminates one cancer type versus other cancer types (i.e. one-vs-rest). We chose to use an ensemble of binary classifiers as opposed to one multiclass classifier so that feature selection could be performed per cancer type, since different features are important for each cancer type. Additionally, we chose to use random forests over other algorithms (e.g. neural networks) as they can natively handle different feature types (continuous, Boolean, categorical, etc.) without requiring feature values to be scaled, which also improves model interpretability. The second component of CUPLR is an ensemble of isotonic regressions to calibrate the probabilities produced by each random forest. Random forests tend to be overconfident at probabilities towards 0 and underconfident at probabilities towards 1, and this bias varies between random forests [21]. The calibration we have performed here ensures that probabilities are comparable between random forests. Furthermore, calibration allows for the probabilities to have the following intuitive interpretation: a probability of e.g. 0.8 means that there is an 80% chance of a prediction being correct (this relationship does not hold for raw random forest probabilities).
We used 6139 samples for training and held out 681 samples as an independent test set, with both having the same cancer type proportions. Training of the main random forest ensemble involved several steps. Briefly, due to the sheer number of RMD bins (3071) as well as their sparsity, non-negative matrix factorization (NMF) was for each cancer type performed on the RMD bins to reduce these to 44 cancer type-specific RMD profiles. Then, for each cancer type, univariate feature selection was performed (to remove irrelevant features) with 502 features ultimately being selected (219 numeric and 283 boolean; Supplementary data 3). This was followed by class resampling (to alleviate imbalances in the number of samples for each cancer type), and subsequently training of the binary random forest itself (Supplementary figure 1). The above training procedure was applied to all samples of the training set to produce the final random forest ensemble. The random forest ensemble training procedure was then subjected to stratified 15-fold cross-validation to obtain cancer type probabilities for the training samples. These probabilities were then used to train the ensemble of isotonic regressions for calibrating the random forest probabilities (Figure 1b).
Performance of CUPLR
To assess the performance of CUPLR, we used the cancer type predictions based on the isotonic regression calibrated cross-validation probabilities, as well as the predictions upon applying CUPLR to the held out test set (Figure 1b). Both training (n=6139) and test sets (n=681) had the same cancer type proportions. Based on cross-validation predictions (Figure 2), CUPLR could classify all samples with an accuracy (i.e. fraction correctly classified) of 91%. Similar performance was observed for the test set across cancer types, with an accuracy of 89% across all samples (Supplementary figure 5). Certain cancer types show differences in test and CV accuracy which was due to low sample sizes (Supplementary figure 6).
High misclassification rates for certain cancer types could likely be explained by shared cancer type characteristics (Figure 2a). This could be due to a common developmental origin, such as with Uterus being misclassified as Ovary (12%) due to both being gynecological cancers [22], and Biliary being misclassified as Pancreas (31%) and Liver (8%) due to being cancers of the foregut [23, 24]. Cancer subtypes were also often misclassified as other subtypes, which was the case for Lung_SC (small cell lung cancer) towards Lung_NSC (non-small cell lung cancer) (37%); Kidney_Papillary towards Kidney_ClearCell (48%); and Sarcoma_Leiomyo (46%) and Sarcoma_Lipo (10%) towards Sarcoma_Other (sarcomas other than leiomyosarcoma, liposarcoma or gastrointestinal stromal tumors). Lastly, Gastrointestinal_NET (gastrointestinal neuroendocrine tumors) was occasionally misclassified as Pancreas_NET (pancreatic neuroendocrine tumors) (12%) which may, at least partially, reflect cancer type mis-annotation amongst these samples due to neuroendocrine tumors having similar morphological features and shared anatomical location [25].
Thus far, we have assessed performance based on whether the highest probability cancer type is the correct cancer type (i.e. top-1 accuracy; Figure 2a,b). However, if we consider whether the correct cancer type is amongst the top-2 highest probabilities (i.e. top-2 accuracy; Figure 2c), overall accuracy increases from 91–95%, with the greatest increases being for the cancer subtypes including Lung_SC (57–88%), Kidney_Papillary (48–83%), and Sarcoma_Leiomyo (46–89%). A large gain in accuracy was also observed for Biliary (41–80%) which was often misclassified as Pancreas. Similar increases in accuracy were seen based on predictions on the held out test set (Supplementary figure 7). The top-2 (and even top-3) probabilities of CUPLR are particularly useful for differential diagnosis purposes to narrow down potential TOOs when routine diagnostics are not fully conclusive.
Added predictive value of SV related features
When examining the most important feature types from each random forest within CUPLR (Figure 3a), RMD profiles (‘rmd’) were consistently the most predictive of cancer type (in line with the findings from Jiao et al. 2020 [4]), as well mutational signatures (‘sigs’) including those with known cancer type associations such as SBS4 (associated with smoking [14]) in lung cancer (Figure 3b). As these mutation features are derived from genome-wide SBSs and indels, we assessed whether the presence of certain confounding factors that affect the SBS and indel genomic landscape (including DNA repair deficiencies [26, 27], chemotherapy treatment [28, 29], smoking history [30]) may lead to more incorrect predictions. However, these confounding factors showed minimal impact on classification performance (Supplementary notes, Supplementary table 1, Supplementary figure 12).
In addition to RMD profiles and mutational signatures, gender (as expected) was highly important for predicting cancers of the reproductive organs including breast, cervical, ovarian, prostate, and uterine cancer (Figure 3b, Supplementary figure 8). Notably, SV related features were important for classifying certain cancer types (Figure 3b). This included known cancer type specific gene fusions such as TMPRSS2-ERG in Prostate [31], KIAA1549-BRAF in CNS_PiloAstro (pilocytic astrocytomas) [32], CTNNB1-PLAG1 and MYBL1-NFIB in HeadAndNeck_SG (salivary gland cancer) [33, 34], and FUS-DDIT3 in Sarcoma_Lipo (liposarcoma) [35]. Of note, while EML4-ALK fusions are known to be specific to non-small cell lung cancer [36], this fusion does not appear amongst the top 15 most important features in Figure 3b since it has low feature importance (Supplementary data 4) due to only occurring in a minor proportion of non-small cell lung cancer samples (Supplementary figure 11). We also find known viral DNA integrations as important features, including human papillomavirus (viral_ins.HPV) in Cervix [37] and Merkel cell polyomavirus (viral_ins.MCPyV) in Skin_Other (non-melanoma skin cancer) [38]. Lastly, the number of SVs in the largest complex SV cluster (sv.complex.largest_cluster) which we use as a proxy for the presence of chromothripsis was also predictive for liposarcomas, which are known to frequently harbor chromothripsis events. However, in contrast to the features mentioned above, the presence of chromothripsis alone is not sufficient for classifying a tumor as liposarcoma as chromothripsis is also highly prevalent in other tumor types [16].
To further assess the added value of SV related features, we excluded entire feature types from the training and examined the decrease in classifier performance (Supplementary figure 9). Indeed, removing gene fusion features (‘fusion’) resulted in a large drop in accuracy for HeadAndNeck_SG (salivary gland cancer; 43–23%) and Sarcoma_Lipo (liposarcoma; 80–63%), with removal of the viral integration features (‘viral_ins’) leading to decreased accuracy for Cervix (88–65%). Exclusion of simple and complex SV features (‘sv’) led to lower accuracy for Sarcoma_Lipo (80–73%) as well as Sarcoma_Leiomyo (leiomyosarcoma; 50–45%), likely due to the presence of chromothripsis (sv.complex.largest_cluster; Figure 3b) being important for the separation of these two cancer types, which are often misclassified as each other (Figure 2a). Similarly, removal of the simple and complex SV features (‘sv’) resulted in a large drop in accuracy for Lung_SC (small cell lung cancer; 53–41%) (Supplementary figure 9), likely because 1-10kb deletions and 10-100kb duplications distinguish Lung_NSC from Lung_SC (Figure 3, Supplementary figure 8).
When compared to existing WGS-based CUP classifiers (Supplementary figure 10), CUPLR also showed improved performance for cancer types where SV related features were important, including for CNS_PiloAstro, Lung_NSC, and Prostate. While Sarcoma_Lipo had no comparable cancer type in existing classifiers, prediction of all sarcoma subtypes by CUPLR (except for Sarcoma_Leiomyo) outperformed osteosarcoma prediction by PCAWG consortium classifier [4] (PCAWG-NN; Bone-Osteosarc) and sarcoma prediction from the Salvadores et al. [5] classifier (Salvadores-SVM; SARC). Overall, CUPLR performs similarly for the remaining cancer types (except for head and neck, myeloid and pancreatic neuroendocrine and thyroid cancers) when compared to existing classifiers, with CUPLR also being able to classify more cancer (sub)types.
In summary, the incorporation of all feature types resulted in the best performance, with SV related features being highly important for specific cancer types. To our knowledge, we are the first to show the added value of SV related features for cancer type classification.
Graphical prediction report
Aside from cancer type probabilities, CUPLR also outputs explanations as to which features support these probabilities. These allow one to verify the predictions based on existing knowledge, and could be included in diagnostic reports to support decision making, e.g. in molecular tumor boards. The feature explanations are based on feature contribution calculations which enable feature importance determination at the sample level (rather than at the cohort level as in Figure 3). Specifically, feature contributions represent the total gain (or loss) in probability upon passing a feature through a random forest [39]. To ease the interpretation of CUPLR’s output, we have implemented a graphical report (Figure 4) which can be generated per patient that shows the cancer type probabilities (left panel), feature contributions for the top features for the top cancer types (middle panel). Also shown are the corresponding feature values in the patient in relation to the average feature value amongst patients in the training set (right panel).
We will use patient DO35949 as an example demonstration of the graphical report (Figure 4a). Since the CNS_PiloAstro probability (1.00) was much higher than the probability of other cancer types, only one cancer type (i.e. panel row) is shown, with up to 3 panel rows being shown when more than one cancer type probability is high (such as for patient HMF004199A; Figure 4b). The top feature for DO35949 was the presence of a KIAA1549-BRAF fusion (middle panel) which is on average found in 74% of CNS_PiloAstro patients (blue label), but 0% in all other patients (grey label). Since this feature is of Boolean type, a feature value of 1 (red label) indicates the presence of the KIAA1549-BRAF fusion in DO35949 (whereas a feature value of 0 would indicate absence). On the other hand, the mutational load of SNVs, DBSs and indels (mut_load.snv, mut_load.dbs, mut_load.indel) is on average lower in DO35949 (red label) and CNS_PiloAstro patients (blue label) compared to all other patients (grey label). Conversely, the proportion of 1-10Mb duplications (sv.DUP_[1e+06,1e+07]), is on average higher in DO35949 (red label) and CNS_PiloAstro patients (blue label) compared to all other patients (grey label). For all features, a red arrow (going right or left)highlights whether the feature value in DO35949 is higher or lower (respectively) than that in all other patients. Additionally, this relationship is also indicated in text in the feature contribution panel (middle panel) for non-Boolean features.
Patient HMF004199A (Figure 4b) is an example of a situation where more than one cancer type probability is high, with the probability of Lung_NSC (non-small cell lung cancer) being 0.84 and Lung_SC (small cell lung cancer) being 0.75 (Figure 4b). Here, two feature panels are shown for both of these cancer types to aid with resolving this uncertainty. Since RB1 loss is specific to small cell lung cancer and rarely occurs in non-small cell lung cancer [40, 41], it is more likely that the correct predicted cancer type is Lung_SC and not Lung_NSC. Indeed, this patient was diagnosed with small cell lung cancer.
This graphical report presents the detailed machine learning-based classification output of CUPLR in a human readable format. We acknowledge that the output is highly detailed, which is inevitable due to the large amounts of data used by the algorithm. However, as these details may not be necessary in all circumstances, we have implemented the option in the software to hide the feature contribution and/or feature value panels in the final graphical output.
Feature contributions aid in clarifying the primary tumor location of CUPs
To showcase how CUPLR could be used in a real life clinical setting, we applied CUPLR to 149 tumors for which primary tumor location was unknown from the Hartwig dataset, and examined the top predicted cancer type together with the top contributing features for each sample (Supplementary data 2). Of the 75 samples with a probability ≥0.8 for the top cancer type, the cancer type for 29 of these samples could be confidently determined purely based on these samples exhibiting features with well-known cancer type associations. This included 22 samples with signatures of smoking (SBS4 and ID13 [14]) and predicted as Lung_NSC, 1 sample with signatures of ultraviolet radiation (SBS7, DBS1 [14]) exposure and predicted as Skin_Melanoma, 5 samples with APC mutations and predicted as Colorectum [42], and 1 sample with a TMPRSS2-ERG fusion [31] and predicted as Prostate.
An additional 15 of the 75 samples could be confidently annotated based on having known cancer type specific features in conjunction with having a high contribution of the RMD profile of the predicted cancer type. For example, KRAS mutations most commonly occur in pancreatic cancer but are also frequent in other cancer types such as colorectal or lung cancer (Supplementary figure 11) [43]. For 3 samples with a KRAS mutation and predicted as Pancreas, the top contributing feature was rmd.Pancreas.1 (RMD profile of pancreatic cancer) thereby giving us confidence that the Pancreas classification was correct. Likewise, in conjunction with the respective RMD profile, 10 samples could be clarified as Gastric based on the presence of the signature of ROS damage (SBS17) [14], LINE insertions [44], and/or fragile site deletions [45]; 1 sample as HeadAndNeck_Other based on having HPV integration [46]; and 1 sample as Uterus based on having a PIK3R1 mutation [47] (Supplementary data 2). For the remaining 31 of the 75 samples, no known cancer type specific features could be identified, although the RMD profiles were the primary contributors of the top cancer type probability for these samples. Most of these samples were predicted as cancer types with highly specific RMD profiles respective to each cancer type, including Breast, Colorectum, Liver, Kidney_ClearCell, Ovary, and Urothelial (Supplementary data 2, Supplementary figure 11), and classification of cancer types achieved ≥0.9 accuracy (Figure 2a). The majority of these samples are thus likely correctly annotated, although additional evidence would be required for validation and final diagnosis. This could for example be based on histopathological examinations, but these were unfortunately not available for the retrospectively analyzed samples that were included here. However, such information would be readily available in routine diagnostics.