Dataset preparation
We used 2 sources of data in this work. The largest one is 8085 samples from TCGA were collected and further split into training dataset and test dataset. For independent validation, another dataset from GEO was obtained. The GEO datasets containing metastatic 42 samples were collected by searching the keywords for each cancer on the GEO website (see Table S1 for more detail). Samples were prepared as described in Materials and Methods and RNA-seq data were prepared for the predictions by the trained random forest model. All datasets described here were summarized in Fig. 1.
First, 8085 RNA-seq profiles of samples in TCGA [33] covering 21 main cancers (all cancer abbreviations are supplied in Table 1) were collected. The TCGA data were quantified by RSEM method. In TCGA, lung cancer was further classified as lung squamous cell carcinoma and lung adenocarcinoma since they are classified as two major tumors in TCGA. We also merged the two cancers, colon adenocarcinoma (COAD) and rectum adenocarcinoma (READ) to COADREAD, since they are similar to each other [34]. T-SNE plot for COAD and READ also showed that they can hardly be distinguished (Fig. 2a). Table 1 showed the distribution of the collected data of all cancers. The TCGA dataset contains 7713 primary tumor samples and 372 metastatic tumor samples. We checked the distribution of each cancer (shown in Table 1), and found that 352 (94.62%) metastatic tumor samples belong to Skin Cutaneous Melanoma (SKCM), which is much higher than that in the primary tumor samples (1.04%). Due to the distinct distribution of the primary and metastatic tumor, the SKCM samples were removed from the data in case it would affect the results, leaving 7633 samples covering 19 cancers in the primary tumor sample dataset and 20 metastatic tumor samples covering 5 cancers (BRCA, CESC, COAD, HNSC and THCA). The 20 metastatic tumor samples were used as an independent validation dataset.
Table 1
The number of patients per cancer from TCGA database.
Cancer
|
Amount
|
Percentage
|
Full name
|
BLCA
|
301
|
3.90%
|
Bladder Urothelial Carcinoma
|
BRCA
|
1056
|
13.69%
|
Breast invasive carcinoma
|
CESC
|
258
|
3.35%
|
Cervical squamous cell carcinoma and endocervical adenocarcinoma
|
COAD
|
451
|
5.85%
|
Colon adenocarcinoma
|
GBM
|
153
|
1.98%
|
Glioblastoma multiforme
|
HNSC
|
480
|
6.22%
|
Head and Neck squamous cell carcinoma
|
KIRC
|
526
|
6.82%
|
Kidney renal clear cell carcinoma
|
KIRP
|
222
|
2.88%
|
Kidney renal papillary cell carcinoma
|
LAML
|
173
|
2.24%
|
Acute Myeloid Leukemia
|
LGG
|
439
|
5.69%
|
Brain Lower Grade Glioma
|
LIHC
|
294
|
3.81%
|
Liver hepatocellular carcinoma
|
LUAD
|
486
|
6.30%
|
Lung adenocarcinoma
|
LUSC
|
428
|
5.55%
|
Lung squamous cell carcinoma
|
OV
|
261
|
3.38%
|
Ovarian serous cystadenocarcinoma
|
PAAD
|
142
|
1.84%
|
Pancreatic adenocarcinoma
|
PRAD
|
379
|
4.91%
|
Prostate adenocarcinoma
|
READ
|
153
|
1.98%
|
Rectum adenocarcinoma
|
SKCM
|
80
|
1.04%
|
Skin Cutaneous Melanoma
|
STAD
|
415
|
5.38%
|
Stomach adenocarcinoma
|
THCA
|
500
|
6.48%
|
Thyroid carcinoma
|
UCEC
|
516
|
6.69%
|
Uterine Corpus Endometrial Carcinoma
|
Among the 7633 primary tumor samples, 7579 were frozen samples and 49 were Formalin-Fixed Paraffin Paraffin-embedded (FFPE) samples and 5 samples were ambiguous since they were labeled different types in their sample or portion information. Since FFPE samples sustained from RNA degradation and chemical modifications [35, 36], the sample storage method was also taken as a factor that might influence the classification accuracy.
Since there might be outliers in the data, such as wrong labels or technical noise from the surgery or the experiment, we performed outlier removal using one-class support vector machine [37]. The outlier detection was performed as described in Materials and Methods. After outlier removal, 7174 samples were left, and the sample number per cancer were listed in Table 2.
Table 2
Sample numbers from the TCGA frozen dataset for each cancer after removing outliers by SVM.
Cancer
|
Original number
|
Number after removal
|
BLCA
|
298
|
283
|
BRCA
|
1041
|
985
|
CESC
|
257
|
244
|
COADREAD
|
591
|
562
|
GBM
|
153
|
145
|
HNSC
|
480
|
455
|
KIRC
|
522
|
493
|
KIRP
|
222
|
211
|
LAML
|
173
|
162
|
LGG
|
439
|
417
|
LIHC
|
294
|
278
|
LUAD
|
476
|
448
|
LUSC
|
428
|
404
|
OV
|
261
|
247
|
PAAD
|
142
|
130
|
PRAD
|
375
|
356
|
STAD
|
415
|
393
|
THCA
|
500
|
477
|
UCEC
|
512
|
484
|
Random forest algorithm performed well in frozen datasets during cross validation
We ran 10-fold cross validations on the frozen datasets to evaluate the algorithm and feature number to use. For most cases, it’s relatively less costly to use panels to determine the expression level of the specific genes. However, the coverage of the panel is also confined to the cost of the panel. Our goal is to lower the cost (gene number) while keeping the primary tracing accuracy as high as possible. We used the of random forest algorithm to choose the most discriminating genes (see Materials and Methods). We chose the best G genes from the training set in each fold of cross validation, where G is an integer between 10 and 200 with 10 as the step size.
The average accuracies of all 10-fold cross validations were plotted against the feature numbers (Fig. 2b). In the frozen dataset, the accuracy stabilized after 90 genes (accuracy = 97.55%). When using more genes, the accuracy increases for less than 0.1% for each increase of gene number. Thus, 90 were set as the optimal feature number for frozen datasets. The confusion matrix of the cross validation of the 90 genes from the frozen dataset was plot as Fig. 2c.
Selected genes were significantly enriched in both tissue-specific functions and tissue-general functions.
We selected 90 genes for further usage in frozen dataset. The log2 transformed average expression values for selected genes in different cancers were plotted as heatmaps (Fig. 3). In some extent, the heatmap explained how those genes could be used to distinguish different cancer types. Some of the genes can be very informative to make the judgment of the primary origin, as they have a much different expression in some cancers. For example, SCGB2A2 has an extremely low expression in LAML compared to other cancers while it has a very high expression in BRCA, showing the gene expression levels could be used to distinguish different cancer types.
We also performed enrichment analysis on the gene set. The enrichment results were shown in Fig. 4. The genes were enriched significantly in development of human organs, such as urogenital system development, endocrine system development, thyroid gland development, prostate gland morphogenesis, renal system development, lung cell differentiation and reproductive system development et al. There are other genes involving in hormone metabolic process, hormone-mediated signaling pathway. Remarkably, some genes involved in the organ or tissue-specific development are more likely to be differentially expressed in tumors and normal tissues. For example, HOXB13 gene, which encodes nuclear transcription factor which belongs to the HOX family, is known to involve in embryonic development and in the termination of the differentiated tissue [38, 39]. The previous studies revealed that HOXB13 was overexpressed in numerous cancers, such as prostate cancer [40], breast cancer [41], ovarian cancer [42], cervical cancer [43], which demonstrated that HOXB13 might be a promising biomarker of carcinogenesis in certain types of cancer. In addition, SOX17 gene encodes a member of the SOX family (SRY-related HMG-box) of transcription factors involved in multiple developmental processes [44], as well as in carcinogenesis [45]. It has been reported that the SOX17 expression was down-regulated in several types of cancer cells [46–48], and it was associated with a poor prognosis of cancers. [48–50]. Remarkably, we found ESR1 was a gene signature of our 90 genes panel. ESR1 encodes a nuclear estrogen receptor which is involved in the regulation of cell proliferation and differentiation in target tissues. The overexpressed ESR1 gene results in two-thirds of breast cancer and it has been defined the ER + subtype of breast cancer [51]. ESR1 is known as a biomarker to evaluate endocrine resistance, disease relapse and an overall poor prognosis [52–54].
The model trained from TCGA frozen dataset performed well on FFPE dataset in specific cancer types.
We first checked whether the 90-gene set could help to discriminate the samples from different cancers. t-SNE were used to visualize the samples from 19 cancer types of the frozen dataset (Fig. 5a) and the samples the FFPE dataset (Fig. 5b) using only 90 genes. The samples from different cancer types were clustered well using much fewer genes. However, the FFPE dataset was less clustered, indicating the FFPE samples might not be a good choice for predicting primary origin.
We trained a model using 90 genes on the 7194-sample frozen dataset. We then predicted the 49-sample FFPE dataset using the trained model, gaining the overall accuracy of 59.18%. We checked the precisions for each tumor type in the predictions of FFPE samples, and found that the STAD and COADREAD were not well classified since COADREAD samples were prone to be predicted as STAD, leading to a low precision on STAD and a low recall on COADREAD. This indicated that the two tumor types tended to be similar after being fixed by formalin. Thus, we need to take the sample storage method into account in further investigations. There are 6 samples having a low confidence in predicting, thus we set a threshold of 0.3 for the max probability of all cancers, leaving 43 predictable samples, which lifted the accuracy to 67.44%. We plotted the confusion matrix for the FFPE dataset in Fig. 5c.
The model trained from TCGA dataset performed well in independent datasets.
For testing our method, we prepared 2 independent datasets: (1) the metastatic dataset from TCGA containing 20 metastatic samples; (2) a metastatic dataset from GEO containing 42 samples.
The model trained from TCGA primary tumor dataset achieves 94.74% accuracy in metastatic tumor dataset for 19 predictable samples. The model was tested on the independent metastatic dataset in TCGA, outputting the result shown in Table S2 and Table S3.
To expand the validation dataset, we further validated the trained models on the GEO datasets. The STAR-RSEM pipeline was used to extract the expression value for all genes [55]. The GEO metastatic dataset contained 42 predictable samples and an accuracy of 74.29% was achieved. We plotted the confusion matrix for the FFPE dataset in Fig. 5d. Specifically, GSE83132 contains samples from cell lines, which might lead to inaccurate RNA expression profiling.