Pan-cancer classification
In the current study, we select 12 cancer types from the open TCGA database but are not limited to 12 types. We use four types of data, including gene expression RNAseq from IlluminaHiSeq Pancan normalization platform, methylation data from Methylation450k platform, copy number variation from Gistic2 platform, and mutation data from TCGA GDC Data Portal database. For NA values in the original data, we remove the corresponding features of NA values. For methylation data, we map the corresponding genes according to their probes and take the average value as expression. In addition, clinical information is obtained for 12 types of cancer. Finally, we obtain 6174 patients. The specific number of patients for each cancer type is given in Table 1. We generate patient features by integrating gene mutation data and gene vectors obtained through a multiple affinity network, and patients are represented with 128-dimensional features.
To test the feasibility of MANS, we use the lightGBM algorithm to predict the grouping of patients. The constructed patient features are used as the input to the algorithm. For patients with 14 cancer types, we select patients with a single cancer type as a positive sample and patients with other cancer types as a negative sample. We randomly select 80% of the samples as the training set to train the model and 20% as the test set for validation. The 5-fold cross validation by averaging 50 times is used as the final AUC value. As seen in Fig. 2(red represents MANS), the AUC values of STAD and HNSC among the 12 cancer types are 0.86 and 0.90, respectively, while the other AUC values are higher than 0.90, with an average AUC value of about 0.94. THCA, SKCM, READ, and UCEC are the most effective, having AUC values that are greater than average AUC values. Due to the high AUC values, it indicates that the majority of patients identified are accurate. In addition, we compared MANS with three more advanced methods. It can be seen that in most said cases, MANS is higher than the other three methods. The AUC value of MANS under THCA is equal to NES but higher than NBS and ECC. In LUAD and STAD, MANS is lower than NES but higher than NBS and ECC. In conclusion, MANS has a more powerful effect in predicting the type of cancer.
In addition, we perform the same strategy for patient type prediction using only a single data type (Fig. 3). The AUC values for MANS are lower than those using only methylation data in COAD and SKCM cancer types and lower than those for CNV only data types in THCA cancer types, but in the vast majority of cases, the AUC values for MANS are higher than the results using only a single data type. This shows that using multi-omics data is more powerful than using single omics data.
Cancer subtype identification
Classification methods can classify pan-cancer into different cancer types. Yet another aim of our work aims at subdividing patients with the same cancer type into the corresponding subtypes. Tumor staging describes the severity and extent of involvement of malignant tumors based on the primary tumor within the individual and the degree of dissemination. The greater the extent of involvement, the worse the patient's prognosis. Through the TNM staging system, patients are classified into stage I, stage II, stage III and stage IV[31, 32]. Therefore, we combined the gene vectors obtained from pan-cancer with tumor staging information to obtain patient characteristics under different cancer types and input to the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. For each cancer type, we obtain different numbers of clusters by varying the neighborhood distance \(\varepsilon\) and the minimum neighborhood sample size \(MinPts\). In general, cancer subtype problems usually set the number of clusters to 2 to 6. In the current study, we cluster patients with same cancer into k classes. For each k, the Silhouette Coefficients are calculated using the corresponding package of python. For each k, we plot the curves based on the values of the average 10 times Silhouette Coefficients. As shown in Fig. 4, the peak positions represent the optimal number of clusters for CESC, LUSC, and SKCM (Supplementary Fig. 1 provides the optimal number of clusters for the 12 cancer types). The darker orange color of the matrix in the figure shows the corresponding clustering results. To test whether the identified subtypes are clinically unique, a survival analysis is performed. In addition, we provide confidence intervals (\(95\% CI\)), Hazard Ratio (\(HR\)), and median survival for each subtype.
By calculating the Silhouette Coefficients, we divide the CESC into two subtypes and drew Kaplan-Meier survival curves. \(HR=1.48\),\(95\% CI\) is (1.11,1.97) and median survival is 237 and 720, respectively. \(P=7.0{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference between CESC0 and CESC1. By calculating the Silhouette Coefficients, we divide the LUSC into two subtypes and drew Kaplan-Meier survival curves. \(HR=1.46\),\(95\% CI\) is (1.11,1.93) and median survival is 601 and 1135, respectively. \(P=7.3{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference between LUSC0 and LUSC1. By calculating the Silhouette Coefficients, we divide the SKCM into three subtypes and drew Kaplan-Meier survival curves. \(HR=1.27\),\(95\% CI\) is (1.08,1.49) and median survival is 439, 824 and 2369, respectively. \(P=4.8{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference among SKCM0, SKCM1 and SKCM2. Of the 12 cancer types, MANS identifies subtypes of 10 cancer types that are significantly associated with patient survival. This further demonstrates the effectiveness of MANS. We provide clustered heat maps and survival curves for each of the 12 cancer types in Supplementary Fig. 2 and Supplementary Fig. 3, respectively.
To further test whether MANS could improve subtype outcomes, we use a single data type in each of the 12 cancer types for comparison with MANS. The same framework as MANS is used to obtain patient characteristics, and cluster members are obtained by DBSCAN clustering, and the subtype results under a single data type are presented in Table 2(The number in parentheses indicates the number of clusters). We find that for most cancer types, the optimal number of clusters based on a single data type is almost the same. Among the 12 cancer types, subtypes identified based on CNV data (CNV-MANS) are significantly associated with 10 cancer types, subtypes identified based on mRNA data (mRNA-MANS) are significantly associated with 9 cancer types, and subtypes identified based on methylation data (methylation-MANS) are significantly associated with 8 cancer types. A smaller p-value indicates a stronger significant correlation. Although there is a significant correlation between a single data type and the cancer subtypes identified by MANS in most cancer types. However, we find that the P-values of CNV-MANS, methylation-MANS and mRNA-MANS are larger than MANS in most cancer types (bold indicates best, italic indicates second best). This indicates that a single data type based on a single data type is inferior to a metropolitan area network.
Table 2
Comparison of single data types and MANS for 12 cancer types
Cancer
|
CNV
|
mRNA
|
methylation
|
MANS
|
BRCA
|
0.06(4)
|
0.05(4)
|
0.06(4)
|
0.04(4)
|
CESC
|
8.0-e4(2)
|
4.5e-3(2)
|
3.0e-3(2)
|
7.0e-3(2)
|
COAD
|
0.04(2)
|
0.05(2)
|
0.06(2)
|
0.03(2)
|
HNSC
|
0.01(2)
|
1.8e-3(2)
|
6.2e-3(2)
|
6.1e-3(2)
|
LUAD
|
0.04(2)
|
0.01(5)
|
0.08(5)
|
0.03(2)
|
LUSC
|
0.01(2)
|
0.01(2)
|
7.4e-3(2)
|
7.3e-3(2)
|
OV
|
0.02(2)
|
0.02(2)
|
0.03(2)
|
0.03(2)
|
READ
|
6.0e-3(2)
|
0.014(2)
|
0.036(2)
|
0.01(2)
|
SKCM
|
2.2e-6(4)
|
5.1e-6(4)
|
1.4e-6(2)
|
7.4e-7(3)
|
STAD
|
0.06(2)
|
0.06(2)
|
0.06(2)
|
0.07(2)
|
THCA
|
0.03(2)
|
0.037(2)
|
0.081(2)
|
0.09(2)
|
UCEC
|
0.04(6)
|
0.01(5)
|
5.9e-3(6)
|
4.8e-3(6)
|
Finally, we also compare with SNF and NES. We used SNF and NES to cluster each cancer type into (\(k=2,3,4,5,6\)) clusters separately and selected the one with the smallest p-value as the final result. Since the use of SNF requires the satisfaction of containing the same samples for each data type, the OV sample size is small(sample = 10) thus no analysis is performed. Table 3 shows that the number of significantly correlated subtypes identified by MANS is better than that of SNF and NES in most data sets (bold indicates best, italic indicates second best), indicating that MANS is feasible to identify cancer subtypes.
Table 3
MANS compared with SNF and NES under 12 cancer datasets
Cancer
|
SNF
|
NES
|
MANS
|
BRCA
|
0.15(4)
|
0.01(3)
|
0.04(4)
|
CESC
|
0.05(4)
|
0.01(2)
|
7.0e-3(2)
|
COAD
|
0.03(2)
|
0.04(2)
|
0.03(2)
|
HNSC
|
0.04(5)
|
0.04(4)
|
6.1e-3(2)
|
LUAD
|
0.04(5)
|
0.01(2)
|
0.03(2)
|
LUSC
|
0.09(2)
|
5.0e-3(4)
|
7.3e-3(2)
|
OV
|
NA
|
6.0e-5(5)
|
0.03(2)
|
READ
|
0.01(2)
|
0.14(2)
|
0.01(2)
|
SKCM
|
0.02(5)
|
9.0e-4 (3)
|
7.4e-7(3)
|
STAD
|
9.5e-7(2)
|
0.28(4)
|
0.07(2)
|
THCA
|
0.02(2)
|
0.18(3)
|
0.09(2)
|
UCEC
|
4.6e-5(2)
|
5.0e-3(2)
|
4.8e-3(6)
|
Parameter sensitivity
In order to test the robustness of MANS, it is discussed in this section. We use the pan-cancer data for sensitivity analysis of the parameters. For each parameter, we change the value of the specified parameter extensively and leave the other parameters as default values unchanged. For each parameter, 10 tests were performed and the average AUC value was used as the final evaluation of MANS performance.
We give in Fig. 5 the performance of MANS under different parameters. According to the results of the test, we find that the fluctuation of the AUC value with the change of parameters is very small, indicating that the MANS is stable. We suggest that the embedding dimension d is chosen at [64,128] the walk length l at [60,80], the number of walk nodes m at [10,20), the window size s at [10, 15], and the network weight switch w is chosen to be 0.2.