Multi-Affinity Network integration based on multi-omics data for tumor Stratification

DOI: https://doi.org/10.21203/rs.3.rs-2154033/v1

Abstract

Tumor stratification facilitates clinical applications such as diagnosis and targeted treatment of patients. Sufficient multi-omics data have facilitated the study of tumor stratification, and many omics fusion methods have been proposed. However, most methods require that the omics data must contain the same samples. In this study, we propose a Multi-Affinity Network integration based on multi-omics data for tumor Stratification, call MANS. MANS addresses the limitation that omics data fusion must contain identical samples. Another novelty is that the subdivision of a single cancer type into a corresponding cancer subtype is unsupervised. Firstly, MANS constructs affinity networks based on the calculated similarity matrices between genes. Then we integrate multi-omics information by performing biased random walks in multiple affinity networks to obtain the neighborhood relationships of genes. Finally, the patient feature is constructed by using the somatic mutation profile. We classify the pan-cancer by lightGBM algorithm with an AUC value of approximately 0.94. The cancer is further subdivided into subtypes by unsupervised clustering algorithm. Among the 12 cancer types, MANS identifies significant differences in patient survival for subtypes of 10 cancer types. In conclusion, MANS is a potent precision oncology tool.

Introduction

Cancer is currently one of the leading causes of human death worldwide, and it can occur anywhere in the body[1]. Normally, cancer (malignant tumor) is an abnormal lesion of local tissue cells caused by carcinogenic factors at the genetic level[2]. Genetic mutations may exist vary greatly in different patients. Therefore, accurately classifying tumor patients into corresponding cancer types and further subdividing them into corresponding subtypes is a challenge.

With the advent of next-generation sequencing and high-throughput technologies, a considerable amount of omics data has been generated, including DNA methylation, copy number variation, gene mutation, gene expression, and so on. TCGA[3] and ICGC[4] are open databases, which can provide us with different omics data. These rich omics data can help us better analyze biological processes and tumor heterogeneity. Because cancer is heterogeneous, using only one type of data may not be sufficient to accurately analyze cancer information. Integrating multi-omics data to identify cancer subtypes has been shown to be more powerful than single data[5]. The current mainstream approaches to integrating multi-omics data can be divided into two categories: a flexible strategy is to first analyze each data individually and then integrate it, called late integration. Although this strategy is more powerful than single-omics data, it does not capture the association between different omics[6, 7]. Another strategy is middle integration, which is the fusion of data after it has been processed. Since networks are widely represented in biology as interactions between organisms, the integration and analysis of multi-omics data based on networks has become increasingly popular[811]. For example, a similarity network fusion (SNF) approach is proposed by Wang et al[5]. Its idea is to construct similarity networks for each kind of data and then fuse them into one similarity network using information dissemination theory. Zhao et al. propose a method for the fusion of molecular and clinical networks (MCNF)[12]. The method allows the fusion of numerical and non-numerical data in a network and the optimal number of clusters can be determined automatically. When integrating multi-omics data, some samples or genes only have measurements in some of the groups. For example, some samples or genes have copy number variation data, but no mRNA data. However, most methods of integrating multi-omics data must contain the same samples, such as SNF, and ANF.

In recent years, it has been found that a few genes are frequently mutated in cancer, but most of the mutations that cause cancer occur in very few mutated genes. This suggests that the somatic mutation profile of tumors is heterogeneous and varies significantly between and within cancers[13]. The most common method for explaining this variability is to group tumor mutations into molecular networks[14]. Many methods have been proposed according to this principle, such as network-based stratification(NBS) for identifying cancer subtypes[15], and network-embedded stratification (NES)[16], which all obtain information about gene mutations through the mechanism of information dissemination[17].

In summary, we propose a Multi-Affinity Network integration based on multi-omics data for tumor Stratification, call MANS. MANS can not only process data contained only in other omics but also classify pan-cancer into corresponding cancer types and further subdivide them into corresponding subtypes. Firstly, the correlation between genes is calculated, and converted into an affinity network by k-nearest-neighbor (KNN)[18]. Then, the comprehensive similarity of genes is obtained by random walk in multiple affinity networks. Next, somatic mutations are introduced to construct patient features. Finally, the Light Gradient Boosting Machine(lightGBM)[19] algorithm is used to classify the pan-cancer, and the cancer is further subdivided into subtypes by an unsupervised clustering algorithm. A comparison with other methods (such as NES and NBS) is performed to illustrate that MANS can obtain better results in clinical application and biological significance.

Material And Methods

Data

In the current study, we select 12 cancer types, but not limited to 12 the open TCGA database. We also download the corresponding mRNA expression, methylation data, copy number variation and mutation data. Gene level copy number variation (CNV) is estimated for 12 cancer types using the GISTIC2 method. For the mRNA ex-pression data, we use IlluminaHiSeq pancan normalization. Values for this dataset are generated by combining the "gene expression RNAseq" values for all TCGA cohorts, and then averaging the values for each gene is centralized to extract the transformed data belonging to that cohort only. For methylation data, we use the human Methyla-tion450 platform, which is mapped to the corresponding gene based on the probe. Somatic mutation data are obtained from the TCGA GDC Data Portal. In the current study, a total of 6,174 samples with 17,350 genes from 12 cancer types are obtained. The pan-cancer data and the corresponding sample sizes are described in detail in Table 1. In addition to this, we also obtained clinical data from TCGA on patients, such as survival time and tumor stage.

Table 1

12 cancer types and corresponding sample numbers

Cancer

Full name

Patient number

BRCA

Breast invasive carcinoma

1067

CESC

Cervical squamous cell carcinoma

307

COAD

Colon adenocarcinoma

464

HNSC

Head and neck squamous cell carcinoma

512

LUAD

Lung adenocarcinoma

629

LUSC

Lung squamous cell carcinoma

559

OV

Ovarian serous cystadenocarcinoma

511

READ

Rectum adenocarcinoma

164

SKCM

Skin cutaneous melanoma

470

STAD

Stomach adenocarcinoma

439

THCA

Thyroid carcinoma

501

UCEC

Uterine corpus endometrial carcinoma

551

 

Methods

The core idea of MANS is to embed multiple affinity networks into the same vector space. The framework of MANS is shown in Fig. 1. The similarity between genes in each data is first calculated and further transform into an affinity network. Then a random walk is performed in multiple affinity networks to obtain the relationship between genes. This allowed us to capture the integrated structure of the affinity net-works. The next step is to obtain a low-dimensional vector representation of the genes using Skip-gram and combine it with somatic mutation profiles to construct patient features. Finally, the pan-cancer is classified using the lightGBM algorithm and the cancer is further subdivided by unsupervised DBSCAN[20].

Networks construction

In the natural sciences, the Pearson correlation coefficient is widely used to measure the degree of correlation between two variables. Therefore, we used Pearson's correlation coefficient to calculate the correlation between genes. The similarity of genes \({g_i}\) and \({g_j}\) is calculated by Eq. 1:

$$S({g_i},{g_j})=\frac{{\operatorname{cov} ({g_i},{g_j})}}{{{\sigma _{{g_i}}}{\sigma _{{g_j}}}}},$$
1

where \(\operatorname{cov} ({g_i},{g_j})\) denotes the covariance of gene \({g_i}\) and \({g_j}\), \({\sigma _{{g_i}}}\) and \({\sigma _{{g_j}}}\) denote the standard deviation of gene \({g_i}\) and \({g_j}\), respectively. Then we convert the similarity matrix \(S({g_i},{g_j})\) into distance matrix \(D({g_i},{g_j})\) by \(\sqrt {1 - S({g_i},{g_j})}\), and further convert it into an affinity matrix \(W\left( {i,j} \right)\) using KNN. In calculating the affinity matrix \({\delta _{ij}}\), we use the definition of for genes \({g_i}\) and \({g_j}\) in [12].

$${\delta _{ij}}=\alpha \times mean\left( {D\left( {{g_i},{N_i}} \right)} \right)+\alpha \times mean\left( {D\left( {{g_j},{N_j}} \right)} \right)+\beta \times D\left( {{g_i},{g_j}} \right),$$
2

where \({N_i}\) denotes a set of neighbors of \({g_i}\) and \(D({g_j},{N_i})\) denotes the average of the distance between g and each neighbor \({N_i}\).The affinity matrix \(W(i,j)\) is defined as:

$$W(i,j)=\frac{1}{{\sqrt {2\pi } {\delta _{ij}}}}{e^{\frac{{ - {D^2}({g_i},{g_j})}}{{2\delta _{{ij}}^{2}}}}},$$
3

where \(\frac{1}{{\sqrt {2\pi } {\delta _{ij}}}}\) is the normalization constant.

In our research, we employ 17,350 genes. Therefore, the affinity network constructed by the above method is huge, which is not conducive to computation. Here, we set the weight switch \(\omega\) to reduce the complexity of the network. The local network has certain advantages such as time and space efficiency compared to the global network. The weight switch \(\omega\) is defined as follows:

$${\omega _{\left( {{g_i},{g_j}} \right)}}=\left\{ {\begin{array}{*{20}{c}} {0\begin{array}{*{20}{c}} {}&{if{\omega _{\left( {{g_i},{g_j}} \right)}}<\omega ,} \end{array}} \\ {1\begin{array}{*{20}{c}} {}&{if{\omega _{\left( {{g_i},{g_j}} \right)}} \geqslant \omega ,} \end{array}} \end{array}} \right.$$
4

where \({\omega _{\left( {{g_i},{g_j}} \right)}}\) represents whether gene \({g_i}\) and gene \({g_j}\) are connected. In this study, \(\omega\) is set to 0.2.

Random walks on multi-affinity networks

Since the network is high-latitude and complex, it can be tricky to perform complex inference processes over the entire network. To solve this problem network representation learning (network embedding) is proposed[21]. It attempts to convert each node in the network into a potential representation of low dimensionality by finding a mapping function. Different network representation learning algorithms are different in defining their neighbors. For example, in LINE[22] and its improved versions only the first and second order neighbors of the nodes are considered. DeepWalk[23] uses a random walk strategy(a special kind of Markov chain) to define neighbors. When it comes to selecting neighbors, this random walk technique is fully random and equal. Neighbors who are more similar to the present node are not friendly. In this work, we use the biased random walk to make it prefer to select more similar neighbor nodes. When given the initial node u, the probability of visiting the next node v is P. P is expressed by the following equation:

$${\alpha _{pq}}(t,v)=\left\{ {\begin{array}{*{20}{c}} {\frac{1}{p}\begin{array}{*{20}{c}} {}&{{d_{tv}}=0,} \end{array}} \\ {1\begin{array}{*{20}{c}} {}&{{d_{tv}}=1,} \end{array}} \\ {\frac{1}{q}\begin{array}{*{20}{c}} {}&{{d_{tv}}=2,} \end{array}} \end{array}} \right.$$
6

Where \({\omega _{uv}}\) is the weight of the edges of nodes u and v, \({d_{tv}}\) is the shortest path between node t and node v, and p is the hyperparameter to return to the previous node, and q is the hyperparameter to continue walking down. Therefore, we randomly select a node from multiple affinity networks for a random walk and choose the next node according to the above probability to generate a sequence of nodes.

The skip-gram model is originally used in natural language processing to characterize the semantic information of words by learning a vector of words in the text[24]. Semantically similar words are made to approach in this space through an embedding space. After obtaining a sequence of nodes, and treating a node as a word, the skip-gram model is used to learn node representations[25, 26]. In this study, we generate 10 walking sequences for each node, the length of the random walk sequence is set to 80, and each gene is represented by a 128-dimensional vector.

Constructing patient features

We construct patient features from somatic mutation profiles. If genes are located on the same sample, fusion is performed to generate a new 128-dimensional feature to represent the patient. In constructing patient features, we find that mutations in genes occur differently for different cancer types by somatic mutation profiles. As shown in Fig. 1(C), some genes are mutated in only specific cancer types and some genes are mutated in all cancer types. To eliminate the effect of different gene mutation frequencies, we defined weight to solve the problem. The weights \({\omega _g}\) are defined as:

$${\omega _{{g_{}}}}=\frac{{{c_i}(g)}}{{t(g)}}$$
7

where \({c_i}(g)\) represents the number of gene g in cancer i and \(t(g)\) represents the number of gene g in all cancers.

Supervised classification model and Unsupervised clustering model

LightGBM is a gradient boosting framework that uses decision trees based on learning algorithms[19, 27]. A histogram algorithm is used to first discretize the continuous floating point feature values into k integers and construct a histogram of width k. Further, a leaf-wise algorithm with depth restrictions is used for the decision tree growth strategy. Compared with the XBG algorithm[28], its training speed is faster and memory usage is low.

When performing pan-cancer classification, patients with specific cancer types are considered as positive samples and other cancer types are considered as negative samples. Finally, the goodness of the model is evaluated by the AUC value[29].

We use the DBSDAN algorithm to cluster patients with a single cancer type. However, different numbers of clusters may yield different results. We obtain different numbers of clusters by varying the neighborhood distance \(\varepsilon\) and the minimum neighborhood sample size \(MinPts\), and subsequently determine the optimal number of clusters by averaging the [30].

Results

Pan-cancer classification

In the current study, we select 12 cancer types from the open TCGA database but are not limited to 12 types. We use four types of data, including gene expression RNAseq from IlluminaHiSeq Pancan normalization platform, methylation data from Methylation450k platform, copy number variation from Gistic2 platform, and mutation data from TCGA GDC Data Portal database. For NA values in the original data, we remove the corresponding features of NA values. For methylation data, we map the corresponding genes according to their probes and take the average value as expression. In addition, clinical information is obtained for 12 types of cancer. Finally, we obtain 6174 patients. The specific number of patients for each cancer type is given in Table 1. We generate patient features by integrating gene mutation data and gene vectors obtained through a multiple affinity network, and patients are represented with 128-dimensional features.

To test the feasibility of MANS, we use the lightGBM algorithm to predict the grouping of patients. The constructed patient features are used as the input to the algorithm. For patients with 14 cancer types, we select patients with a single cancer type as a positive sample and patients with other cancer types as a negative sample. We randomly select 80% of the samples as the training set to train the model and 20% as the test set for validation. The 5-fold cross validation by averaging 50 times is used as the final AUC value. As seen in Fig. 2(red represents MANS), the AUC values of STAD and HNSC among the 12 cancer types are 0.86 and 0.90, respectively, while the other AUC values are higher than 0.90, with an average AUC value of about 0.94. THCA, SKCM, READ, and UCEC are the most effective, having AUC values that are greater than average AUC values. Due to the high AUC values, it indicates that the majority of patients identified are accurate. In addition, we compared MANS with three more advanced methods. It can be seen that in most said cases, MANS is higher than the other three methods. The AUC value of MANS under THCA is equal to NES but higher than NBS and ECC. In LUAD and STAD, MANS is lower than NES but higher than NBS and ECC. In conclusion, MANS has a more powerful effect in predicting the type of cancer.

In addition, we perform the same strategy for patient type prediction using only a single data type (Fig. 3). The AUC values for MANS are lower than those using only methylation data in COAD and SKCM cancer types and lower than those for CNV only data types in THCA cancer types, but in the vast majority of cases, the AUC values for MANS are higher than the results using only a single data type. This shows that using multi-omics data is more powerful than using single omics data.

Cancer subtype identification

Classification methods can classify pan-cancer into different cancer types. Yet another aim of our work aims at subdividing patients with the same cancer type into the corresponding subtypes. Tumor staging describes the severity and extent of involvement of malignant tumors based on the primary tumor within the individual and the degree of dissemination. The greater the extent of involvement, the worse the patient's prognosis. Through the TNM staging system, patients are classified into stage I, stage II, stage III and stage IV[31, 32]. Therefore, we combined the gene vectors obtained from pan-cancer with tumor staging information to obtain patient characteristics under different cancer types and input to the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. For each cancer type, we obtain different numbers of clusters by varying the neighborhood distance \(\varepsilon\) and the minimum neighborhood sample size \(MinPts\). In general, cancer subtype problems usually set the number of clusters to 2 to 6. In the current study, we cluster patients with same cancer into k classes. For each k, the Silhouette Coefficients are calculated using the corresponding package of python. For each k, we plot the curves based on the values of the average 10 times Silhouette Coefficients. As shown in Fig. 4, the peak positions represent the optimal number of clusters for CESC, LUSC, and SKCM (Supplementary Fig. 1 provides the optimal number of clusters for the 12 cancer types). The darker orange color of the matrix in the figure shows the corresponding clustering results. To test whether the identified subtypes are clinically unique, a survival analysis is performed. In addition, we provide confidence intervals (\(95\% CI\)), Hazard Ratio (\(HR\)), and median survival for each subtype.

By calculating the Silhouette Coefficients, we divide the CESC into two subtypes and drew Kaplan-Meier survival curves. \(HR=1.48\),\(95\% CI\) is (1.11,1.97) and median survival is 237 and 720, respectively. \(P=7.0{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference between CESC0 and CESC1. By calculating the Silhouette Coefficients, we divide the LUSC into two subtypes and drew Kaplan-Meier survival curves. \(HR=1.46\),\(95\% CI\) is (1.11,1.93) and median survival is 601 and 1135, respectively. \(P=7.3{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference between LUSC0 and LUSC1. By calculating the Silhouette Coefficients, we divide the SKCM into three subtypes and drew Kaplan-Meier survival curves. \(HR=1.27\),\(95\% CI\) is (1.08,1.49) and median survival is 439, 824 and 2369, respectively. \(P=4.8{e^{ - 3}}(P<0.05)\) indicates a significant prognostic difference among SKCM0, SKCM1 and SKCM2. Of the 12 cancer types, MANS identifies subtypes of 10 cancer types that are significantly associated with patient survival. This further demonstrates the effectiveness of MANS. We provide clustered heat maps and survival curves for each of the 12 cancer types in Supplementary Fig. 2 and Supplementary Fig. 3, respectively.

To further test whether MANS could improve subtype outcomes, we use a single data type in each of the 12 cancer types for comparison with MANS. The same framework as MANS is used to obtain patient characteristics, and cluster members are obtained by DBSCAN clustering, and the subtype results under a single data type are presented in Table 2(The number in parentheses indicates the number of clusters). We find that for most cancer types, the optimal number of clusters based on a single data type is almost the same. Among the 12 cancer types, subtypes identified based on CNV data (CNV-MANS) are significantly associated with 10 cancer types, subtypes identified based on mRNA data (mRNA-MANS) are significantly associated with 9 cancer types, and subtypes identified based on methylation data (methylation-MANS) are significantly associated with 8 cancer types. A smaller p-value indicates a stronger significant correlation. Although there is a significant correlation between a single data type and the cancer subtypes identified by MANS in most cancer types. However, we find that the P-values of CNV-MANS, methylation-MANS and mRNA-MANS are larger than MANS in most cancer types (bold indicates best, italic indicates second best). This indicates that a single data type based on a single data type is inferior to a metropolitan area network.

Table 2

Comparison of single data types and MANS for 12 cancer types

Cancer

CNV

mRNA

methylation

MANS

BRCA

0.06(4)

0.05(4)

0.06(4)

0.04(4)

CESC

8.0-e4(2)

4.5e-3(2)

3.0e-3(2)

7.0e-3(2)

COAD

0.04(2)

0.05(2)

0.06(2)

0.03(2)

HNSC

0.01(2)

1.8e-3(2)

6.2e-3(2)

6.1e-3(2)

LUAD

0.04(2)

0.01(5)

0.08(5)

0.03(2)

LUSC

0.01(2)

0.01(2)

7.4e-3(2)

7.3e-3(2)

OV

0.02(2)

0.02(2)

0.03(2)

0.03(2)

READ

6.0e-3(2)

0.014(2)

0.036(2)

0.01(2)

SKCM

2.2e-6(4)

5.1e-6(4)

1.4e-6(2)

7.4e-7(3)

STAD

0.06(2)

0.06(2)

0.06(2)

0.07(2)

THCA

0.03(2)

0.037(2)

0.081(2)

0.09(2)

UCEC

0.04(6)

0.01(5)

5.9e-3(6)

4.8e-3(6)

 

Finally, we also compare with SNF and NES. We used SNF and NES to cluster each cancer type into (\(k=2,3,4,5,6\)) clusters separately and selected the one with the smallest p-value as the final result. Since the use of SNF requires the satisfaction of containing the same samples for each data type, the OV sample size is small(sample = 10) thus no analysis is performed. Table 3 shows that the number of significantly correlated subtypes identified by MANS is better than that of SNF and NES in most data sets (bold indicates best, italic indicates second best), indicating that MANS is feasible to identify cancer subtypes.

Table 3

MANS compared with SNF and NES under 12 cancer datasets

Cancer

SNF

NES

MANS

BRCA

0.15(4)

0.01(3)

0.04(4)

CESC

0.05(4)

0.01(2)

7.0e-3(2)

COAD

0.03(2)

0.04(2)

0.03(2)

HNSC

0.04(5)

0.04(4)

6.1e-3(2)

LUAD

0.04(5)

0.01(2)

0.03(2)

LUSC

0.09(2)

5.0e-3(4)

7.3e-3(2)

OV

NA

6.0e-5(5)

0.03(2)

READ

0.01(2)

0.14(2)

0.01(2)

SKCM

0.02(5)

9.0e-4 (3)

7.4e-7(3)

STAD

9.5e-7(2)

0.28(4)

0.07(2)

THCA

0.02(2)

0.18(3)

0.09(2)

UCEC

4.6e-5(2)

5.0e-3(2)

4.8e-3(6)

 

Parameter sensitivity

In order to test the robustness of MANS, it is discussed in this section. We use the pan-cancer data for sensitivity analysis of the parameters. For each parameter, we change the value of the specified parameter extensively and leave the other parameters as default values unchanged. For each parameter, 10 tests were performed and the average AUC value was used as the final evaluation of MANS performance.

We give in Fig. 5 the performance of MANS under different parameters. According to the results of the test, we find that the fluctuation of the AUC value with the change of parameters is very small, indicating that the MANS is stable. We suggest that the embedding dimension d is chosen at [64,128] the walk length l at [60,80], the number of walk nodes m at [10,20), the window size s at [10, 15], and the network weight switch w is chosen to be 0.2.

Discussion

Cancer is now one of the leading causes of human death worldwide. Tumor stratification facilitates clinical applications such as patient diagnosis and targeted therapy, which has important implications for precision oncology. We construct multiple affinity networks based on multi-omics data. Then we combine the comprehensive gene information obtained from multiple affinity networks by biased random walk with gene mutation information to construct patient features. Unlike for most methods of integrating multi-omics data, MANS does not require each data type to contain the same samples; it allows better access to complementary information for each data type. The AUC value predicted by MANS for cancer patients is as high as 0.94. When further subdividing patients into subtypes for each cancer type, patients with cancer subtypes identified by MANS are significantly associated with survival times on clinical data.

Although we provide a powerful tumor stratification method based on multi-omics integration, it still has some limitations. First, MANS cannot handle non-numerical data, and the frequency of mutations varies widely across tumor types. In addition, we only integrated CNV, mRNA, methylation and mutation data, but not other types of data, such as miRNA, exon expression. If these are added, the performance of MANS may be improved. Secondly, we are extracting gene vectors for individual cancer types directly from pan-cancer. Multi-omics data fusion for each cancer type separately may further improve MANS. Finally, it is also possible to improve tumor stratification by utilizing other computational methods in this framework. For example, spectral clustering and hierarchical clustering for machine learning, graph attention neural network and graph convolution neural network for deep learning. In summary, we provide a powerful tool for precision oncology.

Abbreviations

TCGA: The Cancer Genome Atlas; ICGC: International Cancer Genome Consortium; CGC: Cancer Gene Census; NCG: Network of Cancer Genes; BRCA: Breast invasive carcinoma; CESC: Cervical squamous cell carcinoma; COAD: Colon adenocarcinoma; HNSC: Head and neck squamous cell carcinoma; LUAD: Lung adenocarcinoma; LUSC: Lung squamous cell carcinoma; READ: Rectum adenocarcinoma; OV: Ovarian serous cystadenocarcinoma; SKCM: Skin cutaneous melanoma; STAD: Stomach adenocarcinoma; THCA: Thyroid carcinoma; UCEC: Uterine corpus endometrial carcinoma; KNN: K-Nearest Neighbor; lightGBM: Light Gradient Boosting Machine; XGB: XGBoost; DBSCAN: Density-Based Spatial Clustering of Applications with Noise;

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

This study provides the BRCA, CESC, COAD, HNSC LUAD, LUSC, OV, READ, SKCM, STAD, THCA, UCEC cancer types of gene expression, copy number variation and DNA methyla-tion data in the TCGA database is open. Available online: https://xenabrowser.net/datapages/. Gene mutation data are obtained from https://doi.org/10.1093/bioinformatics/btaa1099.

Competing interests

The authors declare that they have no competing interests.

Funding

This work has been supported by the National Natural Science Foundation of China (61902216, 61972226 and 62172254 and 61972236), and Natural Science Foundation of Shandong Province (No. ZR2018MF013).

Authors’ contributions

S.Z.S. provided the methodology. S.Z.S. and F.L. designed the algorithm. S.J.L., X.K.L. and J.L.S. arranged the datasets and performed the analysis. S.Z.S. drafted the manuscript. F.L.,Y.L. and J.X.L. reviewed and edited the manuscript. All authors read and approved the final manuscript.Sun et al.

Acknowledgements

No applicable

References

  1. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F: Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. International journal of cancer 2015, 136(5):E359-E386.
  2. Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation. cell 2011, 144(5):646-674.
  3. CGARNTssDUMSMRFABD, 5 EUVMEGBDJMMGOJJ, 8 HFHMTLN, 11 MACCAKAYWBO, 13 UoCSFVSBMPM: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068.
  4. DccKASLDZJHSAWJYCKCALY: International network of cancer genome projects. Nature 2010, 464(7291):993-998.
  5. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A: Similarity network fusion for aggregating data types on a genomic scale. Nature methods 2014, 11(3):333-337.
  6. Shen R, Olshen AB, Ladanyi M: Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009, 25(22):2906-2912.
  7. Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladanyi M, Shen R: Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences 2013, 110(11):4245-4250.
  8. Ma’ayan A: Introduction to network analysis in systems biology. Science signaling 2011, 4(190):tr5-tr5.
  9. Zhu L, You Z-H, Huang D-S, Wang B: t-LSE: a novel robust geometric approach for modeling protein-protein interaction networks. PloS one 2013, 8(4):e58368.
  10. Lee J-H, Zhao X-M, Yoon I, Lee JY, Kwon NH, Wang Y-Y, Lee K-M, Lee M-J, Kim J, Moon H-G: Integrative analysis of mutational and transcriptional profiles reveals driver mutations of metastatic breast cancers. Cell discovery 2016, 2(1):1-14.
  11. Zhao X-M, Liu K-Q, Zhu G, He F, Duval B, Richer J-M, Huang D-S, Jiang C-J, Hao J-K, Chen L: Identifying cancer-related microRNAs based on gene expression data. Bioinformatics 2015, 31(8):1226-1234.
  12. Zhao L, Yan H: MCNF: A novel method for cancer subtyping by integrating multi-omics and clinical data. IEEE/ACM transactions on computational biology and bioinformatics 2019, 17(5):1682-1690.
  13. Duan H, Li F, Shang J, Liu J, Li Y, Liu X: scVAEBGM: Clustering Analysis of Single-Cell ATAC-seq Data Using a Deep Generative Model. Interdisciplinary Sciences: Computational Life Sciences 2022:1-12.
  14. Zhang W, Ma J, Ideker T: Classifying tumors by supervised network propagation. Bioinformatics 2018, 34(13):i484-i493.
  15. Hofree M, Shen JP, Carter H, Gross A, Ideker T: Network-based stratification of tumor mutations. Nature methods 2013, 10(11):1108-1115.
  16. Liu C, Han Z, Zhang Z-K, Nussinov R, Cheng F: A network-based deep learning methodology for stratification of tumor mutations. Bioinformatics 2021, 37(1):82-88.
  17. Cowen L, Ideker T, Raphael BJ, Sharan R: Network propagation: a universal amplifier of genetic associations. Nature Reviews Genetics 2017, 18(9):551-562.
  18. Keller JM, Gray MR, Givens JA: A fuzzy k-nearest neighbor algorithm. IEEE transactions on systems, man, and cybernetics 1985(4):580-585.
  19. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y: Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 2017, 30.
  20. Ester M, Kriegel H-P, Sander J, Xu X: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd: 1996; 1996: 226-231.
  21. Yang C, Liu Z, Zhao D, Sun M, Chang E: Network representation learning with rich text information. In: Twenty-fourth international joint conference on artificial intelligence: 2015; 2015.
  22. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q: Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web: 2015; 2015: 1067-1077.
  23. Perozzi B, Al-Rfou R, Skiena S: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining: 2014; 2014: 701-710.
  24. Mikolov T, Chen K, Corrado G, Dean J: Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 2013.
  25. Grover A, Leskovec J: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining: 2016; 2016: 855-864.
  26. Ribeiro LF, Saverese PH, Figueiredo DR: struc2vec: Learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining: 2017; 2017: 385-394.
  27. Kotsiantis SB: Decision trees: a recent overview. Artificial Intelligence Review 2013, 39(4):261-283.
  28. Li F, Chu X, Dai L, Wang J, Liu J, Shang J: Effects of Multi-Omics Characteristics on Identification of Driver Genes Using Machine Learning Algorithms. Genes 2022, 13(5):716.
  29. Lobo JM, Jiménezā€Valverde A, Real R: AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 2008, 17(2):145-151.
  30. Rousseeuw PJ: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 1987, 20:53-65.
  31. Sobin LH, Gospodarowicz MK, Wittekind C: TNM classification of malignant tumours: John Wiley & Sons; 2011.
  32. Edge SB, Compton CC: The American Joint Committee on Cancer: the 7th edition of the AJCC cancer staging manual and the future of TNM. Annals of surgical oncology 2010, 17(6):1471-1474.