MCluster-VAEs: an end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data

1 Background: The discovery of cancer subtype based on unsupervised clustering helps 2 provide precise diagnoses, guide treatment and improve patients’ prognoses. Instead of 3 single-omics data, multi-omics data can improve performance of the clustering because 4 it obtains a comprehensive landscape for understanding biological systems and 5 mechanisms. However, heterogeneous data from multiple sources raises high 6 complexity and different kinds of noise, which will be detrimental to the extraction of 7 clustering information. 8 Methods: We propose an end-to-end deep learning-based method, Multi-omics 9 Clustering Variational Autoencoders (MCluster-VAEs), that can extract cluster-friendly 10 representations on multi-omics data. First, unified network architecture with an 11 attention mechanism is developed for modeling multi-omics data precisely. Then, using 12 a novel objective function built from the Variational Bayes technique, the model is 13 trained to effectively obtain the posterior estimation of clustering assignments. 14 Results: Compared with twelve other state-of-the-art multi-omics clustering methods, 15 MCluster-VAEs achieved outstanding performance on benchmark datasets from the 16 TCGA database. On the Pan Cancer dataset, MCluster-VAEs achieved adjusted Rand 17 index of around 0.78 for cancer category recognition, an increase of more than 18% 18 compared with other methods. Furthermore, the survival analysis and clinical 19 parameters enrichment tests on ten cancer datasets demonstrate that MCluster-VAEs 20 delivered comparable or even better results than many typical integrative methods. 21 Conclusions: These results demonstrate that MCluster-VAEs is a new powerful tool


Background
1 Cancer being a highly complex and heterogeneous genomics disease, showcases 2 variability in tumor responsiveness to the therapy (1). This problem forms the basis of 3 one of the critical areas of cancer research, i.e. the development of excellent subtype 4 discovery methods. These methods are designed to decouple the heterogeneity of cancer, 5 and accordingly, divide cancer patients into different groups to better understand the 6 pathogenesis of cancer and to boost clinical treatment (2). Since the occurrence and 7 development of cancer is related to several different biological layers and molecular 8 systems, it is more pertinent to identify cancer subtypes based on multi-omics data 9 rather than single-omics data (3). 10 The most commonly used methods to recognize cancer subtypes are clustering, such as 11 hierarchical clustering (4), spectral clustering (5) and K-means (6). These methods, 12 designed for single source data, cannot address the dimensional redundancy and 13 heterogeneous information integration brought by multi-omics data. Lately, the advent 14 of high-throughput technologies has proliferated multi-omics data, leading to the 15 development of multi-omics clustering methods (7). These methods are based on either 16 additional regularization controlling the problem of dimensionality (3,8,9), the fusion 17 of similarity networks (10-14), the factorization of multiple matrices (15,16), or the 18 simple linear probabilistic model (17,18). However, all these methods can only deal

Variational Bayes 20
In order to solve the above inference task, we introduce a trainable parametric (2) 4 The term ( ) ,; i L x is called (variational) evidence lower bound (ELBO), which 5 strictly less than the marginal likelihood which can be maximized indirectly by 6 updating generative parameters  and variational parameters  . This approach was 7 proposed first by Kingma and Welling (31), and has been applied in several generative 8 models, including the famous VAE (31). 9 We can assume that zx has the mean field form and can be decomposed into 10 product form: 11 Figure 1C illustrates this inference path. Therefore, the ELBO can be derived in the 13  learning-based approach for multi-omics information integration is the intermediate 17 integration (32), which first processes each omics data using dedicated layers, 18 concatenates the outputs of dedicated layers and then uses further layers to integrate the 1 features extracted from each omics data. This approach enables the most suitable 2 dedicated layers to be used for each omics data and can hence extract more predictive 3

features. 4
However, the intermediate integration approach cannot consider the contribution of 5 multiple omics data. In order to make more effective use of multi-omics data, we add omics and then is used to predict clustering assignments. 15

Conditional entropy annealing trick 16
The most unusual term in the ELBO is the conditional entropy term. It tries to minimize 17 the KL divergence of y-posterior and uniform prior. It seems to be in the opposite 18 direction of clustering. However, it is the existence of this term which makes reasonable 19 clustering possible. Actually, this term is a Bayesian regularization term against bias of 20 maximum likelihood, which enables every category of i y to have enough chance to 1 join the training process and the model does not quickly fall into a very bad local 2

optimum. 3
In practice, model training often falls into collapse mode when the equation (5) is 4 directly considered as loss function. In this case, all clustering assignments tend to 5 become a same value. This problem may come from the enormous contribution of large 6 dimensional multi-omics data to the training of reconstruction term rec L . To solve it, 7 we need to increase the weight of the conditional entropy term. Therefore, we propose 8 a conditional entropy annealing trick for experiments on real data, which multiplies the 9 conditional entropy term centropy L by a large value of  at the beginning of training, 10 and then gradually reduces the  during training to take into account all categories of The value of  is described in Supplementary Note 3. In the following experiments, 14 we will show that this trick is very helpful in training of MCluster-VAEs. 15

Gumbel softmax reparameterization 16
The training speed of MCluster-VAEs is relatively slow, because all categories of i y 17 need to be passed in model when calculating the loss function. One way to solve this 18 problem is to use reparameterization trick on the discrete variable i y as well. 19 According to the techniques introduced by Jang and Maddison (33,34), we use Gumbel 20 softmax distribution to approximate the categorical distribution to achieve 21 reparameterization. The Gumbel softmax distribution can be sampled by the following 22 where ic  is the posterior probability of category c of sample i (the value of .  is an annealing factor that decreases gradually during whole 4 training. The pseudo-code of MCluster-VAEs with and without the gumbel softmax 5 trick is shown in Supplementary Note S3. 6

Datasets and preprocessing 7
The mRNA expression, miRNA expression, DNA methylation (450K) and copy 8 number alterations (short for mRNA, miRNA, methylation, CNA) from TCGA 9 database were used in this study. MCluster-VAEs and comparison methods were tested 10 in the following two settings. 11 The Pan Cancer dataset consisted of a total of 8,314 samples across 32 types of cancers. 12 These cancer types originated from different types of tissues and dissection positions. 13 Therefore, the Pan Cancer dataset had natural categorical structure (32 cancer types) 14 and could be used to test the performance of clustering algorithms. We then 15 preprocessed this dataset using a 6-steps approach: 1. to filter samples to ensure that 16 they exist in all four omics; 2. to perform log transformation of mRNA and miRNA; 3.  Table S1. 4 Secondly, ten cancer types in the TCGA dataset were applied in this study ( (26). These datasets with the above-mentioned four omics were preprocessed by the 10 above 6-steps approach. 11

Evaluation criteria and comparison algorithms 12
Since the cancer types were known, the Pan Cancer dataset was tested to determine 13 whether the clustering algorithms could correctly identify the labels. It was measured 14 by unsupervised clustering accuracy (ACC), adjusted Rand index (ARI), normalized 15 mutual information (NMI), and clustering F measure (F1). They were defined in 16 On the ten single cancer datasets, the significance of survival analysis and the number 18 of enriched clinical parameters were used to measure performance of MCluster-VAEs 19 and the comparison algorithms. It assumed that if clusters of samples exhibit significant 20 difference in clinical parameters and survival, they are different in a biologically 21 meaningful way. The clinical parameters included gender, age at diagnosis, pathologic 22 T, pathologic M, pathologic N and pathologic stage. The four latter parameters were 1 discrete pathological parameters, measuring the progression of the tumor (T), 2 metastases (M) and cancer in lymph nodes (N), and the total progression (stage). We 3 measured significance of survival analysis among the obtained clustering assignments 4 using the log-rank test (35), enrichment for discrete parameters using the chi-square 5 test, and enrichment for numeric parameters using Kruskal-Wallis test. According to 6 Rappoport and Shamir (36), the accuracy of pure log-rank test, chi-square test and 7 Kruskal-Wallis test might be influenced by small sample size and unbalanced cluster 8 assignments. Therefore, we followed their advice and used the corresponding 9 permutation tests to estimate empirical P values. More details on the permutation tests 10 can be found in their study (36). After that, the P values were corrected for multiple 11 hypotheses using Bonferroni correction for each cancer and corresponding method. 12 Note that cancer subtypes that are biologically different may have similar survival, and 13 this is also true for enrichment of clinical parameters. However, these measures are 14 widely used for clustering assessment, including in the most multi-omics clustering 15 studies (12,26,36-38). 16 To evaluate the clustering performance of MCluster-VAEs, we compared its 17 performance with the performances of twelve state-of-the-art clustering methods (i.e.    The proposed MCluster-VAEs and the state of art methods were applied to the Pan 9

Results
Cancer dataset labeled with known cancer types, with clusters set equal to the number 10 of cancer types (i.e., 32). The clustering performance (ACC, ARI, F1, and NMI) was 11 shown in Table 1. Except for SNF, Spectral, and NEMO, which produced deterministic 12 results, all methods were repeated five times to avoid the influence of randomness.

13
As mentioned earlier, the excellent performance of MCluster-VAEs may come from its 14 "one-step" framework, which guides the model to learn cluster-friendly representation. greater degree of separation. In addition, the points representing STAD tended to mix 1 with the points representing COAD for these three "two-steps" methods (light blue 2 circle). The confusion could be caused by the fact that these two kinds of cancers are 3 all digestive tract cancer. However, the representations learned by MCluster-VAEs 4 separated STAD from COAD. MCluster-VAEs also made some improvements in LIHC 5 and CHOL (green circle). The above results suggested that MCluster-VAEs could learn 6 cluster-friendly representations and better clusters than "two-step" methods. 7

12
To determine the necessity of usage of multi-omics data, we compared the clustering 13 results of MCluster-VAEs based on four single-omics data and multi-omics 14 data.MCluster-VAEs removed the attention module when processing single-omics data 15 because the fusion of multiple omics features was no longer required, while other 16 parameter settings remained unchanged. As shown in Figure 3A, CNA alone was 17 insufficient for accurate cancer classification. In comparison to methylation, mRNA, 18 and miRNA data alone, the combined use of multi-omics data increased ARI of 19 approximately 18%, 19%, and 33%, respectively. In addition, the ARI of the multi-20 omics data had a smaller distribution range than the ARI values of single-omics data. 21 ACC, F1, and NMI's conclusions are consistent with ARI's. These results indicate that 22 MCluster-VAEs can give more accurate and stable clustering results using multi-omics 23 information compared with single-omics datasets. 24 19

The role and influence of attentional mechanism 1
How to integrate multiple source data is a major challenge in multi-omics studies. Deep 2 learning-based methods have an inherent advantage, because the neural networks can 3 contain multiple independent layers to extract appropriate information for each omics, 4 and the shared layer then fuse these information. This architecture, called intermediate 5 integration, was easy to implement and has been used in many studies (22,25,26). 6 However, different omics data contained varying amounts of clustering information, 7 making it difficult for the intermediate integration architecture to identify these 8 differences and lack interpretation of omics contribution. The solution provided by this 9 study was the attention mechanism, which adaptively learned the weight of each omics 10 data for each sample. 11 The distribution of the attention scores of each omics was shown in Figure 3B has a more negligible effect on PRAD and BRCA than on other cancers, whereas CNA 6 has a more significant effect on OV (Supplementary Figure S4). In conclusion, the 7 above results showed that the attention mechanism could help with more explainable 8 and accurate multi-omics data integration. 9

Gumbel Softmax and conditional entropy weighted annealing 10
The exact form of MCluster-VAEs was time-consuming, especially when the number version also had a better performance than the exact version ( Figure 4). This 19 improvement of ACC might be because this reparameterization trick improved gradient 20 variance and gave more space for exploration. 21 Another challenge of multi-omics data analysis was the curse of dimensionality, which 22 also appeared in single-omics analysis but would be more severe in multi-omics 1 analysis because of the integration of multiple data sources. One of its influences for 2 MCluster-VAEs was that the effect of the reconstruction term overwhelmed the KL 3 divergence regularization. As a result, it quickly led ( | ) ii qy  x to converge into a 4 unique clustering assignment during the early stage of training. This is a "Lazy behavior" 5 of model training, resulting in convergence to a bad local minimum. 6  dataset. "WA" means using conditional entropy weight decay annealing trick. "Gumbel" means 8 using Gumbel Softmax reparameterization trick. "none" means neither used.

9
To solve this problem, we proposed the conditional entropy weight annealing trick. The 10  factor was used to provide obvious gradient signals for each clustering category 11 during backpropagation learning. Figure 4 showed the entropy, conditional entropy and 12 ACC during the whole training period in the Pan Cancer dataset. When the weight 13 annealing was not applied, conditional entropy would decline sharply to zero in first 14 10-20 epochs. Correspondingly, the entropy was soon stop at a lower level until training 15 end. When using the weight annealing trick, conditional entropy would still decline to 16 zero but more slowly, the entropy would rise to a higher level, and the ACC would also 17 rise up to a higher value. Unexpectedly, the use of Gumbel softmax reparameterization 18 also eased the convergence problem, which may be due to the reparameterization that 19 improved efficiency of gradient backpropagation. Overall, by using the conditional 20 entropy weight annealing trick and Gumbel Softmax trick, the convergence problem of 21 MCluster-VAEs could be solved effectively. 22

Integrative clustering across ten cancer types with thirteen methods 1
We applied MCluster-VAEs and the other twelve state-of-the-art clustering methods on 2 the ten TCGA multi-omics datasets. The clustering assignments and clinical parameters 3 were used to calculate corrected empirical P values of survival analysis and enrichment 4 test (36). To ensure the fairness of comparison, we followed the setup of Yang's study 5  The -log10 P-values of differential survival and the number of enriched clinical 12 parameters of all methods on the ten datasets were shown in Figure 5  indicating that MCluster-VAEs can reveal cancer datasets' survival differences. 23 clinical parameters, whereas the other methods had no more than one. MCluster-VAEs 7 also obtained better results than the comparison methods on the other datasets. 8 Therefore, we can conclude that our method outperformed most of the typical methods 9 on many TCGA datasets. 10 The MCluster-VAEs was also applied to single-omics data of the ten single cancer These results again proved that MCluster-VAEs could effectively use multiple 20 information from multi-omics data. Combining the results of the Pan Cancer dataset, 21 we had reason to believe that MCluster-VAEs can achieve better integration of multi-22 omics data for clustering tasks. 1 Figure 7. Performance of MCluster-VAEs based on four omics or single-omics data on the ten 2 cancer datasets. The x-axis was the number of clinical parameters enriched in the clusters, and the 3 y-axis measured the differential survival between clusters (-log10 of permutated logrank's test P 4 values). Colors indicated the omics data applied. Here, MOmics represents four omics data, mRNA 5 represents mRNA expression, methy denotes DNA methylation (450K), miRNA represents miRNA 6 expression and CNA represents copy number alterations.

Identification of marker genes of BRCA dataset 8
It is crucial to demonstrate that the subtypes identified by MCluster-VAEs are 9 biologically interpretable as an application in medicine. To identify essential genes 10 among the five subtypes of breast carcinoma (BRCA) identified by MCluster-VAEs 11 (i.e., C1, C2, C3, C4, and C5), we calculated the signal-to-noise ratio (SNR) of each 12 gene, proposing a one-vs-rest differential procedure for one subtype as one group with 13 the other four subtypes as another group. The gene with a high absolute value of SNR 14 indicates significant differential expression in its subtype. We selected ten genes with 15 the highest SNR and ten genes with the lowest SNR for each BRCA subtype, such that 16 a total of one hundred marker genes were filtered and listed in Supplementary Table S6.  Table S8), 15 confirming the ability of MCluster-VAEs to acquire reasonable cancer subtypes from 16 the original multi-omics data without cancer-related prior information. 17       The monitoring records of the training process of MCluster-VAEs on the Pan Cancer dataset. "WA" means using conditional entropy weight decay annealing trick. "Gumbel" means using Gumbel Softmax reparameterization trick. "none" means neither used.

Figure 5
Performance of the algorithms on the ten cancer datasets. The x-axis was the number of clinical parameters enriched in the clusters, and the y-axis measured the differential survival between clusters (-log10 of permutated logrank's test P values). Colors indicated clustering methods. Kaplan-Meier survival plots of MCluster-VAEs on the ten cancer datasets. Performance of MCluster-VAEs based on four omics or single-omics data on the ten cancer datasets. The x-axis was the number of clinical parameters enriched in the clusters, and the y-axis measured the differential survival between clusters (-log10 of permutated logrank's test P values). Colors indicated the omics data applied. Here, MOmics represents four omics data, mRNA represents mRNA expression, methy denotes DNA methylation (450K), miRNA represents miRNA expression and CNA represents copy number alterations.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. supplementary.docx