NMFNA: a non-negative matrix factorization network analysis method for identifying communities and characteristic genes from pancreatic cancer data

Background: Pancreatic cancer (PC) is a common type of digestive system disease. Comprehensive analysis of different types of PC genetic data plays a crucial role in understanding its biological mechanisms. Currently, Non-negative Matrix Factorization (NMF) based methods are widely used for data analysis. Nevertheless, it is a challenge for them to integrate and decompose different types of data simultaneously. Results: In this paper, a Non-negative Matrix Factorization Network Analysis method, NMFNA, is proposed, which introduces a graph regularized constraint to NMF, for identifying communities and characteristic genes from two-type PC data. Firstly, three PC networks, i.e., methylation network (ME), copy number variation network (CNV), and the bipartite network between them, are constructed by both ME and CNV data of PC downloaded from the TCGA database, using the Pearson Correlation Coefficient. Then, the NMFNA is proposed to detect core communities and characteristic genes from these three PC networks effectively due to its introduced graph regularized constraint, which is the highlight of NMFNA. Finally, both gene ontology enrichment analysis and pathway enrichment analysis are performed to deeply understand the biological functions of detected core communities. Conclusions: Experimental results demonstrated that the NMFNA facilitates the integration and decomposition of two types of PC data simultaneously and can serve as an alternative method for detecting communities and characteristic genes from multiple genetic networks. demo


Background
Pancreatic cancer (PC) is an extraordinarily deadly and highly invasive disease of the digestive system [1]. Consequently, analyzing multiple biological data comprehensively of the whole biological system from the PC data has become a hot topic in the field of Bioinformatics [2], and various methods for understanding tumorigenesis of PC have been proposed. For instance, Akashi et al. [3] applied the lasso penalized Cox regression method to identify genes that were relevant to PC progression. To predict biomarkers of PC, Yang et al. [4] selected differentially expressed genes and pathways using a threshold screening method. Gong et al. [5] combined the doubly regularized Cox regression method with the integration of pathway information to detect a comprehensive set of candidate genes with PC survival. Kwon et al. [6] introduced the support vector machine to evaluate the diagnostic performance of PC tissues' biomarkers. Tao et al. [7] performed a comprehensive search of electronic literature sources to mine information on PC data. Li et al. [8] applied the k-nearest neighbor method and random forest algorithm to identify suspicious genes of PC.
In addition to above-mentioned methods, several Non-negative Matrix Factorization (NMF) based methods also have been proposed to detect pathogenic factors associated with PC. For instance, Mishra et al. [9] used the NMF to distinct molecular subtypes of PC methylome data. Wang et al. [10] proposed the maximum correntropy-criterion based NMF (NMF-MCC) to cluster the PC gene expression data. Zhao et al. [11] proposed the NMF bi-clustering method to determine the subtypes of PC. Ding et al. [12] proposed the orthogonal Non-negative Matrix Tri-Factorizations (Tri-NMF) with class aggregate distribution and multi-peak distribution to distinguish different molecular subtypes of cancers. Zhang et al. [13] introduced the joint NMF (jNMF) method to identify characteristic 5 genes of cancers, which can project different types of expression data onto a common coordinate system. Xiao et al. [14] proposed the graph regularized non-negative matrix factorization (GRNMF) method to infer potential associations between miRNAs and pancreatic neoplasms.
Nevertheless, most of them only considered individual genetic effects and ignored interaction effects among different features. It is widely believed that interaction effects among different features could unveil a large portion of the unexplained heritability of cancers [15]. To capture these interaction effects, several NMF based network analysis methods have been proposed due to network facilitating presenting interactions between genes. Yang et al. [16] introduced the integrative NMF (iNMF) to analyze the multi-omics matrices jointly, which included a sparsity option for decomposing heterogeneous data. Chen et al. [17] adopted a novel non-negative matrix factorization framework in a network manner (NetNMF) to identify modular patterns of cancers by integrating both gene expression network and miRNA network, which might ignore the spatial geometric characteristics of the data while analyzing the gene expression network. Gao et al. [18] proposed the integrated graph regularized non-negative matrix factorization (iGMFNA) model for clustering and network analysis. Nevertheless, rather than fusing and analyzing multiple networks directly, the iGMFNA decomposed the integrated data and used the decomposed sub-matrix to construct the network, which might weaken the internal connection between nodes in the network. Zheng et al. [19] integrated methylation (ME) network and copy number variations (CNV) network to identify molecular subtypes of ovarian cancer, which provides new insight into early disease diagnosis. Though these NMF based network analysis methods are not used for PC, they show their successes in capturing cancer patterns involving interaction 6 effects, which provides a direction to unveil the unexplained heritability of PC.
In this paper, we proposed a Non-negative Matrix Factorization Network Analysis method, NMFNA for short, based on graph regularized constraint, to detect core communities and characteristic genes of two-type PC data. Firstly, the Pearson Correlation Coefficient (PCC) was used to construct the ME network, CNV network, and the bipartite network between them by both ME and CNV data of PC downloaded from the TCGA database. Then, the NMFNA is used to integrate and decompose these networks, as well as to detect core communities and characteristic genes from them. Finally, we performed Gene Ontology (GO) enrichment analysis and pathway enrichment analysis to validate the biological functions of detected core communities. The experimental results demonstrated that the NMFNA can reveal the mechanisms of how the two types of features interact with each other and can serve as an alternative method for identifying communities and characteristic genes from multiple genetic networks.

Method Typical NMF Methods
The NMF [20] and its variants have been increasingly applied to identify communities in a biological network. Given a gene expression data matrix X mn  , the NMF decomposes it into two non-negative matrices U mk  and V kn  such that X UV Euclidean distance between X mn  and its approximation matrix obtained by the multiplication of U mk  and V kn  , is applied to minimize the factorization error, which can be written as, , min . . , .
In addition to two-factor NMF, three-factor NMF also plays an important role in the process of matrix factorization. Different from two-factor NMF, three-factor NMF absorbs different scales of X mn  , U mk  , V kn  through an extra factor S : X USV  . A three-factor NMF variant, TriNMF, was used in NMFNA [12], which is defined as: where U mk  and V kn  are used to identify communities from different levels, S is an kk  matrix presenting the relations between identified communities. The update process to minimize the objective function of (4) is,  , respectively [17]. The update process to minimize the objective function of the NetNMF is,

Non-negative Matrix Factorization Network Analysis Method
To improve the performance of analyzing multi-type networks, as well as identifying 9 communities and characteristic genes, we proposed the NMFNA method by incorporating the graph regularized constraint [21], which can indicate the inherent geometrical structure of the input networks. In other words, graph regularized constraint ensures that interacted genes in the represented Euclidean space are also close to each other in the low-dimensional space, which can be defined as, where Z ij represents the similarity between gene . . , , , .
Similarly, R  are used to measure the connection between communities from the ME network and CNV network, respectively. k is the number of dimensionality reduction.
According to the NetNMF [17],  and  are used to balance three Frobenius norms, which are also set to − is used to identify the one-to-one interactions between communities from the ME network and CNV network. The tuning parameter  is used to adjust the closeness between the interactive genes. 10 To solve the optimization problem of NMFNA, the strategy of multiplicative update rules is adopted. ( (14) is, The partial derivative of f with respect to matrices S 11 , S 22 , G 1 and G 2 are as follows, Through Karush-Kuhn-Tucher (KKT) conditions [22] The iterative process, the schematic diagram, and the convergence proof of NMFNA are 11 shown in Table 1, Fig.1, and Additional file 1, respectively.

Determination of Communities
Two types of communities could be identified from the ME network and CNV network while G 1 and G 2 were obtained, respectively. For the ME network matrix, z-scores ij x were calculated as ME weights at each column, and those MEs whose weights were higher than the threshold were considered as members of the corresponding community. Here, the threshold was set to 2  = according to the reference [17].
The matrix S 12 indicated the similarity between G 1 and G 2 , which was obtained by the decomposing matrix R 12 , R G S G T 12 1 12 2  . As a diagonal matrix, R 12 enforced the i th community of the ME network only to interact with the i th -community of the CNV network.
Furthermore, the matrix S 11 represented the weight of GG T 11 in the reconstruction of R 11 . In the study, the gene numbers in communities indicated their importance. Hence we used the non-diagonal elements in S 11 to indicate the importance of relationships of communities from the ME network. Similarly, the non-diagonal elements in S 22 were used to indicate the importance of relationships of communities from the CNV network. Same to previous studies [23,24], this study also detected core communities from the ME network and CNV network, which refer to the communities with the most gene nodes. 12

Analyses of Communities and Characteristic Genes
GO enrichment analysis was carried out to annotate genes and identify biological features from core communities by using the DAVID online tool (https://david.ncifcrf.gov/) [25].
Besides, the KOBAS online analysis database (version 3.0) (http://kobas.cbi.pku.edu.cn/) [26] where () Dx , () Bx and () Ex are degree centrality, betweenness centrality, and eccentricity centrality of gene x . Betweenness centrality and eccentricity centrality only represent global feature of genes in the network, while degree centrality only represents a local feature of genes in the network [27]. Hence, the Multi-measure Score combines both global and local features of genes. By examining the Multi-measure Score of each gene in the network, we can detect the most suspicious characteristic genes.

Results and discussions Datasets
The two-type Pancreatic cancer data were downloaded from the TCGA database (https://tcgadata.nci.nih.gov/tcga/), which contains ME data and CNV data. The datasets of 13 PC are summarized in Table 2. We calculated the adjacency matrix R 11 to represent the ME network, R 22 to represent the CNV network and R 12 to represent the bipartite network between them by PCC, respectively. In the network , where E is the edge set and V is the node set. Element to be the same for both terms [28]. From Fig. 2A it is seen that , and x C is the members of community x . To reduce the error of the decomposition results, we iterated multiple times of NMFNA. Fig. 2B shows that the convergence of NMFNA is relatively stable after iterating 200 times. In addition, k is determined by SVD method. Specifically, the first inflection point based on SVD, k 6834 = , was used as the dimensionality reduction value to approximate the PC matrix. 14

Analyses of GO and Pathway Enrichment
To demonstrate the validity of NMFNA, we compared it with NetNMF, TriNM, and NMF on PC data. Core communities were selected from the ME network and CNV network using NMFNA, NetNMF, TriNMF, and NMF methods, respectively. These core communities were used to perform the GO and pathway enrichment analysis. Fig. 3 shows numbers of GO terms and pathways (which satisfy P-Value 0.05  ) respectively. From Fig. 3A and Fig. 3B, it is seen those core communities identified by NMFNA include more GO terms and pathways than those identified by other methods from the ME network and CNV network.
To further verify the biological significance of communities detected by NMFNA, we visualized the significant GO annotations ( P-Value 0.05  ) in Fig. 4 and Fig. 5. In both of them, GO terms mainly included three functional groups: molecular function, biological process, and cellular component. In Fig. 4, the core community from the ME network was mainly E − ) has been identified as the key target in PC [36], which promoted the process of antitumor drug development. And the detailed information of genes enriched in each GO term is shown in Table 3.

Identification of significant pathways
Furthermore, P-Values in common pathways from core communities obtained by four methods were also compared, which were listed in Table 4. It is seen that, in terms of P-Values, as well as P-Values that corrected by False Discovery Rate (FDR), NMFNA performs best among all compared methods, implying that pathways enriched by NMFNA core communities were more accurate than those enriched by core communities of other comparison methods.
In particular, two pathways, i.e., transmembrane transport of small molecules and 16 arachidonic acid metabolism, have been reported to be associated with PC [37,38]. To aid early diagnosis, the metabolism pathway can be used, individually or in combination, to differentiate people with PC from their healthy peers. Also, three metabolism-related pathways, including lipid metabolism and autophagy, glutamine-regulatory enzymes [39], and Akt/c-Myc pathway [40], have been identified to directly affect the growth of PC cells.
Metabolic pathways are known as one of the most significant pathways in biological processes. Among them, the small molecule metabolic process [41] has been found as the enriched pathway for the biological process of PC. The reference [42] reported that the identification of immune system-related regulation pathways could provide new insights for PC treatment and prognosis.
To further verify core communities detected by NMFNA, we listed details of the top 10 pathways according to their P-Values in Fig 6, in which, the size of each node denotes the number of genes that are enriched in the corresponding pathway. The deeper the color of the node, the smaller the P-Value of the pathway is, therefore the more important the pathway is. Fig. 6A shows the top 10 pathways of the NMFNA core community from the ME network.
Similarly, Fig. 6B shows the top 10 pathways of the NMFNA core community from the CNV network. Among these pathways, transport of small molecules pathway, metabolism of xenobiotics by cytochrome P450 pathway, and arachidonic acid metabolism pathway have been confirmed to be associated with PC. Specifically, transmembrane transport of small molecules pathway has been reported to filter downregulated differentially expressed genes of PC [37]; metabolism of xenobiotics by cytochrome P450 pathway has been considered as an important pathway associated with the progression of cancer [43]; the arachidonic acid metabolism pathway and its downstream pathways have been demonstrated to promote the 17 progress of PC [38].

Detection of Characteristic Genes
To detect characteristic genes of core communities, we only considered edges with strong PCC (>0.8) values in these two networks. The Multi-measure Score of each gene was regarded as an indicator of its contribution to the interactions among genes in the ME network and CNV network. The larger the value of the Multi-measure Score is, the more significant the gene is. After sorting genes of core communities in descending order using the Multi-measure Score, we respectively selected the top 10, top 30, and top 50 characteristic genes to sum over their total scores from both ME and CNV networks (Table 6), which showed the importance of characteristic genes related to PC. Besides, Table 5 listed the names of top 10, top 30, and top 50 characteristic genes detected by NMFNA method. It is seen that in Table 6 the scores of NMFNA were larger than those of other compared methods, which indicated that characteristic genes detected by NMFNA were more relevant to PC than characteristic genes detected by other compared methods.
Besides, total PC-related genes of the top 10 characteristic genes detected by compared methods in both ME network and CNV network were listed in Table 7. It is seen that the NMFNA detected more PC-related genes than other methods. We further analyzed the biological functions of these PC-related genes detected by NMFNA. TNKS1BP1 has been reported to regulate actin cytoskeleton and cancer cell invasion, which might also be related to the progression of PC [32]. The level of TK1 is up-regulated 4-fold in the mice PC specimen, therefore we speculated that it might also play a potential role in the human PC [33]. The variation of KCNJ1 is associated with diabetes [44], which is the major risk factor of PC. As an independent prognostic marker, SDC1 has been identified to up-regulate in PC [45]. SDC1 is an important paralog of SDC3, hence we speculated that SDC3 may also be related to PC. NR3C2 has been identified as the target of miR-135b-5p [46], which may promote migration, invasion, and EMT of PC cells. USF1 is a protein-coding gene, which has been verified to be associated with Hyperlipidemia. As a protein-coding gene, TTC21A and its important paralog TTC21B are associated with several diseases including spermatogenic failure and non-syndromic male infertility. MAN1C1 is a protein-coding gene, which is associated with breast abscess, alcohol-induced mental disorder, and other diseases.
It is worth noting that USF1, TTC21A, and MAN1C1 have been marked as PC associated genes in the GeneCards database.

Conclusions
To identify core communities and characteristic genes that are associated with PC, in this

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and material
The datasets of Pancreatic cancer that support the findings of this study are available in https://portal.gdc.cancer.gov/projects/TCGA-PAAD. The demo codes of NMFNA are available online at https://github.com/CDMB-lab/NMFNA.      28