Network-Based Metric Space for Phenotypic Stratication of Samples Using Transcriptome Proles

With the advancements of high-throughput sequencing technology, several recent studies addressed the clinical/phenotypic stratiﬁcation of samples by utilizing transcriptome data. However, existing stratiﬁcation methods lack efﬁcient utilization of gene interaction information, and furthermore, handling more than 20,000 genes causes the curse of high dimensionality that hinders elucidating the linkage between genetic proﬁles and clinical/phenotypic differences. To overcome these challenges, we propose a network-based two-step computational framework. We ﬁrst reduce dimensions of transcriptome to a few tens of dimensions by mapping transcriptome to protein interaction network followed by performing network propagation algorithm and clustering analysis. Then, each network is converted into a single numeric metric by utilizing information theoretic quantiﬁcation of gene expression abnormality, which results in a single sample mapping to a metric space generated by each subnetwork in the form of vectors. The proposed network-based stratiﬁcation method was used to analyses Pan-Caner dataset and Oryza sativa dataset. Extensive experiments showed that our method generates a metric space that captures data-speciﬁc biological functions and improves the stratiﬁcation performance compared to existing methods. Therefore, the proposed method successfully stratiﬁed the samples, addressing the problem in the complex gene space. The proposed method is implemented in Python and available at https://github.com/Sunginyoung/net_stratiﬁcation.


Introduction
Biological state of an organism can be defined by the various regulatory mechanisms of genes. When regulatory disorder or malfunction occurs in the regulation mechanism within cells, the state of the organism becomes an abnormal state out of normal state. For example, cancer is an abnormal state that results from uncontrolled cell proliferation or evasion of programmed cell death [1][2][3][4] . In the case of plants, external stimuli such as drought stress alter the growth or survival by changing major physiological processes such as photosynthesis, water systems and hormone metabolism 5,6 . One of the ways to infer the state of cells is to measure the state of gene expression. With the development of sequencing technology, transcriptome data of biological samples are measured and accumulated in biobanks such as The Cancer Genome Atlas (TCGA) or Gene Expression Omnibus (GEO) 7,8 . Therefore, using huge amount of transcriptome data, it will be possible to measure the dysfunction of the regulatory mechanism according to the degree of abnormality, and furthermore, achievable to stratify the samples in a clinically and phenotypically meaningful direction through the degree of dysfunction 9,10 .
Therefore, for stratification of samples, it is necessary to be able to define the distance between samples. However, it is difficult to calculate the difference between the genetic profile and the clinical or phenotype of the samples using transcriptome data for more than 20,000 genes. In the classical method, it is assumed that all genes are on the same line, that is, have equal importance regardless of order, and the gene expression represented as a vector to measure the distance between sample using basic mathematical methods such as Euclidean distance or Pearson's correlation 11,12 . Furthermore, there have several efforts to reduce gene space using prior knowledge, such as hallmark geneset in cancer or organ specific gene signature in the Human Protein Atlas (HPA) or to consider interactions between genes through biological pathways or experimental analysis [13][14][15][16] . However, these methods may not reflect the numerous biological mechanisms that can arise from relationship between genes to compare samples by interpreting gene relationships one-dimensionally. On the other hand, there are methods that consider interactions between genes using biological networks such as protein interaction network or gene regulation network 17,18 . The overview of the proposed method. From gene expression data and biological template network, STEP 1: To generate condition specific subnetworks, gene interactions were re-calculated based on propagation score, through a network propagation algorithm with DEGs as seeds. Then, in a redefined weighted network, subnetworks were created from the results of gene clustering using a community detection algorithm. STEP 2: To measure phenotypic changes using a network, NetJSD was calculated using information theory and network structure in each subnetwork. Then, a single sample is mapped into the metric space coordinated by each subnetwork. Finally, samples are stratified in the metric space in clinically and phenotypically meaningful ways. However, since most existing studies are using the whole network, these approaches are still insufficient for finding meaningful information unique to a given data set without using prior knowledge within a network where many information is convoluted. Therefore, we attempted to find meaningful information by reflecting the properties of a given dataset without prior knowledge and to measure the abnormality of sample using information theory, in the biological template network that can consider the interaction of genes. In this paper, we proposed a network-based two-step computational framework for constructing metric space for measuring and stratifying samples of different phenotypes. We first reduced the dimension of 20,000 transcript genes in the template network to tens of dimensions. By mapping the transcript to a public protein interaction network, we performed gene clustering through two network analysis algorithms: network propagation and community detection algorithm, and generated tens of subnetwork from clustering result 19,20 . Then, in each subnetwork, the abnormality 2/13 in gene expression in each sample was transformed into a single metric value using network structure and information theory. As a result, a single sample is mapped into the metric space coordinated by each subnetwork in the form of a vector. We used two different public datasets: 14 cancer types from the Pan-Cancer Atlas and drought response over time for three Oryza sativa cultivars from GEO, to evaluated the proposed method. In the pan-cancer dataset, we showed that the proposed method generates subnetworks reflecting the properties of the given data in the pathway enrichment test and metric space coordinated by the subnetworks successfully stratifies the samples in the survival analysis for four clinical endpoints. In particular, we evaluated the phenotypic changes in the sample in a metric space coordinated by the subnetworks and outperformed existing methods in survival analysis for all endpoints. In the Oryza sativa cultivar dataset, we showed phenotypic changes over time and cultivars as drought stress continued. Therefore, the metric space generated by the proposed method efficiently captured clinical or phenotype information of samples and successfully stratified the samples, addressing the problem in the high-dimensional gene space of transcriptome data.

Results and Discussion
The overview of our method is showed as Figure 1. The proposed two-step computational method aimed to generate networkbased metric space to measure abnormality compared to the normal state and the stratify the samples using the measured values (see details in the Methods section). This method constructed condition specific subnetworks of given gene expression data and protein-protein interaction (PPI) network using two network analysis algorithms in the first step ( Figure 1. STEP 1). For the given gene expression data of normal and query samples, a network propagation algorithm was performed in PPI network using seed genes as different expression genes between normal and query 21 . As a result of propagation, the weight of edge of the PPI network was redefined using the propagation score (or diffusion score) of each gene. And then, gene clustering was performed using a community detection algorithm on the PPI network with redefined edge weight 20 . The given data specific subnetworks were constructed by constraining the network size for the result of gene clustering (here, we constrained the number of nodes and edge weight). With the subnetworks result from the first step, samples are mapped into a metric space generated by each subnetwork utilizing information theoretic quantification values in the second step ( Figure 1. STEP 2). For each subnetwork, each sample was measured phenotypic changes from normal samples utilizing network level Jensen-Shannon divergence and network structure level importance. By combining these two changes as NetJSD (Equation 7), the abnormality was defined for each query sample. Finally, all query samples map into a metric space coordinated by subnetworks in the form of vectors.

Descriptions of datasets
The proposed network-based stratification method required gene expression data of normal and query samples and biological template network. For gene expression dataset, we used with two different public datasets: cancer patients from Pan-Cancer Atlas and drought response over time for three Oryza sativa cultivars from GEO. Pan-Cancer Atlas provided expression data profiled by TCGA in matched tumors and adjacent normal samples for 33 cancer types (TCGA RNASeqV2 http://www.cancergenome.nih.gov/). Among them, we used cancers with both normal and tumor patient samples present and with more than 300 tumor patient samples, and as a result, we evaluated 14 cancer types: Breast Invasive Carcinoma Furthermore, we evaluated Oryza sativa cultivar(commonly known as Asian rice) dataset to showed that the proposed method can stratify samples while representing biological knowledge in addition to cancer patient dataset with clinical information. Oryza sativa cultivar dataset with drought response over time was obtained from GEO under the accession number of GSE142470 and GSE74465. The GSE142470 dataset consists of four subtypes of 20 rice cultivars 22 . The GSE74465 dataset consists of three time points (0h, 1h and 6h) RNA sequencing measured to drought resistance for three rice cultivars: Nipponbare (Oryza sativa ssp. japonica, Nip), Nipponbare AP2 transgenic plants (AP2), Vandana (Van) 23 .
For biological template network, we used protein-protein interaction (PPI) network where nodes are genes and edges are interactions. From STRING portal (https://string-db.org), we obtained two different species PPI networks: Homo sapiens and Oryza sativa. STRING generated PPI networks with literature-based functional interactions 24 . Also, STRING provided combined scores that probabilistically calculated the interactions between gene for each edge 25 , but these edge scores are not used in this paper. Homo sapiens PPI network consisted of 19,354 genes and 5,879,727 edges and Oryza sativa PPI network consisted of 25,106 genes and 4,474,524 edges.

3/13
Cancer Normal Tumor Node Edge Network Group1 Group2   For each cancer type, we mapped samples into a metric space coordinated by its subnetworks in the form of a vector of NetJSD values. Then, we divided samples into two groups using a K-means clustering algorithm to evaluate how well the proposed method metric space stratified the samples. As a result of K-means clustering, each cancer type was divided the samples into two groups as shown in 'Group1' and 'Group2' in Table 1 and Figure 2 shows the distribution of each group for the average value of NetJSD across all subnetworks of each cancer type, showing that the average NetJSD between two groups 4/13  . The x-axis represents Principal Component 1 and the y-axis represents Principal Component 2. The color of the point represents the average value of NetJSDs and the shape of the point represent the K-means groups: Group1 is a blue square, Group2 is an orange triangle. (c) Subsubnetwork of Subnetwork 11. This network is a subsubnetwork for 10 genes with large differences between the two groups in Subnetwork 11, where the estrogen signaling pathway is concentrated: the bigger the difference, the bigger the letters and circles and the redder the color.

Pan-Cancer dataset
is significantly different. Using K-means two groups, we perform survival analysis on four clinical endpoints: OS, DSS, DFI and PFI (Kaplan-Meier curves for each cancer type in Supplementary figure). As a result, we showed that the log rank p-value of the survival analysis between two groups was below 0.1, which was significant result in all cancer types except COAD and BLCA. Interestingly, the endpoint with the lowest p-value for each cancer was different, which can be said to capture the properties of the cancers in the TCGA dataset 29 . Therefore, we showed that the proposed method reduces the complex genetic space by reflecting the properties of a given cancer type and generates a metric space that stratifies cancer patient samples clinically meaningful way. For more detailed interpretation and evaluated the results, we used the BRCA dataset to analyze the function of the subnetworks, the measured phenotypic change for each subnetwork and the sample stratification result compare to the existing methods.

BRCA dataset
The proposed method generated 12 subnetworks with 13,012 gene and 379,546 edges for the BRCA dataset, as shown in Table 2. As a result of the pathway enrichment test using KEGG to investigate the biological function of the subnetworks, the subnetworks were enriched in cancer-related pathways such as PI3k-Akt signaling pathway (hsa04151), Cell cycle (hsa04110),  Table 1 Table 3. Performance comparison result. The table shows the p-value of the survival analysis results on four clinical endpoints using the three existing methods (Euclidean distance, tITH and TSD), the network level JSD value without the network structure level importance and the propose method. The bold value represented the best performance for each endpoint.
Olfactory transduction (hsa04740) and Pathwaysin cancer (hsa05200), and in particular, it showed enrichment results in breast cancer-related pathways such as Estrogen signaling pathway (hsa04915) and Breast cancer (has05224) (detailed in Supplementary file 2) 27, 28, 30-35 . Therefore, we showed that the propose method generated subnetworks that capture the properties of breast cancer patient samples. Using subnetwork generated BRCA dataset, we mapped samples into a metric space in the form of a vector with length 12 and then, we divided samples into two groups using K-means clustering algorithm. Figure 3 shows the results of BRCA dataset mapped into a metric space. Figure 3 (a) shows the NetJSD results for each sample measured in each subnetwork and K-means groups of each sample (Group 1 is blue and Group 2 is orange) and Figure 3 (b) is the result of mapping samples NetJSD values into a two-dimensional space using Principal Component Analysis (PCA, Group1 is a blue square, Group2 is an orange triangle and the color is the average NetJSD value of the entire subnetwork). And Figure 3 (c) shows the subsubnetwork for top genes with the large differences between the two groups (the bigger the difference, the bigger the letters and circles and the redder the color).We showed that the NetJSD values between two groups differ significantly and samples from the same group have similar NetJSD trends across most of the subnetworks, as shown in Figure 2 and Figure 3 (b). In particular, when the difference between the two groups was measure, it showed an interesting result that the top genes belongs to all Subnetwork 11, which is enrich in the Estrogen signaling pathway (Figure 3 (c)). Furthermore, we performed survival analysis on four clinical endpoints and produced a Kaplan-Meier (KM) plot to investigate the BRCA dataset stratification in a metric space (Figure 4). As a result, we showed that the log rank p-value is significant on all endpoints except DSS and the higher the NetJSD values (that is, Group2 with the larger NetJSD) the worse the prognosis. Therefore, these results showed that a BRCA metric space successfully represents abnormalities in cancer patient samples and, consequently, well stratified samples for clinical prognosis.
We performed comparison analysis to evaluate how well samples are stratified in a clinically meaningful way. We compared our method with classical mathematical method (Euclidean distance) and two existing methods (tITH 18 and TSD 13 ). In addition, we compared the values of the network level JSD without using network structure information (called JSD) to evaluated the importance of the combination of two proposed changes. Each comparison method was calculated each samples abnormality in the following way. Euclidean distance is a classical and basic distance method of measuring the length of a linear segment between two points. From Euclidean distance, we measured the distance of the sample from the normal sample by setting each sample as a point in the gene space generated using 15,477 genes in STRING network. tITH is a method of measuring the heterogeneity of the query sample compared to the normal sample using information theory in the network. From tITH, we measured the distance of the sample from the normal sample using STRING with 15,477 genes and 387,144 links. And TSD is a method of measuring the distance between normal and query using information theory and rank correlation for tissue specific 6/13 signature genes provided by HPA. From TSD, we measured the distance of the sample from the normal sample using 156 breast specific signature genes in HPA. To compare the patient stratification results using the phenotypic change from each method, we divided sample into two groups using K-means clustering and performed log rank test to analyze the survival of the two groups. Table 3 shows the results of the log rank p-value at the four clinical endpoints of each method (Kaplan-Meier curves for each comparison method in Supplementary figure). As a results, the proposed method outperformed the existing method in all endpoints. In addition, combining two changes, network level JSD and network structure level importance, showed better results than Network JSD with only one change. All methods showed low p-value results in OS or DFI and TSD showed the best results among comparison methods. Through the result of TSD using the prior knowledge provided by HPA, it can be said that selecting important genes according to the given dataset in the complex gene space shows better result. In addition, when comparing the JSD and NetJSD results with tITH result, it can be argued that reducing the gene space by generating subnetworks to fit the given dataset shows better performance than using the entire template network. As a results, our method significantly reduced the gene space without using prior knowledge and was able to stratify samples more than other methods in the generated metric space.

Oryza sativa dataset
In addition to the cancer patient dataset with clinical information, we evaluated changes in plants caused by stressful environment.
As the drought stress continued in three Oryza sativa cultivars, we showed how each cultivar changes over time in the standard rice conditions and what difference exist between cultivars. The proposed method generated 11 subnetworks with 2,865 genes and 35,171 edges for the Oryza sativa dataset, as shown in Table 4. To investigate the function of the reduced gene space by the proposed method, we performed pathway enrichment test using plant organism gene set enrichment website PlantGSEA 36 . All genes constituting the metric space were enriched in metabolic pathways such as Biosynthesis of secondary metabolites (osa01110) and Glycerophospholipid metabolism (osa00564) 37,38 and each subnetwork was enriched in the pathways shown in the Table 4 (detailed in Supplementary file 3). Interestingly, the subnetworks showed highly enrichment in pathways known as drought resistance or survival maintenance such as the photosynthetic pathway (osa00195), the phosphatidylinositol signaling pathway (osa04070), and the oxidative phosphorylation pathway (osa00190) 23,39,40 . As a result, we showed that the proposed method generate Oryza sativa subnetworks composed of genes that respond to the drought stress and are involved in biological processes for sustaining life. Figure 5 shows the result of nine Oryza sativa samples mapped to a metric space. Figure 5 (a) shows the NetJSD result of each sample measured in the subnetworks and Figure 5 (b) shows the values measured in each gene of consisting the subnetwork. And Figure 5 (c) showed the NetJSD values at 0h and 6h for each cultivar in the photosynthetic pathway, one of the pathway where the relationship between rice and drought resistance has been most studied 41,42 . We interpreted the results for the Oryza sativa dataset from two perspectives: time-wise and cultivar-wise. For time-wise perspective, most of the cultivars showed an increase in NetJSD values as the drought continued. In particular, the difference between 0h when drought stress start and 6h when the drought stress was sufficiently sustained was showed that 6h were on average 1.5 times greater than 0h in all cultivar in all subnetworks. These results were also confirmed in the photosynthetic pathway. In Figure 5 (c), the values of each gene in the pathway increased over time. And in particular, it showed that genes such as Os01g0869800 (chlorophyll a/b binding protein) and Os07g0513000 (ATP synthase subunit gamma, chloroplastic) that changes significantly over time, i.e., showed large abnormal expression compared to the normal rice state, were associated with drought stress [43][44][45] . Therefore, it can be showed that each cultivar represents in increasing direction in the proposed metric space for sustained drought stress and Each sample on the Oryza sativa metric space is presented as a vector with NetJSD values. Row is subnetworks which are axes of the space and columns is Oryza sativa samples. For each sample, it is denoted by cultivar (Nip is silver, AP2 is grey and Van is black) and time point (0h is blue, 1h is orange and 6h is green) and each subnetwork is denoted by in Table 4. (b) Box plots of the abnormality of each gene in the metric space at each time in each cultivar. The x-axis is color-coded (upper left) with the over of time for each cultivar, and the y-axis is the abnormality value for all genes in the metric space. (c) Photosynthetic pathway in Oryza sativa metric space. The x-axis represents cultivar type, and the y-axis represents time points 0h and 6h. The color of genes in the network represents NetJSD. The size of genes represents network centrality life sustaining efforts over time.
For cultivar-wise perspective, it can be showed that NetJSD increased in the order of Nip, AP2 and Van in most subnetworks. This result can be argued that the NetJSD in each subnetwork reflects the phenotypic properties of the rice cultivar well over time. Since Nip is a subspecies of Oryza sativa japonica and AP2 is a drought-resistant transgenic plant of Nip, AP2 will show stronger resistance to drought than Nip 46 . On the other hand, Van is a drought-resistant improved Indian highland cultivar developed in tropical japonica/aus cross, which is a strong drought-tolerant Southeast Asian rice cultivar, and is expected to survive with stronger resistance to drought than other cultivars [47][48][49][50] . Therefore, these results showed that the Oryza sativa metric space successfully represented the degree of activation of life sustaining mechanisms and the drought stress response for each cultivar, and as a result, the proposed method could stratify the three Oryza sativa cultivars over time.

Conclusion
In this paper, we proposed a novel network-based stratification method that constructs the metric space for measuring phenotype changes quantitatively using transcriptome data. To effectively handle the space of more than 20,000 genes, we developed a two-step computational framework. This method performed gene clustering in biological networks, through this result, generated condition specific subnetworks that reflect the properties of the given data. Then, for each subnetwork, we measured the distance between patients or samples using network structures and information theory to quantify the degree of abnormality compared to the normal state. As a result of experiments on Pan-caner dataset and GEO rice dataset, we showed that the proposed metric space consists of the meaningful biological functions that were specifically to the given data and successfully 8/13 stratifies the transcriptome data samples. In addition, it showed outperformed the conventional transcriptome data-based distance measurement methods in survival analysis. In particular, we showed that the results of combining two changes, network level JSD and network structure level importance, which measure the changes both gene itself and surrounding gens in the network can capture the actual phenotypic changes in the biological samples. Therefore, the proposed metric space adroitly captured the clinical or phenotype information of samples and successfully stratified the transcriptome data samples in clinically or phenotypically meaningful ways. Moreover, the proposed computational framework has potential applications as it proposes a method to measure the difference between samples no matter what kind of data comes in. As a future work, it will be much more possible to create a linear metric space for easier interpretation and comparison of samples. We are exploring several computational methods of this, including a density-based clustering method that could generate a linear ordering of data points while performing clustering analysis.

Methods
In this paper, we proposed a network-based two-step computational method: constructing condition specific subnetworks (STEP 1) and mapping samples into a metric space (STEP 2) (Figure 1).

Construction of condition specific subnetworks
Public biological network is a powerful resource that help to understand cellular mechanisms through proteins and their interaction. Among them, STRING protein-protein interaction (PPI) network is a well-known protein interaction networks that comprehensively reflects functional and physical interactions of proteins using public available resources 24 . However, since STRING predicted the interaction with an approximate probability (called confidence score in STRING) that predicted link exists between two enzymes in the same metabolic map in the KEGG, there is a limit to measuring and interpreting phenotypic change specifically for given data using the whole network 51 . Therefore, we constructed give gene expression data specific subnetworks using two network analysis algorithms: network propagation algorithm and community detection algorithm.

Re-defining of gene interactions using network propagation
In the public biological network, STRING PPI network, we used largest connected component network with a threshold of edge weight as 0.7 to use a template network with high confidence and low false positive 52 . To re-define the genetic relationships, that is, edge weight in the template network by reflecting the properties of the given gene expression data, we used a network propagation algorithm 53 . Network propagation is a method that can show the impact across the network through random and repetitive walks with nodes connected to the seed nodes, starting with given seed nodes. Here, we used genes with large expression difference between the normal and query samples as seed nodes. For initial probability distribution P 0 and adjacency matrix W, the probability distribution as follows: where t is the propagation step and α is restart rate that describes the trade-off between prior information and network smoothing.
In this paper, we used the random walk with restart (RWR)-based method for prioritizing candidate gens using global network distance measurement and random walk analysis to define the similarity of network 21 . As result of the RWR-based method, we obtained a propagation or diffusion score of each gene. And then, we constructed a network with given data specific edge weight by re-defining the interaction of two nodes as the product of the propagation scores of the two interacted nodes.

Gene clustering using community detection
To perform gene clustering by reflecting the properties of given data to a large-scale network, we used the Louvain method on a network with re-defining edge weights a network propagation algorithm 20 . In the Louvain method, the modularity of the weighted graph is defined as: for gene v and w in network node set N, where m is the sum of all edge weights in the network, A vw is an edge weight between genes v and w, k v is the sum of edge weights with neighbors of the node v, δ is Kronecker delta function and c v is the communities of the genes. Using the above modularity definition, the Louvain method first found small communities by optimizing modularity locally on the entire node, then grouped each small community into one node and repeated this method to perform community detection. As a result of a community detection algorithm, we obtained clusters of genes that maximizes modularity. Finally, we defined given data specific subnetworks by constraining the network size, here we constrained the number of nodes and edge weight, and through this result, the complex gene space is reduced to few tens of subnetwork dimension.

9/13
Map into a metric space For each subnetwork, we measured the phenotypic changes of each sample compare to normal sample from two perspectives: network level Jensen-Shannon divergence (JSD) and network structure level importance. By combining these two measured values, each sample is mapped into a metric space generated by subnetworks in the form of vector.

Calculation of network level JSD
In information theory, JSD is a method of measuring the difference between two probability distributions, with symmetrical and finite values of Kullback-Leibler divergence (KLD). Motivated by the 18 , we measured how different the influence of genes on neighboring genes in network was compared to normal, based on information theory. For each gene in the network, the probability distribution of the gene v is defined using the expression value of the neighbors N v : where e w is expression value of gene w in N v . Then, the KLD of gene v between normal sample distribution P N (v) = p N (w)wherew ∈ N v and query sample distribution P Q (v) = p Q (w)where|w ∈ N v is defined as: Utilizing the KLD defined in the network, we defined the network level JSD of gene v between normal and query as follows: where P M = 1 2 P N + P Q .

Calculation of network structure level importance
When defining the phenotypic change of a query sample, the probabilistic change based on information theory calculated the change of the same value for gene set with different expression but the same proportion. To compensate for this, we measured the importance of each gene in the network structure. Closeness centrality of node is a value that evaluate the degree of the centrality of the node in the network and is defined as the reciprocal for the sum of shortest path lengths between node and all other nodes in the network: where N is set of nodes in network, |N| is the number of nodes and d(v, w) is shortest pathway between v and w. We used the centrality obtained by each gene in the network as the importance of the gene in a network structure.

Calculation of NetJSD
In the subnetworks constructed utilizing the properties of given data in the template network, we defined the phenotypic change in each gene of the subnetwork as the product of the two level changes, network level JSD and network structure level importance. And then, the average value of the change value of all genes in the subnetwork was defined as the phenotypic change of sample in the subnetwork NetJSD: As a result, we computed NetJSD on all subnetworks and then mapped a single sample S into a metric space generated by subnetworks in the form of following vector: where M is the number of subnetworks.