Prioritizing Cancer lncRNA Modulators via Integrated lncRNA-mRNA Network and Somatic Mutation Data

: 8 Background: Long noncoding RNAs (LncRNAs) represent a large category of 9 functional RNA molecules that play a significant role in human cancers. lncRNAs can 10 be genes modulators to affect the biological process of multiple cancers. 11 Methods: Here, we developed a computational framework that uses lncRNA- 12 mRNA network and mutations in individual genes of 9 cancers from TCGA to prioritize 13 cancer lncRNA modulators. Our method screened risky cancer lncRNA regulators 14 based on integrated multiple lncRNA functional networks and 3 calculation methods in 15 network. 16 Results: Validation analyses revealed that our method was more effective than 17 prioritization based on a single lncRNA network. This method showed high predictive 18 performance and the highest ROC score was 0.836 in breast cancer. It’s worth noting 19 that we found that 5 lncRNAs scores were abnormally high and these lncRNAs 20 appeared in 9 cancers. By consulting the literatures, these 5 lncRNAs were 21 experimentally supported lncRNAs. Analyses of prioritizing lncRNAs reveal that these 22 lncRNAs are enriched in various cancer-related biological processes and pathways. 23 Conclusions: Together, these results demonstrated the ability of this method 24 identifying candidate lncRNA molecules and improved insights into the pathogenesis 25 of cancer.


30
Introduction: 31 Cancer is a major public health problem across the world and is a leading cause of death 32 in many countries (1). Cancer is a complex disease involving DNA abnormalities, 33 transcriptomic alterations and epigenetic aberrations(2) and whole genome sequencing 34 efforts have uncovered the genomic landscapes of common forms of human cancers(3). 35 The cancer Genome atlas (TCGA) has provided a mass of data of human samples and 36 discover molecular alterations at the DNA and RNA levels(4).

37
Mutations are important markers of cancer genes and the somatic mutation 38 landscapes and signatures of major cancer types have been reported and stockpiled by 39 international cancer genome projects, such as TCGA and ICGC(5). In recent years, 40 increasing experimentally supported evidence has suggested that lncRNAs as genes

82
The golden standard of gene set was obtained from Cancer Genome Census(CGC) 83 database, which includes 616 cancer genes(16). In order to verify the accuracy of the 84 model, we choose the 9 cancers with the most experimentally confirmed data in the 85 Lnc2Cancer database, which includes 148 lncRNAs for 9 cancers.

86
The lncRNA-mRNA functional network 87 In our study, the lncRNA-mRNA functional network was a fusion of lncRNA-mRNA 88 co-expression network, lncRNA-mRNA ceRNA network and lncRNA-protein 89 interaction network. We have used gene expression data from TCGA and lncRNA 90 expression data from TANRIC. Pearson correlation was used to construct our lncRNA-91 mRNA co-expression network and the lncRNA-mRNA pair was selected if it meets 92 following criteria: corr (lncRNA, mRNA) > 0.8, fdr < 0.05. We used lncRNA-miRNA 93 pairs and mRNA-miRNA pairs to construct our lncRNA-mRNA ceRNA network.  Scoring scheme of genes 105 We defined Ni for the number of non-synonymous mutations of a gene from the somatic 106 mutation data. Meanwhile, we screened the differential expression genes according to Where n represents the normalization score of differential expression gene(i), 111 represents the P-value of differential expression gene(i). 112 We formulated the score of gene to use mutation occurrences and n :  Scoring scheme of lncRNAs 117 We designed three different ways to use gene's score of direct neighbors and edge 118 weights in our network for prioritizing lncRNAs: the first computational method named "Smax" was defined that lncRNA's score is the biggest score of its direct gene's score 120 multiply by edge weights: 121 = ( * (ⅈ, )) 122 the second computational method named "Ssum" was defined that lncRNA's score is 123 the sum score of its direct gene's score multiply by edge weights: Where represents the score of lncRNA(j), (ⅈ, ) represents the edge weights of 126 lncRNA(j) and gene(i). 127 the third computational method named "NWsum" was defined that lncRNA's score is 128 the sum score of its direct gene's score divide by the number of gene's direct neighbors: Where represents the number of gene(i)'s direct neighbors.

131
For example, if a lncRNA had 5 direct neighbors and the number of 5 genes' direct were 132 2,4,1,1,3, we can obtain the score for this lncRNA: A summary of validated cancer-related lncRNAs 135 Literature mining is an effective way to collect "gold-standard" for a large number of 136 disease-related molecules because of experimental methods, such as Western blot, 137 Luciferase reporter assay. In this study, we used a set of validated lncRNAs for 9 138 cancers from Lnc2Cancer database(http://www.bio-bigdata.net/lnc2cancer/), which 139 contains cancer-related lncRNAs based on experiment by thousands of articles. Due to 140 the emergence of a lot of new data after the database published, we added the latest data 141 to test our method through manually collecting lncRNA-cancer associations. 143 Degree centrality and betweenness centrality are two important indicators in the nature 144 of network topology. Generally, the larger the node degree of a node is, the higher the 145 degree of centrality of the node is, and the more important the node is in the network; 146 betweenness centrality is equal to the number of shortest paths from each node to all 147 others that pass through this node, as an important global geometric quantity, 148 betweenness centrality reflects the role and influence of the corresponding node in the 149 entire network. Degree centrality and betweenness centrality was calculated using the 150 R package "igraph" (17).

151
A Kaplan-Meier survival analysis was performed using the clinical data from 152 TCGA, and statistical significance was assessed using the log-rank test. The survival 153 curve was drawn using the R package "survival". All analyses were performed on the 154 R 3.6.0 framework.

157
Overview 158 A general workflow of method is given in Fig. 1. To prioritize lncRNA molecules, the 159 first step was to score genes based on somatic mutation data and gene expression data.

162
The third step was to score lncRNAs based on genes' score and integrated lncRNA-163 mRNA network. We used three methods to score lncRNAs and then sorted them 164 according to lncRNAs' score. The higher the ranking, the more likely it is to become a 165 risk cancer lncRNA. degree and betweenness of cancer lncRNA were significantly higher than the candidate 174 lncRNA and the P-value by Wilcoxon test was less than 0.001. These results indicated 175 "cancer lncRNA nodes" is a key factor in the network and plays a regulatory role for a 176 large number of genes, which is consistent with the previous research results(18). 178 First, we use genes' score which were calculated based on somatic mutation data and 179 gene expression data for 9 cancers from TCGA and a protein-protein interaction 180 network which named STRING v10 to appraise the method when used to prioritizing 181 cancer genes. The golden standard was a high-confidence gene set form the Cancer 182 Genome Census database (CGC), including 616 cancer driver genes.

183
To assess the performance of our method, ROC analysis was executed for each 184 type of cancer and the AUC value used to determine the quality of the method. For was Glioblastoma multiforme (GBM), however, the level of change did not exceed 10%.

208
These results showed that although the absence of the network would have a certain 209 impact on the performance of the method, the effect was small (Fig. 3D). This was 210 because that the key nodes in the network had very high degrees and betweenness, even 211 if a part of the network was missing, it would not affect the result. All the results 212 indicated our network was robust.

213
Analysis of high-risk cancer lncRNA modulators 214 Duo to the performance of NWsum algorithm was slightly stronger than the other two 215 algorithms, we use the lncRNA predicted by the NWsum algorithm to analyze. number of experimentally supported miRNAs and genes were 134 and 616 (Fig.4D). 261 We found 96 miRNAs and 166 genes exhibited both interaction of association and 262 confirmed evidence, including miR-106a, let-7b, FUS, EWS, etc. (Fig.4E) High 263 overlap indicated they had a significant association with 9 cancers (P < 0.001, P < 0.001). 264 We thus speculated that these 5 lncRNAs were likely to be a high-risk clinical factor 265 and further experiments need to be carried out. binding sites for Lv2 was 1.32-fold that of Lv3 (P < 0.001) (Fig.5B). Previous research 281 shows some specific human lncRNAs which different evolutionary conservation 282 beyond primates but have proven to be both functional and therapeutically relevant(26).

283
The UCSC phyloP score was used to calculate the conservation score of lncRNAs. Duo   Table S2). Thirteen of the top 20 lncRNAs were differential expression.

308
LncRNA TINCR which cannot be identified by differential expression was identified 309 by our method. This indicated our method could identify cancer lncRNAs that were not 310 identified by differential expression method.
For the top 20 lncRNAs that were not be verified by experiment, we use the way   In summary, our method can accurately identify stomach cancer-associated 341 lncRNA molecules, furthermore, our method can identify lncRNAs which can't be 342 identified by the method of differential expression. networks. This shows that the experimentally supported interaction network has higher 368 accuracy than the network obtained by the calculation method. In our integrated 369 network, the degree and median of experimentally supported cancer lncRNAs were 370 significantly higher than other lncRNAs.

371
In our predicted results, in addition to using ROC analysis to evaluate the overall 372 prediction results, we also chose the top ranked lncRNA for analysis. By comparing 373 biological characteristics, lncRNA ranks and characteristic scores had a consistent trend.

374
For more accurate analysis, we identified 5 lncRNAs (MALAT1, NEAT1, FENDRR, 375 CRNDE, TUG1) which ranked in the top 5 of 9 cancers. These 5 lncRNAs had been 376 documented in many literatures and were closely related to various cancers. In STAD 377 study, we found this method make accurate and complement lncRNA found by        The work ow for prioritizing cancer lncRNA modulators.
Step 1: Score the mutated gene.
Step 3: Score the lncRNA modulators by combining gene's score and network.