Contribution Value (cid:0) A Indicator for Measuring the Contribution of ncRNAs to Transcriptome

Background (cid:0) The expression difference multiple log2 FC and P value are often the main basis for screening ncRNA after high-throughput sequencing. However, the above two indicators can’t well reect the regulatory effect of nonprotein coding RNA (ncRNA) on mRNA. Therefore, we propose a new indicator, Contribution value (C value), to characterize the contribution of ncRNAs to transcriptome transformation. Results (cid:0) In this study, we analyzed multiple data sets from mice and humans. We take all differential expression mRNAs as a parent set and the GO and KEGG enrichment analysis were performed on it. C value can be simply regarded as the sum of the product of the richfactor of each ncRNA target genes participating in each term/pathway and the P value of that term/pathway obtained from the parent set. We found that C value was superior to log2 FC and P value in all operation results. Conclusions (cid:0) We show that the C value, which takes into accounts the the KEGG pathways and GO terms involved in the development of the disease, provides a measure of another dimension compared with the log2FC and P value. These hidden interactions between ncRNAs and their target genes may provide more comprehensive analysis.

log2FC and P value. These hidden interactions between ncRNAs and their target genes may provide more comprehensive analysis.
Background RNA sequencing is an extremely sensitive method for analyzingdifferential expression RNA [1]. Compared with traditional capillaryelectrophoresis sequencing, the massive parallel sequencingprovides more data at a lower cost. In contrast to genemicroarrays, they are not limited to known RNAs, but can also beused to analyze the entire transcriptome of tissues and organs,thus providing biological information about the possible functionof annotation genes or new genes [2]. Since the beginning of theHuman Genome Project, the application of high-throughput sequencinghas developed rapidly. Sequencing techniques have been used tostudy the characteristics of dynamic genomic loci ranging fromsimple model organisms to bigger species such ashumans [3,4].
One of the most important applications of RNA sequencing is tocompare the differences in the expression of the non-codingRNAs (ncRNAs). NcRNAs refers to a kind of RNAs that can betranscribed from genome but not translated into proteins and canperform their own biological functions at the RNA level, includingrRNA, tRNA, snRNA, lncrna, microRNA and others. They play importantroles in normal development, physiology and disease [5]. Comparedwith other ncRNAs, microRNAs (miRNAs) and long noncoding RNAs(lncRNAs) are involved in regulating gene expression in manybiological processes.
MiRNAs are non-coding single-stranded microRNAs, ranging from about 20 to 30 nucleotides in length.
MiRNAs areestimated to regulate the translation of more than 60% ofprotein-coding genes [6]. Some miRNAs degrade mRNAs by fullybinding to a target, or inhibit the expression of proteintranslation regulatory genes by partially binding mRNAs [7]. OthermiRNAs may act as master regulators of a process.
High throughput sequencing is a common method for ncRNAresearch. People often select genes with high expressiondifferences for follow-up function research [13,14]. In thistraditional way, the expression difference multiple log2 FC and Pvalues are often the main basis for screening ncRNA, onlyconsidering the change of expression amount. Based on the powerfultargeting function of ncRNA to mRNA, we think that a new index canbe proposed to characterize the contribution of genes totranscriptome transformation, this method can assist the selectionof target gene after high-throughput sequencing.
We collected the existing sequencing results, including skeletalmuscle denervation, Alzheimer's disease, prostate cancer, gastriccancer, and adipocyte differentiation. C57BL / 6 mice were used asthe model of skeletal muscle denervation, APP / PS1 mice as themodel of Alzheimer's disease, prostate cancer, gastric cancer, andadipocyte differentiation samples were all from human [15][16][17][18][19]. Foreach sequencing result, we take all DE mRNAs as a parent set andthe GO and KEGG enrichment analysis were performed on it. Then, thepredicted target mRNAs of each DE ncRNA (miRNA and lncRNA) and DEmRNAs were cross-labeled as a subset, which was also used for theGO and KEGG enrichment analysis. Finally, C value can be simplyregarded as the sum of the products of the richfactor of each ncRNAtarget genes participating in each term/pathway and the P value ofthat pathway obtained from the parent set. The richfactor is equalto the proportion of the number of each subset participating ineach term/pathway to the total number of genes in thatterm/pathway. Our proposed mathematical model for calculating C value for eachncRNA takes into accounts the P value for each enriched GO term andKEGG pathway and the percentage of ncRNAs target genes in thatpathway. We expect to use C value to characterize the contributionof genes to transcriptome transformation, and to assist in theselection of target genes after high throughput sequencing.

Results
The total C value of each DE ncRNA is equal to the sum of BPvalue, CC value, MF value and KEGG value We calculated the Cvalue of eachDE miRNA in skeletal muscledenervation, Alzheimer's disease, prostate cancer andgastric cancerdata setsrespectively.In addition, we calculate the C value of eachlncRNA in skeletal muscle denervation and adipocyte differentiationdata sets.Using the mathematical modelwe designed, the C values ofeach DE miRNA based on biological process (BP), cellular component(CC), molecular function (MF) and KEGG analysis can be obtained,and we call these C values as BP value, CC value, MF value and KEGGvalue respectively. The total C value of each DE miRNA is equal tothe sum of BP value, CC value, MF value and KEGG value. The DEmiRNAs were sorted with the total C value to obtain the 10 DEmiRNAs with maximum C value, named as top10 C value miRNAs (Table1-4). The top10 DE miRNAs with maximum absolute Log2 FC (top10 FCmiRNAs), and the top10 DE miRNAs with minimum P value (top10 Pvalue miRNAs), were obtained by sorting the DE miRNAs according tothe absolute Log2 fold FC and P value respectively(TableS1-8).Similarly, DE lncRNAs are processed in the same way to obtaintop5 C value lncRNAs, top5 FC lncRNAs, top5P value lncRNAs foradipocyte differentiation and top10 C value lncRNAs, top10 FClncRNAs, top10 P value lncRNAsfor skeletal muscle denervation. (Table  S9-14).    In each data set, the most signi cant enriched BP term, CCterm, MF term, KEGG pathway and the most involved BP term, CC term,MF term, KEGG pathway were obtained by DEmRNAs enrichmentanalysis.We took the intersections of DE mRNAs with the predictedtarget genes of top10 C value miRNAs, top10 FC miRNAs and top10 Pvalue miRNAs respectively, and then calculated the proportion ofthese intersections in the above pathways/terms. It was found thatthe proportion of top10 C Value miRNAs target mRNAs wassigni cantly larger than that of top10 FC miRNAs, top10 P valuemiRNAs in the above terms/pathways (Fig.1). Furthermore, weenriched KEGG pathways based on DE mRNAs and pathways with p valueless than 0.01 to generate an annotation network (Fig.2). Accordingto the distance from the centre, the annotation network was dividedinto three regions: core region, subcore region, and noncoreregion (Fig. 2). In the annotation network, the predicted targetgenes of top10 C value miRNAs, top10 FC miRNAs and top10 p valuemiRNAs were labeled in red (Fig. S1). It was found that the totalnumber of top10 C value miRNAs' target genes and their proportionin each region were larger than those of top10 FC miRNAs, and top10P value miRNAs (Fig.3).
Based on extensive literature, we identi ed 14 skeletal musclegrowth regulatory miRNAs, 6 Alzheimer's disease associated miRNAs,7prostate cancer associated miRNAs, 6gastric cancer associatedmiRNAsand found that when DE miRNAs were sorted by C value, thesequence number accumulation value of thesemiRNAs was signi cantlysmaller than that of the other two indexes, which means that thesemiRNAs sequences increased integrally (Fig. 4). When sorting by Cvalue versus sorting by absolute Log2 FC/ P value, most of thedisease critical miRNAs ranked up (Fig. 4).
C value is superior to log2 FC and P value in lncRNAsoperation results We got a conclusion similar to miRNA in the result of lncRNAoperation based on skeletal muscle denervation and adipocytedifferentiation data sets. In skeletal muscle denervationdata set,we calculated the proportion of the predicted target genes of top10C value lncRNAs, top10 FC lncRNAs, and top10 P value lncRNAs in the7 most enriched terms/pathways respectively, and found that theproportion of the genes regulated by top10 C value lncRNAs waslarger than that of top10 FC lncRNAs and top10 P value lncRNAs (Fig. 5A). Then, the predicted target genes of top10 C valuelncRNAs, top10 FC lncRNAs and top10 P value lncRNAs were labeled inred in KEGG annotation network (Fig. 5B-D). It was found that thetotal number of top10 C value lncRNAs' target genes and theirproportion in each region (total: 87, core region: 17.9%, subcoreregion: 17%, non-core region: 19.7%) were larger than those oftop10 FC lncRNAs (total: 65, core region: 11.5%, subcore region:12.7%, non-core region: 16.9%), and top10 P value lncRNAs (total:61, core area: 10.3%, subcore area: 11.8%, non-core area: 16.9%) (Fig. 5E-F).
Since there are relatively few DElncRNAs and DE mRNAsinadipocyte differentiation data set, we take top5C value lncRNAs,top5 FC lncRNAs, top5 P value lncRNAs and draw the KEGG networkwithout partition (Fig. S2). The proportion of the genes regulatedby top5 C value lncRNAs was larger than that of top5 FC lncRNAs andtop5 P value lncRNAs in GO terms and KEGG annotation network (Fig.6A-B). And when DE lncRNAs were sorted by C value, theadipocytedifferentiation associatedlncRNAs sequences increased integrallythan that of the other two indexes (Fig. 6C-D).

Discussion
MiRNAs play an important role in transcriptome regulation andthey can directly or indirectly in uence the biological processesbefore and after gene transcription [20][21][22]. LncRNA (> 200nucleotides) has recently been studied as endogenousncRNA, whichhas multiple mRNA characteristics, including polyadenylation [23].Recent studies have shown that lncRNA, as a different type ofnoncoding RNA, can be classi ed according to genome location ortranscription direction, and regulate gene expression [24].If we canquantify this regulatory capacity of ncRNAs, the results mustin turn indicate how much miRNAs are involved in a particularbiological process. We designed a new mathematical model tocalculate Contribution value, and tried to use this index tocharacterize the contribution of each DE ncRNA to genome-widechanges in total transcriptome sequencing results. C value can besimply regarded as the sum of the product of the richfactor of eachncRNA target genes participating in each pathway and the P value ofthat pathway obtained from the parent set. The richfactor is equalto the proportion of the number of each subset participating ineach pathway to the total number of genes in that pathway. And theparent set is a collection of all DE mRNAs.
To test the superiority of C value as a measure of ncRNAscontribution to genome-wide changes, we compared it with absoluteLog2 FC and P value. Log 2 FC re ects the expression change ofncRNAs and P value re ects how signi cant the change is. The twoindexes of each DE RNA were obtained after the traditional wholetranscriptome sequencing, and many follow-up studies have partiallyreferenced Log2 FC and P values in selecting the target gene [13,14]. As miRNAs, we compared C value with Log2 FC, P value, andfound that the proportion of top10 C value miRNAs target mRNAs wassigni cantly larger than that of top10 FC miRNAs, top10 P valuemiRNAs in the GO terms and KEGG pathway. Further, we analyzed theannotation network of KEGG pathways with P value less than 0.01 andfound that the total number of top10 C value miRNAs' target genesand their proportion in each region were larger than those of top10FC miRNAs, and top10 P value miRNAs. At the same time, we searcheda large number of literatures and got 14 skeletal muscle growthregulatory miRNAs [25][26][27][28][29][30][31][32][33][34][35][36][37][38], 6 Alzheimer's disease associatedmiRNAs [39][40][41][42], 7 prostate cancer associated miRNAs [17,43,44],6gastric cancer associated miRNAs [45,46].Their overall rankingwent up after sorted by C value. For lncRNAs, there were similarresults. Ranking according to C value, top lncRNAs showed strongerregulation on GO terms and KEGG pathways. Moreover, the proportionof top C value lncRNAs' target genes in the core region of KEGGannotation network is also larger than that of top lncRNAs selectedby other two indicators. And adipocyte differentiationassociatedlncRNAs [19] ranked up after sorted by C value.

Conclusions
Based on the above evidence and the rationality of themathematical model, Contribution value can be used as an indicatorto measure the contribution of ncRNA to the transcriptome. In theanalysis of whole transcriptome sequencing, C value, as asupplement to log2 FC and P value, makes the selection of targetgene more reasonable.

Prediction of ncRNAs' target mRNAs
MiRNA: MiRNAs target genes prediction software,miRanda(http://www.microrna.org/) [47], uses a weighted dynamicprogramming algorithm to calculate the optimal sequencecomplementarity between a mature microRNA and a given mRNA. The keyextension of Smith-Waterman algorithm is that the alignment scoreis the weighted sum of the matching and mismatching scores of basepairs (including G: U jitter) and gap penalties. Weights areposition-dependent, re ecting the relative importance of the 5'and 3' regions in a nely adjustable manner. The weight of eachposition can be optimized to re ect experimental facts andphysical principles.

GO and KEGG pathway enrichment analysis
Gene Ontology Analysis: GO is a database established by GeneOntology consortium(http://www.geneontology.org),which includes three parts: molecular function, biological processand cell composition. Fisher exact test and x2 test were used.Enrichment analysis of differentially expressed genes or ncRNAstarget genes was performed using GOseq R software package, and genelength bias was corrected. The corrected P value less than 0.05 wasconsidered to be signi cantly enriched by differentially expressedgenes.
Using hypergeometric analysis, the pathway of KEGG/GO was foundto be signi cantly enriched in the candidate gene compared withthe whole genome background. The formula for this analysis: N is the total number of genes in the GO annotation database; Mis the total number of genes in the database belonging to a GOsubclass; n is the total number of genes in the database thatrequire GO enrichment analysis; i is the number of genes in nbelonging to M.So, we can calculate the probability of whether thegene set n is enriched in class M. Theoretically, the smaller the Pvalue is, the higher the signi cance of pathway enrichment is.
In this study, GO and KEGG enrichment analysis were performed onall DE mRNAs which was called as the parent set. Then, thepredicted target genes of DE ncRNAs was cross-labeled with DEmRNAs, and the subsets were used for GO analysis(BP, CC, MF) andKEGG enrichment analysis.

C value mathematical model and itscalculation
The C value of each DE ncRNA is calculated using the followingmathematical model: RichFactor is equal to the proportion of the number of eachsubset participating in each pathway to the total number of genesin that pathway; P value is the p value of the pathway in theparent set; n represents the number of pathways enriched bysubset.