Data description
The long-read transcriptome data presented here were from mouse normal endothelial cells SVEC4-10 and mouse tumor endothelial cells 2H-11 culture under laboratory condition,with three samples each.(Sample H1, H2, H3, from TECs 2H-11 and Sample S1, S2, S3 from NECs SVEC4-10). Totally, 7,145,224, 5,738,625 and 8,308,634 clean reads were generated from three samples of 2H-11, with an average read length of 1186 bp、1047bp and 916 bp and an N50 of 1560bp,1419 bp and 1219 bp. 5,487,487, 4,739,190 and 21,496,454 clean reads were generated from three samples of SVEC4-10, with an average read length of 1191 bp,1188bp and 1086 bp and an N50 of 1527bp,1500 bp and 1380 bp, respectively (Table. 1).
Table 1
Overview of Nanopore all sample clean reads
Sample ID
|
Read Num
|
Base Num
|
N50
|
Mean Length
|
Max Length
|
Mean Q score
|
H1
|
7,145,224
|
8,475,179,085
|
1560
|
1186
|
28,619
|
Q12
|
H2
|
5,738,625
|
6,011,317,116
|
1419
|
1047
|
25,972
|
Q12
|
H3
|
8,308,634
|
7,613,609,482
|
1219
|
916
|
60,662
|
Q12
|
S1
|
5,487,487
|
6,537,183,635
|
1527
|
1191
|
487,578
|
Q12
|
S2
|
4,739,190
|
5,634,114,176
|
1500
|
1188
|
89,959
|
Q12
|
S3
|
21,496,454
|
23,346,764,762
|
1380
|
1086
|
171,171
|
Q12
|
For both 2H-11 and SVEC4-10, the length of clean reads was distributed among 1 kb~10+ kb, and the most abundant length was 1 kb (Fig. 1A and B ).In addition, the average quality (Q) scores of majority of clean reads of 2H-11 and SVEC4-10 were Q13 (Fig. 2A and B). After discarding rRNA,6,905,496 , 5,509,170 , 7,939,379 , 5,312,938 , 4,535,096 and 20,822,758 clean reads were gained, and among them 89.10%, 88.57%, 88.66%, 88.96%, 87.27% and 89.23% were identified as being full-length(Table. 2).
Table 2
Summary of full-length clean reads
Sample ID
|
Number of Clean Read (except rRNA)
|
Number of full-length reads
|
Full-Length Percentage (FL%)
|
H1
|
6,905,496
|
6,152,877
|
89.10%
|
H2
|
5,509,170
|
4,879,500
|
88.57%
|
H3
|
7,939,379
|
7,038,880
|
88.66%
|
S1
|
5,312,938
|
4,726,331
|
88.96%
|
S2
|
4,535,096
|
3,957,596
|
87.27%
|
S3
|
20,822,758
|
18,581,090
|
89.23%
|
The length of full-length clean reads from 2H-11 and SVEC4-10 was distributed among 1 kb~10+ kb; the most abundant lengths for both were 1 kb (Fig. 3A and B).After removing redundant reads, the lengths of remaining full-length transcripts of both 2H-11 and SVEC4-10 were ranged from 1 kb to 9 kb, with the largest group of 1 kb (Fig. 4A and B).
DEG identification and selection
Transcripts of differentially expressed genes (DEGs) were characterized using isoform sequencing (Iso-Seq) analyses of 2H-11 cells and SVEC4-10 cells. The results revealed obvious differences between 2H-11 cells groups and the SVEC4-10 groups. We discovered 1847 up-regulated and 1202 down-regulated DEGs in the 2H-11 cells. The hierarchical clustering heat map, MA plot, and volcano plots were generated to represent the up- and down-regulated genes (logFC ±2 and p<0.001 ). Fig. 5A represents the heatmap of up- and down-regulated genes in red and green, respectively. The volcano plot (Fig. 5B) and the MA plot (Fig. 5C) visualize the differences between measurements taken in SVEC4-10 cells and 2H-11cells DEGs.
Gene annotation and functional classification of DEGs
All the DEGs were uploaded to the GO Enrichment Analysis tool and database for annotation. The biological processes (BPs), cellular components (CCs), molecular functions (MFs), and pathways were predicted in the significantly enriched GO terms of the differentially express genes (Fig. 6). The enrichment trend of some DEGs in these GO secondary functions is different from the enrichment trend of all genes. It is important to analyze whether these functions are related to the difference.
The most DEGs are involved in various BPs, such as cellular process (GO:0009987), single-organism process (GO:0044699), metabolic process (GO:0008152), biological regulation (GO:0065007), response to stimulus (GO:0050896), multicellular organismal process (GO:0032501), signaling (GO:0023052), cellular component organization or biogenesis (GO:0071840), developmental process (GO:0032502) and localization (GO:0051179) functions. Among these BPs, some DEGs are involved in multi-organism process (GO:0051704; 191genes), immune system process (GO:0002376; 214genes), locomotion (GO:0040011; 161genes) and biological adhesion (GO:0022610; 114 genes), growth (GO:0040007; 61genes) and behavior (GO:0007610; 58genes).The enrichment trend of these DEGs in these GO secondary functions is different from the enrichment trend of all genes ,so these functions are related to differences.
The most DEGs are involved in various CCs, such as cell (GO:0005623), cell part (GO:0044464), organelle (GO:0043226), membrane (GO:0016020) ,membrane part (GO:0044425), organelle part (GO:0044422), macromolecular complex (GO:0032991) and membrane-enclosed lumen (GO:0031974) functions.Among these CCs, these DEGs are involved in the inextracellular region (GO:0005576; 225genes), extracellular region part (GO:0044421; 169genes), synapse (GO:0045202; 47genes), cell junction (GO:0030054; 65genes) and synapse part (GO:0044456; 31genes).The enrichment trend of these DEGs in these GO secondary functions is different from the enrichment trend of all genes, so these functions are related to differences.
The most DEGs are involved in various MFs, such as binding (GO:0005488) and catalytic activity (GO:0003824) functions. Some DGEs are involved in the antioxidant activity (GO:0016209; 13genes), chemoattractant activity (GO:0042056; 7genes), and chemorepellent activity (GO:0045499; 8genes).The enrichment trend of these DEGs in these GO secondary functions is different from the enrichment trend of all genes so these functions are related to differences.
Pathway analysis
Pathway analysis helps elucidate data from canonical prior structured knowledge in the form of pathways. It allows finding distinct cell processes, diseases, or signaling pathways that are statistically associated with the selection of DEGs. It is always used to analyze whether the DEGs are significantly different in a certain pathway (over-presentation). Pathway significant enrichment analysis takes pathway in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database as the unit, and applies hypergeometric test to find pathways that are significantly enriched in DEGs compared with the entire gene background. The results of enrichment analysis of the DEGs are shown in the Fig. 7 The Fig. 7A shows the top 20 pathways with the smallest significant Q value.The Fig. 7B indicates the network view of DEGs pathways in 2H-11 and SVEC4-10.
They are involved in various KEGG pathways, such as the Renin secretion (ko04924; 39 genes), Pancreatic secretion(ko04972; 40 genes), Fluid shear stress and atherosclerosis (ko05418; 37 genes), Human papillomavirus infection (ko05165; 73 genes), Rap1 signaling pathway (ko04015; 44 genes), Pathways in cancer (ko05200; 88 genes), Antigen processing and presentation(ko04612; 26 genes),Proteoglycans in cancer (ko05205; 40 genes), Focal adhesion (ko04510 41 genes), MAPK signaling pathway (ko04010; 52 genes), Protein digestion and absorption༈ko04974; 23 genes),
ECM-receptor interaction(ko04512; 24 genes), Human cytomegalovirus infection༈ko05163; 46 genes༉, Type I diabetes mellitus༈ko04940; 19 genes༉, Phagosome༈ko04145; 39 genes༉, PI3K-Akt signaling pathway༈ko04151; 70 genes༉, Allograft rejection༈ko05330; 18 genes༉, Glutathione metabolism༈ko00480; 17 genes༉, Graft-versus-host disease༈ko05332; 18 genes) and Toxoplasmosis༈ko05145; 22 genes).
Here, we mainly show the signal pathways of some genes related to our research. There are a large number of DEGs enriched in Pathways in cancer (ko05200; 88 genes; Table. 3) and PI3K-Akt signaling pathway (ko04151; 70 genes; Table. 4). A relatively large number of DEGs are enriched in Proteoglycans in cancer (ko05205; 40 genes; Table. 5), Focal adhesion (ko04510; 41 genes; Table. 6) and MAPK signaling pathway (ko04010; 52 genes; Table. 7). There is also a small number of genes enriched in Antigen processing and presentation ( ko04612; 26 genes; Table. 8), Protein digestion and absorption ( ko04974; 23genes; Table. 9), ECM-receptor interaction (ko04512; 24genes; Table. 10) and Glutathione metabolism (ko00480; 17 genes; Table. 11).
Table 8
Genes involved Antigen processing and presentation pathway. (p-value = 0.00011321866609603849; rich factor =2.1770077367)
#ID
|
gene_name
|
FDR
|
log2FC
|
regulated
|
ENSMUSG00000035929
|
H2-Q4
|
0.004570336
|
-1.537433763
|
down
|
ENSMUSG00000024401
|
Tnf
|
9.88E-09
|
4.921767952
|
up
|
ONT.10206
|
ONT.10206
|
1.16E-07
|
-3.858448298
|
down
|
ENSMUSG00000024308
|
Tapbp
|
5.22E-27
|
-2.26916257
|
down
|
ONT.21272
|
ONT.21272
|
0.007145755
|
1.664993518
|
up
|
ENSMUSG00000073411
|
H2-D1
|
1.30E-15
|
-1.634302503
|
down
|
ENSMUSG00000060550
|
H2-Q7
|
8.82E-05
|
-1.084974005
|
down
|
ENSMUSG00000006611
|
Hfe
|
0.003328646
|
1.733234191
|
up
|
ENSMUSG00000059970
|
Hspa2
|
0.000972136
|
-1.211418313
|
down
|
ENSMUSG00000021477
|
Ctsl
|
2.43E-25
|
-1.747916795
|
down
|
ENSMUSG00000090877
|
Hspa1b
|
5.02E-10
|
3.433264195
|
up
|
ENSMUSG00000022216
|
Psme1
|
1.10E-29
|
-2.01144189
|
down
|
ENSMUSG00000067212
|
H2-T23
|
5.69E-24
|
-2.877428307
|
down
|
ENSMUSG00000037649
|
H2-DMa
|
3.11E-09
|
2.724576686
|
up
|
ENSMUSG00000037321
|
Tap1
|
7.27E-07
|
-1.763033392
|
down
|
ENSMUSG00000091971
|
Hspa1a
|
3.55E-12
|
5.290791162
|
up
|
ENSMUSG00000053835
|
H2-T24
|
0.000633033
|
-2.850276057
|
down
|
ENSMUSG00000060802
|
B2m
|
1.27E-23
|
-1.720470365
|
down
|
ENSMUSG00000061232
|
H2-K1
|
3.87E-26
|
-2.025866713
|
down
|
ENSMUSG00000092243
|
Gm7030
|
0.001190601
|
-1.912033582
|
down
|
ONT.6732
|
ONT.6732
|
1.82E-05
|
-2.524698523
|
down
|
ENSMUSG00000079197
|
Psme2
|
1.21E-13
|
-1.611409427
|
down
|
ENSMUSG00000073402
|
Gm8909
|
0.000921921
|
-2.408210504
|
down
|
ONT.22061
|
ONT.22061
|
8.03E-08
|
3.399291437
|
up
|
ENSMUSG00000056116
|
H2-T22
|
1.63E-23
|
-1.966873441
|
down
|
Table 9
Genes involved Protein digestion and absorption pathway. (p-value = 0.000288132612123349; rich factor = 2.17062146892655)
#ID
|
gene_name
|
FDR
|
log2FC
|
regulated
|
ENSMUSG00000048126
|
Col6a3
|
9.48E-34
|
8.237114364
|
up
|
ENSMUSG00000004098
|
Col5a3
|
7.53E-06
|
2.343294368
|
up
|
ENSMUSG00000001435
|
Col18a1
|
1.90E-09
|
3.471710732
|
up
|
ENSMUSG00000031273
|
Col4a6
|
5.24E-16
|
-3.063792325
|
down
|
ENSMUSG00000029163
|
Emilin1
|
1.05E-13
|
5.538029232
|
up
|
ENSMUSG00000056174
|
Col8a2
|
4.13E-05
|
-3.614177874
|
down
|
ENSMUSG00000032332
|
Col12a1
|
0.009306321
|
1.305619343
|
up
|
ENSMUSG00000001918
|
Slc1a5
|
2.74E-07
|
1.266888836
|
up
|
ENSMUSG00000054342
|
Kcnn4
|
0.001042811
|
2.792801361
|
up
|
ENSMUSG00000027966
|
Col11a1
|
9.47E-09
|
4.925101847
|
up
|
ENSMUSG00000026576
|
Atp1b1
|
2.01E-15
|
-2.7595007
|
down
|
ENSMUSG00000026837
|
Col5a1
|
1.22E-06
|
1.149962411
|
up
|
ENSMUSG00000027820
|
Mme
|
1.98E-07
|
-4.569286255
|
down
|
ENSMUSG00000040907
|
Atp1a3
|
1.23E-21
|
5.920207798
|
up
|
ENSMUSG00000031274
|
Col4a5
|
4.62E-06
|
-1.233139489
|
down
|
ENSMUSG00000029661
|
Col1a2
|
1.27E-06
|
1.30884117
|
up
|
ENSMUSG00000001119
|
Col6a1
|
1.36E-10
|
-1.735704123
|
down
|
ENSMUSG00000025650
|
Col7a1
|
4.29E-06
|
2.572385689
|
up
|
ENSMUSG00000020241
|
Col6a2
|
7.99E-12
|
-2.197425083
|
down
|
ENSMUSG00000031502
|
Col4a1
|
3.11E-11
|
-1.438161228
|
down
|
ENSMUSG00000024053
|
Emilin2
|
4.99E-17
|
5.955417654
|
up
|
ENSMUSG00000068196
|
Col8a1
|
2.15E-15
|
2.664904277
|
up
|
ENSMUSG00000000958
|
Slc7a7
|
0.000135723
|
3.515025572
|
up
|
Table 10
Genes involved ECM-receptor interaction pathway. (p-value = 0.000348331597598506; rich factor = 2.10448476549127)
#ID
|
gene_name
|
FDR
|
log2FC
|
regulated
|
ENSMUSG00000048126
|
Col6a3
|
9.48E-34
|
8.237114364
|
up
|
ENSMUSG00000041936
|
Agrn
|
2.72E-06
|
-1.629566318
|
down
|
ENSMUSG00000028364
|
Tnc
|
4.22E-12
|
2.223658118
|
up
|
ENSMUSG00000023885
|
Thbs2
|
3.24E-31
|
8.006758795
|
up
|
ENSMUSG00000001281
|
Itgb7
|
7.71E-05
|
2.925093206
|
up
|
ENSMUSG00000038486
|
Sv2a
|
0.001262385
|
-1.707618944
|
down
|
ENSMUSG00000027111
|
Itga6
|
3.34E-07
|
-1.825360992
|
down
|
ENSMUSG00000015647
|
Lama5
|
3.23E-16
|
-2.641045482
|
down
|
ENSMUSG00000031273
|
Col4a6
|
5.24E-16
|
-3.063792325
|
down
|
ENSMUSG00000001507
|
Itga3
|
6.06E-10
|
-1.777849405
|
down
|
ENSMUSG00000040998
|
Npnt
|
4.08E-08
|
-3.24447338
|
down
|
ENSMUSG00000034687
|
Fras1
|
3.50E-09
|
-4.652846498
|
down
|
ENSMUSG00000017009
|
Sdc4
|
6.03E-09
|
-1.518175322
|
down
|
ENSMUSG00000031274
|
Col4a5
|
4.62E-06
|
-1.233139489
|
down
|
ENSMUSG00000029661
|
Col1a2
|
1.27E-06
|
1.30884117
|
up
|
ENSMUSG00000028047
|
Thbs3
|
2.46E-10
|
2.24046441
|
up
|
ENSMUSG00000001119
|
Col6a1
|
1.36E-10
|
-1.735704123
|
down
|
ENSMUSG00000005087
|
Cd44
|
0.000121501
|
1.123040059
|
up
|
ENSMUSG00000042284
|
Itga1
|
1.01E-07
|
-2.184817206
|
down
|
ENSMUSG00000022817
|
Itgb5
|
3.96E-08
|
2.06121516
|
up
|
ENSMUSG00000020241
|
Col6a2
|
7.99E-12
|
-2.197425083
|
down
|
ENSMUSG00000031502
|
Col4a1
|
3.11E-11
|
-1.438161228
|
down
|
ENSMUSG00000039952
|
Dag1
|
1.18E-05
|
-1.128763692
|
down
|
ENSMUSG00000029304
|
Spp1
|
4.87E-133
|
5.690870707
|
up
|
Table 11
Genes involved in Glutathione metabolism pathway. (p-value = 0.000621008; rich factor = 2.366449275)
#ID
|
gene_name
|
FDR
|
log2FC
|
regulated
|
ENSMUSG00000018339
|
Gpx3
|
6.25E-15
|
6.07451606
|
up
|
ENSMUSG00000032348
|
Gsta4
|
3.32E-39
|
-2.885542505
|
down
|
ENSMUSG00000058135
|
Gstm1
|
4.63E-17
|
2.362659014
|
up
|
ENSMUSG00000074183
|
Gsta1
|
0.001087574
|
2.431146261
|
up
|
ENSMUSG00000038155
|
Gstp2
|
5.86E-09
|
2.328303859
|
up
|
ENSMUSG00000008540
|
Mgst1
|
1.70E-12
|
-1.768900074
|
down
|
ENSMUSG00000074604
|
Mgst2
|
2.62E-05
|
2.166938591
|
up
|
ENSMUSG00000039062
|
Anpep
|
2.81E-05
|
3.75229385
|
up
|
ENSMUSG00000028597
|
Gpx7
|
8.31E-74
|
4.851252354
|
up
|
ENSMUSG00000026688
|
Mgst3
|
3.90E-17
|
2.527299541
|
up
|
ENSMUSG00000001663
|
Gstt1
|
1.06E-12
|
2.740719466
|
up
|
ENSMUSG00000028961
|
Pgd
|
3.43E-07
|
1.073122031
|
up
|
ENSMUSG00000029864
|
Gstk1
|
1.31E-15
|
5.024276188
|
up
|
ENSMUSG00000027603
|
Ggt7
|
5.36E-05
|
1.792209318
|
up
|
ENSMUSG00000025934
|
Gsta3
|
8.73E-17
|
-6.331940412
|
down
|
ENSMUSG00000011179
|
Odc1
|
9.84E-12
|
-1.805655481
|
down
|
ENSMUSG00000027890
|
Gstm4
|
0.006745519
|
1.774064896
|
up
|
Gene Set Enrichment Analysis(GSEA)
GSEA results showed that the differentially expressed gene set was enriched for some functional gene networks that are clearly associated with endothelial cells, such as neuron differentiation (13 genes), bicellular tight junction (10 genes), andapical plasma membrane (22 genes), filopodium (8 genes), double-stranded RNA binding (10 genes), RNA-DNA hybrid ribonuclease activity (19 genes) and guanyl nucleotide binding (31 genes), as well as some novel signal pathways such as Ribosome biogenesis in eukaryotes (40 genes), Cytosolic DNA-sensing pathway (17 genes) ,Renin secretion (32 genes) and Pancreatic secretion (32 genes) (Fig. 8).
Alternative splicing profiles
The main types of gene alternative splicing (AS) are as follows: (A) Exon skipping; (B) Alternative 3'splicing site; (C) Mutually exclusive exon; (D) Alternative 5'splice site; (E) Intron retention. From the analysis results of the Astalavista software, We collected statistics on the occurrence of the above 5 alternative splicing events in the transcripts of 6 samples, and obtained a statistical graph of the number of predicted alternative splicing events in each sample.(Fig. 9) After analysing the data of 2H-11 and SVEC4-10, we detected Exon skipping was the most frequent AS events and Mutually exclusive exon was the rarest AS events. The prevalence of different AS events was similar between 2H-11 and SVEC4-10, indicating relevant pathogenesis of these two types of cells. It is worthy that one gene might possess several alternative splicing patterns.For each sample and every possible splice event, percent-splice-in (PSI) value was calculated, which is the ratio of normalized read counts indicating inclusion of a transcript element over the total normalized reads for that event (both inclusion and exclusion reads). Because the number of samples>=4, PSI-Sigma uses delta PSI>0.1 and Pvalue<0.01 to filter by default to get the filtered differential alternative splicing events.(Table. 12) Changes in average PSI values when comparing groups of samples mean a shift in splicing patterns between the groups or a splice event.
Table 12
Differentially spliced genes (DSGs) between 2H-11 and SVEC4-10.
GeneID
|
Target_Exon
|
Event_Type
|
ΔPSI (%)
|
T-test_p-value
|
FDR(BH)
|
Gm20425
|
9:103189546-103189594
|
A3SS
|
93.96
|
2.76E-08
|
3.29E-07
|
H2-K1
|
17:33996477-33996503
|
A3SS
|
75.73
|
3.38E-05
|
5.31E-05
|
Zscan21
|
5:138117836-138117920
|
SES
|
62.35
|
2.07E-05
|
6.94E-05
|
Gm20425
|
9:103189594-103189644
|
TSS|A5SS
|
61.62
|
6.72E-05
|
7.99E-04
|
Gm20425
|
9:103189594-103189642
|
TSS|A5SS
|
61.62
|
6.72E-05
|
7.99E-04
|
Gm20425
|
9:103189594-103189595
|
A5SS
|
61.62
|
6.72E-05
|
7.99E-04
|
Gm20425
|
9:103189546-103189594
|
A3SS
|
60.74
|
3.22E-06
|
3.83E-05
|
Npm1
|
11:33157037-33157048
|
A3SS
|
55.35
|
7.34E-08
|
8.22E-08
|
Arl14ep
|
2:106969298-106969558
|
A3SS
|
42.95
|
1.40E-03
|
2.81E-03
|
H2-K1
|
17:33996477-33996503
|
TSS|A3SS
|
41.31
|
1.70E-06
|
2.68E-06
|
Ube2t
|
1:134962610-134962786
|
TSS|A5SS
|
38.81
|
6.13E-04
|
1.11E-03
|
Eif4a2
|
16:23112351-23113166
|
TSS|A3SS
|
37.09
|
1.76E-03
|
2.59E-03
|
Fancl
|
11:26403322-26403378
|
SES
|
35.54
|
2.95E-03
|
3.29E-03
|
Mpnd
|
17:56009330-56009586
|
SES
|
34.69
|
1.26E-03
|
2.02E-03
|
Nmrk1
|
19:18632000-18632081
|
TSS|A5SS
|
34.09
|
1.84E-04
|
3.13E-04
|
Mtg1
|
7:140138380-140138380
|
A5SS
|
33.03
|
1.85E-04
|
9.90E-04
|
4833420G17Rik
|
13:119472139-119472166
|
A3SS
|
32.56
|
5.86E-03
|
7.24E-03
|
Apol9b
|
15:77733626-77733796
|
SES
|
32.1
|
3.58E-05
|
5.03E-05
|
Flywch1
|
17:23770235-23770332
|
SES
|
31.53
|
7.81E-03
|
1.19E-02
|
Eif4a2
|
16:23112351-23112461
|
SES
|
30.77
|
1.76E-03
|
2.59E-03
|
Target Exon: Genomic coordinates of the alternative exon. |
Event Type: Category of the splicing event |
A3SS: Alternative 3’splice site. A5SS: Alternative 5’s plice site. |
SES: Single exon skipping. TSS:Transcription start site(Alternative 5’first site) |
ΔPSI (%): the average difference of PSI values in group 2H-11 and group SVEC4-10 |
T-test p-value: p-value derived from two-sample t-test.
FDR (BH): false discovery rate based on the p-values.
Differentially spliced genes (DSGs) between 2H-11 and SVEC4-10 were analysed and the top 20 significantly altered AS events were summarized in Table. 12. A3SS(Alternative 3’splice site) and SES(Single exon skipping) were dominant AS types, and several genes (Gm20425, H2-K1, Zscan21, Npm1) demonstrated two AS events with opposite preference in 2H-11 and SVEC4-10. It is obvious that most AS patterns were more active in TECs than NECs. Moreover, the results demonstrated significant difference of AS patterns among different stages of 2H-11 and SVEC4-10 (all FDR<0.001), which indicate that certain alternative splicing patterns might show significant differences between TECs and NECs.
Alternative polyadenylation (APA) analysis
Polyadenylation refers to the covalent linkage of polyadenylic acid to messenger RNA (mRNA) molecules. In the process of protein biosynthesis, this is part of the way to produce mature mRNA ready for translation. In eukaryotes, polyadenylation is a mechanism that interrupts mRNA molecules at their 3'ends. The polyadenylic acid tail (or poly A tail) protects mRNA from exonuclease attack.And it is very important for the termination of transcription, export of mRNA from the nucleus and translation. The alternative polyadenylation (APA) of precursor mRNA may contribute to the diversity of the transcriptome and the coding ability of the genome and the regulatory mechanism of genes. We use TAPIS pipeline to identify APA.[27] The APA identified by each sample is shown in Fig. 10A and B.Moreover,we use DREME to analyze the sequence of 50bp upstream of the polyA site of all transcripts and the identified motif is shown in Fig. 10C and D.
Find fusion transcript
Use the consensus sequence before de-redundancy to screen each sample for fusion transcripts according to the following conditions:(1) must map to 2 or more loci; (2) minimum coverage for each loci is 5% and minimum coverage in bp is >= 1 bp; (3) The total length of all loci compared must account for more than 95% of the total length of the transcript; (4) The distance between the two loci must be more than 10kb.The fusion transcripts obtained from 6 samples ranged from 33 to 108.
Sequence prediction of coding region of new gene
The TransDecoder (v3.0.0) software is based on the length of the Open Reading Frame (ORF), the log-likelihood score, the comparison of the amino acid sequence and the protein domain sequence of the Pfam database, etc. It can identify reliable potential coding sequence (Coding Sequence, CDS) from the transcript sequence. Use TransDecoder software to predict the coding region sequence and its corresponding amino acid sequence of the new transcript obtained. A total of 19,021 orf were obtained this time, of which 9,954 were complete orf. The predicted sequence length distribution of the entire ORF region encoding protein is shown in the Fig. 11.
Transcription factor analysis
Transcription factors are proteins that can bind to a specific nucleotide sequence upstream of a gene. These proteins can regulate the binding of RNA polymerase to the DNA template, thereby regulating gene transcription. Animal transcription factor identification uses the animal transcription factor database-animalTFDB 3.0 [12]. This project predicts a total of 2,047 transcription factors from the new transcripts obtained. We only show the top 20 transcription factors Family information .Statistics on the number of different types of transcription factors are shown in Fig. 12.
long noncoding RNA (LncRNA) prediction
Because lncRNA does not encode protein, it is possible to determine whether the transcript is an lncRNA by screening the transcript for coding potential. Four methods of CPC[13] analysis, CNCI[14] analysis, CPAT[15], pfam[16] protein domain analysis were used to predict the newly discovered transcripts for lncRNA. A total of 954 lncRNA transcripts were predicted by the four methods. In order to visually display the analysis results, the noncding transcripts identified by the above 4 analysis softwares are added to the 4 analysis results to take the intersection. Then according to the position of the lncRNA on the reference genome annotation information (gff), the lncRNA is classified as shown in Fig. 13.