Identification of Novel Transcripts in 34 different pig tissues
To discover and map novel transcripts, we performed RNA-sequencing in 34 different pig tissues. An average of 48.48 million reads per tissue were sequenced from 116 strand-specific and paired-end cDNA libraries (Table S1); of these sequences, an average of 43.97 million reads (90.7%) per sample passed the strict quality control (QC). The 1,495 million high-quality reads (376.7 Gb, 135-fold genome coverage) were aligned to the pig genome (Sus scrofa 11.1) and 1,223 million mapped fragments (310.1 Gb, 110.8-fold genome coverage) with an average alignment rate of 88.29% were recovered (Table S2).
A total of 2,486,239 transcripts with 29,270 genes were produced across all tissues. In these transcripts, 60,578 transcripts with 20,401 coding genes and 3,486 non-coding genes were annotated in the pig Ensembl database. The remaining 2,281,529 novel transcripts were from 26,493 loci without any annotated information. Full details for each tissue are available in Table S3. We determined known protein coding genes with FPKM (fragments per kilobase of exon per million fragments mapped) >0.1 among these non-redundant transcript isoforms in least one tissue. These novel transcripts enhance pig non-coding gene annotation and increase the number of average transcripts of each gene in pig.
Classification of Alternative Splicing Types
AS events are classified into twelve basic types, among them, we counted 4 basic types of AS events including exon skipping, alternative acceptor 5’ site, intron retention and alternative acceptor 3’ site (Fig. 1A). We detected 138,403 AS events and 29,270 genes across all 34 tissues. Moreover, a gene has 4.73 AS events on average. Alternative donor sites was the most common AS event type (60,851, 44%), the other 3 types share the similar percentages (Exon skipping, 19%; Alternative acceptor site, 19%; Intron retention, 18%) (Fig. 1A). The number of novel AS events increased 1.5 to 9.4 times compared to original ones. The largest difference was showed in novel alternative donor sites in terms of quantity, which are accounting for 44% in all novel AS events (Fig. 1B).
The novel AS divide into two types, one is the novel AS of known genes and the other one is the novel transcript of unknown genes. The number of novel transcripts in different tissues was showed in Table S4. We found that much more novel transcripts of unknown genes were detected than those of known genes. We further compared the number of original and novel transcripts of per annotated genes. Fig. 2A showed that more annotated isoforms per gene, more novel isoforms. Compare to the original isoforms, novel isoforms did not increase too much, maybe because of the limited isoforms of per gene. The number would not change much even with improved methods.
The number of AS events among different genes could be various. In our analysis, the AS events number of AHNAK gene is up to 391. However, thousands of genes only have one AS event. In order to explain the difference existing in these genes, we performed cluster analysis on them. We considered the genes whose number of AS events is greater than 20 belong to a group, and those equals 1 belong to another group. Then we performed KEGG analysis on these two group (Fig. 2B, Table S5). For genes with AS events =1, altogether 24 pathways were significantly enriched, including phosphatidylinositol signaling system (P value is 6.30E-09, the number of genes is 59), inositol phosphate metabolism (P value is 2.80E-06, the number of genes is 43), glycerophospholipid metabolism (P value is 1.30E-04, the number of genes is 49) and so on. These genes were mostly involved in lipid metabolism. Other significant pathway mainly related with the cancer and reproduction. However, for genes with ASE > 20, only 7 pathways were significantly enriched. These pathways mainly involve in neuroactive and some immune-related diseases. The second significant pathway (olfactory transduction, P value is 4.9E-145) is enriched with 639 genes.
Then we further compared the genes distribution with one AS events and more than 20 AS events in the different tissues (Fig. 2C). Average 1.33 genes with one AS event were expressed in one tissue. However, there were altogether 5,857 genes in 34 tissues, which means many genes were expressed in multiples tissues simultaneously. For the genes with more than 20 AS events, average 27 were expressed in one tissue with 2,131 genes altogether.
Tissue-specificity of Alternative splicing in different tissues
The tissue specific AS events in pig transcriptome were detected by Jekroll-splicingexpress . We defined those were found to be expressed above 0.1 FPKM in only one tissue type, with expression of <0.1 FPKM in all others to be tissue specific transcripts. The number of specific transcripts found varied significantly in different tissues, with PBMC (Peripheral Blood Mononuclear Cell) possessing the most unique isoforms (1, 633) and pancreas possessing the least (133; Table 1). Further examination of these tissue-specific transcripts showed that 86% of the genes that encode them were also only expressed in a single tissue with above 0.1 FPKM, revealing that these transcripts’ tissue specificity occurs at the transcriptional level. The remaining 14% of tissue-specific transcripts had other isoforms expressed in other different tissues, indicating that their tissue specificity likely results from AS (Fig. 3A).
In order to identify developmentally regulated changes in AS, Stringtie was used to quantify transcript expression across all tissues. To assess the relationship between the different tissues and their splicing profiles, we have computed the number of isoforms and related genes in different tissues (Table S3). Totally, 7,595 novel isoforms from 2,421 known non-coding genes and 136,537 novel isoforms from 15,385 coding genes were identified. Herein we considerate that the annotated transcripts are those with Ensemble IDs. Adult pig testes had the highest number isoforms when compared with other tissues, followed by thymus, which is similar to humans . The number of AS in adult pig testis is 2,459 more than that in infant testis. This difference may arise from sexual maturity of pig since testis is the genital organ and generates sperm.
We then analyzed the expression of novel and known transcripts in different tissues (Table S6). The average level of expression for novel transcripts in pancreas, liver, longissimus, dorsi, spleen, prostate, adrenal gland and breast these tissues have a higher expression (FPKM>10) than known transcripts. By comparing the difference between the highest expression and the lowest expression in the genes with different AS numbers, it can be found that the more isoforms of a gene, the greater difference in expression (Fig. 3B). Our results found that 116,439 genes were tissue specific, of which 109,773 genes were new detected. In the meanwhile, 5,683 genes were found to be expressed in all tissues, of which 110 are first detected (Table S7). Our results will provide useful resource to improve transcript abundance.
Potential Effect of Novel Alternative Splicing on the Pig Proteome
Translated DNA sequences of novel transcripts were aligned to UniProt database by DIAMOND software . In order to determine their possible effect on the pig proteome, computationally predicted proteins encoded by novel isoforms of known genes were compared to the annotated proteins generated by their similar known transcript. Total of 1,184 AS were mapped to 361 new proteins (Table S8). Our results have improved the annotation of unknown proteins.
HMMER3 was used to identify conserved domains within novel and known proteins, which were then compared in a pairwise manner. One novel transcript of APOBEC3B in tissue of uterus was mapped to a protein (UniProt identifier: D3U1S2_PIG). By comparing their most similar known (UniProt identifier: F1SNY0_PIG) and novel transcript, transcript length changed from 3083 bp to 3072 bp. The annotated protein contains two APOBEC-like N-terminal domains, while a new isoform lacks one APOBEC-like N-terminal domain potentially altering its function significantly (Fig. 4A). For coding exons, the number of novel transcript was less 3 than known one (Fig. 4C). It has been showed that DNA ligase 3 (LIG3) proteins have been proposed to active leukemia via transcription error . Our results found that one novel transcript (UniProt identifier: I3L9T2_PIG) of LGI3 have gained a domain (DNA ligase 3 BRCT domain) than the most similar known transcript (UniProt identifier: I3LN18_PIG; Fig. 4B). On the other hand, the novel transcript has one less the coding exons than known one (Fig. 4D). In the meanwhile, we have predicted the protein conformations of these four transcripts (Fig. 4E&4F). From the protein conformations, we can also see the domains’ loss or gain.
Conserved AS between pig and human
The pig is generally considered a promising medical model for human disease and pig orthologs of many human disease-associated genes have been identified. In order to investigate the orthologous relationship between humans and pigs at protein level, we aligned the detected pig proteins with the human proteins using blast software and the criteria of a minimum length ≥ 50 bp and identity ≥ 80%. We identified 4,168 (56.97%) pig proteins with orthology to human proteins. We used protein MX2 as an example (Fig. 4G). The protein sequence was aligned to the human proteins. Six conserved domains were found, including DLP_1, DYNc, Dynamin_M, Dynamin_N, GED and CrfC (Fig. 4H; Table S9). These conserved proteins provide an indication of genome evolution and allow us to further explore the potential functional similarity between human and pig genomes.
Effect of SNP on Pig Alternative Splicing
Deep RNA-seq sequencing can help us detect sequence variations associated with AS. Herein we grouped these AS into 3 types: canonical splicing with a GT–AG intron (i.e., GT and AG splicing signals at donor and acceptor sites, respectively); Semi-canonical splicing with GC–AG or AT–AC introns; Novel splicing without GT–AG introns. The semi-canonical and novel splicing are also called aberrant splicing together. SNP occurs in different regions of a gene would change the expression of transcript, then change protein abundance, especially when SNP occurs in the exon region (Table S10). The change of exon would cause synonymous or nonsynonymous mutation in proteins. Our results showed some aberrant splicing in the exon lead to non-synonymous mutations in codons, more seriously, some translations are terminated in advanced in some tissues (Table S11).
Related SNP, genes and codons were described in the Table 2 when an aberrant splicing occurs more than 20 tissues. Our result showed that 6 novel splicing cause non-synonymous mutation (Table 2). Particularly, we found a missense SNV (exon3:c.T376A:p.S126T) within Synapse associated protein 1 (SYAP1) gene that generated a Ser -to- Thr substitution, leading to a change in the SYAP1 protein conformation space.
Validation of kwon and novel transcripts by RT-PCR
The reliability of transcripts identification in this study was verified by RT-PCR of 8 randomly selected novel and 2 known transcripts in 6 tissues (Fig. 5A). For transcripts identified in all tissues, the RT-PCR analysis showed 5 (TCONS_00011100, TCONS_01815428, TCONS_01410294, TCONS_00028775 and TCONS_02428049) were expressed in all tissues, and 1 other (TCONS_01245051) were detected in 5 tissues. TCONS_01413087 and TCONS_01387293 were successfully validated as tissue-specific transcripts in kidney and thyroid respectively. These results confirm the accuracy of our identification and characterization of the pig transcripts.