Profiling novel alternative splicing within multiple tissues puts insight into the porcine genome annotation

Background Alternative splicing (AS) is a process that mRNA precursor splices intron to form the mature mRNA. AS plays important roles in contributing to transcriptome and proteome divert. However, to date there is no research about pig AS in genome-wide level by RNA sequencing. Results To characterize the AS in pigs, herein we detected genome-wide transcripts and events by RNA sequencing technology (RNA-seq) 34 different tissues in Duroc pigs. In total, we identified 138, 403 AS events and 29, 270 expressed genes. We found alternative donor site was the most common AS form, which is accounted for 44% of the total AS events. The percentage of the other 3 AS forms (Exon skipping, Alternative acceptor site and Intron retention) are all around 19%. The results showed that the most common AS events (alternative donor site) can produce different transcripts or different proteins which affect the biological process. Among these AS events, 109, 483 were novel AS events, and the number of alternative donor splice site has increased the most (Accounting for 44% of the novel AS events). Conclusions The expression of gene with tissue specific AS events showed that the functions of these genes were consistent with the tissue function. AS increased proteome diversity and resulted in novel proteins that gained and lost important functional domains. In summary, these findings extend genome annotation and highlight roles that AS acts in tissue identity in pig.


Background
Alternative splicing (AS) is a regulated process, which can generate multiple transcripts from a single gene and therefore plays an extremely important role in expanding protein diversity. AS has effects on 95% of multi-exonic genes in human and it also presented a high AS proportion in other animals [1][2][3]. AS have four basic modes, which are exon skipping, alternative donor site, alternative acceptor site and intron retention (IR), of which the most common is exon skipping in animal [4]. In addition to these four forms, there are two other main mechanisms by which different mRNAs may be generated from the same gene, multiple promoters [5] or multiple polyadenylation sites [6].
Many alternatively spliced isoforms play important roles in the biological timing and development of tissue [7][8][9][10]. It is becoming clear that a large amount of AS contribute to the acquisition of adult tissue functions and identity in human during tissue development [11]. AS can also reveal new regulatory mechanisms when genes simultaneously express multiple isoforms, like causing human diseases [12]. Protein coding genes always have several alternatively spliced isoforms, which reveals the importance of AS's function on gene expression [13]. Proteins generated by different AS also have the potential to gain or lose domain, which may change significantly. Those protein products might have same, similar or even opposite biological functions, then affect the performance of phenotypes [14]. Point mutation like single nucleotide polymorphism (SNP), which may be the result of changes in alternative splicing [15][16][17][18][19][20], have been shown to have substantial phenotypic variation affect pre-mRNA splicing [21]. AS could lead to early termination of transcription by premature stop codons [22]. AS regulation plays roles in multiple eukaryotic biological processes, including cell growth [23], lung cancer [24], chromatin modification [25] and tissue development [26].
As the best medical model for human disease, pig has similar anatomical and physiological structure with human [27]. Various research groups have attempted to understand the role of AS in the pig by using expressed sequence tags (EST). A total of 1, 223 genes with average 2.8 splicing variants were detected among 16, 540 unique genes via EST data [28]. It was reported that more AS could be produced by RNA-seq than EST technology in human and plants [1,29]. AS in pigs found to play an essential role in regulation of gene expression in genital tissues [30,31]. However, to our best knowledge, there is no 4 research about pig AS in genome-wide level by RNA sequencing.
In this study, we constructed a profiling AS within multiple tissues to extend the porcine AS genome annotation. To achieve this aim, we performed genome-wide transcripts identification in 34 normal tissues of the pig using over 340 Gb of sequence data generated from 116 RNA sequencing (RNA-seq) libraries. We identified 2, 486, 239 transcripts and 2, 430, 911 novel transcripts. Novel and known transcripts were examined for tissue-specific expression patterns and new isoforms were investigated to determine how their divergence from annotated transcripts affects their protein coding and regulation by SNP. The data reported here provides a valuable resource for enhancing the understanding and utilization of the pig AS.

Identification of Novel Transcripts in 34 different pig tissues
To discover and map novel transcripts, we performed RNA-sequencing in 34 different pig tissues. An average of 48.48 million reads per tissue were sequenced from 116 strandspecific and paired-end cDNA libraries (Table S1); of these sequences, an average of 43.97 million reads (90.7%) per sample passed the strict quality control (QC). The 1,495 million high-quality reads (376.7 Gb, 135-fold genome coverage) were aligned to the pig genome (Sus scrofa 11.1) and 1,223 million mapped fragments (310.1 Gb, 110.8-fold genome coverage) with an average alignment rate of 88.29% were recovered (Table S2).
A total of 2,486,239 transcripts with 29,270 genes were produced across all tissues. In these transcripts, 60,578 transcripts with 20,401 coding genes and 3,486 non-coding genes were annotated in the pig Ensembl database. The remaining 2,281,529 novel transcripts were from 26,493 loci without any annotated information. Full details for each tissue are available in Table S3. We determined known protein coding genes with FPKM (fragments per kilobase of exon per million fragments mapped) >0.1 among these non-redundant transcript isoforms in least one tissue. These novel transcripts enhance pig non-coding gene annotation and increase the number of average transcripts of each gene in pig.

Classification of Alternative Splicing Types
AS events are classified into twelve basic types, among them, we counted 4 basic types of AS events including exon skipping, alternative acceptor 5' site, intron retention and alternative acceptor 3' site ( Fig. 1A). We detected 138,403 AS events and 29,270 genes across all 34 tissues. Moreover, a gene has 4.73 AS events on average. Alternative donor sites was the most common AS event type (60,851, 44%), the other 3 types share the similar percentages (Exon skipping, 19%; Alternative acceptor site, 19%; Intron retention, 18%) (Fig. 1A). The number of novel AS events increased 1.5 to 9.4 times compared to original ones. The largest difference was showed in novel alternative donor sites in terms of quantity, which are accounting for 44% in all novel AS events (Fig. 1B).
The novel AS divide into two types, one is the novel AS of known genes and the other one is the novel transcript of unknown genes. The number of novel transcripts in different tissues was showed in Table S4. We found that much more novel transcripts of unknown genes were detected than those of known genes. We further compared the number of original and novel transcripts of per annotated genes. Fig. 2A showed that more annotated isoforms per gene, more novel isoforms. Compare to the original isoforms, novel isoforms did not increase too much, maybe because of the limited isoforms of per gene. The number would not change much even with improved methods.
The number of AS events among different genes could be various. In our analysis, the AS events number of AHNAK gene is up to 391. However, thousands of genes only have one AS event. In order to explain the difference existing in these genes, we performed cluster analysis on them. We considered the genes whose number of AS events is greater than 20 6 belong to a group, and those equals 1 belong to another group. Then we performed KEGG analysis on these two group (Fig. 2B, Table S5). For genes with AS events =1, altogether 24 pathways were significantly enriched, including phosphatidylinositol signaling system (P value is 6.30E-09, the number of genes is 59), inositol phosphate metabolism (P value is 2.80E-06, the number of genes is 43), glycerophospholipid metabolism (P value is 1.30E-04, the number of genes is 49) and so on. These genes were mostly involved in lipid metabolism. Other significant pathway mainly related with the cancer and reproduction. However, for genes with ASE > 20, only 7 pathways were significantly enriched. These pathways mainly involve in neuroactive and some immunerelated diseases. The second significant pathway (olfactory transduction, P value is 4.9E-145) is enriched with 639 genes.
Then we further compared the genes distribution with one AS events and more than 20 AS events in the different tissues (Fig. 2C). Average 1.33 genes with one AS event were expressed in one tissue. However, there were altogether 5,857 genes in 34 tissues, which means many genes were expressed in multiples tissues simultaneously. For the genes with more than 20 AS events, average 27 were expressed in one tissue with 2,131 genes altogether.

Tissue-specificity of Alternative splicing in different tissues
The tissue specific AS events in pig transcriptome were detected by Jekroll-splicingexpress [32]. We defined those were found to be expressed above 0.1 FPKM in only one tissue type, with expression of <0.1 FPKM in all others to be tissue specific transcripts. The number of specific transcripts found varied significantly in different tissues, with PBMC (Peripheral Blood Mononuclear Cell) possessing the most unique isoforms (1, 633) and pancreas possessing the least (133; Table 1). Further examination of these tissue-specific transcripts showed that 86% of the genes that encode them were also only expressed in a 7 single tissue with above 0.1 FPKM, revealing that these transcripts' tissue specificity occurs at the transcriptional level. The remaining 14% of tissue-specific transcripts had other isoforms expressed in other different tissues, indicating that their tissue specificity likely results from AS (Fig. 3A).
In order to identify developmentally regulated changes in AS, Stringtie was used to quantify transcript expression across all tissues. To assess the relationship between the different tissues and their splicing profiles, we have computed the number of isoforms and related genes in different tissues (Table S3). Totally, 7,595 novel isoforms from 2,421 known non-coding genes and 136,537 novel isoforms from 15,385 coding genes were identified. Herein we considerate that the annotated transcripts are those with Ensemble IDs. Adult pig testes had the highest number isoforms when compared with other tissues, followed by thymus, which is similar to humans [33]. The number of AS in adult pig testis is 2,459 more than that in infant testis. This difference may arise from sexual maturity of pig since testis is the genital organ and generates sperm.
We then analyzed the expression of novel and known transcripts in different tissues ( Table S6). The average level of expression for novel transcripts in pancreas, liver, longissimus, dorsi, spleen, prostate, adrenal gland and breast these tissues have a higher expression (FPKM>10) than known transcripts. By comparing the difference between the highest expression and the lowest expression in the genes with different AS numbers, it can be found that the more isoforms of a gene, the greater difference in expression (Fig.   3B). Our results found that 116,439 genes were tissue specific, of which 109,773 genes were new detected. In the meanwhile, 5,683 genes were found to be expressed in all tissues, of which 110 are first detected (Table S7). Our results will provide useful resource to improve transcript abundance. potentially altering its function significantly (Fig. 4A). For coding exons, the number of novel transcript was less 3 than known one (Fig. 4C) Fig. 4B). On the other hand, the novel transcript has one less the coding exons than known one (Fig. 4D). In the meanwhile, we have predicted the protein conformations of these four transcripts (Fig. 4E&4F). From the protein conformations, we can also see the domains' loss or gain.

Conserved AS between pig and human
The pig is generally considered a promising medical model for human disease and pig orthologs of many human disease-associated genes have been identified. In order to investigate the orthologous relationship between humans and pigs at protein level, we 9 aligned the detected pig proteins with the human proteins using blast software and the criteria of a minimum length ≥ 50 bp and identity ≥ 80%. We identified 4,168 (56.97%) pig proteins with orthology to human proteins. We used protein MX2 as an example (Fig.   4G). The protein sequence was aligned to the human proteins. Six conserved domains were found, including DLP_1, DYNc, Dynamin_M, Dynamin_N, GED and CrfC ( Fig. 4H; Table S9). These conserved proteins provide an indication of genome evolution and allow us to further explore the potential functional similarity between human and pig genomes.

Effect of SNP on Pig Alternative Splicing
Deep RNA-seq sequencing can help us detect sequence variations associated with AS.
Herein we grouped these AS into 3 types: canonical splicing with a GT-AG intron (i.e., GT and AG splicing signals at donor and acceptor sites, respectively); Semi-canonical splicing  Table   S11).
Related SNP, genes and codons were described in the Table 2 when an aberrant splicing occurs more than 20 tissues. Our result showed that 6 novel splicing cause nonsynonymous mutation ( Table 2). Particularly, we found a missense SNV (exon3:c.T376A:p.S126T) within Synapse associated protein 1 (SYAP1) gene that generated a Ser -to-Thr substitution, leading to a change in the SYAP1 protein conformation space.

Validation of kwon and novel transcripts by RT-PCR
The reliability of transcripts identification in this study was verified by RT-PCR of 8 randomly selected novel and 2 known transcripts in 6 tissues (Fig. 5A). For transcripts identified in all tissues, the RT-PCR analysis showed 5 (TCONS_00011100, TCONS_01815428, TCONS_01410294, TCONS_00028775 and TCONS_02428049) were expressed in all tissues, and 1 other (TCONS_01245051) were detected in 5 tissues.
TCONS_01413087 and TCONS_01387293 were successfully validated as tissue-specific transcripts in kidney and thyroid respectively. These results confirm the accuracy of our identification and characterization of the pig transcripts.

Discussion
In the present study, we profiled novel AS from 34 different tissues in Duroc pigs via deep RNA sequencing, which provided insight into the porcine genome annotation. A total of 138,403 AS events and 29,270 expressed genes were detected, 4.73 events in each gene on average. The average number of AS events in each gene has improved 0.69 times compared with 2.8 splicing variants by EST data [11]. Previous studies have revealed that exon skipping, alternative donor site, alternative acceptor splice site, and intron retention are the four basic AS types. Exon skipping is thought to be the most prevalent AS type in animals [4]. In our study, alternative donor site is the most common AS form, which is accounted for 44% of the total AS events. The percentage of the other 3 AS forms are all around 19%. The results showed that the most common AS events (alternative donor site) can produce different transcripts or different proteins which affect the biological process.
Among these AS events, 109, 483 were novel AS events, and the number of alternative donor splice site has increased the most (Accounting for 44% of the novel AS events).
Regardless of the splicing mechanism, this large increase in new alternatively spliced transcripts will improve the diversity of pig proteome dramatically.

11
We found genes with less AS events have more chance to express in the tissue-specific manner. Genes with only one AS event were majoring located on the pathway of lipid metabolism. For those with more than 20 AS events belonged to pregnancy or disease related signal pathways. Whether these genes in different kinds of pathways is related to the number of their AS events need further research, we detected a total of tissue specific 15,606 AS events were in 3,743 genes based on our AS profiling analysis. We found the function of genes with tissue specific AS is nearly consistent with their related tissues. In genes not only confirm the biological properties of known genes, but can also be used to predict the potential function of unknown genes in the pig genome.
In generally, majority of the novel splicing events lead to new proteins. We revealed that novel AS often contained domain losses or gains relative to their most similar known isoform, then even lead to opposite function [14]. In our analysis, we observed unknown proteins database was computationally predicted for all novel transcripts. 1184 AS were mapped to 361 new proteins.. And we took gene APOBEC3B and LIG3 as examples. By comparing the known and novel transcript, transcript length had changed. The novel transcript of APOBEC3B and LIG3 loss and gain a domain predicted by HMMER3, respectively (Fig. 5A&D).
SNPs affect protein-binding sites (or mutations in the binding proteins themselves) and contribute to aberrant splicing. Some nonsynonymous mutations even cause transcription to stop in advance. We found a missense SNV (exon3:c.T376A:p.S126T) within Synapse associated protein 1 (SYAP1) gene that generated a Ser -to-Thr substitution. We can see the protein conformation of these two AS predicted by Swiss-model have changed from the Serine to Threonine.

Conclusion
In summary, our analysis greatly expanded the pig transcriptome to include more than

Read alignment to the reference Sus scrofa 11.1 genome
The RNA-Seq raw dat were trimmed based on the quality control for downstream analyses by following steps: BBmap automatically detected the adapter sequence of reads and removed reads containing illumina adapters; the Q20, Q30 and GC content of the clean data were calculated by FASTQC for quality control and filtering; homopolymer trimming to 3′ end of fragments and removed the N bases of 3′ end were carried out by Fastx toolkit (version 0.0.14). The resulting sequences then were mapped to reference genome

Alternative splicing events in pig transcriptome
Asprofile (v1.0.4) software was used to classify and count the AS events in each sample [39], whose results consists of twelve AS events types. In our analysis, only exon skipping, alternative donors ite, alternative acceptor site and intron retention were included. Tissue specific AS events in pig transcriptome were detected by jekroll-splicingexpress software (https://bitbucket.org/jekroll/splicingexpress/).

Comparison of Novel and Known Protein Domains
Translated DNA sequences of novel transcripts were aligned to UniProt database by

SNP compared with Alternative Splicing Variations
Here we mainly used the pvaas software [40] to detect the SNV mutation associated with the novel AS, which is mainly to determine the correlation between the SNV and AS by Fisher exact test. Its reliability were assessed by P value after correcting by the Benjamini-Hochberg procedure. Further filtering were performed in order to ensure the accuracy of SNV: The minimum mutation frequency should be greater than 5% (the minimum mutation reads number accounts for more than 5% of AS); the mutant reads number >= 5; the corrected P value <0.001; the reads number of new AS>=10.

Known and Novel AS Validation by PCR
RNAs from 6 pig tissues including lung, kidney, brain, gut, thyroid and lymph were transcribed into cDNA using PrimeScript™ RT reagent Kit with gDNA Eraser ( TaKaRa Bio) for PCR reaction. Two reverse transcribed reaction system were conducted. DNA-free contained 1.0 μg RNA, 2.0 μL 5×gDNAEraser Buffer, 1.0 μL gDNA Eraser 7 μL RNase-free

Availability of Data and Materials
The sequenced RNA-Seq raw data for 34 pig tissues is available from NCBI Sequences Read Archive with the BioProject number: PRJNA392949.  Table S1. Information of sequence data from 34 pig tissues.