CNS transcriptome assembly and annotation
L. stagnalis is a widely used model organism in understanding the fundamental mechanisms of neural function due to its simple and well-characterized CNS. A de novo assembly of the L. stagnalis CNS transcriptome from 100 bp single-end reads has previously been published [50], but the completeness of the assembly had been hindered by the lack of a genome reference. With the aid of a preliminary, recently assembled L. stagnalis genome, here we employed a combination of genome-guided and de novo approaches to assemble reads from four 150bp paired-end libraries that we had prepared from four adult central ring ganglia and the aforementioned published 100bp single-end library (Sadamoto et al. 2012) to create a new and improved L. stagnalis CNS transcriptome assembly (Figure 1). After correcting for erroneous bases and removing unfixable reads (Table S3), each of the five libraries had at least 88% of the reads mapped to the genome with unique location (Table S4). The mapped reads in each library were assembled in a genome-guided manner using multiple assemblers and the unmapped reads were pooled for de novo assembly using Trinity. Altogether, this resulted in 22 assemblies containing 1,651,924 transcripts in total (Table S5). To identify putative protein-coding transcripts (Figure 1), we analyzed this set of sequences using the Evigene pipeline to identify 196,514 non-redundant transcripts that contained a complete or 3’-partial (containing start but not stop codon) open reading frame (ORF) (Table S6). Potential transcript artifacts and noise were filtered out, resulting in a transcript set containing 68,094 transcripts. As ORFs can arise spuriously [52], we further identified protein domain-containing transcripts by annotation against the Pfam database and/or signaling peptide prediction by both SignalP and Phobius, resulting in 17,832 sequences as the final set of predicted protein-coding transcripts. Annotation of the top 20 most highly expressed predicted protein-coding transcripts showed that the majority were annotated, or predicted as signaling peptides (Table 1).
Comparison and aggregation of current assembly with previously published L. stagnalis CNS transcriptome
To compare the number and quality of predicted protein-coding transcripts uncovered in our current assembly with those in the previously published expressed sequence tags (EST) library [49] and de novo assembly [50], these two earlier sets of sequences were processed as described above to identify potential protein coding transcripts. We found 1,946 sequences (out of 10,375 in total) that contained at least one protein domain and/or were annotated as a signaling peptide by both Phobius and SignalP in the EST library, and 11,742 such sequences (out of 116,355 in total) in the de novo assembled library. BUSCO analysis showed that the predicted protein-coding transcripts identified in the current assembly contained 92.7% of single-copy ortholog genes present in >90% eukaryotic species, higher than the de novo assembly (91.5%) and the EST library (13.4%) (Table 2).
Comparison of the amino acid sequence length across the three sets of predicted protein coding sequences (Figure 2A) found that the mean (ESTs: 196 aa; de novo: 470 aa; current: 569 aa), median (ESTs: 209 aa; de novo: 353 aa; current: 415 aa), and maximum (ESTs: 313; de novo: 8,195 aa; current: 13,109 aa) sequence lengths in the current assembly were all higher than those in the ESTs library and de novo assembly. Comparison of unique Nr hits in the three sets of predicted protein-coding sequences showed that 7,198 Nr hits were found in all three, with 3,119, 1,338, and 316 hits found only in the current assembly, de novo assembly, and ESTs library, respectively (Figure 2B). Taken together, these findings indicate that the current combined genome-guided and de novo assembly improved the coverage and sequence length of the L. stagnalis CNS protein-coding transcriptome.
As shown in Figure 2B, each of the three transcriptome libraries captured sequences absent from the other two. Therefore, we combined the three sets of sequences (31,520 sequences in total) to create the most comprehensive predicted protein-coding sequences library to date, which were further clustered into a collection of 16,447 non-redundant predicted protein-coding sequences. BUSCO analysis showed that this set contains 96.2% of 978 single-copy orthologs present in >90% of all eukaryotic species, higher than all three libraries when analyzed individually (Table 2). We next examined this set of sequences to characterize gene expression in the L. stagnalis CNS.
Comparisons of CNS transcriptomes of key neuroscience model organisms
This improved L. stagnalis CNS transcriptome provides an unprecedented opportunity to carry out comparative studies of CNS gene expression in a number of commonly used vertebrate and invertebrate neuroscience model organisms, so as to gain insights into commonalities and differences that may inform future studies. To this end, we analyzed publicly deposited RNA-seq libraries to characterize gene expression in the CNS of adult Mus musculus (mouse) [53], Xenopus tropicalis [53], Danio rerio (zebrafish) [54] , Drosophila melanogaster (fruit fly) (PRJNA320764) and Caenorhabditis elegans [55] (Table S1). Due to differing levels of non-coding RNA annotation for each species, we have focused our analyses on protein-coding transcripts. Annotation of the top 20 expressed transcripts in each of these five species (Table S7-11), in the same manner as done for L. stagnalis (Table 1), found a predominance of transcripts involved in energy production, protein synthesis, and signaling, in both vertebrate and invertebrate species. To gain a complementary perspective, we characterized the functional categorization of CNS protein-coding transcripts amongst L. stagnalis and these five common model organisms (Figure 3) using their KOG annotations. Interestingly, the general distribution was largely similar, with "Signal transduction mechanisms" being the most abundant category overall. Nevertheless, in all three of the vertebrate species examined, more transcripts were functionally classified as "Signal transduction mechanisms", "Transcription", and "Cytoskeleton" than in the three invertebrate species. Conversely, in the three invertebrate species, more transcripts were identified to be functionally involved in "Carbohydrate transport and metabolism", "Lipid transport and metabolism", and "Translation, ribosomal structure and biosynthesis" than those in the three vertebrate species.
We further clustered CNS protein-coding transcripts in the six species into orthologous groups to identify core CNS functions that are mediated by orthologous genes across distantly related species. In total, we identified 12,022 orthogroups, with 3,729 orthogroups containing genes from all of the six species examined. Interestingly, L. stagnalis shares more orthogroups with the three vertebrate species than with C. elegans or fruit fly (Figure 4). Functional enrichment of mouse genes (as they have the best annotation quality) in orthogroups shared amongst all species identified a variety of key CNS functions, including “Myelination” (GO:0042552), “Glutamate receptor signaling pathway” (GO:0007215), “Mitochondrial respiratory chain” (GO:0005746), and “ionotropic glutamate receptor binding” (GO:0035255) (Figure 5; Table S12). These findings were complemented by Reactome (Table S13) and KEGG (Table S14) pathway analyses of the same set of mouse genes, which further identified known core CNS functions including “Axon guidance” (REAC:R-MMU-422475), “Cellular response to hypoxia” (REAC: R-MMU-2262749), “cAMP signaling pathway” (KEGG:04024) and “Insulin signaling pathway” (KEGG:04910). Furthermore, we identified 222 transcript factors whose binding motifs were enriched in the promoter region of this set of mouse genes (Table S15), identifying both well-characterized neuronal transcription factors such as CREB1 and AP2, but also novel proteins that may provide targets of future studies. We also performed similar enrichment analyses of genes in the 2,154 vertebrate-only and 88 invertebrate-only orthogroups, in order to gain insights into differential functional specializations of the vertebrate and invertebrate CNS. Enrichment of mouse genes in the vertebrate-only orthogroups showed a constellation of GO (Figure 6; Table S16), Reactome (Table S17) and KEGG (Table S18) terms related to immune/inflammatory processes, including “Leukocyte chemotaxis” (GO:0030595), “Regulation of interferon-beta production” (GO:0032648), “Complement and coagulation cascades” (KEGG:04610), and “Receptor-type tyrosine-protein phosphatases” (REAC:R-MMU-388844). Enrichment of fruit fly genes in the invertebrate-only orthogroups found terms relating to response to chemical stimuli, such as “Sensory perception of chemical stimulus” (GO:0007606), “Peptide receptor activity” (GO:0001653), and “G-protein coupled peptide receptor activity” (GO:0008528) (Figure 7; Table S19). No enriched Reactome or KEGG pathways were found for this group of genes.
Identification of ion channels and ionotropic receptors expressed in the L. stagnalis CNS and interspecies comparison
Ion channels and ionotropic neurotransmitter receptors play critical roles in regulating neuronal excitability and synaptic transmission in the CNS. As the simple and well-characterized L. stagnalis CNS is widely employed to study the fundamental mechanisms of neuronal function and circuit function [4, 56-58], we next sought to identify ion channels and ionotropic receptors expressed in the L. stagnalis CNS. We identified 211 protein-coding transcripts that contained both 1) ion channel or ionotropic receptor protein domains and 2) transmembrane helices to characterize the complement of ion channels and ionotropic receptors expressed in the L. stagnalis CNS. A proportion of sequences in this set were substantially shorter or longer than their best matched sequence in the Nr database. However, for the sake of completeness, we have included all in our analysis as some may be informative as splice isoforms. Sequence alignment-based phylogenetic analysis of these transcripts showed that they encode a wide range of ionotropic neurotransmitter receptors (Figure 8; Tables S20-22), K+ (Figure 9; Table S23), Ca2+ (Figure 10; Table S24), Na+ (Figure 11; Table S25), Cl- (Figure 12; Table S26), cation (Figure 13; Table S27), and transient receptor potential (TRP) (Figure 14; Table S28) channels. Only 31 of the 211 putative ion channel/ionotropic receptor transcripts were annotated as a L. stagnalis sequence in the Nr database, indicating that majority of the sequences were identified here in this species for the first time.
We further examined the degrees of protein sequence similarity between ion channel/ionotropic receptor transcripts identified in L. stagnalis and those in these five model organisms through the BLASTp program and resulting bitscores of BLASTp hits. We plotted the standardized bitscore of each L. stagnalis ion channel/receptor against its top BLASTp hit in each species in a Clustergrammer heatmap (link provided in the figure legend), where bitscores lower and higher than the average are shown in blue and red, respectively (Figure 15). Each row corresponds to a putative L. stagnalis ion channel/ionotropic receptor transcript and each column is one of the five species examined. First, ranking the species (columns) by the absolute sum of standardized bitscores, such that the species with the highest degree of sequence similarity with L. stagnalis is on the left of the graph, we observed that putative L. stagnalis ion channel/ionotropic receptor transcripts showed the most sequence similarity with their homologs in fruit fly and X. tropicalis, and the least with mouse and C. elegans (Figure 15A). Then, by sorting the transcripts according to their class (ex. Ca2+ channels, K+ channels, glutamate receptors, etc.), we found that L. stagnalis transcripts encoding Ca2+ channels showed the highest degrees of sequence similarities with their homologs as compared to all other classes (Figure 15A). Finally, to more closely examine the individual L. stagnalis transcripts that showed the highest sequence similarities with other species, we sorted the transcripts (rows) by their absolute bitscore sum and focused on the top 20 transcripts (Figure 15B). Indeed, in this list we observed numerous putative Ca2+ channel transcripts, such as ryanodine receptors (RyR), inositol 1,4,5-trisphosphate receptors, and voltage-gated Ca2+ channels (both L- and non-L-type). However, there were also a diversity of cation, K+ and TRP channels in this list. Taken together, these findings provide not only a resource for investigating the molecular mechanisms of CNS neural transmission and neuronal excitability in L. stagnalis, but also a foundation for better understanding the conservation and evolution of these processes across species.