Transcriptome-wide single nucleotide polymorphisms related to electric organ discharge differentiation among African weakly electric fish (genus

Background: African weakly electric fish generate weakly electric organ discharges (EODs) for orientation and communication. In the mormyrid genus Campylomormyrus, waveform and duration of discharges are species-specific and elongated pulses (> 4ms) are a derived trait. Pulse duration is presumed to depend on electrocyte geometry/excitability or hormonal signaling.Tissue-specific transcriptome analyses across species with different pulse durations have revealed differential expression of genes related to electrocyte excitability. There are also point mutations in ion channel genes which relate to pulse duration. However, a comprehensive assessment of expressed Single Nucleotide Polymorphisms (SNPs) throughout the entire transcriptomes of African weakly electric fish, with the potential to identify further genes influencing EOD duration, is still lacking. This is of particular value, as discharge duration is likely based on multiple cellular mechanisms and various genes. Here, we want to identify non-synonymous SNPs in transcriptomes of Campylomormyrus species differing by EOD duration. Results: We identified 27 candidate genes with inferred positive selection among Campylomormyrus species and at least one non-synonymous substitution specific to Campylomormyrus tshokwe, a species with elongated EODs. Inferred candidate genes had mainly functions linked to transcriptional regulation, cell proliferation, cell differentiation and apoptosis. Further, we identified 27 GO terms and 2 KEGG categories ('Transcription' and 'Cancer: Specific types') for which C. tshokwe (derived elongated EOD) significantly more frequently exhibited a species-specific expressed substitution than C. compressirostris (ancestral short EOD). Conclusion: We performed the first transcriptome-wide SNP analysis among weakly electric fish species (genus Campylomormyrus) and provide putative candidate genes and cellular mechanisms potentially involved in the determination of an elongated discharge in C.

tshokwe. Most likely, transcriptional regulation as well cell proliferation, cell differentiation and apoptosis take part in the determination of pulse durations. The processes of cell growth and density are pivotal for tissue morphogenesis and might determine the shape of electric organs. This supports the observed correlation between cell geometry/tissue structure of the electric organ and discharge duration. The inferred expressed SNPs putatively related to EOD duration are a valuable resource for future investigations on this peculiar trait, which may focus on their functional implications.
African weakly electric fishes of the family Mormyridae (Osteoglossiformes: Teleostei) comprise numerous sympatric closely related species differing in their electric sense.
Either the anterior or the posterior face of electrocytes gives rise to several finger-like evaginations fusing in a stalk, serving as the interface to the electromotor neuron. Those stalks can penetrate the electrocyte and occur as single-or multiple stalk-systems [15][16][17][18].
The electrocyte faces can be smooth or unevenly invaginated with papillae or folds increasing the membrane's surface [15,16]. The membrane is excitable and packed with different ion channels, such that each single cell is independently capable of generating an action potential [16]. The simultaneous release of all action potentials forms the weak pulse-type discharge which varies among closely related species in the number and orientation of phases as well as in its duration. In several African mormyrid weakly electric fish, slight differences in EOD characteristics occur between sexes and populations, while this was never detected in our focus genus Campylomormyrus, rendering their EOD a species-specific trait [14,18]. Currently, 15 Campylomormyrus species are morphologically described and for 9 of them the EOD is known. Their waveforms can be bi-or triphasic with a putatively ancestral pulse duration mostly shorter than 400µs ( Campylomormyrus compressirostris; C. tamandua). A significantly longer EOD (> 4ms) occurs in C. tshokwe, C.
numenius and C. rhynchophorus and is assumed to be the derived character state within the genus [22,23].
In mormyrids, there is evidence that different patterns of electrocyte geometry, like electrocyte penetrations, cause different EOD waveforms. The electric organ of several mormyrid species has been histologically analyzed and a relation between electrocytes being stalk-penetrated or not and the EOD phase number could be observed [15][16][17][18][19].
Further, species with a multiple-stalk system are found to have a longer EOD than those with a single-stalk system [19]. Stalk size as well as the size of the electrocytes also seem to play a role in determining pulse duration. This is observed in Brienomyrus where longer EODs emitted by males coincide with thicker electrocytes and larger stalks, relative to females [24,25]. A comparable correlation can also be found among differentially discharging species of Campylomormyrus [18]. Moreover, a link between the increase of the electrocyte surface at the anterior face and EOD length has been found after hormone treatment (e.g., with testosterone) [24][25][26][27]. As these experiments are mainly performed in species with sex-specific EOD differences, it is not clear, whether hormones also contribute to inter-specific variation.
Further crucial components for the generation of an EOD are different types of ion channels integrated in the electrocyte membrane. They are responsible for generating the action potential, as they regulate the in-and out-flux of sodium and potassium ions. Hence, variations in the abundance or molecular structure of ion channels may also contribute to different pulse durations. All these components are encoded in the genome. Sequence variation of genes coding for major cellular components like the ion channels has been assessed in relation to EOD duration in several mormyrid species [28][29][30][31], but could not fully explain the observed variation in EOD characteristics.
Mutations can affect a particular phenotypic trait in different ways. Single point mutations in the coding region of a particular gene can change the protein sequence and hence, its function, resulting in variation of a phenotypic trait [32][33][34]. Often, however, traits have a quantitative inheritance, i.e., they depend on several or many genes and multiple mutations in these genes may contribute to trait variability. Regarding the ion channels, SNPs have been detected between wave-type and pulse-type African weakly electric fish in one of the two paralogous genes of the voltage-gated potassium ion channel kcna7a [35]. Among Campylomormyrus species with shorter and longer EODs, SNPs are also found in one paralog of the voltage-gated sodium channel gene ( scn4aa) [31]. In those genes, nonsynonymous SNPs likely alter the protein function and thus, can contribute to a change of the EOD duration. Mutations can also affect gene expression by impairing promoter and transcription factor interactions, modifying the function of regulatory proteins like activators/repressors, or changing mRNA conformations (stability) [36][37][38]. Indeed, the elongated EOD in C. tshokwe is associated with significantly elevated expression in several ion channel genes [30] but the regulatory elements behind this up-regulation are not yet known. EOD characteristics are likely encoded by multiple genes being part of various cellular processes. Thus, an identification of genes and processes related to the EOD in weakly electric fish will provide further insights into its evolution and the mechanisms which cause the different pulse duration.
In this study we aim at identifying further genes potentially influencing the pulse duration by using RNA sequencing data of the electric organ of three Campylomormyrus and one Gnathonemus species with different EOD durations [40]. Specifically, we look for putative candidate genes with non-synonymous SNPs which are related to EOD duration. Such expressed single mutations may have an impact either on gene expression or protein functions. Furthermore, we put our data in the context of gene ontology to infer those biological processes which are likely involved in the determination of EOD duration. As EOD differentiation is likely due to divergent selection among closely related species, genes with species-specific mutations may be more common for those biological processes and mechanisms which are relevant for a derived elongation of pulse duration. Conceptually, we screen two closely related species with short and long EOD, respectively, for speciesspecific non-synonymous SNPs, having two further species with ancestral EOD characteristics as outgroup to discern putatively derived from ancestral SNP states.

Transcriptome assemblies
In total, 66986804, 117994270, 41018678 and 99097520 high quality reads were used to       categories (see Additional file 4). Among these, 11 categories exhibited significant differences in the percentage of candidate SCO-sequences among species (Figure 2). In C.
tshokwe the 2 categories 'Transcription' and 'Cancer: specific types' were significantly overrepresented in terms of its proportion of candidate SCO-sequences, while in C.      . Asterisks (*) depict significance at p<0.05.

Discussion
Within mormyrid species, several mechanisms are discussed to explain differences in EOD characteristics, i.e., electrocyte and stalk geometry, variation in hormone levels, and cell membrane excitability [18,19,43]. In this study, we identified 27 genes, which (1)  fate [44]. Additional three genes ( tsr2, mapkbp1 and cd40) play a role in the regulation of NFкB signaling pathway [45][46][47]. The protein complex NfкB is a transcription factor regulating the expression of several target genes being responsible for cell proliferation and differentiation as well as apoptosis. Neuroscientific studies proposed that the NFкB pathway mediates long-term changes in synaptic structures and neuronal plasticity via gene expression regulation [48,49]. The co-repressor interacting with RBPJ1 ( cir1) negatively regulates the NOTCH signaling pathway which is essential in cell-cell communication and cell differentiation processes at embryonic and adult stages [50][51][52]. It also plays a role in neuronal function and development, angiogenesis and cardiac valve homeostasis [53][54][55][56]. Furthermore, three genes code for zinc finger proteins ( znf32, znfx1 and znf678). This class of nucleic acid binding proteins has a zinc finger domain interacting with DNA and thus, acts as transcription factor. While the target genes of two of the zinc finger proteins (znfx1 and znf678) are unknown in zebrafish, a knock-down of znf32 suppresses the SOX2 transcription which in turn enhances the regeneration of the nervous lateral line system [57]. In other model organisms, these three zinc finger proteins are associated with cancer pathways and epigenetic methylation [60][61][62][63]. Transcriptional regulation is also linked to the gene coding for Annexin A3 ( anxa3b) and the EWS RNA binding protein 1 (ewsr1) [62,63]. The relevance of this mechanism is corroborated by the GO terms 'cell communication', 'signal transduction' and 'transcription factor complexes' being significantly overrepresented among candidate SCO-sequences in C. tshokwe compared to C. compressirostris (see Additional file 3). Moreover, the KEGG orthology comparison of C. tshokwe yields the category 'Genetic Information Processing' (including processes of transcriptional regulation) to be significantly overrepresented among candidate SCO-sequences, compared to its occurrence in the entire transcriptome ( Figure   3). Furthermore, the KEGG level B category 'transcription' is more abundant among candidate SCO-sequences of C. tshokwe than those of C. compressirostris (Figure 2). These categories refer to RNA polymerase, basal transcription factors and splicosome. Our results reveal that genes related to processes and components of transcriptional regulation, exhibit an accelerated evolution in C. tshokwe with species-specific non-synonymous substitutions and inferred positive (divergent) selection. This emphasizes the importance of gene expression regulation for differences in EOD length among species. Indeed, previous studies already showed differential gene expression among C. tshokwe and C. compressirostris [30,40], indicating the impact of transcriptional regulation on EOD waveforms. Transcriptional regulation would provide also a feasible explanation for the remarkable changes in EOD waveforms during the ontogeny of a Campylomormyrus fish [64], as expression levels may change during ontogeny [65,66]. Experimental evidence on ontogenetic expression levels is however still lacking for our target taxa.
Our regulatory candidate genes are mainly associated with cell proliferation, cell differentiation and apoptosis, which holds true also for some non-regulatory candidate genes like the Vascular Cell Adhesion Molecule 1 ( vcam1), DNA heat shock protein family member A1 (dnaja1) and REV3-like DNA directed polymerase subunit zeta ( rev3l) ( Table 2).
The functions of their encoded proteins are related to cell expansion, cell survival and cell fate [67][68][69][70]. Furthermore, besides its regulatory function, Annexin A3 is mainly responsible for blood vessel and vascular cords formation [71,72]. According to the MAPP results, the observed species-specific substitutions in the genes ewsr1 and rev3l are likely to alter the protein function. The function of EWS RNA binding protein 1 ( ewsr1) is shown to maintain miotic integrity and proneural cell survival in early developmental stages of zebrafish. A knock-down of this gene results in abnormalities of miotic spindles, followed by apoptosis and leading to a reduction of the proneural cell number and disorganization of neuronal networks during the early development stages [68]. Knock-down experiments of the second gene, rev3l, yield disorganized tissue with significantly reduced cell density [69]. A substitution at a functionally important site in these genes might lead to a functional loss or neofunctionalization and thus, to a modification of subsequent processes affecting cell fate.
The association of cell proliferation processes to EOD duration is not only supported by many candidate genes (Approach I), but also by the KEGG categories 'Human Diseases' and 'Cancer: Specific types' which are significantly more abundant among candidate SCO-sequences of C. tshokwe. Indeed, most of the cancer pathways are linked to cell proliferation, cell differentiation and apoptosis, all crucial processes of tissue morphogenesis. Expressed genetic variation in these candidate genes may hence contribute to variation in electric organ tissue structures (e.g., multi-or single stalk systems) or cell morphs, supporting the hypothesis of an association between EOD duration and cell geometry [19,24,25].
Due to the teleost-specific whole genome duplication 350 mya [73], paralogous copies of essentially all genes emerged, and many of them were retained during teleost evolution.
Some of the paralogs may have retained their ancestral function or deteriorated into pseudogenes, but others underwent a neofunctionalization. One of our candidate genes, anxa3, is known to have two paralogs in zebrafish ( anxa3a/ anxa3b) [74]. In C. tshokwe only one gene copy, anax3b, was found to be expressed in the electric organ. Annexin A3 has a similar function as some voltage gated ion channels shown to exhibit a electric organspecific expression [30], i.e., the increase of membrane permeabilization activity and the influx regulation of calcium ions (Ca2+) [75] which points out the known importance of ion activity and related proteins during EOD generation.

Conclusion
To our best knowledge, this is the first transcriptome-wide SNP analysis among African geometry not only to affect the shape of EOD pulses, but also their duration.
Biochemical or physiological experiments via, e.g., knock-out trials were out of the scope of this study. Consequently, we can so far not proof a direct link between the inferred candidate genes and processes and the actual electric organ/electrocyte features. Thus, our study provides hypotheses about genes and processes relevant for EOD duration, which future research could build upon.

Transcriptome assemblies
We used RNA sequencing data from the electric organ (EO) of three Campylomormyrus species (C. compressirostris, C. tshokwe and C. tamandua) and the closely related species G. petersii to assemble tissue-specific transcriptomes ( Figure 4A). We downloaded the Illumina reads from the Sequence Read Archive (SRA) with the accession number SRP050174 [40]. The processing of raw reads (quality filter, adapter trimming, etc.) was achieved as described in Lamanna et al. (2015) [40]. Filtered paired-end reads of the four species were assembled de novo into separate transcriptomes using Trinity v. 2.2.0 with default parameters [76]. The four tissue-specific assemblies were tested for transcriptome completeness using BUSCO v3 [77]. For this purpose, transcriptomes were compared to the core gene set of Actinopterygii (state: 2018). The assemblies had been analyzed with the TransDecoder 3.0.1 pipeline to obtain the longest open reading frame for the transcripts [78]. The four transcriptomes served as input for the subsequent orthology analysis for which Orthofinder 1.1.10 was used (Figure 4) [79]. The Orthofinder tool is based on an allversus-all blast of amino acid sequences, followed by a first sequence clustering taking into account the normalized bit score of the blast results. Afterwards, orthologous genes are selected and a final clustering by the Markov Cluster algorithm results in discrete orthogroups. Orthofinder distinguished between orthogroups with multiple sequences per species and Single Copy Orthogroups (SCO; one sequence per species). Further analyses were applied only to the SCOs to ensure analytical comparison among orthologous genes.
The four nucleotide sequences of each SCO had been aligned codon-wise using PRANK v.
140110 (default parameters) [80] and trimmed to equal length by a customer bash script.
To discard remaining paralogs, YASS 1.15 [81] was used. It compared all sequences pairwisely and outputs the similarity for each pair in percentage. Single copy orthogroups with a similarity value below 90% were discarded from our data set. This procedure ensured that our retained genes are either clearly distinguishable from an ancient paralog (i.e., identified as separate SCOs) or do not have a paralog, at least not expressed in the electric organ.
Furthermore, sequences shorter than 200bp were removed. Finally, a randomly chosen subset of SCOs was checked manually for correct filter criteria confirming the performance of the bioinformatical scripts and tools. The resulting data set of SCOs served as basis for our subsequent analyses ( Figure 4).
Approach I: Identification of candidate genes potentially related to EOD elongation in C.
tshokwe First, the ratio of non-synonymous (dN) to synonymous (dS) nucleotide substitution rates was calculated for all SCOs using the codeml package implemented in PAML v. 4.9 [82].
Therefore, the sequence alignments of each SCOs were loaded to codeml separately and the site model M0 was chosen to compute the respective omega value (ω = dN/dS). The ω value is an indicator of selective pressures on genes. A ratio significantly greater than 1 indicates positive selection. A ratio of 1 indicates neutral evolution at the protein level. A ratio less than 1 indicates selection to conserve the protein sequence (i.e., purifying selection). By using Geneious R 8.1.9, SCOs with ω > 1 were screened manually for non-synonymous species-specific SNPs occurring only in the sequence of C. tshokwe (elongated EOD), when compared to the species with short EOD ( C. compressirostris, C. tamandua and G. petersii; Figure 4B). As C. tshokwe and C. compressirostris are closely related, the direct comparison of these two species allowed us to focus on genes which experienced genetic changes since they diverged from their last common ancestor. Single Copy Orthogroups with ω > 1 and non-synonymous, species-specific SNPs in C. tshokwe were considered as putative candidate genes for an elongated EOD ( Figure 4B). Nucleotide sequences of these SCOs were blasted against the current nucleotide database of NCBI (nt database) using the BLASTn algorithm [83]. We also blasted sequences against the protein database of the Zebrafish Information Center (ZFIN) and the UniProtKB/Swiss-Prot database using the BLASTx algorithm. Putative candidate genes were assigned to GO terms using Blast2GO Approach II: Annotation comparisons of orthogroups between C. compressirostris and C. tshokwe The second approach aimed at identifying cellular and molecular mechanisms which play a role in the differentiation of EOD duration. Therefore, all SCO alignments were converted into a VCF file format using the tool SNP-sites to call SNPs from multiple sequence alignments [87]. Singe copy orthogroups with non-synonymous species-specific SNPs in C.
tshokwe and C. compressirostris were isolated separately, creating two candidate SCO data sets. Subsequently, ω values for each remaining SCO were determined with the Ka/Ks_Calculator 2.0 using the model selection according to the AICs (MS) [88]. Its underlying calculation deviates from the M0 site model (codeml), as it relies on a pairwise calculation across all input sequences and outputs a ω value for each possible combination.
Only SCO with ω > 1 for the pairing of C. tshokwe and C. compressirostris were retained in the candidate SCO data sets ( Figure 4C). As we wanted to look for over-/underrepresentation of candidate SCO-sequences in certain biological processes, the candidate SCO data sets of both species were analyzed separately by a GO term annotation  Figure 4C). We uploaded each data set separately to KAAS and used the same parameters and databases as in approach I. The GO term and KEGG pathway annotations were used for four annotation comparisons ( Figure   4C). respective data set) were calculated and the respective 95% confidence intervals were determined using the online tool of the Allto Market Research web site [89]. The data were illustrated as bar chart with Microsoft Excel 2010.  Author's contribution: JC set up the bioinformatics pipeline, carried out the analyses, and drafted the manuscript. FK participated in the manuscript drafting and supervision. RT conceived and supervised the study and contributed to manuscript finalization. All authors read and approved the manuscript.

Figure 1
Comparative analysis of GO term annotations between the candidate SCO data set of C. tshokwe and C. compressirostris. The figure represents the proportional ratio of candidate SCO-sequences between C. tshokwe and C. compressirostris for any GO-term (y-axis), relative to the total candidate SCO-sequence number for the respective GO term (x-axis). The 95% confidence interval for an equal ratio (50:50) is depicted as the gray shaded area, rendering dots (i.e., GO terms) outside of the area significant (red circles).  KEGG level A category assignments among the candidate SCO sequences, compared to the entire transcriptome, for C. tshokwe (red) and C. compressirostris (blue), respectively. Bars represent the proportion of sequences/transcripts for each KEGG level A category (x axis). The error bars indicate the 95% confidence interval, taking the total absolute number of candidate SCO-sequences/transcripts into account (confidence limits of proportions). Asterisks (*) depict significance at p<0.05.

Figure 4
Overview of the workflow of used data-analytical approaches. Shown are the major bioinformatical steps to create an input data set (A), steps for potential candidate gene identification (B), and the computational steps to create the candidate SCO data sets as well as their three annotation comparisons (C).