Genome sequencing and assembly
The Lymnaea stagnalis genome was sequenced using a combination of both Illumina paired-end and mate-pair reads. The resulting draft assembly spans 943 Mb and comprises 6,640 scaffolds with lengths ranging from 2,000 to 7,106,496 bp, as outlined in Table 1. This assembly is 250 Mb lower than the previously measured genome size of 1.193 Gb (C-value = 1.22 pg 69). This discrepancy is consistent with the general trend observed in eukaryotes (see 70).
Table 1
Summary of the L. stagnalis genome assembly.
Metric
|
Final assembly
Contigs
|
Final assembly
Scaffolds
|
Assembly size
|
873,879,680
|
942,996,421
|
Number
|
64,200
|
6,640
|
N50
|
35,351
|
957,215
|
L50
|
6,921
|
297
|
N90
|
7,115
|
151,064
|
L90
|
27,651
|
1,178
|
Largest
|
315,608
|
7,106,496
|
GC%
|
|
37.55%
|
%N’s
|
|
7.33%
|
Repeat content
Repetitive elements spanned 37.89% of the 943 Mb-genome. Using a semi-automated process (described in the materials and methods section), we annotated the L. stagnalis genome with 2,643 transposable element (TE) consensus sequences (provisionally available via INRAE data repository:
(https://entrepot.recherche.data.gouv.fr/privateurl.xhtml?token=1ad8e7c4-fa43-4002-b99f-c236eae7ba01), giving 883,147 TE copies (24,190 of which are full-length copies (FLCs). In the consensus library, 1,517 TEs were classified as retrotransposons (Class I) and 774 as DNA transposons (Class II), 243 were not classified and 105 as Potential Host Genes (PHGs). Most of these consensus sequences (2,604, i.e., 99%) have one or more full-length copy, indicating that a consensus sequence is likely a reasonable estimate of the common ancestor of any one TE. The resulting annotation was used to mask the TE landscape in the genome.
Annotation
Following repeat masking of the genome, we employed ab initio, transcriptome- and homology-based methods to predict a set of 22,499 protein-coding genes (see Table 2 for detailed gene characteristics). This number is somewhat lower than in other annotated gastropod genomes (e.g., 23,851 genes in L. gigantea, 28,683 in A. californica, 36,943 in B. glabrata), and lower than the average number of genes found in a panel of 20 representative Metazoa (see OrthoFinder analysis).
Table 2
Characteristics of the L. stagnalis genome annotation.
Parameter
|
value
|
Assembly size
|
943 Mbp
|
# predicted genes
|
22,499
|
# predicted genes without intron
|
3,492 (15.5%)
|
# exon per gene avg:med
|
7.59:5
|
Gene size (nt) avg:med
|
15,299:8,502
|
CDS size (nt) avg:med
|
1,375:993
|
Intron size (nt) avg:med
|
1,611:604
|
Complete BUSCOs
|
98.5% (vs, odb10 Eukaryota)
|
Complete and single-copy BUSCOs
|
97.3%
|
Complete and duplicated BUSCOs
|
1.2%
|
Fragmented BUSCOs
|
1.2%
|
Missing BUSCOs
|
0.3%
|
Total BUSCO groups searched
|
255
|
# predicted genes with domain*
|
19,387 (86%)
|
# predicted genes with InterPro* domain*
|
16,622 (74%)
|
# predicted genes with Pfam* domain*
|
15,088 (67%)
|
*at least one protein domain |
Mitochondrial genome
BLAST searches returned mitogenomic sequence similarities for the full length of Lsta_scaffold2639. As indicated by a 127 bp sequence overlap at the termini, this scaffold represents a circular sequence. Nanopore reads confirmed that the assembled sequence is contiguous, and gene annotations revealed the complete standard gene complement for a metazoan mitogenome. The 13,834 bp mitogenome of L. stagnalis (Genbank MW221941) is 28.21% GC-rich, different from the 37.55% AT content of the nuclear genome. Several protein coding genes are delimited by downstream tRNA genes, causing the ORF to terminate with a truncated stop codon (T- or TA-) that is completed by polyadenylation of the mRNA transcripts. This mitogenome sequence has 94% identity with another, independently characterised L. stagnalis mitogenome (13,807 bp, MT874495). The different lengths of these two mitogenomes are mostly due to indels in regions with tRNA- and rRNA genes, a two bp indel occurs at the 5’ start of ND4L. These mitogenome sequences align closely but contain abundant single nucleotide variants (including several non-synonymous substitutions) within protein coding genes. The organisation of the 22 tRNA genes, two rRNA genes plus 13 protein coding genes (Supplementary Figure SF1) of the L. stagnalis mitogenome is generally similar to that of other lymnaeid (and hygrophilid) snail species, with a few minor differences in the location of tRNA genes.
Receptor identification
Receptor Tyrosine Kinases
A total of 32 gene predictions (including manually curated versions) were identified as Receptor Tyrosine Kinases (RTKs), distributed across 18 of the 20 families or types (see Supplementary Table ST1). Among these, 12 gene predictions did not contain expected (canonical) domains. We found no gene prediction representing type III or XI RTKs, while two gene predictions could not be assigned to a specific RTK type.
Nuclear receptors
In total, 47 nuclear receptors (NRs) were identified from automatic and manual gene predictions, 39 of which containing the complete sequence of both conserved domains, i.e., the DNA-binding domain (DBD) and ligand binding domain (LBD) (Supplementary Tables ST2 and ST3). All 6 traditional NR families were represented as follows: NR1 (28 predictions), NR2 (12), NR3 (3), NR4 (1), NR5 (2), NR6 (1). However, in terms of NR nomenclature, these corresponded to only 29 different NRs, as several gene predictions were found to have the same annotation (e.g., NR1F1_RORA, represented by four different gene predictions, that share low levels of domain sequence similarity (from 28.9–41.5%).
Significantly, we found two NRs with 2DBDs and one LBD. The occurrence of such an atypical NR structure was first reported in 2006 in the parasitic platyhelminth Schistomosa mansoni 71 and further confirmed in other animal species, including protostomes and deuterostomes (both non-chordate and chordates; see 72), suggesting an origin that predates the protostome-deuterostome split. Using the metazoan-scale orthology analysis performed in this study, the two sequences identified in L. stagnalis belong to the same orthogroup (OG00004255), which is composed of 33 sequences from molluscan (Gastropoda and Bivalvia), annelid, and platyhelminth taxa, plus the hemichordate Saccoglossus kowalevskii. However, in this OG, no protein other than the two L. stagnalis’ possess 2 DBDs, which may reflect conflicts in orthology search due to this very particular NR structure. Although a stable phylogeny of 2DBD-NRs is still to be established, Wu and LoVerde 72 proposed to group them into a new family (NR7). Accordingly, we found higher sequence similarity among the four DBDs of these two 2DBD-NRs than with any other L. stagnalis NR or with their corresponding best hits (see green clade in Fig. 1 and Supplementary Figure SF2).
To take the analysis of the unusual 2DBD-NRs further, we concatenated the DBD and LBD domains from 60 other 2DBD-NRs representative of the metazoan groups studied by Wu and Loverde 72 (see Supplementary Table ST4). A phylogenetic analysis of these domains suggests that the two L. stagnalis receptors are members of two different groups (A vs B; Fig. 2 and Supplementary Figure SF3), albeit with low statistical support (see branch bootstrap values). Therefore, following the recommendation of Wu et al. 2023, L. stagnalis has two NR7 genes, GSLYST00003556001 (NR7A) and GSLYST00001311001 (NR7B), corresponding respectively to C. gigas 2DBDγ and 2DBDδ 73.
Among the incomplete NRs, four genes were found to lack either the entire LBD (1 gene) or the DBD (3 genes) and should thus be classified as members of the NR0 family (although this grouping is not monophyletic). We tentatively annotated them following their BLAST best-hit (H. sapiens and/or nr). On this basis, the single DBD-only gene was annotated as NR1A2_THRB, despite the lack of a true thyroid system in Mollusca but see 74, and the three LBD-only genes as NR1D2_REVERBB, NR1H3_LXRA, and NR2F1_COUP-TFI, respectively. Note that NR1D2 and NR1H3 annotations were shared by at least one other gene prediction which contained both the DBD and LBD. Another specific case was GSLYST00019008001 (and its related transcript). Although we could not find an open reading frame (ORF) that would generate both the DBD and LBD, two alternative ORFs would generate sequences with either of the domains. Ambiguously, each of these sequences has a different BLAST best-hit in H. sapiens: NR1C1_PPARA for the DBD-only ORF, and NR1H3_LXRA for the LBD-only ORF. In addition to the observation that NR1C1 and NR1H3 are already represented by two and three other gene predictions respectively, this suggests that GSLYST00019008001 and its corresponding transcript do not constitute a functional NR.
Despite extensive searches for alternative ORFs, 5 NRs were found to have an incomplete domain sequence (lacking part of either the DBD or the LBD). These might represent non-functional gene copies, as suggested by the observed density of nucleotide repeats in the problematic areas. While this should be confirmed through specific re-sequencing, the fact that some of these and other genes with complete domains share the same functional annotation (see NR1C3_PPARG and NR2E1_TLL, Supplementary Table ST2) tends to support the hypothesis of non-functionality. One last specific situation is that of NR2E3 (PNR), which is represented by two protein predictions, neither of which have both domains in full (an incomplete DBD, and the LBD is either complete or missing).
G protein-coupled receptors
InterProScan identified a significant GPCR hit for a total of 892 gene predictions, four of which were discarded due to incompleteness or other conflicting annotations (see Supplementary Table ST5). The remaining 888 gene predictions are distributed among the five main GPCR families (GRAFS system 75) and the ancestral cAMP 76: class A - Rhodopsin-like (784); class B – adhesion/secretin (65); class C – Glutamate (31); class E – cAMP (3); class F – Frizzled (5). Of these, 418 were members of five orthogroups (OGs) found to be expanded within the L. stagnalis genome (orthology analysed at metazoan level, Z-score > 2, Table 3; see also Supplementary Information for further details), while the remainder (470) was distributed across 135 non-expanded OGs. Most gene predictions belonging to the five expanded GPCR OGs were identified as Rhodopsin-like (class A) 7 trans-membrane (7TM; IPR17452). These five expanded OGs encompassed a high proportion (89 to 100%) of molluscan proteins from the six taxa included in the orthology analysis, and four of these OGs were predominantly exclusive to gastropods (85 to 100%; Table 3).
Table 3
Summary statistics for GPCR Orthology Groups, with a focus on molluscan representatives. Percentage values indicate for each orthogroup the proportion of proteins belonging either to Mollusca, Gastropoda, or to L. stagnalis. Abbreviations: OG, Orthogroup. A Z-score value (number of standard deviations from the mean) > 2.0 indicates significant expansion.
OrthoGroup
|
InterProScan
|
OG size
|
Mollusca
|
Gastropoda
|
L. stagnalis
|
|
Z-score
|
OG0000008
|
Rhodopsin_R (peptide, FMRFamide)
|
1,215
|
1085
(89%)
|
1035
(85%)
|
369
(30%)
|
2.537
|
OG0001159
|
Rhodopsin_R (peptide, FMRFamide)
|
72*
|
71
(99%)
|
71
(99%)
|
24
(33%)
|
2.336
|
OG0002276
|
Rhodopsin _R (peptide, FMRFamide)
|
49*
|
48
(98%)
|
48
(98%)
|
17
(35%)
|
2.452
|
OG0002279
|
Rhodopsin_R
|
49
|
49
(100%)
|
49
(100%)
|
22
(45%)
|
3.106
|
(2 GPCRs)
|
OG0002601
|
Rhodopsin_R (peptide, FMRFamide)
|
45
|
38
|
22
|
4
|
< 2
|
Absent in Deuterostoma and Arthropoda
|
OG0010705
|
Rhodopsin (peptide, FMRFamide)
|
8
|
7
|
7
|
4
|
< 2
|
OG0000001
|
Rhodopsin_R (peptide)
|
3,444
|
911
|
561
|
139
|
< 2
|
OG0004949
|
Rhodopsin_R (biogenic amine)
|
30
|
30
(100%)
|
27
(90%)
|
14
(47%)
|
3.347
|
OG0000121
|
Rhodopsin_R (peptide and biogenic amine)
|
286
|
268
|
115
|
29
|
< 2
|
OG0000012
|
Adhesion_R
|
1,072
|
310
|
186
|
42
|
< 2
|
*in OG0001159, the single non-molluscan member was B. floridae, in OG0002276 C. elegans. |
7TM-class A Peptide GPCRs: a predominance of FMRFamide R-like
Three of the five expanded OGs with a GPCR hit (OG000008, 0G0001159, OG0002276) consisted exclusively of Rhodopsin-like receptors from the group of (7-transmembrane-domain) 7tmA_FMRFamide_R-like receptors (peptide), representing in total 424 L. stagnalis proteins. However, OG0002279, consisting of 49 exclusively Gastropoda proteins, was heterogeneous with only 2 GPCR hits for L. stagnalis proteins. The 22 other L. stagnalis proteins in this OG included BLASTp annotations against bacterial proteins, casting doubt on the statistical reliability of this OG. By contrast, OG0002601, although not expanded in L. stagnalis, was clearly an FMRFamide receptor OG (45 proteins), apparently absent in Deuterostomia and Arthropoda. Molluscs were the most represented taxa in this group, with 38 proteins, 22 of which belong to Gastropoda, and 4 to L. stagnalis. OG0010705 is also a small, non-expanded, FMRFamide receptor OG (8 proteins), with 7 molluscan representatives, all gastropods from the Heterobranchia order. However, the fact that one protein from this OG belongs to the arthropod F. candida questions the validity of this small group as a true OG. Finally, an extra-set of 44 L. stagnalis gene predictions from various and heterogeneous OGs were annotated as FMRFamide_R-like, although only 18 gene predictions had a complete 7TM motif (as determined with NCBI Conserved Domains database search tool). In summary, about 450 GPCRs could be confidently designated as FMRFamide_R-like. For completion, an OG of biogenic amine GPCRs (G0004949), which was also expanded in L. stagnalis, contained 13 receptors to muscarinic acetylcholine and one receptor to adrenaline.
We hypothesised that FMRFamide receptor proliferation may have supported the evolution of simultaneous hermaphroditism in gastropods. To test this hypothesis, we refined the analysis of orthology by focusing on gastropods, and used available genomic resources to balance hermaphroditic species (A. californica, B. glabrata, Bulinus truncatus, Candidula unifasciata, Elysia chlorotica, L. stagnalis, Radix auricularia, Lottia gigantea) against gonochoristic species (separate genders, Batillaria attramentaria, Gigantopelta aegis, Haliotis rubra, Littorina saxatilis, Pomacea canaliculata, Potamopyrgus antipodarum) (see Supplementary Information, Orthology analyses). Note that L. gigantea is a protandrous sequential hermaphrodite while all other species are simultaneously hermaphroditic. This analysis generated nearly 29,500 OGs (Fig. 3; Supplementary Table ST6). Because all simultaneous hermaphrodites in our analysis are in the Euthyneura infraclass of molluscs, we also identified various OGs specific to this clade, including FMRFamide receptors, c-type lectins, cytochrome P450 cyclodipeptide synthase-associated, (see Supplementary Table ST7 for domain annotation of the main euthyneura-specific OGs).
The search for FMRFamide-R among L. stagnalis orthologs retrieved 483 genes distributed in 89 different OGs. FMRFamide-R groups previously found expanded in L. stagnalis appeared no more enriched, which is not surprising as all members of the analysis were gastropods. We excluded Radix auricularia from this analysis because that dataset was abnormally small (less than 18,000 protein predictions in total), and that species was recurrently absent from OGs otherwise well represented by other Hygrophila species (casting further doubt on the completeness of its predicted proteome). Excluding R. auricularia, the number of FMRF-R amide genes was significantly higher (ANOVA F-value [1,11] = 10.21, P = 0.009) in hermaphrodites (simultaneous and sequential: 289.86 ± 149.10 genes) than in gonochoric gastropods (89.17 ± 36.88 genes). After excluding L. gigantea (a sequential hermaphrodite), simultaneous hermaphrodites were more significantly enriched in FMRFamide receptor genes (329.17 ± 117.04) than non-hermaphrodite species (ANOVA F-value [1,10] = 22.95, P < 0.001).
Peptide GPCRs: a high level of ligand diversity
OG0000001 is comprised of peptide GPCRs and is the second largest OG determined by OrthoFinder with 3,444 genes representative of all taxa included. Although not expanded in L. stagnalis, this OG contains 139 L. stagnalis gene predictions, all assigned as peptide GPCRs, and predicted to bind a great diversity of molecules (more than 40 categories; see supplementary Table ST8). OG0000121 grouped gene predictions which shared four annotations with OG0000001 (in bold in Supplementary Table ST8) or were identified as other peptide receptors (neurotensin, somatostatin, FF neuropeptide, NPY peptide, opioid).
Among the rhodopsin-like gene predictions, a structural pattern was observed in the genome, with a spatial clustering of similar genes. Whether all members of a given cluster are functional genes or pseudogenes cannot be determined based on our data, yet it is to be noted that they do not all possess a complete 7TM motif and these clusters were also often interspersed with repetitive regions. In line with its high level of occurrence, FMRFamide_R-like was either the main or exclusive annotation of 30 of the 46 largest groups (size: 4 to 26 tandem GPCR genes), and also appeared in other groups. Among these, the proportion of transmembrane domains with at least 6 helices varied from 50 to 100%. Furthermore, 10 of the 30 groups of FMRFamide_R-like genes contained genes which also exhibited an additional viral chemokine GPCR hit (PHA02638 or PHA03087). Details on cluster annotations are given in Supplementary Table ST9.
Consistent with the levels of diversity observed in metazoan GPCR classes, classes B, C, E and F were more moderately represented in L. stagnalis. After discarding gene predictions with transmembrane domains containing less than 6 helices (considered as incomplete), we found 37 proteins representative of class B (12 B1_hormone receptors, 19 B2_adhesion, and 6 B3_Methuselah), 17 of class C (3 GABA B, 12 metabotropic glutamate, and 2 orphan GPR158), a single class E (cAMP), and five class F receptors (four Frizzled and one Smoothened protein) (see Supplementary Table ST10). Several other predicted proteins with significant matches to GPCRs from either of these classes were also identified but discarded due to incomplete or total lack of a transmembrane domain.
Spatial expression of FMRFamide receptors
While the association between FMRFamide receptor expansion and simultaneous hermaphroditism is strong (as indicated by OG composition), it is problematic that the Euthyneura are monophyletic and are also the only simultaneous hermaphrodites used in these analyses. Therefore, to clarify whether some of these receptors are associated with the hermaphroditic condition in L. stagnalis (as opposed to other euthyneuric traits such as detorsion), we searched for differences in the expression of receptors in our transcriptomic datasets, and accordingly a selection was used for in situ expression analysis. The differential expression of some FMRFamide receptors between male (prostate) and female (albumen) reproductive organs drove our selection of 4 of these receptors (GSLYST00019397001, GSLYST00001331001, GSLYST00016909001, GSLYST00019383001). In addition, we investigated the expression patterns of three differentially expressed “traditional” nuclear receptors (GSLYST00008490001, GSLYST00009627001 and GSLYST00020139001) and the two 2DBD-NRs (GSLYST00001311001 and GSLYST00003556001). We also included ovipostatin (GSLYST00000615001) as a reference because it is known to be differentially upregulated in the prostate gland relative to the albumen gland 77. The spatial expression profiles for these genes were striking in their similarity in early veligers; in general, these genes were robustly expressed in ectodermal cells of the foot, head and mantle edge (Fig. 4). The albumen and prostate glands are fully differentiated organs with separate reproductive functions, and we were able to detect qualitatively distinct spatial expression patterns for these genes in one or both reproductive organs. As expected, ovipostatin generated a strong signal in sections of the prostate gland, while effectively no signal was detected in the albumen gland (Fig. 4). In addition, the expression patterns of the two 2DBD-NRs in early embryos (early and late blastulae, early and late gastrulae) indicates that these gene products play likely critical roles in early developmental processes (Fig. 5).