The gene structure and super-hypervariability of the complete Penaeus monodon Dscam gene

Background In pancrustaceans, the Down syndrome cell adhesion molecule (Dscam) is an extraordinarily complex, single-locus gene, with the potential for generating thousands of isoforms by combining alternative splicing exons. In the present study, we used two advanced sequencing approaches, Illumina and PacBio, with hybrid assembly to analyze the entire Dscam genomic structure in Penaeus monodon. Results The P. monodon Dscam (PmDscam) genome was ~250 kbp, with a total of 175 constitutive and alternative splicing exons. Analysis of PmDscam cDNA and genomics revealed a conserved architectural structure consisting of an extracellular region with hypervariable Ig domains, a transmembrane domain, and a cytoplasmic tail. While the number of splicing exon variants in N-terminal Ig2, N-terminal Ig3 and the entirety of Ig7 were previously reported to be 28, 43 and 19, we now show that there are in fact 26, 81 and 26 alternative exons in these regions, respectively. We also identified two alternative variants of two exons in the cytoplasmic tail, as well as 7 cytoplasmic tail elements that can either be included or skipped. The presence of three stop codon sites in the cytoplasmic tail region means that alternative splicing is involved in the selection of the stop codon. In total, alternative splicing provides for 54,756 potential combinations in the extracellular region, plus 512 potential combinations in the cytoplasmic tail, all derived from one PmDscam genome locus. We have also established a public-facing PmDscam genome database to facilitate future research on characterizing the involvement of Dscam in pancrustacean immunity.


Abstract Background
In pancrustaceans, the Down syndrome cell adhesion molecule (Dscam) is an extraordinarily complex, single-locus gene, with the potential for generating thousands of isoforms by combining alternative splicing exons. In the present study, we used two advanced sequencing approaches, Illumina and PacBio, with hybrid assembly to analyze the entire Dscam genomic structure in Penaeus monodon.

Results
The P. monodon Dscam (PmDscam) genome was ~250 kbp, with a total of 175 constitutive and alternative splicing exons. Analysis of PmDscam cDNA and genomics revealed a conserved architectural structure consisting of an extracellular region with hypervariable Ig domains, a transmembrane domain, and a cytoplasmic tail. While the number of splicing exon variants in Nterminal Ig2, N-terminal Ig3 and the entirety of Ig7 were previously reported to be 28, 43 and 19, we now show that there are in fact 26, 81 and 26 alternative exons in these regions, respectively. We also identified two alternative variants of two exons in the cytoplasmic tail, as well as 7 cytoplasmic tail elements that can either be included or skipped. The presence of three stop codon sites in the cytoplasmic tail region means that alternative splicing is involved in the selection of the stop codon.

Conclusions
In total, alternative splicing provides for 54,756 potential combinations in the extracellular region, plus 512 potential combinations in the cytoplasmic tail, all derived from one PmDscam genome locus.
We have also established a public-facing PmDscam genome database (http://pmdscam.dbbs.ncku.edu.tw/) to facilitate future research on characterizing the involvement of Dscam in pancrustacean immunity.

Background
Dscam belongs to the immunoglobulin (Ig) superfamily gene, and it was first identified in the human chromosome in relation to the development of neuronal connectivity (1). This gene also has important roles in nervous system development in insects (2)(3)(4). The typical structure of Dscam consists of 10 Ig domains and six fibronectin type III repeats connected to a transmembrane domain and a cytoplasmic respectively. Our new transcriptomics data also reveals a relatively complex PmDscam cytoplasmic tail structure that is distinct from insect Dscam. Several highly conserved functional motifs were discovered in the cytoplasmic tail. In addition to our analysis of the PmDscam genomic structure, we also found that most exons in the genome were selected in both nervous and immune related-cells. In different combinations, the various alternatively spliced exons in the extracellular regions as well as the cytoplasmic tail could generate up to 20 million distinct protein isoforms. Taken together, these findings highlight the substantial diversification of Dscam structure. We also provide a draft of the complete genome of tiger shrimp Dscam, which is accessible via our public-facing PmDscam database.

Whole genome sequencing and genome assembly
The procedures illustrated in Figure 1 produced a first draft M2 assembly which had the highest contiguity of any assembly that we generated, with an N50 of 5.1 kb in 2.2 million contigs. The final assembly size was 2.6 Gb (Table S2; Figure S1). After gaps in the PmDscam sequences were closed by PCR amplification and sequenced using the Sanger sequencing platform, a final corrected M2 assembly was produced (Fig. 1A). The completely constructed draft of the Penaeus monodon Dscam genome has a size of approximately 260 kbp (Fig. 2). Figure 2 also shows how the three platforms and the transcriptomics data contributed to this construction.

Penaeus monodon Dscam gene organization
Previously we reported the full-length cDNA of PmDscam (16). Here an assembled P. monodon PmDscamgenome reveals the complete PmDscam gene structure. Excluding exon 1, the PmDscam gene contains 175 exons (Fig. 3). 38 of these exons are constitutive and 137 use alternative splicing.
Analysis of the genomic sequences in combination with the cDNAs revealed that the PmDscam gene ( Fig. 3) has a much more complex cytoplasmic tail than other pancrustacean and arthropod species, e.g., Daphnia and Drosophila, while the overall gene organization is otherwise similar. Unfortunately, however, we were unable to identify the 5'-UTR of Dscam located in exon 1 even though this has been identified in other crustacean species (4,15). The gene organization of PmDscam consists of two main parts: the extracellular region (Fig. 3A) and the cytoplasmic tail (Fig. 3B). The extracellular region of PmDscam has three alternatively spliced exons, with exons 4, 6 and 15 consisting of 26, 81 and 26 variable exons, respectively (Fig. 3A). Meanwhile, the cytoplasmic tail has two alternatively spliced exons, exon 32 and 44, which contain 2 variable exons each (Fig. 3B). The mature mRNA consists of 44 exons or less and each cDNA sequence contains only one of each of the variable exons

Identification of PmDscam hypervariable regions
To identify the hypervariable sequences of Ig2, Ig3 and Ig7 in the PmDscam genome, the conserved amino acid sequences of isoform variants from each domain were searched for in the genome. The multiple hypervariable exons were corrected by eye and a total of 26, 81 and 26 spliced forms of the exons encoding Ig2, Ig3 and Ig7 were detected, respectively. These numbers are in contrast to those in Chou et al. (2011), where the number of exon variants in Ig2, Ig3 and Ig7 were reported to be 28, 43 and 19, respectively. The isoform sequences from each domain were aligned using Clustal Omega and Genedoc software, and the resulting amino acid sequences are shown in Figure 4. Each of the detected hypervariable regions show several conserved amino acids, some partly conserved amino acids and a number of variable amino acid sequences. Assuming that these alternative variants can be selected independently, then the extracellular region of PmDscam can potentially generate at least 54,756 different unique isoforms (26 81 26 = 54,756). We note that one of the Ig7 variants has an abnormal length (Fig. 4C), although the significance of this, if any, is unclear.
The first four Ig domains of Dscam were reported to have a horse-shoe conformation, and parts of Ig2 and Ig3 contribute to two composite surface epitopes, epitope I and epitope II (21). Although, these two epitopes are not well conserved in insects (21), they are highly conserved among crustaceans (15). Epitope I is responsible for homophilic binding specificity, while epitope II was hypothesized to bind to non-Dscam ligands (21). Here, we used PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred) to identify the two epitopes located in Ig2-(exon 4) and Ig3-spliced (exon 6) variants. The epitope I and epitope II sequence logos in exon 4 and exon 6 were then generated using WebLogo (http://wrblogo.berkeley.edu/). In exon 4, the sequence of approximately 12 amino acids before conserved residues 16I, and 13 amino acids after conserved residues 41W were considered to belong to epitope I and II, respectively (Fig. 5A). In exon 6, 8 amino acids after conserved residues 9K(R) were considered to belong to epitope I, and 8 amino acids before the conserved LLC motif were considered to belong to epitope II (Fig. 5B).

Detection of PmDscam isoform variants expressed in different tissues
To confirm whether the isoform variants of the three hypervariable exons (exon 4, 6 and 15) obtained from the genome sequence are actually present in shrimp, amplicons spanning the hypervariable exons were amplified from hemocytes and nerve tissue from ten individual shrimp using gene specific primers (Fig. 6A). After cloning and sequencing, the obtained nucleotide sequences were BLASTed against our PmDscam genome database. Almost all the isoform variants of exon 4, 6 and 15 found in the PmDscam genome were expressed ( Fig. 6B-6D). Among the exon 4 variants, isoform 1 and isoform 15 were not found in either hemocytes or nerve tissue, while isoform 19 was absent only from nerve (Fig. 6B). For exon 6, isoforms 10, 38, 51, 52, 70 and 72 were absent from both hemocytes and nerve (Fig. 6C), while isoforms 4, 7, 10, 15 and 16 of the exon 15 domain were also absent from both tissues (Fig. 6D). Although, interestingly, fewer exon 15 isoforms were detected in hemocytes than in nerve tissue (Fig. 6D), these results suggest that at least most of the isoform variants of the hypervariable exons can be found in shrimp.
A complex cytoplasmic tail organization In our previous study (16), although we successfully identified several cytoplasmic tail isoforms of PmDscam, we were only able to identify PmDscam element 0 to element 8 (with elements 0-5 corresponding to exons 31-38; the numbering of the elements corresponds to the exons in Daphnia Dscam). However, the earlier analysis contained several errors, and some of the downstream functional protein motifs were still missing. Here, using Drosophila and Daphnia Dscam protein sequences to search for additional putative elements against our transcriptomics sequence, we were able to identify the cytoplasmic tail of PmDscam from exon 31 to the stop codon in exon 44 (Fig. 7A).
We named these exons according to the order in which they are located in the PmDscam genome.
The amino acid sequences of each cytoplasmic element are shown in Table 2. Differences between the naming system used in Chou et al. (2011) and the exons in Figure 7 include: exons 36, 37 and 38, which were previously thought to be variants C, B and A of element 5, respectively, and the amino acid sequences from exon 39 to exon 44, which were grouped together as element 8. Two alternative kinds of transmembrane domain were found in exon 32; this is like Drosophila but unlike Daphnia Dscam (6). Interestingly, mutually exclusive alternative splicing was also found in exon 44, with both of the two alternative exons containing the stop codon. In fact, based on the PmDscam genomic sequence, exon 44.1 and exon 44.2 are located in the same area but the amino acids of each element are translated with different reading frames, and this results in the expression of two different elements. Further, we found a rare case that if exon 43 is included, it is always followed by exon 44.1, and the resulting nucleotide sequence will produce a stop codon in the very first amino acid of exon 44.1 (Fig. 7A). As noted previously (16), there is a poly(A) site on exon 31. When translation continues to the next exon (i.e. exon 32.1 or 32.2), the normal, membrane-bound form of Dscam is produced, but when this poly(A) tail is added, it results in the production of the tail-less form of PmDscam. This tail-less form has been found in several crustaceans, but not in insects (10,16,24). Bioinformatics  Table 2). The other functional motifs of Dscam, which are highly conserved among crustaceans and insects, were predicted with the simple modular architecture research tool (SMART) version 4.0 and are also shown in Figure 7C and Table 2 Dscam. The PmDscam cytoplasmic tail includes important protein motifs that correspond to those in Drosophila and Daphnia Dscam, even though many of the amino acid sequences in each exon share a percent identity of less than 50% (Table 2). Taken together, PmDscam exhibits a cytoplasmic tail arrangement that is the most complex to have so far been reported in any arthropod. Information on both the nucleotide and amino acid sequences of the extracellular region and cytoplasmic tail of PmDscam is now publicly accessible from our shrimp Dscam in-house database (http://pmdscam.dbbs.ncku.edu.tw/).

The PmDscam ORF
The complete full-length PmDscam, including both the extracellular region and the cytoplasmic tail, is shown in Figure 8. The open reading frame (ORF) of PmDscam contains 6,135 bp encoding a predicted protein of 2,045 amino acid residues, although the lengths of the nucleotide and amino acid sequences vary as a result of the alternative splicing of hypervariable exons. The putative signal peptide predicted by Signal P3.0 domain analysis is located at the N-terminus. Domain homology analysis using SMART software showed that the deduced amino acid sequence contained ten tandem repeat immunoglobulin domains (Ig), six fibronectin type III domains (FNIII) and thirteen elements in the cytoplasmic tail. The hypervariable sequences in Ig2, Ig3 and Ig7 are indicated. The conserved cell attachment RGD motif (Arg-Gly-Asp) is located between the Ig6 and Ig7 domains at amino acids 595 to 597. The mutually exclusive alternative splicing elements 1 and 13 in the cytoplasmic tail are also indicated.

Discussion
During the past decade, several approaches, including BAC end sequencing, linkage map construction, transcriptome sequencing and whole-genome sequencing, have been used to investigate the genome and genetic properties of crustaceans (26)(27). However, the large and highly repetitive sequences of the crustacean genome cause difficulty in genome assembly and other genetic studies (26,28). Furthermore, crustacean genomes show substantial variations in size. For example, the genomes of caridean shrimp (Exopalaemon carinicauda) and white shrimp (Litopenaeus vannamei) are 5.73 and 2.3 Gb, respectively (28)(29), while the Penaeus monodon genome size was estimated to be ~2.1 Gb. In the present study, the P. monodon whole-genome sequence analysis was conducted using state-of-the-art genomics techniques, including a combination of short read Illumina and long read PacBio sequencing and hybrid assembly. A Penaeus monodon Dscam (PmDscam) genome, ~250 kb, was assembled, corrected and analyzed ( Fig. 2A).
We reported previously (16) that PmDscam has a typical Dscam domain architecture similar to arthropod Dscam (9). The extracellular region has 10 immunoglobulin domains and six fibronectin III , with half of the second and third Ig domains and the entire Ig7 domain encoded by arrays of near-duplicate exons. The FNIII6 of the extracellular region is followed by a transmembrane domain and a cytoplasmic tail (5)(6). Diversity of the hypervariable regions, i.e. the Ig2, Ig3 and Ig7 domains, occurs through mutually exclusive alternative splicing which ensures that in mature mRNA, there is only one exon selected from each array cluster (7). In the present study, we found that the PmDscam genome has a total of 175 exons, with five variable regions: the extracellular exon clusters 4, 6, 15 and two cytoplasmic tail exon clusters (32 and 44), which had two alternative splicing exons each (Fig. 3A, 3B). In contrast to our previous study, which reported finding 28, 43 and 19 alternative sequences for N-terminal Ig2, Nterminal Ig3 and the entire Ig7, respectively (16), figure 4 shows that the correct numbers are in fact 26, 81 and 26. There are two reasons for these discrepancies. In the previous study, isoforms with only a single amino acid difference were counted as distinct isoforms even though they were more likely to have resulted from sequencing errors. This would have artificially inflated the earlier figure.
Conversely, a number of isoforms were simply not found in the Chou et al. In insects, hypervariability is also produced by mutually exclusive RNA splicing that occurs in clusters of alternative splicing exons (2,5). In a comparison of hypervariable exons among arthropods, PmDscam had the most multiple exon variants (3,8,15,22). In PmDscam, we found that each alternative splicing exon has a different level of conservation: exon 4 variants have a higher similarity of amino acids in each variant compared to exons 6 and 15 (Fig. 4). Based on alternative splicing in those hypervariable exons, we infer that there are at least 54,756 and 512 possible combinations for the extracellular region and cytoplasmic tail, respectively. It is noteworthy that P. monodon can generate the most Dscam isoforms among currently known arthropod species. For example, P. monodon can generate more Dscam isoforms than crab (30,600) Drosophila (19,008) and Daphnia (3,264) (6,8,15).
Since the presence of Dscam in nerve cells and immune-related cells or hemocytes implies it might have a role in both the nervous and immune systems (2,4,30), we investigated the population distribution of the PmDscam hypervariable exons which encode for Ig2, Ig3 and Ig7 in both hemocytes and in nerve tissues. The populations of exon 4 (Fig. 6B) and exon 6 ( Fig. 6C) variants were similar in hemocytes and nerve tissue, whereas there was a higher diversity of exon 15 variants in nerve tissues compared to hemocytes (Fig. 6D) contribute to epitope II (21). Epitope I is important for homophilic binding specificity, whereas epitope II may be involved in non-Dscam binding (21). The two epitopes located in the exon 4 and exon 6 spliced variants were also detected in PmDscam (Fig. 5), suggesting that it may function as proposed by Meijers et al. (2007). In addition, sequences of those two epitopes had a high similarity in amino acid sequences when compared to EsDscam, suggesting that as in crab (15), PmDscam may bind with specific pathogens and regulate phagocytosis.
Transcriptomics were used to determine the unknown exon sequences in the cytoplasmic tail of shrimp Dscam. Here, unlike Dscam from other arthropods, PmDscam had not only two alternative exons that encoded for transmembrane domains, but also two alternative exons that encoded for stop codons in the cytoplasmic tail (Fig. 7B). Several functional conserved domains among arthropod Dscam were discovered, including SH2-binding motif, SH3-binding motif, ITAM motif, polyproline motif and PDZ motif ( Fig. 7C; Table 2). These small binding motifs are involved in specific protein-protein interactions in cellular signal transduction (31)(32). The SH2/SH3-binding motif interacts with Dock then activates axon guidance in Drosophila (5). The ITAM motif was reported to be involved in downstream protein tyrosine kinase (PTK)-mediated immunoreceptor signaling after ligand binding and it regulates the expression of surface membrane receptors (6,33). The PDZ motif determines which exons are present on the cytoplasmic tail (34). However, no immune tyrosine-based inhibition motif (ITIM) (I/S/V/LXYXXV/L) was found in PmDscam. The ITIM motif is also missing from crab Dscam (23,35), and indicates that these two crustaceans may have only positive transmembrane signaling.
In Daphnia, the cytoplasmic tail can include or exclude the ITIM or ITAM motif, implying variable signal capacity (6). Like other arthropod Dscams, the RGD (Arg-Gly-Asp) motif that is recognized by integrin family members (36) was also present between Ig6 and Ig7 in the PmDscam extracellular region.
Additionally, we found that alternative splicing also produced exon variability in the cytoplasmic tail  Table 2), suggesting that these alternative PDZ domains may interact with different proteins that are located in various parts of the cellular membrane (39). Isoforms with or without these motifs may have important differences in signaling capacity and regulation of expression of surface membrane receptors (40).
Although most of the alternative splicing exons in the PmDscam and Drosophila Dscam have a relatively low amino-acid identity ( Table 2), there is also a high (>50%) level of amino-acid conservation in 3 out of 5 of the constitutive domains, suggesting that shrimp and insect Dscam might share a common ancestor. Finally, we would note that PmDscam has a great diversity of isoforms, and its complex cytoplasmic tail structure enables >20 million isoforms via alternative splicing of the extracellular regions and cytoplasmic tail. This is a considerably greater number of isoforms than any other crustacean or arthropod (5-6,10,15,23).

Conclusions
Combining all the data obtained from genomics, transcriptomics and cDNA, we successfully generated an in-house database (http://pmdscam.dbbs.ncku.edu.tw/) of PmDscam which was sufficient to support BLAST function ability for nucleotide and amino acids sequences of the extracellular regions and cytoplasmic tail. This database should be useful for researchers who need to identify isoforms of each hypervariable exons. We are confident that this PmDscam genome as well as our in-house database will be useful resources for research into the involvement of Dscam in pancrustacean immunity.

Whole-genome sequencing
To construct the complete Dscam genome (PmDscam) for the tiger shrimp Penaeus monodon, we first used a combination of traditional, next-generation, and new third-generation sequencing strategies to assemble a draft genome (Fig. 1A). For the Illumina whole-genome sequencing, genomic DNAs were extracted from the muscle tissue of an adult female (F09) collected from the coastal waters of Taiwan, following the standard phenol-chloroform procedure. Using the standard operating protocol provided by Illumina (San Diego, CA, USA), two different types of insert library for sequencing were constructed: paired-end libraries for small inserts (180, 350, and 500 bp), and mate-pair libraries for large inserts (2, 5, and 8 kb) (Table S1). Paired-end sequencing was performed using the Illumina HiSeq platform, and a total of 585.60 Gb of raw reads (293.03 Gb from the small insert libraries and 292.57 Gb from the large insert libraries) were generated (Table S1). After quality control removing low-quality reads as well as PCR-replicates and adapter sequences, we obtained 486.22 Gb (224.06X of genome coverage) of clean data for subsequent assembling.
In addition, to improve the assembly quality and increase the scaffold N50, we adopted PacBio (Pacific Biosciences) single-molecule real-time sequencing strategy. Pleopod genomic DNA (F40) was extracted using the Blood and Cell Culture DNA Midi Kit (Qiagen) for construction of a 20-kb insertsize library. A total of 29 SMRTcells were sequenced on the PacBio RS II platform, producing ~17.9 Gb of long reads data with a read length N50 of 11.6 kb (mean 9.14 kb) (Table S1).

De novo genome assembly
As Figure 1A shows, for the preliminary genome assembly, we first assembled the Illumina short reads using two different programs, Allpaths-LG (41) and Velvet (42), separately. The ALLPATHS assembly had a higher N50 length (6,606 bp vs. 2,458 bp) and a much lower contig number (251,428 vs. 2,003,807) than the VELVET assembly, but the total contig length (1,101,722,092 bp) was only half of the VELVET assembly (2,167,365,623 bp). The VELVET assembly contig length was very close to the full length of the P. monodon genome (~2.17 Gb) as estimated by flow cytometry (43).
To improve the scaffold N50, a third assembly was produced. This was a hybrid assembly combining both the Illumina short reads and PacBio long reads data. However, due to computational limitations, not all Illumina data were used for this assembly. To obtain an optimum assembly that had both contiguity and completeness and could serve as a practical genome database, the three assemblies were sequentially merged using quickmerge (46).
For this process, the DBG2OLC assembly (most contiguous and least complete) was merged to the ALLPATHS assembly (the next most contiguous but more complete), and the result was then merged to the VELVET assembly to produce the first draft M2 assembly ( Fig. 1A; Table S2). Default merging parameters (python merge_wrapper.py ${hybridpath} ${selfpath} -hco 5 -c 1.5 -l 10000) were used, with the exception of the -1 parameter (minimum size cutoff for seed contigs for merging) due to the low average contig size across the genome, which would have prevented merging had the ordinary cutoff been used. The M2 assembly was polished using one round of Quiver (47) error correction and one round of Pilon (48) error correction, again as described in Chakraborty et al. (2016). All available PacBio data and all available non-matepair Illumina data were used for polishing.
In order to fill the gaps which were found in some parts of the genome and to confirm the sequences, Sanger sequencing was performed using cDNA and genomic DNA samples. Total RNA samples were isolated from hemocytes using REzolTM C&T reagent (Protech Technology, Taiwan) according to the manufacturer's protocol. First-strand cDNA synthesis was performed using SuperScript® ll Reverse Transcriptase (Invitrogen) according to the manufacturer's instructions. Genomic DNA was extracted from the pleopods of individual shrimp using a DNA extraction kit (GeneReach Biotechnology Corp.).
The hemocyte cDNA and pleopod genomic DNA were used as templates for PCR amplification of the exon and intron fragments using gene specific primers ( Table 1). The PCR products were separated by agarose gel electrophoresis and purified prior to cloning. The purified DNA fragments were cloned into RBC T&A cloning vector (RBC Bioscience, Taiwan) and sequenced using M13F and M13R universal primers. The resulting Sanger sequences were then merged with the first draft M2 assembly to produce the corrected M2 assembly (Fig. 1A) Transcriptome sequencing and assembly Paired-end sequencing was performed on an Illumina NextSeq500 (Genomics BioSci & TechCo.), and the paired-end reads were assembled using Trinity (v.2.1.1; 49) with strand-specific mode (SS_lib_type RF). For functional classification, annotations were determined using BLAST with the Flybase database, and analysis was conducted using PANTHER (50). For the gene-to-gene correlation network, annotations were determined using BLAST with the NCBI-PM and EMBL-CDS databases, and analysis was conducted using the ContigViews (51) web server.
The transcriptomics database was used to search for the remaining exons located in the cytoplasmic tail region. To obtain the sequence of the cytoplasmic tail, several conserved sequences of PmDscam (Table S3; 16) were first used to search against the transcriptomics database. Then, all of the nucleotides were translated to amino acid sequences, and BLASTed against the NCBI database. The obtained sequences were analyzed and identified as both nucleotide and amino acid sequences in each exon. Finally, the PmDscam genome database was searched for the nucleotide sequences of each exon to find the location of those exons on the PmDscam genome (Fig. 1B). The corresponding sequences have been uploaded to NCBI database (under progress), and the total exons sequences for PmDscam is already uploaded in our in-house database (please see section 3.5).

Identification of PmDscam hypervariable regions and sequence analysis
To obtain the hypervariable sequences of the PmDscam exons Ig2, Ig3 and Ig7, we first searched the corrected M2 assembly to find the locations of the conserved amino acid sequences of previous known PmDscam isoform variants from each domain (16). To ensure that every potential isoform variant was included, we then aligned all matching variants and used the conserved sequences from each variable region as a guide to search for all the possible exons in the PmDscam genome sequences. The exons from each hypervariable region were named according to the order of the location in the PmDscam genome sequence.

Diversity of hypervariable regions in immune-related tissues
To investigate the expression of hypervariable exons in shrimp, hemocytes and nerve tissues were collected from ten individual shrimp and used to amplify the hypervariable regions of PmDscam. For the hemocyte samples, hemolymph was drawn from the ventral sinus using a sterile 1-ml syringe with anticoagulant solution and centrifuged at 10,000 g for 1 min at 4°C to separate the hemocytes. Then, for both the hemocytes and excised nerve tissue, total RNA was extracted from each sample using REzol TM C&T reagent (Protech Technology, Taiwan) following the manufacturer's instructions. The extracted mRNA was used as a template to synthesize first-strand cDNA with SuperScript® ll Reverse Transcriptase (Invitrogen) according to the manufacturer's instructions. To obtain the cDNA sequence of the Ig2, Ig3 and Ig7 variable exons, we performed the polymerase chain reaction (PCR) using 2 nested sets of oligonucleotide primer pairs specific to PmDscam. The first amplification used the primers D-F16 and D-R30 (Table 1). The PCR reaction mixture contained 0.2 mM dNTP, 1.5 mM MgCl2, 0.2 µM of each primer and 2X Taq DNA Polymerase Mastermix-RED (Bioman). The PCR reaction was carried out as follows: 94° C for 5 min, then 35 cycles of 94° C for 30 sec, 55° C for 30 sec, 72° C for 2 min, followed by a final extension at 72° C for 10 min. The PCR product was then diluted and used as the template for the second amplification of the nested PCR with the primers D-F24 and D-R30 (Table   1) in the presence of 1 unit of Takara Ex taq polymerase (Takara). The PCR reaction was carried out as described above. The PCR products were purified and cloned into RBC T&A cloning vector (RBC Bioscience, Taiwan). Individual colonies (n=20) containing insert fragments from each sample were selected randomly and sequenced using M13F and M13R universal primers. BLAST was used to check that the obtained sequences corresponded to our PmDscam genome database. Isoform sequences were aligned with Crustal Omega (http://www.ebi.ac.uk/uniprot/).

The PmDscam Database
The PmDscam database was constructed on a LAMP (Linux+Apache+MySQL+PHP) system. The web interface is written in PHP. BLAST algorithms (52), including blastn, blastp and blastx, were used for sequence alignment, with the e-value set to 10e-10 as default. There are 175 P. monodon Dscam exons in the PmDscam database. Users can input multiple sequences in FASTA format to perform an analysis. All the blast results for each sequence will be shown.

Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files.

Competing interests
The authors declare that they have no competing interests.  Tables   Table 1 Nucleotide sequence of the primers used.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.