3.2 De-nova Sequencing analysis
The total coverage of raw data of the whole genome was estimated to be 56x. The total number of Illumina raw sequencing reads was about 489387876 and total base pairs of the raw reads were 73897569276. The total GC content was about 38.01 percentage and AT about 61.99 percentage. The raw reads have the phred quality score of 94.62 percentage for Q20 and 88.13 percentage for Q30. Quality of FASTQ data has improved by using tools such as, Trimmomatic (Bolger et al. 2014), another popular trimming adapter tool. FASTQC (Simon Andrews 2010) is a Java-based QC tool providing per-base and per-read quality profiling features. The raw reads were quality filtered using a flexible read trimming tool Trimmomatic and the raw sequences were checked for adapter contamination and poor-quality bases by FASTQC software. This tool removes the Illumina sequencing adapters and low-quality sequences (Phred score > 30). The total and quality filtered reads with the length and the percentage of reads above Q30 were mentioned in Table.2. The best k-mer length was estimated using Kmergenie. The best k-mer identified by this algorithm was used for the genome size estimation using Jellifish and Genomescope Table.3, Fig. 2. GenomeScope2 model fitting of the k-mer distribution analysis estimated a genome size 1.31Gbp, which is comparatively smaller than the previously reported value by (Swathi et al. 2018), and (Kawato et al. 2021) and has low heterozygosity of about 0.81% (Fig. 2), whereas the unique sequences were found to be 77.9% and about 1.02Gbp (Table.3). The best k-mer selected was used for the de novo genome assembly using ABySS v.2.1.5 Table.2.
Table 2
Summary of Kmer analysis and genome size estimation
SI.No | Particulars | Number |
1 | Estimated K (best) | 69 |
2 | Estimated Heterozygosity | 0.81% |
3 | Estimated Genome Haploid Length | 1,308,850,805 bp |
4 | Estimated Genome Repeat Length | 288,637,448 bp |
5 | Estimated Genome Unique Length | 1,020,213,357 bp |
BUSCOs scores of the final assembly indicate a fairly complete genome assembly with the completed BUSCO of 52.8%, and have the fragmented genes of about 17.7%. Similarly, the slight difference observed between the genome size and the initial size estimation is unlikely to be a consequence of erroneous assembly duplication, as duplicated BUSCOs scores is 2.0%. The quality of the genome assembly is further supported by the Jellifish and Genomescope analysis; overall, these statistics indicate that the C Pseudogracilirostris draft genome assembly here presented is fairly complete, non-redundant, and useful resource for various applications (Fig. 2).
3.2.1 Repeat masking and Repeat annotation
The repeats in the assembled genome were masked and annotated using a de novo repeats library and RepeatMasker (Table 3). Notably, repeated repeats make up 24.60 percent of the genomic sequences. According to the percentage of bases that were mask using RepeatModeler, the de novo repeat elements made up 24.60 percent (247 Mb) of the assembled genome. Simple sequence repeats (SSR), which made up a significant fraction of the genome (7.26 percent), were one of the distinguishing characteristics. Retroelements (3.19 percent), LINEs (2.37 percent), and L2/CR1/Rex are three other significant repeat classes (1.05 percent). Other minor repeat families found in the C included the low complexity regions (0.97 percent), LTR elements (0.82 percent), Gypsy/DIRS1 (0.79 percent), DNA transposons (0.65 percent), Penelope (0.48 percent), small RNA (0.45 percent), hobo-Activator (0.28 percent), RTE/Bov-B (0.27 percent), R1/LOA/Jockey (0.24 percent), and Tc1-IS630-Pogo (0. assembly of the pseudogracilirostris genome.
Table 3
Summary repeats identified from Shrimp genome
Elements | Number of elements* | length occupied | percentage of sequence |
Retroelements | 119335 | 31987404 bp | 3.19% |
SINEs: | 0 | 0 bp | 0.00% |
Penelope | 20195 | 4832234 bp | 0.48% |
LINEs: | 99932 | 23743059 bp | 2.37% |
CRE/SLACS | 3264 | 651164 bp | 0.06% |
L2/CR1/Rex | 43007 | 10558906 bp | 1.05% |
R1/LOA/Jockey | 8571 | 2391652 bp | 0.24% |
R2/R4/NeSL | 0 | 0 bp | 0.00% |
RTE/Bov-B | 14949 | 2720789 bp | 0.27% |
L1/CIN4 | 1227 | 61535 bp | 0.01% |
LTR elements: | 19403 | 8244345 bp | 0.82% |
BEL/Pao | 126 | 37428 bp | 0.00% |
Ty1/Copia | 0 | 0 bp | 0.00% |
Gypsy/DIRS1 | 17150 | 7908568 bp | 0.79% |
Retroviral | 1009 | 203099 bp | 0.02% |
DNA transposons | 27558 | 6549101 bp | 0.65% |
hobo-Activator | 10032 | 2782116 bp | 0.28% |
Tc1-IS630-Pogo | 8197 | 1985750 bp | 0.20% |
En-Spm | 0 | 0 bp | 0.00% |
MuDR-IS905 | 0 | 0 bp | 0.00% |
PiggyBac | 131 | 34120 bp | 0.00% |
Tourist/Harbinger | 1404 | 306575 bp | 0.03% |
Other (Mirage, P-element, Transib) | 243 | 69465 bp | 0.01% |
Rolling-circles | 7479 | 1076387 bp | 0.11% |
Unclassified: | 886502 | 120058713 bp | 11.96% |
Total interspersed repeats: | | 158595218 bp | 15.80% |
Small RNA: | 29150 | 4517743 bp | 0.45% |
Satellites: | 74 | 14407 bp | 0.00% |
Simple repeats: | 1148445 | 72870646 bp | 7.26% |
Low complexity: | 130707 | 9779534 bp | 0.97% |
* most repeats fragmented by insertions or deletions have been counted as one element |
The genomes of P. chinensis and P. vannamei have higher proportion of DNA transposons and low complexity repeats than other shrimp. In general, large-scale DNA editing of retrotransposons, by simultaneously generating large numbers of mutations, may have accelerated their exaptation during mammalian evolution (Carmi et al. 2011). Similarly, inverted SINE repeats promotes RNA editing by adenosine to inosine deamination by being part of longer RNAs, it creates potential novelties in both coding and regulatory sequences (Daniel et al. 2014). The role of SSRs in adaptive evolution was recently demonstrated for shrimp (Yuan et al. 2021). It was considered that only negligible amount of Transposable elements (TEs) was present in eukaryotic genomes, although long before sequencing began, Later it was known that it accounts for major proportion of genomes (Britten and Kohne 1968). The proportion of SSRs reported here for C. Pseudogracilirostris was found to be the lower than P. indicus genome (49.31%). It was now recognized that the proportion of TEs in the genome can differ widely depending on the organism, it ranges from 3% in the yeast (Carr et al. 2012) to a large proportion that almost the entire genome about > 80% in the maize (Meyers et al. 2001). And particularly human genome was rich in repetitive sequences, which was about 45%, as per International Human Genome Sequencing Consortium, 2001.
3.2.2 Gene Prediction and Gene Annotations
The genes were predicted using Ab initio method from the repeat masked genome using Augustus v.2.5.5. (Table S1) A total of 14101 genes are been identified, from 63100629 of total gene length. The putative functions for the predicted genes were assigned using BLASTx and non-redundant protein database (Table 4).
Table 4
Summary of BLASTX annotation
SI.No | Particulars | Number |
1 | Total number of genes subjected to BLASTx | 14101 |
2 | Total number of genes assigned with function | 12345 |
3 | Total number of genes with no similarity | 1756 |
3.3 Functional annotation and pathway analysis using EggNOG mapper, Panther db and pathway common analysis
About 9395 transcripts (66.62 percent of the total sequences) had hits in the EggNOG database, of which 5231 (55.67 percent), 6476 (68.93 percent), and 3867 (41.16 percent) were connected to the relevant GO keywords, KEGG orthology functional annotation, and gene symbols, respectively. By mapping the genes with Panther database resulted a list of 19 under Biological process 1. Developmental process (GO:0032502), 2. Multicellular organismal process (GO:0032501), 3. Cellular process (GO:0009987),4. Reproduction (GO:0000003), 5. Localization (GO:0051179), 6. Reproductive process (GO:0022414), 7. Biological adhesion (GO:0022610), 8. Immune system process (GO:0002376), 9. Biological regulation (GO:0065007), 10. growth (GO:0040007), 11. Signalling (GO:0023052), 12.Metabolic process (GO:0008152), 13. Biological process involved in interspecies interaction between organisms (GO:0044419), 14. Pigmentation (GO:0043473), 15. response to stimulus (GO:0050896), 16. Biological phase (GO:0044848), 17. Behaviour (GO:0007610), 18. Rhythmic process (GO:0048511), 19. Locomotion (GO:0040011). The relative function of genes under 4 major process (Developmental (31), cellular (30), immune system (20) and reproduction (24)) was clustered using pathway common and are summarised in supplementary Tables (2,3,4,5). A total of 31 genes that mapped under Developmental process (actS, APC, APC2, CNTN5, DAG1, DDR2, DLX2, eda, FLI1, FRK, GAL3ST1, GBX, HAND2, HCK, HCRTR2, ILK, LHX1, MMP14, mrdB, mreD, NR1A2, NR1H4, NR4A2, PTK7, rodA, RTN1, RXRA, TCP1, THRB, WDR77, ZEB1) gets clustered and involved in the pathways of of SUMOylation of intracellular receptors, Nuclear Receptor transcription pathway and Thyroid hormone mediated signaling pathway (Table S2). 30 genes that mapped under Cellular process (ABHD12, ACLY, APBB2, carA, CETN1, DDX46, DSC1, GDF2, HCRTR2, HIP1, lysS, MTHFS, MYC, NCAPD3, NSF, NUP54, P4HB, POLE, PPRC1, PRP39, PRPF39, RICTOR, SARDH, SIX4, SLC, 2A2, TPR, TPS, TRPC4AP, valS, XPOT) gets clustered and involved in the pathway of Cellular modified amino acid catabolic process (Table S3). 20 genes that mapped under Immune system process (ABHD12, ACLY, APBB2, carA, CETN1, DDX46, DSC1, GDF2, HCRTR2, HIP1, lysS, MTHFS, MYC, NCAPD3, NSF, NUP54, P4HB, POLE, PPRC1, PRP39, PRPF39, RICTOR, SARDH, SIX4, SLC, 2A2, TPR, TPS, TRPC4AP, valS, XPOT) gets clustered and involved in the pathways of CLEC7A/inflammasome pathway, DEx/H-box helicases activate type I IFN and inflammatory cytokines production, DEx/H-box helicases activate type I IFN and inflammatory cytokines production, Toll-like Receptor Cascades, Cellular response to mechanical stimulus and Response to lipopolysaccharide (Table S4). 24 genes that mapped under Reproductive process (DDX3X, DMRT2, DMRTA, ERCC4, HUS1, lhr, LIS1, MCD1, MPS1, NANOS1, NCAPD3, OXTR, PAFAH1B1, ppk1, RAD50, RAD51C, RFA1, RPA1, SCC1, TDRD12, TDRD5, ttk, TUBGCP5, WDR77) gets clustered and involved in the pathways of HDR through Homologous Recombination, Homologous DNA Pairing and Strand Exchange, Presynaptic phase of homologous DNA pairing and strand exchange, Processing of DNA double-strand break ends, HDR through Single Strand Annealing, Homology Directed Repair, HDR through Homologous Recombination (HRR) or Single Strand Annealing (SSA), DNA Double-Strand Break Repair, G2/M DNA damage checkpoint, Double-strand break repair via homologous recombination, Recombination repair, Nucleotide-excision repair, Meiotic cell cycle, Meiotic nuclear division, Reciprocal meiotic recombination, Meiotic cell cycle process, Homologous recombination, Meiotic cell cycle checkpoint signalling, Meiotic recombination, Oocyte construction, Oocyte axis specification, Telomere organization, Telomere maintenance, Nucleotide-excision repair, Negative regulation of telomere capping, Regulation of TP53 Activity through Phosphorylation (Table S5.1, S5.2, S5.3, S5.4 and S5.5).
3.4 Species conservation and phylogenetic analysis
The protein sequences uploaded in NCBI database were retrieved for all the Malacostraca species with enough number of sequences. A total of 15 Malacostraca species (Armadillidium nasatum, Armadillidium vulgare, Cherax quadricarinatus, Eriocheir sinensis, Gammarus roeselii, Homarus americanus, Hyalella Azteca, Macrobrachium nipponense, Penaeus indicus, Penaeus japonicus, Penaeus monodon, Penaeus vannamei, Portunus trituberculatus, Procambarus clarkia and Trinorchestia longiramus) along with C. Pseudogracilirostris were analysed and 2,81,390 genes were involved in orthogroups representing conserved genes/gene families in various animals (Fig. 3). The percentage of genes included in orthogroups were 88.8% among the total genes of 3, 17,043. A proportion of unassigned genes were about 35,653 representing 11.2%. A total of 21685 orthogroups were shared by all the animal species involved in this study. Highest number of species specific orthogroups was found to be 5294. The highest number of genes in species-specific orthogroups was 26377 (Table. 5). Further investigation showed that there were 7396 orthogroups and a total of 11619 genes involved in the orthogroups of C. pseudogracilirostris. Approximately 3.7 percent of the genes from C. pseudogracilirostris were implicated in orthogroups, compared to 8.3 percent of the genes from other species. Macrobrachium nipponense was found to be the species that was most closely related to Caridina Pseudogracilirostris in the phylogenetic analysis of all 16 malacostracan species. The number of genes duplicated in C. pseudogracilirostris was 1856, but in Homarus americanus, it was around 39322, and in Cherax quadricarinatus, it was approximately 91.
Table 5
Orthologous genes analysis between 16 species of Malacostraca
Overall Statistics |
Number of species | 16 |
Number of genes | 317043 |
Number of genes in orthogroups | 281390 |
Number of unassigned genes | 35653 |
Percentage of genes in orthogroups | 88.8 |
Percentage of unassigned genes | 11.2 |
Number of orthogroups | 21685 |
Number of species-specific orthogroups | 5294 |
Number of genes in species-specific orthogroups | 26377 |
Number of C. Pseudogracilirostris- specific orthogroups | 7396 |
Number of genes in C. Pseudogracilirostris -specific orthogroups | 11619 |
Percentage of genes in species-specific orthogroups | 8.3 |
Percentage of genes in C. Pseudogracilirostris orthogroups | 3.7 |
Mean orthogroup size | 13 |
Median orthogroup size | 9 |
G50 (assigned genes) | 20 |
G50 (all genes) | 18 |
O50 (assigned genes) | 3524 |
O50 (all genes) | 4453 |
Number of orthogroups with all species present | 1 |
Number of single-copy orthogroups | 0 |
Table 6
Functions of mapped genes in neurogenesis pathway.
Function | Genes involved in performing the function |
Regulation of synapse structure or activity | APP, DAG1, EPHB1, GRIN1, NEDD4, NRCAM, PAFAH1B1, ROBO2, SLIT1, and TIAM1. |
Regulation of synapse organization | APP, DAG1, EPHB1, GRIN1, NEDD4, NRCAM, PAFAH1B1, ROBO2, SLIT1, and TIAM1. |
Telencephalon development | CNTN2, GRIN1, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT1, SLIT2, and SMO. |
Telencephalon cell migration | CNTN2, PAFAH1B1, ROBO1, SLIT1, and SLIT2. |
Forebrain cell migration | CNTN2, PAFAH1B1, ROBO1, SLIT1, and SLIT2 |
Forebrain development | APP, CNTN2, GRIN1, NOTCH1, NR4A2, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT1, SLIT2, SMO, and WNT4 |
Regulation of axonogenesis | CNTN2, GRIN1, L1CAM, NRCAM, PAFAH1B1, PAK1, ROBO1, ROBO2, RYK, SLIT2, TIAM1, and ULK2. |
Axon extension | L1CAM, NRCAM, PAFAH1B1, PAK1, RYK, SLIT1, SLIT2, SLIT3, and ULK2. |
Neuron projection extension | L1CAM, NRCAM, PAFAH1B1, PAK1, RYK, SLIT1, SLIT2, SLIT3, TIAM1, and ULK2. |
Positive regulation of axonogenesis | L1CAM, PAFAH1B1, PAK1, ROBO1, ROBO2, SLIT2, and TIAM1 |
Central nervous system projection neuron | EPHB1, NR4A2, PAFAH1B1, and SLIT2 |
Central nervous system neuron development | CNTN2, EPHB1, NR4A2, PAFAH1B1, ROBO1, ROBO2, and SLIT2 |
Central nervous system neuron axonogenesis | EPHB1, NR4A2, PAFAH1B1, and SLIT2 |
Central nervous system neuron differentiation | CNTN2, EPHB1, NR4A2, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT2, and SMO |
Olfactory lobe development | ROBO1, ROBO2, SLIT1, and SLIT2 |
Olfactory bulb interneuron differentiation | ROBO1, ROBO2, and SLIT2 |
Olfactory bulb development | ROBO1, ROBO2, SLIT1, and SLIT2 |
Brain morphogenesis | PAFAH1B1, SLIT1, and SMO |
Axon choice point recognition | APP, ROBO1, ROBO2, and ROBO3 |
Regulation of commissural axon pathfinding by SLIT and ROBO | ROBO1, ROBO2, SLIT1, SLIT2, and SLIT3 |
Activation of RAC1 | PAK1, RAC1, ROBO1, and SLIT2 |
Netrin-1 signaling | PAK1, RAC1, ROBO1, SLIT1, SLIT2, and SLIT3 |
Retinal ganglion cell axon guidance | EPHB1, NRCAM, ROBO2, SLIT1, and SLIT2 |
Next-generation sequencing (NGS) technologies and information of genomic sequence help in achieving our aim to decode unfound genes and its evolutionary mysteries. It is well known that innate immunity-related molecules, such as cytokines, toll-like receptors, the complement family, and molecules of acquired immunity-related, such as MHC and antibody receptors, are also expressed in the brain and play important roles in brain development (Morimoto and Nakajima 2019). Generally, shrimps do not have the classical adaptive immune system like T cells and specific memory of antigen in order to survive under poor environmental conditions (Hoffmann et al. 1999), (Hauton and Smith 2007). Crustaceans such as shrimp, prawns, crayfish, lobster and crabs are farmed widely to improve the global demand through intensive aquaculture techniques (Hauton and Smith 2007). Studies of specialized adaptive functions, such as studies of digestive functions, can be revealed through transcriptomic analysis(Wang et al. 2021). However further genome annotation and RNA sequence analysis will reveal the physiological functions and their role in evolutionary adaptations will be revealed