De nova assembly, annotation and comparative genomics study on the draft genome of Indian brackish water shrimp Caridina pseudogracilirostris

doi:10.21203/rs.3.rs-2228983/v1

Download PDF

Research Article

De nova assembly, annotation and comparative genomics study on the draft genome of Indian brackish water shrimp Caridina pseudogracilirostris

https://doi.org/10.21203/rs.3.rs-2228983/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The caridean shrimp Caridina pseudogracilirostris (order-Decapoda, family-Atyidae) is commonly found in the brackish waters of the southwestern coastal regions of peninsular India. The draft genome sequencing of this shrimp was carried out by Illumina NovaSeq6000 platform. We obtained a draft genome assembly of C. pseudogracilirostris (1.3 Gbp; 6,03,962 scaffolds; scaffold N50 = 2641 bp; 35.71% GC, 52.8% BUSCO completeness). It revealed that 24.60% of the genomic sequences are repetitive and has high proportion of simple sequence repeats (SSR) spanning 7.26% of the entire genome. Other major repeat classes found are Retroelements (3.19%), LINEs (2.37%) and L2/CR1/Rex (1.05%). A total of 14101 genes were identified with AUGUSTUS. The predicted genes were functionally annotated using EggNOG-mapper and the genes containing database hits were sorted based on the biological processes using Panther database. Genes associated with developmental process (31), cellular process (30), immune system process, (20) and reproductive process (24) were further analyzed in pathway commons and narrowed down into genes involved in regulatory pathways. We have conducted a comparative study with 15 crustacean species using OrthoFinder. It provided the phylogenetic species tree and a total of 7396 orthogroups were identified. C. pseudogracilirostris has shown only 3.7% orthologous genes.

Invertebrate

Caridina pseudogracilirostris

NGS

Crustacean

shrimp

neurogenesis

The most biodiverse and diverse phyla on earth are the arthropods (chelicopters, myriapods, crustaceans, and hexapods), adapted to all major habitats in all major ecosystems, updated and extended. They are recognizable by their articulated limbs and chitin-based cuticle, often mineralized with calcium compounds. Decapod crustaceans, part of the class Malacostraca, include several well-known species such as crabs, lobsters, crayfish, shrimp, and prawns.(Mente 2008a). The initial group of decapods to diverge was the Dendrobranchiata (prawns), which appeared in the Late Ordovician, some 455 million years ago. (Wolfe et al. 2019). The remaining group, Pleocyemata, subsequently divided into the groups of crawling/walking Reptantia, which included the lobsters and crabs, and the swimming shrimp groupings. The Jurassic and Cretaceous ages, when modern coral reefs first appeared and spread around the world, was when high species diversity first occurred. Marine decapods rely on coral reefs as a habitat.(Lloyd et al. 2008).. Dendrobranchiata and Pleocyemata are the two suborders that result from classification within the order Decapoda based on the gill and leg structures and how the larvae develop. There are various species of prawns of the Dendrobranchiata genus that are commonly referred to as "shrimp," including the "white shrimp," Litopenaeus setiferus, and the tiger shrimp, Penaeus monodon. Pleocyemata, with the exception of Stenopodidea and Caridea, are the groups that often prefer to walk as opposed to swim, and they make up a clade called Reptantia. (Mente 2008b).

Decapods exhibit tremendous diversity due to a number of genetic alterations and innovations that have been favored throughout their evolutionary history. But it is still difficult to link the diversity of phenotypes to fundamental genetic alterations (Thomas et al. 2020). Such innovations could be caused by a variety of genomic mechanisms, but a thorough phylogenetic analysis of the underlying molecular changes have not been done. It is necessary to link the whole genome data to a reliable phylogenetic framework in order to track these transitions at the genomic level. Currently, crayfish serve as the foundation for the majority of scientific knowledge regarding the ecology and general biology of freshwater decapods. (Thorp and Rogers 2011), most likely as a result of their frequent occurrence and the fact that they are a favorite food in some regions of the world. Even though certain species can live for more than 100 years, it's interesting to note that decapods, bivalves, and echinoderms rarely acquire neoplastic and age-related disorders. Their successful developments must have resulted in a genome rearrangement, indicating the presence of undiscovered and advantageous traits that could lead to the formation of novel perspectives on a variety of topics. The discovery of the underlying molecular and regulatory systems may inspire fresh concepts for the creation of extremely resilient organisms that are advantageous commercially. Crustaceans could be used as models for neuroscience research despite their status as an important commercial species and their significance in evolutionary history. Generally speaking, early neurogenesis in arthropods exhibits significant heterogeneity in terms of the quantity, kinds, and morphologies of neurons and glia.(Stollewerk 2016). The architecture of the brain is shaped by the final placements and connections. Therefore, alterations in the developmental processes that control the production of brain cells and their differentiation during evolution must have led to the formation of the nervous systems of decapods.

The initial discovery, curation, and comparison of genes that have the greatest influence on our understanding of the genes peculiar to neurogenesis in comparison to other species are the main objectives of this study. We have selected a widely available freshwater caridean shrimp for our investigation, and we have created practical mechanisms to keep them in the lab for a longer period of time. A species of brackish water shrimp known as Caridina pseudogracilirostris is a member of the Atyidae family and the suborder Caridea. Its extensive natural range stretches from Kanyakumari to Cochin.(Thomas et al. 1973), (Soundharapandiyan et al. 2022). Living in mangroves and marshes, it is an algae-eating species. Due to its small size, ease of catch, transparent body, strong tolerance for fluctuation in a variety of environmental circumstances, average size, and ability for reproduction, Caridina pseudogracilirostris is a viable experimental animal for crustacean adaptation study. The samples for this investigation were gathered from numerous sampling sites (backwater). The backwaters feature a distinctive habitat where freshwater aquatic life can be seen that is distinct from that seen in rivers and the sea. The purpose of this work is to comprehend the whole genome sequence of this shrimp and compare it to genomes of comparable species in order to identify genes that are similar, and novel genes that may also be beneficial in other contexts.

In recent decades, sequencing technologies, specifically next-generation sequencing (NGS), have been widely used in various scientific research and clinical applications. NGS allows for higher sequencing throughput and lower sequencing costs, by development and optimization of experimental and data analysis methods the results can be accurate. Therefore, quality control and data preprocessing are highly important for obtaining downstream high-quality and high-confidence analytical data to reduce false positives and false negatives (He et al. 2020). At present, there are many software programs for data quality preprocessing. Trimmomatic (Bolger et al. 2014) includes a variety of processing steps for read trimming and filtering, but the main algorithmic innovations are related to the identification of adapter sequences and quality filtering. The core set of kmer size selection is much needed …to achieve a good quality during de novo assembly. (Chikhi and Medvedev 2014) a new method was devised to select the best kmer size for de novo genome assembly, yet different other methods were proposed such as, estimating number of interesting characteristics like paths with variations or repeats for different kmer sizes from a FM-index over the reads (Simpson 2014), also by using optimal kmer range for de novo read error correction was also proposed by (Ilie et al. 2011) and (Schulz et al. 2014).A eukaryotic organism in particular contains a varying but significant proportion of repeated elements all along the sequence. These repeats are interesting as they can originate from transposons or viral insertions, and they can have direct effects on the expression of genes (Muñoz-López and García-Pérez 2010). Still, it causes lots of trouble when you work on genomics data. This can be made easy by the support of various applications. With all improvement the sequences are subjected to gene prediction and annotation. Phylogenetic orthology inference for comparative genomics were studied to create a framework for understanding the evolution and diversity of life on Earth and enables the extrapolation of biological knowledge between organisms. The majority of these software tools try to presume phylogenetic relationships between gene sequences through experimental analyses of pairwise sequence similarity scores obtained from an all-vs-all BLAST (Camacho et al. 2009) search, or enhanced alternatives to BLAST such as DIAMOND (Buchfink et al. 2015) or MMseqs2 (Steinegger and Söding 2017). Widely used methods include InParanoid (Östlund et al. 2010), OrthoMCL (Li et al. 2003), OMA (Altenhoff et al. 2011), and OrthoFinder (Emms and Kelly 2015). Each application takes different approaches to cross-examining sequence similarity scores, and all of which produce different outputs. Some find orthogroups, some find paralogs and orthologs, and some do all three. The aim of the work is to understand the phylogenetic relationship with other species of Malacostraca and finding gene clusters that are specifically involved in the neurogenesis pathway, which is present in the C. pseudogracilirostris genome sequence.

2.1 Shrimp Collection and Rearing

Live shrimps were collected from different locations of the sampling site at Rajakkamangalam estuary at Kanyakumari, Tamilnadu, India (8°07'17.9"N, 77°22'19.3"E).89. The study samples were collected through the convenience sampling method. Fresh shrimps were collected and euthanized using tricaine (MS-222) and fixed in 4% paraformaldehyde. Shrimp identification has been carried out based on morphological features, described earlier by Thomas (Thomas et al. 1973). The Minimum Information about any (x) Sequence (MIxS) data is presented in Table 1.

Table 1

MIxS mandatory information for samples.
Item	Definition
Investigation type	Eukaryote
Project name	NGS Whole genome sequencing
Organism	Caridina pseudogracilirostris
Classification	Animalia (Kingdom) Arthropoda (Phylum) Crustacea (Subphylum) Malacostraca (Class) Eucarida (Superorder) Decapoda (Order) Pleocyemata (Suborder) Caridea (Infraorder) Atyidae (Family) Caridina (Genus) Caridina pseudogracilirostris (Species)
Submitted_to SRA database	Bioproject: PRJNA847710 BioSample: SAMN28951408 : NGS_1004 (TaxID: 1042303)
Tissue type	Whole shrimp
Geographic location (country and/or sea, region)	Rajakkamangalam estuary, Kanyakumari district. India
Environment	Brackish water
Collection date	20-07-2019
Sequencing technology	Illumina Novaseq6000
Assembly	ABySS de novo assembler
Annotation source	BLASTx, KEGG

Further, a purposive sampling method was followed for the collection of Caridina and to avoid other types of shrimps. Samples were collected, using a long aquarium net (30.5 cm x 30.5 cm) and packed immediately in large polyethylene bags containing fresh water. Packed samples were brought to the lab within 20 hours and transferred to laboratory tanks at Aquaculture Facility (Aquaneering USA). The handling and experimentation on animals were carried out with ARRIVE guidelines and followed the U.K. Animals (Scientific Procedures) Act, 1986, and associated guidelines.

2.2 DNA extraction

Sample for genome sequencing was collected from an adult C. pseudogracilirostris maintained in the Aquatic Physiology facility of Sathyabama Institute of Science and Technology, Chennai. The animal was acclimated in tap water aquaria at 26 ± 2°C for one month. Live shrimp was euthanized in MS-222 and dissected immediately in ice cold condition and tissue was transferred immediately into 100% ethanol. DNA isolation was carried out using Qiagen DNeasy Blood & Tissue DNA isolation kit as per manufacturer instruction. DNA quantity was measured in Nanodrop(Thermo, USA) and confirmed by running in 1% agarose gel electrophoresis.

2.3 Library preparation and genome sequencing

Genomic DNA used for TruSeq Nano DNA protocol was quantified in the Tape station (Agilent) and diluted. The genomic DNA was sheared using an S2 Ultrasonicator using the same settings provided in TruSeq DNA protocol. Library preparation was carried out according to the manufacturer’s instructions. Adaptor enrichment was performed in multiple cycles of PCR according to the manufacturer’s instructions.

2.4.1 Technical validation

The raw reads were quality filtered using a flexible read trimming tool Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) version 0.39 (Bolger et al. 2014), It uses a pipeline-based architecture, allowing individual “steps” (adapter removal, quality filtering, and so on) to be applied to each read/read pair in the order specified by the user. Each step can choose to work on the reads in isolation or work on the combined pair, as appropriate. The tool tracks read pairing and stores “paired” and “single” reads separately. And the raw sequences were checked for adapter contamination and poor quality bases by FASTQC v0.11.8 software (Simon Andrews 2010).

2.4.2 Kmer size selection and Genome size estimation

The best k-mer length was estimated using Kmergenie v. 1.7051 (http://kmergenie.bx.psu.edu/). The best k-mer identified by this algorithm was used for the genome size estimation using Jellifish v.2.3.0 (https://github.com/gmarcais/Jellyfish) and Genomescope v. 1.0 (http://qb.cshl.edu/genomescope/)

2.4 De nova assembly and annotation

De novo genome assembly of the shrimp was performed using the ABySS tools for the Illumina reads. ABySS is a de novo sequence assembler intended for short paired end reads and large genomes. The raw Illumina reads were quality filtered using stringent filtering criteria (Phred Score > 30) and used for the best k-mer prediction and k-mer based genome size estimation. The raw reads were assembled in to contigs and scaffolds using ABySS de novo genome assembly tool. The scaffolds were subjected to gap filling and error correction. The resulting genome with scaffolds was used for the repeat masking and genome annotations. The predicted genes were assigned with putative functions. The complete workflow for the genome assembly and annotation is presented in Fig. 1. The best k-mer selected was used for the de novo genome assembly using de bruijn graph-based single-k de novo genome assemblers used for the larger eukaryotic genomes such as ABySS v.2.1.5 (https://www.bcgsc.ca/resources/software/abyss) (Simpson et al. 2009).

2.4.3 Genome evaluation

For quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. It was implemented by the assessment procedure using an open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO (Simão et al. 2015).

2.4.4 Repeat masking and Repeat annotation

Generally whole genome sequence has high composition of repetitive elements (∼80%) and high genome heterozygosity in raw genome, (Yu et al. 2015). Hence they should be identified and masked before proceeding to genome annotation. Particularly Transposable Elements (TE) identification was carried out using RepeatModeler V 1.08 by creating the de novo repeats library and it is further masked and annotated using RepeatMasker V 4.1.1. RepeatModeler is a de novo transposable element (TE) family identification and modelling package. The heart of RepeatModeler has three de-novo repeat finding programs (RECON, RepeatScout and LtrHarvest/Ltr_retriever) that employs complementary computational methods to identify repeat element boundaries and family relationships from provided sequence data. RepeatModeler assists in automating the runs of the various algorithms given a genomic database, clustering redundant results, refining and classifying the families and producing a high-quality library of TE families suitable for use with RepeatMasker. RepeatMasker searches for repetitive sequence by aligning the input genome sequence against a library of known repeats (Tarailo-Graovac and Chen 2009). RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences (A.F.A. Smit et al). The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked. Sequence comparisons in RepeatMasker are performed by the program cross-match, an efficient implementation of the Smith-Waterman-Gotoh algorithm.

2.4.5 Gene Prediction

The first step of genome annotation is to find all the genes in given genomic sequence. The genes were predicted using Ab initio method from the repeat masked genome using Augustus v.2.5.5 (http://augustus.gobics.de/). AUGUSTUS is based on a generalized hidden Markov model (GHMM) that defines probability distributions for various sections of genomic sequences. Introns, exons, intergenic regions, etc. correspond to states in the model and the purpose of each state is to create DNA sequences with certain pre-defined emission probabilities (Stanke and Morgenstern 2005). Augustus was used to create de novo models from scaffolds identified/confirmed later by searching in the BLAST.

2.4.6 Gene Annotation

The function of the predicted genes was annotated using BLASTx developed by National Center for Biotechnology Information (NCBI) against the previously characterized non-redundant protein database. BLASTX compares a nucleotide query sequence translated along all six reading frames (both strands) against the amino acid sequence database.

2.5 Functional annotation and pathway analysis.

Eggnog-mapper is a tool for fast functional annotation of novel sequences (genes or proteins) using precomputed eggNOG-based orthology assignments. The GO and KO annotations were retrieved and gene sorting were performed in Panther db analysis. Based on Panther analysis, the genes involved in developmental process, cellular process, Immune system process, and Reproductive process were further analyzed using pathway common

2.6. Comparative genomics and phylogenic studies

This study was carried out using OrthoFinder Software (https://github.com/davidemms/OrthoFinder), which provides high accuracy orthogroups inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. The protein sequence of the species Armadillidium nasatum, Armadillidium vulgare, Cherax quadricarinatus, Eriocheir sinensis, Gammarus roeselii, Homarus americanus, Hyalella Azteca, Macrobrachium nipponense, Penaeus indicus, Penaeus japonicus, Penaeus monodon, Penaeus vannamei, Portunus trituberculatus, Procambarus clarkia and Trinorchestia longiramus along with the sequence of Caridina pseudogracilirostris were used for the phylogenic and orthologous studies.

3.1 Library preparation and genome sequencing:

The DNA was isolated from the tissue sample and the concentration of isolated DNA was quantified as 116.724 ng/µl. The sample passed the NGS Library QC with the concentration of 65.13 ng/ul, which was carried out in Tape Station (TruSeq Nano DNA − 350). Library preparation was carried out using the Illumina TruSeq Nano DNA Library (350bp PE insert) and sequenced using NovaSeq6000 (2x150bp paired-end read length). Totally 160 GB of data was produced during this sequencing.

3.2 De-nova Sequencing analysis

The total coverage of raw data of the whole genome was estimated to be 56x. The total number of Illumina raw sequencing reads was about 489387876 and total base pairs of the raw reads were 73897569276. The total GC content was about 38.01 percentage and AT about 61.99 percentage. The raw reads have the phred quality score of 94.62 percentage for Q20 and 88.13 percentage for Q30. Quality of FASTQ data has improved by using tools such as, Trimmomatic (Bolger et al. 2014), another popular trimming adapter tool. FASTQC (Simon Andrews 2010) is a Java-based QC tool providing per-base and per-read quality profiling features. The raw reads were quality filtered using a flexible read trimming tool Trimmomatic and the raw sequences were checked for adapter contamination and poor-quality bases by FASTQC software. This tool removes the Illumina sequencing adapters and low-quality sequences (Phred score > 30). The total and quality filtered reads with the length and the percentage of reads above Q30 were mentioned in Table.2. The best k-mer length was estimated using Kmergenie. The best k-mer identified by this algorithm was used for the genome size estimation using Jellifish and Genomescope Table.3, Fig. 2. GenomeScope2 model fitting of the k-mer distribution analysis estimated a genome size 1.31Gbp, which is comparatively smaller than the previously reported value by (Swathi et al. 2018), and (Kawato et al. 2021) and has low heterozygosity of about 0.81% (Fig. 2), whereas the unique sequences were found to be 77.9% and about 1.02Gbp (Table.3). The best k-mer selected was used for the de novo genome assembly using ABySS v.2.1.5 Table.2.

Table 2

Summary of Kmer analysis and genome size estimation
SI.No	Particulars	Number
1	Estimated K (best)	69
2	Estimated Heterozygosity	0.81%
3	Estimated Genome Haploid Length	1,308,850,805 bp
4	Estimated Genome Repeat Length	288,637,448 bp
5	Estimated Genome Unique Length	1,020,213,357 bp

BUSCOs scores of the final assembly indicate a fairly complete genome assembly with the completed BUSCO of 52.8%, and have the fragmented genes of about 17.7%. Similarly, the slight difference observed between the genome size and the initial size estimation is unlikely to be a consequence of erroneous assembly duplication, as duplicated BUSCOs scores is 2.0%. The quality of the genome assembly is further supported by the Jellifish and Genomescope analysis; overall, these statistics indicate that the C Pseudogracilirostris draft genome assembly here presented is fairly complete, non-redundant, and useful resource for various applications (Fig. 2).

3.2.1 Repeat masking and Repeat annotation

The repeats in the assembled genome were masked and annotated using a de novo repeats library and RepeatMasker (Table 3). Notably, repeated repeats make up 24.60 percent of the genomic sequences. According to the percentage of bases that were mask using RepeatModeler, the de novo repeat elements made up 24.60 percent (247 Mb) of the assembled genome. Simple sequence repeats (SSR), which made up a significant fraction of the genome (7.26 percent), were one of the distinguishing characteristics. Retroelements (3.19 percent), LINEs (2.37 percent), and L2/CR1/Rex are three other significant repeat classes (1.05 percent). Other minor repeat families found in the C included the low complexity regions (0.97 percent), LTR elements (0.82 percent), Gypsy/DIRS1 (0.79 percent), DNA transposons (0.65 percent), Penelope (0.48 percent), small RNA (0.45 percent), hobo-Activator (0.28 percent), RTE/Bov-B (0.27 percent), R1/LOA/Jockey (0.24 percent), and Tc1-IS630-Pogo (0. assembly of the pseudogracilirostris genome.

Table 3

Summary repeats identified from Shrimp genome
Elements	Number of elements*	length occupied	percentage of sequence
Retroelements	119335	31987404 bp	3.19%
SINEs:	0	0 bp	0.00%
Penelope	20195	4832234 bp	0.48%
LINEs:	99932	23743059 bp	2.37%
CRE/SLACS	3264	651164 bp	0.06%
L2/CR1/Rex	43007	10558906 bp	1.05%
R1/LOA/Jockey	8571	2391652 bp	0.24%
R2/R4/NeSL	0	0 bp	0.00%
RTE/Bov-B	14949	2720789 bp	0.27%
L1/CIN4	1227	61535 bp	0.01%
LTR elements:	19403	8244345 bp	0.82%
BEL/Pao	126	37428 bp	0.00%
Ty1/Copia	0	0 bp	0.00%
Gypsy/DIRS1	17150	7908568 bp	0.79%
Retroviral	1009	203099 bp	0.02%
DNA transposons	27558	6549101 bp	0.65%
hobo-Activator	10032	2782116 bp	0.28%
Tc1-IS630-Pogo	8197	1985750 bp	0.20%
En-Spm	0	0 bp	0.00%
MuDR-IS905	0	0 bp	0.00%
PiggyBac	131	34120 bp	0.00%
Tourist/Harbinger	1404	306575 bp	0.03%
Other (Mirage, P-element, Transib)	243	69465 bp	0.01%
Rolling-circles	7479	1076387 bp	0.11%
Unclassified:	886502	120058713 bp	11.96%
Total interspersed repeats:		158595218 bp	15.80%
Small RNA:	29150	4517743 bp	0.45%
Satellites:	74	14407 bp	0.00%
Simple repeats:	1148445	72870646 bp	7.26%
Low complexity:	130707	9779534 bp	0.97%
* most repeats fragmented by insertions or deletions have been counted as one element

The genomes of P. chinensis and P. vannamei have higher proportion of DNA transposons and low complexity repeats than other shrimp. In general, large-scale DNA editing of retrotransposons, by simultaneously generating large numbers of mutations, may have accelerated their exaptation during mammalian evolution (Carmi et al. 2011). Similarly, inverted SINE repeats promotes RNA editing by adenosine to inosine deamination by being part of longer RNAs, it creates potential novelties in both coding and regulatory sequences (Daniel et al. 2014). The role of SSRs in adaptive evolution was recently demonstrated for shrimp (Yuan et al. 2021). It was considered that only negligible amount of Transposable elements (TEs) was present in eukaryotic genomes, although long before sequencing began, Later it was known that it accounts for major proportion of genomes (Britten and Kohne 1968). The proportion of SSRs reported here for C. Pseudogracilirostris was found to be the lower than P. indicus genome (49.31%). It was now recognized that the proportion of TEs in the genome can differ widely depending on the organism, it ranges from 3% in the yeast (Carr et al. 2012) to a large proportion that almost the entire genome about > 80% in the maize (Meyers et al. 2001). And particularly human genome was rich in repetitive sequences, which was about 45%, as per International Human Genome Sequencing Consortium, 2001.

3.2.2 Gene Prediction and Gene Annotations

The genes were predicted using Ab initio method from the repeat masked genome using Augustus v.2.5.5. (Table S1) A total of 14101 genes are been identified, from 63100629 of total gene length. The putative functions for the predicted genes were assigned using BLASTx and non-redundant protein database (Table 4).

Table 4

Summary of BLASTX annotation
SI.No	Particulars	Number
1	Total number of genes subjected to BLASTx	14101
2	Total number of genes assigned with function	12345
3	Total number of genes with no similarity	1756

3.3 Functional annotation and pathway analysis using EggNOG mapper, Panther db and pathway common analysis

About 9395 transcripts (66.62 percent of the total sequences) had hits in the EggNOG database, of which 5231 (55.67 percent), 6476 (68.93 percent), and 3867 (41.16 percent) were connected to the relevant GO keywords, KEGG orthology functional annotation, and gene symbols, respectively. By mapping the genes with Panther database resulted a list of 19 under Biological process 1. Developmental process (GO:0032502), 2. Multicellular organismal process (GO:0032501), 3. Cellular process (GO:0009987),4. Reproduction (GO:0000003), 5. Localization (GO:0051179), 6. Reproductive process (GO:0022414), 7. Biological adhesion (GO:0022610), 8. Immune system process (GO:0002376), 9. Biological regulation (GO:0065007), 10. growth (GO:0040007), 11. Signalling (GO:0023052), 12.Metabolic process (GO:0008152), 13. Biological process involved in interspecies interaction between organisms (GO:0044419), 14. Pigmentation (GO:0043473), 15. response to stimulus (GO:0050896), 16. Biological phase (GO:0044848), 17. Behaviour (GO:0007610), 18. Rhythmic process (GO:0048511), 19. Locomotion (GO:0040011). The relative function of genes under 4 major process (Developmental (31), cellular (30), immune system (20) and reproduction (24)) was clustered using pathway common and are summarised in supplementary Tables (2,3,4,5). A total of 31 genes that mapped under Developmental process (actS, APC, APC2, CNTN5, DAG1, DDR2, DLX2, eda, FLI1, FRK, GAL3ST1, GBX, HAND2, HCK, HCRTR2, ILK, LHX1, MMP14, mrdB, mreD, NR1A2, NR1H4, NR4A2, PTK7, rodA, RTN1, RXRA, TCP1, THRB, WDR77, ZEB1) gets clustered and involved in the pathways of of SUMOylation of intracellular receptors, Nuclear Receptor transcription pathway and Thyroid hormone mediated signaling pathway (Table S2). 30 genes that mapped under Cellular process (ABHD12, ACLY, APBB2, carA, CETN1, DDX46, DSC1, GDF2, HCRTR2, HIP1, lysS, MTHFS, MYC, NCAPD3, NSF, NUP54, P4HB, POLE, PPRC1, PRP39, PRPF39, RICTOR, SARDH, SIX4, SLC, 2A2, TPR, TPS, TRPC4AP, valS, XPOT) gets clustered and involved in the pathway of Cellular modified amino acid catabolic process (Table S3). 20 genes that mapped under Immune system process (ABHD12, ACLY, APBB2, carA, CETN1, DDX46, DSC1, GDF2, HCRTR2, HIP1, lysS, MTHFS, MYC, NCAPD3, NSF, NUP54, P4HB, POLE, PPRC1, PRP39, PRPF39, RICTOR, SARDH, SIX4, SLC, 2A2, TPR, TPS, TRPC4AP, valS, XPOT) gets clustered and involved in the pathways of CLEC7A/inflammasome pathway, DEx/H-box helicases activate type I IFN and inflammatory cytokines production, DEx/H-box helicases activate type I IFN and inflammatory cytokines production, Toll-like Receptor Cascades, Cellular response to mechanical stimulus and Response to lipopolysaccharide (Table S4). 24 genes that mapped under Reproductive process (DDX3X, DMRT2, DMRTA, ERCC4, HUS1, lhr, LIS1, MCD1, MPS1, NANOS1, NCAPD3, OXTR, PAFAH1B1, ppk1, RAD50, RAD51C, RFA1, RPA1, SCC1, TDRD12, TDRD5, ttk, TUBGCP5, WDR77) gets clustered and involved in the pathways of HDR through Homologous Recombination, Homologous DNA Pairing and Strand Exchange, Presynaptic phase of homologous DNA pairing and strand exchange, Processing of DNA double-strand break ends, HDR through Single Strand Annealing, Homology Directed Repair, HDR through Homologous Recombination (HRR) or Single Strand Annealing (SSA), DNA Double-Strand Break Repair, G2/M DNA damage checkpoint, Double-strand break repair via homologous recombination, Recombination repair, Nucleotide-excision repair, Meiotic cell cycle, Meiotic nuclear division, Reciprocal meiotic recombination, Meiotic cell cycle process, Homologous recombination, Meiotic cell cycle checkpoint signalling, Meiotic recombination, Oocyte construction, Oocyte axis specification, Telomere organization, Telomere maintenance, Nucleotide-excision repair, Negative regulation of telomere capping, Regulation of TP53 Activity through Phosphorylation (Table S5.1, S5.2, S5.3, S5.4 and S5.5).

3.4 Species conservation and phylogenetic analysis

The protein sequences uploaded in NCBI database were retrieved for all the Malacostraca species with enough number of sequences. A total of 15 Malacostraca species (Armadillidium nasatum, Armadillidium vulgare, Cherax quadricarinatus, Eriocheir sinensis, Gammarus roeselii, Homarus americanus, Hyalella Azteca, Macrobrachium nipponense, Penaeus indicus, Penaeus japonicus, Penaeus monodon, Penaeus vannamei, Portunus trituberculatus, Procambarus clarkia and Trinorchestia longiramus) along with C. Pseudogracilirostris were analysed and 2,81,390 genes were involved in orthogroups representing conserved genes/gene families in various animals (Fig. 3). The percentage of genes included in orthogroups were 88.8% among the total genes of 3, 17,043. A proportion of unassigned genes were about 35,653 representing 11.2%. A total of 21685 orthogroups were shared by all the animal species involved in this study. Highest number of species specific orthogroups was found to be 5294. The highest number of genes in species-specific orthogroups was 26377 (Table. 5). Further investigation showed that there were 7396 orthogroups and a total of 11619 genes involved in the orthogroups of C. pseudogracilirostris. Approximately 3.7 percent of the genes from C. pseudogracilirostris were implicated in orthogroups, compared to 8.3 percent of the genes from other species. Macrobrachium nipponense was found to be the species that was most closely related to Caridina Pseudogracilirostris in the phylogenetic analysis of all 16 malacostracan species. The number of genes duplicated in C. pseudogracilirostris was 1856, but in Homarus americanus, it was around 39322, and in Cherax quadricarinatus, it was approximately 91.

Table 5

Orthologous genes analysis between 16 species of Malacostraca
Overall Statistics
Number of species	16
Number of genes	317043
Number of genes in orthogroups	281390
Number of unassigned genes	35653
Percentage of genes in orthogroups	88.8
Percentage of unassigned genes	11.2
Number of orthogroups	21685
Number of species-specific orthogroups	5294
Number of genes in species-specific orthogroups	26377
Number of C. Pseudogracilirostris- specific orthogroups	7396
Number of genes in C. Pseudogracilirostris -specific orthogroups	11619
Percentage of genes in species-specific orthogroups	8.3
Percentage of genes in C. Pseudogracilirostris orthogroups	3.7
Mean orthogroup size	13
Median orthogroup size	9
G50 (assigned genes)	20
G50 (all genes)	18
O50 (assigned genes)	3524
O50 (all genes)	4453
Number of orthogroups with all species present	1
Number of single-copy orthogroups	0

Table 6

Functions of mapped genes in neurogenesis pathway.
Function	Genes involved in performing the function
Regulation of synapse structure or activity	APP, DAG1, EPHB1, GRIN1, NEDD4, NRCAM, PAFAH1B1, ROBO2, SLIT1, and TIAM1.
Regulation of synapse organization	APP, DAG1, EPHB1, GRIN1, NEDD4, NRCAM, PAFAH1B1, ROBO2, SLIT1, and TIAM1.
Telencephalon development	CNTN2, GRIN1, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT1, SLIT2, and SMO.
Telencephalon cell migration	CNTN2, PAFAH1B1, ROBO1, SLIT1, and SLIT2.
Forebrain cell migration	CNTN2, PAFAH1B1, ROBO1, SLIT1, and SLIT2
Forebrain development	APP, CNTN2, GRIN1, NOTCH1, NR4A2, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT1, SLIT2, SMO, and WNT4
Regulation of axonogenesis	CNTN2, GRIN1, L1CAM, NRCAM, PAFAH1B1, PAK1, ROBO1, ROBO2, RYK, SLIT2, TIAM1, and ULK2.
Axon extension	L1CAM, NRCAM, PAFAH1B1, PAK1, RYK, SLIT1, SLIT2, SLIT3, and ULK2.
Neuron projection extension	L1CAM, NRCAM, PAFAH1B1, PAK1, RYK, SLIT1, SLIT2, SLIT3, TIAM1, and ULK2.
Positive regulation of axonogenesis	L1CAM, PAFAH1B1, PAK1, ROBO1, ROBO2, SLIT2, and TIAM1
Central nervous system projection neuron	EPHB1, NR4A2, PAFAH1B1, and SLIT2
Central nervous system neuron development	CNTN2, EPHB1, NR4A2, PAFAH1B1, ROBO1, ROBO2, and SLIT2
Central nervous system neuron axonogenesis	EPHB1, NR4A2, PAFAH1B1, and SLIT2
Central nervous system neuron differentiation	CNTN2, EPHB1, NR4A2, PAFAH1B1, ROBO1, ROBO2, RYK, SLIT2, and SMO
Olfactory lobe development	ROBO1, ROBO2, SLIT1, and SLIT2
Olfactory bulb interneuron differentiation	ROBO1, ROBO2, and SLIT2
Olfactory bulb development	ROBO1, ROBO2, SLIT1, and SLIT2
Brain morphogenesis	PAFAH1B1, SLIT1, and SMO
Axon choice point recognition	APP, ROBO1, ROBO2, and ROBO3
Regulation of commissural axon pathfinding by SLIT and ROBO	ROBO1, ROBO2, SLIT1, SLIT2, and SLIT3
Activation of RAC1	PAK1, RAC1, ROBO1, and SLIT2
Netrin-1 signaling	PAK1, RAC1, ROBO1, SLIT1, SLIT2, and SLIT3
Retinal ganglion cell axon guidance	EPHB1, NRCAM, ROBO2, SLIT1, and SLIT2

Next-generation sequencing (NGS) technologies and information of genomic sequence help in achieving our aim to decode unfound genes and its evolutionary mysteries. It is well known that innate immunity-related molecules, such as cytokines, toll-like receptors, the complement family, and molecules of acquired immunity-related, such as MHC and antibody receptors, are also expressed in the brain and play important roles in brain development (Morimoto and Nakajima 2019). Generally, shrimps do not have the classical adaptive immune system like T cells and specific memory of antigen in order to survive under poor environmental conditions (Hoffmann et al. 1999), (Hauton and Smith 2007). Crustaceans such as shrimp, prawns, crayfish, lobster and crabs are farmed widely to improve the global demand through intensive aquaculture techniques (Hauton and Smith 2007). Studies of specialized adaptive functions, such as studies of digestive functions, can be revealed through transcriptomic analysis(Wang et al. 2021). However further genome annotation and RNA sequence analysis will reveal the physiological functions and their role in evolutionary adaptations will be revealed

Here, we provide a draft genome assembly of the highly adaptable crustacean shrimp, C. Pseudogracilirostris. The 1.3 Gbp draft genome assembly has a BUSCO score that describes it (52.8 percent complete). We anticipate that comprehensive sequencing datasets will be a useful tool for basic evolutionary and comparative understanding among the crustaceans as they are playing a huge ecological role and climate impact in the aquatic system. Furthermore, by combining commercially significant organisms with a specific evolved transcriptome sequence of Caridina pseudogracilirostris, it is possible to increase the resistance to and adaptation to a variety of climatic circumstances, which boosts production and satisfies supply and demand. Thus the deep phylogenetic relationships and comparison study with all other genera helps in understanding the related species. The nucleotide sequences of C.pseudogracilirostris genome will offer effective genetic information necessary for grasping the evolution of decapods. However further detailed comparative and functional genomics studies will be needed to understand the evolutionary and adaptation mechanism of brackish water shrimp Caridina pseudogracilirostris.

Acknowledgement:

The authors are acknowledging the support from NextGen Lab facility, Sathyabama Institute of Science and Technology.

Funding Statement:

The corresponding author RRK is acknowledging the Department of Science and Technology (Govt. of India) for financial support (EMR/2014/000630).

Declaration of Conflicting Interests:

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contribution statement:

Rajaretinam Rajesh Kannan: Conceptualization, Supervision, Writing, Review & Editing Nandhagopal Soundharapandiyan: Methodology, Investigation, Writing – Original, Draft, Software, Formal analysis Carlton Ranjith Wilson Alphonse: Software, Formal analysis, Subramoniam Thanumalaya: Conceptualization, Reviewing and Editing. Samuel Gnana Prakash Vincent: Conceptualization, Review & Editing.

Data availability statement:

The datasets generated and/or analysed during the current study are available in the NCBI SRA repository, Accession Number: SRR19611691.

Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289. https://doi.org/10.1093/NAR/GKQ1238
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. https://doi.org/10.1093/BIOINFORMATICS/BTU170
Britten RJ, Kohne DE (1968) Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science 161:529–540. https://doi.org/10.1126/SCIENCE.161.3841.529
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60. https://doi.org/10.1038/NMETH.3176
Camacho C, Coulouris G, Avagyan V, et al (2009) BLAST+: Architecture and applications. BMC Bioinformatics 10:1–9. https://doi.org/10.1186/1471-2105-10-421/FIGURES/4
Carmi S, Church GM, Levanon EY (2011) Large-scale DNA editing of retrotransposons accelerates mammalian genome evolution. Nat Commun 2:. https://doi.org/10.1038/NCOMMS1525
Carr M, Bensasson D, Bergman CM (2012) Evolutionary Genomics of Transposable Elements in Saccharomyces cerevisiae. PLoS One 7:e50978. https://doi.org/10.1371/JOURNAL.PONE.0050978
Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30:31–37. https://doi.org/10.1093/BIOINFORMATICS/BTT310
Daniel C, Silberberg G, Behm M, Öhman M (2014) Alu elements shape the primate transcriptome by cis-regulation of RNA editing. Genome Biol 15:. https://doi.org/10.1186/GB-2014-15-2-R28
Emms DM, Kelly S (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 16:1–14. https://doi.org/10.1186/S13059-015-0721-2/FIGURES/7
Hauton C, Smith VJ (2007) Adaptive immunity in invertebrates: a straw house without a mechanistic foundation. Bioessays 29:1138–1146. https://doi.org/10.1002/BIES.20650
He B, Zhu R, Yang H, et al (2020) Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data. Front Bioeng Biotechnol 8:817. https://doi.org/10.3389/FBIOE.2020.00817/FULL
Hoffmann JA, Kafatos FC, Janeway CA, Ezekowitz RAB (1999) Phylogenetic perspectives in innate immunity. Science 284:1313–1318. https://doi.org/10.1126/SCIENCE.284.5418.1313
Ilie L, Fazayeli F, Ilie S (2011) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27:295–302. https://doi.org/10.1093/BIOINFORMATICS/BTQ653
Kawato S, Nishitsuji K, Arimoto A, et al (2021) Genome and transcriptome assemblies of the kuruma shrimp, Marsupenaeus japonicus. G3 Genes|Genomes|Genetics 11:. https://doi.org/10.1093/G3JOURNAL/JKAB268
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Res 13:2178. https://doi.org/10.1101/GR.1224503
Lloyd GT, Davis KE, Pisani D, et al (2008) Dinosaurs and the Cretaceous Terrestrial Revolution. Proc R Soc B Biol Sci 275:2483. https://doi.org/10.1098/RSPB.2008.0715
Mente E (2008a) Reproductive biology of crustaceans : case studies of decapod crustaceans. 16
Mente E (2008b) Reproductive biology of crustaceans : case studies of decapod crustaceans. Science Publishers
Meyers BC, Tingey S V., Morgante M (2001) Abundance, Distribution, and Transcriptional Activity of Repetitive Elements in the Maize Genome. Genome Res 11:1660–1676. https://doi.org/10.1101/GR.188201
Muñoz-López M, García-Pérez JL (2010) DNA Transposons: Nature and Applications in Genomics. Curr Genomics 11:115. https://doi.org/10.2174/138920210790886871
Östlund G, Schmitt T, Forslund K, et al (2010) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 38:D196. https://doi.org/10.1093/NAR/GKP931
Schulz MH, Weese D, Holtgrewe M, et al (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30:i356–i363. https://doi.org/10.1093/BIOINFORMATICS/BTU440
Simão FA, Waterhouse RM, Ioannidis P, et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212. https://doi.org/10.1093/BIOINFORMATICS/BTV351
Simon Andrews (2010) FastQC A Quality Control tool for High Throughput Sequence Data. In: http://www.bioinformatics.babraham.ac.uk/projects/fastqc. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 8 Jun 2022
Simpson JT (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics 30:1228–1235. https://doi.org/10.1093/BIOINFORMATICS/BTU023
Simpson JT, Wong K, Jackman SD, et al (2009) ABySS: A parallel assembler for short read sequence data. Genome Res 19:1117. https://doi.org/10.1101/GR.089532.108
Soundharapandiyan N, Thanumalayaperumal S, Rajaretinam RK (2022) Real-time imaging and developmental biochemistry analysis during embryogenesis of Caridina pseudogracilirostris. J Exp Zool Part A Ecol Integr Physiol 337:206–220. https://doi.org/10.1002/JEZ.2556
Stanke M, Morgenstern B (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res 33:. https://doi.org/10.1093/NAR/GKI458
Steinegger M, Söding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017 3511 35:1026–1028. https://doi.org/10.1038/nbt.3988
Stollewerk A (2016) A flexible genetic toolkit for arthropod neurogenesis. Philos Trans R Soc B Biol Sci 371:. https://doi.org/10.1098/RSTB.2015.0044
Swathi A, Shekhar MS, Katneni VK, Vijayan KK (2018) Genome size estimation of brackishwater fishes and penaeid shrimps by flow cytometry. Mol Biol Reports 2018 455 45:951–960. https://doi.org/10.1007/S11033-018-4243-3
Tarailo-Graovac M, Chen N (2009) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinforma Chapter 4: https://doi.org/10.1002/0471250953.BI0410S25
Thomas GWC, Dohmen E, Hughes DST, et al (2020) Gene content evolution in the arthropods. Genome Biol 21:1–14. https://doi.org/10.1186/S13059-019-1925-7/FIGURES/4
Thomas MM, Pillai VK, Pillai NN (1973) Caridina pseudogracilirostris sp.nov. (Atyidae: Caridina) from the Cochin Backwater . J Mar Biol Assoc India 15:871–872
Thorp JH, Rogers DC (2011) Crayfish, Crabs, and Shrimp: Subphylum Crustacea, Class Malacostraca, Order Decapoda. F Guid to Freshw Invertebr North Am 157–168. https://doi.org/10.1016/B978-0-12-381426-5.00018-1
Wang Z, Tang D, Shen C, Wu L (2021) Identification of Genes Involved in Digestion from Transcriptome of Parasesarma pictum and Parasesarma affine Hepatopancreas. Thalass An Int J Mar Sci 2021 381 38:93–101. https://doi.org/10.1007/S41208-021-00296-2
Wolfe JM, Breinholt JW, Crandall KA, et al (2019) A phylogenomic framework, evolutionary timeline and genomic resources for comparative studies of decapod crustaceans. Proceedings Biol Sci 286:. https://doi.org/10.1098/RSPB.2019.0079
Yu Y, Gu J, Jin Y, et al (2015) Panoramix enforces piRNA-dependent cotranscriptional silencing. Science 350:339–342. https://doi.org/10.1126/SCIENCE.AAB0700
Yuan J, Zhang X, Wang M, et al (2021) Simple sequence repeats drive genome plasticity and promote adaptive evolution in penaeid shrimp. Commun Biol 2021 41 4:1–14. https://doi.org/10.1038/s42003-021-01716-y

No competing interests reported.

SupplementaryTables.doc

Download PDF

Version 1

posted

You are reading this latest preprint version

De nova assembly, annotation and comparative genomics study on the draft genome of Indian brackish water shrimp Caridina pseudogracilirostris

Status:

Version 1

Abstract

Figures

1. Introduction

2. Materials And Method

2.1 Shrimp Collection and Rearing

2.2 DNA extraction

2.3 Library preparation and genome sequencing

2.4.1 Technical validation

2.4.2 Kmer size selection and Genome size estimation

2.4 De nova assembly and annotation

2.4.3 Genome evaluation

2.4.4 Repeat masking and Repeat annotation

2.4.5 Gene Prediction

2.4.6 Gene Annotation

2.5 Functional annotation and pathway analysis.

2.6. Comparative genomics and phylogenic studies

3. Results And Discussion

3.1 Library preparation and genome sequencing:

3.2 De-nova Sequencing analysis

3.2.1 Repeat masking and Repeat annotation

3.2.2 Gene Prediction and Gene Annotations

3.3 Functional annotation and pathway analysis using EggNOG mapper, Panther db and pathway common analysis

3.4 Species conservation and phylogenetic analysis

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1