Comparative analysis of the codon usage patterns in two closely related Marsupenaeus species based on comparative transcriptomics


 Background: Marsupenaeus japonicus, a major commercial shrimp species in the world, has two cryptic or sibling species, Marsupenaeus japonicus and Marsupenaeus pulchricaudatus. Due to the lack of genomic information, little is known about the correlations among codon usage bias, gene expression, and evolutionary trends in Marsupenaeus orthologs.Results: Using the CodonW 1.4.2 software, we performed the codon bias analysis of two Marsupenaeus species transcriptomes. The average contents of GC and ENc were 51.61% and 52.1 for VI (M. japonicus), 51.54% and 52.22 for VII (M. pulchricaudatus), respectively. Parity Rule 2 (PR2) plot analysis showed that purines (A and G) were used more frequently than pyrimidines (C and T) in two Marsupenaeus species. The average ENc value was 52.1 and 52.22 for M. japonicus and M. pulchricaudatus, respectively. Overall, orthologous genes that underwent positive selection (ω > 1) had a higher correlation coefficient than that experienced purifying selection (ω < 1). In M. japonicus, the relationships were highly significant positive about Axis 1 and A3, T3 and ENc (p < 0.01). However, all relationships in M. pulchricaudatus were the opposite. We determined 12 and 14 optimal codons for M. japonicus and M. pulchricaudatus, respectively. Two Marsupenaeus species had 31 different codon pairs. The results of multi-species clustering based on codon preference were consistent with traditional classification. Conclusions: We characterized the codon usage patterns of the two Marsupenaeus species and the evolutionary trends in Marsupenaeus orthologs, which provides new insights into the genetic divergence and the phylogenetic relationships of two Marsupenaeus species.


Background
The codon is the basic genetic code of all biological messenger RNA and 62 codons encode 20 different amino acids [1][2][3]. For different genes or genomes, the selection of synonymous codons is non-random, which is called codon bias (CUB) [4,5]. Codon preference is speci c to the organism and may be in uenced by GC content, gene expression level, and gene length [6][7][8]. Besides, codon usage patterns may affect the biological functions of mRNA biosynthesis, translation elongation rate, protein folding, and other downstream expressions [7,[9][10][11][12]. It is now thought that the CUB is mainly affected by natural selection and mutational pressure [13][14][15][16].
Most of the research in the past mainly studied the codon preference of species with genome-wide information [17][18][19]. Recent rapid development in next-generation sequencing has provided large amounts of genomic and transcriptome data. Machado et al detected and quanti ed strong selection on synonymous sites of Drosophila melanogaster by using deep genomic population sequencing [20]. Utilizing the Robo-seq and RNA-seq data, Chu et al studied how codon usage bias could impact the translation patterns of Arabidopsis thaliana [21]. Guan et al analyzed codon usage of Hirudinaria manillensis RNA-seq data and found that the genetic evolution was driven by mutation pressure and natural selection [22]. Additional studies of codon usage bias based on transcriptome data include Bombyx mori [23], Taenia multiceps [24], and Megalobrama amblycephala [25].
The kuruma shrimp (Marsupenaeus japonicus) includes two cryptic species, which are distributed mostly allopatrically but co-occur in the northern South China Sea [26]. Our previous study showed that there were obvious genetic differentiation and reproductive isolation between the two cryptic species [27]. We sequenced and compared the transcriptomes of two Marsupenaeus species, and obtained a large number of putative orthologs [28]. Yi et al compared and analyzed the codon bias of three Misgurnus species and revealed the potential importance of expression-mediated selection in shaping the genome evolution [29]. Song et al found that the complex correlations among gene expression, codon usage bias, and substitution rate in Arachis duranensis and Arachis ipaënsis orthologs [30].
In this study, we performed codon usage bias analysis based on two Marsupenaeus species transcriptomes using the CodonW software. We systematically compared the codon usage patterns of the two Marsupenaeus species and evaluated the comprehensive effects of various factors. Moreover, the results provided new insights into the genetic divergence and the phylogenetic relationships of two Marsupenaeus species.

Data collection and ltering
Raw Illumina sequences are accessible from NCBI Sequence Read Archive (SRA) (https://trace.ncbi.nlm.nih.gov/Traces/sra/) under accession SRR7786082 (Marsupenaeus pulchricaudatus) and SRR7786083 (Marsupenaeus japonicus). Transcript assembly and functional annotation were described in detail in our previous work [28]. A total of 14126 and 13695 unigenes with full-length CDS (Coding sequence) regions were identi ed from M. japonicus and M. pulchricaudatus libraries, respectively. Orthologous groups were screened using OrthoMCL with default settings [31]. Gene expression levels as fragments per kilobase million [32] were estimated by RSEM software [33]. Coding sequences of the other 9 species were downloaded from NCBI. All CDSs were ltered using the Omics hare online platform (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/), and those sequences with length less than 400 bp or unknown bases were eliminated.

Codon usage indices analysis
The GC1, GC2, GC3 contents were calculated using Perl scripts, and GC12 was the average value of GC1 and GC2. Using the CodonW 1.4.2 software (http://codonw.sourceforge.net), we performed codon bias analysis. The calculation indices included GC content, the nucleotide composition at the 3rd codon position (A3s, T3s, G3s, and C3s), the effective number of codons (ENc), the codon adaptation index (CAI), codon bias index (CBI), frequency of optional codon (Fop), relative synonymous codon usage (RSCU), and so on. The Parity Rule 2 (PR2) plot analysis was based on the third codon position, using A3/(A3 + T3) as the ordinate and G3/(G3 + C3) as the abscissa. The PR2 plot can be used to estimate the impact of selection and mutation pressure on codon usage bias [34].

ENc-plot and GO annotation
The effective number of codons (ENc), with a value between 20 and 61, is a key parameter to interpret codon bias. The value 20 indicates that only one synonymous codon is chosen and 61 represent no usage bias and all synonymous codons have the same probability. The lower the value for a coding sequence, the stronger the codon usage bias [35,36]. In general, a gene possesses strong codon usage bias when the ENc value lower than 35 [37,38]. The ENc-plot was drawn by Origin 2020 (OriginLab Corporation, USA), which uses ENc value as the ordinate and GC3s as the abscissa. The expected ENc values were calculated based on the equation: ENc(exp) = 2 + GC3s + 29/[ GC3s 2 + (1-GC3s) 2 ] [39]. CAI was an important index for estimating synonymous codon usage bias and gene expression level, and the higher CAI value signi es the stronger codon usage bias [40][41][42]. Gene ontology annotation was performed using Blast2GO v2.5 (E-value < 1e − 6 ) [43]. GO classi cations were compared among different groups of GC3s (High, Mid, and Low) using the Omicshare online platform (https://www.omicshare.com/tools/).

Correlation analysis
The codon usage patterns were often shaped by many factors, such as GC content, expression level, tRNA abundance, protein structure, and hydrophilicity [44,45]. We performed a correlation analysis between codon bias parameters and expression level (FPKM). Using the PAML toolkit [46], we calculate the non-synonymous substitution ratio (Ka) and synonymous substitution ratio (Ks). In genetics, the Ka/Ks (ω) can be used to determine whether there is selective pressure on protein-encoding genes [47,48]. The ω > 1 suggests that the gene evolved under positive selection and the ω closing to zero indicates that the gene is under heavy selection pressure [47,49].

Correspondence analysis (COA)
To further investigate the factors related to the codon usage pattern, the correspondence analysis was conducted by CodonW based on the RSCU values. The COA was used to compare the usage patterns of 59 codons (except Met, Trp, Taa, Tag, and Tga) and re ect the variation trend in codon usage. COA creates a series of orthogonal axes, which were used to estimate the main source of variation. Using SPSS v22, the relative coe cient between ten codon bias parameters and axis1 and axis2 was calculated, respectively.

Relative synonymous codon usage and optimal codons
According to Sharp et al [50], the relative synonymous codon usage (RSCU) was an index to measure the codon usage preference. The higher the RSCU value, the stronger the preference. Based on the calculated ENc values, 10% of the genes with extremely high and low ENc values were regarded as the high and low RSCU datasets [51]. The optimal codons were con rmed based on the △RSCU value and chi-square test [38,52,53].

Clustering and principal component analysis
The protein-coding sequences of nine species were downloaded from the ensemble database (http://asia.ensembl.org/index.html) and NCBI (https://www.ncbi.nlm.nih.gov/), and codon usage preference was analyzed using CodonW. The heatmap was performed based on RSCU values using the Omicshare online platform (https://www.omicshare.com/tools/). Based on the RSCU of 59 codons, the Principal component analysis (PCA) of 11 species were performed by Origin 2020 (OriginLab Corporation, USA).

Results
Nucleotide composition and PR2-plot analysis A total of 9414 and 9420 unigenes with length larger than 400 were screened from M. japonicus and M. pulchricaudatus libraries, respectively (Fig. 1A). The length distribution of the two groups was similar. In the M. japonicus, the mean contents of A and T nucleotide were 31.89% (SD = 10.45%) and 30.8% (SD = 9.04%), and the mean contents of C and G nucleotide were 33.63% (SD = 10.71%) and 28.03% (SD = 9.21%). In the M. pulchricaudatus, the average contents of A and T nucleotide were 31.86% (SD = 10.55%) and 30.8% (SD = 9.18%), and the average contents of C and G nucleotide were 33.56% (SD = 10.73%) and 28.27% (SD = 9.29%) (Fig. 1B). The average contents of GC were 51.61% and 51.54% for M. japonicus and M. pulchricaudatus, respectively. The mean contents of GC3s were 49.1% and 49.17% for M. japonicus and M. pulchricaudatus, respectively, which were signi cantly higher than that of GC12. The median of GC-biases [G3/(G3 + C3)] were 0.4563 and 0.4582 for M. japonicus and M. pulchricaudatus, and the median of AT-biases [A3/(A3 + T3)] were 0.5047 and 0.5051, respectively ( Fig. 1C and D). Parity Rule 2 (PR2) plot analysis showed that purines (A and G) were used more frequently than pyrimidines (C and T) in two Marsupenaeus species. The unbalanced use of the third base suggested that mutation pressure and natural selection contribute to codon usage bias.

Correlation analysis of codon usage parameters
All parameters have similar correlations between M. japonicus and M. pulchricaudatus (Fig. 2). The results indicated that FPKM was negatively correlated with T3s and A3s (p < 0.05), and positively correlated with other parameters (p < 0.05) in M. japonicus and M. pulchricaudatus. There was a signi cant (p < 0.05) positive correlation among T3s, A3s, and ENc values. These three values were negatively (p < 0.05) correlated with other parameters. Correlation analysis indicated that the third base content of synonymous codons signi cantly affects gene expression and codon usage bias. The signi cant correlation (p < 0.05) between GC3 and GC content indicated that the nucleotide contents play an important role in codon usage bias. The rst and second base content was often determined by natural selection and the third base content was affected by mutation pressure [54,55].  [35,37]. To explore the effect of GC3s on the codon usage bias, we performed the ENc-plot analysis. In gure S1A and S1B, most genes were aggregated close to the expected curve, which showed that codon usage bias was mainly affected by mutation pressure. We found lower ENc values in M. japonicus than M. pulchricaudatus. Meanwhile, we estimated the difference between the expected and the observed ENc values and calculated the (ENcexp -ENcobs)/ENcexp (Fig. S1C and S1D). The frequencies distribution of unigenes with values within 0 ~ 0.1 were highest, which showed that most ENc values from GC3s were larger than observed ENc values. Moreover, there was a signi cant positive correlation between GC3s and CAI values (Fig. S1E and S1F).

Go ontology (GO) annotation based on GC3s
To further understand the in uence of GC3s on the gene function, the CDSs with low, mid, and high GC3, including 1000, 1001, and 1005 genes in M. japonicus and 1005, 1001, and 1002 genes in M. pulchricaudatus, were performed GO annotation. Gene ontology terms provided three categories, including cellular component, molecular function, and biological process. Overall, these three groups' genes in two shrimp species presented similar functional categories (Fig. S2). In the biological process categories, with 13 subtypes, most corresponding genes were involved in cellular process, metabolic process, single-organism process, and biological regulation. Thirteen subtypes were annotated with the cellular component, and the highest gene number was observed in "cell part" and "cell" categories. In the molecular function, the "binding" was the highest category.
Correlation analysis between codon usage parameters and the substitution rate A total of 5036 pairs of single-copy orthologous genes were identi ed between M. japonicus and M. pulchricaudatus libraries [28]. Among these orthologs, the Ka/Ks values of 2491 pairs were calculated with the mean values were 0.002, 0.019, and 0.175 for Ka, Ks, and Ka/Ks (ω), respectively. There were 49 pairs of orthologous genes with a ω value greater than 1 (positive selection), and 2225 pairs with a ω value less than 1 (purifying selection).
Overall, orthologous genes that underwent positive selection (ω > 1) had a higher correlation coe cient than that experienced purifying selection (ω < 1), which could be more genes with ω < 1 lead to large differences. Almost all parameters had different signi cance levels with Ka, Ks, or Ka/Ks (Fig. 3). In the M. japonicus, the Ka/Ks of genes with ω less than 1 was positively correlated with ENc, A3s, and T3s (p < 0.01), but negatively correlated with other parameters (p < 0.01). There was no signi cant correlation between all parameters with the Ka/Ks of genes with ω greater than 1. However, GC content and CBI value were positively correlated with Ks, and G3s was negatively correlated with Ks. Besides, Fop and CBI values were positively correlated with Ka. In M. pulchricaudatus, the Ka/Ks of genes with ω less than 1 were positively correlated with A3s and T3s but negatively correlated with other parameters. Like M. japonicus, there was no signi cant correlation between all parameters with the Ka/Ks of genes with ω greater than 1. However, CBI and T3s values were positively correlated with Ka and Ks.

Correspondence analysis (COA)
Using the RSCU values, the correspondence analysis was used to investigate the factors related to codon usage patterns and re ect the variation trend in codon usage. The result indicated that the rst ve axes accounted for 43.8% and 44.3% of the amino-acid variation for M. japonicus and M. pulchricaudatus (Fig. 4A). In M. japonicus, Axis 1 and Axis 2 explain 25.16% and 6.54% of the variance, respectively. In M. pulchricaudatus, Axis 1 and Axis 2 explain 26.38% and 6.29% of the variance, respectively. In M. japonicus, the relationships were highly signi cant positive between Axis 1 and A3, T3, and ENc (p < 0.01), and others were signi cantly negative correlated (p < 0.01). However, all relationships in M. pulchricaudatus were opposite (Fig. 4B).
To identify the effect of GC content on codon bias, GC contents of genes were color-coded on the plot, which uses Axis 1 as the abscissa and Axis 2 as the ordinate (Fig. 4C for M. japonicus and 4D for M. pulchricaudatus). Overall, the distribution of GC content was the opposite along Axis 1. In M. japonicus, the larger value of Axis 1, the smaller the GC content. The negative correlation (-0.562 with p-value < 0.01) between Axis 1 and GC content was presented in the Fig. 6B. Instead, the larger value of Axis 1, the larger the GC content in M. pulchricaudatus, and the positive correlation was 0.7 (p < 0.01).

Determination of optimal codons
There were 32 codons with the RSCU value > 1 in M. japonicas and M. pulchricaudatus, respectively, which indicated that these codons were preferences for two Marsupenaeus species (Table S1). Expect for Trp and Met, the codons of Ala, Arg, Gly, Pro, Ser, and Thr had a higher bias. Besides, the codons with the RSCU value > 1 mainly end with C and A. Based on ENc values, we obtained the RSCU datasets of high and low expression genes and calculated the △RSCU value (Table S2). We determined 12 and 14 optimal codons for M. japonicus and M. pulchricaudatus, respectively (Table 1). In M. japonicus, 9 optimal codons were C-ending and 3 optimal codons were G-ending. In M. pulchricaudatus species, 9 optimal codons were C-ending and 5 optimal codons were G-ending. Most optimal codons were the same in two Marsupenaeus species. M. japonicus had a special ACC (Thr) codon, and M. pulchricaudatus had special CCG (Pro), GCG (Ala), and GGC (Gly) codons. Codon pairs in two Marsupenaeus species The synonymous codon that encodes two amino acids was called a duplex codon or codon pairs, which was more commonly used than the single codon. Two marsupenaeus species had different use frequency of codon pairs (Table 2), such as GlyAla (GGAGCU vs GGAGCA),GlnArg CAGAGA vs CAAAGA and GluAsn GAGAAC vs GAAAAU . In M. japonicus, the high-frequency codon pair of ArgArg was AGAAGA, while the optimal codon of Arg was CGC (Fig. S3). The high-frequency codon pair of AspAsp was GAUGAU, while the optimal codon of Asp was GAC. The high-frequency codon pair of GluGlu was GAAGAA, while the optimal codon of Glu was GAG. There were other inconsistencies, including GlyGly (GGAGGA) and Gly (GGC), HisHis (CAUCAU) and His (CAC), ProPro (CCACCA) and Pro (CCC), SerSer (AGCAGC) and Ser (UCG), ThrThr (ACAACA) and Thr (ACC/ACG), and ValVal (GUGGUG) and Val (GUC) (Fig. S3). In M. pulchricaudatus, the high-frequency codon pair of HisHis was CACCAC, while differentiates it from that of M. japonicus. The high-frequency codon pair of ProPro was CCACCA, while the optimal codon of Pro was CCC and CCG. The high-frequency codon pair of AlaAla was GCAGCA, while the optimal codon of Ala was GCG (Fig. S4). Codon pair utilization biases play an important role in protein synthesis by the interaction with tRNA isoacceptors [56]. Codon pair analysis enables us to get a clear picture of the codon usage bias during transcription and translation.

Multi-species clustering analysis
Based on the RSCU values of 59 codons (except Met, Trp, Taa, Tag, and Tga), the heat map (Fig. 5) showed that two Marsupenaeus species were clustered with Daphnia pulex and Litopenaeus vannamei, and then Crassostrea gigas. The Larimichthys crocea, Cyprinus carpio, and Danio rerio were classi ed into the same cluster. Homo sapiens and Mus musculus were clustered into one group. It was interesting that Drosophila melanogaster and mammals were grouped at rst, then Arthropoda and Granulifusus kiranus (Crassostrea gigas) joined in them. This may be mainly since D. melanogaster has a stronger codon preference than other arthropods. Similar to the clustering results, the PCA showed two Marsupenaeus species overlapped almost completely, and the relationship between C. gigas and arthropods was not as strong as the result of heat map clustering (Fig. 6). These clustering results were consistent with traditional species classi cation.

Discussion
Given the signi cant biological effects of different codon patterns, identifying these patterns in a given gene or genome is important to understand the molecular mechanisms of gene expression and to uncover the effects of long-term evolution on the genome [15,57,58]. Moreover, identifying these patterns is helpful to the phylogenetic analysis of species and to improve the expression of the target gene by optimizing codons [59][60][61][62].
In this study, we analyzed the codon preference of transcripts of two Marsupenaeus species, which was consistent on the whole. There was no signi cant difference in the content of AT and GC of the third base. The correlation between codon preference parameters was similar. The gene expression level (FPKM) was signi cantly negatively correlated with A/T3s, and the effective codon number ENc was signi cantly positively correlated with A/T3s. This result indicated that the third codon base signi cantly affects codon preference and gene expression level. The rst and second base content of a codon is usually affected by natural selection, while the third-base content is affected by mutation pressure [54,55]. The mean effective codon number (ENc) of the two cryptic species was 52.1 and 52.2, respectively, indicating the weak codon preference of M. japonicus transcripts. The PR2 analysis and ENc-GC3s correction analysis showed that the codon usage patterns of M. japonicus were in uenced by both mutation pressure and natural selection.
It was consistent on the whole that the correlation analysis between Ka/Ks value and codon preference parameters of orthologous genes in two cryptic species. In the group with ω < 1, the Ka/Ks of variety I was signi cantly positively correlated with ENc. In the group with ω > 1, Ka and Ks were signi cantly positively correlated with T3s, and the Fop and GC content of variety I was signi cantly positively correlated with Ka and Ks, respectively. Those orthologous genes under positive selective stress are usually associated with abiotic or biotic stress resistance [63,64]. Axis 1 was the most important evaluation index, which showed a highly signi cant correlation with C3 and GC3. The results of the correspondence analysis showed that the codon preference parameters of the two cryptic species had an opposite correlation with axis 1. The gene expression level was signi cantly positively correlated with GC content. The results from Camiolo et al showed that gene sequences with higher GC content showed a higher expression level and better codon preference [65]. Natural selection can improve gene transcription and translation e ciency by giving priority to the use of optimal synonymous codons [53,66].
In this study, RSCU and ENc values were combined to determine 12 optimal codons in the variety I, among which 9 ended in C and 3 ended in G, and 14 optimal codons in variety II, among which 9 ended in C and 5 ended in G. these results showed M. japonicus is genetically more likely to end in C/G, which was similar to the codon usage characteristic of carp (Cyprinus carpio L.), zebra sh (Danio rerio L.), Acanthopagrus schlegelii and Pagrus major [67,68]. This may be since the evolution of M. japonicus is mainly mutated from AT to CG. Al-Saif et al showed that reducing the proportion of UU or UA could enhance the resistance to mRNA attenuation, thus increasing protein expression [69]. The codon preference of different species is generally in uenced by mutation and natural selection pressure [70,71]. There were 31 different double codon pairs between the two cryptic species, and the optimization of the codon pair could improve the e ciency of protein translation than the single optimal codon [72][73][74]. The genetic distance of species is closely related to the codon preference difference, which can be used for species classi cation [25]. The results of multi-species heat map analysis and clustering based on RSCU values are consistent with traditional species classi cation, indicating that the size of inter-species codon preference difference can re ect the proximity of species, which is also veri ed in other species [74][75][76].
In conclusion, we systematically compared the codon usage patterns of two Marsupenaeus species and evaluated the comprehensive effects of various factors. This study provided a relatively comprehensive understanding of the correlations among codon usage bias, gene expression, and evolutionary trends in Marsupenaeus orthologs. Moreover, the results provided new insights into the genetic divergence and the phylogenetic relationships of two Marsupenaeus species.