Nucleotide composition and PR2-plot analysis
A total of 9414 and 9420 unigenes with length larger than 400 were screened from M. japonicus and M. pulchricaudatus libraries, respectively (Fig. 1A). The length distribution of the two groups was similar. In the M. japonicus, the mean contents of A and T nucleotide were 31.89% (SD = 10.45%) and 30.8% (SD = 9.04%), and the mean contents of C and G nucleotide were 33.63% (SD = 10.71%) and 28.03% (SD = 9.21%). In the M. pulchricaudatus, the average contents of A and T nucleotide were 31.86% (SD = 10.55%) and 30.8% (SD = 9.18%), and the average contents of C and G nucleotide were 33.56% (SD = 10.73%) and 28.27% (SD = 9.29%) (Fig. 1B). The average contents of GC were 51.61% and 51.54% for M. japonicus and M. pulchricaudatus, respectively. The mean contents of GC3s were 49.1% and 49.17% for M. japonicus and M. pulchricaudatus, respectively, which were significantly higher than that of GC12. The median of GC-biases [G3/(G3 + C3)] were 0.4563 and 0.4582 for M. japonicus and M. pulchricaudatus, and the median of AT-biases [A3/(A3 + T3)] were 0.5047 and 0.5051, respectively (Fig. 1C and D). Parity Rule 2 (PR2) plot analysis showed that purines (A and G) were used more frequently than pyrimidines (C and T) in two Marsupenaeus species. The unbalanced use of the third base suggested that mutation pressure and natural selection contribute to codon usage bias.
Correlation analysis of codon usage parameters
All parameters have similar correlations between M. japonicus and M. pulchricaudatus (Fig. 2). The results indicated that FPKM was negatively correlated with T3s and A3s (p < 0.05), and positively correlated with other parameters (p < 0.05) in M. japonicus and M. pulchricaudatus. There was a significant (p < 0.05) positive correlation among T3s, A3s, and ENc values. These three values were negatively (p < 0.05) correlated with other parameters. Correlation analysis indicated that the third base content of synonymous codons significantly affects gene expression and codon usage bias. The significant correlation (p < 0.05) between GC3 and GC content indicated that the nucleotide contents play an important role in codon usage bias. The first and second base content was often determined by natural selection and the third base content was affected by mutation pressure [54, 55].
The average ENc values were 52.1 and 52.22 for M. japonicus and M. pulchricaudatus, respectively. There were 268 (2.85%) and 249 (2.64%) genes with ENc value equal to 61 for M. japonicus and M. pulchricaudatus, which represent all synonymous codons have the same probability. There were 187 (1.99%) and 133 (1.41%) genes with ENc value less than 35 for M. japonicus and M. pulchricaudatus, and the minimum values were 23.6 and 27.46, respectively. The 35 value was the standard of codon bias [35, 37]. To explore the effect of GC3s on the codon usage bias, we performed the ENc-plot analysis. In figure S1A and S1B, most genes were aggregated close to the expected curve, which showed that codon usage bias was mainly affected by mutation pressure. We found lower ENc values in M. japonicus than M. pulchricaudatus. Meanwhile, we estimated the difference between the expected and the observed ENc values and calculated the (ENcexp - ENcobs)/ENcexp (Fig. S1C and S1D). The frequencies distribution of unigenes with values within 0 ~ 0.1 were highest, which showed that most ENc values from GC3s were larger than observed ENc values. Moreover, there was a significant positive correlation between GC3s and CAI values (Fig. S1E and S1F).
Go ontology (GO) annotation based on GC3s
To further understand the influence of GC3s on the gene function, the CDSs with low, mid, and high GC3, including 1000, 1001, and 1005 genes in M. japonicus and 1005, 1001, and 1002 genes in M. pulchricaudatus, were performed GO annotation. Gene ontology terms provided three categories, including cellular component, molecular function, and biological process. Overall, these three groups' genes in two shrimp species presented similar functional categories (Fig. S2). In the biological process categories, with 13 subtypes, most corresponding genes were involved in cellular process, metabolic process, single-organism process, and biological regulation. Thirteen subtypes were annotated with the cellular component, and the highest gene number was observed in “cell part” and “cell” categories. In the molecular function, the “binding” was the highest category.
Correlation analysis between codon usage parameters and the substitution rate
A total of 5036 pairs of single-copy orthologous genes were identified between M. japonicus and M. pulchricaudatus libraries [28]. Among these orthologs, the Ka/Ks values of 2491 pairs were calculated with the mean values were 0.002, 0.019, and 0.175 for Ka, Ks, and Ka/Ks (ω), respectively. There were 49 pairs of orthologous genes with a ω value greater than 1 (positive selection), and 2225 pairs with a ω value less than 1 (purifying selection).
Overall, orthologous genes that underwent positive selection (ω > 1) had a higher correlation coefficient than that experienced purifying selection (ω < 1), which could be more genes with ω < 1 lead to large differences. Almost all parameters had different significance levels with Ka, Ks, or Ka/Ks (Fig. 3). In the M. japonicus, the Ka/Ks of genes with ω less than 1 was positively correlated with ENc, A3s, and T3s (p < 0.01), but negatively correlated with other parameters (p < 0.01). There was no significant correlation between all parameters with the Ka/Ks of genes with ω greater than 1. However, GC content and CBI value were positively correlated with Ks, and G3s was negatively correlated with Ks. Besides, Fop and CBI values were positively correlated with Ka. In M. pulchricaudatus, the Ka/Ks of genes with ω less than 1 were positively correlated with A3s and T3s but negatively correlated with other parameters. Like M. japonicus, there was no significant correlation between all parameters with the Ka/Ks of genes with ω greater than 1. However, CBI and T3s values were positively correlated with Ka and Ks.
Correspondence analysis (COA)
Using the RSCU values, the correspondence analysis was used to investigate the factors related to codon usage patterns and reflect the variation trend in codon usage. The result indicated that the first five axes accounted for 43.8% and 44.3% of the amino-acid variation for M. japonicus and M. pulchricaudatus (Fig. 4A). In M. japonicus, Axis 1 and Axis 2 explain 25.16% and 6.54% of the variance, respectively. In M. pulchricaudatus, Axis 1 and Axis 2 explain 26.38% and 6.29% of the variance, respectively. In M. japonicus, the relationships were highly significant positive between Axis 1 and A3, T3, and ENc (p < 0.01), and others were significantly negative correlated (p < 0.01). However, all relationships in M. pulchricaudatus were opposite (Fig. 4B).
To identify the effect of GC content on codon bias, GC contents of genes were color-coded on the plot, which uses Axis 1 as the abscissa and Axis 2 as the ordinate (Fig. 4C for M. japonicus and 4D for M. pulchricaudatus). Overall, the distribution of GC content was the opposite along Axis 1. In M. japonicus, the larger value of Axis 1, the smaller the GC content. The negative correlation (-0.562 with p-value < 0.01) between Axis 1 and GC content was presented in the Fig. 6B. Instead, the larger value of Axis 1, the larger the GC content in M. pulchricaudatus, and the positive correlation was 0.7 (p < 0.01).
Determination of optimal codons
There were 32 codons with the RSCU value > 1 in M. japonicas and M. pulchricaudatus, respectively, which indicated that these codons were preferences for two Marsupenaeus species (Table S1). Expect for Trp and Met, the codons of Ala, Arg, Gly, Pro, Ser, and Thr had a higher bias. Besides, the codons with the RSCU value > 1 mainly end with C and A. Based on ENc values, we obtained the RSCU datasets of high and low expression genes and calculated the △RSCU value (Table S2). We determined 12 and 14 optimal codons for M. japonicus and M. pulchricaudatus, respectively (Table 1). In M. japonicus, 9 optimal codons were C-ending and 3 optimal codons were G-ending. In M. pulchricaudatus species, 9 optimal codons were C-ending and 5 optimal codons were G-ending. Most optimal codons were the same in two Marsupenaeus species. M. japonicus had a special ACC (Thr) codon, and M. pulchricaudatus had special CCG (Pro), GCG (Ala), and GGC (Gly) codons.
Table 1
The optimal codons based on high and low levels of expression. AA: amino acids.
| | M. japonicus | M. pulchricaudatus |
AA | Codon | RSCU-H | RSCU-L | △RSCU | RSCU-H | RSCU-L | △RSCU |
Val | GUC | 1.557 | 0.995 | 0.563 | 1.454 | 0.984 | 0.470 |
Ser | UCG | 1.005 | 0.682 | 0.324 | 1.202 | 0.717 | 0.486 |
Pro | CCC | 1.552 | 0.906 | 0.646 | 1.764 | 0.920 | 0.844 |
| CCG | 0.896 | 0.614 | 0.282 | 1.047 | 0.606 | 0.441 |
Thr | ACC | 1.756 | 0.959 | 0.797 | 1.627 | 1.001 | 0.626 |
| ACG | 1.170 | 0.698 | 0.472 | 1.356 | 0.720 | 0.636 |
Ala | GCG | 0.836 | 0.564 | 0.271 | 1.002 | 0.537 | 0.464 |
Tyr | UAC | 1.602 | 0.944 | 0.658 | 1.609 | 0.943 | 0.666 |
His | CAC | 1.551 | 0.986 | 0.565 | 1.541 | 0.998 | 0.544 |
Asn | AAC | 1.605 | 0.966 | 0.639 | 1.590 | 0.998 | 0.591 |
Asp | GAC | 1.470 | 0.991 | 0.479 | 1.545 | 0.986 | 0.559 |
Glu | GAG | 1.329 | 0.924 | 0.405 | 1.465 | 0.917 | 0.548 |
Cys | UGC | 1.407 | 0.929 | 0.478 | 1.469 | 0.980 | 0.488 |
Arg | CGC | 1.832 | 0.835 | 0.997 | 2.120 | 0.830 | 1.289 |
Gly | GGC | 1.819 | 1.001 | 0.818 | 2.108 | 0.978 | 1.130 |
Codon pairs in two Marsupenaeus species
The synonymous codon that encodes two amino acids was called a duplex codon or codon pairs, which was more commonly used than the single codon. Two marsupenaeus species had different use frequency of codon pairs (Table 2), such as GlyAla (GGAGCU vs GGAGCA),GlnArg༈CAGAGA vs CAAAGA༉༌and GluAsn༈GAGAAC vs GAAAAU༉. In M. japonicus, the high-frequency codon pair of ArgArg was AGAAGA, while the optimal codon of Arg was CGC (Fig. S3). The high-frequency codon pair of AspAsp was GAUGAU, while the optimal codon of Asp was GAC. The high-frequency codon pair of GluGlu was GAAGAA, while the optimal codon of Glu was GAG. There were other inconsistencies, including GlyGly (GGAGGA) and Gly (GGC), HisHis (CAUCAU) and His (CAC), ProPro (CCACCA) and Pro (CCC), SerSer (AGCAGC) and Ser (UCG), ThrThr (ACAACA) and Thr (ACC/ACG), and ValVal (GUGGUG) and Val (GUC) (Fig. S3). In M. pulchricaudatus, the high-frequency codon pair of HisHis was CACCAC, while differentiates it from that of M. japonicus. The high-frequency codon pair of ProPro was CCACCA, while the optimal codon of Pro was CCC and CCG. The high-frequency codon pair of AlaAla was GCAGCA, while the optimal codon of Ala was GCG (Fig. S4). Codon pair utilization biases play an important role in protein synthesis by the interaction with tRNA isoacceptors [56]. Codon pair analysis enables us to get a clear picture of the codon usage bias during transcription and translation.
Table 2
The different duplex codons of two Marsupenaeus species
Codons | M. japonicus | M. pulchricaudatus | Codons | M. japonicus | M. pulchricaudatus |
Arg_Pro | AGGCCA | AGACCA | Phe_Thr | TTCACA | TTCACC |
Asn_Ile | AACATT | AACATC | Pro_His | CCTCAT | CCTCAC |
Asn_Leu | AACCTC | AACCTG | Pro_Lys | CCCAAG | CCAAAG |
Asp_Pro | GATCCA | GACCCA | Pro_Ser | CCTTCA | CCATCA |
Asp_Val | GATGTG | GATGTT | Pro_Val | CCAGTG | CCTGTG |
Cys_Met | TGTATG | TGCATG | Ser_Gln | TCACAG | AGCCAG |
Gln_Arg | CAGAGA | CAAAGA | Ser_Met | TCCATG | TCAATG |
Gln_Ile | CAGATC | CAGATT | Thr_Leu | ACCCTC | ACTTTG |
Glu_Asn | GAGAAC | GAAAAT | Trp_His | TGGCAC | TGGCAT |
Gly_Ala | GGAGCT | GGAGCA | Trp_Tyr | TGGTAC | TGGTAT |
Gly_Val | GGAGTG | GGTGTT | Tyr_Ala | TATGCT | TATGCA |
His_His | CATCAT | CACCAC | Tyr_Pro | TACCCA | TATCCA |
Leu_Leu | CTGCTG | CTCCTC | Val_Cys | GTGTGC | GTGTGT |
Leu_Met | CTGATG | TTGATG | Val_Gly | GTGGGC | GTTGGA |
Lys_Glu | AAAGAA | AAGGAA | Val_Ser | GTGTCA | GTCAGC |
Phe_Ala | TTTGCA | TTTGCT | | | |
Multi-species clustering analysis
Based on the RSCU values of 59 codons (except Met, Trp, Taa, Tag, and Tga), the heat map (Fig. 5) showed that two Marsupenaeus species were clustered with Daphnia pulex and Litopenaeus vannamei, and then Crassostrea gigas. The Larimichthys crocea, Cyprinus carpio, and Danio rerio were classified into the same cluster. Homo sapiens and Mus musculus were clustered into one group. It was interesting that Drosophila melanogaster and mammals were grouped at first, then Arthropoda and Granulifusus kiranus (Crassostrea gigas) joined in them. This may be mainly since D. melanogaster has a stronger codon preference than other arthropods. Similar to the clustering results, the PCA showed two Marsupenaeus species overlapped almost completely, and the relationship between C. gigas and arthropods was not as strong as the result of heat map clustering (Fig. 6). These clustering results were consistent with traditional species classification.