Codon usage pattern
To investigate the codon usage bias on S. lemnae codon usage pattern, the RSCU values in coding sequenceswere assessed synonymous codon usage were assessed. Among 26/ preferred codons of corresponding all amino acids (except Methionine and Tryptophan) in S. lemnae coding sequences, 26 are A/U-ended (twelve A-ended; fourteen U-ended) and the remaining are C/G ended (one C-ended; one G-ended). Therefore, most preferred codons in S. lemnae are found to be A-ended or U-ended codons(Table 1). By analyzing the 26 preferred codons, we can find that the RSCU values of six codons, GGA (G), UCA (S), CCA(P), ACU(T), GCU (A) and AGA(R) are >1.6, whereas the RSCU values of the remaining are also found to be > 1 and< 1.6 (Table 1). Nucleotide composition (A/T-rich) and RSCU analysis (A/U-ended) show that the preferred codons have been mostly influenced by compositional constraints of nucleotide,which account for mutation pressure. Additionally, it was found 25 are A/U-ended in highly expressed genes in S. lemnae, but 21 are C/G end in lowly-expressed genes (Table 2), which also show A-ended or U-ended codons are preferentially codons of highly expressed genes. These results indicate that mutation pressure decide the third position of codons and affect codon usage pattern.
To investigate the potential influences (mutation pressure or natural selection) of biological evolution on the codon usage patterns, the RSCU values of the codons in S. lemnae coding sequences were calculated and then were compared with five model organism (H. sapiens, C. elegans, S. cerevisiae, T. thermophila and P. caudatum).. Wefind that preferred codons is eight between S. lemnae and H.Sapiens; fourteen between S. lemnae and C. elegans;twenty one between S. lemnae and S. cerevisiae; twenty one between S. lemnae and T.thermophila;and nineteen between S. lemnae and P. caudatum (Table 1). Those results show the similarity in codon pattern between S. lemnae and H. Sapiens is lower than that among C. elegans、S. cerevisiae、T.thermophila or P. caudatum. The codon usage bias of S. lemnae differs greatly from that of higher eukaryotes and is similar to that of lower eukaryotes. It is generally believed that the closer the species are, the more similar the codon usage pattern should be.
Mutational pressure versus natural selection on codon usage pattern
In order to get an insight whether codon usage bias of S. lemnae was driven by mutational pressure alone or other environmental factors, we analyzed correlation coefficients between overall nucleotide composition (A%, T%, G%, C%, GC%), nucleotide composition at the third position of codons (A3%, T3%, G3%, C3%, GC3%) and ENC(Table 3). The results show that the A content has a significant positive correlation with the content of A3s andG3s, but has a significant negative correlation with the content of C, T, G, GC, C3s, T3s and GC3s. The C content has a significant positive correlation with the content of G, GC, C3s, GC3s and ENC, but has a significant negative correlation with the content of A, T, A3s T3s and G3s. The T content has a significant positive correlation with T3s contents, but has a significant negative correlation with the content of A, C, G, GC, A3s, C3s, G3S, GC3s and ENC. The G content has a significant positive correlation with the content of G, GC, C3s, G3s, GC3s and ENC, but has a significant negative correlation with A, T, A3s and T3s content. The GC contents has a significant positive correlation with the content of CG, G3s, C3s, GC3s, but has a significant negative correlation with the content of A, T, A3s and T3s. The ENC value has a significant positive correlation with the content of C, G, GC, C3s, G3s and GC3s, but has a significant negative correlation with the content of T, A3s and T3s, and has no correlation with the content of A. These results suggested that mutation pressure may affect the codon usage bias for S. lemnae.
To study the relative contribution of two major factors, i.e., natural selection and mutational pressure on codon usage pattern, we performed PCA analysis taking RSCU scores to find out major trends of codon usage in S. lemnae genes. A plot of PC1 and PC2 showed important features of the codon usage pattern in S. lemnae genesFig.2a). From this analysis major trends in codon usages were detected in which axis1 (PC1) accounted for 13.4%, whereas axis2 (PC2), axis3 and axis4 accounted for 10.2%, 6.6%, and 4.6% of total variation in S. lemnae. Axis1-axis4 explaining 34.8% of the cumulative variances, which indicates that no single factor influence codon usage patteren in S. lemnae.
In order to characterize the codon usage patterns from different types of genes, ribosomal related genes was statistical analysis by PCA (Fig. 2b).It was clearly seen that ribosomal genes of S. lemnae were clustered on the right side of PC1, and indicate that compositional constraints are a major factor in CUB, that is mutation pressure mostly shaped its codon pattern, but other factors are also powerful.
Additationlly, Correlation analysis was also performed to determine the correlations between the first two axes and nucleotide constraints of S. lemnae genome (Table 4). The results show that the Axis1 is positively correlated with the A and A3s, whereas it is negatively correlated with the contents of C, G, GC, C3s, GC3s and ENC. Meanwhile, Axis2 is insignificant correlated with the C, T, G, GC, A3s,C3s,T3s, G3s, GC3s and ENC. Moreover, these results also support the key role of mutational pressure in shaping the codon usage bias in S. lemnae.
To investigate the effect of natural selection on codon usage bias in S. lemnae , correlation analysis between the characters of amino acid (Gravy values and Aroma) and the codon bias (Axis1, Axis2, ENC, and GC 3s) was performed(Table 5). The result indicates that Axe1 have a significant negative correlation with AROMA and GRAVY, and Axis 2 has a significant positively correlation with AROMA and GRAVY, but Axe1 or Axis 2 has non-significant correlation with ENC, GC or GC 3s. Earlier it is found that AROMA and hydropathy of the encoded proteins has a significant correlation with the base composition of third codon positions in some other prokaryotes, several eukaryotes and viral genomes [17,18,19,20,21].However, there is no report of such correlation in any of the ciliate genomes studied so far. To our knowledge, this is for the first time, the aromaticity and hydrophobicity of amino acid under nature selection have been demonstrated importance effect on the codon usage pattern of S. lemnae.
ENC- GC 3S plot analysis
To study the factors (mutation pressure, natural selection) influencing the codon usage patterns of S. lemnae coding sequences, we performed ENC- plot analysis for GC 3S, PR2 plot and neutrality plot analysis.
The degree of codon bias is reflected by the size of ENC value. ENC value ranges from 20 to 61, with the level of base composition bias increasing as the ENC values approach 20. Similarly, genes expressed at low levels contain numerous rare codons, with a higher ENC value approach 61. The convention uses 35 as the criterion for biased bias [22]. The ENC values of the S. lemnae genome range from 24.8 to 61, and most of them are more than 35, so the codon bias of the gene is weak (Supplementary Table S2). The average GC3 content is 33.7%, GC1 and GC2 are 38.68% and 30.1%, respectively, indicating that the codon base composition is mostly A and U.
The association analysis between ENC - GC3 is shown in Fig. 3. If ENC value of genes will lie on or just below the continuous curve of the expected ENC values, it indicates that the codon bias is only constrained by a G3+ C3 mutational bias [23]. In the figure, more ENC value of genes loci are on the top or far below the curve of the expected ENC values, but a little ENC value of genes lie on or just below the expected curve. It indicates that the codon usage patterns have been influenced by mutation pressure and other factors, including natural selection.
PR2-plot plot analysis.
The relationship between purine(A and G) and primidines (T and C) of partial amino acids of each gene was analyzed by PR2-plot mapping. There is no bias in the selection or mutation pressure when the plot lie on the center, where both coordinates are 0.5[24].According to Supplementary Fig. S2, most of the genes are distributed on the high right of the plan, indicating that the frequency of A is higher than T, and the frequency of G is higher than C. those results demonstrate that the codon usage pattern of S. lemnae is also shaped by mutation pressure and other factors, including natural selection.