Comprehensive analysis of codon usage pattern of whole genome in protozoa, Stylonychia lemnae

Background: Codon usage pattern is an important evolutionary feature in genomes widely observed in many organisms. Stylonychia lemnae is a classical model single-celled eukaryote, and a quintessential ciliate typified by dimorphic nuclei: a germline micronucleus and a vegetative macronucleus. Analysis of codon usage pattern of S. lemnae macronucleus genome helps in understanding evolution at molecular level and acquires signiﬁcance in mRNA translation, design of transgenic and new gene discovery. Results: The codons of the macronucleus genome sequence of S. lemnae were analyzed and 20,750 coding sequences (CDS) were screened. The overall codon usage of S. lemnae is similar and slightly biased. The value of effective number of codons (ENC) showed that the overall extent of codon usage bias in S. lemnae is relatively high. Nucleotide analysis showed that the overall codon usage is biased toward A- and U-ending codons. The phylogenetic analysis indicated that ciliate is independent evolutionary origins from a common ancestor. The RSCU analysis showed that the codon usage pattern of S. lemnae is more similar to that of Thtrahymana thermophila and Paramecium caudatum . Correlation analysis, ENC-GC 3S plot, and PR2 plot indicated that the codon usage patterns of S. lemnae are not only influenced by mutational pressure but also by natural selection, but neutrality plot analysis showed that the latter plays a major role. C onclusions : Codon usage patterns in eukaryotes are not determined by translational efficiency, but also are determined by the genome. Our study is the ﬁrst attempt to evaluate the codon usage pattern of S.lemnae macronucleus genome to better understand the evolutionary changes. These results built the base for further research on the molecular evolution of S. lemnae .

. The sequences data were incomplete for S. lemnae codon analysis in previous studies, and it is not clear whether the influencing factor on its codon usage with each other, such as mutation pressure [14] or natural selection [15].
Heritable Variations in Stylonychia sp is reported firstly in 1915, and causing who the main fact has considered to be cumulative effects of selection [16] .But due to the lack of genomic data, the molecular mechanism of genetic change cannot be elucidated. Basing

Phylogenetic analysis of S. lemnae based on 18s ribosomal DNA (18s-rDNA)
To determine the phylogenetic relationship of S. lemnae, a phylogenetic tree was drawn

Codon composition analysis
CodonW software was used to analyze the macronucleus genome coding sequence of the S. lemnae. The GC content ranged from 22.1% to 58.6% in the gene of S. lemnae. The GC content was mainly distributed in 25% ~ 40% with an average of 33.7%, indicating that AT was enriched in the genome ( Supplementary Fig. S1).

Relative synonymous codon usage analysis
The patterns of synonymous codon usage in S. lemnae coding sequences were assessed by RSCU analysis. Among 26/ preferred codons of corresponding all amino acids (except Methionine and Tryptophan) in S. lemnae coding sequences, 26 are A/U-ended (twelve Aended; fourteen U-ended) and the remaining are C/G ended (one C-ended; one G-ended).
Therefore, most of preferentially used codons in S. lemnae are A-ended or U-ended codons (Table 1). By analyzing the 26 preferred codons, we can find that the RSCU values of six codons, GGA (G), UCA (S), CCA(P), ACU(T), GCU (A) and AGA(R) are >1.6, whereas the RSCU values of the remaining are also found to be > 1 and< 1.6 ( Table 1). Nucleotide composition (A/T-rich) and RSCU analysis (A/U-ended) show that selection of the preferred codons has been influenced by compositional constraints, which indicated that nature selection mostly, shaped its codon pattern.
Based on these data, it was found that in S. lemnae coding sequences, 25 are A/U-ended in highly expressed gene, but 21 are C/G end in lowly-expressed gene ( Table 2). It verifies A-ended or U-ended codons are preferentially used codons of highly expressed genes.
To determine the potential influences (mutation pressure or natural selection) on the codon usage patterns, the RSCU values of the codons in S. lemnae coding sequences were calculated and then were compared with five model organism (H. sapiens, C. elegans, S. cerevisiae, T. thermophila and P. caudatum). We find that preferred codons is eight between S. lemnae and H.Sapiens; fourteen between S. lemnae and C. elegans; twenty one between S. lemnae and S. cerevisiae; twenty one between S. lemnae and T.thermophila ; and nineteen between S. lemnae and P. caudatum (Table 1). In all, the similarity in codon pattern between S. lemnae and H. Sapiens is lower than that among C. elegansS. cerevisiae T.thermophila or P. caudatum. The codon usage bias of S. lemnae differs greatly from that of higher eukaryotes and is similar to that of lower eukaryotes. These results suggest that the selection pressure maybe affect the codon usage pattern of S. lemnae.

Correlation analysis
To determine whether the codon usage patterns of S. lemnae coding sequences are mainly influenced by mutation pressure or natural selection, we performed a correlation analysis between the nucleotide compositions and the third base of synonymous codons( Table 3).
The results show that the A content has a significant positive correlation with the content of A3s and G3s, but has a significant negative correlation with the content of C, T, G, GC, C3s, T3s and GC3s. The C content has a significant positive correlation with the content of G, GC, C3s, GC3s and ENC, but has a significant negative correlation with the content of A, T, A3s T3s and G3s. The T content has a significant positive correlation with T3s contents, but has a significant negative correlation with the content of A, C, G, GC, A3s, C3s, G3S, GC3s and Enc. The G content has a significant positive correlation with the content of G, GC, C3s, G3s, GC3s and ENC, but has a significant negative correlation with A, T, A3s and T3s content. The GC contents has a significant positive correlation with the content of CG, G3s, C3s, GC3s, but has a significant negative correlation with the content of A, T, A3s and T3s. The ENC value has a significant positive correlation with the content of C, G, GC, C3s, G3s and GC3s, but has a significant negative correlation with the content of T, A3s and T3s. These results indicate that compositional constraints under mutation pressure may affect the codon usage pattern for S. lemnae.
To study the relative contribution of two major factors, i.e., natural selection and mutational pressure on codon usage, we performed PCA analysis taking RSCU scores to find out major trends of codon usage in S. lemnae genes. A plot of PC1 and PC2 showed important features of the codon usage pattern in S. lemnae genes (Fig.2a). From this analysis major trends in codon usages were detected in which axis1 (PC1) accounted for 13.4%, whereas axis2 (PC2), axis3 and axis4 accounted for 10.2%, 6.6%, and 4.6% of total variation in S. lemnae. Axis1-axis4 explaining 34.8% of the cumulative variances, which indicates that no single factor influence condon usage patteren in in S. lemnae.
In order to characterize the codon usage patterns from different types of genes, ribosomal related genes was statistical analysis by PCA (Fig. 2b).It was clearly seen that ribosomal genes of S. lemnae were clustered on the right side of PC1, and indicate that compositional constraints are a major factor in CUB, that is n mutation pressure mostly shaped its codon pattern, but other factors are also powerful.
Additationlly, Correlation analysis was also performed to determine the correlations between the first two axes and nucleotide constraints of S. lemnae genome ( Table 4). The results show that the Axis1 is positively correlated with the A and A3s, whereas it is negatively correlated with the contents of C, G, GC, C3s, GC3s and ENc. Meanwhile, Axis2 is insignificant correlated with the C, T, G, GC, A3s,C3s,T3s, G3s, GC3s and ENC. Overall, these results indicating that mutation pressure has played a major role in shaping the codon usage patterns of S. lemnae genomes.
To determine the potential influence of natural selection, correlation analysis was performed between the characters of aminoacid (Gravy values and Aroma) and the codon bias (Axis1, Axis2, ENC, and GC 3s) ( Table 5). Our analysis indicates that Axe1 have a significant negative correlation with AROMA and GRAVY, and Axis 2 has a significant positively correlation with AROMA and GRAVY. Earlier it is found that AROMA and hydropathy of the encoded proteins has a significant correlation with the base composition of third codon positions in some other prokaryotes, several eukaryotes and viral genomes [17,18,19,20,21].However, there is no report of such correlation in any of the ciliate genomes studied so far. To our knowledge, this is for the first time; a correlation has been demonstrated between the synonymous codon usages in genes of S. lemnae genomes. All in, the aromaticity and hydrophobicity of amino acid have effect on the codon usage pattern of S. lemnae, which reveal that the importance of nature selection.

ENC-GC 3S plot analysis
To determine whether the codon usage patterns of S. lemnae coding sequences have been shaped by mutation pressure, natural selection or both, we constructed ENC-GC 3S plot, PR2 plot and neutrality plot analysis.
The degree of codon bias is reflected by the size of ENC value. ENC value ranges from 20 to 61, with the level of base composition bias increasing as the ENC values approach 20.
Similarly, genes expressed at low levels contain numerous rare codons, with a higher ENC value approach 61. The convention uses 35 as the criterion for biased bias [22]. The ENC values of the S. lemnae genome range from 24.8 to 61, and most of them are more than 35, so the codon bias of the gene is weak. The average GC3 content is 33.7%, GC1 and GC2 are 38.68% and 30.1%, respectively, indicating that the codon base composition is mostly A and U.
The association analysis between ENC -GC3 is shown in Fig. 3. If ENC value of genes will lie on or just below the continuous curve of the expected ENC values, it indicates that the codon bias is only constrained by a G3+ C3 mutational bias [23]. In the figure, more ENC value of genes loci are on the top or far below the curve of the expected ENC values, but a little ENC value of genes lie on or just below the expected curve. It indicates that the codon usage patterns have not only been influenced by mutation pressure, but also mainly influenced by other factors, such as natural selection.

PR2-plot plot analysis.
The relationship between purine(A and G) and primidines (T and C) of partial amino acids of each gene was analyzed by PR2-plot mapping. According to Supplementary Fig. S2, most of the genes are distributed on the high right of the plan, indicating that the frequency of A is higher than T, and the frequency of G is higher than C. If the codon bias of S. lemnae is completely affected by random mutation, it shows that A=U and G=C, that is, the use frequency of purine base is equal to that of the pyrimidine base. The use frequency of A differs from that of T, G differ from C indicate that the formation of codon bias is weakly influenced by random mutation, and is strongly influenced by mutation pressure, natural selection, and other factors in S. lemnae.

Neutral Plot Analysis
A neutrality plot was constructed to determine the extent of influence between mutation pressure and natural selection by comparing the value of GC 12 and GC 3. When the value of GC 12 is statistically significantly correlated to GC 3 and the slope of the regression line is close to 1 in the neutrality plot, mutation pressure is regarded as the main force forming the codon usage bias. Conversely, if selection is the dominant factor, then the slope of the regression line is close to 0. The analysis show that no correlation is observed between the value of GC 12 and GC 3 (r =0.286, P > 0.05) which seemed indicative of mutation pressure playing a little role in codon usage bias of S. lemnae genome (Fig.   4).then, after calculating the slope of the regression in the neutrality plot, this was the case. The slope of the regression line was calculated to be 0.2016, high-lighting the relative GC 3 (natural selection) is 79.94%, while the relative constraint on neutrality (mutation pressure) is 20.16%. Compared with mutation pressure, natural selection is the dominant factor in shaping the codon usage pattern of S. lemnae genes.

Conclusions And Discussion
According to our result, Phylogenetic relationships of S. lemnae are similar to Tetrahymena sp., Paramecium sp. and S. cerevisiae, but are far from C. elegans and H.sapien. It indicated that the origin and early evolution of many eukaryotes remain unknown or ambiguous, the codon usage pattern is non-universal and influenced by evolutionary processes.
ENC is a simple measure of the degree of codon usage bias. Generally, when the ENC value is less than 35, the codon usage bias is high in a given gene [24].From Supplementary Base composition is an important feature of a genome and is the main factor that affects codon usage pattern. The organisms with AT-rich genome, such as T. thermophila and P.
tetraurelia, tend to use A or T at the third position in coding sequence [25, 26,27].
Likewise, AT-rich genome in S. lemnae preferes A/T as the third condon. However, The The organisms with GC rich are Archea, Bacteria, Hordeum vulgare, Triticum aestivum, fungi and Oryza sativa [28,29,30].In all, mutation pressure is the main factor affecting the codon usage bias of the species [31].
Surprisingly, highly expressed genes with G/C end are significantly more frequent in Tetrahymena and Paramecium, but it with A/T end is significantly most frequent in S.
lemnae. Most optimal codons end in T or A, rather than G or C in RSCU analysis (T >A > C >G). PR2 plot shows that T and C are used more frequently than A and G in the third base of synonymous codons in S. lemnae complete coding regions. On the contrary, C and T end are preferred in the 3' end of genes in Tetrahymena and Paramecium [32].In addition, G3S and G 3S are strongly positively correlated with ENC, but are strongly negatively correlated with axis1 and axis2. However, A3S and U 3S are strongly negatively correlated with ENC, but are strongly positively correlated with axis1 and axis2. In general, the high the high A/T content in the third base of synonymous codons, related to the low ENC value, which indicate that the codon bias is relatively high. In another respect, because the A, T, C, G, GC, A3s, T3S, C3S, G3s and GC3s content have a correlation with the two principle axes, which also reveal that mutation pressure from base composition is an important factor shaping the codon usage patterns.
However, according to our PR2 plot, ENC-GC 3S plot, the codon usage patterns of S. lemnae have also been influenced by natural selection, such as the hydropathicity and the aromaticity, which plays a major role in shaping the codon usage bias by the result of neutrality plot in S. lemnae. From Table 4, AROMO and GRAVY have a negative correlation with Axis1 while which have a positive correlation with Axis1 and Axis2. It firstly indicates the role of hydropathicity and aromaticity forming the codon usage pattern of the cliate.
In short, our analysis revealed that the codon usage bias in S. lemnae is low and natural selection such as aromaticity, hydropathicity, is the main factor that affects codon usage variation in S. lemnae. Mutation pressure of nucleotide composition is also an important factor influencing codon usage bias. The evolution of S. lemnae probably reflects a dynamic process of mutation and natural selection to adapt its codon usage. This study of codon usage patterns in S. lemnae is the first reveal information about molecular evolution, and also builds the base for analysis alternative genetic codes of hypotrichous ciliates.

Methods
The complete S. lemnae genome is obtained from the S. lemnae genome database (http://stylo.ciliate.org/index.php/). A total of 21,061 coding sequences were downloaded.
Then we deleted the short sequence whose length is less than 300 bp. Finally, we extracted 20,750 CDSs which had correct start and stop codons with exact multiple of three bases to do data analysis.

Nucleotide composition analysis
The following nucleotide contents of CDS of S. lemnae macronucleus genomes were calculated by CodonW software (http://codonw.sourceforge.net), including (i) frequency of occurrence of the nucleotides (A%, C%, U%, G%, GC%); (ii)frequency of each nucleotide at the third position of the synonymous codons (A 3s %, C 3s %, U 3s %, and G 3s %); (iii) frequencies of occurrence of nucleotides G+C at the first (GC 1 ), second (GC 2 ), and third base of codon (GC 3 ); (iv) mean frequencies of nucleotides G+C at the first and second position (GC 1,2 ).

Relative Synonymous Codon Usage (RSCU) Analysis
RSCU was defined as the ratio of the observed frequency of a specific codon to the expected value, if the each codon of synonymous codons group was used equally [33].The codon with RSCU value is less than 1.0, it means that this codon has relative negative codon usage bias, while the value more than 1.0 has positive codon usage bias; when the RSCU value of codon is close to 1.0, it means that this codon is chosen equally and randomly [34]. In this study, the RSCU values of S. lemnae were calculated by CodonW software, the RSCU values for H.sapiens were retrieved from Singh R.K. [35], the RSCU values of C.elegans were retrieved from Stenico M [36], the RSCU values of S.cerevisiae were retrieved from Lloyd A.T [37], and the the RSCU values of P.caudatum and T.thermophila were retrieved from Barth D and Hannah M.W [38,39], respectively.
Effective number of codons analysis. The ENC was calculated to quantify the codon usage bias of gene and genome level, which is the best estimator of absolute codon usage bias [40,41]. ENC values ranging from 20 to 61 don't require any prior knowledge or a reference set. The value of 20 indicates extreme codon usage bias and the value to 61 indicates no bias. When the ENC value is less than or equal to 35, it is generally believed that the gene has an obvious codon bias.

Determination of optimal codons
Optimal codons definition referred to the analytical method of Liu et al [42]. Gene expression levels were determined by CAI values. The method is as follows: Calculate the CAI of each gene, arrange it of each gene based on CAI values, and take 5% of total genes with the highest CAI valuea to form a high. The RSCU values of each codon in the two data sets is calculated, and then the RSCU values of the high dataset is subtracted that of the low dataset. If difference is greater than 0.08, the corresponding codon is defined as the optimal codon. The software used in this study includes CodonW 1.4.2, SPSS 20 and Excel2016.

ENC-GC 3S plot analysis
The ENC-GC 3S plot was used to analysis the influence of the GC 3S content on codon usage. The expected ENC value for each GC 3S was calculated using the following formula:

Parity rule 2 analysis
The Parity rule 2 (PR2) plot analysis was used to explore the effects of mutation and natural selection on the codon usage of genes. In this PR2 plot, take the value of AU-bias

Neutral Plot Analysis
The neutrality plot was used to examine the extent of the effect of mutation pressure and natural selection on the codon usage patterns by plotting the GC12 values against the GC3 values. In this plot, mutation pressure is assumed to be the main force shaping codon usage when the regression line falls near the diagonal. Alternatively, the regression curve tends to tilt or parallel to the horizontal axis which indicates the dominant role of natural selection on the codon usage bias [21,45].

Principal Component Analysis (PCA).
PCA was used to analyze the major trends in synonymous codon usage of genes mutation.
We normalize the data according to the method by Sharp and Li [46]. In each PC, the score of the gth gene (yg) gene was normalized by the mean (m) and standard deviation (SD), expressed as zg = (yg -m) / SD. Highly expressed genes were a zg score greater than 5.17 (theoretically covering only the range of 1.5% of total genes). Then gene expression levels were identified as the major trends in PC score changes in genes. In addition, we analyzed the distribution of PC scores for constitutively overexpressed genes (encoding ribosomal proteins).

Phylogenetic analysis
A phylogenetic tree was constructed based on the nucleotide sequences of the coding