Phylogenetic analysis of S. lemnae based on 18s ribosomal DNA (18s-rDNA)
To determine the phylogenetic relationship of S. lemnae, a phylogenetic tree was drawn (Fig.1). The results show that S. lemnae clusters with Stylonychia sp. with high support (ML/BI:86/0.76), which is then sister to Tetrahymena sp.and S. cerevisiae with low bootstrap value (ML/BI: 78/0.76, ML/BI: 60/0.45 respectively) ; which clusters with C.elegans and H.sapiens in a poorly supported clade (ML/BI: 55/0.43); and also groups with Paramecium sp..
Codon composition analysis
CodonW software was used to analyze the macronucleus genome coding sequence of the S. lemnae. The GC content ranged from 22.1% to 58.6% in the gene of S. lemnae. The GC content was mainly distributed in 25% ~ 40% with an average of 33.7%, indicating that AT was enriched in the genome (Supplementary Fig. S1).
Relative synonymous codon usage analysis
The patterns of synonymous codon usage in S. lemnae coding sequences were assessed by RSCU analysis. Among 26/ preferred codons of corresponding all amino acids (except Methionine and Tryptophan) in S. lemnae coding sequences, 26 are A/U-ended (twelve A-ended; fourteen U-ended) and the remaining are C/G ended (one C-ended; one G-ended). Therefore, most of preferentially used codons in S. lemnae are A-ended or U-ended codons (Table 1). By analyzing the 26 preferred codons, we can find that the RSCU values of six codons, GGA (G), UCA (S), CCA(P), ACU(T), GCU (A) and AGA(R) are >1.6, whereas the RSCU values of the remaining are also found to be > 1 and< 1.6 (Table 1). Nucleotide composition (A/T-rich) and RSCU analysis (A/U-ended) show that selection of the preferred codons has been influenced by compositional constraints, which indicated that nature selection mostly, shaped its codon pattern.
Based on these data, it was found that in S. lemnae coding sequences, 25 are A/U-ended in highly expressed gene, but 21 are C/G end in lowly-expressed gene (Table 2). It verifies A-ended or U-ended codons are preferentially used codons of highly expressed genes.
To determine the potential influences (mutation pressure or natural selection) on the codon usage patterns, the RSCU values of the codons in S. lemnae coding sequences were calculated and then were compared with five model organism (H. sapiens, C. elegans, S. cerevisiae, T. thermophila and P. caudatum). We find that preferred codons is eight between S. lemnae and H.Sapiens; fourteen between S. lemnae and C. elegans; twenty one between S. lemnae and S. cerevisiae; twenty one between S. lemnae and T.thermophila; and nineteen between S. lemnae and P. caudatum (Table 1). In all, the similarity in codon pattern between S. lemnae and H. Sapiens is lower than that among C. elegans、S. cerevisiae 、T.thermophila or P. caudatum. The codon usage bias of S. lemnae differs greatly from that of higher eukaryotes and is similar to that of lower eukaryotes. These results suggest that the selection pressure maybe affect the codon usage pattern of S. lemnae.
Correlation analysis
To determine whether the codon usage patterns of S. lemnae coding sequences are mainly influenced by mutation pressure or natural selection, we performed a correlation analysis between the nucleotide compositions and the third base of synonymous codons(Table 3). The results show that the A content has a significant positive correlation with the content of A3s and G3s, but has a significant negative correlation with the content of C, T, G, GC, C3s, T3s and GC3s. The C content has a significant positive correlation with the content of G, GC, C3s, GC3s and ENC, but has a significant negative correlation with the content of A, T, A3s T3s and G3s. The T content has a significant positive correlation with T3s contents, but has a significant negative correlation with the content of A, C, G, GC, A3s, C3s, G3S, GC3s and Enc. The G content has a significant positive correlation with the content of G, GC, C3s, G3s, GC3s and ENC, but has a significant negative correlation with A, T, A3s and T3s content. The GC contents has a significant positive correlation with the content of CG, G3s, C3s, GC3s, but has a significant negative correlation with the content of A, T, A3s and T3s. The ENC value has a significant positive correlation with the content of C, G, GC, C3s, G3s and GC3s, but has a significant negative correlation with the content of T, A3s and T3s. These results indicate that compositional constraints under mutation pressure may affect the codon usage pattern for S. lemnae.
To study the relative contribution of two major factors, i.e., natural selection and mutational pressure on codon usage, we performed PCA analysis taking RSCU scores to find out major trends of codon usage in S. lemnae genes. A plot of PC1 and PC2 showed important features of the codon usage pattern in S. lemnae genes (Fig.2a). From this analysis major trends in codon usages were detected in which axis1 (PC1) accounted for 13.4%, whereas axis2 (PC2), axis3 and axis4 accounted for 10.2%, 6.6%, and 4.6% of total variation in S. lemnae. Axis1-axis4 explaining 34.8% of the cumulative variances, which indicates that no single factor influence condon usage patteren in in S. lemnae.
In order to characterize the codon usage patterns from different types of genes, ribosomal related genes was statistical analysis by PCA (Fig. 2b).It was clearly seen that ribosomal genes of S. lemnae were clustered on the right side of PC1, and indicate that compositional constraints are a major factor in CUB, that is n mutation pressure mostly shaped its codon pattern, but other factors are also powerful.
Additationlly, Correlation analysis was also performed to determine the correlations between the first two axes and nucleotide constraints of S. lemnae genome (Table 4). The results show that the Axis1 is positively correlated with the A and A3s, whereas it is negatively correlated with the contents of C, G, GC, C3s, GC3s and ENc. Meanwhile, Axis2 is insignificant correlated with the C, T, G, GC, A3s,C3s,T3s, G3s, GC3s and ENC. Overall, these results indicating that mutation pressure has played a major role in shaping the codon usage patterns of S. lemnae genomes.
To determine the potential influence of natural selection, correlation analysis was performed between the characters of aminoacid (Gravy values and Aroma) and the codon bias (Axis1, Axis2, ENC, and GC 3s) (Table 5). Our analysis indicates that Axe1 have a significant negative correlation with AROMA and GRAVY, and Axis 2 has a significant positively correlation with AROMA and GRAVY. Earlier it is found that AROMA and hydropathy of the encoded proteins has a significant correlation with the base composition of third codon positions in some other prokaryotes, several eukaryotes and viral genomes [17,18,19,20,21].However, there is no report of such correlation in any of the ciliate genomes studied so far. To our knowledge, this is for the first time; a correlation has been demonstrated between the synonymous codon usages in genes of S. lemnae genomes. All in, the aromaticity and hydrophobicity of amino acid have effect on the codon usage pattern of S. lemnae, which reveal that the importance of nature selection.
ENC- GC 3S plot analysis
To determine whether the codon usage patterns of S. lemnae coding sequences have been shaped by mutation pressure, natural selection or both, we constructed ENC-GC 3S plot, PR2 plot and neutrality plot analysis.
The degree of codon bias is reflected by the size of ENC value. ENC value ranges from 20 to 61, with the level of base composition bias increasing as the ENC values approach 20. Similarly, genes expressed at low levels contain numerous rare codons, with a higher ENC value approach 61. The convention uses 35 as the criterion for biased bias [22]. The ENC values of the S. lemnae genome range from 24.8 to 61, and most of them are more than 35, so the codon bias of the gene is weak. The average GC3 content is 33.7%, GC1 and GC2 are 38.68% and 30.1%, respectively, indicating that the codon base composition is mostly A and U.
The association analysis between ENC - GC3 is shown in Fig. 3. If ENC value of genes will lie on or just below the continuous curve of the expected ENC values, it indicates that the codon bias is only constrained by a G3+ C3 mutational bias [23]. In the figure, more ENC value of genes loci are on the top or far below the curve of the expected ENC values, but a little ENC value of genes lie on or just below the expected curve. It indicates that the codon usage patterns have not only been influenced by mutation pressure, but also mainly influenced by other factors, such as natural selection.
PR2-plot plot analysis.
The relationship between purine(A and G) and primidines (T and C) of partial amino acids of each gene was analyzed by PR2-plot mapping. According to Supplementary Fig. S2, most of the genes are distributed on the high right of the plan, indicating that the frequency of A is higher than T, and the frequency of G is higher than C. If the codon bias of S. lemnae is completely affected by random mutation, it shows that A=U and G=C, that is, the use frequency of purine base is equal to that of the pyrimidine base. The use frequency of A differs from that of T, G differ from C indicate that the formation of codon bias is weakly influenced by random mutation, and is strongly influenced by mutation pressure, natural selection, and other factors in S. lemnae.
Neutral Plot Analysis
A neutrality plot was constructed to determine the extent of influence between mutation pressure and natural selection by comparing the value of GC 12 and GC 3. When the value of GC 12 is statistically significantly correlated to GC 3 and the slope of the regression line is close to 1 in the neutrality plot, mutation pressure is regarded as the main force forming the codon usage bias. Conversely, if selection is the dominant factor, then the slope of the regression line is close to 0. The analysis show that no correlation is observed between the value of GC 12 and GC 3 (r =0.286, P > 0.05) which seemed indicative of mutation pressure playing a little role in codon usage bias of S. lemnae genome (Fig. 4).then, after calculating the slope of the regression in the neutrality plot, this was the case. The slope of the regression line was calculated to be 0.2016, high-lighting the relative GC 3 (natural selection) is 79.94%, while the relative constraint on neutrality (mutation pressure) is 20.16%. Compared with mutation pressure, natural selection is the dominant factor in shaping the codon usage pattern of S. lemnae genes.