LTR/TEs have been shown to make up bulk of many monocotylous plant genomes, and they are in general randomly inserted across the genome (26). This TE insertion mechanism makes a big trouble because the earlier inserted TEs could be fragmented or truncated by the later insertions at stochastic sites, as a matter of fact huge number of TE fragments are observed in plant genomes (3,26,27). It is rational to compare the nucleotide changes between the two LTR sequences of one complete TE to estimate the insertion time point in the evolution, while it is a big challenge to define the truncated TE pieces without intact LTR sequences (28). Considering the huge number of TEs caused by sequential burst events and the stochastic nucleotide substitutions under low nature selection stress, we developed the pipeline Matrix-TE with GPDF analysis to successfully synthesized the intact and truncated TEs and calculated the Ks of TE burst events in rice genomes (Figure 1).
Matrix-TE approach optimized and synthesized the scripts in LTR_finder, EMBOSSY, BioEdit, MEGA and BLASTN packages, to integrate the most abundant TE super families in one super matrix, and quantitatively investigate the individual TE content.
TE matrix and clusters
Rice genome and TE statistics. LTR/TEs ORF length and clusters were presented in Figure 2 (A,B,C,D). The two cultivated Oryza sativa rice genome sizes, gene size and total TE size were generally similar for 93-11 and Nip accessions (29). The number of annotated full length LTR/TE, ORFs over 1500bp and ORFs in super matrix were more in Indica than Japonica rice genomes (Figure 2. E). 2010 of 93-11 and 1520 of Nip ORFs which containing the majority of TE clusters were presented in the super matrix. In which the length of individual TE cluster were the same (Figure 2. A,B), and the identity of individual cluster were greater than 95% (Figure 2. C,D).
Then the TE clusters were extracted and annotated, 93-11 and Nip genomes contained 8 and 6 clusters separately (Figure 2. F). They shared the same 6 TE clusters with the annotation OS-type1 (C1), OS-type2 (C2), OS-typeRT (C3), OS-typePHA (C4), Ty1/Copia (C5) and Ty3/Gypsy (C6). In which the OS-type1 (C1) and OS-type2 (C2) TEs were unclassified, and Cluster 6 (C6), Cluster 7 (C7) and Cluster 8 (C8) in 93-11 genome were 3 subfamilies of Ty3/Gypsy and they were combined as C6C7C8 in the following analysis.
Phylogenetic trees and reference sequences of TE clusters. The 6 TE clusters of 93-11 and Nip genomes were constructed phylogenetic trees separately (Figure 3. A,B). ref-C1, ref-C2, ref-C3, and ref-C4 and ref-C5 on top branch of the trees were selected as the reference TE sequences for both 93-11 and Nip. ref-C6C7C8 and ref-C6 on top branch of the trees were selected as the reference TE sequences for 93-11 and Nip, separately. And the sequences of all the reference TE sequences were presented in Supplementary Table S1.
Genome scanning by ref-LTR/TE and TE peak differences. 93-11 and Nip genomes were scanned with the reference TE sequences extracted from 93-11 and Nip accessions, separately. All the TE super families including intact and truncated sequences were classified and combined according to their annotations (Figure 4. A). The histogram showed that OS-type1, OS-type2, OS-typeRT and OS-typePHA TE super families were almost the same contents between 93-11 and Nip genomes. While the contents of Ty1/Copia and Ty3/Gypsy super families were significantly different between 93-11 and Nip genomes (Figure 4. B,C). All the Ty1/Copia and Ty3/Gypsy sequences including intact and truncated TE were normalized to distribution curves according to their identities with reference TE sequences. And the individual peaks P-Copia and P-Gypsy were observed in Nip and 93-11 genomes, separately (Figure 4. D,E). P-Copia peak was quite sharp in Nip, while weak in 93-11 genome. P-Gypsy peak was strong in 93-11, but hardly observed in Nip genome.
Stochastic SNP distribution on TE and GPDF analysis
SNP distribution across TE. Although the nucleotide substitution rate is usually low in plants, rice genome accumulated abundant SNPs in the genome sequences, especially in TE sequences as they experienced very low nature selection stress in million years of evolution (3,30,31,32). SNP distribution was observed across Ty1/Copia and Ty3/Gypsy coding sequences in both 93-11 and Nip genomes, without conserved regions. And the SNP level was little higher for Ty3/Gypsy gene body than Ty1/Copia gene body (Figure 5. A,B).
GPDF fit of P-Copia and P-Gypsy. On the bases of stochastic nucleotide substation analysis of TEs, the individual identity distribution peaks P-Copia and P-Gypsy were extracted from Figure 4, and the peak curves were fitted well by the mathematic model Gaussian Probability Density Function, with high R square values (Figure 5. C,D). The average nucleotide substitution ratio (Ks) of peaks P-Copia and P-Gypsy were calculated as 2.58σ (3). By GPDF analysis, the Ks of P-Copia was 0.0043, while the Ks of P-Gypsy was 0.018.
Rice LTR/TE transcription activities
RNA-seq data mapping. Each of the total RNA for mild condition, hot, cold and salt stress treated samples was sequenced 20 million paired end reads, and the sequencing data were mapped on rice genomes. For both 93-11 and Nip genomes, majority over 95% reads were mapped in gene coding regions, while few reads less than 0.5% were mapped in LTR/TE regions (Table 1). Further indicating the low transcription activities of LTR/TEs in rice species (33).
Table 1
mapping rate of the RNA-seq data for mild condition, hot, cold and salt stress treated rice samples
|
Reads in genes (% total reads)
|
Reads in TE ORFs (% total reads)
|
|
mild
|
42℃
|
5℃
|
NaCl
|
mild
|
42℃
|
5℃
|
NaCl
|
Indica (93-11)
|
95.75
|
95.16
|
95.11
|
95.51
|
0.18
|
0.20
|
0.20
|
0.15
|
Japonica (Nip)
|
97.57
|
97.35
|
97.41
|
97.05
|
0.18
|
0.42
|
0.26
|
0.27
|
Full length LTR/TE cloning from cDNA. 13 full length LTR/TE sequences were cloned from cDNA in 93-11 rice, while 20 full length LTR/TE sequences were cloned from cDNA in Nip rice, under stress treatment conditions. The agarose gel results presented the full length cDNA clones of LTR/TEs, generally the size of transcribed TEs were distributed from 1 kb to 5 kb under variable stress conditions (Figure 6). The length and sequence details of the LTR/TE clones were presented in Supplementary Table S2 and Table S3.
Expression level of rice LTR/TEs under stress treatments. The classified and annotated TE clones were applied to quantification. The RT domain (reverse transcriptase) sequences were amplified in the RT-PCR (2). And the quantified TE expression levels were shown in Figure 7. With the TE expressions under mild conditions as control, significant TE expressions induced by salt and hot stresses were observed in 93-11 rice plants, while hot and cold stressed induced TE expressions were observed in Nip rice plants.