Comparative Evolution and Transcription Activity of Transposable Elements in Indica and Japonica Rice

Transposable elements (TEs) are able to diversify plant gene expression and function, sequentially promote plant variety and evolution. However, there is lack of ecient approach to investigate the evolution behavior and transcription activity of TEs in plants. Here we developed a pipeline Matrix-TE to comprehensively evaluate the super-families, differentiation and transcription activity of LTR/TEs in Indica and Japonica rice, the two considerable important and closely related monocots.


Background
TEs have been extensively studied for decades, and are extremely important for monocotylous and dicotyledonous plants, as TEs cause plant gene expression and function changes by insert mutations and regulatory element introduction (1). With recent advances in genomics and bioinformatics, it becomes reality to systematically evaluate the role of TEs for plant genome evolution (2,3). The long terminal repeat transposable elements (LTR/TEs) are retrotransposons which make up the majority of total TEs in most plant genomes (4). Owing to the huge copy number accumulation of LTR/TEs in plant genomes, comprehensive interpreting of the insertion time point and insertion quantity would help to determine the biological role of LTR/TEs for plant evolution history (3).
The genomes of the most widely cultivated Oryza sativa subspecies Indica and Japonica were constructed and updated in the past two decades, genomic variations and wild rice genomes were sequentially studied to reveal the origin and evolution of genus Oryza (5)(6)(7)(8)(9)(10)(11). The most abundant Ty1/Copia and Ty3/Gypsy were class I retrotransposons in rice genomes and they went through very low nature selection constraints after the insertion in the genome. Then they diverged from the ancestral sequences and become pretty polymorphism over million years of evolution (6). Single nucleotide polymorphisms (SNPs), structural variations and pan-genome constructing represent rice species genetic diversity, further providing new perspectives on rice diversity and evolutionary history (7). The Oryza species span ~ 15 million years of evolution, and the divergence time of cultivated Indica and Japonica rice was estimated at 0.55 million years ago (Mya) (8). Comparative annotation and phylogenetic analysis of LTR-RTs in domesticated and wild rice developed by Joshua C. Stein et. al. revealed several lineage-speci c TE burst events within 2.5 million years (8). Molecular phylogenetic studies have been extensively carried out to investigate the relationships of Indica and Japonica rice, and the results indicated that they were originated independently (9). A well assembled Indica genome displayed signi cantly more solo-LTRs sequences compared with Japonica, indicating the two subspecies had experienced independent ampli cation or loss of LTRs after their divergence (10). It is essential to understand the LTR/TEs differences and similarities for the comprehensive interpreting of rice genome origin and evolution (11).
LTR/TEs content are abundant in monocotylous plants, particularly in wheat and maize as Ty3/Gypsy and Ty1/Copia super families constitute more than half of the entire genome sequences (12)(13)(14)(15)(16). Large quantity of complete and truncated LTR/TEs were identi ed in wheat D and A subgenomes, intact and solo LTR sequences were applied to the estimation of TE insertion time. Sequential bursts of TEs followed by silencing over the past 3 million years were established by this approach (12,13). The comparison of the TEs in the maize B73, W22 and Mo17 genomes revealed high level of variations even in very close subspecies (14,15,16). TE mediated gene silencing in maize indicated the epigenetic regulation activities in development (17).
Most TEs of various plant species have accumulated huge number of mutations and truncation events in the evolution time course, however a number of TE families remain active and might generate new insertions in some genomes (18)(19)(20)(21)(22)(23). Non-autonomous DNA transposons MITEs have been identi ed in both Indica and Japonica rice, and they are ampli ed during stress activations (19,23). The repetitive nature and variable mutations make TEs challenging to analyze in genomic studies. RNA-seq based procedures have been developed to screen active L1 elements in animal and plant cells (21,22). Analysis of 3000 rice genomes demonstrated that mPing TEs are likely active in the past century (24). A software named TRACK-POSON were developed to detect the insertion polymorphisms of TEs in rice genomes (25). Here we developed the approach Matrix-TE to systematically and quantitatively study the complete and truncated TEs in very close subspecies Indica and Japonica rice genomes, based on BLASTN and GPDF algorithms (Fig. 1). All LTR/TE super families were scanned and restricted in one super matrix by their ORFs and sequence identities, then classi ed and phylogenetic clustered. The peaks of TEs were analyzed by Gaussian Probability Density Function, and the nucleotide substitution ratio Ks of the subspecies individual TE peaks were calculated. Finally, stress induced expression of TEs in Indica and Japonica rice were cloned and quanti ed.
Results LTR/TEs have been shown to make up bulk of many monocotylous plant genomes, and they are in general randomly inserted across the genome (26). This TE insertion mechanism makes a big trouble because the earlier inserted TEs could be fragmented or truncated by the later insertions at stochastic sites, as a matter of fact huge number of TE fragments are observed in plant genomes (3,26,27). It is rational to compare the nucleotide changes between the two LTR sequences of one complete TE to estimate the insertion time point in the evolution, while it is a big challenge to de ne the truncated TE pieces without intact LTR sequences (28). Considering the huge number of TEs caused by sequential burst events and the stochastic nucleotide substitutions under low nature selection stress, we developed the pipeline Matrix-TE with GPDF analysis to successfully synthesized the intact and truncated TEs and calculated the Ks of TE burst events in rice genomes ( Figure 1).
Matrix-TE approach optimized and synthesized the scripts in LTR_ nder, EMBOSSY, BioEdit, MEGA and BLASTN packages, to integrate the most abundant TE super families in one super matrix, and quantitatively investigate the individual TE content.

TE matrix and clusters
Rice genome and TE statistics. LTR/TEs ORF length and clusters were presented in Figure 2 Supplementary  Table S1.
Genome scanning by ref-LTR/TE and TE peak differences. 93-11 and Nip genomes were scanned with the reference TE sequences extracted from 93-11 and Nip accessions, separately. All the TE super families including intact and truncated sequences were classi ed and combined according to their annotations ( Figure 4. A). The histogram showed that OS-type1, OS-type2, OS-typeRT and OS-typePHA TE super families were almost the same contents between 93-11 and Nip genomes. While the contents of Ty1/Copia and Ty3/Gypsy super families were signi cantly different between 93-11 and Nip genomes ( Figure 4. B,C). All the Ty1/Copia and Ty3/Gypsy sequences including intact and truncated TE were normalized to distribution curves according to their identities with reference TE sequences. And the individual peaks P-Copia and P-Gypsy were observed in Nip and 93-11 genomes, separately (Figure 4. D,E). P-Copia peak was quite sharp in Nip, while weak in 93-11 genome. P-Gypsy peak was strong in 93-11, but hardly observed in Nip genome.
Stochastic SNP distribution on TE and GPDF analysis SNP distribution across TE. Although the nucleotide substitution rate is usually low in plants, rice genome accumulated abundant SNPs in the genome sequences, especially in TE sequences as they experienced very low nature selection stress in million years of evolution (3,30,31,32). SNP distribution was observed across Ty1/Copia and Ty3/Gypsy coding sequences in both 93-11 and Nip genomes, without conserved regions. And the SNP level was little higher for Ty3/Gypsy gene body than Ty1/Copia gene body ( Figure  5. A,B).
GPDF t of P-Copia and P-Gypsy. On the bases of stochastic nucleotide substation analysis of TEs, the individual identity distribution peaks P-Copia and P-Gypsy were extracted from Figure 4, and the peak curves were tted well by the mathematic model Gaussian Probability Density Function, with high R square values ( Figure 5. C,D). The average nucleotide substitution ratio (Ks) of peaks P-Copia and P-Gypsy were calculated as 2.58σ (3). By GPDF analysis, the Ks of P-Copia was 0.0043, while the Ks of P-Gypsy was 0.018.

Rice LTR/TE transcription activities
RNA-seq data mapping. Each of the total RNA for mild condition, hot, cold and salt stress treated samples was sequenced 20 million paired end reads, and the sequencing data were mapped on rice genomes. For both 93-11 and Nip genomes, majority over 95% reads were mapped in gene coding regions, while few reads less than 0.5% were mapped in LTR/TE regions (Table 1). Further indicating the low transcription activities of LTR/TEs in rice species (33). conditions. The agarose gel results presented the full length cDNA clones of LTR/TEs, generally the size of transcribed TEs were distributed from 1 kb to 5 kb under variable stress conditions ( Figure 6). The length and sequence details of the LTR/TE clones were presented in Supplementary Table S2 and Table  S3.

Discussion
Advances in sequencing technology have been extensively applied to resolve the complex plant genomic regions accurately (34). Particularly the long sequencing reads generated by single-molecule technology, together with accurate short read sequencing technologies have dramatically improved representations of TE repeats in plant genomes (35). The huge number copies and non-conserved nucleotide substitution sites nature of TEs make them rough to be systematically and quantitatively studied. We established the approach Matrix-TE based on BLASTN and GPDF algorithms to comprehensively evaluate TE burst events and decline behaviors in plant genomes. Actually we have applied the primary method in cotton genome evolution analysis this year (3), and now it has been improved and optimized thoroughly. For Matrix-TE approach, it has overcome the issues including full length and fragmented piece TEs, and the stochastic nucleotide substitution sites in the coding regions. By using GPDF to t TE identity distribution curves, it is suitable to apply Matrix-TE in big plant genome analysis.
Six TE super families were identi ed for Indica and Japonica rice genomes in the super matrix, indicating the six types of TEs have underwent burst events in the rice genome evolution (Fig. 2). However, no obvious TE burst event was detected in Arabidopsis genome by this method, it might be ascribed to the much smaller Arabidopsis genome size (Supplementary Figure S1) (36). As it has been described that the genome size expansion of Malvales plants was highly correlated with TE contents (3). We used the ORFs of full length TEs to generate the super matrix, and extracted the clusters with high identities (95%) to gure out ref-LTR/TEs. This provided a much e cient strategy to do TE analysis, because any truncated or fragmented TE sequence pieces with diverse SNPs could be scanned out and collected by BLASTN algorithm for the subsequent steps.
The Ty1/Copia and Ty3/Gypsy distribution curves showed signi cantly differences between Indica and Japonica genomes, while the other four types of TEs were almost the same distributed in the two genomes ( Fig. 4. D,E and Supplementary Figure S2. A,B,C,D). It is quite interesting that individual TE peaks P-Copia and P-Gypsy were observed for Japonica and Indica rice genome separately. And it is strongly suggested that more Ty1/Copia TE insertions in Japonica genome, while more Ty3/Gypsy TE insertions in Indica genome (Fig. 4. A,B,C).
Considering the stochastic nucleotide substitutions across TE gene body and low TE transcription activities in rice gnomes, we attempted to resolve the TE distribution curves by Gaussian Probability Density Function (Fig. 5. A,B and Table 1). The intense issue was the normalizing of intact and truncated TE pieces quantitatively, therefore we generated the raw data points of TE identities with 30 bp units, as the optimized setting of short DNA sequences for BLASTN was 30 bp in practice (37). Then we found the symmetrical single peak of TE unit identity distribution curve was tted by GPDF perfectly, indicating GPDF was an appropriate model for stochastic events (Fig. 5. C,D).
The transcription activity of LTR/TEs was hardly observed in plants, mainly caused by the dilution of RNA-seq mapping intensity by numerous TE repeat copies (2). However, some approaches were developed to rescue the TE transcript pro ling (38). We assumed the LTR/TEs with full length ORFs which encoded functional proteins were potentially active (2), the RNA-seq data and TE cDNA clones of rice samples support this hypothesis (Table 1 and Fig. 6). Then the stress induced LTR/TEs transcription were quanti ed by the RT domain sequences ampli cation (39). Particularly the high induced expression levels of LTR/TE were observed in hot stress treated rice plants ( Fig. 7. A,B). However, the potential biological functions of these induced LTR/TE transcripts were expected in the future.

Conclusions
Generally, we developed the approach Matrix-TE on the basis of BLASTN and GPDF algorithms, and applied it to comprehensively and quantitatively investigate LTR/TE types and contents in the close subspecies Indica and Japonica rice genomes. The individual TE burst events P-Copia and P-Gypsy were observed in Japonica and Indica rice, separately. RNA-seq and RT-PCR methods indicated that LTR/TE transcripts were induced by hot, cold and high salt stress conditions. The optimized Matrix-TE approach and procedures probably could be used in other plant species with big genomes like wheat and maize.

TE Matrix generation and GPDF analysis
Indica and Japonica genomes.
The two cultivated Oryza Sativa Indica (93-11) and Japonica (Nip) genomes were applied for the analysis pipeline ( Figure 1). The genome assembly ID were GCA_003865215.1 for Indica (93-11) and GCA_003865235.1 for Japonica (Nip), separately. As the two updated genomes were assembled by PacBio with long reads sequencing technologies (29). Then the full length LTR/TEs of 93-11 and Nip genomes were annotated by LTR_ nder software with default parameters (3).
TE ORFs matrix generation.
The ORFs were extracted from the previous full length LTR/TEs by Getorf Script in EMBOSS package. All the ORFs were sorted by length from longest to shortest, and the ORFs with length less than 1500 bp were discarded. Then the sequence identity matrix was calculated by BioEdit software with default parameters (2,3).
ORF clustering and phylogenetic analysis.
The LTR/TE ORFs in the previous matrix with sequence identity over 95% were extracted and grouped by their sequence length and homology. The molecular phylogenetic trees of the ORF groups were generated by MEGA software with low homologous sequences as the root. The ORF sequences on the top branch of the phylogenetic trees were determined as the reference TE sequences (ref-TE-seq) in the following steps (3).
Genome scanning and GPDF analysis.
The Indica (93-11) and Japonica (Nip) genomes were scanned by the reference TE sequences with BLASTN software and the following parameters: The blast result sequences containing complete and truncated TE sequences with identity relative to ref-TE-seq, were fragmented into 30 bp units and the sequence identity of each unit was a data point in the following step (3). All the data points for the individual TE super family were used to generate an identity distribution curve. Then the peaks of the curve were tted by Gaussian Probability Density Function and the average nucleotide substitution ratio Ks of each TE peak were de ned as 2.58 standard deviations (σ).
Single nucleotide polymorphism distribution across Ty1/Copia and Ty3/Gypsy gene body were calculated for 93-11 and Nip genomes, and the SNP densities were labeled along with TE gene bodies. The individual TE burst peaks were compared between Indica and Japonica rice genomes, and the different TE burst events between them might be correlated with rice subspecies differentiation (29,30,40).

Rice stress conditions and TE transcription activity
Rice plants growth and stress conditions. The 93-11 and Nip rice were planted and stress treated as the conditions described previously (41,42). Generally, rice seedlings were cultivated on culture medium in incubator for 14 days, with cycles of 14 h of light and 10 h of dark at 28 ℃. Then the rice plants were grouped and group I were used as control (mild conditions), group II were treated at 42 ℃ for 12 h, group III were treated at 5 ℃ for 5 days, group IV were treated under 200 mM NaCl for 5 days. And the leafs of four plant groups were collected for RNA extraction (42).
Rice RNA sequencing and mapping.
The rice RNA was extracted by using FastPure Plant Total RNA Isolation kit (Vazyme Biotech) with the instructions (2). The RNA quality was examined by 1% agarose gel, and the cDNA was prepared by HiScript III 1st Strand cDNA Synthesis kit (Vazyme Biotech). The total RNA samples were applied to RNAseq by Illumina HiSeq 3000 platforms. The RNA-seq clean data were mapped to the full length ORF sequences of Indica and Japonica TEs separately with Hisat2 software, and the transcribed TEs were screened by HOMER packages (2).
Rice transcribed TE cloning and RT-PCR quanti cation.
The full length TEs supported by RNA-seq data were cloned by using Phanta Max Super-Fidelity DNA Polymerase (Vazme Biotech), and validated by 1% agarose gel analysis ( Figure 6). The TE transcription activities of rice under mild conditions, hot, cold and salt stresses were quanti ed by RT-PCR with UBQ5 as reference gene (Figure 7) (43). The primers used for cloning and RT-PCR were presented in the Supplementary Table S4 and Table S5, separately.

Declarations
Ethics approval and consent to participate Not applicable