2.1. Genome transposon annotation of blackleg causal pathogen
In an effort to clarify the transposon size, categories, counts, we enforced genome transposon annotation for three blackleg causal pathogen strains and made statistical analysis based on annotation results.
Firstly, genome transposon annotation of different pathogen strains was conducted by running EDTA program. Pathogen strain genomes were downloaded from NCBI database, including three genomes of Leptophaeria biglobosa 'brassicae' B3.5 (Lbb B3.5), Leptophaeria biglobosa 'brassicae' CA1 (Lbb CA1), and Leptophaeria maculans' brassicae 'V23.1.3 (Lmb V23.1.3). Then, unknown LTR retrotransposons were extracted from transposon library of pathogen and reclassified through software DeepTE. Finally, the pathogen genome was conducted transposon annotation again by homologous alignment with the reclassified transposon library of pathogen. When implementing transposon annotation by the EDTA program, we additionally afforded a SINE_LINE database sorted from website (https://sines.eimb.ru/) for annotating various kinds of transposons. When running the DeepTE program, the fungal classification model of unknown LTR retrotransposons originated from official website of DeepTE (https://github.com/LiLabAtVT/DeepTE). After completing genome transposon annotation of three pathogenic strains, the annotation results were organized and analyzed in R Studio. The parameter settings of EDTA annotation and DeepTE classification were as follows.
EDTA.pl --genome genome.fna --species others --sensitive 0 --anno 1 --threads 20 --curatedlib ../../11.software/SINE_LINE/SINE_LINE.fa
Python /pub/software/DeepTE/DeepTE.py -I LTR_unknown.fa -sp F -m_dir ../../11.software/Model_dir/Fungi_model -fam LTR 1 > DeepTE.log 2>&1
EDTA.pl --genome genome.fna --step anno --overwrite 1 --anno 1 --threads 20
2.2. Statistics of size, proportion and counts of various transposons of blackleg causal pathogen
The above transposon annotation results of three pathogen strains were imported into RStudio, converted into the table format, then merged together. At the same time, genome index files of pathogen were constructed in Linux and were also imported into RStudio for calculating genome size of pathogen. Finally, size, proportion and counts of different types of transposons were counted and displayed in RStudio. The shell command of constructing genome index files was as follows.
samtools faidx genome.fna
2.3. Transposon divergence analysis of blackleg causal pathogen
In order to learn about the divergence between genome transposons and their corresponding ancestor sequences, we carried out transposon divergence analysis for three different blackleg causal pathogen strains.
2.3.1. Analysis of transposon divergence based on identity
Through the above EDTA annotation program, non-redundant transposon libraries of different blackleg causal pathogen strains were obtained. Then, genome transposons of pathogen were made alignment with ancestor sequences in transposon library for calculating identity. Afterwards, transposon divergence of different pathogen strains was statistically analyzed based on identity. The specific steps were as follows. The transposon annotation file gene.fna.mod.EDTA.RM.gff3 from software RepeatMasker of EDTA annotation program was converted into table format and made further arrangement in Rstudio. Finally, the sorted transposon annotation files of three pathogen strains were merged together, and transposon identity of different pathogen strains was counted and displayed in RStudio.
2.3.2. Analysis of transposon divergence based on Kimura coefficient
Firstly, Kimura coefficient was calculated based on alignment results between genome transposons of pathogen and ancestral sequences in transposon library. Then, for each type of transposons, proportion of transposons with different Kimura coefficient was calculated in different pathogen strain genomes. The specific steps were as follows. To start with, Kimura coefficient was calculated based on the following shell commands.
cp ../../23.EDTA/sp_Lbi/genome.fna.mod.EDTA.anno/genome.fna.mod.cat.gz ./
gunzip genome.fna.mod.cat.gz
/pub/software/RepeatMasker/util/calcDivergenceFromAlign.pl -s TE.div genome.fna.mod.cat
grep -P "^\d+|Div" TE.div > TE.divsum
Afterwards, the result file TE.divsum containing size information of various transposons with different kimura coefficients was imported into Rstudio, and made further arrangement. Meanwhile, the genome index files were also imported into RStudio for obtaining genome size of pathogen. Finally, proportion of various transposons with different kimura coefficients were calculated and displayed in RStudio. The shell command of constructing genome index file of pathogen was shown in Statistics of size, proportion and counts of transposons of blackleg causal pathogen.
2.4. Insertion time estimation of intact LTR retrotransposons of blackleg causal pathogen
As mentioned above, LTR retrotransposons were usually the main cause of plant genome expansion. Moreover, LTR retrotransposons played an important role in maintaining chromatin structure, centromere function, as well as in regulating expression level of effector protein encoding genes. So as to additionally understand LTR retrotransposons of blackleg causal pathogen, we performed the insertion time estimation for them. Because the estimation method was based on difference between two long terminal repeat sequences at both ends, we only analyzed the insertion time of intact LTR retrotransposons in this paper.
Firstly, intact LTR retrotransposons were extracted according to the annotation file geneme.fna.mod.EDTA.TEAnno.gff3 and the corresponding pathogen strain genome sequences, and reclassified by using software TEsorter. The relevant shell commands were as follows.
awk \ '$3~"LTR_retrotransposon"' ../../23.EDTA/sp_Lbi/genome.fna.mod.EDTA.TEanno.gff3 | grep "Method = structural" >intact.LTR.gff3
awk -F';' '{print $1}' intact.LTR.gff3 |sed 's/ID=//' |awk '$7!="?"' |awk '{print"JACTNT"$1".1""\t"$4 − 1"\t"$5"\t"$9"\t.\t"$7}' >intact.LTR.bed6
awk -F';' '{print $1}' intact.LTR.gff3|sed 's/ID=//' |awk '$7=="?"'|awk '{print"JACTNT"$1".1""\t"$4 − 1"\t"$5"\t"$9"\t.\t+"}' >>intact.LTR.bed6
bedtools getfasta -fi ../../12.ref/Lbi.genome.fna -bed intact.LTR.bed6 -fo - -name -s | awk -F'::' '{print $1}' | seqtk seq -l 60 > intact.LTR.fa
TEsorter -db rexdb intact.LTR.fa
Afterwards, intact LTR retrotransposon annotation file of pathogen genome was imported into Rstudio, converted into table format, made further arrangement. Then, certain columns such as Sequence name, and so on were extracted from the above sorted results. Similarly, another result file gene.fna.mod.pass.list containing insertion time of intact LTR retrotransposon was imported into RStudio, made arrangement and extracted certain columns such as Insertion time, and so on. Besides, the result file intact.LTR.fa.rexdb.cls.tsv containing classification information of intact LTR retrotransposons was also imported into RStudio, and the above extracted information columns from different result files were merged together. Finally, insertion time of intact LTR retrotransposons was analyzed and displayed in RStudio.
2.5. Phylogeny tree construction of intact LTR retrotransposons of blackleg causal pathogen
For the sake of grasping phylogeny relationship among intact LTR retrotransposons from different blackleg causal pathogen strains, we also implemented phylogeny tree construction for them based on their amino acid sequences of intermediate RT domains.
2.5.1. Phylogeny tree construction of intact Ty1/Copia retrotransposons
To begin with, based on the classification result files intact.LTR.fa.rexdb.dom.tsv and intact.LTR.fa.rexdb.dom.faa of software TEsorter, amino acid sequences of RT domains of intact Ty1/Copia superfamily retrotransposons were extracted in Linux. Afterwards, these obtained amino acid sequences were performed multiple sequence alignment via software Mafft. Eventually, according to the multiple sequence alignment results, phylogeny tree was constructed through software Iqtree and visualized on website server ( https://itol.embl.de/) of software ITOL. The relevant shell commands were as follows.
grep "Ty1_copia" ../../31.intactLTR/sp_Lbi/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lbi/intact.LTR.fa.rexdb.dom.faa - > LbbCA1.RT.raw.fa
awk -F':' '{print $1}' LbbCA1.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LbbCA1-/' > LbbCA1.RT.fa
grep "Ty1_copia" ../../31.intactLTR/sp_Lbib/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lbib/intact.LTR.fa.rexdb.dom.faa - > LbbB3.5.RT.raw.fa
awk -F':' '{print $1}' LbbB3.5.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LbbB3.5-/' > LbbB3.5.RT.fa
grep "Ty1_copia" ../../31.intactLTR/sp_Lma/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lma/intact.LTR.fa.rexdb.dom.faa - > LmbV23.1.3.RT.raw.fa
awk -F':' '{print $1}' LmbV23.1.3.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LmbV23.1.3-/' > LmbV23.1.3.RT.fa
cat LbbCA1.RT.fa LbbB3.5.RT.fa LmbV23.1.3.RT.fa > RT.aln
mafft --auto RT.fa > RT.aln
iqtree -s RT.aln -bb 1000 -nt AUTO -pre Ty1_copia.iqtree 1 > iqtree.log 2>&1 &
2.5.2. Phylogeny tree construction of intact Ty3/Gypsy retrotransposons
Likely, amino acid sequences of RT domains of intact Ty3/Gypsy superfamily retrotransposons were also extracted from the classification result files intact.LTR.fa.rexdb.dom.tsv and intact.LTR.fa.rexdb.dom.faa of software TEsorter. Afterwards, the remaining steps were shown in Phylogeny tree construction of intact Ty1/Copia retrotransposons. The relevant shell commands were as follows.
grep "Ty3_gypsy" ../../31.intactLTR/sp_Lbi/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lbi/intact.LTR.fa.rexdb.dom.faa - > LbbCA1.RT.raw.fa
awk -F':' '{print $1}' LbbCA1.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LbbCA1-/' > LbbCA1.RT.fa
grep "Ty3_gypsy" ../../31.intactLTR/sp_Lbib/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lbib/intact.LTR.fa.rexdb.dom.faa - > LbbB3.5.RT.raw.fa
awk -F':' '{print $1}' LbbB3.5.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LbbB3.5-/' > LbbB3.5.RT.fa
grep "Ty3_gypsy" ../../31.intactLTR/sp_Lma/intact.LTR.fa.rexdb.dom.tsv | grep -P "\-RT\t" | cut -f 1 | seqtk subseq ../../31.intactLTR/sp_Lma/intact.LTR.fa.rexdb.dom.faa - > LmbV23.1.3.RT.raw.fa
awk -F':' '{print $1}' LmbV23.1.3.RT.raw.fa | sed 's/\Class_I\/LTR\///'|sed 's/>/> LmbV23.1.3-/' > LmbV23.1.3.RT.fa
cat LbbCA1.RT.fa LbbB3.5.RT.fa LmbV23.1.3.RT.fa > RT.aln
mafft --auto RT.fa > RT.aln
iqtree -s RT.aln -bb 1000 -nt AUTO -pre Ty1_copia.iqtree 1 > iqtree.log 2>&1 &
2.6. Gene density and LTR retrotransposon density statistics of blackleg causal pathogen
Commonly, the lower LTR retrotransposon density, the higher gene density. In an effort to figure out the relationship between gene density and LTR retrotransposon density of blackleg causal pathogen, we executed statistical analysis of gene density and LTR retrotransposon density for three blackleg causal pathogen strains. Firstly, contigs greater than or equal to 1M were selected out and conducted sliding window analysis with size of 1M and step length of 200K. Then the position and other information of sliding windows were sorted into a bed format file. Combined with gene structure annotation file and LTR retrotransposon annotation file, gene density and LTR retrotransposon density of sliding windows were analyzed and displayed in RStudio. There was no available gene structure annotation file for pathogen strain Lbb B3.5 in NCBI database. Therefore, the gene structure annotation file of Lbb B3.5 was obtained by making advantage of software BRAKER and MAKER. The specific annotation process was shown in our previous work (Tian et al, unpublished). The relevant shell commands in the analysis were as follows.
samtools faidx ../../12.ref/Lbi.genome.fna
awk '{print $1"\t"$2}' ../../12.ref/Lbi.genome.fna.fai |awk -F"\t" '{if($2 > = 1000000) print $1"\t"$2}' >genome.txt
bedtools makewindows -g genome.txt -w 1000000 -s 200000 > chr_sw.bed3
awk '{print $1"\t"$2"\t"$3"\t"NR"\t.\t."}' chr_sw.bed3 > chr_sw.bed6
awk '$3=="gene"' ../../12.ref/Lbi.genes.gtf|bedtools coverage -a chr_sw.bed6 -b - -counts -F 0.5 > chr_sw.gene.density.txt
awk '$3=="Copia_LTR_retrotransposon"' ../../31.intactLTR/sp_Lbi/modify_intact.LTR.gff3 | bedtools coverage -a chr_sw.bed6 -b - -counts -F 0.5 > chr_sw.intact_Copia.density.txt
awk '$3=="Copia_LTR_retrotransposon"' ../../23.EDTA/sp_Lbi/modify_genome.fna.mod.EDTA.TEanno.gff3 | bedtools coverage -a chr_sw.bed6 -b - -counts -F 0.5 > chr_sw.truncated_Copia.density.txt
awk '$3=="Gypsy_LTR_retrotransposon"' ../../31.intactLTR/sp_Lbi/modify_intact.LTR.gff3 | bedtools coverage -a chr_sw.bed6 -b - -counts -F 0.5 > chr_sw.intact_Gypsy.density.txt
awk '$3=="Gypsy_LTR_retrotransposon"' ../../23.EDTA/sp_Lbi/modify_genome.fna.mod.EDTA.TEanno.gff3 | bedtools coverage -a chr_sw.bed6 -b - -counts -F 0.5 > chr_sw.truncated_Gypsy.density.txt
awk '{print $7}' chr_sw.truncated_Copia.density.txt > truncated_Copia.density.txt
awk '{print $1}' truncated_Copia.density.txt |paste - chr_sw.intact_Copia.density.txt |awk '{print $8=$8+$1; print $0}' |awk '{print $1 = null; print$0}' |awk -v OFS="\t" '$1=$1' >chr_sw.Copia.density.txt
awk '{print $7}' chr_sw.truncated_Gypsy.density.txt > truncated_Gypsy.density.txt
awk '{print $1}' truncated_Gypsy.density.txt |paste - chr_sw.intact_Gypsy.density.txt |awk '{print $8=$8+$1; print $0}' |awk '{print $1 = null; print$0}' |awk -v OFS="\t" '$1=$1' >chr_sw.Gypsy.density.txt
2.7. Solo-LTR counts and intact LTR retrotransposon counts statistics of blackleg causal pathogen
To further understand the LTR retrotransposons in blackleg causal pathogen, the LTR retrotransposon were also divided into solo-LTR and intact LTR retrotransposons. Afterwards, we carried out the statistical analysis of solo-LTR and intact LTR retrotransposons in three pathogen strains. Firstly, based on the genome transposon annotation file and the transposon library index file of blackleg causal pathogen, the R script solo_intact_ltr_edta.R was enforced to generate the result file soloLTR_RM.tsv containing position of INT domain of LTR retrotransposons, distance, upstream and downstream, away from the nearest long terminal repeat sequence and so on. Afterwards, the obtained result file soloLTR_RM.tsv was combined with the intact LTR retrotransposon annotation intact.LTR.gff3 to analyze solo-LTR retrotransposon counts and intact LTR retrotransposon counts. Finally, solo-LTR retrotransposon counts and intact LTR retrotransposon counts were analyzed and displayed in RStudio. The relevant shell commands were as follows.
ln -s ../../23.EDTA/sp_Lbi/genome.fna.mod.EDTA.anno/genome.fna.mod.EDTA.RM.gff3
ln -s ../../23.EDTA/sp_Lbi/genome.fna.mod.EDTA.final/genome.fna.mod.EDTA.TElib.fa
samtools faidx genome.fna.mod.EDTA.TElib.fa
Rscript ../../11.software/solo_intact_ltr_edta.R -g genome.fna.mod.EDTA.RM.gff3 -T genome.fna.mod.EDTA.TElib.fa.fai
The main function of R script solo_intact_ltr_edta.R was as follows. Through the genome transposon annotation file and the genome index file of pathogen, INT domains and long end repeat sequences of pathogen genome were extracted, ending up with a annotation file containing positions of INT domains and long end repeat sequences, alignment scores, coverage, etc. Meanwhile, the distance between the INT domain and its upstream and downstream nearest long terminal repeat sequences was also calculated. If only upstream or downstream long terminal repeat sequence of INT domain was within the given distance range, then the LTR retrotransposon in the position was considered as the Solo-LTR retrotransposon. The R script would finally produce a solo-LTR retrotransposon annotation file containing location of INT domain, the distance between the INT domain and its upstream and downstream the nearest long terminal repeat sequence, and so on.
2.8. Pan-genome transposon analysis of blackleg causal pathogen
So as to identify more genome transposons and perform common and special transposon statistical analysis among different pathogen strains, we conducted pan-genome transposon analysis of blackleg causal pathogen by using four genomes: Lmb V23.1.3, Leptosphaeria maculans ‘brassicae’ CAN1 (Lmb CAN1), Lbb CA1, and Lbb B3.5. Firstly, Transposon library of each pathogen strain was constructed by software EDTA, and then the four transposon libraries were clustered to construct pan-genome transposon library. Upon utilizing the constructed pan-genome transposon library, each pathogen strain was made transposon annotation again by running EDTA program. Then common and special intact LTR retrotransposons among four pathogen strains and their insertion time were counted and analyzed based on the second annotation results. When initially constructing transposon library of pathogen strains, the parameter setting of software EDTA referred to Genome transposon annotation of blackleg causal pathogen, except for the parameters anno and curatedlib. The parameter value of anno was changed from 1 to 0. And the parameter curatedlib was removed. The related shell commands were as follows.
ln -s ../23.EDTA/Lbb_B35/genome.fna.mod.EDTA.TElib.fa Lbb_B35.genome.fna.mod.EDTA.TElib.fa
ln -s ../23.EDTA/Lbb_CA1/genome.fna.mod.EDTA.TElib.fa Lbb_CA1.genome.fna.mod.EDTA.TElib.fa
ln -s ../23.EDTA/Lmb_CAN1/genome.fna.mod.EDTA.TElib.fa Lmb_CAN1.genome.fna.mod.EDTA.TElib.fa
ln -s ../23.EDTA/Lmb_V2313/genome.fna.mod.EDTA.TElib.fa Lmb_V2313.genome.fna.mod.EDTA.TElib.fa
ln -s ../../TEAnalysis/11.software/SINE_LINE/SINE_LINE.fa
ls *.TElib.fa > liblist.txt
perl /pub/anaconda3/envs/EDTA/share/EDTA/util/make_panTElib.pl -liblist liblist.txt -mincov 0.70 -miniden 70 -threads 20 --curatedlib SINE_LINE.fa
The shell commands of genome transposon reannotation of pathogen were as follows.
source activate EDTA
cd genome.fna.mod.EDTA.combine
rm genome.fna.mod.LTR.TIR.Helitron.fa.stg1
ln -s ../../../24.make_panTElib/liblist.txt.panTE.fa genome.fna.mod.LTR.TIR.Helitron.fa.stg1
cd ..
EDTA.pl --genome genome.fna --species others --sensitive 0 --anno 1 --step final --force 1 --threads 20 --overwrite 1
source deactivate
2.8.1. Analysis of common and special pan-genome intact LTR retrotransposons in blackleg causal pathogen
Pan-genome intact LTR retrotransposons of four pathogen strains were extracted through shell commands and reclassified through software TEsorter. The specific steps were shown in the above Insertion time estimation of intact LTR retrotransposons of blackleg causal pathogen. Afterwards, pan-genome intact LTR retrotransposon IDs of pathogen strain were extracted and removed redundancy in RStudio. Finally, common and special pan-genome intact LTR retrotransposon counts were visualized by Venn plot.
2.8.2. Insertion time estimation of common and special pan-genome intact LTR retrotransposons of blackleg causal pathogen
In RStudio, the pan-genome intact LTR retrotransposon annotation file including insertion time was made arrangement referring to Insertion time estimation of intact LTR retrotransposons of blackleg causal pathogen. Afterwards, insertion time of pan-genome intact LTR retrotransposons was displayed in RStudio.