Identification of TEs using the RepeatMasker RepBase library
The three published genomes from the B. tabaci cryptic species complex were the focus of the analyses (MEAM1, MED/Q, and SSA-ECA). TEs within these genomes were initially identified using a RepBase library (version RepBase_RepeatMasker-edition20180826 library) through RepeatMasker. The results of the TE identification using the RepeatMasker RepBase library were significantly lower than reported in their respective publications. (Table 1); MEAM1 (18.92% vs 43.82% published), MED/Q (17.28% vs 40.29% published), and SSA-ECA (13.41% vs 38.52% published).
The RepBase library was searched for B. tabaci-specific TEs and 282 different TE consensus sequences were identified. The result of the identification shows that only some of the identified TE consensus sequences were submitted to RepBase and with these submitted consensus TE sequences, only less than half of the published TEs were identified. There was an attempt to find the rest of the consensus sequences; however, publication of these consensus sequences could not be found.
The RepBase library was then tested for its ability to identify TEs in a Drosophila melanogaster genome (release 6 [50]) to identify if the anomalies for the hemipteran genomes tested in this study applied more widely. The RepBase library was able to identify 17.44% TE genome coverage while published results show that <20% of the genome was identified as TEs in different Drosophila studies [51–54]. The results of the identification were thus in line with what was reported to be found in the species, confirming that the library was being searched correctly.
Table 1
Repetitive elements identified in the three whitefly genomes
Results of the identification of TEs reported by their respected studies, using the last publicly available RepBase library (RepBase RepeatMasker-edition20180826), and the custom-built repeat library built using the workflow described in the study.
|
|
MEAM1
|
MED/Q
|
SSA-ECA
|
|
|
Published
|
RepBase
|
Custom Library
|
Published
|
RepBase
|
Custom Library
|
Published
|
RepBase
|
Custom Library
|
DNA
|
29.25
|
18.07
|
25.28
|
15.66
|
16.48
|
23.42
|
25.94
|
12.92
|
19.86
|
Retroelements
|
|
0.86
|
2.6
|
|
0.61
|
2.65
|
|
0.42
|
1.72
|
LINE
|
0.96
|
0.61
|
1.25
|
3.18
|
0.57
|
0.96
|
0.44
|
0.38
|
0.94
|
SINE
|
0.16
|
0.04
|
0.17
|
0.96
|
0.04
|
0.18
|
0.16
|
0.04
|
0.08
|
LTR
|
0.49
|
0.21
|
1.19
|
18.5
|
0.19
|
1.51
|
0.08
|
0.07
|
0.7
|
Unknown
|
12.96
|
0
|
16.26
|
1.99
|
0
|
14.81
|
11.9
|
0
|
15.22
|
Total
|
43.82
|
18.92
|
44.14
|
40.29
|
17.28
|
40.88
|
38.52
|
13.41
|
36.8
|
The results of the TE identification using the RepeatMasker RepBase library shows that the library could not be used for the characterization and comparison of the TEs found within the whitefly genomes. To resolve the issue, an annotation workflow needed to be developed to standardize the identification of the TEs across the whitefly genomes. The different whitefly genomes published utilized different TE identification tools; MEAM1, and SSA-ECA used a DNA transposons specific tool, while MED/Q used a LTR specific identification tool. Standardization of the annotation workflow would allow a fairer comparison across the three genomes. A species-specific custom-built repeat library was created for each genome studied using the same range of tools to identify and classify TEs within each genome. The identification of the TEs in the workflow combines several methods in the identification of elements: structural-based and de novo; while the classification of the identified elements uses sequence similarity, structural, and machine learning (for details see methodology section).
The performance of the annotation workflow developed was validated using a well characterized genome to determine its suitability for annotating TEs in less well characterized insect genomes. The D. melanogaster genome (release 6 [50]) was chosen for the validation as it is known to be one of the most accurate in terms of its TE annotation with several iterations of reference genome releases and information on TEs released alongside these [50, 55]. The annotation workflow developed was compared against the RepeatMasker RepBase library as the latter uses a database that contains the updates from several TE studies and libraries that includes the TE annotation from the D. melanogaster genome releases [24, 56].
A total of 17.44% genome coverage of interspersed repeats was found in the D. melanogaster using the RepeatMasker library compared to 16.88% genome coverage of interspersed repeats was found using the species-specific custom-built library (Table 2). Most of the repeats found were LTRs and a difference of 0.46% in this category was seen between the RepeatMasker and custom-built libraries. The SINE class of elements was the least common; the RepeatMasker library identified 81 bp of SINEs while the custom-built library found none (0 bp). For DNA transposons a difference of 0.58% was observed between the two libraries, while a difference of 0.42% was observed in the detection of LINEs. The difference of <1% of the total of TEs identified and less than <1% in each of the orders support the capability of the workflow developed in identifying TEs found within a genome.
Table 2
RepeatMasker output of RepeatMasker library and the species-specific custom-built library for the Drosophila melanogaster genome
Comparison of the results of the identification of TEs using RepeatMasker RepBase library and the species-specific repeat library in the D. melanogaster genome. The custom-built repeat library was built using the workflow described in the study.
|
RepBase (%)
|
Custom Library (%)
|
DNA
|
1.79
|
1.21
|
LINE
|
4.93
|
4.50
|
SINE
|
<0.001
|
0.00
|
LTR
|
10.68
|
10.22
|
Unclassified
|
0.04
|
0.34
|
Total Interspersed Repeats
|
17.44
|
16.88
|
TEs in arthropod genomes
The developed workflow after validation was used to identify the TE content of each of the target genomes (Figure 1), resulting in a custom-built species-specific library for each of the genomes studied. Aside from the three whitefly genomes (MEAM1, MED/Q, and SSA-ECA), three hemipteran genomes were included, namely Acyrthosiphon pisum (ACPIS), Diaphorina citri (DIPSY), and Myzus persicae (MYPER). Each of the three whitefly genomes had a higher TE content (an average of 40.61% genome coverage of TEs) compared to each of the three non-whitefly genomes (an average of 25.01% TE genome coverage). MEAM1 had the highest TE content across the six genomes at 44.14% while ACPIS had the highest TE content amongst the non-whitefly genomes at 34.54%. SSA-ECA had the lowest TE content amongst the whitefly genomes at 36.80% but was still higher than the TE content in the ACPIS genome. MYPER had the lowest TE content across the six genomes at 17.52%.
The relationship between genome sizes of the six genomes and their TE content was tested using Spearman’s rank rho correlation (Figure 2). TE coverage was found to be positively correlated with genome size (r = 0.93, p = 0.006). The highest TE content (44.14%) across the six genomes was in the MEAM1 genome (615 Mbp) while the smallest genome, MYPER (347 Mbp) had the lowest TE content at 17.52%. Amongst the whitefly genomes, SSA-ECA has the smallest genome size (538.48 Mbp) and the lowest TE genome coverage (36.80%).
Difference in the distribution of TE content between genomes
There was no statistically significant difference (p = 0.09) in genome size between the whitefly genomes (average 603.92 Mbp) and the non-whitefly genomes (average 458.24 Mbp). This allows us to compare the two groups without significantly biasing our results with the variations in genome sizes. The distribution of TEs as a percentage of genome was compared across the six genomes. The majority of the classified elements within the whitefly genomes were DNA transposons at an average of 22.85% across the three genomes. MEAM1 had the highest distribution amongst the three whitefly genomes at 25.28% while SSA-ECA had the lowest at 19.86%. Retrotransposons were classified at a much lower average of 2.32% coverage in the whitefly genomes, with LTRs as the most abundant order identified across the three at an average of 1.13% followed by LINEs at an average of 1.05%.
For the three non-whitefly genomes, DNA transposons were the most abundant in ACPIS (14.06%) and MYPER (8.35%) while retrotransposons were the most abundant class in the DIPSY genome (6.68%). An average of 4.34% coverage was identified as retrotransposons within the non-whitefly genomes. LINEs were the most abundant retrotransposon order in ACPIS (2.32%) and MYPER (1.86%) while SINEs were the most abundant in DIPSY (3%).
Across the four orders of TEs, SINEs were the least identified at an average of 0.58% (0.14% for the whitefly genomes and 1.01% for the non-whitefly genomes). Amongst all the six genomes, DIPSY had the highest percentage of SINEs at 3% while this TE order was not detected in MYPER.
The distribution of TEs between the genomes was explored further by comparing their distribution between the two groups of genomes. The comparison of the distribution of the orders of the TEs between the whitefly and the non-whitefly genomes was performed using a two-sample t-test (DNA transposon, LTR, and LINE) and Wilcoxon rank-sum test (SINE) (Figure 3). A standard t-test was used for orders that had the same variance (DNA transposons, LTRs, and LINEs) while a Wilcoxon rank-sum test for SINEs as the distribution for genome coverage in the two groups as they had a non-normal distribution. There is a significant difference between the mean TE content of DNA transposons (p = 0.01) and LINEs (p = 0.008) between the whitefly genomes and the non-whitefly genomes, while there was no significant difference found in LTRs (p = 0.7856) and SINEs (p = 0.6625). There are significantly more DNA transposons found in the whitefly genomes and significantly less LINEs compared to the three non-whitefly hemipteran genomes studied.
Lastly, unclassified elements are still found within the identified TEs. Across the six genomes, an average of 13.70% genome coverage remains unclassified (15.43% for the whitefly genomes and 11.98% for the non-whitefly genomes). The relative proportions of the elements will therefore be subject to change when these unclassified elements become classified; nevertheless, the very high proportion of identified DNA transposons in the whitefly genomes means that this class will remain the largest order of elements identified within all three whitefly genomes analyzed (Supplementary Table 2).
TE superfamilies across the genomes
Each TE from the different orders can be further classified into superfamilies on the basis of their monophyletic origin and homology of motifs [27, 56, 57]. Superfamilies were identified in each genome (Table 3). A total of 98 TE superfamilies were identified in the whitefly genomes and 89 for the non-whitefly genomes. A total of 69 TE superfamilies were identified to be present across the genomes in the two groups (39 DNA transposon, eight LTR, 19 LINE, and three SINE). Most of the superfamilies identified were classified as DNA transposons with a total of 66 different superfamilies of which 19 were unique to whitefly genomes while eight were unique to non-whitefly genomes. SINE superfamilies were the least identified with 11 superfamilies of which four are unique to whitefly genomes and another four unique to the non-whitefly genomes. LINE superfamilies were the most identified retrotransposons with 29 unique superfamilies of which three are unique to whitefly genomes while seven are unique to the non-whitefly genomes.
MEAM1 showed the greatest number of superfamilies identified at 82 while MYPER has the lowest at 61 superfamilies. In all genomes, DNA transposon superfamilies were the most identified with an average of 47 in the whitefly genomes and 36 in the non-whitefly genomes. MED/Q and MEAM1 had the greatest number of DNA transposon superfamilies at 49 and 48 respectively, while DIPSY had the least at 30 superfamilies. SINE superfamilies were the least identified at an average of four superfamilies. DIPSY had the greatest number of SINE superfamilies identified with seven while SINEs were not identified at all in MYPER.
Table 3
Repeat Superfamilies identified within the genomes
The table presents a summary of the number of superfamilies found in each class of TEs in each of the genomes. DNA represent DNA transposons, LINE (Long interspersed nuclear elements), SINE (Short interspersed nuclear elements), LTR (Long terminal repeats).
|
DNA
|
LINE
|
LTR
|
SINE
|
Total
|
MEAM1
|
48
|
20
|
9
|
5
|
82
|
MED/Q
|
49
|
20
|
6
|
4
|
79
|
SSA-ECA
|
44
|
18
|
9
|
4
|
75
|
ACPIS
|
43
|
18
|
5
|
1
|
67
|
DIPSY
|
30
|
23
|
6
|
7
|
66
|
MYPER
|
36
|
18
|
7
|
0
|
61
|
TE repeat superfamilies that had been identified across the three whitefly genomes were analyzed further (Figure 4A). A total of 63 superfamilies were found to be common across the three whitefly genomes, 39 of these superfamilies were identified as DNA transposons, six LTR, 16 LINEs, and two SINEs. Aside from the common superfamilies, each genome had repeat superfamilies that were identified uniquely in them. MED/Q had the highest number of unique superfamilies at ten consisting of six DNA transposons (hAT-hAT19, Kolobok-E, Kolobok-T2, PIF-ISL2EU, TcMar, and TcMar-Sagan), two LINEs (CR1-Zenon and Daphne) and two SINEs (SINE2 and tRNA-V). Seven unique superfamilies were identified in the MEAM1 genome consisting of four DNA transposon superfamilies (Crypton-S, hAT-hAT1, P-Fungi, and TcMar-Cweed), two LTR superfamilies (ERVL and Caulimovirus), and one SINE superfamily (tRNA-L2). SSA-ECA had the least number of unique superfamilies identified at six which consisted of four DNA transposon superfamilies (hAT-hATw, IS, TcMar-ISRm11, and TcMar-Stowaway) and two LTR superfamilies (DIRS and Foamy).
Repeat superfamilies identified across the three non-whitefly genomes were also analyzed (Figure 4B). A total of 44 superfamilies were found to be common across the three non-whitefly genomes, of which 26 superfamilies were identified as DNA transposon superfamilies, four as LTRs, and 14 as LINEs. Unique superfamilies were also identified within each genome. DIPSY had the greatest number of unique superfamilies at 15 (three DNA transposon superfamilies, one LTR superfamily, five LINE superfamilies, and six SINE superfamilies) while MYPER had the least at five (one DNA transposon superfamily, three LTR superfamilies, and one LINE superfamily).
A further comparison of the superfamilies was performed between the 63 common superfamilies found within the whitefly genomes and the three non-whitefly hemipteran genomes analyzed by the same workflow methodology (Figure 4C). A total of 35 superfamilies were identified as common across all the groups (21 DNA transposons, four LTRs, and ten LINEs). Nine superfamilies were identified to be present in the three non-whitefly genomes which were not identified in the superfamilies common to all the whitefly genomes. However, seven of these nine superfamilies were found to be present in one or two of the three whitefly genomes analyzed. Lastly, a total of 11 superfamilies of the 63 superfamilies common to all whitefly genomes were uniquely identified in them and not found in any of the three other hemipteran genomes. Nine of these 11 superfamilies represent DNA transposons (CMC-Chapaev-3, EnSpm/CACTA, ISL2EU, Kolobok, Mariner, PIF-Spy, Sola-2, TcMar-Tc4, and Zator) while the remaining two were LINE superfamilies (Nimb and L2B).