DOI: https://doi.org/10.21203/rs.2.24663/v1
Background
Fusarium equiseti is a plant pathogen with a wide range of hosts and diverse effects, including probiotic activity. However, the underlying molecular mechanisms remain unclear, hindering its effective control and utilization. In this study, the Illumina HiSeq 4000 and PacBio platforms were used to sequence and assemble the whole genome of Fusarium equiseti D25-1.
Results
The assembly included 16 fragments with a GC content of 48.01%, gap number of zero, and size of 40,776,005 bp. There were 40,110 exons and 26,281 introns having a total size of 19,787,286 bp and 2,290,434 bp, respectively. The genome had an average copy number of 333, 71, 69, 31, and 108 for tRNAs, rRNAs, sRNAs, snRNAs, and miRNAs, respectively. The total repetitive sequence length was 1,713,918 bp, accounting for 4.2033% of the genome. In total, 13,134 functional genes were annotated, accounting for 94.97% of the total gene number. Toxin-related genes, including two related to zearalenone and 23 related to trichothecene, were identified. A comparative genomic analysis supported the high quality of the F. equiseti assembly, exhibiting good collinearity with the reference strains, 3,483 species-specific genes, and 1,805 core genes. A gene family analysis revealed more than 2,500 single-copy orthologs. F. equiseti was most closely related to Fusarium pseudograminearum based on a phylogenetic analysis at the whole-genome level.
Conclusions
Our comprehensive analysis of the whole genome of F. equiseti provides basic data for studies of gene expression, regulatory and functional mechanisms, evolutionary processes, as well as disease prevention and control.
Fusarium equiseti, also known as Gibberella intricans in its sexual stage, has complex and diverse beneficial or pathogenic properties. As a probiotic, it can produce many metabolites beneficial to plants or humans. For example, it produces linamarase [1], equisetin, cellulase, polyketides [2], glucose, alcohol [3], and various metabolites that inhibit hepatitis virus infections [4]. Additionally, Marín et al. (2012) have found that F. equiseti can produce certain type A and B trichothecenes, butenolides, beauvericin, zearalenone, and fusarochromanone [5]. F. equiseti is also employed in plant insect control [6–7], benzopyrene degradation, and heavy metal pollution abatement [8]. Conversely, F. equiseti itself also acts as a plant pathogen with a broad host range and produces various toxins. In fact, it is a pathogen of multiple cereals, including wheat [9], corn [10], triticale, oats, and barley [11], and other crops, such as cowpea [12], beans [13], melon [14], lettuce [15], sunflower [16], jujube, and radish [17]. Some flowers, trees, and Chinese medicinal plants, including hickory [18], white pine seedlings [19], cumin [20], fishtail palm [21], hybrid cymbidium, clover, rice flatsedge [22], white mangrove [23], mulberry [24], largehead atractylodes rhizome, and Chinese magnolia vine fruit [25] can also be infected with F. equiseti.
However, the mechanisms underlying the beneficial or pathogenic properties of F. equiseti are not fully understood. A direct and effective approach for clarifying the action mechanisms of F. equiseti is the identification of the genetic factors directly related to its functions. For instance, Stępień et al. have analyzed the tef-1α sequences of F. equiseti, as well as the toxin-related genes PKS13, PKS4, and TRI5 [26], and Kari has successfully cloned a protease gene [27]. However, there are few in-depth studies on the functional genes. Moreover, the traditional methods for functional gene discovery are insufficient to meet the modern-day research demands due to the high error rates and low efficiencies.
Advances in whole-genome sequencing technologies have enabled numerous key discoveries. In this study, the complete genomic sequence of Fusarium equiseti D25-1 was determined using multiple sequencing platforms. The assembly was characterized, including analyses of the genomic structure, with the aim of providing data to support the discovery of beneficial or harmful genes and to improve our understanding of the molecular pathogenesis.
Sequencing was performed using the Illumina HiSeq 4000 platform. A total of 56,458,678 reads were produced from D25-1, with an insert size of 500 bp and a read length of 125 bp. The proportion of adapter sequences was 0.25%, the proportion of repetitive sequences was 2.38%, and the low-quality sequences accounted for only 7.04%. After filtering out 9.69% of the sequences from the raw data (7,057 Mb), 6,372 Mb of clean data were obtained (Table 1).
Insert size (bp) | Read length (bp) | Raw data (Mb) | Adapter (%) | Duplication (%) | Total reads | Filtered reads (%) | Low-quality filtered reads (%) | Clean data (Mb) |
---|---|---|---|---|---|---|---|---|
500 | 125 | 7,057 | 0.25 | 2.38 | 56,458,678 | 9.69 | 7.04 | 6,372 |
Polymerase Read Number | Mean Read Length (bp) | Polymerase Read Bases (bp) | Polymerase Read Quality | Sub-read Number | Sub-read Mean Length (bp) | Sub-read Bases (bp) | Subread Quality | Utilization Ratio |
---|---|---|---|---|---|---|---|---|
294,876 | 15,189 | 4,957,271,514 | 0.84 | 447,618 | 9,970 | 4,463,016,422 | 0.84 | 0.92 |
In the base distribution analysis, the frequencies of A and T, as well as the frequencies of G and C, were correlated, indicating an equilibrium base composition. Further, base quality values were relatively high, suggesting a low error rate and high-quality reads (Fig. 1).
Given the large quantities of adapter sequences and low-quality or erroneous sequences in the raw sequencing data obtained using the PacBio platform, extensive filtering was performed to generate clean data, which were deposited at DDBJ/ENA/GenBank under accession number QOHM00000000. The lengths and qualities of the reads were well distributed (Fig. 2.). The version described in this paper is version QOHM01000000. Summary statistics are provided in Table. 2.
Prior to assembly, the genome size, heterozygosity, and repetitive sequence information were determined by a K-mer analysis. The D25-1 genome was approximately 44.69 Mb in size, with a coverage depth of 144.34×. The details are shown in Fig. 3.
Data assembly was performed by combining the data from the Illumina HiSeq4000 and PacBio sequencing platforms. As shown in Table 3, the D25-1 genome was well spliced. A total of 16 Fusarium equiseti chromosomes with a total length of 40,776,005 bases were assembled. The longest chromosome was 8,344,890 bp, and the shortest chromosome was 2,316 bp. The gap number was 0, and the GC content was 48.01%.
Seq Type | Total Number | Total Length (bp) | N50 Length (bp) | N90 Length (bp) | Max Length (bp) | Min Length (bp) | Gap Number (bp) | GC Content (%) |
---|---|---|---|---|---|---|---|---|
Scaffold | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | 0 | 48.01 |
Contig | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | - | 48.01 |
Through calculating GC content and average depth, one can analyze whether a GC bias exists. If not seriously biased, the corresponding scatter diagram will assume a shape similar to that of Poisson distribution and have a peak near the GC content estimate of the genome. The more the graph deviates from this peak, the lower the depth is. The GC bias in the genome of D25-1 was analyzed after assembly, and a Poisson distribution was observed, suggesting no substantial bias (Fig. 4).
Gene prediction was performed using the assembly to obtain the open reading frames and gene length distribution, as summarized in Table 4. The total number, total length, average length, and percentage of total chromosome lengths were analyzed for the genes, exons, CDS, and introns. The introns were abundant but short, accounting for a relatively low proportion of the genome.
Type | Total Number | Total Length (bp) | Average Length (bp) | Length/Genome Length (%) |
---|---|---|---|---|
Gene | 13,829 | 22,077,720 | 1,596.48 | 54.14 |
Exon | 40,110 | 19,787,286 | 493.33 | 48.53 |
CDS | 13,829 | 19,787,286 | 1,430.85 | 48.53 |
Intron | 26,281 | 2,290,434 | 87.15 | 5.62 |
Type | Copy number | Average Length (bp) | Total Length (bp) | Percentage of Genome (%) |
---|---|---|---|---|
tRNA | 333 | 86.91 | 28,943 | 0.0710 |
rRNA (de novo prediction) | 71 | 116.01 | 8,237 | 0.0202 |
sRNA | 69 | 66.69 | 4,602 | 0.0113 |
snRNA | 31 | 140.12 | 4,344 | 0.0107 |
miRNA | 108 | 55.77 | 6,024 | 0.0148 |
Genes with lengths between 500–999 bp were most abundant (n = 3,663), followed by those of 1000–1499 bp. Genes shorter than 200 bp and longer than 10,000 bp were infrequent (Fig. 5).
The non-coding RNAs in the Fusarium equiseti D25-1 genome were analyzed. With a copy number of 333, an average length of 86.91 nt, and a total length of 28,943 nt, tRNAs accounted for 0.0171% of the entire genome. The rRNA copy number was 71; the average length was 116.01 nt, and the total length was 8237 nt, accounting for 0.0202% of the genome. sRNAs represented 0.0113% of the genome, with a copy number of 69, an average length of 66.69 nt, and a total length of 4,602 nt. Small nuclear rRNAs accounted for 0.0107% of the genome, with a copy number of 31, an average length of 140.12 nt, and a total length of 4,334 nt. The microRNA copy number was 108; the average length was 55.77 nt, and the total length was 6,024 nt, accounting for 0.0148% of the genome.
Repetitive sequences, including DNA transposons, tandem repeats (TRs), and transposable elements, have important roles in chromosomal spatial structure, regulation of gene expression, and genetic recombination. Transposable elements are further classified into long terminal repeats (LTRs) and non-LTRs, and the latter category includes long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs). Tables 6 and 7 list the repetitive sequences in the Fusarium equiseti D25-1 genome, as determined by various prediction algorithms and databases. For instance, the TRs predicted by the TRF software spanned 224,134 bp, which accounted for only 0.5497% of the genome. The total predicted repetitive sequences spanned 1,713,918 bp, accounting for 4.2033% of the genome.
Method | Repeat Size (bp) | Percentage of Genome (%) |
---|---|---|
Repbase | 561,215 | 1.3763 |
ProMask | 420,889 | 1.0322 |
de novo | 1,169,760 | 2.8687 |
TRF | 224,134 | 0.5497 |
Total | 1,713,918 | 4.2033 |
Repbase TEs | ProteinMask TEs | De novo TEs | Combined TEs | ||||||
---|---|---|---|---|---|---|---|---|---|
Type | Length (bp) | % of Genome | Length (bp) | % of Genome | Length (bp) | % of Genome | Length (bp) | % of Genome | |
DNA | 220,036 | 0.5396 | 195,998 | 0.4807 | 164,159 | 0.4026 | 311,230 | 0.7633 | |
LINE | 63,265 | 0.1552 | 59,402 | 0.1457 | 27,174 | 0.0666 | 98,240 | 0.2409 | |
LTR | 279,482 | 0.6854 | 165,768 | 0.4065 | 352,607 | 0.8647 | 527,778 | 1.2943 | |
SINE | 2,627 | 0.0064 | 0 | 0.0000 | 9,329 | 0.0229 | 11,283 | 0.0277 | |
Other | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | |
Unknown | 1,940 | 0.0048 | 0 | 0.0000 | 619,726 | 1.5198 | 621,666 | 1.5246 | |
Total | 561,215 | 1.3763 | 420,889 | 1.0322 | 1,169,760 | 2.8687 | 1,530,931 | 3.7545 |
As shown in Table 8, 13134 genes, accounting for 94.97% of all the genes in D25-1 genome, were annotated after BLAST searches against all databases. Over 70% of annotations were based on the NR, NOG, and IPR databases. Furthermore, 1422 genes (10.28%) were annotated using the PHI database.
Sample | Total | ARDB | CAZy | COG | GO | IPR | KEGG | KOG |
---|---|---|---|---|---|---|---|---|
D25-1 | 13829 | 2 | 397 | 1516 | 7261 | 9995 | 4646 | 2231 |
Proportion | 0.01% | 2.87% | 10.96% | 52.5% | 72.27% | 33.59% | 16.13% | |
Sample | NOG | NR | P450 | PHI | SwissProt | T3SS | VFDB | Overall |
D25-1 | 11332 | 12640 | 1209 | 1422 | 3284 | 3804 | 55 | 13134 |
Proportion | 81.94% | 91.4% | 8.74% | 10.28% | 23.74% | 27.5% | 0.39% | 94.97% |
Genes encoding toxins were mined based on the annotation results. Two genes related to zearalenone were identified (D25-1_GLEAN_10000531 and D25-1_GLEAN_10000533). D25-1_GLEAN_10000531 was related to the non-reductive iterative type I polyketide synthase involved in the synthesis of zearalenone, whereas D25-1_GLEAN_10000533 was related to the highly reductive iterative type I polyketide synthase (Table 9).
Gene ID | Identity | Database | Database gene ID | Function |
---|---|---|---|---|
D25-1_GLEAN_10000531 | 54.84 | KEGG | pcs:Pc21g12450 | Zearalenone synthase, nonreducing iterative type I polyketide synthase |
D25-1_GLEAN_10000533 | 57.27 | KEGG | pcs:Pc21g12440 | Zearalenone synthase, highly reducing iterative type I polyketide synthase |
Additionally, 23 genes related to trichothecene mycotoxin were identified, including 19 genes related to the active efflux pump of trichothecene mycotoxin, two related to trioxyacetyltransferase of trichothecene mycotoxin, and two related to the biosynthesis of trichothecene mycotoxin. Many of these genes were obtained by searches against the NOG database (Table 10).
Gene ID | Identity (%) | Database | Database gene ID | Function |
---|---|---|---|---|
D25-1_GLEAN_10001590 | 42.83 | sordNOG | JGI67026 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10001811 | 76.04 | ascNOG; euNOG; fuNOG; opiNOG | EFQ35752 | Trichothecene 3-O-acetyltransferase |
D25-1_GLEAN_10002084 | 87.89 | hypNOG; necNOG | FGSG_10823P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10002452 | 76.6 | hypNOG; necNOG; sordNOG | FGSG_12768P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10003513 | 78.59 | hypNOG | NechaP36294 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10003680 | 94.59 | hypNOG; necNOG | FGSG_08749P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10005110 | 91.85 | hypNOG; necNOG | FGSG_05352P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10005621 | 80.1 | hypNOG; necNOG; sordNOG | XP_387212.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006448 | 69.65 | necNOG | NechaP53897 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006579 | 88.71 | ascNOG; hypNOG; necNOG; opiNOG; sordNOG | XP_383902.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10006973 | 71.71 | NOG; ascNOG; euNOG; hypNOG; fuNOG; necNOG; opiNOG; sordNOG | FVEG_00056T0 | Trichothecene 3-O-acetyltransferase |
D25-1_GLEAN_10006980 | 85.6 | NOG; ascNOG; euNOG; hypNOG; fuNOG; necNOG; opiNOG; sordNOG | FGSG_03537P0 | Trichothecene biosynthesis |
D25-1_GLEAN_10009924 | 89.23 | ascNOG; fuNOG; opiNOG | FGSG_02343P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10010021 | 85.26 | hypNOG; necNOG | FGSG_12141P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10010553 | 91.59 | necNOG | FVEG_04795T0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011281 | 86 | hypNOG | FOXG_01267P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011497 | 83.28 | hypNOG; sordNOG | FGSG_11815P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011948 | 82.65 | hypNOG; necNOG; sordNOG | XP_381015.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013026 | 92.83 | necNOG; sordNOG | XP_389873.1 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013028 | 94.49 | ascNOG; fuNOG; opiNOG; hypNOG; sordNOG | FGSG_09701P0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10013718 | 54.74 | necNOG | FVEG_04795T0 | Fungal trichothecene efflux pump (TRI12) |
D25-1_GLEAN_10011175 | 53.63 | NR | gi|751354543|gb|KIL92265.1| | tri7-trichothecene biosynthesis gene cluster |
D25-1_GLEAN_10009924 | / | IPR | PF06609 | Fungal trichothecene efflux pump (TRI12) |
As summarized in Table 11, the newly assembled F. equiseti genome had the most intact sequence, with 16 contigs and an N50 of 6,178,397 bp, indicating a high-quality assembly. In a comparative analysis, the best-assembled F. oxysporum genome only had 33 scaffolds, with an N50 of 4,490,135 bp, which was superior to the assembly results for other strains. The differences in quality might be explained by the differences in assembly technology and platform. In this study, data were obtained using the third-generation PacBio platform and second-generation Illumina platform for joint assembly; whereas, most previously reported data had been obtained using the second-generation sequencing technology alone.
Sample_Name | Seq Type | Total Number | Total Length (bp) | N50 Length (bp) | N90 Length (bp) | Max Length (bp) | Min Length (bp) | Gap Number (bp) | GC Content (%) |
---|---|---|---|---|---|---|---|---|---|
Bipolaris sorokiniana | Scaffold | 154 | 34,409,167 | 1,789,485 | 1,003,746 | 3,642,493 | 2,011 | 1,196,549 | 49.84 |
F. avenaceum | Scaffold | 83 | 41,590,745 | 1,436,644 | 424,894 | 4,337,333 | 602 | 29,282 | 48.47 |
F. oxysporum | Scaffold | 33 | 52,908,293 | 4,490,135 | 2,466,030 | 6,470,671 | 4,587 | 0 | 47.67 |
F. pseudograminearum | Scaffold | 281 | 36,973,259 | 8,840,934 | 7,724,594 | 11,688,822 | 502 | 40,400 | 47.75 |
Nectria haematococca | Scaffold | 209 | 51,286,497 | 1,255,602 | 96,667 | 4,937,060 | 865 | 56,137 | 50.79 |
F. equiseti D25-1 | Contig | 16 | 40,776,005 | 6,178,397 | 2,783,306 | 8,344,890 | 2,316 | - | 48.01 |
Figure 6 summarizes the genomic collinearity results. The highest collinearity (i.e., greatest conservation) was found for F. pseudograminearum, which had high similarity with F. equiseti D25-1, followed by F. avenaceum and F. oxysporum, whereas B. sorokiniana had the lowest similarity. In addition, the gene number in the reference B. sorokiniana genome was 5,133; however, in a collinearity analysis, only 37.12% of the target genes were covered. The N. haematococca reference genome had 7,123 genes, covering 51.51% of the target genes. The F. pseudograminearum genome had 10,052 genes, covering 72.69% of the target genes. The F. avenaceum genome had 9884 genes, covering 71.47% of the target genes. The F. oxysporum genome had 10,236 genes, covering 74.02% of the target genes. Such findings might be related to the large number of reference genes from F. oxysporum.
Based on core-pan genome analysis, the six strains shared 1,805 core genes. As an outgroup, Bipolaris sorokiniana had the most species-specific genes (n = 8,912), followed by Nectria haematococca (n = 5,759), Fusarium oxysporum (n = 4,946), Fusarium equiseti (n = 3,483), Fusarium avenaceum (n = 2,614), and Fusarium pseudograminearum (n = 2,299), respectively (Fig. 7).
COG was used to annotate the core genes (Fig. 8) in four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function). Regarding cellular processes, 3 genes were associated with cell cycle control, cell division, and chromosome partitioning; 14 were involved in cell wall/membrane/envelop biogenesis; 1 gene was related to cell mobility; 5 were related to defense mechanisms; 1 was associated with the cytoskeleton; 1 was associated with intracellular trafficking, secretion, and vesicular transport; 51 were related to posttranslational modification, protein turnover, and chaperones; and 6 were related to signal transduction mechanisms. Five terms in the genetic information storage and transmission module were overrepresented; 1 gene was responsible for chromatin structure and dynamics; 1 was related to RNA processing and modification; 17 were related to replication, recombination, and modification; 14 were related to transcription; and 80 were related to translation, ribosomal structure, and biogenesis. In the metabolism module, 84 genes were associated with amino acid transport and metabolism; 63 were related to carbohydrate transport and metabolism; 38 were associated with coenzyme transport and metabolism; 69 were related to energy production and conversion; 20 were related to inorganic ion transport and metabolism; 58 were linked with lipid transport and metabolism; 35 were related to nucleotide transport and metabolism; and 36 were related to secondary metabolite biosynthesis, transport, and catabolism.
Genes specific to the Fusarium equiseti D25-1 genome were annotated using COG (Fig. 9) based on four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function). In the cellular processes module, 2 genes were associated with cell cycle control, cell division, and chromosome partitioning; 3 were associated with cell wall/membrane/envelop biogenesis; 1 gene was related to defense mechanisms; 5 were related to posttranslational modification, protein turnover, and chaperones; and 5 were related to signal transduction mechanisms. In the genetic information storage and transmission module, 17 genes were associated with replication, recombination, and modification, and 7 were related to translation, ribosomal structure, and biogenesis. In the metabolism module, 12 genes were associated with amino acid transport and metabolism; 13 were related to carbohydrate transport and metabolism; 7 were associated with coenzyme transport and metabolism; and 13 were related to secondary metabolites biosynthesis, transport, and catabolism.
Phylogenetic analysis based on the whole genomes (Fig. 10) indicated that F. equiseti was most closely related to F. pseudograminearum, followed by F. avenaceum and F. oxysporum.
As shown in Table 12, the clustered genes in the outgroup taxon Bipolaris sorokiniana accounted for 72% of the total genes, whereas the genes assigned to clusters in Fusarium solani accounted for over 90% of the genes. Typically, Fusarium equiseti D25-1 gene families exceed 60%, with 54 species-specific gene families.
Sample | Gene Number | Clustered Genes | Unclustered Genes | Families | Unique Families |
---|---|---|---|---|---|
Bipolaris sorokiniana | 12,154 | 8,692 | 3,462 | 5,708 | 213 |
Fusarium oxysporum | 16,792 | 15,288 | 1,504 | 8,726 | 139 |
Nectria haematococca | 15,647 | 14,091 | 1,556 | 7,901 | 109 |
Fusarium equiseti D25-1 | 13,829 | 12,654 | 1,175 | 8,493 | 54 |
Fusarium avenaceum | 13,092 | 12,476 | 616 | 8,181 | 9 |
Fusarium pseudograminearum | 12,348 | 11,446 | 902 | 8,117 | 7 |
Single-copy orthologs were detected in the genome of each strain (n > 2,500). Multiple-copy orthologs were also detected in each genome, but the counts differed among the species. For instance, Fusarium oxysporum and Nectria haematococca had high numbers of multi-copy orthologs (Fig. 11).
Gene family-based clustering analysis (Fig. 12) indicated that F. equiseti D25-1 was most similar to F. pseudograminearum, followed by F. avenaceum and F. oxysporum. These results were consistent with the results of the genome-wide sequence analysis, as well as the evolutionary position of F. equiseti in a genome-based phylogenetic analysis.
The whole-genome sequencing assembly of Fusarium equiseti D25-1 was based on a combination of results from Illumina Hiseq 4000 and PacBio platforms. The primary assembly was carried out on Illumina Hiseq 4000 platform, and then the rough analysis of the genomic data and pre-determination of the genome size were conducted, providing a basic reference for an advanced assembly on PacBio platform. This platform provides more accurate results, and the assembly effect is better.
Meanwhile, the gene composition and distribution, and the number of non-coding RNA genes, repeat sequences, and transposon types were identified. Non-coding RNAs play a crucial role in the regulation of mRNA translation, localization, and stability, in addition to other processes [28]. Numerous studies have shown that miRNAs play key roles in the development of cancer [29, 30]. Thus, the roles of F. equiseti miRNAs in pathogenesis should be evaluated. Transposons can affect the coding capacity of genes and can even lead to chromosomal rearrangements [31]. Identification, classification, and annotation of transposons can be accomplished more accurately over whole genomes. Therefore, the genome-wide annotation of F. equiseti D25-1 transposons can provide a basis for the detection of functionally and evolutionarily important genes, including those associated with pathogenicity.
On the basis of these basic data, gene function was annotated. Gene function annotation was used to comprehensively annotate the basic functions of F. equiseti D25-1 through various databases. Numerous studies have shown that F. equiseti is pathogenic [32, 33]. Additionally, fusarium can produce multiple toxins, including zearalenone, trichothecene mycotoxin, moniliformin, and fumonisins [34]. Hence, the genes that confer pathogenicity to fungi were also annotated using the carbohydrate-related enzyme (CAZy), PHI, KEGG, and cytochrome P450 databases. The derived data can be used to explore the factors and mechanism(s) underlying the pathogenicity of F. equiseti.
Currently, the genomic sequences of many plant pathogenic fungi can be retrieved from the NCBI database. This study selected Fusarium oxysporum, Nectria haematococca, Fusarium pseudograminearum, Fusarium avenaceum, and Bipolaris sorokiniana as the reference strains for Fusarium equiseti D25-1 because these strains are crop root rot pathogens. For example, F. pseudograminearum and N.haematococca can cause root rot in barley [35] and physic nut [36], respectively. Our study showed that these pathogens have many common genes, such as the core genes described here. This similarity may cause them to give rise to the same disease. In addition, genes belonging to many families have been identified in the genomes of the six strains through gene family analysis, and the functional characteristics of each gene family need to be further studied.
In this study, the complete genome sequence of F. equiseti D25-1 was assembled based on the Illumina HiSeq 4000 and PacBio platforms. The primary assembled genome had a size of 40.55 Mb and a GC content of 47.92%. The advanced assembly included 16 chromosomes, with a GC content of 48.01%, a total size of 40,776,005 bp, and a gap number of 0. The introns, exons, gene lengths, non-coding RNA, and repetitive sequences were characterized. Based on 14 databases, 13,134 functional genes were annotated in the Fusarium equiseti D25-1 genome, accounting for a high proportion (94.97%) of the total genes. Moreover, multiple genes were related to cellular processes, metabolism, molecular functions, and pathogenicity, including two genes related to zearalenone and 23 genes associated with trichothecene mycotoxin. High collinearity with the genome sequence of the reference strain was observed, with 3,483 specific genes, 1,805 common genes, and over 2,500 single-copy orthologs. The genome had the highest sequence similarity with the genome of F. pseudograminearum, followed by F. avenaceum, providing insight into its evolutionary position. The genomic data provide a basis for studies of gene expression, regulatory mechanisms, functional mechanisms, evolutionary processes, and disease control in F. equiseti and other fungi.
F. equiseti D25-1 was isolated from the rotten root of highland barley in our preliminary study; it was identified as a pathogen of highland barley root rot disease according to Koch's postulates.
F. equiseti D25-1 was inoculated into a 500-mL flask containing 200-mL sterile potato dextrose broth and cultured at 25 °C in a 120 r/min shaker for 5 d. After filtration, mycelia were collected by centrifugation at 7,168 g for 1 min, followed by DNA extraction using the OMEGA Fungal DNA Kit (Biel, Switzerland) as per the manufacturer’s instructions.
After quality control, DNA samples were used to construct a library. Large DNA fragments (genomic DNA, BACs, or long-range PCR products) were randomly cleaved by Covaris or Bioruptor ultrasonication to generate a series of DNA fragments with the main band of 800 bp or less. Next, T4 DNA Polymerase, Klenow DNA Polymerase, and T4 PNK were used to repair the ensuing sticky ends into blunt ends, and the base "A" was added to the 3' ends to allow the DNA fragments to become ligated to special adapters containing "T" at their 3' ends. Next, the desired ligation product was identified by electrophoresis, and the DNA fragments with adapters at both ends were amplified by PCR. Finally, cluster preparation and sequencing were performed using a qualified library (sequencing was completed with the BGI Tech APAC microbial line).
The genomic DNA was processed by g-TUBE into fragments of appropriate size, after which damage- and end-repair were performed. Both ends of the DNA fragments were ligated to the hairpin adapters to form a dumbbell structure designated SMRTbell. The annealed SMRTbell was mixed with polymerase at the bottom of ZWM for final sequencing (sequencing was completed with the BGI Tech APAC microbial line).
To filter low-quality data and thereby ensure the accuracy and reliability of the subsequent analyses, 1–125 bp of read 1 and read 2 were first obtained. Reads with quality values ≤ 2 were removed, as were those with a total of 40% "N" bases. Adapters and duplications were eliminated.
The raw sequencing data from the PacBio platform contained the adapter sequences, low-quality sequences, erroneous sequences, and other reads requiring processing. To improve the assembly results, the following steps were taken: 1) polymerase reads < 1,000 bp were removed; 2) polymerase reads with qualities < 0.80 were removed; 3) sub-reads were extracted from polymerase reads, and adapter sequences were removed; 4) sub-reads < 1,000 bp were removed.
The sequencing data were assembled using a variety of tools. 1) During the sub-read correction, multiple algorithms were used for self-correction (Pbdagcon and FalconConsensus) or hybrid correction (Proovread) of the sub-reads to obtain highly reliable corrected reads. 2) For corrected read assembly, Celera and Falcon were used to finalize the optimal assembly. 3) For assembly correction, single-base correction (GATK) was performed using the second-generation Illumina short reads to obtain a highly reliable assembly sequence. 4) For scaffolding, SSPACE Basic v2.0 was used based on the second-generation Illumina long reads, and the scaffold gaps were filled using PBjelly2 [37–39].
Gene component analyses were primarily performed using Homology [40], SNAP [41], Augustus [42], and GeneMark-ES [43]. Homology prediction was implemented in GeneWise. SNAP and Augustus were used for predictions using training sets for the reference species.
For the non-coding RNA analysis, rRNAs were detected by comparing the rRNA library with the non-coding RNAs predicted by RNAmmer 1.2 [44]. The domains and secondary structures of the tRNAs were predicted using tRNAscan 1.3.1 [45]. The sRNAs were identified by comparing with the Rfam 9.1 [46] database using Infernal.
Transposons were predicted via three approaches–Repeat Masker 4-0-6 (using Repbase database), RepeatProteinMasker (using its own transposon protein library), or de novo (a database of transposon sequences was generated using buildXDFDatabase, and then a transposon model was constructed according to this database using Repeat Modeler, followed by transposon identification from the constructed model using Repeat Masker). The tandem repeats were predicted using TRF 4.04 (Tandem Repeat Finder) [47].
The protein sequences were functionally annotated. Gene sequences were aligned with the available sequences in databases to obtain corresponding annotations. The highest quality alignment was chosen for gene annotation. Functional annotation was performed by BLAST searches against the following databases: Gene Ontology (GO) [48], Kyoto Encyclopedia of Genes and Genomes (KEGG) [49], Cluster of Orthologous Groups of proteins (COG) [50], Swiss-Prot [51], Trembl 2016, NR 2015, EggNOG 4.5, Antibiotic Resistance Genes Database (ARDB) 1.1 [52], Pathogen Host Interactions (PHI) 4.0 [53], Fungal Cytochrome P450 Database1.1 [54], Carbohydrate-Active enzymes Database (CAZy), virulence factor database (VFDB) [55], Type III secretion system Effector protein (T3SS) 1.0 [56], and TransportDB 2.0. Finally, the genes related to toxin production were identified.
A comparative genomic analysis was performed with the whole-genome sequences of Bipolaris sorokiniana (INSDC: AEIN00000000.1), Fusarium avenaceum (INSDC: JPYM00000000.1), Fusarium oxysporum (INSDC: MABQ00000000.2), Fusarium pseudograminearum (INSDC: AFNW00000000.1), and Nectria haematococca (INSDC: ACJF00000000.1) as the reference.
The sequence of the target fungus was ordered according to that of the reference fungus using MUMmer [57]. Protein set P1 of the target fungus was aligned with protein set P2 of the reference fungus. P1 was first aligned with P2 using BLASTP by setting P2 as the database, and the best hit for each protein was selected. The reciprocal alignment was then performed (by setting P1 as the database). Finally, the results with the best hit values for both alignments were retained, and the average of two consistent values was obtained.
Genes from the reference genome were used as the gene pool. Predicted genes from Query samples were BLAST-searched against the gene pool, and the BLAST results were filtered by length and identity. The BLAST coverage ratios of the genes from the gene pool and Query samples were calculated separately. Finally, a tree was constructed based on the multiple sequence alignment obtained using Muscle by the neighbor-joining method implemented in Treebest [58]. Completed core and specific genes were annotated using BLAST and the COG database [59].
Gene families were constructed for the reference genes and the genes of the target fungus. Protein sequences were aligned using BLAST, and redundancy was eliminated using SOLAR. The gene family database TreeFam was used for the clustering analysis using Hclustersg. The protein alignment results for gene families were converted to the amino acid sequences of the CDS regions using Muscle [60]. The gene family tree was constructed based on the multiple sequence alignment results by the neighbor-joining method using Treebest.
Basic local alignment search tool
Gene Ontology
Kyoto Encyclopedia of Genes and Genomes
Cluster of Orthologous Groups of proteins
Antibiotic Resistance Genes Database
Pathogen Host Interactions
Carbohydrate-Active enzymes Database
Virulence factor database
Type III secretion system Effector protein
Non-supervised Orthologous Groups
Availability of data and materials
The datasets generated and/or analyzed in the current study can be obtained from the corresponding/first author or DDBJ/ENA/GenBank repository under the accession number QOHM00000000, https://www.ncbi.nlm.nih.gov/bioproject/?term=QOHM00000000&utm_source=gquery&utm_medium=search.
Acknowledgements
Not applicable
Funding
This study was supported by the Youth Fund for Gansu Academy of Agricultural Sciences
(Grant/Award Number: 2019GAAS34) and the Special Fund for Agro-Scientific Research in the
Public Interest (201503112).
Author information
Affiliations
Institute of Plant Protection, Gansu Academy of Agricultural Sciences, Lanzhou, 730070, China
Xueping Li, Yonghong Qi, Yonggang Liu & Minquan Li
College of Prataculture, Gansu Agricultural University, Lanzhou, 730070, China
Jianhong Li
Contributions
Xueping Li was the lead investigator of this study. Jianhong Li assisted in DNA extraction.
Yonghong Qi helped to collect the samples. Yonggang Liu and Minquan Li served as consultants on
the experimental designs and suggested some modifications.
Corresponding author
Correspondence to Xueping Li
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.