Determination and Structural Analysis of the Whole-Genome Sequence of Fusarium equiseti D25-1

Background Fusarium equiseti is a plant pathogen with a wide range of hosts and diverse effects, including probiotic activity. However, the underlying molecular mechanisms remain unclear, hindering its effective control and utilization. In this study, the Illumina HiSeq 4000 and PacBio platforms were used to sequence and assemble the whole genome of Fusarium equiseti D25-1. Results The assembly included 16 fragments with a GC content of 48.01%, gap number of zero, and size of 40,776,005 bp. There were 40,110 exons and 26,281 introns having a total size of 19,787,286 bp and 2,290,434 bp, respectively. The genome had an average copy number of 333, 71, 69, 31, and 108 for tRNAs, rRNAs, sRNAs, snRNAs, and miRNAs, respectively. The total repetitive sequence length was 1,713,918 bp, accounting for 4.2033% of the genome. In total, 13,134 functional genes were annotated, accounting for 94.97% of the total gene number. Toxin-related genes, including two related to zearalenone and 23 related to trichothecene, were identi�ed. A comparative genomic analysis supported the high quality of the F. equiseti assembly, exhibiting good collinearity with the reference strains, 3,483 species-specic genes, and 1,805 core genes. A gene family analysis revealed more than 2,500 single-copy orthologs. F. equiseti was most closely related to Fusarium pseudograminearum based on a phylogenetic analysis at the whole-genome level.


Background
Fusarium equiseti, also known as Gibberella intricans in its sexual stage, has complex and diverse bene cial or pathogenic properties.As a probiotic, it can produce many metabolites bene cial to plants or humans.For example, it produces linamarase [1], equisetin, cellulase, polyketides [2], glucose, alcohol [3], and various metabolites that inhibit hepatitis virus infections [4].Additionally, Marín et al. (2012) have found that F. equiseti can produce certain type A and B trichothecenes, butenolides, beauvericin, zearalenone, and fusarochromanone [5].F. equiseti is also employed in plant insect control [6][7], benzopyrene degradation, and heavy metal pollution abatement [8].Conversely, F. equiseti itself also acts as a plant pathogen with a broad host range and produces various toxins.In fact, it is a pathogen of multiple cereals, including wheat [9], corn [10], triticale, oats, and barley [11], and other crops, such as cowpea [12], beans [13], melon [14], lettuce [15], sun ower [16], jujube, and radish [17].Some owers, trees, and Chinese medicinal plants, including hickory [18], white pine seedlings [19], cumin [20], shtail palm [21], hybrid cymbidium, clover, rice atsedge [22], white mangrove [23], mulberry [24], largehead atractylodes rhizome, and Chinese magnolia vine fruit [25] can also be infected with F. equiseti.However, the mechanisms underlying the bene cial or pathogenic properties of F. equiseti are not fully understood.A direct and effective approach for clarifying the action mechanisms of F. equiseti is the identi cation of the genetic factors directly related to its functions.For instance, Stępień et al. have analyzed the tef-1α sequences of F. equiseti, as well as the toxin-related genes PKS13, PKS4, and TRI5 [26], and Kari has successfully cloned a protease gene [27].However, there are few in-depth studies on the functional genes.Moreover, the traditional methods for functional gene discovery are insu cient to meet the modern-day research demands due to the high error rates and low e ciencies.
Advances in whole-genome sequencing technologies have enabled numerous key discoveries.In this study, the complete genomic sequence of Fusarium equiseti D25-1 was determined using multiple sequencing platforms.The assembly was characterized, including analyses of the genomic structure, with the aim of providing data to support the discovery of bene cial or harmful genes and to improve our understanding of the molecular pathogenesis.

Illumina HiSeq4000 data
Sequencing was performed using the Illumina HiSeq 4000 platform.A total of 56,458,678 reads were produced from D25-1, with an insert size of 500 bp and a read length of 125 bp.The proportion of adapter sequences was 0.25%, the proportion of repetitive sequences was 2.38%, and the low-quality sequences accounted for only 7.04%.After ltering out 9.69% of the sequences from the raw data (7,057 Mb), 6,372 Mb of clean data were obtained (Table 1).In the base distribution analysis, the frequencies of A and T, as well as the frequencies of G and C, were correlated, indicating an equilibrium base composition.Further, base quality values were relatively high, suggesting a low error rate and high-quality reads (Fig. 1).

PacBio data
Given the large quantities of adapter sequences and low-quality or erroneous sequences in the raw sequencing data obtained using the PacBio platform, extensive ltering was performed to generate clean data, which were deposited at DDBJ/ENA/GenBank under accession number QOHM00000000.The lengths and qualities of the reads were well distributed (Fig. 2.).The version described in this paper is version QOHM01000000.Summary statistics are provided in Table .2.

Genome Evaluation
Prior to assembly, the genome size, heterozygosity, and repetitive sequence information were determined by a K-mer analysis.The D25-1 genome was approximately 44.69 Mb in size, with a coverage depth of 144.34×.The details are shown in Fig. 3.

Assembly statistics
Data assembly was performed by combining the data from the Illumina HiSeq4000 and PacBio sequencing platforms.As shown in Table 3, the D25-1 genome was well spliced.A total of 16 Fusarium equiseti chromosomes with a total length of 40,776,005 bases were assembled.The longest chromosome was 8,344,890 bp, and the shortest chromosome was 2,316 bp.The gap number was 0, and the GC content was 48.01%.

GC content
Through calculating GC content and average depth, one can analyze whether a GC bias exists.If not seriously biased, the corresponding scatter diagram will assume a shape similar to that of Poisson distribution and have a peak near the GC content estimate of the genome.The more the graph deviates from this peak, the lower the depth is.The GC bias in the genome of D25-1 was analyzed after assembly, and a Poisson distribution was observed, suggesting no substantial bias (Fig. 4).

Gene components
Gene prediction was performed using the assembly to obtain the open reading frames and gene length distribution, as summarized in Table 4.The total number, total length, average length, and percentage of total chromosome lengths were analyzed for the genes, exons, CDS, and introns.The introns were abundant but short, accounting for a relatively low proportion of the genome.Genes with lengths between 500-999 bp were most abundant (n = 3,663), followed by those of 1000-1499 bp.Genes shorter than 200 bp and longer than 10,000 bp were infrequent (Fig. 5).

Non-coding RNAs
The non-coding RNAs in the Fusarium equiseti D25-1 genome were analyzed.With a copy number of 333, an average length of 86.91 nt, and a total length of 28,943 nt, tRNAs accounted for 0.0171% of the entire genome.The rRNA copy number was 71; the average length was 116.01 nt, and the total length was 8237 nt, accounting for 0.0202% of the genome.sRNAs represented 0.0113% of the genome, with a copy number of 69, an average length of 66.69 nt, and a total length of 4,602 nt.Small nuclear rRNAs accounted for 0.0107% of the genome, with a copy number of 31, an average length of 140.12 nt, and a total length of 4,334 nt.The microRNA copy number was 108; the average length was 55.77 nt, and the total length was 6,024 nt, accounting for 0.0148% of the genome.

Repetitive sequences
Repetitive sequences, including DNA transposons, tandem repeats (TRs), and transposable elements, have important roles in chromosomal spatial structure, regulation of gene expression, and genetic recombination.Transposable elements are further classi ed into long terminal repeats (LTRs) and non-LTRs, and the latter category includes long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs).Tables 6 and 7 list the repetitive sequences in the Fusarium equiseti D25-1 genome, as determined by various prediction algorithms and databases.For instance, the TRs predicted by the TRF software spanned 224,134 bp, which accounted for only 0.5497% of the genome.The total predicted repetitive sequences spanned 1,713,918 bp, accounting for 4.2033% of the genome.

Gene annotation
As shown in Table 8, 13134 genes, accounting for 94.97% of all the genes in D25-1 genome, were annotated after BLAST searches against all databases.Over 70% of annotations were based on the NR, NOG, and IPR databases.Furthermore, 1422 genes (10.28%) were annotated using the PHI database.Genes encoding toxins were mined based on the annotation results.Two genes related to zearalenone were identi ed (D25-1_GLEAN_10000531 and D25-1_GLEAN_10000533).D25-1_GLEAN_10000531 was related to the non-reductive iterative type I polyketide synthase involved in the synthesis of zearalenone, whereas D25-1_GLEAN_10000533 was related to the highly reductive iterative type I polyketide synthase (Table 9).Additionally, 23 genes related to trichothecene mycotoxin were identi ed, including 19 genes related to the active e ux pump of trichothecene mycotoxin, two related to trioxyacetyltransferase of trichothecene mycotoxin, and two related to the biosynthesis of trichothecene mycotoxin.Many of these genes were obtained by searches against the NOG database (Table 10).

Comparative genomic analyses
As summarized in Table 11, the newly assembled F. equiseti genome had the most intact sequence, with 16 contigs and an N50 of 6,178,397 bp, indicating a high-quality assembly.In a comparative analysis, the best-assembled F. oxysporum genome only had 33 scaffolds, with an N50 of 4,490,135 bp, which was superior to the assembly results for other strains.The differences in quality might be explained by the differences in assembly technology and platform.In this study, data were obtained using the third-generation PacBio platform and second-generation Illumina platform for joint assembly; whereas, most previously reported data had been obtained using the second-generation sequencing technology alone.

Structural variation
Figure 6 summarizes the genomic collinearity results.The highest collinearity (i.e., greatest conservation) was found for F.
sorokiniana had the lowest similarity.In addition, the gene number in the reference B. sorokiniana genome was 5,133; however, in a collinearity analysis, only 37.12% of the target genes were covered.The N. haematococca reference genome had 7,123 genes, covering 51.51% of the target genes.The F. pseudograminearum genome had 10,052 genes, covering 72.69% of the target genes.The F.
avenaceum genome had 9884 genes, covering 71.47% of the target genes.The F. oxysporum genome had 10,236 genes, covering 74.02% of the target genes.Such ndings might be related to the large number of reference genes from F. oxysporum.
COG was used to annotate the core genes (Fig. 8) in four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function).Regarding cellular processes, 3 genes were associated with cell cycle control, cell division, and chromosome partitioning; 14 were involved in cell wall/membrane/envelop biogenesis; 1 gene was related to cell mobility; 5 were related to defense mechanisms; 1 was associated with the cytoskeleton; 1 was associated with intracellular tra cking, secretion, and vesicular transport; 51 were related to posttranslational modi cation, protein turnover, and chaperones; and 6 were related to signal transduction mechanisms.Five terms in the genetic information storage and transmission module were overrepresented; 1 gene was responsible for chromatin structure and dynamics; 1 was related to RNA processing and modi cation; 17 were related to replication, recombination, and modi cation; 14 were related to transcription; and 80 were related to translation, ribosomal structure, and biogenesis.In the metabolism module, 84 genes were associated with amino acid transport and metabolism; 63 were related to carbohydrate transport and metabolism; 38 were associated with coenzyme transport and metabolism; 69 were related to energy production and conversion; 20 were related to inorganic ion transport and metabolism; 58 were linked with lipid transport and metabolism; 35 were related to nucleotide transport and metabolism; and 36 were related to secondary metabolite biosynthesis, transport, and catabolism.
Genes speci c to the Fusarium equiseti D25-1 genome were annotated using COG (Fig. 9) based on four modules (cellular processes, genetic information storage and transmission, metabolism, and unknown function).In the cellular processes module, 2 genes were associated with cell cycle control, cell division, and chromosome partitioning; 3 were associated with cell wall/membrane/envelop biogenesis; 1 gene was related to defense mechanisms; 5 were related to posttranslational modi cation, protein turnover, and chaperones; and 5 were related to signal transduction mechanisms.In the genetic information storage and transmission module, 17 genes were associated with replication, recombination, and modi cation, and 7 were related to translation, ribosomal structure, and biogenesis.In the metabolism module, 12 genes were associated with amino acid transport and metabolism; 13 were related to carbohydrate transport and metabolism; 7 were associated with coenzyme transport and metabolism; and 13 were related to secondary metabolites biosynthesis, transport, and catabolism.Phylogenetic analysis based on the whole genomes (Fig. 10) indicated that F. equiseti was most closely related to F. pseudograminearum, followed by F. avenaceum and F. oxysporum.

Gene family analysis
As shown in Table 12, the clustered genes in the outgroup taxon Bipolaris sorokiniana accounted for 72% of the total genes, whereas the genes assigned to clusters in Fusarium solani accounted for over 90% of the genes.Typically, Fusarium equiseti D25-1 gene families exceed 60%, with 54 species-speci c gene families.
Single-copy orthologs were detected in the genome of each strain (n > 2,500).Multiple-copy orthologs were also detected in each genome, but the counts differed among the species.For instance, Fusarium oxysporum and Nectria haematococca had high numbers of multi-copy orthologs (Fig. 11).
Gene family-based clustering analysis (Fig. 12) indicated that F. equiseti D25-1 was most similar to F. pseudograminearum, followed by F. avenaceum and F. oxysporum.These results were consistent with the results of the genome-wide sequence analysis, as well as the evolutionary position of F. equiseti in a genome-based phylogenetic analysis.

Discussion
The whole-genome sequencing assembly of Fusarium equiseti D25-1 was based on a combination of results from Illumina Hiseq 4000 and PacBio platforms.The primary assembly was carried out on Illumina Hiseq 4000 platform, and then the rough analysis of the genomic data and pre-determination of the genome size were conducted, providing a basic reference for an advanced assembly on PacBio platform.This platform provides more accurate results, and the assembly effect is better.
Meanwhile, the gene composition and distribution, and the number of non-coding RNA genes, repeat sequences, and transposon types were identi ed.Non-coding RNAs play a crucial role in the regulation of mRNA translation, localization, and stability, in addition to other processes [28].Numerous studies have shown that miRNAs play key roles in the development of cancer [29,30].Thus, the roles of F.
equiseti miRNAs in pathogenesis should be evaluated.Transposons can affect the coding capacity of genes and can even lead to chromosomal rearrangements [31].Identi cation, classi cation, and annotation of transposons can be accomplished more accurately over whole genomes.Therefore, the genome-wide annotation of F. equiseti D25-1 transposons can provide a basis for the detection of functionally and evolutionarily important genes, including those associated with pathogenicity.
On the basis of these basic data, gene function was annotated.Gene function annotation was used to comprehensively annotate the basic functions of F. equiseti D25-1 through various databases.Numerous studies have shown that F. equiseti is pathogenic [32,33].
Hence, the genes that confer pathogenicity to fungi were also annotated using the carbohydrate-related enzyme (CAZy), PHI, KEGG, and cytochrome P450 databases.The derived data can be used to explore the factors and mechanism(s) underlying the pathogenicity of F. equiseti.
Currently, the genomic sequences of many plant pathogenic fungi can be retrieved from the NCBI database.This study selected Fusarium oxysporum, Nectria haematococca, Fusarium pseudograminearum, Fusarium avenaceum, and Bipolaris sorokiniana as the reference strains for Fusarium equiseti D25-1 because these strains are crop root rot pathogens.For example, F. pseudograminearum and N.haematococca can cause root rot in barley [35] and physic nut [36], respectively.Our study showed that these pathogens have many common genes, such as the core genes described here.This similarity may cause them to give rise to the same disease.In addition, genes belonging to many families have been identi ed in the genomes of the six strains through gene family analysis, and the functional characteristics of each gene family need to be further studied.

Conclusion
In this study, the complete genome sequence of F. equiseti D25-1 was assembled based on the Illumina HiSeq 4000 and PacBio platforms.The primary assembled genome had a size of 40.55 Mb and a GC content of 47.92%.The advanced assembly included 16 chromosomes, with a GC content of 48.01%, a total size of 40,776,005 bp, and a gap number of 0. The introns, exons, gene lengths, non-coding RNA, and repetitive sequences were characterized.Based on 14 databases, 13,134 functional genes were annotated in the Fusarium equiseti D25-1 genome, accounting for a high proportion (94.97%) of the total genes.Moreover, multiple genes were related to cellular processes, metabolism, molecular functions, and pathogenicity, including two genes related to zearalenone and 23 genes associated with trichothecene mycotoxin.High collinearity with the genome sequence of the reference strain was observed, with 3,483 speci c genes, 1,805 common genes, and over 2,500 single-copy orthologs.The genome had the highest sequence similarity with the genome of F.
pseudograminearum, followed by F. avenaceum, providing insight into its evolutionary position.The genomic data provide a basis for studies of gene expression, regulatory mechanisms, functional mechanisms, evolutionary processes, and disease control in F. equiseti and other fungi.

Materials
F. equiseti D25-1 was isolated from the rotten root of highland barley in our preliminary study; it was identi ed as a pathogen of highland barley root rot disease according to Koch's postulates.

Strain cultivation and genomic DNA extraction
F. equiseti D25-1 was inoculated into a 500-mL ask containing 200-mL sterile potato dextrose broth and cultured at 25 °C in a 120 r/min shaker for 5 d.After ltration, mycelia were collected by centrifugation at 7,168 g for 1 min, followed by DNA extraction using the OMEGA Fungal DNA Kit (Biel, Switzerland) as per the manufacturer's instructions.

Illumina HiSeq 4000 sequencing
After quality control, DNA samples were used to construct a library.Large DNA fragments (genomic DNA, BACs, or long-range PCR products) were randomly cleaved by Covaris or Bioruptor ultrasonication to generate a series of DNA fragments with the main band of 800 bp or less.Next, T4 DNA Polymerase, Klenow DNA Polymerase, and T4 PNK were used to repair the ensuing sticky ends into blunt ends, and the base "A" was added to the 3' ends to allow the DNA fragments to become ligated to special adapters containing "T" at their 3' ends.Next, the desired ligation product was identi ed by electrophoresis, and the DNA fragments with adapters at both ends were ampli ed by PCR.Finally, cluster preparation and sequencing were performed using a quali ed library (sequencing was completed with the BGI Tech APAC microbial line).

PacBio sequencing
The genomic DNA was processed by g-TUBE into fragments of appropriate size, after which damage-and end-repair were performed.Both ends of the DNA fragments were ligated to the hairpin adapters to form a dumbbell structure designated SMRTbell.The annealed SMRTbell was mixed with polymerase at the bottom of ZWM for nal sequencing (sequencing was completed with the BGI Tech APAC microbial line).

Illumina HiSeq 4000 data ltering
To lter low-quality data and thereby ensure the accuracy and reliability of the subsequent analyses, 1-125 bp of read 1 and read 2 were rst obtained.Reads with quality values ≤ 2 were removed, as were those with a total of 40% "N" bases.Adapters and duplications were eliminated.

PacBio data ltering
The raw sequencing data from the PacBio platform contained the adapter sequences, low-quality sequences, erroneous sequences, and other reads requiring processing.To improve the assembly results, the following steps were taken: 1) polymerase reads < 1,000 bp were removed; 2) polymerase reads with qualities < 0.80 were removed; 3) sub-reads were extracted from polymerase reads, and adapter sequences were removed; 4) sub-reads < 1,000 bp were removed.

Genome assembly
The sequencing data were assembled using a variety of tools. 1) During the sub-read correction, multiple algorithms were used for selfcorrection (Pbdagcon and FalconConsensus) or hybrid correction (Proovread) of the sub-reads to obtain highly reliable corrected reads.2) For corrected read assembly, Celera and Falcon were used to nalize the optimal assembly.3) For assembly correction, single-base correction (GATK) was performed using the second-generation Illumina short reads to obtain a highly reliable assembly sequence.4) For scaffolding, SSPACE Basic v2.0 was used based on the second-generation Illumina long reads, and the scaffold gaps were lled using PBjelly2 [37][38][39].

Gene prediction
Gene component analyses were primarily performed using Homology [40], SNAP [41], Augustus [42], and GeneMark-ES [43].Homology prediction was implemented in GeneWise.SNAP and Augustus were used for predictions using training sets for the reference species.

Non-coding RNA analysis
For the non-coding RNA analysis, rRNAs were detected by comparing the rRNA library with the non-coding RNAs predicted by RNAmmer 1.2 [44].The domains and secondary structures of the tRNAs were predicted using tRNAscan 1.3.1 [45].The sRNAs were identi ed by comparing with the Rfam 9.1 [46] database using Infernal.

Repetitive sequence analysis
Transposons were predicted via three approaches-Repeat Masker 4-0-6 (using Repbase database), RepeatProteinMasker (using its own transposon protein library), or de novo (a database of transposon sequences was generated using buildXDFDatabase, and then a transposon model was constructed according to this database using Repeat Modeler, followed by transposon identi cation from the constructed model using Repeat Masker).The tandem repeats were predicted using TRF 4.04 (Tandem Repeat Finder) [47].

Structural variation
The sequence of the target fungus was ordered according to that of the reference fungus using MUMmer [57].Protein set P1 of the target fungus was aligned with protein set P2 of the reference fungus.P1 was rst aligned with P2 using BLASTP by setting P2 as the database, and the best hit for each protein was selected.The reciprocal alignment was then performed (by setting P1 as the database).Finally, the results with the best hit values for both alignments were retained, and the average of two consistent values was obtained.

Core-pan genome
Genes from the reference genome were used as the gene pool.Predicted genes from Query samples were BLAST-searched against the gene pool, and the BLAST results were ltered by length and identity.The BLAST coverage ratios of the genes from the gene pool and Query samples were calculated separately.Finally, a tree was constructed based on the multiple sequence alignment obtained using Muscle by the neighbor-joining method implemented in Treebest [58].Completed core and speci c genes were annotated using BLAST and the COG database [59].

Gene family analysis
Base       Core and speci c gene counts.The number in the white center circle is the number of the genes common to all the six strains, and each number in the other circles represents the number of species-speci c genes of the strain labeled with the same color as the corresponding circle.
Page 24/27 Annotation of core genes by COG.The ordinate and abscissa show the annotation type and corresponding gene number, respectively.
Annotation of species-speci c genes by COG.The ordinate and abscissa show the annotation type and corresponding gene number, respectively.
Gene family statistics.The abscissa and different colors represent the different strains and the number of different types of genes, respectively.
and quality distribution.The gure on the left shows the distribution of the bases after ltration.The X-axis represents the base positions of read 1 and read 2, and the vertical axis represents the distribution percentage of each base.The gure on the right shows the mass distribution of the bases, and each point in the gure represents the mass value of the base at the corresponding position in a read.

Figure 5 Gene
Figure 5

Table 8
Summary of overall annotation results

Table 9
Genes related to zearalenone