Assembly, annotation, and comparison of Macrophomina phaseolina isolates from strawberry and other hosts

DOI: https://doi.org/10.21203/rs.2.9364/v1

Abstract

Background Macrophomina phaseolina is traditionally considered a broad host range fungal pathogen, but one genotype was recently shown to exhibit a host preference/specificity on strawberry. This fungus lacked a high-quality genome assembly and annotation, and little is known about genomic differences between isolates from different hosts. Results This study used PacBio sequencing and Hi-C scaffolding to provide nearly complete genome assemblies from M. phaseolina isolates recovered from strawberry and alfalfa with 60 or fewer scaffolds and N50 metrics at 4.3 and 5.0 Mb, respectively. The genomes were annotated with MAKER using multiple RNA-Seq libraries generated in this study, and over 13,000 protein-coding genes were predicted. Unique groups of orthologous genes for each M. phaseolina isolate were identified when compared to several closely related fungal species. Comparative genomics between the isolates reveal large-scale structural rearrangements including chromosomal inversions and translocations. An additional 30 isolates of M. phaseolina from various hosts, including several from strawberry, were sequenced with Illumina and compared to the high-quality strawberry isolate assembly. No conserved structural rearrangements were identified among the isolates from strawberry compared to those from other hosts, but some candidate genes were identified that were largely present in isolates pathogenic on strawberry and absent in isolates not pathogenic on this host. Conclusions High-quality reference genomes of M. phaseolina generated in this study have allowed for comparative genomics to be used to investigate the genomic changes that might have led to a host preference toward strawberry and will enable future comparative genomics studies. Having more complete scaffolds allows for structural rearrangements to be more fully assessed and ensures a greater representation of all the genes that are truly present in the genome. Work with the additional isolates in this study suggests that some genes are predominately only present in isolates that are pathogenic on strawberry, but additional work is needed to confirm the role of these genes in pathogenesis. Additional work is also needed to complete the scaffolding of smaller contigs representing smaller chromosomes identified in the strawberry genotype assembly and to determine how gene expression plays a role in pathogenicity.

Background

Macrophomina phaseolina is a haploid, clonally reproducing ascomycete fungus that causes damping off, stem rot, and charcoal rot on a wide range of over 500 host species including soybean, corn, wheat, and strawberry [1–3]. This pathogen is soilborne and can survive multiple growing seasons by forming resting structures called microsclerotia, which are melaninized structures formed from 50-200 cells [3]. Some studies have shown that the survival and symptom severity caused by M. phaseolina increases with warmer soil temperatures, ranging from 28-35°C [4, 5]. Typically, M. phaseolina has been thought to be a broad host range pathogen, with the same isolate being able to infect multiple plant species and thus posing a risk to growers planting a variety of crops in a field with a history of M. phaseolina [1, 6]. However, recent studies suggest that M. phaseolina may exhibit some degree of host preference and that each isolate may not pose an equal risk to all crops [7]. Specifically, our recent work studying M. phaseolina collected from strawberry and other hosts around California support the hypothesis that some isolates of M. phaseolina exhibit a strong host preference toward strawberry [8].

Recently, an increased incidence of M. phaseolina in strawberry fields has been observed as the growers shift away from using methyl bromide as a preplant fumigant as required by changes in pesticide regulations [8–10]. As a result, the 1.8-billon dollar, 15,000-hectare strawberry industry in California is under increasing threat from this pathogen that causes crown rot and death of the plant (USDA Noncitrus Fruits and Nuts 2016 Summary, NASS 2017). Interestingly, this pathogen is found in cooler, coastal strawberry growing regions of California despite its documented preferences for warmer climates [4, 5]. A large population study to understand potential regional and host differences of M. phaseolina from strawberry and other hosts growing mainly in California was conducted using SSR markers [11]. The results from an initial study using 65 SSR markers and 15 California isolates show that the majority of isolates recovered from strawberry cluster within a distinct clade [8]. A larger, unpublished study using 24 SSR markers and over 460 M. phaseolina isolates from California and other parts of the world support the grouping of most strawberry isolates into a single genotype (Marina Ramon and Frank Martin, personal communication). Understanding the genetics behind this grouping might lead to a better understanding of why some isolates of M. phaseolina exhibit a strong host preference for strawberry despite M. phaseolina typically being known as a broad host range pathogen.

To begin to answer genomic questions, a high quality, complete reference genome is needed to which other isolates can be compared. Prior to this study, the only published draft genome of M. phaseolina was sequenced from an isolate that was recovered from jute [1]. This isolate was sequenced using older technology including Illumina and 454 resulting in a 48.9 Mb assembly consisting of 1506 contigs with an N50 of 151 kb [1]. The goal of the current study was not only to sequence isolates of M. phaseolina representing the main strawberry genotype and other genotypes incapable or weakly infecting strawberry, but also to use modern advances in sequencing technology, including long read PacBio, in order to provide a higher quality assembly with fewer fragments and more complete contigs. In recent years, the use of long read PacBio sequencing technology has lived up to the promise of delivering more complete genomes and has successfully been used to de novo assemble or dramatically improve the assembly of several plant-pathogenic fungi including Verticillium dahliae [12], Botrytis cinerea [13], Fusarium oxysporum [14], Magnaporthe oryzae [15], Sclerotinia sclerotiorum [16], and Colletotrichum higginsianum [17]. In fact, Faino et al. 2016 found that a PacBio-based assembly alone, rather than a PacBio and Illumina hybrid assembly provided the highest quality assembly for V. dahliae.

To achieve the goal of producing a very high-quality reference genome for M. phaseolina, an isolate recovered from strawberry in Santa Barbara county in 2011, 11-12, was selected to provide the DNA for the reference Pac Bio assembly because it was representative of the main strawberry genotype and exhibited a host preference for strawberry [8]. In addition, an updated annotation was completed in this study using ten separate RNA-Seq libraries of 11-12 fungal tissue grown in different conditions to provide a strong set of raw transcript evidence on which to base new structural gene annotations. Because assemblies made with these new technologies provide significant advantages when investigating genome structure but may have sequence and structural bias when compared to Illumina-based assemblies, a second isolate from a different host named Al-1 recovered from alfalfa from California in 2013. In this current study, additional polishing of the PacBio-based genome assembly was done using Illumina-based mate pair and paired end sequencing in order to correct any sequencing errors and further improve the quality of the final assembly [18]. In addition, Dovetail Genomics (Santa Cruz, CA) Hi-C technology was used to join contigs into scaffolds, break any contigs that were erroneously joined in the initial assembly and fill in gaps, thereby improving the long-range scaffolds of the assembly [19, 20]. Together, the data from these two isolates were assembled into high-quality genomes which were used to investigate how large and small scale structural changes in the genome may provide the basis for potential mechanisms of evolution that could lead to host preference. Overall, the goals of the work in this manuscript were to 1) provide a complete high-quality genome assembly for Macrophomina phaseolina, 2) provide an updated annotation for M. phaseolina and 3) investigate the genomic changes that may have contributed to strawberry host preference of some isolates of M. phaseolina.

Methods

M. phaseolina isolate growth and DNA extraction

Cultures of M. phaseolina were grown on potato dextrose broth (PDB) at 28°C for two days. The cultures were then blended with sterile water and grown for another 2 days at 28°C to produce a hyphal mat. This tissue was strained, rinsed with sterile water and blotted dry. The tissue was lyophilized and DNA was extracted using the Blood and Cell Culture kit (Qiagen, Redwood City, CA) with the modifications described in the Qiagen user-developed protocol for the isolation of genomic DNA from plants and filamentous fungi using the Qiagen Genomic tip with the exception that buffer G2 in the kit was used as the lysis buffer. The quality of the DNA was tested by running the samples on a 0.8% agarose gel and the concentration measured using a Qubit (Thermo Fisher Scientific, Waltham, MA). The samples were submitted for PacBio RSII sequencing to the DNA Technologies Core of the University of California, Davis Genome Center. Fourteen SMRT cells were sequenced for the isolate representing the main strawberry genotype (11-12; SAMN0894525) and 10 SMRT cells were sequenced for an isolate recovered from alfalfa representing another genotype (Al-1; SAMN08448485). The DNA was also submitted to Michigan State University Research Technology Support Facility Genomics Core (East Lansing, MI) for mate pair library preparation and Illumina paired-end sequencing. The mate pair libraries were prepared using the Illumina Nextera Mate Pair Library Preparation Kit. The inserts for the mate pair libraries were prepared by fragmenting the DNA sample and running it on a polyacrylamide gel to separate the fragment sizes. The samples were run on an Agilent Bioanalyzer DNA 12000 chip which showed that the mate pair insert fragment ranges for isolate 11-12 were 3.5, 5.0, and 7.0 kb for expected 3, 6, and 9 kb insert libraries, respectively. These libraries were sequenced using Illumina HiSeq 2500 High Output flow cell (v4) using 125 bp paired end sequencing with the HiSeq SBS reagents. For the Al-1 sample, mate pair inserts of 3.1, 5.7, and 8.5 kb were created by fragmentation and were sequenced on an Illumina MiSeq v2 flow cell to provide 150 bp paired end data. DNA for these two isolates was also sent to the USDA-ARS Genomics and Bioinformatics Research Unit (Stoneville, MS) for a single MiSeq run using the same library preparation techniques noted above to provide 300 bp paired end data. The paired end and mate pair data for 11-12 and Al-1 is available on NCBI under accession number SAMN08294525 and SAMN08448485, respectively.

Additional sequencing was completed using DNA extracted from lyophilized tissue of 30 isolates of M. phaseolina using the Qiagen DNeasy Plant Mini Kit following the manufacturer’s directions for lyophilized plant tissue except with the tissue lysis step being increased to one hour. This DNA was submitted to either Michigan State University Research Technology Support Facility Genomics Core or DNA Technologies Core of the University of California, Davis, Genome Center for 150 bp paired end sequencing using the Illumina HiSeq4000. The data can be located at NCBI under accession numbers SAMN09764508-SAMN09764537 (for a full list of isolates and the corresponding accession numbers see Table 1).

M. phaseolina isolate growth and RNA extraction

Isolate 11-12 of M. phaseolina was grown under a variety of conditions to express a wide range of genes for RNA sequencing to be used for genome annotation. All the samples were started on PDB and were grown at 28°C for two days. Three samples were blended with water and were transferred to 15°C, 35°C, or 24-hour light and were grown for an additional two days. One sample was grown at 28°C for 7 days after it was blended. Other cultures that were started on PDB were transferred to a different media by removing only the growing culture with a rubber policeman and rinsing it briefly with distilled water before blending it with a new media and transferring it to a new set of petri plates to grow for 2 days at 28°C. The other media used included PDB with 2% and 4% NaCl and PDB adjusted to a pH of 3.5 or 8 with HCl or NaOH. Additional media included 100 mM ammonium sulfate and sterilized macerated strawberry crown extract. All the fungal tissue was collected from the various growth conditions by straining the media from the tissue, rinsing the tissue with sterile water several times, and blotting the tissue dry. The tissue mat was then rolled into a cylinder and flash-frozen in liquid nitrogen and stored at -80°C until the RNA was extracted.

RNA was extracted using the Qiagen RNeasy Plant Mini Kit from flash-frozen tissue grown in each of the conditions described above. The RNA was submitted to the Michigan State University Research Technology Support Facility Genomics Core and was prepared using the Illumina TruSeq Stranded mRNA Library prep kit and was sequenced using 125 bp paired end reads from the Illumina HiSeq 2500. The reads were submitted to NCBI (SAMN09699088, SAMN09699089, SAMN09699090, SAMN09699091, SAMN09699092, SAMN09699093, SAMN09699094, SAMN09699095, SAMN09699096, SAMN09699097).

Genome assembly with FALCON

The genomes of M. phaseolina isolates 11-12 and Al-1 were assembled from the PacBio data using the FALCON assembly pipeline v. 0.3.0 [28]. First, the reads were converted from the raw bax.h5 format to subreads.fasta using the pbh5tools v. 0.8.0. FALCON was run for 11-12 using the following parameter options: input_type = raw, length_cutoff = 10000, length_cutoff_pre = 10000, pa_HPCdaligner_option = -v -B128 -t16 -e.70 -l1000 -s400 -T4, ovlp_HPCdaligner_option = -v -B128 -t32 -h60 -e.96 -l500 -s1000 -T4, pa_DBsplit_option and ovlp_DBsplit_option = -x500 -s100, falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 6, and overlap_filtering_setting = --max_diff 300 --max_cov 500 --min_cov 5 --bestn 10 --n_core 15. FALCON was run for AL-1 using the same parameters that were used for 11-12 except for the following changes: length_cutoff = 15000, length_cutoff_pr = 11000, and overlap_filtering_setting = --max_diff 100 --max_cov 450 --min_cov 10 --bestn 10 --n_core 15. The resulting p_ctg.fa output file was used as the input genome assembly for subsequent analyses.

Genome polishing with PILON

The mate pair and paired end reads used to polish the FALCON-based assemblies were first trimmed using NxTrim v. 0.4.1 [29] using the –justmp and –separate options and or Trimmomatic v. 0.36 [30]. Bowtie2 (v 2.2.6) was used to map the trimmed mate pair and paired end reads to the FALCON-based assemblies of each genome [31]. Bowtie2 was run for 11-12 data using the –very-sensitive option and with the -I parameter set at 2000, 3500, or 4500 for a minimum fragment length and the -X parameter set at 5000, 6500, or 9000 for a maximum fragment length for the ~ 3.4 kb, 6kb, and 9kb mate pair insert libraries, respectively. Bowtie2 was run using the –very-sensitive option with the -I parameter set at 1500, 4200, or 6500 and the -X parameter set at 4500, 7200, or 10500, respectively for the mate pair libraries that were approximately 3 kb, 6 kb, and 9 kb for Al-1 data. The mapped reads were converted to sorted bam files using SAMTools v. 1.17 [32, 33] and were used as –jumps (mate-pair) or –frags (paired-end) input data for PILON v. 1.16 [18] to polish the FALCON-based assemblies. The corrected fasta file from PILON that had SNPs, indels, gaps, and local misassemblies fixed was used as the reference sequence for a second round of paired-end and mate-pair read mapping using the parameters for Bowtie2 described above. The sorted bam files from the second round of Bowtie2 mapping were used for a second PILON run using the same parameters described above. The final assembly from each second round of PILON-based polishing was used as the input assembly for Hi-C analysis by Dovetail (Santa Cruz, CA).

Hi-C Scaffolding

Hyphal mats of M. phaseolina from isolates 11-12 and Al-1 were collected during the log phase of growth in PDB, rinsed, and partially dried on filter paper by pulling a low vacuum. The dried tissue was flash-frozen in liquid nitrogen and sent to Dovetail Genomics (Santa Cruz, CA). The assembly was scaffolded following Hi-C analysis using the HiRise method and the gaps in the scaffolds were filled in with PBJelly using the previously described PacBio subreads. The contigs from the Dovetail gap-filled assembly were analyzed using CLC Genomics Workbench v. 9.5.3. The Align Contigs tool in the Genome Finishing Module was used to align the contigs to themselves using BLAST [34] with a word size of 20, a maximum e-value of 0.01, and minimum match size of 100. Contigs were eliminated from the final assembly that were more than 99.5% contained within another contig and had greater than 99.5% identity to that contig. Contigs less than 1000 bp were also removed. The contigs were reordered from longest to shortest and renamed so that the longest contig was named Contig 1 and the shortest contig was Contig n, where n equals the number of total contigs in the assembly. The final assemblies for 11-12 (PRJNA428521) and Al-1 (PRJNA432410) are available on NCBI. Assemblies were evaluated for quality and completeness using QUAST v. 4.3 [35] and BUSCO2 v. 2.0 [36] with both the eukaryotic and fungal dataset.

Genome Annotation

MAKER [37] was used to structurally annotate both genomes, using RNA-Seq reads produced from 11-12 and predicted protein sequences from closely-related fungal species (Fusarium oxysporum GCA_00149955, Macrophomina phaseolina MS6 GCA_000302655, Penicillium chrysogenum GCA_000710275, Grosmannia clavigera GCA_000143105, Aspergillus nidulans GCA_000149205, Trichoderma reesi GCA_000167675, and Podospora anserina GCA_000226545). Trimmomatic v. 0.36 was used to trim the RNA Seq reads using the default settings with the minimum read length set at 50. The trimmed paired reads were then mapped to each genome using STAR aligner v. 2.3.0e [38]. SAMTools v.0.1.19 was used to convert the aligned sam file into a sorted bam file. Trinity v. 2.4.0 [39] was used to create gene models using the –genome-guided_bam option, a maximum intron length of 5000, the jaccard clipping option, and the SS_lib_type RF. The fasta gene models produced by Trinity from each RNA-Seq data set were used as EST evidence when running MAKER v. r1228. Repeat masking was done within MAKER RepeatMasker [21] using the RepBase for all model organisms with soft masking. The specific transposable elements were curated from the output of RepeatMasker for each genome. De novo gene predictions were made using SNAP [40], Augustus [41] and Genemark v. 4.32 [42]. Specific MAKER options were set so that the AED_threshold was 1, the pred_flank was 200, the split_hit was 10,000, and the single_length was 250. A MAKER standard gene set was made which included genes that had evidence and or Pfam support with a cutoff of 1e-10 and encoded a protein of at least 50 amino acids. The transcripts file, protein sequences, gff files, and final assemblies are available at Data Dryad (https://www.datadryad.org). (Data Dryad DOI will be provided after the paper is accepted)

Functional annotation

The MAKER standard gene set was subject to gene functional annotation using Trinotate v3 [43]. Both protein and transcript sequences were queried against the Swiss-Prot database [44]

using BLASTP and BLASTX [34], respectively. In addition, protein sequences were scanned for conserved pFAM domains in the pFAM-A database [45] using HMMER v3 [46]. Both SignalP [47] and TMHMM [48] were used to identify signal peptide and transmembrane domains in the predicted protein sequences. Outputs from BLASTX, BLASTP, HMMER, SignalP, and TMHMM were used to generate the final functional annotation report (available on Data Dryad DOI). To characterize the carbohydrate-active enzymes in the fungal genome, protein sequences were searched against the dbCAN HMMdb release v7 [49], and the identified carbohydrate enzymes were assigned to appropriate CAZy enzyme classes (GHs, GTs, PLs, CEs, CBMs, and AAs).

Assembly and sequencing of other isolates of M. phaseolina

Illumina sequencing data from the additional 30 isolates of M. phaseolina were analyzed using CLC Workbench (Table 1). The reads were trimmed using the Illumina TruSeq adapter set with a quality trim setting at 0.05 and a minimum read length of 30 bp. These trimmed reads were assembled into draft assemblies using CLC Workbench using the de novo assembly tool with an automatic bubble size and with a word size that was manually adjusted from 25-50 to provide the assembly with the best N50. The assemblies were used to create BLAST databases for downstream analyses within CLC Workbench and were also exported for downstream analyses using other programs.

Comparative structural genomics

To evaluate genome structural rearrangement, M. phaseolina files were prepared for Assemblytics [50] using MUMmer v. 3.23 [51] with the following settings: -l 100, -c 500. The output file from MUMmer was used as the input for Assemblytics using the unique sequence length of 10000, a maximum variant size of 10000, and a minimum variant size of 1. A progressive MAUVE alignment using MAUVE v. 2.3.1 [52] was also run with the final 11-12 assembly as the reference sequence to which the final Al-1 assembly was aligned.

Functional comparative genomics

Orthologs between M. phaseolina from strawberry and other isolates of M. phaseolina as well as 17 other closely-related fungal species were identified using OrthoFinder v. 1.1.3 [53]. The fungal taxonomy and protein sequences used in this analysis can be found in Supplementary Table 1. OrthoVenn was used to create plots and to do additional GO term enrichment analyses of specific groups [54].

Identification of genes unique to the main strawberry genotype

Illumina-based paired end mRNA-sequencing was analyzed for 13 isolates not pathogenic on strawberry and 17 isolates pathogenic on this host to identify genes that may be unique to M. phaseolina isolates from the main strawberry genotype. These reads were trimmed using Trimmomatic and were mapped to the 11-12 genome using STAR [38] with the –outFilterMultimapNmax set to 1 and the –outFilterMismatchNmax set to 0 so that only the reads that were perfectly mapped to the genome were included in the output .sam file. The file was then converted into a sorted .bam file that was used as an input for HTSeq-count from HTSeq v. 0.6.1p1 [55]. HTSeq-count was used with the -f bam, -s no and -t gene parameters using the gff3 file from 11-12 as the gene model input. The read counts per gene from each of the isolates was analyzed using the DESeq package v. 1.32.0 [56] in R v. 3.5.0. Each set of read counts were normalized according to the library size and the counts for each gene for each isolate were exported to Excel. Excel was used to sort the count data files to identify genes that had no reads mapping from the nonpathogenic isolates and had several reads mapping from the pathogenic isolates. These genes were considered as candidate genes unique to the main strawberry genotype for downstream analyses. Additional analysis of candidates was done using the BLAST functions in CLC in which the 11-12 candidate gene sequences were used as queries to BLAST against a database created from the assembly of each isolate. The top BLAST hit from each assembly was considered as the homolog for the candidate gene in that isolate and was used to manually check for the presence of indels or SNPs compared to the gene model from the 11-12 isolate using sequence alignments in CLC Workbench.

Pathogenicity Testing

M. phaseolina isolates were tested for pathogenicity on strawberry by growing cultures on potato dextrose agar (PDA) at 25°C and allowing the isolates to colonize sterile toothpicks placed on top of the growing cultures. The colonized toothpicks were incubated for 10 days and were used as the inoculum source. Prior to inoculation, strawberry plants cv. Albion were grown in the greenhouse from crowns for 28 days. Plants were inoculated by making a 6 mm deep hole in each crown with a metal probe and then placing the colonized toothpick inside the hole. Sterile toothpicks were placed in each crown as a negative control and toothpicks colonized with isolate 11-12 were used as a positive control. For each trial, 8 plants were inoculated per treatment and were placed in a growth chamber set at 30°C and were watered as needed. The plants were rated from 1-4 weeks post inoculation using a 1-3 scale in which a rating of 1 was a symptomless plant, a rating of 2 indicated a moderate amount of wilt and die back, and a rating of 3 indicated complete collapse and death of the plant. The final ratings were analyzed using the nonparametric Kruskal-Wallis test in R followed by pairwise comparisons using the Wilcoxon rank sum test; isolates were considered to be pathogenic if the adjusted P value compared to the negative control was < 0.05 [57] (Table 1). Each isolate was tested in 1-3 separate trials.

Results

In the process of generating the final assemblies for 11-12 and Al-1, several methods and iterations of assembly were used to generate the highest quality assembly. Before the FALCON-based assemblies were selected as the final assembly, the HGAP assembly pipeline was used to generate an assembly for the 11-12 isolate with the PacBio data, but it had a lower N50 (3.3 Mb) than the FALCON assembly (N50 = 4.3 Mb) and did not run successfully with the input data for Al-1. The Genome Finishing Module of the CLC Genomics Workbench (Qiagen, Redwood City, CA) was also used to assemble both genomes using the PacBio data, but in both cases resulted in an assembly with a shorter total length (44 Mb) and over 120 contigs in each assembly. After the FALCON-based assemblies were selected as the best PacBio-based assemblies, a polishing step with PILON using paired end and mate-pair Illumina reads greatly improved the base calling, with 113,895 and 73,557 SNP and indel changes made after the first PILON run for the Al-1 and the 11-12 assemblies, respectively. During the subsequent round of read mapping and polishing, a higher percentage of the Illumina reads mapped, and 3,984 and 3,784 changes were made by PILON for the Al-1 and the 11-12 final assemblies, respectively, resulting in the highest quality, polished FALCON-based assembly.

The final PILON assemblies were further improved using Dovetail HiRise with PacBio gapfilling to combine some contigs into scaffolds. For 11-12, the input assembly of 102 contigs was broken once and was joined 19 times, resulting in 84 scaffolds. The joins made by the initial Dovetail HiRise scaffolding were made by adding 100 Ns at the junction, but 6 of these joins were subsequently filled in with the PacBio data from the initial sequencing, leaving 13 gaps filled with 25-100 Ns. For Al-1, the input assembly of 27 contigs was not broken, but a single join was made with Dovetail HiRise resulting in 26 final scaffolds. The gap created by this contig joining was eliminated by the PacBio gapfill resulting in a final assembly with no gaps and no Ns. To investigate the identity of the smaller (< 100 kb) contigs in each assembly, the assemblies were aligned to themselves using a BLAST-based tool in CLC Genomics Workbench, and contigs that were 99.5% contained within another contig and had greater than 99.5% identity to that contig were considered assembly errors and were eliminated along with contigs less than 1000 bp. The genome statistics for the final assemblies can be found in Table 2.

High-quality, nearly complete assemblies were generated for both the M. phaseolina isolate from strawberry, 11-12, and the isolate from alfalfa, Al-1 (Table 2). Both isolates had very high scaffold N50 scores, at 4.3 Mb and 5.0Mb for the 11-12 and Al-1 isolate, respectively. The sequence coverage of the M. phaseolina genomes from strawberry and alfalfa was 150x for the PacBio reads and over 200x for the Illumina reads. The distribution of the lengths of the main scaffolds for the 11-12 and the Al-1 scaffolds indicate that there are 13 main scaffolds that may be considered chromosomes which are all 870 kb-6.8Mb. No telomeres were detected in the 11-12 assembly, but 7 out of 18 scaffolds of the Al-1 assembly had a telomere on one end of the scaffold. The smaller 47 contigs of 11-12 range from 4,726 to 99,276 bp with contig 15 at 94,974 bp the mitochondrial genome. The 5 smaller contigs of Al-1 range from 2,215 to 49,130 bp with contigs 14, 16, 17, and 18 representing the mitochondrial genome. When mapping the paired end Illumina reads generated from the 11-12 isolate and the Al-1 isolate back to the PacBio-based assemblies, > 96% of the trimmed reads mapped back to the assembly, indicating that the assemblies were of high quality and that the base calling could be largely confirmed with a different sequencing technology. A lower percentage of the paired reads mapped from the isolate of origin to the other sequenced isolate, indicating some genomic divergence between the isolates. Overall, all the assemblies had good BUSCO scores using both the eukaryotic and the fungal databases, indicating that the core set of genes were fully present and correctly assembled within each assembly (Table 2).

Transposable elements (TEs) were identified in each genome using RepeatMasker (Table 3). Overall, the total number of transposable elements identified in each genome assembly was similar, with 2882 TEs identified in the 11-12 assembly and 2703 TEs identified in the Al-1 assembly. Major types of TEs identified included Class I retrotransposons including LTRs (long terminal repeats) and LINES (long interspersed nuclear elements) as well as Class II transposons including DNA transposons and helitrons [21]. Some groups of transposons were unique to the 11-12 isolate including the L2 category of LINE transposons that were found on 8 of the 13 main contigs as well as CRE type of LINE transposon that was only found on contig 8. The Penelope type of LINE was only present in the 11-12 genome and is a unique type of TE that is typically associated with animal genomes [22]. Within the 11-12 genome, this TE type was more prevalent in the smaller contigs than in the 13 main contigs. Both genomes had a large number of Copia and Gypsy LTRs, which are known to be both abundant and diverse in fungal genomes [23]. In contrast, only the 11-12 genome containing endogenous retrovirus 4 (ERV4) LTRs, which were only found within the smaller contigs of the genome and have previously been identified in other fungal genomes [24].

Genome annotation for M. phaseolina isolates 11-12 and Al-1

MAKER was used to annotate the genomes of 11-12 and Al-1 using a set of protein evidence from closely related fungi and an RNA-Seq based transcript dataset from the 11-12 isolate. MAKER used several de novo gene predictors including AUGUSTUS, SNAP, and GENEMARK to begin the annotation process for these two M. phaseolina isolates. A final list of MAKER standard genes was created from the predicted genes that had Pfam support or transcript evidence. In total, the 11-12 isolate contains 14,103 annotated genes and the Al-1 isolate contains 13,443. For the Al-1 genome, all these annotated genes are on the 13 main contigs, with no MAKER-annotated genes located on the five smaller (< 100 kb) contigs (four of which are mitochondrial). Alternatively, there are 146 annotated genes on the 47 smaller contigs (4,726 to 99,276 bp) of 11-12, ranging from 0-18 annotated genes per contig. The functional (PFAM) annotations of these genes identified on the 11-12 small contigs included helicases, retrotransposons, translation initiation factors, and oxidoreductases, and aspartyl proteases.

The carbohydrate-active enzyme composition of a fungus can be used to determine its ecological niche and host specificity. To compare the carbohydrate-active enzymes from the two sequenced M. phaseolina isolates with other plant pathogenic and non-plant associated fungi, the carbohydrate active enzymes encoded by each fungus in the comparison were determined and classified using the CAZy database (Table 4). Carbohydrate active enzymes are classified into six families, and the observed enzyme compositions are conserved within the M. phaseolina isolates. As expected, this comparison shows that the polysaccharide lyase (PL) gene family expanded in the plant pathogenic fungi relative to the non-plant associated fungi (e.g., A. niger and N. crassa). The PL encoded in the M. phaseolina cleave different forms of pectins, such as pectate lyases in the PL1, PL3, and PL9 families, and rhamnogalacturonan lyases in the PL4 family. Additionally, the glycoside hydrolase (GH) enzymes (e.g., GH88 and GH105) that degrade the products generated by PL enzymes are also found in the M. phaseolina genomes. The enhanced capacity in the pectinolytic machinery suggests strong plant cell wall degradation activity that likely enables their infection and colonization in plant tissues.

Genome sequence comparison between 11-12 and Al-1 reveals large-scale genome rearrangement

When comparing the full genome assembly of 11-12 to Al-1, large scale structural rearrangements and smaller scale indels and SNPs were observed. The dot plot from a MUMmer analysis visualized using Assemblytics indicates that the genomes are largely collinear (Figure 1). Some scaffolds have been inverted or translocated. For example, a portion of the Al-1 genome collinear to contig 2 of the 11-12 isolate can be found inverted and translocated to contig 8 of the Al-1. Large portions of the Al-1 genome that are collinear with 11-12 contigs 3, 6, and 7 are also shown to be translocated to a different contig of the Al-1 isolate. Of the larger contigs, contig 8 shows the most changes with large portions of the contig broken and or translocated and inverted between the 11-12 and Al-1 isolates.

Similar pairwise whole-genome comparisons were made between the 11-12 isolate and the Illumina assemblies of other M. phaseolina isolates using MUMmer and Assemblytics. Overall, the genomes of both genotypes, either pathogenic on strawberry or nonpathogenic on this host, were collinear with the 11-12 strawberry isolate, with some small segments of the genome showing translocations or inversions but with none of these rearrangements being consistently associated with pathogenicity on strawberry. In general, the genomes of isolates that were nonpathogenic on strawberry exhibited more insertions and deletions compared to the 11-12 assembly, with a higher proportion of deletions, compared to those identified among the isolates pathogenic on strawberry. However, a couple of pathogenic isolates also had a high number of insertions and deletions compared to the 11-12 assembly and some nonpathogenic isolates had very few indels, indicating that using the trends of indels alone is not an accurate predictor of pathogenicity on strawberry. Additionally, the quality of the genome assembly may affect the analyses given that the 11-12 long-read assembly is composed largely of 13 scaffolds while the Illumina-based assemblies of the other M. phaseolina isolates contain over 1,000 contigs each.

A progressive MAUVE analysis between 11-12 and Al-1 revealed similar translocations and inversions as well as several syntenic blocks between the two genome assemblies (Figure 2). Contig 1 of both assemblies, which is over 6 Mb and the longest contig of each isolate assembly, has two syntenic blocks. Some portions of the Al-1 genome similar to scaffold 2 of the 11-12 assembly can be found inverted and translocated on scaffold 8 in the Al-1 assembly. The third scaffold of each assembly has partial synteny, but a large portion of Al-1 syntenic to the 11-12 assembly can be found on Al-1 scaffolds 4, 5 (inverted) and 9. A portion of the Al-1 genome syntenic to scaffold 6 of the 11-12 assembly has a portion that is inverted and present on scaffold 5 of the Al-1 assembly. Several blocks of the Al-1 assembly syntenic to the 11-12 scaffold 7 can be found on the Al-1 scaffold 3. Scaffold 8 of 11-12 has a region from position 359,168 to 449,477 that appears to be a unique sequence region without a clear corresponding match in the Al-1 assembly. Scaffold 11 of the 11-12 assembly also seems to contain unique genomic regions and does not have many syntenic regions with Al-1.

Identification of unique gene clusters from M. phaseolina main strawberry genotype

An analysis of all the orthologous genes, including the jute M. phaseolina assembly [1] as well as seventeen other isolates including several other fungi within the Dothideomycetes class (Supplemental Table 1), were analyzed with OrthoFinder resulting in a phylogenetic tree (Figure 3; Percentage of shared orthologous genes between species is in Supplemental Table 2) in which the isolates of M. phaseolina are tightly grouped with each other and with the other members of the Botryosphaeriales order, as expected. This phylogeny also indicated that the isolates of M. phaseolina from alfalfa and jute are more closely related to each other than to the strawberry isolate, providing additional support that the host specificity aspect of the strawberry isolate may be associated with a unique genetic component. This phylogeny helped to inform the selection of isolates that were used to create OrthoVenn comparisons with smaller groupings of isolates. The OrthoFinder analysis also identified 2 unique orthogroups for the main strawberry genotype of M. phaseolina with 6 genes represented. The isolate from alfalfa did not have any specific orthogroups and the isolate from jute had 4 specific orthogroups.

In an analysis with OrthoVenn using the M. phaseolina protein sequences from strawberry, jute, and alfalfa isolates, 81 clusters containing a total of 203 proteins were found to be unique to strawberry with 1465 singletons also identified as being unique (Figure 4A). Within the group of 81 unique clusters, enriched GO terms included those associated with a molecular function (cytidylate kinase activity, phosphotransferase activity, annealing helicase activity, four-way junction helicase activity, and ATP-dependent helicase activity) and those associated with biological process (pyrimidine nucleotide biosynthetic process and DNA strand renaturation). An analysis between the selected fungi within the Botryosphaeriales order with OrthoVenn showed that the M. phaseolina isolate from strawberry had 74 unique clusters of orthologous genes with enriched GO terms including cytidylate kinase activity and phosphotransferase activity (Figure 4B). The strawberry isolate also had 1327 singletons without an orthologous gene. Within the same OrthoVenn analysis, the three M. phaseolina uniquely shared 1694 orthologous groups which had enriched GO terms including molecular functions of monooxygenase activity and heme binding and biological process terms including menthol biosynthetic process, sphingolipid biosynthetic process, and terpenoid biosynthetic process. Another OrthoVenn analysis between genera of fungi that commonly infect strawberry – Macrophomina, Fusarium, Colletotrichum, Verticilium, and Sclerotinia – 383 unique orthologous groups were identified in the M. phaseolina isolate from strawberry and 3947 singletons of that isolate were identified (Figure 4C). Enriched GO terms for the unique orthologous groups include the molecular function groups of heme binding and oxidoreductase activity.

Identification of candidate M. phaseolina genes associated with isolates pathogenic on strawberry

A comparative genomics approach using the annotated genome from the strawberry genotype was used to identify candidate genes that may be unique to M. phaseolina isolates that are capable of infecting strawberry. Through this process, the sequencing reads from each of the 30 other sequenced isolates of M. phaseolina – including isolates that were both pathogenic and not pathogenic on strawberry - were perfectly mapped to the 11-12 genome and HT-Seq was used to count the number of reads that mapped to each annotated gene model (Table 1). Genes with no or low counts in the isolates that were not pathogenic on strawberry and with high counts in the isolates that were considered pathogenic on strawberry were selected after the first round of read mapping. Within CLC Workbench, BLAST was used to compare the gene model from the 11-12 isolate to similar genes in the other isolates to identify potential truncations or SNPs that could eliminate a candidate if these mutations cause significant changes in the predicted protein sequence among isolates in the main strawberry genotype. Non-synonymous mutations in candidate genes among the isolates non-pathogenic on strawberry allowed some candidate genes that had a high mapped read count to a candidate gene model in 11-12 to remain as pathogenicity-associated gene candidates. After this selection process, 10 candidate genes were identified, and all of them are located within a ~ 200 kb region of Contig 8 of the 11-12 assembly. Interestingly, this area of Contig 8 is broken in a dotplot alignment between 11-12 and Al-1 (Fig. 1) and shows rearrangements and indels within the MAUVE alignment (Fig. 2).

The candidate genes identified in Table 5 were typically absent in all 13 of the nonpathogenic isolates and present in all 19 of the pathogenic isolates except in the case of isolates 14-26 and 14-27. These isolates were both recovered from strawberry plants in Ventura, California, and were found to be nonpathogenic on strawberry in the toothpick inoculations, yet had sequences of candidate genes that were identical to those found in 11-12. Several of these genes, including M11_12_v1_01446, M11_12_v1_01448, M11_12_v1_01455, M11_12_v1_01460, M11_12_v1_01490, M11_12_v1_01493, and M11_12_v1_01501 were also listed as a member of a unique orthologous grouping or as singletons for M. phaseolina of strawberry in all the OrthoVenn comparisons. Additionally, for some candidate genes – M11_12_v1_01458, M11_12_v1_01460, M11_12_v1_01463, and M11_12_v1_01466 - isolates Al-1 and 14-4 had similar gene sequences, but SNPs or indels changed the predicted protein sequence when compared to that of 11-12. For gene M11_12_v1_01493, the predicted proteins from all the nonpathogenic isolates except for 14-26 and 14-27 were truncated compared to the 11-12 gene model and were predicted to be metalloproteinases while the full-length gene model from 11-12 was predicted to encode a membrane dipeptidase. This is also the only gene among the candidate genes that was predicted by SignalP to have a signal peptide. When looking at the transcript support for each candidate gene model, all the genes except for M11-12_v1_01448, M11_12_v1_01493, and M11-12_v1_01501 had full or partial support from the transcripts predicted by Trinity using the RNA-Seq data from isolate 11-12 provided during the gene annotation. Furthermore, when examined individually, many of the RNA-Seq datasets provided support for the expression of the candidate genes, with M11_12_v1_01446, M11_12_v1_01448, M11_12_v1_01463, and M11_12_v1_01501 being highly expressed in all the libraries. A couple of genes, M11_12_v1_01455 and M11_12_v1_01493, had higher gene expression in the dataset from the crown tissue medium compared to other growth conditions, but additional replicated experiments are needed to confirm this preliminary observation.

Discussion

Overall, this study provides two high-quality assemblies for M. phaseolina and explores genetic differences between M. phaseolina isolates collected from different hosts. The use of third generation sequencing technology, including long read PacBio sequencing, allowed for a genome that was more complete and contiguous compared to the previously published genome that was assembled using older short-read technology [1]. Furthermore, the assemblies presented in this manuscript are of very high quality, with sequence coverage over 100x and thorough genome polishing completed using PILON with Illumina-based mate pair and paired end libraries. Beyond using advancements in sequencing technology, this work also utilized new methods in scaffolding genomes from Dovetail Genomics, which used Hi-C proximity ligation, the HiRise software, and PacBio gapfilling to result in a final assembly with no gaps within the contigs in the case of the Al-1 isolate and 13 gaps within contigs in the case of the 11-12 isolate. Overall, this approach has produced very high-quality genomes with an N50 of 4.3 and 5.0 Mb for the strawberry and alfalfa isolates, respectively. This is a significant improvement over the previous reference assembly of M. phaseolina of an isolate recovered from jute which had an N50 of 151 kb [1]. (The statistics for the jute isolate MphMS6_1.0 in Table 2 were generated by using the publicly available assembly, GCA_00030202655.1, which is assembled at the contig and not the scaffold level.) The total genome size for all three isolates was similar, with the strawberry isolate at 51.3 Mb, the alfalfa isolate at 49.8 Mb, and the jute isolate at 48.9 Mb. Compared to the 454 and Illumina short read technology used in the assembly of the jute isolate [1], the use of the long-read PacBio technology and Dovetail scaffolding of Hi-C data greatly decreased the total number of scaffolds and the L75 and increased the length of the longest contig, as shown in Table 2. The improved sequencing technology was used at a much higher coverage (150x for PacBio and 200x for Illumina) compared to the 66x coverage for the jute assembly.

While the assemblies presented in this work are very high quality and are an improvement over the last published genome sequence from a jute isolate, there are areas where future work could improve the quality of a M. phaseolina assembly to make a fully complete chromosome-level assembly. First, one shortcoming of this study is that FALCON, an assembler designed for diploids, was used to assemble a haploid genome [25]. In this case, the primary contig file was used as the basis for the final assembly, but a smaller associate contig file was also produced by the assembler which contained ~ 50 small contigs per assembly that were each <70 kb. These contigs were compared by BLAST to the final assembly and were found to be fully contained within the primary contigs and were discarded. Similarly, several shorter contigs were identified in the primary contig-based assembly of both isolates even following the Dovetail scaffolding of Hi-C data, and many of these contigs were fully redundant within the genome. Because the sequences were fully duplicated within larger chromosomes with nearly 100% sequence identity, these shorter contigs were considered assembly errors – perhaps due to the exceedingly high amount of raw input data at nearly 200x genome coverage - and were removed to produce the final number of contigs in the final assembly.

Still, there are some smaller contigs, especially within the 11-12 isolate with a largely unknown identity or placement within the larger genome. A couple of these contigs that are < 100 kb are the mitochondrial genome of M. phaseolina, including Contig 15 in 11-12 and Contigs 14, 16, 17, and 18 in Al-1. In the Al-1 assembly, only one small contigs, Contig 15 at 41,517 bp remains unplaced in the final assembly and does not contain any annotated transcripts. In contrast, the 11-12 assembly has 46 small contigs remaining in the genome assembly, with 34 of them having at least one annotated gene and repeat. The 11-12 genome also had a greater number and diversity of transposable elements, including unique Penelope LINE transposons and ERV4 LTR transposons found largely within the small contigs which may contribute to the greater number of small contigs in 11-12 compared to Al-1. Additional methods of scaffolding, including the electronic mapping used by Nabsys (Providence, RI), could help to place some of these small contigs into larger scaffolds [26]. Further biological studies may also indicate that some of these chromosomes have a role in pathogenicity similar to those of Fusarium oxysporum where smaller, accessory chromosomes are well-associated with pathogenicity [14].

The genome annotation pipeline MAKER was used to determine the structural annotations of the genes for both M. phaseolina isolates sequenced in this study. Despite being a widely used and reliable pipeline, there are still limitations with the structural annotation based on the input data available and the ability of the algorithms to correctly predict gene models. For example, the RNA-Seq data from 11-12 was used to annotate the Al-1 assembly and some of the proteins used as supporting evidence were from different classes of fungi. Furthermore, aside from annotating every gene by hand, a perfect genome annotation is unattainable given the resources that are available; however, the computer-generated annotations provide a solid foundation for genome comparisons and provide valuable information for beginning functional annotation.

The functional annotations, while valuable for providing a starting point for characterizing the predicted proteins and orthologous genes identified in the assembly, are limited in that they use tools that base annotations upon proteins and domains that have been previously characterized in model organisms that may or may not be closely related to the organism that is being studied. As a result, many predicted proteins are characterized as a “gene of unknown function” and several of the clusters of orthologous genes determined through OrthoFinder or OrthoVenn don’t have predicted functions or GO annotations. These tools are still useful in sorting the data that we have based on the annotation databases that are currently available, but more work needs to be done in the field of fungal functional genomics before these predictions can be more accurate and analysis like GO term enrichment can be more meaningful.

When searching for candidate genes associated with isolates representing the main strawberry genotype, 10 genes from 11-12 were identified, including enzymes, proteins involved in DNA binding, and several genes with unknown function. Despite a very rigorous pipeline to identify genes that were unique to isolates in the strawberry genotype, identical genes from isolates 14-26 and 14-27, that were initially recovered from strawberry but nonpathogenic on this host in our tooth pick inoculations (and representing a different pathogen genotype), were identified for each candidate. This anomaly could be due to the limited type of testing that was done in this study to determine if an isolate was pathogenic (toothpick inoculation of the stem). A single, highly susceptible cultivar, Albion, was used to perform pathogenicity tests, and it is possible that these two isolates that were originally recovered from strawberry, may be pathogenic on other genotypes of strawberry. Less than 2% of the isolates recovered from strawberry were not in the main strawberry genotype; limited testing with one of these isolates revealed it was pathogenic but had a much lower level of virulence on strawberry when compared to isolates from the main strawberry clade when using soil based inoculum (B. Smith and F. Martin, unpublished). In addition, it is possible that the presence of one or a few genes is not sufficient to cause disease, but that the expression and coordination of several genes is essential for enabling an isolate of M. phaseolina to infect strawberry. Perhaps further study examining gene expression or a transcriptional network during infection of strawberry by M. phaseolina will reveal a more complex role for some of these candidate genes that is tied not only to the presence or absence of the gene model but that is linked to gene expression under certain infection-favoring conditions. In order to fully test whether the candidate genes are responsible for conferring pathogenicity to M. phaseolina on strawberry, a knockout study using new technology like CRISPR would need to be done [27].

Conclusions

In summary, this work provides high-quality annotated genome assemblies for Macrophomina phaseolina isolates from strawberry and alfalfa hosts. These PacBio-based assemblies are an improvement over the previously published assembly from an isolate recovered from jute in that these assemblies have fewer scaffolds, longer scaffolds and all the raw data from the sequencing and annotation is publicly available. Comparative genomics of these more complete assemblies has revealed large-scale structural rearrangements in terms of chromosomal inversions and translocations. In addition, thirty additional isolates from strawberry and other hosts were sequenced with Illumina in this study, all of which provide valuable comparative genomics resources for this and other studies involving M. phaseolina. In this study, an analysis of gene presence/absence created a short list of candidate genes associated with isolates pathogenic on strawberry that may be involved in host preference to this host.

Abbreviations

11-12: Macrophomina phaseolina isolated from strawberry in 2011 Al-1: Macrophomina phaseolina isolated from alfalfa BLAST: Basic Local Alignment Search Tool BUSCO: Benchmarking Universal Single Copy Orthologs CAZY: Carbohydrate-Active enzymes CRISPR: Clustered Regularly Interspaced Short Palindromic Repeats GO: Gene Ontology NCBI: National Center for Biotechnology Information PacBio: Pacific Biosciences PDA: potato dextrose agar PDB: potato dextrose broth SMRT: Single Molecule Real-Time SNP: single nucleotide polymorphism QUAST: Quality Assessment Tool for Genome Assemblies

Declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and materials

Digital dryad storage (provided upon publication) fasta and .gff files for the genome

The datasets described in this manuscript are available through NCBI.

Illumina sequencing of 30 isolates of M. phaseolina – BioProject PRJNA484400

RNA sequencing of 11-12 libraries - SRA SUB4317410

Sequencing and genome assembly for Al-1 isolate – BioProject PRJNA432410

Sequencing and genome assembly for 11-12 isolate – BioProject PRJNA428521

Competing interests

The authors declare that they have no competing interests.

Funding

The authors gratefully acknowledge funding from the USDA- California Department of Food and Agriculture’s Specialty Crop Block Grant Program grant SCB14052 and the California Strawberry Commission that supported this work.

Author Contributions

AKB designed the experiments, conducted that data analysis, and wrote the manuscript. KLC contributed to experimental design, provided scripts for genome annotation, and provided the resources and technical support for all the bioinformatic data analysis. JW conducted the functional annotation. MLR extracted DNA for PacBio sequencing and conducted initial population studies. FNM worked on initial genome assemblies, provided guidance for experimental design and data analysis and obtained the funding to conduct this work. All authors have read, edited, and approved the final manuscript.

Acknowledgements

We thank Steve Koike formerly at the University of California Cooperative Extension for assistance in collecting and providing various samples of Macrophomina phaseolina. We thank John Johnston at Michigan State University for maintaining servers and uploading programs. We thank Brett Smith and Sam Cude for assistance in conducting pathogenicity tests and Brian Scheffler of the USDA-ARS Genomics and Bioinformatics Research Unit in Stoneville, MS for the MiSeq data for isolates 11-12 and Al-1.

References

1. Islam MS, Haque MS, Islam MM, Emdad EM, Halim A, Md Q, Hossen M, Hossain MZ, Ahmed B, Rahim S, Rahman MS, Alam MM, Hou S, Wan X, Saito JA, Alam M: Tools to kill: Genome of one of the most destructive plant pathogenic fungi Macrophomina phaseolina. BMC Genomics 2012, 13:1.

2. Gupta GK, Sharma SK, Rameke R: Biology, epidemiology and management of the pathogenic fungus Macrophomina phaseolina (Tassi) Goid with special reference to charcol rot of Soybean. J Phytopathol 2012, 160:167–180.

3. Kaur S, Dhillon GS, Brar SK, Vallad GE, Chand R, Chauhan VB: Emerging phytopathogen Macrophomina phaseolina : biology, economic importance and current diagnostic trends. Crit Rev Microbiol 2012, 38:136–151.

4. Zveibil A, Mor N, Gnayem N, Freeman S: Survival, Host–Pathogen Interaction, and Management of Macrophomina phaseolina on Strawberry in Israel. Plant Dis 2012, 96:265–272.

5. Mihail JD: Macrophomina phaseolina: Spatio-temporal dynamics of inoculum and of disease in a highly susceptible crop. Phytopathology 1989, 79:848–855.

6. Khan AN, Shair F, Malik K, Hayat Z, Khan MA, Hafeez FY, Hassan MN: Molecular identification and genetic characterization of Macrophomina phaseolina strains causing pathogenicity on sunflower and chickpea. Front Microbiol 2017, 8(JUL):1–11.

7. Su G, Suh R, Schneider W, Russin JS: Host specialization in the charcol rot fungus, Macrophomina phaseolina. Phytopathology 2001, 91:120–126.

8. Koike ST, Arias RS, Hogan CS, Martin FN, Gordon TR: Status of Macrophomina phaseolina on Strawberry in California and Preliminary Characterization of the Pathogen. Int J Fruit Sci 2016, 16:148–159.

9. Mertely J, Seijo T, Peres N: First Report of Macrophomina phaseolina Causing a Crown Rot of Strawberry in Florida. Plant Dis 2005, 89:434–434.

10. Koike ST: Crown Rot of Strawberry Caused by Macrophomina phaseolina in California. Plant Dis 2008, 92:1253–1253.

11. Arias RS, Ray JD, Mengistu A, Scheffler BE: Discriminating microsatellites from Macrophomina phaseolina and their potential association to biological functions. Plant Pathol 2011, 60:709–718.

12. Faino L, Seidl MF, Datema E, Van Den Berg GCM, Janssen A, Wittenberg AHJ, Thomma BPHJ: Single-molecule real-time sequencing combined with optical mapping yields completely finished fungal genome. MBio 2015, 6:1–11.

13. Faino L, Farmer AD, Papasotiriou DG, Zhou S, Seidl MF, Cottam E, Edel D, Hahn M, Schwartz DC, Dietrich RA, Widdison S, Scalliet G: A gapless genome sequence of the fungus Botrytis cinerea. Mol. Plant Path. 2016, 8(20 17):75–89.

14. Van Dam P, Fokkens L, Ayukawa Y, Van Der Gragt M, Ter Horst A, Brankovics B, Houterman PM, Arie T, Rep M: A mobile pathogenicity chromosome in Fusarium oxysporum for infection of multiple cucurbit species. Sci Rep 2017, 7:1–15.

15. Bao J, Chen M, Zhong Z, Tang W, Lin L, Zhang X, Jiang H, Zhang D, Miao C, Tang H, Zhang J, Lu G, Ming R, Norvienyeku J, Wang B, Wang Z: PacBio Sequencing Reveals Transposable Elements as a Key Contributor to Genomic Plasticity and Virulence Variation in Magnaporthe oryzae. Mol Plant 2017, 10:1465–1468.

16. Derbyshire M, Denton-Giles M, Hegedus D, Seifbarghi S, Rollins J, Kan J Van, Seidl MF, Faino L, Mbengue M, Navaud O, Raffaele S, Hammond-Kosack K, Heard S, Oliver R: The complete genome sequence of the phytopathogenic fungus Sclerotinia sclerotiorum reveals insights into the genome architecture of broad host range pathogens. Genome Biol Evol 2017, 9:593–618.

17. Dallery JF, Lapalu N, Zampounis A, Pigné S, Luyten I, Amselem J, Wittenberg AHJ, Zhou S, de Queiroz M V., Robin GP, Auger A, Hainaut M, Henrissat B, Kim KT, Lee YH, Lespinet O, Schwartz DC, Thon MR, O’Connell RJ: Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite gene clusters. BMC Genomics 2017, 18:1–22.

18. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM: Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 2014, 9.

19. Putnam NH, Connell BO, Stites JC, Rice BJ, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS: Chromosome-scale shotgun assembly using an in vitro method for long-range linkage.Genome Res 2016, 26:342–350.

20. Teh BT, Lim K, Yong CH, Ng CCY, Rao SR, Rajasegaran V, Lim WK, Ong CK, Chan K, Cheng VKY, Soh PS, Swarup S, Rozen SG, Nagarajan N, Tan P: The draft genome of tropical fruit durian (Durio zibethinus). Nat Genet 2017, 49:1633–1641.

21. RepeatMasker Open-4.0 [http://www.repeatmasker.org]

22. Arkhipova IR: Distribution and phylogeny of penelope-like elements in eukaryotes. Syst Biol 2006, 55:875–885.

23. Castanera R, López-Varas L, Borgognone A, LaButti K, Lapidus A, Schmutz J, Grimwood J, Pérez G, Pisabarro AG, Grigoriev I V., Stajich JE, Ramírez L: Transposable Elements versus the Fungal Genome: Impact on Whole-Genome Architecture and Transcriptional Profiles. PLoS Genet 2016, 12:1–27.

24. Wolters PJ, Faino L, van den Bosch TBM, Evenhuis B, Visser RGF, Seidl MF, Vleeshouwers VGAA: Gapless genome assembly of the potato and tomato early blight pathogen Alternaria solani. Mol Plant-Microbe Interact 2018, 31:692–694.

25. Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O’Malley R, Figueroa-Balderas R, Morales-Cruz A, Cramer GR, Delledonne M, Luo C, Ecker JR, Cantu D, Rank DR, Schatz MC: Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 2016, 13:1050–1054.

26. Kaiser MD, Davis JR, Grinberg BS, Oliver JS, Sage JM, Seward L, Bready B: Automated Structural Variant Verification In Human Genomes Using Single-Molecule Electronic DNA Mapping. bioRxiv 2017:140699.

27. Shi TQ, Liu GN, Ji RY, Shi K, Song P, Ren LJ, Huang H, Ji XJ: CRISPR/Cas9-based genome editing of the filamentous fungi: the state of the art. Appl Microbiol Biotechnol 2017, 101:7435–7443.

28. Chin C-S, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, Dunn C, O’Malley R, Figueroa-Balderas R, Morales-Cruz A, Cramer GR, Delledonne M, Luo C, Ecker JR, Cantu D, Rank DR, Schatz MC: Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing. bioRxiv 2016(October):56887.

29. O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MM, Gormley NA, Cox AJ: NxTrim: Optimized trimming of Illumina mate pair reads. Bioinformatics 2015, 31:2035–2037.

30. Bolger AM, Lohse M, Usadel B: Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30:2114–2120.

31. Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9:357.

32. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25:2078–2079.

33. Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011, 27:2987–2993.

34. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403–10.

35. Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: Quality assessment tool for genome assemblies. Bioinformatics 2013, 29:1072–1075.

36. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V., Zdobnov EM: BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015, 31:3210–3212.

37. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M: MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 2008, 18:188–196.

38. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29:15–21.

39. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, Di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 2011, 29:644–652.

40. Korf I: Gene finding in novel genomes. BMC Bioinformatics 2004, 5:1–9.

41. Ter-hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M: Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008:1979–1990.

42. Borodovsky M, McIninch J: GeneMark: parallel gene recognition for both strands. Comput Chem 1993, 17:123–133.

43. Bryant DM, Johnson K, DiTommaso T, Tickle T, Couger MB, Payzin-Dogru D, Lee TJ, Leigh ND, Kuo T-H, Davis FG, Bateman J, Bryant S, Guzikowski AR, Tsai SL, Coyne S, Ye WW, Freeman RM, Peshkin L, Tabin CJ, Regev A, Haas BJ, Whited JL: A Tissue-Mapped Axolotl De Novo Transcriptome Enables Identification of Limb Regeneration Factors. Cell Rep 2017, 18:762–776.

44. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32:115D–119.

45. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A: The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res 2016, 44:D279–D285.

46. Eddy SR, Wheeler TJ: HMMER-biosequence analysis using profile hidden Markov models. URL http//hmmer janelia org 2007.

47. Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 2011, 8:785.

48. Möller S, Croning MDR, Apweiler R: Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 2001, 17:646–653.

49. Zhang H, Yohe T, Huang L, Entwistle S, Wu P, Yang Z, Busk PK, Xu Y, Yin Y: DbCAN2: A meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res 2018, 46:W95–W101.

50. Nattestad M, Schatz MC: Assemblytics: a web analytics tool for the detection of assembly-based variants. bioRxiv 2016(June):44925.

51. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5:R12.

52. Darling AE, Mau B, Perna NT: Progressivemauve: Multiple genome alignment with gene gain, loss and rearrangement. PLoS One 2010, 5.

53. Emms DM, Kelly S: OrthoFinder: Solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 2015, 16:1–14.

54. Wang Y, Coleman-Derr D, Chen G, Gu YQ: OrthoVenn: A web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res 2015, 43:W78–W84.

55. Anders S, Pyl PT, Huber W: HTSeq-A Python framework to work with high-throughput sequencing data. Bioinformatics 2015, 31:166–169.

56. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol 2010, 11:R106.

57. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 1995:289–300.

Tables

Table 1. All isolates of Macrophomina phaseolina sequenced and pathogenicity tested in addition to the 11-12 and Al-1 isolates.

Isolate

Host of isolate origin

Location of isolate origin (California)

Pathogenic on Strawberry

NCBI Data

Accession

11-21

Cantaloupe

Los Banos

No

SAMN09764508

11-22

Cantaloupe

Los Banos

No

SAMN09764509

13-10

Watermelon

Imperial County

No

SAMN09764510

13-11

Watermelon

Imperial County

No

SAMN09764511

14-177

Sunflower

Imperial County

No

SAMN09764512

14-24

Strawberry

Stanislaus County

No

SAMN09764513

14-26

Strawberry

Stanislaus County

No

SAMN09764514

14-27

Strawberry

Stanislaus County

No

SAMN09764515

14-4

Almond

Fresno

No

SAMN09764516

14-45

Lima bean

Santa Clara County

No

SAMN09764517

14-48

Lima bean

Santa Clara County

No

SAMN09764518

16-13

Strawberry

Santa Barbara County

No

SAMN09764519

11-14

Strawberry

Ventura County

Yes

SAMN09764520

11-5

Strawberry

California nursery

Yes

SAMN09764521

12-27

Strawberry

Santa Cruz County

Yes

SAMN09764522

13-30

Strawberry

Monterey County

Yes

SAMN09764523

13-45

Strawberry

Monterey County

Yes

SAMN09764524

14-134

Strawberry

Santa Clara County

Yes

SAMN09764525

14-140

Pepper

Santa Clara County

Yes

SAMN09764526

14-170

Strawberry

Oxnard County

Yes

SAMN09764527

14-20

Strawberry

Ventura County

Yes

SAMN09764528

14-21

Strawberry

Ventura County

Yes

SAMN09764529

14-22

Strawberry

Ventura County

Yes

SAMN09764530

14-62

Strawberry

Santa Maria County

Yes

SAMN09764531

15-47

Strawberry

Salinas

Yes

SAMN09764532

15-69

Strawberry

Salinas

Yes

SAMN09764533

15-87

Strawberry

California nursery

Yes

SAMN09764534

15-89

Strawberry

California nursery

Yes

SAMN09764535

16-14

Strawberry

San Luis Obispo

Yes

SAMN09764536

16-15

Strawberry

San Luis Obispo

Yes

SAMN09764537

Table 2. Genome statistics for Macrophomina phaseolina isolates from strawberry (11-12), alfalfa (Al-1) and jute (published Islam et al. 2012).

Isolate

Genome Size (Mb)

# of Scaffolds

N50

Mb

L75

Long Contig Mb

% 11-12 reads mapped

% Al-1 reads mapped

BUSCO euk

BUSCO fungi

# of genes

11-12

51.3

601

4.3

8

6.8

96.4%

93.7%

98.3%

99.3%

14103

Al-1

49.8

182

5.0

8

6.8

85.0%

98.1%

98.3%

99.0%

13443

Jute

48.9

1506

0.15

205

1.1

84.9%

93.5%

98.3%

98.3%

14249

1 One contig represents the mitochondrial genome

2 Four contigs represent the mitochondrial genome

Table 3. Transposable elements identified in Macrophomina phaseolina genomes from an isolates recovered from strawberry (11-12) and an isolate recovered from alfalfa (Al-1).

Transposable Element Category

11-12

Al-1

DNA

490

398

LINE/Penelope

109

LINE/Tad1

269

179

LINE/CRE

1

LINE/L2

12

LINE (other)

189

158

LTR/Copia

550

385

LTR/Gypsy

1173

1522

LTR/ERV4

17

LTR (other)

59

43

RC/Helitron

13

15

LINE: Long interspersed nuclear element

LTR: Long terminal repeat

RC: Rolling circle


Due to technical limitations, Table 4 is only available as a download in the supplemental files section.


Table 5. Candidate genes associate with isolates in the main strawberry genotype that are aggressive on strawberry.

Candidate Gene

Start

Stop

Gene Length

PFAM Domain

Signal Peptide

M11_12_v1_01446

467679

468583

904

Ribosomal protein L36

no

M11_12_v1_01448

458879

462158

3279

Myb-like DNA-binding domain

no

M11_12_v1_01455

468323

470605

2282

NA

no

M11_12_v1_01458

523857

525833

1976

Major Facilitator Superfamily, Sugar (and other) transporter, Organic Anion Transporter Polypeptide (OATP) family

no

M11_12_v1_01460

530438

531136

698

Fungal Zn(2)-Cys(6) binuclear cluster domain, Zinc knuckle

no

M11_12_v1_01463

516694

519283

2589

Aldehyde dehydrogenase family

no

M11_12_v1_01466

519472

521373

1901

Cytochrome P450

no

M11_12_v1_01490

654497

656140

1643

NA

no

M11_12_v1_01493

623466

624632

1166

Membrane dipeptidase (Peptidase family M19)

yes

M11_12_v1_01501

655801

656697

896

Protein of unknown function (DUF3408)

no