Genome Assembly
In the process of generating the final assemblies for isolates 11-12 and Al-1, several methods and iterations of assembly were used to generate the highest quality assembly. Before the FALCON-based assemblies were selected as the final assembly, the HGAP assembly pipeline was used to generate an assembly for the 11-12 isolate with the PacBio data, but it had a lower N50 (3.3 Mb) than the FALCON assembly (N50 = 4.3 Mb) and did not run successfully with input data for isolate Al-1. The Genome Finishing Module of the CLC Genomics Workbench (Qiagen, Redwood City, CA) was also used to assemble both genomes using the PacBio data, but in both cases resulted in an assembly with a shorter total length (44 Mb) and over 120 contigs in each assembly. After the FALCON-based assemblies were selected as the best PacBio-based assemblies, a polishing run with PILON using paired-end and mate-pair Illumina reads greatly improved the base calling, with 113,895 and 73,557 SNP and indel changes made after the first PILON run for the Al-1 and the 11-12 assemblies, respectively. A second PILON polishing step was done and yielded a higher percentage of Illumina reads mapping. In this step, 3,984 and 3,784 additional SNP changes were made by PILON in the Al-1 and the 11-12 assemblies, respectively, resulting in the final highest quality, polished FALCON-based assembly.
The final PILON assemblies were further improved using Dovetail HiRise with PacBio gapfilling to combine some contigs into scaffolds. For 11-12, the input assembly of 102 contigs was broken once and joined 19 times, resulting in 84 total contigs/scaffolds. The joins made by the initial Dovetail HiRise scaffolding were made by adding 100 Ns at the junction, but 6 of these joins were subsequently filled in with the PacBio data from the initial sequencing, leaving 13 total gaps filled with 25-100 Ns. For Al-1, the input assembly of 27 contigs was not broken, but a single join was made with Dovetail HiRise resulting in 26 final contigs. The gap created by this contig joining was eliminated by the PacBio gapfill resulting in a final assembly of isolate Al-1 with no gaps and no Ns. To investigate the identity of the smaller (< 100 kb) contigs in each assembly, the assemblies were aligned to themselves using a BLAST-based tool in CLC Genomics Workbench, and contigs that were 99.5% contained within another contig and had greater than 99.5% identity to that contig were considered assembly errors and were eliminated along with contigs less than 1000 bp. After the Dovetail and CLC analyses of the genome, the final contig/scaffold counts were 60 and 18 for the 11-12 and Al-1 genomes, respectively. All of the contig/scaffolds and their lengths can be found in Supplemental Table 1. For ease of reference, the large segments of DNA in the final assemblies, which are largely gap-free, have all been named as “Contig #” with the Contig 1 being the longest contig for each assembly and Contig n being the shortest contig for each assembly with n representing the total number of contigs. The genome statistics for the final assemblies can be found in Table 1.
High-quality, nearly complete assemblies were generated for both the M. phaseolina isolate from strawberry, 11-12, and the isolate from alfalfa, Al-1 (Table 1). Both isolates had very high N50 scores, at 4.3 Mb and 5.0 Mb for the 11-12 and Al-1 isolate, respectively. The sequence coverage of the M. phaseolina genomes from strawberry and alfalfa was 150x for the PacBio reads and over 200x for the Illumina reads. The distribution of the lengths of the main contigs for 11-12 and Al-1 indicate that there are 13 large contigs that are all 870 kb - 6.8 Mb. The smaller 47 contigs of 11-12 range from 4,726 to 99,276 bp with Contig 15 (94,974 bp) being the mitochondrial genome. The 5 smaller contigs of Al-1 range from 2,215 to 49,130 bp with Contigs 14, 16, 17, and 18 representing the mitochondrial genome. No telomeres were detected in the 11-12 assembly, but 7 out of 14 nuclear contigs of the Al-1 assembly had a telomere on one end. When mapping the paired-end Illumina reads generated from the 11-12 isolate and the Al-1 isolate back to the PacBio-based assemblies of their respective isolates, > 96% of the trimmed reads mapped back to the assembly, indicating that the assemblies were of high quality and that the base calling could be largely confirmed with a different sequencing technology. A lower percentage of the paired-end reads mapped from the isolate of origin to the other sequenced isolate, indicating some genomic divergence between the isolates. Overall, the assemblies had good BUSCO scores using both the eukaryotic (<98%) and the fungal (<99%) databases, indicating that the core set of genes were fully present and correctly assembled within each assembly (Table 1).
Transposable elements (TEs) were identified in each genome using RepeatMasker (Table 2). Overall, the total number of transposable elements identified in each genome assembly was similar, with 2,882 TEs identified in the 11-12 assembly and 2,703 TEs in the Al-1 assembly. Major types of TEs identified included Class I retrotransposons, including LTRs (long terminal repeats) and LINES (long interspersed nuclear elements) as well as Class II transposons including DNA transposons and helitrons [22]. Some groups of transposons were unique to the 11-12 isolate, including the L2 category of LINE transposons that were found on 8 of the 13 main contigs as well as the CRE type of LINE transposon that was only found on Contig 8. The Penelope type of LINE was only present in the 11-12 genome and is a unique type of TE that is typically associated with animal genomes [23]. Within the 11-12 genome, this TE type was more prevalent in the smaller contigs than in the 13 main contigs. Both genomes had a large number of Copia and Gypsy LTRs, which are known to be both abundant and diverse in fungal genomes [24]. In contrast, only the 11-12 genome contained endogenous retrovirus 4 (ERV4) LTRs, which were only found within the smaller contigs of the genome and have previously been identified in other fungal genomes [25].
Genome annotation for M. phaseolina isolates 11-12 and Al-1
MAKER was used to annotate the genomes of 11-12 and Al-1 using a set of protein evidence from closely related fungi and an RNA-Seq based transcript dataset from the 11-12 isolate. MAKER used several de novo gene predictors including AUGUSTUS, SNAP, and GENEMARK to begin the annotation process for these two M. phaseolina isolates. A final list of MAKER standard genes was created from the predicted genes that had Pfam support and or transcript evidence. In total, the 11-12 isolate contained 14,103 annotated genes and the Al-1 isolate contained 13,443. For the Al-1 genome, all of these annotated genes are on the 13 main contigs, with no MAKER-annotated genes found on the 41.5 kb contig or the four contigs representing the mitochondrial genome. Alternatively, there are 146 annotated genes on the 47 smaller contigs (4,726 to 99,276 bp) of 11-12, ranging from 0-18 annotated genes per contig. The functional (Pfam) annotations of these genes identified on the 11-12 small contigs included helicases, retrotransposons, translation initiation factors, oxidoreductases, and aspartyl proteases.
The carbohydrate-active enzyme composition of a fungus can be used to determine its ecological niche and host specificity. To compare the carbohydrate-active enzymes from the two sequenced M. phaseolina isolates with other plant pathogenic and non-plant associated fungi, the carbohydrate-active enzymes encoded by each fungus in the comparison were classified using the CAZy database (Table 3). Carbohydrate-active enzymes are classified into six families, and the observed enzyme compositions are conserved within the M. phaseolina isolates. As expected, this comparison shows that the polysaccharide lyase (PL) gene family expanded in the plant pathogenic fungi relative to the non-plant associated fungi (e.g., A. niger and N. crassa). The PLs encoded in the M. phaseolina cleave different forms of pectins, such as pectate lyases in the PL1, PL3, and PL9 families, and rhamnogalacturonan lyases in the PL4 family. Additionally, the glycoside hydrolase (GH) enzymes (e.g., GH88 and GH105) that degrade the products generated by PL enzymes are also found in the M. phaseolina genomes. The enhanced capacity in the pectinolytic machinery suggests strong plant cell wall degradation activity that likely enables their necrotrophic infection and colonization of plant tissue.
Genome sequence comparison between 11-12 and Al-1 reveals large-scale genome rearrangement
When comparing the full genome assembly of 11-12 to Al-1, large scale structural rearrangements and smaller scale indels and SNPs were observed. The dot plot from a MUMmer analysis visualized using Assemblytics indicates that the genomes are largely collinear (Figure 1). Some contigs have been inverted or translocated. For example, a portion of Contig 2 of the 11-12 assembly is collinear and found inverted and translocated to Contig 8 of the Al-1 assembly. Large portions of 11-12 Contigs 3, 6, and 7 are also shown to be collinear and translocated to a different contig of the Al-1 isolate. Of the larger contigs, Contig 8 shows the most changes with large portions broken and or translocated and inverted in the Al-1 assembly relative to the 11-12 assembly.
Similar pairwise whole-genome comparisons were made between the 11-12 assembly and the Illumina assemblies of other M. phaseolina isolates using MUMmer and Assemblytics. Overall, the genomes of all isolates, either pathogenic on strawberry or nonpathogenic on this host, were collinear with the 11-12 strawberry isolate, with some small segments of the genome showing translocations or inversions but with none of these rearrangements being consistently associated with the strawberry genotype. In general, the genomes of isolates that were nonpathogenic on strawberry exhibited more insertions and deletions compared to the 11-12 assembly, with a higher proportion of deletions, compared to those identified among the isolates pathogenic on strawberry. However, a couple of pathogenic isolates also had a high number of insertions and deletions compared to the 11-12 assembly and some nonpathogenic isolates had very few indels, indicating that using the trends of indels alone is not an accurate predictor of pathogenicity on strawberry. The different sequencing technologies used for the genome assemblies may affect the analyses given that the 11-12 long-read assembly was composed largely of 13 contigs while the Illumina-based assemblies of the other M. phaseolina isolates contain over 1,000 contigs each.
A progressive MAUVE analysis between the 11-12 and Al-1 assemblies revealed translocations and inversions as well as several syntenic blocks between the two genomes (Figure 2). Contig 1 of both assemblies, which is over 6 Mb and the longest contig of each assembly, has two syntenic blocks. Some portions of Contig 8 of the Al-1 genome are similar to Contig 2 of the 11-12 assembly but are inverted. The third contig of each assembly has partial synteny, but a portion of the 11-12 contig can be found predominantly on Al-1 Contig 5 (inverted) and to a lesser extent Contigs 4 and 9 (Figure 1). A portion of the Al-1 genome syntenic to Contig 6 of the 11-12 assembly is inverted and present on Contig 5 of the Al-1 assembly along with portions of 11-12 Contigs 3 and 12. Several blocks of the 11-12 Contig 7 can be found on the Al-1 Contig 3 and 11. Contig 8 of 11-12 has a region from position 359,168 to 449,477 that appears to be a unique sequence region without a clear corresponding match in the Al-1 assembly. Contig 11 of the 11-12 assembly also seems to contain unique genomic regions and does not have many syntenic regions with Al-1.
Identification of gene clusters from M. phaseolina unique to the main strawberry genotype
An analysis of all the orthologous genes, including the jute M. phaseolina assembly [1] as well as seventeen other taxa, including several other fungi within the Dothideomycetes class (Supplemental Table 2), were analyzed with OrthoFinder. In the resulting phylogenetic tree, the isolates of M. phaseolina were tightly grouped with each other and with the other members of the Botryosphaeriales order, as expected (Figure 3; Percentage of shared orthologous genes between species is in Supplemental Table 3). This phylogeny also indicated isolates of M. phaseolina from alfalfa and jute were more closely related to each other than to the strawberry isolate, providing additional support that the host specificity aspect of the strawberry isolate genotype may be associated with evolutionary divergence. This phylogeny helped to inform the selection of isolates that were used to create OrthoVenn2 comparisons with smaller groupings of isolates. The OrthoFinder analysis also identified 2 unique orthogroups for the main strawberry genotype of M. phaseolina with 6 genes represented. The isolate from alfalfa did not have any specific orthogroups and the isolate from jute had 4 specific orthogroups.
In an analysis with OrthoVenn2 using the M. phaseolina protein sequences from strawberry, jute, and alfalfa isolates, 80 clusters containing a total of 223 proteins were found to be unique to the strawberry genotype with 1,338 singletons also identified as being unique (Figure 4A). Within the group of 80 unique clusters, enriched GO slim terms, included those associated with a molecular function (nucleotide binding, oxidoreductase activity, hydrolase activity, and cofactor binding) and those associated with biological process (pyrimidine nucleotide biosynthetic process, cellular aromatic compound metabolic process, and others). Common among all three M. phaseolina isolates were 10,678 orthologous clusters with enriched GO terms associated with oxidoreductase activity (115 groups, p-value 2x10-6) and terpenoid biosynthetic process (48 groups, p-value 6.5x10-5). These two enriched GO terms provide support and additional insight into the potential infection mechanisms of M. phaseolina relative to other fungi. Oxidoreductase activity in terms of hydrogen peroxide production has been positively correlated with virulence in some strains of M. phaseolina on chickpea and sunflower [26] and terpene-derived secondary metabolites produced by filamentous fungi like Fusarium have been implicated in toxicity to plants [27].
An analysis between the selected fungi within the Botryosphaeriales order with OrthoVenn2 showed that the M. phaseolina isolate from strawberry had 75 unique clusters of orthologous genes with enriched GO slim terms including phosphorus metabolic process and oxidoreductase activity (Figure 4B). The strawberry isolate also had 1,231 singletons without an orthologous gene. Within the same OrthoVenn2 analysis, the three M. phaseolina uniquely shared 1,729 orthologous groups which had enriched GO terms including the molecular function of oxidoreductase activity (80 groups, p-value < 1x10-9) and biological processes of pathogenesis (29 groups, p-value 5x10-8) and terpenoid biosynthetic process (14 groups, p-value 3.6x10-4). All of the fungi in this analysis shared 5,739 orthologous clusters, which included enriched GO terms including the biological processes of pathogenesis (48 groups, p-value 9x10-10), terpenoid biosynthetic process (19 groups, p-value 2.4x10-9), and vesicle-mediated transport (44 groups, p-value 2.8x10-7). Several GO terms related to oxidoreductase activities were also enriched among all of the fungi in this analysis.
Another OrthoVenn2 analysis among genera of fungi that commonly infect strawberry – Macrophomina, Fusarium, Colletotrichum, Verticilium, and Sclerotinia – had 410 unique orthologous groups identified in the M. phaseolina isolate from strawberry along with 4,110 singletons (Figure 4C). Enriched GO terms for the unique orthologous groups include the molecular function of oxidoreductase activity (15 groups, p-value 1.3x10-12) and the biological processes of shamixanthone biosynthetic process (3 groups, p-value 2.5x10-4) and terpenoid biosynthetic process (5 groups, p-value 2.5x10-4). All five fungi shared enriched GO terms including those associated with the biological process of transmembrane transport (60 groups, p-value 1.5x10-14), the molecular function of zinc ion binding (18 groups, p-value 1.1x10-13), and the biological process of pathogenesis (31 groups, p-value 8.4x10-10). Also enriched were genes with GO terms that are associated with the breakdown of cell walls and thus perhaps pathogenicity, including polysaccharide catabolic process (14 groups, p-value 8.3x10-8), cellulose catabolic process (11 groups, p-value 1.5x10-6), and xylan catabolic process (7 groups, p-value 1x10-5).
Identification of candidate M. phaseolina genes associated with isolates pathogenic on strawberry
A comparative genomics approach using the annotated genome from the strawberry genotype was used to identify candidate genes that may be unique to M. phaseolina isolates that are capable of infecting strawberry. The sequencing reads from each of the 30 other sequenced isolates of M. phaseolina – including isolates that were both pathogenic and not pathogenic on strawberry - were perfectly mapped to the 11-12 genome (Table 4). These 30 isolates were selected because they represented the breadth of M. phaseolina genetic diversity and were used in a companion project to find a genomic locus that was used as a diagnostic tool to detect isolates pathogenic on strawberry [28]. In order to focus on differences only within the annotated genes in this study, HT-Seq was used to count the number of reads that mapped to each annotated gene model. Genes with no or low counts in the isolates that were not pathogenic on strawberry and with high counts in the isolates that were considered pathogenic on strawberry were selected after the first round of read mapping. Within CLC Workbench, BLAST was used to compare the gene model from the 11-12 isolate to similar genes in the other isolates to identify potential truncations or SNPs that could eliminate a candidate if these mutations were to cause significant changes in the predicted protein sequence among isolates in the main strawberry genotype. Non-synonymous mutations in candidate genes among the isolates non-pathogenic on strawberry allowed some candidate genes that had a high mapped read count to a candidate gene model in 11-12 to remain as pathogenicity-associated gene candidates. After this selection process, 10 candidate genes (Table 5) were identified, and all of them are located within a ~ 200 kb region of Contig 8 of the 11-12 assembly. Interestingly, this region is broken in a dotplot alignment between 11-12 and Al-1 (Fig. 1) and shows rearrangements and indels within the MAUVE alignment (Fig. 2).
The candidate genes identified in Table 5 were typically absent in all 13 of the nonpathogenic isolates and present in all 19 of the isolates of the strawberry genotype pathogenic on this host except in the case of isolates 14-26 and 14-27. These isolates were recovered from diseased strawberry plants in Ventura, California, are not in the strawberry genotype, and were found to be nonpathogenic on strawberry in the toothpick inoculations, yet had sequences of candidate genes that were identical to those found in 11-12. Several of these genes, including M11_12_v1_01446, M11_12_v1_01448, M11_12_v1_01460, M11_12_v1_01490, M11_12_v1_01493, and M11_12_v1_01501 were also listed as a member of a unique orthologous grouping or as singletons for M. phaseolina of strawberry in all the OrthoVenn2 comparisons. Additionally, for some candidate genes – M11_12_v1_01458, M11_12_v1_01460, M11_12_v1_01463, and M11_12_v1_01466 - isolates nonpathogenic on strawberry, Al-1 and 14-4, had similar gene sequences, but SNPs or indels changed the predicted protein sequence when compared to that of 11-12. For gene M11_12_v1_01493, the predicted proteins from all the nonpathogenic isolates except for 14-26 and 14-27 were truncated compared to the 11-12 gene model and were predicted to be metalloproteinases while the full-length gene model from 11-12 was predicted to encode a membrane dipeptidase. This is also the only gene among the candidate genes that was predicted by SignalP to have a signal peptide. When looking at the transcript support for each candidate gene model, all the genes except for M11-12_v1_01448, M11_12_v1_01493, and M11-12_v1_01501 had full or partial support from the transcripts predicted by Trinity using the 11-12 RNA-Seq datasets provided during the gene annotation. Furthermore, when examined individually, many of the RNA-Seq datasets provided support for the expression of the candidate genes, with M11_12_v1_01446, M11_12_v1_01448, M11_12_v1_01463, and M11_12_v1_01501 being highly expressed in all the transcript datasets. A couple of genes, M11_12_v1_01455 and M11_12_v1_01493, had higher gene expression in the dataset from the crown tissue medium compared to other growth conditions, but additional experiments are needed to confirm the significance of this observation.