De novo genome assemblies of A. flavus isolates
A total of 225 isolates were newly collected and sequenced as part of this study (Figure 1a, 1b). In addition, we downloaded raw sequencing data for 121 previously sequenced isolates from NCBI sequence read archive. Among these, 7 isolates were sequenced by Fountain et al. (2020b) [24], 94 sequenced for studying pan-metabolomics [11], and 20 draft genomes from [54]. Overall, a total of 346 isolate genomes were included in this study, representing the diversity from 11 states (California, Arizona, Oklahoma, Texas, Louisiana, Mississippi, Georgia, Florida, North Carolina, Pennsylvania and Iowa) of the United States and soils of crops including peanut, sunflower, soybean, sorghum, corn, sesame and cotton. Of these 346 isolates, 177 were aflatoxigenic and 169 were non-aflatoxigenic. For sclerotia, there were 128 isolates with large sclerotia (L-strains) and 218 isolates showing small sclerotia (S-strains). Finally, for mating types, 159 showed mating type MAT1-1 and 187 isolates showed mating type MAT1-2 (Supplementary Table 1).
De novo genome assemblies were developed for 346 A. flavus isolates using paired end Illumina sequencing reads. The mean number of contigs and scaffolds in the assemblies for soil isolates were 597 and 570, with average of 726 and 697 contigs and scaffolds, respectively. Average length of contigs and scaffolds of soil isolates was 89 kb and 98 kb, respectively. The assemblies of isolates collected from corn dust had a total of 111 and 131, contigs and scaffolds, respectively, the lowest number of the newly sequenced isolate genomes. In contrast, the assemblies of isolates collected from corn leaf tissues had the highest number 151 and 167 of contigs and scaffolds, respectively. The assembly mean contig N50 and mean scaffold N50 were 403,194 bp and 523,939 bp. Mean genome size of the assemblies of corn isolates and soil isolates was 36.9 Mb and 37.6 Mb, respectively. It is observed that soil genomes are slightly longer than corn isolates, with maximum 43.2 Mb genome size in soil isolates and maximum 38.5 Mb in corn isolates. Among the corn isolates there was no significant difference in number of contings and scaffolds collected from different corn parts such as silk, leaf, tassel and dust, and no significant difference in the genome sizes of corn isolates was observed (Supplementary Table 2).
To perform population genomics studies, we retrieved ~1.02 million single nucleotide variants (SNVs) using AF13 as reference, approximately 28 SNVs per kilo bases (kb) among 346 A. flavus isolates. Large number of SNVs across the isolates indicated prominent genetic diversity in the species of A. flavus. On an average of 1,021,529 SNVs per isolate (ranged from 795,241- 1,027,945 SNVs) were used for diversity analysis. An average of 4,871 SNVs showed heterozygous calls, and 1,016,658 SNVs were homozygous per isolate. Only homozygous SNV calls were used for population genomics studies (Supplementary Figure 1;Supplementary Table 3).
Diversity in the population of 346 A. flavus isolates
A total of four genetic clusters (K=4) were identified in the collection of 346 A. flavus isolates using population structure analysis. The largest genetic cluster (P1) included 179 isolates, with majority of isolates from soil. Second largest cluster (P2) included 158 isolates grouped from soil as well as corn. Third and fourth clusters were minor with merely 5 and 4 isolates, respectively. Highly toxigenic isolates collected from the state of Louisiana were grouped in clusters 3 and 4 (Figure 1d). Four principle components identified using discriminant analysis of principle components further confirmed the presence of four genetic clusters (Figure 1e). Linkage-disequilibrium (LD) decay for each genetic cluster was calculated and plotted r2 values against distance between two SNVs (Kb) using ggplot2. Rapid LD decay was observed for genetic cluster P1 (10 Kb), followed by genetic cluster P3 (12 Kb), cluster P4 (18 Kb) and cluster P2 (25 Kb). The population growth results in reduction of LD the largest genetic cluster (P1) showed least LD decay (Figure 1f).
Genome-wide nucleotide diversity (π), fixation index (FST) and Tajima’s D diversity was analyzed individually for each cluster to understand the distinctness of genetic clusters. Genome-wide nucleotide diversity (π) for each cluster was visualized using circos plot, showed significant number of hot spots with high nucleotide diversity in each cluster (Figure 1g). Highest average nucleotide diversity (π=3.2 × 10-3) was observed for cluster P1, followed by cluster P4 (π=2.6 × 10-3), and cluster P3 (π=2.0 × 10-3). The lowest nucleotide diversity was observed in genetic cluster P2 (π=8.4 × 10-4) (Figure 1h). Further, the chromosome wise nucleotide diversity across 346 isolates was represented using boxplots. Chromosome 8 showed highest nucleotide diversity (mean π=5.0 × 10-3, with range 1.9 × 10-2 – 3.8 × 10-7), followed by chromosomes 5, 6 and 7 (mean π=3.7 × 10-3, with range 1.85 × 10-2 – 5.5 × 10-7), chromosomes 3 and 4 (mean π=3.5 × 10-3, with range 1.7 × 10-2 – 5.4 × 10-7), and chromosomes 1 and 2 (mean π=3.2× 10-3, with range 1.9 × 10-2 – 2.5 × 10-5) (Supplementary figure 2). The population differentiation was investigated by calculating fixation index (FST) between the pairs of genetic clusters. Fixation index (FST) value can range from 0 to 1. Where, FST =0 indicates the two clusters are completely sharing each other, whereas FST =1 means there is no sharing or complete differentiation between two clusters. In this study, higher FST values (>0.6) of the cluster P2 with clusters P1, P3 and P4 indicated that P2 cluster is most differentiated cluster than other three clusters. For instance, fixation index (FST) of 0.822 was recorded between clusters P2 and P4, followed by 0.814 between P2 and P3, and 0.619 between P2 and P1. Lowest fixation index (FST =0.346) was recorded between the clusters P3 and P4, followed by P1 and P3 (FST =0.346), P1 and P4 (FST =0.460) (Figure 1h).
Genome-wide Tajima’s D diversity index was calculated for each cluster to study the abundance of rare alleles in each cluster. Tajima’s D diversity helps to determine whether the cluster is evolving neutrally, or under selection pressure. Tajima’s D=0, means the cluster is evolving neutrally without any evidence of selection, Tajima’s D<0, means excess of rare alleles in the clusters or population expansion, Tajima’s D>0, means scarcity of rare alleles in cluster or population contraction. Tajima’s D for cluster P1 was -0.13, indicating that cluster P1 is evolving neutrally with partial selection sweep. Interestingly, Tajima’s D for cluster P2 was -1.77, indicated abundance of rare alleles in P2. The Tajima’s D for cluster P3 (0.81) and P4 (0.81) was greater than zero indicating that these two populations lack rare alleles and therefore the clusters are resulted in the population contraction (Figure 1i). Chromosome wise Tajima’s D index was higher for chromosome 6 and 8 (Supplementary Figure 3).
Whole-genome phylogeny and characteristics of genetic clusters
Whole genome phylogeny was constructed and confirmed the presence of 4 clusters within population of 346 A. flavus isolates. Morphological characteristics of each cluster were investigated for their origin (geographical site), source (soil or corn), mating types, sclerotia types, toxigenicity, genome size (Mb), and number scaffolds in each assembly (Figure 2). In the major cluster P1, the majority of isolates were from soil (136) and 43 were from corn. In cluster P1, a total of 126 isolates had small sclerotia (S-types), and 53 had large sclerotia types. In case of mating types, cluster P1 included 94 isolates with mating type MAT1-1, whereas 85 were MAT1-2. Interestingly, in cluster P1, a total of 111 isolates were aflatoxigenic and 68 were non-aflatoxigenic. Second major cluster P2 included a total of 103 isolates from soil and 55 from corn. In this cluster 58 isolates were aflatoxigenic and 100 isolates were non-aflatoxigenic. There were 74 isolates with large sclerotia (L) types and 84 with small sclerotia (S) types. In case of mating types, 61 isolates showed (MAT1-1) types and 97 showed (MAT1-2) types. In cluster P1, majority of isolates were aflatoxigenic, whereas in P2 majority of isolates were non-aflatoxigenic. Cluster P3 and P4 were minor genetic clusters with 5 and 4 isolates, respectively. In genetic cluster P3 all isolates were from collected from soil with aflatoxigenic, small sclerotia types, and four with MAT1-1 and one MAT1-2 mating types. In cluster P4, three isolates were aflatoxigenic and a one isolate was non-aflatoxigenic from the state of Iowa. All aflatoxigenic isolates in cluster P3 and P4 were from the states of Louisiana and Oklahoma (Figure 2).
310 kb insertion is rare and identified only in soil samples
Recently a highly toxigenic A. flavus strain AF13 was sequenced, and a 310 kb insertion was identified in AF13 genome, but was absent in isolate NRRL3357. This 310 kb insertion includes a bZIP transcription factor named atfC [14].
The presence of 310 kb region was analyzed across 346 isolates, there were only 5 isolates with this 310 kb insertion. Interestingly, all these 5 isolates are from collected from soil samples in the state of Louisiana, and all are toxigenic isolates grouped in cluster P3 and P4 (Figure 2). However, there were about 57% shared identity with this 310 kb insertion among the other 341 isolates. Therefore, 310 kb insertion is rare in A. flavus species and the isolates with this insertion are highly toxigenic [14] (Supplementary Table 1).
AflaPan contains 7,315 core and 10,540 accessory genes
Genome assemblies 346 isolates were used to develop pan-genome framework for A. flavus, called AflaPan genome. A pan-genome represents total genes in the species, including core genes shared by all the individuals of the species and accessory genes that are not present in all the individuals of the species. In this study, we collected 346 genomes of A. flavus and identified 17,855 unique orthologous gene clusters in this pangenome, called AflaPan genome. In AflaPan, of the 17,855, there were 7,315 orthologous gene clusters as core genome (41.1% of the pangenome) present in all 346 genomes, 3,887 as softcore orthogroups in >95% genomes (21.7% of the pangenome), 3,132 as shell genes in 5-95% genomes (17.5% of the pangenome), and 3,521 as cloud genes present in less than 5% of the genomes (19.7% of the pangenome) (Figure 3a, 3b). We also report here 5,994 novel gene clusters that were not annotated in the AF13 reference genome [14]. Of the new gene clusters, 451 were identified in soft-core (12% of total soft-core genes), 2,069 (66% of total shell genes) in shell and 3,474 (99% of total cloud genes) in cloud (Figure 3c).
The gene clusters in this AflaPan genome were compared with the gene models in current reference genomes AF13 and NRRL3357 of A. flavus. Reference genomes AF13 and NRRL3357 contains of 13,188 and 13,487 annotated gene models, respectively. However, in AflaPan, a total of 17,855 unique orthologous gene clusters were annotated. The AflaPan genome includes more diverse genome segments that were missing in current single reference genome assemblies (Figure 3d).
Significant differences between the lengths of protein sequences were observed in core and accessory genomes. The average protein length in core genome was 579 amino acids (AAs), with a range of 68 to 6885 AAs. While the average protein lengths in soft-core genome was 463 AAs, with range of 60 to 7781 AAs. Average protein length in shell genome was 361 AAs, with the range of 50 to 3917 AAs. In clouds and singletons, the average protein length was 228 AAs and 233 AAs, respectively. Therefore, the conservative proteins were longer in length. Overall, the average protein length was decreased from core-softcore-shell-clouds and singletons (Figure 3e; Supplementary Table 4; Supplementary Table 5).
Furthermore, this Aflapan was closed and captured comprehensive genetic diversity of the intraspecies of A. flavus. Any addition of A. flavus genomes will not substantially increase the number of pan-genes or core genes of A. flavus pan-genome (Figure 3f). The number of pan-genes and core genes will be not substantially increased beyond this large collection of 346 A. flavus genome in this AflaPan (Figure 3g).
Toxin producing gene clusters in AflaPan
The orthologous gene clusters in AflaPan were annotated using blasts in InterProScan, SWissProt, and FungiDB databases. A total of 17,038 orthogroups were annotated using FungiDB database. There were 12,691 orthologous gene clusters annotated using SwissProt and 15,740 annotated using InterProScan. Therefore, there were a total of 12,500 orthogroups annotated using all three databases and 399 orthologous gene clusters were not annotated using either of the three databases (Supplementary Table 5). Based on the annotations a total of 89 toxins producing orthologous gene clusters were identified in core genome. Toxins producing gene clusters (TPCs) in core genome include 3-hydroxyacyl-CoA dehydrogenase-like protein, 6-hydroxy-D-nicotine oxidase, acyl-CoA dehydrogenase, satratoxin biosynthesis SC3 cluster protein, astacin-like metalloprotease toxin, aflatoxin B1 (AFB1) aldehyde reductase, viriditoxin biosynthesis cluster protein D (VdtD), latrotoxin producing gene cluster (LTX), AM-toxin biosynthesis protein or cytochrome P450 monooxygenase AMT3, AAL-toxin biosynthesis cluster protein, cercosporin toxin biosynthesis cluster, viridicatumtoxin synthesis proteins (vrtL and vrtT), aflatoxin biosynthesis protein O, and transcription activator AMTR1 (AM toxin). Cytochrome P450 monooxygenase is potent mycotoxins producing gene clusters, and in AflaPan, we identified ALT8, aflU, aflV, and AKT7 toxin producing cytochrome P450 monooxygenases. TPCs in core genome indicated that these clusters are widely present in the species A. flavus and annotated in reference genomes AF13 and NRRL3357.
A total of 43 TPCs identified in soft-core genome including 5 new gene clusters integrated from soil isolates of Louisiana and Oklahoma. New TPCs in soft-core genome included toxin subunit YenA2, taurine hydroxylase-like protein SAT17, and cytochrome P450 monooxygenase AKT7. Similarly, in shell genome, a total of 45 TPCs were annotated. Interestingly, a total of 23 new gene clusters were identified in shell genome linked with toxin production. New TPCs in shell genome include MFS gliotoxin efflux transporter (gliA), called gliotoxin biosynthesis protein A, alpha-latrocrustotoxin-Lt1a, killer toxin subunits alpha/beta, double-stranded DNA deaminase toxin A, and acyl-CoA dehydrogenase AFT10-1. Particularly, in shell genome there were 33 new TPCs integrated from soil isolates and 3 were from corn samples. In cloud genome, a total of 36 TPCs were identified, and all were new or non-AF13 TPCs. Among singletons we identified 8 new TPCs and all were originated from soil samples of Georgia and Mississippi. The TPCs in singletons included C3 and PZP-like alpha-2-macroglobulin domain-containing protein, ADP-ribosylating toxin CARDS (Community-Acquired Respiratory Distress Syndrome toxin), norsolorinic acid reductase (aflE), FAD-dependent monooxygenase, aflatoxin cluster transcriptional coactivator (aflS), cytochrome P450 monooxygenase (alfG), and abhydrolase domain-containing protein AFT2. Here we observed that the extended genome portions of soil included new toxin producing genes (Supplementary Table 5). In order to check the association of new orthologous gene clusters with toxin production and secondary metabolite production, AflaPan-genome wide association analysis (AflaPan-GWAS) was conducted using phenotypes of total aflatoxins produced.
Secondary metabolite producing gene clusters
A. flavus also produces a variety of secondary metabolites which may be useful for pharmaceutical industries. In current AF13 and NRRL3357 reference genomes of A. flavus [14] there were 80 and 78 secondary metabolite producing gene clusters. In this AflaPan genome, there were a total of 160 secondary metabolite producing gene clusters annotated for 10 important secondary metabolites, including aflatrem (14), aflavarin (3), aflatoxin (63), asparasone (6), cyclopiazonic acid (25), ditryptophenaline (3), kojic acid (3), leporins (4), piperazines (19), and ustiloxin (20).
Aflatrem is tremorgenic mycotoxin, the only class of mycotoxins impacting central nervous system [55]. In AflaPan, a total of 14 orthogroups were annotated as aflatrem biosynthetic gene clusters. Among them, 8 orthogroups identified in core, and 5 in soft-core and 1 in shell genome. It indicated that aflatrem producing gene clusters are highly conserved in A. flavus.
Aflavarin is also an important sclerotial metabolite exhibiting insecticidal activity [55]. In AflaPan, a total of 3 aflavarin producing orthologous gene clusters were annotated, and there were two in core genome and one in cloud genome.
Aflatoxin is a potent mycotoxin produced by A. flavus classified as polyketide synthase (PKS) backbone genes. Based on annotation, there were a total of 63 aflatoxin producing secondary metabolite gene clusters identified in this AflaPan genome. Among them, 20 were identified in core, 10 in soft-core, 21 in shell genome, 9 in cloud, and 3 as singletons. Aflatoxin producing gene clusters are uniformly distributed in core and accessory genomes of AflaPan. Moreover, Pan-GWAS analysis was conducted using phenotyping data of aflatoxin produced by 225 isolates, which uncovered 391 orthogroups associated with toxin production, including these 63 aflatoxin producing secondary metabolite gene clusters.
Asparasone is another sclerotial metabolite of structural class PKS [56]. In AflaPan, a total of 6 asparasone producing orthologous gene clusters were identified. Among them, there were one in each core and shell, 2 in soft core, and 2 in cloud genome of AflaPan. Cyclopiazonic acid (CPA) is an indole-tetramic acid mycotoxin, which belongs to structural group dimethyl tryptophan synthases (DMATs) or hybrid PKS (polyketide synthases)-NRPSs (non-ribosomal peptide synthetases) produced by a number of species of Aspergillus [56]. A total of 25 CPA producing orthogroups were identified in AflaPan. Among them, 7 identified in core, 4 in soft core, 6 in shell, and 8 in cloud genomes of AflaPan. Ditriptophenaline (DTP) produces diketopiperazine alkaloid in A. flavus. There were three DTP producing orthogroups identified in cloud genome of AflaPan.
Kojic acid (5-hydroxy-2-hydroxymethyl-1, 4-pyrone) belongs to structural group oxydoreductases, a skin lightning agent widely used in cosmetics industries. Moreover, it also has antibacterial activity and therefore used as pharmaceuticals, pesticides and insecticides [57]. In AflaPan, three orthologous gene clusters were identified associated with Kojic acid biosynthesis. Two of them were in core and one was identified in cloud genome of AflaPan. Leporin A is an antiinsectan N-methoxy-2-pyridone sclerotial metabolite isolated from the sclerotia of A. leporis. Leporin belongs to structural group hybrid PKS-NRPS. In AflaPan, four orthogroups were identified linked with leporin biosynthesis. Among them, three were in soft-core genome and one was identified as a singleton in soil isolate GA6_18.
Piperazines are class of saturated N-heterocycles, acting as central nervous system stimulants widely used in medicinal chemistry [58]. In AflaPan, a total of 19 orthogroups identified as piperazines biosynthesis gene clusters. Among them, there were 5 identified in core, 2 in soft-core, 2 in shell, 9 in clouds, and one as singleton identified in soil isolate GA6_18. Interestingly, in AflaPan, there were 10 new orthogroups identified associated with Piperazines biosynthesis. Ustiloxin B is a mycotoxin with cyclic peptide belonging to structural group ribosomally synthesized and post-translationally modified peptides (RiPPs). In AflaPan, there were 20 orthogroups annotated as Ustiloxin B biosynthetic clusters. Among them, there were 9 identified in core genome, 5 in soft-core genome, 3 in shell genome, and 3 in cloud genome of AflaPan (Supplementary Table 6).
Pan-GWAS uncovered gene clusters associated with aflatoxin biosynthesis pathways
Pan-GWAS analysis using PAV matrix of 17,855 orthogroups and the newly collected 225 isolates of A. flavus uncovered new and existing genes associated with aflatoxin production. Pan-GWAS analysis identified a total of 391 orthogroups significantly associated with aflatoxin production (Figure 4). Of these 391, a total of 91 orthogroups were not annotated either with NCBI blast or InterProScan. There were 369 orthogroups (94.4%) from shell genome. Important orthologous gene clusters with significant association to aflatoxin production were listed in Table 2. Surprisingly, of the 369 shell genes, there were 256 (69.4%) orthogroups which were not annotated and may be absent in the AF13 reference assembly. The new orthogroups significantly associated with aflatoxin production were potential intermediates in various aflatoxin biosynthetic pathways in A. flavus. There were 15 orthogroups from soft-core and seven from cloud genome significantly associated with aflatoxin production, including aspergilol biosynthesis cluster protein, aspyridones biosynthesis protein E (apdE), and imizoquin biosynthesis cluster protein D (imqD) (Supplementary Table 7).
Among highly associated aflatoxin producing gene clusters there were cytochrome P450 monooxygenase (apdE, acrF, aflV, AKT7, aneD), ABC multidrug transporters (atrB, atrD, MDR2), aflatoxin biosynthesis regulatory protein R (aflR), aflatoxin biosynthesis protein S (aflS), demethylsterigmatocystin 6-O-methyltransferase (aflO), aflatoxin biosynthesis protein T (aflT), fatty acid synthase alpha subunit (aflA), beta-cyclopiazonate dehydrogenase, and fumagillin biosynthesis polyketide synthase. aflaoxin producing orthologous gene clusters homologous to aflatoxin producing genes in other fungi were also showed strong association with aflatoxin production in this AflaPan, for instance, terrein biosynthesis cluster protein (terG) from A. terreus, ABC multidrug transporter (atrB) from A. nidulans, Azaphilone biosynthesis cluster protein (azaJ) from A. niger, double-stranded DNA deaminase aflatoxin A from Burkholderia cenocepacia, glutathione S-transferase omega from Saccharomyces cerevisiae, and AM-toxin biosynthesis regulator 1 (AMTR-1) from Alternaria alternata (Supplementary Table 7).