Genome sequencing and annotation
The genome of C. arbuscula NRRL 3705 was sequenced by Illumina Miseq technology and third generation sequencing technology (Single-Molecule Real-Time sequencing technology) with over 100X coverage [21]. This method can efficiently decode difficult but important genomic areas, such as methylated regions, repetitive elements and non-coding regions for possible gap-free eukaryotic genome assembly. Sub-read distribution analysis confirmed high quality of the 20-kb library (Additional file 1: Fig S1). Moreover, we had RNA-Seq results serving as a reference for genome annotation. Combining genomic data and transcriptome analysis makes genome assembly and annotation more accurate. The details of data generation are listed in Additional file 2. The completeness of genome assembly and annotation with single copy orthologs test results suggested a well completed annotation set, with 94.1% of the Fungi BUSCOs being present within the RefSeq annotation set, and 4.8% of those fragmented. Details of BUSCO analysis are presented in Additional file 3. The genome was finally assembled with a size of approximately 45.01 Mb, comprising 91 contigs as displayed by circos-plots (Fig. 1) with an N50 length of 1,530,317 bp,which is larger than the genome size of Calcarisporium sp. (Table 1) [22]. A total of 10,001 genes were predicted and the average gene length is 1,365 bp. The total coding region of C. arbuscula is 13.6 Mb, accounting for 30.2% of the entire genome. Statistics analysis for gene length distribution of C. arbuscula showed that 752 genes have a length over 2,500 bp (Additional file 1: Fig S2). Notably, compared to other commonly studied filamentous fungi, C. arbuscula has a relatively large genome, since most other fungal genomes are less than 40 Mb in size (Table 1) [23-26]. In addition, there are also sections about non-coding proteins (Additional file 2). We also performed RNA-Seq with wild-type strains of C. arbuscula and we found 9005 genes being expressed under laboratory conditions. This accounts for 96.72% of total genes in the corresponding genomes (Additional file 4). Based on the fragments per kilobase of transcript per million mapped reads (FPKM) values, we divided the expressed genes into nine tiers (Additional file 4).
The annotations of 10,001 protein-encoding genes are reported in Additional file 5. Among all coding genes, 9,397 genes could be annotated by various databases (Additional file 1: Fig S3). Using the Non-Redundant Protein Database for protein annotation, we found that 8816 genes were protein-encoding genes (Additional file 1: Fig S3), which account for 88.16% of the total coding genes. KEGG analysis revealed that the products of most genes are involved in metabolism: carbohydrates (≈ 321 proteins), amino acids (≈ 252 proteins), or lipids (≈ 161 proteins) (Additional file 1: Fig S4). These data suggested that C. arbuscula may produce a large number of enzymes involved in its rich metabolic processes. Gene Ontology (GO) functional classification of C. arbuscula (Additional file 1: Fig S5) also showed that most genes are involved in catalytic activity (≈ 3649 proteins / strain) and metabolic processes (≈ 3654 proteins / strain). In addition, Eukaryotic Orthologous Groups (KOG) functional classification showed that many genes are involved in posttranslational modifications (Additional file 1: Fig S6).
Taxonomy
C.arbuscula NRRL 3705 is an endophytic fungus in fruit-bodies of Russulaceae, producing aurovertin-type mycotoxins that are potent against F0F1-ATPase and breast cancers [14-16]. According to fungus taxonomy, it belongs to Calcarisporium, Hypocreales, Pezizomycotina, Ascomycota. Spores of C. arbuscula NRRL 3705 develop after culturing on potato dextrose agar (PDA) medium for 5 days at 25°C. The filamentous fungus displays high sporulation and the conidial heads are yellow (Additional file 1: Fig S7).
The phylogenetic analysis performed in this study used several reference genes (ITS, SSU, LSU, TEF and RPB2) and revealed the close relationship between the sequenced strain and other strains. The multilocus analysis was performed on our isolate with 17 reference strains (NCBI accession number available in Additional file 6). For species delimitation, the aligned sequences matrix of (ITS, SSU, LSU, TEF and RPB2) sequences data for Calcarisporium and for Cordyceps militaris and C. brongniartii as outgroup taxa. The phylogenetic tree was constructed with maximum likelihood and Bayesian analysis and resulted with high bootstrap values (Fig. 2). This tree illustrated that C. arbuscula NRRL3750 was most closely related to C. arbuscula 111.57 and C. arbuscula 144.52.
Repetitive elements
Repeated sequences play an important role in maintaining the spatial structure of chromosomes, gene expression regulation and genetic recombination of fungi [29]. A total of 1,387,508 bp of repeat sequences were identified in the C. arbuscula genome, including LTR retrotransposons, DNA transposons, long interspersed repeated elements (LINEs), tandem repeat sequences (TR) and mini-satellite DNA. Interestingly, the majority of repetitive sequences (63.6%) are tandem repeat sequences, whereas the dispersed repetitive-sequence just accounts for 30.39%. Notably, the highest percentage of all repeat sequences is TR, at 38.29% (Fig. 3).
Predicted candidate secreted effectors involved in virulence and pathogenicity
Secreted effectors play critical roles in virulence and pathogenicity [30, 31]. The signal peptide prediction tool SignalP was used to identify proteins containing the cross-membrane structures [32]. In total, there were 762 possible secretion proteins identified, accounting for 7.62% of all predicted proteins (10,001). Based on alignment analysis of all predicted proteins against the pathogen-host interaction (PHI) database [33], 228 out of 10,001 (2.29%) predicted proteins were related to pathogenicity, of which 21 (0.21%) putative PHI-related proteins were potential secreted effectors.
After whole proteome BLAST against the database of fungal virulence factors (DFVF) [34], 398 out of 10,001 (39.8%) predicted proteins encoded within the C. arbuscula genome were identified to share identity with proteins implicated in virulence, of which 63 (0.63 %) putative DFVF-related proteins were predicted to be secreted. Furthermore, 62 of these secreted proteins were predicted to be involved in pathogen-host interactions (Fig. 4).
P450 enzymes not only participate in the production of important internal metabolites, but also play an important role in adaptation to different environments by modifying harmful chemicals [35]. By BLASTP, the amino acid sequences of all C. arbuscula proteins were compared to the Fungal Cytochrome P450 Database (FCPD), 177 out of 10,001(1.77%) was identified as putative CYP450 enzymes, part of which are involved in fungal virulence factor and pathogen-host interaction (Fig. 4).
Carbohydrate-active enzymes
Carbohydrate-active enzymes (CAZy) play an important role in carbohydrate degradation, modification and biosynthesis in fungi [36]. CAZy is also a Carbohydrate-Active enZYmes Database [37], a specialized database of carbohydrate enzymes, which includes a family of related enzymes that catalyze the degradation, modification, and biosynthesis of carbohydrates. CAZy’s are divided into five main categories: Glycoside Hydrolase (GH) [38], Glycosyl Transferase (GT) [39], Polysaccharide Lyases (PL) [40], Carbohydrate Esterases (CE) [41], and Oxidoreductase (Auxiliary Activitie, AA). In addition, it also contains Carbohydrate-Binding Module (CBM) proteins [42]. In the C. arbuscula genome, 386 proteins were identified as CAZymes,part of which are involved in pathogen-host interactions (Fig. 4). The highest proportion (62.29%) of all CAZymes belonged to the GH category (Additional files 1: Fig S8). Based on genomic information, we compared the potential for hydrolysis with eight Aspergillus species. Notably, C. arbuscula NRRL 3705 contains a large amount of glycoside hydrolases GH18 and GH2, more than found in other fungi (Additional file 7).
Secondary metabolite biosynthetic gene clusters in C. arbuscula NRRL 3705
The secondary metabolites of fungi constitute a rich resource of bioactive compounds with potential pharmaceutical values as antibiotics, cholesterol-lowering drugs and antitumor drugs [1]. Interestingly, genes encoding the biosynthetic pathway responsible for the production of such secondary metabolites are often spatially clustered together; such a compendium of genes is referred to as a ‘secondary metabolite biosynthesis gene cluster’ [45]. Based on profile hidden Markov models of genes that are specific for certain types of gene clusters and antiSMASH 4.0, we identified 65 gene clusters for secondary metabolites in C. arbuscula NRRL 3705. Among them, 23 and 12 gene clusters containing genes encoding polyketide synthases (PKS) and non-ribosomal peptides synthases (NRPS), respectively, were identified. In addition, there are gene clusters for terpenes, PKS/NRPS hybrids, indoles and other types of natural products (Additional file 8). Some of these gene clusters are highly similar to known gene clusters (Table 2).
Aurovertin biosynthetic gene cluster
Aurovertins are a class of toxic polyketides harboring a unique structure of a 2, 6-dioxabicyclo-[3.2.1]-octane (DBO) ring system and a conjugated a-pyrone moiety [16, 46]. Due to the unusual polyketide-derived structure, aurovertins have been shown to have potent antiviral, antitumor and antibacterial activities. C. arbuscula is capable of predominantly producing aurovertins (Fig. 5a), and the biosynthetic gene cluster for these mycotoxins has been identified [16]. In addition, an LC-MS analysis was performed on a methanol extract obtained from a 7-days-old culture of C. arbuscula NRRL3705 grown on a PDA plate at 25°C (Additional files 1: Fig S9 and Fig S10). We have also performed a phylogenetic analysis of the aurovertin-related gene cluster of different strains (Additional files 1: Fig S11). The aurovertin biosynthetic gene cluster was mainly composed of seven genes, including aurA, aurB, aurC, aurD, aurE, aurF and aurG. However, some genes were missing in the cluster after genome annotation by automatic bioinformatics.
According to antiSMASH 4.0, there are totally 4 genes in gene cluster 23 (aurovertin biosynthetic gene cluster). These genes encode a PKS (A05996), a SAM-dependent methyltransferase (A05995), a FAD-dependent monooxygenase (A05994), and an acetyltransferase (A05993) (Fig. 5b). We found that there is a 7 kb spacer between gene A05993 and A05994, which was re-predicted by the web-based software Softberry (http://www.softberry.com/)and we found that this spacer contains three known genes: aurD, aurE and aurF. This is consistent with the gene cluster reported previously [16]. This also indicates that there are certain defects in genome sequencing and automatic NR annotation.
Other SM clusters for mycotoxin biosynthesis
Based on antiSMASH predictions, C. arbuscula NRRL 3705 has the potential to produce a variety of mycotoxins. Aflatoxin (AFT), a class of toxic secondary metabolites originally produced by Aspergillus parasiticus, is highly toxic, carcinogenic, mutagenic and teratogenic [47]. Cluster 60 is composed of 15 genes and contains a PKS (A09345), a putative ketoreductase (A09348), a transcription factor (A0934) and two cytochrome P450 monooxygenases (A09346 and A09350). PKS (A09345) of cluster 60 shows high sequence identity with AflC (a polyketide synthase involved in aflatoxin biosynthesis) of A. ochraceoroseus (protein coverage: 98%; identity: 79%). Moreover, the two cytochrome P450 monooxygenases of cluster 60 show highest sequence identity with AflV (protein coverage: 99%; identity: 83%) and AflG (protein coverage: 96%; identity: 79.8%) of A. ochraceoroseus (Fig. 6). These in silico data suggested that C. arbuscula potentially produces compounds and derivatives structurally related to aflatoxin.
In addition, we found that C. arbuscula has the potential to produce alternariol (AOH) [48]. Alternariol, a secondary metabolite produced by Alternaria and other fungi, is harmful to animals and plants. One polyketide synthase (PKS 19) from Parastagonospora nodorum has shown to be responsible for AOH production in this fungus. Surprisingly, we found two candidate gene clusters (cluster 35 and cluster 44) with high similarity to the alternariol biosynthetic gene cluster from P. nodorum SN15. Cluster 35 is composed of 9 genes. The backbone gene encodes a PKS (A07007), while other genes encode an NAD+-binding protein (A07006), an acyl-CoA-acyltransferase (A07005), an aldehyde dehydrogenase (A07004), an integral membrane (A07003), an arginosuccinate synthetase (A07002), an ABC transporter (A07001), a putative capsule polysaccharide biosynthesis protein (A07008) and a transcriptase (A07009) (Fig. 7a). Cluster 44 is also composed of 9 genes that encode a PKS, four putative signal sequence proteins and a transcription factor (Fig. 7b). The PKS from cluster 35 shares higher identity with PKS 19 (72%) than with the PKS from cluster 44 (44%). Considering that the cluster for AOH biosynthesis in P. nodorum contains one PKS and four tailoring enzymes (O-methyl transferase OmtI, monooxygenase MoxI, short chain dehydrogenase like protein SdrI and an estradiol dioxygenase DoxI), cluster 35 is more likely responsible for the biosynthesis of AOH. In contrast, cluster 44 lacks multiple enzymes, as shown above. Therefor we hypothesize that cluster 35 is more likely the putative SM cluster for biosynthesis of AOH. However, this needs to be validated by further genetic and biochemical analysis.
The non-ribosome polypeptide synthase is responsible for the synthesis of peptide secondary metabolites, such as surugamides and ferricrocin [49, 50]. In C. arbuscula, 12 putative NRPS genes were found. According to antiSMASH and MIBiG, cluster 37 and cluster 59 show sequence similarity with the biosynthetic gene cluster of destruxin, a secondary metabolite of non-ribosomal cyclic hexapeptides with insecticidal and pharmaceutically active activities (Additional file 8).
Gene clusters for biosynthesis of other secondary metabolites
In the C. arbusclua genome, seven hybrid NRPS / PKS gene clusters were also predicted. Among them, cluster 26, cluster 41 and cluster 58 show similarity to the gene clusters of aculeacin A, citrinin and isoflavipucine, respectively (Additional file 8) [51-53]. Moreover, we also predicted eleven terpene genes, whose products remain to be determined. In addition to the PKS, NRPS, and hybrid NRPS/PKS gene clusters, we also identified 12 gene clusters likely to produce indoles, one hybrid indole-t1PKS and one hybrid t1PKS-terpene. Overall, we have performed a thorough comparative analysis and shown as much information on gene clusters as possible after antiSMASH and gene cluster homology comparisons. We have shown some predicted chemical structures of natural products synthesized by the corresponding gene clusters, while most gene clusters share low sequence identity with others, or produce unknown natural products, which also raises high interests for further investigation of these gene clusters experimentally.
Gene cluster expression by RNA-Seq analysis
Although C. arbuscula contains a large number of biosynthetic gene clusters, this fungus rarely produces secondary metabolites, except aurovertins [16]. It is very likely that under laboratory culture conditions, expression of the core genes of most gene clusters is low. Therefore, RNA-Seq assays were performed and gene expression was evaluated based on FPKM from RNA-Seq. Using three housekeeping genes (gpdA, tubC and actA) as reference genes, we found that only the PKS gene in cluster 23 (aurovetin biosynthetic gene cluster) was expressed at the highest level in core genes of all gene clusters (Fig. 8). These results confirmed that most gene clusters are expressed at low levels or silenced.
In addition, nine gene clusters contain pathway-specific transcription factors (Additional file 8), which provides the possibility to activate these gene clusters by overexpression of these transcription factors to obtain secondary metabolites.