Sub-proteome enrichment
To obtain a representative proteome of P. cinnamomi, vegetative mycelia and transient short-lived zoospores of P. cinnamomi were used as these are the dominant cell types that grow and initiate infection in hosts. In addition, we extracted soluble secreted proteins (secretome) from the mycelia, which are widely studied due to their implications on pathogen-host interactions. The purity of the mycelia and zoospores was observed under a stereoscope (Figure 1). Figure 1A shows no evidence of intercellular contamination and demonstrated the purity of these cell types. The large mass of mycelia had not produced zoospores or their precursor (the sporangia) in this method of in vitro cell culture. Similarly, vegetative mycelia was not observed in the zoospore preparation (Figure 1B).
1D SDS-PAGE was run to visualize the sub-proteomes of each cell type (Figure 2A). The banding patterns of each sub-proteome show differences in total protein content. The extracellular proteome showed enrichment in lower molecular weight proteins whereas the mycelia and zoospores had proteins that spanned over the whole mass range. To test the purity of the secretome, an enzyme activity assay of the cytoplasmic marker GAPDH was measured, which should only be present in small amounts (Figure 3b) [34]. Both the mycelia and zoospores had similar detected amounts of GAPDH detected, approximately 4.7 and 4.8 mU/mg protein, respectively. GAPDH was also detected in the secretome, however at lower amounts (1.6 mU/mg protein). The RP-HPLC UV total ion count traces indicated differing protein content between the three sub-proteomes, as majority of the peaks do not match in intensity and retention time (Figure 3c). The majority of proteins detected in the mycelial and zoospore were localised intracellularly at 45% and 41%, respectively as predicted by WolfPSORT (Table 1). The secretome was enriched in extracellular localisation proteins with a predicted 18% compared to 5% in both the mycelia and zoospores.
Validation of V1 gene models using sub-proteome spectra
The mass spectra were used to validate the draft annotation of the P. cinnamomi genome. The annotations acquired from JGI Mycocosm (assembly annotation version 1.0) were designated in this study as ‘V1’ and the annotation set containing subsequently manually edited loci was designated ‘V2’.
Non-redundant peptide matches (at least two 95% confident peptides) resulted in 2,554, 1,362, and 2,304 proteins from the mycelia, secretome and zoospores respectively. From this data, 2,764 unique proteins from the V1 predicted gene set were identified (Figure 4). 526, 215 and 432 proteins were unique to the mycelia, secretome and zoospores respectively, which implies a wide range of the whole proteome detected. The mycelia and zoospores had more unique protein identifications than the secretome, which may be a result of an expected lower mass range of an extracellular proteome that were below the acquisition detection limits.
When matched to 4,874,027 generated open reading frames (ORF) of the 6-frame translation, 2,752, 1,355, and 2,334 ORFs from the mycelia, secretome and zoospores were identified (Table 2). Although this does allow us to match more peptides to the genome than the V1 annotation, some level of redundancy is expected from matching to reading frames that do not form genes. The false discovery rate for all mass spectra analysis was <0.1% using the Protein Pilot decoy database method, which is within the limits of the general consensus for large scale proteomic data [35, 36]. Of the V1 detected by mass spectrometry, 2,398 had additional support by assigned GO terms and/or PFAM domain.
Annotating new gene models by homology criteria
Although there is peptide support for a large number of the V1 genes, it is expected that there are some forms of incorrect intron and exon boundary annotations that can be detected using spectral data. In addition, this spectral data can also be used in the detection of new genes. 23,457 unique high confidence peptides matched to the 6-frame ORFs were mapped back to their genomic location. 22,443 peptides mapped completely within coding exon boundaries. 274 peptides mapped partly within exons (i.e span across boundaries) and 287 within 200 bp of boundaries (Figure 2B, 2C). 453 peptides mapped more than 200 bp from exon boundaries (Figure 2A). Furthermore, the frame test applied more stringent criteria for frame matching of these peptides to corresponding V1 annotations (Table 4). A total of 1,010 peptides did not match the frame of corresponding CDS features or were further than 200 bp from any gene models. This suggested 438 gene features with potentially incorrect boundaries. These were considered as candidates for new gene models.
To select peptide candidates that would likely result in alteration of V1 genes and curation of new genes, Blastp was used. Peptides that returned significant hits to other Phytophthora species were used to manually edit and curate new genes (Table 4). This largely reduced the number of potential edited and new genes due to both the redundancies of 6-frame peptides and rigorous Blastp parameters used for peptide matches. Of those with conflicting boundaries, 70 peptides showed significant homology to other Phytophthora species. Of the peptides that were further than 200 bp from any gene, a total of 118 peptides returned significant BLASTp hits, suggesting the presence of previously unannotated genes on the P. cinnamomi genome. The homologous sequences were transferred onto the P. cinnamomi genome and the annotations were manually integrated, taking into consideration differences in the genome and features such as introns.
Using these criteria, a total of 60 genes were edited, which equates to an error rate of approximately 2% of the detected proteome. The CDS coordinates of the edited genes are shown in Additional file 1. Of these, 44 were modified by extending the exon boundaries and there were 16 instances of merged genes. Additionally, 23 new previously undefined genes were annotated (Table 5). These annotations were uploaded to the GenBank under accessions MT820663-MT820655. The edited annotations will be referred to by original annotation identification with ‘V2’ suffixed, as listed in the Additional files 2 and 3, respectively. In summary, we identified errors in 60 V1 genes which were manually altered and added a further 23 annotations to the gene set of P. cinnamomi.
Validating edited and new genes
The edited genes were subsequently analysed for total peptide support and differences in functional assignment compared to the original annotation. Peptides within the edited regions were manually counted (Table 6). Of the extended genes, only one had no other supporting information other than the support of one 95% confident peptide in the extended portion of the gene (e_gw1.28.366.1_V2). All other extended genes had support from more than two high confidence peptides and/or homologous functional assignment. Similarly, only one merged gene had a single peptide supporting the merged region of the annotation (gw1.160.19.1_V2). All others were supported by two or more high confidence peptides, which is the general requirement for protein identification in proteomics [37]. Genes were analysed for GO terms, PFAM domains and KEGG orthologues (KO) to determine whether the altered boundaries change their functional annotation assignment (Table 6). Details of each functional annotation are shown in Additional files 2 and 3.
The original mass spectra were matched to the set of new genes (using Protein Pilot- see methods) to determine how many peptides supported each gene (i.e. determine if any genes were a product of single peptide matches) (Table 7). Of the 23 new genes identified, one new gene had support from only one high confidence peptide (MT820633). All new genes were detected in the mycelia and most were also identified in the secretome and zoospore (Additional file 4). The remaining 22 genes had at least two or more supporting peptides.
To further support this new gene set, protein sequences were analysed for protein function by assignment of PFAM domains, GO terms and KO assignment (Table 7). Details of these annotations for each entry are shown in Additional file 4. The new annotations were analysed for virulence factors using PHI-BASE. None of these annotations returned a significant hit to any known virulence factors.
Codon adaptation Index
The codon adaptation indices were calculated for the set of new features and compared to the V1 gene set to identify significant differences in codon usage and distribution that could indicate possible causes for errors and missed genes (Figure 8). The distribution of the CAIs of the new set were significantly different (t-test, p value <0.05) than those of the predicted gene set suggesting a higher proportion of less common codon usage in the new set. These were also significantly lower to the CAIs of all original annotations that had high confidence supporting peptides. Each new gene was also analysed for unusual codon usage, primarily the use of start codons other than methionine and not terminated by a stop codon (Table 8). Only one new annotation MT820649 had abnormal codon usage, where there was no annotated start codon at the correct locus.