MOnSTER identified five CLUMPs containing known motifs characteristics of oomycetes effector protein sequences
Characteristic motifs of oomycetes effector proteins are well-known in the literature, such as RxLR, -dEER and LxLFLAK-HVLVxxP [15]. Thus, we reasoned to apply our novel tool, MOnSTER, on oomycetes effectors to test its ability to recover well-characterized motifs. We compiled a set of 4752 oomycetes proteins, comprising 1743 effectors and 3009 non effectors, from five oomycetes species. We performed motif discovery on this set of proteins using MERCI and STREME and we identified 265 significantly enriched motifs (see methods for further details). Then we fed MOnSTER with these motifs and we obtained 11 CLUMPs (Supplementary table 3), employing the Davis-Bouldin score, as a criterion to cut the tree. By selecting CLUMPs having a MOnSTER score greater than the median of the overall scores we identified six CLUMPs (CLUMP7, 4, 10, 6, 2 and 9), the first five best-scoring CLUMPs, accordingly to the MOnSTER score, correspond to the known motifs (Fig. 2). In Supplementary Fig. 2 we can also observe that the motifs are respectively grouped in two clades, the two characteristics motifs of CRN-effectors (LxLFLAK and HVLVxxP), form a separate subclade on the right, while the RxLR and -dEER motifs fall into the left clade, resembling the family distinction of effectors to which they belong. More precisely RxLR motifs are divided into two different CLUMPs; CLUMP6 containing only RYLR and RFLR motifs, and CLUMP10, containing other RxLR motifs and included in the same sub-clade of the dEER motif (CLUMP2). The last best-scoring CLUMP contains no known motifs, perhaps suggesting a novel putative motif for oomycetes effectors to investigate. Since oomycetes effectors characterization is not in the scope of this article, we did not consider this last CLUMP for further analysis. In support of that, CLUMPs 7, 4, 10, 6 and 2 are present in 1205/1743 effectors (~ 70% of the sequences in the positive dataset) while in combination with the last significant CLUMP (CLUMP9) only two more sequences can be detected.
Thus, we investigated the occurrences and co-occurrences of the five selected CLUMPs in oomycetes effectors and non-effectors (Supplementary Fig. 3). For the effectors we deeply analyzed the two distinct families; in total we found that 68% of the RxLR-effectors in the positive dataset contain the motifs in CLUMPs associated with the RxLR motif (CLUMP10, 6 and 2). In particular, CLUMP10 and 6 are present alone in 41% of the RxLR-effectors (1238/1743 RxLR-effectors), while 19% of the RxLR-effectors contained the co-occurrence of these CLUMPs with the CLUMPs representing the dEER motif (CLUMP2). This reflects the importance of the RxLR motifs in the effector sequences and the role of the attached dEER [51]. On the other hand, the co-occurrence of CLUMPs specific for LxLFLAK and HVLVxxP (CLUMP7 and 4), in CRN-effector sequences accounts for 67% of the relative sequences in the positive dataset (377/1743). The high co-occurrences rate of CLUMP7 and 4 is strongly in agreement with the presence of LxLFLAK and HVLVxxP motif marking the beginning and the end of the DWL-domain in the Crinkler-effector family [33]. For the negative dataset, instead, only 15% of the sequences show the presence of CLUMP-motifs with a huge decrease in CLUMPs co-occurrences. Overall co-occurrences, indeed, are present in around 30% of positive sequences and in 1% of negative ones.
Previous research showed that the motifs characteristics of oomycetes effectors have strong sequence position preferences [52]–[54]. Thus, we plotted the CLUMPs occurrences in the positive versus negative dataset (Supplementary Fig. 4). Indeed, we can observe that the CLUMPs are concentrated at the beginning of the sequence in positive sequences and conversely spread around the sequence of negative dataset proteins. More precisely the five most interesting CLUMPs are condensed in the first 40% of the sequence with a higher preference at the very beginning and around 30% of the sequence probably corresponding to the N-terminal of the protein in which the target motifs lie.
Altogether these results highlight the ability of MOnSTER to identify CLUMPs containing biologically relevant motifs.
MOnSTER allowed to identify six CLUMPs characteristics of nematode candidate parasitism proteins
The application of MOnSTER of the oomycetes effectors served as a proof of concept of our methodology. Thus, we moved to the characterization of nematode candidate parasitism sequences for which no characteristic motifs have been identified yet. We collected a set of 4395 proteins, including 546 well-known candidate parasitism proteins and 3849 proteins in the negative dataset, coming from 13 nematode species. By running motif discovery analysis as for the previous dataset, we found 269 motifs enriched in the candidate parasitism protein sequences. By applying MOnSTER with the previous configuration, the 269 input motifs were grouped into 11 CLUMPs. Six best-scoring CLUMPs were selected using the median as the significant threshold (Supplementary table 4). Similar to the oomycetes results, we observe two main clades (Fig. 3): the second and the third best scoring ones (CLUMP2 and 5 respectively) form a single clade while the other significant CLUMPs (CLUMP1, 3, 7 and 10) are distributed in the bigger clade with the non-significant ones. Overall, we found at least one occurrence of one of the six CLUMPs in almost 60% of sequences from the positive dataset compared to 5% of sequences from the negative.
Then we investigated the presence of the six CLUMPs in each of the 13 PPN species present in the dataset. Figure 4 shows the abundance of the six best-scoring CLUMPs in the species according to their phylogeny tree. The first three species are the most represented in the positive dataset. Interestingly very distant species show similar CLUMPs frequencies thus suggesting that they might share common characteristics at the sequence level for accomplishing similar functions. Furthermore, we could identify characteristic CLUMPs also for species represented in the dataset with very few sequences reinforcing the previous observation. Overall, this analysis suggests that CLUMPs might be associated with the functional properties of PPN nematodes.
Finally, we focused on the positional sequence preferences of CLUMPs in candidate parasitism protein sequences (Supplementary Fig. 5). In general, we observe a difference in the position preferences of the best-scoring CLUMPs between positive and negative dataset sequences. The six CLUMPs tend to occur more frequently in the middle of the sequences in candidate parasitism proteins (positive dataset), with more abundance in central (around 50% of the sequence) and terminal (around 70%), positions. The same CLUMPs are rare in the central position of the negative dataset protein sequences (negative dataset). Contrary to the properties of oomycetes effectors, whose characteristics CLUMPs occur mainly at the beginning of the sequence, PPN candidate parasitism proteins showed a different pattern of occurrences, privileging a central – C terminal occurrence.
Co-occurrences of different CLUMPs are associated with functional protein domains.
We investigated the co-occurrence patterns of CLUMPs in the PPNs candidate parasitism protein sequences (all possible combinations of co-occurrences are reported in Supplementary Fig. 6). Overall, we notice that CLUMPs tend to co-occur more frequently in the sequences of the positive dataset than in the negative one, despite the positive set being smaller than the negative one. 30% of candidate parasitism protein sequences show co-occurrences of the six selected CLUMPs, while in the sequences from the negative dataset, co-occurrences, are present in less than 1% of the sequences. As observed for oomycetes, some CLUMPs tend to be present alone, while others tend to co-occur with specific CLUMPs. This suggests that different classes of nematode candidate parasitism proteins might exist, similar to the oomycetes effectors. Interestingly, among the 311 candidate parasitism proteins bearing at least one occurrence of one of the six selected CLUMPs, 72 do not have a predicted signal peptide, consisting of 55% of the proteins in the positive dataset not having the signal peptide. Of note, this is a similar percentage to the percentage of proteins bearing both the CLUMPs and the signal peptide, suggesting that CLUMPs characterize sequence properties beyond the type of secretion. Furthermore, similar patterns of co-occurrences of CLUMPs in candidate parasitism proteins bearing or not the signal peptide are observed with slightly higher co-occurrence presences in the sequences not having the signal peptide (Supplementary Fig. 7). Importantly, there is no relationship between the sequence length and the number of co-occurrences possibly suggesting a functional role for CLUMPs co-occurrences (Supplementary Fig. 8).
To inspect further a putative functional role of CLUMPs in candidate parasitism protein sequences, we queried the sequences having at least one CLUMP or a co-occurrence of multiple CLUMPs against several protein domain databases (see supplementary information, results in Fig. 5 and Supplementary table 5). Among the 311 candidate parasitism protein sequences bearing at least one occurrence of at least one of the six CLUMPs, 84 also have at least an occurrence of a known protein domain. The most recurrent hits are the coil domain, intrinsically disordered domain and the presence of the signal peptide (SP) followed by the pectate lyase domain, glycosyl hydrolase family 5, Stichodactyla toxin (ShK) domain, 14-3-3 family and cysteine-rich domain. Importantly, none of these domains was also found in the sequences from the negative dataset bearing at least one occurrence of at least one of the six CLUMPs. Interestingly, we observe the almost exclusive association between CLUMPs and functional domains, mainly when multiple CLUMPs co-occur in candidate parasitism protein sequences.
The strongest association that we observe is between the co-occurrences of CLUMPs 7 and 10 and the glycosyl hydrolase family 5 domain on one hand and the co-occurrences of CLUMPs 3, 7, 10 and the cysteine-rich domain, on the other hand. Specifically, all 23 candidate parasitism protein sequences containing the co-occurrences of CLUMP 7 and 10 bear also the glycosyl hydrolase family 5 domain. By inspecting the position of CLUMPs occurrences within the sequences, we observed that the two CLUMPs are flanking the domain: CLUMP7 is consistently present at the beginning of the sequence and consequently of the domain, while CLUMP10 mostly concentrates at the end of the domain, around 60–80% of the sequences (Supplementary Fig. 9). Examples of these genes in nematodes is poorly characterized and likely resulting from horizontal transfer [55], [56]. Similarly, all 17 sequences presenting the co-occurrence of CLUMPs 3, 7,10 also contain the cysteine-rich domain. Cysteine-rich domain and CAP protein are known to be involved in the virulence of nematodes [57]. They are expressed in both plants and pathogens; in the latter, they are important for their virulence by suppressing the host’s immune responses and promoting colonization. Interestingly, these sequences do not contain disordered regions or coil domains, consistently with unique conserved sandwich fold with a large central cavity of these kinds of proteins [58]. 16 out of 19 sequences presenting co-occurrences of CLUMPs 2, 3 have also the 14-3-3 family domain, a eukaryotic-specific protein family with a general role in the signal transduction [59]. We also observe only one motif from CLUMP 2 in these sequences (KDKM) and 4 from CLUMP 3 (NKDKAC, KMKG, PTHPIR, PTHP). 13 out of 34 sequences bearing only CLUMP 1 also contain the pectate lyase domain. Of note, these sequences do not contain coiled or disordinate regions, and only seven show the presence of the SP. Pectate lyase enzymes in nematodes facilitate penetration in plant-cell walls made of pectin [60]. Numerous recent reports showed that these enzymes are produced in specialized nematode gland cells and secreted during the parasitism process. In the case of sedentary endo-parasitic nematodes, this occurs mainly during juvenile migration through the root tissue, when these enzymes play a crucial role in the maceration of the plant tissue facilitating the infection [61]. Finally, eight out of 22 sequences bear the co-occurrences of CLUMPs 2, 5 and the ShK domain. Although the exact biological function of the ShK domain remains unclear, previous reports have shown that this domain might be associated with immunosuppression [62], [63].
Overall, these findings highlight that specific CLUMPs co-occurrences are associated with specific functional domains with roles in invasion and/or infection and might suggest different classes of candidate parasitism proteins cross-species.
CLUMPs screening yielded the identification of a novel effector in M. incognita validated by in situ hybridization.
To inspect whether the novel-identified CLUMPs could also help to find new effectors, we focused on the selection of a novel putative effector to validate experimentally. Thus, we selected all proteins of Meloidogyne incognita proteome bearing the signal peptide for secreted proteins and no transmembrane domain. Then we screened these sequences and retrieved the ones containing at least one motif of the six significant CLUMPs. Among them, 23% contain at least one occurrence of motifs in CLUMP5 (Supplementary Table 6). Since this is the most abundant CLUMP in this species, we decided to focus on this one to identify a putative candidate to validate experimentally. By literature mining, we refined our list, by sorting out seven sequences that were already experimentally validated by previous studies (Supplementary Table 6). Then we filtered out any candidates having homologs in species other than root-knot nematodes and more than two gene copies to avoid dealing with multigene families according to [42]. Finally, among these eight new putative effector sequences, we studied the pattern of expression of one candidate: MiEFF72 (Minc3s00056g02931) by performing in situ hybridisation (ISH, see supplementary information). A specific signal was detected in the subventral oesophageal gland cells of pre-parasitic J2s after hybridisation with digoxigenin-labelled MiEFF72 antisense probes (Fig. 6A). No signal was detected in pre-J2s with sense negative controls. MiEFF72 fused to the C-terminus of GFP was transiently expressed in N. benthamiana leaf epidermis. GFP fluorescence was detected in the cytoplasm and in cytoplasmic vesicles (Fig. 6B). This finding suggests that MiEFF72 be secreted and play a role in planta in nematode parasitism.