Protein type and lifestyle contribute to the evolution of low complexity regions in proteins of plant colonizing fungi

Background: Interactions between plants and fungi range from mutual symbiosis to parasitism. Fungi secrete proteins (termed effectors) to establish different forms of interactions with host plants. Such proteins are thought to coevolve with their molecular plant targets, and this can favor the emergence of novel alleles of fungal effector proteins. Low complexity regions in protein sequences are abundant in eukaryotes and were shown to contribute to the formation of novel protein sequences. This suggests that low complexity regions may play a role in the evolution of effector proteins. Several effector proteins with low complexity regions were functionally characterized in plant colonizing fungi that showed diverse lifestyles and belonged to different taxonomic groups. To investigate if low complexity regions in fungal effector proteins could contribute to the evolution of different plant-fungus interaction types, I employed publicly available genomic data from 121 species of plant colonizing fungi representing six different lifestyles and three phyla. I classified proteins in each species as cytoplasmic, secreted-non effector or effector protein and predicted low complexity regions in all protein sequences. Results: I found that the fraction of proteins that contain a low complexity region differs between cytoplasmic and secreted proteins. Moreover, the fraction of a protein sequence that spans a low complexity region differed on average between cytoplasmic and secreted proteins. Inferring homologous relationships between effector proteins revealed that this fraction is higher in recent compared to ancestral proteins, suggesting that low complexity regions contribute to the formation of novel effector alleles. Furthermore, a principal component analysis and the results of a generalized linear model showed that the lifestyle of different fungi contributes to the evolution of low complexity regions. Likewise, the relative position of low complexity regions differed between cytoplasmic and secreted proteins, between ancestral and recent effector proteins, and between effectors of different lifestyles. Conclusions: Protein type and lifestyle contribute to the evolution of low complexity regions in proteins of plant colonizing fungi, but molecular and evolutionary mechanisms explaining the differences between different protein types and proteins of different lifestyles remain to be a generalized linear model suggest that lifestyle contributes to the evolution of relative positions of low complexity regions in effector proteins.


elucidated.
Background Over the last 400 million years, plants and fungi have shared a long history of coevolution, and mutualistic symbiotic interactions with fungi might have already supported the origin of the first land plants [1][2][3]. Today, diverse types of interactions can be found between plants and fungi, ranging from parasitism to mutualistic symbiosis [4][5][6]. Pathogenic fungi employ distinct strategies to colonize host plants and to obtain nutrients from them. Necrotrophic fungi kill their host plant and feed on dead plant tissue, whereas biotrophic fungi depend on the survival of the host plant to complete their life cycle. Some biotrophic species strictly depend on their host plant for survival (obligate biotrophs), while others can also grow as free-living organisms on artificial media (facultative biotrophs). Finally, hemibiotrophic fungi switch from an initial biotrophic to a later necrotrophic feeding strategy during plant colonization [7]. Understanding the molecular basis of pathogenic and symbiotic interactions is important, because they have a great influence on natural and agricultural ecosystems [8][9][10][11][12][13][14][15][16].
Pathogens, as well as symbionts, secrete proteins that modulate interactions with their host plants.
Such proteins are termed effectors and fulfill their function in the apoplastic space between invading fungal hyphae and plant cells or are taken up into the plant cytoplasm. Effectors promote fungal colonization by suppressing plant immune responses, shielding invading hyphae or altering host cell physiology and metabolism in favor of the fungus [6,7,[17][18][19][20][21][22][23]. Genes encoding effector proteins in pathogenic fungi are thought to coevolve antagonistically with their plant targets, and this may lead to the emergence of new alleles in effector genes [24].
Low complexity regions in protein sequences are very abundant in eukaryotes and are characterized by the high enrichment in one or a few amino acids [25,26]. Their emergence is thought to be linked to mitotic replication slippage or meiotic recombination [26,27]. Two different scenarios are discussed to explain the abundance of low complexity regions. One hypothesis proposes that low complexity regions are merely neutral spacers between protein folds [28], or that low complexity regions are excised from the mature protein sequence and have therefore no structural or functional role [29]. Moreover, the high diversification of low complexity regions between species suggests that such regions evolve neutrally [30]. An opposite scenario proposes that the presence of low complexity regions is adaptive [31,32], as they, for example, increase mRNA stability [33] and underlie proteinprotein interactions [31]. Furthermore, the prevalence of low complexity regions in antigenic loci supported the idea that low complexity regions contribute to antigen diversification [34]. Low complexity regions are an important source of phenotypic variation [35], and this innovation can form the basis of adaptations [36]. In sum, these characteristics of low complexity regions suggest that they also contribute to the evolution of fungal effector proteins.
Several examples of effectors containing low complexity regions have been functionally characterized in plant colonizing fungi that belong to different taxonomic groups and that employ different lifestyles [37,38]. Therefore, I sought to investigate if the evolution of low complexity regions varies between species with different lifestyles. To this end, I used publicly available genomic data of 121 plant colonizing fungi representing six different lifestyles from symbiontic to necrotrophic and wood degrading species. I compared typical features of low complexity regions like the fraction of a protein sequence that spans a low complexity region, amino acid composition, and relative positions of low complexity regions in a protein sequence. The presented results reveal differences between cytoplasmic and secreted proteins, ancestral and recent protein sequences, and between different fungal lifestyles.

Results
First, I predicted low complexity regions in all protein sequences of the 121 investigated species (supplementary table 1) and determined the fraction of proteins with low complexity regions. This yielded one value for each species and protein type (Fig. 1). I then compared this fraction between cytoplasmic proteins, secreted non-effector proteins, and effector proteins. It turned out that the fraction of protein sequences with low complexity regions was lowest in secreted non-effector proteins in 119 species (Fig. 1). Exceptions to this general trend were the two obligate biotroph species Blumeria graminis f.sp. tritici (short name 'Blugrt') and Erysiphe necator (short name 'Erynec'), both belonging to the Ascomycota. I found this pattern in only 29 to 71 species in 10,000 random permutations (see Methods). Moreover, the fraction of proteins with low complexity regions was in 118 species higher in cytoplasmic proteins compared to secreted proteins (effectors and noneffectors) than in cytoplasmic proteins (Fig. 1). The three exceptions from this general trend were Taphrina deformans (a facultative biotrophic pathogen belonging to the Ascomycota; abbreviation 'Tapdef') as well as the two Basidiomycete symbionts Tulasnella calospora (abbreviation 'Tulcal'), and Piriformospora indica (abbreviation 'Pirind'). This trend occurred only in 14 to 52 species in 10,000 random permutations. In conclusion, these observations indicate that the fraction of proteins with low complexity regions does not evolve by chance. However, evolutionary mechanisms that could explain systematic differences in the presence of low complexity regions between cytoplasmic and secreted proteins remain to be identified. For example, the high fraction of cytoplasmic proteins with low complexity regions could suggest that low complexity regions are functionally important; hence, their presence could be advantageous and selected. Alternatively, the occurrence of low complexity regions could be neutral in cytoplasmic proteins, and therefore low complexity regions accumulate in cytoplasmic proteins. Likewise, it remains to be elucidated if the presence of low complexity regions in secreted proteins is generally disadvantageous, which would then explain the low fraction of secreted proteins with low complexity regions. In particular, the evolutionary and molecular mechanisms that underlie differences between secreted non-effector proteins and effector proteins remain to be elucidated. In summary, the fraction of proteins with low complexity regions differed between cytoplasmic and secreted proteins, but the observed trend was largely consistent between different lifestyles and phyla ( Fig. 1A to Fig. 1F).
Next, I calculated the fraction of each protein sequence that spans a low complexity region, thereby providing one value for all proteins in each species (supplementary table 2). I found that the median of these fractions was highest in effector proteins for 95 species (Fig. 2). This number ranged from 21 to 58 species in 10,000 random permutations, again indicating that this pattern does not evolve neutrally. Moreover, the median fraction of protein sequences spanning a low complexity region did not evolve by chance in the investigated protein categories (Table 1). Intriguingly, I found the highest median values in the group of effector proteins (Table 1). Together with the analysis of the protein fraction with low complexity regions, this finding indicates that low complexity regions are less common in effector proteins, but on average longer when they occur. Previous studies reported that low complexity regions differ in their amino acid composition, that is, certain amino acids were found to be overrepresented in low complexity regions [39][40][41]. My analysis revealed over-representation of certain amino acids as well; however, no lifestyle-specific or phylumspecific enrichments could be identified (supplementary table 3).
To investigate further a putative role of low complexity regions in the emergence of novel effector alleles, I inferred homologous relationships between all 73,484 effector sequences (supplementary   table 2). In two independent analyses, I used the natural effector protein sequences or sequences where I replaced low complexity regions with 'X' as unknown amino acid, because low complexity regions can complicate the search for homology [42]. For both analyses, I reconstructed families of homologous sequences with OrthoFinder [43] (supplementary table 2, supplementary table 4, and   supplementary table 5). Next, I aimed to identify all families of homologous effector proteins that contain at least one member from each species. This set of proteins represents likely ancestral sequences, as they are conserved in all species; however, no family of homologous proteins contained members from all species (Table 2). Therefore, I used those families of homologous proteins that covered the largest number of species as a proxy for truly ancestral sequences (  [41], suggesting that this observation reflects a general trend in eukaryotic proteins. Table 2 Groups of homologous proteins and number of their members for ancestral and species-specific proteins as identified by OrthoFinder with native and masked protein sequences To gain more fine-grained insights in the contribution of fungal lifestyles on the fraction of protein sequences that span a low complexity region, phylogenetic information need to be taken into account [44]. However, obtaining accurate alignments and phylogenetic trees is challenging in this data set, because the used effector protein sequences represent hundreds of million years of evolution [45].
To investigate potential layered effects between protein type, lifestyle, and phylogenetic relationships (phylum), I fitted a general linear model to the data of all proteins, regardless of their type (supplementary table 2). Specifically, I used the formula "fraction of protein sequences that span a low complexity region" ~ protein type * lifestyle * phylum. I then used the results to rank the models with different fixed-term effects according to the Bayesian information criterion, and I found that all three parameters together explain best the observed data (Table 3). In summary, the fraction of low complexity regions in a protein sequence is higher in younger protein sequences, indicating that low complexity regions contribute to the formation of novel alleles. Moreover, the results obtained from a principal component analysis and a generalized linear model suggest that lifestyle contributes to the evolution of the fraction of effector protein sequences that span low complexity regions. A previous study indicated that low complexity regions could play a position-dependent role and proteins where low complexity regions tended to localize towards the termini of a protein had a larger number of interaction partners [46]. To investigate if low complexity regions show different localization patterns in my set of fungal proteins, I determined the relative position of low complexity regions in all types of proteins, that is, cytoplasmic proteins, secreted non-effector proteins, and effector proteins. Figure 4 shows the result for each low complexity region in each protein and species. In 115 species, the median relative position of low complexity regions in cytoplasmic proteins was located closer to the N-terminus than the median relative position of low complexity regions in secreted proteins (effectors and non-effectors urcinatum (an Ascomycete symbiont, short 'Tubaes'), and Wolfiporia cocos (a Basidiomycete wood degrading fungus, short 'Wolcoc'). This suggests that the position of low complexity regions evolves in general differently between cytoplasmic and secreted proteins, and this conclusion is corroborated by results from 10,000 random permutations, where cytoplasmic proteins were located closest to the N-terminus in only 14 to 51 species. Following the results reported by Coletta and colleagues [46], this would indicate that cytoplasmic proteins with low complexity regions have more interaction partners than secreted proteins with low complexity regions. In 52 species, the median relative position of low complexity regions in secreted non-effectors was closer to the N-terminus than in effectors, and in 69 species, the opposite trend was observed. This is consistent with randomized samples, where low complexity regions were closer located to the N-terminus in secreted noneffectors compared to effectors in 39 to 83 species, suggesting that the relative localization of low complexity regions is similar between different types of secreted proteins (effectors and noneffectors). To investigate further if the observed median values of relative positions evolved by chance, I randomly assigned each protein to one protein type (cytoplasmic, secreted non-effector, and effector). I found that the median relative position in the different protein type does not evolve by chance (Table 4).  1) Reported are the minimum and maximum median values that are obtained from 10,000 random permutations 2) An observed value is considered to evolve by chance if it lies between the minimum and maximum median values obtained from 10,000 random permutations difference in the relative positions of low complexity regions between anciently and recently emerged protein sequences (Table 2, supplementary table 2, supplementary table 4). I observed that the relative position is closer to the N-terminus in ancient proteins ( Fig. 5A; P-value = 0.01875, Wilcoxon Rank-Sum test). I observed a similar trend when I used natural protein sequences to infer homologous relationships (supplementary Fig. 2A;

Discussion
Several studies showed that effector proteins containing low complexity regions contribute to virulence in fungi that represent diverse taxonomic groups and different strategies of plant colonization. Therefore, I sought to investigate if differences in lifestyle could contribute to the evolution of low complexity regions. Previous studies suggested that fungal lifestyle is connected to the secretome composition because secreted effector proteins aid in plant colonization [7]. For example, a comparative study comprising fungi with different plant colonization strategies showed that necrotrophic and hemibiotrophic species possess a larger repertoire of plant cell wall degrading enzymes compared to biotrophic and symbiontic fungi [7]. Additionally, secreted non-effector proteins could contribute to the evolution of different lifestyles. For instance, secreted serine proteases are involved in the determination of fungal lifestyles [47]. Based on these findings, I hypothesized that effector proteins with low complexity regions could also play a role in the lifestyle evolution of plant colonizing fungi.
To investigate this idea, I made use of comparative genomics. This approach has several potential shortcomings that may influence the results presented here. First, this strategy depends critically on the availability of high-quality genome assemblies and annotations. For example, re-sequencing of the Verticillium dahliae genome with single-molecule real-time sequencing together with optical mapping considerably improved the genome assembly [48]. Recent advances in the assembly of fungal genome sequences were also reported for the plant pathogens Botrytis cinerea [49] and Ramularia collo-cygni [50], and it is conceivable that future comparative genomics studies will benefit from further improvements in genome assemblies and annotations.
Second, I did not consider unconventionally secreted (effector) proteins in my analysis, although there are several reports of unconventionally secreted effector proteins that play important roles in virulence [51][52][53]. Such proteins could be identified in silico with SecretomeP [54,55]. However, this software was designed for mammalian and bacterial sequences, and no method is available for the screening of proteins originating from non-mammalian eukaryotes [56]. For example, applications to plant proteins yielded only unreliable results [57]. Therefore, it is questionable whether SecretomeP would be well suited for the identification of potential fungal effector proteins. Thus, I did not consider Rsp3 of Ustilago maydis [58]. Moreover, effectors were found to localize and function in different plant cellular compartments [59]. Effectors were also shown to be expressed in a colonization-stage dependent manner, that is, some effectors are already expressed when the fungus grows on the plant surface, whereas others are expressed only during late infection stages [60][61][62][63][64]. Such data are not yet available for a larger number of pathosystems, which makes it difficult to integrate them in current comparative studies.
Fourth, the investigated species do not only differ in their mode of colonization but also in their host range, that is, the number of host plants they can infect. This number can range from one to several hundred species [4], and it was shown that codon usage is one genomic factor that contributes to the evolution of host ranges [65]. It is challenging to include data of the host range in the present analysis because such information focuses often on economically important plants [66].  [68], and the broad host range pathogen Verticillium dahlia [69]. Moreover, different strains can encode different alleles of the same effector gene; this was, for example, reported for the low complexity regions containing Ustilago maydis effector Rsp3 [58].
Several reports suggest that effector proteins containing low complexity regions can play important roles in virulence [37,70], and results presented in the present work suggest that different fungal lifestyles contribute to the evolution of low complexity regions in effector proteins.

Conclusions
A comparative genomics study with 121 plant colonizing fungi representing six different lifestyles showed that protein type (cytoplasmic or secreted), protein age (ancestral proteins conserved in most species or recent species-specific proteins), and lifestyle of different fungal species contribute to the evolution of low complexity regions in effector proteins. Future work is required to elucidate the evolutionary and molecular mechanisms that explain the observed differences between protein types (especially secreted non-effectors and secreted effectors), and between different lifestyles.
To study the evolution of low complexity regions in proteins of plant colonizing fungi , I employed   publicly available genome data from 21 necrotrophic, 34 hemibiotrophic, 11 obligate biotrophic, 19 facultative biotrophic, and 24 symbiotic fungi. Furthermore, I included genomic data from 12 wood degrading species as a contrasting set, because wood degrading fungi obtain nutrients also from plant material, but do not establish an interaction with the living plant [71]. Among all investigated species, 39 are Basidiomycota, 81 are Ascomycota, and one belongs to Glomeromycota (supplementary table 1). All species considered in this study, lifestyle information, taxonomic classifications, and sources of protein-coding sequences are listed in supplementary table 1. All protein-coding sequences were initially filtered according to two criteria. If a sequence length was not a multiple of three (that is, the sequence contained invalid codons) or if a sequence contained a nonterminal stop codon (that is, the sequence represents a potential pseudogene), it was not considered for further analysis. The total number of annotated genes and the number of genes that passed the two filtering steps in each species are listed in supplementary table 1. All valid protein-coding sequences were translated to amino acid sequences and assigned as cytoplasmic protein, secreted (but non-effector) protein or secreted effector protein. To define the total set of predicted secreted proteins, SignalP 4.1 [72] was used to identify N-terminal secretion signal peptides, and TMHMM 2.0c [73], as well as Phobius 1.01 [74], were employed to identify transmembrane domains. C-terminal endoplasmatic reticulum retention signals (ERRS) were predicted with ps_scan 1.88 [75] using the prosite pattern PS00014 (ER_TARGET). This prosite pattern considers the consensus sequence [KRHQSA]-[DENQ]-E-L for the prediction of an ERRS. A protein was considered as secreted if (i) a secretion signal peptide could be found, (ii) no transmembrane domain was identified downstream of the signal peptide, and (iii) no ERRS was found. It is conceivable that this set of secreted proteins does not only contain putative effector proteins, but also "housekeeping" proteins that are, for example, required for fungal cell wall synthesis or modification. Therefore, I employed three prediction programs to identify genuine effector proteins, namely Localizer 1.0.3 [76], ApoplastP 1.0 [77], and EffectorP 2.0 [78]. I considered a predicted secreted protein as a putative effector if one of the three software showed a positive prediction. The numbers of cytoplasmic proteins, secreted noneffector proteins and effector proteins in each species are listed in supplementary table 1, and the   assigned classification for each protein is shown in supplementary table 2. I inferred homologous relationships between protein sequences with two approaches using OrthoFinder 2.2.6 [43]. In the first approach, I employed the native protein sequences, and in a second approach, I masked all low complexity regions with 'X' (that is, unknown amino acids) to rule out the possibility that low complexity regions affect the search for homologs. Default settings were used in both cases. Searches for homologous protein sequences were performed only for effector proteins, because they are involved in establishing interactions with host plants. Supplementary table   2 lists all proteins and provides information about the group of homologs to which each analyzed protein belongs in the two analyses. Supplementary table 4 and supplementary table 5 show the number of effector proteins from each species in each group of homologues that were identified by using masked or native protein sequences.
I used three programs to scan protein sequences for the presence of low complexity regions, namely SEG [79], DisEMBL [80], and fLPS [81]. All programs were run with default settings. As a conservative approach for assigning low complexity regions, I considered only protein regions that were identified by all three software as low complexity region. Relative positions of low complexity regions in protein sequences were calculated by dividing the midpoint position of a low complexity region by the protein length.
Amino acid enrichments or depletions in low complexity regions compared to non-low complexity regions were determined as follows. For each protein, the frequency of each amino acid was determined in low complexity regions and non-low complexity regions. This analysis was done separately for each species and each protein category (that is, cytoplasmic proteins, secreted noneffector proteins, and effector proteins). Significant differences in amino acid frequencies were identified with the Wilcoxon rank-sum test, followed by Bonferroni correction to account for multiple testing. Specifically, I multiplied each P-value obtained with the Wilcoxon rank-sum test with 20 (the number of tested amino acids), and considered differences as significant if the P-value was smaller than 5% after this correction.
To assess if observed results can be explained by chance alone, I performed random permutation.
Specifically, the total set of proteins comprises x E effector proteins, x s secreted non-effector proteins, and x C cytoplasmic proteins. From the total set of proteins, I randomly assigned x E proteins as effectors, x s proteins as secreted non-effectors, and x C proteins as cytoplasmic. This assignment was done without replacement (that is, the total number of proteins did not change, and each protein is assigned to only one category) and repeated 10,000 times. For each repetition, I noted the median fraction of a protein sequence that is spanned by a low complexity region and the relative position of a low complexity region. Moreover, I recorded (i) in how many cases the fraction of proteins with low complexity region was lowest in secreted non-effector proteins ( Fig. 1), (ii) the number of cases where the fraction of proteins with low complexity regions was lower in secreted proteins (effectors and noneffectors) than in cytoplasmic proteins ( Fig. 1), (iii) the number of cases where the median fraction of a protein sequence that spans a low complexity region was highest in effectors ( Fig. 2), (iv) in how many cases the median relative position of a low complexity region was closest to the N-terminus in cytoplasmic proteins (Fig. 4), and (v) the number of cases where the relative position of low complexity regions was closer located to the N-terminus in secreted non-effectors compared to effectors (Fig. 4).
To examine if the three characteristics protein type, lifestyle, and phylum could have a layered effect

Declarations
Ethics approval and consent to participate -not applicable. Competing interests -the authors declare that they have no competing interests.
Funding -this project did not receive specific funding.
Authors' contributions -G. S. conceived the study, designed and performed analyses, interpreted results, and wrote the manuscript. Figure 1 Fraction of proteins that contain low complexity regions. In each species, the fraction of cytoplasmic, secreted non-effectors and effectors with at least one low complexity region was determined (vertical axis   Effector proteins are classified as ancestral or species-specific (horizontal axis) and their sequence fraction spanning a low complexity region is shown on the vertical axis.

Figures
OrthoFinder was used to reconstruct groups of homologous proteins by using masked protein sequences (low complexity regions are replaced with 'X' as unknown amino acid).

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.