The Subordinate Role of Pseudogenization over Recombinative Deletion Following Polyploidization in Angiosperms

doi:10.21203/rs.3.rs-4427674/v1

Download PDF

Article

The Subordinate Role of Pseudogenization over Recombinative Deletion Following Polyploidization in Angiosperms

https://doi.org/10.21203/rs.3.rs-4427674/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Extensive gene loss is a key feature of the rediploidization process following polyploidization. Despites this, the molecular mechanisms behind such gene loss are unclear, whether it is through pseudogenization, i.e., the accumulation of deleterious mutations leading to the non-functionality of genes, or through DNA deletion from the genome by recombination. Here, we investigated the extent of pseudogenization post-polyploidization by identifying pseudogenes in collinear segments retained from ancient polyploidy events in twelve angiosperms. The number of identified polyploidy-derived pseudogenes among the investigated species was drastically lower than what would be expected if pseudogenization and DNA deletion contribute equally to gene loss. Because this number could be affected by the decreased pseudogene detectability due to mutation accumulation, and the decreased collinearity over-time, we addressed these issues by simulating neutrally evolving (pseudo)genes and investigating a neo-polyploid sugarcane genome, respectively. Our simulations indicated that in the absence of DNA deletion mechanisms, pseudogenes should be detectable for a longer evolutionary period than observed in the real paleo-polyploid genomes. Furthermore, despite being still polyploid, the neo-polyploid sugarcane genome has already experienced significant gene loss, with only one-third of these loss events associated with pseudogenization. Taken together, these findings imply that gene loss post-polyploidization primarily takes place via deletion through recombinative DNA deletion. This assertion is further reinforced by the identification of an apparent initial phase of rapid gene loss, and an enrichment of pseudogenes in regions of low recombination. Moreover, we hypothesize that some of the found pseudogenes may still govern a function, indicated by their non-neutral Ka/Ks ratio and overlap with lncRNAs.

Biological sciences/Evolution/Molecular evolution

Biological sciences/Genetics/Genomics/Comparative genomics

Biological sciences/Genetics/Genomics/Genome evolution

Polyploidization, the whole genome multiplication (WGM) of the total set of chromosomes, is a prominent evolutionary phenomenon that has been observed all across the plant lineage, especially in angiosperms^1–3. After formation, polyploidy is very often followed by drastic changes in the genome, such as genomic rearrangements and loss events, which may eventually lead to a return to a diploid state⁴. This process, referred to as rediploidization, significantly contributes to the evolutionary importance of polyploidy, influencing speciation, diversity, and evolutionary innovation^5–8. Interestingly, a typical contemporary angiosperm genome has undergone one or more rounds of polyploidization, each time followed by rediploidization^9,10. This suggests that rediploidization plays a pivotal role in the long-term evolutionary trajectory following polyploidization. Namely, rediploidization has been postulated to mitigate the drawbacks associated with polyploidy, such as meiotic instability, gene dosage imbalance, and increased nutrient demands, while concurrently generating variation, e.g., in gene copies and genomic rearrangements, upon which selection can operate^5,11. As such, rediploidization can facilitate speciation through the differential loss of duplicate genes and distinct genomic rearrangements¹².

One of the key features of post-WGM genome evolution and rediploidization is extensive gene loss. To this end, the fraction of retained WGM-derived duplicate genes appears to rapidly decrease over time, exhibiting an apparent exponential distribution^4,13,14. This rapid pervasive loss of genes can be attributed to the genome-wide redundancy for all genes after WGM. As such, it has been hypothesized that one copy of the duplicated gene pair may get lost without negatively impacting the fitness of an organism as there is still another functional copy of the gene present in the genome¹³. Indeed, after gene and genome duplication, gene loss or non-functionalization is considered as the most probable evolutionary path^13,15–17. Interestingly, despite this redundancy, duplicate genes are not simply lost at random. Some duplicate genes involved in specific functional categories are preferentially retained after WGM⁴. Several models have been put forward to explain this specific biased retention of duplicate pairs^15,18. For example, according to the dosage balance hypothesis, duplicate genes whose proteins participate in a pathway and/or multi-protein complex are preferentially retained after polyploidization due to dosage constraints, specifically through selection against the disruption of the stoichiometric balance of that pathway or multi-protein complex¹⁹. This hypothesis can explain the functional bias of retained duplicate genes towards transcription regulation, signaling, and developmental processes^20–22.

Although duplicate gene retention has been extensively studied, a major gap in our understanding of rediploidization pertains to the processes by which genes are lost. As a result, little is known about the mechanistic underpinnings of gene loss following WGM. In principle, gene loss can occur through two notable mechanisms, DNA deletion and pseudogenization. DNA deletion refers to the physical removal of a segment of DNA from the genome. The deleted DNA segment may encompass one or multiple genes. Hereby, genetic processes leading to DNA deletion are mainly related to recombination or mobilization of a transposable element²³. Illegitimate recombination, i.e., the recombination of two non-homologous sequences, has been suggested as the most prominent mechanism in angiosperms for gene loss^4,24–27. Pseudogenization, on the other hand, refers to the process whereby a gene acquires deleterious mutations leading to the non-functionality of that gene, followed by further degradation through mutation. A pseudogene is generally defined as a genomic sequence that has sequence similarity to a gene, i.e., a functional paralog, but is itself not functional anymore²⁸. In animals and ferns, pseudogenization has long been considered the most prominent mechanism for gene loss^29–31. Although originally considered rarer in angiosperms, several studies have indicated that pseudogenization may not be as limited as previously thought^32–37. Moreover, Xie et al. (2019) and Mascagni et al. (2021) investigated pseudogenes that originated from WGMs and pinpointed a number ranging from hundreds to thousands WGM-derived pseudogenes in seven and five paleo-polyploid species, respectively, suggesting pseudogenization plays a role in gene loss after WGM in angiosperms.

The two mechanisms of gene loss, DNA deletion and pseudogenization, may have very different biological implications. Importantly, DNA deletion results in the complete removal of the sequence from the genome, whereas after pseudogenization, the sequence is still retained, albeit in a non-functional state. As such, a pseudogene sequence may still have the potential to evolve and (re)gain a functional role, a capability that is not possible after DNA deletion. Indeed, a growing body of research indicates that at least some pseudogenes exhibit some form of functionality^{28,34,38–41}. Pseudogenes may regain their original function through the reversal of deleterious mutations if the selective environment shifts⁴², or may acquire a novel protein-coding function, e.g., as a truncated protein. Besides protein-coding potential, pseudogenes may also play a significant role in the origin of non-coding RNA^28,40,43. For example, several studies have put pseudogenes forward as an important source of long non-coding RNAs (lncRNAs), which govern regulatory functions^36,44–46. Gene loss through pseudogenization may thus be evolutionary appealing as it is flexible and may lead to functional novelty and diversity. However, DNA deletion leads to a reduction of the genome size which may be selectively advantageous, e.g., to reduce resource requirements⁴⁷.

Here, we investigated the relative importance of pseudogenization in comparison to recombinative DNA deletion and analyzed its temporal evolution during the process of rediploidization. To this end, we identified pseudogenes that trace back to WGMs in twelve different paleo-polyploid plant genomes. All these paleo-polyploid plant species have undergone at least one WGM at various timings in their evolution but are currently functionally diploid, allowing us to understand gene loss in the process of rediploidization. Furthermore, to obtain a more thorough comprehension of gene loss in a shorter timeframe following WGM, we investigated pseudogenization in a neo-autopolyploid sugarcane genome that has not undergone complete rediploidization. Although there have been two studies that identified WGM-derived pseudogenes^32,36, the current study is the first to investigate the relationship between these pseudogenes and the process of rediploidization. Our results indicate that a large fraction of duplicate genes appear to rapidly get lost through DNA deletion following WGM, most likely via recombination-related processes. Nonetheless, WGM-derived pseudogenes are identifiable, albeit in much lower numbers than expected given their rediploidization histories. Moreover, at least some of these pseudogenes have acquired a functional role as lncRNAs.

Identification of pseudogenes in paleo-polyploid species

We investigated the extent of pseudogenization in twelve paleo-polyploid plant species that underwent WGMs at different times in their evolutionary past (Figure 1A and Supplementary Table 1). The PseudoPipe pipeline⁴⁸ was utilized to identify putative pseudogenes. Pseudogenes were identified as genomic sequences that reside in the non-coding regions of a genome, encompassing all areas outside the coding exons, and have sequence similarity with a functional gene. In total, we obtain a highly variable total number of putative pseudogenes for different species, ranging from 6,989 in Arabidopsis thaliana to 236,487 in Zea mays (Supplementary Figure 1 and Supplementary Data 1). These putative pseudogenes were classified into three classes in accordance to PseudoPipe: (1) “retro-transposed (or processed) pseudogenes” that originated by retro-transposition of a reverse transcribed mRNA, (2) “duplicated pseudogenes” that originated from the duplication of a gene, and (3) “fragmented pseudogenes”, encompassing pseudogenes characterized by a highly fragmented structure, making it challenging to attribute them to a specific origin (Materials and Methods). In addition to these generic classes, we further defined a class of “WGM-derived pseudogenes”, encompassing our focal interest, for pseudogenes initially classified as “duplicated” or “fragmented” that were located in intergenic regions. This classification is based on their presence in collinear “homoeologous” (i.e., derived from a WGD or hybridization event) segments, which are defined as two or more regions within a genome containing several paralogous (pseudo)genes that largely maintain their order and orientation. As such, a “duplicated” or “fragmented” pseudogene that forms a homologous pair with a functional paralogous gene, is assumed to be “WGM-derived”. Duplicated pseudogenes that are not WGM-derived are then considered to be derived from small-scale duplication (SSD). The WGM-derived pseudogenes are consistently the smallest class, ranging from 61 in Sorghum bicolor to 1,761 in Glycine max (Figure 1B). Most WGM-derived pseudogenes came from the fragmented rather than the duplicated class, suggesting that pseudogenes of this class generally also have a fragmented structure. We compared our results with previous studies, revealing variability in the identification of pseudogenes that can be attributed to the absence of a precise definition for pseudogenes (Supplementary Text 1).

The total number of pseudogenes exhibit a moderately significant correlation with genome size (Spearman’s ρ = 0.57, p-value = 0.003) and the number of retro-transposons in the genome (Spearman’s ρ = 0.59, p-value = 0.02) (Supplementary Table 2 for more correlations between pseudogenes and genomic features). In contrast, the total number of pseudogenes does not significantly correlate with the timing of the most recent WGM in a species (Spearman’s ρ = -0.43, p-value = 0.20). Moreover, the number of fragmented pseudogenes also correlates significantly with the number of retrotransposons in the genome (Spearman’s ρ = 0.77, p-value = 0.03). Together with the large number of retro-transposed pseudogenes, which was very often larger than the number of duplicated (SSD- and WGM-derived) pseudogenes (Supplementary Figure 1), the significant correlations suggest that a large fraction of the pseudogenes originated from retro-transposition events. Surprisingly, the number of retro-transposed pseudogenes itself does not correlate with the number of retrotransposons in the genome (Spearman’s ρ = 0.36, p-value = 0.18). A potential explanation for the lack of correlation may be that retro-transposed pseudogenes were identified using stringent cut-off criteria, resulting in a large number of them being categorized as fragmented pseudogenes.

In all species analyzed, fragmented pseudogenes formed the most abundant class, suggesting that a large proportion of pseudogenes has experienced extensive fragmentation. This may be the result of recombinational processes that lead to rearrangements and deletions in the genome. Fragmentation is also reflected in the alignment coverage of pseudogenes against their functional paralog, which is consistently lower for WGM-derived pseudogenes than the alignment coverage observed between pairs of paralogous genes within collinear segments, i.e., WGM-derived duplicate genes or anchor pairs (Figure 1C, Supplementary Text 2, and Supplementary Figure 2). The protein percent identity between WGM-derived pseudogenes and their functional paralogous gene is relatively high when compared to the percent identity between WGM-derived duplicate genes, especially for species with an ancient WGM (Figure 1C, Supplementary Text 2, and Supplementary Figure 2). However, only looking at protein percent identity gives an incomplete picture as it overlooks the alignment coverage. Multiplying the protein percent identity by the alignment coverage provides a metric for the protein percent identity including non-aligned parts, i.e., global protein percent identity. This combined metric tends to be higher for WGM-derived duplicate genes (Supplementary Figure 3).

The number of WGM-derived pseudogenes is strongly negatively correlated with the timing of the most recent WGM (Spearman’s ρ = -0.79, p-value = 0.01, Figure 1F). The more recent the WGM event, the more WGM-derived pseudogenes are identified, suggesting that pseudogenes are lost over time. In concordance, we find significant enrichments of various meiosis and RNA/DNA-related Gene Ontology (GO) terms in paleo-polyploid species with a WGM less than 35 MYA, while for paleo-polyploid species with a WGM more than 35 MYA, we observed no significant enrichments (Supplementary Text 3 and Supplementary Table 4). We would indeed expect that WGM-derived pseudogenes are enriched in functional categories associated with meiosis and RNA/DNA-related processes, as these are underrepresented in retained duplicate genes^21,22,51. The lack of enrichment observed in ancient paleo-polyploids (WGM > 35 MYA) suggests that these pseudogenes have disappeared over time, making it impossible to identify such enrichments.

The number of WGM-derived pseudogenes is also significantly positively correlated with the total number of genes in the genome (Spearman’s ρ = 0.83, p-value = 0.001), the number of WGM-derived duplicate genes (Spearman’s ρ = 0.87, p-value = 0.00014, Figure 1D), and the number of collinear segments (Spearman’s ρ = 0.93, p-value = 0.00013, Figure 1E). These strong correlations are likely a reflection of the method that was used to identify WGM-derived pseudogenes. The number of WGM-derived pseudogenes could be impacted by several factors, a concern that we discuss in the following sections.

Why so few WGM-derived pseudogenes?

Considering the rediploidization histories of the species and introducing a null hypothesis that pseudogenization and DNA deletion contribute equally to gene loss after WGM, we would expect to observe a comparable number of WGM-derived pseudogenes as DNA deletions to reflect this balanced contribution. However, the empirical data show that although the total number of identified pseudogenes is large, often surpassing the total number of functional genes within the given species, the number of WGM-derived pseudogenes is relatively limited compared to the total number of lost genes after WGM (Supplementary Figure 1). For example, in A. thaliana, which underwent its most recent WGM about 50 MYA, only about 21% of the total number of genes are retained as gene anchor pairs, suggesting that about 79% were lost or translocated (Supplementary Table 3). However, we only find 91 WGM-derived pseudogenes in A. thaliana, suggesting that the large majority of duplicate genes were lost through DNA deletion. Species with a more recent WGM, such as G. max, which underwent its most recent WGM about 13.6 MYA, tend to have more WGM-derived pseudogenes (1,761 pseudogenes), but the number is still modest compared to the gene number (52,872 genes) and the 48% of WGM-derived duplicate genes that were lost or translocated after the WGM.

We further fitted two competing models on gene loss over time to gain deeper insight into the gene loss processes after WGMs⁵². Hereby, we plotted the number of WGM-derived pseudogenes and gene pairs in the paleo-polyploids against the corresponding timing of the most recent WGM. For the two models of (pseudo)gene loss we consider, each represents a distinct dynamic of loss over time. The first model consists of one phase with one rate whereby (pseudo)genes get lost passively and independently, characterized by a single-exponential decay curve. The second model consists of two phases with two distinct rates and accommodates an initial phase whereby a group of genes may get lost simultaneously. The two-phase model can be approximated by a double-exponential decay curve, which decays more rapidly at the start and subsequently slows down⁵². Therefore, the initial phase of fast gene loss may account for the genome-wide redundancy that is expected after a WGM. Interestingly, fitting a double-exponential curve to the WGM-derived pseudogene counts against the timing of the most recent WGM consistently reverted to a single-exponential curve with an r² value of 0.65, corresponding to the one-phase model. In contrast, for the WGM-derived gene pair data, the double-exponential function provided a better fit than the single-exponential function with r² values of 0.8 and 0.56, respectively (Figure 1F). In addition, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) measures also favors the two-phase model over the one-phase model for the WGM-derived gene pairs, while the one-phase model is favored for the WGM-derived pseudogene counts (Supplementary Table 5 and Supplementary Table 6). Similar findings are uncovered when using the timing of WGMs represented by the peaks observed in distributions of synonymous substitutions per synonymous site (K_s) for all duplicate genes from Vanneste et al.⁵⁰ (Supplementary Figure 4, Supplementary Table 5, and Supplementary Table 6).

While merely fitting a curve does not guarantee the correctness of the underlying assumed model, the notable differences in fits between WGM-derived pseudogenes and WGM-derived gene pairs are intriguing. If the fits reflect the essence of the underlying models on gene loss processes, our findings indicate that genes undergo an initial rapid loss phase, involving a rapid loss of genes right after WGM events, followed by a phase of gene loss with a slower rate. The pseudogenes, in contrast, show a consistent loss rate, suggesting it could not solely contribute to the rapid gene loss after WGM events. Although both models do not consider the mechanisms by which this simultaneous loss happens, our results here would suggest that DNA deletion dominate the gene loss processes after WGMs over pseudogenization. These observations may indicate that pseudogenization only has limited importance for gene loss after WGMs compared to DNA deletion in angiosperms as suggested before^26,27. However, before jumping to this conclusion, we further explore whether the life span of pseudogenes and the degradation of collinear segments may lead to potential biases and complicate the interpretation of WGM-derived pseudogene numbers.

Simulating the evolution of neutrally evolving pseudogenes indicates a longer expected identifiability span than observed in paleo-polyploids

A first potential bias which may have contributed to the low number of WGM-derived pseudogenes in the paleo-polyploids studied is that a pseudogene may rapidly accumulate mutations, rendering it undetectable due to its low sequence similarity with its functional paralogous gene, i.e., falling below the cut-offs of pseudogene identification. To interrogate this potential bias, we adopted the classical gene duplication model whereby gene duplicates are under relaxed functional constraint and free to accumulate mutations^13,17,53. Further, we assume that after obtaining a deleterious mutation, e.g., a premature stop codon or frameshift mutation, it becomes a pseudogene and continues to evolve without selective constraint⁵⁴. As such, we would expect that the more ancient a WGM, the more difficult to identify WGM-derived pseudogenes. Based on the above assumptions, we simulated the sequence evolution of one copy of the duplicate genes without any selective constraint to provide insight in the timeframe by which pseudogenes may be detectable. To implement the simulation, we retrieved all coding sequences (CDS) of A. thaliana and simulated a WGM event by generating duplicates with identical sequences. Next, we simulated the evolution of these duplicates over at least 100 million years using a spontaneous substitution rate of 7 × 10^-9 substitutions per site per year and 1- to 3-bp insertion and deletion rates of 0.3 × 10⁻⁹ and 0.6 × 10⁻⁹ per site per year, respectively⁵⁵. Per one million years, we translated and aligned both the evolved and ancestral sequences to record alignment statistics, most notably protein percent identity, alignment coverage, and E-value of the pairwise alignment. Importantly, for pseudogene identification in the paleo-polyploid species, we did not compare the pseudogene to its ancestral gene, but rather to its functional paralog, which also evolved, albeit under various selection pressures. To address this in the simulations, we additionally ran simulations using twice the spontaneous substitution and indel rates, which is then equivalent to evolving both duplicate genes neutrally. As such, we can consider the simulation following 1 × the spontaneous mutation rates function as a lower boundary of divergence between a pseudogene and its functional paralog, while the simulation following 2 × the spontaneous mutation rates functions as an upper boundary. We expect that the actual sequence divergence between a pseudogene and its functional paralog is somewhere in between these two boundaries (Materials and Methods).

By tracking the protein percent identity and coverage of the simulated protein relative to its ancestral form over time, our simulation reveals alignment plateaus for simulations with both mutation rates. For protein percent identity, the plateau is reached at a median of around 30% identity after between 40 and 80 MY for the simulations using 1 × and 2 × the spontaneous mutation rates, respectively (Figure 2A). Similarly, alignment coverage also reaches a plateau at a median of about 23% after 120 MY and 60 MY using respectively 1 × and 2 × the spontaneous mutation rates (Figure 2B and Supplementary Figure 5). In addition, the variation of both alignment statistics increases through time. By further exploring the “plateaus” in these alignment statistics, we found that they can be attributed to the pairwise alignment algorithm, which is inherently designed to seek similarities between two sequences to align most of the sequences. This, hence, underscores the importance of alignment significance, most often expressed as E-value, i.e., the number of expected alignments that would score at least as high as the observed alignment just by random chance⁵⁶. As expected, the E-value increases over time, indicating that the obtained alignments become less significant over time (Supplementary Figure 6). Next, we employed the thresholds utilized for duplicated (protein percent identity ≥ 40%, alignment coverage ≥ 5% and E-value < 10^-5) and fragmented pseudogenes (protein percent identity ≥ 20%, alignment coverage ≥ 5% and E-value < 10^-5) for the paleo-polyploids to track the fraction of unidentifiable simulated pseudogenes over time. Adhering to the criteria established for fragmented pseudogenes, half of the pseudogenes should remain detectable on average 55 MY, ranging between 37 and 73 MY using the upper and lower boundaries, respectively (Figure 2C). Similarly, when applying the more stringent criteria of duplicated pseudogenes, half of the simulated pseudogenes are expected to remain detectable on average 42 MY, ranging between 28 and 55 MY (Figure 2D).

Therefore, considering that approximately 79% of the duplicate genes are lost after the most recent WGM timing of A. thaliana (50 MYA) (Supplementary Table 3), assuming that pseudogenization has contributed equally as deletion events (50%), and taking the average percentage of pseudogenes that are detectable in the simulations at the most recent WGM timing of A. thaliana (50% and 41% using the cut-offs of respectively the fragmented and duplicated pseudogenes), we expect to see a proportion between 16-20% of the number of pre-WGM genes that are detectable as pseudogenes. Moreover, considering that approximately 21% of the WGM-derived duplicate genes are retained, we would thus expect to see a proportion between 13-16% of the current number of genes that are detectable as pseudogenes, which is much larger than what we observe (91). Similarly, for Populus trichocarpa, considering that approximately 58% of the duplicate genes were lost after its most recent WGM (34.7 MYA) (Supplementary Table 3), the same null hypothesis and that nearly all pseudogenes should remain identifiable following its most recent WGM timing, as it has a much slower spontaneous mutation rates of 1.33 × 10^− 10per site per year⁵⁷ than A. thaliana (Supplementary Figure 7), we would expect to see a proportion of about 20% of the current number of genes that are detectable as pseudogenes. This is again much higher than the number we observe (1,394). Moreover, these proportions are also much higher than the numbers found by Xie et al.³⁶ and Mascagni et al.³². Although the above calculations are based on the strong assumption that all genes that have no collinear paralog have lost its paralog from the genome, these calculations give us a more robust indication that the number of identified WGM-derived pseudogenes is, indeed, lower than expected, and could not be attributed to the loss of detectability. In addition, following the spontaneous mutation rates of A. thaliana, almost all simulated WGM-derived pseudogenes are identifiable after 20 to 45 MY (Figure 2C and D), indicating that pseudogenes where a WGM occurred less than 20 MYA, e.g., G. max, M. domestica, and Z. mays, should be readily identifiable.

We acknowledge that two factors may affect our simulations. First, only using the spontaneous mutation rates of A. thaliana and P. trichocarpa may not accurately reflect reality for other species. Nevertheless, the two selected species are among the species with the highest and lowest spontaneous mutation rates among angiosperms, respectively⁵⁸. The obtained results may thus represent a minimal and maximal expected identifiability time span for pseudogenes, with pseudogenes in other species experiencing an expected identifiability time span in between these two species. Second, assuming immediate loss of selective constraint after duplication is of course not realistic in our simulation, reflected by our observation that both WGM-derived pseudogenes from A. thaliana and P. trichocarpa tend to have higher protein percent identity than the simulated results but comparable alignment coverage. Indeed, our simulations using the spontaneous mutation rate of A. thaliana revealed that frameshifts and premature stop codons could rapidly accumulate within the first 5 MY, indicating that pseudogenization can happen easily in the absence of selection (Supplementary Figure 8 and Supplementary Figure 9). In reality, before the acquisition of a deleterious mutation, the duplicated genes may initially be still under some selective constraint, resulting in a greater protein percent identity than anticipated under a scenario of complete neutral evolution. That being said, a delayed process of pseudogenization by selection constraints would only prolong the detectability of pseudogenes hence leading to a higher number of pseudogenes than our simulation. As such, by considering both factors, we would expect much more WGM-derived pseudogenes than the ones observed in all the studied paleo-polyploids. Since the real data diverge from our simulation results, we must consider other explanations beyond the loss of detectability resulting from the rapid accumulation of mutations in neutrally evolving pseudogenes.

Gene loss in a neo-polyploid sugarcane genome

Another potential bias which may limit WGM-derived pseudogene identification is the ability to detect collinear segments. After WGM, intragenomic collinearity typically diminishes over time because of genome rearrangements and gene loss itself⁵⁹. This decline of collinearity through time is supported by a significant negative correlation between the timing of the most recent WGM and both the number of collinear segments (Spearman’s ρ = -0.84, p-value = 0.022) and the number of WGM-derived gene pairs (Spearman’s ρ = -0.77, p-value = 0.033). As such, the more ancient a WGM, the more challenging it becomes to find collinear segments, and subsequently to identify WGM-derived pseudogenes therein. Moreover, the number of WGM-derived pseudogenes itself significantly correlates with both the number of collinear segments and number of anchor pairs (Figure 1D and E), indicating that the ability to detect collinear segments surely influences the identification of WGM-derived pseudogenes. Nonetheless, for species that underwent a WGM very recently and are still polyploid, i.e., neo-polyploids, we would expect to observe extensive collinearity. As such, if the loss of collinearity is the main reason for the low number of WGM-derived pseudogenes and gene loss equally occurs through pseudogenization and DNA deletion, we would expect to identify both loss categories comparably in neo-polyploid species.

To this end, we investigated the auto-octoploid Saccharum spontaneum ‘AP85-441’ (sugarcane) genome, which has been sequenced to the sub-genome level, enabling direct comparisons between its four sub-genomes⁶⁰. S. spontaneum underwent two rounds of WGDs less than 0.8 MYA⁶¹, resulting in four sub-genomes, referred to as the A, B, C and D sub-genomes. Overall, the collinearity in sugarcane is relatively well maintained between sub-genomes, with only a few inversions and deletions (Supplementary Figure 10). In total, 14,929 (18%), 19,980 (24%), and 28,913 (35%) of the 82,773 genes had paralogs in two, three, and four sub-genomes, leaving 18,951 (23%) genes that were only found in one sub-genome (Figure 3A). Of these genes that were present on only one sub-genome, 1,372 were never found in a collinear segment. The above numbers clearly indicate that extensive gene loss has already been going on in this sugarcane genome, as also indicated by Zhang et al.⁶⁰. Moreover, this sugarcane genome is not an exception in gene loss dynamics as studies on other neo-polyploid genomes report similar decreases in copy number of WGM-derived paralogs, such as in an autotetraploid potato genome⁶² and an autotetraploid Cyclocarya paliurus genome⁶³.

Interestingly, different enriched GO terms were found for different levels of paralog retention. For genes that had paralogs in all four sub-genomes, we find enriched GO terms related to metabolic processes and responses. For genes that had paralogs in three sub-genomes, we find significantly enrichment in GO terms related to meiosis. For genes that had paralogs in two sub-genomes, we find significant enrichment in GO terms related to RNA/DNA modification. Finally, for genes that had a copy in only one sub-genome, we find significantly enriched GO terms related to transposition and ribonuclease activity (Supplementary Data 2 and Supplementary Text 4). In addition, genes that encode for proteins that form interactions with other proteins are significantly enriched in the genes that were retained in all sub-genomes or in three of the four sub-genomes (hypergeometric test p-value = 0), while this was not the case for genes that were retained in one or two of the sub-genomes (hypergeometric test p-value > 0.05). Interactions include both direct physical interactions, such as those within multi-protein complexes, and indirect interactions, like those occurring together in metabolic pathways. This observation aligns with the dosage balance hypothesis, which proposes a selective constraint against disturbing the stoichiometric equilibrium of multi-protein complexes^20,21.

To investigate pairwise sub-genome level gene loss, a new strategy was adopted for identifying pseudogenes, involving pairwise comparisons between the four sub-genomes of sugarcane. Hereby, rather than first identifying pseudogenes and subsequently collinear segments, we made use of the extensive collinear segments between two sub-genomes to determine genes for which WGM-derived paralogs were missing (i.e., genes that have not retained paralogs on all four sub-genomes) and identify pseudogenes for these missing paralogs (Materials and Methods). When comparing any two sub-genomes, around 90% of the genes from the two sub-genomes are located in collinear segments, with around 50% forming an anchor pair with a gene on the other sub-genome (Figure 3B). Hence, the remaining 40% of the genes that lack a paralogous gene within these segments, may signify two evolutionary scenarios: 1) the gene or its paralog is translocated; or 2) the gene lost its paralog, either through pseudogenization or DNA deletion (Supplementary Figure 11). If a translocation occurred, the paralog should still be present in the genome, but no longer form a paralogous pair within a collinear segment. Between any two sub-genomes, on average, we find that 6.4% of the genes had a paralog outside of a collinear segment, suggesting translocation. Subsequently, in cases where no paralog was found outside the collinear segment, pseudogenes were sought within the corresponding region of the collinear segment containing the gene. Between any two sub-genomes, on average, we find that 11% of the genes had a paralogous pseudogene within the collinear segment. Finally, between any two sub-genomes, on average, we find that 21.2% of the genes were present in collinear segments, but had no paralogous pseudogene or gene counterpart anywhere in the genomes, indicating DNA deletion events have occurred in the sub-genomes. To further corroborate that these genes lost their counterparts via DNA deletions, we checked if their neighboring paralogs also tend to get lost, as DNA deletion may not necessarily only affect one gene, but may affect a group of neighboring genes. Our results show that genes without counterparts are clustered together with an average 1.33 adjacent genes without counterparts indicating these genes are lost together likely through DNA deletions. In contrast, for randomly selected genes an average of only 0.57 adjacent genes without counterparts is found (Supplementary Figure 12).

Together, although the degradation of collinear segments and the lifespan of pseudogenes influence pseudogene identification, our findings demonstrate that DNA deletion, rather than pseudogenization, primarily drives gene loss following WGMs in both paleo- and neo-polyploids. Furthermore, while some pseudogenes are eventually eliminated through DNA deletion, those that persist may provide evolutionary spandrel that contribute to novel biological functions^34,39,40,43.

Pseudogenes are mostly present in regions of low recombination

It is expected that there is a negative correlation between recombination rate and total pseudogene number across the genome, as (pseudo)genes would be more likely removed from the genome by recombination in regions of high recombination⁶⁴. Indeed, a significant negative correlation between the distribution of pseudogenes and the recombination rate has been reported already for G. max³⁶. To further strengthen this finding, we explored the correlation in G. max, A. thaliana, O. sativa, and Z. mays. For all species analyzed, except Z. mays, a significant negative correlation was observed between total number of pseudogenes and local recombination rate (Figure 4 and Supplementary Figure 13-Supplementary Figure 15). Z. mays exhibits a surprisingly uniform pseudogene distribution, which may be caused by the high abundance of retrotransposons in the maize genome. Indeed, we find a low positive Pearson correlation of 0.16 (p-value = 1.49 × 10^-7) between local pseudogene number and retrotransposon pseudogene number (per window of 1 Mb), suggesting a link between retrotransposon and pseudogene abundance. A. thaliana, G. max and O. sativa have a significant Pearson correlation coefficient of respectively -0.35 (p-value = 4.18 × 10^-8), -0.58 (p-value = 1.4 × 10^-107) and -0.66 (p-value = 1.33 × 10^-91) (Supplementary Table 9).

Importantly, when analyzing different pseudogene classes separately, i.e., WGM-derived, SSD-derived, retro-transposed, and fragmented pseudogenes, there is a consistent negative correlation between pseudogene number and recombination rate, except for the WGM-derived pseudogenes. In all species, except A. thaliana, a significant positive correlation with local recombination rate is detected for the WGM-derived pseudogenes, with Pearson correlation coefficients ranging from 0.13 to 0.45 for different species (Supplementary Table 9). These positive correlations with recombination rate further explain why we observe so few WGM-derived pseudogenes in comparison to other pseudogene classes, as they are predominantly found in regions characterized by high recombination rates, different from other types of pseudogenes. Also, considering that these WGM-derived pseudogenes are retained in collinear segments with functional genes, the density of which also has a positive correlation with recombination ⁶⁵, our results further suggest that the remaining WGM-derived pseudogenes may be under stronger selections than other classes of pseudogenes.

Pseudogenes exhibit a non-neutral K_a/K_s ratio

To further explore the selection pressures on pseudogenes, we estimated the ratio of non-synonymous substitutions per non-synonymous site (K_a) to synonymous substitutions per synonymous site (K_s). Naively, we would expect the ratio of K_a/K_s to be equal to 1 for neutrally evolving pseudogenes, while functional genes are expected to have a K_a/K_sratio lower than 1 because of purifying selection. Whereas pseudogenes do exhibit a higher K_a/K_sratio than that of functional paralogous genes, they have K_a/K_sratio less than 1 in general, in contrast to the anticipated pattern of being exempt from selection with a K_a/K_sratio of 1 (Figure 5 and Supplementary Figure 16). Similar patterns have been observed for pseudogenes in e.g., rice and barley^32,33,35, reflecting that the estimation of K_a/K_sratios for pseudogenes and their functional paralogous genes may be affected by the accumulation of synonymous substitutions in the functional paralogous gene and by the period before pseudogenization in which the gene may be under purifying selection. This may be particularly true for pseudogenes originating from WGMs, as they are located in the same genomic area as the ancestral gene and, therefore, could potentially function as usual.

That being said, it is possible that at least some of the pseudogenes have adopted a novel function and are still under selective constraint. By definition, pseudogenes are expected to have no functionality because they lost their coding potential through detrimental mutations. However, non-functionality is not evaluated during the identification of pseudogenes, as it necessitates labor-intensive experimental verification. Several studies have demonstrated that some of the pseudogenes are still transcribed, with some of them playing significant functional roles^34,39,40,43. For example, pseudogenes have been considered as an important source of long non-coding RNAs (lncRNAs), which govern regulatory functions^36,44–46.

We hence assessed whether our pseudogene complement overlapped with the lncRNA complement in A. thaliana and S. lycopersicum, taken from the PLncDB database v2.0⁶⁶, considering an overlap when at least half of the pseudogene and lncRNA sequences coincided. As such, we find that 297 and 178 of the pseudogenes overlap with lncRNAs in A. thaliana and S. lycopersicum, respectively (Supplementary Table 10). An empirical null distribution was established by randomly shuffling lncRNA coordinates and assessing the overlap between our pseudogenes and the randomly shuffled lncRNAs (Materials and Methods). Hereby, we find that pseudogenes in both species had a significant overlap with a p-value of 0 and 0.049, respectively (Supplementary Figure 17). Moreover, after filtering the fragmented pseudogenes, we find an overlap between 106 pseudogenes and lncRNAs in S. lycopersicum with a p-value of 0 (Supplementary Figure 18), whereas the fragmented pseudogenes overlap with 72 lncRNAs, which is not significant for S. lycopersicum (p-value = 0.99, Supplementary Figure 19). Despite the relatively low overlap between WGM-derived pseudogenes and lncRNAs, with twelve and two overlaps in A. thaliana and S. lycopersicum, respectively, this is statistically significant when compared to an empirical null distribution. This significance is attributed to the limited number of WGM-derived pseudogenes in the two particular species (Supplementary Figure 20). For SSD-derived pseudogenes, there are 57 and 75 pseudogenes having a significant overlap with lncRNAs in A. thaliana and S. lycopersicum, respectively (Supplementary Figure 21). In contrast, retro-transposed pseudogenes do not show a significant overlap with lncRNAs in both species (Supplementary Figure 22). Together these results suggest that pseudogenes, in particular the ones originating from (both SSD and WGM) duplications, may lead to the origin of lncRNAs which may govern a regulatory function.

In this study, we investigated the role of pseudogenization in the process of rediploidization following WGM. We examined twelve paleo-polyploid genomes with varying timings of WGMs to identify pseudogenes arising from such events. A lower number of WGM-derived pseudogenes was observed than would be expected under a null hypothesis where pseudogenization and DNA deletion contribute equally to gene loss after WGM. After addressing potential biases that could cause a low number of pseudogenes, our analyses reject the null hypothesis but embraces a subordinate role of pseudogenization over DNA deletion following WGMs in the studied angiosperms. Different from other classes of pseudogenes, the positive correlation between the retention of WGM-derived pseudogenes and recombination rates along chromosomes indicate that recombinative DNA deletion is the reason for the limited number of such pseudogenes. Finally, some pseudogenes, especially the ones from duplications, either from SSDs or WGMs, may still govern a function, indicated by their non-neutral K_a/K_s ratio and overlap with lncRNAs.

Data retrieval

Twelve species were selected that differed in the occurrence and timing of WGMs: Amborella trichopoda, Arabidopsis thaliana, Brassica oleracea, Brassica rapa, Glycine max, Malus domestica, Oryza sativa, Populus trichocarpa, Solanum lycopersicum, Sorghum bicolor, Vitis vinifera, and Zea mays. The genomes of the analyzed paleo-polyploid species were retrieved from PLAZA v5.0⁶⁷. The timings of the WGMs were obtained from Vanneste et al.⁵⁰ for all species except Vitis vinifera, which was obtained from Jiao et al.⁶⁸(Supplementary Table 1). For Amborella trichopoda, we refrained from providing an estimated timing due to the ongoing controversy regarding whether a WGD occurred before the divergence of angiosperms, with A. trichopoda being the earliest extant diverging lineage^69–71. A species tree depicting the relationships between the paleo-polyploid species was generated using TimeTree 5⁴⁹. Timings of WGMs were annotated on the tree using shinyWGD (https://github.com/li081766/shinyWGD).

The assembly and annotation of the haploid auto-octoploid Saccharum spontaneum AP85-441 (sugarcane) genome, which underwent a WGD less than 0.8 MYA⁶¹, was obtained from the site of the respective research group (https://www.life.illinois.edu/ming/downloads/Spontaneum_genome/)⁶⁰.

Pseudogene identification in genomes of paleo-polyploid species

First, repeats were identified in the genomes of the paleo-polyploid species and masked using RepeatMasker v4.1.1⁷². Next, putative pseudogenes were identified using the publicly available PseudoPipe pipeline⁴⁸. Briefly, PseudoPipe attempts to identify pseudogenes by searching for homologous sequences of functional genes in the non-coding part of the genome. The proteome of a species is searched against the repeat- and exon-masked genome of that species using TBLASTN⁷³ with an E-value cut-off of 1 × 10⁻⁴and further processed (see ⁴⁸for further details). Pseudogenes are classified as (1) “processed” if they lack introns and possess a poly-A tail, (2) “duplicated” if they comprise multiple exons and lack a poly-A tail, and (3) “fragmented” if they show sequence similarity to a gene but fail to meet the criteria for processed and duplicated pseudogenes, or fall below the thresholds of the other two classes. Both processed and duplicated pseudogenes must exhibit a minimum of 40% protein identity with the corresponding functional paralogous protein and an E-value lower than 1 × 10^-10. Moreover, processed pseudogenes are specifically required to align more than 70% with the functional protein. After running PseudoPipe, some further post-processing was performed. Putative pseudogenes with more than 30 bps overlap with an exon were removed and pseudogenes were filtered based on some minimal criteria: they must align with at least 5% of the functional protein, have a minimum of 20% protein identity, and an E-value smaller or equal to 1 × 10^-5. Furthermore, in cases where pseudogenes exhibited an overlap of more than 30 bps, only the best pseudogene hit was preserved, determined on the following order of priority: E-value, alignment coverage, and protein percent identity. Subsequently, i-ADHoRe v3.0⁷⁴ was executed to identify regions within the genome with similar gene content and order, i.e., collinear segments. As input for i-ADHoRe, we extracted gene-pseudogene pairs from the duplicated and fragmented classes of PseudoPipe, filtering out intronic pseudogenes as these do not likely originate from a WGM. In addition, gene pairs were extracted from an all-vs-all BLASTP run, whereby following criteria were met: a maximum E-value of 1 × 10^-5, an alignment coverage of at least 30%, and a difference in length no more than the length of the smallest protein. Following parameters were used for i-ADHoRe: (1) A minimum number of 3 anchorpairs, i.e. pairs of putative homologous (pseudo)genes in a collinear segment, (2) a maximum gap size that should exist between two points of 35, (3) a maximum cluster gap size of 40, (4) maximum number of gaps in alignment of 40, (5) a probability cut-off of 0.01, and (6) a q-value cut-off of 0.75. Pseudogenes that form homologous pairs with its functional paralogous gene within collinear segments were considered to be WGM-derived. Collinear segments were visually inspected using GenoPlotR⁷⁵.

Fitting single- and double-exponential function to WGM-derived numbers

To explore the evolution through time of the number of WGM-derived pseudogenes and gene pairs, these quantities were graphed against the timing of the most recent WGM, utilizing both absolute (as million years ago) and relative (as K_s) timing of the most recent WGM in each species. Subsequently, we fitted a single- and double-exponential function to these datasets using the lmfit library⁷⁶ in Python. The single-exponential function corresponds to the simple passive or random loss model whereby (pseudo)genes are lost randomly. The double-exponential function approximates the two-phase model described by Inoue et al.⁵². This model comprises an initial rapid phase, characterized by the simultaneous loss of multiple genes, followed by a second phase with a lower rate of loss. Regression fits were evaluated and compared through R-squared, Akaike information criterion (AIC), Bayesian information criterion (BIC) and root-mean-square error (RMSE) statistics.

Simulation of pseudogene evolution

To understand the expected evolution of genes without selective constraints, we conducted simulations. These simulations represent the situation where redundant genes are immediately released from selective constraints and can freely accumulate mutations at the spontaneous substitution rate after a WGM. These simulations were performed using the phylogenetic sequence simulator AliSim of IQ-TREE v2.2.2.6⁷⁷. For this, we used the spontaneous substitution rate of 7 × 10^-9 substitutions per site per generation of 1 year and the rate of 1- to 3-bp deletions as 0.6 × 10⁻⁹ per site per year and the rate of 1- to 3-bp insertions as 0.3 × 10⁻⁹ per site per year estimated for Arabidopsis thaliana⁵⁵. In addition, we explored multiples of this substitution rate (× 2, × 10, / 10), along with the estimated spontaneous substitution rate of Populus trichocarpa as 1.33 × 10^− 10per site per year⁵⁷. Employing 1 × the spontaneous mutation rate represents the case whereby a pseudogene evolves neutrally, while the functional gene remains identical to the ancestral gene. As the functional gene will also accumulate mutations, albeit under selection, employing 1 × the spontaneous mutation rate provides a lower boundary of expected divergence. Employing 2 × the spontaneous mutation rate leads to the pseudogene evolving twice as fast from the ancestral sequence compared to the substitution rate. This is equivalent to having two sequences that evolve at the spontaneous mutation rate, and thus both genes evolving as neutral pseudogenes. Consequently, this serves as an upper boundary of expected divergence. The evolution of all current CDS sequences of A. thaliana were simulated for 100 million years in steps of 1 million year. Each step consisted of a simulation along a simple two-branch tree, using a GTR model, taking into account the %GC of A. thaliana (36%) and following a Lavalette distribution (with parameter a = 0.5 and max = 3) for both insertion and deletion size. The used branch length represents the expected amount of substitutions per site. For example, a branch length of 0.007 was used to represent the spontaneous substitutions per site per million year for A. thaliana. For each step the obtained simulated sequence of the previous simulation was used as input. After each simulation step, the simulated sequence was aligned to the original translated CDS sequence using tfasty from the FASTA v36.3.8d suite⁷⁸. This is the aligner that was also used within PseudoPipe for re-alignment. By using the same aligner, we ensure that the observed differences between reality and simulation are not due to a different alignment method. For the obtained alignment, summary statistics were calculated: percent identity between the evolved and ancestral protein, percentage of the ancestral functional protein that is aligned by the evolved protein, E-value of the alignment between the evolved and ancestral protein, number of frameshifts in the simulated DNA sequence, and the number of stop codons in the simulated DNA sequence. Furthermore, we applied the cut-offs that were used for paleo-polyploids (see above) to infer the detectability of pseudogenes over time.

Pseudogene identification in the neo-polyploid sugarcane genome

For the neo-polyploid sugarcane genome, first collinear segments were inferred between sub-genomes using i-ADHoRe v3.0⁷⁴, using gene pairs extracted from an all-vs-all BLASTP run (E-value < 1 × 10^− 10, alignment coverage ≥ 30%, and protein percent identity ≥ 40%). We utilized the same parameters for i-ADHoRe v3.0 as for the paleo-polyploids (see above), except that only collinear segments of level 2 were inferred. Furthermore, only collinear segments from two distinct sub-genomes located on the same chromosomes were preserved. A riparian plot was generated to show the syntenic relationships between sub-genomes using GENESPACE⁷⁹. Subsequently, we identified genes within these collinear segments that had no paralogous gene within these collinear segments, and inferred the possible fates of their missing paralogs. First, we searched for paralogous genes outside the collinear segment on the sub-genome where the paralog was absent. For this purpose, genes with at least 60% protein percent identity and at least 70% alignment coverage were considered paralogs. In this case, we assume that the gene or its paralog has translocated in or out of the collinear segment after WGM. Second, for the remaining genes that had no paralog within a collinear segment, we searched for pseudogenes within the relevant region of that segment. This was done by searching for similar sequences in the non-coding part of the corresponding region of the collinear segment using exonerate⁸⁰ with the “protein2genome” model and a maximum intron size of 6000. To do this, genic regions in the collinear segment were masked using “maskfasta” of the BEDtools suite⁸¹. Hits that match the same gene were merged if they were within 6000 bp, and only the hit with the longest length was retained. Next, the inferred hits were re-aligned with the corresponding gene using exonerate with the “affine:local” model. Hits with at least 20% protein percent identity and at least 20% alignment coverage were retained. Finally, MACSE v2.06⁸² was used to align each pseudogene hit to its functional paralogous gene. From these alignments summary statistics were calculated: number of frameshift mutations, number of premature stop codons, the absence of a start or stop codon, alignment coverage, and protein percent identity.

Link between pseudogenization and recombination rate

To explore the relationship between the distributions of pseudogenes and recombination rate across the genome, we obtained genetic maps and their corresponding physical maps for A. thaliana⁸³, G. max⁸⁴, O. sativa⁸⁵ and Z. mays⁸⁶. Subsequently, we estimated local recombination rates using the MareyMap R package⁸⁷. A similar approach was taken as Brazier and Glémin⁶⁵. Hereby, a two-degree polynomial loess regression was fitted to the marey map (i.e., map showing the relationship between genetic and physical distances) to obtain recombination rate estimates, expressed in centimorgan (cM) per megabase (Mb). Hereby, the span was set to 0.2 and negative estimates were set to 0. Subsequently, the average recombination rate of the markers and number of pseudogenes were calculated per bin of 500 kb across the genome. Using these binned estimates, we evaluated the correlation between pseudogene number and recombination rate.

Calculation of evolutionary distances: K_a, K_s and K_a/K_s

To get insight into pseudogene evolution, we calculated the number of non-synonymous substitutions per non-synonymous site (K_a), the number of synonymous substitutions per synonymous site (K_s) and the ratio between the two (K_a/K_s) for all pseudogenes. To do this, pseudogene nucleotide sequences were aligned to their functional paralogous gene using MACSE v2.06⁸². MACSE can accommodate pseudogenes by recognizing them as less reliable sequences, assigning them a reduced penalty for frameshifts and stop codons in comparison to functional genes. Next, the alignments were converted to AXT format, and evolutionary distances were calculated using KaKs_Calculator 2.0⁸⁸ through model averaging (MA). Moreover, K_a, K_s and K_a/K_s were also calculated for WGM-derived gene pairs, i.e., anchor pairs.

Link between pseudogenes and lncRNAs

To investigate the relation between pseudogenes and long non-coding RNAs (lncRNAs), we assessed whether there was overlap between our inferred pseudogenes and lncRNAs in the genome of Arabidopsis thaliana and Solanum lycopersicum using “intersect” of the BEDtools suite⁸¹ within Python with the pybedtools library⁸⁹. For an overlap to be acknowledged, a minimum of 50% of the pseudogene and lncRNA needed to be contained in the overlap. The coordinates of the lncRNAs were obtained from the PLncDB database v2.0⁶⁶. To evaluate whether the observed number of overlaps was significantly higher than expected by chance, we generated an empirical null distribution by randomly rearranging lncRNAs on the genome using the “shuffle” function of the pybedtools library⁸⁹, and determining the number of overlapping pseudogenes with these randomly shuffled lncRNAs. This process was carried out for 1,000 iterations. Subsequently, P-values were computed for the observed counts by determining the fraction of iterations that exhibit an equal or more extreme value compared to our observed number. The assessment of overlap was conducted for all pseudogenes, as well as separately for each class of pseudogenes.

Functional assessment of (pseudo)genes

To assess whether groups of (pseudo)genes were enriched in specific functional categories, Gene Ontology (GO) enrichment analyses were performed. GO annotations were obtained from PLAZA v5.0⁶⁷ for the paleo-polyploid species. For the sugarcane genome, GO terms were obtained from PLAZA for a different sugarcane genome and linked to the neo-polyploid sugarcane genome by identifying the best BLASTP hit for each gene, ensuring an E-value below 1 × 10^-5. GO enrichment analyses were performed using the clusterProfiler v4.0 R package⁹⁰. Hereby, all GO-annotated genes of that analysis were used as background. Pseudogenes were assumed to have the same functional category as their functional paralogous genes. GO enrichment analyses were performed for WGM-derived pseudogenes (1) in each species, (2) in all paleo-polyploid species combined, and (3) further stratified based on species that underwent its most recent WGM less than 35 MYA and those that underwent its most recent WGM more than 35 MYA. In addition, for the neo-polyploid sugarcane genome, GO enrichment analyses were performed for pseudogenes and for genes that were retained at different levels on the sub-genomes, i.e., retained on all four, three, two or one sub-genome(s). Moreover, for the neo-polyploid genome it was inferred whether genes were part of protein-protein interaction (PPI) networks (both functional associations and physical links) by annotating the proteomes of the neo-polyploid species using STRING 12.0⁹¹. Next, it was assessed whether genes that were retained at different levels were enriched in such PPI genes by performing hypergeometrical tests.

Acknowledgements

This work was supported by the European Research Council under the European Union’s Horizon 2020 Research and Innovation program (No. 833522) and from Ghent University (Methusalem funding, BOF.MET.2021.0005.01) (to Y.V.d.P.).

Author contributions

E.C., Z.L. and Y.V.d.P. conceived and managed the project. E.C. conducted the genomic analyses of this study. E.C. and Z.L. wrote the manuscript with support of Y.V.d.P. All authors read and approved the manuscript.

Code availability

The code used in this study is available on GitHub at: https://github.com/ewoutcrombez/PseudogeneProject_scripts.