Identification of pseudogenes in paleo-polyploid species
We investigated the extent of pseudogenization in twelve paleo-polyploid plant species that underwent WGMs at different times in their evolutionary past (Figure 1A and Supplementary Table 1). The PseudoPipe pipeline48 was utilized to identify putative pseudogenes. Pseudogenes were identified as genomic sequences that reside in the non-coding regions of a genome, encompassing all areas outside the coding exons, and have sequence similarity with a functional gene. In total, we obtain a highly variable total number of putative pseudogenes for different species, ranging from 6,989 in Arabidopsis thaliana to 236,487 in Zea mays (Supplementary Figure 1 and Supplementary Data 1). These putative pseudogenes were classified into three classes in accordance to PseudoPipe: (1) “retro-transposed (or processed) pseudogenes” that originated by retro-transposition of a reverse transcribed mRNA, (2) “duplicated pseudogenes” that originated from the duplication of a gene, and (3) “fragmented pseudogenes”, encompassing pseudogenes characterized by a highly fragmented structure, making it challenging to attribute them to a specific origin (Materials and Methods). In addition to these generic classes, we further defined a class of “WGM-derived pseudogenes”, encompassing our focal interest, for pseudogenes initially classified as “duplicated” or “fragmented” that were located in intergenic regions. This classification is based on their presence in collinear “homoeologous” (i.e., derived from a WGD or hybridization event) segments, which are defined as two or more regions within a genome containing several paralogous (pseudo)genes that largely maintain their order and orientation. As such, a “duplicated” or “fragmented” pseudogene that forms a homologous pair with a functional paralogous gene, is assumed to be “WGM-derived”. Duplicated pseudogenes that are not WGM-derived are then considered to be derived from small-scale duplication (SSD). The WGM-derived pseudogenes are consistently the smallest class, ranging from 61 in Sorghum bicolor to 1,761 in Glycine max (Figure 1B). Most WGM-derived pseudogenes came from the fragmented rather than the duplicated class, suggesting that pseudogenes of this class generally also have a fragmented structure. We compared our results with previous studies, revealing variability in the identification of pseudogenes that can be attributed to the absence of a precise definition for pseudogenes (Supplementary Text 1).
The total number of pseudogenes exhibit a moderately significant correlation with genome size (Spearman’s ρ = 0.57, p-value = 0.003) and the number of retro-transposons in the genome (Spearman’s ρ = 0.59, p-value = 0.02) (Supplementary Table 2 for more correlations between pseudogenes and genomic features). In contrast, the total number of pseudogenes does not significantly correlate with the timing of the most recent WGM in a species (Spearman’s ρ = -0.43, p-value = 0.20). Moreover, the number of fragmented pseudogenes also correlates significantly with the number of retrotransposons in the genome (Spearman’s ρ = 0.77, p-value = 0.03). Together with the large number of retro-transposed pseudogenes, which was very often larger than the number of duplicated (SSD- and WGM-derived) pseudogenes (Supplementary Figure 1), the significant correlations suggest that a large fraction of the pseudogenes originated from retro-transposition events. Surprisingly, the number of retro-transposed pseudogenes itself does not correlate with the number of retrotransposons in the genome (Spearman’s ρ = 0.36, p-value = 0.18). A potential explanation for the lack of correlation may be that retro-transposed pseudogenes were identified using stringent cut-off criteria, resulting in a large number of them being categorized as fragmented pseudogenes.
In all species analyzed, fragmented pseudogenes formed the most abundant class, suggesting that a large proportion of pseudogenes has experienced extensive fragmentation. This may be the result of recombinational processes that lead to rearrangements and deletions in the genome. Fragmentation is also reflected in the alignment coverage of pseudogenes against their functional paralog, which is consistently lower for WGM-derived pseudogenes than the alignment coverage observed between pairs of paralogous genes within collinear segments, i.e., WGM-derived duplicate genes or anchor pairs (Figure 1C, Supplementary Text 2, and Supplementary Figure 2). The protein percent identity between WGM-derived pseudogenes and their functional paralogous gene is relatively high when compared to the percent identity between WGM-derived duplicate genes, especially for species with an ancient WGM (Figure 1C, Supplementary Text 2, and Supplementary Figure 2). However, only looking at protein percent identity gives an incomplete picture as it overlooks the alignment coverage. Multiplying the protein percent identity by the alignment coverage provides a metric for the protein percent identity including non-aligned parts, i.e., global protein percent identity. This combined metric tends to be higher for WGM-derived duplicate genes (Supplementary Figure 3).
The number of WGM-derived pseudogenes is strongly negatively correlated with the timing of the most recent WGM (Spearman’s ρ = -0.79, p-value = 0.01, Figure 1F). The more recent the WGM event, the more WGM-derived pseudogenes are identified, suggesting that pseudogenes are lost over time. In concordance, we find significant enrichments of various meiosis and RNA/DNA-related Gene Ontology (GO) terms in paleo-polyploid species with a WGM less than 35 MYA, while for paleo-polyploid species with a WGM more than 35 MYA, we observed no significant enrichments (Supplementary Text 3 and Supplementary Table 4). We would indeed expect that WGM-derived pseudogenes are enriched in functional categories associated with meiosis and RNA/DNA-related processes, as these are underrepresented in retained duplicate genes21,22,51. The lack of enrichment observed in ancient paleo-polyploids (WGM > 35 MYA) suggests that these pseudogenes have disappeared over time, making it impossible to identify such enrichments.
The number of WGM-derived pseudogenes is also significantly positively correlated with the total number of genes in the genome (Spearman’s ρ = 0.83, p-value = 0.001), the number of WGM-derived duplicate genes (Spearman’s ρ = 0.87, p-value = 0.00014, Figure 1D), and the number of collinear segments (Spearman’s ρ = 0.93, p-value = 0.00013, Figure 1E). These strong correlations are likely a reflection of the method that was used to identify WGM-derived pseudogenes. The number of WGM-derived pseudogenes could be impacted by several factors, a concern that we discuss in the following sections.
Why so few WGM-derived pseudogenes?
Considering the rediploidization histories of the species and introducing a null hypothesis that pseudogenization and DNA deletion contribute equally to gene loss after WGM, we would expect to observe a comparable number of WGM-derived pseudogenes as DNA deletions to reflect this balanced contribution. However, the empirical data show that although the total number of identified pseudogenes is large, often surpassing the total number of functional genes within the given species, the number of WGM-derived pseudogenes is relatively limited compared to the total number of lost genes after WGM (Supplementary Figure 1). For example, in A. thaliana, which underwent its most recent WGM about 50 MYA, only about 21% of the total number of genes are retained as gene anchor pairs, suggesting that about 79% were lost or translocated (Supplementary Table 3). However, we only find 91 WGM-derived pseudogenes in A. thaliana, suggesting that the large majority of duplicate genes were lost through DNA deletion. Species with a more recent WGM, such as G. max, which underwent its most recent WGM about 13.6 MYA, tend to have more WGM-derived pseudogenes (1,761 pseudogenes), but the number is still modest compared to the gene number (52,872 genes) and the 48% of WGM-derived duplicate genes that were lost or translocated after the WGM.
We further fitted two competing models on gene loss over time to gain deeper insight into the gene loss processes after WGMs52. Hereby, we plotted the number of WGM-derived pseudogenes and gene pairs in the paleo-polyploids against the corresponding timing of the most recent WGM. For the two models of (pseudo)gene loss we consider, each represents a distinct dynamic of loss over time. The first model consists of one phase with one rate whereby (pseudo)genes get lost passively and independently, characterized by a single-exponential decay curve. The second model consists of two phases with two distinct rates and accommodates an initial phase whereby a group of genes may get lost simultaneously. The two-phase model can be approximated by a double-exponential decay curve, which decays more rapidly at the start and subsequently slows down52. Therefore, the initial phase of fast gene loss may account for the genome-wide redundancy that is expected after a WGM. Interestingly, fitting a double-exponential curve to the WGM-derived pseudogene counts against the timing of the most recent WGM consistently reverted to a single-exponential curve with an r2 value of 0.65, corresponding to the one-phase model. In contrast, for the WGM-derived gene pair data, the double-exponential function provided a better fit than the single-exponential function with r2 values of 0.8 and 0.56, respectively (Figure 1F). In addition, the Akaike information criterion (AIC) and Bayesian information criterion (BIC) measures also favors the two-phase model over the one-phase model for the WGM-derived gene pairs, while the one-phase model is favored for the WGM-derived pseudogene counts (Supplementary Table 5 and Supplementary Table 6). Similar findings are uncovered when using the timing of WGMs represented by the peaks observed in distributions of synonymous substitutions per synonymous site (Ks) for all duplicate genes from Vanneste et al.50 (Supplementary Figure 4, Supplementary Table 5, and Supplementary Table 6).
While merely fitting a curve does not guarantee the correctness of the underlying assumed model, the notable differences in fits between WGM-derived pseudogenes and WGM-derived gene pairs are intriguing. If the fits reflect the essence of the underlying models on gene loss processes, our findings indicate that genes undergo an initial rapid loss phase, involving a rapid loss of genes right after WGM events, followed by a phase of gene loss with a slower rate. The pseudogenes, in contrast, show a consistent loss rate, suggesting it could not solely contribute to the rapid gene loss after WGM events. Although both models do not consider the mechanisms by which this simultaneous loss happens, our results here would suggest that DNA deletion dominate the gene loss processes after WGMs over pseudogenization. These observations may indicate that pseudogenization only has limited importance for gene loss after WGMs compared to DNA deletion in angiosperms as suggested before26,27. However, before jumping to this conclusion, we further explore whether the life span of pseudogenes and the degradation of collinear segments may lead to potential biases and complicate the interpretation of WGM-derived pseudogene numbers.
Simulating the evolution of neutrally evolving pseudogenes indicates a longer expected identifiability span than observed in paleo-polyploids
A first potential bias which may have contributed to the low number of WGM-derived pseudogenes in the paleo-polyploids studied is that a pseudogene may rapidly accumulate mutations, rendering it undetectable due to its low sequence similarity with its functional paralogous gene, i.e., falling below the cut-offs of pseudogene identification. To interrogate this potential bias, we adopted the classical gene duplication model whereby gene duplicates are under relaxed functional constraint and free to accumulate mutations13,17,53. Further, we assume that after obtaining a deleterious mutation, e.g., a premature stop codon or frameshift mutation, it becomes a pseudogene and continues to evolve without selective constraint54. As such, we would expect that the more ancient a WGM, the more difficult to identify WGM-derived pseudogenes. Based on the above assumptions, we simulated the sequence evolution of one copy of the duplicate genes without any selective constraint to provide insight in the timeframe by which pseudogenes may be detectable. To implement the simulation, we retrieved all coding sequences (CDS) of A. thaliana and simulated a WGM event by generating duplicates with identical sequences. Next, we simulated the evolution of these duplicates over at least 100 million years using a spontaneous substitution rate of 7 × 10-9 substitutions per site per year and 1- to 3-bp insertion and deletion rates of 0.3 × 10−9 and 0.6 × 10−9 per site per year, respectively55. Per one million years, we translated and aligned both the evolved and ancestral sequences to record alignment statistics, most notably protein percent identity, alignment coverage, and E-value of the pairwise alignment. Importantly, for pseudogene identification in the paleo-polyploid species, we did not compare the pseudogene to its ancestral gene, but rather to its functional paralog, which also evolved, albeit under various selection pressures. To address this in the simulations, we additionally ran simulations using twice the spontaneous substitution and indel rates, which is then equivalent to evolving both duplicate genes neutrally. As such, we can consider the simulation following 1 × the spontaneous mutation rates function as a lower boundary of divergence between a pseudogene and its functional paralog, while the simulation following 2 × the spontaneous mutation rates functions as an upper boundary. We expect that the actual sequence divergence between a pseudogene and its functional paralog is somewhere in between these two boundaries (Materials and Methods).
By tracking the protein percent identity and coverage of the simulated protein relative to its ancestral form over time, our simulation reveals alignment plateaus for simulations with both mutation rates. For protein percent identity, the plateau is reached at a median of around 30% identity after between 40 and 80 MY for the simulations using 1 × and 2 × the spontaneous mutation rates, respectively (Figure 2A). Similarly, alignment coverage also reaches a plateau at a median of about 23% after 120 MY and 60 MY using respectively 1 × and 2 × the spontaneous mutation rates (Figure 2B and Supplementary Figure 5). In addition, the variation of both alignment statistics increases through time. By further exploring the “plateaus” in these alignment statistics, we found that they can be attributed to the pairwise alignment algorithm, which is inherently designed to seek similarities between two sequences to align most of the sequences. This, hence, underscores the importance of alignment significance, most often expressed as E-value, i.e., the number of expected alignments that would score at least as high as the observed alignment just by random chance56. As expected, the E-value increases over time, indicating that the obtained alignments become less significant over time (Supplementary Figure 6). Next, we employed the thresholds utilized for duplicated (protein percent identity ≥ 40%, alignment coverage ≥ 5% and E-value < 10-5) and fragmented pseudogenes (protein percent identity ≥ 20%, alignment coverage ≥ 5% and E-value < 10-5) for the paleo-polyploids to track the fraction of unidentifiable simulated pseudogenes over time. Adhering to the criteria established for fragmented pseudogenes, half of the pseudogenes should remain detectable on average 55 MY, ranging between 37 and 73 MY using the upper and lower boundaries, respectively (Figure 2C). Similarly, when applying the more stringent criteria of duplicated pseudogenes, half of the simulated pseudogenes are expected to remain detectable on average 42 MY, ranging between 28 and 55 MY (Figure 2D).
Therefore, considering that approximately 79% of the duplicate genes are lost after the most recent WGM timing of A. thaliana (50 MYA) (Supplementary Table 3), assuming that pseudogenization has contributed equally as deletion events (50%), and taking the average percentage of pseudogenes that are detectable in the simulations at the most recent WGM timing of A. thaliana (50% and 41% using the cut-offs of respectively the fragmented and duplicated pseudogenes), we expect to see a proportion between 16-20% of the number of pre-WGM genes that are detectable as pseudogenes. Moreover, considering that approximately 21% of the WGM-derived duplicate genes are retained, we would thus expect to see a proportion between 13-16% of the current number of genes that are detectable as pseudogenes, which is much larger than what we observe (91). Similarly, for Populus trichocarpa, considering that approximately 58% of the duplicate genes were lost after its most recent WGM (34.7 MYA) (Supplementary Table 3), the same null hypothesis and that nearly all pseudogenes should remain identifiable following its most recent WGM timing, as it has a much slower spontaneous mutation rates of 1.33 × 10− 10 per site per year57 than A. thaliana (Supplementary Figure 7), we would expect to see a proportion of about 20% of the current number of genes that are detectable as pseudogenes. This is again much higher than the number we observe (1,394). Moreover, these proportions are also much higher than the numbers found by Xie et al.36 and Mascagni et al.32. Although the above calculations are based on the strong assumption that all genes that have no collinear paralog have lost its paralog from the genome, these calculations give us a more robust indication that the number of identified WGM-derived pseudogenes is, indeed, lower than expected, and could not be attributed to the loss of detectability. In addition, following the spontaneous mutation rates of A. thaliana, almost all simulated WGM-derived pseudogenes are identifiable after 20 to 45 MY (Figure 2C and D), indicating that pseudogenes where a WGM occurred less than 20 MYA, e.g., G. max, M. domestica, and Z. mays, should be readily identifiable.
We acknowledge that two factors may affect our simulations. First, only using the spontaneous mutation rates of A. thaliana and P. trichocarpa may not accurately reflect reality for other species. Nevertheless, the two selected species are among the species with the highest and lowest spontaneous mutation rates among angiosperms, respectively58. The obtained results may thus represent a minimal and maximal expected identifiability time span for pseudogenes, with pseudogenes in other species experiencing an expected identifiability time span in between these two species. Second, assuming immediate loss of selective constraint after duplication is of course not realistic in our simulation, reflected by our observation that both WGM-derived pseudogenes from A. thaliana and P. trichocarpa tend to have higher protein percent identity than the simulated results but comparable alignment coverage. Indeed, our simulations using the spontaneous mutation rate of A. thaliana revealed that frameshifts and premature stop codons could rapidly accumulate within the first 5 MY, indicating that pseudogenization can happen easily in the absence of selection (Supplementary Figure 8 and Supplementary Figure 9). In reality, before the acquisition of a deleterious mutation, the duplicated genes may initially be still under some selective constraint, resulting in a greater protein percent identity than anticipated under a scenario of complete neutral evolution. That being said, a delayed process of pseudogenization by selection constraints would only prolong the detectability of pseudogenes hence leading to a higher number of pseudogenes than our simulation. As such, by considering both factors, we would expect much more WGM-derived pseudogenes than the ones observed in all the studied paleo-polyploids. Since the real data diverge from our simulation results, we must consider other explanations beyond the loss of detectability resulting from the rapid accumulation of mutations in neutrally evolving pseudogenes.
Gene loss in a neo-polyploid sugarcane genome
Another potential bias which may limit WGM-derived pseudogene identification is the ability to detect collinear segments. After WGM, intragenomic collinearity typically diminishes over time because of genome rearrangements and gene loss itself59. This decline of collinearity through time is supported by a significant negative correlation between the timing of the most recent WGM and both the number of collinear segments (Spearman’s ρ = -0.84, p-value = 0.022) and the number of WGM-derived gene pairs (Spearman’s ρ = -0.77, p-value = 0.033). As such, the more ancient a WGM, the more challenging it becomes to find collinear segments, and subsequently to identify WGM-derived pseudogenes therein. Moreover, the number of WGM-derived pseudogenes itself significantly correlates with both the number of collinear segments and number of anchor pairs (Figure 1D and E), indicating that the ability to detect collinear segments surely influences the identification of WGM-derived pseudogenes. Nonetheless, for species that underwent a WGM very recently and are still polyploid, i.e., neo-polyploids, we would expect to observe extensive collinearity. As such, if the loss of collinearity is the main reason for the low number of WGM-derived pseudogenes and gene loss equally occurs through pseudogenization and DNA deletion, we would expect to identify both loss categories comparably in neo-polyploid species.
To this end, we investigated the auto-octoploid Saccharum spontaneum ‘AP85-441’ (sugarcane) genome, which has been sequenced to the sub-genome level, enabling direct comparisons between its four sub-genomes60. S. spontaneum underwent two rounds of WGDs less than 0.8 MYA61, resulting in four sub-genomes, referred to as the A, B, C and D sub-genomes. Overall, the collinearity in sugarcane is relatively well maintained between sub-genomes, with only a few inversions and deletions (Supplementary Figure 10). In total, 14,929 (18%), 19,980 (24%), and 28,913 (35%) of the 82,773 genes had paralogs in two, three, and four sub-genomes, leaving 18,951 (23%) genes that were only found in one sub-genome (Figure 3A). Of these genes that were present on only one sub-genome, 1,372 were never found in a collinear segment. The above numbers clearly indicate that extensive gene loss has already been going on in this sugarcane genome, as also indicated by Zhang et al.60. Moreover, this sugarcane genome is not an exception in gene loss dynamics as studies on other neo-polyploid genomes report similar decreases in copy number of WGM-derived paralogs, such as in an autotetraploid potato genome62 and an autotetraploid Cyclocarya paliurus genome63.
Interestingly, different enriched GO terms were found for different levels of paralog retention. For genes that had paralogs in all four sub-genomes, we find enriched GO terms related to metabolic processes and responses. For genes that had paralogs in three sub-genomes, we find significantly enrichment in GO terms related to meiosis. For genes that had paralogs in two sub-genomes, we find significant enrichment in GO terms related to RNA/DNA modification. Finally, for genes that had a copy in only one sub-genome, we find significantly enriched GO terms related to transposition and ribonuclease activity (Supplementary Data 2 and Supplementary Text 4). In addition, genes that encode for proteins that form interactions with other proteins are significantly enriched in the genes that were retained in all sub-genomes or in three of the four sub-genomes (hypergeometric test p-value = 0), while this was not the case for genes that were retained in one or two of the sub-genomes (hypergeometric test p-value > 0.05). Interactions include both direct physical interactions, such as those within multi-protein complexes, and indirect interactions, like those occurring together in metabolic pathways. This observation aligns with the dosage balance hypothesis, which proposes a selective constraint against disturbing the stoichiometric equilibrium of multi-protein complexes20,21.
To investigate pairwise sub-genome level gene loss, a new strategy was adopted for identifying pseudogenes, involving pairwise comparisons between the four sub-genomes of sugarcane. Hereby, rather than first identifying pseudogenes and subsequently collinear segments, we made use of the extensive collinear segments between two sub-genomes to determine genes for which WGM-derived paralogs were missing (i.e., genes that have not retained paralogs on all four sub-genomes) and identify pseudogenes for these missing paralogs (Materials and Methods). When comparing any two sub-genomes, around 90% of the genes from the two sub-genomes are located in collinear segments, with around 50% forming an anchor pair with a gene on the other sub-genome (Figure 3B). Hence, the remaining 40% of the genes that lack a paralogous gene within these segments, may signify two evolutionary scenarios: 1) the gene or its paralog is translocated; or 2) the gene lost its paralog, either through pseudogenization or DNA deletion (Supplementary Figure 11). If a translocation occurred, the paralog should still be present in the genome, but no longer form a paralogous pair within a collinear segment. Between any two sub-genomes, on average, we find that 6.4% of the genes had a paralog outside of a collinear segment, suggesting translocation. Subsequently, in cases where no paralog was found outside the collinear segment, pseudogenes were sought within the corresponding region of the collinear segment containing the gene. Between any two sub-genomes, on average, we find that 11% of the genes had a paralogous pseudogene within the collinear segment. Finally, between any two sub-genomes, on average, we find that 21.2% of the genes were present in collinear segments, but had no paralogous pseudogene or gene counterpart anywhere in the genomes, indicating DNA deletion events have occurred in the sub-genomes. To further corroborate that these genes lost their counterparts via DNA deletions, we checked if their neighboring paralogs also tend to get lost, as DNA deletion may not necessarily only affect one gene, but may affect a group of neighboring genes. Our results show that genes without counterparts are clustered together with an average 1.33 adjacent genes without counterparts indicating these genes are lost together likely through DNA deletions. In contrast, for randomly selected genes an average of only 0.57 adjacent genes without counterparts is found (Supplementary Figure 12).
Together, although the degradation of collinear segments and the lifespan of pseudogenes influence pseudogene identification, our findings demonstrate that DNA deletion, rather than pseudogenization, primarily drives gene loss following WGMs in both paleo- and neo-polyploids. Furthermore, while some pseudogenes are eventually eliminated through DNA deletion, those that persist may provide evolutionary spandrel that contribute to novel biological functions34,39,40,43.
Pseudogenes are mostly present in regions of low recombination
It is expected that there is a negative correlation between recombination rate and total pseudogene number across the genome, as (pseudo)genes would be more likely removed from the genome by recombination in regions of high recombination64. Indeed, a significant negative correlation between the distribution of pseudogenes and the recombination rate has been reported already for G. max36. To further strengthen this finding, we explored the correlation in G. max, A. thaliana, O. sativa, and Z. mays. For all species analyzed, except Z. mays, a significant negative correlation was observed between total number of pseudogenes and local recombination rate (Figure 4 and Supplementary Figure 13-Supplementary Figure 15). Z. mays exhibits a surprisingly uniform pseudogene distribution, which may be caused by the high abundance of retrotransposons in the maize genome. Indeed, we find a low positive Pearson correlation of 0.16 (p-value = 1.49 × 10-7) between local pseudogene number and retrotransposon pseudogene number (per window of 1 Mb), suggesting a link between retrotransposon and pseudogene abundance. A. thaliana, G. max and O. sativa have a significant Pearson correlation coefficient of respectively -0.35 (p-value = 4.18 × 10-8), -0.58 (p-value = 1.4 × 10-107) and -0.66 (p-value = 1.33 × 10-91) (Supplementary Table 9).
Importantly, when analyzing different pseudogene classes separately, i.e., WGM-derived, SSD-derived, retro-transposed, and fragmented pseudogenes, there is a consistent negative correlation between pseudogene number and recombination rate, except for the WGM-derived pseudogenes. In all species, except A. thaliana, a significant positive correlation with local recombination rate is detected for the WGM-derived pseudogenes, with Pearson correlation coefficients ranging from 0.13 to 0.45 for different species (Supplementary Table 9). These positive correlations with recombination rate further explain why we observe so few WGM-derived pseudogenes in comparison to other pseudogene classes, as they are predominantly found in regions characterized by high recombination rates, different from other types of pseudogenes. Also, considering that these WGM-derived pseudogenes are retained in collinear segments with functional genes, the density of which also has a positive correlation with recombination 65, our results further suggest that the remaining WGM-derived pseudogenes may be under stronger selections than other classes of pseudogenes.
Pseudogenes exhibit a non-neutral Ka/Ks ratio
To further explore the selection pressures on pseudogenes, we estimated the ratio of non-synonymous substitutions per non-synonymous site (Ka) to synonymous substitutions per synonymous site (Ks). Naively, we would expect the ratio of Ka/Ks to be equal to 1 for neutrally evolving pseudogenes, while functional genes are expected to have a Ka/Ks ratio lower than 1 because of purifying selection. Whereas pseudogenes do exhibit a higher Ka/Ks ratio than that of functional paralogous genes, they have Ka/Ks ratio less than 1 in general, in contrast to the anticipated pattern of being exempt from selection with a Ka/Ks ratio of 1 (Figure 5 and Supplementary Figure 16). Similar patterns have been observed for pseudogenes in e.g., rice and barley32,33,35, reflecting that the estimation of Ka/Ks ratios for pseudogenes and their functional paralogous genes may be affected by the accumulation of synonymous substitutions in the functional paralogous gene and by the period before pseudogenization in which the gene may be under purifying selection. This may be particularly true for pseudogenes originating from WGMs, as they are located in the same genomic area as the ancestral gene and, therefore, could potentially function as usual.
That being said, it is possible that at least some of the pseudogenes have adopted a novel function and are still under selective constraint. By definition, pseudogenes are expected to have no functionality because they lost their coding potential through detrimental mutations. However, non-functionality is not evaluated during the identification of pseudogenes, as it necessitates labor-intensive experimental verification. Several studies have demonstrated that some of the pseudogenes are still transcribed, with some of them playing significant functional roles34,39,40,43. For example, pseudogenes have been considered as an important source of long non-coding RNAs (lncRNAs), which govern regulatory functions36,44–46.
We hence assessed whether our pseudogene complement overlapped with the lncRNA complement in A. thaliana and S. lycopersicum, taken from the PLncDB database v2.066, considering an overlap when at least half of the pseudogene and lncRNA sequences coincided. As such, we find that 297 and 178 of the pseudogenes overlap with lncRNAs in A. thaliana and S. lycopersicum, respectively (Supplementary Table 10). An empirical null distribution was established by randomly shuffling lncRNA coordinates and assessing the overlap between our pseudogenes and the randomly shuffled lncRNAs (Materials and Methods). Hereby, we find that pseudogenes in both species had a significant overlap with a p-value of 0 and 0.049, respectively (Supplementary Figure 17). Moreover, after filtering the fragmented pseudogenes, we find an overlap between 106 pseudogenes and lncRNAs in S. lycopersicum with a p-value of 0 (Supplementary Figure 18), whereas the fragmented pseudogenes overlap with 72 lncRNAs, which is not significant for S. lycopersicum (p-value = 0.99, Supplementary Figure 19). Despite the relatively low overlap between WGM-derived pseudogenes and lncRNAs, with twelve and two overlaps in A. thaliana and S. lycopersicum, respectively, this is statistically significant when compared to an empirical null distribution. This significance is attributed to the limited number of WGM-derived pseudogenes in the two particular species (Supplementary Figure 20). For SSD-derived pseudogenes, there are 57 and 75 pseudogenes having a significant overlap with lncRNAs in A. thaliana and S. lycopersicum, respectively (Supplementary Figure 21). In contrast, retro-transposed pseudogenes do not show a significant overlap with lncRNAs in both species (Supplementary Figure 22). Together these results suggest that pseudogenes, in particular the ones originating from (both SSD and WGM) duplications, may lead to the origin of lncRNAs which may govern a regulatory function.