Low genomic variations in S. polyrhiza
We sequenced the genomes of 131 globally distributed S. polyrhiza genotypes with an average of ~ 25 X coverage. Together with previously published samples1,2, we analysed the genomic diversity of 228 S. polyrhiza individuals across five continents (Supplementary Data 1). We identified 1,241,981 high-quality biallelic single-nucleotide polymorphisms (SNPs) and 166,075 short insertions and deletions (INDELs, less than 50 bp of length). Based on an updated genome annotation of S. polyrhiza (see Section 1.1 of the Supplementary Information), we found that most of the SNPs (70.3%) are in the intergenic regions (Supplementary Fig. 1). Of all the SNPs located in the protein-coding regions, 61,039 were identified as nonsynonymous and 44,287 as synonymous (Supplementary Data 2). Consistent with our previous study, the genome-wide nucleotide diversity at neutral sites is 0.0013 (Supplementary Table 1), the lowest among multicellular eukaryotes. The species-wide efficacy of selection (πN/πS ratio) is 0.39, the highest among studied organisms43, indicating relatively weak purifying selection in S. polyrhiza, despite its large effective population size1,2.
In addition to SNPs and small INDELs, we also characterize the genome-wide structural variations (SVs, ≥ 50 bp in length) in S. polyrhiza (see Section 1.2 of the Supplementary Information). We identified 3,205 high-quality SVs, including 2,089 deletions, 291 duplications, and 825 insertions. The abundance of the SVs is substantially lower than all reported multicellular organisms44,45. Among all identified SVs, 155 duplications and 169 deletions affected protein-coding genes (Supplementary Table 2). Interestingly, in comparison to the genome-wide average, the MADS-box genes (see Section 1.3 of the Supplementary Information) showed a higher tendency to overlap with SVs (permutation tests; P = 0.063). In addition, we also observed a significant overrepresentation of small INDELs in the MADS-box genes (permutation tests; P = 0.008). These results indicate that MADS-box genes might be under positive or relaxed purifying selection in S. polyrhiza.
Population structure and demographic history of S. polyrhiza
Because S. polyrhiza is facultatively asexual, genotypes collected from the geographic proximity can be derived from the same clonal family. Using a previously established grouping threshold that was developed in S. polyrhiza2, we identified 159 likely clonal families (Supplementary Data 3).
Population structure and principal component analyses revealed four populations in S. polyrhiza (Fig. 1a and 1b). Consistent with our previous study, the four populations are largely concordant with their geographic origins, namely America, South East Asia (SE-Asia), Europe, and India (Supplementary Fig. 2), with a few exceptions that can be due to recent migration events or artefacts during long-term duckweed maintenance.
We inferred the population history with a Maximum Likelihood (ML) phylogeny combined with Approximate Bayesian Computation (ABC). Using Colocasia esculenta (from the Araceae family) as an outgroup, the maximum likelihood phylogeny of all 228 genotypes indicated an early split of the American population from the other populations and subsequent splits of the Indian and European populations from SE-Asia (Fig. 1c). The European population constitutes the most recent split (Fig. 1c-d). Here, genotypes collected from the transcontinental region (e.g. Russia) showed intermediate features of SE-Asian and European populations, suggesting this as a likely migration route. Furthermore, the Indian population possibly originated via Thailand and Vietnam, as genotypes from these countries show intermediate features between Indian and SE-Asian populations.
We modelled the demographic history using an ABC modelling approach to further validate the evolutionary history of the four populations in S. polyrhiza. Three plausible demographic scenarios were simulated, allowing for either the SE-Asian, American or an additional putative population to function as the ancestral population (Supplementary Fig. 3). We found that the scenario, in which the American population and Asian population were derived from an additional putative ancestral population, constituted the most supported model (Fig. 1d). While the American population was separated from other populations around 1 million generations ago, the European population was derived from the SE-Asian population only 12,000 generations ago (see Section 1.4 of the Supplementary Information).
Determinants of genomic diversity among populations
Among the four populations, nucleotide diversity (π) and the efficacy of selection (πN/πS ratio) varied among populations (Fig. 2b). While the SE-Asian population has the highest π and lowest πN/πS ratio, the American population has the lowest π and highest πN/πS ratio. Interestingly, while the European population has a much smaller π compared to the SE-Asian population, the πN/πS ratio of the European population remains similar to the latter, likely due to its recent split from the SE-Asian population.
Using genome-wide SNPs, we found that linkage disequilibrium (LD) is comparable to Arabidopsis thaliana46, suggesting considerable historical sexual reproduction in S. polyrhiza. However, the extent of LD decay varied substantially among populations (Fig. 2b and Supplementary Fig. 4). While the Asian population showed the most rapid LD decay (about 12 kb at r2 = 0.2), the European population had very long LD blocks (> 100 kb). The Indian and American populations had intermediate LD decay. This pattern is further supported by the recombination rates calculated for all four populations, where the Asian population had the highest recombination rate compared to the other three (Fig. 2b). Different LDs and recombination rates found among populations indicate that the frequencies of sexual reproduction varied among populations. In addition, we found that the variations of heterozygosity in S. polyrhiza showed a similar pattern with the genomic diversity and recombination rate among four populations (Fig. 2 and Supplementary Fig. 5).
Interestingly, the changes in genomic diversity and levels of heterozygosity are associated with two SVs involving MADS-box genes that are involved in sexual reproduction. One SV is an 84 bp insertion at the last coding sequence (CDS) of gene SpGA2022_005278, a homolog of AGL62 from the Mα subclade of MADS-box genes (Supplementary Fig. 6). In A. thaliana, AGL62 is a transcription factor that suppresses endosperm cellularization by activating the expression of a putative invertase inhibitor, InvINH1, in the micropylar region of the endosperm47,48 (Supplementary Fig. 7). The insertion likely disrupted the function of the AGL62-like gene, thus reducing the suppression of endosperm development, which is likely required for sexual reproduction (Fig. 2a). Consistently, we found the insertion was at a higher abundance in the SE-Asian population (87.5%) than in other populations (Fig. 2c and 2d). In addition, the insertion positively correlates with heterozygosity within the European population (Supplementary Table 3 and Supplementary Fig. 8). Another SV is a 69 bp deletion at 1.8 kb upstream of SOC1 (SpGA2022_007306, Supplementary Fig. 5), the ortholog of which is a positive regulator of the flowering process in A. thaliana49 (Fig. 2a). The deletion was exclusively found in the Indian population with the alternate allele frequency of 73% (Fig. 2c). It is plausible that the deletion, due to its disruption potential at the cis-regulatory region, reduces the ability of the SOC1-like gene to respond to the upstream floral activators (e.g. CO) in S. polyrhiza, thus reducing the frequency of sexual reproduction in the Indian population (Fig. 2d). Consistently, this deletion negatively correlates with heterozygosity in the Indian population (Supplementary Table 3, Supplementary Fig. 9).
Population epigenomic diversity in S. polyrhiza
As changes in sexual reproduction can also alter epigenomic dynamics, we further investigated the patterns of population epigenomic diversity in S. polyrhiza. We selected five individuals from each population and quantified their shoot DNA methylation levels at single-base resolution using whole genome bisulphite sequencing. Similar to a recent study40, we found that only 1.5% of cytosines are methylated in S. polyrhiza (7.2% of CpG, 2.2% of CHG, and 0.1% of CHH; see Supplementary Table 4), and the average species-wide methylation levels are much lower than all other investigated angiosperm species (Supplementary Fig. 10)31,50. The hierarchical clustering of 20 methylomes in CpG, CHG, and CHH contexts show high consistency with their genetic similarity (Supplementary Fig. 11–13). Few discrepancies were mostly found within the same population or between the recently diverged SE-Asian and European populations, which indicates that population demography can strongly affect epigenomic variations in S. polyrhiza.
We then compared the genome-wide weighted methylation level (wML) among populations. For CpG methylation, the four populations did not differ at genome-wide levels (Fig. 3a). However, CpG methylation level in gene bodies (but not in TEs) was higher in the American population than the other three populations (Fig. 3d, g). For CHG, the Indian population had the lowest genome-wide methylation level among all four populations (Fig. 3b, 3e and 3h). Interestingly, for CHH, the American population had the lowest genome-wide methylation level (P < 0.01, pairwise Wilcoxon test; Fig. 3c), while the European and Indian populations showed a gradual reduction of methylation in comparison to the SE-Asian population. The pattern was the same for both gene bodies and TEs (P < 0.01, pairwise Wilcoxon test; Fig. 3f and 3i). The genome-wide reduction of CHH methylation is consistent with the hypothesis that clonal reproduction reduces CHH methylation, and the effects gradually accumulate over clonal generations51.
The footprint of selection on the genome
To identify the genomic signature of selection at the species level we performed genome-wide scan. To reduce false positives, we used the µ-statistics from RAiSD52, the composite likelihood ratio CLR statistic from SweeD53, and the T statistic from LASSI54. We found 69 genes showed strong signature of selection using all three methods (Supplementary Fig. 14). Manual inspection indicated that several orthologs of these genes are related to gametogenesis (e.g., NOTCHLESS) and embryogenesis (e.g., NUP214, CPSF, CDK, AGP, and ACR4) in Arabidopsis thaliana55–59. Further enrichment analysis indeed showed that embryo lethal genes were enriched in these 69 genes (P = 0.016, \({\chi }^{2}\) test). In addition, the A. thaliana orthologs of several genes under selection are also associated with controlling sexual reproduction, including floral development (DRMY1 and ACR4)55,60, flowering time (NF-Y AT2G27470, NF-YAT1G72830, and CPSF), pollen development (EFOP3, ELMOD, and CLC) 57,61−65, seed development (NUP214, NF-Y AT2G27470 and NF-YAT1G72830, and Transducin/WD40 )56,61,66. Furthermore, among these 69 genes, we also found several genes involved leaf development and vascularity (SECA2, RbgA, PHABULOSA/PHAVOLUTA)67–69, light signaling (NF-Y, CCR4-NOT, and PPP)61,70,71, root development (GEND1, WAVY, and ACR4)55,72,73 and DNA damage repair (ATM and Xrcc3)74,75 (Supplementary Data 4).
To further understand the selection that drove the evolution within individual populations, we identified the signature of positive selection in a three-population tree using patterns of linked allele frequency differentiation and calculating the corresponding composite-likelihood ratio (CLR, see Methods). In total, we found 1,883 genes on the SE-Asian branch, 593 genes on the Indian branch and 401 genes on the European branch (Fig. 4a; see Section 1.5 of the Supplementary Information) which showed strong signatures of selection (top 1% of CLR values).
We found that genes under positive selection in the European branch are enriched with reproduction and development-related GO terms (Supplementary Fig. 15). Among these, SpGA2022_013448, in chromosome 9, is an ortholog of FLOWERING LOCUS KH DOMAIN (FLK) that delays flowering by up-regulating FLC family members in A. thaliana76. This gene showed a strong signature of selection in the European branch but not in other branches (Fig. 4c and d). Similarly, SpGA2022_006111, on chromosome 3, is an ortholog of the A. thaliana BIG BROTHER (BB) that negatively regulates floral organ size and is also under selection in Europe77 (Fig. 4c).
In the SE-Asian population, we found that gene SpGA2022_051517, a CHROMOMETHYLASE3 (CMT3) homolog in A. thaliana, is involved in maintaining CpG, particularly CHG methylation17. This is consistent with the higher CpG and CHG methylation levels observed in the SE-Asian population when compared to the European and Indian populations (Fig. 3a and 3b). Within the Indian population, we found that five MADS-box genes have been under selection exclusively along this branch. Given that there are 43 MADS-box genes in the genome, the fact that five of them have been targeted by selection, constitutes a significant enrichment of such genes under selection (P = 0.0075, Fisher’s Exact test). For example, SpGA2022_013078 is an ortholog of AGAMOUS-LIKE6 (AGL6), which is involved in flower and meristem identity specification in rice78; SpGA2022_052274, a homolog of APETALA3 (AP3), is involved in the petal and stamen specification in A. thaliana79; and SpGA2022_006905 belongs to the SHORT VEGETATIVE PHASE (SVP-group) which controls the time of flowering and meristem identity80.
We found 77 genes under positive selection (top 1% CLR values) in both the European and Indian populations (Fig. 4b), significantly more genes than expected by chance (P < 2.2e-16, Fisher’s Exact Test). Among these, gene SpGA2022_055195, a homolog to CYP78A9 of cytochrome P450 monooxygenases in A. thaliana, belongs to a highly conserved gene family CYP78A. Previous studies in A. thaliana and other species found that CYP78A9 plays a critical role in promoting cell proliferation during flower development and further impacts seed size81–83. Overall, these data consistently suggest that genes involved in reproduction and development were under selection in Indian and European populations, which might have led to reduced sexual reproduction in these two populations.