Phylogenomics of the genus Glycine sheds light on polyploid evolution and life-strategy transition

Polyploidy and life-strategy transitions between annuality and perenniality often occur in flowering plants. However, the evolutionary propensities of polyploids and the genetic bases of such transitions remain elusive. We assembled chromosome-level genomes of representative perennial species across the genus Glycine including five diploids and a young allopolyploid, and constructed a Glycine super-pangenome framework by integrating 26 annual soybean genomes. These perennial diploids exhibit greater genome stability and possess fewer centromere repeats than the annuals. Biased subgenomic fractionation occurred in the allopolyploid, primarily by accumulation of small deletions in gene clusters through illegitimate recombination, which was associated with pre-existing local subgenomic differentiation. Two genes annotated to modulate vegetative–reproductive phase transition and lateral shoot outgrowth were postulated as candidates underlying the perenniality–annuality transition. Our study provides insights into polyploid genome evolution and lays a foundation for unleashing genetic potential from the perennial gene pool for soybean improvement. Assemblies of six representative perennial Glycine genomes and a comparison with annual soybean genomes reveal evolutionary patterns, differentiation and adaptation of annual and perennial genomes and mechanisms driving subgenome fractionation.

diploid genomes range in size from 941 to 1,374 Mb, with 55,376-58,312 annotated protein-coding genes (Supplementary Table 6). The assembled allopolyploid genome is 1,948 Mb, harbouring 113,697 annotated protein-coding genes (Supplementary Table 6). Approximately 40-62% of these genomes are composed of transposable elements (TEs) (Fig. 2a and Supplementary Table 6). vulgaris were used as outgroup species to generate a rooted tree. Stars indicate sporadic distribution of perennial Glycine species. The divergence times of species or higher taxa are labelled. b, Synteny plot showing genomic collinearity among annual and perennial Glycine species using G. max as a reference (Chr, chromosome). c, A genomic rearrangement of chromosome 8 between A/A t and D/D t . The TE density along each chromosome is plotted above and below the D and A genomes, respectively. Red vertical lines indicate assembled gaps among contigs. d, Frequencies of genomic rearrangements in G. max and perennial Glycine species evaluated using P. vulgaris as a reference. *P < 0.05, chi-squared test. Phylogeny and karyotype stability. A phylogeny of these perennials and the annual species was constructed with 830 single-copy orthologous genes using Medicago truncatula and common bean (Phaseolus vulgaris), which diverged from the Glycine lineage ~40-60 and ~20- 30 Ma, respectively 7 , as outgroups. The evolutionary relationships, as reflected by the phylogenetic tree ( Fig. 1a and Supplementary Fig. 3), are consistent with previous reports based on a limited number of genes, although the dates for the split of the perennial lineage from the annual species and the speciation of individual perennials were estimated to be earlier than the previous estimates with a different method 14 (Fig. 1a). Differences in dates are probably due to updated calibrations from divergence times in the legume family from nuclear phylogenomic analyses 7 rather than from individual plastid genes 15 .
To understand relative genomic stability between the annual and the perennial diploid genomes, we identified large segment rearrangements including inversions, translocations and insertions/deletions (InDels) of >50 kilobase pairs (kb) in size that can be clearly defined by comparison with P. vulgaris as an outgroup and determined the numbers and genomes in which a specific rearrangement occurred. Overall, the perennial diploids exhibited a high level of chromosomal conservation ( Fig. 1b and Extended Data Fig. 1a,d). Only 183 genomic rearrangements were identified by pairwise comparisons with the soybean reference genome Williams 82 (av2) (Supplementary Table 7). Of these rearrangements, 89.6-95.7% were supported by the BESs mapped to respective genomes, while the remaining rearrangements do not have available BESs to validate (Supplementary Table 7). Ten randomly selected inversion events were also validated by PCR-based amplification of fragments harbouring the inversion junction sites ( Supplementary  Fig. 4, additional Supplementary Fig. 1 and Supplementary Table 8). The rearrangements include both perennial-specific and species-specific ones ( Fig. 1c and Extended Data Fig. 1a). Comparison among multiple species/genomes enabled validation of many genomic rearrangements and defining the relative timing and nature of these events, such as the occurrence of the inversion of a ~3.4 Mb fragment involving 398 genes in the annual species (Extended Data Fig. 1b) and the reorganization of chromosome 8 of the D/D t genome by a combination of inversion, deletion and translocation events, which resulted in reduction of >20 Mb of local genomic sequence as compared with chromosome 8 of the A/A t genome (Fig. 1c). The reduction was caused by a translocation of a segment to chromosome 10 of the D/D t genome (Extended Data Fig. 1c). Intriguingly, the ancestral telomeric region adjacent to the site where these rearrangements occurred has been retained as one of the telomeric regions of the rearranged chromosome in the D/D t genome ( Fig. 1c and Extended Data Fig. 1c). The gene density along this chromosome has been reshaped primarily by accumulation of TEs in the restructured pericentromeric regions and reduction of TE sequences in the restructured chromosomal arm (Fig. 1c). Overall, fewer genomic rearrangements have occurred in the perennial than in the annual species as revealed by comparison with P. vulgaris (Fig. 1b, Data and Supplementary Table 6). Individual LTR-RT families exhibited distinct spectra of amplification among the species as reflected by their relative abundance and estimated insertion times (Fig. 2a Family size (×1,000) 317.4 Mb (23.1%) of the genome. By contrast, few LTR-RTs belonging to these two families were seen in the annual genomes. Given that 98.2% of intact LTR-RTs were estimated to be amplified in the last 5 Ma (Supplementary Tables 9 and 10), it is apparent that LTR-RT amplification was largely responsible for local genomic differentiation. On the other hand, the pace and degree of LTR-RT DNA loss are also striking. Of the 905 intact LTR-RTs in the perennials estimated to be amplified ~7 Ma and thus considered to be inserted before the split of the perennial and annual lineages, none was found to have intact orthologues in the annuals and fewer than a quarter of these elements had detectable remnants at putatively orthologous sites in the annuals, suggesting a loss of at least 94.3% DNA from the 'original' copies. In addition, a large amount of LTR-RT DNA has been removed by formation of solo LTRs through unequal recombination (Extended Data Fig. 2a). Besides LTR-RTs, Mutator and Helitrons are the two major TE superfamilies contributing to the perennial genome size variation and differentiation of intergenic regions (Fig. 2a).
Centromeres in most plant species are composed of long arrays of centromeric satellite repeats (CSRs), which are often interrupted by centromere-enriched retrotransposons (CRs) 16,17 . In the annual soybean, two subfamilies of CSRs, G m -Cent1 and G m -Cent2 (Extended Data Fig. 3), were previously identified and found to mark two distinct subsets of the 20 chromosomes by fluorescence in situ hybridization 18 and interpreted as evidence for the palaeo-allopolyploid origin of soybean. Intriguingly, few copies of these CSRs were detected in the perennial genomes. Compared with the annual G. max (G m ) genome, which harbours ~38 Mb of CSRs, the F genome contains ~7 Mb, of CSRs, while the B, C, A and D genomes only contain 75, 5, 0.4 and 12 kb of the CRS-homologous sequences, respectively (Fig. 2c). Analysis of abundant tandem repeats in the perennial Glycine species and examination of their distributions along chromosomes did not identify any CSR-like repeats in B, C, A and D genome (Supplementary Fig. 5 and Supplementary Table 11). On the basis of the average sequence divergence, the G m -Cent1 and G m -Cent2 were estimated to have diverged ~8 Ma and CRS homologues in the F genome (dubbed G f -Cent) were slightly more similar to G m -Cent2 than G m -Cent1 ( Fig. 2d and Extended Data Figs. 3 and 4a). On the basis of the relative frequencies of physical adjacency of G m -Cent1/G m -Cent2 and specific retrotransposon sequences detected with the Illumina sequences, we identified Gmr17 (enriched in the G m -Cent1 sequences) and Gmr01 (enriched in the G m -Cent2 sequences) ( Fig. 2e-g), as putative CR families in the annual genomes, whereas no similar retrotransposon families were found in the perennial genomes ( Fig. 2e,f,h and Extended Data Fig. 4b-d).
Super-pangenome framework and evolutionary architecture. The pangenomes that have been constructed to date are limited to primarily cultivated accessions of one species, lacking representation of genomic diversity at the genus level 19 . To facilitate future development of a pangenome (so-called a super-pangenome) of representative species within Glycine and to understand genome evolution at the local-and micro-scales, we construct a super-pangenome framework using the sequenced diploid Glycine genomes. Rather than defining presence/absence of gene families that would not reflect the gain/loss of individual genes, we identified and compared orthologous genes among the sequenced perennial species (Methods).
A total of 109,827 non-redundant genes were annotated in the five perennial diploids, of which, 31,936 (29%) are shared by all five perennials as orthologues and referred to as perennial core genes, while the remaining are referred to as perennial non-core genes ( Fig. 3a and Supplementary Table 12). By contrast, a total of 129,006 non-redundant genes were annotated in the 26 G. soja/ G. max accessions 5 , of which 31,564 (24.5%) are shared by all these annual accessions and are referred to as annual core genes, while the remaining are referred to as annual non-core genes (Supplementary  Table 12 and Extended Data Fig. 5a). The higher proportion of core genes identified in perennial species than in the annual species is due to the smaller sample size involved in analysis (5 versus 26). When only five randomly selected annual genomes were analysed, as expected, a higher proportion of core genes were observed in the annual species than in the perennial species (Extended Data Fig. 5b). Of the 31,936 perennial core genes, 17,922 (56.2%) overlap with the annual core genes, 8,704 (27.2%) overlap with the annual non-core genes and 5,310 (16.6%) were perennial-specific core genes (Fig. 3b). Of the 77,891 perennial non-core genes, 7,022 (9.0%) overlap with the annual core gene set, 6,745 (8.7%) overlap with the annual non-core genes and 64,124 (82.3%) were identified as perennial-specific non-core genes (Fig. 3b). The shared orthologues between any two of the five perennial species after several million years of independent evolution account for 77.4-83.5% of all non-redundant genes annotated in the compared species (Fig. 3a and Supplementary Table 12), whereas the shared orthologues between a G. max and a G. soja accession that diverged ~0.13-0. 62 Ma on the basis of the uncorrelated lognormal clock (UCLN) models ( Fig. 1a and Supplementary Fig. 3) make up 84.5% of all non-redundant genes in the compared two accessions (Supplementary Table 12). These observations may suggest a much higher rate of non-core gene formation in the annuals than in the perennials or may be partially caused by much more sources of gene expression data used for gene annotation from the annuals than the perennials that were used for gene annotation.
Overall, the 17,922 core genes shared between the perennials and annuals exhibited lower rates of synonymous substitution (K s ) and non-synonymous substitution (K a ) and stronger intensities of purifying selection (ω, that is, K a /K s ,) than the 6,745 non-core genes shared between the perennials and annuals in both lineages but neither the core genes nor the non-core genes showed differences in K s , K a and ω between the two lineages ( Fig. 3c-e). Among these shared genes, duplicates retained from the ~10 Ma WGD showed lower rates of K a and stronger intensities of purifying selection than singletons in both lineages but no differences in K s , K a and ω were detected between the two lineages ( Fig. 3f-h).
Duplicates were more conservative between the annuals and the perennials than the singletons, as 83.7% (23,671 in the D genome), 86.3% (21,537 in the F genome) of duplicates were shared with the G. max genome compared to 46.8% (6,386 in D) and 50.9% (6,036 in F) for singletons (Extended Data Fig. 5c,d). In addition, a higher ratio of duplicates to singletons was observed in the core gene sets (4.7:1) than in the non-core gene set in the perennial diploids (1:1.6) (Supplementary Fig. 6 and Supplementary Tables 13 and 14). Interestingly, a higher ratio of duplicates to singletons was observed in the annual genome than in the perennial genomes and the rapid emergence of non-core genes in the annuals, many of which do have allelic copies in subsets of the annual accessions, appears to be partly caused by the rapid turnover of singletons (Extended Data Fig. 5e). These observations suggest that the two life-history strategies may have distinguishable effects on gene evolution during the course of the diploidization process after the split of the annual and perennial lineages.

Adaptive gene evolution underlying the life-history strategy transition.
Little is known about the genetic basis of life-history strategy transitions between perenniality and annuality in flowering plants 20 . It is suggested that regulatory networks controlling flowering are involved as the outcomes of adaptive response to particular environments 21,22 . Mutagenesis analysis revealed that SQUAMOSA PROMOTER BINDING PROTEIN-LIKE 15 (SPL15), the orthologue of Arabidopsis thaliana FLOWERING LOCUS C (FLC), in the perennial crucifer Arabis alpina contributes to perenniality through reducing flowering duration to facilitate a return to vegetative development, preventing floral transition of some branches for polycarpic growth and conferring flowering response to winter temperatures that restrict flowering to spring 21,22 . Nevertheless, no natural mutations responsible for such transitions have been identified.
In an attempt to propose candidate genes underlying the perenniality-annuality transition in Glycine, we performed genome-wide screen for all orthologous genes shared by these perennial diploids and their orthologous genes shared by seven wild annual soybean accessions used for construction of a wild soybean pangenome 4 to determine whether they have undergone adaptive evolution during the divergence of the two subgenera (Methods). Among all orthologues tested, 52 exhibited evidence of adaptive evolution between the two subgenera and high level of purifying selection within each subgenus (K a /K s < 0.45, an average K a /K s of soybean genes orthologous to Arabidopsis flowering genes) (Supplementary  Tables 15 and 16). On the basis of gene ontology (GO) analysis, six genes are predicted to be associated with the cell differentiation and flower development processes (Supplementary Table 15 and Extended Data Fig. 6) and two of them showed extreme level of purifying selection within both the annual and perennial subgenera ( Fig. 4a,b). One of the two genes is orthologous to the Arabidopsis PLANT HOMOLOGOUS TO PARAFIBROMIN (PHP) (Fig. 4c). In Arabidopsis, the loss-of-function php mutation causes misregulation c-e, K a (c), K s (d) and K a /K s (e) of the core genes versus non-core genes in the annual and perennial species. f-h, K a (f), K s (g) and K a /K s (h) of singletons versus duplicates in the annual and perennial species. The data for the five perennial species were combined. In the boxplots, the box defines the interquartile range, the centre represents the median, and box bounds represent the lower and upper quartiles; **P < 0.01; P values calculated from Student's two-sided t-test are shown above each chart.
of FLC and activation of SPL15 and FLOWERING LOCUS T (FT) expression, resulting in conditioned accelerated phase transition from vegetative growth to flowering 23 . The other gene is orthologous to the Arabidopsis DWARF14 (D14) that encodes a receptor of strigolactones. In Arabidopsis, the loss-of-function d14 mutation increases numbers of lateral shoots and delays axillary bud development, lateral shoot outgrowth and flowering time of lateral inflorescences 24 . The functional counterpart of D14 has been identified in rice and the loss-of-function d14 mutation exhibits increased shoot branching with reduced plant height 25 . In addition to the observed point mutations in the D14 orthologues between the annual and perennial subgenera, the orthologues also exhibited distinct gene structures between the subgenera (Fig. 4d). Given that both the annual accessions and perennial species are distributed in highly diverse habitats, the adaptive evolution of the PHP and D14 orthologues between the annual and perennial groups, the highest intensities of purifying selection of these orthologues within each of the two groups (Fig. 4e) and the functions of these genes in Arabidopsis as revealed by mutagenesis (Fig. 4f,g) could explain the genetic basis underlying the life-history strategy transition, although explicability should by no means be interpreted as proof of the transition.

Biased subgenomic fractionation in the recent allopolyploid.
Loss of redundant genes (fractionation) is a common phenomenon following polyploidy 26,27 . Biased fractionation, in which most gene loss occurs in one subgenome, is thought to occur when the genomes brought together in a polyploid are differentiated, with the density of TEs in progenitor genomes being a critical 'parental legacy' that produces epigenetic regulatory differences between subgenomes, leading to subgenome-wide lower transcription and ultimately to loss of many of the more weakly expressed homoeologues 28,29 . We first evaluated the collinearity of genes among the A, D, A t and D t genomes/subgenomes, focusing on rearrangements each involving DNA >50 kb. Compared with the more diverse perennial genomes such as F and B, which produced seven rearrangements per million years (Myr) and B and C, which produced 6.7 rearrangements per Myr, the A and D genomes exhibited a higher rate of rearrangements (11 per Myr), with more rearrangements in D than A (Supplementary Table 7). A and A t , and D and D t , are highly conserved, with only ten transpositions between A and A t and six transpositions between D and D t identified (Fig. 5a). Of the ten transpositions between A and A t , three occurred in A involving 113 genes and seven occurred in A t involving 314 genes. Of the six transpositions between D and D t , three occurred in D involving 123 genes and three occurred in D t involving 146 genes. It remains unclear if the transpositions detected in the A t and D t subgenomes occurred before or after the allopolyploidization event.
Comparison among the A, D, A t and D t genomes revealed loss of 7,351 genes from the A t and D t subgenomes of the allopolyploid, with a higher number being lost from D t (4,109) than from A t (3,242) (Fig. 5c, Supplementary Fig. 7a and Supplementary Table 17). In both subgenomes, loss was higher for loci where only one ~10 Ma WGD homoeologue was retained in the A or D genome: 60.4% of the lost genes were deduced to be singletons in their respective subgenomes before their losses, 39.6% were deduced to be members of ~10 Ma WGD homoeologues (Fig. 5d). More singletons were lost in D t than in A t , whereas no difference in the number of losses of duplicates was observed between the two subgenomes (Fig. 5e). and D14 (d) loci, respectively, between the two subgenera. The blank boxes in the annuals indicate extended exon boundaries corresponding to intronic sequences in the perennials. e, K a /K s comparison within a subgenus and between the subgenera. A/A, comparison within the annual Glycine subgenus; P/P, comparison within the perennial Glycine subgenus; A/P, comparison between annual and perennial Glycine subgenera. f,g, Model of genetic pathways mediated by the PHP (f) and D14 (g) orthologues, respectively, underlying the life-history strategy transition and adaptation.
Only 0.14% of the ~10 Ma WGD-derived homoeologues lost both copies from either the A t or D t subgenomes (Fig. 5e). On the basis of the orthologues of the 7,351 genes lost in A t and D t that are present in the A and D genomes, we found that 55.4% and 76.2% of the non-core genes present in both A and D were lost from the A t and D t subgenomes, respectively, whereas 39.8% and 49.6% of the core genes present in both A and D were lost from the A t and D t subgenomes. It is apparent that non-core genes had a higher tendency of loss than the core genes in both subgenomes (Supplementary Table 18). As a class, genes in the A and D genomes that have lost a homoeologue in the A t or D t subgenome have higher K a /K s , relative to loci where both homoeologues are retained (Fig. 5f), suggesting that such genes were under weaker purifying selection in the parental genomes before their loss following allopolyploidy. Consistent with their greater dispensability, genes experiencing a loss from either homoeologous genome are more weakly expressed than are Chr., chromosome. c, Losses of homoeologous genes from the A t and D t subgenomes. d, Losses of singletons and duplicated genes derived from the ~10 Ma WGD in the recent allopolyploid. e, Biased losses of singletons and unbiased losses of duplicated genes derived from the ~10 Ma WGD between A t and D t . f, For each locus in A t or D t , the K a /K s ratios of their A and D orthologues were calculated. These values were then pooled separately for loci where either the A t or D t homoeologue had been lost and compared with the pooled value for loci in which both homoeologues were retained. K a and K s of genes in A and D were calculated on the basis of comparison with their respective orthologues in P. vulgaris. g, Negative correlations between expression levels of A and D orthologues and rates of losses of their corresponding genes from A t and D t , respectively. h, The majority of gene losses from the A t or D t subgenome occurred in clusters each harbouring fewer than five genes. i, Pairs of adjacent genes missing in either the A t or D t subgenome are co-expressed in the parental genome. j, Pairs of adjacent genes in the A or D genomes for which only one of the two loci is lost from the A t or D t subgenome are not co-expressed. k, Different categories of gene losses and pseudogenizations in A t and D t . l, Exemplification of a small deletion by illegitimate recombination in the A t subgenome that resulted in a premature stop codon. In the boxplots (f), the box defines the interquartile range, the centre represents the median and box bounds represent the lower and upper quartiles; *P < 0.05 and **P < 0.01. P values calculated from Chi-square test (c-e,g,h), Student's two-sided t-test (f) and Pearson correlation test (i,j) are shown above each chart.
genes where both homoeologues are retained (Fig. 5g). Thus, the A t subgenome was probably dominant over the D t subgenome on the formation of the allopolyploid. The biased fractionation between the A t and D t subgenomes by gene losses was probably predetermined by genomic features of their diploid progenitors particularly the higher expression, on average, of the A homoeologues ( Supplementary Fig. 7b). This could be due to the distribution of TEs distal to the promoter regions of adjacent genes that is associated with levels of gene expression (Supplementary Fig. 7d). Weaker gene expression is also reflected by lower levels of purifying selection on D homoeologues ( Supplementary Fig. 7c). We found that gene losses in A t or D t tended to occur in clusters each harbouring fewer than five genes ( Fig. 5h and Supplementary Fig. 7e) and that the counterparts of those lost genes within a cluster that are retained in A or D generally showed a pattern of co-expression ( Fig. 5i; r 2 = 0.155, P = 0.012). By contrast, such a pattern of co-expression was not seen between the A or D orthologues of a single lost gene in A t or D t and its adjacent gene ( Fig. 5j; r 2 = 0.014, P = 0.508). The clustered gene losses are probably the consequences of homoeologous exchange, which has been demonstrated to be a cause of gene presence/absence variation in Brassica napus 30 , cotton (Gossypium hirsutum) 31 , peanut (Arachis hypogaea) 32 and other allopolyploids 33 , although its effectiveness appears to vary greatly among these allopolyploids.
Illegitimate recombination for genomic fractionation. The loss of a gene, as described above, was defined when it was annotated in A, D and only one of the two subgenomes (A t and D t ). After careful manual inspection of individual genes, we found that only 13.1% of the 7,351 gene losses are complete absence of the gene sequences, whereas 45.3% and 6.3% of the losses were pseudogenized genes caused by small InDels and point mutations, respectively, and 4.4% were genes interrupted by TE insertions (Fig. 5k and Supplementary  Fig. 8). The remaining 30.9% of gene losses were defined as 'losses' as they were not annotated as genes due to the lack of expression by the standard gene annotation pipeline (Methods). Nevertheless, this category of 'losses' in the A t and D t subgenomes was proportional to the ratio of total lost genes in respective subgenomes and was not excluded from our analysis.
In an attempt to shed light on the mechanism(s) that gave rise to the small deletions resulting in pseudogenization that was largely responsible for subgenome fractionation, we examined 2,850 small deletions with clearly defined boundaries in 2,315 genes in the allopolyploid. We found that 31.2% of these deletions were flanked by short repeats of 2-18 bp, as exemplified in Fig. 5l, a hallmark of illegitimate recombination 34 , suggesting that illegitimate recombination that gradually deleted genic sequences was a key mechanism for subgenome fractionation. Classical homologous recombination generally requires homologous sequences of at least 20 bp (ref. 35 ). Thus, such small deletions are unlikely to be formed by homoeologous exchange.
Allopolyploidization did not trigger extensive TE amplification. It has been predicted that allopolyploidization by interspecific hybridization triggers 'genomic shock' that would lead to widespread activation of TEs 36 . If this happened in the recently synthesized Glycine allopolyploid, as a consequence, a large number of identical or nearly identical TEs, particularly LTR-RTs between the A t and D t subgenomes would be observed. We therefore extracted the reverse transcriptase sequences from 1,202 copia-type and 3,070 gypsy-type intact LTR-RTs which inserted into the genome within the last 350,000 years to construct their phylogenetic relationships, respectively. Out of the 4,272 LTR-RTs examined, only 38 showed a high level of sequence identity (for example, ≥99%) between the A t and D t subgenomes (Extended Data Figs. 7 and 8). By contrast, none of the LTR-RTs from A and D genomes showed such a level of sequence identity (Extended Data Figs. 9 and 10). These observations suggest that the allopolyploidization event did not lead to massive amplification of LTR-RTs. Given the association of gene expression and adjacent TEs, it is likely that TE-mediated transcriptome modification, rather than TE proliferation, is the main consequence of the polyploidy-induced 'genome shock' .

Discussion
We have generated chromosome-level genome assemblies of six representative Glycine perennial species to construct a super-pangenome framework of the Glycine genus. A total of 109,827 non-redundant protein-coding genes were annotated in the perennials, of which ~70% were absent in the annual soybean pangenome. This represents a huge repertoire of genetic potential for improvement of the annual crop through distant hybridization, targeted modification or replacement of soybean genes or direct use through de novo domestication of the perennial soybeans into new crops. In addition to the genomic data, this study elucidates the propensities and consequences of polyploid genome evolution, possible genetic determinants of the life-history strategy transition and the causes and mechanisms for subgenome fractionation.
Our analyses revealed several diverged genomic features in the perennials compared with the annual soybeans, including a lower rate of genomic rearrangements, a lower ratio of solo LTRs to intact LTR-RTs and a slower pace of non-core gene emergence. Genomic rearrangements are generally formed by repeat-mediated recombination events 37 , while solo LTRs are the products of unequal intra-element recombination; thus, the differences in rates of rearrangements and solo LTR formation probably reflect different rates of unequal recombination. As unequal recombination often correlates positively with allelic recombination 38 , such differences are likely to be associated with distinct generation times of the perennials and annual species. The slower pace of formation of non-core genes in the perennials may also be associated with the reduced generation times 39 , which would reduce recombination-induced mutations that could lead to pseudogenization and gradual degradation of genic sequences. Intriguingly, the substitution rates as reflected by K s did not show notable differences between the perennials and annuals. One possible explanation is the potential effect of recombination on the effectiveness of natural selection 38 . Alternatively, it may be explained by the relatively short time frame for independent evolution of the annual and perennial lineages relative to the much longer divergence from P. vulgaris, which was used as a reference for calculation of evolutionary rates.
Plant centromeres typically harbour CSRs as integral for maintaining centromere functions 40 . Some CSRs differ drastically between close relatives, such as sibling Oryza species 41 and some are highly conserved even between species diverged for millions of years, such as maize and rice (Oryza sativa) 42 . Given such a short time of divergence between the annual and perennial lineages, the lack of typical CSR sequences in the perennials is a surprising observation. Similarly, however, centromere satellites identified in the annual Brachypodium distachyon were not detected in a few examined perennial species of the Brachypodium genus 43 . Although many plant species have been sequenced, centromeric sequences have only been characterized in a few species, primarily a few major annual crops. Thus, it remains challenging to determine whether the CSRs were lost in these perennials or were born in the annuals and whether they were associated with the life-history strategy transition.
Subgenome expression biases have been reported in various allopolyploids since the advent of transcriptomic studies 44 but only with genome sequences of polyploids and their diploid progenitors is it possible to understand the parental legacies that probably lead to expression and fractionation biases. This is now known for polyploids in an increasing number of genera, including such models as Brassica napus 45 and Gossypium hirsutum 46 . The perennial soybean allopolyploid, G. dolichocarpa, which has been used as a model for studying various aspects of polyploid evolution 47,48 , now becomes another such species. The availability of complete genomes of its A and D diploid progenitors allows precise definition of the two subgenomes and the genome-wide distribution of TEs, permitting detailed characterization of patterns of subgenomic fractionation. Whereas homoeologous exchange is increasingly appreciated as a mechanism of genome evolution in allopolyploids 33 , less is known about illegitimate recombination, which we describe as a major force in the evolution of G. dolichocarpa. The genomic sequences reported here enhance the value of Glycine as a model system for exploring what Wendel proclaimed 26 as the "wondrous cycles of polyploidy". Perennial Glycine includes several other allopolyploid species that combine overlapping sets of diploid genomes 14 and evidence to date has revealed commonalities among them that could be evidence of emergent properties of polyploidy 48,49 . The genomes of these species, most of which include either G. tomentella or G. syndetika as progenitors, hold additional clues to the patterns and processes of polyploid evolution.

Methods
Plant materials and genome sequencing. Five diploid perennial Glycine species (G. falcata (FF), G. stenophita (BB), G. cyrtoloba (CC), G. syndetika (AA) and G. tomentella D3 (DD)) and one recent allotetraploid species (G. dolichocarpa (G. syndetika × G. tomentella D3)), were obtained from Cornell University. Young leaves were collected 3 weeks after planting in the greenhouse and used for DNA extraction using the cetyltrimethylammonium bromide method 50 . For Illumina short-read sequencing, DNA was physically sheared to a size of ~500 bp and sequenced with an Illumina HiSeqX platform to generate 150 bp paired-end reads at a mean of ~100× coverage. High-molecular-weight DNA (>20 kb) libraries were created following Pacbio's standard protocol and sequenced using a PacBio CLR platform at ~60× coverage (Novogene). Hi-C libraries were created and sequenced at Phase Genomics at a minimum of ~100× coverage (Phase Genomic).
Genome assembly and construction of pseudomolecule chromosomes. Quality trimming of raw Illumina reads removed adaptor sequences and low-quality bases (Q < 20), as well as contamination with mitochondrial, chloroplast and PhiX reads, using Trimmomatic (2:30:10) 51 . Long reads generated by PacBio were self-corrected using its built-in pipeline. The size of each perennial Glycine species was estimated using GenomeScope 52 on the basis of k-mer (21-31 nucleotides) analysis. Error-corrected reads were assembled using FALCON v.0.3.0 (ref. 53 ). Multiple assembly runs were tested with adjusted parameters and the best assembly was selected on the basis of contig N50. The acquired draft genome then underwent two rounds of error correction using Quiver 54 and PILON 55 to obtain contig-level assembly. The draft assembly was further scaffolded using Hi-C reads. Briefly, Hi-C reads were aligned to the draft genome using BWA-MEM 56 and processed using SAMBLASTER and SAMTOOLS 57 (bwa mem -5SP [assembly.fasta] [fwd_hic. fastq] [rev_hic.fastq] | samblaster | samtools view -S -h -b -F 2316). Quality control of the alignments was then performed using the Phase Genomics open-source Hi-C alignment QC tool and scaffolding was carried out with the Phase Genomics Proximo Hi-C genome scaffolding platform to obtain chromosome-level assembly.
Error correction and assembly evaluation. Species-specific paired BAC end sequences (BESs) generated from the SoyMapII project (https://soybase.org/ soymap2/) were used to check and correct the orientation of scaffolded contigs. First, paired reads were mapped independently to the assembled genome using BLAST+ (ref. 58 ) and at least one of the paired reads that mapped uniquely to the assembly was kept for downstream analysis. Screening was then performed for paired reads that spanned two contigs. The order and orientation of contigs were manually adjusted on the basis of the expected insert size (~150 kb) and the orientation of mapped reads. The final statistics for each corrected assembly was generated with QUAST 59 and the completeness of the assemblies was evaluated using BUSCO 11 and CEGMA 12 .
Annotation of repetitive sequences. Repetitive elements were identified through a combination of de novo and structure-based approaches. The de novo prediction of LTR-RTs was performed with LTR harvest 60 and LTR_finder 61 . Identified LTR-RT candidates were further screened for the integrity of their LTR polypurine tract and primer binding site signatures. A retrotransposon database was compiled from the intact LTR-RTs together with known retrotransposons from G. max (Williams 82 (av2)) 3 and sequences downloaded from Repbase 62 to form a confident retrotransposon database and this was used to mask the genome for identification of nested LTR-RTs. The whole process was repeated until no new predictions could be made. Gypsy-and copia-LTR-RTs were distinguished according to their sequence similarity and the order of integrase, reverse transcriptase and RNase H sequences. Identified LTR-RTs were further classified into different families on the basis of the 80-80-80 rules suggested by Wicker et al. 63 using SiLiX 64 . Non-LTR-RT short interspersed nuclear elements (SINEs) were identified using SINE_scan with default parameters 65 . Helitrons were identified with HelitronScanner 66 . A similar pipeline was used to identify long interspersed nuclear elements (LINEs) and DNA TEs, such as Mutator, hAT, CACTA, Mariner, PIF-Harbinger and Ping elements. Briefly, enzymes specific to each type of TE were compiled and used to perform local BLASTx searches against assembled genomes. Upstream and downstream sequences were then extracted on the basis of the expected length of each type of transposon and the patterns of target site duplications and terminal inverted repeats were screened for signatures distinguishing each type of transposon. The identified candidate sequences of each type of transposon were aligned, manually checked and clustered into families. Families that contained two or more members were kept. The insertion times of LTR-RTs were calculated as described by Du et al. 67 . Briefly, the Jukes-Cantor distance (K) of two LTRs was calculated in MEGA X 68 and the insertion time (T) was estimated as T = K/2r, where a substitution rate (r) of 1.3 × 10 -8 per synonymous sites per year was used 69 .

Identification of centromeric repeats and associated retrotransposons.
To identify centromeric repeats in perennial Glycine species, we performed a genome-wide search for all tandem repeats using Tandem Repeats Finder 70 . The identified repeats were then clustered into families at 80% similarity using SiLiX 64 The abundance of each family was estimated by k-mer analysis using Illumina short-read sequences with Jellyfish 71 and was compared with known centromeric repeat sequences from G. max (Williams 82 (av2)) 3 with BLAST+ 58 . The distribution of the top five most abundant repeats along each chromosome was manually checked to assess whether they possessed the features of centromere repeats. A neighbour-joining tree was created for identified centromeric repeats on the basis of random sampling of 200 repeats from each family, using MEGA X 68 . Divergence among three types of identified centromeric repeats was calculated on the basis of pairwise comparisons of 200 random sample repeats from each type using the formula above. CRs were identified by mapping paired-end short reads to a database created from the LTR repeats and centromeric repeats, using BLASTn 58 . LTR repeats that were linked by paired-end reads to centromeric repeats within a reasonable distance (~500 bp) were considered as candidate CRs. The degree of association of each CR was calculated using the formula: index = the number of mapped reads/(sequencing depth × family size).
Annotation of protein-coding genes. RNA-seq libraries were created using a mixture of leaf, root and flower tissue for each perennial Glycine species. Quality-trimmed reads were assembled de novo using Trinity and a reference-based assembly was generated using the HISAT2 and StringTie pipeline 72 . The de novo and reference-based assemblies were combined and the identical transcripts were removed using CD-HIT 73 . Fragments per kilobase per million mapped reads (FPKM) values generated by StringTie were used to estimated transcript abundances. Annotation of the genome was performed using the MAKER-P pipeline 74 . Species-specific TE libraries as well as the repeat database from G. max (Williams 82 (av2)) 3 and Repbase were used to mask the genome by RepeatMasker 75 . Assembled transcripts were used as evidence for expressed sequence tags and proteins downloaded from the NCBI protein database for G. max (Williams 82 (av2)) 3 and other flowering plant species 76 were used as evidence for proteins. To obtain confident annotations, we first performed gene model training with the ab initio gene prediction tools used by MAKER-P, including SNAP 77 and AUGUSTUS 78 and then conducted two rounds of MAKER annotation to obtain the final annotation file.

Phylogenetic analysis.
A set of clearly defined orthologous genes among M. truncatula, P. vulgaris, G. max (Williams 82), G. soja (accession W05) and selected perennial Glycine species were used to establish the phylogenetic relationships of these species and to estimate their divergence times. In total, 830 nuclear loci that were one-to-one orthologues for all species were aligned with MAFFT and alignments were trimmed with TrimAl using the 'automated1' option. Individual loci were concatenated into a super matrix using SequenceMatrix 79 to export in nexus format. A time-calibrated phylogeny was inferred using BEAST v.2.6.4 (ref. 80 ) using a birth-death tree before, a gamma site model with estimated substitution rate and proportion of invariant sites, with either a UCLN 81 or local random clocks 82 using secondary calibration points with a normal distribution before Koenen et al. 83 . Three different analyses were conducted using the following parameters: (1) a UCLN clock prior with the root age of phylogeny constrained . Constraint dates were based on a recent nuclear phylogenomic study of legumes and are older than previous estimates of divergence times based on plastid gene sequences 83 . The two UCLN models were run for 600 million generations sampling every 1,000 generations, while the random clocks model was run for 900 million generations sampling every 1,000 generations. The Markov chain Monte Carlo runs were checked for convergence using Tracer 84 for effective sample size values >200 after a 10% burnin. The maximum clade credibility tree annotated with median node ages and 95% HPD was generated with TreeAnnotator (part of the BEAST v.2.6.4 package) after a 10% burnin. Parameter set 1 using UCLN was chosen for downstream interpretations primarily because there is evidence that the random clocks may be over fitting the models for age estimates 83,85 while the estimate for the divergence of P. vulgaris and Glycine as a calibration before for parameter set 1 (19.81-32.85 MA) is closer to previous estimates in the literature 86  Genome collinearity analysis. Chromosome-level comparison of synteny among 26 recent reported annual Glycine accessions 5 , Williams 82 and our perennial Glycine species was performed. Amino acid sequences from each pair of species selected were compared using all-against-all BLASTP (e value <1 × 10 −10 ; similarity >95% and coverage >80). Protein pairs belonging to the reciprocal best hits were extracted and fed into MCScanX 87 to identify syntenic blocks between each pair of species. Except for P. vulgaris, for which each protein was expected to have two orthologous pairs in Glycine species, proteins were classified into 1:1 pairs (putative orthologues) on the basis of the symmetrical best hit within syntenic blocks that had the lowest P value between the two genomes. Other homologous genes within each family were considered to be co-orthologues or in-paralogues.
Gene classification. Perennial core orthologues were defined as those genes shared by all five diploid perennial Glycine species for which a 1:1 orthologous pair could be identified within syntenic regions. The remaining genes were considered to be non-core genes. Genes were further categorized on the basis of their duplication modes with the DupGen_finder pipeline 88 (https://github.com/qiao-xin/DupGen_ finder) using P. vulgaris as the outgroup reference. All-to-all BLASTp was performed between each studied Glycine species and P. vulgaris (e value 1 × 10 -10 ; similarity >70%; coverage >70%; the top five matches were kept if more than five hits met the preset requirements). The same parameters were applied for all-against-all self-genome BLASTp comparisons, except that the similarity was set >80%.
Comparison of K a , K s and K a /K s (ω) among annual and perennial Glycine species. The K a , K s and K a /K s were calculated for the annual and perennial Glycine groups with the KaKs_Calculator package using the YN method 89 . G. soja accessions W01, W02 and W03 were selected as representative species for annual Glycine species to ensure the consistency of our core gene analysis and minimize the effect of human selection during domestication. For each orthologous gene pair, the K a , K s and K a /K s were calculated on the basis of its orthologous gene identified in P. vulgaris as a reference. Coding sequences and their corresponding amino acid sequences were aligned using MUSCLE 90 and transformed into codon-aligned format with PAL2NAL 91 , low-quality alignments (similarity <85, coverage <85) were removed and the resulting alignment files were manually checked and fed into KaKs_Calculator. For the comparison between core genes and non-core genes, only genes shared by annual and perennial Glycine groups were selected. The evolutionary rate and selective pressure between annual and perennial Glycine species was tested using a Student's t-test at a significance level of P ≤ 0.05.

Comparative analysis of genes as singletons and duplicates.
For calculating the percentage of duplicated genes shared between each perennial species and G. max, we only focus on duplicated genes generated by WGD. Duplicates shared with G. max were defined as both copies (homoeologues) are retained in both species on the basis of the 1:1 orthologous gene pair identified above. The percentage of singletons shared between each perennial species and G. max was also calculated on the basis of the 1:1 orthologous pair identified. For the comparison of K a , K s and K a /K s between the annual and the perennial groups as WGS duplicated genes and singletons, only core genes shared by annual and perennial groups were selected.
Detection of genome rearrangements. Genome rearrangements among selected Glycine species were initially identified using the SyRI pipeline 92 with default parameters. Briefly, each pair of genomes was aligned using the NUCMer utility 93 (--maxmatch -c 500 -b 500 -l 100) and filtered with a delta-filter module (-m -i 90 -l 100). The resulting files were fed to SyRI to detect variations in genome structure. Identified genome rearrangements were further filtered on the basis of the genome synteny file generated above. Only rearrangements occurring within syntenic blocks were kept for further analysis. The comparison of genome stability among annual and perennial species was calculated using P. vulgaris as a reference. The variation within shared syntenic regions among P. vulgaris and each pair of genomes of Glycine species was compared. Structural rearrangements were considered to have occurred in species that differed from P. vulgaris. Identified structure variations were further validated using species-specific BESs as describe above. Ten inversions with clear boundaries were also selected for PCR validation.
Glycine gene orthologues related to perenniality. Homologous genes with syntenic regions shared by all seven G. soja accessions 4 and perennial Glycine species were selected for comparative analyses between the annual and perennial groups and within each group. Seven G. soja accessions were selected instead of three wild G. soja accessions sequenced by Liu et al. 5 to ensure that the sample size is comparable for downstream statistic analyse. Identification of genes showing adaptive evolution between annual and perennial Glycine groups was performed according to the approach described by McDonald and Kreitman 94 and the divergence was corrected by Jukes-Cantor 95 (http://mkt.uab.es/mkt). Genes that function in the flowering time-controlling pathway were identified on the basis of the annotation of their homologous genes in Arabidopsis using blastp (e value <1 × 10 -10 ). A neighbour-joining tree was created for each reference gene from Arabidopsis and its candidate orthologous gene in each Glycine species. The constructed phylogenetic tree was manually checked and genes that clustered within the same clade as the Arabidopsis gene were considered to be bona fide orthologous genes in each Glycine species. For each pair of species studied, the K a , K s and K a /K s were calculated for orthologous gene pairs using the method described above. The mean values of K a /K s from all shared genes or flowering genes were used as two levels of thresholds for purifying selection. GO enrichment analysis was performed on SoyBase (https://www.soybase.org/). Single nucleotide polymorphisms (SNPs) that distinguished annual and perennial Glycine species were extracted and their genotypes in annual Glycine populations were analysed in 302 soybean resequencing datasets 96 to identify SNPs related to annual/perennial speciation. The impact of identified sequence variations on protein structure was predicted with I-TASSER 97 .
Subgenome exchange in the allopolyploid. Structural variations between each subgenome (A t A t and D t D t ) and its ancestor genomes, G. tomentella D3 (AA) and G. syndetika (DD), were identified using SyRI as described above. Genetic exchanges between the two subgenomes in G. dolichocarpa were identified by mapping Illumina paired-end short reads of the two diploid contributors G. tomentella D3 and G. syndetika to the assembled tetraploid genome (A t A t / D t D t ). The mapping density along the chromosomes was calculated as density (d) = the number of reads from A/ the number of reads from D using a 10 kb window and was compared with syntenic regions of a simulated tetraploid genome (AA/DD) to ensure that no bias was caused by an unequal sequencing coverage. When d was notably <1 on the A t genome, this indicated a translocation from D t to A t and vice versa.
Mechanism of gene fractionation and deletion in the allopolyploid. The 1:1 orthologous pairs shared by G. tomentella D3 (A) and G. syndetika (D) were extracted and the presence/absence of an orthologous pair in the A t and D t genomes was examined. A subset of genes that had lost their orthologous pair either in A t or D t but not in both genomes was then selected to study the mechanism of gene fractionation and deletion. A Chi-squared test was performed to test whether the number of A/A t lost genes differed significantly from the number of D/D t lost genes (P ≤ 0.05) genome-wide. To study local effect on gene deletion, 200 fragments (100 kb) were random sampled from each subgenome and the number of genes lost in A t and D t of each orthologous pair was counted. Genes in each orthologous pair were also categorized as singletons or duplicates, on the basis of their duplication status in the A and D genomes. For the orthologous pair that one gene was annotated as singleton and another was annotated as duplicates, the percentage of lost genes in each category was calculated and tested according to their deviation from a 1:1 ratio, using a Chi-squared test. The K a /K s values and transcript abundance were also measured for the lost and conserved orthologous gene pairs, on the basis of their counterparts from the A and D genomes. To test the effect of TE insertion on gene expression, the distance of the nearest TE inserted into the upstream region of a gene was identified using bedtools 98 (closest -id -D a) and the correlation was compared for the orthologous pairs between A and D genomes. The distance for lost genes along a chromosome was calculated according to the number of genes that separated them. If this number was smaller than five, the two lost genes were considered to be in a cluster; otherwise, the gene was defined as an independently lost gene. To test whether lost genes had similar local expression levels, the correlation between the gene expression of the two nearest lost genes within the same cluster was calculated and compared with the correlation between the expression of the lost gene and its nearest conserved gene. For genes without an annotated orthologous pair in the tetraploid genome, genomic DNA from either the A or D genome was used as a query to perform BLASTn against the syntenic regions of the tetraploid subgenome to study the mechanism of fractionation/gene deletion (e value 1 × 10 -10 ; word_size 30; -qcov_hsp_perc 0.3). If genomic DNA could be identified, a local alignment of each exon in the lost gene was performed using BLAST+ and was analysed to examine the presence of mutations, indels and transposon insertions. This helps to identify the mechanism that regulates gene deletion/loss. Flanking sequences of each deletion identified were further manually checked for the presence of illegitimate recombination.

Investigation of TE activation following the formation of allopolyploid.
To study whether the formation of allopolyploid by interspecific hybridization triggers 'genomic shock' , which would lead to the activation and transposition of TEs between subgenomes, the predicted reverse transcriptases of LTR-RTs that were estimated to be amplified within the last 350,000 years were extracted from the A, D, A t and D t genomes. A neighbour-joining tree with 200 bootstrap replications was constructed for gypsy-and copia-type LTRs using MEGA X 68 for A t /D t and A/D genomes, respectively. The sequence divergence of transposons located on different subgenomes within each clade was calculated using the same general approach. TEs transposed between subgenomes were identified as those located on different subgenomes but placed in the same clade.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Raw sequences generated during this study are deposited in the public repository of the National Center for Biotechnology Information under accession number PRJNA44023. The annotated assemblies are deposited in the European Nucleotide Archive under accession number PRJEB44023. Additionally, the assembled genome data and gene annotation have been deposited in SoyBase (http://soybase. org/data/v2/Glycine/) for future visualization of interspecific genome content comparison that is under development.