Genomic insights into the recent chromosome reduction and polyploidization of complex autopolyploid sugarcane S. spontaneum

S. spontaneum is a founding Saccharum species that contributes stress resistance to the genetic background of modern sugarcane cultivars. Here, we have assembled the autopolyploid S. spontaneum Np-X genome with ancestral form into 40 pseudo-chromosomes in 10 homologous groups, revealing the recent chromosome reduction and polyploidization that occurred in Saccharum. The paleo-duplicated chromosomal pairs exhibit functional redundancy in Saccharum and underwent ssion followed by fusion accompanied by centromeric spreading around 0.80 million years ago (Mya) before evolving into their current forms with basic chromosome numbers x = 9 and x = 8 in S. spontaneum, likely in a stepwise manner. WGDs occurred independently in Saccharum species around 1.5 Mya. Highly diverse chromatin structures exist among homologous chromosomes despite their high collinearity, and the re-structuring of NpChr5 and NpChr8 might have suppressed switching of chromatin structure from inactive to active. Resequencing of 116 sugarcane accessions elucidated that the S. spontaneum originated from North India and that the basic chromosome numbers x = 8, x = 9, and x = 10 originated independently, indicating that recent chromosome reduction rather than polyploidization has driven the adaptive evolution of Saccharum. Our study provides genomic resources and suggests new directions for accelerating sugarcane improvement and advances our knowledge of the evolution of auto-polyploids. genome assembly for S. spontaneum Np-X using circular consensus sequencing (CCS) 12 and performed resequencing of 102 S. spontaneum accessions with various ploidy levels and basic chromosome numbers. In addition to describing the very recent chromosome reduction and polyploidization events in Saccharum, our study presents a hereditary blueprint for the genomic basis of sugarcane biology and sheds new light on the evolution of auto-polyploids. gene dened, (24.1%) alleles, (35.9%) (30.1%) (0.1%) with one, with an average of 3.4 alleles gene These results conrm that the chromosome-level are A/B compartment further investigated for the genes with a complete set of four alleles in the collinearity regions. genes alleles 2,240 genes conserved alleles, for all four alleles GO term analysis three type genes genes all four alleles in catabolic process FDR oxidoreductase FDR multicellular FDR catabolic the genes two three transferase S. S. 1.2 and 2.0 the estimation of ~1.6 Mya on the Ks values of orthologous gene pairs. The occurrence of the P6 peak ~0.89 Mya specically in Np-X and AP85-441 is consistent with the estimation of 0.8 Mya on the analysis of orthologous gene pairs. Two insertion identities of 91.8–92.4% (P8, ~0.55 Mya) and 96.4–96.8 (P9, Mya) Np-X TE insertion specically in Np-X. P2 Mya) P5 Mya) (Npp.04A015560.1, Npp.04A015590.1, Npp.04A015600.1, and Npp.04A015610.1) putatively involved in plant cell wall biosynthesis processes were also identied in the selective sweeps 44,45 , which could provide a foundation for the morphological differentiation of the three populations. In addition, some gene functions related to stress responses were detected, e.g., LOT1(Npp.05D013900.1) which enhances ABA response and plant drought tolerance 46 ; two genes (Npp.10B011680.1 and Npp.10B011690.1) putatively encoding PER3 and Npp.01B018240.1, which have been implicated in responses to low temperatures in plants 47,48 ; two genes (Npp.06C016710.1 and Npp.06C016720.1) with functions related to cold tolerance and that are induced by drought and ABA; two genes (Npp.01B030420.1 and Npp.05D025320.1) encoding a putative ethylene response factor; and 16 genes, such as CYP81F (Npp.01B031390.1 and Npp.05D025440.1), which likely encode enzymes putatively involved in oxidation-reduction processes and might be related to tolerance of drought stress 49,50 . These results indicated that the natural genomic sweep regions were likely involved in polysaccharide metabolism and stress tolerance during the chromosome reduction process. the molecular mechanisms by which sugar accumulation diverged among Saccharum species. have larger contributions to and adaptive implications on the population scale. transposable element using RepeatModeler novo and homology strategies including two de novo repeat-nding programs, 80 and RepeatScout 81 , we imported RepeatMasker to identify and cluster repetitive elements. Tandem repeats were identied using the Tandem Repeat Finder (TRF) 82 , and unknown TEs were classied using TEclass 83 . Next, the from the above to identify telomeres and centromeres. We also integrated results from LTR_FINDER 84 and LTRharvest 85 and removed false positives from the initial predictions using the LTR_retriever pipeline 86 . These LTRs were also classied as either intact or non-intact LTRs.


Introduction
Polyploidization is considered a major force in the evolution of plants, particularly in angiosperms, as fossil records indicated that up to 70% of the owering plant species originated shortly after polyploidization 1 . Polyploids can generally be classi ed into two major categories: auto-polyploids and allo-polyploids [2][3][4] . Although auto-polyploids are considered to have arisen by whole genome duplication (WGD) of identical chromosome sets within a single species 5,6 , the study of their genome evolution is hampered by the much lower availability of information than for allopolyploids due to the paucity of homologous chromosome-level genome assemblies in various auto-polyploid taxa.
Modern sugarcane (Saccharum spp., Poaceae) is a crucial crop with an economic value of 90 billion USD that provides 80% of the world's sugar and 40% of its ethanol yield, and its production by weight exceeds that of any other food crops such as rice, wheat, or maize (FAO, 2017). The sugar content of sugarcane juice varies from 1% to 24% Brix 7 . S. spontaneum is a founding Saccharum species that is widely distributed across a very large region from the Mediterranean to the Paci c. S. spontaneum is the genetic donor of stress tolerance to the genetic background of modern sugarcane hybrids, which has been a major breakthrough during sugarcane breeding. This species exhibits wide variation in chromosome numbers, which range from 2n = 40 to 2n = 128, and has ploidy levels ranging from 4x to 16x 7 . Moreover, S. spontaneum is particularly notable for the highest known degree of polyploidy within its genus as an hexadecaploid 7 , and exhibits three basic chromosome numbers 8 , x = 8, x = 9, or x = 10. The extensive variation in ploidy levels of S. spontaneum presents an extreme case for the study of the evolution of auto-polyploidy genomes in plants.
Recently, a mosaic monoploid genome for the modern sugarcane cultivar R570 (382 Mb) 9 , a haploid genome assembly for S. spontaneum AP85-441 (3.13 Gb) 10 , as well as a representative gene space assembly of the hybrid sugarcane variety SP80-3280 (4.26 Gb) 11 and an auto-octoploid S. o cinarum assembly (Zhang et al., under review) have been released. These genome assemblies have provided an opportunity to jointly characterize the evolutionary history of Saccharum. S. spontaneum Np-X (2n = 4x = 40), which grows along the Himalayas at over 1300 m above sea level, has the lowest total number of chromosomes among natural Saccharum accessions, and is the only accession with a basic chromosome number of x = 10 in S. spontaneum accession to our knowledge, so it likely represents the ancestral karyotype of S. spontaneum and the intermediate karyotype for the evolution of Saccharum. Here, we have generated a high-quality auto-tetraploid genome assembly for S. spontaneum Np-X using circular consensus sequencing (CCS) 12 and performed resequencing of 102 S. spontaneum accessions with various ploidy levels and basic chromosome numbers. In addition to describing the very recent chromosome reduction and polyploidization events in Saccharum, our study presents a hereditary blueprint for the genomic basis of sugarcane biology and sheds new light on the evolution of autopolyploids.

Results
An allele-de ned genome for S. spontaneum Circular consensus sequencing (CCS) can produce accurate reads from noisy individual subreads, representing a better tradeoff between read length and accuracy, thereby providing a better tool to tackle highly complex auto-polyploid genomes. We obtained a total of 52 Gb of PacBio CCS long reads from 909 Gb of raw sequence data and 417 Gb of Illumina short reads on the Sequel and HiSeq X platforms, respectively (Supplementary Table S1). The Canu (v1.9) 13 software package was used to perform initial de novo assembly of the S. spontaneum Np-X genome and yielded an initial contig-level assembly with an N50 of 405 Kb, a signi cant improvement relative to the previous S. spontaneum AP85-441 genome, which had an N50 of 45 Kb (Table 1 and Supplementary Note). The total size of this initial contig-level genome assembly was 2.76 Gb, which accounts for 97.53% of the Np-X genome size estimated by genome survey based on K-mers (2n = 4x, ~2.83Gb) ( Table 1, Supplementary  Fig. S1 and Supplementary Note). A total of 59.97 billion (98.73%) of 60.74 billion Illumina short reads could be aligned and covered 99.05% of the assembly (Supplementary Table S2), supporting that a high-quality contig-level genome had been obtained for further processing.
We then used ~105x of high-throughput chromosome conformation capture (Hi-C) data to scaffold the allele-aware chromosome-level autotetraploid S. spontaneum Np-X genome using ALLHiC 10,14 (Supplementary Fig. S2 and Table S3). A total of 2.74 Gb (98.96%) of contigs were anchored into 40 pseudo-chromosomes comprised of 10 homologous groups with four allelic chromosomes in each group ( Fig. 1 and Supplementary Table S4). Comparison of homologous haplotypes A, B, C, and D revealed 8.20 million SNPs, 0.66 million insertion/deletion (InDels), and 4,033 structural variations (SVs) among 10.73 Mb of sequence with heterozygosity of 1.39% in the S. spontaneum Np-X genome (Supplementary Fig. S3 and Supplementary Table S5). A total of 241 (96.27%) complete gene models among the 248 ultra-conserved core eukaryotic genes (CEGs) in CEGMA and 1,389 (96.46%) among 1,440 conserved genes in BUSCO were recalled in our assembly (Supplementary   Table S6, S7). The assembly allowed us to predict 30 potential centromeric regions with lengths ranging from 0.59 to 8.63 Mb and 43 telomere regions with monomer copy number from 104 to 1,218 along the 40 chromosomes (Supplementary Table S8, S9).

Allele-de ned genome annotation
Allele-de ned gene annotation is the key for characterizing autopolyploid genomes. Based on our homologous chromosome-level assembly, we annotated a total of 123,128 protein-coding gene models (Table 1 and Supplementary Table S4) and 93.5% (115,166) of these gene models were functionally annotated via searches of the NR, GO, KEGG, Swiss-Prot, and KOG databases (Supplementary Table S10 and Extended Data 1). Due to the polyploid nature of this genome, many annotated gene models are naturally allelic variations of the same gene. Analysis of allele de nition showed that 35,830 gene models were well de ned, including 8,645 (24.1%) genes with four alleles, 12,861 (35.9%) with three, 10,781 (30.1%) with two, and 3,543 (0.1%) with one, with an average of 3.4 alleles per gene (Table 1, Supplementary Table S11 and Extended Data 2). These results con rm that the chromosome-level S. spontaneum Np-X assembly and gene annotation are well-organized and allele-de ned.
We identi ed a total of 1,590 Mb transposable elements (TEs) accounting for 57.52% of the assembled S. spontaneum Np-X genome (Supplementary Table S12), similar to their proportions in the genomes of AP85-441 10 Table  S13).
The genome re-structuring in S. spontaneum The genome of S. spontaneum Np-X shows strict synteny with that of sorghum except for a sorghum-speci c inversion on SbChr04 10,15 (Fig. 1c). The decrease in the basic chromosome number from 10 to 8 in S. spontaneum AP85-441 was caused by ssion following by fusion of its ancestral sorghum chromosome homologs 5 and 8 as well as of rice chromosome homologs 11 and 12, which are paleo-duplicated chromosome pairs (PdCPs) that originated from the ρ event of the Poaceae 10,16,17 . However, the chromosome forms and numbers resulting from these events were not found in S. spontaneum Np-X ( Fig. 1c and 1d). Relative to sorghum, two inversions occurred in S. spontaneum AP85-441 chromosomes APChr2AB and APChr7AB and three chromosomal fragments occurred in AP85-441 chromosome APChr6ABD 10 , but these chromosomal variations were not detected in S. spontaneum Np-X or the related genus Miscanthus (Supplementary Fig. S4). These results indicated that chromosomal inversions occurred after the event resulting in the chromosome number decrease in S. spontaneum AP85-441, further con rming that S. spontaneum Np-X retained the chromosome forms of the last common ancestor (LCA) of S. spontaneum.
To analyze the scenario of the hypothesized rejoining of Chr5 and Chr8, we analyzed synteny between these S. spontaneum Np-X and S. spontaneum AP85-441 chromosomes (Fig. 2) and found that the recombination breakpoints in S. spontaneum Np-X were located on the centromeres of NpChr5 and NpChr8. We found that S. spontaneum AP85-441 APChr5 is comprised of the ancestral short arms of S. spontaneum Np-X NpChr5 and NpChr6, and APChr6 is comprised of the ancestral long arms of NpChr5 and NpChr7, and that centromere-speci c sequences were retained and spread over longer distances in both APChr5 and APChr6 (Fig. 2b). However, although APChr2 is comprised of the ancestral long arms of NpChr8 and NpChr2, its centromere-speci c sequences were retained from only ancestral NpChr2. In contrast, APChr7, which is composed of the ancestral short arms of NpChr8 and NpChr2, retained two centromere-speci c sequences from the two Np-X chromosomes. Comparative analysis of the homologous genomic regions in Np-X and AP85-441 showed that a 6.2 Mb genomic region of NpChr5 containing 34 genes and a 6.1 Mb segment of NpChr8 containing 16 genes appears to have been lost in the centromeric regions of the AP85-441 genome (Fig. 2c, Supplementary Table S14 and S15).
To further investigate the concerted evolution of the PdCPs (NpChr5 and NpChr8) in Saccharum, close paralogous gene pairs were identi ed in Chr5 and Chr8 in Np-X as well as in the corresponding chromosomes of AP85-441 (Chr5A (57-89 Mb) and Chr6D (54-90 Mb) compared with Chr2 (98-125 Mb) and Chr7 (62-83 Mb)), Miscanthus 16 (Chr10 and Chr14), sorghum 15 (Sb05 and Sb08), and rice 18 (R11 and R12) (Fig. 2a). The numbers of pairs of gene conversions in each genome in descending order are: 848 pairs in rice, 213 in sorghum, 265 in Np-X, and 114 in AP85-441 ( Supplementary Fig. S9, S10). Obviously, AP85-441 retained far fewer converted genes pairs than did Np-X. These phenomena could be the reason that re-structuring of the PdCPs results in the absence of homeologous genes during the concerted evolution of the chromosomal ancestors of NpChr5 and NpChr8.

Gene redundancy on PdCPs in Saccharum
The genes on NpChr05 and NpChr08 displayed lower expression levels than those located on the other chromosomes in the examined tissues, and a similar phenomenon was observed in the AP85-441 (Fig. 2e) and S. o cinarum genomes ( Supplementary Fig. S11). Moreover, the homologous regions between these two chromosomes displayed signi cantly higher average synonymous substitution rates (Ks, 0.036) within than did the homologous regions of other chromosomes (0.022) in Np-X (Fig. 2f), suggesting a more rapid evolution of the PdCPs.

Change in the chromatin status of PdCPs across the Poaceae
To investigate the evolution of the three-dimensional (3D) genome in the S. spontaneum Np-X auto-tetraploid, 1,937 million valid interaction read pairs were used to construct chromatin interaction maps for each set of homolog chromosomes at 500-Kb resolution. The frequencies of intrachromosomal interactions displayed a rapid decrease with increasing linear distance ( Supplementary Fig. S12). We found that 32.9-64.2% and 35.8-67.1% of regions on the homologous group chromosomes exist as compartment A (active regions) and compartment B (inactive regions), respectively (Supplementary Fig. S13 and S14), indicating that the distribution of chromatin compartment is disproportionate in the autopolyploid.
We further analyzed chromatin status in PdCPs of AP85-441 10 , Np-X, sorghum 15 , and rice 18 . In these four genomes, 44.8-47.5% and 52.5-55.2% of genomic regions exist in compartment A and in compartment B, respectively (Supplementary Table S17). We found that the genes of the homologous chromosomes of NpChr2/5/7/9 (74.0%-87.0%) reside mainly in the most conserved regions among the four genomes, while the genes (41.1-49.9%) that underwent switching from the B to A compartment reside mainly in the homologs of NpChr2/6 (Supplementary Fig. S17 and Fig. S18). It is noteworthy that NpChr5 and NpChr8 displayed the lowest degree of regions exhibiting B to A compartment switching among the homologous chromosome sets in Np-X compared to AP85-441 (Fig. 2d), and all these gene located in the regions of B to A compartment switching are higher expressed in AP85-441 than in Np-X ( Supplementary Fig. S19), indicating that the re-structuring of NpChr5 and NpChr8 might have suppressed switching of chromatin status from inactive to active.
TEs have long been considered as key genomic features that can trigger genome instability and are thus an important source of evidence for exploring the evolutionary history of genomes. In the four genomes (Np-X, AP85-441 10 , S. o cinarum, sorghum 15 , and Miscanthus 16 ) we examined, LTR retrotransposons appear to have undergone continuing and recent ampli cation bursts ranging from 0 to 2 Mya (Fig. 3b) The most recent LTR/Gypsy sequence was used as a reference to identify TE hits, and 10 distinct insertion peaks with identities ranging from 65% to 98% were detected in the four genomes ( Fig. 3c and 3d). A Gaussian probability density function (GPDF) analysis estimated that the earliest TE insertion events (P1 and P3) occurred at ~2.0 Mya in Saccharum (

Genes related to the key characteristics of Saccharum
Given that Np-X exhibits the ancestral chromosome forms of S. spontaneum, comparative analysis of the genes in the three Saccharum and sorghum genomes may offer clues to the evolution of key agronomic characteristics of Saccharum. We thus analyzed the core gene families related to sugar accumulation, photosynthesis, and leaf width in Saccharum.
Genes encoding sugar metabolism enzymes and sugar transporters: We identi ed the key gene families related to sugar metabolism including sucrose phosphate synthase (SPS) 19,20 , invertase (INV) 21 , sucrose synthase (SUS) 22,23 , and fructokinase (FRK) 24 from the three Saccharum genomes ( Fig. 4a and Supplementary Table S18). Compared to sorghum, the number of INV genes had undergone expansion in S. o cinarum, which exhibited high sugar content, but were relatively conserved in both of the S. spontaneum genomes, indicating that expansion of INV genes might be a prerequisite for evolution of sugarcane with high sugar content.
Sugar transporters are crucial for the allocation of sugars to sink and source tissues during the development of plants [25][26][27] . We identi ed a total of 119 genes that are likely members of sugar transporter superfamilies including PLT, STP/HXT, INT, VGT, pGlcT, SFP, TMT, and SWEET ( Fig. 4a and Supplementary Table S18). The PLT, SUT, and SFP gene families expanded in the two S. spontaneum genomes, and the STP and VGT families showed more expansion in Np-X than in AP85-441, indicating that the expansion of the STP and VGT families in S. spontaneum Np-X could be a transitional intermediate state. In contrast, the SWEET gene family only expanded in S. spontaneum Np-X, suggesting that contraction of this gene family might have occurred in S. spontaneum AP85-441 after genome re-structuring. In addition, 19 STP and 23 PLT genes had expanded due to tandem duplications in AP85-441, while expansion of the PLT family was only observed in Np-X, indicating that the tandem duplications of STP genes in Saccharum occurred after speciation.
C4 photosynthesis pathway: The C4 photosynthesis pathway was discovered in sugarcane 28,29 and has been considered to have higher energy e ciency necessary for generating large biomass. We identi ed members of nine core gene families that participate in the photosynthesis pathway (CA, PEPC, PEPC-K, NAD-ME, PPDK, NADP-MDH, NADP-ME, PPEK-RP and PEPCK) in the three Saccharum genomes and that of sorghum ( Fig. 4a and Supplementary Table S19). Compared to sorghum, gene expansions occurred in the NAD-ME gene family in both Np-X and AP85-441, but not in S. o cinarum (Fig. 4a, Supplementary Fig. 25 and Table S19). NpPEPC1, NpPEPC-k1, NpNADP-MDH2, NpNADP-ME2, NpPPDK1, NpPPDK-RP1, NpCA1, and NpCA2, are preferentially expressed in leaves, suggesting that these eight genes might be related to C4 photosynthesis in Np-X, which is consistent with observations in AP85-441 10 and S. o cinarum (Zhang et al., under review). However, with the exception of NpPEPC1, transcripts of other genes putatively involved in photosynthesis displayed much lower abundance (average TPM = 21.0) than in either AP85-441 (average TPM = 216.9) or S. o cinarum (average TPM = 320.9) (Fig. 4b). Thus, it seems that the types of component genes involved in the C4 photosynthesis pathway in Saccharum might have converged as well as the regulation of their respective expression.
Narrow leaf genes: S. o cinarum has much larger leaves than S. spontaneum, and among S. spontaneum varieties, Np-X has much smaller leave than AP85-411 (Fig. 1a). About six NARROW LEAF (NAL) genes controlling leaf width have been reported in rice [30][31][32][33][34][35] , and among them, NAL1 was shown to have the strongest effect on leaf width. We identi ed 13 NAL genes in the three Saccharum genomes (Fig. 4a, 4c and Supplementary Fig.  S26), and gene expression analysis showed NAL1 and NAL10 transcripts to be expressed at much lower abundance in Np-X than in either AP85-441 or LA-Purple, and both appear to be down-regulated compared to their expression in stem (Fig. 4c), suggesting these as candidate genes affecting leaf width.
The origination and independent polyploidization of S. spontaneum We generated a total of 4,682 Gbp of resequencing reads for 102 S. spontaneum genomes and the genomes of 14 related species for population genetics analysis, using the reference S. spontaneum Np-X genome ( Fig. 5a and Supplementary Table S20). A total of 3,345,380 high-con dence variants including 3,140,400 SNPs and 204,980 InDels were identi ed in these data. Using the genomes of the 14 related species as outgroup (including Sorghum, Miscanthus, S. o cinarum and S. robustum), principal component analysis (PCA) of SNPs showed substantial genetic diversity among S. spontaneum groups (Fig. 5b). PC1 (35.81%) and PC2 (15.64%) separated the S. spontaneum accessions from the outgroup accessions while PC1 vs. PC3 clearly divided the S. spontaneum accessions into four groups, which was consistent with our Maximum Likelihood (ML)-based phylogenetic analysis ( Fig. 5a and 5c). The four groups displayed continuous geographic distribution from the Indian subcontinent to eastern and southern Asia. Group I accessions were primarily distributed in the Pakistan, India, and southwestern China (Himalayan region); Group II accessions were mainly distributed in southeastern China including Fujian and Taiwan provinces, and the Philippines; Group III accessions were mainly distributed in southern China including Yunnan, Guangdong, and Hainan provinces, and in Myanmar and Laos; and group IV accessions were primary distributed in Indonesia and India (Fig. 5b). It is noteworthy that the recorded progenitor S. spontaneum of modern sugarcane hybrids, Glagah, clustered into group IV.
Linkage Disequilibrium (LD) decays among the resequenced accessions were relatively low with an average rate of 0.086, supporting that these accessions had not experienced arti cial selection ( Supplementary Fig. S27). The π values of Groups were 0.24×10 -3 , 0.31×10 -3 , 0.29×10 -3 , and 0.26×10 -3 , respectively ( Supplementary Fig. S28), which veri es the results of our LD analysis. Genetic differentiation values (F ST ) between any two of the four groups ranged from 0.047 to 0.084 ( Supplementary Fig. S28), and each of the neighbor group pairs exhibited lower F ST values in comparison to any two of the other four groups.
Further, we estimated admixture proportions and individual ancestry based on the SNP data set (Fig. 5d, Supplementary Fig. S29 and Fig. S30). In the admixture plot at K = 5 (Fig. 5d), each of the ve groups exhibited distinct relative monophyletic ancestry matching our ML phylogenetic analysis, and further supports that a low level of genetic exchange between the four S. spontaneum groups due to their geographic isolation. Only weak gene ow was detected between Group I and Group II (P-value = 2.2e-308; F_statistic = 0.397) (Fig. 5e). Further, population structure analysis showed that 12 accessions in Group I had some Group II ancestry and that 11 accessions in Group II had some Group I ancestry (Fig. 5d), supporting the limited gene ow between the two groups. All of these ndings indicate that the four S. spontaneum groups evolved relatively independently after they originated on the Indian subcontinent (Group I) and spread step by step to the regions where Group II, Group III, and Group IV are now distributed.
Consistent with the results of a previous study 10 , our phylogeny showed that S. spontaneum accessions of diverse ploidy are clustered together within each group, con rming that the different ploidy levels of S. spontaneum were originated and diverged independently in each of these four groups.
The evolution of basic chromosome numbers in S. spontaneum With the availability of the S. spontaneum Np-X genome (x = 10), basic chromosome numbers can be determined based on the coverage of mapping reads on the restructured chromosomes (NpChr5 and NpChr8), speci cally by looking at read coverage across breakpoints for evidence of restructuring. Examples of x=9 typically has signs of restructuring around NpChr5, and not NpChr8 (Fig. 6a).
Totals of 4, 7, and 91 of the S. spontaneum accessions have basic chromosome numbers of x = 10, x = 9 and x = 8, respectively. It is noteworthy that the S. spontaneum accessions with x = 8 were distributed across all four groups, while the accessions with x = 10 and x = 9 formed a clade that was nested within Group I and were mainly located in Pakistan, northern India, and Tibet China (Fig. 5c). As accessions with basic chromosome numbers of x = 9 and x = 10 are only found in Group I, while those with basic chromosome numbers of x = 8 appear in all four groups of S. spontaneum accessions, we assumed that the uid ploidy levels have evolved independently from ancestral progenitors with basic chromosome numbers of x = 8 in three groups: Group II, Group III, and Group IV. In addition, gene ow is undetected between any pairs of these three groups of accessions (Fig. 5e). We explored the population demographic history of S. spontaneum using SNP data with a PMSC (Pairwise Sequentially Markovian Coalescence) model 31 , which revealed that the S. spontaneum population experienced a prominent effective population (N e ) expansion (N e ~ 500,000) around 12 to 14 Kya (thousand years ago) and a subsequent N e contraction (N e ~ 10,000-60,000) around 8 to 1.4 Kya (Fig. 6d). Interestingly, the time of the N e expansion corresponded to that of the Marine Isotope Stage (MIS) 5e interglacial period, the warmest interval of a warming phase during which the earth emerged from an extreme glacial phase 36-40 , according to benthic oxygen isotope data. The time of the N e contraction corresponded to the Younger Dryas cold event that occurred ~1.1-1.2 Kya during which the global climate changed dramatically [41][42][43] . The S. spontaneum population of accessions with a basic chromosome number of x = 10 diverged demographically from the S. spontaneum populations with basic chromosome numbers of x = 9 and x = 8 with a much smaller N e in recent history.
The accessions of S. spontaneum with a basic chromosome number of x = 8 exhibit a high level of genetic diversity and are distributed over a broader geographical range than the other two forms of accessions with basic chromosome numbers of x = 9 and x = 10, indicating that the genome re-structuring likely contributed to the vigor and adaptation of S. spontaneum with a basic chromosome number of x = 8.
To investigate the genetic basis of likely tness advantage of x = 8 in S. spontaneum, XP-CLR was used to detect natural selective sweeps. We identi ed selective sweep signatures for 25.7 Mb of the genome sequence with 338 candidate genes putative involved in the reduction of basic chromosome number from x = 10 to x = 8 ( Supplementary Fig. S33 and Extended Data 5). The identi ed genes located within selective sweep regions are mainly enriched in the molecular function 'polysaccharide binding', suggesting divergence in the genetic control of related processes after the genome re-structuring in S. spontaneum. It is noteworthy that 13 of the genes found in the selective sweeps putatively encode glycosyltransferase, glycosyl hydrolases, STARCH SYNTHASE 1 (Npp.10B005850.1), sucrose-phosphate synthase (Npp.03C039020.1), cellulose synthase (Npp.01B008750.1), and SWEET (Npp.04A016530.1), indicating that divergence of genes involved in carbohydrate metabolism pathways likely occurred after the S. spontaneum genome re-structuring. Interestingly, four tandem FUTs (Xyloglucan fucosyltransferase) (Npp.04A015560.1, Npp.04A015590.1, Npp.04A015600.1, and Npp.04A015610.1) putatively involved in plant cell wall biosynthesis processes were also identi ed in the selective sweeps 44,45 , which could provide a foundation for the morphological differentiation of the three populations. In addition, some gene functions related to stress responses were detected, e.g., LOT1(Npp.05D013900.1) which enhances ABA response and plant drought tolerance 46 ; two genes (Npp.10B011680.1 and Npp.10B011690.1) putatively encoding PER3 and Npp.01B018240.1, which have been implicated in responses to low temperatures in plants 47,48 ; two genes (Npp.06C016710.1 and Npp.06C016720.1) with functions related to cold tolerance and that are induced by drought and ABA; two genes (Npp.01B030420.1 and Npp.05D025320.1) encoding a putative ethylene response factor; and 16 genes, such as CYP81F (Npp.01B031390.1 and Npp.05D025440.1), which likely encode enzymes putatively involved in oxidation-reduction processes and might be related to tolerance of drought stress 49,50 . These results indicated that the natural genomic sweep regions were likely involved in polysaccharide metabolism and stress tolerance during the chromosome reduction process.

Discussion
The allele-de ned genomes in our study clearly suggested that the founding Saccharum species are auto-polyploid, as a low level of divergence was detected among alleles based on Ks estimates (Ks = 0.00 or 0.01). Very recently, a whole-genome analysis demonstrated that Miscanthus originated from an allo-tetraploid event, con rming the independent WGD before the divergence of the Miscanthus and Saccharum genera 51 , which disagrees with a previous study that suggested Saccharum shared an allopolyploid event with Miscanthus 16 . In S. spontaneum, the WGD of the x = 10 form (Ks = 0.01) likely occurred prior to that of the x = 8 form (Ks = 0.00) according to estimates of allelic divergence. Considering the x = 10 form to be the common ancestor, the WGD for x = 8 form must have been followed by reduction of the basic chromosome number from x = 10 to x = 8. Thus, we conclude that the different basic chromosome numbers of S. spontaneum originated from independent WGDs, and that the WGD in Np-X was close in time to the split of the x = 10 and x = 8 forms, as alleles within Np-X and Np-X/85-411 show similar levels of synonymous nucleotide divergence (Ks = 0.01). Likewise, the polyploidization of two founding Saccharum species, S. spontaneum and S. o cinarum, likely occurred independently. S. spontaneum and S. o cinarum are very recent auto-polyploids, and S. spontaneum became clearly separated from the rest of the Saccharum species (Fig. 3f). S. spontaneum and S. o cinarum likely originated from a common ancestor with a basic chromosome number of x = 10 at ~1.6 Mya. Further, S. robustum distributes phylogenetically together with S. o cinarum (Fig. 5c) and exhibits fewer interchromosomal rearrangements than does S. o cinarum 52 . Thus, we assumed these three Saccharum species originated from the autopolyploidization event.
Studies of recent genomic polyploidization are rare because genomic analysis of auto-polyploids is notoriously challenging. The auto-polyploid Saccharum genomes can bridge ancient and recent polyploidization events as the Saccharum genomes have experienced a recent WGD (less 0.80 Mya) and retained a set of PdCPs that originated from a much older WGD that occurred ~80 Mya affecting all cereal species. Ks estimates for the PdCPs NpChr5 and NpChr8 in S. spontaneum Np-X indicate relatively higher levels of nucleotide diversity between alleles (Fig. 2f). Interestingly, the NpChr5 centromere-speci c sequences were found to have been retained and spread on both APChr5 and APChr6. Such functional redundancy likely resulted in genes on one PdCP, NpChr5, displaying the lowest transcript abundance among genes on the 10 S. spontaneum Np-X chromosome groups (Fig 2d). Ancestral NpChr5 split into two major segments and experienced translocations in the ancestors of NpChr06 and NpChr07 in all of the examined accessions with a basic chromosome number of x = 9 ( Fig. 1d and Fig. 2ab). Therefore, we hypothesize that the ancestral NpChr5 split was the rst step in the chromosome number reduction followed by rearrangement of ancestral NpChr8 to evolve into the x = 8 form. However, we cannot exclude the possibility that the ancestral NpChr8 split was the initial step for the chromosome reduction as we might not have identi ed all of the accessions with a basic chromosome number of x = 9. We might answer the questions raised by Kim et al 16 and consider the split of NpChr5 in x = 9 form as a stepping stone in the process of chromosome number reduction rather than assuming that the x = 9 form originated from crosses between the x = 8 and x = 10 forms. However, we can still hypothesize that S. spontaneum evolved after recent polyploidization followed by chromosome number reduction in diploids (Fig. 6e).
Although there is no signi cant global homologous genome dominance in S. spontaneum, over 60% of its genes display differences in allelic expression 10 . The auto-tetraploid S. spontaneum Np-X genome harbors four alleles for each gene, which results in broad redundancy of gene function in this species compared with diploid species. Our analysis of A/B compartment organization showed that only 1,164 (13.7%) of alleles sets were located in the regions in which all four alleles are conserved among the four homogenous groups, suggesting that auto-polyploidy might alleviate the functional redundancy caused by multiple alleles by dynamically regulating the spatial structure of chromatin in some regions.
Similarly, in allo-tetraploid cotton, asymmetric A/B compartment switching was revealed between two subgenomes during polyploidization 53 , which suggests that distinct spatial structures were retained in two sets of chromosomes even though they exhibit similar genome sequences.
However, regions of NpChr5 and NpChr8 displayed the highest degrees of A to B compartment switching among the homologous chromosome sets in S. spontaneum Np-X compared to S. spontaneum AP85-441 (Fig. 2d). Thus, chromosome rearrangement might play as important roles in evolutionary dynamics as does polyploidization in auto-polyploids.
S. spontaneum Np-X plants have much narrower leaf widths and lower biomass compared to those of S. spontaneum AP85-441 and S. o cinarum. Biomass production is mainly attributed to photosynthetic productivity in plants 54 , and the recent polyploidization in S. spontaneum Np-X has likely affected only the regulation of its C4 photosynthesis pathway as the core gene family members in the photosynthesis pathway are conserved but their expression levels vary among the three species (Fig 4b). The candidate C4 genes display much lower transcript abundances in S. spontaneum Np-X than in either S. spontaneum AP85-441 or S. o cinarum. Sugarcane performs an NADP-ME sub-type of photosynthesis, in which expression of the C4-type NpNADP-ME2 (TPM = 32.59), is much lower than that of APNADP-ME (TPM = 519.63) or SoNADP-ME (TPM = 277.49), re ecting the relatively lower e ciency of decarboxylation and lower activity of photosynthesis in Np-X. However, transcript abundances of the sugarcane homolog of the most important gene controlling leaf width identi ed in rice 55,56 , NAL1, in descending order are SoNAL1, APNAL1, and NpNAL1 (Fig. 4c), which are correlated to the leaf width of the three species. Understandably, because of the recent WGD and chromosome reduction in Saccharum, the transcriptional regulation rather than gene gain/loss is likely a driving force in the strong environmental adaptability of S. spontaneum, and the divergence of transcriptional regulation of genes controlling the photosynthesis pathway and leaf width might form the genetic basis of the divergence in biomass productivity among Saccharum species.
The recent divergence of Saccharum less than 1.6 Mya has resulted in great variation in sugar content of its species. Among the gene families related to sugar metabolism, INV is the only gene family expanded in the high sugar-content species S. o cinarum while the other genes examined were relatively conserved in the Saccharum genome. Whereas, in sugar transporter super families, with tandem duplication, the STP gene family had expanded in AP85-411, and the PLT family had expanded in both S. spontaneum varieties. The members of these gene families appear to have evolved independently in Saccharum species. Genes encoding proteins involved in sugar metabolism, such as SPS 57 , FRK 24 , SUSY 58 , and sugar transporters such as SUT 59 , SWEET 60 display differential expression between S. spontaneum and S. o cinarum. Similar to our explorations of the genetic basis of biomass, the regulation of sugar accumulation via transcriptional regulation rather than the gain or loss of functional genes is the main avenue of exploration of the molecular mechanisms by which sugar accumulation diverged among Saccharum species.
S. spontaneum is widely distributed from Indonesia to New Guinea and Japan through the Indian subcontinent to the Mediterranean and Africa. A previous study suggested that S. spontaneum accessions could only be divided into two major groups 61 because a limited number of molecular markers in that study were not su cient for exploring the genetic divergence of these large and highly polyploid genomes. Based on whole genome sequencing, our previous study showed that 64 S. spontaneum accessions could be divided into three distinct groups 10 , and, the current study clustered the largest genetic collection thus far of 102 representative S. spontaneum accessions, into four groups. The phylogeny and geographical distribution indicated that the S. spontaneum was originated from the North of India, a biodiversity hotspots near the Himalayas with a climate that has been in uenced by both east and south Asian monsoons 62 , and then radiated along Middle East, East Asia and South-East Asia mainly with founder species of basic chromosome number of x=8. Obviously, S. spontaneum with x=8 is more adaptive to environment and facilitating diversi cation and radiations. However, the x = 10 form in Saccharum including S. spontaneum, S. robustum, and S. o cinarum has only very limited geographical distributions. This phenomenon might be explained by the recent WGD only generating genetic redundancy rather than genetic variation, while both the recent genome re-structuring as well as genes involved in selective sweeps for functions related to polysaccharide metabolism and stress tolerance, have larger contributions to the adaptive evolution in S. spontaneum. Methods Sampling and sequencing. Fresh young leaves were collected from individual S. spontaneum Np-X plants cultivated in a greenhouse kept at 25-30℃ with 16 h light per day. Genomic DNA were isolated from these young leaves using a CTAB method 63 . The extracted DNA was then packaged with dry ice and sent to BioMarker company (Beijing, China) for construction of circular consensus sequencing (CCS) libraries and Illumina short read libraries that were subsequently sequenced on the PacBio Sequel and Illumina Hiseq platforms, respectively. Totals of ~52 Gb of PacBio HiFi reads and ~417 Gb of Illumina reads were generated for de novo assembly of the S. spontaneum Np-X genome.
Total RNA was extracted from mature stems and leaves of S. spontaneum Np-X using TRIzol (Invitrogen, Carlsbad, California). These two RNA samples were sent to Novogene (Beijing, China) for construction of transcriptome libraries and were sequenced on an Illumina platform.
Hi-C library construction and sequencing. Fresh tender leaves were collected from S. spontaneum Np-X plants and used to prepare chromatin crosslinked to DNA and xed with formaldehyde as described previously 10 . The xed samples were sent to BioMarker (Beijing, China) for Hi-C library construction and sequencing. Two Hi-C libraries were constructed after digestion with HindIII restriction endonuclease and sequenced on the Illumina HiSeq platform. Finally, a total of ~290 Gb (~105x) Hi-C reads were obtained for genome scaffolding (Supplementary Table S3).
Genome survey for genome size estimation. About 417 Gb of Illumina short reads from S. spontaneum Np-X was ltered using Trimmomatic 64 (v.0.36) software with default parameters. The clean reads were used to create Kmer set using Jelly sh 65 (v. 2.2.6). The genome size was estimated using K-mer (K=21) frequency-based methods with K-mer number/K-mer Depth. The genome size of S. spontaneum Np-X was estimated as 2,832.7 Mb with a heterozygosity rate of 1.14% based on a 21-Kmer distribution (Supplementary Fig. S1).
Genome assembly and scaffolding. About 52 Gb of CCS clean reads was used to assemble the S. spontaneum Np-X genome with Canu (v1.9) 13 using the optimized parameters "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50", Hicanu 66 with parameters batOptions="-eg 0.0 -sb 0.001 -dg 0 -db 3dr 0 -ca 2000 -cp 200", and Hi asm 67 with parameter "-l0 -u", respectively. Assemblies of 2.76 Gbp and N50 length of 405 Kbp by Canu (v1.9) 13 , 2.97 Gb and an N50 length of 3,541 Kbp by HiCanu 66 , and 2.89 Gb and an N50 length of 1,962 Kbp by HiFiasm 67 resulted. However, many chimeric contigs were identi ed in the latter two assemblies so that they could not be scaffolded with Hi-C using ALLHiC 14 . We attempted to solve the chimera problem using Hi-C interaction signals, but the scaffolds still aborted after chimera correction. After a comprehensive comparison of the results obtained by the above different assembly methods, the best-assembled Canu (v1.9) 13 version was nally chosen to serve as the reference genome for subsequent analysis.
A total of 290 Gb Hi-C reads were mapped on the contig-level assembly using BWA (v.0.7.15) 68 software with default parameters. The mapping results were pruned using an in-house script that generated a BAM le. All the contigs were reordered and scaffolded using the ALLHiC 10,14 pipeline. Finally, we generated a pseudo-chromosome assembled genome that included 40 chromosomes with a total length of 2,760 Mb and N50 of 67.73 Mb (Table 1). We mapped about ~60.7 Gb of Illumina short reads onto the assembled chromosome-level genome using Bowtie2 69 to estimate the quality of the assembly.
Gene annotation. Transcriptomic data were obtained from RNAs isolated from stem and leaf tissues of S. spontaneum Np-X. Trimmomatic 64 (v. 0.36) software was used to further lter RNA-Seq data, and then HISAT2 70 , which blocks duplicates, was used to align reads to the reference genome sequence. After reads were aligned, different coverage thresholds were set according to the sequencing depth of each aligned region to obtain reliable intron and optimal transcript information. Then we used TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki) to predict the ORFs of optimal transcripts and de ne gene models. The optimal gene models were then screened and trained using AUGUSTUS 71 software. We chose protein sequences from closely related species of maize, sorghum, rice, sugarcane, and used them as input for Genewise (https://github.com/brewsci/homebrew-bio/blob/master/Formula/genewise.rb) software for gene prediction in the S. spontaneum Np-X genome.
We next collected exon information for homologous proteins and transcripts, and collected intron information by comparing reads. AUGUSTUS 71 software was used to combine the above intron and exon information for gene prediction. The results of the above three methods were integrated, and then the PFAM database 72 was used for screening to obtain nal gene prediction results. Finally, we used a Perl script to analyze the nal assembled genome for eukaryotic genes and obtained a total of 123,128 high-con dence gene models.
Gene function annotation. Gene functions in the S. spontaneum Np-X genome were predicted using the best matches of the alignments as queries against the eggNOG 73 , Nr 74 and Swiss-Prot 75 databases using BLASTP (E-value = 1e-5). The eggNOG, Nr, and Swiss-Prot databases were downloaded to our local server. Unigenes were then used to query the NCBI non-redundant nucleotide sequence (Nt) database using BLASTN with a cut-off E-value of 1e-5. We used a Perl script provided by EBI (https://www.ebi.ac.uk/) for InterPro annotation. The script sends the sequences to the o cial web server at InterProScan 76 for InterPro annotation and returns the results to our local server. In order to determine whether the proteins encoded by these genes might participate in any functional pathways, all gene models were aligned (E-value = 1e-5) to the KO database. The putative functions of the gene models were predicted and classi ed using the Clusters of Orthologous Group (COG) database and the Eukaryotic Orthologous Groups (KOG) database. The online KEGG Automatic Annotation Server was used to assign assembled sequences to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.
Genome collinearity and evaluation. Collinearity between the S. spontaneum Np-X, S. spontaneum AP85-441 10 , sorghum 15 , Miscanthus 51 and rice 77 genomes was analyzed using MCScanX 78 with default parameters, and all of the orthologous and paralogous gene pairs were identi ed based on synteny blocks. The paired synonymous substitution rates (Ks) were calculated using the Nei-Gojobori method (https://github.com/tanghaibao/bio-pipeline/blob/master/synonymous_calculation/synonymous_calc.py.) Identi cation of A and B compartments. The principal component analysis method in the HiTC package (v. 1.28.0) 79 was applied to identify A and B compartments in S. spontaneum Np-X, S. spontaneum AP85-441, Sorghum and Rice. For this analysis, each chromosome was divided into consecutive 500-kb regions from the contact maps. Genomic regions belonging to the A compartments usually contain more genes than those corresponding to B compartments. To identify regions that had switched A/B compartment status during genome polyploidization, we considered only regions that showed changes of rst principal component (PC1) values from positive to negative or vice versa in both biological replicates.
Repeat sequence annotation. Consensus transposable element (TE) sequences were generated using RepeatModeler with a combination of de novo and homology strategies including two de novo repeat-nding programs, RECON 80 and RepeatScout 81 , which we imported into RepeatMasker (http://www.repeatmasker.org/) to identify and cluster repetitive elements. Tandem repeats were identi ed using the Tandem Repeat Finder (TRF) package 82 , and unknown TEs were classi ed using TEclass 83 . Next, the outputs from the above processes were used to identify telomeres and centromeres. We also integrated results from LTR_FINDER 84 and LTRharvest 85 and removed false positives from the initial predictions using the LTR_retriever pipeline 86 . These LTRs were also classi ed as either intact or non-intact LTRs.
LTR burst time estimation. The most recent and longest LTR/Gypsy sequence was selected as the representative sequence for detecting additional TE hits in the genomes. Full-length and truncated LTRs with various lengths and identities were identi ed across genomes, and then each sequence (length = l) was divided into 30-bp units to determine the number of dots (TE hits) (n = l/30) with the same identity following a previously reported method 87 . All dots were used to generate a box plot according to their identities. Single peaks in the TEs identity distribution curves were separated for GPDF tting and burst time calculation, and the average nucleotide substitution ratio (K) was de ned as 2.58 standard deviations (σ). The formula t = K/r was used to calculate the TE burst time point for single peaks, where r refers to the nucleotide substitution rate for sugarcane species (6.5×10 −9 ).
Population genomics analysis. The high-con dence set of variants was then used to perform population genetic analysis. Values of π and Tajima's D were calculated using VCFtools 89 in 500-Kb windows and 100-Kb steps based on the high-con dence ltered SNPs. PLINK 90 was used to perform principal component analysis and to transform the VCF le into a Plink binary le for input. The results of PCA were plotted using R 91 . ADMIXTURE 92 was then applied to infer population strati cation among the 102 sugarcane accessions using the prede ned number of genetic clusters K from 1 to 10. After the best value of K was calculated, the population structure of S. spontaneum was inferred using fastStructure 93 for K = 1 through K = 10. A maximum likelihood tree was constructed using RaxML 94 and the format conversion of the input le was performed using vcf2phylip (https://github.com/edgardomortiz/vcf2phylip). Finally, Figtree (https://github.com/rambaut/ gtree/releases) was used to visualize the tree.
Analysis of population demographic history. We inferred a demographic history for S. spontaneum by applying the PSMC (Pairwise Sequentially Markovian Coalescence) model 35 to the complete diploid genome sequences. This method reconstructs the history of changes in population size over time using the distribution of the most recent common ancestor (tMRCA) between two alleles in an individual. Consensus sequences were obtained using SAMtools 88 . Bases with low sequencing depth (less than a third of the average depth) or high depth (twice the average depth) were masked. The analysis was performed using the following parameters: -N25 -t15 -r5 -p '4+25*2+4+6'.The mutation rate per generation per site was 6.5 ´ 10 -9 and g = 2. PSMC modeling was performed using a bootstrapping approach, with sampling performed 100 times to estimate the variance of the simulated results.

Declarations Data deposit
All raw sequencing data were deposited into the Sequence Read Archive (under Bioproject accession PRJNA721787). Genome assemblies and annotation les are stored in NCBI at the same accession and SGD (Sugarcane Genome Database) (http://sugarcane.zhangjisenlab.cn/sgd/html/index.html). Source data are provided with this paper.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.