Pacbio HiFi and ONT sequencing
A total of 70 rice accessions from Oryza AA genome group were including in this study, comprising 51 Asian rice O. sativa, 11 wild O. rufipogon, four African rice O. glaberrima and four wild O. barthii samples (Supplementary Table 1). The rice plants were cultivated at the China National Rice Research Institute (CNRRI), Hangzhou, China. For Pacbio HiFi sequencing, we newly generated Pacbio HiFi data for 46 accessions, and the HiFi reads for 24 accessions were adopted from previous studies (Wu et al., 2023; Shang et al., 2023; Zhang et al., 2022; Sedeek et al., 2023). High-molecular-weight DNA from a single individual was extracted from young seedlings for each accession using the CTAB method. DNA degradation and contamination was scanned on 1% agarose gels. DNA purity and concentration were quantified using a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) and a Qubit 4.0 Fluorometer (Invitrogen, USA). According to the methods in the SMRTbell prep kit 3.0 kit manual, HiFi libraries for sequencing were prepared, including DNA quality control, DNA shearing and cleanup, DNA repair and A-tailing reaction, adapter ligation, nuclease treatment and library size selection. Sequencing was conducted on the PacBio Revio platform following the manufacturer's instructions..PacBio HiFi reads with quality scores over Q20 were generated from raw sequencing reads using PacBio Circular Consensus Sequencing tool ccs (min passes = 3, min RQ = 0.99). The sequencing depth ranged from 23.8× to 49.5× across genomes with an average of 30.4× (Supplementary Table 1).
For ONT sequencing of ten samples, genomic DNA was extracted using a QIAGEN Genomic DNA extraction kit (Cat #13323, QIAGEN). DNA purity and concentration were determined using a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA)and a Qubit 3.0 Fluorometer (Invitrogen, USA), respectively. Long DNA fragments were selected using the PippinHT system (Sage Science, USA) via gel cutting. Subsequently, DNA repair and adapter ligation were performed using an SQK-LSK110 kit. The prepared DNA library was then loaded onto a PromethION flow cell (R10.4.1). Base modification information was called using Guppy (v6.4.6). Reads with quality scores less than seven were discarded. For 50 samples, ONT reads were from previous studies (Shang et al., 2022; Zhang et al., 2022) and downloaded from NCBI Sequence Read Archive (BioProjects PRJNA656318 and PRJNA692836) and National Genomics Data Center (BioProject PRJCA008812). Additionally, available whole-genome NGS data for 64 accessions were adopted from previous studies (Shang et al., 2022; Shang et al., 2023) (Supplementary Table 1).
CENH3 Chromatin immunoprecipitation and sequencing (ChIP-seq)
ChIP experiments for ten samples were performed following standard protocols (Nagaki et al., 2003). Approximately 5 g of 4-week-old seedlings were well grounded in liquid nitrogen for 1 hour and then transferred into 50 ml Eppendorf tube with 20 ml nuclei isolation buffer (10 mM Tris-HCl pH 7.5, 3 mM CaCl2, 2 mM MgCl2, 0.1 mM PMSF, 1 EDTA-free protease inhibitor cocktail tablets (Roche) and 0.5% Tween 40). After incubating on ice for 15 min, the mixture was filtered through one layer of Miracloth (Calbiochem 475855). The liquid was centrifuged at 600 rcf for 10 min at 4℃. Discard the supernatant and resuspend the nuclei pellet with 5 ml 1 × TBS buffer (10 mM Tris-HCl pH 7.5, 3 mM CaCl2, 2 mM MgCl2, 0.1 mM PMSF, 1 EDTA-free protease inhibitor cocktail tablets (Roche)) with 25% sucrose, and then added on top of 1 × TBS buffer with 50% sucrose gently and centrifuged at 1,500 rcf for 10 min at 4℃. Collect the nuclei and digest the chromatin into nucleosome by MNase in MNB buffer (10% sucrose, 50 mM Tris-HCl pH 7.5, 4 mM MgCl2, 1 mM CaCl2, 0.1 mM PMSF) for 6 min at 37℃. The nucleosomes were incubated with pre-immune rabbit serum or CENH3 antibody at 4℃ overnight, and was then incubated with rProtein A SepharoseTM Fast Flow (GE Healthcare) for 2 hours at 4℃. Preimmune rabbit serum was used as the negative control. The rProtein A Sepharose was collected by centrifugation at 13,000 rpm for 1 min at 4℃, and washed once using Buffer A (50 mM Tris-HCl pH 7.5, 10 mM EDTA, 0.2 mM PMSF) with 50 mM, 100 mM, and 150 mM NaCl. Immunoprecipitated DNA-protein complexes were eluted twice from the beads with elution buffer (1% SDS, 50 mM NaCl, 20 mM Tris-HCl pH 7.5, 5 mM EDTA) at 65℃ for 15 min. The DNA was extracted with equal volume of phenol:chloroform (1:1) and the aqueous layer was transferred to new tube. The aqueous layer was then precipitated with isopropanol, sodium acetate, and glycogen at -20℃ for 2 hours, followed by washing with 70% alcohol once and air-drying. The DNA extraction, precipitation, and washing were performed by centrifugation at 12,000 rpm for 10 min at room temperature, then resolved with TE buffer and stored at -20 ℃. The anti-CENH3 ChIP-seq library was constructed using rabbit polyclonal anti-CENH3 against the rice-specific peptide. ChIP samples were PCR amplified for 12 cycles and sequenced (2×150bp) on the MGISEQ-200 platform.
Genome assembly
Among the 70 assemblies analyzed, 67 were newly assembled for this study. Three previously released assemblies (NIP, MH63 and ZS97), assembled completely using extremely high sequencing depth of long reads, were downloaded from the Rice Information GateWay (RIGW) and Rice Super Pan-genome Information Resource Database (Shang et al., 2023; Song et al., 2021). Newly generated PacBio HiFi reads with adaptor sequences were filtered using HiFiAdapterFilt (Sim et al., 2022). For 50 accessions with both PacBio HiFi and ONT reads, we integrated HiFi and Nanopore data together to generate draft de novo assemblies using Hifiasm (0.19.5-r587)(Cheng et al., 2024) and Verkko (v1.4)(Rautiainen et al., 2023), respectively. For the remaining 17 accessions with only PacBio HiFi reads available, we used the HiFi-only mode of Hifiasm and Verkko for assembly, respectively. The scaffolded assembly of Verkko was split into contigs and used to scaffold and extend Hifiasm contigs using a in-house script. The extended contigs (larger than 10 Kb) were then scaffolded on chromosomes using RagTag with no further chimeric splitting (v2.1.0, scaffold -f 2000 --remove-small --aligner minimap2)(Alonge et al., 2022), using T2T assembly of NIP as the reference genome (Shang et al., 2023). To correct base-level assembly errors and mitigate potential PacBio HiFi sequencing homopolymer errors, we polished the gap-filled assemblies using both HiFi and NGS reads. High-frequency k-mers (top 0.02%) were generated using meryl (v1.4)(Rhie et al., 2020) and Winnowmap2 (v2.03)(Jain et al., 2022) aligned HiFi reads against the assemblies. NextPolish2 (Hu et al., 2024) was then employed for polishing the assembly with default parameters. Two k-mer datasets (21-mer and 31-mer) were prepared using yak (https://github.com/lh3/yak). For five red rice accessions (Sedeek et al., 2023) and one weedy accession YCW03 (Wu et al. 2023), assemblies were not polished using short reads, due to their unavailability. To prevent potential over-scaffolding based on homology, we detected and trimmed the potential mis-joins at both ends of each chromosome. Telomere regions were identified by searching the rice telomeric repeat motif TTTAGGG with the occurrence of at least five consecutive repeats. Non-telomeric sequences outside the telomeric region (< 2 Mb) were then pruned from the assemblies.
Assembly quality assessment
We first manually checked and corrected the potential large structural errors in assembly by aligning to the NIP reference. Assembly quality for each genome was evaluated using a series of indices from completeness, correctness and continuity. For completeness, NGS short reads and long reads (PacBio HiFi and ONT) were aligned to their corresponding chromosomes for each sample with BWA mem (0.7.17-r1188)(Li & Durbin, 2009) and Winnowmap2 (v2.03)(Jain et al., 2022) in default parameters, respectively. The mapping statistics was generated by using samtools flagstat (v1.18-15-g9a59467) (Li et al., 2009). Averagely 97.62% of short reads were aligned, while wild rice O. rufipogon had significantly lower mapping rates (~95.23%) compared to O. sativa (~98.35%), due to the nature of higher heterozygosity in wild population (Supplementary Table 1). ~99.92% and ~98.71 of PacBio HiFi and ONT reads were mapped. The mapping coverage along each chromosome was calculated using samtools depth and the coverage distribution was checked visually to detect assembly collapsed regions in 10-Kb windows. For correctness, we used QV and discordant k-mers in VerityMap to assess the base accuracy. NGS short reads and Pacbio HiFi reads were split into 21-mers and combined as a k-mer library using meryl (v1.4) (Rhie et al., 2020) and rare k-mers were removed (meryl gt 1). Assembly correctness indicator QV and assembly completeness indicator k-mer completeness were calculated using Merqury (v1.3) (Rhie et al., 2020). It should be noted that NGS data used here was sequenced after polymerase chain reaction (PCR) amplification, thus we analyzed the potential effects of PCR on QV. We sequenced PCR-free NGS libraries for four samples (CW06, CW15, SL044 and SL177) using Illumina Novaseq 6000. Using only NGS k-mer library to calculate QV, PCR NGS data yielded a lower QV than PCR-free data, but varied among samples (46.4 for PCR-free QV versus 25.8 for PCR QV in CW06, 47.1 versus 23.2 in CW15, 47.4 versus 40.1 in SL044 and 43.5 versus 44.2 in SL177). We also employed VerityMap (Bzikadze et al., 2022) to detect assembly errors and heterozygosity sites in a reference-free mode. VerityMap identifies discordant k-mers between the assemblies and PacBio HiFi reads. High frequency of discordant k-mers indicate potential errors in local assembly. For continuity, we detected potential assembly gaps with low-confidence read supports using CRAQ (Li et al., 2023) and GCI (Chen et al., 2024)(Supplementary Fig. 1). We aligned the PacBio HiFi reads and NGS reads against corresponding assemblies and calculated AQI values to identify regional and structural assembly errors based on clipped alignment information. GCI is a continuity inspector for complete genome assembly, by utilizing stringently filtered alignments from multiple aligners (minimap2 and Winnowmap2 used in this study) to detect assembly gaps. We quantified the assembly overall continuity using GCI scores at the whole-genome and chromosome levels.
Evaluation of centromere assembly completeness
To assess the assembly quality of rice centromere sequences, we first determined the centromere regions based on the enrichment of rice CEN155 satellites. The quality assessment metrics (e.g. QV. GCI) were calculated for each centromere (Supplementary Fig. 2). We zoomed in and manually checked the raw and curated reads mapping coverage and the issues reported by GCI and VerityMap in the centromere regions. Additionally, given the challenges in assembling highly repetitive sequences accurately and completely, we compared the centromere sequences solely generated by Hifiasm (Cheng et al., 2024) and Verkko (Rautiainen et al., 2023) to eliminate the effects of assembly algorithms. For each accession, two sets of contig assemblies from Hifiasm and Verkko with no gap filling and polishing were anchored onto chromosomes using RagTag, respectively. The CEN155 consensus sequence was then aligned to each chromosome using BLASTN (v2.5.0+, -use_index true -task megablast -evalue 1e-6). We counted the copy numbers of CEN155 satellites in each centromere and compared the centromeres with no gaps in both Hifiasm and Verkko assemblies (Supplementary Fig. 2).
Phylogeny and ancestry analysis
Short Illumina paired-end reads from each sample were aligned to the T2T NIP reference assembly (Shang et al., 2023) using BWA (Li & Durbin, 2009). For six samples lacking NGS short reads (YCW03, Pulut Hitam-2, Zag, Cempo Ireng, Balatinaw, and Cempo Abang), simulated 120-bp reads were generated from their genome assemblies and mapped against the T2T NIP assembly. Whole-genome SNPs across all 70 samples were jointly called following the GATK best practices (McKenna et al., 2010). A phylogenetic tree encompassing these 70 rice accessions was constructed using FastTreeMP (Price et al., 2009) with 1000 bootstrap replicates. This tree was utilized to elucidate the taxonomic relationships and groupings among the rice accessions.
Gene annotation
Gene elements were annotated using a homology-based approach. We employed Gmap to align transcripts from the MSU v7.0 annotation of Nipponbare with the parameters '-n 0 --min-trimmed-coverage 0.7 --min-identity 0.7', and kept alignments with at least 70% coverage and 90% identity (Wu & Watanabe, 2005). Cleaned RNA-seq reads were mapped to the corresponding genome assembly using TopHat2 (Kim et al., 2013), and gene expression was quantified using Cufflinks (Trapnell et al., 2010).
TE annotation
To characterize repeat elements across the 70 high-quality rice genomes, a hybrid approach combining de novo prediction and homology-based search strategies was employed. Initially, a de novo repeat database was constructed using assemblies from 16 representative rice accessions (including 18XHB-57, 18XHB-83, CW03, CW06, CW11, CW15, MH63, NH271, NH285, NIP, SL028, SL039, SL044, SL121, YZ-2 and ZS97). The de novo transposable element (TE) library was generated separately for each accession using RepeatModeler with the parameter '-LTRStruct' (Tarailo-Graovac & Chen, 2009). These individual TE libraries were integrated into a raw library consisting of 37,589 records, and subsequently clustered and merged into a non-redundant library. To remove redundant records, pairwise alignment of TEs from each accession and TE library of rice in repbase (Repbase 28.09) using BLAST were performed. Records showing alignment lengths over 60% of the element's length and identities over 60% were considered as redundant records. A total of 23,541 redundant records were eliminated from the final library. For further filtering, specific accession de novo TEs library and that from Repbase was used to mask 16 representative accessions by RepeatMasker separately (Tarailo-Graovac et al., 2009). TEs, that did not overlap with records in other libraries, had an accumulated length over 5 Kb and more than 10 instances in presence, were selected as candidate records. Subsequently, 341 non-redundant TEs identified in 2 to 15 accessions were clustered based on similarity using CD-HIT (Li & Godzik, 2006). The final non-redundant library comprised 3,847 sequences, including 3,321 records from the rice TE library (Repbase 28.09), 341 homologous TEs and 185 accession-specific TEs. All genome assemblies were masked against the newly-built non-redundant TE library using RepeatMasker. For elements classified as 'UNKNOWN' by RepeatModeler in more than 50% of cases, a convolutional neural network-based approach (deepTE) was applied for reclassification (Yan et al., 2020). Full-length elements were defined based on the annotation results, initially merged based on identity and alignment length to obtain non-redundant or overlapping entries. Specifically, full-length LTR retrotransposons were characterized by the canonical LTR-Internal-LTR structure, with the intervals between LTR and internal units being less than 10% of the consensus length, and the overall length deviating by no more than 10% from the consensus. To ensure accuracy, candidate full-length LTRs meeting these criteria were extracted and compared against consensus sequences, applying a minimum coverage threshold of 90%. Solo-LTRs were defined as units lacking TE annotations of the same family within a distance of 10% of the consensus length.
Centromere annotation
The Tandem Repeat Annotation and Structural Hierarchy (TRASH) pipeline (Wlodzimierz et al., 2023), available at https://github.com/vlothec/TRASH, was utilized for the prediction of centromeric satellite repeats across genomes. The pipeline was configured with parameters "--simpleplot --horclass CEN155 --frep 2 --par 5 --k 1000". Genomic sequences were segmented into 1-Kb windows to assess local k-mer counts and identify repetitive regions. Score of each window was determined based on the proportion of repeated k-mers, with regions scoring above a defined threshold considered to contain repeats. Tandem repeats within these windows were characterized by the distances between identical k-mers, aiding in the identification of consensus sequences such as CEN155. To pinpoint centromeric regions within each rice chromosome, all satellite repeats identified as CEN155 from the TRASH results were extracted and mapped according to their chromosomal positions. CEN155 arrays, indicative of centromeric blocks, were defined where the spacing between repeat units did not exceed 150 Kb. The largest contiguous block on each chromosome was considered as the candidate centromeric region. Subsequently, the CEN155 consensus sequence for each rice accession was aligned back to its respective genome assembly using BLASTN (v2.5.0+) under default settings. These alignments were integrated into the centromeric blocks based on their proximity to adjacent CEN155 arrays. Final validation of centromeric regions involved manual inspection, confirming regions supported by both TRASH predictions and BLASTN alignments as definitive centromere locations. Synteny analysis was performed to validate the predicted centromere boundaries across 69 rice accessions by extracting and examining the 800-kb flanking regions surrounding the centromere in the NIP reference genome for collinearity with other rice centromere assemblies. Sequence identity heat maps were plotted for each chromosome using StainedGlass (Vollger et al., 2022), providing a visualization strategy to confirm centromere boundaries.
Centromere similarity and haplotyping
Centromere classification was initiated based on pairwise similarity. Firstly, pairwise alignments were performed between all centromeres divided into 2-Kb windows using minimap2 (v.2.24) with parameters (-f 5000 -s 400 -ax ava-ont --dual=yes --eqx) (Li, 2018). Segment windows with identities over 95% were merged, and the cumulative length was designated as the identical length between centromeres. The cumulative length between different centromeres was normalized by the length of the query centromere, highlighting structural variations such as deletions causing asymmetry between the upper and lower triangles in the heat map (Fig. 2). Next, specific haplotypes were assigned to each centromere. Haplotypes were manually defined based on identity heat maps generated from 2-Kb windows of different centromeres. Candidate reference haplotypes were established, comprising sequences from XI and GJ that shared at least 95% of full length with two or more accessions. The longest regions among candidate centromeres for each haplotype were selected as reference haplotype sequences. Subsequently, windows split into 2-Kb segments from each centromere were classified into corresponding haplotypes based on the identity among all haplotypes, when the alignment covered at least 85% of the window size. In cases where multiple haplotypes exhibit the same identity, the window was assigned to the haplotype with the fewest mismatches against reference haplotype sequences.
Structural variations and SNPs around centromeres
HiFi reads from each accession were aligned to the reference assembly of NIP using minimap2 (Li, 2018) with parameters (-ax map-hifi -R "@RG\tID:${pre}\tLB\tSM:${pre}") and SVision (v.1.4) (--min_mapq 10 --min_sv_size 50) (Lin et al., 2022) was utilized to generate structural variation (SV) calls. SVs within a 500-Kb upstream and downstream region of each centromere were extracted for statistical analysis. Phylogenetic trees of all accessions were constructed based on SNPs located in the 500-Kb upstream and downstream regions of each centromere with FasttreeMP (Price et al., 2009).
Satellite clustering
The CEN155 repeats identified by TRASH were classified as canonical satellites if their lengths ranged from 140 bp to 170 bp. Across the 70 rice genomes, a total of 249,952 unique CEN155 sequences were detected. Initially, satellites occurring frequently (>30 times) across all accessions were selected. Additionally, satellites predominantly (>90%) found in specific populations (GJ, XI, AUS, Oruf, Obar, and Ogla) were included, despite lower frequencies ranged from 5 to 30. A collection of 6,665 representative CEN155 repeats were finally obtained. These selected satellites were aligned using MUSCLE (Edgar, 2022) with default parameters, followed by phylogenetic categorization into 15 subfamilies using FastTreeMP (Price et al., 2009). To further classify the remaining 243,287 unique satellites, Levenshtein edit distances were calculated in an all-vs-all comparison against the 6,665 categorized satellites. Each unique satellite was assigned to the superfamily of the assigned CEN155 with the closest edit distance.
CENP-B box and pJα motif analysis
To investigate the DNA binding sites of rice centromeres, we introduced a scoring matrix based on the CENP-B box sequence (5'-PyTTCGTTGGAAPuCGGGA-3') and the pJα motif (5'-TTCCTTTTPyCACCPuTAG-3'), originally characterized in the human centromeres. Each k-mer (k= 17) from the query sequences underwent global alignment scoring using a dynamic programming algorithm against the two binding motif sequences. Additional penalties and bonuses were applied at conserved loci. All match scores of all monomer in NIP were used to explore the criteria for defining B-box-like elements. The screening score was set to be top 1% of all scores with a threshold set at 24, denoting the match of all conserved sites. According to the score distribution, the k-mers with scores above 25/27 were considered as candidate B-box/pJα-like elements. To enable rapid scanning of genome-wide elements across multiple genomes, the k-mers harboring B-box-like elements from NIP genome were integrated using MEME v5.4.1 with parameters '-w 17 -dna -revcomp' to construct a B-box-like motif library (Bailey et al., 2015). Subsequently, B-box-like elements were identified across the whole genome with e-value < 1e-05 against the motif library for B-box and pJα motif separately, utilizing FIMO in MEME with the parameters '--max-strand'.
Structural organization analysis of CEN155 arrays
We resolved the structure of satellite arrays in a progressive compression strategy, merging identical units iteratively (Fig. 4a). Homogenization regions of monomer (moHRs) were identified where more than five consecutive satellites from the same family were present. TEs embedded in the satellite arrays were masked. Fine analysis of satellite organization within each moHR was performed using identity dot plots and NTRprism. Pairs of identical satellites within moHRs were selected to investigate internally local collinearity using DAGchainer (Haas et al., 2004). NTRprism (Altemose et al., 2022) was employed to infer the most abundant tandem repeat periodicities by counting the interval lengths between adjacent instances of each 6-mer. For a canonical HOR array, a distinct peak in the interval spectra was expected at the HOR unit length (Altemose et al., 2022). Using the moHR-compressed satellite array, we defined homogenization regions of dimer (diHRs) containing at least five replicates of dimers. For determining homogenization regions of multimer (muHRs), a de Bruijn graph was initially constructed based on the dimer-compressed satellite string of each centromere, where the vertex set comprises all monomer superfamilies and the edge set consists of all pairs of its consecutive monomers superfamily. For instance, for a dimer-compressed centromere sequence “ABCDABCD”, the edge set would include the pairs A-B{2}, B-C{2}, C-D{2}, D-A{1}. Candidate multimers were identified by exhaustively enumerating all cycles in the graph using a depth-first search (DFS) strategy. We observed that ancient multimers might not exhibit a unified head-to-tail pattern like in the human centromeres, due to local expansion, random insertion or deletion. Candidate multimer calls were filtered out based on criteria of multiplicity less than 2, traversal less than 3 and adjacent multimer spacing greater than three times the multimer length. The multiplicity of an edge (A, B) in the graph was defined as the number of times the monomers family A followed the monomer family B in each dimer-compressed centromere. Traversal represented the occurrence frequency of a cycle (multimer) in each dimer-compressed centromere string. To identify multimer hierarchically, candidate nested-multimers were mapped onto the discrete dimer-compressed centromere and detect overlapping regions with continuity and occurrence. If the merged multimer presented better continuity and higher occurrence, it would replace the nested one.
Centrophilic LTR analysis
To identify centrophilic LTRs, which preferentially localize within centromeres, we conducted a statistical analysis based on the annotation of full-length LTRs. LTRs that constituted more than 90% of all full-length LTRs in centromeres and exhibited a significantly higher preference for the centromere region compared to the chromosome arms were designated as centrophilic LTRs. To characterize the core domains of these LTRs, ORFs within the internal domain of consensus elements from the pan-TE database were extracted using getorf in EMBOSS (Rice et al., 2000). Subsequently, domain predictions were performed using HMMER (v3.4) (-E 1e-5) against Pfam hidden Markov models, including PF00026, PF13650, PF08284, PF13975, PF00077, PF09668 for protease; PF03732 for GAG; PF00665, PF13683, PF17921, PF02022, PF09337, PF00552 for integrase; PF17917, PF17919, PF13456 for RNase H; and PF00078 for reverse transcriptase (Mistry et al., 2021). Given the presence of non-autonomous LTRs in the pan-TE library, the phylogenetic relationships of LTRs were resolved separately based on the 5’ LTR, 3’ LTR, and internal sequences. Multiple sequence alignments were conducted using MAFFT (--anysymbol –maxiterate 1000), and conserved sites were extracted using trimal (Katoh et al., 2013; Capella-Gutiérrez et al., 2009). Phylogenetic trees were constructed using FastTreeMP (-nt -gtr) and visualized with iTOL (Price et al., 2009; Letunic & Bork, 2021). Pairwise alignments among full-length centrophilic LTRs were performed and visualized using VISTA (Mayor et al., 2000). Identity between 5’ LTR and 3’ LTR of centrophilic LTRs was calculated based on full-length LTRs existing within centromeres. Alignments for each LTR pair were conducted using MAFFT, and identities were computed using ape (Paradis & Schliep, 2019). To map the insertion sites of SZ-22 elements, 150-bp regions flanking each element were extracted and queried against the CEN155 database using BLASTn. Centromeric elements were filtered based on close alignments to the same monomer (2,258 out of 3,673 sequences). The midpoint between aligned flanking sequences determined the insertion positions, and their distribution along the CEN155 consensus sequence was visualized.
Pairwise comparison of rice centromeres
We introduced a new algorithm termed SynPan-CEN, to compared satellite arrays in single monomer scale. This approach leverages satellite clustering information and pairwise genetic distances to identify syntenic pairs of CEN155 satellites between rice centromeres. Initially, we conducted an exhaustive all-to-all comparison of monomer sequences from two satellite arrays using pairwise edit distance (ED) to quantify similarity. We refined synteny construction by conservatively selecting monomer pairs in a correct order, incrementally increasing the ED threshold until the framework encompassed more than 20% of total monomers or 100 monomers. Using DAGchainer software (Haas et al., 2004), which computes chains of syntenic genes based on sequence homology and genomic coordinates, we identified syntenic satellite regions and syntelogs. DAGchainer employs a scoring function considering the distance between neighboring elements on each DNA molecule and BLAST E-value scores between homologs, facilitating the identification of maximally scoring chains of ordered element pairs. For the analysis here, DAGchainer parameters (-g 1 -D 20 -A 6 -Z 50 -e -3f) were set to retain blocks with the highest scores, with non-overlapping monomer pairs added to complement the chain backbone. Using a window of ten syntenic pairs, we applied sliding windows along the fixed chain backbone to extract each region. Local syntenic pairs were identified using DAGchainer, maintaining consistency with the parameters mentioned above. Blocks with the highest scores were preserved, and additional non-overlapping syntenic monomer pairs were incorporated. To refine our results, we incrementally tested various ED thresholds and observed a significant impact on identifying optimal syntenic pairs of CEN155 satellites. Syntenic satellite pairs increased as ED values were below 15, but decreased beyond this threshold. This phenomenon is likely due to the rapid accumulation of potential syntenic paths after exceeding the ED threshold, which complicates the determination of the optimal paths by DAGchainer (Haas et al., 2004). In cases where sliding windows presented excessive alignment possibilities, we subdivided windows into two sub-regions, each containing five syntenic pairs, mitigating potential mismatches.
Sequence divergence and mutation rate in centromeres
Edit distances between syntenic CEN155 satellite pairs inferred by SynPan-CEN were calculated. Continuous gaps at the beginning or end of a monomer alignment were systematically removed to minimize potential mismatches. Within alignments, consecutive gaps were treated as a single transition, and insertion counts were used exclusively to quantify edit distances. To optimize the alignment process, we compared three widely-used programs—MUSCLE (Edgar, 2022), MAFFT (Rozewicki et al., 2019), and ClustalW (Thompson et al., 2002)—using default parameters for pairwise monomer sequence alignments. Based on alignment performance and computational efficiency, MAFFT and ClustalW consistently demonstrated superior reliability compared to MUSCLE. Consequently, MAFFT was selected for following analyses. A phylogenetic tree was constructed using all syntenic satellites between chromosomes, where the synteny information was visualized with links in the circle tree plot. Orthologous pairs were identified based on their proximity to each other in the tree structure. The satellite pair, where the two CEN155 were clustered as a monoclade, were confirmed as orthologs, while pairs lacking such proximity were more likely to be paralogs.
DNA methylation analysis using PacBio HiFi data
For the 46 samples whose HiFi reads were newly generated in this study, 5mC methylation probability was determined by combining kinetics from multiple passes in HiFi sequencing. HiFi reads with methylation MM and ML tags were aligned against the corresponding assembly using pbmm2 (v1.2.0) (https://github.com/PacificBiosciences/pbmm2). We employed pb-CpG-tools (https://github.com/PacificBiosciences/pb-CpG-tools) to call base-level methylation status in the model ‘pileup_calling_model.v1’. Given that HiFi-based 5mC signals were predominantly enriched in the CG islands, we focused exclusively on the methylation signals at CG sites along chromosomes, therefore excluding methylation information for CHH and CHG contexts.
DNA methylation analysis using ONT data
ONT sequencing data was utilized for methylation analysis. Methylation probabilities in CG, CHH, and CHG contexts were called using the ONT reads basecaller Dorado (v.0.5.3) from the raw pod5 files, with the basecalling model options: ‘--modified-bases-models dna_r10.4.1_e8.2_400bps_hac @v4.3.0_5mC_5hmC@v1 --emit-moves’. The base-called read sequences were aligned against their corresponding genome assemblies. Modkit (v.1.0, https://github.com/nanoporetech/modkit) was utilized to generate summary counts of modified and unmodified bases in bedMethyl files. DNA methylation levels were quantified using ‘modkit pileup’ based on only primary alignments. The consistency of methylation quantification between PacBio HiFi reads and ONT reads was assessed by randomly selecting 10,000 CG sites, using Pearson correlation as the evaluation metric.
ChIP-seq data alignment and processing
CENH3 ChIP-seq reads from 13 accessions (ten newly generated in this study and three previously released) were used to determine the functional centromere regions. Paired-end raw reads were subjected to adapter trimming and filtering using fastp (v.0.23.4) (Chen et al., 2023), with the following options: ‘--cut_front_window_size 4 --cut_front_mean_quality 20 -3 --cut_tail_window_size 4 --cut_tail_mean_quality 20 --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20’. Filtered read sequences were aligned to their corresponding genome assembly using Bowtie2 (v.2.3.5.1) (Langmead et al., 2019) with parameters ‘--very-sensitive --no-mixed --no-discordant --maxins 800’. Reads that aligned to multiple locations were filtered using samtools (v.1.7) (Li et al., 2009). The enrichment level of CENH3 for each 1-Kb window was calculated using bamCompare from the Deeptools package (v.3.5.4.post1) (Ramírez et al., 2016) with the parameters ‘--binSize 1000 --outFileFormat bedgraph --operation log2 -p 5 --extendReads’. Base-level CENH3 enrichment was calculated using bamCompare with the parameters ‘--binSize 1 --numberOfProcessors 40 --operation ratio --outFileFormat bedgraph’. Custom R scripts were used to visualize CENH3 occupancy profiles across the 12 chromosomes and CEN155 satellite repeat arrays.