Technology-enabled great leap in deciphering plant genomes

Plant genomes provide essential and vital basic resources to study multifarious aspects of plant biology and applications (e.g. breeding). From 2000 to 2020, 1,031 genomes of 788 plant species were sequenced. From 2021 to 2023, 1,857 genomes of 841 plant species, including 622 newly sequenced species, were assembled, showing a great leap. The 1,857 newly assembled genomes, with some being telomere-to-telomere (T2T) assembles, cover the major phylogenetic clades and many of them have a high quality. The achievement is mainly attributed to signi�cant advances in both sequencing technologies and assembly software. A database named N3: plants, genomes, technologies was developed to accommodate the metadata associated with the sequenced plant genomes. In the end, we discussed the challenges involved in building huge size gap-free and single-cell genomes as well as the future opportunities in plant genomic studies.


Introduction
Most plant species have a complex genome, including large size, extremely high repeat content, high heterozygosity, and polyploidy 1,2 .Genome sequencing plays an important role in providing unparalleled genomic resources for deeper understanding the biology and evolution of plants 3 .Since the publication of the rst plant reference genome of Arabidopsis thaliana in 2000 4 , bene ting from notable improvements in sequencing technology and computational algorithms for genome assembly, a great many of plant genomes have been sequenced and assembled from 2000 to 2020, including 1,031 de novo genomes of 788 different plant species 5 .The genomes provide a myriad of high-quality genomic resources for studies of plant functional genomics, population genetics, evolution, and breeding.However, given the number of green plant species (Viridiplantae) being approximately 500,000 6 , the number of plant species that have been sequenced so far is just a tip of the iceberg.More plant genomes need to be decoded and deciphered in order to have a comprehensive understanding of plant biology and evolution and so to better serve humanity in the long run, which largely relies on advancing in sequencing and assembling technologies.
At the beginning of plant genome sequencing, the predominate approach applied was the Sanger technology or the rst-generation sequencing technology 7 , and gradually, the next-generation sequencing (NGS) technologies (e.g., Illumina) overtook the rst-generation technology, especially when doing largescale genome resequencing 8 .With further technological innovations, we are now in the third-generation sequencing (TGS) era, characterized by its capability of generating very long reads to overcome the drawback of short reads of NGS 9 .The TGS technologies include the Paci c Biosciences (PacBio) continuous long reads (CLR) / circular consensus sequencing (CCS), the Oxford Nanopore Technologies (ONT) sequencing platforms, and so on 10,11 .Although the CLR method has played an important role in PacBio sequencing for several years, high delity (HiFi) reads generated by CCS have attracted more and more attention due to signi cant improvement in sequencing accuracy 12 .The accuracy of the ONT sequencing platforms is still not as high as that of the PacBio platforms, but ONT has the advantage of being able to generate ultra-long reads, which can resolve the issue of large repeats or compensate for HiFi's coverage dropouts 13 .Meanwhile, multiple software developed using different algorithms offer us the opportunity to create more complete assembly 14 .From Velvet 15 , ABySS 16 , AllPath-LG 17 , and SOAPdenovo 18 contig assemblers by short reads to Canu 19 , Mecat2 20 , Necat 21 , Falcon 22 by TGS technologies, more high-quality genomes have been assembled.Hi asm, a software using HiFi reads as input for contig assembly, has been applied in generating haplotype-resolved de novo assembly, even telomere-to-telomere (T2T) assembly when combining HiFi reads with ultra-long ONT reads 5,23 .
Here, we updated the progress of plant genome sequencing and found that the number of plant genomes (1,857 genomes from 841 plant species) released in the past two years (from January 2021 to June 2023) is more than that released in the rst twenty years (2000 -2020).We further analyzed the technologies used in plant genome sequencing and assembly, shedding lights on technology enabled great leap in plant genome research and identifying the challenges faced by the plant community to further advance genome sequencing and assembly in the years to come.

Results
A rapid increase of sequenced plant genomes in the past two years During the past several years, we have witnessed rapid and great progresses in sequencing technology and computing algorithm, making it relatively easier for genome assembly and for achieving high-quality assembly, compared to ~10 or so years ago.Compared with the 1,031 genomes of 788 plant species published from 2000 to 2020 5 , we have observed a great leap in de novo genomes in the past two years, with a total of newly generated 1,857 genomes representing 841 plant species (including 622 newly sequenced species).As a result, a total of 2,836 genomes of 1,410 plant species are available up to now (Figure 1A).In addition to the surge in quantity, the quality of plant genome assemblies has also increased rapidly, with 72.0% of chromosome level assemblies in the past two years compared to 44.1% in the rst twenty years (Figure 1A).There were not signi cant changes in the size of the sequenced genomes in the rst 20 years and the last two years, with a minimum, maximum and median value of  1B).Moreover, the innovation of technology provides the impetus for the sequencing and assembly of the plant pan-genome.Among these de novo genomes, the number of genomes assembled by the pangenome project accounted for a considerable part, contributing to the surge in the plant genome sequencing in the past two years (Figure 1A).
Compared to the plant species in the rst 20 years 5 , the plant species sequenced in the last two years showed similar diversity in Viridiplantae, but taxonomic species gaps persist.The plant species from several new clades, such as Acorales, Buxus, Chloranthales, and Cycads, were covered in the recent two years (Figure 2).However, no species of the ANA (Amborellales, Nymphaeales and Austrobaileyales) clade have been sequenced in past two years.In both the previous two decades and the last two years, the most sequenced species are angiosperms, including monocot Commlinids and eudicot Asterids and Rosids (Figure 2) (see Table S1 for the details of all the sequenced species).
To eliminate the effect of the number of pan-genomes on the analysis results, we reanalyzed all the sequenced plant species since 2000 by excluding the pan-genomes.We found that plant species from a total of 91 orders have been assembled so far and among these sequenced genomes, most were from Poales (Commlinids), Brassicales (Rosids), Fabales (Rosids), Rosales (Rosids), and Lamiales (Asterids) (Supplementary Table 1).Although the number of published genomes varies among the other orders, the Poales (near 130 and 190 genomes in the rst twenty years and the past two years, respectively) was consistently the most studied.Meanwhile, Poaceae, which includes major food crops (such as rice, wheat, and maize), was the main target and had many more sequenced (near 129 and 160 genomes in the rst twenty years and the past two years, respectively) than other families in Poales in both time periods (Supplementary Figure 1).Two countries (China and the United States) contributed the most to plant de novo assembly in both time periods, with China's contribution to the sequenced genomes rising from about 30% to 60% (Supplementary Figure 2; Supplementary Table 1).The plant genomes released in the rst 20 years 5 were reported by 767 articles in near 100 different journals, with Nature Genetics, Nature Communications, GigaScience, The Plant Journal, and DNA Research being the top ve journals in terms of the number of plant genomes published, while the genomes reported in the past two years were published in 829 articles in near 100 different journals (Supplementary Table 1), with Horticulture Research, Nature Communications, Frontiers in Plant Science, The Plant Journal, and G3-Genes Genomes Genetics becoming the top journals (Supplementary Table 1; Supplementary Figure 2).Based on the word cloud analysis for the titles of the recent 829 articles, the sequenced genomes contributed a diverse of study topics, including the origin, domestication, and evolution of plant species, biosynthetic pathways of key substances, resistances, molecular mechanisms regulating plant biological processes, agronomic traits of crops, and molecular breeding (Supplementary Figure 3).Compared with the previous 20 years, the biggest change is that "chromosome level" appeared more frequently, while "draft" appeared less frequently (Supplementary Figure 3).

Pan-genomes, haplotype-resolved and T2T assembles
During the past two years, in addition to generating more de novo reference genomes of plant species that had not been sequenced previously, for the plant species already with a reference, genomic studies shifted to generating of pan-genomes, haplotype-resolved genomes and telomere-to-telomere (T2T) assembles.Pan-genome consists of the core genome and the unique genome of all individuals of a species 24 .In the recent two years, generation of plant pan-genomes has been attempted in soybean 25 , Arabidopsis 26 , barley 27 , wheat 28 , cotton 29 , tomato 30 , potato 31 , and rapeseed 32 , especially in the main crops rice 33,34 and maize 35,36 .Overall, de novo genomes of plant species from seven families, including Poaceae, Brassicaceae, Solanaceae, Fabaceae, Malvaceae, Cucurbitaceae, and Rutaceae have been assembled and used in pan-genome construction (Table 1).As for the sequencing platforms applied in construction of pan-genomes, and TGS was used for 97% (12% by HiFi and 85% by other TGS platforms) of plant assemblies, while 3% of the genomes were assembled by only NGS reads.No ONT ultra-long reads have been used yet in pan-genome assembly (Supplementary Figure 4).While the quality of the pan-genomes assembled in the rst half of 2021 is relatively low, the mean contig N50 of pan-genomes assembled afterwards is generally in the range of 4 Mb to 30 Mb and the genomes have a relatively high continuity (Supplementary Figure 4).Currently, the typical number of de novo genomes used in assembly of pan-genomes is 10-15 (Supplementary Figure 4).The most used software for contig assembly of pangenome is Canu, Falcon, and Hi asm (Supplementary Table 1).Pilon dominates in the polishing tool used.3D-DNA, LACHESIS, ALLMAPS, and BioNano Solve are the most used for scaffolding (Supplementary Table 1).
Haplotype-resolved or phased genome separates heterozygous genomic regions and helps genome annotation, mutation detection, evolutionary analysis, gene function characterization, comparative genomics, and so on 1,37 .Haplotype-resolved genomes are current available for 17 orders, with the top ve being Poales, Solanales, Vitales, Malvales, and Sapindales (Supplementary Table 1).Most haplotyperesolved assemblies utilize TGS and 37% of them used HiFi reads.The most commonly used software for assembling haplotype-resolved genomes includes Hi asm, Falcon, and Canu, with Hi asm being utilized 2-3 times more often than Falcon and Canu (Supplementary Table 1).
T2T genome refers to a high-quality, fully complete genome assembly, including all centromeres and repeating regions, with high precision, continuity, and integrity 13 .Among the assembled T2T assemblies, the quality is variable due to different criteria used in de ning a T2T genome.In this study, only those with at least half of the total number of chromosomes with both ends, i.e., telomere, assembled were considered as T2T genome.With this criterion, a total of 30 genomes from 20 species have been assembled to the T2T level.All of these assemblies belong to angiosperms, including Poales, Brassicales, Rosales, Solanales, and so on.The majority (≥22) of the 30 T2T assemblies is diploid (Supplementary Table 1).Most T2T assembles adopted a combination of PacBio, ONT, and NGS sequencing technologies and nearly 90% of these T2T genomes used HiFi reads.In contig assembling, in addition to Hi asm, other contig assembling software were also used for an integrated assembling version.

The innovation of sequencing technologies and assembly software/algorithm
Regarding the sequencing strategies used in the assembly of plant genomes reported during the last two years, the TGS technology was the dominated one (91.2%),only 8.8% of the sequenced genomes were solely based on the NGS technology (Figure 3A).No apparent relationship was observed between the genome sizes and their contig N50 sizes, i.e., big genomes also can have a high continuity of assembly (Figure 3A).Although the CLR method integrated with Sequel and RS II platforms has played an important role in PacBio sequencing in the past two years, HiFi reads generated by the CCS method attracted more and more attention (used by 17.4% of the sequenced genomes) due to its accuracy improvement (Figure 3A).HiFi reads were adopted by the assembles of almost all genome sizes, generating a signi cantly higher contig N50 size compared to the traditional TGS reads, indicating the advantages of HiFi reads for genome assembly (Figure 3A).The use of HiFi reads was signi cant increased in around the beginning of 2022, with 21.1% of utilization frequency in 2022 and 28.5% in 2023, compared with 5.1% in 2021 (Supplementary Figure 5; Supplementary Table 1).And the number of genomes assembled by ONT ultra-long reads (reads N50 size ≥100kb) is low.
Generating a genome usually involves three stages, including contig assembly, polishing, and scaffolding.In the past two years, the de novo assemblers including Canu, Falcon, Hi asm, Nextdenovo, and wtdbg were used in contig assembly of most sequenced genomes, particularly Hi asm, which showed an a signi cantly increase trend (Figure 3B; Figure 3C).Polishing of assembly (except pangenomes) usually used NGS reads generated by Ilumina's HiSeq, Novaseq, and MiSeq series, and BGI's DNBSEQ and MGISEQ series and typically requires multiple iterations.Approximately 23% assemblies were polished by using Pilon, following by a combination of Racon and Pilon (17%), Arrow and Pilon (11%) (Supplementary Table 1).The top ve software used in polishing genomes includes Pilon, Racon, Arrow, NextPolish, and Quiver (Figure 3C).Five approaches, including physical linkage map, genetic map, mate pair linkage reads, high-throughput chromosome conformation capture (Hi-C) reads, and homologybased approaches, were generally used in the scaffolding stage.Currently the focus of this stage is chromosomal level assembly, with Hi-C reads being the major contributor, because 84% of chromosomescale genomes used Hi-C reads, alone or in combination with others, such as genetic map (RagTag) and Bionano optical map (BioNano Sovle), to link contigs to scaffolds (Supplementary Table 1).The most common scaffolding software involving Hi-C reads in building chromosomes includes 3D-DNA, LACHESIS, and ALLHIC (Figure 3C).

The N3 database: a hub of sequenced plant genomes and assembly technologies
To provide the details of the plant genomes collected and analyzed in this study, such as the sequencing platforms and assembly tools used by each of the published plant reference and de novo genome, we built a database named N3: plants, genomes, technologies (http://ibi.zju.edu.cn/n3/)(Figure 4).It included all available 2,836 plant reference/de novo genomes collected by our previous study 5 and this study.The database mainly consists of four modules: Statistics, Search, Pan&T2T, and Links modules.The 'Statistics' module summarizes the total number of genome/species that have been de novo assembled so far and the technologies (sequencing and assembly) used in generating the genomes.In the 'Search' module, users can search any available plant genome information by choosing species names, sequencing platform and assembly software.The 'Pan&T2T' module presents the information about the available plant pan-genomes and T2T assemblies.In the 'Links' module, we provide the web links to the main sequencing platforms and assembly software.

Discussion
A total of 2,836 genomes from 1,410 plant species have been sequenced up to now.However, comparing to the total number of plant species, that is just a tip of the iceberg and many species, including the basal groups of angiosperms (e.g.Amborellales, Nymphaeales, and Austrobaileyales) are yet to be sequenced because of their negligible economic values.Even crop species with giant economic value were sequenced, the genomes of their closely related species are not taken seriously, although they are useful and essential for understanding crops' evolution trajectory and providing superior alleles for agronomic traits.Notwithstanding the quality of the genomes assembled during the past two years has improved signi cantly compared to those assembled in the previous two decades, however, the assembly quality of many genomes, particularly the giga-genomes (with a genome size >10 Gb), still remain to be improved, mainly due to the nature of massive repetitive sequences and high heterozygosity of plant genomes 38,39 .
With the further advance in sequencing technologies, the heavy workload involved in sequencing large genomes can be partially addressed by the high volume of sequencing output and the repetitive sequences can be addressed by long and ultra-long sequence reads.It is thus expected a signi cant increase of plant giga-genomes in the near future.
Long read sequencing is a key to a high-quality genome assembly.In recent years, HiFi reads are preferred, especially in haplotype-resolved assembly and T2T assembly.Although the lengths of the ONT ultra-long reads can even reach mega-base level, the base accuracy is not as high as that of HiFi reads.Longer HiFi reads and more precise ONT ultra-long reads with lower sequencing costs are expected in the near future, which will accelerate the acquisition of more complete plant (pan-)genomes and provide opportunities in studying hidden regions along chromosomes (e.g.centromere).Centromere is one of the important components of chromosomes and plays a key role in mitosis and meiosis 40 .Unfortunately, due to highly repetitive sequences and complex structures, centromeric regions have been rarely explored 41 .The recently published Actinidia chinensis and Daucus carota, for example, used HiFi reads and ONT ultra-long reads to identify all chromosomal centromeres 14,42 .It is expected that more and more of plant genome with intact chromosomal centromeres will be available in future.
Single-cell genome sequencing can reveal cell heterogeneity in tissues and cell developmental trajectory.It has quickly become a new research hotspot in recent years and will continue to be one of the future research directions.Compared to the studies of single-cell genomics in human and microbiome [43][44][45] , that in plants is still in its infancy.Single-cell genomics involves sequencing thousands or more genomes in parallel, so heavily depends on sequencing capacity, which is still a challenge for most sequencing platforms.To facilitate studies on single-cell genomics, in addition to enhancing sequencing throughput, computational algorithms and data storage that can handle increasing volume of sequencing data are to be improved.

Methods
With the aim to have an as complete as possible list of plant genome assemblies of the Viridiplantae clade, which consists of green algae and land plants, we checked and searched genome related papers in the Web of Science (not including BioRxiv), downloaded papers on genome assembly from Published Plant Genomes (https://www.plabipd.de/plant_genomes_pa.ep), and conducted an extended search based on the genomes mentioned in the published articles.The complete list of the genomes sequenced during the past two years and the approximately 830 papers describing the genomes are provided in Supplementary Table 1.All the 2,836 genomes of 1,410 plant species published so far and the 1,600 related papers (including our previous study 5 ) were available at the N3 database (http://ibi.zju.edu.cn/n3/).
For the purpose of this study, we only used the genomes by de novo assembly with a contig N50 size over 10kb.For the same plant accession with multiple genome assembles, all assemblies were included in the Supplementary Table 1 and the N3 database, but only counted once; the updated genome and different haplotypes of the same accession also only counted once.The family and order of the plant species are based on the Taxonomy plate on NCBI (https://www.ncbi.nlm.nih.gov/) and Wikipedia (https://www.wikipedia.org/).Classi cation of species were classi ed according to the Angiosperm Phylogeny Group IV (APG IV) (http://www.mobot.org/MOBOT/research/APweb/)and Phylogenetic inferences of one thousand plant transcriptomes 46 .The estimated genome size of all species was based on k-mer analysis or ow cytometry.If the genome size of a species was not mentioned in the source article, we used the estimated size of the species in Published Plant Genomes.Country/institution refers to the region where the rst publication was located.Different versions of a software used in genome assembly were considered as the same.For the pan-genome assemblies, we also counted only the de novo assembles and excluded those with a contig N50 size less than 10kb.In order to exclude the in uence of pan-genome assembly on the analysis results, we analyzed separately the genome assembles by excluding pan-genomes and therefore generated two datasets, one with all genomes and another without pan-genomes.
We built the N3 database by SQLyog visual tool (https://webyog.com/product/sqlyog/)and processed and analyzed the data using Python and R scripts.
Tables Table 1.Summary of the studies on plant pan-genomes and T2T genomes in the last two years (from January, 2021 to June, 2023)    Figure 4

Figure 1 An
Figure 1

Figure 3 The
Figure 3