Genome-Survey, Sequencing, and Assembly
This study evaluated the size, repeatability, heterozygosity, and other genome parameters of the white clover. After quality control, Illumina sequencing yielded 59 Gb of data[18]. Blasting 10,000 randomly selected clean reads against the NT library revealed a 98.79% mapping. Moreover, K-mer analysis performed to estimate the complexity of the genome further predicted genome size of 1075 Mb, with 1.68% repeat and 68.80% heterozygous sequences[22].
Third-generation Hi-Fi sequencing (TGS) technology developed by PacBio was used to initially assemble the white clover genome based on traditional next-generation sequencing (NGS) data assembly methods[23]. Compared with the second-generation sequencing technology, TGS technology overcomes some NGS shortcomings in genome assembly. TGS does not require polymerase chain reaction (PCR) amplification or long read length and has no guanine-cytosine (GC) preference, thus making genome splicing using PacBio Hi-Fi an effective assembly strategy[24]. High-quality Hi-Fi reads were obtained after parameter comparison of the output data. The Hi-Fi reads were 1.89 Mbp in size, with an N50 measure of 1.63kbp.
After eliminating heterozygous and redundant contigs, the assembled genome (1095 Mb) had 380 contigs, with an N50 of 14 Mbp and a maximum contig size of 53 Mbp. The average GC content of the assembled genome was 33.64% (Table 1), closer to the previously assembled Trifolium pratense genome (33.60%)[25, 26]. To evaluate the quality and integrity of the assembly, we compared the sequencing data with the assembly results and found that the properly paired mapping was 89.22%, while BUSCOs assembly assessment integrity was 98.50%. These results indicate that the assembly results had good integrity[27].
Table 1
Summary statistic for the Trifolium repens genome
| Assembly | |
Genome assembly | Estimated genome size | 1075Mb |
| Total length of assembly | 1096Mb |
| Number of contigs | 380 |
| Contig N50 | 14Mb |
| Largest contig | 53Mb |
| Number of scafolds | 202 |
| Scafold N50 | 65Mb |
| Chromosome coverage(%) | 95.06% |
| GC content of genome | 33.64% |
| Annotation | |
| | Total length |
Transposable elements | Total Retrotransposon DNA Transposon | 672Mb(61.37%) |
448Mb(40.91%) |
140Mb(12.81%) |
| | Copies |
Noncoding RNAs | rRNAs | 10,984 |
| tRNAs | 2,024 |
| miRNAs | 662 |
| snRNAs | 1352 |
Gene models | Number of genes | 90,128 |
| Mean gene length | 3,604bp |
| Mean coding sequence length | 1,592bp |
Table 2
The information of annotated gene models per species for all the species
Organism | Number of genes | Mean CDS length(bp) | Exons per transcript | Mean exon length(bp) | Mean intron length(bp) |
Vigna radiata | 29006 | 1430 | 7.6 | 293 | 449 |
Glycine max | 54881 | 1391 | 8 | 295 | 413 |
Trifolium medium | 119102 | 306 | 1.4 | 219 | 172 |
Cicer arietinum | 28772 | 1393 | 7.7 | 291 | 418 |
Medicago truncatula | 36079 | 1428 | 6.9 | 324 | 393 |
Trifolium repens | 90128 | 1592 | 5 | 341 | 490 |
Scaffold Construction and Curation
The high-throughput chromatin conformation capture (Hi-C) technology utilizes the entire cell nucleus to fix and capture the mutual chromosomal sites[19, 28]. Hi-C uses high-throughput sequencing to determine the whole-genome spatial distribution of chromatin DNA through a high-resolution interaction map of chromatin regulatory elements obtained from the positional relationship[12, 19, 28]. In this study, we used the Hi-C technology and generated 270 Gb of data, from which 180 Gb was used to construct chromosome-level super scaffolds with 160 times genome coverage. Subsequent analysis of the Hi-C library revealed a genome with a scaffold-Len of 1096 Mb and an N50 of 65 Mbp. Compared with the previously reported sequence data of white clover (scaffold N50 = 122 kb), the quality and integrity of the data obtained in this study were substantially better[17]. Moreover, after Hi-C-assisted assembly, 1.04 Gb of genome sequences were identified on 16 chromosomes, accounting for 95.06% of the entire genome. After Hi-C-assisted assembly, it was observed that the genetic material exchange was much stronger within than between chromosomes[29]. The heat map showing the genome interaction of the Hi-C-assisted assembly further verified the accuracy of the assembly results (Fig. 1). Table 1 summarizes the assembly information. Thus, these results demonstrate the high accuracy of the Hi-C assembled genome.
Genome Annotation
The gene functions were inferred by analyzing the homology alignments and predicting the repetitive sequences. We constructed a repeat sequence library and annotated 2,023,411 repeat sequences. MITEs (miniature inverted-repeat transposable elements) and LTR (long terminal repeat) transposition components were identified by the structure prediction method, and these elements accounted for 61.37% and 37.75% of the total sequences, respectively. Copia and Gypsy accounted for 13.56% and 11.49% of LTR-retrotransposons, respectively, and additional 4092 simple repeats were also found in the assembled genome. We also predicted 13 types of ncRNA, totaling 15520 ncRNAs.
After removing the gene models containing premature stop codons and frameshifts, we obtained 90,128 high-confidence gene models and 91,690 transcripts using RNA-seq and de novo prediction strategies. However, these gene models were unevenly distributed across the 16 chromosomes.
Each gene contained an average of one transcript, and the average lengths of white clover genes and transcripts were 3604 bp and 1697 bp, respectively. Moreover, each transcript contained an average of 5 exons, with average lengths of 341 bp. We also compared the white clover genome with its five closely related species, including Medicago truncatula, Trifolium medium, Vigna radiata, Cicer arietinum, and Glycine max. The results showed that T. medium (119,102) had the most genes, while V. radiata (29,006) and Cicer arietinum (28,772) had the least. The five species had similar average coding sequence (CDS) lengths except for T. medium (306) (Table 2).
Using the NR, SwissProt, KEGG, GO, and eggNOG databases, we annotated and predicted the function and number of various genes[30]. We annotated 88,094, 61,830, 77,722, 52,992, and 26,979 genes using NR, Swiss-Prot, eggnog, GO, and KEGG databases, respectively. Furthermore, we conducted a Venn analysis by integrating the five databases, which revealed 21,825 common gene annotations (Table S1). Venn analysis of functional gene annotations is shown in Fig. 2.
Gene family and evolution analysis
Closely related species tend to have greater collinear fragments coverage and the collinear relationship between their genomes. Collinearity analysis suggested that the relationship between T. repens and M. truncatula is relatively close. Moreover, 16 chromosomes of T. repens and eight of M. truncatula had a good collinear relationship (Fig. 3), indicating their chromosomal conservation after species divergence[26].
The T. repens genome assembled in this study was compared with the genomes of seven other related species: G. max, V. radiata, M. truncatula, T. medium, C. arietinum, Arabidopsis thaliana, and T. pratense. The OrthoMCL clustering analysis showed that 90,128 white clover genes clustered into 25,840 gene families. Arabidopsis had the most gene families (26,382), and T. repens shared 6,194 gene families with the seven related species (Fig. 4a). Cafe software was used to study the changes in gene families among species at a family-wide p-value threshold of 0.05. The analysis showed that the red trifoliate significantly expanded 1,245 gene families but contracted one gene family during evolution (Fig. 4b)[31]. Furthermore, the GO functional enrichment analysis of the gene families showed that T. repens gene families were associated with biological processes, molecular function, cellular components, and environmental resistance, which could explain its excellent agronomic traits (Table S2).
We constructed a phylogenetic tree based on the results of protein family clustering and found that T. repens formed a monophyletic group with V. radiata, G. max, T. pratense, T. medium, M. truncatula, and C. arietinum[32]. White clover was most closely related to T. pratense and T. medium, with their estimated divergence time being 15.5 million years ago (Fig. 4b).
Whole genome duplication (WGD) events are important indices of plant evolution and are the driving force for plant adaptation to various environments[33]. Thus, WGD provides sufficient genetic material for expanding plant gene families or generating new genes. It also enhances the adaptability of plants to the environment and accelerates the evolution of plants by generating various genetic variations. To explore the evolutionary history of T. repens, we used the changes in the synonymous replacement rate of paralogous genes to measure the gene duplication and loss in its genome. The resultant data suggested that the divergence of T. repens and T. pratense occurred after the WGD events. Both T. repens and T. pratense experienced a WGD event when the KS value was 0.13 (Fig. 4c); however, an additional WGD event also occurred when the KS value of T. repens was 0.6 (Fig. 4d).