Genome sequencing and assembly
On the basis of K-mer distribution assessment (K = 25), the estimated genome size of sudangrass (2n = 20) was 741.34 Mb with heterozygosity of 0.225% and repeat contents of 58.4% (Table 1, Supplementary Table 1, Supplementary Fig. 1). DNBseq, ONT, and Hi-C data were combined to yield a high-quality and chromosome-level reference genome. In total, 106.32 Gb of ONT long reads (~ 143.41x coverage of the genome), 113.78 Gb (~ 153.55x coverage of the genome) of DNB clean reads, 107.5Gb (~ 145.07x coverage of the genome) of Hi-C data were produced, resulting in 442–fold coverage of the genome. The assembled genome was 715.95 Mb with scaffold N50 of 71.60 Mb and contig N50 of 28.97 Mb (Table 1). The longest contig and scaffold were 44.15 Mb and 86.04 Mb, respectively. The Hi-C data were used to order and anchor the assembled sequences onto the 10 chromosomes (Supplementary Fig. 2). The genome contained 44.33% GC.
The mapping rate was 95.72% when BGI sequencing reads and the assembled genome were aligned. BUSCO analysis was carried out to evaluate the quality of the assembled genome and showed that sudangrass assembly contained 97.9% of the highly conserved genes common across eukaryotes (Supplementary Table 2). In addition, 142 syntenic blocks and 1,489 paralog groups were identified based on self-alignment of 35,243 annotated genes, indicating that the sudangrass genome has undergone frequent segmental duplications and interchromosome fusions in its evolutionary history (Fig. 1).
Table 1
Assembly and annotation of the sudangrass genome
Estimated genome size (Mb) | 741.34 |
Assembled genome size (Mb) | 720.10 |
Number of Scaffolds | 10 |
Number of N50 Scaffolds | 5 |
Number of contigs | 115 |
Number of N50 contigs | 11 |
Longest Chr (Mb) | 86.04 |
GC content (%) | 44.33 |
Transposable elements (%) | 67.33 |
Predicated protein-coding genes | 35,243 |
Average gene length (bp) | 3316.36 |
Average exon length (bp) | 282.50 |
Genome annotation
RNA sequence and homology searches were used to identify protein-coding genes in the sudangrass genome. Overall, 35,243 protein-coding genes were identified. On average, gene length was 3.32 kb and exon length was 282.50 bp (Supplementary Table 3). A total of 90.25% (31,086 genes) of the 35,243genes were annotated by homology to known proteins, domains or transcripts (Supplementary Table 4). In total, 1647 transcription factors (TFs) in 56 TF families were identified in the genome. These included 188 MYB, 160 bHLH and 94 WRKY (Supplementary Table 5).
The sudangrass genome contained 510.09 Mb repetitive sequences (71.25% of the genome) using homology-based and de novo methods, of which 3.68% were tandem repeat and interspersed repeats (26.36 Mb). Long terminal repeats (LTRs) retrotransposons was the most abundant interspersed repeats, representing 57.1% of the genome (394.71 Mb), followed by DNA transposons at 10.12% (72.48 Mb). The non-LTR retrotransposons LINEs (long interspersed nuclear elements) and SINEs (short interspersed nuclear elements) accounted for 2.08% (14.88 Mb) of the genome (Supplementary Table 6). Genes annotated as ncRNAs included 2,751 miRNAs, 690 tRNAs, 672 rRNAs, and 4,670 snRNAs (Supplementary Table 7).
Comparison of gene families
A gene family cluster evaluation from the whold genomes of four Sorghum genus (S. bicolor, S. proniquum, S. bicolor ssp. verticilliflorum and sudangrass) was performed. The four genomes shared 17,155 gene families and 448 gene families were sudangrass-specific (Fig. 2A). These 448 gene families consisting of 3,141 genes were used to perform GO analysis (cutoff < 0.05). These genes were enriched in cell component (endoplasmic reticulum lumen) and 15 molecular function categories (ATPase-coupled cation transmembrane transporter activity, metal ion binding etc.) (Supplementary Fig. 3 and Supplementary Table 8).
The CAFE program was used to perform gene expansion and contraction analysis. Comparative genomic analyses were performed among nine representative plant species (Fig. 2B) and 276 and 905 gene family expansion and contractions, respectively, were found in sudangrass, after its divergence from S. bicolor, indicating that more sudangrass gene families have undergone contraction than expansion. Among the gene families, 638 significantly expanded genes and 603 significantly contracted genes were found at the 0.05 level. KEEG analysis showed that expanded genes were enriched in 45 pathways including acridone alkaloid biosynthesis, flavonoid biosynthesis and circadian rhythm (Supplementary Table 9 and Supplementary Fig. 4). Contracted genes were enriched in 17 pathways including isoflavonoid biosynthesis and stilbenoid, diarylheptanoid and gingerol biosynthesis etc. (Supplementary Table 10 and Supplementary Fig. 5).
A divergence tree was constructed based on expansion-contraction of 186 single-copy orthologs. The results indicated that the Sorghum genus diverged from Z. mays about 21.6 Mya (16.2–27.4), and sudangrass diverged from S. bicolor about 1.1 Mya (0.6–1.7) (Fig. 2B). We also constructed a phylogenetic tree using 17 sequenced plant varieties. In this phylogenetic tree, sudgangrass (S722) was most closely related to the five cultivated sorghums BTx623, Rio, SC187, RTx430 and BTx642 and distinct from wild relatives and cultivated sorghums from Africa (Supplementary Fig. 6).
Chromosome changes compared to the sorghum genome
Comparisons of the sudangrass genome with the sorghum reference genome by chromosome using Chromeister (Pérez-Wohlfeil et al. 2019) indicated largely collinearity between the two, except between sudangrass chromosome 4 and the reference chromosome 5 (Fig. 3). In this case, while the middle of the two chromosomes was largely colinear, both arms were inverted (Fig. 3). Between sudangrass chromosome 1 and the reference chromosome 1 and between sudangrass chromosome 6 and the reference chromosome 7, it was the middle portion of the chromosomes that were inverted (Fig. 3).
The LTR analysis
Transposons, especially LTRs, are important to the evolution of genome structure. LTRs were identified and compared among the four sorghum species. In total, 3076, 7019, 6005 and 1940 intact LTRs were found in sudangrass, S. bicolor, S. bicolor ssp. verticilliflorum and S. proniquum, respectively (Fig. 4A). There was no significant proliferation of LTRs in the last 2 Mya for S. proniquum. But for S. bicolor, S.bicolor ssp. verticilliflorum and sudangrass, there has been continuous and substantial LTR accumulation in the last 2 Mya (Supplementary Fig. 6). For S. bicolor and S. bicolor ssp. verticilliflorum, LTR number expanded three times more than S. proniquum. For sudangrass, LTR number only expanded 0.5 times more than S. proniquum.
Approximately 81.21%, 81.72%, 79.30% and 84.03% of the intact LTRs had at least one protein domain in the four species (S. bicolor, S. bicolor ssp. verticilliflorum, S. proniquum, and sudangrass), respectively. There was no significant difference among the four species in the number of LTR with protein domain. As in other species, the majority of LTRs were Ty1/copia or Ty3/gypsy. Specifically, 70.96%, 69.69%, 64.14% and 59.92% were Ty3/gypsy in S. bicolor, S. bicolor ssp. verticilliflorum, S. proniquum and sudangrass, respectively, and the respective number for Ty1/copia were 10.76%, 9.61%, 19.89% and 21.29%. Compared to S. proniquum, the number of Ty1/copia was 0.5, 0.7 and 1.0 times more in S. bicolor ssp. verticilliflorum, S. bicolor and sudangrass, respectively. But for Ty3/gypsy, it was 2.4, 3.0 and 1.48 times more in the three species, respectively, compared to that in S. proniquum.
We also compared the expansion curve in the four species (Fig. 4B and 4C). For Ty3/gypsy, there was no sharp peak in S. proniquum. For sudangrass, S. bicolor and S. bicolor ssp. verticilliflorum, there were a burst in 1.06, 0.22 and 0.35 Mya for Ty3/gypsy and a burst in 0.32, 0.12 and 0.21 Mya for Ty1/copia, respectively. These showed that S. bicolor and S. bicolor ssp. verticilliflorum had similar expansion pattern and sudangrass had distinct expansion pattern compared to the other two species (Fig. 4B and 4C).
HCN-p in mini core/sudangrass accessions and GWAS
We measured leaf HCN-p (in ppm) in 2-week-old seedlings of 227 mini core sorghum and seven sudangrass accessions including S722. HCN-p ranged from 189–717 with an average of 381 in the mini core accessions while in the sudangrass accessions it ranged from 71–233 with an average of 140. Overall, sorghum accessions had higher HCN-p than sudangrass accessions (p = 0.0000383; Fig. 5).
We performed GWAS with the mini core HCN-p data using 6,094,317 SNPs and 957,449 indels. No markers were associated with HCN-p with − log10(P) value greater than the threshold of 8.08 but we found one QTL with the highest − log10(P) value (7.94 and 7.47 for the two SNPs and 7.73 for the indel) mapped to chromosome 1 (Fig. 6A and 6B), in the same gene cluster mapped by Hayes et al. (2015). The two SNPs (Fig. 6C) and one indel (Fig. 6D) with the strongest association with HCN-p were all located in the 3’ UTR of Sobic.001G012300 which encodes CYP79A1 (Fig. 6E), the enzyme that catalyzes the first committed step of dhurrin biosynthesis, converting L-tyrosine into (Z)-p-hydroxyphenylacetaldehyde oxime (Gleadow et al. 2014; Laursen et al. 2016).