A chromosome-scale genome sequence of sudangrass (Sorghum sudanense) highlights the genome evolution and regulation of dhurrin biosynthesis

Sudangrass is more similar to US commercial sorghums than to cultivated sorghums from Africa sequence-wise and contain significantly lower dhurrin than sorghums. CYP79A1 is linked to dhurrin content in sorghum. Sudangrass [Sorghum sudanense (Piper) Stapf] is a hybrid between grain sorghum and its wild relative S. bicolor ssp. verticilliflorum and is grown as a forage crop due to its high biomass production and low dhurrin content compared to sorghum. In this study, we sequenced the sudangrass genome and showed that the assembled genome was 715.95 Mb with 35,243 protein-coding genes. Phylogenetic analysis with whole genome proteomes demonstrated that the sudangrass genome was more similar to US commercial sorghums than to its wild relatives and cultivated sorghums from Africa. We confirmed that at seedling stage, sudangrass accessions contained significantly lower dhurrin as measured by hydrocyanic acid potential (HCN-p) than cultivated sorghum accessions. Genome-wide association study identified a QTL most tightly associated with HCN-p and the linked SNPs were located in the 3’ UTR of Sobic.001G012300 which encodes CYP79A1, the enzyme that catalyzes the first step of dhurrin biosynthesis. As in other grasses such as maize and rice, we also found that copia/gypsy long terminal repeat (LTR) retrotransposons were more abundant in cultivated than in wild sorghums, implying that crop domestication in the grasses was accompanied by increased copia/gypsy LTR retrotransposon insertions in the genomes.


Introduction
Sudangrass [Sorghum sudanense (Piper) Stapf, also known as S. bicolor ssp. drummondii] is a hybrid between grain sorghum and its wild relative S. bicolor ssp. Verticilliflorum (Wiersema et al. 2007) and has been grown as a forage crop together with forage sorghums (Beck et al. 2013). However, when used as forage, sorghum poses a risk of cattle poisoning by dhurrin [p-hydroxy-(S)-mandelonitrile-β-Dglucoside], the precursor of hydrocyanic acid (HCN) (Hayes et al. 2015). Dhurrin is produced by sorghum and related species mainly in the leaves with much lower concentration in the stems and panicles and is measured by hydrocyanic acid potential (HCN-p) (De et al. 2011). Sudangrass contains lower concentration of dhurrin than sorghum-sudangrass hybrids, forage sorghum, shattercane, Johnsongrass, grain sorghum and sorghum almum (hybrid between grain sorghum and Johnsongrass) (De et al. 2011;Gorz et al. 1977;Jieqin Li, Lihua Wang and Paul W. Bible contributed equally to this work. Loyd et al. 1970;McBee et al. 1980;Provin and Pitt 2003). Because of its low dhurrin content and high forage yield, sudangrass has been used to produce sudangrass-sorghum hybrids which display high forage biomass heterosis (Lu et al. 2011;Zhan et al. 2004). Within the cultivated sorghum, there is tremendous variation in dhurrin content in the leaves with a 17-fold difference between the highest and lowest dhurrin-containing entries (Hayes et al. 2015). Dhurrin plays a role of defense against herbivores and pathogens (Gleadow and Møller 2014).
In this study, we sequenced and assembled the whole genome of sudangrass S722, and characterized the evolution of the sudangrass genome. We also measured HCN-p of 227 the mini core accessions (Upadhyaya et al. 2009) and seven sudangrass accessions, performed genome-wide association studies (GWAS) and identified a strongly associated quantitative trait locus (QTL) which included two SNP and one indel markers. These markers were all located in the 3' UTR of Sobic.001G012300 coding for CYP79A1, the enzyme that catalyzes the first step of dhurrin synthesis (Gleadow and Møller 2014;Laursen et al. 2016).

Genome sequencing and assembly
Genomic DNA of sudangrass S722 was isolated from fresh leaf tissues using DNAsecure Plant Kit (Qiagen, Cat.No. DP320). After DNA isolation, 20 kb DNA fragments were recovered using the BluePippin system (Sage Science, Beverly, MA). A sequencing library was constructed with SQK-LSK109 kit and sequenced using PromethION 48 (Oxford Nanopore Technologies-ONT, UK). For shortread sequencing, DNA size of 350 bp were generated using the Covaris E210 Ultrasonicator (Woburn, MA) and the library was sequenced with BGI's PCR DNBSEQ™ (Shenzhen, China). Similarly, a 350 bp Hi-C sequencing library was constructed and sequenced using an MGI-2000 Sequencer (Shenzhen, China). After sequencing, the short-read sequencing data were processed using SOAP-nuke1.5.6 (Chen et al. 2018) with the following setting to obtain clean reads:"-n 0.01 -l 20 -q 0.1 -i -Q 2 -G -M 2 -A 0.5 -d".
For genome assembly, the long-read ONT sequencing data were de novo assembled into contigs using NECAT assembler (Chen et al. 2021). Three rounds of correction were applied subsequently using Racon v1.3.3 (Vaser et al. 2017) before Pilon (Walker et al. 2014) was used to improve the accuracy of the genome assembly by integrating the short-read sequencing data. Lastly, Hi-C technology (Lieberman-Aiden et al. 2009) was used to anchor primary contigs to pseudo-molecules and remove redundancy.

Genome annotation
For genome annotation, repetitive elements in the sudangrass genome were identified through a combination of homologbased and de novo approaches. RepeatMasker v4.0.7 and RepeatProteinMask v4.0.7 (Tarailo-Graovac and Chen 2009) were used to detect repeats by aligning against Repbase database v21.12 (Bao et al. 2015). LTR-FINDER (Xu and Wang 2007) and RepeatModeler (Saha et al. 2008) were employed to build a de novo repetitive sequence library. RepeatMasker v.4.0.7 and Tandem Repeats Finder v.4.09 (Benson 1999) was then used to identify repetitive sequences and tandem repetitive sequences with the combined library, respectively.
MAKER pipeline (Cantarel et al. 2008) was employed to annotate protein-coding genes from de novo prediction as well as homology-based and transcriptome-based prediction. Protein sequences of Hordeum vulgare, Sorghum bicolor, Triticum urartu, Zea mays B73 and Oryza sativa (msu 7.0) were used for homology-based prediction. Non-redundant isoforms were used for transcriptome-based prediction. The above result was used to train the parameters for Augustus (Stanke et al. 2006) to conduct the de novo prediction. Lastly, we combined the models trained by Augustus with protein sequences of the five species and isoforms for annotation. The combined results were filtered for the proteins with more than 50% repetitive sequences or protein length less than 100 aa. The final gene set was obtained by further filtering based on transcript sequences. Gene functions were assigned by software eggNOG-Mapper (Huerta-Cepas et al. 2017) and blast according to the best match of the alignment to the public databases, including NCBI Non-Redundant Protein Sequence (NR) databases, Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), Swis-sProt and Clusters of Orthologous Groups (COG).

Gene family analysis
The proteomes of sudangrass and three other Sorghum genus species were used to cluster gene families. OrthoFinder (Emms and Kelly 2015) was used to identify protein paralogous and orthologous sequences among the four species with the default parameters. The Venn diagram was drawn using OrthoVenn2 (Xu et al. 2019).

Phylogenetic analysis
Proteomes were collected from 17 sequenced genomes. These included sudangrass (S722), the Sorghrum bicolor reference (Sbicolor proteome v3.  (Tao et al. 2021), the wild relative S. propinquum (S369-1) (Tao et al. 2021), Miscanthus sinensis (v7.1, Phytozome) (Mitros et al. 2020), and the outgroup Zea mays reference (v4, Phytozome). For each genome, the longest isoform amino acid sequences were selected and processed using the Orthofinder pipeline to detect orthogroups and orthologs (Emms and Kelly 2015;Emms and Kelly 2019). For orthogroup sequence alignments, the '-M msa' option was used (orthofinder -t 8 -a 8 -M msa -f prots_folder/). The '-M msa' option invokes a more sensitive sequence alignment using MAFFT (Katoh and Standley 2013) v 7.481. Single-gene families shared by sudangrass and eight other plant species were identified by OrthoFinder. RAxML was used to construct the maximum likelihood phylogenetic tree using the GTRGAMMA model with Arabidposis thaliana as an out group with a bootstrap value of 1000. The MCMCtree program of PAML (Yang 2007) was used to estimate the divergence time. The times of divergence between Arabidposis thaliana and Oryza sativa (115-308 Mya), and Oryza sativa and Zea mays (42-52 Mya) were from the TimeTree database (http:// timet ree. org/).
To explore the evolutionary history of the gene families, gene expansion and contraction analysis was performed using CAFE 5 (Mendes et al. 2020) with default parameters. Gene trees calculated by Orthofinder were retrieved, and gene families were extracted based on sequence similarity using Orthofinder. Gene and species trees were constructed by Orthofinder using FastTree (Price et al. 2010) version 2.1.11 from multiple sequence alignments. The phylogenetic relationship between species (called the "species tree") was inferred by Orthofinder based on conserved single copy proteins using the STRIDE (Emms and Kelly 2017) and STAG (Emms 2018) algorithms.

Chromosomal structural change tracking
To assess global structural changes, sorghum genome (S. bicolor v3.1.1) was compared to the sudangrass genome using CHROMEISTER (Pérez-Wohlfeil et al. 2019) (command: "CHROMEISTER -query genome1.fa -db genome2. fa -out outputfile"). The generated dot-plot was drawn using the compute_score.R R script provided by CHROMEISTER to create image files. Before analysis, each genome assembly was filtered to use only its chromosomes. Any genome with only scaffold-level assembly quality was filtered to its 10 largest scaffolds (all > 19 Mb). Large scale structural changes and their corresponding coordinates were calculated using CHROMEISTER 's detect_events.py script (command: "detect_events.py outputfile.raw.txt png"). Centromere positions were estimated based on Paterson et al. (2009).

Long terminal repeat retrotransposons (LTR-RTs) analysis
LTR_FINDER and LTRharvest (Xu and Wang 2007) were used to search intact long terminal repeat retrotransposons (LTR-RTs) against the genome sequences. LTR-RT insertion time was calculated using T = K/2r, where K is the divergence rate measured by the Jukes-Cantor model with K = − 3/4 × ln (1-d × 4/3), and r is the base substitution rate (Steinbiss et al. 2009). The base substitution rate (r) was set at 1.3 × 10 −8 substitutions/site/year (Ma and Bennetzen 2004). All LTR-RTs were grouped into Ty1/copia, Ty3/gypsy or other superfamilies based on their structure and proteins domains using LTRdigest (Ossowski et al. 2010). The density graph of LTR-RTs for the Sorghum genus was drawn using ggplot2 in R.

HCN-p and GWAS
Sorghum mini core collection comprises 242 landraces from 57 countries representative of the International Crops Research Institute for the Semi-Arid Tropics' (ICRISAT) sorghum global landrace collection (Upadhyaya et al. 2009). Two hundred twenty seven sorghum mini core and seven sudangrass accessions were planted in plastic trays in triplicate. The plants were grown at 28 ℃ under 12 h light and 12 h dark photoperiod. Two weeks after planting, the first leaf was sampled in triplicate to measure HCN-p as described by Gorz et al. (1977). t-test between sorghum and sudangrass accessions was performed with Microsoft® Excel 365 using "Two Sample Assuming Unequal Variances" from its Analysis ToolPak.
GWAS was performed according to Wang et al. (2021) using 6,094,317 SNPs and 957,449 indels. Kinship matrix (K) was generated using EMMAX (Kang et al 2010). GWAS analyses were performed using EMMAX with Q matrix. The modified Bonferroni correction was used to determine the genome-wide significance thresholds of the GWAS, based on a nominal level of α = 0.05 corresponding to raw P values of 8.2 × 10 -9 or − log 10 (P) values of 8.08. Candidate genes were identified using the reference sequence Sorghum bicolor v3.1.1 curated at Phytozome (Goodstein et al. 2012) 13 (https:// phyto zome-next. jgi. doe. gov/).

Genome sequencing and assembly
On the basis of k-mer distribution assessment (k = 25), the estimated genome size of sudangrass (2n = 20) was 741.34 Mb with heterozygosity of 0.225% and repeat contents of 58.4% (Table 1, Supplementary Table1, Supplementary Fig. 1). DNBseq, ONT, and Hi-C data were combined to yield a high-quality and chromosome-level reference genome. In total, 106.32 Gb of ONT long reads (~ 143.41 × coverage of the genome), 113.78 Gb (~ 153.55 × coverage of the genome) of DNB clean reads, 107.5 Gb (~ 145.07 × coverage of the genome) of Hi-C data were produced, resulting in 442-fold coverage of the genome. The assembled genome was 715.95 Mb with scaffold N50 of 71.60 Mb and contig N50 of 28.97 Mb ( Table 1). The longest contig and scaffold were 44.15 Mb and 86.04 Mb, respectively. The Hi-C data were used to order and anchor the assembled sequences onto the 10 chromosomes ( Supplementary Fig. 2). The genome contained 44.33% GC.
The mapping rate was 95.72% when BGI sequencing reads and the assembled genome were aligned. BUSCO analysis was carried out to evaluate the quality of the assembled genome and showed that sudangrass assembly contained 97.9% of the highly conserved genes common across eukaryotes (Supplementary Table 2). In addition, 142 syntenic blocks and 1,489 paralog groups were identified based on self-alignment of 35,243 annotated genes, indicating that the sudangrass genome has undergone frequent segmental duplications and interchromosome fusions in its evolutionary history (Fig. 1).

Genome annotation
RNA transcript sequence and homology searches were used to identify protein-coding genes in the sudangrass genome. Overall, 35,243 protein-coding genes were identified. On average, gene length was 3.32 kb and exon length was 282.50 bp (Supplementary Table 3). A total of 90.25% (31,086 genes) of the 35,243 genes were annotated by homology to known proteins, domains or transcripts (Supplementary Table 4). In total, 1647 transcription factors (TFs) in 56 TF families were identified in the genome. These included 188 MYB, 160 bHLH and 94 WRKY TFs (Supplementary Table 5).

Comparison of gene families
A gene family cluster evaluation from the whole genomes of four Sorghum genus (S. bicolor, S. propinquum, S. bicolor ssp. verticilliflorum and sudangrass) was performed. The four genomes shared 17,155 gene families and 448 gene families were sudangrass-specific ( Fig. 2A). These 448 gene families consisting of 3,141 genes were used to perform GO analysis (cutoff < 0.05). These genes were enriched in cell component (endoplasmic reticulum lumen) and 15 molecular function categories (ATPase-coupled cation transmembrane Comparative genomic analyses were performed among nine representative plant species (Fig. 2B) and 276 and 905 gene family expansion and contractions, respectively, were found in sudangrass after its divergence from sorghum, indicating that more sudangrass gene families have undergone contraction than expansion. Among the gene families, 638 significantly expanded genes and 603 significantly contracted genes were found at the 0.05 level. KEGG analysis showed that expanded genes were enriched in 45 pathways including acridone alkaloid biosynthesis, flavonoid biosynthesis and circadian rhythm (Supplementary Table 9 and Supplementary Fig. 4). Contracted genes were enriched in 17 pathways including isoflavonoid biosynthesis and stilbenoid, diarylheptanoid and gingerol biosynthesis, etc. (Supplementary Table 10 (Fig. 2B). We also constructed a phylogenetic tree using 17 sequenced plant genomes. In this phylogenetic tree, sudgangrass (S722) was most closely related to the five cultivated sorghums BTx623, Rio, SC187, RTx430 and BTx642 and distinct from wild relatives and cultivated sorghums from Africa ( Supplementary Fig. 6).

Chromosome changes compared to the sorghum genome
Comparisons of the sudangrass genome with the sorghum reference genome by chromosome using Chromeister (Pérez-Wohlfeil et al. 2019) indicated largely collinearity between the two, except between sudangrass chromosome 4 and the reference chromosome 5 (Fig. 3). In this case, while the middle of the two chromosomes including possibly the centromere was largely colinear, both arms were inverted (Fig. 3). Between sudangrass chromosome 1 and the reference chromosome 1 and between sudangrass chromosome 6 and the reference chromosome 7, it was the middle portion of the chromosomes that were inverted (Fig. 3).

The LTR-RTs analysis
Transposons, especially LTR-RTs, are important to the evolution of genome structure. LTR-RTs were identified and compared among the four sorghum species. In total, 3076, 7019, 6005 and 1940 intact LTR-RTs were found in sudangrass, S. bicolor, S. bicolor ssp. verticilliflorum and S. propinquum, respectively (Fig. 4A). There was no significant proliferation of LTR-RTs in the last 2 Mya for S. propinquum. But for S. bicolor, S. bicolor ssp. verticilliflorum and sudangrass, there has been continuous and substantial LTR-TR accumulation in the last 2 Mya ( Supplementary  Fig. 6). For S. bicolor and S. bicolor ssp. verticilliflorum, LTR-TRs expanded three times more than S. propinquum. For sudangrass, LTR-TRs only expanded 0.5 times more than S. propinquum.
Approximately 81.21%, 81.72%, 79.30% and 84.03% of the intact LTR-RTs had at least one protein domain in the four species (S. bicolor, S. bicolor ssp. verticilliflorum, S. propinquum, and sudangrass), respectively. There was no significant difference among the four species in the number of LTR-TRs with protein domain. As in other species, the majority of LTR-RTs were Ty1/copia or Ty3/gypsy. Specifically, 70.96%, 69.69%, 64.14% and 59.92% were Ty3/gypsy in S. bicolor, S. bicolor ssp. verticilliflorum, S. propinquum and sudangrass, respectively, and the respective number for Ty1/copia were 10.76%, 9.61%, 19.89% and 21.29%. Compared to S. propinquum, the number of Ty1/copia was 0.5, 0.7 and 1.0 times more in S. bicolor ssp. verticilliflorum, S. bicolor and sudangrass, respectively. But for Ty3/gypsy, it was 2.4, 3.0 and 1.48 times more in the three species, respectively, compared to that in S. propinquum.
We also compared the expansion curve in the four species ( Fig. 4B and 4C). For Ty3/gypsy, there was no sharp peak in S. propinquum. For sudangrass, S. bicolor and S. bicolor ssp. verticilliflorum, there were a burst in 1.06, 0.22 and 0.35 Mya for Ty3/gypsy and a burst in 0.32, 0.12 and 0.21 Mya for Ty1/copia, respectively. These showed that S. bicolor and S. bicolor ssp. verticilliflorum had similar expansion pattern and sudangrass had distinct expansion pattern compared to the other two species (Fig. 4B and 4C).

HCN-p in mini core/sudangrass accessions and GWAS
We measured leaf HCN-p (in ppm) in 2-week-old seedlings of 227 mini core sorghum and seven sudangrass accessions including S722. HCN-p ranged from 189-717 with We performed GWAS with the mini core HCN-p data using 6,094,317 SNPs and 957,449 indels. No markers were associated with HCN-p with − log 10 (P) value greater than the threshold of 8.08 but we found one QTL with the highest − log 10 (P) value (7.94 and 7.47 for the two SNPs and 7.73 for the indel) mapped to chromosome 1 (Fig. 6A and 6B), in the same gene cluster mapped by Hayes et al. (2015). The two SNPs (Fig. 6C) and one indel (Fig. 6D) with the strongest association with HCN-p were all located in the 3' UTR of Sobic.001G012300 which encodes CYP79A1 (Fig. 6E), the enzyme that catalyzes the first committed step of dhurrin biosynthesis, converting L-tyrosine into (Z)-p-hydroxyphenylacetaldehyde oxime (Gleadow and Møller 2014;Laursen et al. 2016). While the top two HCN-p accessions had a haplotype of AAG in 1,146,009, 1,146,058, and 1,146,087 and bottom two accessions' haplotype was GG- (Table 2).

Discussion
In this study, we sequenced the sudangrass genome using DNBseq, ONT, and Hi-C with combined 442-fold coverage of the genome. The genome size of 715.95 Mb with 35,243 protein-coding genes is similar to Sorghum bicolor v3.1.1 (BTx623)'s 800 Mb and 34,211 genes (McCormick et al. 2018). In the 13 assembled Sorghum genomes, the genome size ranges from 468-737 Mb and number of genes from 31,898-36,952 (Tao et al. 2021). For the other four sorghum genomes curated at Phytozome 13, the genome size/gene number are 679 Mb/35,115 for BTx642 (DOE-JGI 2022a), 694 Mb/35,490 for Rio (Cooper et al. 2019), 674 Mb/34,601 for Tx430 (DOE-JGI 2022b), and 688 Mb/32,101 for SC187 (DOE-JGI 2022c). The similarity is also reflected in the protein sequence comparisons among the genomes, especially with the five genomes curated at Phytozome 13. Phylogenetic analysis based on the proteomes shows that sudgangrass (S722) is most closely related to the five sequenced US commercial sorghums curated at Phytozome 13: BTx623, Rio, SC187, RTx430 and BTx642 (Supplementary Fig. 6) which all produced panicles under field conditions. Although BTx623, SC187, RTx430 and BTx642 are all known grain sorghum varieties, Rio as a sweet sorghum variety can also produce grain yield comparable to BTx623 in some environments in addition to its high biomass and stem juice sugar content (Murray et al. 2008). In contrast, sudangrass produces lower grain yield than sorghum (Li JQ unpublished results) and in this aspect sudangrass is not as fully domesticated using grain yield (Preece et al. 2017) as a criterion. Additional phylogenetic analysis based on expansion-contraction of 186 single-copy orthologous genes showed that sorghum and maize diverged 21.6, sorghum and switchgrass 46.7, and sorghum and rice 89.5 mya. Similar analysis based on a 1150-gene coalescent tree and a 180-gene concatenated super-matrix produces a slightly lower estimates of 17. 34, 39.57, and 81.43 mya, respectively (Huang et al. 2022). This supports our data that sorghum and sudangrass may have diverged over a million years ago.
Comparing the sudangrass genome with other grasses, we found that there seems to be a correlation between the degree of domestication and the copy number of copia/gypsy LTR-RTs. In the sudangrass and the five US commercial sorghum clade shown in Supplementary Fig. 6, the LTR-RT copy number was 3076 for sudangrass S722 and 7019, 5844, 7203, 7308, and 7467 for BTx623, Rio, RTx430, SC187, and BTx642, respectively. Among the five varieties, the four  -p (ppm) in the sorghum mini core accessions than in the seven sudangrasses. ** indicates significant difference at p < 0.01 (p =0.0000383). X inside each box in the boxplot represents the mean and horizontal line the median value of each data group commercial grain sorghums (BTx623, RTx430, SC187, and BTx642) contain similar copy number, but Rio has lower copy number (Supplementary Fig. 6) and is grown as sweet sorghum for its high stem juice sugar content (Murray et al. 2008). All other cultivated sorghums or wild relatives have copy number lower than the four grain sorghums (Supplementary Fig. 6). Our results confirm findings by studies in other grasses. For example, in cultivated indica and japonica rice gypsy accounted for 21% and 22% of their respective genomes while this number ranges from 9-14% in wild rice relatives (Li et al. 2017). In another rice study, the 50 most abundant LTR retrotransposons numbered from 1885 ~ 3224 in cultivated indica and japonica rice and 337 ~ 1491 in wild rice relatives (Stein et al. 2018). Similarly in maize, in a study of 91 improved, 24 landrace and 10 teosinte maize accessions, Zhang and Qi (2019) found that landraces contained more copia/gypsy than teosinte (793,808 vs. 460,394) although improved maize contained far less than teosinte (154,136 vs. 460,394) on average. These results suggest that crop domestication at least in the grasses was accompanied by increased copia/gypsy LTR-RT insertions in the genomes. This is not surprising as LTR-RTs can influence gene expression and evolution through their cis-regulatory motifs if they are in or close to them before or after activation, and they can be activated by biotic and abiotic elicitors (Galindo-González et al. 2017). This is also supported by our LTR-RT proliferation data. For example, we found that for S. bicolor (cultivated sorghum) and S. bicolor ssp. verticilliflorum (progenitor of cultivated sorghum; Tao et al. 2021), LTR-RT number expanded three times more than S. propinquum (a perennial, rhizomatous wild sorghum) and in sudangrass, LTR-TR number only expanded 0.5 times more than S. propinquum. In addition, there was no sharp peak in Ty3/gypsy expansion in S. propinquum, but for sudangrass, S. bicolor and S. bicolor ssp. verticilliflorum, there were a burst in 1.06, 0.22 and 0.35 Mya for Ty3/gypsy and a burst in  0.32, 0.12 and 0.21 Mya for Ty1/copia, respectively. Further studies will be needed to link those activities with sorghum domestication and evolution. Because green forage sudangrass and sorghum present dhurrin poisoning danger to cattle, we analyzed 227 sorghum mini core and seven sudangrass accessions for HCN-p and found that sorghum accessions contained higher HCN-p than sudangrass accessions in general. This is in agreement with previous studies that showed lower leaf HCN-p in sudangrass compared to sorghum (Gorz et al. 1977;Loyd and Gray 1970;McBee and Miller 1980). For example, Gorz et al. (1977) analyzed nine grain sorghum A/R lines, six sweet sorghums, 13 forage sorghums, and nine sudangrass varieties with average/(range) HCN-p of 1027/(861 ~ 1208), 1130/(976 ~ 1341), 983/(80 ~ 1560), and 445/(286 ~ 614) ppm, respectively.
In mapping HCN-p using SNP/indel markers, we found that the two SNPs (Fig. 6C) and one indel ( Fig. 6D) with the strongest association with HCN-p were all located in the 3' UTR of Sobic.001G012300 which encodes CYP79A1 (Fig. 6E), the enzyme that catalyzes the first committed step of dhurrin biosynthesis (Gleadow and Møller 2014;Laursen et al. 2016). The CYP79A1 gene is critical to dhurrin biosynthesis as antisense plants reduces HCN from 221.4 µg/g to 76.2 µg/g (Pandey et al. 2019) and missense mutations P414L (Blomstedt et al. 2012) or C493Y (Skelton 2014) in the gene shuts down dhurrin production. One possible reason is that transcript levels of CYP79A1 and CYP71E1 are almost perfectly correlated (R = 0.956) (Choi et al. 2020) and CYP71E1 catalyzes the second of the three steps in dhurrin biosynthesis (Gleadow et al. 2014;Laursen et al. 2016). These GWAS results are based on HCN-p measurements of 2-week-old seedling leaves. Future studies may need to use data from seedling stage as well as other stages relevant to forage sorghum/sudangrass production. Regardless, these results indicate that it is possible to shut down dhurrin production to develop dhurrin-free sudangrass and sorghums.
In conclusion, we have sequenced a sudangrass genome producing a chromosome level assembly that is 715.95 Mb in size containing 35,243 genes. Phylogenetic analysis with whole genome proteomes showed that the sudangrass genome was more similar to but distinct from the cultivated US sorghums. We confirmed that at seedling stage, sudangrass accessions contained significantly lower HCN-p than cultivated sorghum accessions. GWAS identified a QTL most tightly associated with HCN-p and the linked SNPs were located in the 3' UTR of Sobic.001G012300 which encodes CYP79A1, the enzyme that catalyzes the first step of dhurrin biosynthesis. In addition, LTR-RT bursts may be associated with crop domestication. These insights will guide future studies to interrogate the regulatory mechanisms of the dhurrin pathway.
Authors contribution statement LW prepared samples for genome sequencing; WT, JZ, PJ, JD and JZ measured HCN-p in the seedlings of the mini core accessions and sudangrass; PWB performed chromosome alignment and phylogenetic analysis; YW performed statistic test for HCN-p between sudangrass and sorghums, and drafted the manuscript; QZ designed the HCN-p experiment, JL and LW performed GWAS, phylogenetic and genome and transposon analyses. All authors reviewed and approved the final manuscript.
Funding This work was supported by Anhui Provincial Natural Science Fund (2008085MC73), the National Natural Science Foundation of China (31971993), Anhui Provincial Key R&D Programme (202004b11020003) and the Key Project of Natural Science Research of Anhui Provincial Education Department (KJ2021ZD0108).

Data availability
The raw sequencing data of sudangrass have been deposited in NCBI with the Bioproject ID PRJNA831289. The assembled genome of sudangrass have been deposited in NCBI with Bioproject ID PRJNA830304.

Declaration
Conflict of interest We declare that there are no conflicts of interest.