Genome assembly and annotation
A de novo draft assembly of the Cas genome was produced using 58.7 Gb of PacBio Sequel I raw data and the Canu assembler pipeline. The draft assembly has a total length of 417.64 Mb, with 24 440 contigs and a N50 contig size of 21.3 Kb. The assembly length accounts for only 43.6% of the expected genome size and yet it presents 62.6% complete universal single-copy orthologs (Table 1). This may indicate the accumulation of repetitive DNA or transposable elements (TE) throughout the evolutionary history of the species (Rojas-Gomez et al. 2020), which contributes to the larger genome size but are hard to sequence and assemble. Partial or complete genome duplication is also a common feature in plant genomes (Wendel et al. 2016). The GC content of the Cas draft assembly was 40.05%, similar to that of P. guajava (Feng et al. 2020) and E. grandis (Myburg et al. 2014), the two most closely related species with sequenced genomes.
Structural annotation of the draft assembly resulted in 59 036 gene models. However, after filtering out incomplete genes, overlapping exons, and duplicates, a total of 52 422 genes were retained. This is a larger number of predicted protein-coding genes than in the sequenced genomes of the myrtaceous species, L. scoparium (Mānuka) (31 220 genes), E. grandis (36 376 genes) and P. guajava (25 601 genes) (Thrimawithana et al. 2019; Myburg et al. 2014; Feng et al. 2020). In addition, of the remaining gene models, 6106 correspond to mono-exonic genes, while 46 316 correspond to multi-exonic genes. On average, multi-exonic genes are composed of 5.9 exons and 4.9 introns. Median CDS length was 1 279 nt, but multi-exonic gene size varies greatly with CDS ranging from 209 nt to 14 373 nt. For mono-exonic genes the average length was 948.6 nt. This is consistent with reports of plants having more protein coding genes of smaller size and with fewer exons per gene than other organisms (Ramirez-Sanchez et al. 2016).
Functional annotation was conducted against the non-redundant set of genes from the KEGG database. Of the 52 422 complete genes models, 15.55% (8 153) presented homology with KEGG orthologs. The majority of the annotated genes correspond with genetic information processing pathways, such as proteins involved in transcription and translation processes, as well as folding, sorting and degradation. The second largest group of annotated genes (%) correspond to environmental information processing pathways, including membrane transport and signal transduction. Finally, the third largest group of annotated genes correspond to carbohydrate metabolism pathways. A complete distribution of the functional categories of the annotated genes is presented in Fig. 2. The remaining gene models, that is the ones without annotation, were classified as hypothetical proteins. Many of these gene models showed homology with other hypothetical proteins reported in the NCBI database from both closely and distantly related species (data not shown). Because there is not reported evidence of their function in vivo these hypothetical proteins remain uncharacterized.
Genes involved in flavonoid biosynthesis
The most important flavonoids in Cas are proanthocyanidins, anthocyanins, and flavonols (quercetin aglycones). All three classes are being intensively investigated for their potential benefit to human health and are the compounds that provide functional food properties. For example, proanthocyanidins are thought to help in maintaining urinary tract health (Howell 2002; Foo et al. 2000), anthocyanins are important as antioxidants (Yan et al. 2002; Sun et al. 2002), and flavonols are implicated in anti-plasmodic, hepatoprotective, anti-inflammatory, and anti-cancer bioactivities, among others (Hollman and Katan 1997; Ren et al. 2003). Thus, considering the importance of the flavonoids, in this study we focus on analyzing the presence of genes involved in its biosynthesis. For this, we obtained the KEGG reference pathway for flavonoid biosynthesis (KEGG modules: M00137 and M00138; KEGG map: map00941, Fig. 3) and phenylpropanoid biosynthesis (KEGG map: map00940, Fig. 4). Then, we compared our Cas draft assembly with annotated genes of Eucalyptus grandis (rose gum) [erg] in the KEGG repository. According to the KEGG pathway assignments, 17 isoforms were annotated for chalcone synthase (K00660; [EC: 2.3.1.74]), 4 isoforms were annotated for trans-cinnamate 4-monooxygenase (K00487; [EC: 1.14.1491]), 3 isoforms for chalcone isomerase (K01859; [EC: 5.5.1.6]), 6 for naringenin 3-dioxygenase (K00475; [EC:1.14.11.9]), 1 for anthocyanidin synthase (K05277; [EC:1.14.20.4]) and 2 isoforms for flavanone 4-reductase (K13082; [EC:1.1.1.219 1.1.1.234]); these genes are part of the flavonoid biosynthetic pathway (Fig. 3). In addition, 10 isoforms were annotated for phenylalanine ammonia lyase (K10775; [EC: 4.3.1.24]) and 9 isoforms for 4-coumarate-CoA ligase (K01904; [EC: 6.2.1.12]); as part of the phenylpropanoid biosynthetic pathway (Fig. 4 and Table S1).
The genes and regulatory mechanisms involved in the flavonoid pathway are of great interest in Cas, as these affect the temporal and spatial flavonoid biosynthesis as well as the specific flavonoids produced. It has been shown that the central pathway for flavonoid biosynthesis is conserved in most plants, but depending on the species, a group of enzymes, such as isomerases, reductases, hydroxylases and several Fe2+/ 2-oxoglutarate-dependent dioxygenases can modify their basic flavonoid skeleton, generating different subclasses of flavonoids (Martens et al. 2010). Likewise, it has been shown that flavonoid biosynthesis is tissue specific, regulated by development and can be induced by a variety of environmental factors, including light, UV radiation, fungal infection, interaction with microorganisms, wounding, among others (Winkel-Shirley 2001). This study presents the first annotation of genes involved in flavonoid biosynthesis for Cas. These genes can be targeted for manipulation of flavonoid biosynthesis through various means or used as markers for the selection of desirable flavonoid profiles through breeding. For example, in cranberry it has been shown that an important aspect of anthocyanins as antioxidants is the specific aglycone, as well as the glucoside, since this affects both the antioxidant potential and the bioavailability (Satué-Gracia et al. 1997; Wang et al. 1997).
The results of this study provide as a genetic resource an overview of the first draft of the Cas genome assembly using long PacBio reads. This assembly represents the basis for new genomic and molecular studies in the future of this crop. In particular, it delves into the evolution of secondary metabolites, mainly of the polyphenolic type, which give this crop special characteristics as a functional food.