A draft genome assembly of “Cas” (Psidium friedrichsthalianum (O. Berg) Nied.): an indigenous crop of Costa Rica untapped

Psidium friedrichsthalianum (O. Berg) Nied. is a tropical tree species in the Myrtaceae family, natively distributed from southern Mexico to eastern Venezuela and Ecuador and commonly known as "Cas'', "Costa Rican guava" or “Sour Guava”. The “Cas” produces a fruit with a rather distinctive acidic flavor and has bioactive compounds with biological potential equal or greater than common Guava; is considered an indigenous crop in Costa Rica with characteristics as a functional food untapped. This species has not been completely domesticated, and can be found in home gardens, paddocks, small groups, and, more recently, in small and medium sized plantations. Also, the plantations of this species do not have technical and scientific support or agronomic promotion from industry, nor are there genetic resources or germplasm readily available to farmers. This limits its commercial development and the implementation of selection or genetic improvement programs. In this study, we present the first draft assembly of the Cas genome using PacBio long reads and the Canu assembly pipeline. Our draft assembly has a total length of 417.64 Mb, with 24 440 contigs and a N50 contig size of 21.3 Kb. Structural annotation resulted in 59 036 gene models. Functional annotation was conducted against the non-redundant set of genes from the KEGG database. Of the 52 422 complete genes models, 15.55% (8 153) presented homology with KEGG orthologs. The genes found in our Cas draft assembly were compared to those found in Eucalyptus grandis W. Hill. in the KEGG repository. According to the KEGG pathway assignments, 33 isoforms were annotated as part of the flavonoid biosynthetic pathway. In addition, 19 isoforms were annotated as part of phenylpropanoid biosynthetic pathway. The results of this study provide an overview of the first draft of the Cas genome assembly using PacBio long reads. This new genomic resource represents the basis for exploring the genetic potential of this crop with characteristics as a functional food.

Recent studies have shown that fruit of Cas (P. friedrichsthalianum) has bioactive compounds and biological potential equal or greater than guava (P. guajava) (González et al. 2012;Cuadrado-Silva et al. 2017a). This tree produces phenolic compounds, mainly quercetin aglycones and proanthocyanidins, derived from epi-catechin units. Other compounds such as ellagitannins and benzophenones have also been putatively identified (Rojas-Garbanzo et al. 2021, 2019Cuadrado-Silva et al. 2017a, b;Granados-Chinchilla et al. 2016;Flores et al. 2013;González et al. 2012). Previous studies have categorized it as a functional food due to its association with positive health effects. Anti-diabetic, anti-plasmodic, hepatoprotective, antioxidant, antimicrobial, anti-inflammatory, and anti-platelet activity are some of the reported bioactivities for leaves and fruit extracts (Cuadrado-Silva et al. 2017b;González et al. 2012; Fig. 1 a Image of a Cas tree (P. friedrichsthalianum) located in Tacacorí, Alajuela, Costa Rica (DMS: 10°03 0 07.3'' N-84°12 0 52.3'' W) b Cas fruits c Cas tree flowers hermaphrodites d Cas tree trunk e Cas tree branch, characterized by being reddish and having foliage of intense green color Flores et al. 2015Flores et al. , 2013Granados-Chinchilla et al. 2016;Rojas-Garbanzo et al. 2021).
In Costa Rica, the fruit of Cas is consumed mainly as fresh fruit and is used to prepare traditional foods, such as pasteurized soft drinks, ice creams, juices, desserts, and jams. It is sold at grocery stores and farmers markets (León 2000). In Costa Rica, this species has not been completely domesticated, and can be found in home gardens, paddocks, small groups, and, more recently, in small and medium-sized plantations. In general, Cas is considered an indigenous crop in Costa Rica with characteristics as a functional food untapped. The plantations of this species do not have technical and scientific support or agronomic promotion from industry, nor are there genetic resources or germplasm readily available to farmers. This limits its commercial development and the implementation of selection or genetic improvement programs.
Previous work on the assembly of the complete genome of economically important members of the Myrtaceae, such as Psidium guajava L. (Feng et al. 2020), Eucalyptus grandis W. Hill. (Myburg et al. 2014) and Leptospermum scoparium JRForst. & G.Forst (Thrimawithana et al. 2019) has been published. In this study, we present the first draft assembly of the Cas genome (P. friedrichsthalianum) using PACBio long reads and the Canu assembly pipeline (Koren et al. 2017). This new genomic resource represents the basis for exploring the genetic potential of this crop with characteristics as a functional food.

Sampling permits
All the needed permits for sampling leaf tissue from Cas were requested by Comisión de Biodiversidad-UCR at Universidad de Costa Rica (under permit # 133-2018) and Mr. Alfonso Ruíz, owner of the farm where the sampling was carried out.
Plant material, DNA preparation, and genome sequencing Plant material used for this study was collected from a small plantation located at Tacacori, Alajuela, Costa Rica (DMS: 10°03 0 07.3'' N-84°12 0 52.3'' W). Leaf tissue was collected from mature trees and transported to the laboratory at room temperature in sterile polypropylene tubes (50 mL) containing silica gel. Genomic DNA was extracted from 4 samples of 50 mg each of dry leaf tissue, using the DNeasy Plant Mini kit (QIAGEN, Inc.). DNA size, quantity, and integrity was assessed with an Agilent's TapeStation system (model 4200, CA, USA). The sample with the highest molecular weight and DNA integrity was selected for sequencing. SMRTbell library preparation (template size 4-10 Kb) and next-generation sequencing was performed by the North Carolina State University Genomic Sciences Laboratory (Raleigh, NC, USA) using the PacBio Sequel I system.
Genome assembly, gene prediction, and annotation Raw reads (BAM files) from each SMRTcell were converted to FASTQ format, using bam2fastq, and collated into a single file. These long reads were corrected, trimmed and assembled using Canu v2.0 (Koren et al. 2017), with corrected Error Rate set to 0.155, min Read Length set to 500, min Overlap Length set to 400, and genome size equal to 958 Mbp, based on previous flow cytometry estimates (Rojas-Gómez et al. 2021). Genome assembly quality statistics were estimated with QUAST v5.0.2 (Gurevich et al. 2013). For comparison purposes, the Guava genome sequence (GenBank assembly accession: GCA_016432845.1) was set as a reference for estimating missassemblies, genome fraction, duplication ratio, and potential transposable elements with QUAST. Assembly completeness was assessed with BUSCO v4.1.2, using the Eudicotyledons (odb10) database (Seppey et al. 2019). Structural annotation and gene predictions were obtained with Augustus v3.3.3 (Stanke et al. 2008) and filtered with gFACs v1.1.2 (Caballero and Wegrzyn 2019), excluding incomplete gene models and redundant exons.
Functional annotation was performed with Blas-tKOALA (Kanehisa et al. 2016) and the non-redundant set of genes from the KEGG database (Kanehisa and Goto 2000;Kanehisa 2019;Kanehisa et al. 2021), using default parameters. Additionally, annotation of the flavonoid biosynthesis pathway was performed using modules M00137 and M00138, from genes annotated in Eucalyptus grandis W.Hill, a member of the Myrtaceae family and the closest relative for which there is data available.

Results and discussion
Genome assembly and annotation A de novo draft assembly of the Cas genome was produced using 58.7 Gb of PacBio Sequel I raw data and the Canu assembler pipeline. The draft assembly has a total length of 417.64 Mb, with 24 440 contigs and a N50 contig size of 21.3 Kb. The assembly length accounts for 43.6% of the expected genome size (Rojas-Gomez et al. 2020) and yet it presents 62.6% complete universal single-copy orthologs (Table 1). This may indicate the accumulation of repetitive DNA or transposable elements (TE) throughout the evolutionary history of the species, which contributes to the larger genome size but are hard to sequence and assemble. Partial or complete genome duplication is also a common feature in plant genomes (Wendel et al. 2016). The GC content of the Cas draft assembly was 40.05%, similar to that of P. guajava L. (Feng et al. 2021) and E. grandis W. Hill. (Myburg et al. 2014), the two most closely related species with sequenced genomes. Furthermore, when comparing the Cas assembly to the latest Guava reference sequence, the aligned genome fraction was 21.57% with a duplication ratio of 1.38. This indicates that a fraction of the Cas assembly shares a strong homology with the Guava reference, hinting at conserved regions in the Psidium spp. genomes. Conversely, up to 60 of the missassemblies between the Cas and Guava sequence can be attributed to potential transposable elements (TE) of different lengths, which along with the duplication ratio would help explain the differences in genome size, as previously mentioned.
Structural annotation of the draft assembly resulted in 59 036 gene models. However, after filtering out incomplete genes, overlapping exons, and duplicates, a total of 52 422 genes were retained. This is a larger number of predicted protein-coding genes than in the sequenced genomes of the myrtaceous species, Leptospermum scoparium JRForst. & G.Forst (31 220 genes), Eucalyptus grandis W. Hill. (36 376 genes) and Psidium guajava L. (25 601 genes) (Thrimawithana et al. 2019;Myburg et al. 2014;Feng et al. 2021). In addition, of the remaining gene models, 6106 correspond to mono-exonic genes, while 46 316 correspond to multi-exonic genes. On average, multi-exonic genes are composed of 5.9 exons and 4.9 introns. Median CDS length was 1 279 nt, but multiexonic gene size varies greatly with CDS ranging from 209 to 14 373 nt. For mono-exonic genes the average length was 948.6 nt. This is consistent with reports of plants having more protein coding genes of smaller size and with fewer exons per gene than other organisms (Ramirez-Sanchez et al. 2016).
Functional annotation was conducted against the non-redundant set of genes from the KEGG database. Of the 52 422 complete genes models, 15.55% (8 153) presented homology with KEGG orthologs. The majority of the annotated genes correspond with genetic information processing pathways, such as proteins involved in transcription and translation processes, as well as folding, sorting and degradation. The second largest group of annotated genes correspond to environmental information processing pathways, including membrane transport and signal transduction. Finally, the third largest group of annotated genes correspond to carbohydrate metabolism pathways. A complete distribution of the functional categories of the annotated genes is presented in Fig. 2. The remaining gene models, that is the ones without annotation, were classified as hypothetical proteins. Many of these gene models showed homology with other hypothetical proteins reported in the NCBI database from both closely and distantly related species (data not shown). Because there is not reported evidence of their function in vivo these hypothetical proteins remain uncharacterized.

Genes involved in flavonoid biosynthesis
The most important flavonoids in Cas are proanthocyanidins, anthocyanins, and flavonols. All three classes are being intensively investigated for their potential benefit to human health and are the compounds that provide functional food properties. For example, proanthocyanidins are thought to help in maintaining urinary tract health (Howell 2002;Foo et al. 2000), anthocyanins are important as antioxidants (Yan et al. 2002;Sun et al. 2002), and flavonols are implicated in anti-plasmodic, hepatoprotective, anti-inflammatory, and anti-cancer bioactivities, among others (Hollman and Katan 1997;Ren et al. 2003). Thus, considering the importance of the flavonoids, in this study we focus on analyzing the presence of genes involved in its biosynthesis. For this, we obtained the KEGG reference pathway for flavonoid biosynthesis (KEGG modules: M00137 and M00138; KEGG map: map00941, Fig. 3) and phenylpropanoid biosynthesis (KEGG map: map00940, Fig. 4). Then, we compared our Cas draft assembly with annotated genes of Eucalyptus grandis W. Hill. in the KEGG repository.
According to the assignments of the KEGG pathway in the flavonoid biosynthetic pathway, 4 isoforms were annotated for Trans-cinnamate 4-monooxygenase (K00487, [EC: 1.14.1491]), 17 isoforms were annotated for Chalcone synthase (K00660, [EC: 2.3.1.74]), 3 isoforms were annotated for Chalcone isomerase (K01859, [EC: 5.5.1.6]), 6 isoforms were annotated for Naringenin 3-dioxygenase (K00475, [EC: 1.14.11.9]), 1 isoform were annotated for Anthocyanidin synthase (K05277, [EC: 1.14.20.4]) and 2 isoforms were annotated for Flavanone 4-reductase (K13082, [1.1.1.234])( Fig. 3 and Table S1). In addition, 10 isoforms were annotated for Phenylalanine ammonia lyase (K10775, [EC: 4.3.1.24]) and 9 isoforms were annotated for 4-coumarate-CoA ligase (K01904, [EC: 6.2.1.12]); as part of the phenylpropanoid biosynthetic pathway (Fig. 4 and Table S1). These genes annotated in our study express enzymes that are precursors in the formation of different flavonoids previously identified in Cas, for  (Fig. 3 and Fig. 4) (Flores et al. 2013;Rojas-Garbanzo et al. 2019. The identification of genes for flavonoid biosynthesis is very important not only to clarify the expression and accumulation of bioactive molecules of Cas, but also for the application of biotechnology to improve its biosynthesis. To date, most of the articles published in Cas have focused on the chemical characterization of its metabolites. The number of isoforms reported for the genes annotated in the flavonoid and phenylpropanoid biosynthetic pathways could be related to the number of duplications in our Cas assembly; several authors have suggested that the events of partial or complete duplication of plant genomes are common and expected characteristics, as well, may be important in the evolution of genome sizes in plants (Wang et al. 2016;Wendel et al. 2016). While we concede that the number of isoforms reported may be affected by the nature of the annotation methods (in silico prediction) and reference database used, our results are consistent with isoforms reports in the similar biosynthetic pathway of different species. For example: Theobroma cacao L. (Argout et al. 2011), Glycyrrhiza uralensis (Mochida et al. 2016) and Carthamus tinctorius L. (Chen et al. 2018).
The genes and regulatory mechanisms involved in the flavonoid pathway are of great interest in Cas, as these affect the temporal and spatial flavonoid biosynthesis as well as the specific flavonoids produced. It has been shown that the central pathway for flavonoid biosynthesis is conserved in most plants, but depending on the species, a group of enzymes, such as Fig. 3 Genes annotated for Cas in pathway for flavonoid biosynthesis (KEGG reference, map00941). Red lines correspond to flavonoid biosynthetic upstream and downstream pathway. Green boxes correspond to the genes annotated in our study for Cas (EC: 1.14.1491: Trans-cinnamate 4-monoxygenase, EC: 2.3.1.74: Chalcone synthase, EC: 5.5.1.6: Chalcone isomerase, EC: 1.14.11.9: Naringenin 3-dioxygenase, EC: 1.1.1.234: Flavanone 4-reductase, EC: 1.14.20.4: Anthocyanidin synthase). Likewise, the legends in red correspond to number of isoforms reported in our study for each of these annotated genes. These genes annotated for Cas express enzymes that are precursors in the formation of different flavonoids, which have been previously identified in this species. Thus, the red stars in the figure mark these flavonoids for Cas (Naringenin (Flavanone), Quercetin (Flavonol), Gallocatechin (Monomeric flavanol), Epigallocatechin (Monomeric flavanol) and Myricetin (Hexahydroxyflavone) isomerases, reductases, hydroxylases and several Fe 2? /2-oxoglutarate-dependent dioxygenases can modify their basic flavonoid skeleton, generating different subclasses of flavonoids (Martens et al. 2010). Likewise, it has been shown that flavonoid biosynthesis is tissue specific, regulated by development and can be induced by a variety of environmental factors, including light, UV radiation, fungal infection, interaction with microorganisms, wounding, among others (Winkel-Shirley 2001). This study presents the first annotation of genes involved in flavonoid biosynthesis for Cas. These genes can be targeted for manipulation of flavonoid biosynthesis through various means or used as markers for the selection of desirable flavonoid profiles through breeding. For example, in cranberry it has been shown that an important aspect of anthocyanins as antioxidants is the specific aglycone, as well as the glucoside, since this affects both the antioxidant potential and the bioavailability (Satué-Gracia et al. 1997;Wang et al. 1997).
The results of this study provide an overview of the first draft of the Cas genome assembly using PacBio reads. This assembly represents the basis for new genomic and molecular studies in the future of this crop. In particular, it delves into the evolution of secondary metabolites, mainly of the polyphenolic type, which give this crop special characteristics as a functional food.
Data availability The whole-genome sequence of Cas has been deposited at the DDBJ/ENA/GenBank databases under Fig. 4 Annotated genes for Cas in pathway for phenylpropanoid biosynthesis (KEGG reference, map 00,940). Green boxes correspond to functional annotation of genes for Cas related to flavanone biosynthesis as part of the phenylpropanoid biosynthetic pathway (EC: 4.3.1.24: Phenylalanine ammonialyase, EC: 1.14.1491: Trans-cinnamate 4-monoxygenase, EC: 6.2.1.12: 4-coumarate -CoA ligase). The legends in red color inform the number of isoforms reported in our study for each of these genes annotated in Cas BioProject PRJNA725514, under the accession JAGVVM000000000. The version described in this paper is version JAGVVM010000000. Available data includes raw reads, draft assembly sequence, and gene models. https:// www.ncbi.nlm.nih.gov/search/all/?term=PRJNA725514