The results of flow cytometer revealed that the genome size of M. alternifolia is approximately 300 Mb (Supplement Table S3), which is similar to the result obtained by k-mer analysis, i.e., 345 Mb. The genome size of our results are consistent with the previous research (356Mb) of Calvert’s (2017). And similar to that of Metrosideros polymorpha at 347 Mb [14] and Leptospermum scoparium at 297 Mb [15], but less than Eucalyptus pauciflora at 595 Mb and Eucalyptus grandis at 640 Mb [12, 13], all of which belong to the Myrtaceae family. The genome size of M. alternifolia is considered “very small” according to the definition of genome size [30]. Such a small genome will facilitate genome research and molecular manipulation, and provide insights into genome size variation in tree genomes.
A GC content that is too high (>65%) or too low (<25%) may result in sequence bias during whole-genome sequencing, and may negatively affect genome assembly. The mean GC content of our result was 41.9%, quite consistent with the result of Voelker’s, 42%. According to the GC content and depth analysis, two of three detected clusters were derived from M. alternifolia. However, the de novo-assembled genome was 595 Mb (Table 1), which is twice about the genome sizes estimated using flow cytometry (300 Mb) and k-mer frequency distribution analyze (345 Mb). This larger assembly length was also performed in Metrosideros polymorpha [31]. This notable larger assembly length in M. alternifolia may suggest more than one haplotype were assembled. Based on these observations, we speculate that this species may be a diploid with a highly heterozygous genome.
Genome heterozygosity reflects the reproductive lifestyle of the species, such as self-incompatibility or outcrossing, but may confound the quality of the genome assembly. Our k-mer analysis identified two depth peaks: the height of the main peak (depth = 37) was about twice that of the minor peak (depth = 17), suggesting that the M. alternifolia genome is highly heterozygous, and the heterozygosity determined using JELLYFISH was 0.8% (Supplement Table S2) smaller than that of Eucalyptus grandis (1.0%) [12] and Eucalyptus pauciflora (1.5%) [13]. However, the M. alternifolia genome is on the threshold for being considered a highly heterozygous genome with heterozygosity ≥0.8% [27] .
Flowers of M. alternifolia are hermaphroditic, protandrous, pollinated by insects and predominantly outcrossing by nature, with occasional geitonogamy (self-pollination among flowers within the inflorescence) [32, 33]. The high heterozygosity of the M. alternifolia genome is likely the result of outcrossing [34] and highly confounds genome sequencing and assembly [35].
Contig N50 and BUSCO (BUSCO, RRID: SCR 015008) scores are usually used to assess and compare genome assemblies. A higher N50 length suggests an assembled genome with fewer and larger contigs. The length of contig N50 assembled by SOAPdenovo of M. alternifolia genome is 1,021 bp, lower than that of the previous research of this species(8,778bp) [11]. May be related to its short-read assembly method. However, it is much higher than that of Mikania cordata (312∼352 bp) which used the same SOAPdenovo software [10]. Genome completeness, estimated by BUSCO using the viridiplantae dataset, was 88.7%, indicating a relatively high geonome completeness. However, these two mentods of short-read assembly were much lower than Pacific Biosciences using MaSuRCA assembly. Repetitive regions cover over 40.43% of the genome, which similar with that in Eucalyptus pauciflora (44.77%) and Eucalyptus grandis (41.22%) [12].
A total of 44,369 protein-coding genes were predicted in the M. alternifolia assembled genome, more than those encoded by Metrosideros polymorpha (39,305) [14], Eucalyptus grandis (36,376) [13] and Leptospermum scoparium (31,220) [15]. 1109 TFs were annotated in the whole genome. C2H2, Myb related, bHLH were the top 3 TFs, accounted for 17.5%, 14.2%, 12.7%, respectively. Analyse of SSRs was conventional performed in genome survey, 457,661 SSRs have been identified, A/T, AG/CT, AAG/CTT are the most common motifs of mono-, di, and trinucleotide, respectively. This result showed the same patten with that in the Camellia sinensis [36]. The genomic SSRs identified here would benefit assessing the genetic diversity of M. alternifolia.
In conclusion, we report a draft genome sequencing of M. alternifolia. This cost-effective strategy based on short-read data helped us to reveal its basic genomic parameters with small genome size, mid GC content, high heterozygosity rate. Its genome annotation has also been characterized, including repeat element annotation, the SSR identification, and TFs prediction at the genome level. This M. alternifolia draft genome will be useful for functional and comparative genomics research in the future.