Genome Survey Sequencing of Nomocharis forrestii, Assembly of Its Complete


 Background: Nomocharis is a genus that is closely related to Lilium in the Liliaceae family. It’s useful to study the influence of the uplift of the Qinghai-Tibet Plateau on plants and their biological diversity. Nomocharis is a genus of such plants, and research on this genus will be especially informative, considering the genetic diversity of flowers. However, the genetic information of Nomocharis has not been fully elucidated. Results: To obtain a complete Nomocharis reference genome, the paper first performed a general survey. Next-generation sequencing (NGS) was utilized to perform de novo sequencing of the entire Nomocharis forrestii genome. In this study, the sequencing process yielded approximately 137.4 Gb of high-quality data, the total sequencing depth was approximately 63X, and the Q30 ratio was 91.95%; the estimated genome size was approximately 2.17 Gb; the repetitive sequence content was approximately 84.7%, the heterozygosity rate was 3.99%, and the estimated GC content of the genome was 43%. Furthermore, an annotated circular chloroplast gene map was generated, and a preliminary evolutionary analysis was performed. In addition, a total of 78,045 high-quality SSR markers were developed. Conclusion: Nomocharis forrestii has a 2.17 Gb heterozygous genome, its SSR markers are predominantly dinucleotides, and its chloroplast genome shows that Nomocharis forrestii and Lilium bakerianum have the highest homology followed by Lilium distichum. To the best of our knowledge, this report describes the first de novo whole-genome sequencing and assembly process to be performed for Nomocharis. The results of this study may provide new resources for the future genetic analysis and molecular breeding of Nomocharis.

5 the subsequently analyzed genome. The statistics of the N. forrestii sequencing data are shown in 91 Table 1. 92 Clean data of high-quality reads were assembled using SOAPdenovo software based on a De 93 Bruijn graph. The total length of the obtained genome sequence was 689 Mb, and the specific 94 assembly results are shown in Table 2. 95

Genome Size Estimation and Genome Survey 101
Using 137.4 Gb data for 17-mer analysis, the total number of K-mers was determined to be 1.3 × 102 10 11 , and the expected K-mer depth was observed to be 75. According to the formula (genome size 103 = total number of K-mers/expected depth of K-mer), the genome size was calculated to be 104 approximately 1.73 Gb, and the genome size was estimated to be 2.17 Gb by GenomeScope 105 software (Fig. 1). According to our experience, for complex genomes, the results of K-mer 106 calculations may be smaller because the homologous K-mer is overlooked; therefore, the results 107 predicted by GenomeScope were considered to be more accurate. The genome size of N. forrestii 108 was estimated to be 2.17 Gb. The fully annotated annotation results indicate that the sample chloroplast genome is a 127 circular double strand. Similar to most higher plant chloroplast genomes, there are two inverted 128 repeats (IRs), namely, IRAs and IRBs; between the inverted repeats, there is a large single-copy 129 region (LSC) and a small single-copy region (SSC) (Fig. 2).  genes, of which 7 genes have more than 2 copies. The total GC content of the chloroplast genome 147 was determined to be approximately 37.0%. All chloroplast genes and classifications are shown in 148 Table 3. 149 The chloroplast genome plays an important role in the reconstruction of plant phylogeny and 150 evolutionary history. In our research, we utilized whole-genome sequences from 25 kinds of 151 chloroplasts (15 of which are Liliaceae) and constructed a phylogenetic tree using MEGAX 152 software [24,25] . Using the neighbor-joining method [26] , 1000 bootstrap test repeats draw a 153 proportional evolutionary tree; taking the number of base substitutions at each site as the unit, the 154 maximum likelihood method [27] is used to calculate the evolutionary distance (deleting all 155 ambiguous positions of a single sequence pair). The branch length of the evolutionary tree is used 156 to show the evolutionary distance of the phylogenetic tree, and the percentage of the replication 157 tree of the bootstrap test is marked next to the branch. The use of the complete chloroplast genome 158 sequence to reconstruct molecular phylogenetic relationships strongly supports the phylogenetic 159 relationships of Liliaceae plants. In this study, it was observed that Nomocharis forrestii and 160 Lilium bakerianum have the highest homology followed by Lilium distichum (Fig. 3).  accounted for 0.17%, and hexanucleotides accounted for 0.30% (Fig. 4a). 196  Among dinucleotide SSR markers, AT/AT repeat motifs accounted for 35.67%, AG/CT 215 motifs accounted for 48.34%, AC/GT motifs accounted for 15.05%, and CG/CG motifs accounted 216 for only 0.94% (Fig. 4b). Among the predominant trinucleotide SSR markers, the AAT/ATT 217 repeat motif, AAG/CTT repeat motif and ATC/ATG repeat motif accounted for 28.13%, 28.00% 218 and 12.93%, respectively (Fig. 4c). 219 SSR markers classified by the number of repeated motifs are summarized (Fig. 5). The SSR 220 dinucleotide and trinucleotide markers were determined to be considerably more common than 221 other SSR markers. In general, the number of SSR markers were observed to decrease as the 222 length of the repeated motif increased. The genome of garlic, a member of the Liliaceae family, has been reported previously. The size of 238 the sequenced garlic genome is 16.24 Gb, accounting for 96.1% of the total garlic genome [29] . 239 Among the representative monocots, the genome size of indica rice is 430 Mb, and the functional 240 coverage is 92% [30] ; the genome size of japonica rice is 420 Mb, and the assembly coverage is 241 93% [31] . The genome size of maize is 2.3 Gb [32] . According to our genome survey data, using all 242 clean data for Genome Scoper analysis, the estimated size of the N. forrestii genome was 2.17 Gb. 243 Compared with the garlic genome, the whole genome of N. forrestii is small, but it is relatively 244 large in monocots. With the development of NGS technology, whole-genome sequencing research 245 has begun to be widely employed in horticultural plants, which may play an important role in 246 understanding the key genes of N. forrestii. 247 GC content directly affects sequence bias [33] . GC content outside the 25-65% interval may 248 cause sequence bias in Illumina sequencing. This problem is a notable one that affects the 249 assembly of the genome [34] . The GC content of N. forrestii is 43.0%, which is higher than that of 250 potato (34.8-36.0%) [35] , Luffa cylindrica (37.9%) [36] , and humans (41%) but lower than that of 251 Gracilariopsis lemaneiformis (48%) [37] . 252 From the 1,155,548,885-bp genome survey sequence, 34,552 SSRs without single nucleotide 253 repeats were identified. Therefore, it is estimated that the distribution of SSRs in the genome of N. 254 forrestii is approximately 29.90 SSR/Mb, which is considerably lower than the 135.50 SSR/Mb 255 measured in Arabidopsis [38] and the 117.57 SSR/Mb detected in Luffa cylindrica. Among the 256 dinucleotide repeat motifs, AG/CT accounted for 48.34%, which is the most abundant type, 257 followed by AT/AT, accounting for 35.67%; in the trinucleotide repeat sequence, AAT/ATT and 258 16 AAG/CTT account for approximately the same proportion, being 28.13% and 28.00%, 259 respectively; among other polynucleotide repeats, AAAT/ATTT, AAAAT/ATTTT and 260 AAAAAG/CTTTTTT account for the highest proportions, and they are all A/T-rich motifs present 261 in N. forrestii. This phenomenon is in keeping with the findings obtained by studies of other 262 species, such as L. cylindrica [36] , rice [39] , and Arabidopsis [40] . 263 Chloroplasts play important roles in the study of evolution and metabolism. The assembly 264 and analysis of the whole chloroplast genome may also provide evidence to determine the 265 evolutionary level and phylogeny of N. forrestii. The results of this study also indicate that 266 Nomocharis evolved from the genus Lilium. Total genomic DNA was extracted from fresh leaves using the CTAB method [16] . 280

Illumina Sequencing Data Analysis and Assembly 281
The Illumina HiSeq platform (Illumina Inc., San Diego, CA, USA) was used for genome 282 sequencing. Sequencing was performed by Shaanxi Baiai Gene Information Technology Co., Ltd. 283 Clean data were obtained through strict quality evaluation and data filtering of raw Illumina 284 sequencing data. SOAPdenovo (https://github.com/aquaskyline/SOAPdenovo2) software [41] based 285 on a De Bruijn Graph (version 1.05, BGI, Beijing, China) was employed to assemble clean data of 286 high-quality reads. After assembly, the GC content information in the assembled genome was 287 quantified. 288 289

Genome Size Estimation and Genome Survey 290
Clean data from high-quality reads were used for K-mer analysis. Based on the frequency 291 distribution of K-mers (k = 17), we used GenomeScope 292 (https://github.com/schatzlab/genomescope) to estimate the characteristics of the genome (genome 293 size, duplicate content, and heterozygosity rate) [42] . Each read used 17 bp as the window and 1 bp 294 as the step size to slide, and the total number of K-mers and the corresponding frequency were 295 counted and calculated. Next, based on the K-mer depth distribution curve, the peak value 296 (Peak_depth) was identified. Finally, according to the formula Genome Size = 297 K-mer_num/Peak_depth, the genome size was calculated [16] . 298 299 Assembly and analysis of chloroplast genome 300 The chloroplast genome was directly assembled with the help of NOVOPlasty 301 (https://github.com/ndierckx/NOVOPlasty) [43] software; the reference sequence is NC_035592.1 302 of L. bakerianum. The chloroplast group genes of the samples were annotated with CPGAVAS 303 (http://47.96.249.172:16014/analyzer/home) [24] software. The annotation results were plotted 304 using OGDRAW (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html) [44] .
MEGAX 305 (https://www.megasoftware.net) was used to analyze the whole genome sequence of N. forrestii 306 and 24 other chloroplasts using the neighbor joining method to analyze the evolutionary tree. 307 308