Genome survey sequencing and identication of genomic SSR markers for Batocera horseldi (Coleoptera: Cerambycidae)

The white-striped longhorn beetle Batocera horseldi (Coleoptera: Cerambycidae) is a polyphagous wood-boring pest that causes substantial damage to the lumber, fruit and nut industry. Here, next-generation sequencing was used to generate a whole genome survey dataset to provide fundamental information of its genome and develop genome-wide microsatellite markers for it. The genome size of B. horseldi was estimated as approximate 520 Mb by using K-mer analyses, and its heterozygosity ratio and repeat sequence ratio were 0.26% and 51.03%, respectively. The assembled genome was 528.56Mb with GC content of 35.40%. A total of 121750 microsatellite motifs were identied. The most frequent repeat motif was mononucleotide with a frequency of 85.84%, followed by 8.08% of dinonucleotide, 5.04% of trinonucleotide, 0.73% of tetranonucleotide, 0.20% of pentanonucleotide and 0.12% of hexanonucleotide motifs. The AT/AT, TA/TAand GA/TC repeats were the most abundant motifs of dinucleotide motifs, and AAT/ATT, TAA/TTA and ATA/TAT were the most abundant motifs of trinucleotide motifs, respectively. ninety six pairs of SSR primers were randomly selected for PCR amplication and agarose gel electrophoresis detection, among which 56 pairs of primers can be effectively amplied to obtain the target fragment. In summary, various candidate microsatellite markers were identied and characterized in this study using genome survey analysis.


Introduction
Batocera hors eldi (Coleoptera: Cerambycidae) is a polyphagous wood-boring pest widely distributed in China [1]. The larval and adults feed on more than 20 trees belonging to taxonomically distant plant families such as Juglandaceae, Salicaceae and Fagaceae with annular damage on bark. Completion the whole life cycle needs 2 years. Adults emerge in the early summer and feed on phloem of host plants, for instance, Viburnum awabuki, Betula luminifera [2] and rosaceous plants [3], for nutrient supplement until they are sexually mature. After copulation, females deposit their eggs under the bark of the host plant, The larvae were bore into the inner bark immediately after hatching and then feed on the xylem for at least 20 months before pupation, which decreases plant tness or even plant death, which limited lumber, fruit and nut industry, for example Eucalyptus, loquat, Juglans regia and so on. However, traditional methods, such as pesticides or manual capturing, for controlling this wood-boring pest are ine cient, because of its complete protection lifecycle and the crypticity of larvae [4]. Hence, research identi ed volatile components of host [5], analysed feeding, encounter and mating behaviors [6] to develop effective methods to control B. hors eldi, and explored its molecular basis of olfactory recognition.
Recently, the genome survey sequencing by next-generation sequencing (NGS) has come out as a main, e cient and cost-effective scheme for generating a large scale of genetic and genomic information for different species [7][8][9][10][11]. Genome survey sequencing (GSS) based on the NGS platform has been proven utile in identifying genome-wide microsatellite markers in many non-model species especially [12][13]. An increased molecular marker density will enable better genome-wide association and molecular heritable variation studies [14][15]. Genome survey studies also provide information about genome structure of organisms, including estimates of genome size, levels of heterozygosity, and repeat contents, phylogenetic analysis, and genetic amelioration of bene cial features [10,[16][17][18].
Microsatellites or simple sequence repeats (SSRs) are short tandem-repeated motif (1-6bases) that are found in both non-coding and coding regions of the genome and are characterized by a high degree of length polymorphism [19], which can be identi ed in the genome survey e ciently [20]. SSRs have made one of the most useful tools for detecting genetic diversity, genetic link age mapping, genetic structure, germplasm and evolution analysis, so the SSRs are the most widely used molecular marker system. However, conventional ways to isolate and develop microsatellite primers were time-and energy-wasting because it must create enriched microsatellite libraries [19,21]. Recently developed NGS platforms offer opportunities for high-throughput and inexpensive genome sequencing and rapid marker development, such as in Aphis glycines [22], Apis cerana [8], 23 mosquito species [23] and twenty-nine beetles [11]. Besides revealed genome features of simple sequence repeats are generally conserved at the family level in insects [24], and patterns of evolutionary conservation of SSRs suggest a faster rate of genome evolution in Hymenoptera than in Diptera [25].
In this study, we performed the rst genome survey of B. hors eldi and identi ed SSRs through genome data. Our main aim is to establish a basic genomic and genetic foundation for future studies on the B. hors eldi.

Methods
Sample collection and genome survey sequencing B. hors eldi larval was collected from damaged J. regia in Bama County (24•13-N,107•31-E), Guangxi, China in October 2018 and fed damaged J. regia cut logs. After pupation, adult eclosion, and eggs hatch, the next generation larval were fed was stored in 95% ethanol at − 80℃. Then extracted total genomic DNA by a standard phenol-chloroform method. DNA was treated with RNase A to produce pure, RNA-free DNA. Two paired-end DNA libraries were constructed with the insert size of 350 bp, and then sequenced using the Illumina HiSeq2500 platform (Illumina Inc., San Diego, CA, USA) following the manufacturer's protocol. The library construction and sequencing were performed at Biomarker in Beijing.

Data analysis
Clean data were procured by performing strict quality assessments and conducting data ltering based on the Illumina sequencing raw data. After removing low-quality reads, all clean data were used to perform K-mer analysis. Based on the results of the K-mer analysis, information on peak depth and the number of predicted best K-mer were obtained and used to estimate the size of the genome. Its relationship was expressed using the following algorithm: Genome size = K-mer num/peak depth, where K-mer_num is the total number of predicted best K-mer, and peak depth is the expected value of the K-mer depth. Also, the heterozygosity ratio and repeat sequence ratio were estimated following the description [26], based on the K-mer analysis. K-mer analyses were performed using the software GCEv1.0.0 [27] and KmerGenie v1.7039 [28], respectively. The paired-end information was then used to join the unique contigs into scaffolds. The high-quality read data were assembled using the de Bruijngraph-based SOAPdenovo software (version 1.05, BGI, Beijing, China) [29].

Development and validation of SSR primers
Based on the SSR loci screened above, Primer Premier 3 (version1.14) software was used to conduct batch design of primers on both sides of SSR sequences. The length of the primers was set as 18-26 bp, and the optimal length was 23 bp. The ampli cation target fragment was the rst to the last ve base fragments of the repeat sequence, and the length of the ampli ed product was between 100 and 400 bp. 96 pairs of SSR primers were randomly selected for preliminary screening, and the primers needed were synthesized by Beijing Qingke Xinye Biotechnology Co., Ltd. Each pair of primers was ampli ed by PCR using DNA of longicornis variegata samples and detected by 1% agarose gel electrophoresis. Primers with single, clearly visible bands and the same size as the target fragment were screened out. The PCR reaction system was 25.0 µL, including 2×T5 Super PCR Mix 12.5 µL (Beijing Qingke Xinyou Biotechnology Co., Ltd.), forward and reverse primers 1.0 µL, DNA template 1.0 µL, and 9.5 µL ddH 2 O complement system.The PCR reaction procedure was pre-denaturation at 98 ℃ for 3 min, denaturation at 98 ℃ for 10 s, optimal temperature annealing for 10 s, extension at 72 ℃ for 10 s, 35 cycles of reaction, extension at 72 ℃ for 2 min.

Results
Genome sequencing and estimation of genome size A total of 76.66 Gb raw data were generated by sequencing the genome survey library with 350 bp inserts. A total of 29.07Gb clean data with 91.70% of Q30 were obtained after ltering. This was an approximately 54.99x coverage of the B. hors eldi estimated genome size. All the clean data were used for genome sequence assembly and K-mer analysis. The 19-mer frequency distribution derived from the sequencing reads is plotted in Fig 1; the peak of the 19-mer distribution was 41, and the total K-mer count was 2,557,1694,595. As a result, the genome size was estimated to be 528.56 Mb and the heterozygosity ratio and repeat sequence ratio were 0.26% and 51.03%, respectively (Table 1). A scatterplot of the genomic GC content and sequencing depth can provide information on sequencing data bias. The GC content of B. hors eldi genome was 35.40% (Fig 2).
The B. hors eldi genome assembly consisted of 1,224,119 scaffolds with a total length of 470,309,436 bp and a scaffold N50 length of 714 bp. In addition, it's genome assembly consisted of 1,469,229 contigs with a total length of 466,713,970 bp and a scaffold N50 length of 437 bp ( Tri-nucleotide SSR markers, the AAT/ATT repeat motifs, the TAA/TTA repeat motifs and the ATA/TAT repeat motifs accounted for 23.78%, 17.23% and 16.80%, respectively (Table 4).

Ampli cation test of SSR primers for B. hors eldi
To ensure the reliability of SSR data, 96 pairs of SSR primers were randomly selected from the developed SSR primer library for ampli cation test, including 61 pairs of dinucleotide repeat SSR primers, 35 pairs of Trinucleotide repeat SSR primers. Agarose gel electrophoresis was used for screening, A total of 56 pairs of SSR primers with single bands, clearly visible and the same size as the target fragment were obtained, The ampli cation e ciency was 58.33%, and primer sequences information were shown in the Table 5.

Discussion
There are 532 reference and representative insect genomes submitted to NCBI database Recently, the K-mer method has been successfully applied for estimating genome size using genome survey sequences without prior knowledge of the genome size for non-model species [27,33]. Based on Illumina sequencing data, the genome size was estimated using K-mer depth distribution of sequenced reads [34]. K-mer depth distribution was affected by the existence of a heterozygote and repeating sequences in the genome [27]. Based on K-mer (K=19) analysis, genome size of B. hors eldi was estimated to be about 528.56 Mb. In this study, the GC content is 35.40%, and repeat sequence ratio was 51.03%. For the genome assembly, if the heterozygosity rate is higher than 0.5%, it is di cult to assemble, and if higher than 1%, it is even more di cult [34], 0.26% heterozygosity ratio in B. hors eldi, which can assemble normally. More and more insect genomes were reported especially in 2018 to now. However, a major challenge now confronting insect science is how best to use genomic data to improve our understanding of insect biology and mine insect genomes for functionally a liated genes [35].
Genomic microsatellite markers, which are reliable, highly polymorphic, multi-allelic, and easy to amplify, are widely used in population genetics, link age analysis, evolutionary studies and so on [36]. Queir´osetal [37] suggested that reliable and accurate estimates of genetic diversity can be obtained using random microsatellites distributed throughout the genome because selecting the most polymorphic markers will overestimate parameters of genetic diversity, leading to misinterpretations of the actual genetic diversity, which is particularly important for managed and threatened populations. A total of 121750 microsatellite motifs were identi ed, that included mononucleotide, dinonucleotide, trinonucleotide, tetranonucleotide, pentanonucleotide and hexanonucleotide motifs. Furthermore the repetitive sequence of genomic DNA, not only affects the advanced structure of chromosomes but is also useful in genome evolution [38][39].
However, the most abundant percentage of microsatellite motifs in B. hors eldi is mononucleotide, which is different from the most abundant di-and trinucleotide types in insect SSRs [24] and should explore further.

Conclusions
Our study was a rst draft genome and stands as a useful reference for further studies on the whole genome sequencing of B. hors eldi. At the same time, a large number of SSR loci were developed through Genome survey sequencing to provide a basis for the genetic research.    Average sequencing depth of B. hors eldi genome data