Development and Characterization of Novel Microsatellite Markers for a Dominant Parasitoid (Stethynium Empoasca) in Tea Plantations Using High-throughput Sequencing

Background: Stethynium empoasca is an egg parasitoid of the predominant tea pest, and is regarded as the most promising candidate for both augmentative and conservative biological control in tea plantations. However, little is presently known about its biology, ecology, and genetics. Methods: Novel microsatellite markers were developed for S. empoasca from transcriptome sequences generated using high-throughput sequencing. The identied markers were then validated and characterized using 98 individuals from ve geographically distinct populations through the tests of Hardy– Weinberg equilibrium and linkage disequilibrium as well as the analyses of genetic diversity, genetic differentiation, and gene ow. Results: A total of 54,520 microsatellites were identied from 117Mb clean sequences. By assessing with ve geographical populations, a total of 18 loci were demonstrated to be polymorphic, stable, and repetitive. The genetic variation over the 18 loci varied with allele number ranging from 2 to 7, polymorphism information content from 0.234 to 0.634; and observed and expected heterozygosity from 0.166 to 0.686 and 0.276 to 0.753, respectively. Meanwhile, the 18 loci showed a low frequency of null alleles (0 – 0.107), and the inbreeding coecient revealed a range from −0.143 to 0.173, with the exception of loci Ste36 (0.4076). Based on analysis of these 18 loci, the assessed populations showed low to moderate levels of genetic differentiation, in which individuals clearly grouped into two clusters. And the recent dispersal rates between the geographically distinct populations were identied to be much lower (0.011 – 0.250) than the individual gene exchange rates within a population (0.683 – 0.939). Conclusion: The identied 18 microsatellite markers could reveal a pattern of spatial structure and gene ow in S. empoasca populations according to geographic variability. This work provides an important basis for future studies on how these markers can be used in studies of the biology, genetics, and ecology of this important parasitoid. The ndings can further provide important information for the development of biological control strategies in tea plantations. Additionally, our study rearms the importance and eciency of high-throughput sequencing in microsatellite marker development for non-model species lacking reference genome information.


Background
Biological control is an important pest management strategy in agriculture, and its application has increasingly grown, with the public becoming more knowledgeable and less tolerant of the use of pest controls that are potentially harmful to people and the environment [1]. In tea plantations, the tea green leafhopper Empoasca onukii is the predominant pest, causing 15%-50% loss in the annual yield [2]. At present, control strategies for E. onukii have mainly relied on the use of synthetic insecticides, and e cient biological control strategies are still lacking. Field surveys have demonstrated that the solitary egg parasitoid Stethynium empoasca Subba Rao is the most common natural enemy of E. onukii, with a high rate of parasitism (up to 30%) [3,4]. As an egg parasitoid, S. empoasca tend to offer more effective biological control as they can immediately prevent further herbivory, in contrast to most larvae parasitoids, that commonly keep the host alive for prolonging nutrient assimilation [5]. Thus, S. empoasca can be the most promising candidate for augmentative biological control, and also an object for conservation biological control. However, little is currently known about its biology, ecology, and genetics. Knowledge about the biological characteristics, ecological preference, genetic variation, and gene ow of parasitoid populations is needed in order to increase their adoption and success as biological control agents [6], as well as to favor the development of an effective habitat management strategy for parasitoid conservation [7].
The lack of research on S. empoasca may largely be due to the fact that it is small in size (597.15-637.23µm) [8], and traditional observation methods could face challenges when studying such a minute parasitoid. Alternatively, molecular markers are capable of detecting minute quantities of parasitoid DNA. In fact, molecular marker-based techniques have been widely used for evaluating the genetic diversity and phylogenetic relationships of natural enemy populations, and are increasingly being recognized as valuable tools in facilitating ecological studies (e.g., population dynamics and dispersal) on small-sized parasitoids [6,9]. Among the available markers, microsatellites are widely used owing to their various desirable traits, such as high mutation rate and polymorphism, co-dominant inheritance, and genome-wide distribution, and their use in analysis produces results with good reproducibility.
The traditional approach of microsatellite development using enriched libraries is time-consuming and labor-intensive [10]. Hence, high-throughput sequencing-based approaches can be a good alternative. The advantages of this technology are that it makes the development of novel microsatellite markers e cient and, importantly, it also makes it possible to investigate and characterize microsatellites at a genome-wide scale [11].
In this study, we aimed to identify and characterize microsatellites for S. empoasca from transcriptome sequences. This is the rst studying reporting on novel microsatellites for S. empoasca. These developed markers will facilitate further biological, ecological, and genetic studies of S. empoasca.

Sample collection and DNA extraction
A total of 98 S. empoasca adult females were collected from ve sampling sites in China (Table 2; Fig. 2), and were used for the identi cation of the polymorphic loci and in the performance of population genetic analyses. Genomic DNA were individually extracted with a TIANamp Micro DNA Kit (TIANGEN, Beijing, China) following the manufacturer's instructions. Additionally, 50 adult females collected from FZ were used for the initial test of selected primer pairs with genomic DNA, also extracted using the TIANamp Micro DNA Kit. Prior to use, these samples were stored in absolute ethanol and were frozen at −80℃.
2.2 Sequencing, microsatellites mining, and primer design A total of 100 S. empoasca adult females collected from FZ were used to generate the transcriptome sequencing libraries using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, USA). After clustering on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumia), the library preparations were sequenced on an Illumina HiSeq platform. The generated transcriptome sequences were assembled using Trinity [12] with the default parameters.
The microsatellites were identi ed from the transcriptome sequences using MISA [13]. The microsatellite search criteria were as follows: at least 10, 6, 5, 5, and 5 repeats for the mono-, di-, tri-, tetra-, penta-, and hexa nucleotide motifs, respectively. The remaining parameters were used at default settings. An optimal primer set for each microsatellite was designed using Primer3 with default settings.

Primer testing and polymorphism detection
In total, 50 primer pairs were selected and synthesized for validation. The primer effectiveness and polymorphism were initially tested using the mixed genomic DNA of 50 S. empoasca adult females collected from FZ. The ampli cations were prepared using a Golden Easy PCR System (TIANGEN, Beijing, China)in a nal volume of 10μL, consisting of 5 μL Master Mix, 0.5μL forward primer, 0.5μL reverse primer, 0.5μL (12.5ng) template DNA, and 3.5μL ddH 2 O.
The ampli cations were performed in a BIO-RAD C1000 TOUCH TM thermal cycler under the following conditions: 4 min at 94℃; 30 cycles of 30s at 94℃, 30s at 56℃, and 30s at 72℃; with a nal extension at 72℃ for 7 min. The PCR products were visualized using electrophoresis on a 2% agarose gel for the primary screening of the primers, and those primers that successfully produced positive PCR ampli cations were considered to be valid.
A total of 37 primers, selected in previous steps, were further used to test for polymorphism and98 female adults from FZ, AX, and WYS were characterized ( Table 2). The forward primers that were used were randomly tagged with one of the four universal primers (Tail A: GCCTCCCTCGCGCCA; Tail B: GCCTTGCCAGCCCGC; Tail C: CAGGACCAGGCTACCGTG; Tail D: CGGAGAGCCGAGAGGTG) [14] at the 5'end, and the reverse primers were the same as previously used. Additionally, the corresponding universal primers labeled with uorescent dye (Tail A-FAM; Tail B-TAMRA; Tail C-HEX; Tail D-ROX) were also used. Six or seven primers, tagged unique universal primers or different allele size ranges, were pooled into one group and co-ampli ed using multiplex PCR. The ampli cations were prepared using a Multiplex PCR Kit (TIANGEN, Beijing, China)in a nal volume of 10μL containing 1 μL Multi HotStar Buffer, 0.8μL (40μM) dNTP S , 0.2μL Multi HotStar, 0.48μL or 0.56μL primer mix (0.02μL for each forward primer and 0.02μL for the corresponding tail, and 0.04μL for reverse primer), and 6.02μL or 5.94μL ddH 2 O. The ampli cations were performed in a BIO-RAD C1000 TOUCH TM thermal cycler under the following conditions: 15 min at 95℃; 40 cycles of 30s at 94℃, 90s at 58℃, and 90s at 72℃; with a nal extension at 72℃ for 10 min. The uorescently labeled PCR products were analyzed on an ABI 3730XL capillary sequencer (Applied Biosystems, USA) using the GeneScan 500 LIZ size standard (Applied Biosystems, USA) to score the fragment size. The genotypes and allele sizes were then visualized using GENEMAPPER 4.0 (Applied Biosystems, USA) and further checked manually. In the case of failed ampli cations, reactions were repeated twice to obtain results in the case of occasional technical mistakes. Finally, 18 microsatellite loci were found to be polymorphic, and their genotypic data were selected for further analyses of the genetic characteristics of the population from which they corresponded to, such as in terms of genetic diversity and population structure.

Statistical analysis
The linkage disequilibrium (LD) between each pair of loci at each population and the Hardy-Weinberg equilibrium (HWE) for each locus at each population were tested using GENEPOP 4.7.0 [15]. Null allele frequencies were estimated for each locus within each population using the expectation maximization algorithm in FreeNA [16] with 1000 bootstrap replicates. Population genetic diversity indices, such as the total number of alleles (A T ) and the unbiased expected heterozygosity (H E ) [17]were estimated in GENCLONE 2.0 [18], while the observed heterozygosity (H o ) and polymorphism information content (PIC) were obtained using the macros Microsatellite Tools [19]. Given the effect of sample size on diversity statistics, the standardized number of alleles (A S ) for a minimum sample size of nine diploid individuals was calculated in GENCLONE using the rarefaction approach. The allelic richness (A R ) and allelic richness of private alleles (P AR ) were also computed on a minimum sample size of nine diploid individuals using the rarefaction method in HP-RARE 1.1 [20]. Additionally, the inbreeding coe cient (F IS ) for each population was reported using FSTAT 2.9.3 [21].
Population differentiation was rst investigated by calculating Nei's genetic distance [17,22] and Provesti's genetic distance of paired populations using the R package 'adegenet' 2.0.0 [23]. The pairwise F ST values estimated in ARLEQUIN 3.5.2 [24] and the corrected pairwise F ST values computed using the ENA correction method [16] in FreeNA were compared in order to examine the effect of null alleles on the population differentiation analyses. The population structures were then explored using three independent programs: STRUCTURE 2.3.4 [25], Discriminant Analysis of Principal Components (DAPC) [26], and TESS3r [27,28]. STRUCTURE was run using the admixture model with correlated allele frequencies and no prior location information. The clustering test was performed using 30 independent runs for each K value (from 1 to 10) and 200,000 Markov chain Monte Carlo (MCMC)iterations after a burn-in of 100,000 iterations for each run. The delta K method [29] was used to determine the optimal value of K. The membership coe cient matrices (Q-matrices) of the inferred optimal K values were combined using CLUMPP 1.1.2 [30], and then visualized using DISTRUCT 1.1 [31]. The model-free analysis, DAPC, was performed to interpret and visualize the genetic correlation among individuals de ned by sampling sites by retaining 200 principal components and all of the linear discriminants. The analysis was nished in R package 'adegenet' 2.0.0. As a spatially explicit clustering method, TESS3r was performed by incorporating the speci c geographic coordinates with the genotype data. A similar range of K values (from 1 to 10) was tested with 30 independent runs for each K, and the default cross-validation criterion was used to evaluate the model performance. Finally, the data were further analyzed to evaluate the contribution of the geographic distance on S. empoasca genetic differentiation among populations using GLMM in the R package 'stats' [32] and a simple Mantel test in the R package 'ecodist' [33]. A signi cant result would provide direct evidence for a pattern of isolation by geographic distance (IBD). The geographic distances were computed as the shortest distance between two sample sites using the "Vincenty (ellipsoid)" method [34] in R package 'GEOSPHERE' 1.5-10 [35], while the genetic dissimilarity was developed using pairwise F ST , Nei's, and Provest's genetic distance. BAYESASS 3.0.4 [36] was used to estimate the dispersal rates over a few generations of populations. Ten runs were conducted to combine the results with 20,000,000 MCMC interactions, sampled every 100 interactions after a burn-in of 2,000,000 for each run.

Microsatellites in transcriptome sequences
In total, 44,388,868 raw reads of paired-end (PE) sequences were generated from transcriptome libraries constructed by theIllumina HiSeq platform. Removal of adapter and poly-N sequences as well as low-quality reads resulted in 6.51 GB of sequence, including 43,375,184 clean reads. After assembly, a total of 116,907,081 bp sequences, including 68134 unigenes, were produced for microsatellite mining. The unigenes had a mean size of 1716 bp and an N50 of 2676 bp, which indicates they are much longer than those of previous studies on Parantica sita (unigenes = 886 bp) [37], Dorcus hopei (N50 = 1218 bp) [38] and Carposina sasakii (N50 = 286 bp) [10]. This difference may be explained by the coverage of the sequencing reads and the method of assembly.

Development and performance of microsatellite loci
The Primer3 [41,42] generated a possible 47,490 primers for analysis of the detected microsatellites, of which 50 primer pairs (39 trinucleotide, 7 tetranucleotide, 2 pentanucleotide, and 2 complex repeat types) were randomly selected for the validation test. In the test, 13 primer pairs did not produce any visible ampli cation and 10 failed to produce stable ampli cation, while 9 loci were monomorphic and the ampli cation of 18 showed they were polymorphic (Table 1).
Through the assessment of ve natural S. empoasca populations (N=98; XC, QCC, TF, FZ, and AX; Table 2 and Fig. 2), the average frequency of the null alleles for the 18 polymorphic loci was determined to range from 0 to 0.107 (Table 1), yielding little effect on the estimation of the genetic diversity and structure [16]. Among the 18 loci, 38 of the 765 locus-locus pairs in a population showed linkage disequilibrium (LD; P < 0.05; Table S2), and 7 of the 153 locus pairs across all populations (Ste24-Ste47, Ste24-Ste10, Ste49-Ste50, Ste26-Ste21, Ste47-Ste44, Ste26-Ste44, and Ste9-Ste44) exhibited linkage disequilibrium (P < 0.05; Table S2). However, none of the locus pairs were consistently signi cant in all ve populations. Four of the 95 loci-population pairs deviated from Hardy-Weinberg equilibrium (HWE; P< 0.05; Table S3), and two (Ste26 and Ste36) of the loci across all populations deviated from HWE (P < 0.05; Table 1). However, a global HWE test across all loci and populations showed non-signi cant deviation (P = 0.5992), and the test on each of the populations based on all loci also indicated non-signi cant deviation (P > 0.05; Table 2). The presence of LD and the deviation from HWE are not necessarily due to the characteristics of the loci [10]. They can be caused by a variety of reasons, such as the Wahlund effect [43], the structure of the subpopulations or occurrence of a bottleneck [9,44], or simply the biological properties of the species [10].  (Table 1). These seem to be species-speci c results, as both similar (e.g., C. sasakii) [10] and comparatively different (e.g., F. occidentalis,Trichinella spiralis ) [11] [45] data have previously been reported in other studies. The loci with polymorphism information content (PIC) values of <0.25 and >0.5 are commonly considered to indicate low-and high polymorphism, respectively [9]. The PIC values of our study loci ranged from 0.250 to 0.634, except for the loci of Ste47 (0.234) and Ste42 (0.243; Table 1), indicating a medium-to-high level of polymorphism. The inbreeding coe cient (F IS ) revealed a range from −0.143 to 0.173, with the exception of loci Ste36 (0.4076, Table 1 Table 2). The populations showed a low level of F IS , with values ranging from 0.038 to 0.057 (Table 2), indicating a low chance of the study individual coming from the same egg brood and, thus, suggesting the effectiveness of the sampling strategy and the analyzed results. Because of the low frequency of null alleles, the pairwise F ST estimation yielded similar results both with and without using the ENA correction method, with values ranging from 0.0106 (TF-AX) to 0.099 (FZ-QCC; Fig. 3a). Comparatively higher values (>0.5) were detected between FZ and other populations, while lower values were detected between TF and others. Consistently, the Nei's and Provesti's genetic distance among populations ranged from 0.034 to 0.150 and 0.131 to 0.264, respectively, with the largest values between FZ and other populations, and the lowest between TF and others (Fig. 3b). These ndings suggest a moderate differentiation between FZ and other populations, and a low differentiation between TF and others [47]. This conclusion was further con rmed by the population structure analyses using both spatially and non-spatially clustering models, as well as a model-free program. The three methods consistently and clearly divided the 98 individuals into two distinct clusters, with one cluster containing samples from FZ and another composed of individuals from the other sites ( Fig. 4 and S1). Meanwhile, gene ow analysis provided additional and more direct evidence. Multiple runs in BayesAss consistently indicated that the recent dispersal rates between the geographically distinct populations were much lower (0.011-0.250) than the individual gene exchange rates within a population (0.683-0.939; Fig. 5). This suggests that S. empoasca tends undergo gene exchange within populations with little gene ow occurring between. The TF population showed a somewhat different pattern, with some gene ow towards XC, QLL, and AX, although it still had relatively low rates of 0.248, 0.250, and 0.199, respectively (Fig. 5). The distinct spatial structure and limited gene ow between the geographically distinct populations was expected, as small insects are commonly known to lack a strong natural ability to disperse [9,48,49]. Unexpectedly, both the logit link generalized linear model (GLMM) and simple Mantel analysis indicated that the geographic distance was not signi cantly correlated with the pairwise F ST , Provesti's genetic distance, and Nei's genetic distance (Table S4), suggesting a lack of isolation by distance (IBD) pattern in S. empoasca. This might be caused by the insu cient number of populations examined or by the relevant effect of human activity. Nevertheless, a pattern of genetic differentiation and gene ow was found based on the 18 loci in this study, indicating that these microsatellite markers could be used to discriminate geographic populations as well as to facilitate other genetic and ecology studies on S. empoasca.

Conclusions
In this study, we developed and characterized 18 microsatellite markers for a parasitoid of the predominant tea pest from the transcriptome sequences generated using high-throughput sequencing. By assessing 98 individuals from ve geographic ally distinct populations, 18 loci were shown to be polymorphic, and were veri ed to be able to reveal a pattern of spatial structure and gene ow in S. empoasca populations. This work provides an important basis for future studies on how these markers can be used in studies of the biology, genetics, and ecology of this important parasitoid. These ndings can further provide important information for the development of biological control strategies in tea plantations. Additionally, our study rea rms the importance and e ciency of high-throughput sequencing in microsatellite marker development for non-model species without a reference genome.   stanardized total number of alleles (for 9 specimens); A R : average allelic richness (for 9 speciemens); P AR : private allelic richness; F IS : inbreeding coe cient. Refer to Fig. 2 for the sample sites of the populations. Figure 1 Counts of different microsatellite (SSR) motif types with different repeat type in Stethynium empoasca.

Figure 2
Sampling sites (or locations) of the ve Stethynium empoasca populations used to access the 18 microsatellite loci in this study. Refer to Table 2 for the coordinates of the populations' sample sites. Figure 3