3.1 Microsatellites in transcriptome sequences
In total, 44,388,868 raw reads of paired-end (PE) sequences were generated from transcriptome libraries constructed by theIllumina HiSeq platform. Removal of adapter and poly-N sequences as well as low-quality reads resulted in 6.51 GB of sequence, including 43,375,184 clean reads. After assembly, a total of 116,907,081 bp sequences, including 68134 unigenes, were produced for microsatellite mining. The unigenes had a mean size of 1716 bp and an N50 of 2676 bp, which indicates they are much longer than those of previous studies on Parantica sita (unigenes = 886 bp) [37], Dorcus hopei (N50 = 1218 bp) [38] and Carposina sasakii (N50 = 286 bp) [10]. This difference may be explained by the coverage of the sequencing reads and the method of assembly.
About 40% (27,172) of the assembled genes were found to contain microsatellite sequences, resulting in a total of 54,520 candidate microsatellites being identified. The total relative abundance was 466.35 loci/Mb, which is higher than that of Frankliniella occidentalis (322.01 loci/Mb) [11] detected in the genome and Pachypeltis micranthus (40.47loci/Mb) [39] in the transcriptomic expressed sequence tags, as well as for Aphis glycines (139.01 loci/Mb) [40] as identified in transcript sequences. However, it is unclear whether this difference is species-specific difference or due to the different methods used for sequencing and assembly. Nevertheless, our results indicate that microsatellites are abundant in S. empoasca and have great potential as a tool for further biological, ecological, and population studies. The identified microsatellites contained 40,021 (73.41%) mononucleotides, 8393 (15.39%) dinucleotide, 5764 (10.57%) trinucleotide, 265 (0.49%) tetranucleotide, 17 (0.03%) pentanucleotide, and 60 (0.11%) hexanucleotide motifs. After mononucleotide, dinucleotide motifs were the most abundant, followed by trinucleotide and tetranucleotide (Table S1). The quantity of microsatellites decreased as the corresponding motif repeats increased (Fig. 1). These findings are in agreement with previous studies on other insects, like F. occidentalis [11], C. sasakii [10], and A. glycines [40]. The 15 most abundant microsatellite motifs accounted for 92.75% of all of the detected microsatellites, which were (A)n, (T)n, (AT)n, (TA)n, (AAT)n, (TTA)n, (ATT)n, (TAA)n, (ATA)n, (TAT)n, (TC)n, (AG)n, (GA)n, (CAG)n, and (TG)n.
3.2 Development and performance of microsatellite loci
The Primer3 [41, 42] generated a possible 47,490 primers for analysis of the detected microsatellites, of which 50 primer pairs (39 trinucleotide, 7 tetranucleotide, 2 pentanucleotide, and 2 complex repeat types) were randomly selected for the validation test. In the test, 13 primer pairs did not produce any visible amplification and 10 failed to produce stable amplification, while 9 loci were monomorphic and the amplification of 18 showed they were polymorphic (Table 1).
Through the assessment of five natural S. empoasca populations (N=98; XC, QCC, TF, FZ, and AX; Table 2 and Fig. 2), the average frequency of the null alleles for the 18 polymorphic loci was determined to range from 0 to 0.107 (Table 1), yielding little effect on the estimation of the genetic diversity and structure [16]. Among the 18 loci, 38 of the 765 locus–locus pairs in a population showed linkage disequilibrium (LD; P < 0.05; Table S2), and 7 of the 153 locus pairs across all populations (Ste24–Ste47, Ste24–Ste10, Ste49–Ste50, Ste26–Ste21, Ste47–Ste44, Ste26–Ste44, and Ste9–Ste44) exhibited linkage disequilibrium (P < 0.05; Table S2). However, none of the locus pairs were consistently significant in all five populations. Four of the 95 loci-population pairs deviated from Hardy–Weinberg equilibrium (HWE; P< 0.05; Table S3), and two (Ste26 and Ste36) of the loci across all populations deviated from HWE (P < 0.05; Table 1). However, a global HWE test across all loci and populations showed non-significant deviation (P = 0.5992), and the test on each of the populations based on all loci also indicated non-significant deviation (P > 0.05; Table 2). The presence of LD and the deviation from HWE are not necessarily due to the characteristics of the loci [10]. They can be caused by a variety of reasons, such as the Wahlund effect [43], the structure of the subpopulations or occurrence of a bottleneck [9, 44], or simply the biological properties of the species [10].
The genetic variation over the 18 polymorphic loci inferred based on four of the populations varied with allele number (AT), ranging from 2 to 7 (mean of 4.06); average allelic richness (AR; for nine specimens of each population), from 2 to 4.17; and observed (HO) and expected (HE) heterozygosity from 0.166 to 0.686 and 0.276 to 0.753, respectively (Table 1). These seem to be species-specific results, as both similar (e.g., C. sasakii) [10] and comparatively different (e.g., F. occidentalis,Trichinella spiralis ) [11] [45] data have previously been reported in other studies. The loci with polymorphism information content (PIC) values of <0.25 and >0.5 are commonly considered to indicate low- and high polymorphism, respectively [9]. The PIC values of our study loci ranged from 0.250 to 0.634, except for the loci of Ste47 (0.234) and Ste42 (0.243; Table 1), indicating a medium-to-high level of polymorphism. The inbreeding coefficient (FIS) revealed a range from −0.143 to 0.173, with the exception of loci Ste36 (0.4076, Table 1). The high FIS in Ste36 might be due to the low HO (0.166; Table 1) [46].
The genetic diversity over S. empoasca populations, estimated based on the 18 loci with varied AT, ranged from 85 to 99, AS (the standardized number of alleles for a minimum sample size of nine diploid individuals) from 77.547 to 85.682, AR from 2.76 to 2.95, HO from 0.424 to 0.476, HE from 0.448 to 0.522, and PAR from 0.03 to 0.17 (Table 2). The populations showed a low level of FIS, with values ranging from 0.038 to 0.057 (Table 2), indicating a low chance of the study individual coming from the same egg brood and, thus, suggesting the effectiveness of the sampling strategy and the analyzed results. Because of the low frequency of null alleles, the pairwise FST estimation yielded similar results both with and without using the ENA correction method, with values ranging from 0.0106 (TF–AX) to 0.099 (FZ–QCC; Fig. 3a). Comparatively higher values (>0.5) were detected between FZ and other populations, while lower values were detected between TF and others. Consistently, the Nei’s and Provesti’s genetic distance among populations ranged from 0.034 to 0.150 and 0.131 to 0.264, respectively, with the largest values between FZ and other populations, and the lowest between TF and others (Fig. 3b). These findings suggest a moderate differentiation between FZ and other populations, and a low differentiation between TF and others [47]. This conclusion was further confirmed by the population structure analyses using both spatially and non-spatially clustering models, as well as a model-free program. The three methods consistently and clearly divided the 98 individuals into two distinct clusters, with one cluster containing samples from FZ and another composed of individuals from the other sites (Fig. 4 and S1). Meanwhile, gene flow analysis provided additional and more direct evidence. Multiple runs in BayesAss consistently indicated that the recent dispersal rates between the geographically distinct populations were much lower (0.011–0.250) than the individual gene exchange rates within a population (0.683–0.939; Fig. 5). This suggests that S. empoasca tends undergo gene exchange within populations with little gene flow occurring between. The TF population showed a somewhat different pattern, with some gene flow towards XC, QLL, and AX, although it still had relatively low rates of 0.248, 0.250, and 0.199, respectively (Fig. 5). The distinct spatial structure and limited gene flow between the geographically distinct populations was expected, as small insects are commonly known to lack a strong natural ability to disperse [9, 48, 49]. Unexpectedly, both the logit link generalized linear model (GLMM) and simple Mantel analysis indicated that the geographic distance was not significantly correlated with the pairwise FST, Provesti’s genetic distance, and Nei’s genetic distance (Table S4), suggesting a lack of isolation by distance (IBD) pattern in S. empoasca. This might be caused by the insufficient number of populations examined or by the relevant effect of human activity. Nevertheless, a pattern of genetic differentiation and gene flow was found based on the 18 loci in this study, indicating that these microsatellite markers could be used to discriminate geographic populations as well as to facilitate other genetic and ecology studies on S. empoasca.