Genome Sequencing and Characterization of Microsatellite Loci
Between six and eight million paired-end reads were generated per sample (Table 2). Most reads had the expected sequence length (250 bp) and were of sufficiently high quality to pass the filtering; only ca. 1% of the reads of each sample were filtered out.
Between 90,000 and 115,000 contigs were successfully assembled per sample using the accepted reads. The average contig length ranged from 2,000 to 3,000 bp with an N50 around 7,000 bp.
A total of about 9,000 VNTR loci were found per sample using the contigs; about 8,000 of them were identified as perfect or pure microsatellites. As per the QDD3 glossary, we defined pure microsatellites as loci composed of a single, 2–6 bp motif, with no interruption, and at least five repetitions [30]. Some 135,000 primer pairs were generated per sample from the VNTR loci found and characterized.
Overall, the VNTRs of three C. cyanellus samples from each locality had a very similar genomic characterization (Fig. 2). Almost 90% of all the VNTR loci found met the definition of perfect microsatellites; about 9% were categorized as compound microsatellites and the rest as minisatellite loci (Fig. 2a). Compound microsatellites were defined as microsatellite loci followed by 3–4 tandem repetitions of a 2–6 bp motif separated by a distance equal or shorter than the longest of two tandem repetitions. Minisatellite loci were defined as two or more perfect or compound microsatellite loci in the same target region [30].
The frequency of the VNTR classes was inversely related to motif size (Fig. 2b). Dinucleotide VNTRs accounted for approximately 65% of all the loci, followed by trinucleotides with about 30%, tetranucleotides with 5%, and so on. The average number of reiteration units seen in VNTR loci was close to the minimum threshold (five) employed by the QDD3 pipeline. The mean number of repetitions was 8.9 for dinucleotides, 5.5 for trinucleotides, 5.3 for tetranucleotides, 5.3 for pentanucleotides, and 8.4 for hexanucleotides; however, a higher number of repetitions was found in loci with shorter motifs (Fig. 2c; Dinucleotides: median = 5, Q3 = 9, Max = 66; Trinucleotides: median = 5, Q3 = 6, Max = 25; Tetranucleotides: median = 5, Q3 = 5, Max = 16).
We also found differences in the frequency of the various motifs in the VNTR loci. AT was the most common motif in dinucleotide VNTRs, and CG was the rarest (Fig. 2d); AAT and AGC were the most and least common motifs, respectively, in trinucleotide loci (Fig. 2e); AAAT was the most common motif in tetranucleotide loci, while ACAG and AACC were the rarest, with only one locus each.
Validation of Microsatellite Loci
Fourteen of the 16 primer pairs synthesized were successfully amplified and yielded single PCR products of the expected size; PCR amplification was not achieved in one locus and reading issues appeared in another one (SupplementaryMaterial2). The amplification success rate of the 14 markers was 95.2%, with only 12 missing data after amplifying 18 C. cyanellus samples. Most of these loci amplified fragments with lengths differing between them in three nucleotides. This result was expected, as primers were designed only on trinucleotide-perfect microsatellites; this facilitated the assignment of raw fragment lengths — often referred to as binning — into allele classes [36]. However, Ccy09 and Ccy12 showed a mostly continuous distribution of raw fragment sizes with differences of less than three nucleotides. Based on this, we decided to bin their alleles into the mononucleotide allele class (Fig. 3).
Polymorphism Analyses
The number of alleles in polymorphic loci ranged from 2 to 16, and the observed heterozygosity from 0.11 to 0.82 (Table 3). Ten loci showed moderate polymorphism (4–6 alleles), four were marginally polymorphic (2–3 alleles), and only two were highly polymorphic (9 and 16 alleles). The effective number of alleles (AE) was about half the number of observed alleles, suggesting the presence of a few dominant alleles in the loci.
Eight loci showed high (>0.20) absolute FIS values, indicating large differences between their expected (HE) and observed (HO) heterozygosity. These loci displayed positive FIS values, indicating a deficit of heterozygote samples vs. expected under HW equilibrium. Polymorphism information content (PIC) ranged from 0.190 to 0.889, with a mean of 0.494.
The moderate genetic diversity observed in these markers conferred a relatively high Probability of Identity (PID). This parameter measures the probability of two unrelated individuals having the same genotype [37]. PID values ranged from 60.1% (Ccy11) to 0.1% (Ccy12). However, when PID was computed over the full set of 14 markers using the product rule, it reached a value of 1.299 ´ 10-12 in unrelated individuals and 1.563 ´ 10-4 in siblings.
Since we examined only a few (2–3) samples per locality, we measured the ability of these 14 markers to measure genetic structure by identifying the number of private alleles of each locus. Eleven of the 14 loci possessed at least one private allele at one locality (Fig. 4). Ten of such 11 loci had few (1–3) private alleles, and locus Tuxt06 had eight private alleles across six different localities.