Next-generation sequencing, isolation and characterization of 14 microsatellite loci of Canthon cyanellus (Coleoptera: Scarabaeidae)

We used Illumina paired-end sequencing to isolate and characterize microsatellites of Canthon cyanellus, a Neotropical roller dung beetle, encompassing several lineages within its distribution range. We examined C. cyanellus specimens collected at eight different localities in Mexico (two or three specimens per locality). We initially performed amplification tests with 16 loci, but two of which were unsuccessful. The 14 remaining microsatellites were polymorphic, with 2–16 alleles each. The expected and observed heterozygosity ranged from 0.11 to 0.76 and from 0.20 to 0.78, respectively. These microsatellites will help to assess structure at the population and lineage levels, identify zones of potential hybridization between lineages, and draw a more precise geographic delimitation of C. cyanellus lineages.


Introduction
Canthon cyanellus is a Neotropical roller dung beetle (Scarabaeidae: Scarabaeinae) that feeds on dung and carrion but only reproduces on the latter [1,2]. This diurnal species inhabits forests, edges, and hedgerows [3]. Similar to other dung beetle species, C. cyanellus provides ecological services such as nutrient cycling, carrion burial, seed dispersal, and fly control [4]. The cuticular color of C. cyanellus varies widely across its distribution from northern Brazil to Texas [5,6]. The coloration polymorphism ranges from a predominantly reddish-brown in northern South America and Central America to entirely green or even blue in tropical regions of Mexico and Texas. This variation in cuticular color enabled Halffter [7] to establish three subspecies. C. cyanellus cyanellus LeConte, 1859 is distributed in Mexico along the Pacific and Gulf of Mexico slopes to Texas; it is mostly green, but green-or blue-colored specimens can be found in the northern plains of the Gulf of Mexico [7]. C. cyanellus sallei is distributed from Guatemala to Peru [8]; its dorsal surface is predominantly reddish-brown with greenish shades, elytra that are reddish-brown with greenish-black markings, and a light colored pronotal disc, usually with reddish or yellowish-brown hues. C. cyanellus violetae is endemic to Chiapas in southern Mexico, with pronotum and pygidium that are typically a reddish-brown [7]. However, the taxonomy of this group has been one of the most complex of the Canthon genus. The coloration polymorphism and the absence of diagnosable and reliable morphological characters have created uncertainty about its genealogical limits, which has led to the conflicting recognition and delimitation of the species or subspecies that comprise it.
The reproductive behavior, chemical communication, and ecology of C. c. cyanellus have been extensively studied [1-4, 9-14, 14-21]. Ortíz-Domínguez and collaborators [19] carried out experimental interpopulation crosses of C. c. cyanellus from the Gulf of Mexico coastal plains. These authors analyzed the composition of cuticular hydrocarbons used in sexual recognition and chemical communication [22]. Male-female interactions between individuals from populations geographically separated by more than 600 km were mainly aggressive. They fought for a food ball instead of rolling it cooperatively, as occurs between male-female pairs from the same population. Crosses between males and females from distant populations showed low fecundity and fertility, suggesting that C. cyanellus is undergoing an incipient reproductive isolation process.
Phylogeographic studies based on one nuclear (ITS2) and two mitochondrial (COI and 16S) loci resolved the C. cyanellus genealogy into seven clades from tropical regions of Mexico and one clade from Colombia [23,24]. Five of these clades were well supported, while the remaining clades were not. The geographic distribution of haplotypes suggests limited historical gene flow and recent dispersal events between some clades [23,24]. Nolasco-Soto and collaborators [24] used COI to estimate genetic distances between all the clades inferred for the "cyanellus" group. These authors found that the interclade divergence value surpassed the 2.2% threshold used to recognize separate insect species, according to Ratnasingham and Hebert [25]. Only two clades (SPS and SGM) were below this threshold (2.16%). However, the morphological variation of the aedeagus in the C. cyanellus populations studied by Nolasco-Soto and collaborators [24] did not align with the recovered clades, which therefore could not be separated into different species. Studies by Nolasco-Soto and collaborators [23,24] indicate that the phylogeographic history of C. cyanellus in Mexico is complex.
In this context, we used Illumina sequencing to isolate and characterize microsatellite loci of the C. cyanellus complex. Microsatellite markers (short DNA fragments, 100-500 bp) are polymorphic, codominant markers that consist of tandem repeats of simple sequence motifs with high mutation rates (10 −6 -10 −2 mutations per site per generation) [26]. Characteristics of microsatellite markers can contribute to a better understanding of the population structure of this species and enable the quantification of gene flow between clades/populations. The geographical delimitation of the populations and lineages of C. cyanellus will allow for the study of ecological and geographical factors that may influence their genetic differentiation and diversification in the Neotropics.

Sample collection and DNA extraction
Specimens of C. cyanellus were collected from eight localities in the Pacific and the Gulf of Mexico slopes, the Sierra Madre Oriental, and the southeast and southwest of the state of Chiapas (see Table 1 and Supplementary Material 2). The geographic coordinates and elevation (masl) were recorded at each location. Twenty to thirty baited pitfall traps were placed at each location and left for 48 h; traps were filled with damp soil and fish or squid as bait. All the beetles collected were transported alive to the laboratory, where they were subsequently frozen at − 70 °C. Both hind legs of each specimen were excised and ground to extract DNA following the DNeasy Blood & Tissue kit (QIAGEN) protocol.

Next-generation sequencing, De Novo contig assembly, and selection of microsatellites
Three genomic DNA samples of C. cyanellus (sampled from Tuxpan, Los Tuxtlas and El Vergel localities) were used to construct the Illumina library following the Illumina TruSeq DNA PCR-Free kit sequencing library protocol (Illumina Inc. USA). The library was sequenced using a 250-bp paired-end run on a MiSeq platform (Illumina Inc. USA). The quality of raw reads was checked with the software FastQC v0.11.3 [27] and filtered with cutadapt v1.8.3 [28]. Quality filtering consisted of clipping adapters, removing sequences shorter than 20 nt, and trimming the 3′ end to a threshold of -q = 20. The software SPAdes 3.10.1 [29] was used for de novo assembly, and the QDD 3.1.2 pipeline [30] was used for the detection, characterization, and primer design of the variable number tandem repeat (VNTR) loci; both programs were run with their default parameters.
The following criteria (based on Meglécz et al. [30]) were used to select primer pairs for testing from the set of primer design sequences detected in the "Tuxpan" and "Los Tuxtlas" samples: (i) select only one primer pair per locus, (ii) the sequence should contain only pure microsatellites in the target region with the motifs AAT or AAG; (ii) the primer alignment score of the amplified sequence should be 2 or 3; (iii) markers should be selected from different ranges (at least 20 bp between each other) for PCR product length to facilitate multiplexing; and (iv) right and left primer lengths should be ≤ 60 bp.

Polymorphism analyses
The total number of alleles (A), effective number of alleles (A E ), observed (H O ) and expected heterozygosity (H E ), inbreeding index (F IS ), and number of private alleles per locality were estimated using the macro GenAlEx 6.503 [33]. Finally, we calculated the polymorphism information content (PIC) and the probability of identity (P ID ) for each locus using the software PowerMarker 3.25 [34] and Gimlet 1.3.3 [35], respectively.

Genome sequencing and characterization of microsatellite loci
Between six and eight million paired-end reads were generated per sample (Table 2). Most reads had the expected sequence length (250 bp) and were of sufficiently high quality to pass the filtering; only ca. 1% of the reads of each sample were filtered out. Between 90,000 and 115,000 contigs were successfully assembled per sample using the accepted reads. The average contig length ranged from 2000 to 3000 bp with an N50 of approximately 7000 bp.
A total of approximately 9000 VNTR loci were found per sample using the contigs; approximately 8000 of them were identified as perfect or pure microsatellites. As per the QDD3 glossary, we defined pure microsatellites as loci composed of a single, 2-6 bp motif, with no interruption, and at least five repetitions [30]. Some 135,000 primer pairs were generated per sample from the VNTR loci found and characterized.
Overall, the VNTRs of three C. cyanellus samples from each locality had a very similar genomic characterization (Fig. 1). Almost 90% of all the VNTR loci found met the definition of perfect microsatellites; approximately 9% were categorized as compound microsatellites, and the rest were categorized as minisatellite loci (Fig. 1a). Compound microsatellites were defined as microsatellite loci followed by 3-4 tandem repetitions of a 2-6 bp motif separated by a distance equal to or shorter than the longest of two tandem repetitions. Minisatellite loci were defined as two or more perfect or compound microsatellite loci in the same target region [30]. The frequency of the VNTR classes was inversely related to motif size (Fig. 1b). Dinucleotide VNTRs accounted for approximately 65% of all the loci, followed by trinucleotides with approximately 30% and tetranucleotides with 5%.
The average number of reiteration units seen in VNTR loci was close to the minimum threshold (five) employed by the QDD3 pipeline. The mean number of repetitions was 8.9 for dinucleotides, 5.5 for trinucleotides, 5.3 for tetranucleotides, 5.3 for pentanucleotides, and 8.4 for hexanucleotides; however, a higher number of repetitions was found in loci with shorter motifs (Fig. 1c; Dinucleotides: median = 5, Q3 = 9, Max = 66; Trinucleotides: median = 5, Q3 = 6, Max = 25; Tetranucleotides: median = 5, Q3 = 5, Max = 16). We also found differences in the frequency of the various motifs in the VNTR loci. AT was the most common motif in dinucleotide VNTRs, and CG was the rarest (Fig. 1d); AAT and AGC were the most and least common motifs, respectively, in trinucleotide loci (Fig. 1e); AAAT was the most common motif in tetranucleotide loci, while ACAG and AACC were the rarest, with only one locus each.

Validation of microsatellite loci
Fourteen of the 16 primer pairs synthesized were successfully amplified and yielded single PCR products of the expected size; PCR amplification was not achieved in one locus, and reading issues occurred in another locus (Supplementary Material 3). The amplification success rate of the 14 markers was 95.2%, with only 12 missing data after amplifying 18 C. cyanellus samples. Most of these loci amplified fragments with lengths differing between them in three nucleotides. This result was expected, as primers were designed only on trinucleotide-perfect microsatellites; this facilitated the assignment of raw fragment lengths-often referred to as binning-into allele classes [36]. However, Ccy09 and Ccy12 showed a mostly continuous distribution of raw fragment sizes with differences of less than three nucleotides. Based on this, we decided to bin their alleles into the mononucleotide allele class (Supplementary Material 4).

Polymorphism analyses
The number of alleles in polymorphic loci ranged from 2 to 16, and the observed heterozygosity ranged from 0.11 to 0.82 (Table 3). Ten loci showed moderate polymorphism (4-6 alleles), four were marginally polymorphic (2-3 alleles), and only two were highly polymorphic (9 and 16 alleles). The effective number of alleles (A E ) was approximately half the number of observed alleles. The moderate genetic diversity observed in these markers conferred a relatively high probability of identity (P ID ). This parameter measures the probability of two unrelated individuals having the same genotype [37]. P ID values ranged from 60.1% (Ccy11) to 0.1% (Ccy12). However, when P ID was computed over the full set of 14 markers using the product rule, it reached a value of 1.299 × 10 -12 in unrelated individuals and 1.563 × 10 -4 in siblings.
Since we examined only a few (2-3) samples per locality, we measured the ability of these 14 markers to measure genetic structure by identifying the number of private alleles of each locus. Eleven of the 14 loci possessed at least one private allele at one locality (Fig. 2). Ten of these 11 loci had few (1-3) private alleles, and locus Tuxt06 had eight private alleles across six different localities.

Characterization of microsatellite loci
Perfect microsatellite loci were the most common type of VNTR in C. cyanellus genomes, which has been similarly reported for the coconut rhinoceros beetle Oryctes rhinoceros (Linnaeus, 1758) [38] and the red flour beetle Tribolium castaneum (Herbst, 1797) [39]. This is in contrast to a recent comparison of 29 beetle genomes that showed imperfect microsatellites were the most common short-sequence repeat type [40]. Comparisons between genomic characterizations of microsatellite loci must be taken with caution due to a current lack of standardization of detection criteria and VNTR classes.
The most frequent VNTR class has been reported to be a highly heterogeneous feature of beetle genomes, particularly in Scarabaeidae species [40]. However, after excluding mononucleotide microsatellites (as we did in our analyses), dinucleotides and trinucleotides seem to be the most frequent microsatellite motif sizes [40,41]. Our results are consistent with this observation, as we found a trend of decreasing VNTR frequency with an increase in motif size. Our results are also consistent with the general pattern of short sequence repeats to favor poly(A) and poly(T) motifs over poly(G) and poly(C) in eukaryotic genomes [40,42,43].

Non-tandem repeat polymorphisms in Ccy09 and Ccy12
The raw allele reads of Ccy09 and Ccy12 were not consistent with the expected three-nucleotide difference in size; as a result, we binned these loci as mononucleotide microsatellites. These markers might show evidence of nontandem repeat polymorphisms, such as SNPs or indel mutations, and thus would not be perfect microsatellites. Complex mutational processes that produce variations in allele length at microsatellite loci have been reported in several eukaryote groups, including yeasts [44,45], fish [46], and insects [39].
Imperfect microsatellites have pros and cons when used as molecular markers for DNA fingerprinting and population genetics compared to perfect microsatellites [47]. Additional alleles provide more information that is useful for paternity testing and mapping studies and can achieve a  higher resolution of population structure [47,48]. Ccy09 and Ccy12 showed higher genetic diversity and PIC values than the other loci. However, the full advantages of imperfect microsatellites can only be achieved by using DNA sequencing. Moreover, mutational models of imperfect microsatellites are more complex than those of perfect microsatellites [47].

Polymorphism analyses
Overall, our markers displayed moderate levels of genetic diversity with an effective number of alleles that suggests there are few dominant alleles per locus. Additionally, a deficit of heterozygote samples was found compared to what would be expected under HW equilibrium. This was expected due to the small sample size used to evaluate the primers and the use of specimens from multiple localities to sample as many alleles as possible. Since a potential hidden genetic structure might be causing the heterozygote deficitoften referred to as the Wahlund effect [49] -, we decided not to test for departure from HW equilibrium.
More accurate estimates of diversity for each marker could be obtained by increasing the genotyped sample size. However, with this small sample size, we estimated a P ID of 1.299 × 10 -12 for unrelated individuals and 1.563 × 10 -4 for siblings over the full set of 14 markers. Based on these values, the probability of sampling two unrelated individuals with the same genotypes across the 14 markers is 1 in ca. 770 billion and 1 in 6000 for sibling beetles. This highlights the highly informative power of the full set of 14 microsatellite markers. Finally, the robustness of these markers to measure the genetic structure in C. cyanellus populations and lineages should be evaluated using larger number of samples from each locality.

Conclusion
The set of 14 C. cyanellus microsatellites that were isolated and characterized provides a valuable genetic tool to assess structure at the population or lineage levels and analyze hybridization processes. These microsatellites will help to better understand the geographic distribution of the lineages in this species complex. Further studies including additional populations from Central and South America would contribute to the establishment of a sound taxonomic delimitation for these lineages.