Characterization of novel genotyping-by-sequencing (GBS)-based simple sequence repeats (SSRs) and their application for population genomics of Capoeta aculeata (Valenciennes, 1844)

The species Capoeta aculeata (Valenciennes, 1844) is one of the most important freshwater species endemic to Iran. However, the investigation of the population genetic structure of this species is limited by the low number of molecular markers currently described. In this study, we implemented next generation sequencing technology to identify polymorphic microsatellite markers and investigate the population genetic structure of C. aculeata sampled from three geographical sites in Iran. We characterized and developed 36 novel polymorphic microsatellite markers and these loci were examined in 120 individuals from three populations occurring in the Zagros basin. The average number of alleles per locus varied from 1.7 to 16 (average = 7.89). The results showed that, the polymorphism information content (PIC) of these simple sequence repeat (SSR) loci varied from 0.254 to 0.888. The observed heterozygosity (HO) per locus ranged from 0.170 to 0.881, while the expected heterozygosity (HE) per locus was from 0.170 to 0.881. Among these SSR loci, 20 loci deviated significantly from the Hardy–Weinberg equilibrium after Bonferroni correction (p < 0.05). These microsatellite markers could provide a valuable tool for future population and conservation genetics studies of C. aculeate and other closely related species.


Introduction
The genus Capoeta is a herbivorous cyprinid genus and highly diversified with 14 species and widely distributed in water bodies of Western Iran. Kohgiluyeh and Boyer-Ahmad Province in the southwest part of Iran is a region with high number of endemism in some freshwater fish species including Capoeta aculeata (Valenciennes, 1844) [1]. However, populations of this species have experienced significant declines over the last 20 years, due primarily to habitat loss and fragmentation caused by water storage and land clearing. The previous phylogenetic and phylogeographic studies found that populations of C. aculeata are different from the Capoeta species [2][3][4][5].
Simple sequence repeat (SSR), also known as microsatellite markers are highly variable DNA sequences, distributed throughout eukaryotic genomes [6,7]. Given its co-dominant inheritance, bi-parental mode of inheritance, polymerase chain reaction (PCR)-based, high amplification success rate and levels of polymorphism, microsatellite markers have been used in a wide range of applications in population genetics, molecular ecology, conservation genetics and quantitative trait loci (QTL) identification studies [8][9][10]. Nevertheless, the conventional methods for developing SSR markers are believed to be the key element which limits the application of microsatellite markers due to they 1 3 are expensive, low efficiency, laborious and time-consuming [11,12]. In fact, SSR markers are tremendously effective tools in population genetic studies because they might reveal the distinct population Segments (DPSs) even in fine-scale genetic structure studies. Until now, microsatellite loci have not been reported in this species [13].
The major advantage of Next Generation Sequencing (NGS) technique is their ability to produce large amounts of sequence data and it can provide wide genetic information of simple sequence repeat (SSR) or microsatellite loci [14,15]. Restriction-site associated DNA sequencing (RAD-seq) is a powerful tool to develop microsatellite and single nucleotide polymorphism (SNP) markers, which uses Illumina nextgeneration sequencing (NGS) platforms to discover informative volume of independent markers in a lot of specimens at the same time [16][17][18].
The reads created by RAD-seq can be grouped according to the enzyme recognition sequence, so this procedure could develop the accuracy and precision of contigs assembly, and consistently expand the success rate of polymorphic microsatellite markers development [19,20]. In comparison with other techniques, restriction site-associated DNA sequencing (RAD-seq) gradually becomes the most promising method and can give genome-wide genotypic and sequence data adjacent to single restriction enzyme cut sites and generate reduced representation genomic libraries [21,22]. Recently, the characterization and development of RAD-seq have been used to report microsatellite markers in many cyprinid species, such as Ctenopharyngodon Idella [23], Schizothorax prenanti [24] and Xenocypris argentea [25]. However, the genetic background of C. aculeata is limited and no microsatellite locus has been characterized for this species. In this study, RAD-seq was used to isolate and develop the first resource of microsatellite loci for C. aculeata from Iran.

Materials and methods
In this study, fin clips of 120 individuals were collected from three wild C. aculeata populations inhabiting the Beshar, Khersan, Maroun rivers in Kohgiluyeh and Boyer-Ahmad Province in Iran (Fig. 1, Table 1). All procedures were approved by the Committee on Animal Care and Use, and Committee on the Ethics of Animal Experiments of Agricultural Research, Education and Extension Organization (AREEO) and Islamic Azad University. The modified method [26] was used to extract genomic DNA using the commercial DN-easy blood and tissue kit (Qiagen™, Germany) following the manufacturer's instructions. The quality of DNA was assessed using a Nanodrop spectrophotometer (Nanodrop ND1000) and 1% agarose-gel electrophoresis, after which DNA concentration of each sample was diluted to 100 ng in a 100 μL volume and stored at -20 °C [27].

Screening of microsatellite loci by PCR
Total genomic DNA was isolated and purified from fin clips (about 30-50 mg) of each individual using the commercial DN-easy blood and tissue kit (Qiagen™, Germany) following the manufacturer's instructions. Then, the final DNA concentration of each sample was standardized to 100 ng in a 10 μL volume [26]. The quality and concentrations of DNA samples were checked on 1.0% agarose gels and Nanodrop 1000 spectrophotometer, respectively.

RAD library construction
RAD library preparation was carried out following the protocol of [28]. The RAD sequencing of ten samples per each population was performed to develop microsatellite markers. Briefly, genomic DNA from each individual was digested for 2 h at 37 °C in a 40 μL reaction at specific sites with EcoRI-HF restriction enzymes according to the RAD protocol [29]. Adapter P1 containing individual-specific nucleotide barcodes was ligated to the digested product. Ligation reaction was performed in a 40 μL volume along with 5 μL of ligation buffer, 0.85 μL of P1 adapter, and 300 units of ligase for 2 h at 25 °C. DNA was subsequently ligated to a second adapter (P2), which contains T overhangs. The ligated materials were then pooled, purified, eluted and subjected to PCR enrichment. Total genomic DNA samples were sheared to fragment DNA to an average size of 350 bp and a pool of adapter-ligated molecules with random, variable ends was generated. DNA fragments with lengths of 250-500 bp were isolated using the Qiagen Min Elute Gel Extraction kit (Qia-gen™, Germany). The libraries constructed were sequenced on the Illumina HiSeq 4000 platform (BGI, China) using 150 bp paired-end reads.
The quality control of the raw sequences was initially evaluated with the program FastQC (Babraham Bioinformatics) [30]. Reads were controlled with Cutadapt to eliminate potential ligated materials of Illumina adapters, permitting a 10% mismatch rate in the adapter sequence [31]. During this process, the programs process_radtags module and clone_filter from the software Stacks version 2.52 (http:// catch enlab. life. illin ois. edu/ stacks/) were used to demultiplex data and trim adapter sequences and remove over-represented sequences, respectively [32]. To avoid low-quality reads with artificial bias, the following thresholds were set [33]: reads with adapter contamination were removed; reads with ≥ 10% unidentified nucleotides were removed; reads with > 50% bases having Phred quality < 20 were removed; putative duplication reads generated by PCR amplification in library construction were discarded; reads were checked for presence of the partial 5 bp EcoRI motif (AATTC). De novo assembly was performed using software SPAdes [34]. The contigs shorter than 100 bp were removed from the assembly. To evaluate the quality of assembly, the reads involved in alignment were re-mapped onto the assembled genome using BWA program (version 0.7.17) [35].
All clean sequences were aligned to the reference assembly using BWA program. Microsatellite motif repeats of mono to hexa-nucleotide microsatellites were identified from the assembled genome using MIcroSAtellite identification tool (MISA) (http:// pgrc. ipk-gater sleben. de/ misa/). The following criteria of motifs were required to be met: at least 10 repetitive motifs for mononucleotide, 6 repetitive motifs for di-nucleotide, and 5 repetitive motifs for trito hexa-nucleotide. The number of microsatellite loci, repeat motifs, number of repeats, motifs sequence length, repeat type, start and end location of the repeat sequence, and microsatellite sequence, were analyzed using MISA. The software TFPGA ver. 1.3 [36] based on Nei's [37] was used to construct Unweighted Pair Group Method with Arithmetic mean (UPGMA) dendrogram based on Nei's [37] unbiased distance.
A total of 36 polymorphic loci were selected for validation by PCR. Polymerase chain reaction (PCR) reactions were performed in a final volume of 25 μL consisting of 1.2 μL of template DNA (100 ng), 2.5 μL of 10 × PCR buffer, 2.5 μL of Mg 2+ (20 mM), 0.85 μL of dNTP mixture (2.5 mM each), 0.30 μL of each primer (10 pmol/μL), 17.35 μL of ddH2O, and 0.3 μL of 1 U/μL Taq DNA polymerase (Qiagen™, Germany). Amplified products were checked on 1% agarose gel electrophoresis and sequenced using ABI Prism 3730 automated DNA analyzer (Applied Biosystems).  Table 1   Table 1 The samples of C. aculeata collected for population genetic analysis. Sampling localities, numbers (see Fig. 1 The PCR amplification procedure was performed in a thermal cycler (Bio-Rad, USA) using following program: first denaturation 94 °C, 30 s, one cycle; denaturation 94 °C, 30 s, 12 cycles; a specified annealing temperature for each primer pair for 30 s, 60 s, 12 cycles; then the products were extended at 72 °C for 7 min (for annealing temperatures of each primer pair, see Table 2). The PCR products were electrophoresed on 1.2% agarose gels, stained with ethidium bromide followed by visualization under UV light. SSR loci with single bands were chosen as candidates for additional validation.

Validation of microsatellite polymorphism
For validated SSR loci, each of the selected primer sequences was validated with 120 single individual of C. aculeata from three sampling locations in Iran. From fragment sizing and reporting, GeneMapper (v4.1) was used to analyze genotype data. The negative controls with nuclease-free sterile water were included in each PCR experiment. We randomly selected 80 microsatellite loci to be amplified in nine from three different populations (three individual per population) of Capoeta aculeate for preliminary testing of all the primers. To investigate polymorphism of the loci, non-denaturing 12% polyacrylamide gel electrophoresis (PAGE) was used to screen for size variants. After the initial tests, 36 primer pairs were observed to be potentially useful and further used to test polymorphism of the loci. We carried out multiplex PCR according to the manufacturer's instructions. We used the programs MICRO-CHECKER v. 2.2.3 [38] to screen for scoring errors and allelic dropout and FREENA to estimate the frequency of null alleles in all loci [39]. For validated loci, statistics including the number of alleles (N A ), the effective number of alleles (Ne), the observed heterozygosity (H O ), expected heterozygosity (H E ) indices and inbreeding coefficient (F IS ) were calculated in GenAlex v6.5 [40]. The Hardy-Weinberg distribution (HWE) with Bonferroni correction was estimated at the population level using Arlequin v3.0 [41]. The inbreeding coefficient index (F IS ) and the polymorphism information content (PIC) were estimated using Genepop V4.7 [42].

Results
In the present study, the total GBS raw reads varied from 11.4 to 22.38 million, with total raw nucleotides ranging from 0.93 Gb to 2.96 Gb. GBS sequence analysis and microsatellite mining were performed using a modified GBS analysis workflow without using a reference genome (Fig. 2). After filtering low-quality reads and reads that lacked enzyme cutting sites by the process_radtags component of Stacks v1.39 software, the total number of clean reads kept in the libraries varied from 10.81 to 21.74 million, with an average of 15.72 million. The summary of sequencing reads generated for 30 samples of C. aculeata were listed in Table 3.
In the present study, the average of the allele numbers inspected at each locus varied between 1.7 for locus Ca04 to 16 for locus Ca25 ( Table 3). The results showed that, the polymorphism information content (PCI) of these SSR loci varied from 0.254 to 0.888. The observed heterozygosity (Ho) per locus ranged from 0.170 to 0.881, while the expected heterozygosity (He) per locus was from 0.213 to 0.884. The final set of all microsatellite loci were most informative (PIC > 0.50). Across all samples, 20 SSR loci showed significant deviations from the Hardy-Weinberg equilibrium after Bonferroni correction, (p < 0.05) ( Table 3).
The AMOVA implied that there was significant differences (p < 0.001) among three regions. The AMOVA also partitioned of total 23.53% genetic variation among the populations and 59.83% of the total within the individuals, indicating that much of the variation was within the individuals (Table 4). Results from the AMOVA analysis were supported by the UPGMA dendrogram for overall population differences (p < 0.001). The UPGMA dendrogram was constructed based on Nei's unbiased genetic distance among all the examined populations of C. aculeata by using the software TFPGA. The dendrogram showed the populations of C. aculeata clustered onto two major branches (Fig. 4). The upper cluster included the Khersan and Beshar River populations while the lower cluster showed Maroun population.

Discussion
For non-model organisms, microsatellite loci were generally identified by 5̕ anchor PCR [43], transcriptome sequence analysis [44,45], and microsatellite library screening [46] and development of microsatellite loci through cross-species amplification of closely related species [47]. In the present study, RAD-seq method was used for isolating microsatellite loci in C. aculeata. In contrast with conventional approaches, RAD-seq approach is considered to be faster       and profitable. Furthermore, this method can produce a large number of microsatellite markers for one time [17,25].
Population structure genetic is very important for the sustainability of many species. For example for Labeo bata [48], blunt snout bream Megalobrama amblycephala [49], sea cucumber Holothuria scabra [50], red snapper Lutjanus campechanus [51], pearl oyster Pinctada fucata [52] catfish, Rhamdia sp. [53], yellow catfish Pseudobagrus fulvidraco [54], Persian sturgeon (Acipenser persicus) [55], narrowclawed crayfish Pontastacus leptodactylus [27] and fan mussel Pinna nobilis [56]. Conservation management plans with no prior information of the genetic background could result in disturbance to the genetic structure with adverse effects on the gene pools of wild populations [18]. Until now, the microsatellite loci have not been carefully developed for C. aculeata which has posed a serious obstacle to conservation and management of this species [13]. The current study detected 16,706 microsatellite markers in C. aculeata genome using MISA, which accounted for 2.12% of the total genome sequence. The relative abundance of microsatellite sequences was calculated at 1.16 loci per kb of C. aculeata genomes. In general, the frequencies of microsatellite loci are expected to decrease with increasing repeat length due to longer repeats have a higher possibility of being mutated [57], and this tendency has been identified in many organisms [58][59][60]. According to Nei's unbiased genetic distance among all the investigated populations of C. aculeata, the UPGMA dendrogram clustered the populations onto two major branches indicating their genetic relatedness partially following their geographic distribution. The populations lying in the same cluster with restricted genetic distance despite larger geographic distance indicate the stock management structure of fisheries department in the province.

Conclusion
This study is the first report of SSR loci in C. aculeate developed using RAD-seq. In the present study we observed SSR loci for Capoeta aculeata from contigs assembled from GBS data based on the Illumina HiSeq platform. From the identified contigs, 36 polymorphic SSR markers were characterized. Moreover, we also indicated that NGS data is a useful method for genotyping of this species. Therefore these SSR markers would provide an invaluable resource for population genetics and natural resource conservation in C. aculeata [57][58][59][60].
Acknowledgements This work was supported by the Research Fund for the department of Fisheries of the Islamic Azad University of Tehran, Iran.
Author contributions HG made the literature, sample collection, material preparation. SPHS has major contribution to the preliminary study design and interpretation of data. The first draft of the manuscript was written by HAA. SN designed the experiment, institutional support and supervision of the preliminary experiment and made the data analysis. MSM has significant contributions to the laboratory work and interpretation of the results. All authors read and approved the final manuscript.

Data availability
The datasets generated are available from the corresponding author upon request.

Conflict of interest
The authors declare that they have no conflict of interest.

Ethical approval
The main research project researcher of this work was approved for animal ethics by Iranian Fisheries Sciences Research Institute (IFSRI) for scientific purposes development, Iran. All applicable guidelines for the care and use of animals were followed by the authors. The experiment was approved for Animal Welfare by the Institutional Animal Welfare Committee of Agricultural Research, Education and Extension Organization (AREEO), Iran.
Informed consent This research does not involve humans and therefore informed consents are not applicable.

Consent to participate (ethics)
Informed consent was obtained from all individual participants included in the study.