Samples studied
Four Manila clam (R. philippinarum) samples, three from the Adriatic Sea (Italy: Chioggia, N= 30; Porto Marghera, N= 30 [45]; and Po river mouth N= 25) and one from the Atlantic Ocean (Spain: Galicia, N= 25), were studied. Four common edible cockle (C. edule) samples from the European Atlantic area (Somme Bay, France, N= 30; Campelo, Spain, N= 30; Miño River mouth, Spain, N= 30; and Ría Formosa, Portugal, N= 30) were used from regions with extractive cockle activities [46]. Three samples of brown trout (Salmo trutta) from Duero River basin in the Iberian Peninsula (Águeda, N= 15; Omaña, N= 20; and Pisuerga, N= 17), two of them representing different mitochondrial pure native lineages (Atlantic and Duero, [47, 48]) and one from the hybrid zone (Omaña; [49]), were evaluated. Two populations of silver catfish (R. quelen), a Neotropical freshwater species distributed from the Northeast of Los Andes to the centre of Argentina and living in fluvial and coastal lagoon environments [50], were analysed. These samples came from Sauce Lagoon (N=10) and Uruguay River Basin (N= 11) belonging to two divergent lineages [29]. Finally, two nearby populations without genetic differentiation of small-spotted catshark (S. canicula) from North Sea (N= 13) and Irish Sea (N= 15) were analysed [30]. All information from samples analysed is summarized in supplementary material (Table S6).
Library preparation
DNA extraction and 2b-RAD libraries preparation using AlfI IIB RE followed the same protocol except for small-spotted catshark [30] where CspCI IIB RE was used instead. The libraries were sequenced using Illumina sequencing platforms (i.e. HiSeq 1500 for small-spotted catshark, NextSeq500 for the remaining species) following a 50 bp single-end chemistry. For details see [29, 30].
Bioinformatic Analysis
Building-loci pipelines: background
STACKS 2.0 and Meyer’s 2b-RAD v2.1 were the building-loci pipelines chosen for comparing their performances using a de novo approach within the broad genome and population genetics species scenarios selected. Meyer’s 2b-RAD v2.1 pipeline and STACKS building-loci pipelines have some similarities on their strategy; roughly, both are based on stacking reads into putative loci by sequence similarity, assuming that each locus correspond to a single place in the species genome. Nevertheless, there are many differences on how loci are built and how the user can control the existing options and genotyping strategies. STACKS, with a de novo approach, works firstly at the individual level demanding a number of identical reads to build a locus (i.e –m parameter in ustacks), while the 2b-RAD v2.1 pipeline works with a combined subset of samples to build a global reference panel to which align every read. For genotyping, STACKS uses a chi square test to call a heterozygote or a homozygote (i.e. –alpha and --gt-alpha), whereas nucleotide frequencies based on allele read depth at each position and sample are used for genotyping in the 2b-RAD v2.1 pipeline . Accordingly, huge differences in the raw SNP number were obtained in preliminary analysis in our study previous analysis. Anyway, we tried to apply the highest number of common parameters in both pipelines to be consistent among comparisons.
STACKS 2.0 pipeline can be summarized as follows: (1) Raw sequence reads were demultiplexed and filtered according to different criteria such as quality, uncalled bases and read length (process_radtags); (2) reads from each individual were clustered into putative loci, and polymorphic nucleotide sites identified (ustacks); (3) putative loci were grouped across individuals and catalogues of RAD-loci, SNPs and alleles were constructed (cstacks); (4) putative loci from each individual were matched against the catalogue (sstacks); (5) the data were transposed to be oriented by locus, instead of by sample (tsv2bam); (6) all individuals were genotyped at each called SNP (gstacks); and (7) SNPs were finally subjected to population genetics filters (populations) and results written in different output files (e.g. GENEPOP[51], STRUCTURE[52] file formats).
The de novo approach used for the 2b-RAD v2.1 pipeline can be summarized as follows: (1) from a subset of high quality reads (Phred quality scores ≥ 30 at all positions) from all samples, a global de novo reference catalogue was built by clustering these reads with the BuildRef.pl script and cd-hit 4.6.8 [53,54]; (2) every read was aligned against the reference catalogue using a mapping program (in our case Bowtie 1.1.2 [55]); (3) allele frequencies were counted at each position (SAMBasecaller.pl) and genotypes determined from that information (NFGenotyper.pl); and (4) genotypes called across samples were combined into a single genotype matrix with samples as columns and loci as rows (CombineGenotypes.pl). Perl scripts belonging to the last version are publicly available (https://github.com/Eli-Meyer/2brad_utilities/) and earlier versions on request.
A reference genome-based pipeline was used for Manila clam and brown trout, the species with chromosome assembly level reference genomes. In this case, the differences in the pipelines of STACKS 2 and Meyer’s 2b-RAD v2.1 are lower. For instance, STACKS 2 reduces the number of modules necessary from six to two when using reference genome and the building of putative loci is conditioned by the step using a short-read aligner. Our goal was not to compare this option between pipelines, but to take as reference the genome-based approach to be compared with the de novo approach as a gold standard within each pipeline.
Building-loci pipelines: analysis
After demultiplexing raw data, several filtering criteria were applied: (1) all reads were trimmed and filtered by the RE recognition site to retain only those sequences of 36 bp (or 32 bp from CII RE catshark) centred on the RE recognition site using our own Perl scripts and Trimmomatic 0.38 [56]; and (2) process_radtags (module belonging to STACKS) was used to remove reads with uncalled bases (-c option). Other parameters were species-specific (e.g. window sliding size, -w; score limit, -see Table S7A). To check raw and filtered reads quality, FastQC 0.11.7 [57] was used. STACKS input sequences were oriented in the same orientation using our own Perl script to avoid oversplitting. To say, the set of input reads for both building-loci pipelines in each species was always the same.
At the building-loci pipeline step, the parameters considered were: (1) minimum number of identical reads to create a stack (default values were used); (2) maximum number of mismatches between RAD loci within and between individuals (-M 2/-n 2 for fish or -M 3/-n 3 for bivalves and their analogous parameters with ALT pipeline); (3) indels were discarded (i.e. --disable-gapped at different modules); (4) SNP calling model and alpha cut-off: their default values were used in the STA pipeline (STACKS 2.0), while in the ALT pipeline (Meyer’s 2b-RAD v2.1) we considered a range of 0.1-0.2 to determine the genotype at each position (default values are 0.01-0.25); to say, when the frequency of the less frequent allele was lower than 0.1 the genotype was called as homozygous while frequencies higher than 0.2 were called as heterozygous; intermediate allele frequencies for the less frequent allele were called as uncertain (see Tables S7B and S7C).
When a reference genome was available, Bowtie 1.1.2 was used as short read aligner. The number of mismatches allowed between reads and the reference genome using the -v alignment mode was 2 mismatches for brown trout and 3 mismatches for Manila clam, the same as mentioned above. Only reads which aligned to a single site in the reference genome were considered (-m 1). The same parameter values were used in STACKS modules shared between reference-based genome and de novo approaches (i.e. gstacks and populations modules).
SNP filtering steps and creation of datasets
The raw SNP panels of the STA and ALT pipelines were filtered using the same parameters for consistency, retaining only a set of markers and alleles represented across the individuals genotyped. Filters were applied in the same order for each dataset (Fig. 1), as recommended [9]: i) SNPs with more than two alleles were excluded (BIAL filter); ii) a minimum locus coverage of eight reads was chosen (Min coverage filter); iii) RAD-loci with more than three SNPs were excluded for further analyses; iv) SNPs were retained only if the less frequent allele was represented at least three times at the whole species sample (MAC, minimum allele count, filter); v) less than 40% of missing data in each population for a SNP to be retained (POP filter); vi) SNPs were excluded when they did not adjust to Hardy-Weinberg (HW) expectations (P < 0.01) in more than half of the populations analysed (HW filter); and vii) only the first SNP per RAD-locus was retained when several SNPs were called in the same RAD-locus to avoid redundant information.
According to the aforementioned criteria, four SNP panels were tested and compared: (1) STA and (2) ALT SNP panels were further used to obtain (3) common (COM) and (4) merged (MER) panels. When reference genome was available three additional SNP panels were obtained (i.e. RG, RG-STA, RG-ALT). RAD-loci of these panels from both pipelines were compared to identify shared and private RAD-loci. In order to do this, cd-hit-est was used to cluster similar RAD-loci taken from the two pipeline catalogues with the same threshold of similarity (-c) used in the clustering steps of building-loci pipelines (i.e. two mismatches maximum for fish species and three mismatches maximum for bivalve species). Furthermore, we used a -g value of 1 when clustering sequences to meet the established similarity threshold and a band alignment width (-b) of 1 to avoid previously separated sequences due to indels at specific clusters. This procedure rendered COM and MER SNP panels created using a customized Perl script (see Supplementary material). SNPs at shared RAD-loci between STA and ALT panels were selected after the sixth filtering step, since the first SNP at shared RAD-loci could be different after the full filtering pipeline (Fig. 1). MER panel was finally created by taking the SNPs from “private” ALT and STA pipelines RAD-loci (i.e. those from cd-hit-est clusters with only STA or ALT pipelines RAD-tags) plus the COM SNP panel previously obtained. Again, only the first SNP per RAD locus was retained to avoid redundant information. The GENEPOP files of all shared SNP panels were compared to quantify their genotyping differences using own Perl script (see Supplementary material, customised_scripts.zip).
Comparison of outputs and population genetics analyses.
The results of the aforementioned pipelines were compared using both quantitative (i.e. number of SNPs) and biological criteria (population genetics data). Filtered GENEPOP files were obtained using customized Perl scripts. These files were transformed for subsequent analyses using the PGDSPIDER 2.1.1.5 software [58]. Firstly, the consequences of filtering over the number of RAD-loci and SNPs were evaluated for each combination of species-pipeline; secondly, common and private RAD-loci/SNPs between the two pipelines were obtained for each species. Finally, biological interpretations from each pipeline/species were compared, including basic population genetics results (i.e. genetic diversity levels and population structure).
Observed and expected heterozygosity (Ho and He, respectively), inbreeding coefficient (FIS, using 1000 bootstrap iterations to estimate their 95% confidence intervals) and allelic richness were calculated per population using R’s package diveRsity [59]. Global FST was calculated with Genepop R package [51]. STRUCTURE software [52], using R package ParallelStructure [60], was used to define the most likely number of population units (K) present with LOCPRIOR model with correlated allele frequencies model, testing K values from 1 to the number of sampling localities in the species dataset + 1 with 10 replicates composed by 100,000 Markov chain Monte Carlo (MCMC) replicates and a burn-in period of 10,000 steps. Structure results were parsed with Structure Harvester [61], which implements the Evanno’s method [62] to detect the most likely number of clusters according to the data. CLUMPAK [63] was used to merge runs with the same K that suggested similar patterns of structuring and to obtain cluster membership plots. As a second approach to detect population structure, a Discriminant Analysis of Principal Components (DAPC) analysis was performed based on genetic data, as implemented in R package adegenet [64,65]. The optimal number of principal components to be used was estimated with the cross-validation method implemented in the package and from one to three discriminant components were retained according to the amount of population structure variation they explained. Finally, outlier loci potentially under selection (OL), i.e. those showing higher or lower differentiation values (i.e. FST) across populations than the neutral background, were detected using the Bayesian approach implemented in BAYESCAN 2.1 [66] with default parameters. Loci with a Log10 posterior odds (PO) higher than 1.5 were retained as potential outliers for later comparison among the four datasets resulting from the two pipelines.