In this study, new genetic markers for goldsinny wrasse were developed and validated by: I) sequencing one goldsinny wrasse sample with high coverage to build a genomic reference de novo, II) selecting the longest contigs from the assembly, and III) mapping individual reads from 60 fish from four populations (15 from each) against the reference contigs to call SNPs. A random subset of the high-quality SNPs obtained (1 SNP/~kbp, in total 33 235), were then used to, IV) run population genetic analyses to explore basic population parameters, and to characterize genetic differentiation among the populations as well as to look for signs of selection. Finally, V) a set of 173 candidate SNP loci showing large divergence between northern and southern Scandinavian populations was selected and validated using Agena MassARRAY® iPLEX platform with an additional set of samples (N=94).
5.1. Samples and sequencing
60 goldsinny wrasse from four locations were used to search for and develop SNP markers against the reference contigs from one individual. 16 individuals (including a single fish used to build a reference genome) were collected from Varberg (VAR) in south-western Sweden, 15 from Bodø (BOD) in Northern Norway, 15 from Isle of Mull in Scotland, UK (SCO), and 15 from Galicia region (GAL) in north-western Spain (Fig. 1). These samples were collected in 2014-2016, and represent a subset of the samples used in a previous study of population genetic structure (Jansson et al. 2017). Samples were collected in compliance with EU Directive 2010/63/EU, and the national legislations in each country. Fish were killed upon catch and samples were taken immediately or killed and whole fish stored frozen until sampling in laboratory facilities. Details on data collection and DNA extraction are provided in Jansson et al. 2017. Samples were selected I) to cover the species’ north-eastern Atlantic distribution area, II) to represent genetically most diverged populations from this area (based on results in Jansson et al. 2017), and III) being of good DNA quality and quantity. DNA quality was assessed by running samples on a 1% agarose gel, measuring their absorbance by a Nanodrop spectrophotometer, and by estimating DNA quantity with Qubit Quant-iT kit from Invitrogen. Superior samples (size >10 000bp, Abs260/280 = 1.8-2.0 and Abs260/230 = 1.8-2.4) from each population were selected for sequencing (see below). 4 µg of DNA from the selected reference individual (VAR-ref-77; sex unknown) and 2.5 µg from each selected individual from the four populations were sent to the Norwegian Sequencing Centre (NSC) for sequencing. At the NSC, Illumina TruSeq adapter ligation was used to construct libraries, DNA fragmented to 300 bp target size, and barcoded to enable individual identification. Sequencing was done using an Illumina HiSeq X instrument producing 2x150 bp paired-end (PE) reads. The reference individual was run in a single lane, whereas the rest of the samples (60) were pooled in four lanes according to their origin (15 individuals/lane, all individual barcoded). Each lane produced between 302-316 GB of data as FASTQ files; for the reference individual, we obtained 451.1 million read pairs, and for pooled population samples 455.8–474.5 million read pairs per lane (on average 30.7±12.9 x106 read pairs per individual, range 16.0-102.4 x106).
5.2. Building and evaluating the de novo reference
Read quality was checked with FastQC (Andrews 2010), and followed by trimming of sequences with Trimmomatic (v. 0.32; Bolger et al. 2014) where adapter contaminants and low-quality reads were removed using the following setting: phred score ≥ 33, leading and trailing low-quality bases removed, scanning in 4 bp windows and cutting when the average quality per base drops below 15, minimum length for a sequence to be retained = 36 bp. Only correctly paired, high-quality reads were retained for the next phase, and consisted of 92.5% (~417 million) of the initial raw reads. Genome size and average sequencing coverage was estimated using K-mer statistics with a K-mer size of 32 and software kmx (K-Mer indeXing; available at: https://github.com/ketil-malde/kmx). Expectation-maximization was used to fit a negative binomial distribution to the error K-mers, and Poisson distributions for haploid, diploid, and repeat K-mers.
To build a de novo reference, we used SOAPdenovo2 (v.2.0.4; Luo et al. 2012) with default parameter values and increasing K-mer sizes (bp) from 43 to 123 (every tenth), and with the assembler’s possible maximum value of 127. Of these 10 assemblies, the assembly with K-mer size 127 produced the longest continuous scaffolds (based on scaffold length and N50), and was selected as our genomic reference. The reference quality was assessed based on contig lengths and their total amount (i.e. fragmentation), content of core genes (i.e. how many percent of ray-finned fish core genes were found using BUSCO (Benchmarking Universal Single-Copy Orthologs, v. 2.0.1; Simão et al. 2015)), and by checking the mapping success of each individual against the reference (as % of reads mapped).
5.3. SNP calling, selection and validation
Individual reads from all 60 fish were aligned against the de novo reference genome using mem method in BWA (v. 0.7.5a; Li & Durbin, 2009a), using default parameters. Sorting, indexing and variant calling was done with SAMtools (v.0.1.19; Li & Durbin, 2009b), and followed by filtering with BCFtools (v. 0.1.17; Danecek et al. 2011). Supplementary alignments were removed, and due to computational constraints, only reference scaffolds >20kbp were included. The vcfR package (Knaus & Grünwald 2017) in R (version 3.2.2; R core team 2015) was used to scrutinize quality and quantity of the aligned sequences, and the following filtering criteria (per population) was chosen for any SNP to be accepted: min_QUAL=600, min_DP=20, max_DP=999, min_MQ=0, and max_MQ=90. Due to observed discrepancies between the expected genotypes from mapping data and the observed genotypes from the genotyping platform (see 2.3 in Results), an additional a posteriori step of SNP selection was implemented. The output of the mapping was re-examined, and the average of the individual phred-scaled quality score was calculated for each of the 173 SNPs within Bodo and Varberg samples separately. The average individual phred-scaled mapping quality ranged from 0 to 100 in the Bodo sample and from 0 to 80 in the Varberg sample. SNPs with both average quality larger than 50 in Bodo sample and larger than 40 in Varberg samples, were selected as a subset of the 74 most robust markers.
Two separate data sets were created to: 1) Study population genomic patterns and divergence among the populations. For this, a random subset of the high-quality SNPs was selected along the longest scaffolds so that the physical distance between them was ≥1000bp. This resulted in a data set of 33 866 SNPs. 2) Select putatively diagnostic SNPs for pair-wise separation of the four populations. To do this, the SNPs showing highest pairwise Fst between each population pair were selected from each scaffold. These SNPs were thereafter ranked from the highest to lowest pairwise FST (for each pair separately) down to a value of ~0.15, leaving a few hundred candidate loci per pair (~450-850; Tables S1a-f). To validate the SNPs that separated the Northern and Southern Scandinavian populations, 231 of the top-SNPs were selected and organized into eight multiplex groups (28-30 SNPs in each; Supplementary Table 2) using the MassARRAY® Typer 4.0 Assay Design software (Agena Bioscience). Putatively diagnostic SNPs for other pair-wise comparisons were developed (Supplementary Tables 1b-f), but not validated in this study. Genotyping to validate the diagnostic SNPs developed between the northern and southern Scandinavian samples was performed on a MassARRAY® Typer 4.0 Analyser (Agena Bioscience). Only loci that were polymorphic and produced clear clustering patterns, were selected leaving 173 SNPs. 47 samples from Bodø in Northern Norway and Varberg on the Swedish West coast representing roughly the current edges of the species’ distribution area in Scandinavia, were genotyped and analysed.
5.4. Population analyses of genome wide data
First, 609 of the 33 866 SNPs showing no or very little variation (MAF<0.01) were removed. Remaining loci were checked for possible deviations in Hardy-Weinberg equilibrium (HWE) using pegas package v. 0.11 (Paradis 2010) in R. 22 SNPs not in HWE after False Discovery Rate correction (FDR; Benjamini & Hochberg 1995) in at least two of the studied populations, were removed from the final dataset (Nloci= 33 235). Linkage disequilibrium was tested by computing the correlation coefficient (r2) between all pairs of loci that were physically linked on the same scaffold. In addition, a random set of 100 000 pairs of loci was sampled across scaffold, to produce a reference distribution of r2 values without physical linkage. The R package hierfstat (v. 0.04-28; Goudet 2005) was used to calculate gene diversity (HS), and observed heterozygosity (Ho) and overall FIS within populations. Pairwise FST values (Weir & Cockerham 1984) were estimated for each population pair and over all samples using the same package, and their statistical significance was determined comparing the observed values with 95% confidence limits determined by 1000 bootstrap repeats. For examination of genetic relatedness between samples, R packages poppr (v.2.8.1; Kamvar et al. 2014) and ape (v. 5.2; Paradis et al. 2004) were utilized to reconstruct a genetic distance tree based on information from all 33 235 SNPs using UPGMA algorithm and 100 bootstrap replicates to assess branch support.
To investigate population structure further, and test assignment probability of individuals back to populations based on their genotypes, we performed a Discriminant Analysis of Principal Components (DAPC; Jombart et al. 2010) using the R package adegenet (v. 2.1.1; Jombart 2008; Jombart and Ahmed 2011). This is a multivariate method useful for exploring separation of groups in large genomic datasets, even when the general level of divergence between populations is low (Jombart et al. 2010). The method identifies structuring alleles, and maximizes among-group variation by first transforming the genotype data into principal components (PCs), followed by discriminant analysis (DA) to define the groups. The number of clusters (K) was determined by the find.clusters function with a maximum K set to 10. The lowest BIC (Bayesian Information Criterion) value was obtained with K=4. To avoid model overfitting, the optimal number of PCs was determined via α-optimization to be five (Fig. S2), which was then used in the following DAPC analysis and to define group membership (i.e. individual assignment probabilities to predefined populations).
To estimate if and to what extent natural selection could explain divergence between populations, the pcadapt (v. 3.0.4; Luu et al. 2017) package in R was employed to perform a genome scan to detect markers potentially influenced by selection. This was done for the whole dataset, and separately for the Scandinavian samples only. The method first performs a principal component analysis (PCA) to ascertain the underlying population structure, and thus allows uncertainty of origin and admixed individuals. Based on obtained ‘scree plots’ from the PCA analysis, the optimal value of PCs was four for the whole dataset (Fig. S5a), and two for the Scandinavian dataset (Fig. S6a), and K=4 and 2, respectively, were used for the subsequent pcadapt analyses. SNPs with minor allele frequency below 0.05 were removed from the analysis leaving 32 712 SNPs in the whole and 31 941 SNPs in the Scandinavian datasets. A locus was considered as an outlier if its q-value threshold was below 0.1. Q-value estimation includes FDR and adjust p-values accordingly. Q-values were determined by qvalue package (v. 2.12.0; Storey et al. 2015) in R. Another selection test, BayeScan (v.2.1; Foll and Gaggiotti, 2008) was run for comparison for both datasets. Default parameter setting was used (prior odds 10, samples size 5000, thinning interval 10000, pilot runs 20, pilot run length 5000 and additional burn-in 50000). The decision whether a locus was under selection was based on q values (<0.05 suggests selection). Possible clustering of multiple (>2) selected SNPs along same contigs was checked.
5.5. Validation and population analysis using putatively diagnostic loci
All 94 genotyped samples produced reliable genotypes with more than 70% of the used 173 loci and were included in the analyses. Because these SNPs were picked as candidates for separating Scandinavian populations best, neutrality and thus HWE cannot be assumed. It was nevertheless investigated, together with other basic population parameters using same methods and approaches as given for the genome wide data analysis. Moreover, each locus pair was tested for linkage disequilibrium using the snpStats (v.; 1.32.0; Clayton 2018) package in R and visualized with LDheatmap (v. 0.99-5; Shin et al. 2006).
As for the genomic dataset, population subdivision and individual assignment was inspected with the DAPC package. To evaluate the discriminatory power for the baseline data, an additional Monte-Carlo cross-validation through resampling procedure was run with R package assignPOP (v. 1.1.4; Chen et al. 2018). The resampling scheme contained 50%, 70% and 90% of the individuals from both populations, and top (based on FST) 10%, 25%, 50% or all loci. Each resampling event was repeated 30 times.