Sampling and molecular methods
In total, 180 mature individual worms were haphazardly collected from nine sites (20 per site) across the global geographic range between August 2013 and March 2014, from North Africa at the species’ southern boundary, along the Atlantic Coast, to the Northern Irish Sea in the north, as well as from two sites on the Mediterranean coast (Figure 1). Individuals were placed immediately in RNA-later (Qiagen Inc., Crawley), maintained at room temperature for between 48 hours and one week and then stored at -20°C until DNA extraction.
Genomic DNA was extracted using chaotropic “Chaos” buffer (4M guanidine thiocyanate; 0.5% N-lauroyl sarcosine; 25μM tris(hydroxymethyl)aminomethane (Tris), pH 8.0; 0.1M 2-mercaptoethanol, 0.5% N-lauroyl sarcosine) (modified from Fukami et al. 2004). Worms were homogenised within the buffer at room temperature for at least 48 hours before adding an equal volume of phenol extraction buffer (PEB) and double the volume of phenol:chloroform:isoamyl alcohol (25:24:1). From here the protocol followed a phenol-chloroform extraction with isopropanol precipitation [84]. An additional ammonium acetate (3M) precipitation step was added to further purify DNA. DNA was resuspended in sterile distilled H2O, quality assessed using a NanoDrop 8000 spectrophotometer (Thermo Scientific, Wilmington, DE), visualised on 1% agarose gels, and quantified using the Quant-IT PicoGreen dsDNA Assay Kit (Life Technologies, Carlsbad, California). Single-end Restriction-site Associated DNA (RAD) libraries were prepared by digesting DNA from each individual using the EcoRI and MseI restriction enzymes. Samples were individually barcoded and sequenced (Ecogenics, Zurich-Schlieren, Switzerland) using an Illumina NextSeq 500 (Illumina, San Diego, California).
SNP detection and summary statistics
All analyses were carried out on the Analyses and Bioinformatics for Marine Science (ABiMS) Galaxy Platform at the Station Biologique de Roscoff (galaxy.sb-roscoff.fr; [85]). Samples with mean Phred scores of <25 or read coverage of <150,000 were removed from the dataset and sequences were trimmed to 60bp length using Trimmomatic version 0.36.4 [86]. STACKS version 1.46.0 [87] was used to de novo map the sequences and call SNPs by sequentially running “ustacks”, “cstacks” and “sstacks” using the parameters suggested to be optimal for population genetic inference by [88] namely, allowing a minimum depth of coverage of three (ustacks: m; stack coverage), a maximum of two mismatches between reads for a single individual (ustacks: M), and four mismatches between primary and secondary reads (ustacks: N). These settings have been found to increase the number of loci and reduce the SNP and allele error rates within a dataset [88]. Furthermore, the low value of M used here reduces the risk of paralogous loci being merged into the same SNP [90]. The number of mismatches allowed between loci was set at three (cstacks: n; [90]).The removal or break up of highly repetitive RAD tags was permitted within the program (ustacks: deleverage).
The programme populations from the STACKS suite, was used to process the SNP data [87]. RADtags with at least 10x depth of read coverage were retained to ensure accuracy of heterozygous SNP calls (populations: p; [48]). Lumped paralogs can be a concern in high coverage data sets [91], but 10x coverage is considered low [89] and thus reduces this to a low risk factor. A locus had to be present in 70% of the populations to be included in the analysis, which allows for mutations in restriction sites that may cause loci to dropout in certain lineages [48, 92] and the minor allele frequency within populations (populations: min_maf) was set at > 0.01 [90]. In order to maximise SNP discovery, the percentage of individuals required within a population to process a locus was set at 50% (populations: r). Only a single SNP per RADtag was retained for subsequent analyses to avoid problems of non-independence between markers [10]. The STRUCTURE file output by populations was transformed in PGD Spider version 2.1.0.3 [93] for use in all further analyses. Each population and locus was tested for deviation from Hardy-Weinberg equilibrium and linkage disequilibrium in ARLEQUIN version 3.5.2.2 [94] and significance was assessed following Bonferroni correction for multiple tests. Loci that showed null alleles in multiple populations, or with significant deviation from Hardy–Weinberg equilibrium, were removed from further analyses to reduce the risk of inclusion of paralogous sequence variants [95].
Genomic summary statistics were calculated in ARLEQUIN version 3.5.2.2 [94] as mean number of alleles per locus (N alleles), nucleotide diversity (θ), expected heterozygosity (He) and observed heterozygosity (Ho). Separate analyses of molecular variance (AMOVA) were carried out based on site of origin and genetic clusters identified in STRUCTURE (see below) using 16,000 permutations.
An outlier approach was used to identify loci that had a higher or lower FST than expected under a neutral model of selection (under positive or balancing selection, respectively). Outliers were estimated in two ways: firstly, using BAYESCAN version 2.1 [96], which uses a Bayesian approach to estimate the posterior probability that a locus is affected by selection, and was run using 20 pilot runs of 5,000 iterations each, a total of 1,050,000 iterations (sample size of 100,000 and a thinning interval of 10), and a burn-in of 50,000. Only loci with a posterior probability (P) ≥ 0.95 with a prior odd of 10 were considered as outliers. BAYESCAN has consistently outperformed other outlier detection methods in terms of lack of false positives [4, 17, 97]. Outliers were also calculated in ARLEQUIN using 100,000 simulations, 500 demes, and a maximum expected heterozygosity of 0.5 [98]. Outlier SNPs were identified using a P-value of ≤ 0.05. Only SNPs identified as outliers by both BAYESCAN and ARLEQUIN were conservatively selected as candidate loci under balancing or directional selection. Subsequent analyses of population structure and seascape genomics were conducted separately for neutral and outlier loci.
Population structure
Pairwise FST were calculated in ARLEQUIN version 3.5.2.2 [94] and significance in certainty of the estimator assessed following Bonferroni correction for multiple tests. Straight-line distances between sites were calculated using ArcGIS Version 10.1 (ESRI, 2011) and shortest ocean distances between sites were calculated in R version 3.1.2 [99] using the package marmap [100], where distance was calculated excluding positive elevation [98]. Isolation-by-distance (straight line and shortest ocean) was tested in ARLEQUIN (Version 3.5.2.2) using a Mantel test [101] with 10,000 permutations. To visualise these relationships, neighbor-joining (nj) tree estimation plots were constructed in R package ape 5.3 [102], following the methods outlined in [103], and were constructed using FST and shortest ocean distances for each pairwise (site) combination. Divergence between sites was also assessed using a Principle Component Analysis (PCA) with the R package adegenet 2.1.2 [104], with 30 components and two discriminant functions retained in the analysis. Hierarchical structure was also considered by carrying out a second PCA excluding the most divergent site, as identified in the first PCA.
Genetic clusters (populations) were inferred using Bayesian analyses carried out in STRUCTURE version 2.3.4 [52]. The analyses assumed admixture, correlated allele frequencies, and were run with 100,000 burn-in cycles and 100,000 Markov Chain Monte Carlo runs. The number of populations (K) was considered between 1 and 10, with 10 replicates per K. This was repeated with and without location priors, and using a hierarchical approach to resolve fine-scale genetic structure by removing highly differentiated populations [44]. The most likely value of K was inferred by estimating ΔK using the Evanno method [105], as implemented in Structure Harvester version 0.6.94 [106].
Hydrodynamic models are increasingly used as predictors of larval dispersal (gene flow) in marine systems (e.g. [20, 107, 108] and many others). Probabilistic dispersal simulations of potential larval connectivity were run using the Connectivity Modeling System (CMS, Paris et al. 2013) paired with 3-hrly Hybrid Coordinate Ocean Model (HYCOM) and Navy coupled ocean data assimilation (NCODA) global 1/12° daily outputs from 2004, 2010, and 2012 [110]. These years were selected to represent neutral, negative and positive phases of the North Atlantic Oscillation (NAO) respectively, and are thus likely to reflect average dispersal patterns over the last decade [111, 112]. Vertical velocities were calculated using the Eulerian continuity equation, and a random diffusive kick was added at each hourly dispersal time-step to better capture the potential effects of sub-gridscale turbulent diffusion (horizontal = 15 m2s-1, vertical = 0.05 m2s-1; after Kough et al. 2016). HYCOM is a global model appropriate for large landscape continental-scale dispersal simulations, but does not include tides. Larvae were considered as passive particles even though diel vertical migrations may occur [42]. Simulations can therefore be over-dispersive, as the inclusion of tides and behaviour is liable to be retentive [114].
In the simulations, 100 propagules, each representing an S. alveolata larva, were released in a Lagrangian framework (so that each theoretical larvae was tracked individually), from each of nine locations (Figure 1), and their dispersal under two different release scenarios predicted: (1) a conservative scenario with propagule releases every 6-h over one week in April for each year, centred on the spring tide, and (2) an optimistic scenario which modelled a daily release on every day of the year. The optimistic scenario was designed to simulate all the potential pathways of dispersal if reproduction occurs throughout the year, which has been suggested in S. alveolata [42], thus capturing the full variability of potential larval fates. Larval competency to settle was assumed after 4 weeks, and a maximum PLD of 12 weeks applied based on field and laboratory estimates [42, 49]. Larvae were permitted to settle when located within 5km of a known reef site (Additional file 1). Additional (non-release) sites were also included as reported in The Global Biodiversity Information Facility (GBIF.org) in which larvae could also settle if encountered. Larval exchange counts (i.e. the source-sink contribution) were generated after 5, 8, 10, and 12 weeks PLD. Model outputs were processed in Matlab (Mathworks, Version R2016a). Pairwise matrices of larval exchange counts, standardised using proportions, were generated from larval fates averaged across the three focus years to compare against genetic pairwise FST matrices. Daily positions of larvae throughout the simulation were used to generate cumulative 2-D maps of larval density to highlight the predicted pathways of dispersal and areas of potential settlement
The proportion of settlers reaching each study site from each release site under the four larval duration scenarios (5, 8, 10 and 12-weeks) in the ocean circulation modelling simulations were inverted and used to form a matrix of pairwise dispersal resistances (i.e. the proportion of larvae that did not reach each site). Mantel tests [101] were carried out in ARLEQUIN (Version 3.5.2.2) [94] using 16,000 permutations to assess the correlation between pairwise FST (using putatively neutral loci) and dispersal. A Bonferroni correction was applied to assess the significance of the Mantel tests [115].
Loci under selection
Locally collected temperature data provides insight into the conditions that S. alveolata experience at a spatial and temporal scale that cannot be achieved using satellite data [116] and can be used for comparison with genomic data, to test genotype-environment associations [117]. Temperature data were obtained from Seabra et al. (2015) [118] who used biomimetic temperature loggers placed in the intertidal zone of exposed shores [116, 119]. At five sites, temperature data were collected with a resolution of 0.5°C at mid-intertidal level every hour between July 2010 and July 2014. High- and low-tide temperatures were also identified to the nearest hour. At the remaining four sites (i.e. English Channel, Balearic Sea, Tyrrhenian Sea and North Africa), an iButton (Maxim, Munich) was placed in the mid-littoral zone, collecting temperature data with a resolution of 0.5°C every 3 hours between April 2014 and April 2015. Temperature readings closest to high- and low-tide were identified.
S. alveolata experience tidally-driven fluctuations in water and air temperatures. Sixteen metrics were calculated to describe variation in temperature at each site as follows: mean temperature, maximum temperature, minimum temperature, temperature range (maximum-minimum), standard deviation of the mean, 95th – 5th percentile (to exclude outlier temperatures), average daily temperature range (maximum-minimum per day), and average daily standard distribution (standard deviation of the mean), per site. High-tide (water) and low-tide (air) temperature values were also used separately to calculate mean temperature, maximum temperature, minimum temperature, and temperature range (maximum-minimum) for water and air temperatures, respectively (Table 3). Larvae of S. alveolata can be found in the water column all year round [42], suggesting that they are reproductively active throughout the year, thus temperature data from all months were included in the analyses.
We utilised a gene-environment association software, BAYENV2 [120], to test for covariance between SNP allele frequencies and the 16 thermal variables. BAYENV2 controls for neutral population structure by first estimating a covariance matrix among populations for all loci and then accounting for that covariance in the gene-environment association test [121, 122]. The programme was run with 100,000 iterations for both neutral parameterisation and association testing [123]. Outlier loci were classed as those with a Bayes Factor > 3 [8, 124, 125].