Sampling, library preparation and sequencing
A total of 50 tiger shark individuals (Galeocerdo cuvier) from both the Indo-Pacific (IP) and the Atlantic Ocean (AO) were sampled off (from west to east) Brazil (BRA), Reunion Island (RUN), North Coast of Australia (AUSN; North Territory), East Cost of Australia (AUSE; Sunshine Coast), Coral Sea (COR) and New Caledonia (NCA). Sharks were grouped into six populations based on their sampling site (Fig. 1; Table 1). Total genomic DNA was extracted from muscle tissue or fin clips preserved in 96% ethanol using QIAGEN DNeasy Blood and Tissue kit (Qiagen, Hilden, Germany) according to the manufacturer's protocols. Double-digest restriction-associated DNA (ddRAD) libraries were prepared following (46) using EcoRI and MspI restriction enzymes, a 400-bp size selection, and a combination of two indexes and 24 barcodes to pool 48 individuals per lane. The genomic libraries obtained were sequenced with a HiSeq 2500 Illumina sequencer (single-end, 125 bp).
Rad-seq de novo assembly
Raw reads were first demultiplexed and quality filtered through the process_radtags.pl pipeline in stacks v.2.5 (47). In the absence of a reference genome of G. cuvier or of closely related species, RAD-seq loci (125 bp sequences) were de novo assembled under the denovo_map.pl pipeline in stacks. Preliminary results based on parameters m = 3 (minimum read depth to create a stack), M = 3 (number of mismatches allowed between loci within individuals), and n = 3 (number of mismatches allowed between loci within catalogue) found an average depth of ~ 10x (see Results). Despite the absence of a clear cut-off indicating an acceptable coverage value above which genotype calling may be considered reliable, simulation results suggest that ~ 10x may produce inconsistent calling under different algorithm (48). To prevent this, we used on a genotype free estimation of allele frequencies implemented in the software angsd v.0.923 (Analysis of Next Generation Sequencing Data; (30)), which has been proven more efficient for low to medium coverage next-generation sequencing (NGS) data than SNPs calling algorithms (30). We describe below the bioinformatics steps required to apply angsd to Rad-seq data from a non-model organism and the downstream population genetic analyses applied to the filtered datasets.
Assembly pipeline and filtering
angsd requires a reference sequence to work, which we were lacking. To circumvent this issue, we followed the approach described in (49) by creating an artificial reference sequence from loci previously assembled by stacks under the parameter m = 3, N = 3, M = 3 (based on the results of Mona, Bertorelle, Benazzo, in preparation). To this end, we concatenated the consensus sequences of each locus spaced by a stretch of Ns and then map reads back from individual fastq files using the bwa-mem algorithm with default parameters (50). Using custom bash scripts coupled with angsd, we then discarded: (i) sites with coverage < 3x (-minIndDepth = 3, corresponding to m in the first assembly performed by stacks) and/or of low quality (based on the per base alignment score, -baq = 1 flag); (ii) low quality bases and poorly aligned reads (-minQ and -minMapQ and -C flags with default values); (iii) SNPs present in the last 5 bp of each locus and SNPs genotyped as heterozygous in 80% or more of the individuals; (iv) loci with more than 5 SNPs that might be the result of paralog RAD loci alignment on the reference. Specific filters were further added according to the downstream analyses performed.
Population structure
A single reference sequence was created for all populations and we retained sites shared by at least 80% of the samples. The PCA was computed with PCAngsd v.0.97 based on genotype likelihood (51). Admixture was then investigated by running the non-negative matrix factorization algorithm (nmf) implemented in PCAngsd which is based on the same covariance matrix inferred for the PCA. The number of ancestral populations (K) was automatically chosen by PCAngsd to be e + 1, where e is the optimal number of significant principal components depicting population structure, resulting from the Velicier’s minimum average partial test run on the covariance matrix. The sparseness regularization parameter α (used to reduce the noise in low depth NGS data) that best fitted the data was tested between 0 and 100 and it was chosen by comparing the resulting likelihood following (51). We generated pairwise site allele frequency likelihood files and then computed FST with the realSFS program in angsd (52) using SNPs with a minor allele frequency ≥ 0.05 (-minMaf flag). The significance of each pairwise FST comparison was evaluated with 1,000 permutations by randomly allocating individuals to one of the two populations. We finally tested isolation by distance (IBD) using a Mantel test (53) and plotted the relationship between genetic vs. geographic distances.
We applied an approximate Bayesian computation (ABC) approach similar to previous studies (7, 31, 32) in all sampling sites to further investigate the presence of population structure. This approach is particularly helpful in the Atlantic Ocean (AO) where only one locality was sampled (Brazil; BRA population). Briefly, we designed three demographic models (Figure S3): (1) NS (No Structure) which represents a panmictic population where \({N}_{mod}\), the modern effective size instantly changes to \({N}_{anc}\), the ancestral effective size, at \({T}_{s}\)(time shift) generations; (2) FIM (Finite Island Meta-population) which represents a finite island meta-population model composed of 100 demes exchanging symmetrically \(Nm\) migrants per generation with each other. All demes were instantaneously colonised, \({T}_{col}\) generations ago, from an ancestral population of size \({N}_{anc}.\) (3) SS (Stepping-Stone) which represents a stepping-stone model where the 100 demes are arranged in a two-dimensional grid and where migration is only allowed symmetrically in both directions between the four nearest neighbouring demes. We performed 50,000 coalescent simulations under each model using fastsimcoal v.2.6.0.3 (54) extracting parameters from the priors distributions displayed in Table S1. Model selection was evaluated by the random forest classification method implemented in the abcRF package in R (55). We used the SFS, θπ and TD as summary statistics and further added the first two axes of the Linear Discriminant Analysis in the dataset as suggested by (55) to increase the classification method accuracy. The number of trees was chosen by checking the evolution of the out-of-bag error.
Genetic diversity and effective population size variation
We created one reference sequence per population in order to maximise the number of loci assembled. We filtered the sites with missing data by setting the -minInd flag in angsd to the total number of individuals in each population. The filtered dataset was then used to generate a site allele frequency likelihood (saf) file, where genotype likelihoods were computed using the SAMtools method (-GL = 1 flag). The folded site frequency spectrum (SFS) was directly computed from the filtered saf datasets through the realSFS program (52). Nucleotide diversity (θπ), Watterson’s theta based on segregating sites (θw;(56)) and Tajima’s D (TD; (57)) were computed with custom script from the SFS. Significance of TD was evaluated after 1,000 coalescent simulations of a constant population model with size θw. We reconstructed the variation in the effective population size (Ne) through time by running the stairwayplot v.0.2 software (58), where the composite likelihood is evaluated as the difference between the observed (folded) SFS and its expectation under a specific demographic history.
Population divergence and migration rate estimation
Based on the results of the previous analyses, we devised five alternative Isolation/Migration (IM) models of divergence between IP and AO regions using the composite likelihood method implemented in fastsimcoal (54). We presented in Fig. 3 the model richest in parameters, the remaining four representing simplified versions nested within it. Hereafter, a brief description of the five models going from the most complex to the simplest: (a) IM-full: the two ocean regions with their respective modern effective population sizes, \({N}_{mo{d}_{IP}}\) and \({N}_{mo{d}_{AO}}\), diverged at \({T}_{div}\) from an ancestral population of effective population size \({N}_{anc}\). Due to the stairwayplot results, we allowed the two modern effective population sizes \({N}_{mo{d}_{IP}}\) and \({N}_{mo{d}_{AO}}\) to change to \({N}_{an{c}_{IP}}\) and \({N}_{an{c}_{AO}}\) following an exponential dynamic in \({T}_{{s}_{IP}}\) and \({T}_{{s}_{AO}}\) years respectively. Migration is defined by two time periods: \({m}_{1}\) representing the migration rate occurring between time 0 until \({T}_{mig}\) and \({m}_{2}\) between time \({T}_{mig}\) until \({T}_{div}\). The migration matrix in each time period is asymmetric: for instance, \({m}_{{1}_{AO/IP}}\) represents the forward migration rate from AO to IP and \({m}_{{1}_{IP/AO}}\) from IP to AO. In summary, the model is defined by thirteen parameters: five effective population sizes, four migration rates and four historical events; (b) IM-anc: same as IM-full with ancestral migration only between \({T}_{mig}\) and \({T}_{div}\). The model is defined by eleven parameters, the two \({m}_{1}\) migration rates being removed; (c) IM-rec: same as IM-anc, but with recent migration only occurring between time 0 until \({T}_{mig}\), keeping only the two \({m}_{1}\) migration rates; (d) IM-bsc: the classic model where migration is constant from time 0 until \({T}_{div}\) (i.e. \({\text{m}}_{1}={\text{m}}_{2}=\text{m}\) (28)). We modelled the variation in effective size of the two regions similarly to the other models, for a total of ten parameters; (e) IM-div: a pure divergence model with no migration. This is defined by eight parameters: the five effective population sizes and three historical events (\({T}_{div},{T}_{{s}_{IP}}\), \({T}_{{s}_{AO}}\) ). The analyses are based on the folded 2D-SFS computed by angsd between six individuals from Brazil (representing the AO) and six from the Indo-Pacific (IP). This sample size was chosen to obtain a balanced design and to maximise the number of SNPs shared among the two ocean basins. Similarly, in each basin, we selected the individuals presenting the smaller proportion of missing data to further increase the number of joint SNPs. To maximize the observed 2D-SFS we applied the following options in fastsimcoal: -N 300,000 (number of coalescent simulations), -L 40 (number of expectation-maximization (EM) cycles), and -C 10 (minimum observed SFS entry count considered for parameter estimation). For all model parameters we used wide search ranges with uniform distributions (Table 2). We ran each model 100 times in order to determine the maximum likelihood parameters and to perform model selection using the Akaike’s information criterion comparing the best run of each model (54). To check the robustness of the model selection procedure and to take into account the presence of linked sites in our dataset, we further examined the likelihood distribution obtained based on 100 expected 2D-SFS simulated under the parameters estimated in the best run of each model, each approximated with 106 coalescent simulations. This procedure is needed to take into account the variance in the likelihood estimation given our dataset: if the distributions obtained by the various models do overlap, the difference in the estimated likelihoods of our models is not significant (59). Finally, we determined the confidence interval of the parameter estimated under the best run of our best model by parametric bootstrapping. The 2D-SFS was bootstrapped 100 times using fastsimcoal and each of these datasets was analysed under the same conditions as the original data (one hundred independent runs for each dataset). Calibrating the molecular clock is crucial to obtain accurate estimates of demographic parameters and historical events, but it is challenging when fossil records and/or orthologous loci from an outgroup are lacking. Here, all demographic inferences were performed using the Rad-seq mutation rate of µ = 1.93 x 10−8 per site and per generation previously used for the tiger shark (7), and the generation time was set to 10 years (24, 60).