Sampling and DNA extraction
A total of 231 samples distributed throughout North Africa, including 152 tissue samples collected in the field and 79 samples obtained from museum collections, were used in this study (Table S4 and Figure 1). Tissue samples were collected from road-killed and live trapped animals during several field expeditions in North-West Africa or received from collaborators between November 2011 to February 2015 (57,59,60; Table S4). Tissue samples were preserved in 96% ethanol for genetic analyses at the moment of collection. A total of 54 samples were already used in previous studies, for cytb (51 samples) and ʋWF (21 samples) (17,22); Table S4), but their genomic DNA was re-extracted and analysed for all markers used in this study. Additionally, 10 samples of J. orientalis were extracted and included as an outgroup species (Table S4). Extractions of the genomic DNA from tissue samples were performed using EasySpin Kit, following the “Genomic DNA Minipreps Tissue Kit” protocol. Extractions of museum samples were performed in a separate and autonomous facility, under sterile conditions, using the QIAamp® DNA Micro Kit (QIAGEN), following the “Isolation of Total DNA from Nail Clippings and Hair” protocol. Extracted DNA was stored at 20ºC.
DNA amplification and sequencing
One mitochondrial locus (cytochrome b, cytb, 897 bp) and seven nuclear loci were amplified, including two candidate genes for colour morph variation (the complete coding region of the melanocortin 1 receptor, MC1R; and a fragment of the exon 2 of the Agouti gene and part of an intron), one X-linked gene (intron 5 from the developing brain, homeobox gene, DBX, gene) and four autosomal genes (exon 10 from the growth hormone receptor, GHR; exon 1 from alpha-2B adrenergic receptor, ADRA2B; exon 1 from the interstitial retinoid binding protein, IRBP; and exon 28 from the ʋon Willebrand factor, ƲWF), producing a total of 5,369 bp. Partial amplification of the cytb gene (897 bp) was performed for the entire set of samples (231 samples, contemporaneous and museum) using two primer pairs previously designed for Jaculus species (Jac1Fw, Jac1Rv, Jac4Fw, Jac4Rv; 17). The reconstruction of the DNA fragment with the museum samples was done in several steps to produce overlapping sequences in order to obtain the entire fragment. In some cases, only a short fragment (325 bp) of the gene was amplified, which was obtained combining two pairs of primers, Jack4Fw and Jack1Rv (Primers, references and PCR conditions for cytb are described in Table S5). The short fragment was only used to confirm the phylogeny with the long fragment. Nuclear loci and microsatellites were amplified only on samples collected during field work (152 samples; Table S4). PCR products of both mitochondrial and nuclear genes were purified with a commercial kit (Qiagen) and both strands were sequenced on an ABI 3130xl Genetic Analyser (AB Applied Biosystems). For the autosomal genes, sequencing of both strands was performed in an external laboratory (Macrogen Inc.). Additionally, available sequence data for the cytb gene of our target species (164 sequences) was downloaded from GenBank and included in the analyses (Table S6).
Sequence alignment and phylogenetic analyses
Each sequence was verified and manually aligned using SEQSCAPE v2.6 (58). Alignments for each locus were generated with CLUSTAL W (59) implemented in ClustalX v2.0 (60) and edited manually in BIOEDIT v7.1.3 (61) in order to minimize the number of base pairs in the alignment spanned by insertion/deletions (indels). Polymorphic positions for each sequence from nuclear loci were carefully examined to ensure precise and consistent identification of double peaks in heterozygotes. Heterozygous sequences for indels were resolved manually from offset chromatogram peaks, combing the reverse and forward sequences (62). Nuclear haplotypes were inferred using PHASE v2.1 (63,64) with three runs performed for each locus with 10,000 burn-in steps and 10,000 interactions. Input files were created in SEQPHASE (65). Phased heterozygotes holding indels were included in SEQPHASE as “known haplotype pairs”. Haplotypes presenting probability phase calls below 80% were discarded from the analysis to ensure that only reliable haplotypes were used in downstream analyses. The indels observed in the DBX (21 and 42 bp; Figure S5) and in the partial Agouti gene (8 bp) were coded manually and were included in network reconstruction but excluded in further analyses due to their large sizes. Haplotypes for the cytb gene were inferred with DnaSP v5 (66).
Phylogenetic analyses were performed for the cytb locus. The Akaike information criterion (AIC; (67) was used to select the best-fit model of sequence evolution for each locus alignment among the 88 available in the software jModelTest v2.1.4 (71, Table S7). The phylogenetic relationships between haplotypes were inferred by the Maximum-Likelihood (ML) approach in PHYML v3.0 (69) and the Bayesian phylogenetic inference (BI) implemented in MrBayes v3.2.0 (70). ML analyses were performed with 1000 bootstrap pseudo replicates. Bayesian posterior probabilities were assessed from two runs with four chains of 1 million generations for the nuclear genes 50 million generations for cytb, with a sampling frequency that provided a total of 10,000 samples for each run, discarding 25% of them as burn-in. Tracer v1.5 (71) was applied to evaluate the convergence of the ESS values (effective sample size) for each analysis. Resulting trees were drawn with FIGTREE v1.3.1 (72).
Haplotype networks were generated for each nuclear gene individually using parsimony calculations in TCS v1.21 (73) considering gaps as a fifth state. Each indel of the DBX5 and Agouti locus was considered as a single mutational step, regardless of the corresponding size (Figure 2). Analyses were performed for each locus with a connection limit of 95%. DBX locus presented disconnected haplotypes and so networks were redrawn with the connection limit fixed at 90% in order to link the more unrelated groups and see the number of mutational steps among them. Networks were edited using tcsBU (74). The cytb haplotype network was performed with the R packages “pegas” (75) and “ape” (76).
Species delimitation and species tree inference
Alignments were first tested for the presence of within-locus recombination in SPLITSTREE v4.13.1 (77), and the ones found to be significant were further analysed with IMgc (78) to reduce the dataset to the largest non-recombinant blocks. Moreover, in order to validate the assignment of individuals to the two previously described mitochondrial lineages (16,17,20–22), the program Bayesian Phylogenetics and Phylogeography (BP&P) v3.1 was used to assess the status of species delimitation. Our analyses included the mtDNA and the seven single copy nuclear DNA. Due to the large sample size of our dataset, only 30 individuals, chosen randomly, were analysed for each lineage on each locus. The same outgroup sequences of J. orientalis were used for this analysis. Population size parameters (θ) and divergence time at the root of the species tree (τ) were estimated with the gamma prior G(2, 1000), while the Dirichlet prior was assigned to all other divergence time parameters. We used “algorithm 0” with the fine-tune parameter set to default. Each species delimitation model was assigned equal prior probability. For the MCMC, samples were collected for 1,000,000 generations, with a sampling interval of 2 and a burn-in of 10%. Each analysis was run 3 times to confirm consistency among runs.
The same dataset was also used to infer the species tree by applying the multispecies coalescent model that is implemented in *BEAST (40), part of the BEAST v2.3.0 package (79). The input file was produced with the application BEAUti v2.3.0, also included in the BEAST package. Preliminary analyses were carried out to evaluate which clock-like evolution model best fits the data by comparing a relaxed with a strict molecular clock. Based on these trial runs the final analysis was accomplished with an uncorrelated lognormal relaxed clock, using the HKY+I+G substitution model for cytb. Analyses of the nuclear loci were carried out with an HKY (+I for ƲWF, ADRA2B, IRBP, MC1R and Agouti) substitution model under a strict molecular clock (Table S5).
Times of divergence were estimated using cytb as the reference gene. A fossil-based calibration of substitution rates was not possible due to the poor fossil record of Jaculus in North Africa. Similarly, the well-known calibration point Muridae-Rodentia was not used due to the likely saturation effect associated with the ancientness of the divergence between Muridae and Dipodidae. Instead, we used the average cytb substitution rate estimated for rodent species (0.176 substitutions/site/Myr; 83). Following these assumptions, the prior of the relaxed clock standard deviation was set to a normal distribution with a mean of 0.176 with sigma fixed at 0.05. This mutation rate was used in all subsequent analyses. The coalescent constant population size was used as tree prior and all the remaining priors were set to defaults. Three independent runs of 500 million generations were implemented, sampling trees and parameter estimators every 50,000 generations for all loci. The convergence of the runs was verified after the removal of a 10% burn-in using TRACER v1.5. Visual inspection of trace plots indicated a good sampling of all parameters for each *BEAST independent runs, with effective population sizes (ESS) above 1000, suggesting a good convergence of all parameters. The results from all runs were combined with LogCombiner v2.3.0, and the subsequent maximum clade credibility summary trees with posterior probabilities for each node were generated with TreeAnnotater v2.3.0 from the BEAST package. All the trees were visualized and edited with FIGTREE v1.3.1.
Isolation-with-Migration analyses
Species tree inferences performed with *BEAST incorporate the uncertainty associated with the coalescent process while estimating the phylogeny. However, it does not assume the possibility of occurrence of gene flow after the initial split. Thus, models of isolation-with-migration (IM) (27) implemented in the IMa2 software (24–26) were applied to infer whether gene flow has occurred between the two putative species. This method estimates the multi-locus effective population sizes (for present and ancestral populations), divergence times and migration rates under a model of isolation with migration (IM) (25,27). Analyses were performed with the mtDNA and the seven single copy nuclear DNA, and by considering the two Jaculus species as populations. After several preliminary runs, two independent runs with different starting seeds were performed by sampling 200,000 genealogies per locus with 10% burn-in. Chain convergence was assessed by inspecting ESS values (ESS > 500) and by checking trend plots to verify whether each parameter had a normal distribution. We used a geometric model with the first heating term (ha) set to 1.05 and the second (hb) to 0.95 sampling through 80 chains (hn). Priors for population size, migration rates and splitting times were set to 15, 0.5, and 15, respectively, after assessing the convergence of runs in preliminary analyses. The HKY mutation model was applied to all loci and the same substitution rate as in *BEAST was applied to cytb (here scaled by the length of the locus [897 bp]: 1.96e-04, ranging from 1.40e-04 to 2.52e-04) was used in order to obtain the results in demographic units, considering 1 year of generation time (80). Moreover, the log-likelihood ratio test (LLR) described by Nielsen and Wakeley (27) was used to assess whether migration rates were significantly different from zero, sampling over 400,000 trees, as implemented in the Load-Genealogy mode (L-mode) of IMa2.
Population genetics and demographic analyses
Total (Dxy) and net (Da) divergences between lineages were calculated using p-distance parameter in MEGA v5.1. Additionally, the divergence among several related rodent species, based on published data, was inferred for comparison analysis (28,29,38,30–37). Standard deviations for these divergences were estimated from 10,000 bootstrap replications. Nucleotide diversity (π), theta computed from the number of segregating sites ( ), and haplotype diversity (Hd) were calculated per lineage for each locus analysed. Three test statistics, Tajima’s D (81), Fu’s Fs (82) and R2 (83) were performed to investigate deviations from its neutral expectations, which could imply recent population expansion and/or signs of selection. Significance was evaluated through 10,000 coalescent simulations. These statistics were assessed per locus for each lineage in DnaSP v5. Calculations were made separately for the entire data set and for the non-recombinant portions obtained with IMgc.
The dynamics of effective population sizes through time of the two lineages of Jaculus sp. were inferred with Extended Bayesian Skyline Plots (EBSP; 87), using a linear model in BEAST v2.3.0 and inputted through BEAUti v2.3.0. The same non-recombinant dataset used for species tree inference was analysed. The evolutionary models for each locus of each lineage were estimated in jModelTest v2.1.4, which resulted in similar models to the ones previously obtained (Table S7). After preliminary analyses the evolutionary rates of the mitochondrial and nuclear loci were set to a strict molecular clock. The prior for the mean distribution of population sizes was optimized according to the population sizes estimated in preliminary runs, where different population size models were compared (Gamma, uniform, and exponential distributions) by comparing the ESS values, and was set with a coalescent prior and a constant population size (84). Remaining priors were set as default. The MCMC parameters were the same as applied in *BEAST analysis. TRACER v1.5 was used to assess the convergence of the independent runs. Results of the separate runs were combined with LogCombiner v2.3.0, part of the BEAST package, after discarding 10% burn-in.
Microsatellite selection and optimization
Since there were no specific microsatellite markers available for Jaculus spp. or closely related species, a microsatellite library was developed through high-throughput genomic sequencing (454 pyrosequencing) at GenoScreen (http://www.genoscreen.fr/en/) using J. jaculus individuals from distinct regions in North Africa. Detailed description of the optimization procedure can be found in supplementary material (supplementary information S1). After optimization we used two multiplexes amplifying seven and four markers each, as well as two additional loci that had to be amplified individually in separate PCR reactions (Table S8).
Microsatellite genotyping
A total of 148 contemporary samples were genotyped for 13 microsatellite loci. Multiplex and individual reactions, primer concentrations and amplification conditions are summarized in supplementary information S1. Allele data were obtained using GENEMAPPER v4.0 (Applied Biosystems 2006). Sizing bin windows were created manually and the automated scoring was checked by three independent observers to minimize genotyping errors. In order to assure consistency of results, 30% of the dataset was repeatedly genotyped in three independent runs. Inconsistent genotypes (~2% of all genotypes) were considered as missing data.
Microsatellite analysis
As the sampling was continuous across the distribution and it is hard to delimit populations, these analyses were performed considering the two Jaculus species as two different populations. MICROCHECKER v2.2.3 (85) was used to assess the presence of genotyping errors due to null alleles and allele dropout. Linkage disequilibrium (LD) and deviations from Hardy-Weinberg Equilibrium (HWE) were estimated with GENEPOP on the Web (genepop.curtin.edu.au). The significance of the analysis were inferred according to the Bonferroni correction (0.05/[number of populations*number of loci]), and confirmed with three independent runs. Loci presenting significant deviations from HWE and from LD assumptions and with large amount of missing data (above 40%) were discarded from further analyses. Measures of genetic diversity and differentiation, such as allele frequencies, mean number of alleles sampled per locus and population and the corresponding allelic richness, observed (Ho) and expected (He) heterozygosity, and F-statistics were estimated with FSTAT v1.2 (86). Individual-by-individual genetic distances that were used to compute a Principle Coordinate Analyses (PCoA) were calculated with GENALEX v6.0 (87). The number of clusters and the quantification of admixture between lineages were inferred with the Bayesian Clustering software STRUCTURE v2.3.3 (88). Analyses were accomplished by applying the admixture model with correlated allele frequencies. The software was run for the number of clusters (K) between 1 and 10 with 5 replicates of 1,000,000 MCMC iterations for each K value, following a burn-in period of 100,000 steps. Three independent analyses were performed to ensure similar posterior probabilities between runs. STRUCTURE HARVESTER v0.6.92 (39) was used to determine the probability of each K value. The most likely number of clusters (populations) was assessed using the mean values of likelihood [L(K)] and Delta K (89).
Niche overlap
Resemblance of ecological niches between species was tested: for overlap using Schoener’s D Index (which ranges from 0, no overlap; to 1, totally overlap), for niche equivalency (i.e. whether the niche overlap is constant when randomly reallocating the occurrences of both entities among the two ranges), and for niche similarity (i.e. whether the environmental niches are more similar than what expected by chance; 90). The PCA-environment ordination approach developed by Broennimann et al. (91) was used for analyses. Tests were performed for two regions and scales, for the entire North Africa at ~5x5 km scale and for North-West Africa only (i.e. Mauritania and southern Morocco) at ~1x1 km scale, over two types of background data, composed by: (1) topo-climatic, including two topographic, altitude and slope, and 19 bioclimatic variables; and (2) habitat variables, including six Euclidian distances to habitat categories. Altitude and the 19 bioclimatic variables were downloaded from WorldClim (www.worldclim.org/bioclim). Slope was derived a digital elevation model using the “slope” function from ArcGIS (ESRI 2011). Four of the habitat variables were constructed from land-cover categories for the years 2004–2006, which are likely descriptors of species natural habitats and showed a reasonable spatial representation in both study areas (i.e. sparse vegetation, bare, rocky and sandy areas). The remaining two habitat variables were constructed from spatial representation of water features (secondary rivers and rock pools) which were digitized from the cartographic maps (92). Distance to these six habitat categories was computed using the “Euclidian distance” function from ArcGIS. For the North African region, a total of 125 records for J. jaculus and 122 records for J. hirtipes were included, after reducing spatial clustering by removing records located at lower than ~10 km distance from each other using the “occ.desaggragation” function (91). For the North-West region, a total of 59 records for J. jaculus and 97 J. hirtipes were retained, using ~1 km as distance threshold to remove records and reduce spatial clustering. In both scales, the background area was delimited accordingly to a minimum convex polygon.