Mussel specimens and ddRAD-Seq Library Preparation
Genomic DNA for this study was derived from DNA extractions of specimens prepared by previous studies 20,21,23. Cyprogenia aberti specimens were collected from rivers in Arkansas, Missouri, and Kansas, which occur within the Ouachita (Ouachita, Caddo, and Saline rivers) and the Ozark (Fall, Spring, St. Francis, Black, Current, Buffalo, and Strawberry rivers) highland regions (Fig. 2, Table S1). C. stegaria specimens were collected from the Eastern Highland Region (Licking, Salt, and Green rivers in Kentucky and the Clinch River in Tennessee).
Samples were collected nondestructively using cytology brushes and stored in the buffer supplied with the Puregene Buccal Cell Kit (Qiagen, Hilden, Germany) and DNA was extracted following the manufacturer’s instructions. Mantle biopsies were conducted by taking ~2mm2 tissue and storing it in 95% ethanol (see Chong et al. 23 for details). DNA was extracted from tissue samples using the DNeasy Blood & Tissue Kits (Qiagen, Hilden, Germany). All Cyprogenia specimens were grouped according to their sampling locations. Sites with small sample sizes were pooled together with the most geographically proximate site. In addition, 14 Dromus dromas specimens were collected from the Clinch River, TN for this study and were used as an outgroup for phylogenetic analysis and for genetic comparison at genus level. A total of 214 samples (200 Cyprogenia sp. +14 Dromus sp.) were used in this study. Of these, 24 samples were randomly chosen as replicates to explore STACKS 58,59 parameters for de novo assembly and SNP discovery (see below).
DNA libraries were prepared with modifications of Peterson et al. 60. Briefly, a total of 200ng of genomic DNA for each sample was normalized into 20ng/ul, which was then digested using high fidelity restriction enzymes Pst1 and Msp1 (New England Biolabs). Resulting fragments were ligated to custom-made P1 containing sample-specific barcodes and forward primer annealing sites and P2 adapters with index and reverse primer annealing sites. Each individual was then PCR-amplified using Phusion Polymerase, and following PCR conditions: 25 cycles at 98 ºC for 20 s, 60 ºC for 30 s and 72 ºC for 40 s, with an initial denaturation step at 98 ºC for 30 s and a final extension step at 72 ºC for 12 min. PCR products were electrophoresed in 1.5% agarose gels and only samples with smeared PCR products were chosen for the next steps. Individually barcoded and indexed samples were then pooled into libraries and size-selected (325 to 500bp) using a Blue Pippin (Sage Science) so that the remaining adapters and unbound primers/primer dimers can be removed. Each library was quantified and quality-checked using an Agilent BioAnalyzer 2100 (Agilent Technologies).
A total of 238 samples (214 samples +24 replicates) were divided into 5 lanes (48 samples per lane, with one lane having 46 samples) and subjected to sequencing. Libraries were single-ended sequenced (100-bp) in separate flow cells on an Illumina HiSeq 3000 (Illumina Inc.) at the DNA Facility of Iowa State University. A simplified flowchart of our methods including ddRAD raw sequence reads, STACKS parameter selection, resultant datasets and data analyses is shown in Fig. S1.
ddRAD locus assembly, STACKS parameter selection, and sample/loci filtering
The STACKS software pipeline, version 1.46 58,59 was used for demultiplexing, quality filtering, de novo locus assembly, and SNP discovery (see Supplementary Information). We evaluated a variety of parameter settings in STACKS according to the recommendations of Mastretta-Yanes et al. 61 to determine which combination balanced an acceptable SNP error rate and the number of loci retained (Details in Supplementary Information). A total of 156 specimens (148 Cyprogenia + 8 Dromus) were selected for phylogenetic and population genetic analyses. In this study, we applied the following STACKS population constraints to the 156 specimens. In STACKS, -p: refers to the minimum number of samples that a locus must be present in for inclusion of the locus in phylogenetic analyses: 1) for the phylogenetic study, only loci that were present in >60% of all samples (i.e. 94 of 156) were retained, 2) for the population genetic study, only loci that were present in 50% of four species-groups (i.e. -p 2, here 2 of 4: C. stegaria_Eastern, C. aberti_Ozark, and C. aberti_Ouachita, and Dromus dromas), according to Chong et al. 23, and in >60% of all samples within each group (i.e. -r 0.6) were recovered. Additionally, concatenated SNP data for all samples were also used for phylogenetic analysis of C. aberti specimens identified by conglutinate egg colors (14 red and 17 brown). Sample information for the 156 specimens used in this study is shown in Table S1. Raw ddRAD-seq data for each specimen are deposited into the NCBI SRA database (BioProject: PRJNA454895) under the accession numbers listed in Table S1.
Population genetic diversity and structure
The STACKS populations program (-p 2, -r 0.6) was used to retain only those loci that were present in at least two of the four species groups (C. stegaria_Eastern, C. aberti_Ozark, C. aberti_Ouachita, and D. dromas) and present in over 60% of all samples within each group. In total, 14,890 RAD loci were obtained and 7,243 SNPs (consisting of only the first SNP per ddRAD-Seq locus) were recovered and used for subsequent population genetic analyses.
Measures of genetic diversity for each of 10 sampling locations (rivers) were calculated using GENODIVE 62. Pairwise FST values to assess levels of genetic differentiation between locations were estimated following standard ANOVA 63 using GENEPOP v.4.7 64. Significance was tested through genic differentiation for each population pair using the exact G test with default Markov chain parameters (Dememorisation: 10000, Batches: 100, Iterations per batch: 5000). Significance level was adjusted for multiple tests using Bonferroni correction.
To explore the effect of missing data in each locus, we compared pairwise FST’s calculated for both raw data and corrected data in which missing data were replaced with randomly drawn alleles based on the overall allele frequencies using GENODIVE 62.
GENODIVE was further used to conduct a hierarchical Analysis of Molecular Variance (AMOVA) 65 to assess the partitioning of genetic variation among highlands, populations within highlands, individuals within populations and within individuals. Two separate AMOVA analyses were conducted (three highland regions vs. two highland regions). The comparison of three highland regions compared the Ozark, Ouachita, and Eastern highlands, which mainly accounts for variance between Cyprogeniaaberti and C. stegaria. Whereas the comparison of two highland regions (Ozark vs Ouachita highlands), account for variance within C. aberti. Analyses were carried out using an Infinite Allele Model (IAM) and 999 permutations to assess its significance, where missing data were replaced by randomly drawn alleles based on overall allele frequencies.
We examined the distributions of individual multilocus genotypes in multivariate space, to assess the geographic structuring among individuals. We carried out a principal component analysis (PCA) based on 7,243 SNP loci across all samples and Cyprogenia samples only using the adegenet package 66 implemented in R 67. A scatter diagram was plotted based on factor scores along the first two PC axes accounting for the most variation. We also conducted a Discriminant Analysis of Principal Components (DAPC) using an interactive web interface for DAPC distributed with the package [using R command: adegenetServer(what = "DAPC")]. DAPC aims to provide an efficient description of genetic clusters using a few synthetic variables that are constructed as linear combinations of the original variables (alleles) which have the largest between-group variance and the smallest within-group variance 68. We used the default parameters except selecting the option of suggested number of PCA components.
To infer the most likely number of genetic clusters, we applied a Bayesian clustering approach implemented in STRUCTURE 2.3.4 69. The admixture ancestry model was used with correlated allele frequencies and no prior on population membership. The initial analysis of the complete dataset was conducted 10 times for each K value (the number of inferred genetic clusters) from 1 to 10 with 200,000 Markov Chain Monte Carlo (MCMC) generations and a burn-in of 100,000. The optimal K value was inferred from ad hoc posterior probability models of [Pr(X|K)] 69 and deltaK 70 using STRUCTURE HARVESTER 71. To assess the presence of substructure within major clusters, subsequent analyses were conducted within each of main highland regions, using the same parameters as described above. A STRUCTURE bar plot to infer the genetic composition of individuals was visualized using DISTRUCT v.1.1 72.
Phylogenetic Analyses
To explore the effect of missing data, we compared data sets generated by four different filtering criteria using the STACKS population program. We varied this criterion in order of strictness (i.e., inclusion of loci shared by fewer to more samples); -p 78, -p 94, -p 109, and -p 125, which only retains loci present in at least 50%, 60%, 70% and 80% of the 156 samples respectively. Using the STACKS population program (--phylip_var) and the D. dromas specimens as an outgroup, these four data sets of concatenated SNPs were used to investigate evolutionary relationships among the156 specimens. These four different filtering criteria respectively generated concatenated SNPs of 15,829, 9,673, 4,674, and 1,283, each with informative sites of 7,440, 4,395, 2,014, and 511. Phylogenetic trees were constructed with Maximum likelihood (ML) analyses in IQtree 1.5.6 73,74 with 1000 ultrafast bootstrap replicates and 10,000 iterations. The TVM+R3 model of sequence evolution was selected following a Bayesian Information Criterion (BIC) model selection implemented in IQtree 1.5.6. ML trees constructed from different filtering criteria produced nearly identical topologies (Fig. S4), therefore we used a data set in which SNPs present in at least 60% of samples (-p 94) for downstream analyses.
We used MrBayes v 3.2 75 to construct phylogenetic trees using Bayesian inference (BI) with the GTR-I-GAMMA model (General Time Reversible model with a proportion of invariable sites and a gamma-shaped distribution of rates across sites), which includes a parameter for site heterogeneity. BI employed four simultaneous Monte Carlo Markov chains (one cold and three heated) with 1,000,000 generations and sampled every 500 generations. The first 25% of the data points were discarded as burn-in. We confirmed that both ML and Bayesian tree searches achieved convergence (ML tree: Bootstrap correlation coefficient of split occurrence frequencies = 0.99, Bayesian tree: PSRFs = 1.000592). The consensus trees from both ML and BI were illustrated using FigTree v 1.4.3 (http://tree.bio.ed.ac.uk/software/Figtree/).
Species tree and species delimitation
Species tree inference and species delimitation in a Bayesian framework using SNP data were investigated using SNAPP 76 and MODEL_SELECTION implemented in BEAST version 2 77. Species Trees were inferred directly from SNP markers by bypassing gene trees in a full coalescent analysis, using a polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees 76. Out of 148 samples for Cyprogenia, a total of 40 samples for 10 sampling sites (four samples per each sampling site) were aligned and selected based on two sampling criteria, random sampling of individuals or by selecting the individual samples with the highest coverage (smallest number of missing SNPs) using a custom Python script. Coalescent-based species tree was inferred using XML file with following model parameters; with selecting “Cal mutation rates”, unselecting "Include non-polymorphic sites" and selecting "Use Log Likelihood Correction" in Model Parameter, and the chain length of 5,000,000 with storing every 1,000 in MCMC, and the rest of parameters were set to default.
For species delimitation exploration, candidate species delimitation models were explored using marginal likelihood estimation (MLE) of each competing species delimitation model, ranking models by marginal likelihood, and using Bayes factors 78 to assess support for model rankings. For species delimitation using genetic data and coalescent methods, we designed six candidate species delimitation models; Model1: current taxonomy (two Cyprogenia species), Model2: split C. aberti based on highland into two species (C. aberti was split to Ozark (Black, St. Francis, Spring) and Ouachita (Caddo, Ouachita, Saline)), Model3: intermix half of two candidate C. aberti species from each other (Half of Ozark and Ouachita were equally intermixed), Model4: reassign C. aberti (Black from Ozark was moved to Ouachita, and Saline from Ouachita was moved to Ozark), Model5: reassign C. aberti (Saline from Ouachita was moved to Ozark), Model6: reassign C. aberti (Black from Ozark was moved to Ouachita). For current and alternative species delimitation models, we edited the XML file for marginal likelihood estimation, with the path sampling steps of 36, MCMC sample length of 100,000 for each path sampling step, alpha of 0.3, burnInPercentage=10, and preBurnin of 10,000. Our path sampling parameters might be low, and a thorough analysis may require more path sampling steps and higher MCMC. However, given computational time and capacity, and similar results from various parameter sets applied, such parameter sets are reasonable.
Bayes factor (BF) calculations based on marginal likelihood estimation (MLE) of each competing species delimitation model are made against the current taxonomy model (Model 1), subtracting the MLE values for two models, and then multiplying the difference by two (BF = 2 x (Model1 – alternative model). The strength of support from BF comparisons of competing models was evaluated using the framework of Kass and Raftery 78. The BF scale is as follows: 0 < BF < 2 is not worth more than a bare mention, 2 < BF < 6 is positive evidence, 6 < BF < 10 is strong support, and BF > 10 is decisive. All Bayes factor (BF) calculations are made against the current taxonomy model (Model 1). Therefore, positive BF values indicate support for current taxonomy model, and negative BF values indicate support for alternative model.
Trace files generated by SNAPP were inspected and convergence of chain by checking Effective Sample Size (ESS) for various variables was evaluated using the program Tracer (http://tree.bio.ed.ac.uk/software/tracer). Posterior distribution of species trees was summarized to identify the topology with the best posterior support using TreeAnnotator implemented in BEAST version 2 77. The target tree file generated with Maximum clade credibility (MCC) and Median heights option in TreeAnnotator was displayed in FigTree.
Evolutionary process and demographic history
We investigated the demographic history and evolutionary process of each of Cyprogenia sampling locations in the three highland regions using the Approximate Bayesian Computation (ABC) approach implemented in DIYABC 2.1 79. To evaluate the past demographic changes of each of sampling locations, we followed methods designed by Cabrera and Palsbøll 80. Five demographic models are evaluated. Model 1: CON (constant population size), Model 2: DEC (a single instantaneous decrease in population size), Model 3: INC (a single instantaneous increase in population size), Model 4: INCDEC (a single instantaneous increase followed by a single instantaneous decrease in population size), Model 5: DECINC (a single instantaneous decrease followed by a single instantaneous increase in population size). We followed prior model parameterization designed by Cabrera and Palsbøll 80 with the slight modification of generation time (T) and a uniform distribution with a range of 10 to 20,000 (N). We ran 5,000,000 simulated datasets for evaluating past demographic event and 3,000,000 simulated datasets for evaluating evolutionary scenario (C. aberti_Ouachita and C. aberti_Ozark). These prior distributions are reasonable given the species’ presumed census size (100-20,000 mussels) and generation time (5-9 years per generation) 81. The current and ancestral Ne of each of sampling sites was estimated from posterior distribution of parameters from the best fit model, which were estimated with 1% selected datasets of simulated datasets for a selected evolutionary model. The best fit evolutionary scenario for two highland regions was selected from comparisons of six evolutionary scenarios through the ABC computation of the posterior probabilities of each scenario and by their goodness-of-fit to data sets with principal component analysis and posterior checking of summary statistics using the model checking option implemented in DIYABC (see detail in Supplementary information). To minimize effect of missing data in the analyses, we used only loci that were found in all populations using independent parameters of populations program in STACKS for each of three Highlands (-p 3 for 3 populations in Ozark Highland, and -p 3 for 3 populations in Ouachita Highland, and -p 4 for four populations in Eastern Highland). Only polymorphic loci were employed in the ABC analyses. No mutation model parameterization was required for SNPs.
Genetic relationship among specimens identified by conglutinate lure color
The color of conglutinate lures (red and brown colored eggs) of included specimens were identified previously 23, and a total of 31 C. aberti specimens with known conglutinate lure color (14 with red conglutinates and 17 with brown conglutinates) were selected to examine genetic relationships between specimens. Concatenated SNPs obtained from a moderate criterion (-p 94) described earlier were used for phylogenetic analyses. Maximum likelihood (ML) and Bayesian trees were constructed to examine phylogenetic relationships for total of 33 specimens, 31 C. aberti specimens, and two C. stegeria specimens used as outgroups. The TVM+I+G4 model of sequence evolution was selected according to a Bayesian Information Criterion (BIC) implemented in IQtree 1.5.6.
Quality comparison between genetic materials using non-destructive and destructive methods
To investigate the utility of non-destructive methods as source material for genomic library creation, we compared the results obtained from samples collected using cytology brushes and mantle-tissue biopsies. First, we compared the retention rate of samples from each method after the aforementioned filtering steps. Then, we compared the quality and amount of ddRAD sequence reads between genetic samples using both methods. In addition, we estimated and compared differences in SNP error rates between replicated samples from each sample type. Lastly, we mapped and compared ddRAD fastq sequences from all samples to four genome sequences (bivalve, human, yeast and a bacterium) and PhiX sequences using FastQ_Screen (https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/). Genome sequences of a bivalve, Mytilus galloprovincialis, are available from GenBank (accession: LNJA000000000). Genome sequences of human (Homo sapiens), and yeast (Saccharomyces_cerevisiae) are available from ftp://ftp.ensembl.org/pub/current_fasta. Genome sequences of a bacterium (Escherichia coli) are available from GenBank (accession: U00096.3). PhiX sequences are available from Refseq accession NC_001422.1.