Microhaplotype Panel Development
SNP discovery and amplicon design
Double-digest RAD sequencing:
To discover suitable genomic targets for development of microhaplotype markers, we used a modified double-digest restriction site-associated DNA sequencing approach (ddRAD-seq; Peterson et al., 2012) on 32 individuals from 10 populations of O. mykiss and one population sample of coastal cutthroat trout, O. clarkii clarkii (Ascertainment samples; Supp. Table 1). Following double restriction enzyme digest using EcoR1 and Sph1, a five base pair barcode was ligated to each sample before pooling for size selection that targeted fragments in the range of 300–400bp using a Pippin Prep (Sage Science). The samples were sequenced in one run on a Miseq instrument (Illumina inc. Shen et al., 2005) using a 600-cycle paired-end sequencing kit.
ddRAD demultiplexing:
Raw reads obtained from the Illumina sequencing run were pre-filtered based on their average Phred-scaled base quality score (≥ 33). Using the process_radtags component of Stacks v1.48 (Catchen et al., 2013), reads were truncated to 325bp and then demultiplexed based on the unique 5bp individual barcode to assign reads to their corresponding individuals. Finally, Stacks was used to assemble reads into loci within and across individuals and populations with a minimum depth of coverage (-m) of four, the distance allowed between stacks of two (M), and distance between catalog loci of two (n).
Loci filtering:
Among the 29,024 potential loci, a total of 5,959 had more than two SNPs that were observed in at least 10 or more of the 32 individuals sequenced. Since our target amplicon insert length in the final panel was 100–105 bases, we selected only the loci that had two or more SNPs within 100–105 bases of each other and had enough non-variable bases on either side those variants to attempt primer design. This left us with 3,049 potential loci. Given the short lengths of the targeted sequences and the whole genome duplication event in the common ancestor of salmonid fishes (Berthelot et al., 2014; Lien et al., 2016), the risk of obtaining amplicons containing fully or even partially paralogous genomic regions was very high (Pearse et al., 2019). Consensus sequences from the 3,049 candidate loci were mapped to themselves using BLAT (Kent 2002). A strict filtering was applied using R (R core team 2022), which removed 2,218 ‘duplicate’ targets that fully or partially matched another target locus. These ‘duplicates’ reflect mainly bioinformatics errors (Stacks errantly splitting reads into separate loci) and/or repetitive elements or paralogous regions in the genome. With only unique genomic regions represented in our filtered dataset, BLAT was used again to map our 831 remaining targets to an existing chromosome-scale genome assembly (Pearse et al., 2019) and using the same strict filtering, produced a list of 385 potential targets for the design of microhaplotype markers. Finally, we used the graphical interface of Stacks to assess overall variability, the potential for primer design, and the number of potential population-specific alleles of our loci in the 32 sequenced individuals and used these criteria to select loci for primer design.
Primer design:
We designed primers for 192 variable loci with the software Primer 3 (Untergasser et al., 2012) implemented in Geneious v.R11 (Kearse et al., 2012), using the Santa Lucia (1998) melting temperature (Tm) calculation and salt correction method. The length range of primers was 18–27 bp (target length of 20 bp) and contained between 25 and 50% GC bases (optimal content of 50%), allowing a max Tm difference of 2°C between primers and otherwise using Primer 3 default parameters. We targeted an optimal product size of 130 bp (in the range of 90–145 bp), because short and uniform lengths of target sequences are important factors for PCR success (i.e., uniform amplification among loci). Following initial testing and evaluation, loci with poor amplification or low polymorphism were removed from the panel resulting in the final list of 124 presumably neutral loci (Supp. Table 2).
Adaptive genetic variation and sex informative loci:
In addition to the newly discovered loci from the ddRAD data, markers for several previously identified functional gene regions were added to the panel. First, five microhaplotype markers were designed within the chromosomal inversion complex present on chromosome Omy05, known to be strongly associated with expression of anadromous or resident migratory life-history phenotypes in some O. mykiss populations (Pearse et al., 2014, 2019). Second, we designed primers for microhaplotype loci targeted on five SNPs associated with run-timing in the Greb1L gene region on Omy28 (Waples et al. 2022). Finally, the ‘Omy-Y1-2Sexy’ locus was included by using the primers from Brunelli, Steele, and Thorgaard (2010). This marker amplifies only when the Y chromosome is present (i.e., in males) and has been shown to be highly accurate in identifying males and females in coastal California steelhead (Rundio et al., 2012; Pearse et al., 2019).
Genotyping-by-sequencing
To be able to conduct ‘genotyping in thousands’ (GT-seq), Campbell, Harmon, and Narum (2015) developed a genotyping by sequencing (GBS) method to optimize the sequencing capacity of NGS technologies for population genetics and parentage studies. We used GT-seq to sequence up to 384 individuals with 135 microhaplotype loci in a single Illumina MiSeq® run, using a 150-cycle paired-end approach. All other details of the thermal cycling and library preparation are as in Baetscher et al. (2018).
Bioinformatics processing and panel finalization
Reads were de-multiplexed by the MiSeq Analysis Software (Illumina inc. Shen et al., 2005). Paired-end reads were combined using the Fast Length Adjustment of SHort reads (FLASH Magoč and Salzberg 2011) and mapped to the Stacks consensus sequences for the target loci using the Burrows-Wheeler Aligner (BWA-MEM, Li and Durbin 2009). Mapped reads were converted from Sequence Alignment/Map (SAM) files to Binary Alignment/Map (BAM) files with SAMtools (Li et al., 2009). We identified variable sites using FreeBayes (Garrison and Marth 2012); the positions of all SNPs for each locus were recorded in a variant call format (VCF) file. This VCF was then passed to microhaplot (Ng et al., DOI: 10.5281/zenodo.820110), a microhaplotype dedicated software implemented as an R (R core team 2022) package and associated Shiny app (http://shiny.rstudio.com/), which assembled the SNPs for each amplicon into a microhaplotype using the SAM files. After filtering loci for a minimum read depth coverage of 10 per individual and an allelic balance ratio of 0.3, the software was then used to export the microhaplotypes for downstream analyses.
Because the inference of genetic sex with the ‘Omy-Y1-2Sexy’ locus is based on non-amplification in females, the expected read depth for females is zero. However, due to individual barcode misidentification and other genotyping errors, some reads could possibly be incorrectly assigned to females. Thus, based on the observed distribution of read counts the threshold for the inference of female sex was set to a maximum of 5 reads.
Four sequencing runs were performed, with from one to four plates of 96 individuals at the same time. After each run, the variability across loci was assessed, and loci were filtered according to the following criteria: read depth, inconsistent allelic balance across individuals, deviation from Hardy-Weinberg equilibrium (HWE), and the presence of more than two haplotypes per individual, likely due to paralog loci or index sequencing errors (Larsson et al., 2017). Finally, primers associated with extremely high read-depth loci were diluted or removed, in order to limit their over-representation in the sequencing pools. The final panel was composed of 124 loci for parentage and population genetic analysis, 10 adaptive loci in the Omy05 and greb1L regions, and one locus to identify genetic sex in our individuals.
Phylogeographic Utility
In addition to the individuals genotyped for the panel development, samples from 28 additional populations were genotyped along with additional samples for some populations that were already included. This study is based on 124 loci genotyped in a total of 1,831 tissue samples from 58 populations (Supp. Table 1).
Genetic diversity estimates (i.e., He, Ho, Na, Fis, pairwise Fst) were calculated for each population and across all populations using the MStoolkit (Park, 2008) as well as the R package ‘diveRsity’ (R core team 2022; Keenan et al., 2013), and Allelic Richness (AR) was calculated using the software HPrare v1.1 (Kalinowski 2005). Genetic distances between populations were also analyzed with phylogenetic trees from the software PHYLIP (Felsenstein, 1993). A neighbour-joining tree (Saitou & Nei, 1987), representing genetic distance between populations, was calculated with PHYLIP using pairwise chord distances (Cavalli-Sforza & Edwards, 1967). The stability of the tree topology was examined using the Seqboot program, with 1,000 bootstrap replicates. Discriminant analysis of principal components (DAPC) was performed with the R package ‘adegenet’ (Jombart, Devillard & Balloux, 2010). The number of axes that maximize the results of the DAPC was defined by a “DAPC Cross-Validation” test (Jombart & Collins, 2015), and 150 PCA components were kept for the population panel, for each DAPC. In order to observe core patterns more clearly, each of the multivariate analyses had clearly differentiated populations removed for the next DAPC. Finally, individual-based ancestry evaluations were also implemented using the model-based clustering program STRUCTURE (Pritchard et al., 2000). Values of K were evaluated as follows: K = 2–10, 15, 20, 25, and 30. STRUCTURE output was then analyzed with CLUMPAK (Kopelman et al., 2015) and DISTRUCT (Rosenberg, 2004).