Sample collection and SLAF-seq
To ensure the accuracy of the sample source, a total of 135 P. echinogaster samples (Fig. 1) were collected at nine different locations along the coast of China (15 individuals at each location) from October 2017 to December 2017. External morphological identification of the samples was performed mainly by referring to Nakabo [12] and Li et al. [13]. For each fresh P. echinogaster sample, sterile scissors and tweezers were used to obtain back muscle tissues. All muscle tissues were stored in a cryogenic freezer at -80°C for future experiments. The phenol–chloroform method was used to extract the genomic DNA from each sample. Then, 1% agarose gel electrophoresis and an Invitrogen Qubit fluorometer were used to assess the degradation degree and concentration of the genomic DNA. Qualified DNA was submitted to Biomarker Technologies Corporation (Beijing, China) for library construction and sequencing.
According to the genome size and guanine+cytosine (G+C) content of P. argenteus, enzymatic digestion prediction was performed. The analysis software for SLAF enzymatic digestion prediction independently developed by Biomarker Technologies Corporation [20] was used to predict the digestion of the reference genome, and the optimal digestion scheme was selected according to the following principles: (1) the percentage of enzymatic fragments located in the repeat sequences was as low as possible; (2) the enzymatic fragments were distributed as evenly as possible across the genome; (3) the length of each enzymatic fragment was consistent with that of the specific experimental system; and (4) the number of obtained enzymatic fragments (SLAF tags) agreed with the expected number of tags. With the HaeIII restriction enzyme, the genomic DNA of each qualified sample was digested separately. Each obtained digestion fragment (SLAF tag) had a poly (A) tag added to the 3' end, the dual-index sequencing adaptors were attached [21], and this new sequence was amplified by polymerase chain reaction (PCR), purified, and mixed. The target fragments were taken out by cutting the gel. After the samples were qualified by the library, they were sequenced using the Illumina HiSeq 2500 platform. To evaluate the accuracy of enzyme digestion, Nipponbare was selected as the control for sequencing.
SNP detection and screening
The raw data obtained from sequencing were characterized using the dual-index sequencing adaptor. The raw reads of each sample were obtained, and reads with a quality score lower than 30 were excluded. Based on sequence similarity, the remaining high-quality reads of each sample were clustered, and the reads with a similarity greater than 98% were considered to have clustered as a single SLAF tag. A SLAF tag with sequence differences between samples can be defined as a polymorphic SLAF tag. The sequence with the greatest sequencing depth for each SLAF tag was taken as the reference sequence. The Burrows-Wheeler alignment tool was used [22] to align the reads to the SLAF tags, both GATK [23] and SAMTOOLS [24] were used to perform single nucleotide polymorphism (SNP) calling, and the overlapping SNPs in the GATK and SAMTOOLS results were used as the final SNP dataset. The generated SNPs were saved in a variant call format (VCF) file. To ensure the accuracy of subsequent analyses, VCFtools [25] was used to screen SNPs with the following parameters: -MAF 0.01 (minimum allele frequency> 0.01); --max-missing 0.1 (filtering out the genotypes with less than 90% data); --min-meanDP 150 (minimum mean depth of coverage > 150); --min-alleles 2 --max-alleles 2 (only two alleles); --minGQ 98 (quality score> 98); --minQ 30 (retaining the loci with a quality score> 3); --remove-indels (excluding the loci containing indels); and -HWE 0.05 (excluding the loci with P < 0.05 in the Hardy-Weinberg equilibrium test).
Genetic diversity and population genetic structure
To describe the genetic diversity levels of all genome-wide SNPs of the nine P. echinogaster populations, with TASSEL software (version 5.2.31) [26], the nucleotide diversity (π) and Tajima's D value of each SNP in each P. echinogaster population were estimated. With Circos software [27], the π, expected heterozygosity (HE), and Tajima's D of each SNP were visualized.
Using the “populations” module of Stacks software (version 1.34) [28], the genetic diversity levels of the nine populations, including the polymorphic loci, π, observed heterozygosity (HO), HE, and inbreeding coefficient (FIS), were statistically analyzed. Arlequin software (version 3.5.2.2) [29] was used to estimate the pairwise genetic differentiation (FST) of P. echinogaster, and 10,000 permutations were used to analyze the significance of FST.
Based on all SNPs, four methods were used to estimate population structure and individual clustering within a population. (1) PGDSpider software (version 2.0.5.2) [30] was used to convert the VCF file of all SNPs to STR format, and then ADMIXTURE software, based on the maximum likelihood method (version 1.3.0) [31], was used to evaluate ancestors. This software uses a fast numerical optimization algorithm to achieve a fast analysis rate. During analysis, the range of genetic clusters (K) was set to 2-7, and each analysis was repeated 10 times. Finally, based on the cross-validation (CV) error corresponding to the optimal K value, the clustering results were plotted. (2) Using TASSEL software (version 5.2.31) [26], neighbor-joining (NJ) trees of different P. echinogaster populations were constructed to clarify the phylogenetic relationships between all individuals in the P. echinogaster populations. The acquired NJ trees were visualized with iTOL software (https://itol.embl.de/). (3) Using PGDSpider (version 2.0.5.2) [30], the VCF file of all SNPs was converted to STR format. The adegenet package of R software [32] was used to perform principal component analysis (PCA) of all individuals and visualize the populations and interpopulation relationships. (4) NetView P software, which adopts the k-nearest neighbor (KNN) algorithm, was used to visualize the network topology among all individuals to accurately and thoroughly explore the refine population structure of P. echinogaster. Before running NetView P software, the range of KNN values was first set to 1-60, and the optimal KNN value was determined according to the Fast-Greedy, Infomap, and Walktrap algorithms. In other words, the genetic similarity of individuals was determined based on the optimal resolution of the genetic structure, and then the network topology of all individuals was obtained based on the optimal KNN value. (5) Arlequin (version 3.5.2.2) [29] was used for analysis of molecular variance (AMOVA) in order to detect differentiation between groups (FCT) and differentiation between populations within a group (FSC). In this study, the P. echinogaster populations were divided into three groups based on geographic location: Dalian, Tianjin, Qingdao, and Nantong; Zhoushan and Xiamen; and Shantou, Zhuhai, and Zhanjiang.
Analysis of gene flow between populations
PGDSpider (version 2.0.5.2) [30] was used to convert the VCF file of all SNPs to the GENEPOP format, and the gene flow between the nine populations was analyzed using divMigrate-Online (https://popgen.shinyapps.io/divMigrate-online/).
Prediction of the genomic regions and functions of P. echinogaster populations that are under habitat selection
Lositan [33] and BayeScan [34] were used to screen outlier SNPs based on FST. First, Lositan software was used to screen outlier SNPs by comparing their genetic differentiation and distribution of heterozygosity. The parameters in Lositan software were set as follows: 100,000 simulations, confidence interval (CI) = 0.995, false discovery rate (FDR) = 0.05, and the infinite allele model. BayeScan software was then used to screen outlier SNPs by comparing differences in allele frequency between populations. The parameters of BayeScan software were set to the default, and the FDR = 0.05. The outlier SNPs found by Lositan software after five runs and the outlier SNPs obtained by BayeScan software that overlapped were used as the final set of outlier SNPs.
To identify the functions of genomic regions under habitat selection, we first extracted genome sequences containing SNPs under environmental-driven selection. Then, Blast2GO software was used to compare SLAF tags containing outlier SNPs with the NCBI nr and Swiss-Prot protein databases, and the homologous protein sequences with the highest sequence similarity to SLAF tags were obtained (E-values < 1E-5). Using Blast2GO software, the functional properties of the obtained homologous protein sequences were classified based on GO and KEGG databases.