Summary
To investigate adaptive selection to climate and the potential for evolutionary rescue in K. calcarata across Amazonia, we used genomic data derived from RAD-sequencing. After controlling for neutral genomic variation, we identified distinct genotypes adapted to different parts of the climate gradient. Using environmental niche modelling, we predicted distributions of each genotype until 2100, considering deforestation as a potential barrier to dispersal and a cause of habitat loss. Finally, we estimated the potential for evolutionary rescue of each genotype by analyzing the overlap between the areas with range loss of one genotype and the areas of permanence or expansion of the other genotypes.
Genetic sampling, processing and population structure
We sampled 112 individuals of all Kentropyx species within the known range of K. calcarata in Amazonia, the Atlantic Forest, and the Cerrado savannas (information on specimens, museums, and localities are available in Supporting Information, Table S1).
We extracted genomic DNA from each sample (liver and muscles) using the Macherey-Nagel® Mini Kit with a high salinity protocol with proteinase K. We then visualized fragment sizes in agarose gels and measured DNA concentration and quality with Qubit™ 3.0 Fluorometer and Nanodrop, selecting only samples with a low degree of DNA fragmentation and normalizing the concentration to 20 ± 2 ng DNA for each 50 µl solution (TE buffer). RAD-sequence library preparation as detailed in 65 was then performed by Floragenex, Inc. (Eugene, OR, USA). Individual samples were digested with the SbfI enzyme, linked with barcoded RAD ligators, multiplexed, sonicated, and then size-selected to a range within 300–500 bp. After PCR amplification, DNA sequencing was performed on a 2x100bp Illumina HiSeq platform. This procedure resulted in ~ 8,400,000 reads per individual with a length of 100 base pairs. We deposited the demultiplexed raw-sequencing data in the Sequence Read Archive (details to be provided upon acceptance).
We demultiplexed and assigned the reads for each sample and performed a de novo assembly and SNP calling using iPyrad 0.9.55 37, checking the quality of demultiplexed samples using MultiQC 66. For the iPyrad pipeline, we allowed one mismatch from individual barcodes, clustering the reads of each sample (minimum size of 35 base pairs) and across the samples (de novo assembly), sequence coverage of 6x, and minimum Phred score of 33. We selected a minimum clustering threshold of 0.90 after testing values from 0.85 to 0.95 for missingness and number of recovered SNPs 67. We allowed a maximum of two alleles and a maximum proportion of 0.5 for heterozygous sites per locus, including only loci present in 75% of the individuals.
From the pipeline above, we obtained 139,205 SNPs used in a principal component analysis (PCA) in iPyrad to identify samples clustered with other K. calcarata samples or with the remaining species. To check whether these samples would be similarly grouped regardless of the approach, we additionally inferred a species tree using SVDquartets 68 with the tetrads function of iPyrad in 100 bootstraps. tetrads uses a multi-species coalescent method from quartet trees inferred from unlinked SNPs. After these steps, 66 samples across 33 localities with 348,109 SNPs (~ 20% missing sites) were consistently assigned to K. calcarata using both methodologies and used in the following steps (Fig. S1). We then used VCFtools v. 0.1.16 69 to filter SNPs with Minor Allele Frequency (MAF) higher than 0.05 70. We also selected only one SNP per RAD stack (cluster of loci derived from a RAD de novo assembly) to decrease the probability of sampling linked SNPs, thus, minimizing linkage disequilibrium scripts provided by 58. We kept only SNPs with less than 20% of missing data, obtaining a final dataset of 30,589 SNPs for downstream analyses of population structure and genome-environment association.
We inferred the number of genetic clusters (k), population structure, and degree of admixture for the 66 K. calcarata individuals using sparse nonnegative matrix factorization (sNMF) 38. We calculated least-squares estimates of ancestry proportions for k ancestral populations varying from one to 10 and evaluated model fit with an entropy criterion based on 100 cross-validation repetitions. Next, we chose the value of k with the lowest entropy, with a 5% tolerance error 38. Additionally, we produced a PCA from the same SNP data and used a scree plot to identify the number of significant components. These analyses were implemented in the R-package LEA 2.0 71.
Genome-environment analyses
For the genome-environment association and niche modeling (ENM) analyses, we downloaded 19 bioclimatic variables from the CHELSA v2.139. We used data for the current climate and projections for 2040, 2070 and 2100 based on the Coupled Model Intercomparison Project Phase 6 CMIP6 scenarios 72,73. We considered Shared Socioeconomic Pathways (SSPs) corresponding to the highest emission scenario (SSP 5-8.5, comparable to RCP8.5 from CIMP5) and a moderate (middle-range) scenario SSP3-7.0. We did not include optimistic scenarios in our analyses as they underestimate emissions since 2000 42,43. For each emission scenario, we used all five Global Circulation Models (GCMs) available for download from the CHELSA database (GFDL-ESM4, IPSL-CM6A-LR, MPI-ESM1-2-HR, MRI-ESM2-0 and UKESM1-0-LL), which were selected following the ISIMIP3b bias adjustment fact sheet 74.
We selected the following bioclimatic variables for the subsequent analyses: Bio4 (Temperature Seasonality), Bio5 (Max Temperature of Warmest Month), Bio15 (Precipitation Seasonality) and Bio18 (Precipitation of Warmest Quarter). First, we selected bioclimatic variables representing temperature and precipitation extremes and climate seasonality instead of annual averages. Such extremes temperatures are more strongly associated with local extinctions that have already occurred due to climate change worldwide 52. We included only variables with considerable variation across the sampled localities, which after visual inspection of the environmental gradients resulted in 10 out of the 19 variables. To minimize multicollinearity among predictors, we calculated their Variation Inflation Factors (VIFs) with the usdm R package 75, keeping only variables with VIF lower than 5.
For mapping the current and future estimates of forest cover and land use in the study area, we used a high-resolution (1 km2) global database for the years 2010, 2050, and 2100 derived from Li et al. 45, which are based on remote sensing data, GCMs, future land use simulations (human development predictions), and dispersal constraints (cellular automata). We used the available moderate (AB1) and extreme (A2) scenarios, which have previously been used to model dispersal and range loss in South America 17 and closely match patterns of habitat loss in Amazonia. We transformed the five available landscape classes (forest, grassland, farmland, urban, and barren) into the two relevant classes for our study: forest (habitat of K. calcarata) and non-forest (all remaining categories - Supplementary Material, Fig. S10), the last considered as barriers for organisms’ movement in the downstream analyses. These non-forest areas also include some of the largest Amazon Basin rivers as well (when wider than 1 km – the resolution of the layer). The forest/barrier layer for the present time was used for delimiting current ranges by cropping the ENMs predictions for current climates (see below). We used the future projections to assess corridors and barriers for dispersal in range projections until 2100.
We carried out genome-environment association analyses to identify signatures of climate-driven genetic variation based on associations between allele frequencies and local climates 16, using the four selected bioclimatic variables, the filtered SNPs, and accounting for the underlying neutral genetic structure (sNMF and PCA, as described above). We employed two approaches to identify candidate SNPs with a signature of climate-driven genetic variation, selecting those recovered in both methods for downstream analyses 15,76. First, we used redundancy analysis (RDA) for detecting outlier loci, that is, candidate SNPs under climate selection. RDA enables the simultaneous detection of candidate SNPs across multiple environmental predictors, showing how groups of SNPs covary in the multivariate space, even for weak associations from each locus 16. To control for the neutral genetic structure, we used as covariates the first three axes of the PCA (> 80% of the variation) calculated from the K. calcarata SNPs (Forester et al. 2018a). We followed Capblancq & Forester 77, estimating Mahanolabis distances from the RDA scores (number of axes explaining at least 80% of the variation), which we then corrected for the inflation factor and transformed into p-values. Loci with P-values lower than 0.01 divided by the number of SNPs (3.3 x 10− 7 - Bonferroni correction) were considered candidate SNPs.
The second genome-environment association method we implemented was a latent factor mixed model with regularized least-squares using ridge penalty LFMM 2 –40. LFMM screens each SNP for signatures of local adaptation using the neutral genetic structure (k value derived from sNMF) as latent unobserved variables 78. Unlike RDA, LFMM is implemented against only one environmental variable each time. Therefore, we ran LFMM for all SNPs against the first three PCA axes, which explained > 80% of the variation of the same bioclimatic variables used for RDA 79. We performed additional runs with k-1 and k + 1, selecting the run with the lowest genomic inflation factor (λ). We then corrected the p-values by the genomic inflation factor while ensuring a uniform distribution of P-values along the interval from 0–1, with a peak of values close to 0 (candidate loci), visualized through histograms 80. As for the RDA procedure, we considered loci with p-values inferior to 0.01 divided by the number of SNPs as candidate SNPs.
Climate-adapted genotypes, modeling ranges and evolutionary potential
Each genome-environment association method may detect a set of candidate SNPs that do not entirely match, potentially leading to incongruences when allele mapping frequencies 70. We used the most conservative approach of retaining for the downstream analyses only the SNPs selected in both RDA and LFMM 76. However, we also checked for the spatial congruence of the allele frequencies for all possible sets of candidate SNPs recovered (RDA only, LFMM only, RDA-LFMM intersection, or RDA-LFMM union), highlighting any potential differences in downstream analyses.
We reran the RDA including only each set of candidate SNPs to maximize the spread of the adaptive genetic variation across the environmental space (adaptively enriched RDA) 77. To aid in visualizing of the degree of spatial congruence among the distinct sets of candidate SNPs, we predicted the genetic similarity for every pixel in the study area using the enriched adaptively RDA and environmental predictors 77,81.
To classify individuals as potentially adapted to distinct portions of the environmental space, hereafter climate-adapted genotypes, we used a k-means classification procedure as suggested in Carvalho et al. 82 on the adaptively enriched RDA scores (weighted by RDA axis importance) of each set of candidate SNPs. To select the best number of k-means clusters, we used 30 indexes included at the R package NbClust 83, selecting the value indicated by the majority of the indexes. From this classification, we used ENMs constrained by dispersal barriers (non-forested areas, see below) to predict current and future ranges of each climate-adapted genotype and compared these results to the predicted range of the species using all records.
Considering the potentially small number of localities representing each class of climate-adapted genotypes, we complemented the geographical distribution of each class with verified occurrence records for the species retrieved from Sheu et al., 34. We used the predicted values of genetic similarity for every pixel described above to classify the occurrence records into the respective climate-adapted genotypes. Only records in which the predicted values of genetic similarity felt within the observed values for each genotype were included. We then filtered these additional records according to the geographical proximity to the observed climate-adapted genotypes (i.e., sampled individuals with genetic information), with increasingly larger buffers for each genotype up to the maximum extension in which the buffers do not overlap.
To model the distribution of the species and each climate-adapted genotype we used three different classes of algorithms for an ensemble forecasting 41, including an envelope method (DOMAIN), logistic regression (GLM), and maximum entropy (Maxent)84–86. We performed ten replications for each combination of variables for each method, using 10-fold cross-validation to evaluate the results. To make the results comparable we kept the same proportion of randomly sampled pseudo-absences for each modeled class (2x the number of unique occurrences) 87. We obtained the final ensemble models using a True Skill Statistics (TSS) weighting 88. Finally, we projected the models into the current climate conditions and averaged the predictions for the five distinct GCMs for the SSP 3–7.0 (moderate/middle-range) and SSP 5-8.5 (extreme) emission scenarios for 2040, 2070 and 2100. To calculate the current ranges (binary maps) of each climate-adapted class from the ENM outputs, we applied a threshold of the minimum training presence calculated considering only the records of individuals with genomic data confirmed for each genotype class, allowing to retrieve the minimal conditions for the presence of each genotype. We also noticed that commonly used thresholds such as maximum sensitivity + maximum specificity (max se + sp), although maximizing evaluation metrics such as max TSS and max Kappa 89, tended to crop out even areas with the presence of samples with genomic data. Finally, we cropped the resulting binary predictions for the present by the current forest cover.
We used dispersal simulations to predict whether each class of climate-adapted individuals would be able to track shifts in suitable climates and habitats until 2100, become extinct (low suitability or forest loss), or persist in each location. For this, we used a cellular automata model of dispersal based on kernel densities and barrier constraints, as implemented in the R package MigClim44. We ran MigClim until 2100 with intermediate steps in 2040 and 2070, using the ENMs predictions of habitat suitability for these years to indicate the potential to dispersal into unoccupied cells, where values lower than the minimum training presence were considered unsuitable (e.g., the site is considered unoccupied or extinct). We used the 2050 (for the above-mentioned intermediate steps in 2040 and 2070) and 2100 projections of land cover as strong barrier constraints, that is, dispersal was not enabled between two diagonally adjacent non-forest barrier cells. Although no estimates of dispersal for K. calcarata exists, general estimates of range shifts due to climate change across multiple taxonomy groups range from a few meters to ~ 2 km per year 90. Therefore, we used maximum distance steps of 30 cells as the maximum distance possible to move between timesteps on each modeled interval (current-2040, 2040–2070 and 2070–2100), which is equivalent to ~ 1 km per year.
The results from the ENM-based dispersal simulations were then used to indicate which parts of the ranges of distinct climate-adapted genotypes will remain stable until 2070 and 2100 (climatically suitable and no deforestation), which new areas will become available and reachable through dispersal, and the range loss (climatically unsuitable or deforested). The potential for evolutionary rescue was estimated as the areas where the predicted distribution of one of the genotypes in 2100 overlaps with areas where the other genotype has lost its range (Fig. 1).