Study Species
Fragaria vesca L., Rosaceae, is an herbaceous perennial species with wide geographic distribution (Europe, northern Asia, North America, and northern Africa)56. It reproduces both clonally through stolons and sexually through seeds, although its sexual reproduction is very rare in natural conditions52.
Plant collection and growth
We selected 21 natural populations of F. vesca from three European countries, Italy, Czechia and Norway (see Table S6 for geographic locations and climatic characteristics of the selected populations) between May and July 2018. We chose these countries as they represented the southern limit (Italy), the core (Czechia) and the northern limit (Norway) of the native range of F. vesca distribution in Europe. To increase the environmental difference among the populations’ sites, we sampled the populations following a climatic (mostly corresponding to altitudinal) gradient within each country.
For each population, we collected mature, fully developed leaves of 4 individuals directly from the field conditions (n = 84), we dried them in silica gel and used them for whole genome bisulfite sequencing (WGBS) analysis, see later. We then dug up the same ramets plus additional three (n = 147) and planted them individually following a random block design in 70 × 40 × 20 cm pots filled with a commercial mixture of compost and sand located in the common garden of the Institute of Botany of the Czech Academy of Sciences in Průhonice, Czechia (49.994°N, 14.566°E) one to ten days after their collection (see Table S6 for the climatic characteristics of the common garden). Plants were grown under a shading coverage reducing 50% of the light to simulate natural light levels at most of the localities. We let the plants propagate clonally for one year. Then, we selected the biggest offspring ramet of at least the third generation from every clone and we collected mature, fully developed leaf samples and froze them immediately in liquid nitrogen. These samples were later used for WGBS (n = 147). From a subset of 3 plants per population (n = 63), we also collected mature leaf samples for RNA-sequencing in the same way as samples for WGBS.
WGBS library preparation and sequencing
We extracted genomic DNA from individual plants using the Qiagen DNeasy Plant Mini Kit, following the manufacturer’s instructions with minor modifications. We prepared libraries for WGBS using the NEBNext Ultra II DNA Library Prep Kit and EZ-96 DNA Methylation-Gold MagPrep (ZYMO), and we sequenced paired-end reads on HiSeq X Ten (Illumina, San Diego, CA), using a sequencing coverage per sample of 30x. See Appendix I for more information on the DNA extraction and library preparation protocols.
All the sequencing data (WGBS and RNA-sequencing) can be found in the European Nucleotide Archive (ENA, www.ebi.ac.uk/ena/), under the project PRJEB51609.
Methylation and DMR calling
We used the EpiDiverse WGBS pipeline for bisulfite reads mapping and methylation calling (https://github.com/EpiDiverse/wgbs), which was specifically designed for non-model plant species (i.e. species that have not been extensively studied)57. The pipeline performed quality control (FastQC), base quality and adaptor trimming (cutadapt), bisulfite-aware mapping (erne-bs5) and non-conversion rate calculation, duplicates detection (Picard MarkDuplicates), alignment statistics and methylation calling (Methyldackel). In the mapping step, we used the most recent version of the genome of F. vesca v4.0.a258,59. On average, the sequencing produced 97,142,389 reads per sample (see Table S7 for detailed information), of which 96% mapped successfully to the genome after retaining only uniquely-mapping reads. We calculated the bisulfite non-conversion rate using the chloroplast genome, which is naturally unmethylated60, and we found an average bisulfite non-conversion rate among samples of 0.38% (see Table S7). We obtained individual bedGraph files of methylated positions for each sample and sequence context.
For PCA and RDA analyses, we then combined the individual bedGraph files from both field and garden conditions in a multisample unionbed file using custom scripts and bedtools61. In order to compare the field with the garden conditions, we used only the samples for which we had WGBS data for both conditions (n = 84 per condition). We retained all the cytosines having coverage ≥ 5 in at least 80% of the samples (total methylated positions: 1,644,729, 2,574,494 and 12,335,916, respectively for CG, CHG and CHH). We then performed PCA on Hellinger-transformed methylated positions, and using custom scripts with the R function prcomp in the stats package (v3.5.1)62, and colored the PCAs using either country of origin, mean temperature or precipitation averaged over 7 years before the sampling year (2011-2018). To test for significance of the differences among countries, growth conditions, climatic conditions of origin of the plants, we performed RDA with the RDA function in the vegan package (v2.6.4)63. For DNA methylation, we performed four separate RDA analyses, using in all of them Hellinger-transformed methylated positions as independent variable and in 1) country of origin as predictor and growth conditions (field, garden) as a covariate to account for the effect of growth condition on the plants’ methylomes; 2) growth conditions as predictor and country as a covariate; 3-4) mean temperature or precipitation averaged over 7 years before the sampling year (2011-2018) as predictors and growth conditions as a covariate. We tested the statistical significance of the RDA analyses using a permutation test with 499 permutations.
We called DMRs using the EpiDiverse DMR pipeline (https://github.com/EpiDiverse/dmr)57 and using the DMR caller metilene with default parameters64. We used populations as groups, and we called DMRs separately for all the pairwise comparisons between the populations from the field, and the populations from the garden (i.e. we never compared a field population with a garden population). We used as input individual bedGraph files filtered for cytosine coverage ≥ 5. Separately for field and garden conditions, we then combined the output bed files in a multisample bed file using custom scripts and bedtools (v2.27.1)61, and we merged the overlapping DMRs obtained from all the pairwise comparisons with bedtools. We used only the samples for which we had WGBS data for both conditions (n = 84 per condition). We obtained 82,546 CG-DMRs, 49,459 CHG-DMRs and 211,363 CHH-DMRs for field, and 71,856 CG-DMRs, 37,795 CHG-DMRs and 138,807 CHH-DMRs for garden.
To assess the number of DMRs overlapping with gene promoters, gene bodies and TEs, we intersected these bed files with gene and TE annotations. For genes, we used the gene annotations v4.0.a2 downloaded from the Genome Database for Rosaceae (GDR) (https://www.rosaceae.org/species/fragaria_vesca/genome_v4.0.a2)65, while for TEs we used an annotation carried out using the EDTA annotation pipeline v1.9.666 on the substituted genome using default parameters, kindly provided by67.
SNP calling
We inferred Single Nucleotide Polymorphisms (SNPs) from WGBS data using the EpiDiverse SNP pipeline with default parameters (https://github.com/epidiverse/snp)57,68. For the DMR variance decomposition analysis, separately for field and garden conditions, we then combined the output individual VCF files into multisample VCF files using BCFtools (v1.9)69. As above, we used only the samples for which we had WGBS data for both conditions (n = 84 per condition). Using VCFtools (v0.1.16)69, we filtered the variants successfully genotyped in 80% of individuals, with a minimum quality score of 30 and a minimum mean depth of 3.
For PCA and RDA analyses, we combined the individual VCF files from both field and garden (n = 84 per condition) into a multisample VCF file and performed the same filtering steps as above. We also filtered for Minor Allele Frequency (MAF) ≥ 0.05, and pruned for Linkage Disequilibrium (LD) with an LD threshold (r2) of 0.2 for SNP pairs in a sliding window of 50 SNPs, sliding by 5. After filtering, we were able to retain 76 669 SNPs. We plotted the PCAs with custom scripts in R, using Hellinger-transformed SNPs. We performed RDA analysis similar to methylation (see above), and using Hellinger-transformed SNPs as independent variables.
DMR variance decomposition analysis and GO enrichment
To assess the amount of methylation variance explained by cis-variants, trans-variants and climatic variation, we ran three mixed models for each individual DMR for both field and garden conditions, as in16. Briefly, for cis-variants, we used an Isolation-By-State (IBS) matrix generated with PLINK (v1.90b6.12) (http://pngu.mgh.harvard.edu/purcell/plink/)70 using variants within 50kb from the DMR middle point. For trans-variants, we used an IBS matrix obtained from variants filtered for Minor Allele Frequency (MAF) ≥ 0.01 and pruned for Linkage Disequilibrium (LD) with an LD threshold (r2) of 0.8 for SNP pairs in a sliding window of 50 SNPs, sliding by 5. For climatic variation, for both field and garden conditions, we calculated a Euclidean distance matrix between climatic data from all the field sites, which we reversed and normalized to obtain a similarity matrix in a 0 to 1 range. The climatic data included mean, maximum and minimum temperature, and precipitation, all averaged over 7 years before the sampling year (2011-2018). We sourced climatic data at a resolution of 0.1 × 0.1° (v20.0e) from the European gridded dataset E-OBS, available through the C3S Climate Data Store (CDS) website (https://cds.climate.copernicus.eu/cdsapp#!/home)71. For methylation variants, we use the merged DMRs obtained from all the pairwise comparisons with bedtools, separately for field and garden (see above). As above, we used only the samples for which we had WGBS data for both conditions (n = 84 per condition). We extracted average methylation of the resulting DMRs from all the samples with the function regionCounts from the R package methylKit (v1.16.1)72, using a minimum cytosine coverage of 5.
We plotted circos plots using the R package circlize (v0.4.9)73. We performed the correlation analysis between number of cis-, trans-, climate-, unexplained-DMRs and number of genes and TEs using the Pearson correlation method. For this analysis, we calculated the DMR, gene and TE counts assigned to 1-kb genomic bins, and performed the correlation between DMR count and gene or TE count. We then ran a GO enrichment analysis for cis-, trans-, climate- and unexplained-predicted DMRs, separately for each sequence context and for field and garden conditions. We extracted DMR-related gene promoters with bedtools, and performed a GO enrichment analysis using the R package clusterProfiler (v3.18.1)74 with an FDR-adjusted P value < 0.05.
Genome-wide association (GWA) analysis
To assess the putative genetic basis of the climate-predicted DMRs, we ran GWA analysis for the garden conditions, including all the available samples (7 plants per population, n = 147) to increase the statistical power of the analysis. We ran GWA analysis as described in16, using the R package rrBLUP (4.6.1)75. For genetic variants, we imputed the missing genotype calls with BEAGLE 5.2 (Browning et al., 2018) and filtered for MAF > 0.04. After filtering, we were able to retain 83,095 SNPs. We corrected for population structure using an IBS matrix obtained from variants filtered for MAF ≥ 0.01 and pruned for Linkage Disequilibrium (LD) with an LD threshold (r2) of 0.8 for SNP pairs in a sliding window of 50 SNPs, sliding by 5. We used individual average DMR-promoter methylation for each sequence context as phenotype, calculated with the regionCounts methylKit function (v1.16.1)72, using a minimum cytosine coverage of 5. We performed GWA for all the climate-predicted DMRs overlapping promoters, and we used the Bonferroni correction method to select the threshold P value. We then excluded the climate-predicted DMRs overlapping promoters with significant GWA peaks from the analysis testing for correlation between DMR methylation and gene expression.
RNA-sequencing and correlation of climate-predicted DMRs with gene expression
We collected mature leaf samples from 3 randomly selected plants per population from garden condition (total samples: n = 63), and we snap-froze them in liquid nitrogen. We extracted mRNA using the Nucleospin RNA Plus kit (Macherey Nagel), following the manufacturer’s instructions with minor modifications. To improve RNA quality and yield from F. vesca, we used an increased amount of lysis buffer (500 µl) together with 100 µl of EDTA (0.5 M, pH=8) and PVPP (polyvinylpolypyrrolidone). The cDNA library and sequencing (PE150, 6 Gb per sample of raw data) were performed by Novogene Co., Ltd, Cambridge, using an Illumina NovaSeq 6000 platform. On average, we obtained 22,252,025 raw reads. We trimmed adaptors with cutadapt (v1.16) and assessed sequencing quality with MultiQC (v1.10.1)76. We aligned the reads to the Fragaria vesca genome (v4.0.a2) using STAR (Spliced Transcripts Alignment to a Reference) (v2.7.1a)77, assembled them into transcripts and quantified using StringTie (v2.1.5)78.
For PCA and RDA analyses, we normalized the read counts with the R function DESeq in the DESeq2 package (v1.30.1)79 and applied a variance stabilizing transformation with the function vst from the same package. We performed PCA with custom scripts using the function plotPCA in the DESeq2 package using Hellinger-transformed read counts, and RDA with the RDA function in the vegan package, using Hellinger-transformed read counts as independent variable and country of origin as predictor. As for methylation and SNP data, we tested the statistical significance of the RDA analyses using a permutation test with 499 permutations.
For the correlation analysis between methylation and gene expression, for each sample, we normalized the Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values and extracted the genes adjacent to the directly environmentally induced DMR-related promoters. For each of these DMRs, we combined DMR-promoter methylation, expression of the adjacent gene, sample ID and gene ID in the same file. We removed lines containing missing values in the gene expression data, and performed correlation analysis with the remaining genes (572, 856 and 3,955 genes in the CG, CHG and CHH contexts, respectively). We performed Spearman correlation between methylation and expression of each individual gene, and selected only the genes with a statistically significant correlation (P < 0.05). We considered these correlations statistically significant, as they were twofold more than the ones expected by chance (at P-value = 0.05, corresponding to 29, 43 and 198 random significant correlations in the CG, CHG and CHH contexts, respectively). We plotted heatmaps using the R function pheatmap in the pheatmap package (v1.0.12)80. For methylation, we used the methylation values of the climate-predicted DMRs presenting statistically significant correlation with gene expression, and for expression we used the normalized FPKM values of those genes.