Species and Study Sites
Arabidopsis halleri subsp. gemmifera is a diploid (2n = 16), self-incompatible, and perennial montane plant (Al-shehbaz and O’kane 2002; Kolnik and Marhold 2006). In Japan, this plant is distributed in a wide range of altitudinal and latitudinal gradients. Its leaves are glabrous or sparsely hairy, but the ecotypes in the highland areas in Mt. Ibuki and Mt. Fujiwara in central Japan have dense trichomes on the leaves and stems (Fig. 1a, b). A previous genome-wide association analysis suggested that the two highland ecotypes at Mt. Ibuki and Mt. Fujiwara evolved independently from each other (Kubota et al. 2015), though they have similar morphological characteristics.
We used plants growing in Mt. Ibuki, where the highland habitats are characterized by relatively low vegetation heights, bright environment near the ground, and heavy snow in winter, whereas the lowland habitats are characterized by dark forest floor and relatively mild winter weather (Honjo and Kudo 2019). We harvested the leaf samples from an individual plant at 48 positions along the altitude of Mt. Ibuki in 2007 and 2008 (Table S1). The lowest and highest sampling sites were at 359 m and 1,317 m above the sea level, respectively, and their horizontal distance was approximately 2.8 km (Fig. 1c). To avoid the sampling of same clones, the sampling positions were at least 3 m apart from each other. The mean interval of the altitude and horizontal distance between the sampled plants was 20.4 m and 59.6 m, respectively. We also used genome information reported in Kubota et al. (2015), which was obtained from A. halleri plants growing at altitude of 380, 600, 1,000 and 1,250 m (five plants per site) in 2009 and 2010.
DNA Extraction, Individual-based Sequencing, and Data Processing
We extracted the genomic DNA from the dried leaf samples of collected 48 individuals using the DNeasy Plant Kit (QIAGEN). Thereafter, we prepared the DNA libraries using the TruSeq Nano DNA Low Throughput Library Prep Kit (Illumina). We generated reads using the Illumina HiSeq X Ten, and obtained 270 GB of data from the 48 samples. These raw read sequences are available in the DNA Data Bank of Japan Sequenced Read Archive under the accession number DRA010696. Furthermore, we added previously posed sequence reads of 20 individuals collected at the altitudes of 380, 600, 1,000 and 1,250 m on Mt. Ibuki (Kubota et al. 2015). We trimmed the low-quality reads (more than half of the nucleotides with quality score less than 30) using FASTX-toolkit v0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit). After the trimming, we mapped the reads of total 68 individuals to the reference genome of A. halleri (Briskine et al. 2017) for each individual by the alignment algorism, BWA-MEM v0.6.2 (Li 2013) with the default parameters. We removed data of 6 individuals due to their low coverage depth and eventually used 62 individuals. We removed PCR duplication by samtools v0.1.8 (Li et al. 2009) rmdup. Using bcftools mpileup (v0.1.8) (Li 2011), we employed SNP calling for total 62 individuals at once. Using bcftools filter, we treated genotypes of individuals whose coverage depth (DP) were lower than 6 or whose genotype quality (GQ) were lower than 20 as missing data. After SNP calling, we removed INDEL variant loci and trimmed loci whose minor allele frequencies (MAF) were lower than 5% or whose missing rates were higher than 10% by vcftools v0.1.15 (Danecek et al. 2011). Biallelic SNPs were used for our analysis.
Population Structure Analysis
To estimate the population structure of 62 individuals that were distributed continuously along the altitudes at Mt. Ibuki, we employed genetic clustering analysis with ADMIXTURE ver. 1.3.0 (Alexander et al. 2009). For the population structure analysis, we removed the loci that were in the linkage disequilibrium with each other by plink –indep-pairwise 50 10 0.1 (plink v1.90b4) (Chang et al. 2015). For each value of K (the number of subpopulations) ranging 1 to 5, we performed independent runs. To determine the optimal number of subpopulations for the 62 individuals, we calculated the cross-validation (CV) error for each K value. The CV error would be minimized when the number of K was best or appropriate for the data. Based on this result, we divided the individuals into highland, intermediate, and lowland subpopulations.
Detecting altitude-dependent genomic region
We sought the genomic regions accumulated in the intermediate subpopulation in relation to the Scenario 2 and 3 using an FST approach. For whole-genome, we calculated the window-averaged FST values within 10 kilobase pair (kbp) windows that included one or more SNP sites. Using vcftools --weir-fst-pop, we calculated the window-averaged FST between three combinations of subpopulations, highland and lowland, highland and intermediate, intermediate and lowland (FST_HL, FST_HI and FST_IL, respectively) in each 10 kbp window. To find the windows that were differentiated depending on the altitude, we identified windows that had higher FST_HL values (included in the higher 1% quantile) as the altitude-dependent windows (ADW). Then, among ADWs, windows that had extremely high FST_IL (included in the higher 1%) and low FST_HI (included in the lower 70%) were identified, which were considered to match the Scenario 2H (Table 1). Similarly, among ADWs, windows that had extremely high FST_HI and low FST_IL were also identified, which were considered to match the Scenario 2L (Table 1). We also selected windows that had lower FST_HL values. Among them, windows that had higher FST_IL and FST_HI were identified, which were considered to match the Scenario 3 (Table 1). We did not identify windows that are consistent with the Scenario 1 and that are included in the regions A and B in Fig. 2, because we were interested in the genes under the selective pressure in the intermediate subpopulation.
Heterozygote Frequency Analysis for Altitude-Dependent windows
We sought genome regions showing extremely high or low heterozygote frequency in the intermediate subpopulation in relation to the Scenario 1b and 1c. We assumed that the genotype frequency of a neutral SNP is in the Hardy–Weinberg equilibrium and that of SNP under the selective pressure is deviated from the equilibrium. Therefore, for each bi-allelic SNP, an expected heterozygote frequency (HEXP) was calculated by a minor allele frequency (p) and the Hardy–Weinberg equilibrium.
HEXP = 2 × p × (1 − p)
We calculated the average of HEXP and HOBS (observed heterozygote frequency in the intermediate subpopulation) of all SNPs located in each 10 kbp window (WA_HEXP and WA_HOBS, respectively). In this calculation, we used the same genome windows as those used in the above-mentioned FST approach. Then, the differences between the WA_HOBS and WA_HEXP in the intermediate subpopulation, ΔH, was calculated for each 10 kbp window.
ΔH = (WA_HOBS – WA_HEXP)
ΔH changes between –1.0 and +1.0 and is higher if the heterozygote frequency is large. We defined the Homozygote- and Heterozygote-accumulated windows that have ΔH value smaller and larger than 1% of total windows in the intermediate subpopulation, respectively. Then, the Homozygote- and Heterozygote-accumulated windows that are also the altitude-dependent window (ADW) were defined as Homozygote- and Heterozygote-accumulated ADWs, respectively. The Homozygote-accumulated ADWs were considered to match the Scenario 1b or 2. The Heterozygote-accumulated ADWs were considered to match the Scenario 1c.
Homozygote frequency in the intermediate subpopulation is expected to be high in both Scenarios 1b and 2. To find genes that are specifically consistent with the Scenario 1b rather than Scenario 2, we further assessed the average of the allele frequency of the highland alleles (W_AFHigh) in the intermediate subpopulation for each 10 kbp window. W_AFHigh would be large in the Scenario 2H and smaller in the Scenario 2L, whereas in the Scenario 1b, it would be neither extremely large nor small. Therefore, we categorized the Homozygote-accumulated ADWs (HA-ADWs) into three groups and defined as the Highland- (W_AFHigh > 0.66), the Lowland- (W_AFHigh < 0.33), and the Coexisting-HA-ADWs (0.33 < W_AFHigh < 0.66). The Coexisting-HA-ADWs were considered to match the Scenario 1b.
Annotation of Genes Including Candidate Genomic Regions
We investigated functional genes included in the identified genomic regions by annotation to General Feature Format (GFF) file of the reference genome of A. halleri (Briskine et al. 2017). We also referred to Briskine et al. (2017) that performed alignment against Arabidopsis thaliana coding sequences (TAIR10) and investigated the functions of candidate genes with TAIR database (https://www.arabidopsis.org/index.jsp). Using snpeff-4.5covid19-1, we investigated the number of SNPs included in the identified genome regions and categorized into synonymous, missense, nonsense or intergenic variants (Cingolani et al. 2012).