Population sampling
We sampled a piece of leaf of 737 individuals from 41 wild populations of H. middendorffii complex throughout its distribution range in the Japanese Archipelago, which included 10 populations of var. middendorffii (HK_01–10), one population of var. exaltata (Sado), 29 of var. esculenta (TO_01–07, CH_01–03, CH_05–10, KI_01–07, KY_01–04, OK_01 and OK_02), and one population of var. esculenta f. musashiensis (CH_04). Identification of the samples were undertook by KM and SS. In sampling of wild populaions, appropriate permissions have been obtained based on local laws, in necessary. Detailed sampling information and voucher specimen is provided in Table S1 in Supplementary file 1. This species is known to have very small populations in southwest Japan, and regional populations KY and OK used in this study are all populations that remain (The distribution of regional population is derived from the results of STRUCTURE analysis, see results). The Hemerocallis species are known to hybridize between different species under natural conditions [26, 55–57], and the inclusion of possible recent hybrids can be problematic in estimating population genetic diversity and demography. Hence, during the sampling we collected leaf samples from individuals with open flowers, which can be diagnostic of morphological species. In addition, we included the other congeners (a few individuals of H. citrina Baroni var. vespertina (H.Hara) M.Hotta, H. fulva L. var. littorea (Makino) M.Hotta, H. fulva L. var. longituba (Miq.) Maxim., H. lilioasphodelus L. var. thunbergii (Baker) M.Hotta) in the following genotyping experiments to detect possible hybrids. The genetic structure analysis of the congeneric species and H. middendorffii demonstrated that no recent interspecific hybrids were included in the genotype data of 737 H. middendorffii samples (data not shown).
DNA extraction and genotyping methods
The leaf samples were immediately dried with silica-gel beads and kept in the dark at room temperature until DNA extraction. The leaf materials (ca. 1.0 cm2) were then stored at -80 °C in a deep freezer (MDF-C8V1-PJ, Panasonic, Japan) and were pulverized with a TissueLyser (Qiagen. Hilden, Germany). After polysaccharides removal with wash medium described in Setoguchi and Ohba [58] from the resulting powder, total DNA was extracted using the CTAB (Cetyltrimethylammonium bromide) method [59]. For genetic analysis, we used 12 EST-SSR markers developed for H. middendorffii [60]. The total PCR (polymerase chain reaction) reaction volume was 6 µL, which contained approximately 0.5 ng DNA, 3 µL of 2 × QIAGEN Multiplex PCR Master Mix (QIAGEN), 0.01 µM of forward primer, 0.2 µM of reverse primer, and 0.1 µM of fluorescently labeled universal primer which is attached to one of four different universal sequences added to 5' end of each locus-specific forward primer. PCR amplification was performed with an initial denaturation at 94 °C for 30 min followed by 35 cycles at 94 °C for 30 sec, 55 °C for 3 min, and 72 °C for 1 min followed by a final extension at 68 °C for 30 min. Amplified products were loaded onto an ABI 3130 auto-sequencer (Applied Biosystems, Foster City, California, USA) using the GeneScan LIZ-600 size standard (Applied Biosystems), POP7 polymer (Applied Biosystems) and 36-cm capillary array. Fragment size was determined using GeneMapper (Applied Biosystems).
Data analysis
Genetic structure in the Japanese Archipelago
We employed a model-based Bayesian algorithm implemented in STRUCTURE 2.3.4 [61] to detect genetic structure within the H. middendorffii samples. The STRUCTURE analysis was performed according to the assumption that each individual had admixed ancestral origins and different gene pools retained their correlated allele frequency (along with the “correlated allele frequencies model” described by Falush et al. [62]) and that the sampling locations were designated as prior information to obtain better parameter estimates [63]. Twenty independent simulations were run for each K (K = 1–20), with 100,000 burn-in steps followed by 100,000 Markov chain Monte Carlo steps. We computed the corresponding ΔK ad hoc statistics [64] with STRUCTURE HARVESTER 0.6.94 (http://taylor0.biology.ucla.edu/struct_harvest) [65]. Replicates were combined with CLUMPP [66] and bar plots of assignment probabilities were built using DISTRUCT 1.1. [67]
The genetic structure among populations was explored by constructing neighbor-joining (NJ) [68] and UPGMA networks based on DA distance [69] among populations, and the significance of interior branches of the tree was evaluated based on bootstrap values using a software POPTREE2 [70]. We also calculated pairwise DA distance among populations using GenAlEx 6.51 [71]. This matrix was then used to construct Neighbor-net using Splits Tree4 [72].
Because both model-based and distance-based clustering analyses revealed six population groups corresponding to geographic distributions (see results), all populations were classified to six regional groups; that is HK (Hokkaido), TO (Tohoku district), CH (Chubu district), KI (Kinki district), KY (Kyoto prefecture), and OK (Oki Isls). The following genetic diversity and demographic analyses were performed separately for the population groups.
Isolation by distance analysis
To determine the extent of gene flow within a regional population, the presence of significant isolation-by-distance patterns within the six regional populations was investigated by applying the Mantel test with 999 permutations to the pairwise relationship between the log-transformed geographic distances (transformed to natural logarithms) and pairwise FST / (1 – FST) values between populations [73] using GenAlEx 6.51.
Calculation of genetic diversity
The levels of within-population genetic diversity were evaluated by calculating four genetic diversity statistics using GenAlEx 6.51: average number of alleles per locus (NA), Ho, and fixation index (FIS). AR and PAR were calculated by HP-Rare [74]. By the calculation of allelic richness using the rarefaction method, which can be used to compare allele diversity between populations with different sizes, the standard sample size was set to more than six individuals in this study. M ratio, which evaluates the effect of a population bottleneck, was calculated for the populations with sample size more than 10 individuals following the method described by Garza and Williamson [35]. We also compared the levels of genetic variation (AR, Ho, He, FIS, and FST) among regional populations as defined above. For testing the effect of population grouping, regional genetic diversity and two-sided P-values after 1,000 permutations were assessed using FSTAT 2.9.4 [75].
Genetic differentiation among regions and taxa
Previous genetic studies of H. middendorffii were performed to reveal phylogenetic relationships between H. middendorffii populations, using universal molecular markers, i.e. chloroplast DNA [25, 76], Random Amplified Polymorphic DNA (RAPD) [38] and anonymous single nucleotide polymorphisms (SNPs) from genotyping-by-sequencing [77]. These molecular phylogenetic studies showed that some intraspecific clades were weakly associated with geography. However, because these studies analyzed only one or a few samples per population, no information is available on genetic diversity and demographic changes of regional populations.
To evaluate the contribution rate of genetic differentiation among regions and taxa, we performed a hierarchical analysis of molecular variance [78]. Calculation of regional F-statistics (FRT, FSR, FST, FIS, and FIT) and their significance using the permutation approach were carried out using the program GenAlEx 6.51. Genetic diversity was hierarchically partitioned into four levels: (1) among taxa (a) or among regions (b), (2) among populations, (3) among individuals, and (4) within individuals. In the analysis of the region model, we set up seven regional groups, i.e., the six regional groups defined by STRUCTURE analysis and one group consisted of the Sado Isl. population (which was genetically “admixed”; see results).
Demographic analysis
We used a full-likelihood Bayesian method MSVAR 1.3 [79], which is efficient at detecting population declines and expansions from microsatellite data, which provided neither too weak nor too recent event. MSVAR is also superior to moment-based methods (the M ratio test and Bottleneck) for detecting changes in population size, independent of time and the severity of the event [80]. For the analysis based on coalescent simulations, we employed 10 microsatellite markers that appeared to follow the stepwise mutation model (excluded two markers: hem19141 and hem15494). We estimated the current and ancestral effective population sizes of regional populations (NC and NA), and the time since the population size started to change. Prior distributions were set as follows: starting value for current and ancestral population size was 103 for all loci, mutation rate was 10− 4, starting values for time since decline/expansion for all loci were 104, log10 scale for the prior means and variance of current size, ancestral size, mutation rate and time since decline/expansion were (4, 1), (4, 1), (-4, 2), and (4.5, 2), respectively. Generation time was assumed to be 10 years, by considering 4 years as the time to maturation and 6 years as continuing reproduction period.
To infer the changes in each regional population size, we used R package VAREFF, an approximate likelihood method with a Monte Carlo Markov Chain approach [81]. VAREFF is especially useful to provide evidence of transient changes in population size in the past. In this analysis, we set the parameters as follows: NBLOC: 10, JMAX: 5, MODEL: Geometric Model, and provided an additional coefficient (C) for G models: 0.2. MUTAT (mutation rate, assumed the same for all loci): 10 − 4, NBAR (prior value for the effective size): 3000, VARP1: 3, RHOCORN: 0, GBAR: 3000, VARP2: 3, DMAXPLUS: appropriate values were determined for each regional population. Diagonale: 0.5, NumberBatch (number of batches for metrop in MCMC): 10000, LengthBatch (length of batch for metrop in MCMC): 10, SpaceBatch (space of batch for metrop in MCMC): 10, Burnin (length of the burnin period): 10000, and AccRate: 0.25).
Paleodistribution modeling
Ecological niche modeling (ENM) is a powerful tool for revealing distribution changes in the past times. ENM can also provide a source of evidence independent from genetic data, which is particularly pertinent for species that are poorly represented in the fossil records. Once current occurrence data and environmental variables are collated in an ecological niche model, historical species distributions can be approximated by projecting species parameters onto a simulated past climate [82].
Because the results of our demographic analyses showed that effective population size changes happened since or after the last glacial period (see results), our paleodistribution modeling targeted the two historical times that represented either climatic extremes: The Last Glacial Maximum (LGM, about 21,000 years ago) and the Holocene optimum, (about 6,000 years ago). For the generation of ENM for H. middendorffii, a maximum entropy algorithm implemented in MaxEnt version 3.3.3 k [83] was used to relate occurrence records to the current climate variables. These records were collected either from our original field surveys or specimen records downloaded from the GBIF website (https://www.gbif.org/). After the removal of doubtful records, 271 records were used for model construction. We considered nine bioclimatic variables as predictors, which represented influential factors in terms of temperature and precipitation controls of the species and did not show significant clamping effects in model projections (annual mean temperature, temperature seasonality, mean temperature of the coldest quarter, mean temperature of the warmest quarter, annual precipitation, precipitation of the warmest quarter, precipitation of the coldest quarter, precipitation of the wettest quarter and precipitation of the driest quarter). Current climate and paleoclimate layers during the 21kya-CCSM (LGM) and 6kya-CCSM (Holocene optimum) in 30 arc-sec resolution were downloaded from the Worldclim database (https://www.worldclim.org/)[84]. In MaxEnt analysis, 10,000 background points were randomly generated for the study area (the Japanese Islands and Sakhalin), and other parameters were set as defaults, except regularization multiplier as 2, to avoid model overfitting. To evaluate the optimal quality of the prediction, 100 replicated bootstrapping was performed.
Relationship between paleodistribution and genetic diversity of the extant population
The values of genetic diversity indices estimated from microsatellite data and paleodistribution reconstructed from current climatic variables and species distribution records were estimated independently. Therefore, the combination of these two approaches can significantly improve our understandings of the past processes, which were shaped through population genetic diversity [85–87].
To test which of the glacial or interglacial climates were the limiting factor of the current population genetic diversity of H. middendorffii, we examined the correlation between genetic diversity estimates and climate suitability around the focal populations, as predicted by ENM. We analyzed three different genetic diversity measures (AR: allelic richness, PAR: private allelic richness, and asin He: arcsine-transformed gene diversity) of the current populations as a function of habitat suitability during the LGM (21kya) and the Holocene optimum (6kya). The distribution probability of H. middendorffii populations during the LGM and Holocene optimum were summed around each population locality, and hereafter, this value will be called “habitat suitability”. The projected distribution probability given for each 30-sec pixel within the study area was sampled and summed within multiple-ring buffers using ArcGIS 9.2 Desktop (ESRI, California, USA). Multiple-ring buffers were generated from all sampled locations except for the population number of individuals less than six using QGIS 2.14 (QGIS Development Team). We changed buffer radius from 5 to 300 km (5, 10, 25, 50, 100, 150, 200, 250, 300) as an estimated effect of the fixed term may vary according to the spatial scale with which habitat suitability was evaluated [20]. We used a generalized linear mixed model (GLMM) to analyze the genetic diversity indices with two fixed terms of habitat suitability at 21kya and 6kya, in which the loci as random intercepts, because diversity values at these loci are likely to be more related to each other than to the values from different loci. To explore the combination of spatial scales best explaining genetic variations in each genetic diversity index, we selected the most parsimonious model with minimum Akaike’s information criterion (AIC) value among the candidate models. We used R-package lme4 [88] to obtain AIC values among combinations of the different sizes of sampling radius of 21kya and 6kya (9 × 9 scales). Model selection was performed by R-package MuMIn [89]. Parameters of the best models were estimated by R-package lmerTest [90]. All statistical analyses were conducted in R 3.5.2 (R Development Core Team 2018).