Plant sampling
From 2015-2018, 3596 samples were collected from 128 populations representing the species complex A. chinensis Planch. including two main species of var. chinensis and var. deliciosa. Another variety, A. chinensis var. setosa, narrowly distributed in the Ali mountain of Taiwan island was not included in the present study due to be lack of sample. The field collection followed the ethics and legality of the local government and was permitted by the government. The formal identification of the plant material was undertaken by the National Actinidia Germplasm Repository (NAGR) of China, and voucher specimens were also deposited at NAGR. Each population consisted of either var. chinensis or var. deliciosa, and there were still some populations containing both varieties based on morphology in the sympatric regions. For each population, 6 to 30 individuals were sampled (Additional file 2: Table S1). To avoid collecting closely related individuals, sampled individuals were spaced by ca. 50-100 m. Fresh leaves were dried in silica gel in the field and then stored at -80 °C. In addition, one-year-old shoots were stored at 4 °C in a refrigerator at least 30 days to overcome dormancy and then placed in fresh water at room temperature to stimulate bud breakage for the collection of fresh leaf tissue. The latitude, longitude, and altitude of each sampled individuals were recorded using a global positioning system.
Ploidy analysis
The ploidy level of each individual was determined by estimating its relative DNA content using FCM. Briefly, fresh leaves were chopped in 0.5 mL of nuclear extraction buffer and were then filtered through a nylon sieve, followed by the addition of 2 mL 6-dianidino-2-phenylindole (DAPI) held at 4 °C. Two minutes later, these samples were used to estimate the ploidy levels on CyFlow Ploidy Analyser (Partec, Germany) automatically. Relative fluorescence intensity of a diploid var. chinensis cultivar, ‘Hongyang’ (2n = 58), whose chromosome number had been determined previously based on chromosome counts and the estimated genome size of 750 Mb [71], was used as an internal standard for each measurement [24].
DNA Isolation, Amplification, and Sequencing
For chloroplast DNA analysis, 49 out of 128 populations were selected to cover the entire distribution ranges of the A. chinensis complex. A modified CTAB method [72] was used to extract total genomic DNA from 608 individuals in this pool, then 1% agarose gels and NanoDrop 8000 (Thermo Fisher Scientific, Waltham, MA, USA) was employed to examine the quality and concentration of genomic DNA, respectively.
To identify chloroplast DNA (cpDNA) regions with sufficient variation, 12 individuals of var. chinensis and 12 var. deliciosa were randomly selected to undertake the preliminary screening of primer pairs for eight intron regions: ndhF-rpl132, ps16-trnQ, trnE-trnT, psbA-trnH, rpl16, trnL-trnF, trnL-trnT, and trnD-trnT (data not shown). Finally, three consistently amplified and variable non-coding intergenic spacer (IGS) regions ndhF-rpl132, ps16-trnQ, and trnE-trnT [73]( Additional file 2: Table S5) of the plastid genome were selected for further analysis. A total of 608 individuals from all 49 populations of the A. chinensis complex were sequenced. Amplification was carried out in a volume of 20 μL reaction solution containing 10 μL 2 × Taq PCR MasterMix ( Bioteke), 0.5 μL each primer (0.2 μM), 1 μL template DNA (ca. 50-100 ng) and 8 μL ddH2O. The thermo-cycling conditions for PCR are given as follows: 94 °C, 4 min; 35 × (94 °C, 30 s; 56 °C, 60 s; 72 °C, 60 s); and 72 °C, 10 min. PCR products were sequenced in an ABI 377XL DNA sequencer (Applied Biosystems). All haplotype sequences identified in this study were submitted to GenBank (MT812986-MT813020). Besides, for subsequent analysis, A. eriantha and A. polygama were used as outgroup and these chloroplast sequences were downloaded from GenBank (KY100979.1 and KX345297.1).
Analyses of genetic diversity and population structure
Three cpDNA fragments were aligned separately and trimmed in MEGA 6 [74] and combined into a single dataset using an online tool FaBox [75] for the phylogeographic analyses. During the analyses, indels (gaps) were treated as a single mutation event and coded as substitutions A or T (a third G or C will be used when three kinds of mutations coexisted). The number of cpDNA haplotypes, haplotype diversities (Hd), and nucleotide diversities (π) [76] for each variety were calculated using DnaSP 5.10 [77]. Haplotype distribution maps were constructed using ArcGIS 10.3. Total genetic diversity (HT) and within-population diversity (HS) were calculated with PermutCpSSR 2.0 [78].
The phylogeographic structure was inferred by comparing population differentiation for phylogenetically ordered (NST) and unordered (GST) haplotype using PermutCpSSR 2.0. A test of significance comprising 1000 permutations was used to determine if NST > GST. A significantly higher NST than GST usually indicates the presence of phylogeographical structure. The geographically and genetically distinguishable groups and the potential barriers between groups were analyzed using SAMOVA 2.0 [79]. Various SAMOVA were run, increasing the number of K groups until the percentage of explained variance among groups reached a limit (K = 2–20). In addition, a molecular variance (AMOVA) analysis was performed using Arlequin v3.5 [80] with significance tested using 1023 permutations to test genetic differentiation among populations and between varieties. Finally, estimates of pairwise FST/(1 − FST) were regressed against the pairwise natural logarithm of the geographic distance using a Mantel test with 999 random permutations in GENALEX 6.5 to test IBD patterns among the populations [81, 82]. Measures of pairwise FST distances were calculated using Arlequin, and the pairwise natural logarithms of geographic distance were calculated by GENALEX 6.5.
Haplotype relationships and divergence time estimation
To determine phylogenetic relationships among haplotypes, median-joining networks were constructed using NETWORK v5.0 [83]. The haplotypes of A. eriantha and A. polygama were also included in constructing haplotype network as well as in molecular dating analysis. we a Bayesian phylogenetic inference on all haplotypes using Markov chain Monte Carlo (MCMC) methods and a strict clock model in BEAST2 v2.5.0 were performed to estimate divergence time among lineages [84]. Meanwhile, a GRT model was selected based on the result of AIC from jModelTest v 2.1.10 test [85]. As a tree prior, the Yule model was specified. Two calibration points from previous studies of Actinidia [86] were used to constrain the nodes with a normal distribution prior. One point is the estimated split between A. polygama and the other two species (19 Myr, 95%CI: 14-24), and another is the estimated split between A. eriantha and A. chinensis (11 Myr, 95%CI: 4.7-17.3). The Monte Carlo Markov chain runs were performed every 1×109 generations, with sampling every 1×105 generations, following a burn-in of the initial 10% cycles. the convergence and effective sample size (ESS) > 200 for all parameters were assessed using Tracer version 1.7.1. After discarding the first 25% trees as burn-in, the rest of the trees were summarized in a maximum clade credibility tree with treeAnnotator version 2.6.0. Finally, the maximum clade credibility tree was visualized in FigTree version 1.4.3 (available at http://tree.bio.ed.ac.uk/ software/ figtree/).
Demographic history analyses and ancestral geographical area reconstruction
All populations were divided into four groups which represent the eastern 2x populations, eastern 4x populations, central-western 2x populations, and central-western 4x-6x populations, respectively. The mismatch distribution analysis using Arlequin 3.5 [80] was conducted to test whether each group has undergone recent demographic or spatial population expansion events. Multimodal mismatch distributions of pairwise differences between individuals are expected for populations at demographic equilibrium with a relatively stable size over time, whereas unimodal distributions are expected for the population that has experienced recent demographic expansions [87]. The goodness-of-fit of observed mismatch distributions to the theoretical distributions under a model of sudden expansion was tested with the raggedness index [88]. The significance of the raggedness index was obtained by examining the null distribution of 1000 coalescent simulations of these statistics. Small values of the raggedness index suggest sudden expansion, whereas high values of the raggedness index suggest stationary or genetic bottleneck. Second, for expanding lineage, the expansion parameter (τ) and its 95% confidence interval were converted into generation time (T) since expansion following the equation: T = τ/2μ [89], where μ is the neutral mutation rate of the entire cpDNA sequences per generation. The value for μ was calculated as μ = uk, where u is substitution rate in substitutions/site/generation (s/s/g) (here, 1.0 x 10-8; refer to Gaut [90] ), k is the average sequence length of the cpDNA region under this study (here, 1,854 bp), the expansion time was calculated assuming a generation time of 7 years for A. chinensis under natural conditions [86]. Finally, we also calculated Tajima's D [91] and FS statistics of Fu [92] using Arlequin 3.5 [80], which is based on 1000 random permutations. The mismatch distribution of the observed number of nucleotide differences between pairs of DNA sequences was computed using DnaSP 5.10 [77].
Ancestral range reconstruction was conducted to estimate possible historical patterns of geographical distribution for A. chinensis complex using the BBM (Bayesian binary MCMC) method implemented in RASP v.4.0 [93]. For BBM analysis, 10001 BEAST-generated trees and consensus tree without outgroups were used as topology input, excluding the first 1000 as burning. In the ancestral area reconstruction, the geographical areas of the A. chinensis complex were defined four biogeographical regions: EA, East; SO, South; CE, Central; and WE, West (Figure 4a) [54]. Ten BBM chains were run for 100,000 generations with a sampling frequency of 100. F81 was used as a state frequencies model according to the Akaike Information Criterion.
Ecological niche models
To study the niche differentiation and distributional changes for each variety, the Maxent Version 3.4.0 program [94, 95] was used to undertake to test of the ecological niche models and projected their potential distributions during three periods: the present, the Last Glacial Maximum (LGM, 21kya), and the Last Interglacial (LIG,120-140 kya). The distribution data were obtained from our field collection and the Chinese Virtual Herbarium (CVH, available at http://www.cvh.ac.cn/). For all CVH records, the taxonomic identity was verified by specimen images and removed the artificial cultivation records to ensure that geolocations were consistent with the known distribution ranges. After removing duplicates, 150 sites for var. chinensis and 90 sites for var. deliciosa were identified. The current and past environmental datasets of 19 bioclimate variables with spatial resolutions of 2.5 arc minute were downloaded from the WorldClim database [96]. To avoid over-fitting [97], strongly correlated bioclimatic parameters, according to Pearson’s coefficient (|r| > 0.8) using the Perl program ENMTools v.1.3 [98], were excluded. Finally, six bioclimatic variables (Additional file 2: Table S6) were retained for ecological niche model construction with Maxent Version 3.4.0. And the model quality was assessed by cross-validation comprising 100 replicates using 25% of the data for model testing, and the accuracy of each cross-validation test was evaluated using the area under the ROC curve (AUC) [99].
Ecology divergence analysis
To measure niche divergence, the environmental niches of the two varieties were compared using niche overlap and identity tests, in which the difference between the actual niches was contrasted with null models generated from randomly reshuffled occurrence points [100]. Niche identity was calculated using Schoener’s D similarity index [101] and Warren’s I [100] implemented in ENMTools version 1.3 [98, 100]. Both Schoener’s D and Warren’s I ranged from 0 (no niche overlap) to 1 (identical niches). One hundred pseudoreplicates of shuffling were conducted to generate null models for these statistics, and tested for significance; histograms were drawn using R 3.6.0 (http://www.rproject.org/). In addition, the same tests and analyses were also carried out for the three main different level ploidy (2x, 4x, and 6x) populations.
To better understand and visualize the similarities and differences between the environments in which var. chinensis and var. deliciosa occur, the principal components analyses (PCA) were conducted using the elevation and 19 bioclimatic variables from WorldClim [96, 102]. The same 240 occurrence data points used for ENMs have been extracted values in ArcGIS 10.3. And the matrices of environmental values were standardized before the PCAs were performed, simultaneously the distributions of species in environmental space were visualized in R 3.6.0.