Genome-wide characterization of genetic diversity and population structure of Valeriana officinalis L.

Background: Valeriana officinalis L. is one of the most important medicinal plant with a mild sedative, nervine, antispasmodic and relaxant effect. Despite a substantial number of studies on this species, population genomics has not yet been analyzed. The main aim of this study was: characterization of genetic variation of natural populations of V. officinalis in Poland and comparison of variation of wild populations and the cultivated form using Next Generation Sequencing based DArTseq technique. We also would like to establish foundations for genetic monitoring of the species in the future and to develop genetic fingerprint profile for samples deposited in gene bank and in natural sites in order to assess the degree of their genetic integrity and population structure preservation in the future. Results: The major and also the most astounding result of our work is the low level of observed heterozygosity of individual plants from natural populations despite the fact that the species is widespread in the studied area. Inbreeding, in naturally outcrossing species such as valerian, decreases the reproductive success. The analysis of the population structure indicated the potential presence of metapopulation in a broad area of Poland and the formation of a distinct gene pool in Bieszczady Mountains. The results also indicate the presence of individuals of the cultivated form in natural populations in the region where the species is cultivated for the needs of the pharmaceutical industry and this could lead to structural and genetic imbalance in wild populations. Conclusions: The DArTseq technology can be applied effectively in genetic studies of V. officinalis . The genetic variability of wild populations is in fact significantly lower than assumed. Individuals from the cultivated population are found in the natural environment and their impact on wild populations should be monitored.

3 with valerenic and acetoxyvalerenic acids considered as species-specific marker compounds. Valerian root and the extracts thereof are used in the states of anxiety or sleeping disorders. Their efficacy relates to an interplay between different groups of compounds rather than with individual compounds [2].

Plant description
V. officinalis belongs to the genus Valeriana L. that is a part of the family Valerianaceae, now considered also as a part of the Caprifoliaceae Juss. in the order Dipsacales Juss. ex Bercht. & J.Presl.
Approximately 200 species are listed within this cosmopolitan genus, several of which are of commercial interest. V. officinalis occurs natively throughout Eurasia, except for arctic and desert zones [3]. It occupies fresh habitats such as pastures, meadows, ditches, scrub, open woodland, along margins of watercourses but also dry calcareous soils. Common valerian is a highly morphologically differentiated perennial with short, straight rhizome, roots and stolons as underground organs. Leaves are sessile, pinnately separated with 6-10 pairs of lance-shaped leaflets. Flowers are numerous, hermaphrodite, arranged in terminal cymose inflorescence with white or pink five-lobed corolla [4]. V. officinalis sensu lato constitutes a collective species with controversial subordinate taxonomy and phylogeny. Besides a high degree of developmental plasticity, it is also characterized by a varying ploidy level (2n=14, 2n=28, 2n=42 or 2n=56) [5][6][7].

Problem
Wild growing medicinal and aromatic plants (MAPs) are widely available source for discovering new drugs. Herbal drug industry has been expanding very intensively in recent years. Currently, in developing countries simple herbal products are used as a primary source of medicines, while in western countries they are used as an alternative or complementary to conventional medicine.
However, according to World Health Organization (WHO) data, still about 80% of people depend on pharmaceuticals of plant origin [8]. Out of about 422 000 known plant species, about 12.5%, have been used for the medical purposes. It is estimated that about four thousands of herbal plants species are threatened with extinction [9]. It is caused by both overexploitation of raw materials from natural sites and reduction of natural habitats due to uncontrolled deforestation as well as rapid development 4 of agriculture and urban areas [10]. Climate change also threatens medicinal plants, which, like all other living members of the biosphere, are not immune to them. Particularly endemic species are vulnerable to changes caused by global warming in the geographical regions or ecosystems which they inhabit [11]. The conservation of medicinal plant resources should take place in several directions by in situ and ex situ conservation strategies, good agricultural practices and sustainable use solutions [12]. Awareness of genetic diversity levels significantly improves the efficiency of genetic resources conservation, sustainable use and accelerates breeding activities that lead to the production of forms suitable for cultivation. Genomic studies of MAPs remain behind model and important cultivated plant species.

Genotyping
There is a lack of complex molecular genetic studies of V. officinalis as well as for the majority of medicinal plants. Next Generation Sequencing (NGS) technology could provide efficient toolkit for genotyping studies of medicinal plants. NGS, that is based on massive parallel sequencing, has revolutionized plant science over the last years. This ultra-high-throughput, scalable and high-speed technology can be applied to organisms having an unknown genome sequence. Large amount of Single Nucleotide Polymorphism (SNP) markers can be obtained through the simplified genome sequencing like restriction site associated DNA sequencing (RADSeq), genotyping-by-sequencing (GBS) or other related techniques [13][14][15]. DArTseq (Diversity Array Technology sequencing) is one such method. DArT was first developed at the beginning of 2000's and has improved both throughput and accuracy in the detection of DNA polymorphisms without the need for prior information on DNA sequences through hybridization and solid-state surfaces [16]. Today's DArTseq technique is the result of a combination of DArT and NGS [17]. The use of combinations of endonucleases that target low-copy DNA sites rather than repetitive DNA fragments leads to an effective reduction in genome complexity. The resultant representation of the genome is composed of both constant and polymorphic fragments across individuals. Until now, DArTseq has been applied to many plant species, especially cultivated plants, due to its high throughput, genomic coverage and transferability. Relatively few papers on its use for genotyping of wild and medicinal plant species 5 have been published. It has been successfully used in Eucalyptus L'Hér [18], Cochlearia polonica In Table 1, geographical coordinates (latitude and longitude) were given, followed by the determination of common valerian abundancy, according to Braun-Blanquet scale. In February 2016, seeds were sown in greenhouse, into multi-pots (with hole diameters 2-3 cm) filled with peat substrate, to obtain seedlings. Each population was represented by 25 plants. In the end of May 2016, the seedlings were planted in the experimental field to obtain a field ex situ collection.

DArTseq markers analysis
The results of genotyping were generated as tables listing silicoDArT and SNP markers. SilicoDArts markers (dominant) that are analogous to microarray DArTs are derived from the sequencing of fragments from the genomic representation and are associated with variability within the restriction sites. SNP markers (co-dominant) are single nucleotide polymorphisms detected in sequenced fragments of genome representations. Both markers were then scored as binary data, indicating presence (1) or absence (0) of a marker in the genomic representation of each sample, as described by Cruz et al. [17]. In the next step both types of markers were tested for reproducibility (%), call rate (%) and polymorphism information content (PIC).

Genetic diversity analyses
The following coefficients were calculated for SNP markers: observed heterozygosity (H o ), Nei unbiased genetic diversity (uH e ), and inbreeding coefficient (F IS ) according to the following equation: Due to technical limitations, Equation 1 has been placed in the Supplementary Files section.
For SilicoDArT markers only Nei unbiased genetic diversity (uH e ) was calculated. Genetic distance, for both types of markers, was measured by using Jaccard coefficient [46] according to the formula:

Due to technical limitations, Equation 2 has been placed in the Supplementary Files section.
where a is the number of variables present in both individuals being compared, b is the number of variables present in the first individual and absent from the second individual and c is the number of variables present in the second individual and absent from the first one. The symmetric genetic distance matrices were used to assess the amount of variation among the assigned regional groupings by Analysis of Molecular Variance (AMOVA) with tests including 999 permutations. In order to test for a correlation between two genetic distance matrices and between genetic and geographical distances (in km) among populations, Mantel tests were performed (computing 10000 permutations).

Population structure analysis
Genetic structure was assessed using Agglomerative Hierarchical Clustering (AHC) according to Ward's method, Principal Coordinate Analysis (PCoA) and model-based Bayesian clustering algorithms. Admixture and correlated allele frequencies model was used to determine the number of clusters (K) in the range from one to ten. Ten independent runs were set for each K value with a burnin period of 10 000 and 50 000 Markov Chain Monte Carlo replications after burn-in with no prior information about the origin of individuals. The ΔK method was used to determine the most probable value of K.

Joint analysis
Generalized Procrustes Analysis (GPA) was used to obtained consensus configuration of SNP and SilicoDArT results and to reduce the scale effect.

Marker quality analysis SNP markers
A total of 65 510 SNP markers with an average reproducibility of 99.6% were obtained using the DArTseq method for 188 individuals of V. officinalis. Over 98% of SNPs had reproducibility above 95%, of which 55 101 was completely reproducible (Fig. 2a). The call rate was in the range from 20% to 100% (Fig. 2b). Approximately 63% of SNPs had a call rate above 75%. When the quality criteria were applied, i.e. reproducibility above 90% and call rate above 95%, 23 507 SNP markers were assigned to subsequent analysis. The polymorphism information content (PIC) was in the range of 0.01-0.5 with an average of 0.181 and a median of 0.129. As much as 40% of SNP markers had a PIC below 0.1, and only 1% had a maximum value of 0.5 (Fig.2c).

SilicoDArT markers
A total of 61 633 silicoDArT markers with an average reproducibility of 99.8% were obtained. All markers had the repeatability above 95% and as many as 58 6244 had the repeatability equal to 100% (Fig. 2d). The call rate ranged from 0.8 to 1 (Fig. 2e). About 80% of silicoDArT had a call rate above 90%. After applying quality criteria, 41 925 markers were approved for further analysis. The PIC value ranged from 0.01-0.5 to 0.170 and the median 0.123. The frequency analysis showed that over 40% of markers had a very low PIC, i.e. below 0.1, and only 0.02% had a maximum PIC value (0.5).

Genetic diversity SNP markers
In individual populations, the percentage of polymorphic loci ranged between 31-61% (Fig. 3a). The with uH e calculated for SNP markers. The mean genetic distance between 188 individuals was 0.861.
The maximum distance (0.941) was between individuals from population PL 401941 and PL 403192, while the minimum distance was 0.294 for individuals from population PL 406738 and PL 403192.

Mantel test
The comparison of genetic distance matrices obtained for SNPs and silicoDArTs showed a strong uphill (0.752; p<0.0001) linear relationship of them (Fig 4a). The test in which genetic distance matrices between populations were collated with a matrix of geographical distance showed a moderate positive relationship for both SNPs and silicoDArTs (0.304 and 0.329; p<0.0001 respectively) (Fig. 4b,   c).

AMOVA
The Analysis of Molecular Variance (AMOVA) for SNP markers of all samples showed that 23% of the variance in allele frequency could be attributed to differences between populations (Fig. 5). In case when only wild populations were analyzed, the value of differentiation among populations (ɸPT) coefficient was 0.199 (p<0.001). The hierarchical AMOVA of wild populations revealed that 78% of total variance was found among individuals, 16% was found among populations while the rest of variation (6%) was among regions. The genetic variation between the cultivated form and wild populations was 29%. AMOVA results of silicoDArTs indicated greater intrapopulation than interpopulation variation (~83% vs ~17%). The ɸPT coefficient for the natural populations exclusively was 0.166 (p<0,001). The hierarchical AMOVA of wild populations confirmed that the largest fraction of variation (82%) was within populations, while the interpopulation variance was 13% and interregions variation was restricted to 5%. 16% of the variation was found between the cultivated form and the natural populations.
Population structure

SNP markers
The Agglomerative Hierarchical Clustering based on SNP markers showed the presence of four main groups (Fig. 6a) The first three main coordinates of Principal Coordinate Analysis (PCoA) explained only 24.5% of the total variability (Fig. 6b, c). In the 1 and 2 coordinate system, the presence of four groups, i.e. the cultivated form, population originating from the Podkarpackie Voivodeship with a separate population PL 403183 and a large group consisting of populations from the other three Voivodships, was clearly visible. The third coordinate allowed to isolate as well as a group of individuals from Mazowieckie Voivodeship.
In order to further elaborate the genetic structure of the populations, a model-based Bayesian clustering was conducted using the Structure 2.3.4 software. In order to find the suitable number of clusters (K) it was tested in the range from 1 to 10 and was plotted against ΔK which showed a sharp peak for K=5 (Fig. 6d), hence, five distinct groups were found to contribute significant genetic information across tested accessions. The sub-populations were denoted as POP1 -POP5, and each sub-population contributed in 10%, 5%, 56%, 21% and 7% of total V. officinalis genetic makeup, respectively (Fig. 6d). The proportion of membership of individuals in each population was illustrated in the bar plot of the population assignment test in structure analysis (Fig 6e, f). Genotypes that the

SilicoDArT markers
Ward's hierarchical grouping showed the presence of five main clusters (Fig. 7a). The groups were identified in a very similar way to the above-mentioned SNP analysis. The only difference corresponded to the classification of PL 403183 from Podkarpackie Voivodship as a separate group. Also, PCoA results for silicoDArT data showed high agreement with those reported for SNP markers (Fig. 7b, c). The three main coordinates explained 14.9% of the total variation. Bayesian cluster analysis of silicoDArT markers using the ΔK method to determine optimal K values suggested that the most likely number of clusters was two (Fig. 7d), hence two distinct groups were found, and they correspond to cultivated and wild populations (Fig. 7e). The subpopulations were determined as POP1 and POP2 and constituted 11% and 89% of the total genetic information, respectively (Fig. 7f). The structure of the lower order for K=3 showed the existence of third group composed of populations originating from Podkarpackie Voivodeship (Fig. 7g, h).

Joint analysis
Generalized Procrustes Analysis was performed based on the first ten principal coordinates obtained from PCoA for SNP and silicoDArT data. The most effective reduction of the total variability was scaling and rotation (p<0.0001). The residuals by individuals after transformation were in the range 0.0001-0.009. On average, the highest value was observed in the cultivated form. The residuals by configuration after transformation were equal (0.169) indicating that both methods gave consensus results. A consensus test allowed to determine that the observed Rc (results of the consensus test) value was significantly higher than 95% of the results that were obtained when permuting data. The eigenvalues showed that the first three axes represented 53.56% of variability. The variability splitted between the methods showed that the first axis explained a bit more variation of SNP than silicoDArT data. The first three GPA factors explained 53.5% of total variance. Finally, scatterplots presented the consensus results. Due to the high consistency of the results obtained for SNP and silicoDArT analysis, as before, five major groups could be identified. There was a clearly visible distinctiveness between the cultivated form and natural populations. Among the wild populations, the ones coming from the Bieszczady Mountains were distinct and moreover they showed additional internal differentiation.
Also, a part of individuals from the Masovian region forms a separate aggregation. On the other hand, the population originating from Kujawsko-Pomorskie Voivodeship does not differ from the large group of populations collected in the Świętokrzyskie Voivodeship (Fig. 8).

Discussion 13
V. officinalis is a species commonly used in herbal medicine. It is widespread in Europe and Asia and is currently not considered to be endangered. Despite the great interest of researchers, this species has not yet received genomic characterization. Previous studies concerned mainly the content and specification of active compounds and their use for therapeutic purposes [1,[19][20][21][22]. Several studies using molecular markers, such as Amplified Fragment Length Polymorphism (AFLP) or Random Amplification of Polymorphic DNA (RAPD) were performed for this species, but none of them concerned the intraspecies diversity analysis [23][24][25]. In this context, Valeriana wallichii DC (syn. Valeriana jatamansi Jones) is much better recognized [26][27][28]. The NGS high throughput sequencing technology, i.e. RADseq, had already been used in the Valerianaceae phylogenetic research, but it only concerned the species native to South America and the V. officinalis was not included in it [29].

Utility of DArTseq markers
To the best of our knowledge, in the research presented in this paper, the NGS technology was used for the first time for the analysis of intraspecies variation within the Valeriana genus. Until now, the used here, DArTseq technique has been mainly applied for the analysis of crop and animal species [30][31][32][33][34][35]. For 188 individuals of V. officinalis, representing 19 wild populations and one cultivated form, over 65 000 SNP markers and over 61 000 silicoDArT markers with a very high reproducibility coefficient were obtained using this technique. For comparison, in Triticum durum Desf. DArTseq analysis allowed to identify more than 20 000 SNPs and 56 000 silicoDArTs [36]. In phylogenetic studies of the genus Secale, this technique allowed to identify about 14 thousand SNPs, and in studies of rye inbreeding lines almost 15 thousand silicoDArT and five thousand SNP markers were obtained [37,38]. In the population genomics of Eucalyptus species 54 000 SNP markers were recorded [18]. In earlier studies of V. officinalis using AFLP markers only 311 fragments were generated. The PIC value obtained in the presented study for both SNP and silicoDArT markers was quite low and amounted to 0.181 and 0.123, respectively. It was a derivative of a significant fraction of monomorphic loci. In the above-mentioned cereals studies the PIC value was significantly higher and amounted to 0.302 and 0.265 in wheat and 0.37 and 0.22 in rye for SNPs and silicoDArTs, respectively [36,38]. However, it should be noted that the genome of V. officinalis is more than 2.5 times smaller than the genome of 14 Secale cerale L. and more than 4 times smaller than the genome of T. durum [23,39,40]. On the other hand, similar to the our (~ 0.18) PIC values were obtained in the study of Ctenophorus caudicinctus, Litoria ewingii and Litoria paraewingi the Australian lizard and frogs [35]. Based on the above data, therefore, it can be concluded that the DArTseq technique is efficient and useful in the genetic analysis of V. officinalis.

Diversity
Due to the lack of earlier studies on genetic variability in the species V. officinalis, it is difficult to refer the level of parameters obtained in the presented paper to a species-specific diversity pattern. presence of V. officinalis on natural sites in that region has been proven [3,41]. From the point of view of the genetic structure of the species population, the similarity of genotypes occurring in the relatively remote Świętokrzyskie and Kujawsko-Pomorskie Voivodships also seems to be interesting.
This may indicate that a metapopulation with free gene flow occurs in a significant part of Poland. This hypothesis is supported by the widespread occurrence of valerian across the country, however, its verification will be possible only after investigating a larger number of natural populations, especially from regions not included in this study [3]. In the populations inhabiting the Mazowieckie Voivodeship, the presence of individuals of the genotype corresponding to the cultivated form 'Lubelski' was recognized. This is probably due to the presence of valerian plantations in the area. In other words, we are dealing with the uncontrolled spread of the cultivated form beyond the plantation area. Nonetheless, it is crucial that there is no evidence of crossbreeding wild and cultivated specimens. Valerian cultivation is very widespread and common in the Lubelskie Voivodeship, so in that area the probability of contamination of natural populations by the cultivated form is extremely high. Unfortunately, no environmental monitoring is carried out in this context.

Vulnerabilities
Although V. officinals is still considered to be a commonly occurring species not only in Poland, but also throughout its entire range of distribution, due to the loss and degradation of habitats, the abundance of this species may start to decline rapidly. Climate change and increasing periods of

Competing interests
The authors declare that they have no competing interests.

Funding
The molecular studies were financed by the National Science Centre from grant "MINIATURA I" no. Tables Table 1 The   Distribution of DArTseq data for quality parameters: a) SNP data reproducibility; b) SNP data call rate; c) SNP polymorphic information content (PIC) distribution; d) silicoDArT data reproducibility; e) silicoDArT data call rate; f) silicoDArT PIC distribution. 28 29 Figure 3 The results of diversity analysis: a) the percentage of polymorphic loci within the populations; b) contribution of unique alleles/loci in the studied populations; c) percentage of unique alleles/loci in the examined regions. Accession numbering in accordance with Tab. 1 30 Figure 4 Mantel correlation test between a) SNP and silicoDArT markers; b) geographical distance and genetic distance based on SNP markers; c) geographical distance and genetic distance based on DArTseq markers.
31 Figure 5 Results of the AMOVA framework for SNP and silicoDArT markers Results of joint analysis of SNP and silicoDArT data through Generalized Procrustes Analysis.
a) 2d plot of the first two axis of GPA; b) 2d plot of the first and the third axis of GPA

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.