Development of new SSR markers for C. formosensis
For a higher accuracy of court's judgment on illegal felling, it is necessary to establish a complete forensic system. Although some SSR markers of cypress have been published17,24-26, the reported detection rates for dried timber were only 20-40%. Therefore, it is necessary to develop more SSR markers as a contingency plan. When the sample is in a poor condition, more markers can be applied in order to achieve the threshold of combined power of discrimination (CPD) required for successful comparison between seized timbers and victim trees. In order to maximize potential loci, Next generation sequence (NGS) methods were used. In this study 3 DNA library were constructed. We used the Illumina MiSeq platform (2 × 301 bp; Illumina, San Diego, California, USA) to sequence the DNA libraries (Fig. 1 and Supplementary 1).
A total of 70,325,072 raw reads were produced. The raw reads were deposited in the NCBI BioProject (PRJNA454510). After quality-trimming to the raw reads with CLC Genomics Workbench version 10 (QIAGENE, Aarhus, Denmark), 70,319,509 contigs were generated with length between 133 and 146 bp in average. De novo assembly was then conducted with the following parameters: contig number 208,467, minimum length of contigs 18bp, maximum length contigs 108,928bp, and average length contigs 491bp. The sequence was assembled with software CLC Genomics Workbench version 10 and the length of assembly sequence was 102,281,642 bp.
A sum of 78,250 SSR containing sequences were screened by MISA (v 1.0, MicroSAtellite)27. We newly designed 100 candidate SSR primer pairs for testing in C. formosensis by BatchPrimer328.
There are 9 validated SSR markers are polymorphic (success rate 9.00%), which were registered in GenBank in NCBI (Table 1) and passed for cross-species tests (supplementary 2 and 3).
Different from the traditional SSR cloning method, with next generation sequencing technology, it is easy to obtain colossal amount of SSR containing sequences from sequenced genomes29. However, transforming candidate SSR primer pairs into validated SSR markers is still a time-consuming and expensive step. Qualified SSR markers need to succeed in PCR amplification, have good peak pattern quality with little stuttering, and be free of non-amplifying (invalid) alleles. The turnover rate from candidate SSR primer pairs to validated SSR markers varies from species to species29,30. The success rate in Chamaecyparis plants is between 5.24% and 9.27%8,17,24,25,31.
Developing C. formosensis individual identification system
In this study, newly developed 9 validated SSR markers and other 27 validated SSR markers17 polymorphic SSR markers were analyzed against 92 individuals from 4 geographic areas (MM, HV, GW, SY, Fig. 2 and Supplementary 1). The results of developed 36 SSR markers are summarized in Table 2. Among the 92 individuals in this study, each number of alleles of SSR is between 2 and 27, with average of 7.916. The levels of observed heterozygosity (Ho)32 are from 0.000 to 0.891, with average of 0.414. The levels of expected heterozygosity (He)32 are ranged from 0.103 to 0.906, with average of 0.565. Significant (P < 0.001) deviations of Hardy-Weinberg equilibrium (HWE)33 were detected in 23 SSR loci: Cred47, 225, 231, 236, 242, 248, 249, 250, 253, 260, 262, 276, 277, 280, 603, 610, 628, 640, 641, 674, 678, 682, 683. Ho is the actual proportion of heterozygous individuals contained in each locus within the population, whereas the He is the expected value estimated per HWE. Ho and He are the most popular parameters used in estimating genetic diversity in population. The population structural and even historical information can be obtained from Ho and He. Most marker within the 36 tested ones have higher He than Ho (all except Cred211, Cred220, Cred225, Cred248, Cred276, Cred281, Cred297, Cred298), suggesting that different loci have lower genetic diversity in C. formosensis and the population of C. formosensis is an inbred. HWE describes that under ideal conditions, gene frequency does not change over time or generation. However, there will always be one or more interfering factors affecting gene frequency in nature. Therefore, HWE is difficult to achieve in nature. In this study, 23 loci out of 36 markers deviated from HWE (63.89% deviation rate). The reason for this deviation could be artificial selection, non-panmixia or genetic drift.
Polymorphism information content or power of information content (PIC). PIC value is an index of relative ability of the SSR marker's genetic variability. The higher the polymorphism of marker's genotype, the higher the PIC value34. Polymorphic markers were highly informative (PIC > 0.50), reasonably informative (0.50 > PIC > 0.25), and slightly informative (PIC < 0.25). Power of discrimination (PD)35 refers to the ability of genetic markers to distinguish individuals within a population. Obviously, in a population with more allele types and evenly distributed genotypes, the low probability of two random individuals have the same genotype, and the greater probability of two random individuals can be identified by the system. Probability of identity (PI)36 is the probability of two individuals with the same genotype. PD = 1-PI . The value of PIC, PD and PI of individual markers reflects its identification ability in the individual identification system. The greater PIC and PD, the lower PI in value, suggesting the higher identification ability of the marker, and vice versa. The levels of PIC range from 0.097 to 0.876, with average 0.528. The levels of PD range from 0.102 to 0.885, with average 0.567. The levels of PI range from 0.114 to 0.897, with average 0.431. There were 19 out of 36 markers with PIC greater than 0.5, and the mean of these 36 markers PIC values was greater than 0.5, suggesting the markers have a high identification ability. The results of PD and PIcorrespond to those of PIC. The highly informative markers presented in PIC also show higher identification ability in PD and PI.
Significant linkages (P < 0.001) were detected among Cred35/229/277, Cred47/298, Cred231/249/253/262, Cred281/297, Cred603/683 and Cred640/678/682 with GENEPOP 4.237, suggesting the abovementioned group located in the same linkage group (Table 2). When identifying several independent polymorphic genetic markers at the same time (polymorphic markers located in different linkage groups), the combined probability of identity (CPI) is the product of the PI of each genetic marker. At this time, CPI will be greatly reduced, and combined power of discrimination (CPD) will become very high. As defined above, CPI + CPD = 1. The credibility of the individual identification system is calculated based on "Random match probability in population size and confidence levels’" published by Budowle et al (2000)38. Confidence levels (CL) = (1 – CPI)N,where N is number of individuals.
The individual identification system was applied to illegal felling cases8. When the seized timber and the victim tree are identified to be the same individual plant, under the considerations of fairness and objective, the court usually adopts 99.99%, 99% or 95% confidence level as the credibility standard39 (Wall 2002, ISO ISO/IEC 17025). In this study, the locus with the lowest PI within a linkage group was used to calculate the CPI (Table 3). The CPI decreased along with the accumulation of loci, and with the PI of each locus sorted in ascending order. The system can accumulate up to 28 loci without linkage. When reaching its maximum capability, even under most strict standard (confidence level 99.99%) by dictated by court, this system can be used to identify 60 million C. formosensis, with CPI as low as 1.652×10-12, CPD as high 0.999999999998348 (almost equal to 1), beyond the known population size 32.06 ± 3.20 million of C. formosensis40. Under ideal condition, a minimal of 6 loci can be applied to the system, with an identifiable C. formosensis population of 2,900 under 95% confidence level. The CPI is as low as 1.728×10-5, and CPD is as high as 0.999982712603209 (Table 3).
Population genetics analysis
The fixation index (Fst) is a measure of population differentiation due to genetic structure41. A higher Fst value means a higher degree of difference between populations. When Fst is less than 0.05, there is no differentiation among populations. When Fst is between 0.05 and 0.15, there is low differentiation among populations. On the other hand, estimation of number of migrants (Nm) is a measure of gene flow value41. When Nm is greater than one, it represents genes frequently exchange which counteracts the genetic drift and prevents the population differentiation42. When Nm is greater than 4, it would be a random mating population43. Fst and Nm analysis on the 4 geographic areas were conducted with GeneAlex 6.50344 (Table 4). The Fst value of HV and GW is 0.035, suggesting no inter-population differentiation. The highest Fst value 0.074 was found between HV and MM. The Fst values range from 0.056 to 0.065 were found between the rest geographic areas, indicating a low differentiation among geographic areas. The highest Nm value 6.832 was found between HV and GW, whereas the lowest value 3.141 between HV and MM. The Nm values of 4 geographic areas were all greater than 1 (between 3.141 and 6.832), suggesting a frequent gene exchange between the 4 geographic areas, which offsets genetic drift and prevents population differentiation. For GW/MM (Nm = 4.022), GW/HV (Nm = 6.382), the Nm values of the population are greater than 4, suggesting that they are random mating populations.
STRUCTURE analysis45,46 was used to analyze the population genetic structure of C. formosensis (Fig. 3), and the Delta K value was calculated to obtain the optimal number of clusters. K and Delta K were show in Fig. 3a and K=2, 3 and 4 are drawn respectively in Fig. 3b. These individuals of C. formosensis present the most likely three clusters: The SY located in Eastern Taiwan is an independent cluster, the MM located in Southwestern Taiwan is another standalone cluster, whereas the two HV and GW geographic areas are in the same genetic cluster. Combing the results of Fst (Table 4), Nm (Table 4) and STRUCTURE analysis (Fig. 3), the data show that the C. formosensis of the four geographical areas belongs to the same genetic population. The Fst and STRUCTURE analysis data suggests that the samples fall into three clusters. The hypothesis that Taiwan Island is one of the plant refuges during the Quaternary glaciation47,48 may help to explain the results. In the research on the historical biogeography and phylogeny of cypress48, C. formosensis in Taiwan Island was separated from Chamaecyparis in Japan at 2.9 million years ago. The arrival of the Quaternary glaciation caused the extinction of many species and led a continued retreat of species to lower latitudes, thus Taiwan Island became a refuge for many ancient species. After the glaciation, species spread from the refuge to the surrounding area and formed species diversity on a latitude gradient49. In our results, the low polymorphism of C. formosensis in the four geographic areas’ individuals may be from the same big population of C. formosensis during the glaciation. After the glacial retreat, C. formosensis spread out from the refuge. Due to the influence of terrain and distance, C. formosensis in different geographical areas has a tendency to form different clusters.
GENECLASS v. 2.050 was applied to analyse the provenance of 92 individuals independently and the probability of samples returning to the correct provenance is 95.00% (MM), 88.00% (HV), 69.57% (GW), and 100.00% (SY), with an overall mean correct rate of 88.04% (Table 5). Three HV individuals were mis-assigned to GW and four GW individuals were mis-assigned to HV, corresponding to the observation that HV and GW are the same cluster. However, 3 GW individuals were mis-assigned to MM, possibly because the geographic location of GW is between HV and MM, and therefore GW has characteristics of north and south at the same time. Likelihood, one MM was mis-assigned to GW, further supporting the inference that there is partial gene exchange between MM and GW. Overall, our data shows that the populations in eastern (SY) and western Taiwan (the rest populations) have distinct differences in genotype. Within the western populations, the northern (HV) and the southern one (MM) have obvious differences. Therefore, when seizing timbers in the futures, the genotype can be served as a prefilter to infer the geographic area of the victim tree if the provenance is found to be MM, HV or SY. A further inspection is required if the provenance is found to be GW because of the existence of gene exchange between nearby geographic area.