Discovery of novel SSR markers from transcriptome data of Astronium fraxinifolium Schott, a threatened tree species in Brazil

Astronium fraxinifolium is an endangered tree species from Brazil. Due to its high importance for environmental reforestation, as well as for the use of its wood, it is necessary to implement management programs for conservation of this species. Simple sequence repeats (SSR) or microsatellite markers have been widely used in population genetic studies across diverse organisms. In this study, we reported for the rst time SSR markers for A. fraxinifolium as well as its frequency and distribution from transcriptome data.


Background
Forest fragmentation directly impacts the genetic diversity, population structure, mating system and gene ow of tree populations. Population genetic studies based on genetic markers are keys to understand the effects of anthropogenic interventions on natural populations, conservation and improvement of trees species. Astronium fraxinifolium Schott (Anacardiaceae) tree occur discontinuously distributed on rocky terrains of different Brazilian biomes, from Cerrado to Caatinga [1,2]. It is an insect pollinated dioecious tree, used for restoration of degraded areas [2]. Due to the intense fragmentation of its biomes, A. fraxinifolium is placing as threatened of extinction and remaining populations are found as isolated trees along highways margins or in small forest fragments [3,4]. Thus, the development of genetic markers as microsatellite loci (SSR) is urgent to be used as tool for genetic investigation of genetic diversity, population structure, mating system, and gene ow of the remaining species populations.
Genomic studies in Astronium genus are scarce. More recently, with the development of high throughput sequencing technologies has increased the development of molecular markers in a broad range of organisms [5,6]. The search against these datasets has increased the chances of nding SSR without any prior enrichment. From this, the development of SSR markers from transcriptome sequences has become an effective tool for population genetic investigations [7,8], in special for endangered species [8]. Here, we developed a set of 20 polymorphic microsatellites loci for A. fraxinifolium and evaluated its frequency and distribution based on Illumina sequencing from a RNA-Seq data. These loci were then validated to be reproducible in order to assess genetic diversity, mating system and gene ow in this tree species. We also included an analysis of repeats and GO classi cation of these reads.

Identi cation and classi cation of SSR markers
In this work, the sequence run produced 189 million of paired-end reads, of which more than 95 million successfully joined reads. The trinucleotides motifs the most abundant followed by di-, hexa-, penta, and tetranucleotide motifs (approximately 41. 8, 37.5, 9.80, 7.6, and 3.4%, respectively, Table 1). From these, on average, 32% of the sequences had enough anking sequence for designing of the primers, except for the tetra-and trinucleotides loci where this was 40.9 and 36.9%, respectively. From an initial screening, more than 125 thousand sequences (redundant sequences; available upon request) were identi ed with tandem repeats and then we designed twenty primer pairs for ampli cation and test in a population study. Insert Table 1 The SSR functional annotation was classi ed under the three major categories: cellular component, molecular function and biological process (  Genetic diversity of natural populations The use of SSR-derived from RNA-Seq studies increases the success of ampli cation, including related species, due to its conservation of transcribed regions anking sites. All microsatellite primer pairs designed were successfully ampli ed and polymorphic in the studied populations, being detected between 4 to 11 alleles per locus and ranging from 0.346 to 0.857 (Table 2), which indicate useful markers for population studies. = number of alleles per locus; = polymorphism information content; = observed heterozygosity; = expected heterozygosity; = xation index; = null alleles occurrence; = genetic differentiation between populations; P < 0.05 Insert Table 2 Therefore, despite the history of fragmentation of the A. fraxinifolium populations, the SSR loci showed a large amount of genetic variation: the observed heterozygosity () ranged from 0 to 0.944 (mean of 0.674) and expected heterozygosity () ranged from 0.533 to 0.871 (mean of 0.741) in Ilha Solteira, and in Selvíria ranged from 0.133 to 1.0 (mean of 0.668) and ranged from 0.606 to 1.0 (mean of 0.797). The xation index () ranged from − 0.394 to 0.494 (mean of 0.090) in Ilha Solteira, and in Selvíria ranged from − 0.069 to 0.783 (mean of 0.162). Null alleles were observed in four and six loci in Ilha Solteira and Selvíria, respectively. After Bonferroni sequential correction, genotypic linkage disequilibrium (LD) was observed in four pairs of loci in Ilha Solteira and in three pairs in Selvíria (Table 3). The values represent the probability of genotypic disequilibrium after 19.000 permutations of alleles among individuals. Probability after Bonferroni's corrections: P = 0.000263 (α = 0.05).
Insert Table 3 To test the genetic similarity between populations, we used DAPC and STRUCTURE analysis. The genetic differentiation () between populations (0.363) was high great part of genetic is distribute within than among populations. In fact, the high genetic differentiation among the populations was expected given their geographical distance (50 km). The species is pollinated by bees, which have limited distances reported [1], which can explain the high genetic differentiation between both. The PCA showed a clear differentiation of both populations, with some individuals mixed between them. Furthermore, DAPC and both assignment probability tests (from adegenet package and STRUCTURE) resulted in the similar results of population structure, with two distinct populations (Fig. 2). re ect each population: Ilha Solteira (red) and Selviria (blue). c assignment test from adegenet package. d STRUCTURE results from analysis at optimal K = 2. c, d Each column and colors re ect the genetic assignment of individuals: in c Ilha Solteira (brown) and Selviria (blue); d Ilha Solteira (green) and Selviria (red).

Discussion
In A. fraxinifolium transcriptome data, a predominance of -tri, followed by dinucleotides motifs, with more than 79% of all identi ed contigs, which could not affect the protein structure, with non-perturbation of the reading frame [9,10]. When analyzing all repeats number, we identi ed that repeats number greater than 10 corresponds with less than 7.8% of all SSRs ( Table 4). The SSR frequency decreased with an increase in motif length, as reported for Magnolia wufengensis [11]. The frequency of motifs from AG/CT corresponds to more than 24.3%, being the most abundant motif in this species, followed by TCT/AGA repeats with less than 6%. (Fig. 3). These frequencies of AG/CT repeats are higher than found for other species such bamboo (17.11%) [12], but less than for Magnolia (37.8%) [11]. High frequencies of AG repeats are also reported for other plant species being suggested that could be related to mutation mechanism of generation of SSRs or selective pressure to particular sequences [9,10,11,13,14]. Insert Table 4 Insert The microsatellite markers developed were e cient in the genetic differentiation among populations sampled. The average levels of heterozygosity observed and expected were above that reported for populations of Astronium graveolens [15] and in other tropical species, as in populations of Cedrela ssilis (Meliaceae) [16], Campomanesia xanthocarpa (Myrtaceae) [17], Myracrodruon urundeuva [18] and Eugenia uni ora L. (Myrtaceae) [19], which con rms the existence of high genetic variability in the populations studied here. Therefore, these genetic markers are reliable to be used in population genetics studies, as such in the investigation of the pollen and seeds dispersal patterns aid to understand the actual distribution of natural populations, with impacts in the evolutionary history of a species. Previous studies with other tree species show a great range of pollen dispersal, such in Hymenaea stignocarpa, showed long-distance pollen dispersal reaching values of more than 8 km between the populations analyzed [20] and even more in Ceiba pentandra¸ reaching 18 km [21,22]. However, these long distances are mainly related to the dispersion by bats, which have a large feeding area. For A. fraxinifolium, which are pollinated by bees, distances of almost 6 km were reported of the insects feeding behavior [23,24]. Such results indicate that further investigations of pollen/seed dispersion are necessary for the species. To date, few studies were conducted based in natural populations of A. fraxinifolium, focused on silvicultural traits [25]. Recently in the genus, A. graveolens SSR loci were described, but not tested in other Astronium species [15]. Given this, the microsatellite markers in this work developed may be useful in genetic studies such as diversity and genetic structure, gene ow and mating system, providing information for conservation, breeding and reforestation plans of the species. In addition, our study provides a database with more than 125 thousand of expressed SSR sequences in the genome that will serve as a basis for studies of consequences of forest fragmentation in tropical forest of Brazil, thus contributing to the development of adequate strategies for the conservation of A. fraxinifolium and related species from Anacardiaceae family.

Conclusions
This study reports the rst SSR markers for A. fraxinifolium. The frequency and distribution of SSR motif types showed great diversity, with a predominance of trinucleotides motifs, as reported for other plant species. Functional annotation of SSRs can help future breeding programs in the selection of genes related to important characteristics of the development. Also, the use of transcriptome derived SSR can increase the rate of ampli cation in related species, due the conservative anking sequence of these loci. At population level, these SSR markers showed enough levels of polymorphism in both populations analyzed. Therefore, the obtained results suggest that these markers can be used as tools for ecological population genetic studies, such as genetic diversity, spatial genetic structure, mating system and gene ow, besides improving the development of genetic conservation strategies and management of fragmented populations and related species. This software used as input the raw data reads, processed in a module of quality sort, followed by joining reads and completed the search for the SSRs. Primer design was conducted in BatchPrimer3 v1.0 [28]. Functional annotation of SSR-containing coding sequences were analyzed in Blast2GO software, using EnsemblePlants database from UniProt [29].

Sample collection and DNA extraction for validation analysis
For the population analysis, genomic DNA was isolated from fresh leaves using the cetyltrimethylammonium bromide (CTAB) protocol [30], quanti ed in NanoDrop ND-1000 Spectrophotometer (NanoDrop Products, DE, USA) and its integrity was veri ed in 1% agarose gels running with TBE (1X), at constant voltage of 5V/cm. We selected 20 SSR loci for population validation. Polymerase chain reactions were performed in a Mastercyler thermocycler (Eppendorf, Hamburg, Germany) with an addition of a M13 tail for uorescent labeling [31] in a reaction mixture   Insert Table 5 Statistical population analysis The number of alleles per locus (), observed () and expected () heterozygosity, polymorphism information content () was estimated using CERVUS 3.03 software [32]. The xation index () and genotypic linkage disequilibrium (LD) were estimated for each population using FSTAT software [33]. To test if the values and to the LD were signi cantly different from zero, we used Monte Carlo permutations and a Bonferroni correction (95%, α = 0.05). Micro-Checker v.2.2.3 [34] was used to detect the possibility of occurrence of null alleles () and estimated the genetic differentiation () with base in Hedrick's statistics [35]. The adegenet package [36] in R environment was used to conduct the   from analysis at optimal K=2. c, d Each column and colors re ect the genetic assignment of individuals: in c Ilha Solteira (brown) and Selviria (blue); d Ilha Solteira (green) and Selviria (red).