De novo transcriptome assembly and development of EST-SSR markers for Pterocarpus santalinus L. f. (Red sanders), a threatened and endemic tree of India

The endemic and precious tree Pterocarpus santalinus L. f. (Red sanders) is a drought hardy species for conservation in peninsular India due to its high risk of illegal timber harvest. It is only found in Eastern Ghats of India, and has become threatened owing to overexploitation of its valuable timber. The development of genomic resources, particularly simple sequence repeat (SSR) markers, is essential for strict implementation of in situ conservation measures and application of DNA information based red sanders genetic resource management. However, a lack of genomic data and efficient molecular markers limit the study of its spatial and temporal population genetic structure, identification of diversity hotspots and tree improvement. The current study aims at comprehensive molecular characterization of red sanders and the somatic chromosome counts, flow cytometry and EST-SSR analyses. The results revealed that red sanders is diploid with 2n = 20 and the 2C genome size was 0.7872 ± 0.0561 pg for the first time in this species. A total of 3128 EST-SSRs were detected based on 25,854 de novo assembled unigenes from transcriptome data and primer sets designed for 1953 SSRs. Fifty-nine EST-SSR markers were evaluated for polymorphism in the natural populations of red sanders and 13 were found to be suitable for genetic analysis. Two major transcription factor families bHLH and ERF, responsible for abiotic stress and secondary metabolite synthesis were analysed which would provide the foundation for further research on production of medicinally important biocompounds.

In its native habit, red sanders grows in quartzite and shale geological formations (Raju 1999).
P. santalinus is a deciduous tree yielding extremely durable timber with wide variety of uses (Donga et al. 2017). The commercially valuable heartwood is red to purplish black in colour with interlocked grains, whereas the sapwood is pearly white in colour with little or no economic value. The heartwood has therapeutic properties and is also used to make carved wood products including musical instruments. The occurrence of wavy grained wood in nature is unusual and unpredictable; wood with such grain pattern is regarded as a unique commodity of high value.
Endemic species are more vulnerable for loss in genetic diversity than other taxa due to their limited geographic range (Coelho et al. 2020). Extreme exploitation and selective logging of mature trees in natural forests had lead to loss of genetic diversity and impairs the genetic structure of the populations. Human-imposed genetic diversity loss of red sanders has been faster in the last 50 years than at any previous period in history (Sarangi 2010). Biologists, ecologists, and biogeographers have been paying close attention to P. santalinus conservation (Kukrety et al. 2013;Pullaiah et al. 2019). Studies on endemism patterns provide critical information for conservation efforts, such as identifying hotspots and delineating seed zones. Identifying priority regions or hotspots where the genetic diversity is most threatened is essential when conservation resources are limited. Conservation units and their sizes are generally based on the theory of dynamic gene conservation which highlights the maintenance of evolutionary mechanisms within tree populations to safeguard their prospective for continuous adaptation (Aravanopoulos 2016). Such units are continuously managed with essential silvicultural operations, to facilitate genetic processes within the target tree populations. These units become the source of seeds for new plantations.
In such situation, spatially dispersed molecular data that incorporates genetic information of populations become vital, at a scale where regional decisions on conservation activities and investment are made. As population genetic studies have progressed, there has been an increasing appreciation for the value of regional patterns of genetic variants in adaptive and neutral molecular markers in the identification of strong seed zones for genetic conservation and restoration efforts. As DNA based markers have become more widely available, several indigenous tree species have been the subject of genetic studies (Hartvig et al. 2018;Castro et al. 2021). Researchers can analyse the present influence of drift and gene flow on populations using polymorphic markers inherited from both parents. In many trees with wider distribution, conservation efforts were guided by markers such as simple sequence repeats (Debbabi et al. 2021). Furthermore, assessing genetic structure and diversity combining both adaptive and neutral variations allows researchers to better understand the role of different microevolutionary processes in the formation of genetic diversity hotspots, which has major ecological, evolutionary, and conservation implications (Schueler et al. 2013). Certainly, it would assist in shedding insight on the processes that shaped contemporary genetic diversity hotspots at both the population and species levels. The majority of seed zonation and seed transfer recommendations have been based on ecological classifications, meteorological data, ecophysiological data, and field performance of provenances (Gomory et al.1998).
Despite the great economic importance and scarcity of information for conservation efforts, the use of genetic markers to analyse genetic variation in red sanders is highly restricted. Microsatellites or simple sequence repeat (SSR) markers generated from expressed sequence tags (EST) are increasingly being utilized to evaluate genetic diversity due to their relatively rapid development, cost-effectiveness and are less susceptible to null alleles (Uchiyama et al. 2013). They are developed from expressed regions of the genome with known or suggested functions with the potential to differentiate closely related individuals. Next generation sequencing based RNA-Seq approaches help to analyze the expression and function of genes in different plant species and generate wide variety of SSRs. RNA-Seq using Illumina is rapid, cheaper, and less reliant on existing genomic information than traditional library based techniques. In forest trees high throughput transcriptome sequencing has been successful in discovering SSR markers (Cornacini et al. 2021). These markers have a greater mutation frequency than random mutations, making them an excellent choice for building genomic maps, genotyping, assessment of genetic diversity and population structure. Although reported to be less polymorphic than genomic SSRs, these markers are superior in functional diversity in relation to adaptive variation and interspecific transferability. Until recently, very few studies on DNA markers have been developed for red sanders (Usha et al. 2013) and, so far, there is no report of SSR markers available for the species. Furthermore, there is no comprehensive publication on red sander genetic diversity based on molecular markers, indicating that this species has not received adequate attention for conservation and improvement. Consequently, applications of molecular markers for seed sources identification, conservation of genetic resources and genetic improvement are highly inadequate. The use of SSR markers in red sanders would speed up the tree improvement efforts by allowing researcher to link a seed lot to a seed source stand, study the levels of gene flow within and among populations and prioritise populations for ex situ and in situ conservation. Further, transcriptome analysis enables the identification of novel transcripts, differentially expressed genes, single copy protein coding genes and transcription factors, which could contribute to the discovery of key metabolic pathway candidates for functional studies and further application in red sanders improvement.
Hence, this study began with the objective to identify the exact number of chromosomes and genome size which is the foundation for any molecular breeding research in red sanders. The second objective was to generate and verify EST-SSR markers for use in red sanders, and to our knowledge, this is the first study to develop EST-SSRs by deep sequencing of transcriptome in Pterocarpus species. The transcriptome data obtained in this study will lay a foundation for further genetic analysis of P. santalinus and will be helpful for the discovery and functional annotation of new genes, mapping and marker based breeding.

Plant material
Young and fresh leaf sample from one seedling individual of P. santalinus from Tirupathi (TR) population was taken for RNA sequencing. To evaluate the polymorphisms of the EST-SSR markers developed from the transcriptome datasets and analyze the population genetic diversity of P. santalinus, samples were collected from a total of 42 individuals of P. santalinus from eight natural populations (2-11 individuals per population) in Andhra Pradesh, including Chittor East (CE), Puttor (PU), Mamandur North (MN), Mamandur South (MS) Tirupathi (TR), Balapalli (BP) Chammala (CH) and Srikalahasti (SKH) ( Table 1). Seedlings from the first five populations were used for chromosome count and DNA content estimation. The field data collections in all the natural distribution areas were undertaken along with the Andhra Pradesh State Forest Department authorities. All the leaf samples were silica dried and stored in room temperature before DNA isolation. In addition, three related species in Pterocarpus (P. marsupium Roxb., P. dalbergioides Roxb. ex DC and P. indicus Willd.) were used for testing the transferability of the polymorphic EST-SSR markers developed from P. santalinus. In each species three individuals were used for cross amplification purposes. Mitotic chromosome preparations from root tips were performed using seedlings from five different sources of red sanders. Young seedlings (3 months old) were root pruned to induce more adventitious roots and maintained in pots under greenhouse conditions in sand rich medium. The root tips (1-1.5 cm in length) were collected in three different months (April, July and October) in the morning hours (10.30-11.30). The sand free root tips were immediately immersed in a saturated solution of α-bromonaphthalene and incubated in darkness at room temperature for 3 h. The pretreated tips were subsequently fixed in Farmer's fixation solution (Ethanol:Glacial acetic acid, 3:1) overnight at 4 °C. Selected root tips were hydrolyzed in 1 N HCl for 13 min at 60 °C for maceration followed by enzyme treatment. Meristems were dissected using a scalpel blade, and digested in a mixture of 2.5% (v/v) cellulase from Aspergillus (Sigma-Aldrich) and 2.5%(v/v) pectinase from Aspergillus aculeatus (Sigma-Aldrich) in citrate buffer (pH 4.8) for 20 min at room temperature. Enzyme treated roots gently to release cells. The homogenate was filtered through a sterile nylon membrane filter cup, and the flow collected in a sterile Eppendorf tube. The tube was vortexed and incubated at room temperature for 5 min in Galbraith's buffer (Galbraith et al. 1983). The sample was then acquired into the flow cytometer (BD FACS-Verse, BD Biosciences, USA). Solanum lycopersicum L. 'StupickéPolní Rane' and Pisum sativum L. (pea) were used as references. The sample rate was adjusted to 40-50 nuclei/second and the DNA G 1 nuclei peaks were positioned by adjusting instrument gain settings. An FSC-SSC dot plot was generated to resolve the events based on size and granularity. A PI area to height plot was used to discriminate the doublets and the PI Area histogram to resolve the frequency distribution of the populations from singlets. The threshold discriminator was set on the PI channel to exclude all PI negative events. Measurements were taken up to 10,000 particles. The mean channel number of the G 1 sample peak was determined, and the DNA ploidy of P. santalinus was calculated. Three replicates were analysed with each replicate representing a separate individual. 2C value of sample was estimated according to the following formulae:

DNA isolation
Genomic DNA was isolated from 42 individuals of P. santalinus belong to 8 natural populations for validation of EST-SSR primer pairs designed in this study. Each DNA sample was extracted from 100 mg of silica dried leaf tissue. Leaves were homogenized with liquid nitrogen and DNA isolated using ArborEasy® DNA Isolation Kit (patent product of Institute of Forest Genetics and Tree Breeding, Coimbatore, India) according to the manufacturer's instructions with Proteinase K treatment. The quality of isolated genomic DNA was examined using 0.8% agarose gel electrophoresis, and the concentration was determined using a NanoDrop 8000 spectrophotometer (Nanodrop Technologies, USA). DNA samples were subsequently diluted for the PCR working concentration of 8.0 ng/μL and stored at − 20 °C for further use.
were then stained using leuco basic fuchsin under the dark condition for 30 min. Mitotic chromosome spreading was done by squashing method. Roots dyed with leuco basic fuchsin were immediately placed on a clean slide and added a drop of acetocarmine and squashed under the cover slip by pressing manually with thumb and observed using a light field microscope (Carl Zeiss, Oberkochen, Germany) equipped with AxioCam camera. The cells with well spread chromosomes were selected for counting followed by recording pencil marking of individual chromosomes. For each plant, at least five different root tips were used (one root tip per slide), and 5-10 metaphase plates were examined.
Flow cytometry Approximately 0.5 to 1 mg of young red sanders leaf tissue was excised and placed on a sterile petri plate. The tissue soaked in the extraction buffer was chopped RNA isolation, cDNA library preparation and sequencing Leaf sample was homogenized with liquid nitrogen and total RNA was extracted from each plant using the TRIzol-based RNA extraction kit (Thermo Fischer Scientific, USA). To ensure the accuracy of the data, the purity, concentration, and nucleic acid absorption peaks of the isolated RNA were detected using a Nanodrop spectrophotometer, and the RNA integrity was accurately tested with an Agilent 2100 instrument. Three µg of RNA from three biological replicates were pooled together and the sequencing library was prepared using the NEBNext Ultra RNA Library Prep Kit for Illumina (NEB, Ipswich, MA, USA) according to the manufacturer's instruction. Size selection was performed according to the manufacturer's protocol with the addition of AMPure XP beads (Beckman Coulter) for obtaining final library size of 400-600 bp. After the library quality assessment, cDNA libraries were sequenced on a flow cell using an Illumina HiSeq 2500 system (Illumina, San Diego, CA, USA) at Clevergene Biocorp Pvt. Ltd., Bengaluru, India. Raw data were stored in the sequence read archive (SRA) of NCBI (https:// submit. ncbi. nlm. nih. gov/ subs/ sra/) with the project id PRJNA782228.

Transcriptome assembly generation
Before assembly, the raw reads were filtered to remove poly A/T, low-quality sequences, and empty reads or reads with more than 10% of bases having Q < 30. Raw reads of transcriptome sequences were processed with Cutadapt 1.16.5 (Martin 2011) and Trimmomatic 0.36.5 (Bolger et al. 2014) tools to remove the adapter sequences and to trim low quality bases, respectively. The assembly of a de novo transcriptome using clean reads was performed using the short-read assembling program Trinity 2.9.1 software with parameters setting of minimum contig length of 200 bp (Haas et al. 2013). Trinity super transcript (trinity gene splice modeler) was used to cluster transcripts to obtain final unigenes. In order to validate the completeness of the reconstructed transcriptome analysis was performed with BUSCO v 5.0.0 software (Simao et al. 2015).

Functional annotation of unigenes
Identified red sanders unigenes were used to retrieve proteins with the highest sequence similarity along with putative functional annotations and macro level classification to view the distribution of gene functions. The open reading frames (ORFs) and coding sequences (CDS) were predicted from the unigenes using TransDecoder v 5.5.0 program (Haas et al. 2013). Unigenes and ORFs were annotated using BLASTX and BLASTP alignment tools (BLAST + v 2.12.0 software) with the e-value < 10 -5 against NCBI non-redundant protein (NR), UniProtKB, Gene Ontology (GO) and KEGG Orthology (KO) databases. The orthologous groups and protein function annotation were identified by WebMGA online server using COG, KOG and Pfam databases with default parameter (e < 10 -3 ) (http:// weizh ong-lab. ucsd. edu/ webMGA/ server/). The number of annotated unigenes linked with each GO term was estimated using this approach, which mapped all of the annotated unigenes to GO terms in the database.
Identification of single copy orthologs (SCOs) and molecular phylogeny Single-copy orthologs identification is crucial in functional classification of genes and phylogenetic analysis.  Table S1). The longest ORF sequences of all transcriptomes were retrieved using TransDecoder v 5.5.0 program (Haas et al. 2013). Orthofinder v 2.5.4 (Emms and Kelly 2015) tool was used to identify the SCO sequences. Each SCO groups were aligned using MUSCLE v5 program (Edgar 2004). The aligned sequences were concatenated into single alignment files and the phylogenetic tree was constructed using MEGA v 11.0 following maximum likelihood (ML) method with 1000 bootstrap replications, Jones-Taylor-Thornton (JTT) model (Tamura et al. 2021). A single heuristic search was performed by Nearest Neighbor interchange (NNI) method.

Molecular classification of bHLH and ERF family transcription factors
Transcription factors such as bHLH and ERF families, regulating the production of diverse array of secondary metabolites were identified from the red sanders unigenes. The protein coding longest ORF sequences from the transcriptome sequences of P. santalinus and T. tipu were BLAST against Arabidopsis thaliana using the plant transcription factor database (http:// plant tfdb. gao-lab. org/). T. tipu being a closely related tree species and belonging to Pterocarpus clade with information on transcriptome was chosen to classify the transcription factors. The protein sequences of bHLH, ERF transcription factor families were screened for phylogenetic analysis. The conserved domains and motifs of screened transcription factors were identified using MEME Suite v. 5.4.1 software (https:// meme-suite. org/ meme/) (Bailey et al., 2006) and NCBI Batch Web CD-search tool (https:// www. ncbi. nlm. nih. gov/ Struc ture/ bwrpsb/ bwrpsb. cgi) with default parameters. Full length amino acid sequences were aligned by MAFFT v 7.0 program using g-ins-i algorithm with BLOSUM 62 scoring matrix (Katoh and Standley 2013). The molecular phylogeny tree was constructed using the IQ TREE v 2.1.2 program (Minh et al. 2020) using default parameters with 1000 bootstrap replications. The groups/ clades of bHLH and ERF family transcription factors were named according to the classification followed in A. thaliana (Heim et al. 2003;Nakano et al. 2006). All sequences were renamed and listed in Supplementary file 1 Table S2 and Table S3.

Detection of EST-SSR markers and designing of primers
EST-SSR markers were detected in the assembled unigenes using the MicroSAtellite v.2.1 software (http:// pgrc. ipk-gater sleben. de/ misa/). Parameters were set with a minimum number of 7, 5, 5, 4, and 4 repeat units for identification of di-, tri-, tetra-, penta-, and hexanucleotide motifs, respectively and no mono-nucleotide repeats were considered. The maximum number of bases for two SSRs in an interrupted compound microsatellite was 100. EST-SSRs were detected and mined among the unigenes with length > 1000 bp. Unigenes with a sequence of more than 150 bp before and after the SSR region with known putative functions were used for primer design by Primer v3.0 (https:// bioin fo. ut. ee/ prime r3-0. 4.0/). Primer designing criteria consisted of primer length 18-23 bp, with optimum value 20 bp; T m 57-63 °C, with optimum value 60 °C; GC content 40-60%, with the optimum value 50%; maximum T m difference between forward and reverse primer 1.5 °C and product size range 100-300 bp optimum value 150 bp. Validation of these SSRs was carried out by synthesising 59 primer sets selected randomly.

PCR amplification and validation of EST-SSRs
The PCR amplification was performed in a 10 μL reaction volume, which included 1.5μL of template DNA (8 ng/µL), 0.5μL of the forward primer (10 μmol/L), 0.5μL of the reverse primer (10 μmol/L), 5.0μL of AmpliTaqGold™ 360 master mix, and 2.5 μL of double distilled water. PCR amplification was performed in a Veriti 96-well Thermal Cycler (Applied Biosystems™, USA) using the following temperature profile: pre-denaturation at 95 °C for 10 min; 35 cycles of denaturation at 95 °C for 45 s, the appropriate annealing temperature for 35 s, extension at 72 °C for 55 s; and a final extension at 72 °C for 7 min; and preservation at 4 °C. PCR products were resolved on vertical gel electrophoresis system (Cleaver scientific Ltd, UK) using 7% non-denaturing polyacrylamide gel with 1 × TAE buffer and then detected by silver staining (Huang et al. 2018).
The size and number of alleles present per marker was scored by AlphaEaseFC software (AlphaInnotech, San Leandro, CA, USA). Total number of polymorphic alleles, polymorphic information content (PIC) and heterozygosity (He) were calculated by Power Marker v 3.25 (Liu and Muse 2005).

Results and discussion
Chromosome counting and nuclear DNA content analysis The chromosome number of P. santalinus ascertained in this study was 2n = 20 (Fig. 1). However, the chromosome count observed in this study is different from earlier report (Bhaskar, 1981). Previous study on chromosome counts in P. santalinus showed 2n = 22 after observing seedlings obtained from 2 individuals trees growing in Mysore, which is a non native location. Recently Moraes et al. (2020) Table S4) indicating the possibilities of polyploidy and dysploidy. Nevertheless the species T. tipu, a sister species belonging to Pterocarpus clade was reported to have 2n = 20 (Coleman and Menezes 1980). Chromosome changes within species (dysploidy) were reported in dalbergioid legumes, which includes Pterocarpus clade and highlighted the importance of studying multiple populations of the species (Moraes et al. 2020). Accordingly, in this study after assessing five different populations, it was substantiated that the somatic chromosome number in P. santalinus was 20. Many publications suggest that T. tipu is phylogenically closely linked to the genus Pterocarpus, therefore such investigations may support these findings (Lavin et al. 2005;Klitgard et al. 2013;Hong et al. 2020).
The 2C nuclear DNA content of P. santalinus was estimated as 0.7872 ± 0.0561 pg, where the internal standards Solanum lycopersicum L. and Pisum sativum L. DNA content was 2.0 and 9.8 pg, respectively using flow cytometry (Fig. 2). The C-value (gametic DNA content) and the Cx value (genome content) were calculated based on the Fig. 1 Somatic chromosomes in root meristem cells of P. santalinus: A Chittor East, B Puttor, C Tirupathi, D Balapalli and E Chammala. Scale bars represent 10 μm. One individual was selected from each of the five natural populations estimated 2C nuclear DNA content of the studied plant and were expressed in terms of Mbp. Being diploid, the C-value and the 1Cx value were the same and half the 2C nuclear DNA content, i.e., 0.3936 pg or 393.6 Mbp. Thus, the 2C Nuclear DNA content is 0.7872 pg or 787.2 Mbp. In dalbergias, the estimated genome size ranged from 1.43-1.98 Gb (Hung et al. 2020), however no information available for Pterocarpus species. The genome size of P. santalinus has been described for the first time, which will serve as the foundation for whole genome sequencing and tree improvement.

Transcriptome assembly
Transcriptome research is one of the most essential tools for studying species biological processes. In the current study, using RNA-seq technology on the Illumina HiSeq 2500 platform, leaf transcriptome of P. santalinus was characterized with the objective of identifying EST-SSRs. Despite its economic importance (Pullaiah et al. 2019), there is a genetic information gap in the genus Pterocarpus, and no SSR markers are available. Totally 17,930,663 raw reads were generated and after the removal of adapters and sequences with poor quality, 16,182,076 sequences in the range of 20 to 151 bp were retained. De novo assembly produced 29,816 transcripts and 25,854 unigenes were obtained with total bases of 20,943,459 bp having mean transcript length of 810 bp with the N50 contig length of 1126 bp. Among the assembled transcripts about 90% had > 500 bp length (Table 2). Read mapping of the raw reads showed that 90.2% were aligned by Trinity 2.9.1 software. These results corroborated with the earlier findings on species such as Camelina sativa L. (Liang et al. 2013) where, 81.2% of the unigenes were annotated with gene descriptions or conserved protein domains. In the case of Myrciaria dubia Kunth McVaugh (Castro et al. 2020) de novo transcriptome assembly generated 68.19% of complete and partially assembled unigenes. Whereas in Stephanandra incisa (Thunb.) Zbl totally 35251unigenes were identified and 75.47% were annotated with Nr database . The use of PacBio SMRT sequencing technology for transcriptome sequencing resulted in high-quality transcripts longer than 1000 bp in many plant species , demanding third-generation technologies to be introduced for species like P. santalinus with little information on transcriptome and genome.

Functional annotation and classification
The 25,854 super transcripts were functionally annotated using NR, UniProtKB, KEGG, KOG databases (Supplementary Table 1). These annotated sequences provide a platform for further research into P. santalinus genetic diversity. By comparing the transcript sequence to UniProtKB with homologous species, most of the transcripts (75.8%) had best matches with Medicago truncatula Gaertn., a legume species and the second top-hit (9.3%) species was Actinidia chinensis var.chinensis Planch. (Supplementary file  1 Table S5) thus reflecting the scarcity of reports on the transcriptome of related Pterocarpus species. Among 25,854 unigenes 93.25% were assigned a specific or general function gene ontology terms, 63.26% of transcripts involved in molecular function, 56.39% as cellular components and 45.79% involved in biological process (Fig. 3a). The results of functional annotation in red sanders clearly indicated that the proportion of unigenes annotated in this study is higher than transcriptome data deficient plant species such as Panax vietnamensis Ha et Grushv. (Vu et al. 2020). Among all the unigenes involved in molecular function 6056 unigenes were responsible for catalytic activity, 5187 unigenes were responsible for protein binding and 4860 unigenes involved in response to stimulus. Among cellular components 8267 were involved in membrane bound organelles, 4445 were organelle parts and 3135 involved in plastids. KEGG annotation showed 1899 unigenes involved in the metabolism, 1921 unigenes in genetic information processing, 421 unigenes in signalling & cellular processing and 492 in environmental information processing ( Fig. 3b; Supplementary Table 1). Function annotation of the non-redundant unigenes was conducted by searching against the main databases. A total of 93.25% (24,110) of the non-redundant unigenes were annotated in UniProtKB, 20,608 unigenes in GO,18,476 unigenes in Pfam,8,754 unigenes in KEGG and 8,152 unigenes in KOG. Further, functional classification of P. santalinus was carried out with search against the Clusters of COG database, and 5163 unigenes allocated to 25 classes (Supplementary file 1 Fig. S1). The highest category was general function prediction (702, 13.6%), followed by amino acid transport and metabolism (270, 5.2%), and energy production and conversion (258, 5.0%). In KOG terms, the three major represented predictions were 233 transcripts to serine/threonine protein kinase family (KOG1187), 53 to E3 ubiquitin ligase family (KOG4628) and 47 to UDP-glucuronosyl and UDP-glucosyl transferase (KOG1192) (Supplementary file 1 Fig. S2). KEGG annotation provided 209 categories in which metabolism were represented maximum (Supplementary Table 1). The major pathways include biosynthesis of secondary metabolites, biosynthesis of amino acids, inositol phosphate metabolism, oxidative phosphorylation, RNA transport, ubiquitin mediated proteolysis, sulphur relay system, MAPK signalling pathway etc. These data reveals the active metabolic processes as well as synthesis of diverse metabolites in red sanders. In P. santalinus the wood has high value phytochemicals like alkaloids, phenols, saponins, glycosides, flavonoids, triterpenoids, sterols and tannins (Pullaiah et al. 2019) and these compounds help in adaptation for environmental stresses, such as drought. In the current study, we have recorded majority of unigenes for biosynthesis of secondary metabolites could enhance the utility value of red sanders.

Identification of single copy orthologs and phylogenetic tree construction
In many of the plant species, due to the non-availability of genomic resources, single copy protein coding genes are used as efficient markers for phylogenetic analysis (Li et al. 2017). In this study, Orthofinder identified 56,631 orthogroups from 15 species targeted for analysis. Within species group, an orthogroup is described as a collection of genes derived from a single gene of the last common ancestor (Emms and Kelly, 2015). The largest orthogroup contained 8419 genes (O50 = 8419) and 2683 orthogroups shared genes of all the 15 species. However, complete single copy genes were identified from 19 orthogroups only and were concatenated to generate the phylogenetic tree. The phylogenetic relationship of P. santalinus and the other plant species is shown in Fig. 4. In the orthologues phylogram, a clear separation between the outgroup Populus and Fabaceae was observed. As expected, the P. santalinus was placed within Pterocarpus clade along with T. tipu and confirmed the recent report on their status as sister taxa (Hong et al. 2020). Further all the members of Dalbergia genus grouped together followed by two Sesbania species and two additional Fabaceae members. While studying the Dalbergieae, it was suggested that Pterocarpus clade needs thorough phylogenetic analysis for their polyphyletic origin of species (Cordoso et al. 2013).

Classification of transcription factors
From the sequences of P. santalinus and T. tipu totally 606 and 486 transcription factor (TF) families were identified, respectively. Among these bHLH and ERF TFs were predominant with 105 and 77 respectively. The phylogenetic tree of bHLH family and ERF family transcription factors were represented in Fig. 5a and b. The phylogeny trees of bHLH and ERF were subdivided into 18 and 15 subgroups each (Supplementary file 1 Table S6 and Table S7) respectively, and majority of the groups had both the taxa. Generally the bHLH family proteins in plants are KEGG classification based on metabolism, genetic information processing and signalling and cellular processing grouped into 15 to 32 classes and increase in genome sequencing projects identify novel protein sequences.
Overwhelming evidences available to show that the bHLH genes have high relevance to abiotic stress tolerance and secondary metabolite biosynthesis including anthocyanins (Goossens et al. 2017;Guo et al. 2021). Similarly, the ERF transcription factors are another major group serve as important regulators in biotic and/or abiotic stress responses, disease resistance, plant hormone signal transduction, and metabolite regulation (Xie et al. 2019). Red sanders being the store house of plethora of phytochemicals with bioactive potential and have applications in pharmaceutical, ayurveda, cosmetics, liquor, and textile industries (Bulle et al. 2016;Pullaiah et al. 2019).
In-depth research on biosynthesis of key metabolites through RNA-Seq analysis would bring in more information on production of industrially important derivatives. The information generated in this study on transcription factor families would have intriguing avenues for translational research in P. santalinus, specifically in the areas of regulations of specialized metabolism. SSRs are frequently used in genetic analysis of plant species, and EST-SSRs are functional molecular markers with the benefits of easier and more efficient production, and more interspecific transferability when compared to genomic SSRs . EST-SSRs can be used to map the gene rich regions of chromosomes and obtain gene expression data because of their close link to functional genes.
In this study, all 25,854 unigenes were employed to mine potential EST-SSRs in P. santalinus and 3128 SSRs have been identified. Among these, a total of 3564 (13.8%) unigene sequences were found to encode 1953 potential EST-SSRs which had sufficient flanking regions to design the primer sets for PCR amplification (Supplementary Table 2). Of these, 746 unigenes contained more than one SSR and 344 SSRs were present in compound formation. There were 156 motif sequence types detected in 3128 identified SSRs having di, tri, tetra, penta, and hexa-nucleotide repeats with 3, 10, 12, 30 and 101 types respectively. Tri-nucleotide repeat motifs were the most abundant (1885, 60.26%) followed by di-nucleotide repeat motifs (818, 26.15%) and hexa-nucleotide repeat motifs (258, 8.25%), whereas penta-(105, 3.36%) and tetra-nucleotide repeat motifs (62, 1.98%) were found to be rare (Table 2). Abundance of tri-nucleotide repeats is common among the transcriptome of Legume members (Roorkiwal and Sharma, 2011;Wang et al. 2018), and many other species. Among the repeat types, AG/CT, AAG/CTT, AAC/GTT, ATC/ATG were predominant motifs and similar results were reported in most other Fabaceae plant species (Wang et al. 2014;Huang et al. 2016;Liu et al. 2019), Polymorphic validation of EST-SSRs in P. santalinus and transferability across related species All the 59 primers (Supplementary Table 3) were screened with four randomly selected individuals of P. santalinus and selected 13 primer pairs (Table 3; Supplementary file 1 Fig. S3) which amplified clearly and generated polymorphic alleles among the four selected individuals. All the 13 polymorphic primer pairs were used to analyze the 42 individuals from eight different geographic locations. Allele size was scored with AlphaEaseFC software (Alpha Innotech, USA). The 13 EST-SSR markers produced 151 polymorphic alleles (Table 3). The samples from Chamala population had highest percentage of polymorphism (84.6) and the lowest was from Balapalli population (46.2) ( Table 4). PIC value ranged from 0.7937 to 0.9185 with a mean value of 0.8639, genetic diversity ranged from 0.7792 to 0.9235 with a mean value of 0.8763 and heterozygosity ranged from 0.0714 to 0.5 with a mean value of 0.3223. The EST-SSR markers developed in this study are proven to be useful for genetic variability assessment of red sanders populations.
Cross-species transferability of 13 polymorphic SSR markers were examined in P. marsupium, P. dalbergioides and P. indicus and it was observed that all the 13 markers produced clear, sharp bands of expected product size, indicating the conserved sequences across species. Such cross-species transferability is a common phenomenon in legume plants (Datta et al. 2013;Singh et al. 2020) which increase the repository of SSR markers for related species where minimum genomic information available. Many species of the genus Pterocarpus have international significance and listed under the IUCN Red List of Threatened Species. These novel co-dominant markers would find their applications in gene flow and population genetic structure of P. santalinus, for identifying genetic diversity hotspots, genecological zones and development of suitable conservation guidelines. Using number SSR markers and accessions from entire natural distribution area, genetic diversity and population structure are being studied to obtain better knowledge about the structure and evolution of P. santalinus.

Conclusions
To the best of our knowledge, this is the first comprehensive report on confirmation of somatic chromosome number, estimation of genome size, generation of transcriptome data and development of EST-SSR markers in P. santalinus. This study provided an important genetic resource which would shed light on the role of major transcription factor families such as bHLH and ERF for elucidating their role in various secondary metabolite biosynthetic pathways in P. santalinus especially for those with medicinal value and allow for drug development. Additionally red sanders being an illegally traded commodity timber, the EST-SSR markers have become valuable resource for timber tracking through fingerprinting.
Author contributions Chromosome studies, conduct of PCR experiments, data analysis and, drafting the article were performed by Sindhu Agasthikumar, Aghila Samji, Nithishkumar Kannan and Vijayakumar Arivazhagan; Genomic DNA isolation for the samples was carried out by Aiswarya Munusamy; Conduct of RNA isolation and bioinformatic analysis was done by Maheswari Patturaj; Field collection of samples and

Data availability
The datasets generated during the current study are available in the sequence read archive (SRA) of NCBI (https:// submit. ncbi. nlm. nih. gov/ subs/ sra/) with the ID PRJNA782228.

Conflict of interests
The authors have no relevant financial or non-financial interests to disclose.