Chromosome counting and nuclear DNA content analysis
The chromosome number of P. santalinus ascertained in this study was 2n = 20 (Fig. 1). However, the chromosome count observed in this study is different from earlier report (Bhaskar, 1981). Previous study on chromosome counts in P. santalinus showed 2n = 22 after observing seedlings obtained from 2 individuals trees growing in Mysore, which is a non native location. Recently Moraes et al. (2020) studied the chromosome number evolution in Pterocarpus clade and reported somatic chromosome count of 20 and 40 for species such as Centrolobium tomentosum Guillemin ex Benth. (2n = 20), Stylosanthes hamata (L.)Taub. (2n = 20), S. scabra Vog. cv. Seca (2n = 40), S. seabrana B.L.Maass & 't Mannetje (2n = 20) and S. viscose Swartz (2n = 20). Somatic chromosome counts (parsed n) for different Pterocarpus species showed that it varied from 10, 11, 12, 17 and 22 (Rice et al. 2015; Supplementary File 1) indicating the possibilities of polyploidy and dysploidy. Nevertheless the species T. tipu, a sister species belonging to Pterocarpus clade was reported to have 2n=20 (Coleman and Menezes, 1980). Chromosome changes within species (dysploidy) were reported in dalbergioid legumes, which includes Pterocarpus clade and highlighted the importance of studying multiple populations of the species (Moraes et al. 2020). Accordingly, in this study after assessing five different populations, it was substantiated that the somatic chromosome number in P. santalinus was 20. Many publications suggest that T. tipu is phylogenically closely linked to the genus Pterocarpus, therefore such investigations may support these findings (Lavin et al. 2005; Klitgard et al. 2013; Hong et al. 2020).
The 2C nuclear DNA content of P. santalinus was estimated as 0.7872 ± 0.0561 pg, where the internal standards Solanum lycopersicum L. and Pisum sativum L. DNA content was 2.0 and 9.8 pg, respectively using flow cytometry (Fig. 2). The C-value (gametic DNA content) and the Cx value (genome content) were calculated based on the estimated 2C nuclear DNA content of the studied plant and were expressed in terms of Mbp. Being diploid, the C-value and the 1Cx value were the same and half the 2C nuclear DNA content, i.e., 0.3936 pg or 393.6 Mbp. Thus, the 2C Nuclear DNA content is 0.7872 pg or 787.2 Mbp. In dalbergias, the estimated genome size ranged from 1.43–1.98 Gb (Hung et al. 2020), however no information available for Pterocarpus species. The genome size of P. santalinus has been described for the first time, which will serve as the foundation for whole genome sequencing and tree improvement.
Transcriptome assembly
Transcriptome research is one of the most essential tools for studying species biological processes. In the current study, using RNA-seq technology on the Illumina HiSeq 2500 platform, leaf transcriptome of P. santalinus was characterized with the objective of identifying EST-SSRs. Despite its economic importance (Pullaiah et al. 2019), there is a genetic information gap in the genus Pterocarpus, and no SSR markers are available. Totally 17,930,663 raw reads were generated and after the removal of adapters and sequences with poor quality, 16,182,076 sequences in the range of 20 to 151 bp were retained. De novo assembly produced 29,816 transcripts and 25,854 unigenes were obtained with total bases of 20,943,459 bp having mean transcript length of 810 bp with the N50 contig length of 1126 bp. Among the assembled transcripts about 90 % had >500bp length (Table 2). Read mapping of the raw reads showed that 90.2% were aligned by Trinity 2.9.1 software. These results corroborated with the earlier findings on species such as Camelina sativa L. (Liang et al. 2013), Myrciaria dubia Kunth McVaugh (Castro et al. 2020) and Ulmus wallichiana Planch. (Singh et al. 2021). The use of PacBio SMRT sequencing technology for transcriptome sequencing resulted in high-quality transcripts longer than 1000 bp in many plant species, allowing third-generation technologies to be introduced for species with little genomic information (Wu et al. 2020)
Functional annotation and classification
The 25,854 super transcripts were functionally annotated using NR, UniProtKB, KEGG, KOG databases (Supplementary Table 1). These annotated sequences provide a platform for further research into P. santalinus genetic diversity. By comparing the transcript sequence to UniProtKB with homologous species, most of the transcripts (75.8%) had best matches with Medicago truncatula Gaertn., a legume species and the second top-hit (9.3%) species was Actinidia chinensis var.chinensis Planch. (Supplementary File 1) thus reflecting the scarcity of reports on the transcriptome of related Pterocarpus species. Among 25,854 unigenes 93.25% were assigned a specific or general function gene ontology terms, 63.26% of transcripts involved in molecular function, 56.39% as cellular components and 45.79% involved in biological process (Fig. 3a). The results of functional annotation in red sanders clearly indicated that the proportion of unigenes annotated in this study is higher than transcriptome data deficient plant species such as Panax vietnamensis Ha et Grushv. (Vu et al. 2020). Among all the unigenes involved in molecular function 6056 unigenes were responsible for catalytic activity, 5187 unigenes were responsible for protein binding and 4860 unigenes involved in response to stimulus. Among cellular components 8267 were involved in membrane bound organelles, 4445 were organelle parts and 3135 involved in plastids. KEGG annotation showed 1899 unigenes involved in the metabolism, 1921 unigenes in genetic information processing, 421 unigenes in signalling & cellular processing and 492 in environmental information processing (Fig 3b; Supplementary Table 1). Function annotation of the non-redundant unigenes was conducted by searching against the main databases. A total of 93.25% (24,110) of the non-redundant unigenes were annotated in UniProtKB, 20,608 unigenes in GO, 18,476 unigenes in Pfam, 8,754 unigenes in KEGG and 8,152 unigenes in KOG. Further, functional classification of P. santalinus was carried out with search against the Clusters of COG database, and 5163 unigenes allocated to 25 classes (Supplementary File 1). The highest category was general function prediction (702, 13.6%), followed by amino acid transport and metabolism (270, 5.2%), and energy production and conversion (258, 5.0%). In KOG terms, the three major represented predictions were 233 transcripts to serine/threonine protein kinase family (KOG1187), 53 to E3 ubiquitin ligase family (KOG4628) and 47 to UDP-glucuronosyl and UDP-glucosyl transferase (KOG1192) (Supplementary Table 1). KEGG annotation provided 209 categories in which metabolism were represented maximum (Supplementary File 1). The major pathways include biosynthesis of secondary metabolites, biosynthesis of amino acids, inositol phosphate metabolism, oxidative phosphorylation, RNA transport, ubiquitin mediated proteolysis, sulphur relay system, MAPK signalling pathway etc. These data reveals the active metabolic processes as well as synthesis of diverse metabolites in red sanders. In P. santalinus the wood has high value phytochemicals like alkaloids, phenols, saponins, glycosides, flavonoids, triterpenoids, sterols and tannins (Pullaiah et al. 2019) and these compounds help in adaptation for environmental stresses, such as drought. In the current study, we have recorded majority of unigenes for biosynthesis of secondary metabolites could enhance the utility value of red sanders.
Identification of single copy orthologs and phylogenetic tree construction
In many of the plant species, due to the non-availability of genomic resources, single copy protein coding genes are used as efficient markers for phylogenetic analysis (Li et al. 2017). In this study, all possible SCOs were identified for phylogenetic tree construction to assess the relationship between P. santalinus and 12 other Fabaceae members by combining the publicly available transcriptomes. Three Populus species were included as outgroups. The Fabaceae members included 7 Dalbergieae tribe species (6 species belonging to Dalbergia genus and T. tipu, a Pterocarpus clade species) and Sesbania (2 species), Enterolobium cyclocarpum (Jacq.) Griseb. and Faidherbia albida. Orthofinder identified 56,631 orthogroups from 15 species targeted for analysis. Within species group, an orthogroup is described as a collection of genes derived from a single gene of the last common ancestor (Emms and Kelly, 2015). The largest orthogroup contained 8419 genes (O50= 8419) and 2683 orthogroups shared genes of all the 15 species. However, complete single copy genes were identified from 19 orthogroups only and were concatenated to generate the phylogenetic tree. The phylogenetic relationship of P. santalinus and the other plant species is shown in Fig. 4. In the orthologues phylogram, a clear separation between the outgroup Populus and Fabaceae was observed. As expected, the P. santalinus was placed within Pterocarpus clade along with T. tipu and confirmed the recent report on their status as sister taxa (Hong et al. 2020). Further all the members of Dalbergia genus grouped together followed by two Sesbania species and two additional Fabaceae members. While studying the Dalbergieae, it was suggested that Pterocarpus clade needs thorough phylogenetic analysis for their polyphyletic origin of species (Cordoso et al. 2013).
Classification of transcriptional factors
From the sequences of P. santalinus and T. tipu totally 606 and 486 transcription factor (TF) families were identified, respectively. Among these bHLH and ERF TFs were predominant with 105 and 77 respectively. The phylogenetic tree of bHLH family and ERF family transcription factors were represented in Fig. 5a and Fig. 5b. The phylogeny trees of bHLH and ERF were subdivided into 18 and 15 subgroups each (Supplementary File 1) respectively, and majority of the groups had both the taxa. Generally the bHLH family proteins in plants are grouped into 15 to 32 classes and increase in genome sequencing projects identify novel protein sequences. Overwhelming evidences available to show that the bHLH genes have high relevance to abiotic stress tolerance and secondary metabolite biosynthesis including anthocyanins (Goossens et al. 2017; Guo et al. 2021). Similarly, the ERF transcription factors are another major group serve as important regulators in biotic and/or abiotic stress responses, disease resistance, plant hormone signal transduction, and metabolite regulation (Xie et al. 2019). Red sanders being the store house of plethora of phytochemicals with bioactive potential and have applications in pharmaceutical, ayurveda, cosmetics, liquor, and textile industries (Bulle et al. 2016; Pullaiah et al. 2019). In-depth research on biosynthesis of key metabolites through RNA-Seq analysis would bring in more information on production of industrially important derivatives. The information generated in this study on transcription factor families would have intriguing avenues for translational research in P. santalinus, specifically in the areas of regulations of specialized metabolism.
Identification of EST-SSRs from P. santalinus transcriptome
SSRs are frequently used in genetic analysis of plant species, and EST-SSRs are functional molecular markers with the benefits of easier and more efficient production, and more interspecific transferability when compared to genomic SSRs (Wu et al. 2020). EST-SSRs can be used to map the gene rich regions of chromosomes and obtain gene expression data because of their close link to functional genes. In this study, all 25,854 unigenes were employed to mine potential EST-SSRs in P. santalinus and 3128 SSRs have been identified. Among these, a total of 3564 (13.8%) unigene sequences were found to encode 1953 potential EST-SSRs which had sufficient flanking regions to design the primer sets for PCR amplification (Supplementary Table 2). Of these, 746 unigenes contained more than one SSR and 344 SSRs were present in compound formation.
There were 156 motif sequence types detected in 3128 identified SSRs having di, tri, tetra, penta, and hexa- nucleotide repeats with 3, 10, 12, 30 and 101 types respectively. Tri-nucleotide repeat motifs were the most abundant (1885, 60.26%) followed by di-nucleotide repeat motifs (818, 26.15%) and hexa-nucleotide repeat motifs (258, 8.25%), whereas penta- (105, 3.36%) and tetra-nucleotide repeat motifs (62, 1.98%) were found to be rare (Table 2). Abundance of tri-nucleotide repeats is common among the transcriptome of Legume members (Roorkiwal and Sharma, 2011; Wang et al. 2018), and many other species. Among the repeat types, AG/CT, AAG/CTT, AAC/GTT, ATC/ATG were predominant motifs and similar results were reported in most other Fabaceae plant species (Wang et al. 2014; Huang et al. 2016; Liu et al. 2019),
Polymorphic validation of EST-SSRs in P. santalinus and transferability across related species
All the 59 primers (Supplementary Table 3) were screened with four randomly selected individuals of P. santalinus and selected 13 primer pairs (Table 3; Supplementary File 1) which amplified clearly and generated polymorphic alleles among the four selected individuals. All the 13 polymorphic primer pairs were used to analyze the 42 individuals from eight different geographic locations. Allele size was scored with AlphaEaseFC software (Alpha Innotech, USA). The 13 EST- SSR markers produced 151 polymorphic alleles. The samples from Chamala population had highest percentage of polymorphism (84.6) and the lowest was from Balapalli population (46.2) (Table 3). PIC value ranged from 0.7937 to 0.9185 with a mean value of 0.8639, genetic diversity ranged from 0.7792 to 0.9235 with a mean value of 0.8763 and heterozygosity ranged from 0.0714 to 0.5 with a mean value of 0.3223. The EST-SSR markers developed in this study are proven to be useful for genetic variability assessment of red sanders populations.
Cross-species transferability of 13 polymorphic SSR markers were examined in P. marsupium, P. dalbergioides and P. indicus and it was observed that all the 13 markers produced clear, sharp bands of expected product size, indicating the conserved sequences across species. Such cross-species transferability is a common phenomenon in legume plants (Datta et al. 2013; Singh et al. 2020) which increase the repository of SSR markers for related species where minimum genomic information available. Many species of the genus Pterocarpus have international significance and listed under the IUCN Red List of Threatened Species. These novel co-dominant markers would find their applications in gene flow and population genetic structure of P. santalinus, for identifying genetic diversity hotspots, genecological zones and development of suitable conservation guidelines. Using number SSR markers and accessions from entire natural distribution area, genetic diversity and population structure are being studied to obtain better knowledge about the structure and evolution of P. santalinus.