Frequency and distribution of SSRs in TFs genes
denovo assembly of high quality reads using Trinity assembler resulted into 1,94,558 non-redundant (NR) transcripts. Search for microsatellite loci in individual assembled NR transcripts identified 16,867 SSR motifs (Table 1a). BLAST searched with Plant TFdbII affirmed 2776 TF encoding transcripts harbouring 3687 SSRs (3263: perfect; 423: compound repeats) (Table 1b). Di-repeats containing TFs SSRs were most abundant (2269; 61.5%), followed by tri- (1284; 34.8%), tetra- (58; 1.57%), hexa (41; 1.11%) and penta-repeats (35; 0.94%). Furthermore, AG/CT, AT/TA and AC/GT in di-repeats; AAG/CTT, ACC/GGT, ATC/ATG, AGG/CCT and AAC/GTT in tri-repeats; AAAG/CTTT, AAAT/ATTT and AGAT/ATCT in tetra-repeats; AAAACC/GGTTTT, AACACC/GGTGTT and AACCAG/CTGGTT in hexa-repeats; and AAAAG/CTTTT, AAAAC/GTTTT and AAAAT/ATTTT in penta-repeats were most represented (Fig. S1). Further, localisation identified presence of 1063, 752 and 1383 SSR loci in 5’UTR, 3’UTR and ORF, respectively with 33 loci in conserved functional domain (FD). Interestingly, ORF recorded higher density of tri-repeat, while 5' and 3' UTRs presented by high abundance of di-repeats. Higher tri-repeat abundance in ORF possibly is due to strong selection against frame shift mutation than the UTRs, which limits the expansion of non-triplet motifs (Table 2a, 2b and Fig. 1).
Transcription factor derived MicroSatellite (TTFMS)
Of the 3687 SSR loci present in 2776 putative TFs of tea, 1843 SSRs loci were successfully utilized for designing of flanking primer pairs with maximal localization to ORF (925) followed by 5'UTR (651) and 3'UTR (267), respectively. Among the TF SSRs localized to ORF, 479 belong to tri-nucleotides, while 421 were di-nucleotides. Contrarily, 267 TF SSRs localized to 3'UTR were abundant with di-nucleotides (178) followed tri-nucleotides (81). Likewise, di-nucleotide (513) followed by tri-nucleotides (114) containing TF SSRs were present in 5'UTR region (Fig. 1 and Fig. S1). Novelty of the Tea transcription factor derived microsatellite was established with cross referenced publicly available SSR markers resources in tea. During this study, 1843 novel TF derived MicroSatellite markers resource [TTFMS: 1815; TTFDMS (Functional Domain): 28] was created for the first time in tea (https://www.ihbt.res.in/en/miscellaneous/genomic-resources).
To evaluate the polymorphic potential efficiency, a panel of 862 functionally relevant TTFMS markers localised in UTRs (5'UTR: 289 and 3'UTR: 129), ORF (416) and functional domain (28) encoding TFs involved in regulation of yield, quality and stress responsive genes were synthesized for experimental validation (Fig. 2 and Fig. S2). Of these, 689 SSR makers [TTFMS:666, (79.8%); TTFDMS:23] recorded targeted amplification of expected product size in randomly selected diverse tea genotypes. Furthermore, identification of 589 (68%) novel highly polymorphic TF derived SSR (TTFMS and TTFDMS) markers stipulates their utility for efficient genotyping applications in tea (Table 3 and https://www.ihbt.res.in/en/miscellaneous/genomic-resources).
Polymorphic potential and phylogenetic analysis
In total, 2864 alleles were retrieved from genotyping of 589 polymorphic novel TTFMS and TTFDMS markers in randomly selected diverse tea genotypes. Among the tested genotypes, allele numbers (Na) ranged from 2 to 17 per locus, while, mean value of gene diversity (He) and observed heterozygosity (Ho) recorded was 0.48, and 0.73, respectively. Furthermore, polymorphic information content varied from 0.11 to 0.90 with mean value of 0.60 (Additional file 1). Interestingly, inferences on gene diversity and PIC recorded here was comparable with earlier EST-SSR markers derived genetic diversity assessment of China tea genotypes (Yao et al. 2012). Additionally, higher polymorphic potential SSR marker loci localized to UTRs were retrieved than localized to ORF regions. Likewise, higher polymorphic potential of longer repeat containing SSR loci may be attributed to high slippage than shorter repeats. Similarly, polymorphic TTFMS marker loci localized to 5'UTR (736) was high due to abundance of non-triplet microsatellites (more di-repeat), which are more prone to replication slippage. Current data identified novel polymorphic 438 TTFMS markers with high PIC value (>0.5) abundant in ORF (220) and UTRs (5'UTR: 157; 3'UTR: 61) region. Being localized to UTRs and ORF region, TTFMS and TTFDMS marker loci’s may affect the gene expression and function, hence, may have greater relevance for genome mapping and trait dissection studies [22]. Polymorphic markers retrieved here stipulates the functional importance of genic markers which has also been reported in earlier in crops like rice [4], chickpea [23-24] and sugarcane [5]. Perhaps, the most interesting observation of study was polymorphism in functional domain of TFs, as variation in conserved functional domain has been correlated with agronomic traits in plants and various cancer and neural diseases in humans [25]. Marker resource developed in current study opens opportunity to understand relevance of expansion/contraction and polymorphism of SSR motifs. Further, efficiency of novel 15 TTFMS markers was also utilized for diversity characterization of 115 tea genotypes (China hybrids) maintained at CSIR-IHBT (Table S2 and S3). Fifteen polymorphic markers loci individually distinguished all the tested genotypes and grouped them into five clusters. Interestingly, tea genotypes were also grouped together broadly based on phenotypic trait and tea cultivation zones in the Kangra Region (Palampur, Baijnath and Dharamshala). The cluster I (22 genotypes), III (29 genotypes) and IV (11 genotypes), represented majority of genotypes from Palampur and Baijnath region viz. Banuri Selection (BS), BGPs, and Bhattu22 with regional distribution and phenotypic traits like early flushing and drought tolerance. Interestingly, TH09 and T383, china hybrids from other geographical locations were also found grouped in Cluster I. Cluster II, included 55 china hybrid tea genotypes having phenotypic attributes (quality, yield, and big leaves) mainly representing collection from Palampur and Baijnath regions (KangraJat, KangraAsha, Mbal7, 10, 17, 18, Mpat02, 05, 12, Saloh 1, 2, Baijnath1, and Rajpura04). Nevertheless, tea cultivars from Dharamshala region (CEF01, CEF02, CPF01 Sbari02, 04, Hoodle01, Khalag02 and 04) with desirable with attributes like early flushing, dark and broad green leaves and drought tolerance were found intermixed (Fig. 3).
Suitability of TTFMS and TTFDMS markers for genetic mapping.
SSRs has been widely utilised for genetic map construction in various plant species. Therefore, to determine the utility of developed polymorphic novel makers for genetic map construction in tea, potential 589 markers were tested in two bi-clonal populations. Parental lines along with 10 individuals (F1 populations) were genotyped with novel TTFMS markers. Four different allele segregating were identified and overall, five expected segregating codes were also recorded. Interestingly, 265 SSR markers (80 markers were null) with 185 markers representing hk x hk (77), lm x ll (49), nn x np (47), ab x cd (9) and ef x eg (3) segregating patterns were found appropriate and, can be futuristically utilised for construction of genetic map and establishing marker-trait association of targeted traits in tea (Additional file 2).
Functional classification
Gene ontology (GO) analysis was performed to assign the functions of TFs encoding transcripts harbouring SSRs motifs. In total, 7686 GO terms were assigned to 2128 transcripts with 4538 (49%) GO terms into 15 biological process categories, 1401 (15.3%) GO terms into 7 molecular function categories and 3176 (34.8%) GO terms into 7 cellular component categories. Further, in biological processes, biological regulations (GO:0065007), regulation of metabolic processes (GO:0019222), regulation of gene expression (GO:0010468) were most represented followed by response to stimulus (GO:0050896), response to abiotic stress (GO: 0009628), response to metabolic processes (GO:0008152), cellular macromolecule biosynthesis processes (GO: 0034645) and transcription (GO: 0006350). Likewise, cellular component category; Cell (GO: 0005623), cell part (GO: 0044464), intracellular organelle (GO: 0043229) and nucleus (GO: 0005634) were most abundant. Similarly, in molecular functions, sub-categories namely transcription regulator activity (GO:0030528), transcription factor activity (GO: 0003700) which was expected followed by catalytic activity (GO: 0003824), transferase activity (GO: 0016740) and kinase activity (GO: 0016301) were retained (Fig. S3 and S4). Overall GO categorisation of the TFs gene harbouring SSRs highlighted the role in regulating the basic cellular machinery responsible for yield, quality as well as complex pathways involved in various biotic and abiotic stress responses.
Importance of SSR loci in TFs genes of tea
TFs control the physiological and regulatory networks of various functional genes to maintain the normal growth and response against various biotic and abiotic stresses in higher plants [26]. Large proportion of Arabidopsis genome (7.5%) encodes for TFs clearly indicates their potential role in gene regulation [27]. In current study, TFs harbouring SSRs revealed abundance in bHLH followed by Myb-related, WRKY, C2H2, C3H, ERF, NAC, FAR1, G2-like and MYB TFs families (Fig. 4a). Di-repeats were found more abundant in bHLH and Myb, followed by tri-repeats in bHLH and WRKY, tetra-repeat in bHLH and penta-repeats in bHLH and MYB TF families. Interestingly, bHLH, Myb-related and MYB TF families were found to harbour more SSR loci as compare to other TF families. Primers designed for SSR loci present in TF families viz. bHLH (53), WRKY (37), C3H (32), Myb-related (31) and NAC (30) were found highly polymorphic and stable, indicates potential functional significance of these markers (Fig. 4b). Most of TF families like B3, bZIP, C2H2, C3H, G2-like, GRAS, MYB and WRKY TF families contained AT-rich SSRs motifs. Moreover, bHLH, FAR1, Myb-related and NAC families contained both AT and GC rich SSRs and interestingly only GATA TF family were found to contain GC-rich repeats (Fig. S5). Furthermore, SSR loci analysis in conserved functional domain revealed that tri-repeats were more frequent than di-repeats (Fig. 5).
Potential of codon reiteration in TFs of tea
SSRs repeats in coding region contributes to repetitive pattern in protein sequences as tandem tri and hexa SSR repeats leads to single amino acid repeats (SAARs) (Fig. S6). SAARs or codon reiteration is a unique mechanism which increases the size of protein and some of codons are reiterated more than others. Interestingly, homogenous codon reiteration and SAARs provide huge opportunity for evolution by changes/growth of coding region followed by mutational modification of added sequence immediately after its addition to avoid the ill effects of few repeats like polygluatmine. Current data stipulates that tri-repeats present in ORF region of TFs encoding transcripts of tea, contains abundant codons encoding for serine (19%) followed by glycine (11%), leucine (11%), aspartic acid (10%), threonine (10%), glutamine (10%), glutamate (10%), proline (10%) and histidine (9%), and least abundance of tyrosine (1%) (Fig. S6). Of these serine and glycine are considered to be most favoured amino acids (AAs) of a polypeptide, however proline and leucine which are hydrophobic in nature were also found abundant in sequences of tea [28]. Furthermore, AAs of different lengths reiterated in coding region of tea TFs were also identified, codons and frequency of reiteration containing five to ten and eleven to twenty AA residues is shown in Fig. S7a and S7b. Reiterants containing 5-10 amino acids residues (Serine and Leucine) were more as compare to 11-20 (Serine). Leucine, one of the most abundant AA is also among the most frequently reiterated in TFs of tea. Total 42 transcripts were identified in which near a reiterant, a second reiterant was retrieved in coding region. However, sometimes SSR repeats gets interrupted due to some mutations, likewise three contigs harbouring tri-repeat SSRs encoding for histidine and glycine seems to be interrupted by mutation over time [29] (Fig. S8a - S8d).