Identification and Validation of Intra-Species Transferability of Genome-Wide Functional SSR Markers in Glycine Max

DOI: https://doi.org/10.21203/rs.3.rs-2651467/v1

Abstract

Genic codominant multiallelic markers are essential to identify the genetic variation, population diversity and evolutionary history of a species. Soybean (Glycine max) is a major legume crop having importance in both a protein-rich pulse as well as a high recovery oilseed crop. To date, no genome-wide genic SSR markers have been elucidated in this crop of high importance. This article aims to identify and validate regulatory gene-derived SSR markers in soybean. The coding sequences of Glycine max were downloaded from PlantTFDB and used for the identification, followed by the localization of SSRs by using a Perl 5 script (MISA, MIcroSAtellite identification tool). The flanking primers to SSRs were designed and chromosomal distribution and Gene ontology searches were performed using BLAST2GO. Twenty random SSR markers were validated to check cross-species transferability and genetic diversity study was performed. A set of 1138 simple sequence repeat markers from transcription factor coding genes were designed and designated as TF-derived SSR markers. They were anchored on 20 G. max chromosomes, and the SSR motif frequency was one per 4.64 kb. Trinucleotide repeats were found abundant and tetra, as well as pentanucleotide frequency, was least in soybean. Gene Ontology search revealed the diverse role of SSR-containing TFs in soybean. Eight soybean accessions were analyzed for identified twenty candidates for genic SSR diversification, and a principal co-ordinate analysis, a genic dissimilarity-based unweighted neighbor-joining tree, was constructed. Our findings will serve as a potential functional marker resource for marker-assisted selection and genomic characterization of soybean.

Introduction

To conserve and utilize genetic resources, molecular markers are crucial for evaluating and understanding genetic diversity. Simple sequence repeats (SSRs) or microsatellites are multi-allelic codominant markers composed of tandemly repeated 1-15 nucleotide core units. Due to their desirable genetic characteristics, SSR markers have been widely used in a variety of applications, such as population diversity studies, genetic maps, marker-assisted selection, and genome-wide association studies. SSR markers can be developed from public databases, such as genomic libraries, expressed sequence tags (ESTs), or transcriptome sequences, but genomic libraries are expensive and inefficient. In contrast, SSRs derived from ESTs could contribute to metabolism and gene evolution, which supports their role as "functional genetic markers" for identifying genes and quantitative trait loci (QTLs) for agricultural traits. The availability of genome sequences in recent years has facilitated the development of genome-wide functional SSR markers. 

In plants, transcription factors (TFs) act as trans-activating factors and bind to the promoters of target genes. TFs are crucial for stress tolerance, growth, and evolution [1]. To understand the diversity of life on earth, a thoughtful understanding of transcription factors is much necessary. Recent progressions in array-based sequencing technologies helped in the identification and annotation of many plant TFs [2]. TFs with well-characterized functional domains are available and continuously enriched, offering excellent candidate transcripts and serving as a valuable resource for developing new sequence-based functional microsatellite markers. The development of the transcription factor-derived microsatellite markers in plants has only been reported in a few crop species like chickpea [3], Medicago truncatula [4], Lilium sp. [5], tea [6], flax [7], Paeonia rockii [8]. These markers have potential usage in marker-assisted genetic improvement and have applications in genotyping or fingerprinting. Based on the cropping area and production, the legume family is the second most important food and forage source after the grass family [9]. The conservation of genomic sequences among different legume crops ensures the transfer of technology from more studied legume species such as Glycine max to others [10]. By now, many SSR markers have been derived from the G. max genomic or EST sequences [11]. However, no studies have yet been performed to address the development and usage of TF-derived genic markers of G. max. The development of relatively high polymorphic potential new functional codominant multiallelic microsatellite markers based on G. max full-length TF coding sequences will be indispensable to use in various applications of genetics, genomics, and breeding programs. Therefore, this study commenced by analyzing the frequency and distribution of SSRs in the G. max TFs genes, followed by developing and characterizing those markers and assessing their cross[12]-species transferability. In this study, we developed TF-derived SSR markers that will be useful for the assessment of genetic relationships, marker-assisted selection, and comparative genomic studies in leguminous and non-leguminous species.

Materials And Methods

Plant Material and DNA Isolation

A total of eight genotypes of soybean (Supplementary Table S2) were used to examine the transferability of randomly selected twenty TF-derived SSR markers developed in this study. The soybean accessions (list given in supplementary table S2) were collected for genetic diversity analyses from ICAR-VPKAS Experimental Farm, Hawalbagh, Almora, Uttarakhand, India, including cultivars and landraces. DNA was extracted from the leaf using a CTAB protocol [13]. The quality of DNA was checked in 0.8% agarose gels and the quantity of DNA was estimated using a NanoDrop ND1000 instrument (Thermo Scientific, Waltham, MA, USA), respectively.

Identification Of Ssr And Primer Design

A total of 6150 TF coding sequences of Glycine max were downloaded from PlantTFDB [2] and used for the identification and localization of SSRs by using a Perl 5 script (MISA, MIcroSAtellite identification tool) [14]. The minimum length criteria were defined as ten and six repeat units for mono-nucleotide and di-nucleotide repeats, respectively, and five repeat units for tri-nucleotide, tetra-nucleotide, penta-nucleotide and hexa-nucleotide repeats. The maximum interruption allowed between two SSRs was 100 base pairs (bp). After identification of SSRs, the flanking primers to SSRs were designed using Primer3 software with the help of Perl 5 interface modules in a batch modus manner. The parameters for the primer design were taken as follows: amplicon size of 100–352 bp; primer length of 18–27 bases with an optimum length of 20 bp; annealing temperature was kept between 57–64°C with the optimum of 60°C; GC content of 46–50%.

Identification Of Chromosomal Distribution Of Ssr

A graphical presentation of the physical position of SSR containing TFs was performed using the available MapChart Tool given (Voorrips 2002).

Functional Annotation

Functional annotation of SSR containing transcription factor genes based on Gene Ontology terms (GO) was analyzed by Blast2GO (Conesa et al. 2005) and WEGO software (Ye et al., 2006).

Pcr Amplification

PCR amplifications were carried out (final volume of 20 µL) with a reaction mixture containing 50 ng of DNA template, 1× PCR buffer, 2.0 mM MgCl2, 2.5 mM dNTPs, 4 µM each primer, and 0.5 units of Taq polymerase (Thermo Scientific). The PCR reaction cycling included 5 min at 94°C, 35 cycles of 35 s at 95°C, 35 s at 60°C, and 30 s at 72°C, with a final extension step of 8 min at 72°C. PCR products were subjected to electrophoresis on 3.5% agarose gels, and the banding patterns were visualized with EtBr staining.

Genetic Diversity Analysis

The SSR profiles (alleles) were scored in a binary format as present (1) or absent (0) and used for the determination of the genetic relationship among the different soybean accessions (Supplementary Table S2). Only specific bands that could be scored unambiguously across all soybean accessions were used in this study. The polymorphic information (PIC) of each SSR marker was calculated using the formula, PIC = 1- ∑ (pi) 2; Where pi is the frequency of the ith allele. A dendrogram was constructed using the unweighted pair group mean algorithm (UPGMA) of DARWin software based on the genetic identify matrix.

Results And Discussion

Frequency and Distribution of SSRs in the G. max TF Genes

In the present study, a total of 6150 TF coding sequences of G. max with an average length of 1170 bp were mined for SSRs and used to design the TF-derived genic markers (Table 1). The TF genes containing microsatellite search was based on MISA, and it detected a total of 1550 SSRs in 6150 (25.2%) TF genes, with a distribution frequency of one SSR locus per 4.6 kb, which was higher than the early reports on TF-derived SSRs in chickpea (7.1 kb)(Kujur et al., 2013) and EST-derived SSRs in soybean (7.7 kb) [17], peanut (7.3 kb)[18] among leguminous crops.

 
Table 1

Summary of TF-derived SSR searches in G. max

Search Items

Numbers

Total number of TFs examined

6150

Total number of identified SSRs

1550

Number of SSR containing TFs

1138

Number of TFs containing more than 1 SSR

282

Repeat type

Mononucleotide

26

Trinucleotide

1455

Pentanucleotide

1

The total length of sequences searched (kb)

7200

Frequency of SSRs

One per 4.64 kb


Varied repeat motifs were detected in G. max TF-derived SSRs, and they were unevenly distributed in different motifs and locations (Table 1). Investigating those SSR motifs revealed that 282 (24.7%) TF genes contained more than one SSR. Of the 1138 total SSRs, 978 (85.9%) contained simple repeat motifs, while 160 (14%) were found to be compound motifs. Among the simple repeat motifs, tri-nucleotide motifs were the most abundant (93.8%), followed by di- (3.1%), mono-(1.6%) and hexanucleotide motifs (1.1%). Only one tetra-nucleotide motif (AAAG/CTTT) and one penta-nucleotide motif (ACACT/AGTGT) were spotted in G. max TF sequences. It was reported that tri-nucleotide repeats were the most common motif for SSR markers developed in most of the species, followed by di and tetra-nucleotide repeats. In cereals, tri-nucleotide repeats were the most recurrent motif present in the ESTs (54–78%). Di-nucleotides frequency was found 17.1–40.4% and for tetra-nucleotides it was 3–6% (Fig. 1). It has been reported that in wheat more than 70% trinucleotide repeats were found in the coding sequences, whereas di-nucleotide repeats (~ 80%) were abundant in non-coding regions [19]. However, in this study in soybean, the most abundant repeat type observed was tri-nucleotide followed by di-nucleotide TF.

Interestingly, in this study, the abundance of tri-nucleotide repeats in the ORF of G. max TF genes could be ascribed to the nonappearance of frameshift mutations in coding sequences when there is an existence of length variation in these SSRs. AAC/GTT type tri-nucleotide repeats were found most frequent, followed by ACC/GGT type. Among mono-nucleotides, A/T type repeats were found in high frequency, while only two C/G repeats were detected. In di-nucleotides, AT/AT repeat was least abundant. As compared to other variants, tetra and penta-nucleotide frequency was very low. The observed repeat type and frequency in these two categories were AGCATC/ATGCTG (5), AAAG/CTTT (1), ACACT/AGTGT (1), AACCCG/CGGGTT (2), AACCCT/AGGGTT (1), AAGCCC/CTTGGG (1), AAGGAG/CCTTCT (3), ACCAGC/CTGGTG (2), ACCATG/ATGGTC (2), ACTAGT/ACTAGT (2).

Chromosomal Distribution Of Ssr Containing Tf Genes

We mapped all the 1127 SSR-containing TF genes onto 20 chromosomes using the MapChart tool after removing redundant SSR loci except 11 SSR that belong to unassigned scaffolds regions (Fig. 2). A total of 49 TFs were anchored to the different chromosomes having a compound SSR motif. This may be likely due to a whole-genome duplication event or may result in a tandem duplication. Mapped TF-derived SSR markers varied from a minimum of 18 on chromosomes 1,14,15,18 to a maximum of 49 on chromosome 13, followed by chromosomes 6 and 10 (48 SSRs each).

Functional Classification Of Ssr-containing Tf Genes

In the present study, the potential functions of 176 SSRs containing TF genes were evaluated by searching against the Gene Ontology (GO) database using the Blast2GO and WEGO software. Figure 3 summarizes the categorization of these TF genes according to biological process, cellular component and molecular function. A total of 1138 TF genes were divided into 33 GO categories, and it was found that 1100 TF genes were fully annotated. As per the study, 1013 genes were found to be predominant, having cellular functions and 1009 genes were having a functional role, and 913 genes had biological functions. In the cellular component category, cell and cell part genes were found to be (997 genes, 90.6% for both) dominated, followed by organelle (989 genes, 89.9%), and the least abundant was membrane-enclosed lumen (29 genes, 2.6%). Based on molecular function, the TF genes were classified into several groups: 932 TF genes (84.7%) were assigned to binding, followed by transcription regulation (637 genes, 57.9%), catalytic (56 genes, 5.1%), and structural molecule (1 gene, 0.6%). Rest were found to have molecular transducer activity, antioxidant activity, and molecular function regulator (2 genes for each, 0.2%). In the biological process category, there were the two most over-represented GO terms, i.e. cellular process (895 genes, 81.4%) and Metabolic process (880 genes, 80.0%), followed by biological process (872 genes, 79.3%) and a response to stress or stimulus (both 147 genes, 13.4%).

Development of G. max TF derived Markers

Out of the 1138 SSR-containing TF genes, a total of 1339 primer pairs could be successfully designed from G. max TF genes (96.26%); the remaining genes either had too-short flanking SSR loci sequences or did not match the required criteria for primer design. Details of the designed primer pairs are provided as supplementary data (Supplementary Table 1). A set of twenty markers with a total of 58 allelic loci have been validated in eight different soybean genotypes (Fig. 4) and distinct polymorphism was observed. These functional markers can co-relate with traits further for use in plant breeding research related to hormone signaling, pathogen defense response, abiotic stress tolerance, etc. since they belong to diverse gene families.

Genetic Diversity Analysis Of Soybean Accessions

Twenty TF-derived SSR primer pairs were randomly selected assuming the transferable candidate SSR markers in soybean and were verified for their potential use in diversity study and to ascertain that they were tested against eight soybean accessions (Supplementary Table 2). A total of 57 allelic polymorphisms were detected from the 20 polymorphic TFGM markers. The allele number produced per primer pair ranged from two (Glyma.10G029700.1.p, Glyma.11G216500.1.p, Glyma.13G236800.1.p, Glyma.09G207300.2.p and Glyma.04G242200.1.p) to four (Glyma.13G146400.1.p and Glyma.16G141300.1.p) with an average of 2.85. The highest polymorphism information content (PIC) value was noticed with primer Glyma.16G031400.1.p (0.99), and the lowest PIC was observed for MtTF64 (0.08), and the average PIC value was found 0.60 (Table 2).

Table 2. Details of the twenty polymorphic TF-derived SSR markers with their genetic parameter values

Sl. No.

SSR containing TF-gene

No of alleles

PIC value

TF Family

1

Glyma.15G063300.1. p

3

0.80

GeBP family

2

Glyma.10G029700.1. p

2

0.68

C2H2 family

3

Glyma.11G216500.1. p

2

0.73

GRAS family

4

Glyma.20G014400.2. p

3

0.63

HD-ZIP family

5

Glyma.13G146400.1. p

4

0.86

C2H2 family

6

Glyma.13G236800.1. p

3

0.84

C3H family

7

Glyma.11G136600.1. p

3

0.85

G2-like family

8

Glyma.17G170100.1. p

3

0.93

ERF family

9

Glyma.07G109500.1. p

3

0.95

SBP family

10

Glyma.05G200400.1. p

3

0.88

Trihelix family

11

Glyma.20G005800.1. p

3

0.90

C2H2 family

12

Glyma.16G141300.1. p

4

0.94

GRAS family

13

Glyma.10G225200.1. p

3

0.79

Trihelix family

14

Glyma.09G117200.2. p

3

0.70

B3 family

15

Glyma.06G121300.2. p

3

0.64

GRAS family

16

Glyma.05G027000.1. p

3

0.82

MYB family

17

Glyma.13G236800.1. p

2

0.79

C3H family

18

Glyma.09G207300.2. p

2

0.98

G2-like family

19

Glyma.04G242200.1. p

2

0.98

MYB family

20

Glyma.16G031400.1. p

3

0.99

WRKY family

 As suggested earlier, PIC values greater than 0.5 specify informative markers, and more specifically loci with PIC values more than 0.7 are highly suitable for genetic mapping [20]. In the present study, all the SSR markers taken were found with PIC values greater than 0.5, and eighteen SSR markers showed PIC of more than 0.7, which indicates the high level of polymorphism of these genic SSR markers and their potential usage in genetic diversity study and analysis of genetic mapping. Nevertheless, the results highlighted the worth of the newly developed TFGM markers in our study and can be endorsed for cultivar identification as well as assessment of genetic diversity in soybean genotypes.

Conclusions

The length variations of SSR motifs in regulatory genes regulate the systematic activity and function of the genes. Since few validated SSR markers are available from regulatory genes in other crops, identifying 1138 G. max Transcription Factor derived MicroSatellites (TTFMS) will be a novel marker resource in soybean. Furthermore, the 1100 functionally pertinent core-set of TF SSR markers have biological, cellular, and metabolic functions with desirable marker attributes having PIC values greater than 0.5. Fruitful extrapolation in terms of diversity characterization of eight soybean cultivars indeed submits the comprehensive implications of these novel marker resources. This study outlines the potential applications of these candidate markers in a multiway genetic study of G. max. The polymorphic TF-derived SSR markers belong to gene families like bZIP, EREBP, bHLH, Myb-related, WRKY, C2H2, TCP, R2R3, GRAS, C3H, ERF, NAC, MADS, G2-like, etc. suggests that these markers may attribute to quality (flavonoid biosynthesis), resistance to different stresses and so on in soybean. GO results also show the potential utilization of these TF-derived markers for the elucidation of major QTLs in many biological processes. The derived marker resources are expected to accelerate the characterization of soybean traits as well as molecular breeding efforts due to their polymorphic potential, stability, functional relevance, and genome-wide coverage and provide a golden opportunity for plant breeders to use it in the crop improvement programme.

Statements And Declarations

Acknowledgments

The authors sincerely acknowledge ICAR-VPKAS, Almora, and ICAR-Indian Agricultural Research Institute, Jharkhand, for providing the opportunity to pursue this research.

Compliance with ethical standards

The authors declare that they have no conflicts of interest and is an original article, and has not been submitted to any other journals. The collaborative contributions have been indicated clearly and acknowledged.

No animal or human participation is involved.

Informed consent was obtained from all individual participants included in the study.

Data Availability

All relevant data are within the paper and its supporting information files. 

Supplementary Materials: Supplementary Table 1, 2

References

  1. Spitz, F.; genetics, E.F.-N. reviews; 2012, undefined Transcription Factors: From Enhancer Binding to Developmental Control. nature.com2012, 13, 613, doi:10.1038/nrg3207.
  2. Jin, J.; Zhang, H.; Kong, L.; Gao, G.; Luo, J. PlantTFDB 3.0: A Portal for the Functional and Evolutionary Study of Plant Transcription Factors. Nucleic Acids Res2014, 42, doi:10.1093/NAR/GKT1016.
  3. Kujur, A.; Bajaj, D.; Saxena, M.S.; Tripathi, S.; Upadhyaya, H.D.; Gowda, C.L.L.; Singh, S.; Jain, M.; Tyagi, A.K.; Parida, S.K. Functionally Relevant Microsatellite Markers From Chickpea Transcription Factor Genes for Efficient Genotyping Applications and Trait Association Mapping. DNA Research2013, 20, 355–374, doi:10.1093/DNARES/DST015.
  4. Liu, W.; Jia, X.; Liu, Z.; Zhang, Z.; Wang, Y.; Liu, Z.; Molecules, W.X.-; 2015, undefined Development and Characterization of Transcription Factor Gene-Derived Microsatellite (TFGM) Markers in Medicago Truncatula and Their Transferability In. mdpi.com2015, 20, 8759–8771, doi:10.3390/molecules20058759.
  5. Kumar Biswas, M.; Kumar Nath, U.; Howlader, J.; Bagchi, M.; Natarajan, S.; Abdul Kayum, M.; Kim, H.-T.; Park, J.-I.; Kang, J.-G.; Nou, I.-S. Exploration and Exploitation of Novel SSR Markers for Candidate Transcription Factor Genes in Lilium Species. mdpi.com2018, 9, doi:10.3390/genes9020097.
  6. Parmar, R.; Seth, R.; Sharma, R.K. Genome-Wide Identification and Characterization of Functionally Relevant Microsatellite Markers from Transcription Factor Genes of Tea (Camellia Sinensis (L.) O. Kuntze). Scientific Reports 2022 12:12022, 12, 1–14, doi:10.1038/s41598-021-03848-x.
  7. Saha, D.; Rana, R.S.; Das, S.; Datta, S.; Mitra, J.; Cloutier, S.J.; You, F.M. Genome-Wide Regulatory Gene-Derived SSRs Reveal Genetic Differentiation and Population Structure in Fiber Flax Genotypes. J Appl Genet2019, 60, 13–25, doi:10.1007/S13353-018-0476-Z.
  8. LIU, N.; CHENG, F. yun; GUO, X.; ZHONG, Y. Development and Application of Microsatellite Markers within Transcription Factors in Flare Tree Peony (Paeonia Rockii) Based on next-Generation and Single-Molecule Long-Read RNA-Seq. J Integr Agric2021, 20, 1832–1848, doi:10.1016/S2095-3119(20)63402-5.
  9. Gepts, P.; Beavis, W.; Brummer, E.; Shoemaker, R. Legumes as a Model Plant Family. Genomics for Food and Feed Report of the Cross-Legume Advances through Genomics Conference. 2005.
  10. Fredslund, J.; Madsen, L.H.; Hougaard, B.K.; Nielsen, A.M.; Bertioli, D.; Sandal, N.; Stougaard, J.; Schauser, L. A General Pipeline for the Development of Anchor Markers for Comparative Genomics in Plants. BMC Genomics2006, 7, doi:10.1186/1471-2164-7-207.
  11. Liu, Y.L.; Li, Y.H.; Zhou, G.A.; Uzokwe, N.; Chang, R.Z.; Chen, S.Y.; Qiu, L.J. Development of Soybean EST-SSR Markers and Their Use to Assess Genetic Diversity in the Subgenus Soja. Agric Sci China2010, 9, 1423–1429, doi:10.1016/S1671-2927(09)60233-9.
  12. Kujur, A.; Bajaj, D.; Saxena, M.; … S.T.-D.; 2013, undefined Functionally Relevant Microsatellite Markers from Chickpea Transcription Factor Genes for Efficient Genotyping Applications and Trait Association Mapping. academic.oup.com.
  13. Stefanova, P.; Taseva, M.; Georgieva, T.; Gotcheva, V.; Angelov, A.; Stefan Ova, P. A Modified CTAB Method for DNA Extraction from Soybean and Meat Products. Taylor & Francis2013, 27, 3803–3810, doi:10.5504/BBEQ.2013.0026.
  14. Beier, S.; Thiel, T.; Münch, T.; Scholz, U.; Mascher, M. MISA-Web: A Web Server for Microsatellite Prediction. Bioinformatics2017, 33, 2583–2585, doi:10.1093/BIOINFORMATICS/BTX198.
  15. Conesa, A.; Götz, S.; García-Gómez, J.; … J.T.-; 2005, undefined Blast2GO: A Universal Tool for Annotation, Visualization and Analysis in Functional Genomics Research. academic.oup.com.
  16. Ye, J.; Fang, L.; Zheng, H.; Zhang, Y.; Chen, J.; Zhang, Z.; Wang, J.; Li, S.; Li, R.; Bolund, L.; et al. WEGO: A Web Tool for Plotting GO Annotations. Nucleic Acids Res2006, 34, doi:10.1093/NAR/GKL031.
  17. Hisano, H.; Sato, S.; Isobe, S.; Sasamoto, S.; Wada, T.; Matsuno, A.; Fujishiro, T.; Yamada, M.; Nakayama, S.; Nakamura, Y.; et al. Characterization of the Soybean Genome Using EST-Derived Microsatellite Markers. DNA Research2007, 14, 271–281, doi:10.1093/DNARES/DSM025.
  18. Bosamia, T.C.; Mishra, G.P.; Thankappan, R.; Dobaria, J.R. Novel and Stress Relevant EST Derived SSR Markers Developed and Validated in Peanut. PLoS One2015, 10, e0129127, doi:10.1371/JOURNAL.PONE.0129127.
  19. Varshney, R.K.; Thiel, T.; Stein, N. In Silico Analysis on Frequency and Distribution of Microsatellites in ESTs of Some Cereal Species. researchgate.net2002.
  20. Bandelj, D.; Jakše, J.; Euphytica, B.J.-; 2004, undefined Assessment of Genetic Variability of Olive Varieties by Microsatellite and AFLP Markers. Springer2004, 136, 93–102, doi:10.1023/B:EUPH.0000019552.42066.10.