Identification of GGPPS genes in plants
A total of 134 GGPPS genes were identified from 17 plant species including green algae, mosses, fern, gymnosperms and angiosperms (Additional file 1:Table S1). Among these, one GGPPS gene was identified from C. reinhardtii, two from S. bicolor and P. patens, three from S. moellendorfii and M. truncatula, four from Z. mays and V. vinifera, five from T. cacao, six from S. tuberosum, seven from G. max and P. taeda, eight from O. sativa, 10 from G. arboreum, and G. raimondii, 12 from Arabidopsis, 25 from B. napus, and G. hirsutum. It is observed that from 17 species, GGPPS genes range from 1 (C. reinhardtii) to 25 (G. hirsutum and B. napus) indicating that GGPPS gene family underwent expansion during the evolution in land plants. Further, among three cotton species G. hirsutum contained maximum number of GGPPS genes, indicating the polyploidization and duplication effect on GhGGPPS gene family in G. hirsutum.
Evolutionary analysis of GGPPS genes
Next, we investigated the evolutionary relationships among 134 GGPPS genes of all observed plant species to assume evolutionary mechanisms leading to the formation and maintenance of multiple gene copies within them. The GGPPS phylogenetic tree revealed five main groups, referred to as GGPPS-a to GGPPS-e (Fig. 1). Group GGPPS-a with maximum number (36 genes) of GGPPS family genes from three species Arabidopsis, S. tuberosum, and B. napus while group GGPPS-e with minimum number (20 genes) of GGPPS genes from 11 species except S. bicolor, S. moellendorfii, P. patens, C. reinhardtii, Z. mays, and T. cacao. Moreover, group GGPPS-b contained 27 genes from nine species including G. max, P. taeda, V. vinifera, S. tuberosum, M. truncatula, T. cacao, G. arboreum, G. raimondii, and G. hirsutum while, group GGPPS-c contained 28 genes from 13 species except Arabidopsis, G. max, M. truncatula, and B. napus. Similarly, Group GGPPS-d contained 23 genes from six species including O. sativa, Z. mays, T. cacao, G. arboreum, G. raimondii, and G. hirsutum. Interestingly, 134 GGPPS genes from 17 species were randomly distributed to all groups and most orthologous and paralogous genes between allotetraploids and diploids were clustered close to each other in the same group, showing the expansion of the GGPPS gene family. The phylogenetic analysis showed that group GGPPS-c contained one GGPPS gene from first land plant green algae indicating that GGPPS gene family is derived from common ancestor which is in agreement with earlier studies that all trans-isoprenyl diphosphate synthases are derived from a common ancestor [22, 23]. Similarly, group GGPPS-c also contained two genes from moss and three genes from fern, the more number of GGPPS genes from moss and fern than algae. Group GGPPS-a, b, d and e contained GGPPS genes from gymnosperms and angiosperms but not from algae, moss and fern, indicating that these groups might be evolved after separation of algae, ferns and moss. However, GGPPS genes from G. hirsutum were present in almost all groups except GGPPS-a. Furthermore, the increase in the predicted number of GGPPSs in G. hirsutum than G. arboreum and G. raimondii demonstrated duplication events during polyploidization. Moreover, in phylogenetic tree the GGPPS genes from cotton showed close relationship with cacao GGPPS genes however their number and distribution differ in all groups (Fig. 1). For instance, in group GGPPS-c, GrGGPPS3, GhGGPPS17, GrGGPPS4, GhGGPPS5, GaGGPPS3, and GhGGPPS4 genes showed a close relationship with three cacao genes (TcGGPPS2 TcGGPPS4 and TcGGPPS3), supporting the hypothesis that cacao and cotton were closely related and probably derived from the same ancestors [24]. Additionally the results of phylogenetic analysis were also verified by constructing another phylogenetic tree using ME (Maximum evolution) method (Additional file 6: Fig. S1). Both phylogenetic trees displayed highly similar results.
Next, we generated another phylogenetic tree among three cotton species by NJ (neighbor-joining) method. Phylogenetic tree divided 45 cotton GGPPS genes into four groups from GGPPS-a to GGPPS-d (Fig. 2). Group GGPPS-c was the biggest group with 16 members while group GGPPS-d was the smallest group with four members of GGPPS genes. Out of 45 cotton GGPPS genes, four GGPPS genes (GaGGPPS1, GhGGPPS2, GhGGPPS15, and GrGGPPS2) form a separate group in the phylogenetic tree indicating that GhGGPPS2 and GhGGPPS15 are the orthologs of GaGGPPS1 and GrGGPPS2 and demonstrating that GhGGPPS2 and GhGGPPS15 might be evolved as the result of hybridization of GaGGPPS1 and GrGGPPS2 during evolution. However, GhGGPPS genes form three groups with GGPPS-a as a largest group (11 genes) followed by GGPPS-c (four genes) and GGPPS-b (ten genes), when another phylogenetic tree (NJ) was constructed among them (Additional file 7: Fig. S2).
Biophysical properties and sequence logos of GGPPS in G. hirsutum
Chemical and biophysical characteristics of 25 members of GGPPS gene family in G. hirsutum were predicted. Chromosomal position (start and end points), gene length (bp), coding sequence (CDS), protein length (aa), molecular weight (MW), isoelectric point (pI), and grand average of hydropathicity (GRAVY) of GhGGPPS genes were calculated. The gene length ranged from 620-6,749 bp. Eleven genes with no introns exhibited the lower gene length ranged from 620-1,118 bp, and increase in number of introns increases gene length. Two genes (GhGGPPS6 and GhGGPPS18) with 12 numbers of introns showed maximum gene length (6,749 and 6,657) compared to others (Additional file 2:Table S2). The coding sequence (CDS) ranged from 621-1932 bp and the numbers of amino acids (aa) ranged from 206-643aa, and GhGGPPS14 and GhGGPPS20 are most shortest and longest genes respectively. Molecular weights ranged from 22741-71107 Da with GhGGPPS20 as maximum (71107 Da) while GhGGPPS14 with minimum (22741Da) molecular weight. The value of isoelectric point ranged from 4.67-9.68. Two proteins GhGGPPS7 (8.5) and GhGGPPS11 (9.68) showed isoelectric point more than 7, indicating that they were alkaline proteins, while all others showed isoelectric point below 7 indicating that they all were acidic proteins. In addition, the average hydropathy value of each residue present in the sequences was calculated to evaluate the GRAVY (grand average of Hydropathicity) values of the proteins. The positive GRAVY values of the proteins revealed hydrophobicity however negative GRAVY scores revealed hydrophilicity. The GRAVY values showed that 14 proteins were hydrophilic while 11 proteins were hydrophobic (Additional file 2: Table S2). Subcellular localization indicated that 12 GhGGPPS proteins were located in chloroplast, six in cytoplasm, two in mitochondrial, two in plasma membrane and three in nucleus.
GGPPS protein sequence logos in Arabidopsis, rice and G. hirsutum will assist to discover and evaluate the pattern of GGPPS protein sequence conservation in other plant species, as sequence logos provide a more precise description of sequence similarity than consensus sequences. We generated the sequence logos of the conserved amino acid residues in Arabidopsis, rice and G. hirsutum to check whether the GGPPS family proteins were conserved during evolution (Fig. 3). Conserved amino acid residues analysis showed that the sequence logos among the three species were significantly conserved across the N and C terminal.
Intron/exon structure, protein motifs and promoter cis-elements of GhGGPPS genes
To further clarify the evolutionary relationships, gene structure and conserved motifs analysis of 25 GhGGPPS genes was performed (Additional file 8: Fig. S3a). The gene structure and conserved motifs distribution analysis divided the G. hirsutum GGPPS genes into three groups. Results indicated that 10 conserved protein motifs were randomly distributed among 25 GhGGPPS genes (Additional file 8: Fig. S3b). Most of the GhGGPPS proteins displayed similar motifs distribution pattern, suggesting they might have conserved functions. Motif 4 was present in almost all proteins except three (GhGGPPS25, GhGGPPS15, and GhGGPPS2) GhGGPPS proteins. Motif 10 was identified only in seven proteins (GhGGPPS8, GhGGPPS20, GhGGPPS6, GhGGPPS18, GhGGPPS7, GhGGPPS12, and GhGGPPS24) of group GGPP-b, however it was not identified in proteins of group GGPP-a and GGPP-c. Comparing to motif 10, motif 7 was identified not in group GGPP-b but in almost all proteins of group GGPP-a and GGPP-c.
Gene structural analysis indicated that the coding regions of all GhGGPPS genes were interrupted by 1–12 introns. Accounting 44% of the total GhGGPPS genes, 11 GhGGPPS genes lack introns, and two genes (GhGGPPS6 and GhGGPPS18) had maximum number (12 introns) of introns. Two GhGGPPS genes (GhGGPPS3 and GhGGPPS16) were interrupted by only one intron. However, some variations of exon and intron sizes were observed between GhGGPPS gene family. In most cases, GhGGPPS genes within the same group exhibited similar gene structures in regard to the distribution patterns, number and length of introns/exons (Additional file 8: Fig. S3c).
To analyze the possible roles of GhGGPPS genes in response to various responses, promoter of candidate GhGGPPS genes were used and searched for cis-elements. The GhGGPPS genes shared light responsive boxes and stress-related boxes. Additionally, hormones-related cis-elements including MeJA, salicylic acid, gibberellin, auxin, and abscisic acid were also found in the GhGGPPS genes promoters (Fig. 4). Based on the results, we observed that stressed and hormones-related cis-elements were existed in almost all GhGGPPS genes. Some of the gene promoter regions contained various elements for plant growth and development including circadian, endosperm expression, cell cycle regulation, and seed specific regulation. The identified motifs showed that GhGGPPS genes may be regulated by various cis-elements within the promoter during growth and development.
Chromosomal location, colinearity analysis and gene duplication of GhGGPPS
Chromosomal location of GhGGPPS genes were investigated on their corresponding chromosomes (At and Dt sub-genome chromosomes of G. hirsutum). The chromosomes distribution indicated that 23 genes out of 25 were unevenly distributed among 12 chromosomes while two genes (GhGGPPS11 and GhGGPPS25) were assigned to scaffolds. Uneven distribution of GhGGPPS genes on chromosomes suggested that genetic variation existed in the evolutionary process. Six chromosomes A01, A07, A08, A10, A11, and A13 from At sub-genome contained 12 genes and six chromosomes D01, D07, D09, D10, D11, and D13 from Dt sub-genome contained 11 genes (Additional file 9: Fig. S4). However there was no gene located on chromosome nine (A09) of At sub-genome as well as chromosome eight (D08) of Dt sub-genome. Six GGPPS genes GhGGPPS6, GhGGPPS7, GhGGPPS10, GhGGPPS18, GhGGPPS19, and GhGGPPS24 were located on six chromosomes of At /Dt sub-genome such as A07, A08, A11, D07, D09, and D13 respectively. Chromosome A01 and D01from At /Dt sub-genome have a higher number of GhGGPPS genes as compared to others.
Collinearity analysis of GhGGPPS gene family indicated that there was 29 pairs of orthologous/paralogous GhGGPPS genes in G. hirsutum and that most of GhGGPPS genes loci were conserved for both sub-genomes (At and Dt) (Additional file 3: Table S3, Fig. 5). Tandem and segmental duplication events are the main causes of gene-family expansion in G. hirsutum. To understand the gene duplication event within the G. hirsutum genome, we determined the tandem and segmental duplication during the evolution of GGPPS gene family in this study. According to whole genome duplication analysis, it is observed that 22 pairs of GhGGPPS genes experienced segmental duplication while one tandem and two were dispersed duplication events suggested that segmental duplication contributed deeply for the expansion of the GhGGPPS genes (Additional file 3: Table S3).
Tissues specific expression of GhGGPPS genes
Spatiotemporal expression of transcript is mostly correlated with the biological function of that gene. To investigate the tissue specific expression patterns of different GhGGPPS genes, RNA-seq data were downloaded from NCBI to generate heat map (Additional file 4: Table S4). We noted that all the genes were clustered according to their expression patterns in the vegetative organs (root, stem, and leaf), reproductive organs (torus, petal, stamen, pistil, and calycle), ovule (− 3, − 1, 0, 1, 3, 5, 10, 20, 25 and 35 DPA) and fiber (5, 10, 20, and 25 DPA) (Additional file 10: Fig. S5). According to the heat map all genes exhibited ubiquitous expression with no specific pattern. However, six genes (GhGGPPS2, GhGGPPS15, GhGGPPS8, GhGGPPS22, GhGGPPS3, and GhGGPPS16) showed higher expression in almost all vegetative, reproductive, ovule, and fiber tissues. In contrast, seven genes (GhGGPPS11, GhGGPPS19, GhGGPPS5, GhGGPPS4, GhGGPPS17, GhGGPPS7, and GhGGPPS14) showed very low expression in vegetative, reproductive, ovule and fiber tissues indicating that these genes were pseudo genes or could function with low transcripts in cotton development.
The different members of the same gene family can play different physiological functions by exhibiting their expression in different tissues/organs [25]. In earlier, different expression patterns of GGPPS gene family in A. thaliana tissues were observed in the different organs and seedlings [4]. To elucidate the roles of GhGGPPS genes in different tissues of upland cotton, 9 segmentally duplicated GhGGPPS genes were selected for qRT-PCR. Transcript level of GhGGPP4 and GhGGPPS9 showed significant and specific higher in primary developmental ovules from 0-7 DPA, indicating their potential roles in earlier development of ovule and the fiber cell initiation. Others like GhGGPPS1, GhGGPPS6 and GhGGPPS15 showed higher expression in later development stages of fiber, indicating that they might have important participation in fiber elongation and secondary cell deposition. GhGGPPS2, GhGGPPS3 and GhGGPPS8 showed conserved expression in almost all tissues, indicating they may play some conserved and basic roles in plant different development stages. Whereas expression profiles in non-reproductive tissues of GhGGPPS12 indicated its potential roles in vegetative development (Fig. 6).
Expression pattern of GhGGPPS gene family under abiotic and hormonal stresses
To further investigate the physiological and functional significance of GhGGPPS genes, we investigated the expression patterns of GhGGPPS genes under different stresses such as Cold, NaCl, PEG, and heat and hormonal treatments such as BL, GA, IAA, SA and MeJA. Firstly, the expression pattern of GhGGPPS genes under abiotic stresses were analyzed by RNA-seq data downloaded from NCBI, and a heat map was generated. Results depicted that the expression of GhGGPPS genes were up or down-regulated under different treatments and that all the genes were clustered according to their different responses (Additional file 11: Fig. S6). GGPPS10, GGPPS23, GGPPS12, GGPPS24, GGPPS9, GGPPS3, and GGPPS22 were up-regulated with almost all abiotic stresses. To confirm that, qRT-PCR was performed for 9 GhGGPPS selected genes under different abiotic stresses including Cold, NaCl, PEG, and heat. qRT-PCR results revealed that GhGGPPS1, GhGGPPS9, and GhGGPPS8 responded to almost all abiotic stresses. GhGGPPS1 and GhGGPPS9 exhibited their higher expression under 2h cold stress, while the higher expression of GhGGPPS8 was detected under 1 and 2h NaCl stress suggesting that these genes might play an important positive role in abiotic stresses. But GhGGPPS12 and GhGGPPS15 were only up-regulated by 6h drought stress not other any stresses conditon, indicating both of them playing specific roles in drought stress. Interestingly, GhGGPPS4 seems a negative regulator in abiotic stresses for its significant down-regulation in response to all the abiotic stresses. while GhGGPPS3 was up-regulated by 4h heat treatment (Fig. 7).
Secondly, we investigated the expression patterns of these genes under different hormonal treatments (BL, GA, IAA, SA and MeJA). Q-PCR results indicated that GhGGPPS1, GhGGPPS4, GhGGPPS9, and GhGGPPS15 were up-regulated with all hormonal stresses at different time point demonstrating that these genes are involved in all five hormonal signaling (Fig. 8). GhGGPPS1, GhGGPPS2, GhGGPPS3, GhGGPPS6, and GhGGPPS8 were preferentially induced with 1h SA and 0.5h GA treatments. In addition increased expression of GhGGPPS4 and GhGGPPS9 was observed when subjected to 3h SA stress. Higher transcript level of GhGGPPS12 and GhGGPPS15 were detected at 3 and 1h after MeJA treatment respectively, proposed that these two genes are important for MeJA signaling.