Identification, characterization and phylogenetic analysis of DIR genes
We identified 107 GbDIR genes (GbDIR1-GbDIR107) in G. barbadense and 107 GhDIR genes (GhDIR1-GhDIR107) in G. hirsutum (Additional file 8: Table S1 and S2). To understand the evolutionary relationships, a neighbor-joining (NJ) phylogenetic tree was constructed using the nucleotide sequences of 107 GbDIRs, 107 GhDIRs, 25 AtDIRs (Arabidopsis thaliana), 44 LuDIRs (Linum usitatissimum), 49 OsDIRs (Oryza sativa) and 35 PDIRs (Picea spp.) (Additional file 8: Table S3). As shown in Fig. 1, DIR genes were clearly divided into six subfamilies, designated as DIR-a, DIR-b/d, DIR-c, DIR-e, DIR-f and DIR-g. The topological structure is essentially in agreement with previous reports [17, 19, 21, 23]. Among these subfamilies, DIR-a consisted of 8 GbDIRs, 7 GhDIRs, 5 AtDIRs, 6 LuDIRs, 7 OsDIRs and 12 PDIRs, indicating a highly-conserved ancient clade. The largest group DIR-b/d consisted of 81 GbDIRs, 82 GhDIRs, 14 AtDIRs, 28 LuDIRs, 10 OsDIRs and 12 PDIRs, showing that DIR-b/d has expanded considerably in allotetraploid cotton. In contrast, DIR-f only included 5 GbDIRs, 4 GhDIRs, 3LuDIRs and 11 PDIRs, probably due to gene losses in Arabidopsis thaliana and Oryza sativa. DIR-e contained 13 GbDIRs, 14 GhDIRs, 6 AtDIRs, 6 LuDIRs, 3 OsDIRs but no PDIRs, suggesting a possible angiosperm-specific clade. DIR-c and DIR-g might be unique to monocotyledons, for they only contained OsDIRs. To validate the phylogenetic relationships, we reconstructed two phylogenetic trees using DIR protein sequences of G. barbadense and G. hirsutum (Additional file 1: Figure S1). The evolutionary relationships were highly consistent with that in Fig. 1. Furthermore, these DIR protein sequences were submitted to MEME to discover conserved motifs. As shown in Additional file 1: Figure S1, the adjacent clades carried similar motifs.
Most cotton DIR genes (especially members in DIR-a, DIR-e and DIR-f) held a classical gene structure without intron (Additional file 1: Figure S1). Notably, the DIR-b/d clade showed variable gene structures, indicating lower selection constraints. About 77% of GbDIRs and 82% of GhDIRs encoded proteins containing a signal peptide (Additional file 8: Table S1 and S2). As expected, subcellular localization prediction showed that they were mainly located in the extracellular space. In spite of having a signal peptide, several DIR proteins belonging to DIR-b/d were directed to the vacuole. In addition, DIR proteins located in the nucleus or the peroxisome were observed; the unexpected localization might signify novel functions. We also scanned cotton DIR protein sequences for N-glycosylation sites to explore their solubility and stability. Interestingly, DIR-e genes had much fewer N-glycosylation sites, displaying the evolution and divergence of cotton DIR gene family.
Chromosomal distribution, duplication and evolution
To determine the chromosomal distribution of cotton DIR genes, we marked their physical locations on chromosomes based on their annotation information (Additional file 2: Figure S2). All the 107 GhDIRs were mapped to 24 of the total 26 chromosomes except At06 and Dt06. The gene number varied from 1 to 15 across these chromosomes. Specifically, GhDIRs belonging to DIR-b/d were enriched on chromosomes At01, At04, At11, Dt01, Dt04 and Dt11, while DIR-e genes were mainly located on At10 and Dt10. Gene clusters were frequently observed, indicating a number of tandem duplication events. Because of the short divergence time, GbDIRs and GhDIRs were extremely similar in chromosomal distribution. GbDIRs were located on chromosomes except Dt02, At06 and Dt06, and the gene number ranged from 1 to 13. Six genes (GbDIR29 and GbDIR103-GbDIR107) were not mapped because of the incomplete location information.
To analyze the expansion of cotton DIR gene family, we identified segmental and tandem duplications in G. barbadense and G. hirsutum. The numbers of segmentally duplicated DIR genes and tandemly duplicated DIR genes were 13 and 31, respectively, in G. barbadense and 18 and 41 in G. hirsutum (Fig. 2a and 2b; Additional file 8: Table S4 and S5). Obviously, tandem duplication as the major impetus drove the expansion of cotton DIR gene family, corresponding to the above-mentioned gene clusters. Ks (substitution per synonymous site) values can be used to estimate the occurrence time of gene duplications and then to identify whole-genome duplication (WGD) events. Ks ranges of 0.4-0.6 (corresponding to Gossypium-specific WGD around 16.6 MYA) and 1.5-1.9 (corresponding to the paleo-hexaploidization event shared by eudicots around 130.8 MYA) were observed in G. raimondii [24]. Moreover, Ks peak of 0.03 accounts for the divergence between A-genome and D-genome progenitors [25]. Here we also calculated the Ks values of gene duplications to analyze their occurrence time; the LPB method was selected because its results were just in line with expectations (Additional file 8: Table S4 and S5). As shown in Fig. 2, Ks values of GbDIR14/GbDIR41, GbDIR79/GbDIR99 and GhDIR13/GhDIR41 were 0.60, 0.41 and 0.58, respectively, indicating that these three segmental duplications might be generated from Gossypium-specific polyploidization. The other segmental duplications were inferred to arise from the paleo-hexaploidization event, for their Ks values were in or near 1.5-1.9. Interestingly, segmentally duplicated DIR genes were mainly retained from the ancestral WGD event rather than the recent one, suggesting some functionally conserved genes through the evolution. Ks values of the tandem duplications ranged from 0.02 to 2.92 in G. barbadense and from 0.03 to 2.68 in G. hirsutum. Most of the tandem duplications had Ks value less than 0.5, showing a recent expansion of cotton DIR gene family. The lacking Ks values of 0.5-1.0 might mean gene losses and/or translocations due to chromosome re-packaging processes following polyploidization. Besides, some tandem duplications with Ks > 1.0 were observed, suggesting that these genes were quite conserved.
To further reveal evolutionary history of cotton DIR gene family, we integrated the duplication events with a phylogenetic tree containing 107 GbDIRs, 107 GhDIRs and 23 reported DIRs (Fig. 2c and Additional file 8: Table S6). Track i marked segmental duplications, while tracks ii, iii, iv and v (with artificially-created Ks ranges of 0.0-0.2, 0.2-0.6, 0.6-1.9 and 1.9-3.0, respectively) labeled tandem duplications. The values 0.6 and 1.9 referred to above-mentioned 0.4-0.6 and 1.5-1.9, respectively, and a cluster of Ks values around 0.1 set up the Ks range of 0.0-0.2 (Additional file 8: Table S4 and S5). The DIR-a clade, mainly involved in lignan biosynthesis, was basically established before the Gossypium-specific WGD event (Fig. 2c). Similarly, DIR-e was also a relatively ancient clade because of the lack of recent duplications. Clearly, DIR-b/d have experienced rapid expansion due to recent tandem duplications; we believed DIR-b/d-I, -II and -III arose in turn in the evolution. Among them, DIR-b/d-I covered tandem duplications and segmental duplications, forming a transition clade. DIR-b/d-II and the right half of DIR-b/d-III, lacking traces of duplication events, might have undergone large-scale gene losses and/or translocations. The left half of DIR-b/d-III, involved in the atropselective synthesis of gossypol, emerged recently due to tandem duplications. In addition, DIR-f was obviously inactive in cotton evolution.
The Ka/Ks ratio is a measure of selective pressure on proteins, and Ka/Ks > 1, = 1 and < 1 indicate positive selection (or molecular adaptation), neutral evolution, and purifying selection (or selective constraints), respectively. Here we calculated the Ka/Ks ratio for all the duplicated DIR gene pairs (Additional file 8: Table S4 and S5). Interestingly, those three gene pairs generated from Gossypium-specific WGD (i.e. GbDIR14/GbDIR41, GbDIR79/GbDIR99 and GhDIR13/GhDIR41) showed Ka/Ks ratios > 1, implying that positive selection might contribute to their surviving from gene losses and/or translocations. Ka/Ks ratios of the other segmentally duplicated DIR gene pairs were fairly low (0.14-0.35 in G. barbadense and 0.14-0.34 in G. hirsutum), suggesting the conserved functions due to purifying selection (Additional file 3: Figure S3a and S3b). Moreover, tandemly duplicated gene pairs belonging to DIR-a and DIR-e also showed very small Ka/Ks values (0.11-0.38 in G. barbadense and 0.13-0.39 in G. hirsutum), further showing the conservation of these two clades. In contrast, gene pairs in the DIR-b/d-III clade held larger Ka/Ks values (0.51-1.74 in G. barbadense and 0.53-1.04 in G. hirsutum), indicating weaker selective constraints (Additional file 3: Figure S3c and S3d). The moderate selection pressure might result in the rapid expansion of DIR-b/d-III.
Evolutionary history of cotton DIR gene family
To further verify the evolution of cotton DIR gene family, we constructed a phylogenetic tree consisting of 488 DIR genes from eight dicotyledonous species sharing a common paleo-hexaploid ancestor (Fig. 3a and Additional file 8: Table S7). After the paleo-hexaploidization event, Arabidopsis thaliana and Glycine max have severally undergone two rounds of WGD events, while no WGDs have been identified in Vitis vinifera and Theobroma cacao [26-29]. Moreover, the Gossypium genus have experienced lineage-specific WGD, divergence and hybridization. As shown in Fig. 3a, we identified 33, 54, 25, 32, 63, 67, 107 and 107 DIR genes in Vitis vinifera, Glycine max, Arabidopsis thaliana, Theobroma cacao, G. arboreum, G. raimondii, G. barbadense and G. hirsutum, respectively. We assigned these DIR genes to distinct clades (Fig. 3b). Although they underwent different rounds of WGDs, these species were quite consistent in the number of DIR-a genes (Being tetraploids, G. barbadense and G. hirsutum displayed DIR-a genes twice as many as other species). Similarly, apart from soybean, DIR-e was stable across species. Thus, DIR-a and DIR-e might have been set up before the divergence of Rosids, corresponding to the results inferred from the duplication events. In the above analysis, we found that DIR-f was evolutionarily inactive in G. barbadense and G. hirsutum. As expected, G. arboreum, G. raimondii, Arabidopsis thaliana and Theobroma cacao (all belong to Malvids) possessed less DIR-f genes. In particular, Arabidopsis thaliana lost the DIR-f clade during its evolution. DIR-b/d-I existed in all the eight species, DIR-b/d-II was absent in grape and soybean, and DIR-b/d-III was lineage-specific in cotton. Clearly, DIR-b/d-I, -II and -III really appeared in turn in evolution. It has been shown that DIR-b/d-III displayed two distinct clades and that the left clade arose later than the right. As shown in Fig. 3c, most genes in the left clade of DIR-b/d-III were in the collinear blocks, suggesting that DIR-b/d-III was established before the divergence of cotton species. The chromosome reciprocal translocation between At04 and At05 was also observed, which have been confirmed recently [30].
Expression patterns and transcriptional regulation
Studies have shown that DIR proteins are implicated in lignan, lignin and gossypol biosynthesis, which are all part of plant defense responses against pathogens. To investigate the role of cotton DIR genes in disease resistance, the expression patterns of GhDIRs were analyzed in response to V. dahliae and water (the check group) using a Verticillium wilt-resistant cultivar. As a result, about one quarter of GhDIRs were highly expressed in the check group (Fig. 4). For the “gossypol clade” (the left half of DIR-b/d-III), almost half of the members showed a fairly high expression level, indicating its importance in plant pre-formed defense. Once the seedlings were inoculated with V. dahliae, most of the highly expressed GhDIRs were dramatically down-regulated. It seems that V. dahliae weakened the functions of DIR genes by disturbing their expression to colonize cotton hosts. These down-regulated genes should be an important resource to understand plant-pathogen interaction. To verify the intriguing expression patterns, we also analyzed V. dahliae- and water-responsive expression of GhDIRs at 12 hpi in six other cultivars, and the log2(fold-change) values were presented in a heatmap (Additional file 4: Figure S4). After inoculation with V. dahliae, the cultivars S1 and S2 exhibited the largest number of down-regulated DIR genes, corresponding to their lowest Verticillium wilt resistance. In contrast with S1 and S2, the cultivars T3 and T4 showed more up-regulated genes in the DIR-a and DIR-e clades, and thus showed higher Verticillium wilt tolerance. Unlike T3, T4, S1 and S2, the cultivars T1 and T2 displayed quite a number of up-regulated DIR genes following inoculation with V. dahliae. Despite having different patterns in different cultivars, DIR genes might have contributed to cotton Verticillium wilt resistance.
Considering that lignin/lignin-like phenolics can affect cotton fiber quality, we analyzed the expression patterns of DIR genes during cotton fiber development (Fig. 5). As shown, the expression profiles in G. barbadense (Pima90-53 and Hai7124) were similar to that in G. hirsutum (HY405 and ND13), exhibiting low expression levels in the DIR-e, DIR-f and DIR-b/d-III clades. Several genes belonging to DIR-b/d-II (GbDIR25, GbDIR71 and GbDIR107 in G. barbadense; GhDIR27, GhDIR28, GhDIR75 and GhDIR76 in G. hirsutum) were preferentially expressed in cotton fibers of 20, 25 and 30 DPA, indicating potential functions in secondary wall development. GbDIR78 and GhDIR35 which were part of DIR-b/d-I showed high transcript levels in the fibers of 5, 10 and 15 DPA, suggesting their importance during cotton fiber elongation. Furthermore, GbDIR78 and GhDIR35 were highly homologous but differentially expressed. Two DIR-a genes GhDIR12 and GhDIR36 were highly expressed during secondary wall thickening, whereas their orthologous genes GbDIR13 and GbDIR35 exhibited quite low expression levels. The differential expression might have contributed to the different fiber quality in these two species.
Transcription factor binding sites (TFBS) provide cues for transcriptional regulation. A total of 266 JASPAR matrices were selected and fetched to identify potential TFBS in the promoter regions of GbDIRs and GhDIRs (Additional file 8: Table S8). Despite the biased JASPAR database and the strict threshold, TFBS were widely detected, including hormone-activated signaling pathway (ABA, IAA, ETH, GA, JA and SA), response to abiotic stresses (drought, salt and temperature), response to biotic stresses, and plant cell wall development (Fig. 6 and Additional file 5: Figure S5; Additional file 8:Table S9 and S10). In the DIR-a, DIR-e and DIR-f clades, GhDIR33, GhDIR36, GhDIR78, GhDIR80, GhDIR86 and GhDIR92 had no TFBS related to ABA signal transduction and showed little or no expression in the roots of cotton seedlings (Fig. 4); the others tended to be highly expressed, indicating the probable regulatory roles of ABA signaling in root-specific gene expression. Four DIR-b/d-II genes GhDIR27, GhDIR28, GhDIR75 and GhDIR76 were highly expressed in cotton fibers. Despite having extremely close phylogenetic relationships with these four genes, GhDIR6 and GhDIR64 exhibited quite low expression levels (Fig. 5b). One cause may be the lack of TFBS in their promoter regions (Fig. 6). Similarly, GbDIR6 and GbDIR60 differed from GbDIR25, GbDIR71 and GbDIR107 in TFBS occurrences and in transcript levels (Fig. 5a and Additional file 5: Figure S5). As another example, GhDIR36 carried more IAA-responsive TFBS than GbDIR35, corresponding to their differential expression in cotton fibers (Fig. 5 and Additional file 6: Figure S6a). Considering that GbDIR13 and GhDIR12 differed in transcript levels in cotton fibers but owned similar TFBS (Fig. 5 and Additional file 6: Figure S6b), their trans-acting TFs were analyzed. As shown, some TFs associated with IAA signaling, ETH signaling or plant cell wall development exhibited higher expression levels in G. hirsutum than in G. barbadense.
Functional characterization of GbDIR78 in Arabidopsis
RNA-seq data showed GbDIR78 was preferentially expressed during cotton fiber elongation and that GbDIR78 and GhDIR35 differed in transcript levels, implying the involvement of GbDIR78 in cell elongation. To further identify its functions, the ORF of GbDIR78 driven by a 35S promoter was transformed into A. thaliana plants. Two transgenic T3 lines OE2 and OE3 were generated, and the stable expression of GbDIR78 was confirmed by Real-time PCR and Western blot (Fig. 7a and 7b). Arabidopsis leaf trichomes can serve as a useful experimental system to dissect cotton fiber development because they partly share regulatory mechanisms [31-34]. Here, trichomes from the fifth rosette leaves of OE2, OE3 and WT plants were measured, and then we discovered that the transgenic plants owned significantly longer trichomes (Fig. 7c and 7d). Moreover, dark-grown hypocotyls were utilized to determine the role of GbDIR78 in cell elongation because their growth resulted from cell elongation rather than division [35, 36]. As a result, the seedlings of OE2 and OE3, compared with WT seedlings, showed significantly longer hypocotyls (Fig. 7e and 7f). As expected, the longer epidermal cells were observed in the hypocotyls of transgenic plants in a microscopic inspection (Fig. 7g). These results indicated that GbDIR78 can promote cell elongation and might have contributed to cotton fiber development.