Identification and features of AHL genes in cotton
To identify the AHL genes, the Blastp and Hmmer search program (HMMER3.0 package) was performed against the protein databases using the AtAHL protein sequences as queries. The candidate AHL genes were confirmed using PROSITE and InterProscan 65.0 software to search for the PPC and the AT-hook motifs. Finally, 12, 15, 21, 48, 51, 99 AHL genes were obtained from Physcomitrella patens (P. patens), Vitis vinifera (V. vinifera), Theobroma cacao (T. cacao), Gossypium raimondii (G. raimondii), Gossypium arboreum (G. arboreum), Gossypium hirsutum (G. hirsutum), respectively. The properties of identified AHLs in cotton were also analyzed by ExPASy (https://web.expasy.org/compute_pi/). The gene lengths of AHL genes in G.raimondii ranged from 684 bp to 8394 bp, which encoding polypeptides from 227 to 396 amino acid with predicted molecular weights ranging from 22.75 kD to 41.38 kDa. The theoretical pI ranged from 5.35 to 10.68 with charge from -4 to 18 (Table 1). The AHL genes in G.arboreum and G.hirsutum differed greatly in length (641-10972 bp), isoelectric point (5.3-10.6), molecular weight (17.22-45.29kDa) and charge (-5-19) (Additional file 1, 2).
Phylogenetic analysis and gene structures of AHL genes
To elucidate the evolutionary relationship of the AHL gene family in Gossypium, the maximum-likelihood phylogenetic tree was reconstructed by 1000 bootstrap replicates with AHL proteins from P. patens (Pp), A. thaliana (At), V. vinifera (Vv), T. cacao (Tc) and G. raimondii (Gr). The phylogenetic analysis showed that the AHLs were divided into two monophyletic clades, Clade-A and Clade-B, with 9 and 8 groups respectively (Figure 1). Each group in Clade-A (except for AHL-X, no corresponding AHL gene in A. thaliana, named AHL-X) was composed of one VvAHL, one TcAHL, different number of AtAHL and GrAHL respectively, Those groups in Clade-B were composed of various number of AHL genes from A. thaliana (At), V. vinifera (Vv), T. cacao (Tc) and G. raimondii (Gr). In G .raimondii, Clade-A contained 22 genes including the members of GrAHL1, GrAHL3, GrAHL5, GrAHL7, GrAHL9, GrAHL10, GrAHL13, GrAHL14, GrAHL-X1 and GrAHL-X2, while Clade-B contained 26 members including GrAHL15, GrAHL16, GrAHL17, GrAHL20, GrAHL22, GrAHL23, GrAHL24 and GrAHL25. Each group of GrAHL gene family usually contained 2-3 members, while the group of GrAHL17 had 8 members. The Group AHL15, AHL10 and AHL3 showed a more rigorous evolution pattern, with only one copy left in the genomes of the 4 examined species (Figure 1, Additional file 3). This result indicated the different characteristics and the patterns of evolution in various group. The members of AHLs in G. raimondii, G.arboreum and G.hirsutum showed the preferably relationship of one-to-one correspondence except for AHL17 and AHL23, there were 4 AHL17 members in G. raimondii, 6 in G. arboreum, and 9 in G. hirsutum (Additional file 4). The AHLs from P. patens evolved into two clades, suggesting an expansion of the AHL gene family in land plants posterior to the division between P. patens and the extant land plants [3].
Conservation of gene structure and motifs among AHLs in cotton
The AHL proteins were typically characterized by the presence of AT-hook motif(s) for binding DNA and PPC/DUF296 domain for nuclear localization and interaction with each other or themselves [4]. To investigate the presence of homologous domain sequences and the degree of conservation in the two domains, AT-hook motif(s) and PPC domain, we performed multiple sequence alignment to generate sequence logos of the two domains in cotton against the MEME website (http://meme-suite.org/tools/meme). 20 conserved motifs were predicted, and the specific amino acid sequences of each motif were also provided (Figure 2, Additional file 8). Based on the number and composition of the AT-hook motif(s) and PPC/DUF 296 domain, AHL proteins were classified into three types (Type-I/-II/-III), with Type-I AHLs forming Clade-B, and the other two types together diversifying in Clade-A. Two types of AT-hook motifs (H1 and H2) were found in the AHL proteins (Figure 2). Both of H1 and H2 in the AHL proteins shared the same conserved R-G-R-P core, showing that the ability of bind the minor groove of AT-rich B-form DNA. The conservation of H2 with a longer core R-G-R-P-R-K-Y heptapeptides was higher than that of the H1 in cotton. H1 was found only in Clade-B, while H2 or H1 plus H2 were found in Clade-A (Figure 2). The AHL proteins in T. cacao, the closest related species of cotton, contained three types , while the AHL proteins in grape has only two types, Type-Ⅰ and -Ⅲ. The conserved structure of AHL9, AHL5 from 4 species contained 2 AT-hook motifs, indicating the distinct function in development. Almost all AHL genes in cotton (except for Gorai.012G0247001, Ga04G1890.1, Gh_D04G0182.1 and Gh_A05G3407.1, named AHL-X5) contained AT-hooking motif (s) and PPC/DUF296 domain. We considered AHL-X5s (Table 1, Additional file 1 and 2) in cotton as pseudogenes, because they contained the most regions of PPC/DUF296 domains although lacking the AT-hooking motifs and core sequences (motif 2), so these four genes were used as the members of AHL family for further analysis.
To investigate the diversity of gene structure, we performed multiple sequence alignment to generate the exons/intron pattern using the GSDS (http://gsds.cbi.pku.edu.cn/chinese.php). The structures of AHL genes can be divided into two types, with intronless and multiple-exon. The 26 AHL genes in Clade-B showed intronless in G.raimondii, while those in Clade-A with 5-6 exons (Figure 2). Most of the AHL genes in both G. arboreum and G. hirsutum presented similar exon/intron gene structure. The exception was AHL20-2 in Clade-B, which had only one exon in G. raimondii, but its orthologous genes in both G. arboreum and G. hirsutum showed 4-5 introns in CDS, this indicating that the rapid evolution with intron-insertion (Additional file 5).
Chromosomal location and synteny analysis of AHL genes
A total of 48 GrAHL genes were unevenly mapped onto 13 chromosomes of G. raimondii. Each chromosome contained 3-6 AHL members usually. Chromosome 07 contained 8 AHL genes, while chromosome 10 and 11 had only one AHL, respectively (Figure 3). The distribution of GaAHL and GhAHL genes showed similar to G. raimondii but some AHL genes in scaffolds (Additional file 1 and 2).
We surveyed the collinear relationship among the orthologous AHL genes from V. vinifera, T. cacao and G. raimondii to investigate the putative clues of evolutionary events. There were 15, 21, 29 and 48 AHL genes in V. vinifera, T. cacao, A. thaliana and G. raimondii, respectively. Specific loss and expansion of AHL genes were found in four species. AHL16 and AHL17 were not found in V. vinifera, while AHL-X not in A. thaliana. Most of AHL genes showed one-to-one corresponding relationship in V. vinifera and T. cacao, while 2-4 orthologous genes were found in G. raimondii (Figure 3, Additional file 3). In order to investigate the pattern of gene duplication, MCScanX was used to analyze AHL gene family in G. raimondii, G. arboreum and G. hirsutum, the AHL genes in G. raimondii and G. arboreum showed the correspondent relationship between those from D-subgenome, A-subgenome in G.hirsutum respectively (Additional file 7). The result indicated that the expansion of GrAHL gene family were with segmental duplication or whole genome duplication (WGD), no tandem duplication were found (Figure 4).
Different evolution of AHL genes in A and D subgenomes of G. hirsutum
To explore the selective constrains among the orthologous AHL genes in G. raimondii, G. arboreum and G. hirsutum, we calculated Ks, Ka and the Ka/Ks ratio for the AHL gene pairs (Additional file 9 and 10). It is generally believed that the value of Ks was not affected by natural selection, but that of Ka was affected by natural selection. The Ka/Ks value can also explain positive selection (Ka/Ks>1), neutral selection (Ka/Ks=1) and negative selection (Ka/Ks<1) during the evolution. In this study, 48 and 51 orthologous AHL gene paris were identified by OrthoMCL between G. raimondii and D-subgenome of G. hirsutum (GrAHL/GhAHL-Dt), and bewteen G. arboretum and A-subgenome of G. hirsutum (GaAHL/GhAHL_At), resprectively. The distributions of Ka and Ks between each pairs were shown in Figure 5. The Ka of GrAHL/GhAHL-Dt ranged from 0.972745 to 1.08213, while Ks from 0.795064 to 1.08921. The Ka of GaAHL/GhAHL_At ranged from 0.915553 to 1.03866, while Ks from 0.899268 to 1.30387. 19 gene pairs of GrAHL/GhAHL-Dt (39.6%) with Ka/Ks>1 were subjected to positive selection, while 2 (AHL24-2 and AHL17-3) negative selection; 17 gene pairs of GaAHL/GhAHL_At (33.3%) with Ka/Ks<1 were subjected to negative selection, while only AHL17-8 positive selection. The result suggested that the GhAHL genes derived from G. raimondii and G. arboreum underwent various selection directions during the evolution.
Gene expression profiles of GhAHLs
To explore the possible biological functions of AHLs, we inspected the expression patterns of different AHL genes in G. hirsutum based on the RNA-seq data downloaded from CottonFGD (http://www.cottonfgd.com). The AHL genes from G. hirsutum were expressed in different temporal and spatial patterns. Most GhAHL genes in Clade-B (such as AHL20, AHL22, AHL23, AHL24) were found strongly up-regulated expression in the stem, but extremely lowly in fiber, ovule, leaf, petal, root, stamen and pistil. Some of GhAHL genes in Clade-A (such as, AHL1, AHL7, AHL9 and AHL10) showed an extensive expression activity in different organs, highly expressed in the fiber and ovule, suggesting a special function of these genes in the development of cotton ovule (Figure 6). Interestingly, two AHL20-2 genes inserted by introns in G. hirsutum showed higher expression activity in all organs and periods than other member in Clade-B. The expression of GrAHL showed similar pattern in different tissues (Additional file 6). The expression result showed that the AHLs within each clade shared similar expression patterns with each other; however, AHLs in one monophyletic clade exhibited distinct expression patterns from those in the other clade.
For verification the data of RNA-seq, the qRT-PCR of six selected AHL genes in G. hirsutum was performed to analyze the expression pattern in stem, root, leaf, flower and ovule (-3, -1, 0, 1, 3, 5 PDA). The results showed that two AHL genes (AHL22-1, AHL20-2) in Clade-B displayed higher expression in stem, and lower expression in leaf. Three AHL genes (AHL9-1, AHL7-1and AHL10) in Clade-A expressed highly in the early development of ovule. AHL16-1 expressed extremely lower in stem, root, leaf and flower (Figure 7). The result coincided with the data of the RNA-seq, suggesting that the data from CottonFGD (http://www.cottonfgd.com) were reliable.