Global dissection of the BAHD acyltransferase gene family in soybean: Expression profiling, metabolic functions, and evolution


 Background: Members of the BAHD acyltransferase (ACT) family play important roles in plant defence against biotic and abiotic stresses. Previous genome-wide studies explored different acyltransferase gene families, but not a single study was found so far on the overall genome-wide or positive selection analyses of the BAHD family genes in Glycine max . A better understanding of the functions that specific members of this family play in stress defence can lead to better breeding strategies for stress tolerance. Results: A total of 103 genes of the BAHD family (GmACT genes) were mined from the soybean genome, which could be grouped into four phylogenetic clades (I- IV). Clade III was further divided into two sub-clades (IIIA and IIIB). In each clade, the constituent part of the gene structures and motifs were relatively conserved. These 103 genes were distributed unequally on all 20 chromosomes, and 16 paralogous pairs were found within the family. Positive selection analysis revealed important amino acids under strong positive selection, which suggests that the evolution of this gene family modulated soybean domestication. Most of the expression of ACT genes in soybean was repressed with Al 3+ and fungal elicitor exposure, except for GmACT84 , which expression increased in these conditions 2- and 3-fold, respectively. The promoter region of GmACT84 contains the maximum number of stress-responsive elements among all GmACT genes and is especially enriched in MYB-related elements. Some GmACT genes showed expression specific under specific conditions, while others showed constitutive expression in all soybean tissues or conditions analysed. Conclusions: This study provided a genome-wide analysis of the BAHD gene family and assessed their expression profiles. We found evidence of a strong positive selection of GmACT genes. Our findings will help efforts of functional characterisation of ACT genes in soybean in order to discover their involvement in growth, development, and defence mechanisms.


Glycine max.
Although G. max is an important crop worldwide, little work on BAHD family genes has been carried out on this species when compared to other model and crop plant species. To date, only two BAHD genes were reported in soybean: GmIMaT1 and the two alleles of GmIMaT3 [2], GmMT7 [20] and GmIF7MaT [75], which respectively malonylate the isoflavones daidzin, genistin, and glycitin. The whole-genome sequence of soybean is publicly available at Phytozome and soyKB databases [67,70], making it possible to identify and analyse the BAHD genes as well as explore their expression patterns and create strong hypotheses on their metabolic functions. Sequence comparisons of BAHD genes in soybean with already characterised genes in other species will assist gene function predictions. In this study, we have executed a global analysis of all BAHD genes in the soybean genome. The chromosomal distribution, phylogeny, gene structure, duplication events, positive selection, and transcriptional dynamics in response to different stresses were analysed in this gene family.

Results And Discussion Identification of BAHD family members in the soybean genome
Considering only the genes coding the specific BAHD catalytic motif (HXXXD), we identified 103 genes (Additional files 1: File S3) in the soybean genome encoding proteins of the BAHD family. The annotation of soybean BAHD genes was based on their fall in the clade and position on the chromosome (GmACT1-99) along with the already characterised genes Glyma.18G268200 (GmIMaT1), Glyma.13G056100 (GmIMaT3) [2], Glyma.18G258000 (GmMaT2) and Glyma.18G268400 (GmMaT4) (Fig. 1A). Details about on physical parameters, such as Glyma gene ID, chromosome startend position, coding strand, exon number, predicted protein sequence length, molecular weight, isoelectric point (pI), and predicted subcellular location are listed in Additional files 1: File S3.
Predicted protein length varied from 274 (Glyma.13G007200_GmACT31) to 511 (Glyma.18G258000_GmMaT2) amino acid residues, which indicates high variation within members of the BAHD family. However, most of them (70 out of the 103 members) code a protein with 450-520 residues. The range of predicted molecular weight is from 30.93-56.70 kD whereas the majority of the genes code a product with a predicted molecular weight ranging 45-52 kD, which falls within the values of characterised BAHD genes [16]. The predicted pI value of soybean BAHD proteins ranged from acidic 5.05 (Glyma.10G061900_GmACT53) to basic 9.27 (Glyma.02G264600_GmACT2) (Additional files 1: File S3). The predicted subcellular localization indicates that BAHD proteins could be present in almost all organelles of the soybean cell. However, GmIMaT1 and GmIMaT3 are primarily localised in endoplasmic reticulum and cytosol, respectively [2]. Most BAHD proteins (37/103) were predicted to localize in the cytosol, followed by mitochondria (27/103)  Motif analysis and gene structure Variation in length and sequence was found in the conserved region of soybean BAHD genes (Additional files 1: File S4). Among the ten representative BAHD motifs searched and annotated through Pfam and SMART tools, three motifs were always present in each gene identified (Additional files 1: File S4). The biological importance of these recurring motifs required an in-depth investigation.
The gene architecture of soybean BAHD genes was analyzed to better understand the general features within each clade. The exon-intron structure of each gene is shown along with the phylogenetic tree (Fig. 1C). Genes within the same clade contain almost same length and number of exons and introns, with some exceptions (Fig. 1C) GmACT12 contains the longest intron in the whole set of soybean BAHD genes (Fig. 1C). In clade II, 14 genes contain only one, 8 genes contain two, and 3 genes contain three exons. GmACT21, GmACT22, GmACT32 and GmACT33 have long introns compared to the other clade II members. Clade III genes varied in exon number, but most contain just a single exon. Two clade III members (GmACT51 and GmACT72) contain the maximum exon number (4) in whole soybean BAHD family. Most members of clade IV (76%) contain only one exon while the others contain two exons. Among the 16 paralogous pairs, 10 did not show remarkable differences except for GmACT21-GmACT22 of clade II which a member of the pair has an untranslated region (UTR) longer than the other member. The other six paralogous pairs showed small variations in exon number or intron length (Fig. 1C).

Phylogenetic relationships of the soybean BAHD members
The evolutionary relationship of the soybean BAHD genes was explored by constructing a phylogenetic tree to compare them with 43 functionally characterised BAHD genes from different plant species (Fig. 2). On the basis of substrate donor and acceptor, the 103 soybean BAHD genes were classified into four major clades (I-IV) and clade III was further subdivided into two subclades (IIIA and IIIB) (Fig. 2). Our results based on the phylogenetic relationships of the characterised soybean ACT proteins would predict their correct enzymatic functions.
Clade I contains 20 soybean BAHD genes along with 16 characterised genes from different plant species (Fig. 1). Most members of this clade code for benzoyltransferases, alkyl-hydroxycinnamate ester, or cell-wall acyltransferases. They use benzoyl as a donor substrate to form benzyl benzoates, such as CbBEBT, NtBEBT [18], or salicyl benzoate (PtSABT) and involved in plant defence mechanisms against herbivory [12]. Many characterised BAHD enzymes of clade I are responsible for the formation of benzenoid compounds [7,12,18]. Among them, PtSABT from Populus trichocarpa uses different CoA thioesters and alcohols as a donor to catalyse the production of salicyl benzoate for the synthesis of a variety of acetates, benzoates and cinnamates [12]. PtBEBT from Populus trichocarpa is more selective to produce benzyl benzoate [12] and takes part in the synthesis of salicinoids. CbBEBT produces benzyl benzoate in Clarkia breweri that composes its floral aroma. The CbBEBT gene is highly expressed in the flower, and especially in the stigma [18]. These BEBT genes code for products that have the capability to produce compounds that are important for plant defence or aroma [7,12,18]. Four BEBT genes from soybean (GmACT3, GmACT4, GmACT5, GmACT20) have high similarity with the characterised BEBT genes. Similarly to CbBEBT, GmACT3 and GmACT4 also showed high expression levels in the flower, which leads us to hypothesize that it likely plays a function in catalyzing aroma compounds. These genes are also highly expressed in the stem and the leaf, like PtSABT and PtBEBT that produce defence compounds. Some members of clade I were also reported being involved in the biosynthesis of cutin and suberin, which are known to be lipid barriers on organ epidermis surfaces. These barriers regulate gas, water and solute fluxes [63] and play critical roles against pathogen attacks [9]. Alkaylhydroxycinnamoyl ester BAHD genes, such as AtASFT [50], StFHT [69], AtDCF [64], and AtFACT [37] use feruloyl, p-coumaroyl, sinapoyl, and caffeoyl as a donor to form aromatic suberin, alkyl, or ω-hydroxy fatty acids, which are used for the synthesis of cutin polymers that in turn are important components of the cuticle and assist in the deposition of suberin into the inner side of the primary cell wall of secondary shoots and roots [9]. They also play a role in the synthesis of root waxes that contain hydroxycinnamates in A. thaliana [37]. Ten genes from soybean fell in clade I and, thus have the same predicted transferase function ( Fig. 2; Additional files 1: File S3). Interestingly, we have found in soybean only one orthologue of disinapoylspermidine (AtSDT; [44]) and another of coumaroyl-spermidine (AtSCT; [44]) that were annotated as GmACT2 and GmACT12, respectively (Fig. 2) while all the other genes in this clade have multiple copies or paralogous in the soybean genome. Monolignols, which participate in the formation of lignin, are synthesised through the phenylpropanoid pathway, primarily through the activity of hydroxycinnamoyltransferases (HCTs). The BAHD-ACTs also used the hydroxycinnamic acid (HCA) as donor substrate to esterify monolignols [84]. These cell wall-related BAHD-ACTs also belong to clade I, such as OsPMT [84], BdPMT [62], and OsAT10 [5] that use p-coumaroyl-CoA as a donor substrate to form lignols and glucuronoarabinoxylan (GAX). These activities are responsible for producing the lignocellulosic biomass from grasses, which could be a major feedstock for the generation of liquid biofuels [62]. The soybean genes GmACT1, GmACT13, GmACT14, GmACT16, GmACT17, and GmACT19 were also predicted to have cell wall-related functions, and they may play essential roles in the production of liquid biofuel compounds.
attenuate and involved in defence mechanism against herbivores [60]. While other members of the clade II, i.e., HCT/HQT and their relatives produce different compounds like lignin as well as chlorogenic acid (CGA) and quinates, which play protective roles against UV photooxidation [27] and pathogenic resistance [58]. CGA has also been implicated in cell wall development [51] and root hair formation [57]. Seventeen soybean BAHD genes are closely related to TpHCT2, which product transfers hydroxycinnamoyl moieties (p-coumaroyl and caffeoyl) to malic acid ( Fig. 1). This transferase activity of TpHCT2 is involved in phaselic acid synthesis, particularly in the leaves of red clover (Trifolium pratense, Fabaceae) [74]. TpHCT2 is related to GmACT27, GmACT28, and GmACT45, which also have high and specific expression in leaves (Fig. 2), and may play a role in phaselic acid synthesis. Meanwhile, GmACT21, GmACT22, GmACT32, and GmACT33 fall closest with TpHCT1A/1B, which use caffeoyl or p-coumaroyl as a donor to form shikimates or malates [74] and play a major role in monolignol synthesis [29,74]. Lignin precursors are synthesised through p-coumaroyl derivatives rather than methoxylated analogues in response to pathogen attacks [80]. TpHCT activity was higher in the stem and capable of transferring p-coumaroyl moieties from the respective CoA to shikimic acid, which is critical for the biosynthesis of monolignol lignin precursors [29,74]. AtHCT expression dramatically changed the lignin composition, which demonstrates the essential function of HCT in phenylpropanoid metabolism [6,29]. HCT homologues have been identified in several vascular plant species [9]. In conformity with other members of this group, GmACT21, GmACT22, GmACT32 also presented high expression levels in the stem, and they may transfer p-coumaroyl moieties to respective acceptors with essential roles in lignin biosynthesis. Other four members of this clade (GmACT35, GmACT36, GmACT38, and GmACT39) are closely related to SHT from Arabidopsis, which uses hydroxycinnamoyl as a donor substrate to form hydroxycinnamoylspermidine [27]. This compound is abundant in pollen of many species, which indicates a role of these phenylpropanoids in the pollen [27]. AtSHT expression was high and specific to the tapetum cells of the anther during early flower development in order to supply the building blocks for cell wall development in the pollen [27]. The metabolic profiling of flower bud tissues in Arabidopsis revealed that AtSHT knockout or RNAi plants had a positive correlation between acyltransferase protein accumulation and the amounts of trihydroxyferuloyl and dihydroxyferuloyl sinapoyl spermidines [27]. GmACT35 and GmACT38 were also more highly expressed in the flower than any other tissues and are expected to have a function in the acylation of spermidine in the pollen.
Clade III is further divided into sub-clades IIIA and IIIB ( Medicago, which facilitate the transport or storage of the metabolites [44,89]. Among the 14 aliphatic ACTs in soybean, only two, GmIMaT1 and GmIMaT3 [2] have been functionally characterised and play a distinct role against abiotic and hormonal stresses [2]. Their transcription patterns correlated positively with the malonylisoflavonoid composition of the tissue in different parts of the plant under different stress conditions [2]. Another aliphatic group contains 6 soybean ACTs along with MtMaT5 and MtMaT6 [89] from Medicago that malonylate different anthocyanins, such as cyanidin, delphinidin, and pelargonidin conjugates at 3-O-glucosides [89]. These genes are very specific for anthocyanin modification at the C3 position, and they have no activity towards C5 or C7 of anthocyanin or other flavones or isoflavones [89]. The soybean genes that closely relate to MtMaT5 and MtMaT6 are predicted to function in anthocyanin modification at 3-O-glucoside (Additional file 1: File S3). This group of clade IIIA is most probably associated with the stability of the anthocyanin glucosides since anthocyanin acylation at 3-O sugar with the malonyl moiety has been suggested to increase its stability [16].
Clade IIIA also contains 7 soybean BAHD-ACT genes (GmACT52, GmACT57, GmACT61, GmACT63, GmACT72, GmACT76 and GmACT77) closely associated with the aromatic ACTs from Arabidopsis, At3At1 and At3At2 (Additional file 1: File S3; [44]). At3At1 and At3At2 use p-coumaroyl, feruloyl and caffeoyl CoAs as donors to specifically modify the C6 hydroxyl of glucose at three anthocyanin positions, but it did not have any affinity toward sinapoyl-CoA. We predict that these 7 aromatic ACTs will also prefer coumaroyl-CoA to modify the anthocyanin at the C3 position just like their closely related Arabidopsis orthologues ( Fig. 2; Additional file 1: File S3). We did not find in the soybean genome any aromatic gene that was closely related to a gene able to modify 5-O-glucosides, such as Gt5AT that acylates the C6 hydroxyl of 5-O-glucosyl residues of anthocyanin with either caffeoyl, pcoumaroyl [25] or aliphatic anthocyanin MaTs, such as At5MaT ( Fig. 2; [15]) and Ss5MaT1 from Arabidopsis thaliana (Brassicaceae) and Salvia splendens (Lamiaceae), respectively ( Fig. 2; [77]). Clade IIIB contains 9 GmACT genes from soybean with 7 characterised genes (Fig. 2). This clade contains malonyltransferases (MaTs) that catalyse flavonoids with malonyl-CoA to form malonates, acetyl CoA:deacetylvindoline 4-O-acetyltransferases (DAT) which catalyse the last step in vindoline biosynthesis [73], acylsugar acyltransferases (ASAT), which is important in the acylsucrose biosynthetic pathway and uses sucrose and acyl-CoA ester substrate specificity to contribute in different types of accumulated acylsucroses [24]. Acylsugars are polyesters of short-to mediumlength acyl chains of sucrose or glucose backbones that are produced in secretary glandular trichomes of many solanaceous plants, including the cultivated tomato (Solanum lycopersicum).
These compounds act against different insects through a mechanism involving in a multitrophic defence interaction involving their metabolization to volatile fatty acids [35]. Ss5MaT2 also fell in clade IIIB separated from other the MaTs, which is consistent with D'Auria analyses [16]. These BAHD-ACTs share the anthocyanin specific motif (NYFGNC) along with the BAHD HXXXD motif. The characterised member of clade IIIB (Ss5MaT2 from Salvia splendens) malonylates the anthocyanin glycosides bisdemalonylsalvianin and shisonin by transferring the malonyl moiety to the C4 position of the 5-O-glucosyl residue [76]. The soybean genes of clade IIIB also suggest forming salvianin, which has two malonyl moieties, and malonylshisonin, which plays an important role in flower colour [76]. Although the biochemical properties of Ss5MaT2 are similar to that of clade IIIA members, it fell into the clade with ASAT and DAT.
Clade IV contains 21 soybean BAHD genes along with five characterised genes (  [25]. HCBT produces the aromatic amide dianthramide in carnation [83]. Anthraniloyl-CoA: methanol AT (AMAT) uses anthraniloyl-CoA as a donor substrate to acylate methanol and produce a foxy odour compound (methyl anthranilate) in concord grapes [81]. Moreover, GmACT93 and GmACT96, which were predicted to involved in cutin formation and closely related to the gene DEFECTIVE IN CUTICULAR RIDGES (DCR) from Arabidopsis thaliana, Medicago truncatula, and Populus trichocarpa [61]. In the flower, DCR converts monomers into polymeric cutin, which is composed of fatty acid, glycerol and aromatic monomers [32]. Changes in epidermal cell differentiation and postgenital organ fusion were observed due to defective cuticle in DCR-deficient plants. DCR was also expressed in the root cap, lateral root primordia, and developing lateral root. Its root expression might point to involvement in the biosynthesis of suberin [61].
Downregulation of DCR in dcr mutants led to excessive root branching and changes of root architecture by influencing lateral root emergence and growth [61]. All in all, these analyses help us formulate educated hypotheses about the diversity and evolution of substrate specificity within the BAHD family in soybean. soybean BAHD genes tend to have high rates of recombination [32].

Chromosomal localization and duplication of soybean BAHD genes
Tandem and segmental duplications increased the gene numbers into a large family in different plant species [79]. Potential gene redundancies can be discovered by the identification of closely related paralogues in the genome whereas the accurate identification of true orthologues between species can lead to the creation of strong hypothesis of common gene function in other species. The soybean genome contains a high number (103) of BAHD genes as compared to 69 in Cynara cardunculus [49], 64 in A. thaliana [17], but less the 119 BAHD genes in the Oryza sativa genome [16]. This large gene family might be the result of the two rounds of whole-genome duplication (WGD) events that occurred 58-60 (the Papilionoideae allotetraploidization event) and 13 Mya (recent soybean allotetraploidization) in soybean [66]. The presence of paralogous genes in a plant genome is more often related to their functional category than the genetic proximity between species [67]. We found sixteen paralogous pairs in the soybean BAHD family (Additional file 2: Table S1). These paralogues are present in all the four phylogenetic clades (Additional file 2: Table S1). Clade I contains 7, clade II contains 3, clade III contains 2, and clade IV contains 4 paralogous pairs (Additional file 2: Table S1).
The presence of clear paralogues is evidence of common ancestry and helps to elaborate strong hypotheses about the functional conservation of BAHD genes in soybean along with those already characterised in different species. The selection history of coding sequences can be assessed through the analysis of base substitutions of the protein code called the K a /K s ratio. To investigate the deviation of duplicated BAHD acyltransferases, we determined the non-synonymous substitutions per site (K a ), the synonymous substitutions per site (K s ), and the respective K a /K s ratios for each paralogous pair. The partition of each pair was estimated through the K s values. All paralogous pairs had a K s value from 0.09 to 0.28, which is consistent with the soybean duplication events (Additional file 2: Table S1). The selection pressure can be determined through the K a /K s ratio. Most of the paralogous pairs showed negative Ka/Ks ratios (Additional file 2: Table S1) suggesting that the BAHD gene family in soybean experienced strong purifying selection pressure and was bound by strong evolutionary constraints to maintain its stability [31]. Divergence time analysis showed that these paralogous genes evolved between 7.38 to 22.95 million years ago (Mya), which is consistent with the recent allotetraploidization of the species that occurred around 13 Mya (Additional file 2: Table S1).

Positive and purifying selection of BAHD genes in soybean
The Ka/Ks ratio is used as a measure of the direction and magnitude of gene selection [38]. The Ka and Ks substitution rates in this study were estimated on the basis of simple probability theory.
Positive selection was considered significant with Ka/Ks ratio > 1 [59].  Table S2). The changes in codons, as well as sequences for pairwise comparison, were estimated at the Ka and Ks basis. The variability between Ka and Ks polymorphisms at the allelic distribution level presents quite an explicit proof for positive selection [30]. The allelic distribution in various coding sites can be revealed by the strapping test of neutrality, which is applicable to genomic data containing a large number of polymorphisms, along with the homogeneity test allow for comparing the frequency distribution of synonymous sites [3,46]. The changes in the coding region during the evolutionary time under analysis showed the cumulative presence of synonymous, non-synonymous and ambiguous (indels) codons (Additional file 3: Figure S1).
Synonymous and non-synonymous mutations showed low cumulative behaviour at the start site and then gradually increased during the evolutionary time analysed, whereas the opposite behaviour was observed for ambiguous codons at the start point and then acquired a static behaviour at the end of the code. The very same behaviour of synonymous, non-synonymous and ambiguous codons was found in all four clades of the BAHD family (Additional file 3: Figure S1).
We estimated the Likelihood Rate Test (LRT) through the empirical Bayes method at specific branching points and identified various diversifying selection sites. The mixed-effect model of evolution (MEME) was used to identify the diversifying selection of BAHD genes also on the basis of empirical Bayes procedure [85]. MEME identified 7, 17, 16, and 4 episodic diversifying coding sites (p < 0.05) in clades I-IV, respectively (Additional file 2: Table S3). The synonymous (α) and nonsynonymous (β) substitution rates were calculated, and coding sites with values β > α were considered as significant to determine the sites under diversifying selection. The maximum-likelihood estimation of β + > α in MEME was obtained for codons 60 and 434 in clade I, 189 and 463 in clade II, 277 and 643 in clade III, and 299 in clade IV (Additional file 2: Table S3). The prevalence of purifying or natural selection masked the episodic natural selection with the transient period of adaptive evolution. Thus, we could not identify sites under positive selection due to the sensitivity of tests used [55] and the stringent parameters chosen, such as p < 0.05 and the empirical Bayes factor threshold of analysis for selection of sites (> 15), so that the best outcomes were obtained when the selected branches were placed in the conserved lineage.
Ambiguities in the posterior gene-and site-specific distributions were calculated through FUBAR using the MCMC approach [3,54]. This analysis revealed the following number of sites in BAHD genes under pervasive diversifying selection: one site in clade I, four in clades II and III each, and two in clade IV (Additional file 2: Table S4). The diversifying evolutionary sites were detected through the random effect model using the rate distribution with a large number of parameters at the significance level at 95% confidence interval and Pr [β > α] values (Additional file 2: Table S4). The class rate weight at each coding site under positive selection was calculated through a bivariate general discrete distribution. The posterior mean values ranged between 0.80-0.92, which are closer to the value of the potential scale reduction factor (Additional file 2: Table S4)  The evolutionary fingerprint of soybean BAHD genes Nucleotide and amino acid substitutions in the codon model were used to identify synonymous and nonsynonymous substitutions [4,48]. The genetic algorithm in the codon model evolution was used to identify the evolutionary fingerprint in coding sites of BAHD genes. The codon model evolution used a phylogenetic Markov model that includes substitution rates, character frequencies [19], amino acid substitution rate clustering [72], and branch lengths estimated through the maximum-likelihood method. The 9052 models generated were used in codon model selection on the basis of likelihood log and modified Bayesian Information Criterion (mBIC). Selective effects with an exchangeable preference for particular amino acid were found in BAHD genes through a model that used the combined empirical codon and transition/transversion-related physiochemical parameters [36]. The model with log (L) value − 16360.3 was considered optimum for the amino acid substitution analysis.
We observed mBIC values ranging from 33701.1 for clade II to a maximum value of 81506.1 for clade IV (Additional file 2: Table S5). All clades showed three-class rates for the distribution of amino acids in different classes (Fig. 5) with the estimation of a single Ka/Ks substitution rate (Additional file 2: Table S5). A genetic multi-rate model algorithm was used to analyse the class rates in order to calculate substitution rates at the amino acid level during the evolutionary time scale (Fig. 5). The substitution rate in each class was calculated through genetic models using the Stanfel class parameters ( Fig. 5; [72]). The substitution rate distributed the amino acids into the three classes through evolutionary rate cluster, and the substitution pair ACGILMPSTV revealed 90% substitution, whereas DENQ had 50% substitution, and FWY and HKR presented < 50% substitution rates. The best fit model with mBIC values defined the clustering rate, and the numerical values of corresponding rate class substitution are inferred with maximum-likelihood estimation, which were computed evolutionary evidence ratio for the gene [4].

Analysis of cis-regulatory elements in promoter regions of BAHD genes in soybean
Extensive studies on ACTs indicate their roles in plant growth, development, and stress tolerance.
Many cis-acting regions are found in the upstream region of GmACTs genes according to our in silico analyses (Additional file 1: File S5). The 103 GmACT promoter regions contain cis-elements responsive to light (43%), hormones (29%), environmental stresses (20%), and plant growth (8%) (Fig. 4A). These regions contained transcription factor binding sites involved in response to plant hormones (Fig. 4B), biotic stresses and pathogen defence, MYB binding sites, and response to heat, low temperature, and drought (Fig. 4C) showed the most (12) MYB related elements, and also showed high expression values, up top sevenfold higher in response to Al 3+ and two-fold higher to fungal elicitor than the control (Additional file 1: File S1c, S1d). The promoter regions of paralogous pairs were also compared. Both members of each pair showed variation in cis-elements related to stress or hormonal responses or other signals in their promoter regions. Some hormonal, stress or growth-related elements were often present only in one member of the pair (Additional file 1: File S5). Although the GmACT paralogous pairs have little variation in protein length, the sequences of their promoter regions were highly different, which suggests divergent specific roles of each member. We concluded from this analysis that GmACTs may play vital roles in plant stress perception, signalling, or biotic and abiotic stress defence.

Expression patterns of soybean BAHD genes in different tissues
Expression data for the 103 soybean BAHD genes in different soybean tissues were extracted from Phytozome ( Fig. 6; Additional file 1: File S1a). Most of the genes showed specific tissue expression patterns related to their clade function. The benzenoid-related genes (GmACT3 and GmACT4) had maximum expression levels in above-ground organs, whereas GmACT5 is highly expressed only in the root ( Fig. 6; Additional file 1: File S1a). Benzenoid compounds are produced mostly in flowers and known to be involved in the aroma of numerous fruits and contribute to improving flavour quality [7].
Indeed, already characterised genes from this clade, PhBPBT from Petunia x hybrida and CbBEBT from Clarkia breweri are highly expressed in floral parts and consistently involved with benzylbenzoate compounds [7,18]. Monolignol synthetic genes (e.g., GmACT1 and GmACT16) showed a high expression level in root and pod. Likewise, OsPMT and BdPMT, which are specific to monolignols with p-coumarate and ubiquitously expressed but particularly high in the spikelets. When overexpressed, these genes led to a three-fold increase in pCA lignin compared to the control [62]. Monolignol-pCA conjugates integrate with lignin biosynthesis through oxidative radical coupling in order to generate the pCA appendages [62,84]. Polyamine conjugates are abundant in seeds and serve as nitrogen reserves during germination [23]. The Arabidopsis SDT gene is highly expressed in seeds and produce disinapoyl spermidine and its glucoside while the SCT is mainly expressed in roots and produce coumaroylated spermidines [43]. Soybean also contains SDT and SCT related genes, GmACT2 and GmACT12, respectively, which are expressed in all organs of the plant, although GmACT12 expression is barely detectable in the stem, leaves and roots, although spermidine accumulates at high levels in seeds and the root [43].
The clade II members, GmACT29 and GmACT30, showed high expression only in the seed. GmACT40 is expressed in the root and nodule, while GmACT38 is especially high in nodules and flowers ( Fig. 6; Additional file 1: File S1a). On the other hand, GmACT41, GmACT45, GmACT35, GmACT21, GmACT22, and GmACT32 showed high expression levels in all tissues ( Fig. 6; Additional file 1: File S1a) suggesting putative roles in cell wall development and plant defence mechanism by producing CGA or quinate-like compounds, given their close relatedness to TpHCT1A and TpHCT1B ( Fig. 2; [74]).
Phenolamide, an important compound of clade II genes, is often associated with floral induction and development. GmACT21, GmACT22, and GmACT32 showed high transcription levels in the stem and flowers, similarly to TpHCT1, which showed 4 to 5-fold higher expression in these organs than the leaf and played an important role in stem lignification by participating in monolignol biosynthesis [29,74].
GmACT27, GmACT28 and GmACT45 are similar at the amino acid and also showed the same expression pattern as TpHCT2 [74]. A reduction in the biosynthesis of phenolamides or their conjugates leads to male sterility in plants. Tri-substituted hydroxycinnamoyl spermidines were synthesised in the pollen coat, and this conjugation is catalysed by SHT [21]. GmACT22, GmACT32 and GmACT35 also had their highest expression in the flower and, given their phylogenetic position and expression patterns, these genes might be involved in the synthesis or conjugation of spermidines in the pollen coat of soybean.
Clade III genes, GmACT67 and GmACT72, are highly and specifically expressed in the flower, GmACT55 in the root, GmACT49, GmACT54, and GmACT65 in the nodule, whereas GmACT66 is expressed in the leaf, GmACT47 in stem and flower, GmACT57 in root and flower, and GmACT71 in seed and the pod ( Fig. 6; Additional file 1: File S1a). The putative isoflavonoid malonyltransferase genes, such as GmIMaT1, GmMaT2, GmIMaT3, GmMaT4, GmACT68, GmACT69, and GmACT70, showed high expression levels in almost all tissues. Other related members presented a more specific expression pattern, such as GmACT71 in seeds and pods, and GmACT50 in the leaf ( Fig. 6; Additional file 1: File S1a). The different expression patterns of these genes suggest putative roles of malonylated flavonoids in storage, defence mechanisms, symbiosis signalling, flower colour, and resistance against biotic and abiotic stresses. Most of clade IIIB members had high expression in leaf and stem, like the characterised members of this group. High and specific expression of CrDAT in the leaf correlates with the synthesis of vindoline from tabersonine [73]. Likewise, the expression of ASAT genes was higher in the trichomes of S. lycopersicum stem and S. pennellii leaf, which correlates with the accumulation of acylsugars in these organs [24].
DCR genes are specifically expressed in the epidermis of vegetative and reproductive organs.
Additionally, they also showed expression in the lateral root primordia and developing lateral roots [61]. The cuticle of different organs is the primary barrier against several biotic and abiotic stresses [71]. High expression of DCR genes in young leaves and stem is predictive of their role in the formation of cutin polymers in vegetative organs [61]. During leaf expansion, the cutin polymeric matrix is formed, and DCR plays a vital role to incorporate the hydroxyl fatty acid chains into the cutin polymer [65]. It is also important to convert monomer units into its polymeric form of cutin in the flower. DCR develops the plant lipid polyester for epidermal cell differentiation and patterning, trichome development, and protection from undesired postgenital fusions in aboveground organs [61]. Almost all clade IV genes showed high expression levels in seeds, the root, and nodules. Some genes, such as GmACT82 and GmACT95, presented particularly high expression levels in the root and flowers, GmACT94 and GmACT91 in seeds, GmACT99 in the root and leaves ( Fig. 6; Additional file 1: File S1a).

Expression profiles of BAHD genes during nodule development in soybean
The gene expression of BAHD members was analysed in RNA-Seq data derived from soybean nodules at different developing stages [1]. The results showed that BAHD genes were differentially expressed in the nodule at different developmental stages, from nodule development until its senescence ( Fig. 7; Additional file 1: File S1b). Most of clade II and clade IV genes showed higher expression in the root than nodules, while 50-80% genes of clades I and III had higher transcription levels in nodules than the root. The expression levels of GmACT39, GmACT40, GmACT41 (clade II), GmACT27 and GmACT33 (clade IV) were 10 to 40-fold higher in the initial stages of nodule development and reached up to > 80-fold in later stages (stage 5, pre-senescence) than in the root. Even though most of the clades I and III genes were expressed more highly in nodules than roots, their expression levels were not much higher than genes in clades II and IV, except for GmACT12 and GmACT68, which showed more than 2 and 70 fold higher expression than root, respectively ( Fig. 7; Additional file 1: File S1b).
The expression patterns of GmACT35, GmACT39, GmIMaT1, GmACT86, and GmACT87 gradually increased as the nodule developed, peaking at stage 5 and then dropping at senescence (stage 6), which indicates their roles in nodule development. Although the expression levels of GmIMaT1 and GmACT86 were higher in roots than nodules, their expression increased with nodule development.
Therefore, these genes may play a symbiotic role in the relationship between root and nodule development by modifying secondary metabolites in roots for secretion into the rhizosphere in order to attract rhizobia, such as isoflavonoids that act as a signalling molecule for inducing nod factor biosynthesis in rhizobia. The transcription level of GmIMaT3 was the highest after GmACT87 than all other BAHD genes in the different nodule stages analysed, and its expression remained static throughout the nodule development and almost similar with that in the root ( Fig. 7; Additional file 1: File S1b). The expression levels of GmACT15, GmACT18, GmACT40, GmACT41, GmMaT2, and GmACT75 were higher in the initial stages and gradually decreased as the nodule development advanced, reaching the lowest expression levels at its final stage ( Fig. 7; Additional file 1: File S1b).
Therefore, these genes may play a function in rhizobial interaction and nodule initiation and development.

Expression patterns of soybean BAHD genes in response to different stresses
The expression of BAHD family genes was analysed from RNA-Seq data obtained from plants subjected to acidic condition (low pH 4.0), aluminium treatment 50 µM Al 3+ (pH 4.0), and fungal elicitor after 2 and 24 h. Most of the BAHD genes expression decreased in response to low pH and Al 3+ , except for GmACT13 (clade I) and GmACT36 (clade II). Their expression levels doubled at low pH but decreased or remained the same at Al 3+ stress conditions ( Fig. 8; Additional file 1: File S1c). The expression of GmACT49, GmIMaT1, GmIMaT3 (clade III) and GmACT84 and GmACT92 (clade IV) increased dramatically in response to both stress conditions ( Fig. 8; Additional file 1: File S1c). The expression of GmACT49, GmACT84 and GmACT92 increased about 11, 7, 40 fold in response to low pH, and 6, 7, 70 fold in response to Al 3+ stress, respectively, whereas the expression of GmIMaT1 and GmIMaT3 significantly increased as compared to the control under these conditions ( Fig. 8; Additional file 1: File S1c). Therefore, the BAHD genes that showed elevated transcription levels after low pH and Al 3+ treatments could be good candidate genes for soybean to Al 3+ toxicity tolerance. The transcription levels of GmACT12 (clade I), GmACT39 (clade II), GmACT62, GmACT68, GmMaT2, GmMaT4, GmACT75 (clade III) and GmACT86 (clade IV) decreased from 5 to 70-fold, revealing their sensitivity towards low pH and Al 3+ stress conditions. Transcriptomic analyses in response to fungal elicitor (FE) showed that the expression of 29% of the BAHD genes (30 out of the 103) changed significantly, with 10 genes increasing and 20 decreasing their expression levels ( Fig. 9; Additional file 1: File S1d). The expression of GmACT35, GmACT36 (clade II) GmIMaT1, GmMaT2, GmACT50, GmACT57, GmACT58 (clade III), GmACT84, GmACT85 and GmACT92 (clade IV) increased while the other genes had their transcription level decreased. The expression of GmACT36, GmACT57, GmIMaT1, GmACT58, GmACT84 and GmACT92 increased almost 12, 20, 2, 40, 2, and 2 folds, respectively ( Fig. 9; Additional file 1: File S1d). This expression pattern suggests a role in resistance to fungal attack by the synthesis and secretion of modified metabolites.
Indeed, fungal and bacterial infections, or elicitor treatment in Solanaceae plants and Arabidopsis, increase the synthesis of coumaroyl or feruloyl tyramines [53,87]. The anti-microbial activity was demonstrated for dicoumaroyl-caffeoyl spermidine, which inhibits mycelial growth of Pyrenophora and reduces powdery mildew infection in barley seedlings [83]. Likewise, constitutive accumulation of tyramine derivatives in transgenic tomato promotes resistance to Pseudomonas syringae [11].
MtMaT3 expression increased in response to fungal infection and decreased in response to copper treatment [86]. The observed modulation of BAHD family genes in soybean in response to fungal elicitor may correlate with an increased acylation of active compounds against fungal attacks.
Expression patterns of soybean BAHD family genes in different soybean genotypes seed coat and cotyledons Soybean Illumina expression data of seed coat and cotyledon obtained from developing seed (stage 5, 380-450 mg in weight) of three varieties were downloaded [13] and analysed. Interestingly, some BAHD family genes were highly expressed in seed coat while expressed lowly in cotyledons, and vice versa. GmACT5 (clade I), GmACT25 (clade II), GmMaT2 and GmIMaT3 (clade III), and GmACT84 (clade IV) showed high expression in the cotyledon as compared to seed coat (Additional file 1: File S1e) suggesting a function in the storage of modified metabolites, or resistance against seed borers. The expression levels of GmACT4, GmACT15 (clade I), GmACT21, GmACT22, and GmACT35 (clade II), GmACT57, GmMaT4, GmACT70 (clade III), and GmACT79, GmACT81 (clade IV) were more specific and high in brown than blackl seed coats and almost absent in yellow seed coats (Additional file 1: File S1e) suggesting a role in anthocyanin modification.
Expression modulation of soybean BAHD family genes by MYB transcription factors, GmLEC2a and GmWRi1b in the hairy root heterologous system MYB transcription factors (TF) tightly regulate the biosynthesis of flavonoids and anthocyanins [41].
Many soybean MYB factors have been reported to either positively or negatively regulate isoflavonoid accumulation in roots and seeds or the production of anthocyanins and proanthocyanidins (PAs) in seeds [41]. Two soybean MYB TFs, including one CPC-like R3-MYB repressor (GmMYB3), and an R2R3-MYB activator (GmMYB7), were over-expressed in soybean hairy roots and an RNA-Seq analysis was performed. The expression of GmACT17 (clade I), GmACT23, GmACT40, GmACT41, GmACT43 (clade II), GmACT75 (clade III) and GmACT98 (clade IV) was upregulated by GmMYB3 (Additional file 1: File S1f; Additional file 3: Fig. S2) while the expression of GmACT9, GmACT12 (clade I), GmACT21, GmACT22, GmACT28 (clade II) was induced by GmMYB7 (Additional file 1: File S1f; Additional file 3: Fig. S2). The transcription levels of GmACT24 (clade II) and GmMaT2 (clade III) were 2 to 5-fold upregulated by both, GmMYB3 and GmMYB7, as compared to the control. Most of the induced genes belong to clades I and II, which are involved in the modification of different compounds that by adding ferulyl, caffeoyl, sinopoyl, p-coumaroyl, and hydroaxycinnamoyl moieties. On the other hand, the upregulated genes of clade III may be involved in the synthesis of flavonoids or anthocyanins since we observed the over-expression of MtPAR, which a previous report showed to lead a 10-fold increase in the flavonoid contents [41]. GmMYB3 and GmMYB7 negatively regulated most of the clade III genes, but not GmACT44 (clade II). The expression levels of GmACT54, GmACT57, GmACT62, GmACT64 and GmACT65 were suppressed in the MYB over-expressed soybean hairy roots (Additional file 1: File S1f; Additional file 3: Fig. S2). High isoflavonoid levels were reported in soybean hairy roots [28]. MtPAR in soybean hairy roots decreased contents of isoflavonoids, such as daidzin, glycitin, genistin, malonyldaidzin, and daidzein up to 3, 2, 4, 2.5, and 8-fold, respectively, as compared to control through repressing the IFS gene [42]. In fact, MYB-TF suppressed the expression of not only IFS but also other isoflavonoid biosynthesis genes. An active complex is formed to activate isoflavone biosynthesis via a MYB activator [42]. MtPAR also has a negative impact on isoflavone production through expression suppression of the biosynthesis genes [41].
GmACT56 and GmACT62 were highly expressed in roots along with other organs were upregulated but those root-expressed genes that were also expressed in nodules (e.g., GmACT92, GmACT98) were down-regulated. We also noticed that some of BAHD members with very low count values in different soybean organs were significantly upregulated (GmACT33) or downregulated (GmACT17) in the GmWRi1b-overexpressing hairy root system. Most of the clade IV genes that targeted acyl-lipids for wax or cuticle synthesis were down-regulated in the heterologous root system (Additional file 1: File S1h).
qRT-PCR validation of BAHD genes in developing nodules and in response to stresses In order to validate in silico expression and further reveal the involvement of BAHD genes in physiological processes, we performed quantitative reverse-transcription PCR (qRT-PCR). We analysed the expression profiles during nodule development and the response to Al 3+ and fungal elicitor exposures. To select GmBAHD genes for expression analyses, we considered the following criteria: (i) high expression in different tissues; (ii) representation of genes in all clades; (iii) inclusion of some paralogous pairs. The expression profiles revealed by qRT-PCR were generally consistent with the RNA-seq data. The expression of GmACT2 was higher in nodules than roots but static in all nodule developing stages, while the expression levels of GmACT21, GmACT22, GmACT55, GmACT68, GmACT79, GmACT81, GmACT84 and GmACT86 were higher in roots while lower but static during nodule development. GmACT12, GmACT15, GmACT41 expression was higher during the initial stages of nodule development but gradually decreased and remained low during nodule senescence.
GmACT44 expression was initially low and gradually increased and was found high expression during nodule senescence (Fig. 7B). GmACT84 expression was high in response to Al 3+ (Fig. 8B) and fungal elicitor (Fig. 9B) compared to the control while most of the other genes tested decreased their expression significantly or remained almost unchanged.

Conclusions
The findings described here will help reveal the roles that BAHD genes play in soybean as well as in other plant species. We identified 103 BAHD genes in the soybean genome spread over all its 20 chromosomes. The distribution of genes within the genome, their organization and structure suggest a complex evolutionary history of the BAHD family in this legume species. Several cases of segmentally duplicated genes were found, which could have originated during the genome duplication event. The expression dynamics of BAHD genes in different tissues of the mature plant, during nodule development, and in response to different stressors revealed that this family is broadly involved in soybean organogenesis and stress responses, which warrant further investigation on the molecular and physiological functions of members of this family.

Methods
Identification of BAHD family genes in the Glycine max genome BLASTP against the G. max proteome dataset obtained from the genome sequence (https://phytozome.jgi.doe.gov/pz/portal.html) was used to search for BAHD-coding genes using two characterised protein sequences from legumes as query: GmMT7 from soybean [20], and MaTs from Medicago truncatula [86,89]. The HXXXD motif was searched manually in each retrieved sequence to confirm it as a candidate BAHD protein. Features of BAHD proteins, such as the number of amino acid residues, along with the start-to-end position of their respective gene in the genome, and chromosome location were acquired from the Phytozome database. Physical protein parameters of each putative gene product, such as molecular weight (Mw) and isoelectric point (pI) were calculated from pI/Mw tool at ExPASy (http://www.expasy.org/tools/) using the default parameters.
Chromosomal location, phylogeny, and gene structure analysis of the BAHD genes PhenoGram Plot (http://visualization.ritchielab.psu.edu/phenograms/plot) was used to create the image of chromosomal locations of BAHD family genes using the chromosomal position information available at Phytozome. A phylogenetic analysis with functionally characterised proteins from different species and all BAHD proteins encoded in the soybean genome was conducted to explore the evolutionary relationships of this gene family. An unrooted phylogenetic tree was constructed following the Neighbour-Joining method in MEGA 6.0 [78]. The tool Gene Structures Display Server (http://gsds.cbi.pku.edu.cn) was used to determine the exon/intron structures of individual BAHD genes.

Calculation of K a and K s ratios
The Plant Genome Duplication Database (PGDD, http://chibba.agtec.uga.edu/duplication/) was used to obtain the K a and K s values of BAHD paralogous pairs. Divergence time (T) was calculated using the formula T = K s /2λ, where λ = 6.161029 × 10 − 9 for G. max [45]. Selection pressures on duplicated genes were estimated using the K a /K s ratio for each paralogous gene pair. Assumptions of negative, neutral and positive evolution were considered for the values < 1, =1, and > 1, respectively.
Diversifying evolutionary selection analysis A phylogenetic analysis was carried out by the HyPhy program on cDNA sequences using the Nei-Gojobori method and applied in MEGA 6.0 to calculate K a /K s ratios for neutrality analysis of BAHD genes within each clade. Maximum-likelihood based on the Kimura-2 parameter model was used to infer the evolutionary history of the genes within each clade. The detection of episodic diversification effect of individual coding sites was performed through different approaches. The episodic diversifying selection and pervasive positive selection were identified at the individual branch site level through the mixed-effect model of evolution (MEME). Markov Chain Monte Carlo (MCMC) was used in fast unconstrained Bayesian approximation (FUBAR) to ensure the strength over predefined sites through approximate Bayesian method [55]. The parameter ω = β/α was used in MEME to fit the data in the GTR nucleotide model as initial values. The β:β − ≤ α and β + parameters were used to measure the selection pressure, whereas β − , β + and α were used to estimate the variability of site-tosite rate substitution. LRT based on χ 2 asymptotic distribution was considered significant at p-value < 0.05.

Identification of conserved motifs and promoter region analysis
The analysis of conserved region within BAHD family genes was executed through MEME (http://meme-suite.org/), and their genomic assemblies were screened through the Pfam database (http://pfam.sanger.ac.uk). Regulatory elements and their analytical data were obtained using the Plant cis-acting Regulatory DNA Elements (PlantCARE) program (http://bioinformatics.psb.ugent.be/webtools/plantcare) to analyse the 1.5 kb promoter region upstream of the start codon for each BAHD gene.
Plant growth, RNA-Seq data analysis, and gene expression confirmation with qRT-PCR Soybean seeds cv. Williams 82 were germinated in vermiculite in a light chamber at 25 ± 2 ºC.
Samples at different nodulation stages (N-1 to N-6) were obtained [1]. The seedlings were hydroponically subjected to acidic treatment (pH 4.0) and 50 mM Al 3+ stress (under pH 4.0) [2,82] for 10 days before harvesting the roots for gene expression analysis.
The samples were collected, snap frozen in liquid N 2 , and stored at − 80 ºC for RNA extraction. Solexa sequencing libraries were synthesised and further sequenced. Data were utilised to quantify the expression of soybean BAHD family genes (i.e. the number of sequence reads/million reads aligned) in nodule developing stages (Additional files 1: File S1b), response in Al 3+ stressed roots (Additional files 1: File S1c) and fungal elicitor treated roots (Additional files 1: File S1d), genotypes with different seed coat colour (Additional files 1: File S1e), as well as hairy roots overexpression the following genes: MYB transcription factors (MYB3 and MYB7) (Additional files 1: File S1f), GmLEC2a (Additional files 1: File S1g), and GmWRi1b (Additional files 1: File S1h).           Expression pattern of BAHD family genes in response to fungal elicitor The expression level of different BAHD family genes in in response to fungal elicitor were retrieved from generated Solexa sequencing libraries, analyzed and represented by constructing the heat