Identification of BAHD family members in the soybean genome
We searched genes in the soybean genome encoding proteins containing the specific BAHD catalytic motif (HXXXD, where X means any amino acid residue) and identified 103 genes (Fig. 2A; Additional files 1: File S3). The predicted BAHD proteins showed large disparity in length, from 274 (GmACT31) to 511 (GmMaT2) amino acid residues. However, most of them (70 out of the 103 members) ranged between 450-511 residues. The range of predicted molecular weight is from 30.93-56.70 kDs, whereas the majority of the genes code a predicted product ranging 45-52 kDs, which falls within the values of characterised BAHD proteins . The predicted pI value of soybean BAHD proteins ranged from acidic 5.05 (Glyma.10G061900_GmACT53) to basic 9.27 (Glyma.02G264600_GmACT2) (Additional files 1: File S3). The predicted subcellular localization indicates that soybean BAHD proteins could be localized to several organelles of the cell. However, GmIMaT1 and GmIMaT3 are primarily localised to the endoplasmic reticulum and cytosol, respectively . Most BAHD proteins were predicted to be localized to the cytosol (37/103), followed by mitochondria (27/103), and the plasma membrane (20/103), whereas only 2 proteins are predicted to localize to the nucleus (Additional files 1: File S3). This result is expected, since many of the reactions of specialised metabolism involving BAHD enzymes occur in the cytosol and mitochondria prior to the products to be exported wither to an organelle for accumulation or outside the cell [7, 15, 16, 25, 26].
Phylogenetic relationships of the soybean BAHD members
The evolutionary relationship of the soybean BAHD genes was explored by constructing a phylogenetic tree to compare them with 43 functionally characterised BAHD genes from different plant species (Fig. 1). On the basis of phylogenetic relationships, which defines substrate donor and acceptor of the characterized proteins in the family, the 103 soybean BAHD genes were classified into four major clades (I-IV) and clade III was further subdivided into two subclades (IIIA and IIIB) (Fig. 1). Our results based on the phylogenetic relationships of the characterised soybean ACT proteins would predict their correct enzymatic functions.
Clade I contains 20 soybean BAHD genes along with 16 characterised genes from different plant species (Fig. 1). Most members of this clade code for benzoyltransferases, alkyl-hydroxycinnamate ester, or cell-wall acyltransferases. They use benzoyl as a donor substrate to form benzyl benzoates, such as CbBEBT, NtBEBT , or salicyl benzoate (PtSABT) and involved in plant defence mechanisms against herbivory . Many characterised BAHD enzymes of clade I are responsible for the formation of benzenoid compounds [27-29]. Among them, PtSABT from Populus trichocarpa uses different CoA thioesters and alcohols as a donor to catalyse the production of salicyl benzoate for the synthesis of a variety of acetates, benzoates and cinnamates . PtBEBT from Populus trichocarpa is more selective to produce benzyl benzoate  and takes part in the synthesis of salicinoids. CbBEBT produces benzyl benzoate in Clarkia breweri that composes its floral aroma. The CbBEBT gene is highly expressed in the flower, and especially in the stigma . These BEBT genes code for products that have the capability to produce compounds that are important for plant defence or aroma [27-29]. Four BEBT genes from soybean (GmACT3, 4, 5, 20) have high similarity with the characterised BEBT genes. Similar to CbBEBT, GmACT3 and 4 also showed high expression levels in the flower, leading us to hypothesize that it likely plays a function in catalyzing aroma compounds. These genes are also highly expressed in the stem and the leaf, alike PtSABT and PtBEBT, which produce defence compounds. Some members of clade I are also involved in the biosynthesis of cutin and suberin, which work as lipid barriers on epidermal surfaces of plant organs. These barriers regulate gas, water and solute fluxes  and play critical roles against pathogen attacks . Alkaylhydroxycinnamoyl ester BAHD genes, such as AtASFT , StFHT , AtDCF , and AtFACT  use feruloyl, p-coumaroyl, sinapoyl, and caffeoyl as a donor to form aromatic suberin, alkyl, or ω-hydroxy fatty acids, which are used for the synthesis of cutin polymers that in turn are important components of the cuticle and assist in the deposition of suberin into the inner side of the primary cell wall of secondary shoots and roots . They also play a role in the synthesis of root waxes that contain hydroxycinnamates in A. thaliana . Ten genes from soybean fell in clade I and, therefore having the same predicted transferase function (Fig. 1; Additional files 1: File S3). Interestingly, we found in soybean only one orthologue of disinapoylspermidine (AtSDT; ) and another of coumaroyl-spermidine (AtSCT; ), that are annotated as GmACT2 and 12, respectively (Fig. 1). Meanwhile all the other genes in this clade have multiple copies or paralogous in the soybean genome. Monolignols, which participate in lignin formation, are synthesised through the phenylpropanoid pathway, primarily through the activity of hydroxycinnamoyltransferases (HCTs). The BAHD-ACTs also used hydroxycinnamic acid (HCA) as a donor substrate to esterify monolignols . These cell wall-related BAHD-ACTs also belong to clade I, such as OsPMT , BdPMT , and OsAT10 , which use p-coumaroyl-CoA as a donor substrate to form lignols and glucuronoarabinoxylan (GAX). These activities are responsible for producing the lignocellulosic biomass in grasses, which could have a major yield impact on feedstocks for making biofuels . The soybean genes GmACT1, GmACT13, GmACT14, GmACT16, GmACT17, and GmACT19 were also predicted to have cell wall-related functions, and they may play essential roles in the production of liquid biofuel compounds.
Clade II of the soybean BAHD family contains 26 genes that cluster together with 9 characterised genes from other plants (Fig. 1). This clade includes many genes related to shikimate/quinate hydroxycinnamoyltransferases (HCT/HQT) and acyl-CoA N-acyltransferases, such as hydroxycinnamoyl-CoA, spermidinehydroxycinnamoyltransferase (SHT), hydroxycinnamoyl-CoA: N-hydroxycinnamoyl/benzoyltransferase (HCBT), and rosmarinic acid synthase, which is alternatively called hydroxycinnamoyl-CoA:hydroxyphenyllactate hydroxycinnamoyltransferase (RAS). These members use hydroxycinnamoyl, caffeoyl or p-coumaroyl as donor substrates to form chlorogenic acid, hydroxycinnamoylquinate, caffeoyl, or p-coumaroylquinate or malate, hydroxycinnamoyl glycerol, hydroxycinnamoylshikimate, hydroxyphenyllactates, anthramides, and spermidines [33, 39-44]. Clade II mostly contains acyl-CoA N-ACTs that catalyze N-acylation instead of O-acylation. These reactions catalyze hydroxycinnamoyl and amine moieties that are the entry point of the phenolamide pathway . Hydroxycinnamic acid (HCA) amides are classified as phytoalexins . Most of the plant N-HCTs that have been characterised so far are from A. thaliana and Nicotiana attenuate and involved in defence against herbivores . Meanwhile, other members of the clade II, such as HCT/HQT and their relatives, produce diverse types of compounds that play protective roles against UV photooxidation  and pathogenic resistance , such as lignin, chlorogenic acid (CGA) and quinates. CGA has been implicated in cell wall development  and root hair formation . Seventeen soybean BAHD genes are closely related to TpHCT2, which transfers hydroxycinnamoyl moieties (p-coumaroyl and caffeoyl) to malic acid (Fig. 1). The transferase activity of TpHCT2 is involved in phaselic acid synthesis, particularly in the leaves of red clover (Trifolium pratense, Fabaceae) . TpHCT2 is related to GmACT27, 28, and 45, that have high and specific expression in leaves (Fig. 1) and may play a role in phaselic acid synthesis. Meanwhile, GmACT21, 22, 32, and 33 fall closest with TpHCT1A/1B, which use caffeoyl or p-coumaroyl as a donor to form shikimates or malates  and play a major role in monolignol synthesis [40, 43]. Lignin precursors are synthesised through p-coumaroyl derivatives rather than methoxylated analogues in response to pathogen attacks . TpHCT activity was higher in the red clover stem and capable of transferring p-coumaroyl moieties from the respective CoA to shikimic acid, which is critical for the biosynthesis of monolignol lignin precursors [40, 43]. AtHCT expression dramatically changed the lignin composition, which demonstrates the essential function of HCT in phenylpropanoid metabolism [40, 50]. HCT homologues have been identified in several vascular plant species . In conformity with other members of this group, GmACT21, 22, and 32 also presented high expression levels in the stem, and they may transfer p-coumaroyl moieties to respective acceptors with essential roles in lignin biosynthesis. Other four members of this clade (GmACT35, 36, 38, and 39) are closely related to SHT from Arabidopsis thaliana, which uses hydroxycinnamoyl as a donor substrate to form hydroxycinnamoylspermidine . This compound is abundant in pollen of many species, which indicates a role of these phenylpropanoids in plant sexual reproduction . Indeed, AtSHT expression was high and specific to the tapetum cells of the anther during early flower development in order to supply the building blocks for cell wall development in the pollen . The metabolic profiling of flower bud tissues in Arabidopsis revealed that AtSHT knockout or RNAi plants had a positive correlation between acyltransferase protein accumulation and the amounts of trihydroxyferuloyl and dihydroxyferuloyl sinapoyl spermidines . GmACT35 and 38 were also more highly expressed in the flower than any other tissues and are expected to have a function in the acylation of spermidine in the pollen.
Clade III is further divided into sub-clades IIIA and IIIB (Fig. 1). Sub-clade IIIA contains 25 genes along with 19 already characterised from other species. Most of the characterised members of clade IIIA acylate phenolic glucoside anthocyanins or other flavonoids (flavones or isoflavone). Malonylation of clade IIIA ACTs depends on the position of the acceptor substrate. Clade IIIA can be divided into aliphatic and aromatic ACTs. Fourteen soybean BAHD genes fall in the aliphatic ACTs, and they are functionally annotated as isoflavonoid malonyltransferases (Additional file 1: File S3) along with the characterised At5MaT, At3At1, At3At2 from Arabidopsis and MtMaT1, MtMaT2 and MtMaT4 from Medicago truncatula (Fig. 1), which facilitate the transport or storage of metabolites [7, 35]. Among the 14 aliphatic ACTs in soybean, only two, GmIMaT1 and 3  have been functionally characterised and play a distinct role against abiotic and hormonal stresses . Their transcription patterns correlated positively with the malonylisoflavonoid composition of tissues in different parts of the plant under different stress conditions .Another aliphatic group in this clade IIIA contains 6 soybean ACTs along with MtMaT5 and MtMaT6  from Medicago that malonylate different anthocyanins, such as cyanidin, delphinidin, and pelargonidin conjugates at 3-O-glucosides. These genes are very specific for anthocyanin modification at the C3 position, and they have no activity towards C5 or C7 of anthocyanin or other flavones or isoflavones . The soybean genes that closely relate to MtMaT5 and MtMaT6 (Fig. 1) are predicted to function in anthocyanin modification at 3-O-glucoside (Additional file 1: File S3). This group of clade IIIA is most probably associated with the stability of the anthocyanin glucosides since anthocyanin acylation at 3-O sugar with the malonyl moiety has been suggested to increase its stability .
Clade IIIA also contains 7 soybean BAHD-ACT genes (GmACT52, 57, 61, 63, 72, 76 and 77) closely associated with the aromatic ACTs from Arabidopsis, At3At1 and At3At2 (Fig. 1; ). At3At1 and At3At2 use p-coumaroyl, feruloyl and caffeoyl-CoA as donors to specifically modify the C6 hydroxyl of glucose at three anthocyanin positions, but it does not have any affinity towards sinapoyl-CoA. We predict that these 7 aromatic ACTs also prefer coumaroyl-CoA to modify the anthocyanin at the C3 position just like their closely related Arabidopsis orthologues (Fig. 1; Additional file 1: File S3). We did not find in the soybean genome any aromatic gene that was closely related to a gene able to modify 5-O-glucosides, such as Gt5AT that acylates the C6 hydroxyl of 5-O-glucosyl residues of anthocyanin with either caffeoyl, p-coumaroyl  or aliphatic anthocyanin MaTs, such as At5MaT (Fig. 2; ) and Ss5MaT1 from Arabidopsis thaliana (Brassicaceae) and Salvia splendens (Lamiaceae), respectively (Fig. 1; ).
Clade IIIB contains 9 GmACT genes from soybean with 7 characterised genes (Fig. 1). This clade contains malonyltransferases (MaTs) that catalyse flavonoids with malonyl-CoA to form malonates, acetyl CoA:deacetylvindoline 4-O-acetyltransferases (DAT) which catalyse the last step in vindoline biosynthesis , acylsugar acyltransferases (ASAT), which is important in the acylsucrose biosynthetic pathway and uses sucrose and acyl-CoA ester substrate specificity to contribute in different types of accumulated acylsucroses . Acylsugars are polyesters of short- to medium-length acyl chains of sucrose or glucose backbones that are produced in secretary glandular trichomes of many solanaceous plants, including tomato (Solanum spp.). These compounds act against different insects through a mechanism involving in a multitrophic defence interaction involving their metabolization to volatile fatty acids . Ss5MaT2 also fell in clade IIIB separated from other the MaTs . These BAHD-ACTs share the anthocyanin-specific motif (NYFGNC) in addition to the BAHD motif (HXXXD). The characterised member of clade IIIB (Ss5MaT2 from Salvia splendens) malonylates the anthocyanin glycosides bisdemalonylsalvianin and shisonin by transferring the malonyl moiety to the C4 position of the 5-O-glucosyl residue . The soybean genes of clade IIIB also suggest forming salvianin, which has two malonyl moieties, and malonylshisonin, and plays an important role in flower colour . Although the biochemical properties of Ss5MaT2 are similar to that of clade IIIA members, it fell into the clade with ASAT and DAT.
Clade IV contains 21 soybean BAHD genes along with five characterised genes (Fig. 1; Additional file 1: File S3). The members of this clade have varied functions. Most are predicted to function as shikimate O-hydroxycinnamoyltransferase or transferase family, except for GmACT88 and GmACT97, which are predicted to function as an anthranilate N-hydroxycinnamoyl (HCBT)-related transferase, as well as GmACT79 and 81, which are predicted to function as a shikimate O-hydroxycinnamoyltransferase/quinate O-hydroxycinnamoyltransferases. Gt5At from Gentiana triflora, an anthocyanin aromatic acyltransferase, catalyses the transfer of hydroxycinnamic acid moieties from their CoA esters to the glycosyl groups of anthocyanins . HCBT produces the aromatic amide dianthramide in carnation . Anthraniloyl-CoA: methanol AT (AMAT) uses anthraniloyl-CoA as a donor substrate to acylate methanol and produce a foxy odour compound (methylanthranilate) in concord grapes . Moreover, GmACT93 and 96, which are closely related to the gene DEFECTIVE IN CUTICULAR RIDGES (DCR) from Arabidopsis thaliana, Medicago truncatula, and Populus trichocarpa  and predicted to involved in cutin formation. In the flower, DCR converts monomers into polymeric cutin, which is composed of fatty acid, glycerol and aromatic monomers . Changes in epidermal cell differentiation and post-genital organ fusion were observed due to defective cuticle in DCR-deficient plants. DCR was also expressed in the root cap, lateral root primordia, and developing lateral root. This expression profile points to its involvement in suberin biosynthesis . Downregulation of DCR in dcr mutants led to excessive root branching and changes of root architecture by influencing lateral root emergence and growth . All in all, these analyses help us formulate educated hypotheses about the diversity and evolution of substrate specificity within the BAHD family in soybean.
Motif analysis and gene structure
Variation in length and sequence was found in the conserved region (PF02458/IPR003480) of soybean BAHD genes (Additional files 1: File S4). Among the ten representative BAHD motifs searched and annotated through Pfam and SMART tools, three motifs (1, 4 and 6) were found related to BAHD transferase family (Additional files 1: File S4). The biological importance of these recurring motifs required an in-depth investigation. All members in clades I and II contained all 10 conserved motifs whereas clade III and IV members missed some of them (Fig. 2B). We found a typical motif pattern in most of the paralogous pairs. However, some presented differences in motif composition and arrangement, such as pairs GmACT94-99, 93-96, 31-46, 21-22, 52-72, 7-9 and 15-18 (Fig. 2B).
The architecture of soybean BAHD genes was analysed to establish general features within each clade. The exon-intron structure of each gene is shown along with the phylogenetic tree (Fig. 2C). Genes within the same clade contain almost the same length and number of exons and introns, with a few exceptions (Fig. 2C). Introns of clade I members are usually much larger than in other clades. Seventy percent of clade I members (14 out of 20) contain two exons while 15% (3/20) contain one and three exons (Fig. 2C; Additional files 1: File S3). GmACT12 contains the longest intron (13,062 bp) in the whole set of the BAHD gene family (Fig. 2C). In clade II, 14 genes contain only one, 8 genes contain two, and 3 genes contain three exons. GmACT21, 22, 32 and 33 have long introns compared to the other clade II members. Clade III genes varied in exon number, but 77.78% (28 out of 36) contains just a single exon. Two clade III members (GmACT51 and 72) contain the maximum exon number (4) in the whole soybean BAHD family. Most members of clade IV (76%) contain only one exon while the remaining ones contain two exons. Among the 16 paralogous pairs, 10 did not show remarkable differences except that for GmACT21-GmACT22 from clade II which GmACT22 has an untranslated region (UTR) longer (4982bp) while GmACT21 had (1212bp) at 5/ end. The other six paralogous pairs showed small variations in exon number or intron length (Fig. 2C).
Chromosomal localization and duplication of soybean BAHD genes
A chromosome ideogram was constructed on the basis of the physical position of the soybean BAHD-ACT genes (Additional file 1: File S3). All the 20 soybean chromosomes contain BAHD-ACT genes, although the number on each chromosome varies between 1 and 15 genes per chromosome (Fig. 3). Chromosome 18 has the highest number of genes (15%) followed by chromosomes 8 (11%) and 19 (10%) (Fig. 3; Additional file 1: File S3). On the other hand, chromosomes 1, 9 and 20 contain only a single gene each (Fig. 3; Additional file 1: File S3). The positions of BAHD genes on the soybean chromosome, often away from the centromere (Fig. 3), which explained their high recombination rates and active translation .
Tandem and segmental duplications expanded BAHD genes into a large family in different plant genomes . Gene redundancies can be predicted by the identification of closely related paralogues in the genome whereas the accurate identification of true orthologues between species can lead to the creation of strong hypothesis of common gene functions. The soybean genome contains a high number of BAHD genes (103) as compared to 69 in Cynara cardunculus , 64 in A. thaliana , but less the 119 BAHD genes in the Oryza sativa genome . This large gene family might be the result of the two rounds of whole-genome duplication (WGD) events in soybean that occurred 58-60 (the Papilionoidea allotetraploidization event) and 13 Mya (recent soybean allotetraploidization) . The presence of paralogous genes in a plant genome is more often related to their functional category than the genetic proximity between species . We found sixteen paralogous pairs in the soybean BAHD family (Additional file 2: Table S1). These paralogues are present in all four phylogenetic clades (Additional file 2: Table S1): Clade I contains 7 members; clade II has 3, clade III contains 2, and clade IV has 4 paralogous pairs (Additional file 2: Table S1). The presence of clear paralogues is evidence of a recent common ancestry and helps to elaborate strong hypotheses about the functional conservation of BAHD genes in soybean along with those already characterised in different species. The selection history of coding sequences can be assessing through the analysis of base substitutions of the protein code called the Ka/Ks ratio. To investigate the deviation of duplicated BAHD acyltransferases, we determined the non-synonymous substitutions per site (Ka), the synonymous substitutions per site (Ks), and the respective Ka/Ks ratios for each paralogous pair. The partition of each pair was estimated through the Ks values. All paralogous pairs had a Ks value from 0.09 to 0.28, which is consistent with the soybean duplication events (Additional file 2: Table S1). The selection pressure can be determined through the Ka/Ks ratio. Most of the paralogous pairs showed negative Ka/Ks ratios (Additional file 2: Table S1) suggesting that the BAHD gene family in soybean experienced strong purifying selection pressure and is bound by strong evolutionary constraints to maintain its stability . Divergence time analysis showed that these paralogous genes evolved between 7.38 to 22.95 million years ago (Mya), which is consistent with the recent allotetraploidization of the species that occurred around 13 Mya (Additional file 2: Table S1).
Positive and purifying selection of BAHD genes in soybean
The Ka/Ks ratio is used as a measure of the direction and magnitude of gene selection . The Ka and Ks substitution rates in this study were estimated on the basis of simple probability theory. Positive selection was considered significant with Ka/Ks ratio > 1 . Comparisons of nucleotide substitution per site in the 103 BAHD genes were carried out on the basis of the Ka/Ks ratio in each individual branch of the phylogenetic tree. Maximum Likelihood computations of Ka and Ks were conducted using the HyPhy software package . The analysis involved 26 nucleotide sequences from clade I, 20 from clade II, 36 from clade III and 21 from clade IV. All positions containing gaps or missing data were eliminated. The neutrality analysis based on maximum likelihood computation of Ka/Ks values of all soybean BAHD genes revealed 16 coding sites under positive selection between codons 69- 342 in clade I, 6 coding sites between codons 33-143 in clade II, 19 coding sites in clade III and 15 sites in clade IV under positive selection, all of which showed Ka/Ks ratios > 1 (Additional file 2: Table S2). The changes in codons, as well as sequences for pairwise comparison, were estimated at the Ka and Ks basis. The variability between Ka and Ks polymorphisms at the allelic distribution level presents quite an explicit proof for positive selection . The allelic distribution in various coding sites can be revealed by the strapping test of neutrality, which is applicable to genomic data containing a large number of polymorphisms, and that along with the homogeneity test, allows for comparing the frequency distribution of synonymous sites [67, 68]. The changes in the coding region during the evolutionary time under analysis showed the cumulative presence of synonymous, non-synonymous and ambiguous (indels) codons (Additional file 3: Figure S1). Synonymous and non-synonymous mutations showed low cumulative behaviour at the start site and then gradually increased during the evolutionary time analysed, whereas the opposite behaviour was observed for ambiguous codons at the start point and then acquired a static behaviour at the end of the code. The very same behaviour of synonymous, non-synonymous and ambiguous codons was found in all four clades of the BAHD family (Additional file 3: Figure S1).
We estimated the Likelihood Rate Test (LRT) through the empirical Bayes method at specific branching points and identified various diversifying selection sites. The mixed-effect model of evolution (MEME) was used to identify the diversifying selection of BAHD genes also on the basis of empirical Bayes procedure . MEME identified 7, 17, 16, and 4 episodic diversifying coding sites (p < 0.05) in clades I-IV, respectively (Additional file 2: Table S3). The synonymous (α) and non-synonymous (β) substitution rates were calculated, and coding sites with β > α were considered as significant to determine sites under diversifying selection. The Maximum-Likelihood estimation of β > α in MEME was obtained for nucleotide position 60 and 434 in clade I, 189 and 463 in clade II, 277 and 643 in clade III, and 299 in clade IV (Additional file 2: Table S3). The prevalence of purifying or natural selection masked the episodic natural selection with the transient period of adaptive evolution. Thus, we could not identify sites under positive selection due to the sensitivity of tests used  and the stringent parameters imposed, such as p < 0.05 and the empirical Bayes factor threshold of analysis for selection of sites (> 15), so that the best outcomes were obtained when the selected branches were placed in the conserved lineage.
Ambiguities in posterior gene- and site-specific distributions were calculated through FUBAR using the MCMC approach [71, 72]. This analysis revealed the following number of sites in BAHD genes under pervasive diversifying selection: one site in clade I, four in clades II and III each, and two in clade IV (Additional file 2: Table S4). The diversifying evolutionary sites were detected through the random effect model using the rate distribution with a large number of parameters at the significance level at 95% confidence interval and Pr [β>α] values (Additional file 2: Table S4). The class rate weight at each coding site under positive selection was calculated through a bivariate general discrete distribution. The posterior mean values ranged between 0.80-0.92, which are closer to the value of the potential scale reduction factor (Additional file 2: Table S4) and indicates MCMC convergence. Coding sites with β > α and empirical Bayes factor (EBF) >15 were considered to be under diversifying selection. The EBF values for each coding site under positive selection were calculated with a net effective sample size (Additional file 2: Table S4). PSRF values for each site under positive selection were <1.03, whereas the effective population size was >100. The detected selection across many coding sites was potentially improved through deducing the allocation of gene-specific selection parameters. Altogether, the coding sites under positive selection unveiled in this analysis provides strong evidence of diversifying selection in BADH genes under selection in the soybean lineage.
The evolutionary fingerprint of soybean BAHD genes
Nucleotide and amino acid substitutions in the codon model were used to identify synonymous and nonsynonymous substitutions [71, 73]. The genetic algorithm in the codon model evolution was used to identify the evolutionary fingerprint in coding sites of BAHD genes. The 9,052 models generated were used in codon model selection on the basis of likelihood log and modified Bayesian Information Criterion (mBIC). Selective effects with an exchangeable preference for particular amino acid were found in BAHD genes through a model that used the combined empirical codon and transition/transversion-related physiochemical parameters . The model with log (L) value -16,360.3 was considered optimum for the amino acid substitution analysis. We observed mBIC values ranging from 33,701.1 for clade II to a maximum value of 81,506.1 for clade IV (Additional file 2: Table S5). All clades showed three-class rates for the distribution of amino acids in different classes (Fig. 5) with the estimation of a single Ka/Ks substitution rate (Additional file 2: Table S5). A genetic multi-rate model algorithm was used to analyse the class rates in order to calculate substitution rates at the amino acid level during the evolutionary time scale (Fig. 5). The substitution rate in each class was calculated through genetic models using the Stanfel class parameters (Fig. 5; ). The substitution rate distributed the amino acids into the three classes through evolutionary rate cluster, and the substitution pair ACGILMPSTV revealed 90% substitution, whereas DENQ had 50% substitution, and FWY and HKR presented < 50% substitution rates. The best-fit model with mBIC values defined the clustering rate, and the numerical values of corresponding rate class substitution are inferred with Maximum-Likelihood estimation, which were computed evolutionary evidence ratio for the gene .
Analysis of cis-regulatory elements in promoter regions of BAHD genes in soybean
Extensive studies on ACTs indicate their roles in plant growth, development, and stress tolerance. Many cis-acting regions are found in the upstream region of GmACTs genes according to our in silico analyses (Additional file 1: File S5). The 103 GmACT promoter regions contain cis-elements responsive to light (43%), hormones (29%), environmental stresses (20%), and plant growth (8%) (Fig. 4A). These regions contained transcription factor binding sites involved in response to plant hormones (Fig. 4B), biotic stresses and pathogen defence, MYB binding sites, and response to heat, low temperature, and drought (Fig. 4C). The pattern of cis-acting regions differed among the soybean BAHD gene promoters (Additional file 1: File S5). GmACT3 (clade I) contains the maximum number of cis-elements (29), whereas GmACT82 (clade IV) contains 26 hormone-responsive elements (Additional file 1: File S5). GmACT34 contains the maximum number of ABA-responsive elements (8), and GmACT45 contains 9 ethylene-responsive motifs (Additional file 1: File S5). GmACT84 and 96 (clade IV), along with GmACT66 (clade III) contain the maximum number of environmental stress-response elements (18, 17, and 15, respectively) in their promoter regions. Members of clades III and IV contain the maximum number of MYB-related cis-elements (Additional file 1: File S5). GmACT84 showed the most (12) MYB related elements, and also showed high expression values, up top seven-fold higher in response to Al3+ and two-fold higher to fungal elicitor than the control (Additional file 1: File S1c, S1d). The promoter regions of paralogous pairs were also compared. Both members of each pair showed variation in cis-elements related to stress or hormonal responses or other signals in their promoter regions. Some hormonal, stress or growth-related elements were often present only in one member of the pair (Additional file 1: File S5). Although the GmACT paralogous pairs have little variation in protein length, the sequences of their promoter regions were highly different, which suggests divergent specific roles of each member. We concluded from this analysis that GmACTs likely play vital roles in plant stress perception, signalling, or biotic and abiotic stress defence.
Expression patterns of soybean BAHD genes in different tissues
Expression data for the 103 soybean BAHD genes in different soybean tissues were extracted from Phytozome (Fig. 6; Additional file 1: File S1a). Most of the genes showed tissue expression patterns that corresponds to their putative functions as predicted by their phylogenetic positions. The benzenoid-related genes (GmACT3 and 4) had maximum expression in above-ground organs, whereas GmACT5 was highly expressed only in the root (Fig. 6; Additional file 1: File S1a). Benzenoid compounds are produced mostly in flowers and known to be involved in the aroma of numerous fruits and contribute to improving flavour quality . Indeed, already characterised genes from this clade, PhBPBT from Petunia x hybrida and CbBEBT from Clarkia breweri are highly expressed in floral parts and consistently involved with benzylbenzoate compounds [27, 29]. Monolignol synthetic genes (e.g., GmACT1 and 16) showed high expression in root and pod. Likewise, OsPMT and BdPMT, which are specific to monolignols with p-coumarate and ubiquitously expressed but particularly high in spikelets. When overexpressed, these genes led to a three-fold increase in pCA lignin compared to the control . Monolignol-pCA conjugates integrate with lignin biosynthesis through oxidative radical coupling in order to generate the pCA appendages [36, 37]. Polyamine conjugates are abundant in seeds and serve as nitrogen reserves during germination . The Arabidopsis SDT gene is highly expressed in seeds and produce disinapoyl spermidine and its glucoside while being mainly expressed in roots to produce coumaroylated spermidines . Soybean also contains SDT and SCT related genes, GmACT2 and 12, respectively, which are expressed constitutively, even though GmACT12 expression is barely detectable in the stem, leaves and roots but spermidine still accumulates at high levels in seeds and roots .
The clade II members, GmACT29 and 30, showed high expression only in the seeds. GmACT40 is expressed in roots and nodules, while GmACT38 transcriptional activity is especially high in nodules and flowers (Fig. 6; Additional file 1: File S1a). On the other hand, GmACT21, 22, 32, 35, 41, and 45 showed expression in all tissues and at high levels (Fig. 6; Additional file 1: File S1a) suggesting putative roles in cell wall development and plant defence by producing CGA or quinate-like compounds, provided their close relatedness to TpHCT1A and TpHCT1B (Fig. 2; ). Phenolamide, an important compound produced by clade II BAHD enzymes, is often associated with floral induction and development. GmACT21, 22, and 32 showed particularly high transcription in the stem and flowers, similarly to TpHCT1, which was shown to have 4 to 5-fold higher expression in these organs than the leaf and plays an important role in stem lignification by participating in monolignol biosynthesis [40, 43]. GmACT27, 28 and 45 are very similar at the amino acid sequence and also showed the same expression pattern as reported for TpHCT2 . A reduction in the biosynthesis of phenolamides or their conjugates leads to male sterility in plants. Tri-substituted hydroxycinnamoyl spermidines were synthesised in the pollen coat, and this conjugation is catalysed by SHT . GmACT22, 32 and 35 also had their highest expression in the flower and, given their phylogenetic position and expression patterns, these genes might as well be involved in the synthesis or conjugation of spermidines in the pollen coat of soybean.
The clade III genes, GmACT67 and 72 are highly and specifically expressed in the flower, GmACT55 in the root, GmACT49, 54, and 65 in the nodule, whereas GmACT66 is expressed in the leaf, GmACT47 in stem and flower, GmACT57 in root and flower, and GmACT71 in seed and the pod (Fig. 6; Additional file 1: File S1a). The putative isoflavonoid malonyltransferase genes, such as GmMaT1, 2, 3, 4, GmACT68, 69, and 70, showed high expression levels in almost all tissues. Other related members presented a more specific expression pattern, such as GmACT71 in seeds and pods, and GmACT50 in the leaf (Fig. 6; Additional file 1: File S1a). The different expression patterns of these genes suggest putative roles of malonylated flavonoids in storage, symbiosis signalling, flower colour, and resistance against biotic and abiotic stresses. Most of clade IIIB members had high expression in leaves and the stem, similarly to the characterised members of this group. High and specific expression of CrDAT in leaves correlates with the synthesis of vindoline from tabersonine in Catharanthus roseus (Apocynaceae) . Likewise, the expression of ASAT genes was higher in the trichomes of S. lycopersicum stem and S. pennellii leaf, which correlates with acylsugars accumulation in these organs .
DCR genes are specifically expressed in the epidermis of vegetative and reproductive organs. Additionally, they also showed expression in the lateral root primordia and developing lateral roots . The cuticle of different organs is the primary barrier against several biotic and abiotic stresses . High expression of DCR genes in young leaves and stem is predictive of their role in the formation of cutin polymers in vegetative organs . During leaf expansion, the cutin polymeric matrix is formed, and DCR plays a vital role to incorporate the hydroxyl fatty acid chains into the cutin polymer . It is also critical to convert monomer units into its polymeric form of cutin in the flower. DCR develops the lipid polyester for the plant epidermal cell differentiation and patterning, trichome development, and protection from undesired post-genital fusions in aboveground organs . We noticed that almost all clade IV genes showed high expression levels in seeds, roots, and nodules. Some genes, such as GmACT82 and 95, presented particularly high expression levels in roots and flowers, GmACT94 and 91 in seeds, GmACT99 in roots and leaves (Fig. 6; Additional file 1: File S1a).
Expression profiles of BAHD genes during nodule development in soybean
From our analysis of RNA-Seq data derived from soybean nodules at different developing stages , BAHD genes were differentially expressed at different developmental stages of the nodule, since its development until senescence (Fig. 7; Additional file 1: File S1b). Most of clade II and clade IV genes showed higher expression in roots than nodules, while 50 and 80% genes in clades I and III showed the opposite pattern. The expression levels of GmACT39, 40, 41 (clade II), 27 and 33 (clade IV) were 10 to 40-fold higher in the initial stages of nodule development and reached up to >80-fold in later in the development (stage 5, pre-senescence) than roots. Even though most of clade I and III genes were expressed more highly in nodules than roots, their expression levels were not much higher than the genes in clades II and IV, except for GmACT12 and 68, which showed more than 2 and 70 fold higher expression than roots, respectively (Fig. 7; Additional file 1: File S1b). The expression patterns of GmACT35, 39, 86, 87, and GmMaT1 gradually increased as the nodule developed, peaking at stage 5 and then dropping at senescence (stage 6), which imply roles during nodule development. Although the expression levels of GmMaT1 and GmACT86 were higher in roots than nodules, their expression increased as the nodule developed. Therefore, they may play a symbiotic role in the relationship between root and nodule development by modifying secondary metabolites in roots for secretion into the rhizosphere in order to attract rhizobia, such as the isoflavonoids that act as a signalling molecule for inducing nod factor biosynthesis in rhizobia. The transcription level of GmMaT3 was the highest after GmACT87 than any other BAHD genes during the different nodule stages analysed, and its expression remained constant throughout nodule development and almost similar to that of the root (Fig. 7; Additional file 1: File S1b). The expression levels of GmACT15, 18, 40, 41, 75, and GmMaT2 were higher in the initial stages and gradually decreased as the nodule matured, reaching the lowest expression levels at senescence (Fig. 7; Additional file 1: File S1b). Therefore, these genes may function in rhizobial interaction or nodule initiation and development.
Expression patterns of soybean BAHD genes in response to different stresses
The RNA-Seq data available allowed us to analyse the expression profile of BAHD family genes in plants subjected to acidic condition (low pH 4.0), aluminium treatment 50 µM Al3+ (pH 4.0), and fungal elicitor after 2 and 24 h. Most of the BAHD genes expression decreased in response to low pH and Al3+ exposure, except for GmACT13 (clade I) and 36 (clade II). Their expression levels doubled at low pH but decreased or remained the same upon Al3+ stress (Fig. 8; Additional file 1: File S1c). The expression of GmACT49, GmMaT1 and 3 (clade III) as well as GmACT84 and 92 (clade IV) increased dramatically in response to both stresses (Fig. 8; Additional file 1: File S1c). The expression of GmACT49, 84 and 92 increased about 11-, 7-, 40-fold in response to low pH, and 6-, 7-, 70-fold in response to Al3+ stress, respectively. Meanwhile, the expression of GmMaT1 and 3 significantly increased under these conditions as compared to the control (Fig. 8; Additional file 1: File S1c). Therefore, the BAHD genes that showed elevated transcription levels after low pH and Al3+ treatments are good candidate for studying Al3+ toxicity tolerance mechanisms in soybean. On the other hand, the expression of GmACT12 (clade I), 39 (clade II), 62, 68, 75, GmMaT2 and 4, (clade III) as well as GmACT86 (clade IV) decreased between 5 and 70-fold, revealing their sensitivity to these stresses.
Transcriptomic analyses in response to fungal elicitor (FE) showed that the expression of 29% of the BAHD genes (30 out of the 103) changed significantly, with 10 genes increasing and 20 decreasing their expression (Fig. 9; Additional file 1: File S1d). The expression of GmACT35, 36 (clade II) GmMaT1 and2, GmACT50, 57, 58 (clade III), 84, 85 and 92 (clade IV) increased while the other genes decreased their transcription. An increase in expression was observed for GmACT58 (40-fold), 57 (20-fold), 36 (12-fold) as well as 84, 92, and GmMaT1 (2-fold each) (Fig. 9; Additional file 1: File S1d). This expression pattern suggests a role in resistance to fungal attack by the synthesis and secretion of modified metabolites. Indeed, fungal and bacterial infections, or elicitor treatment in Solanaceae plants and Arabidopsis, increase the synthesis of coumaroyl or feruloyl tyramines [79, 80]. The anti-microbial activity was demonstrated for dicoumaroyl-caffeoyl spermidine, which inhibits mycelial growth of Pyrenophora and reduces powdery mildew infection in barley seedlings . Likewise, constitutive accumulation of tyramine derivatives in transgenic tomato promotes resistance to Pseudomonas syringae . MtMaT3 expression increased in response to fungal infection and decreased in response to copper treatment . The observed modulation of BAHD family genes in soybean in response to fungal elicitor may correlate with an increased acylation of active compounds against fungal attacks, which warrants further analysis.
Expression of BAHD genes in seed coat and cotyledons of different soybean genotypes
Soybean RNA-Seq data of coat and cotyledon tissues dissected from developing seeds (stage 5, 380-450 mg in weight) of three varieties with distinct seed coat colours were downloaded  and analysed. Interestingly, some BAHD family genes were highly expressed in the seed coat while low levels were found in the cotyledons, and vice versa. GmACT5 (clade I), GmACT25 (clade II), GmMaT2 and GmIMaT3 (clade III), and GmACT84 (clade IV) showed higher expression in the cotyledon compared to the seed coat (Additional file 1: File S1e). This result suggests a function in the accumulation of modified metabolites, potentially related to resistance against seed borers. The expression levels of GmACT4, GmACT15 (clade I), GmACT21, GmACT22, and GmACT35 (clade II), GmACT57, GmMaT4, GmACT70 (clade III), GmACT79, and GmACT81 (clade IV) were higher in brown than black seed coats and almost absent in yellow seed coats (Additional file 1: File S1e), which suggests a role in anthocyanin modification.
Expression modulation of BAHD genes by MYB transcription factors, GmLEC2a, and GmWRi1b in the hairy root system
MYB transcription factors (TF) tightly regulate the biosynthesis of flavonoids and anthocyanins . Many MYB factors have been reported to either positively or negatively regulate isoflavonoid accumulation in soybean roots and seeds as well as the production of anthocyanins and proanthocyanidins (PAs) in seeds . Two soybean MYB TFs, a CPC-like R3-MYB repressor (GmMYB3) and an R2R3-MYB activator (GmMYB7), were overexpressed in soybean hairy roots to perform a transcriptome analysis on BAHD genes. The expression of GmACT17 (clade I), GmACT23, GmACT40, GmACT41, GmACT43 (clade II), GmACT75 (clade III) and GmACT98 (clade IV) was upregulated by GmMYB3 (Additional file 1: File S1f; Additional file 3: Fig. S2) while the genes GmACT9, GmACT12 (clade I), GmACT21, GmACT22, GmACT28 (clade II) were induced by GmMYB7 (Additional file 1: File S1f; Additional file 3: Fig. S2). The transcription levels of GmACT24 (clade II) and GmMaT2 (clade III) were upregulated 2 to 5-fold by both, GmMYB3 and GmMYB7, as compared to the control. Most of the induced genes belong to clades I and II, which are involved in the modification of diverse compounds by adding ferulyl, caffeoyl, sinopoyl, p-coumaroyl, and hydroaxycinnamoyl moieties to a substrate. On the other hand, the upregulated genes of clade III may be involved in the synthesis of flavonoids or anthocyanins since we observed overexpression of MtPAR, which a previous report showed to lead to a 10-fold increase in flavonoid contents . GmMYB3 and GmMYB7 negatively regulated most of the clade III genes, but not GmACT44 (clade II). The expression levels of GmACT54, GmACT57, GmACT62, GmACT64, and GmACT65 were suppressed in both GmMYB overexpressed soybean hairy roots (Additional file 1: File S1f; Additional file 3: Fig. S2).
Higher isoflavonoid levels were reported in hairy roots than the control . An active complex is formed via a MYB activator to prompt isoflavone biosynthesis. When the Medicago truncatula proanthocyanidin (PA) biosynthesis regulator (MtPAR) was heterologously expressed in soybean hairy roots decreased contents of isoflavonoids, such as daidzin, glycitin, genistin, malonyldaidzin, and daidzein up to 3, 2, 4, 2.5, and 8-fold, respectively, as compared to control through repressing the expression of the isoflavone synthase (IFS) gene . In fact, the Medicago MYB TF (MtPAR) suppressed the expression of not only IFS but also other isoflavonoid biosynthesis genes. Indeed, MtPAR has a documented negative impact on isoflavone production through expression suppression of the biosynthesis genes .
The expression of some BAHD acyltransferases were also significantly modulated in GmLEC2a-overexpressing hairy roots . While some genes specifically expressed in leaves (GmACT26) and roots (GmACT84, GmACT85) should not be native GmLEC2a targets of due to their distinct expression patterns, genes that are highly expressed in seeds (GmACT15, GmACT34), pods (GmACT71, GmACT99) or open flowers (GmACT75, GmACT97) could potentially be natural targets. Many of these genes were repressed in the GmLEC2a-overexpressing system (Additional file 1: File S1g). This result suggests that LEC2a may mainly upregulate the primary metabolism while repressing genes of the specialised metabolism, including those coding for enzymes that modify metabolic structures, such as BAHD genes.
Some of the specialised metabolic genes were also regulated by the soybean WRINKLED1 (GmWRI1b), an AP2/EREBP family TF that regulates fatty acid metabolism in seeds, as shown by their capability of modifying gene expression significantly in hairy roots . The expression of 63% (65 out of 103 genes) of the BAHD genes was modulated significantly in GmWRI1b-OE hairy roots, with 51% of the genes upregulated while were 49% downregulated (Additional file 1: File S1h). Interestingly, the two genes (GmACT2 and GmACT4) that have higher expression in above ground soybean parts i.e., stem, leaf, flower, were upregulated in GmWRi1b-OE hairy roots. BAHD genes that are expressed in specific organs, such as leaves (GmACT28), stems (GmACT47), and flowers (GmACT47, GmACT57, GmACT82) were also upregulated by GmWRi1b in hairy roots. Some genes expressed in root nodules (GmACT9, GmACT44, GmACT49) were upregulated, whereas others (GmACT54, GmACT87 and GmACT89) were downregulated. On the other hand, most of the genes specifically expressed in roots (GmACT18, GmACT21, GmACT22, GmACT23, GmACT36, GmACT40, GmACT41, GmACT43, GmACT86) and seeds (GmACT15, GmACT94) were downregulated in GmWRi1b-OE hairy roots. The genes GmACT56 and GmACT62 that were naturally highly expressed in roots and other organs were upregulated, but those root-expressed genes that were also expressed in nodules (e.g., GmACT92, GmACT98) were rather down-regulated. We also noticed that some BAHD genes with very low counts in all organ analysed were significantly upregulated (GmACT33) or downregulated (GmACT17) while most of the clade IV genes that targeted acyl-lipids for wax or cuticle synthesis were down-regulated in the GmWRi1b-overexpressing hairy roots. (Additional file 1: File S1h).
Validation of BAHD gene expression in developing nodules and in response to stresses via qRT-PCR
In order to validate the in silico expression analysis obtained from public RNA-Seq data, and further assess the involvement of BAHD genes in physiological processes, we performed quantitative reverse-transcription PCR (qRT-PCR). We analysed the expression profiles during nodule development and root responses to Al3+ and fungal elicitor exposures. To select BAHD genes for expression analyses in soybean, we considered the following criteria: (i) high expression in different tissues; (ii) representation of genes in all phylogenetic clades; (iii) inclusion of some paralogous pairs. The expression profiles revealed by qRT-PCR were generally consistent with the RNA-seq data. The expression of GmACT2 was higher in nodules than roots but remained static in all nodule developing stages. In contrast, the expression levels of GmACT21, GmACT22, GmACT55, GmACT68, GmACT79, GmACT81, GmACT84 and GmACT86 were higher in roots while lower but static during nodule development. GmACT12, GmACT15, GmACT41 expression was higher during the initial stages of nodule development but gradually decreased and remained low during nodule senescence. GmACT44 expression was initially low and steadily increased and was found high expression during nodule senescence (Fig. 7B). GmACT84 expression was high in response to Al3+ (Fig. 8B) and fungal elicitor (Fig. 9B) exposures compared to the control. At the same time, most of the other genes tested significantly decreased their expression or remained virtually unchanged.