Identification of BAHD family members in the soybean genome
Considering only the genes coding the specific BAHD catalytic motif (HXXXD), we identified 103 genes (Additional files 1: File S3) in the soybean genome encoding proteins of the BAHD family. The annotation of soybean BAHD genes was based on their fall in the clade and position on the chromosome (GmACT1–99) along with the already characterised genes Glyma.18G268200 (GmIMaT1), Glyma.13G056100 (GmIMaT3) , Glyma.18G258000 (GmMaT2) and Glyma.18G268400 (GmMaT4) (Fig. 1A). Details about on physical parameters, such as Glyma gene ID, chromosome start-end position, coding strand, exon number, predicted protein sequence length, molecular weight, isoelectric point (pI), and predicted subcellular location are listed in Additional files 1: File S3. Predicted protein length varied from 274 (Glyma.13G007200_GmACT31) to 511 (Glyma.18G258000_GmMaT2) amino acid residues, which indicates high variation within members of the BAHD family. However, most of them (70 out of the 103 members) code a protein with 450–520 residues. The range of predicted molecular weight is from 30.93–56.70 kD whereas the majority of the genes code a product with a predicted molecular weight ranging 45–52 kD, which falls within the values of characterised BAHD genes . The predicted pI value of soybean BAHD proteins ranged from acidic 5.05 (Glyma.10G061900_GmACT53) to basic 9.27 (Glyma.02G264600_GmACT2) (Additional files 1: File S3). The predicted subcellular localization indicates that BAHD proteins could be present in almost all organelles of the soybean cell. However, GmIMaT1 and GmIMaT3 are primarily localised in endoplasmic reticulum and cytosol, respectively . Most BAHD proteins (37/103) were predicted to localize in the cytosol, followed by mitochondria (27/103) and the plasma membrane (20/103), whereas only 2 proteins are predicted to localize in the nucleus (Additional files 1: File S3).
Motif analysis and gene structure
Variation in length and sequence was found in the conserved region of soybean BAHD genes (Additional files 1: File S4). Among the ten representative BAHD motifs searched and annotated through Pfam and SMART tools, three motifs were always present in each gene identified (Additional files 1: File S4). The biological importance of these recurring motifs required an in-depth investigation. All clade I and clade II members contained all 10 conserved motifs while clade III and IV members missed some of the motifs (Fig. 1B). We found a typical motif pattern in most of the paralogous pairs. However, some pairs presented differences in motif composition and arrangement, such as GmACT94-GmACT99, GmACT93-GmACT96, GmACT31-GmACT46, GmACT21-GmACT22, GmACT52-GmACT72, GmACT7-GmACT9 and GmACT15-GmACT18 (Fig. 1B).
The gene architecture of soybean BAHD genes was analyzed to better understand the general features within each clade. The exon-intron structure of each gene is shown along with the phylogenetic tree (Fig. 1C). Genes within the same clade contain almost same length and number of exons and introns, with some exceptions (Fig. 1C). Introns of most clade I members are quite larger than of those members in other clades. Seventy percent (14 out of 20) of clade I members contain two exons while 15% (3/20) contain one and three exons (Fig. 1C; Additional files 1: File S3). GmACT12 contains the longest intron in the whole set of soybean BAHD genes (Fig. 1C). In clade II, 14 genes contain only one, 8 genes contain two, and 3 genes contain three exons. GmACT21, GmACT22, GmACT32 and GmACT33 have long introns compared to the other clade II members. Clade III genes varied in exon number, but most contain just a single exon. Two clade III members (GmACT51 and GmACT72) contain the maximum exon number (4) in whole soybean BAHD family. Most members of clade IV (76%) contain only one exon while the others contain two exons. Among the 16 paralogous pairs, 10 did not show remarkable differences except for GmACT21-GmACT22 of clade II which a member of the pair has an untranslated region (UTR) longer than the other member. The other six paralogous pairs showed small variations in exon number or intron length (Fig. 1C).
Phylogenetic relationships of the soybean BAHD members
The evolutionary relationship of the soybean BAHD genes was explored by constructing a phylogenetic tree to compare them with 43 functionally characterised BAHD genes from different plant species (Fig. 2). On the basis of substrate donor and acceptor, the 103 soybean BAHD genes were classified into four major clades (I-IV) and clade III was further subdivided into two subclades (IIIA and IIIB) (Fig. 2). Our results based on the phylogenetic relationships of the characterised soybean ACT proteins would predict their correct enzymatic functions.
Clade I contains 20 soybean BAHD genes along with 16 characterised genes from different plant species (Fig. 1). Most members of this clade code for benzoyltransferases, alkyl-hydroxycinnamate ester, or cell-wall acyltransferases. They use benzoyl as a donor substrate to form benzyl benzoates, such as CbBEBT, NtBEBT , or salicyl benzoate (PtSABT) and involved in plant defence mechanisms against herbivory . Many characterised BAHD enzymes of clade I are responsible for the formation of benzenoid compounds [7, 12, 18]. Among them, PtSABT from Populus trichocarpa uses different CoA thioesters and alcohols as a donor to catalyse the production of salicyl benzoate for the synthesis of a variety of acetates, benzoates and cinnamates . PtBEBT from Populus trichocarpa is more selective to produce benzyl benzoate  and takes part in the synthesis of salicinoids. CbBEBT produces benzyl benzoate in Clarkia breweri that composes its floral aroma. The CbBEBT gene is highly expressed in the flower, and especially in the stigma . These BEBT genes code for products that have the capability to produce compounds that are important for plant defence or aroma [7, 12, 18]. Four BEBT genes from soybean (GmACT3, GmACT4, GmACT5, GmACT20) have high similarity with the characterised BEBT genes. Similarly to CbBEBT, GmACT3 and GmACT4 also showed high expression levels in the flower, which leads us to hypothesize that it likely plays a function in catalyzing aroma compounds. These genes are also highly expressed in the stem and the leaf, like PtSABT and PtBEBT that produce defence compounds. Some members of clade I were also reported being involved in the biosynthesis of cutin and suberin, which are known to be lipid barriers on organ epidermis surfaces. These barriers regulate gas, water and solute fluxes  and play critical roles against pathogen attacks . Alkaylhydroxycinnamoyl ester BAHD genes, such as AtASFT , StFHT , AtDCF , and AtFACT  use feruloyl, p-coumaroyl, sinapoyl, and caffeoyl as a donor to form aromatic suberin, alkyl, or ω-hydroxy fatty acids, which are used for the synthesis of cutin polymers that in turn are important components of the cuticle and assist in the deposition of suberin into the inner side of the primary cell wall of secondary shoots and roots . They also play a role in the synthesis of root waxes that contain hydroxycinnamates in A. thaliana . Ten genes from soybean fell in clade I and, thus have the same predicted transferase function (Fig. 2; Additional files 1: File S3). Interestingly, we have found in soybean only one orthologue of disinapoylspermidine (AtSDT; ) and another of coumaroyl-spermidine (AtSCT; ) that were annotated as GmACT2 and GmACT12, respectively (Fig. 2) while all the other genes in this clade have multiple copies or paralogous in the soybean genome. Monolignols, which participate in the formation of lignin, are synthesised through the phenylpropanoid pathway, primarily through the activity of hydroxycinnamoyltransferases (HCTs). The BAHD-ACTs also used the hydroxycinnamic acid (HCA) as donor substrate to esterify monolignols . These cell wall-related BAHD-ACTs also belong to clade I, such as OsPMT , BdPMT , and OsAT10  that use p-coumaroyl-CoA as a donor substrate to form lignols and glucuronoarabinoxylan (GAX). These activities are responsible for producing the lignocellulosic biomass from grasses, which could be a major feedstock for the generation of liquid biofuels . The soybean genes GmACT1, GmACT13, GmACT14, GmACT16, GmACT17, and GmACT19 were also predicted to have cell wall-related functions, and they may play essential roles in the production of liquid biofuel compounds.
Clade II of the soybean BAHD family contains 26 genes and cluster together with 9 characterised from other species (Fig. 2). This clade includes many genes related to shikimate/quinate hydroxycinnamoyltransferases (HCT/HQT) and acyl-CoA N-acyltransferases, such as hydroxycinnamoyl-CoA, spermidinehydroxycinnamoyltransferase (SHT), hydroxycinnamoyl-CoA: N-hydroxycinnamoyl/benzoyltransferase (HCBT), and rosmarinic acid synthase, which is alternatively called hydroxycinnamoyl-CoA:hydroxyphenyllactate hydroxycinnamoyltransferase (RAS). These members used hydroxycinnamoyl, caffeoyl or p-coumaroyl as donor substrates to form chlorogenic acid, hydroxycinnamoylquinate, caffeoyl, or p-coumaroylquinate or malate, hydroxycinnamoyl glycerol, hydroxycinnamoylshikimate, hydroxyphenyllactates, anthramides, and spermidines [27, 29, 43, 58, 64, 74, 83]. Clade II contains mostly acyl-CoA N-ACTs that catalyze N-acylation instead of O-acylation. These reactions catalyze the hydroxycinnamoyl and amine moieties that are the entry point of the phenolamide pathway . Hydroxycinnamic acid (HCA) amides are classified as phytoalexins . Most of the plant N-HCTs that have been characterised so far are from A. thaliana and N. attenuate and involved in defence mechanism against herbivores . While other members of the clade II, i.e., HCT/HQT and their relatives produce different compounds like lignin as well as chlorogenic acid (CGA) and quinates, which play protective roles against UV photooxidation  and pathogenic resistance . CGA has also been implicated in cell wall development  and root hair formation . Seventeen soybean BAHD genes are closely related to TpHCT2, which product transfers hydroxycinnamoyl moieties (p-coumaroyl and caffeoyl) to malic acid (Fig. 1). This transferase activity of TpHCT2 is involved in phaselic acid synthesis, particularly in the leaves of red clover (Trifolium pratense, Fabaceae) . TpHCT2 is related to GmACT27, GmACT28, and GmACT45, which also have high and specific expression in leaves (Fig. 2), and may play a role in phaselic acid synthesis. Meanwhile, GmACT21, GmACT22, GmACT32, and GmACT33 fall closest with TpHCT1A/1B, which use caffeoyl or p-coumaroyl as a donor to form shikimates or malates  and play a major role in monolignol synthesis [29, 74]. Lignin precursors are synthesised through p-coumaroyl derivatives rather than methoxylated analogues in response to pathogen attacks . TpHCT activity was higher in the stem and capable of transferring p-coumaroyl moieties from the respective CoA to shikimic acid, which is critical for the biosynthesis of monolignol lignin precursors [29, 74]. AtHCT expression dramatically changed the lignin composition, which demonstrates the essential function of HCT in phenylpropanoid metabolism [6, 29]. HCT homologues have been identified in several vascular plant species . In conformity with other members of this group, GmACT21, GmACT22, GmACT32 also presented high expression levels in the stem, and they may transfer p-coumaroyl moieties to respective acceptors with essential roles in lignin biosynthesis. Other four members of this clade (GmACT35, GmACT36, GmACT38, and GmACT39) are closely related to SHT from Arabidopsis, which uses hydroxycinnamoyl as a donor substrate to form hydroxycinnamoylspermidine . This compound is abundant in pollen of many species, which indicates a role of these phenylpropanoids in the pollen . AtSHT expression was high and specific to the tapetum cells of the anther during early flower development in order to supply the building blocks for cell wall development in the pollen . The metabolic profiling of flower bud tissues in Arabidopsis revealed that AtSHT knockout or RNAi plants had a positive correlation between acyltransferase protein accumulation and the amounts of trihydroxyferuloyl and dihydroxyferuloyl sinapoyl spermidines . GmACT35 and GmACT38 were also more highly expressed in the flower than any other tissues and are expected to have a function in the acylation of spermidine in the pollen.
Clade III is further divided into sub-clades IIIA and IIIB (Fig. 2). Sub-clade IIIA contains 25 genes along with 19 already characterised from other species. Most of the characterised members of clade IIIA acylate phenolic glucoside anthocyanins or other flavonoids (flavones or isoflavone). Malonylation of clade IIIA ACTs depends on the position of the acceptor substrate. Clade IIIA can be divided into aliphatic and aromatic ACTs. Fourteen soybean BAHD genes fall in the aliphatic ACTs, and they are functionally annotated as isoflavonoid malonyltransferases (Additional file 1: File S3) along with the characterised At5MaT, At3At1, At3At2 from Arabidopsis and MtMaT1, MtMaT2 and MtMaT4 from Medicago, which facilitate the transport or storage of the metabolites [44, 89]. Among the 14 aliphatic ACTs in soybean, only two, GmIMaT1 and GmIMaT3  have been functionally characterised and play a distinct role against abiotic and hormonal stresses . Their transcription patterns correlated positively with the malonylisoflavonoid composition of the tissue in different parts of the plant under different stress conditions .
Another aliphatic group contains 6 soybean ACTs along with MtMaT5 and MtMaT6  from Medicago that malonylate different anthocyanins, such as cyanidin, delphinidin, and pelargonidin conjugates at 3-O-glucosides . These genes are very specific for anthocyanin modification at the C3 position, and they have no activity towards C5 or C7 of anthocyanin or other flavones or isoflavones . The soybean genes that closely relate to MtMaT5 and MtMaT6 are predicted to function in anthocyanin modification at 3-O-glucoside (Additional file 1: File S3). This group of clade IIIA is most probably associated with the stability of the anthocyanin glucosides since anthocyanin acylation at 3-O sugar with the malonyl moiety has been suggested to increase its stability .
Clade IIIA also contains 7 soybean BAHD-ACT genes (GmACT52, GmACT57, GmACT61, GmACT63, GmACT72, GmACT76 and GmACT77) closely associated with the aromatic ACTs from Arabidopsis, At3At1 and At3At2 (Additional file 1: File S3; ). At3At1 and At3At2 use p-coumaroyl, feruloyl and caffeoyl CoAs as donors to specifically modify the C6 hydroxyl of glucose at three anthocyanin positions, but it did not have any affinity toward sinapoyl-CoA. We predict that these 7 aromatic ACTs will also prefer coumaroyl-CoA to modify the anthocyanin at the C3 position just like their closely related Arabidopsis orthologues (Fig. 2; Additional file 1: File S3). We did not find in the soybean genome any aromatic gene that was closely related to a gene able to modify 5-O-glucosides, such as Gt5AT that acylates the C6 hydroxyl of 5-O-glucosyl residues of anthocyanin with either caffeoyl, p-coumaroyl  or aliphatic anthocyanin MaTs, such as At5MaT (Fig. 2; ) and Ss5MaT1 from Arabidopsis thaliana (Brassicaceae) and Salvia splendens (Lamiaceae), respectively (Fig. 2; ).
Clade IIIB contains 9 GmACT genes from soybean with 7 characterised genes (Fig. 2). This clade contains malonyltransferases (MaTs) that catalyse flavonoids with malonyl-CoA to form malonates, acetyl CoA:deacetylvindoline 4-O-acetyltransferases (DAT) which catalyse the last step in vindoline biosynthesis , acylsugar acyltransferases (ASAT), which is important in the acylsucrose biosynthetic pathway and uses sucrose and acyl-CoA ester substrate specificity to contribute in different types of accumulated acylsucroses . Acylsugars are polyesters of short- to medium-length acyl chains of sucrose or glucose backbones that are produced in secretary glandular trichomes of many solanaceous plants, including the cultivated tomato (Solanum lycopersicum). These compounds act against different insects through a mechanism involving in a multitrophic defence interaction involving their metabolization to volatile fatty acids . Ss5MaT2 also fell in clade IIIB separated from other the MaTs, which is consistent with D’Auria analyses . These BAHD-ACTs share the anthocyanin specific motif (NYFGNC) along with the BAHD HXXXD motif. The characterised member of clade IIIB (Ss5MaT2 from Salvia splendens) malonylates the anthocyanin glycosides bisdemalonylsalvianin and shisonin by transferring the malonyl moiety to the C4 position of the 5-O-glucosyl residue . The soybean genes of clade IIIB also suggest forming salvianin, which has two malonyl moieties, and malonylshisonin, which plays an important role in flower colour . Although the biochemical properties of Ss5MaT2 are similar to that of clade IIIA members, it fell into the clade with ASAT and DAT.
Clade IV contains 21 soybean BAHD genes along with five characterised genes (Fig. 2; Additional file 1: File S3). The members of this clade have varied functions. Most are predicted to function as shikimate O-hydroxycinnamoyltransferase or transferase family, except for GmACT88 and GmACT97, which are predicted to function as an anthranilate N-hydroxycinnamoyl (HCBT)-related transferase, as well as GmACT79 and GmACT81, which are predicted to function as a shikimate O-hydroxycinnamoyltransferase/quinate O-hydroxycinnamoyltransferases. Gt5At from Gentiana triflora, an anthocyanin aromatic acyltransferase, catalyses the transfer of hydroxycinnamic acid moieties from their CoA esters to the glycosyl groups of anthocyanins . HCBT produces the aromatic amide dianthramide in carnation . Anthraniloyl-CoA: methanol AT (AMAT) uses anthraniloyl-CoA as a donor substrate to acylate methanol and produce a foxy odour compound (methyl anthranilate) in concord grapes . Moreover, GmACT93 and GmACT96, which were predicted to involved in cutin formation and closely related to the gene DEFECTIVE IN CUTICULAR RIDGES (DCR) from Arabidopsis thaliana, Medicago truncatula, and Populus trichocarpa . In the flower, DCR converts monomers into polymeric cutin, which is composed of fatty acid, glycerol and aromatic monomers . Changes in epidermal cell differentiation and postgenital organ fusion were observed due to defective cuticle in DCR-deficient plants. DCR was also expressed in the root cap, lateral root primordia, and developing lateral root. Its root expression might point to involvement in the biosynthesis of suberin . Downregulation of DCR in dcr mutants led to excessive root branching and changes of root architecture by influencing lateral root emergence and growth . All in all, these analyses help us formulate educated hypotheses about the diversity and evolution of substrate specificity within the BAHD family in soybean.
Chromosomal localization and duplication of soybean BAHD genes
A chromosome ideogram was constructed on the basis of the physical position of the soybean BAHD-ACT genes (Additional file 1: File S3). All the 20 soybean chromosomes contain BAHD-ACT genes, although the number on each chromosome varies between 1 and 15 genes per chromosome (Fig. 3). Chromosome 18 has the highest number of genes (15%) followed by chromosomes 8 (11%) and 19 (10%) (Fig. 3; Additional file 1: File S3). On the other hand, chromosomes 1, 9 and 20 contain only one gene each (Fig. 3; Additional file 1: File S3). Given their position, often on chromosome arms (Fig. 3), soybean BAHD genes tend to have high rates of recombination .
Tandem and segmental duplications increased the gene numbers into a large family in different plant species . Potential gene redundancies can be discovered by the identification of closely related paralogues in the genome whereas the accurate identification of true orthologues between species can lead to the creation of strong hypothesis of common gene function in other species. The soybean genome contains a high number (103) of BAHD genes as compared to 69 in Cynara cardunculus , 64 in A. thaliana , but less the 119 BAHD genes in the Oryza sativa genome . This large gene family might be the result of the two rounds of whole-genome duplication (WGD) events that occurred 58–60 (the Papilionoideae allotetraploidization event) and 13 Mya (recent soybean allotetraploidization) in soybean . The presence of paralogous genes in a plant genome is more often related to their functional category than the genetic proximity between species . We found sixteen paralogous pairs in the soybean BAHD family (Additional file 2: Table S1). These paralogues are present in all the four phylogenetic clades (Additional file 2: Table S1). Clade I contains 7, clade II contains 3, clade III contains 2, and clade IV contains 4 paralogous pairs (Additional file 2: Table S1). The presence of clear paralogues is evidence of common ancestry and helps to elaborate strong hypotheses about the functional conservation of BAHD genes in soybean along with those already characterised in different species. The selection history of coding sequences can be assessed through the analysis of base substitutions of the protein code called the Ka/Ks ratio. To investigate the deviation of duplicated BAHD acyltransferases, we determined the non-synonymous substitutions per site (Ka), the synonymous substitutions per site (Ks), and the respective Ka/Ks ratios for each paralogous pair. The partition of each pair was estimated through the Ks values. All paralogous pairs had a Ks value from 0.09 to 0.28, which is consistent with the soybean duplication events (Additional file 2: Table S1). The selection pressure can be determined through the Ka/Ks ratio. Most of the paralogous pairs showed negative Ka/Ks ratios (Additional file 2: Table S1) suggesting that the BAHD gene family in soybean experienced strong purifying selection pressure and was bound by strong evolutionary constraints to maintain its stability . Divergence time analysis showed that these paralogous genes evolved between 7.38 to 22.95 million years ago (Mya), which is consistent with the recent allotetraploidization of the species that occurred around 13 Mya (Additional file 2: Table S1).
Positive and purifying selection of BAHD genes in soybean
The Ka/Ks ratio is used as a measure of the direction and magnitude of gene selection . The Ka and Ks substitution rates in this study were estimated on the basis of simple probability theory. Positive selection was considered significant with Ka/Ks ratio > 1 . Comparisons of nucleotide substitution per site in the 103 BAHD genes were carried out on the basis of the Ka/Ks ratio in each individual branch of the phylogenetic tree. The neutrality analysis based on maximum likelihood computation of Ka/Ks values of all soybean BAHD genes revealed 16 coding sites under positive selection between codons 69–342 in clade I, 6 coding sites between codons 33–143 in clade II, 19 coding sites in clade III and 15 sites in clade IV under positive selection, all of which showed Ka/Ks ratios > 1 (Additional file 2: Table S2). The changes in codons, as well as sequences for pairwise comparison, were estimated at the Ka and Ks basis. The variability between Ka and Ks polymorphisms at the allelic distribution level presents quite an explicit proof for positive selection . The allelic distribution in various coding sites can be revealed by the strapping test of neutrality, which is applicable to genomic data containing a large number of polymorphisms, along with the homogeneity test allow for comparing the frequency distribution of synonymous sites [3, 46]. The changes in the coding region during the evolutionary time under analysis showed the cumulative presence of synonymous, non-synonymous and ambiguous (indels) codons (Additional file 3: Figure S1). Synonymous and non-synonymous mutations showed low cumulative behaviour at the start site and then gradually increased during the evolutionary time analysed, whereas the opposite behaviour was observed for ambiguous codons at the start point and then acquired a static behaviour at the end of the code. The very same behaviour of synonymous, non-synonymous and ambiguous codons was found in all four clades of the BAHD family (Additional file 3: Figure S1).
We estimated the Likelihood Rate Test (LRT) through the empirical Bayes method at specific branching points and identified various diversifying selection sites. The mixed-effect model of evolution (MEME) was used to identify the diversifying selection of BAHD genes also on the basis of empirical Bayes procedure . MEME identified 7, 17, 16, and 4 episodic diversifying coding sites (p < 0.05) in clades I-IV, respectively (Additional file 2: Table S3). The synonymous (α) and non-synonymous (β) substitution rates were calculated, and coding sites with values β > α were considered as significant to determine the sites under diversifying selection. The maximum-likelihood estimation of β+ > α in MEME was obtained for codons 60 and 434 in clade I, 189 and 463 in clade II, 277 and 643 in clade III, and 299 in clade IV (Additional file 2: Table S3). The prevalence of purifying or natural selection masked the episodic natural selection with the transient period of adaptive evolution. Thus, we could not identify sites under positive selection due to the sensitivity of tests used  and the stringent parameters chosen, such as p < 0.05 and the empirical Bayes factor threshold of analysis for selection of sites (> 15), so that the best outcomes were obtained when the selected branches were placed in the conserved lineage.
Ambiguities in the posterior gene- and site-specific distributions were calculated through FUBAR using the MCMC approach [3, 54]. This analysis revealed the following number of sites in BAHD genes under pervasive diversifying selection: one site in clade I, four in clades II and III each, and two in clade IV (Additional file 2: Table S4). The diversifying evolutionary sites were detected through the random effect model using the rate distribution with a large number of parameters at the significance level at 95% confidence interval and Pr [β > α] values (Additional file 2: Table S4). The class rate weight at each coding site under positive selection was calculated through a bivariate general discrete distribution. The posterior mean values ranged between 0.80–0.92, which are closer to the value of the potential scale reduction factor (Additional file 2: Table S4) and indicates MCMC convergence. Coding sites with β > α and empirical Bayes factor (EBF) > 15 were considered to be under diversifying selection. The EBF values for each coding site under positive selection were calculated with a net effective sample size (Additional file 2: Table S4). PSRF values for each site under positive selection were < 1.03, whereas the effective population size was > 100. The detected selection across many coding sites was potentially improved through deducing the allocation of gene-specific selection parameters. Altogether, the coding sites under positive selection unveiled in this analysis provide strong evidence of diversifying selection in BADH genes under selection in the soybean lineage.
The evolutionary fingerprint of soybean BAHD genes
Nucleotide and amino acid substitutions in the codon model were used to identify synonymous and nonsynonymous substitutions [4, 48]. The genetic algorithm in the codon model evolution was used to identify the evolutionary fingerprint in coding sites of BAHD genes. The codon model evolution used a phylogenetic Markov model that includes substitution rates, character frequencies , amino acid substitution rate clustering , and branch lengths estimated through the maximum-likelihood method. The 9052 models generated were used in codon model selection on the basis of likelihood log and modified Bayesian Information Criterion (mBIC). Selective effects with an exchangeable preference for particular amino acid were found in BAHD genes through a model that used the combined empirical codon and transition/transversion-related physiochemical parameters . The model with log (L) value − 16360.3 was considered optimum for the amino acid substitution analysis. We observed mBIC values ranging from 33701.1 for clade II to a maximum value of 81506.1 for clade IV (Additional file 2: Table S5). All clades showed three-class rates for the distribution of amino acids in different classes (Fig. 5) with the estimation of a single Ka/Ks substitution rate (Additional file 2: Table S5). A genetic multi-rate model algorithm was used to analyse the class rates in order to calculate substitution rates at the amino acid level during the evolutionary time scale (Fig. 5). The substitution rate in each class was calculated through genetic models using the Stanfel class parameters (Fig. 5; ). The substitution rate distributed the amino acids into the three classes through evolutionary rate cluster, and the substitution pair ACGILMPSTV revealed 90% substitution, whereas DENQ had 50% substitution, and FWY and HKR presented < 50% substitution rates. The best fit model with mBIC values defined the clustering rate, and the numerical values of corresponding rate class substitution are inferred with maximum-likelihood estimation, which were computed evolutionary evidence ratio for the gene .
Analysis of cis-regulatory elements in promoter regions of BAHD genes in soybean
Extensive studies on ACTs indicate their roles in plant growth, development, and stress tolerance. Many cis-acting regions are found in the upstream region of GmACTs genes according to our in silico analyses (Additional file 1: File S5). The 103 GmACT promoter regions contain cis-elements responsive to light (43%), hormones (29%), environmental stresses (20%), and plant growth (8%) (Fig. 4A). These regions contained transcription factor binding sites involved in response to plant hormones (Fig. 4B), biotic stresses and pathogen defence, MYB binding sites, and response to heat, low temperature, and drought (Fig. 4C). The pattern of cis-acting regions differed among the soybean BAHD gene promoters (Additional file 1: File S5). GmACT3 (clade I) contains the maximum number of cis-elements (29), whereas GmACT82 (clade IV) contains 26 hormone-responsive elements (Additional file 1: File S5). GmACT34 contains the maximum number of ABA-responsive elements (8), and GmACT45 contains 9 ethylene-responsive motifs (Additional file 1: File S5). GmACT84 and GmACT96 (clade IV), along with GmACT66 (clade III) contain the maximum number of environmental stress-response elements (18, 17, and 15, respectively) in their promoter regions. Members of clades III and IV contain the maximum number of MYB-related cis-elements (Additional file 1: File S5). GmACT84 showed the most (12) MYB related elements, and also showed high expression values, up top seven-fold higher in response to Al3+ and two-fold higher to fungal elicitor than the control (Additional file 1: File S1c, S1d). The promoter regions of paralogous pairs were also compared. Both members of each pair showed variation in cis-elements related to stress or hormonal responses or other signals in their promoter regions. Some hormonal, stress or growth-related elements were often present only in one member of the pair (Additional file 1: File S5). Although the GmACT paralogous pairs have little variation in protein length, the sequences of their promoter regions were highly different, which suggests divergent specific roles of each member. We concluded from this analysis that GmACTs may play vital roles in plant stress perception, signalling, or biotic and abiotic stress defence.
Expression patterns of soybean BAHD genes in different tissues
Expression data for the 103 soybean BAHD genes in different soybean tissues were extracted from Phytozome (Fig. 6; Additional file 1: File S1a). Most of the genes showed specific tissue expression patterns related to their clade function. The benzenoid-related genes (GmACT3 and GmACT4) had maximum expression levels in above-ground organs, whereas GmACT5 is highly expressed only in the root (Fig. 6; Additional file 1: File S1a). Benzenoid compounds are produced mostly in flowers and known to be involved in the aroma of numerous fruits and contribute to improving flavour quality . Indeed, already characterised genes from this clade, PhBPBT from Petunia x hybrida and CbBEBT from Clarkia breweri are highly expressed in floral parts and consistently involved with benzylbenzoate compounds [7, 18]. Monolignol synthetic genes (e.g., GmACT1 and GmACT16) showed a high expression level in root and pod. Likewise, OsPMT and BdPMT, which are specific to monolignols with p-coumarate and ubiquitously expressed but particularly high in the spikelets. When overexpressed, these genes led to a three-fold increase in pCA lignin compared to the control . Monolignol-pCA conjugates integrate with lignin biosynthesis through oxidative radical coupling in order to generate the pCA appendages [62, 84]. Polyamine conjugates are abundant in seeds and serve as nitrogen reserves during germination . The Arabidopsis SDT gene is highly expressed in seeds and produce disinapoyl spermidine and its glucoside while the SCT is mainly expressed in roots and produce coumaroylated spermidines . Soybean also contains SDT and SCT related genes, GmACT2 and GmACT12, respectively, which are expressed in all organs of the plant, although GmACT12 expression is barely detectable in the stem, leaves and roots, although spermidine accumulates at high levels in seeds and the root .
The clade II members, GmACT29 and GmACT30, showed high expression only in the seed. GmACT40 is expressed in the root and nodule, while GmACT38 is especially high in nodules and flowers (Fig. 6; Additional file 1: File S1a). On the other hand, GmACT41, GmACT45, GmACT35, GmACT21, GmACT22, and GmACT32 showed high expression levels in all tissues (Fig. 6; Additional file 1: File S1a) suggesting putative roles in cell wall development and plant defence mechanism by producing CGA or quinate-like compounds, given their close relatedness to TpHCT1A and TpHCT1B (Fig. 2; ). Phenolamide, an important compound of clade II genes, is often associated with floral induction and development. GmACT21, GmACT22, and GmACT32 showed high transcription levels in the stem and flowers, similarly to TpHCT1, which showed 4 to 5-fold higher expression in these organs than the leaf and played an important role in stem lignification by participating in monolignol biosynthesis [29, 74]. GmACT27, GmACT28 and GmACT45 are similar at the amino acid and also showed the same expression pattern as TpHCT2 . A reduction in the biosynthesis of phenolamides or their conjugates leads to male sterility in plants. Tri-substituted hydroxycinnamoyl spermidines were synthesised in the pollen coat, and this conjugation is catalysed by SHT . GmACT22, GmACT32 and GmACT35 also had their highest expression in the flower and, given their phylogenetic position and expression patterns, these genes might be involved in the synthesis or conjugation of spermidines in the pollen coat of soybean.
Clade III genes, GmACT67 and GmACT72, are highly and specifically expressed in the flower, GmACT55 in the root, GmACT49, GmACT54, and GmACT65 in the nodule, whereas GmACT66 is expressed in the leaf, GmACT47 in stem and flower, GmACT57 in root and flower, and GmACT71 in seed and the pod (Fig. 6; Additional file 1: File S1a). The putative isoflavonoid malonyltransferase genes, such as GmIMaT1, GmMaT2, GmIMaT3, GmMaT4, GmACT68, GmACT69, and GmACT70, showed high expression levels in almost all tissues. Other related members presented a more specific expression pattern, such as GmACT71 in seeds and pods, and GmACT50 in the leaf (Fig. 6; Additional file 1: File S1a). The different expression patterns of these genes suggest putative roles of malonylated flavonoids in storage, defence mechanisms, symbiosis signalling, flower colour, and resistance against biotic and abiotic stresses. Most of clade IIIB members had high expression in leaf and stem, like the characterised members of this group. High and specific expression of CrDAT in the leaf correlates with the synthesis of vindoline from tabersonine . Likewise, the expression of ASAT genes was higher in the trichomes of S. lycopersicum stem and S. pennellii leaf, which correlates with the accumulation of acylsugars in these organs .
DCR genes are specifically expressed in the epidermis of vegetative and reproductive organs. Additionally, they also showed expression in the lateral root primordia and developing lateral roots . The cuticle of different organs is the primary barrier against several biotic and abiotic stresses . High expression of DCR genes in young leaves and stem is predictive of their role in the formation of cutin polymers in vegetative organs . During leaf expansion, the cutin polymeric matrix is formed, and DCR plays a vital role to incorporate the hydroxyl fatty acid chains into the cutin polymer . It is also important to convert monomer units into its polymeric form of cutin in the flower. DCR develops the plant lipid polyester for epidermal cell differentiation and patterning, trichome development, and protection from undesired postgenital fusions in aboveground organs . Almost all clade IV genes showed high expression levels in seeds, the root, and nodules. Some genes, such as GmACT82 and GmACT95, presented particularly high expression levels in the root and flowers, GmACT94 and GmACT91 in seeds, GmACT99 in the root and leaves (Fig. 6; Additional file 1: File S1a).
Expression profiles of BAHD genes during nodule development in soybean
The gene expression of BAHD members was analysed in RNA-Seq data derived from soybean nodules at different developing stages . The results showed that BAHD genes were differentially expressed in the nodule at different developmental stages, from nodule development until its senescence (Fig. 7; Additional file 1: File S1b). Most of clade II and clade IV genes showed higher expression in the root than nodules, while 50–80% genes of clades I and III had higher transcription levels in nodules than the root. The expression levels of GmACT39, GmACT40, GmACT41 (clade II), GmACT27 and GmACT33 (clade IV) were 10 to 40-fold higher in the initial stages of nodule development and reached up to > 80-fold in later stages (stage 5, pre-senescence) than in the root. Even though most of the clades I and III genes were expressed more highly in nodules than roots, their expression levels were not much higher than genes in clades II and IV, except for GmACT12 and GmACT68, which showed more than 2 and 70 fold higher expression than root, respectively (Fig. 7; Additional file 1: File S1b). The expression patterns of GmACT35, GmACT39, GmIMaT1, GmACT86, and GmACT87 gradually increased as the nodule developed, peaking at stage 5 and then dropping at senescence (stage 6), which indicates their roles in nodule development. Although the expression levels of GmIMaT1 and GmACT86 were higher in roots than nodules, their expression increased with nodule development. Therefore, these genes may play a symbiotic role in the relationship between root and nodule development by modifying secondary metabolites in roots for secretion into the rhizosphere in order to attract rhizobia, such as isoflavonoids that act as a signalling molecule for inducing nod factor biosynthesis in rhizobia. The transcription level of GmIMaT3 was the highest after GmACT87 than all other BAHD genes in the different nodule stages analysed, and its expression remained static throughout the nodule development and almost similar with that in the root (Fig. 7; Additional file 1: File S1b). The expression levels of GmACT15, GmACT18, GmACT40, GmACT41, GmMaT2, and GmACT75 were higher in the initial stages and gradually decreased as the nodule development advanced, reaching the lowest expression levels at its final stage (Fig. 7; Additional file 1: File S1b). Therefore, these genes may play a function in rhizobial interaction and nodule initiation and development.
Expression patterns of soybean BAHD genes in response to different stresses
The expression of BAHD family genes was analysed from RNA-Seq data obtained from plants subjected to acidic condition (low pH 4.0), aluminium treatment 50 µM Al3+ (pH 4.0), and fungal elicitor after 2 and 24 h. Most of the BAHD genes expression decreased in response to low pH and Al3+, except for GmACT13 (clade I) and GmACT36 (clade II). Their expression levels doubled at low pH but decreased or remained the same at Al3+ stress conditions (Fig. 8; Additional file 1: File S1c). The expression of GmACT49, GmIMaT1, GmIMaT3 (clade III) and GmACT84 and GmACT92 (clade IV) increased dramatically in response to both stress conditions (Fig. 8; Additional file 1: File S1c). The expression of GmACT49, GmACT84 and GmACT92 increased about 11, 7, 40 fold in response to low pH, and 6, 7, 70 fold in response to Al3+ stress, respectively, whereas the expression of GmIMaT1 and GmIMaT3 significantly increased as compared to the control under these conditions (Fig. 8; Additional file 1: File S1c). Therefore, the BAHD genes that showed elevated transcription levels after low pH and Al3+ treatments could be good candidate genes for soybean to Al3+ toxicity tolerance. The transcription levels of GmACT12 (clade I), GmACT39 (clade II), GmACT62, GmACT68, GmMaT2, GmMaT4, GmACT75 (clade III) and GmACT86 (clade IV) decreased from 5 to 70-fold, revealing their sensitivity towards low pH and Al3+ stress conditions.
Transcriptomic analyses in response to fungal elicitor (FE) showed that the expression of 29% of the BAHD genes (30 out of the 103) changed significantly, with 10 genes increasing and 20 decreasing their expression levels (Fig. 9; Additional file 1: File S1d). The expression of GmACT35, GmACT36 (clade II) GmIMaT1, GmMaT2, GmACT50, GmACT57, GmACT58 (clade III), GmACT84, GmACT85 and GmACT92 (clade IV) increased while the other genes had their transcription level decreased. The expression of GmACT36, GmACT57, GmIMaT1, GmACT58, GmACT84 and GmACT92 increased almost 12, 20, 2, 40, 2, and 2 folds, respectively (Fig. 9; Additional file 1: File S1d). This expression pattern suggests a role in resistance to fungal attack by the synthesis and secretion of modified metabolites. Indeed, fungal and bacterial infections, or elicitor treatment in Solanaceae plants and Arabidopsis, increase the synthesis of coumaroyl or feruloyl tyramines [53, 87]. The anti-microbial activity was demonstrated for dicoumaroyl-caffeoyl spermidine, which inhibits mycelial growth of Pyrenophora and reduces powdery mildew infection in barley seedlings . Likewise, constitutive accumulation of tyramine derivatives in transgenic tomato promotes resistance to Pseudomonas syringae . MtMaT3 expression increased in response to fungal infection and decreased in response to copper treatment . The observed modulation of BAHD family genes in soybean in response to fungal elicitor may correlate with an increased acylation of active compounds against fungal attacks.
Expression patterns of soybean BAHD family genes in different soybean genotypes seed coat and cotyledons
Soybean Illumina expression data of seed coat and cotyledon obtained from developing seed (stage 5, 380–450 mg in weight) of three varieties were downloaded  and analysed. Interestingly, some BAHD family genes were highly expressed in seed coat while expressed lowly in cotyledons, and vice versa. GmACT5 (clade I), GmACT25 (clade II), GmMaT2 and GmIMaT3 (clade III), and GmACT84 (clade IV) showed high expression in the cotyledon as compared to seed coat (Additional file 1: File S1e) suggesting a function in the storage of modified metabolites, or resistance against seed borers. The expression levels of GmACT4, GmACT15 (clade I), GmACT21, GmACT22, and GmACT35 (clade II), GmACT57, GmMaT4, GmACT70 (clade III), and GmACT79, GmACT81 (clade IV) were more specific and high in brown than blackl seed coats and almost absent in yellow seed coats (Additional file 1: File S1e) suggesting a role in anthocyanin modification.
Expression modulation of soybean BAHD family genes by MYB transcription factors, GmLEC2a and GmWRi1b in the hairy root heterologous system
MYB transcription factors (TF) tightly regulate the biosynthesis of flavonoids and anthocyanins . Many soybean MYB factors have been reported to either positively or negatively regulate isoflavonoid accumulation in roots and seeds or the production of anthocyanins and proanthocyanidins (PAs) in seeds . Two soybean MYB TFs, including one CPC-like R3-MYB repressor (GmMYB3), and an R2R3-MYB activator (GmMYB7), were over-expressed in soybean hairy roots and an RNA-Seq analysis was performed. The expression of GmACT17 (clade I), GmACT23, GmACT40, GmACT41, GmACT43 (clade II), GmACT75 (clade III) and GmACT98 (clade IV) was upregulated by GmMYB3 (Additional file 1: File S1f; Additional file 3: Fig. S2) while the expression of GmACT9, GmACT12 (clade I), GmACT21, GmACT22, GmACT28 (clade II) was induced by GmMYB7 (Additional file 1: File S1f; Additional file 3: Fig. S2). The transcription levels of GmACT24 (clade II) and GmMaT2 (clade III) were 2 to 5-fold upregulated by both, GmMYB3 and GmMYB7, as compared to the control. Most of the induced genes belong to clades I and II, which are involved in the modification of different compounds that by adding ferulyl, caffeoyl, sinopoyl, p-coumaroyl, and hydroaxycinnamoyl moieties. On the other hand, the upregulated genes of clade III may be involved in the synthesis of flavonoids or anthocyanins since we observed the over-expression of MtPAR, which a previous report showed to lead a 10-fold increase in the flavonoid contents . GmMYB3 and GmMYB7 negatively regulated most of the clade III genes, but not GmACT44 (clade II). The expression levels of GmACT54, GmACT57, GmACT62, GmACT64 and GmACT65 were suppressed in the MYB over-expressed soybean hairy roots (Additional file 1: File S1f; Additional file 3: Fig. S2). High isoflavonoid levels were reported in soybean hairy roots . MtPAR in soybean hairy roots decreased contents of isoflavonoids, such as daidzin, glycitin, genistin, malonyldaidzin, and daidzein up to 3, 2, 4, 2.5, and 8-fold, respectively, as compared to control through repressing the IFS gene . In fact, MYB-TF suppressed the expression of not only IFS but also other isoflavonoid biosynthesis genes. An active complex is formed to activate isoflavone biosynthesis via a MYB activator . MtPAR also has a negative impact on isoflavone production through expression suppression of the biosynthesis genes .
Some of the BAHD acyltransferases were also significant up- or down-regulated in GmLEC2a-over-expressing hairy roots . While some genes specifically expressed in leaves (GmACT26) and roots (GmACT84, GmACT85) should not be native targets of GmLEC2a, genes that are highly expressed in seeds (GmACT15, GmACT34), pod (GmACT71, GmACT99) or open flowers (GmACT75, GmACT97) may be natural GmLEC2a targets. Many of such genes were repressed in the GmLEC2a-overexpressing system (Additional file 1: File S1g), suggesting that LEC2a may mainly upregulate the primary metabolism while repressing specialised metabolic genes, including those coding for enzymes that modify metabolic structures, such as BAHD genes.
Some of these secondary metabolic genes were also regulated by GmWRI1b, as shown by their up- or downregulation in GmWRI1b-OE transgenic hairy roots . There were 63% (65 out of 103 genes) BAHD genes with significantly modulated expression in GmWRI1b-OE hairy roots, being 51% of them upregulated and 47% downregulated (Additional file 1: File S1h). Interestingly, some genes with high expression in aboveground organs (e.g., GmACT2, GmACT4) were upregulated in GmWRi1b-OE hairy roots. BAHD genes that are expressed specifically in leaves (GmACT28), stems (GmACT47), and flowers (GmACT47, GmACT57, GmACT82) were upregulated in GmWRi1b-OE hairy roots. Nodule expressed genes (GmACT9, GmACT44, GmACT49) were upregulated whereas, GmACT54, GmACT87 and GmACT89 were downregulated. On the other hand, most of the genes specifically expressed in roots (GmACT18, GmACT21, GmACT22, GmACT23, GmACT36, GmACT40, GmACT41, GmACT43, GmACT86) and seeds (GmACT15, GmACT94) were downregulated in GmWRi1b-OE hairy roots. GmACT56 and GmACT62 were highly expressed in roots along with other organs were upregulated but those root-expressed genes that were also expressed in nodules (e.g., GmACT92, GmACT98) were down-regulated. We also noticed that some of BAHD members with very low count values in different soybean organs were significantly upregulated (GmACT33) or downregulated (GmACT17) in the GmWRi1b-overexpressing hairy root system. Most of the clade IV genes that targeted acyl-lipids for wax or cuticle synthesis were down-regulated in the heterologous root system (Additional file 1: File S1h).
qRT-PCR validation of BAHD genes in developing nodules and in response to stresses
In order to validate in silico expression and further reveal the involvement of BAHD genes in physiological processes, we performed quantitative reverse-transcription PCR (qRT-PCR). We analysed the expression profiles during nodule development and the response to Al3+ and fungal elicitor exposures. To select GmBAHD genes for expression analyses, we considered the following criteria: (i) high expression in different tissues; (ii) representation of genes in all clades; (iii) inclusion of some paralogous pairs. The expression profiles revealed by qRT-PCR were generally consistent with the RNA-seq data. The expression of GmACT2 was higher in nodules than roots but static in all nodule developing stages, while the expression levels of GmACT21, GmACT22, GmACT55, GmACT68, GmACT79, GmACT81, GmACT84 and GmACT86 were higher in roots while lower but static during nodule development. GmACT12, GmACT15, GmACT41 expression was higher during the initial stages of nodule development but gradually decreased and remained low during nodule senescence. GmACT44 expression was initially low and gradually increased and was found high expression during nodule senescence (Fig. 7B). GmACT84 expression was high in response to Al3+ (Fig. 8B) and fungal elicitor (Fig. 9B) compared to the control while most of the other genes tested decreased their expression significantly or remained almost unchanged.