Optimisation of the DAP-seq workflow and evaluation of transcription factor binding site reproducibility
The protocol of DAP-seq, applied to Eucalyptus MYB TFs in the present work, was adapted from the method described in Bartlett et al. (2017). A schematic view of the key steps is represented in Fig. 1. Briefly, a sonicated genomic DNA library (gDNA) was generated from E. grandis developing xylem tissues, ligated to sequencing adaptors, and purified using magnetic beads coated with HALO-tagged MYB TF translated in vitro. Purified gDNA fragments bound on MYB TFs were washed, isolated, released, amplified, and sequenced to identify TFBSs. In order to reduce potential technical limitations that could explain the many unsuccessful DAP-seq assays reported in previous studies using large-scale screens (O’Malley et al. 2016; Bartlett et al. 2017), we added additional QC to improve DAP-seq efficiency and reproducibility (Fig. 1a), as well as providing qualified improved standards for independent DAP-seq experiment comparison.
Optional QCs concern (1) the detection of HALO-TF protein successfully bound on beads and (2) the measurements of gDNA library enrichment at critical steps of the DAP-seq protocol (Figs. 1b to e). First, in addition to the input quantity of HALO-TF fusion protein, the quantity of TF bound to beads through HALO-ligand (chloroalkane) recognition likely influences DAP-seq success. As shown in Fig. 1C (left panel), we used an anti-HALO antibody to detect EgrMYB1, EgrMYB2, EgrMYB122, EgrMYB135, and EgrMYB137 fused to the HALO-tag in the in vitro translation reaction mix, along with a HALOTag positive control. We demonstrated that a fraction of TF-bound beads can be used to assay TF binding on beads (Fig. 1C, right panel) as an additional quality step during the DAP-seq assay and is appropriate in case of unsuccessful protein detection after in vitro production or unbound HALO-protein signal in the supernatant. Second, in addition to PCR quality controls proposed by Bartlett et al. (2017), efficiency and homogeneity of DNA recovery was assessed by capillary electrophoresis and/or qPCR before and after the steps of DNA sonication, adaptor ligation, DNA amplification, and library purification. A fast and efficient beads-based DNA purification method (SPRIselect®) was compared with gel purification. SPRIselect increased DNA recovery yield after library pooling and lowered the amount of 100–150 bp dimer contaminants detrimental for next-generation sequencing.
The reproducibility of DAP-seq data has not been well studied, and best practices for identifying reproducible TFBSs from DAP-seq data have not yet been established. Therefore, we implemented DAP-seq analysis of EgrMYB1, EgrMYB2, EgrMYB122, EgrMYB135, and EgrMYB137 across two independent laboratories, each with technical replication, and additionally with inter-lab replication for EgrMYB1 and EgrMYB2. In Laboratory 1, technical replication involved two independently prepared pre-binding gDNA libraries prepared as per Bartlett et al. 2017, with additional evaluation of different PCR cycle numbers during full-length adapter indexing to evaluate the effect of this parameter on sequence yield and duplication levels. In total, ~ 173 million paired clean reads were obtained from laboratory 1 (Supplementary Table S1). After testing different PCR cycle numbers during library preparation, we observed that 15 PCR cycles reduced the proportion of duplicated sequence reads but overall yielded lower quantities of total reads than 20 PCR cycles (Supplementary Table S1). Therefore, we opted to continue with 20 cycles. Additionally, substantial differences were observed between DAP-seq experiments performed on independent pre-binding gDNA libraries within the same experiment, where one library (E2) yielded significantly more sequence data than the other (E1) (Supplementary Table S1), despite yielding similar qPCR values in pre-binding library quality control as advised by Bartlett et al. (2017) (data not shown). This suggested that post-binding quality control (e.g., qPCR validation) could ensure low-quality libraries are discarded prior to sequencing.
In Laboratory 2, EgrMYB1, EgrMYB2 and EgrMYB137 DAP-seq datasets each included three technical replicates (Supplementary Table S1). To reduce library failure, we included additional capillary electrophoresis to monitor gDNA fragments of interest along DAP steps, and additional qPCR analyses (Fig. 1b, d). As advised by Bartlett et al. (2017), the average size of gDNA fragments was around 200 bp after sonication, 260 bp after minimal adaptor ligation (Fig. 1b), and 320 bp after PCR amplification using full-length adaptors (Fig. 1d). DNA purification dramatically decreased gDNA template (Fig. 1d): Ct rose from 7.3 ± 0.8 for ligated gDNA to 24.4 ± 2.7 for recovered DNA. The highest variability of gDNA template among and between library replicates was also observed at this step, but library homogeneity was restored post-PCR (Fig. 1d). Ligated-DNA amplification by PCR generated a high amount of detrimental 80–150 bp DNA fragments, likely synthesised from free adaptors (black arrows, Fig. 1e). Because Illumina technology is biased towards the sequencing of small fragments, we tested a bead-based post-DAP gDNA size selection, which enabled an efficient elimination of contaminating dimers while reducing gDNA loss compared to classical gel purification (Fig. 1e).
DAP-seq analysis was replicated within and, for EgrMYB2, between independent laboratories. We applied three quality metrics to assess libraries: first, libraries had to yield at least 1,000 DAP-seq peaks to be considered informative; second, they required a Fraction of Reads in Peaks (FRiP) score of ≥ 5% as per ENCODE signal-to-noise recommendations for ChIP-seq (Landt et al. 2012); third, their peaks had to exhibit enrichment for the expected AC-I (ACCTACC), AC-II (ACCAACC), or AC-III (ACCTAAC) canonical motif (Hatton et al. 1995; Raes et al. 2003; Patzlaff et al. 2003; Zhong et al. 2008; Zhou et al. 2009; Zhong and Ye 2012). We called peaks using MACS2(Zhang et al. 2008) and GEM(Guo et al. 2012) to compare the two peak callers. For libraries meeting the criteria, GEM generated more peaks than MACS2 (an average of 11,421 peaks per candidate for GEM versus 3,819 peaks per candidate from MACS2) and had a higher average FRiP score (21.6% versus 19.5%) (Supplementary Table S4). Therefore, we proceeded with GEM for peak calling. Based on our stringent criteria, five of the seven replicates for EgrMYB2 but only seven of sixteen libraries from EgrMYB1, EgrMYB122, EgrMYB135 and EgrMYB137 passed these criteria (Supplementary Table S4 and Supplementary Table S5), corresponding to a per-library success rate of 52%. By similar metrics, ours is a 3-fold improvement over the 17.3% (314 of 1,812 TFs) subjected to DAP-seq by O’Malley et al.(2016) that were successful. Furthermore, if we consider the additional QC steps implemented in Laboratory 2, our per-library success rate was ~ 78%.
For EgrMYB2, two technical replicates from Laboratory 1 (EgrMYB2_E1_15 and EgrMYB2_E1_20) had fewer than 1000 peaks each and FRiP scores below 5% and hence discarded (Supplementary Table S6). Since their peaks were still enriched for AC-II elements, the affinity enrichment of these libraries was likely successful, but the sequence coverage was insufficient to yield high peak numbers. The number of peaks in common between EgrMYB2 replicates ranged from 3,823 (Laboratory 2) to 15,263 (Laboratory 1) (Supplementary Table S6). To determine the final peak set for EgrMYB2 we merged the peak sets from each laboratory and took the overlapping set of peaks between laboratories. In doing so, only high confidence peaks were retained. The best pairwise overlap of peaks between replicates for EgrMYB135 and all three replicates for EgrMYB137 that passed our criteria were similarly merged. Finally, only single replicates for EgrMYB1 and EgrMYB122 yielded over 1,000 peaks and FRiP ≥ 5%. Therefore, while DAP-seq success rates are still far from 100%, by including multiple technical replicates per candidate we were able to discard poor replicates and obtain high-quality data for all five of our candidates.
Development of a random forest (RF) classifier for target gene assignment
In previous work, we observed that putative target genes linked to SCW-related EgrMYB DAP-seq binding sites located in open chromatin were overrepresented in systems biology datasets linked to wood formation biology, while those found outside of accessible chromatin were not (Brown et al. 2019). This suggests that true target genes could be distinguished from incidental DAP-seq peak associations based on complementary data that provide functional context. We sought to build on this premise and improve the accuracy of proximity-based target gene assignment by leveraging high-confidence target gene-DAP-seq peak associations and various functional omics data in Arabidopsis (Supplementary Table S2) to train and transfer a supervised ML classifier to similar features calculated from Eucalyptus data.
Brooks et al.(2019) identified direct target genes of 33 nitrogen-early response TFs in Arabidopsis root cells from protoplast-based post-translational induction assays (Bargmann et al. 2013). We used this data to train a random forest (RF) classifier to discriminate bona fide gene targets proximal to DAP-seq binding events by labeling candidate TFBS-target pairs as true (i.e., transient overexpression of the TF causes differential expression of the target gene) or false (no differential expression following transient TF induction). Of 33 TFs analysed by Brooks et al. (2019), 14 had DAP-seq data available (O’Malley et al. 2016) for classifier training. Classifier performance was evaluated for different classifiers trained on three negative sample sets comprising genes with undetected (UDG) or low (LEG) expression values versus random selection of non-differentially expressed genes. The rationale for this was that genes with undetected or low expression in the same tissue as a TF of interest are least likely to be true targets and include fewer false negatives (Song et al. 2020). One-way ANOVA analyses showed that the choice of algorithm and the negative training dataset significantly affected classifier performance, where post-hoc testing for AUC-ROC scores revealed that the RF classifier developed on matrices with UDGs as negative samples performed better than the other methods (Fig. 2a; P-value < 0.05).
An absolute pairwise correlation threshold of 0.7 was used to identify redundant features (Supplementary Figure S2). After removing the least important feature of each pair based on RF feature importance values, eleven features remained. Of these, the Pearson correlation coefficient (PCC) showed the highest RF feature importance, while features relating to DNase-seq and CNS data were the least important (Supplementary Figure S2). This suggests that the co-expression relationship between each TF and a possible target gene contributed the most information to the classifier. Removing the eleven redundant features reduced the performance of the RF classifier by a small degree (decrease in AUC-ROC of 0.018 and 0.019 for UDGs and LEGs negative sample sets, respectively). To better understand what categories of features were important for classifier performance, the RF classifier was evaluated on training and testing matrices with different combinations of features. The various feature sets all had a statistically significant effect on the AUC-ROC performance of the classifier (one-way ANOVA, P-value < 0.05) (Fig. 2b). RF classifier performance using seven features related to DAP-seq data exclusively was favourable compared to using only the co-expression data feature (P-value < 0.05, Tukey HSD), while integrating DAP-seq and co-expression learning features improved classifier performance when compared to using either of these feature sets alone (P-value < 0.05, Tukey HSD). Including features relating to DAP-seq peak overlap with open chromatin and/or CNS regions resulted in only a modest improvement in classifier performance with or without expression data already included as features (Fig. 2b). With this result, we proceeded with the set of eleven non-redundant features.
In evaluating the performance of the RF classifier against conventional distance-based target gene assignment, two baseline methods were considered. The RF classifier showed substantially higher precision, recall, and F1 scores than both the proximity-based approach that assigned the nearest gene to a DAP-seq peak, as well as one that assigned genes to DAP-seq peaks if the TSS was within 5 kb of the peak (Table 1; Supplementary Table S7). This was the case for all three types of negative training sets (UDGs, LEGs, and random selection). The RF classifier was between 5.5 and 7.7-fold more precise than distance-based methods and 1.6 to 3-fold more accurate than either baseline, particularly for the RF classifier trained on datasets with UDGs as negative samples which exhibited an F1 score 4.7 to 5-fold higher than the distance-based methods (Table 1). Overall, the RF classifier predicted a much greater proportion of true TF-target associations with superior accuracy.
Table 1
Evaluation of conventional distance-based methods and the random forest classifier trained on different negative samples
| Random forest classifier | Distance-based method |
| UDGsa | LEGsb | RANDOMc | Nearest gene to each DAP-seq peak | DAP-seq peak within 5 kb of gene TSSd |
Precision | 0.751 | 0.689 | 0.581 | 0.105 | 0.098 |
Recall | 0.711 | 0.723 | 0.595 | 0.243 | 0.377 |
F1 Score | 0.730 | 0.705 | 0.588 | 0.147 | 0.156 |
aUDGs: classifier trained on genes with undetected expression for the target tissue as negative samples.
bLEGs: classifier trained on lowly expressed genes (TPM < 5) for the target tissue as negative samples.
cRANDOMs: classifier trained on randomly selected non-differentially expressed genes as negative samples.
dTSS, Transcription Start Site
To assess the classifier’s robustness when applied to independent datasets, the RF classifier was iteratively re-trained and tested on independent training and testing submatrices selected from the 14 Arabidopsis TF datasets from Brooks et al. (2019). Five independent training/testing data splits were considered such that all datasets from the 14 TFs were used while maintaining an approximate 80/20 training/testing split, a balanced number of positive and negative samples in the training set and ensuring that samples from different TF families were in either the training or testing subsets but never both (Supplementary Table S8). One-way ANOVA analysis showed that the way in which the data was split had a negligible but significant effect on the performance of each of the negative sample sets (Supplementary Figure S3, P-value < 0.05). Despite the modest decrease in AUC-ROC metrics, the data indicate that the RF classifier performed robustly on independent DAP-seq datasets, with an AUC-ROC of ~ 0.72 when considering UDGs as the negative sample set. Since the latter was consistently superior compared to other negative sample sets, we retained the RF classifier trained on UDGs for gene target prediction from DAP-seq data.
We next implemented the RF classifier in Eucalyptus to improve gene target assignment of the final peak set of EgrMYB2 by direct transfer of the classifier using a similar set of 11 non-redundant Eucalyptus features. We obtained a total of 9,626 possible target genes using proximity-based peak assignment. Using the ML classifier and applying a probability (Pr) threshold Pr ≥ 0.5, we retained only 2,108 target genes and a mere 906 target genes at P ≥ 0.7. Using Gene Ontology (GO) analysis of the assigned target genes to gain insight into the possible biological roles that EgrMYB2 plays in woody plants such as Eucalyptus, we looked at the fold change of overrepresented GO terms and assessed the enrichment of some of these terms at the three cut-off points in comparison to the baseline method (pre-ML) (no cut-off, Pr ≥ 0.5 and Pr ≥ 0.7) as shown in Supplementary Table S9. Prior to the ML step, there was a significant overrepresentation of GO terms related to phenylpropanoid and lignin biosynthesis, in broad agreement with altered lignin profiles observed in EgrMYB2-overexpressing tobacco (Goicoechea et al. 2005). To examine the effect of the ML algorithm, we carried out GO enrichment analysis for post-ML datasets and observed even stronger enrichment for phenylpropanoid and lignin-related terms, but also a large number of (secondary) cell wall GO terms among the top-ranking terms pre-filtering (Supplementary Table S9), consistent with the ability of EgrMYB2 to increase SCW thickness(Goicoechea et al. 2005) as well as the SCW master regulator status of its homologs AtMYB46/83. This suggests that the ML algorithm is filtering incidental and false positive targets but retaining biologically plausible DAP-seq target gene associations.
Identification of EgrMYB1, EgrMYB2, EgrMYB122, EgrMYB135, and EgrMYB137 target genes
Despite more than a decade of research on the biological roles of EguMYB1 and EguMYB2 from Eucalyptus (Goicoechea et al. 2005; Legay et al. 2007; Legay et al. 2010), most of their target genes remain unknown. We set out to identify genome-wide binding sites for these candidates in E. grandis along with EgrMYB122, EgrMYB135, and EgrMYB137, as well as their target genes. We observed that the AC-rich element known for MYB-family TFs(O’Malley et al. 2016) was prominent in the peaks sets of all five TFs and highly similar to their closest Arabidopsis homologs (Supplementary Table S5). This result suggests that the DNA-binding specificity of these putative orthologs are conserved.
Using our optimised target gene ML classifier trained on Arabidopsis data and transferred to Eucalyptus, we inferred gene targets for all five SCW-related TF candidates. When we analysed the binding profiles of each TF relative to the transcription start site (TSS), we observed that EgrMYB2, EgrMYB135 and EgrMYB137 DAP-seq peak frequencies increased dramatically within 1,000 bp immediately upstream of the TSS, dropping again near the TSS and then increasing again over much of the gene body (Fig. 3a). Interestingly, EgrMYB1 and EgrMYB122 binding profiles, which were similar to each other, showed a preference for binding further upstream of the TSS (-4 kb to -1 kb), with infrequent binding occurring over most of the gene body (Fig. 3a). These differences in cis-regulatory architecture underpin the importance of incorporating such features into the target gene assignment process. We obtained an average of 7,059 target genes per candidate pre-ML using the baseline method of assigning peaks to proximal genes. We then used the confidence probability score from the classifier to filter possible false positive associations. Post-ML (but without a probability cut-off), we were left with an average of 5,894 target genes, further reducing to between 479 and 2661 targets when the cut-off of Pr ≥ 0.5 is implemented (Fig. 3b).
To gain insight into the potential biological role being played by each of these candidates at different ML probability thresholds, we carried out Gene Ontology (GO) enrichment analysis on target genes identified with DAP-seq-ML. Biological processes such as phenylpropanoid metabolism and phenylpropanoid biosynthesis were common and highly enriched across most of the gene target sets even pre-ML, suggesting that these candidates are involved in the regulation of lignin and SCW-related biological process (Supplementary Tables S9-S13). When a ML classifier threshold of Pr ≥ 0.5 was applied, several additional SCW-related GO terms such as cell wall thickening, SCW biogenesis, lignin biosynthesis, and cell wall polysaccharide biosynthesis were enriched for the transcription repressor EgrMYB1, and activators EgrMYB2 and EgrMYB137, consistent with their known links to SCW regulation (Supplementary Table S9, S10 and S11). This suggests that the ML classifier was successful in removing false positive targets and retaining true targets. Although the enrichment for some of these terms increased further at Pr ≥ 0.7, a number of terms were lost in the case of EgrMYB1, EgrMYB2 and EgrMYB137, suggesting that an intermediate cut-off was optimal in retaining biologically relevant gene targets (Supplementary Table S9, S10 and S11).
EgrMYB122 target genes, which were not significantly associated with any biological process pre-ML, were strongly enriched in phenylpropanoid pathway-associated GO terms at Pr ≥ 0.5 and Pr ≥ 0.7 (Supplementary Table S12). EgrMYB122 targeted five of the seventeen bona fide lignin genes (EgrPAL9, EgrC3H3, EgrC4H1, EgrCCoAOMT2 and Egr4CL1) described by Carocha et al. (2015). Additionally, much of the phenylpropanoid pathway is targeted, as shown by the KEGG pathway analysis (Supplementary Figure S4), suggesting that EgrMYB122 is involved in SCW-related biology. In contrast to the rest of the candidates, EgrMYB135 had no overrepresented GO terms at Pr ≥ 0.5 or Pr ≥ 0.7 (Supplementary Table 13). While several GO terms were mildly overrepresented in the unfiltered (pre-ML and post-ML) set of genes including phenylpropanoid pathway-associated terms, response to abiotic and biotic stress-related GO terms ranking highest and were the majority of the GO terms represented (Supplementary Table S13). This suggests that although EgrMYB135 might be involved in SCW-related biology, it might also be involved in stress-related biological processes.
Based on what appears to be a sensible DAP-seq-ML threshold of Pr ≥ 0.5, we constructed a SCW-related subnetwork linking the five candidate TFs to bona fide cellulose, hemicellulose, as described by Myburg et al. (2014) and lignin structural genes as described by Carocha et al. (2015), as well as SCW-related TFs (see Supplementary Methods S2) (Fig. 4). Overall, there were at least 51 TFs, 15 lignin-associated genes, 39 hemicellulose genes, and 5 cellulose-related genes linked to EgrMYB1, EgrMYB2, EgrMYB122, EgrMYB135, and/or EgrMYB137 in the network represented in Fig. 4. In addition to the core SCW-related cellulose synthases, a substantial number of lignin and hemicellulose genes are targets of these five MYB-family TFs. EgrMYB2 and EgrMYB137 seem to co-target a large number of SCW-related TFs, lignin and hemicellulose genes, suggesting a possible co-regulation of SCW-related biological processes.
Putative orthologs of TFs involved in SCW regulation in Arabidopsis were targeted by both EgrMYB2 and EgrMYB137 (Fig. 4), including SCW regulators NST1 (EgrNAC49) and SND2 (EgrNAC170), while SND3 (EgrNAC64) and MYB103 (EgrMYB60) are additionally targeted by both EgrMYB2 and EgrMYB137. Additionally, EgrMYB 2 targets SCW regulator SND1 (EgrNAC61; Laubscher et al. 2018). EgrMYB2, EgrMYB122 and EgrMYB137 target vessel-associated VND1/2 (EgrNAC146). Other TFs were targeted by EgrMYB135 and EgrMYB137, like VND6 (EgrNAC26; (Laubscher et al. 2018)) and REV, EgrMYB2 alone, like AtVND4/5 (EgrNAC50) and EgrMYB122 alone like VND7 (EgrNAC75). EgrMYB2 and EgrMYB137 shared a large number of common SCW-related targets (Fig. 4). A total of 62 bona fide SCW-related genes and TFs were common between EgrMYB2 and EgrMYB137 At least 15 out of 17 lignin genes involved in 10 steps of lignin biosynthesis previously described Carocha et al. (2015) are shown to be targets in this subnetwork of five MYB-family TFs of which 13 out of the 17 lignin genes are common targets of EgrMYB2 and EgrMYB137.
These results reveal for the first time the biologically relevant target genes from genome-wide in vitro binding sites of the EgrMYB TFs, with biological enrichments observed using GO analysis supporting our hypothesis of a SCW-related role for at least four of the five candidate TFs and EgrMYB135 predicted to target several SCW-related TFs and structural genes. Several overlapping gene targets were observed, with EgrMYB2 and EgrMYB137 sharing many more targets than the other TFs.
Functional characterization of EgrMYB137 in Arabidopsis, Populus, and Eucalyptus transgenic lines
In order to investigate the potential role of EgrMYB137 in SCW formation, we analysed transgenic Arabidopsis lines overexpressing either a native form of EgrMYB137 (Pro35S:EgrMYB137) or a dominant-repressive form (Pro35S:EgrMYB137-EAR) fused to an Ethylene-responsive element binding factor-associated Amphiphilic Repression motif (EAR), to transform it into a dominant repressor (Hiratsu et al. 2003). While EgrMYB137 overexpressors display no visible growth phenotype (Fig. 5a), the dominant repressors show reduced growth at the rosette stage and a deficiency in stem rigidity (Fig. 5a). To clarify the link between EgrMYB137 and wood-related traits, Pro35S:EgrMYB137 overexpressing lines were generated in an E. grandis Hairy Roots (HR) system (Plasencia et al. 2016) and Pro35S:EgrMYB137-EAR dominant-repressive lines generated in Populus tremula x P. alba. In poplar, Pro35S:EgrMYB137-EAR lines display a floppy stem as previously observed in Arabidopsis, and presented curled leaves (Fig. 5b). Phloroglucinol-HCl staining on bleached leaves was less intense in vascular tissues in dominant-repressive lines compared to control (Fig. 5c), suggesting a possible reduction of lignin content in leaf vascular tissues of Pro35S:EgrMYB137-EAR poplar lines.
Transverse sections of Arabidopsis, Populus and Eucalyptus HR transgenic lines overexpressing EgrMYB137 and/or EgrMYB137-EAR were analysed and stained with Phloroglucinol-HCl to investigate vascular tissue lignification. While no clear phenotype appeared for Arabidopsis EgrMYB137 over-expressing lines, dominant repressor lines in both Arabidopsis and Populus presented a weaker phloroglucinol staining compared to control lines (Fig. 6a, b), indicating a reduction in lignification. For Arabidopsis, this reduction in lignification was more visible in interfascicular fibres. In Eucalyptus grandis hairy roots, the overexpression of EgrMYB137 increased phloroglucinol staining of the xylem fiber cells compared to control (Fig. 6c). All these results suggest a positive role for EgrMYB137 in xylem cell lignification, confirmed by biochemical analysis of soluble lignin content using the acetyl bromide method (Ployet et al. 2019). In accordance with histological results, dominant repressor lines (Pro35S:EgrMYB137-EAR) in Arabidopsis and Populus exhibited a significantly lower lignin content compared to control lines (12.0 ± 3.1 vs 15.4 ± 0.6 and 15.7 ± 1.6 vs 19.5 ± 1.0 respectively, expressed in % of dry weight; Table 2) while overexpressing lines (Pro35S:EgrMYB137) in Eucalyptus HR had a higher lignin content (18.6 ± 0.4%) relative to control (16.2 ± 0.6%).
Table 2: Lignin content in EgrMYB137 overexpressing and dominant-repressor lines. Lignin content, expressed as percentage of dry weight, was measured through acetyl bromide method (AcBr) as described in Ployet et al., 2019. Measurements were performed on control lines (empty vectors), overexpressing lines (Pro35S:EgrMYB137) and dominant-repressor lines (Pro35S:EgrMYB137-EAR) obtained in 3 genetic backgrounds: Arabidopsis thaliana, Populus tremula x alba and Eucalyptus grandis. Asterisks indicate significant differences with control (Student’s t-test; n= 4 to 9; * p < 0.05, *** p < 0.001). All measurements were done in triplicate.
To gain a broader view of SCW composition in EgrMYB137 overexpression/dominant repression transgenic lines, we performed a non-destructive analysis on SCW global composition through Fourier-Transformed Infra-Red spectroscopy (Fig. 7a-c). PLS-DA performed on spectra normalised values showed a clear separation between Pro35S:EgrMYB137 lines, Pro35S:EgrMYB137-EAR lines and controls (Fig. 7d), suggesting significant changes in SCW composition. A Sparse-PLS-DA analysis allowed the identification of the most discriminant wavenumbers explaining the separation between controls and transgenic lines. These wave numbers pointed to 13 regions of the spectra that were previously related to SCW components (Fig. 7e). Most of these wavenumbers (12 out of 13) were associated with lignin structure and composition (Supplementary Table S13). Altogether, this evidence clearly demonstrate the positive role of EgrMYB137 in lignification.
To investigate EgrMYB137 role in SCW formation at the transcriptional level, targeted gene expression analyses were performed on Pro35S:EgrMYB137-EAR poplar lines and control lines by RT-qPCR. 53 genes involved in either SCW regulation or in SCW biosynthesis were profiled, the majority of which are close homologs of EgrMYB137 predicted direct targets. As a general trend, most of these genes are repressed in the three independent dominant repressive lines compared to control (Fig. 8). Among SCW regulators, PtVNS10, involved in fibre differentiation, is significantly repressed as well as three MYB TFs (PtrMYB216, PtrMYB125, PtrMYB221). Three out of seventeen tested polysaccharides biosynthesis genes were significantly down-regulated in Pro35S:EgrMYB137-EAR poplar lines, among them one cellulose synthase (PtrCesA8B) and two genes involved in hemicellulose biosynthesis (PtrGT47c and PtrGXM3). Regarding lignin biosynthesis, nine genes out of nineteen were significantly repressed in at least one of the dominant repressive lines, all but one of which are homologs of E. grandis genes predicted to be direct targets of EgrMYB137 (Fig. 4).