Structural and biochemical insight into a modular β-1,4-galactan synthase in plants

Rhamnogalacturonan I (RGI) is a structurally complex pectic polysaccharide with a backbone of alternating rhamnose and galacturonic acid residues substituted with arabinan and galactan side chains. Galactan synthase 1 (GalS1) transfers galactose and arabinose to either extend or cap the β-1,4-galactan side chains of RGI, respectively. Here we report the structure of GalS1 from Populus trichocarpa, showing a modular protein consisting of an N-terminal domain that represents the founding member of a new family of carbohydrate-binding module, CBM95, and a C-terminal glycosyltransferase family 92 (GT92) catalytic domain that adopts a GT-A fold. GalS1 exists as a dimer in vitro, with stem domains interacting across the chains in a ‘handshake’ orientation that is essential for maintaining stability and activity. In addition to understanding the enzymatic mechanism of GalS1, we gained insight into the donor and acceptor substrate binding sites using deep evolutionary analysis, molecular simulations and biochemical studies. Combining all the results, a mechanism for GalS1 catalysis and a new model for pectic galactan side-chain addition are proposed. The authors present a multidisciplinary approach investigating the mechanistic underpinnings of galactan synthase 1 and use their data to propose a new model for complex pectin biosynthesis.

Rhamnogalacturonan I (RGI) is a structurally complex pectic polysaccharide with a backbone of alternating rhamnose and galacturonic acid residues substituted with arabinan and galactan side chains. Galactan synthase 1 (GalS1) transfers galactose and arabinose to either extend or cap the β-1,4-galactan side chains of RGI, respectively. Here we report the structure of GalS1 from Populus trichocarpa, showing a modular protein consisting of an N-terminal domain that represents the founding member of a new family of carbohydrate-binding module, CBM95, and a C-terminal glycosyltransferase family 92 (GT92) catalytic domain that adopts a GT-A fold. GalS1 exists as a dimer in vitro, with stem domains interacting across the chains in a 'handshake' orientation that is essential for maintaining stability and activity. In addition to understanding the enzymatic mechanism of GalS1, we gained insight into the donor and acceptor substrate binding sites using deep evolutionary analysis, molecular simulations and biochemical studies. Combining all the results, a mechanism for GalS1 catalysis and a new model for pectic galactan side-chain addition are proposed.
Plants are the pre-eminent builders of complex carbohydrates, essential molecules of life that store and supply energy to nearly all organisms in the biosphere. The plant cell wall is a complex extracellular matrix composed of cellulose, hemicellulose, pectin, proteins and polyphenolic molecules. Plants are estimated to devote at least 10% of their genomes to constructing their plant cell walls 1 . However, unlike other natural polymers, such as DNA, RNA and proteins, far less is known about the synthesis and essential biology of the carbohydrates that constitute plant cell walls. An important reason why progress has been more challenging is that complex carbohydrate structures are not defined by sequence-based templates but are synthesized through the concerted actions of a diversity of carbohydrate-active enzymes, notably glycosyltransferases (GTs), whose functions and mechanisms of action are slowly revealed [2][3][4] . Galactan synthases (GalSs) are enzymes categorized as inverting GTs from family 92 (GT92; Pfam, PF01697) in the CAZy (http://www.cazy.org/) database [5][6][7] . GalS enzymes are involved in the synthesis of β-1,4-linked galactan side chains of pectic rhamnogalacturonan I (RGI) (Fig. 1).
RGI is a complex pectic polysaccharide found within the primary cell walls of vascular plants 8 . RGI consists of a backbone composed of the repeating disaccharide -2)-α-l-Rhap-(1-4)-α-d-GalpA-(1- (Fig. 1a). The complexity of this polysaccharide is further increased by substitution with lesser amounts of other monosaccharides and non-glycosyl substituents to the backbone Rhap and GalpA, respectively 8,9 . One such modification is RGI galactan side chains, which are extended by GalS enzymes. The GalS1 enzyme from Arabidopsis thaliana (AtGalS1) is a bifunctional enzyme that elongates β-1,4-galactan side chains of RGI by adding galactose (Gal) or  PtGalS1 with UDP-Gal  This study is focused on GalS1 from Populus trichocarpa (Potri.005G258900; Extended Data Fig. 1). Analysis of the P. trichocarpa GalS1 activity confirms that it is also a bifunctional enzyme and is able to catalyse the synthesis of long β-1,4-galactan chains using galactotetraose an an acceptor as well as the termination of extension by addition of two Arap residues (Fig. 1c-e). We report the crystal structure of GalS1, which represents the primary structure of a CAZy GT92 family member. This adds to the only two other structures that have been solved for enzymes involved in plant cell wall biosynthesis, the others being xyloglucan xylosyltransferase 1 (95T1) 21 and xyloglucan fucosyltransferase 1 (FUT1) 22,23 . The general architecture of GalS1 adopts a C-terminal domain containing a glycosyltransferase A (GT-A) fold and an N-terminal domain that functions as an ancillary carbohydrate-binding module (CBM) that binds specifically to the backbone of RGI. This CBM is conserved across plant GT92 protein sequences present in Phytozome v12 and represents the founding member of a new CAZy family, CBM95. The presence of a CBM in a glycosyltransferase such as GalS1 is unexpected in a glycosyltransferase, as these modules are much more commonly found within modular enzymes involved in carbohydrate deconstruction. CBMs are more commonly associated with hydrolases or lyases; its presence became more intriguing as we performed biomolecular interaction studies and showed that the CBM95 module binds to the backbone of pectic RGI, while the GT92 catalytic domain interacts with β-1,4-galacto-oligosaccharides. Small-angle X-ray scattering (SAXS) experiments demonstrated that GalS1 works as a dimer in solution. Collectively, this study provides insights into the function of both domains of the GT92 enzymes and suggests a new model for RGI synthesis where the CBM95 is essential for enzymatic activity and stability that facilitates the ability of GalS1 to target and extend complex acceptor substrates, such as the repeating disaccharide backbone of RGI. Understanding the architecture and detailed mechanism of GalS1 will enable the utilization of galactan as a source for chemoenzymatic synthesis of tailored polysaccharides 24 for novel applications and the optimization of feedstocks for biomass valorization to chemicals and fuels via the alteration of the hexose to pentose ratio 24,25 .

Expression, purification and crystal structure of GalS1
GalS1 is classified in the CAZy database as a member of the GT92 family and until now, no structural information for this family is available. Additionally, GT92 does not share notable amino acid sequence similarities with other glycosyltransferase families. To investigate the structure of GalS1, it was expressed as a 'superfolder' green fluorescent protein (sfGFP) fusion protein (Extended Data Fig. 2) in HEK293S GnT1-cells as a soluble secreted fusion protein (122 mg l −1 estimated by using GFP fluorescence) and purified 26 using affinity and size exclusion chromatography before crystallization (Extended Data Fig. 3a-c). A truncated form of PtGalS1 was generated as a fusion protein containing an NH 2 -terminal signal sequence, an 8xHis tag, an AviTag, sfGFP, the tobacco etch virus (TEV) protease recognition site and amino acid residues 73-495 of PtGalS1 (Extended Data Fig. 2). The fusion tag and N-linked oligosaccharides were removed by TEV protease and endoglycosidase F1 treatment before crystallization to reduce structural heterogeneity. We solved two structures: apo form GalS1 diffracting to 2.37 Å resolution and Mn 2+ -bound GalS1 diffracting to 2.56 Å resolution (Extended Data Fig. 4). Both structures lack residues 73-96 (residues 97-495 were observed) in the electron-density maps due to the highly flexible nature of the stem region. The crystal lattice contained 4 and 2 copies of GalS1 in the asymmetric unit in the apo-state and Mn-bound GalS1, respectively. In addition to the polypeptide chain, the GalS1 structure showed 7 glycosylation sites. Each GalS1 monomer contained a stem region (residues 97-107) and two globular domains: a CBM95 (residues 108-221) connected by a linker region (residues 222-228) to a GT-A fold glycosyltransferase domain (residues 229-495) ( Fig. 2a and Extended Data Fig. 2). The core GT-A domain consists of 7 core β-sheets (β3, β2, β1, β4, β5, β6, β7), with β5 and β7 in an antiparallel orientation, surrounded by a helix that includes the donor and acceptor binding sites. The glycosyltransferase core of GalS1 displayed some distant similarity (root-mean-square deviation (RMSD) ≥ 4.2 over ≥138 residues) to insect and mammalian β1,4-galactosyltransferases (β4GalTs) that transfer Gal from UDP-Gal to xylose or N-acetylglucosamine (GlcNAc), respectively, from CAZy family GT7 (Extended Data Fig. 4). However, the GalS1 structure is dissimilar to known β4GalTs; these changes possibly account for GalS1 activity as both extending and capping β-1,4-galactan side chains (Extended Data Fig. 5). We also tried to obtain UDP-, UDP-Gal donor-and acceptor-bound structures but were unsuccessful.

The oligomeric state of GalS1
Size exclusion chromatography and multi-angle light scattering (SEC-MALS) analysis of GalS1 indicated that it exists as a dimer in vitro (the calculated molecular weight of GalS1 is 103.9 kDa and the theoretical molecular weight is 97.6 kDa). The asymmetric unit in the apo state showed 4 molecules of GalS1 and 2 possibilities of dimer formation. To identify the correct monomer-monomer interactions (Fig. 2b), we examined the solution state of GalS1 by performing experiments where SAXS is coupled to SEC-MALS and quasi-elastic light scattering detection. SEC-SAXS-MALS experiments provide accurate measurement of molecular weight and provide information on particle shape. We observed a single peak eluting from the gel filtration column that corresponded to the GalS1 homodimer, as judged by molecular weight (MW) determined from SAXS and MALS (MW SAXS = 110 kDa, MW MALS = 119 kDa). Two conformers were built on the basis of two possible interfaces visualized in the crystal structure to determine dimer arrangement in solution: parallel (A:B) or antiparallel (A:C) (Fig. 2c). The antiparallel arrangement with the N-terminal stem region interacting across the chains matched the SAXS curve well, whereas the alternative dimer showed a poor match ( 2 dimer = 7.8 and 2 alternative dimer = 254.3, Fig. 2c). The residual discrepancy between the atomistic model and the SAXS data was due to the absence of glycans in our model and the flexibility of the disorder N-terminal region (Fig. 2c). Additionally, we reconstructed the SAXS envelope to further confirm the overall arrangement of the GalS1 homodimer in an antiparallel (A:C) orientation (Fig. 2c).

The importance of the stem region of GalS1
We identified several potential interactions between two GalS1 monomers forming the homodimer (Extended Data Fig. 6a,b), including interactions between the N-terminal stem region of one monomer with the other. Comparative analysis of stem regions of GT92 proteins across different plant species indicated that conservation beyond residue Asp96 increases, indicating a conserved role in stability, activity or both (Extended Data Fig. 6f). To further investigate the role of the stem domain in dimer formation and activity, we generated a ΔSTEM construct lacking the stem region (Extended Data Fig. 2) and evaluated the activity of the truncated variant using galactotetraose as an acceptor. Despite the presence of the GT92 catalytic domain, the GalS1-ΔSTEM variant was inactive (Extended Data Fig. 6c,d). Comparison of thermal melting temperatures showed a decrease from 59.1 °C for wild type (WT) to 57.4 °C in the ΔSTEM variant and indicated that the protein was correctly folded but slightly less stable (Supplementary Table 1). SEC-MALS analysis of the ΔSTEM variant suggested that a portion of the protein was present as a higher molecular weight aggregate (nearly 8% of the total), in addition to the expected dimer (calculated MW is 84.5, theoretical MW of the dimer is 90.1 kDa) in solution (Extended Data Fig. 6e). These data suggested that the stem region plays an essential role in the structural and functional stability of GalS1, as its presence prevents higher-order aggregation of GalS1 in vitro but is not entirely responsible for dimerization. Construction and analysis of additional truncation variants may shed light on their role in dimerization.

GalS1 contains a family 95 carbohydrate-binding module
Inspection of the GalS1 structure revealed an additional domain at the N terminus of the protein (amino acids 108-221) that adopted a β-sandwich fold reminiscent of the CBM60 present in a xylanase from Camponotus japonicus (2XFD 27 ; RMSD for Cα of 4.4 over 64 residues; Fig. 3a) and a CBM61 from an endo-β-1,4-galactanase from Thermotoga maritima (2XOM 28 ; RMSD for Cα of 7.6 over 88 residues; Fig. 3b). To explore the function of this putative domain, we generated a GalS1-CBM95 construct (Extended Data Fig. 2) and evaluated its ability to bind various cell wall oligo-and polysaccharides using microscale thermophoresis (MST). We showed that GalS1-CBM95 specifically binds unbranched pectic RGI 29,30 isolated from non-adherent Arabidopsis thaliana mucilage. In contrast, GalS1-CBM95 did not interact with galactotetraose, polygalacturonic acid or xylohexaose on the basis of a signal-to-noise ratio cut-off below 5, which is minimally required to confirm binding. Most polysaccharide substrate-binding happens through stacking interactions with aromatic residues on the CBM surface. Therefore, we mutated various exposed tyrosine and tryptophan residues on the surface of the CBM95 domain (Fig. 3c). Additionally, basic residues such as lysine have previously been shown to act as functional residues in pectin-binding CBMs, such as CBM77 (ref. 31 ), and inspection of the GalS1 structure revealed that several were present on the surface-exposed region of CBM95 and were also mutated ( Fig. 3c). Recombinant CBM95 and the aforementioned mutant variants were expressed in HEK293 cells and purified using nickel-nitrilotriacetic acid. We studied the effects of mutating these residues on RGI binding. MST analysis of mutant variants using RGI as a substrate showed that K133A, W142A, Y199A, K206A and K209A displayed increases in the K D from 3-to 6-fold, whereas K144A, W166A and Y207A variants showed increases in the K D from 10-to 13-fold (Fig. 3d), indicating that the latter play a more predominant role in RGI interaction. The CBM95 of GalS1 does not share any sequence homology with any other CBMs in the CAZy database 7 (confirmed by personal communication with Dr Nicholas Terrapon, head of the CAZy database) and will be assigned as a new CBM family (see http://www.cazy.org/CBM95.html for an actively updated list of sequences and source organisms).
We have established that the stem region is essential for GalS activity. To investigate whether the GT-A core domain is still catalytically active in the absence of the CBM, we generated a GalS1-ΔCBM variant (Extended Data Fig. 2) that lacks the entire CBM domain and the stem region. The GalS1-ΔCBM variant of the GalS1 was successfully expressed as a soluble secreted fusion protein (75 mg l −1 ); however, it lacked detectable GalS activity (Fig. 3d), suggesting that both the CBM and the stem domain play a crucial role in enzyme stability/ folding and catalysis.

Evolutionarily constrained residues in the GT92 family
Recently, a minimal structural unit for GT-A fold enzymes has been defined on the basis of deep mining of large sequence datasets, revealing 20 residues shared throughout the common glycosyltransferase core 32 . Unfortunately, GT92 family proteins were not included in the study due to a lack of structural information at the time of publication. To identify core conserved residue positions within the GT92 family, we generated an alignment of representative GT92 sequences with other GT-A fold sequences using a profile-based approach and the GalS1 structure as a template. For this, we aligned the GalS1 structure with other GT-A fold structures and used this structural alignment as a basis to then align a GT92 consensus to the GT-A profile alignment generated in a previous study 31 . Incorporating GT92 sequences into this alignment provided a comparative basis for mapping GT-A shared features and residues uniquely conserved in the GT92 family ( Fig. 4a and Extended Data Fig. 7a). Initially, on the basis of the profile alignment, several GT-A fold conserved motifs were mapped: the DXD motif that is involved in coordinating the metal ion and the donor sugar in metal-dependent GT-A fold enzymes (D331 and D333); the G-loop involved in donor binding (R397-K400) and the conserved xED motif harbouring the catalytic base (G412-H414) with H414 as the putative catalytic base (see below). On the basis of the alignments, H435 is predicted to function as the metal coordinating histidine at the C-terminal tail (C-His). In addition, the hydrophobic core residues that define GT-A fold enzymes are also present in GT92 (ref. 32 ). These include Y233, L234, Y235, M249, M253, F266, V267, F268, F328 and I403 (Extended Data Fig. 7b). Moreover, to identify GT92-specific residue positions, we performed a query-centric Bayesian partitioning with pattern selection 33 analysis on a set of 24,816 sequences that includes diverse GT-A fold sequence sets 32 and representative GT92 sequences, using the GT92 consensus sequence as the query. This resulted in a foreground cluster of 153 GT92 sequences defined by multiple residue positions uniquely conserved within these sequences, suggesting family-specific functions. These residues are highlighted in Fig. 4a along with the GT-A shared motifs. The most distinct GT92-specific feature was H414, which is invariant at this position across all GT92 sequences and is distinct from other GT-A fold enzymes, which largely conserve an Asp or a Glu that acts as a catalytic base. We also identified K400 as one of the most uniquely conserved features of GT92. This residue is part of the G-loop, which uniquely conserves a number of charged residues in contrast to smaller amino acids with shorter side chains such as Gly, Ala or Ser in other GT-A fold enzymes. Other GT92-specific features include cysteines (C236 and C316) that form a disulfide bond and other charged residues (D315, E334) within the GT-A domain (Extended Data Fig. 7a).

Biochemical characterization of active site residues
The glycosyltransferase sequence comparisons combined with docking and molecular dynamics simulations analysis provided us with insight into the putative residues involved in substrate binding and catalysis in the GT92 domain of the enzyme. To understand the specific roles of these residues further, we performed multiple independent mutational analyses to study the effects of these non-conservative mutations on the activity of the enzyme. In the absence of a suitable acceptor substrate, most Leloir GTs can hydrolyze suitable With 2XOM (CBM61; binds to β-1,4galactan) RMSD is 7.6Å over 88 residues.
With 2XFD (CBM60; binds to xylan) RMSD is 4.4Å over 64 residues. Key donor binding residues of the core GT-A domain were identified on the basis of a statistical analysis of evolutionary constraints acting on primary sequences and docking simulations. Since Drosophila β4GalT7 D211N complex with manganese, UDP-Gal and xylobiose is available in the database (PDB ID 4M4K; Extended Data Fig. 5) showing key donor and acceptor binding residues 35 , we aligned apo PtGalS1 to Dmβ4GalT7 to pinpoint the key residues of GalS1 involved in the donor and the acceptor binding (Fig. 4b,c). We mutated several of these residues hypothesized to be involved in nucleotide sugar donor binding to alanine. GFP-fused mutated variants G242A, D331A, D333A, E334A, Q309A, K400A, H414A, H435A  and H437A were expressed and purified in HEK293S WT cells. First, we quantified glycosyltransferase activity by UDP-Glo-assay, using UDP-Gal or UDP-Arap as a donor and galactotetraose as an acceptor.
The results indicate that mutating residues of the DxD motif (D331A and D333A), as well as E334A, Q309A, K400A and H414A, reduces GalT and arabinopyranosyltransferase (ArapT) activity to a minimum compared with the wild-type enzyme, showing that these residues are essential for catalysis/binding (Fig. 4d). In contrast, mutant variants G242A, H435A and H437A displayed only slightly reduced GalT and ArapT activity, indicating that these residues are not directly involved in binding or catalysis (Fig. 4d). A key GT92-specific feature is a conserved disulfide bridge between C236 and C316. We generated a C236S single mutant and a C236S:C316S double mutant to evaluate the role of this disulfide bond in GalS1. Both mutations resulted in negligible protein expression (Supplementary Table 2), making purification difficult, and the resulting enzymes showed no detectable GalT or ArapT activity (Fig. 4d). We postulate that this disulfide formation is essential for protein folding or stability post expression. Comparison of chain A of GalS1 with chain A of β4GalT7 (bound with xylobiose; PDB ID: 4m4k) led to the identification of three residues, W166, R396 and D398, as potential acceptor binding residues. Mutating W166A and R396A did not substantially perturb GalT or ArapT activity; however, the D398A variant showed loss of both GalT and ArapT activity in the presence of the acceptor (Fig. 4d,e). The hydrolytic activity (or the ability of the enzyme to transfer the sugar to water in the absence of acceptor substrate) of most variants was comparable to that of WT GalS1 except in W166A, where hydrolytic activity was increased by nearly 2.5-fold. To better understand the contribution of all the residues studied above in binding the nucleotide sugar donor, we estimated the equilibrium dissociation constant (K D ) of UDP-Gal in WT and mutant variants using MST. The K D of UDP-Gal in D331A, K400A and H435A was in a similar range to that of WT GalS1. E334A, Q309A and H414A hampered the binding of UDP-Gal, whereas all other mutants showed improvement in the binding of UDP-Gal (Table 1). Taken together, in vitro GalT or ArapT assays and insights into enzyme-donor substrate interactions support that E334, Q309 and H414 directly interact with UDP-Gal/ UDP-Arap in GalS1 and are required for glycosyltransferase activity. The reduced activity of H435A and D333A (part of the DXD motif) is probably due to inability to interact with Mn 2+ . The utilization rate of UDP-Ara by D333A is 15-fold less than that by WT, but minimal activity may be retained due to altered binding between the two donors, as Arap and galactopyranose are structurally similar but arabinose lacks the C6 carbon. As hypothesized, H414 acts as a catalytic base since no other residues closer to the donor seem possible and mutating this residue to alanine leads to a complete loss of activity. G242 and H437 are part of variable loops allowing flexibility at the active site, explaining a partial decrease in the catalytic activity due to mutation. W166, R396 and D398 are involved in acceptor binding (Fig. 4c). We propose that this occurs through stacking interactions between the acceptor and the aromatic W166, and hydrogen bonding interactions with the planar polar side chains of R396 and D398. This is evident from these mutations which improved the K D of UDP-Gal and yet led to a drop in glycosyltransferase activity, particularly apparent in the D398A mutant variant. W166 is even more fascinating as it protrudes into the active site from the CBM95 domain, and the W166A variant has a 2.5-fold increase in the rate of NDP-sugar hydrolysis. Donor affinity was improved by 2-fold in the W166A variant but its GalT activity decreased, suggesting that W166 is directly involved in acceptor substrate-binding. This is further shown by a decrease in the dissociation constant (K D ) of the W166A mutant (nearly 7-fold), confirming its role in acceptor binding (Extended Data Fig. 8b). WT and its mutant variants were tested for GalT activity by polysaccharide analysis using carbohydrate gel electrophoresis (PACE). The result broadly agrees with the Glo-based assays, re-confirming the role of the selected residues in GalT activity (Extended Data Fig. 9).

Docking and molecular dynamics reveal putative substrate-bound complexes
The pursuit of crystallizing a ligand-bound structure was unsuccessful; however, we obtained an Mn 2+ ion-bound GalS1 structure that pointed to the binding pocket at the active site. We used this structure as a starting point for docking and molecular dynamics simulations studies to identify substrate binding modes and evaluate enzyme-substrate-bound structures. The Mn 2+ -bound GalS1 structure was equilibrated under fully solvated conditions since it was crystallized without the presence of a substrate. Molecular dynamics simulations of the monomeric Mn 2+ bound GalS1 were performed to explore the flexibility of the various structural domains of the GT-A fold (Fig. 5a) and provide an equilibrated receptor structure for docking the donor substrate. Blind docking studies of the donor substrate revealed that most bound poses were concentrated around the Mn 2+ -binding site. Targeted binding studies suggested that the Mn 2+ -binding site could accommodate the donor molecule with favourable binding energies, and showed configurational and geometric feasibility for hydrolysis on the basis of its proximity to the putative catalytic base H414. Molecular dynamics simulations of the UDP-Gal-Mn 2+ -GalS1 complex revealed that the substrate remains bound throughout the 100 ns simulations (Fig. 5b), illustrating a putative binding pose for the donor molecule at the active site. Distances between the donor sugar C1 and the putative catalytic base N during the molecular dynamics simulations were observed to be consistent with the hydrolysis of the sugar molecule even in the absence of the acceptor molecule and corroborated the experimental observation. Molecular dynamics simulations of the donor-bound state also provided equilibrated structures of the binary complex for initiating docking studies of the acceptor-bound ternary complex (Fig. 5c). Docking simulations with galactotetraose suggest that acceptor substrate binding is probably coordinated via key aromatic residues on the CBM95 domain of GalS1.
Furthermore, various docked conformations were observed to satisfy two critical requirements for the GalS1 reaction mechanism: (1) the orientation of the non-reducing end of the substrate into the active site and (2) proximity to the putative base. Figure 5d illustrates a putative binding pose for the ternary complex. Although this pose in itself does not represent a catalytically competent configuration, the ability of the active site to stabilize the ternary complex over 10s of nanoseconds in the molecular dynamics simulations presents promise for conducive configurations of the complex that may undergo catalysis according to the proposed reaction mechanism (Fig. 6a).

Discussion
The presence of a carbohydrate-rich cell wall is a ubiquitous feature of all plants. While we are beginning to understand the composition and diversity of the polysaccharide components in these walls, little is known about the molecular players involved in their synthesis. Recent studies on galactan interactions with cellulose in tension wood and its possible implications in stress-bearing and imparting flexibility and support to plant tissues highlight its complex role in the plant cell wall. Our ability to develop more refined synthetic biology approaches to design plant cell walls with enhanced properties for valorization of the fixed carbon locked within them requires detailed understanding of their biosynthetic processes at the molecular level. Re-engineering a biocatalyst such as GalS1 requires understanding of its active site, catalytic mechanism and interactions with other functional proteins or protein domains. Our structure of a plant β-1,4-galactan synthase revealed that GalS1 is a modular protein with an ancillary carbohydrate-binding module (CBM95) at its N terminus that binds specifically to the backbone of RGI. Further, we showed that the stem region plays a structural role in homodimer formation, interacting across GalS1 monomers in a 'handshake' pose, and is essential for both glycosyltransferase activity and protein stability. GalS1 belongs to the glycosyltransferase A or GT-A fold of the broader classification of GTs. Nucleotide-sugar binding site residues within the GT-A fold are highly conserved among family members 7,32 . Although we were unsuccessful in determining a donoror acceptor-bound X-ray structure, we were able to identify crucial residues using sequence conservation information by mining over half a million GT-A fold sequences and comparing them to those in the GT92 family. The majority of inverting GT-A enzymes that catalyse the formation of glycosidic linkages between a donor and an acceptor substrate through a single-step inverting catalytic mechanism utilize Asp or Glu within a conserved protein-associated xED motif as the catalytic base 36 . In contrast, our bioinformatics and docking data led to the hypothesis, which biochemical analyses confirmed that H414 functions as the catalytic base in GalS1. We also determined that E334 is essential for UDP-Gal binding, extending the DxD motif to a DxDE motif in GT92. Docking with molecular dynamics simulations provides a powerful alternative to study active site residues and their contribution to binding or catalysis, and to evaluate the flexibility of loops to accommodate substrates at the active site [37][38][39][40] . Our molecular simulations support the proposed binding site and elucidate the critical active site residues in stabilizing the donor substrate (Fig. 5b). They also provide insight into acceptor binding at the active site, which is shown to be mediated via a series of hydrophobic interactions (with Y241, Y307, V413 and F310 in Fig. 5d) that are common among many GTs [41][42][43] .
A key finding was that GalS1 is a modular enzyme containing a CBM, a domain more commonly found in enzymes involved in carbohydrate deconstruction (GH) 44 , and has been less frequently associated with GTs so far. However, there are some notable examples of modular GTs that have functional CBMs. O-glycosylation in animals is initiated by UDP-GalNAc:polypeptide α-N-acetylgalactosaminyltransferases (GALNTs, mammals; PGANTs, Drosophila) from GT27 that contain a GlcNAc-binding CBM13 domain from the β-trefoil family. Mannoside N-acetylglucosaminyltransferase from GT54, involved in the synthesis of complex N-glycans in animals, contains a GlcNAc-binding CBM94 domain 45 . Finally, plant enzymes involved in starch synthesis contain CBM48 and CBM53 modules 46,47 . This number is likely to grow quickly as protein structure prediction tools facilitate identification of GTs that may contain additional modules. The glycosyltransferase domain of GalS1 functions to extend galactan side chains of RGI. In contrast, we demonstrated that the CBM95 binds to the RGI backbone-data that resulted in its classification as the founding member of a new family in the CAZy database. Identification and characterization of this new module led us to propose a new model in which the CBM95 functions to bring the GalS1 enzyme in proximity to the RGI backbone to enable chain elongation (Fig. 6b), potentially functioning to target regions of the polymer that are sparsely substituted. In Arabidopsis GalS1/GalS2/GalS3 triple mutants, the RGI backbone still has Gal substitutions even though elongated galactan chains are absent 6 . Taken together, all available genetic and biochemical evidence supports that GalS1 catalyses galactan chain extension but is not involved in attaching the initial Gal residue(s) to the RGI backbone. Thus, the GalT that adds the initial Gal residues to the RGI backbone is still unknown. It remains to be shown whether this is a common principle of the synthesis of complex polysaccharides such as RGI polysaccharide or a unique feature of galactan side-chain elongation.
The stem region of a glycosyltransferase is generally defined as the stretch of amino acids after the transmembrane domain that can be truncated without changing the enzyme's activity. Biologically, stem regions are proposed to function as flexible tethers that position in the catalytic domain away from the membrane 48 . Several studies have also investigated the various roles of the stem regions of GTs in flexibility, orientation to the substrate 49,50 , site of interaction to other proteins or itself, acting as chaperon 51 , localization and stability 52 . The stem region and the catalytic domain are not clearly differentiated in GTs; however, for this study, we have characterized the stem region as the residues not part of the globular domains and showed that it plays a critical role in dimer structure. Furthermore, thermo-stability and biochemical analyses of stem deletion mutants showed that this region is essential for stability and biochemical activity of GalS1.
GalS1 is a metal-dependent inverting glycosyltransferase (Fig. 1b). Combining structural information with computationally guided mutational analyses and molecular dynamics simulations, our data suggest that GalS1 utilizes an S N 2 single-displacement reaction mechanism (Fig. 6a), similar to GalTs from CAZy family GT7 (refs. 35,53,54 ). In other GalTs 53 , it has been demonstrated that acceptor binding probably involves structural rearrangements of residues after donor binding to create an active site conducive for the reaction. His414 of GalS1 aligns (Fig. 4b,c) with the catalytic base in Drosophila beta,14galactosyltransferase 7 (PDB ID: 4M4K (ref. 35 )); also, mutant H414A has negligible activity. These data suggest that His414 acts as the catalytic base to deprotonate the Gal C4 nucleophilic hydroxyl group of the acceptor, and the carboxylic group in the DxDE motif stabilizes the divalent cation required to complete the GalT reaction (Fig. 6a). Investigation of the Mn 2+ -bound GalS1 structure shows that the conserved His435 is directly involved in coordination of Mn 2+ . This structural data, combined with alignments with GalT structures from other glycosyltransferase families, further suggest that His435 plays a direct role in coordination of Mn 2+ with oxygen of the β-phosphate in the nucleotide sugar donor (Fig. 4b).
Unlike GT7 GalTs that catalyse the addition of a single sugar, GalS1 catalyses the extension of β-1,4-galactan side chains composed of hundreds of monosaccharides in vivo. To efficiently process several hundreds of reactions in the Golgi compartment, we propose that GalS1 utilizes the N-terminal CBM to both target and anchor itself to the RGI backbone (Fig. 6b) through a combination of hydrophobic and ionic interactions (Fig. 3c). This is somewhat unusual in GTs but very common in glycoside hydrolases. Plant starch synthase III has modular structure similar to that of GalS1, where the presence of an N-terminal starch-binding domain from CBM20 increases progressivity of the enzyme 46,[55][56][57] . Similar to the CBM95 from GalS1, the CBM20 in starch-binding domain-containing protein 1 or STBD1 associated with glycogen metabolism and autophagy plays an essential role in stability and facilitates interaction with glycogen-associated proteins 58 . Taken together, these data suggest that the presence of CBMs in a glycosyltransferase may play an enabling role for the efficient synthesis of long polysaccharides. Very recently, new machine learning approaches, including RoseTTAFold 59 and Alphafold 60 , have been developed for the prediction of protein secondary structures from their amino acid sequences and have been widely released. We compared the AlphaFold predicted structures of Arabidopsis and Populus GalS1 with the core domains of the empirically determined P. trichocarpa GalS1 structure and found that, in this case, they are very similar, with RMSD (using Cα) of 0.610 Å and 0.593 Å, respectively; the slight difference in the RMSD between the two is majorly due to the orientation of the stem domain (Extended Data Fig. 10). The experimental insight into the three-dimensional structure of GalS1 will accelerate the understanding of GalS1's role in tension wood and potential approaches to enzyme engineering and gene replacement. In the future, it may be possible to use the knowledge gained from structural and functional analysis of diverse bacterial and eukaryotic galactosyltransferases to develop targeted engineering strategies to modify enzymes such as GalS1 to generate variants with altered donor, acceptor or regioselectivity to enzymatically generate new saccharide structures. However, the viability of this enzyme engineering approach will involve substantial concerted efforts in understanding the detailed mechanisms of inverting and retaining GTs, together with elucidating the molecular basis of donor and acceptor substrate selectivity. In particular, Gal binding to lectins has been implicated in tumour metastasis in mammals 61,62 , and unnatural substrates can pave the way to developing newer inhibitors. Also, in the future, we plan to design chimeric transferases or hydrolases with CBM95 domains to modify the activity of pectin-synthesizing and degrading enzymes.

Cloning, protein expression and site-directed mutagenesis
The NΔ72GalS1 coding sequence was amplified from a complementary DNA template prepared from terminal buds of P. trichocarpa WT primers (Supplementary Table 1) and cloned into mammalian expression vector pGEn2-DEST according to our standard protocols 26,63 ; henceforth called pGEN2-DEST-GalS1WT or GalS1 WT. The resulting fusion proteins consisted of an N-terminal NH 2 -signal sequence, 8xHis tag (for purification), AviTag recognition site, sfGFP (for quantification) and the seven amino acid TEV protease recognition site, followed by the truncated coding region of GalS1. Mutated variants of GalS1 were generated by site-directed mutagenesis using the Q5 Site-Directed Mutagenesis kit (New England Biolabs) according to the manufacturer's instructions, using pGEN2-DEST-GalS1WT as a template and primers listed in Supplementary Table 2. Primers used to generate the GalS1-ΔSTEM-(lacks residues 73-113) and the GalS1-ΔCBM truncation variant that lack the stem and the stem and CBM regions (Extended Data Fig. 2), respectively, are listed in Supplementary Table 1. Primers used to generate the construct for expression of CBM95 (73-235 residues; Extended Data Fig. 2) are listed in Supplementary Table 1. Mutated variants of CBM95 were generated by site-directed mutagenesis using the Q5 Site-Directed Mutagenesis kit (New England Biolabs) according to the manufacturer's instructions, using pGEN2-DEST-CBM95 as a template and primers listed in Supplementary Table 3. All constructs were confirmed by DNA sequencing (Eurofins). For transient expression, plasmid DNA was isolated using the PureLink HiPure Expi Plasmid Gigaprep kit or Maxiprep kit (Thermo Fisher) as suggested by the manufacturer. Plasmids were transfected into HEK cells (FreeStyle 293-F cell line, Life Technologies; HEK293S GnTI-cells, CRL-3022, ATCC 64 ) as described previously 63 . Selenomethionine labelling of WT GalS1 was done by transfecting HEK293S GnTI-cells with pGEN2-DEST-GalS1WT in methionine-starved custom media for 6 h and then supplementing with 60 mg l −1 of selenomethionine. Soluble secreted fusion proteins were collected from the media on the sixth day. Schematics of domain organization of the full-length protein and constructs used in this study are shown in Extended Data Fig. 2. All DNA and protein sequences for construct design and phylogenetic comparison were analysed using Geneious Prime 2019.2.3.

Purification and characterization of GalS1 dimer
Extracellular media obtained from the culture were collected and clarified by sequential centrifugation, then passed through a 5 μm filter 26 . All purifications were carried out using HisPrep FF 16/10 or HisTrap FF 5 ml columns (Cytiva) on an ÄKTA Go or ÄKTA Pure 25 l (Cytiva) protein purification system 26 . Proteins were concentrated to 5 mg ml −1 using Amicon Ultra 15 ml centrifugal filter devices (10 kDa MWCO, Millipore) and stored at 4 °C. Proteins were further purified by gel filtration using a Hi Load 16/600 Superdex 200 pg column (Cytiva) in 50 mM HEPES and 400 mM NaCl at pH 7.5. Fractions were combined and dialysed overnight in 50 mM HEPES containing 100 mM NaCl at pH 7.5, concentrated to 2 mg ml −1 , aliquoted (200 μl) and flash-frozen in liquid nitrogen before storing at −80°C.
For crystallization, purified GalS1 WT (94 mg) expressed in HEK293S GnTI-cells was treated with 5 mg of recombinant His-tagged GFP-TEV protease and His-tagged EndoF1 (ref. 26,63 ) at 4 °C for 24 h. Tag-free protein was further purified by a second round of immobilized metal affinity chromatography to remove the cleaved N-terminal sfGFP tag, His-tagged GFP-TEV and His-tagged GFP-EndoF1, concentrated to 5 mg ml −1 and loaded onto a Superdex 75 Increase 10/300 GL (Cytiva) column. The unbound fraction was collected and dialysed overnight into 50 mM HEPES and 100 mM NaCl (pH 7.5), and concentrated to 15 mg ml −1 . SEC-MALS was carried out in 50 mM HEPES and 250 mM NaCl (pH 7.5) on a Superdex 200 10/30 GL (Cytiva) column using an Agilent HPLC system coupled to an Optilab T rEX Refractive Index detector and a Mini Dawn Treos detector (Wyatt). Protein (20 μl, 2 mg ml −1 ) was injected using an autosampler. Analysis was done using ASTRA 6 HPLC software (Wyatt). Protein thermal shift assays were carried out using 5 μM of protein and 200X SYPRO Orange (Thermo Fisher) in 50 mM HEPES and 100 mM NaCl (pH 7.5) in a total volume of 50 μl in Hard-Shell 96-well WHT/CLR (Bio-Rad) plates using a CFX96 Real-Time system (Bio-Rad). Fluorescence reads using the 'FRET' channel to measure SYPRO Orange fluorescence were taken at each 30 s hold as temperature was increased from 25 °C to 100 °C. The data were analysed using the JTSA online server (Bond, P. S., JTSA 2017, http://paulsbond.co.uk/jtsa).

Crystallization and structure determination
For crystallization trials, GalS1 (12 mg ml −1 ) was screened using the following crystallization screens: Berkeley Screen 65 , Crystal Screen, SaltRx, PEG/Ion, Index and PEGRx (Hampton Research) and MCSG-1 (Anatrace). Crystals of GalS1 were found in 0.1 M sodium citrate tribasic dihydrate (pH 5.0) and 10% w/v polyethylene glycol 6000. They were obtained after 2 d by the sitting-drop vapour-diffusion method, with the drops consisting of a mixture of 0.2 μl of protein solution and 0.2 μl of reservoir solution. Crystals of GalS1 were placed in a reservoir solution containing 20% (v/v) glycerol, then flash-cooled in liquid nitrogen. The X-ray datasets for GalS1 were collected at the Berkeley Center for Structural Biology beamline 8.2.2 at the Advanced Light Source at Lawrence Berkeley National Laboratory (LBNL). The diffraction data were recorded using an ADSC-Q315r detector and processed using the programme Xia2 (ref. 66 ).
The GalS1 crystal structure was determined using selenomethionine (Se-Met)-labelled protein by the single-wavelength anomalous dispersion method 67 with phenix.autosol 68 and phenix.autobuild 69 programmes within the Phenix suite 70,71 . The atomic positions obtained from the initial single-wavelength anomalous dispersion data set were used as a search model for molecular replacement against native GalS1 data and to initiate crystallographic refinement and model rebuilding. Structure refinement was performed using the phenix.refine programme 71 . Translation-libration-screw (TLS) refinement was used, Article https://doi.org/10.1038/s41477-023-01358-4 with each protein chain assigned to a separate TLS group. Manual rebuilding using COOT 72 and the addition of water molecules allowed the construction of the final model. The final models of GalS1 and GalS1-Mn 2+ have an R factor of 0.197/R free of 0.247 and R factor of 0.235/R free of 0.267, respectively. RMSD differences from ideal geometries for bond lengths, angles and dihedrals were calculated with Phenix. The stereochemical quality of the final model of GalS1 was assessed by the programme MOLPROBITY v4.5.2. A summary of crystal parameters, data collection and refinement statistics can be found in Supplementary Table 4.

SAXS
SAXS was performed at the SIBYLS beamline at the Advanced Light Source 73,74 . For SEC-SAXS-MALS experiments, 60 μl containing 10 mg ml −1 GalS1 in 25 mM Hepes (pH 7.5) and 100 mM NaCl were used during the experiments. SEC-SAXS-MALS data were collected at the ALS beamline 12.3.1 LBNL Berkeley, California. The X-ray wavelength was set at λ = 1.127 Å and the sample-to-detector distance was 2,100 mm, resulting in scattering vectors, q, ranging from 0.01 Å-1 to 0.4 Å-1. The scattering vector is defined as q = 4πsinθ/λ, where 2θ is the scattering angle. The SAXS flow cell was directly coupled with an online Agilent 1260 Infinity HPLC system using a Shodex KW803 SEC column equilibrated with a running buffer as indicated above with a flow rate of 0.5 ml min −1 . Each sample was run through the SEC and 3 s X-ray exposures were collected continuously during a 30 min elution. The SAXS frames recorded before the protein elution peak were used to subtract all other frames. The subtracted frames were investigated by the radius of gyration (R g ) derived by the Guinier approximation I(q) = I(0) exp(−q 2 R g 2 /3) with the limits qRg < 1.5 (ref. 75 ). The elution peak was mapped by comparing the integral of ratios to background and R g relative to the recorded frame using the programme SCÅTTER. Uniform R g values across an elution peak represent a homogeneous sample. Final merged SAXS profiles, derived by integrating multiple frames at the peak of the elution peak, were used for further analysis, including the Guinier plot, which determined aggregation-free state. Eluent was subsequently split 3 to 1 between the SAXS line and a series of detectors, including UV at 280 and 260 nm, MALS, quasi-elastic light scattering and the refractometer detector. MALS experiments were performed using an 18-angle DAWN HELEOS II light scattering detector connected in tandem to an Optilab refractive index concentration detector (Wyatt). System normalization and calibration were performed with a BSA monomer using a 45 μl sample at 10 mg ml −1 in the same SEC running buffer and a refractive index increment (dn/dc) value of 0.19. The light scattering experiments were used to perform analytical scale chromatographic separations for MW determination of the principal peaks in the SEC analysis. UV, MALS and differential refractive index data were analysed using Wyatt Astra 7 software to monitor the homogeneity of the sample across the elution peak complementary to the above-mentioned SEC-SAXS signal validation.
Two atomistic models of the GalS1 dimer were built on the basis of close interfaces found in the crystal structure. The missing N-terminal region was modelled as a random coil using the programme MODELLER 9.25 (ref. 76 ). The experimental SAXS profiles were then compared to theoretical scattering curves generated from atomistic models using FOXS 77,78 . The SAXS envelope was restored in the P2 symmetry from the experimental data using the programme GASBOR 2.3 (ref. 79 ). The average SAXS envelope was determined from 10 reconstructions using the DAMAVER programme 80 . Structures and the SAXS envelopes were superimposed and visualized in CHIMERA 1.13.1 (ref. 81 ).

Bayesian pattern-based evolutionary analysis of GT92
We first collected 259 GT92 sequences curated at the CAZy database (www.cazy.org) 7 . Using the alignment of the GalS1 structure and other GT-A fold structures as a template, we then used a profile-based alignment strategy, mapgaps 82 , to align them to the core GT-A fold profile generated in our previous study 32 . This alignment allowed the mapping of GT-A features into the GalS1 structure. A representative set of 24,816 sequences 32 was generated, including diverse GT-A fold families and GT92 sequences purged using an 80% sequence identity cut-off. This set was used to perform a query-centric Bayesian partitioning with pattern selection 33 analysis, with the GT92 consensus sequence as the query. This procedure clusters GT92 sequences into a distinct foreground group on the basis of alignment positions that are most conserved within the GT92 family and distinguish them from other GT-A fold enzymes grouped into the background.

Identification of critical active site residues in GalS1
Docking and molecular dynamics simulations were employed to deduce the donor (UDP-Gal) and acceptor (Gal 4 ) binding sites and poses in GalS1. The Mn-bound monomeric structure of GalS1 was considered for both molecular dynamics and docking studies. A sequential combination of molecular dynamics simulations conducted using the CHARMM v44a1 molecular dynamics engine and docking studies conducted using Autodock Vina v1.2.2 were used for modelling the enzyme-substrate complexes [83][84][85] . Considering that the Mn-bound crystal structure was elucidated in the absence of substrate molecules, the first set of simulations conducted were of the apo state (Mn-bound) of GalS1 under fully solvated conditions. The CHARMM36 forcefield was used for proteins 86 and ions, including Mn 2+ and the TIP3P 87 forcefield for water molecules. The protonation states of the titratable amino acids in the proteins were estimated on the basis of the H++ package v4.0 88 , and disulfide linkages between residues 145-179, 236-316 and 369-447 were considered. A 100 ns unbiased simulation of the solvated Mn-bound state of GALS1 was conducted and snapshots from this run were considered for the donor molecule docking studies. An initial blind docking study was conducted wherein the whole GALS1 structure was considered for docking the donor molecule, followed by a more targeted docking study centred around the bound Mn 2+ ion. The targeted docking calculations involved a 40 × 40 × 40Å box with a grid spacing of 0.375 Å, an exhaustiveness value of 128 and a total of 40 binding modes. The best binding pose was selected for conducting 100 ns production molecular dynamics simulations of the GalS1-UDP-Gal complex to evaluate the validity of the docked binding pose under fully solvated unbiased dynamical conditions. Before these molecular dynamics simulations, a series of short, restrained simulations (totalling 2.24 ns) were conducted to ensure proper equilibration of the active site residues around the bound donor molecule. Snapshots chosen from the donor-bound simulations were then considered for docking studies of the acceptor molecule to obtain ternary complexes of Mn-bound GalS1 with UDP-Gal and Gal4 bound at the active site. Suitable docked poses of a putative ternary complex were then subjected to fully solvated unbiased molecular dynamics simulations. The CHARMM36 forcefield was also employed for the molecular dynamics simulations that involved the UDP-Gal and galactotetraose substrates 89 .

Generation and purification of galactotetraose acceptor
The plasmid for heterologous expression of the β-1,4-galactanase GanA from Geobacillus stearothermophilus in pET9d was a kind gift provided by Dr Yuval Shoham 90 . His-tagged GanA was heterologously expressed in Escherichia coli purified using nickel-nitrilotriacetic acid chromatography, and concentrated stocks (2.5 mg ml −1 ) were stored at −80 °C in 50 mM MES (pH 6.5), 100 mM NaCl and 10% (v/v) glycerol. Potato galactan (500 mg) (Megazyme, P-GALPOT) was dissolved in 50 mM MES (pH 6.5) to a final concentration of 10 mg ml −1 . GanA β-1,4-galactanase (50 μg) was added and the digestion was allowed to proceed for 3 h at 30 °C, with shaking at 1,000 r.p.m. Galacto-oligosaccharides were separated from the reaction mixture via diafiltration (10 kDa MWCO, Millipore). An additional 50 μg of galactanase was added to undigested potato galactan retained in the filter device. The digestion was repeated 5 times in total, with intermittent addition of enzyme and product Article https://doi.org/10.1038/s41477-023-01358-4 removal. The galacto-oligosaccharides collected in the filtrates were pooled and lyophilised before loading onto a Bio-Gel P-2 (Bio-Rad) column (120 ml, self-packed column) attached to an HPLC, with water as a running buffer. The fractions were collected and analysed using matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) using a Microflex LT spectrometer (Bruker) as described below. Fractions containing galactotetraose were pooled and lyophilised.

Galactan synthase activity assays
All activated nucleotide sugars were purchased from CarboSource, Promega or Sigma. Screening of sugar-nucleotide donor specificities in the absence of acceptor substrate was done with the UDP-Glo glycosyltransferase assay (Promega) kit 34 . Reactions (20 μl) consisted of 100 μM individual UDP-sugars (UDP-Gal, UDP-Arap, UDP-Xyl, UDP-Glc, UDP-GalA, UDP-GlcA, UDP-GlcNAc and UDP-GalNAc) and 4 μg of purified GalS1 in 50 mM HEPES and 100 mM NaCl (pH 7) at 30 °C for 18 h. The reaction mixture (5 μl) was mixed with an equal amount of UDP-Glo reagent in a 384-well assay plate (Corning 4513) and incubated for 1 h at room temperature before measuring luminescence using a Synergy LX Multi-mode microplate reader (BioTek). A standard curve was used for quantification of the UDP produced.
The quantity of UDP formed as a by-product of the GalT reaction was determined using the UDP-Glo glycosyltransferase assay (Promega) according to the manufacturer's instructions, using either UDP-Gal (Promega) or UDP-Arap (CarboSource) as donor substrates. Standard GalT assays (20 μl) consisted of either UDP-Gal (250 μM) or UDP-Arap (400 μM) as activated nucleotide sugar donors, galactotetraose (400 μM) as an acceptor and 5 mM manganese(II) chloride in 50 mM HEPES (pH 7.0). Reactions were allowed to proceed at 30 °C for 2 h and the amount of UDP produced was determined as described above.

Polysaccharide analysis by carbohydrate gel electrophoresis
Reactions (25 μl) consisted of 2 μg galactotetraose as galacto-oligosaccharide substrate, 200 μM UDP-Gal and 20 μg purified protein, and contained 10 mM MnCl 2 and 1% (v/v) Triton X-100 in 50 mM HEPES (pH 7.0). Reactions were incubated at 30 °C for 2 h and then terminated by heating at 100 °C for 5 min, followed by centrifugation at 10,000 × g for 10 min. Supernatants (15 μl) were mixed with 15 μl 3 M urea and 5 μl samples were loaded on large-format Tris-borate acrylamide gel prepared as described previously 91 and electrophoresed at 200 V for 30 min followed by 1,000 V for 1.5 h. The PACE gels were visualized with a G-Box gel doc system (Syngene) at Tumi-wavelength with a UV detection filter and long-wave UV tubes (365 nm emission).

MST
MST experiments to investigate the ability of the full-length protein and variants to bind UDP-Gal were performed on a NanoTemper Monolith NT.115 with blue/red filters, as previously described 92 . His-GFP-GalS1 (or variants) were diluted 200X in MST buffer 1 (1% Triton X-100, 10 mM MnCl 2 , 50 mM HEPES (pH 7.0)) and the final concentration yielded detectable fluorescent signals between 200 and 1,600 units of fluorescence (FU units). UDP-Gal solution (10 μl, 5 mM) was diluted 1:1 in 10 μl MST buffer 1 to make a 16-sample serial dilution from 2.5 mM to 76.3 nM. Purified protein (10 μl, 5 μg ml −1 ) was added to 10 μl of each ligand solution and incubated at room temperature for 10 min. Prepared samples were loaded into standard treated capillaries for measurements using 20% MST power with laser off/on times of 0 s and 10 s, respectively, at 22 °C. All experiments were repeated 3 times for each measurement.
MST experiments to evaluate the CBM95 were performed on a Monolith NT.115Pico (NanoTemper) equipped with blue/red filters. Non-adherent Arabidopsis thaliana seed coat mucilage, composed of almost pure RGI, was prepared according to a previously described method 93 . A protein solution of His-GFP-CBM95 (80 nM) or variants was prepared in MST buffer 2 (0.02% Tween 20, 10 mM MnCl 2 , 600 mM NaCl and 100 mM HEPES (pH 7.0)), mixed and centrifuged at 15,000 × g for 10 min to remove any potential aggregates. Substrate solutions (50 μl, 0.1 mg ml −1 ) of non-adherent mucilage, galactotetraose, polygalacturonic acid (Sigma) and xylohexaose (Megazyme) were mixed 1:1 with 50 μl of the CBM95 protein solution. Samples were incubated for 5 min in the dark before MST analysis. Four aliquots of standard capillaries were loaded with prepared samples and the binding was checked. Binding affinity was measured using a 16-sample serial dilution from 8.3 μM to 0.25 nM. Purified protein (10 μl, 160 nM) was added to 10 μl of each ligand solution and incubated at room temperature for 10 min. Prepared samples were loaded into standard treated capillaries for measurements using 40% MST power with laser off/on times of 0 s and 10 s, respectively, at 22 °C. All experiments were repeated at least 2 times for each measurement.
MST experiments to investigate the ability of the full-length protein and variants to bind galactotetraose were also performed on a NanoTemper Monolith NT.115 with blue/red filters, similar to CBMs as described above, except that binding affinity was measured using a 16-sample serial dilution from 5 mM to 153 nM of the acceptor.

MALDI-TOF mass spectrometry
MALDI-TOF mass spectra of GalS1 saccharide reaction products were acquired using a Microflex LT spectrometer (Bruker). Reactions contained 4 μg of the WT GalS1 enzyme, 10 mM of UDP-Gal or UDP-Arap, 0.5 mM of galactotetraose and 1 mM manganese(II) chloride in 50 mM HEPES (pH 7.0) in a total volume of 20 μl. Reactions were allowed to proceed for 16 h at 25 °C. Aliquots (5 μl) of each reaction were mixed with 1 μl of Dowex-50 cation exchange resin (Bio-rad) and incubated for 1 h on a microplate mixer. The tubes were centrifuged at 1,250 × g for 5 min. Of each sample, 1 μl was mixed with 1 μl matrix (2, 5-dihydroxybenzoic acid, 100 mg ml −1 in 50% methanol) directly on the plate and blow-dried. Positive-ion spectra from 200 laser shots were added to generate the spectrum for each sample.

Sequence analysis of the stem domain
To prepare the sequence alignments of the stem domain, the sequence region spanning residues 1-113 of GalS1 was blasted against the National Centre for Biotechnology Institute (NCBI) database using PSI blast. The top 100 sequences were taken for analysis and hypothetical, predicted and protein sequences with low-quality were removed beforehand. The sequences were aligned using the T-Coffee web server 94 ; the web logo was created using https://weblogo.berkeley.edu/logo.cgi.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The diffraction data and crystallographic models that support the findings of this study are available at the Protein Data Bank (https://www. rcsb.org/) under PDB accession codes 8D3T and 8D3Z for GalS1 in the apo form and the GalS1 bound to Mn 2+ , respectively. The SAXS data and model have been deposited and are available from the SIMPLE SAXS (https://simplescattering.com) database under the accession code XSMHXTBH. Other data that support the findings of this study and any computer code used herein are available from the corresponding author upon request. Source data are provided as part of this paper.

Code availability
The software used for analysis of the crystallographic data is freely available online or from the authors of each software package. Any computer code used herein is available from the corresponding author upon request.  Tables S2 and S3. Each transfection experiment for mutant variants was performed at least once and two different SDS PAGE gels monitoring protein purification were generated. Data presented here are representative of the final purified product used in respective experiments. c, SEC-MALS-SAXS analysis of GalS1. SEC elution profile for the GalS1, along with masses calculated from MALS and radius of gyrations calculated from SAXS-frame collected across the SEC-elution peak. The masses confirm that untagged GalS1 is a dimer in solution (MW SAXS = 110 kDa, MW MALS = 119 kDa). Original uncropped images are provided in the Source Data.