Polymerization of the backbone of the pectic polysaccharide rhamnogalacturonan I

Rhamnogalacturonan I (RG-I) is a major plant cell wall pectic polysaccharide defined by its repeating disaccharide backbone structure of [4)-α-d-GalA-(1,2)-α-l-Rha-(1,]. A family of RG-I:Rhamnosyltransferases (RRT) has previously been identified, but synthesis of the RG-I backbone has not been demonstrated in vitro because the identity of Rhamnogalacturonan I:Galaturonosyltransferase (RG-I:GalAT) was unknown. Here a putative glycosyltransferase, At1g28240/MUCI70, is shown to be an RG-I:GalAT. The name RGGAT1 is proposed to reflect the catalytic activity of this enzyme. When incubated together with the rhamnosyltransferase RRT4, the combined activities of RGGAT1 and RRT4 result in elongation of RG-I acceptors in vitro into a polymeric product. RGGAT1 is a member of a new GT family categorized as GT116, which does not group into existing GT-A clades and is phylogenetically distinct from the GALACTURONOSYLTRANSFERASE (GAUT) family of GalA transferases that synthesize the backbone of the pectin homogalacturonan. RGGAT1 has a predicted GT-A fold structure but employs a metal-independent catalytic mechanism that is rare among glycosyltransferases

Pectins are a galacturonic acid (GalA)-rich class of polysaccharides present in the cell wall of nearly every plant species and cell type. The traditionally recognized roles of pectic polysaccharides as structural components of plant cell walls have been studied extensively in model organisms, most notably Arabidopsis 1 , and also in biomass feedstock species including switchgrass and poplar 2 . Having well-established uses as a safe food additive and proposed roles in gut microbiome health, contemporary interest in pectin research extends to uncovering the positive health effects that are expected to result from human metabolic pathways affected by pectin consumption 3,4 . As a result of the chemical complexity and structural heterogeneity that exists within pectic polysaccharides, several challenges limit the current understanding of pectins as a family of functionally active macromolecules. These include difficulties in isolating homogeneous pectic domains for use in biological studies and characterizing the families of biosynthetic enzymes that synthesize the individual sugar linkages.

RGGAT1 (MUCI70) adds GalA to RG-I acceptors in vitro
The purified MUCI70 protein was tested for the ability to transfer GalA to different pectin acceptors. RG-I acceptor oligosaccharides were generated by digesting Arabidopsis seed mucilage with a rhamnogalacturonan endohydrolase from Aspergillus aculeatus, RGase A 29,30 (Extended Data Fig. 2a). Following the method originally developed by Ishii et al. 31 , RG-I oligosaccharides of defined chain lengths were derivatized to include a 2-aminobenzamide (2AB) fluorescent tag at the reducing terminus and were purified from the mixture of digested mucilage (Extended Data Fig. 2b and Supplementary Fig. 1). Elongation of the oligosaccharide by transfer to the non-reducing end, as depicted in Fig. 3a, would be consistent with the elongation mechanism of the pectic biosynthetic GAUT 6,32 and RRT 16 enzyme families. The abbreviation RG-I (R) signifies RG-I oligosaccharides generated by digestion of RG-I with RGase A and resulting in non-reducing terminal rhamnose.
RG-I:GalAT activity was assayed by incubating MUCI70 with an RG-I (R) oligosaccharide acceptor of a degree of polymerization (DP) of 12 total sugar units. The hypothetical reaction scheme (Fig. 3a) represents elongation of a DP12 (R) to a DP13 (G) oligosaccharide. On the basis of a mass shift of 176 Da corresponding to the addition of a GalA monomer, MUCI70 catalysed the transfer of GalA to the RG-I (R) acceptor (Fig. 3b,c). The activity of MUCI70 is limited to the addition of a single are branched at O-4 with side branches largely composed of arabinan, galactan and arabinogalactan 1 . GAUT family enzymes have not been shown to incorporate GalA into RG-I oligosaccharide acceptors 5,8 , which suggests that a distinct, unidentified GT family may function in polymerization of the RG-I backbone.
The in vivo functions of RG-I are poorly understood; however, the cell wall structural and compositional changes that occur during fruit ripening have provided some initial insight into RG-I function 9,10 . RG-I has been proposed to contribute to cell wall structural integrity through interactions with other polysaccharides and to cellular adhesion by interacting with HG within the primary wall and middle lamella 2,9,11 . The isolation of RG-I polysaccharides has previously required extensive sequential selective hydrolysis and extraction of pectin-rich tissues such as citrus peels, the major source of commercial pectins 12 . Seed mucilages, water-retentive polysaccharide fractions secreted by many species to maintain seed viability and hydration, have been identified as an ideal source of RG-I polysaccharide material for biosynthesis studies 13,14 . The polysaccharide components of seed mucilages vary across species, but Arabidopsis seed epidermal cells have been shown to secrete a non-adherent mucilage highly enriched in RG-I with minimal or no backbone substitution 15 . Arabidopsis seed mucilage RG-I is a polysaccharide with a molecular mass greater than 600 kDa 15 .
Several activities related to RG-I biosynthesis have been identified and added to the families of GTs categorized within the CAZy database. A family of RG-I:Rhamnosyltransferases, annotated as RRT, transfers Rha to RG-I acceptors, resulting in an α-4 linkage to GalA on the non-reducing end 16 . The discovery of the RRT activity resulted in the establishment of CAZy family GT106. The RRT clade in Arabidopsis has recently been expanded to 10 members, of which 5 have been shown to have RG-I:RRT activity 17 . Consistent with a role in RG-I mucilage synthesis, the founding member, RRT1, was discovered due to its high expression in the late stages of Arabidopsis seed development when mucilage production is elevated 16 . A GT family has also been identified that functions in the elongation of the RG-I-specific β-1,4-linked galactans. Each of the three members of the GALS family, a sub-clade of GT92 in Arabidopsis, has been confirmed to exhibit RG-I galactan synthase function 18,19 .
Synthesis of the RG-I backbone has not been demonstrated in vitro because Rhamnogalacturonan I:Galaturonosyltransferase (RG-I:GalAT), the enzyme that transfers GalA to Rha-containing RG-I acceptors, has not been identified. Here, At1g28240, a gene encoding a protein currently annotated as MUCILAGE-RELATED70 (MUCI70), was selected as a candidate RG-I:GalAT. Similar to RRT1, MUCI70 was originally discovered due to its high expression in the Arabidopsis seed coat during a developmental period consistent with upregulated RG-I biosynthesis 20 . The putative domain structure of MUCI70 has characteristics common with known GT families, including an N-terminal transmembrane domain and a predicted C-terminal putative GT domain currently annotated as DUF616 (PF04765) 21 . This putative GT domain has been predicted to be most closely related to GT8 family proteins 22 , but DUF616-domain proteins have not been identified as members of the GAUT1-related superfamily 23 . Mutants of MUCI70 have reduced staining of the mucilage that is released from seeds upon hydration and a reduced amount of both GalA and Rha in total mucilage extracts, phenotypes also observed in mutants of RRT1 16,20 . Concurrent with the work described in this study, MUCI70 was also identified in a genome-wide study of phenotypes resulting from single nucleotide polymorphisms, with mutant alleles of muci70 resulting in reduced molecular weight of the mucilage polysaccharide 24 . The association of MUCI70 expression with the size and composition of RG-I polysaccharides recovered from Arabidopsis seed mucilage supported a proposed function in RG-I biosynthesis. Here we show that MUCI70 is a GalAT that functions with RRT to synthesize the RG-I backbone.
Article https://doi.org/10.1038/s41477-022-01270-3 GalA to this acceptor and does not catalyse the transfer of GalA to RG-I acceptors containing a GalA on the non-reducing end or to HG acceptors (Extended Data Fig. 3). On the basis of this activity, we propose the name RG-I:GALACTURONOSYLTRANSFERASE1 (RGGAT1) for this enzyme.
The initial test of RGGAT1 activity, described in the previous paragraph, used a high enzyme concentration (1 µM) that resulted in the complete conversion of a DP12 (R) acceptor to a product elongated by a single GalA monosaccharide. A separate set of reaction conditions was established to measure the biochemical parameters of the enzyme activity. In these reactions, the activity was tested using a lower enzyme concentration (50 nM) to limit the reaction progress. A 10 min incubation with the DP12 (R) acceptor under these conditions resulted in a 9.7% conversion of the DP12 (R) acceptor to the DP13 (G) product ( Fig. 3d and Extended Data Fig. 4).

Biochemical characterization of RGGAT1 activity
The kinetics of RG-I:GalA activity was determined using a commercial UDP-Glo assay that detects activity on the basis of the conversion of  the UDP released during the glycosyltransfer reaction to a luminescent signal. The RGGAT1 reaction progress curve was monitored from 0 to 60 min using the DP12 (R) acceptor substrate and UDP-GalA as a donor, indicating a 6% conversion to products in a reaction containing 1 mM UDP-GalA, 100 µM acceptor and 50 nM enzyme when measured at 10 min (Extended Data Fig. 4b). Similar levels of activity were detected using anion exchange chromatography (Extended Data Fig. 4a) under equivalent reaction conditions, indicating that both methods are suitable for biochemical characterization of RG-I:GalAT activity. The pH optimum of RGGAT1 was 6.5 (Fig. 4a). Comparison of RGGAT1 activity using a series of RG-I acceptors revealed that RGGAT1 can detectably transfer GalA to acceptors of at least DP6, with an approximately 4-fold increase in activity with acceptors of DP ≥ 10 ( Fig. 4b). DP6 was the smallest size acceptor purified from the Arabidopsis mucilage digest.
Michaelis-Menten kinetics were measured for the UDP-GalA donor and for acceptor oligosaccharides of different chain lengths (Fig. 4c). RGGAT1 has a Michaelis constant (K M )for UDP-GalA of 110 µM. Using a range of acceptor concentrations from 0 to 100 µM, we were able to model Michaelis-Menten kinetics, with similar results for the DP12 and DP16 acceptors, yielding an estimated K M of 28-31 µM; however, the estimated K M of 294 µM for the DP8 acceptor was outside of the range of acceptor concentrations available for assay (Fig. 4d). The inability of the DP8 acceptor to saturate the active site under this range of concentrations resulted in a catalytic efficiency k cat /K M , where k cat is the catalytic constant, that was >10-fold higher for the longer-chain acceptors.
Most families of GT-A fold enzymes require divalent cations for activity since they coordinate interactions between the diphosphate of the sugar nucleotide donor and the enzyme active site DxD motif 33 . The most common divalent cation utilized by glycosyltransferases is Mn 2+ , which is also required for transferase activity by the HG biosynthetic complex GAUT1:GAUT7 6 . Following Ni 2+ -NTA affinity purification, RGGAT1 was dialysed against Chelex-100, a resin used to remove any residual metal ions by chelation. In the assays presented above, activity was observed without the addition of exogenous sources of metal ions to the reaction mixture, suggesting that RGGAT1 might not require metal ions for catalysis. To verify that RGGAT1 is a metal-independent GT, the enzyme was incubated in a MES buffer containing no additives (control), 10 mM EDTA or 10 mM MnCl 2 . After a 30 min incubation period, the assay was performed. Identical enzyme activity was observed in control reactions as well as in those containing EDTA or MnCl 2 . The results indicate that divalent cations are not required for activity (Fig. 4e).
In assays containing 50 nM enzyme, the reactions were limited to approximately 20% conversion of the acceptor, as measured at 60 min (Extended Data Fig. 4c). Increased reaction times, including overnight incubation of samples, did not result in complete conversion of the acceptor at limiting enzyme concentrations. When a phosphatase (potato apyrase) was included in the reaction, at least a 2-fold increase in the conversion of the acceptor was detected at 60 min (Extended Data Fig. 4c), suggesting that RGGAT1 is inhibited by UDP released during the transferase reaction.

In vitro polymerization of RG-I by RGGAT1 and RRT4
The data presented established that RGGAT1 transfers a single GalA to RG-I acceptors. It has been previously shown that a family of RG-I:Rhamnosyltransferases (RRT) transfer Rha to RG-I acceptors 16 . Reactions were incubated with 50 mM of MES buffer (blue circles) of pH 5.5-6.7 or HEPES buffer (red squares) of pH 6.7-8.0. The buffer MES pH 6.5 was used for standard condition assays. The black line represents the average value of n = 3 independent assays. Individual data points from the three assays are shown. b, RGGAT1 activity was measured in 10 min reactions containing 50 nM enzyme, 1 mM UDP-GalA and 100 µM of acceptors with degrees of polymerization ranging from DP6 to DP18 or no acceptor (no acc) using UDP-Glo. Error bars represent standard deviations of n = 3 independent experiments. c, Michaelis-Menten kinetics for the UDP-GalA donor. RGGAT1 was incubated for 10 min with 100 µM DP16 acceptor and variable concentrations of UDP-GalA (0-1,000 µM). Kinetic constants were calculated by nonlinear regression using GraphPad Prism. Error bars represent standard deviations from n = 4 independent experiments. Dotted lines represent 95% confidence intervals. K M and k cat are reported as mean ± s.e.m. d, Michaelis-Menten kinetics for RG-I oligosaccharide acceptors of DP8, DP12 and DP16. RGGAT1 was incubated for 10 min with 1 mM UDP-GalA and variable concentrations of the indicated acceptors (0-100 µM). Kinetic constants were calculated by nonlinear regression using GraphPad Prism. Error bars represent standard deviations of n = 3 independent experiments. Dotted lines represent 95% confidence intervals. K M and k cat are reported as mean ± s.e.m. e, RGGAT1 was incubated in a 50 mM MES pH 6.5 buffer (control) or with buffer containing either 10 mM EDTA or 10 mM MnCl 2 for 30 min before the assay. After a 30 min incubation period, the enzyme was diluted and assayed as described. The final concentration during the reaction was 10 mM for EDTA and 0.25 mM for MnCl 2 . No difference in activity compared to the control reaction was detected. Error bars represent standard deviations of n = 3 independent experiments. Article https://doi.org/10.1038/s41477-022-01270-3 If the linkages transferred by these two enzymes were consistent with the linkages of the GalA-Rha disaccharide repeat backbone of RG-I, then we predicted that the combined activities would result in polymerization of longer-chain RG-I polysaccharides through sequential addition of GalA and Rha to the non-reducing end of elongating acceptor oligosaccharides.
To purify a source of RG-I:RhaT activity, the coding sequences of the original four RRT enzymes with known activity 16 , truncated by their predicted N-terminal transmembrane domains, were cloned into the pGEn2 vector for expression in HEK293 cells. Compared to RGGAT1, all four members of the RRT family expressed relatively poorly in HEK293 cells, as indicated by the low fluorescence of secreted protein (Extended Data Fig. 5). Of the four proteins tested, RRT4Δ51 resulted in the highest yield of soluble protein. RRT4Δ51 was expressed in a 500 ml culture and purified using Ni 2+ -NTA affinity. The protein eluted from this purification was resolved on an SDS-PAGE gel under reducing (+DTT (dithiothreitol)) and non-reducing (−DTT) conditions (Fig. 5a). Under reducing conditions, the major band detected was consistent with the expected molecular weight of the RRT4Δ51 fusion protein, but the appearance of a higher-molecular-weight band under non-reducing conditions suggested that an aggregated form of the protein co-purified with the monomeric protein during Ni 2+ -NTA affinity chromatography. Size-exclusion chromatography was unable to separate the active monomer from the aggregates. Despite the lower apparent purity of this enzyme compared with RGGAT1, RRT4 was able to transfer Rha to RG-I acceptors containing a GalA residue on the non-reducing end. A product with an increased mass of 146 Da was detected when RRT4 was added to a reaction mixture containing UDP-Rha and a DP12-2AB (G) RG-I oligosaccharide acceptor (Extended Data Fig. 6), consistent with previously published data showing that RRT4 is an RG-I:RhaT 16 . Conversion of the RG-I acceptor oligosaccharide required higher enzyme concentrations (1-5 µM, Extended Data Fig. 6) than were necessary for the measurement of activity by RGGAT1, suggesting that the specific activity of RRT4 was low due to the relatively low expression and purity. To compensate for this low conversion efficiency, higher enzyme concentrations were used in reactions with RRT4 to maximize the observable polymerization of longer-chain polysaccharide products.
The potential for the combined activities of RGGAT1 and RRT4 to elongate RG-I acceptors was tested by incubating both enzymes (5 µM) in a reaction mixture containing 1 mM UDP-GalA, 1 mM UDP-Rha and 100 µM DP12 RG-I (G) acceptor. The reaction resulted in a series of peaks separated by 322 Da, consistent with the size of an RG-I disaccharide containing both GalA (176 Da) and Rha (146 Da) residues (Fig. 5b,c). The absence of a detectable intermediate mass resulting from a single Rha addition indicates that the GalA transfer reaction proceeded at a significantly faster rate than the Rha addition under these reaction conditions. The activities of RGGAT1 and RRT4 were limited to a single GalA or Rha transfer when incubated as individual enzymes, and neither RGGAT1 nor RRT4 was able to polymerize RG-I in the absence of the other enzyme (Extended Data Fig. 7).
Having established that the RG-I oligosaccharide acceptor can be elongated by at least 6 disaccharide repeat units when incubated with both RGGAT1 and RRT4 enzymes (Fig. 5c), the enzyme pair was tested for the ability to polymerize longer-chain RG-I polysaccharides. The enzymes were incubated with 2.5 mM UDP-GalA, 2.5 mM UDP-Rha and 25 µM of the DP12-2AB acceptor to create a 100:1 molar ratio of each donor molecule to the acceptor. If the reaction was able to consume the respective sugar nucleotide donors, it would theoretically result in the synthesis of an RG-I polysaccharide of DP212 as a result of the addition of 100 disaccharide units to the initial acceptor. At the indicated time points ranging from 0 to 12 h, aliquots were removed and products were detected using high-percentage polyacrylamide gels stained with alcian blue (Fig. 5d) and size-exclusion chromatography with refractive index detection (Fig. 5e). Both of these methods have previously been used to detect the polymerization of HG by GAUT family enzymes 6 . The reaction resulted in the synthesis of RG-I polysaccharides that increased in size during the 12 h incubation to a final mixture of polysaccharides of at least DP40 compared to RG-I standards of known size based on alcian blue staining in polyacrylamide gels. The products separated by size-exclusion chromatography were also coupled to a multi-angle light scattering (MALS) detector, which estimated a product size of DP130 for the RG-I polysaccharides synthesized in a 12 h incubation.
The in vitro polymerized RG-I polysaccharides were digested by two enzymes specific to the two linkages in the RG-I backbone. RG-I hydrolase from Aspergillus aculeatus (RGase A) is an endohydrolase that cleaves the [4)-α-d-GalA-(1,2)-α-l-Rha-(1,] linkage, resulting in oligosaccharides containing Rha residues on the non-reducing end 30 . Alternatively, RG-I lyase (RGase B) is an endolyase that cleaves the [2)-α-l-Rha-(1, 4)-α-d-GalA-(1,] linkage, resulting in oligosaccharides containing 4,5-unsaturated GalA residues on the non-reducing end 30 . An RG-I polysaccharide was polymerized in vitro, as described above. After termination of the reaction by boiling, the polysaccharide was incubated with RG-I hydrolase or RG-I lyase for 1-12 h and the digested products were detected by alcian blue-stained PAGE (Fig. 5f). The ability of these two enzymes to degrade the in vitro polymerized RG-I polysaccharides confirmed that the linkages synthesized by RGGAT1 and RRT4 are the expected backbone linkages for an RG-I polysaccharide.
The sequential addition of GalA and Rha units to polymerize long-chain RG-I polysaccharides invites the hypothesis that RGGAT1 and RRT family enzymes interact and function as a biosynthetic complex. Co-expression in HEK293 cells of two HG biosynthetic enzymes, GAUT1 and GAUT7, resulted in the formation of a heterocomplex with enhanced expression compared with expression of the individual enzymes in the same system 6 . We tested whether co-expression of RGGAT1 with four RRT family members in HEK293 cells resulted in enhanced expression of RRT as a preliminary test of interactions between these two GT families. Only RGGAT1 protein was detected in all samples in sufficient amounts to observe monomer bands, suggesting that co-expression with RGGAT1 did not result in increased expression of any RRT family enzymes tested (Extended Data Fig. 8). Although no evidence currently exists for an RG-I biosynthetic heterocomplex, such a complex may require a specific permutation of RGGAT (GT116) and RRT (GT106) family members.

RGGAT1 is a GT116 family enzyme with a predicted GT-A fold
Before this study, RGGAT1 was not annotated as a member of any existing GT family in the CAZy database 34 . RGGAT1 has now been included as a member of the new family GT116 as a result of the GalA transferase activity presented here. At least 154 plant species and 143 bacterial species listed in the Pfam database have additional uncharacterized sequences containing a GT116 domain (previously DUF616) 21 . While some of the members of this family may function in pectin biosynthesis, a broad range of substrate utilization can exist within a single GT family. Rather than being grouped by substrate specificity, enzymes within a given GT family are predicted to share a similar overall structural fold 34 .
Despite the large number of GT families, glycosyltransferases have generally been found to belong to one of three different structural fold types 33 . The most common fold type, GT-A, includes the GT8 family that contains the GAUTs. We were interested in determining whether RGGAT1 is also predicted to share this fold as a basis for future studies on the structures and mechanisms of the pectin biosynthetic machinery. The GT-A fold shares elements of secondary structure that are highly conserved across many families, including a series of alternating α-helices and β-sheets that make up a Rossman-like domain and four landmark active site motifs (DxD, G-loop, xED and C-His) 35 . Because RGGAT1 shares limited sequence similarities with other GT sequences, generating an accurate primary sequence alignment for the comparison of these motifs is difficult. Thus, we first used a sequence    alignment-independent deep-learning-based method that was recently developed to determine GT fold type on the basis of primary sequence information using a module trained on nearly 50,000 GT sequences 36 .
In contrast with typical methods of sequence or structural alignment, this method recognizes patterns of conserved secondary structure shared within the GT fold classes and uses these common elements for GT fold prediction. By applying this method to 678 representative GT116/DUF616 sequences, the family of proteins was predicted to adopt a GT-A fold with high confidence (Extended Data Fig. 9). On the basis of the prediction that RGGAT1 contains structural features representative of the broad GT-A fold, we used AlphaFold2 (v2.0.1) 37 to model the structure of RGGAT1. The resulting protein structural model was generated with high confidence and conformed to the general structural features of a GT-A fold domain (Supplementary Fig. 2). A structural comparison with a well-characterized GT-A fold from a GT31 family protein 38 validates the prediction that the GT116 family of enzymes conforms to a GT-A fold with a core alignment to the GT31 protein structure with a 3.2 Å root-mean-square deviation (RMSD) (Fig. 6a). The alignment has highest structural similarity in the secondary structural elements that are specific to this fold type, which include several α-helices and β-sheets of the Rossman domain. The aligned structural model predicts that three of the GT-A fold common core conserved motifs are positioned into a putative active site (Fig. 6b). The DxD motif (DGK in RGGAT1), xED motif (RDQ) and G-Loop (EGC) are regions with substrate-binding and catalytic functions, but variations occur across the many GT-A fold families and contribute to the mechanistic diversity observed in this enzyme superfamily 35 . Additional variations occur in the hypervariable regions, which are regions of secondary structure that are specific to individual GT families and may contribute to the binding of acceptor substrates (Fig. 6b).
The classification of RGGAT1 as part of a new GT family suggested that GT116 is phylogenetically distinct from the existing GT families. We evaluated the phylogeny using a structure-based sequence alignment of RGGAT1 and related GT116 sequences to previously published GT-A profiles 35 . This analysis revealed that RGGAT1/GT116 does not group into existing GT-A clades. The observed metal-independent activity (Fig. 4c), which is uncommon among GT-A fold enzymes, is consistent with this family being divergent from other GT-A families with the same overall structural fold.
The AlphaFold2 structure of RGGAT1 was used to predict candidate residues with roles in binding to the donor and acceptor substrates. Molecular docking was performed with UDP-GalA and an RG-I (R) oligosaccharide, DP12 substrate (Fig. 6c). Notably, a lysine residue (K363) was found to be well-positioned to interact with the diphosphate group of UDP-GalA. K363 is part of the DGK motif which replaces the DxD motif that is normally highly conserved in GT-A fold enzymes 35 with a crucial role in coordinating metal ions, typically Mn 2+ , that interact with the diphosphate common to nucleotide sugar donors 33 . Several additional interactions with UDP-GalA were also predicted from the docked structure, including D361 (also part of the DGK motif), K344, R393, D472 (part of the RDQ motif that replaces the canonical xED motif) and H508. One of the residues of the hypervariable region (K392) was predicted to interact with a carboxyl group of a GalA residue within the RG-I acceptor oligosaccharide.
The Arabidopsis genome includes 8 sequences with GT116/DUF616 domains (Extended Data Fig. 10). Using 9 plant species, including Arabidopsis, 4 ancestral lineages, and 4 angiosperms with applications as agricultural crops and biomass feedstocks, a phylogenetic tree was created from a total of 77 protein sequences containing GT116/DUF616 domains (Fig. 7a). This analysis expands on a similar previously created phylogenetic tree 20 , but here the 8 Arabidopsis sequences are grouped into 5 distinct clades with the inclusion of sequences from additional species. The GT116 family represents putative GTs that may be predicted to also have RG-I:GalAT activity, but proof of function in RG-I synthesis for other family members will require confirmation of enzyme activity.
One possible reason for the existence of an expanded GT family is that members with similar catalytic activities have a tissue-specific functional specialization. The availability of RNA-seq genome-wide Arabidopsis expression data has enabled facile analysis of differential gene expression 39 . The expression of the eight Arabidopsis GT116 family members was compared in six different tissues representing both early developmental and mature stages (Fig. 7b). RGGAT1 and several of the GT116 family members are expressed broadly in plant tissues. Combined with the observation that lower plant paralogues of RGGAT1 are also present in Clade A (Fig. 7a), RGGAT1 is likely to function beyond mucilage synthesis in other tissues. Two genes from Clade B, At4g09630 and At1g34550 (EMB2756), are highly expressed in seedlings and mature tissues (Fig. 7b). The enzymes coded for by these genes have relatively large amino acid chain lengths of 711 and 735 residues, respectively (Extended Data Fig. 10), suggesting that they have an expanded domain structure that could facilitate possible interactions with other RG-I biosynthetic enzymes or complex glycan acceptors. Due to the high expression profiles, these Clade B genes are putative targets for RG-I biosynthesis in other cell types and developmental stages.

Discussion
Pectins are a heterogeneous family of cell wall polysaccharides that have proven challenging to define as functional macromolecules. For most plant tissues, pectins are extracted as heteropolysaccharides composed of distinct domains that require enzymatic or chemical digestion for isolation 40,41 . One of the difficulties associated with the study of pectins is the existence of different backbone and side-chain structures. More than 60 individual transferase activities have been estimated to be necessary for synthesis of the full range of pectic glycan linkages 1 . Understanding the scope of plant cell wall biosynthetic machinery is further complicated by the existence of large families of GTs with sometimes redundant catalytic activities 42 .
All pectic polymers contain a homogalacturonan backbone (HG; repeating unit [-4-d-GalA-α-1-]) or rhamnogalacturonan backbone (RG-I; repeating unit [-4-α-d-GalA-1,2-α-l-Rha-1-]). Synthesis of the HG backbone is catalysed by at least six members of the GALACTU-RONOSYLTRANSFERASE (GAUT) family (GAUT1, 4, 10, 11, 13, 14 and the GAUT1:GAUT7 complex) 2,5,8,23 . The present work establishes that the α-1,2-GalA transferase that catalyses biosynthesis of the RG-I backbone is a novel GalAT and a founding member of family GT116. Annotated in previous studies as MUCILAGE-RELATED70, a new name for this enzyme has been proposed here-RHAMNOGALACTURONAN GALACTURONOSYLTRANSFERASE1 (RGGAT1)-to distinguish this activity from the HG biosynthetic activity of the GAUT family. Genes homologous to RGGAT1 were identified in ancestral plant lineages and modern crops of industrial and agricultural interest (Fig. 7a), providing opportunities to study RG-I synthesis in plant species beyond the model organism Arabidopsis.
Two previous publications describing muci70 gene mutants yielded results that are consistent with the revelation that RGGAT1/ MUCI70 functions in RG-I biosynthesis. The levels of RGGAT1/MUCI70 transcript measured from silique tissues were reduced by at least 60% in two T-DNA insertion mutant lines (muci70-1 and muci70-2) 20 . These knockdown mutants resulted in a reduction of the surface area of the mucilage layer released on hydration of seeds and at least a 50% reduction of both GalA and Rha from total mucilage 20 . A study of the macromolecular properties of Arabidopsis seed mucilage identified several natural variants containing single nucleotide polymorphisms in RGGAT1/MUCI70 that resulted in reduced molar mass of the mucilage polysaccharide 24 . Water-extracted mucilage, which has been shown to be mostly composed of a >600 kDa polysaccharide of unbranched RG-I 15 , was reduced in molar mass by >70% in the muci70-1 and muci70-2 mutants 24 . These mutants are transcriptional Article https://doi.org/10.1038/s41477-022-01270-3 knockdowns of an RG-I:GalAT, but the reduction of this activity does not appear to have been compensated by the presence of up to 7 other putative RG-I:GalATs. This lack of compensation suggests that RGGAT1/ MUCI70 may be functionally specialized for the production of the high-molecular-weight RG-I polysaccharides specifically synthesized by seed epidermal cells, but the expression analysis suggests that RGGAT1 also functions to synthesize RG-I within other tissues (Fig. 7b).
The seed mucilage phenotypes of muci70 mutants were instrumental in the discovery that RGGAT1 functions in RG-I biosynthesis in seed mucilage, but RG-I also exists broadly in other plant tissues 9 . RNA-seq data obtained from the database Transcriptome Variation Analysis (TraVA) indicate that some GT116 family members are transcribed in all Arabidopsis tissues 39 . One member of the family, At1g34550, is included within a curated dataset for mutants that result in an 'embryo-defective' phenotype 43 . These mutants are classified by the production of defective seeds due to arrested embryonic development. The goal of establishing the EMB dataset, currently containing 510 EMBRYO-DEFECTIVE genes, was to identify the minimal set of genes necessary for plant growth and development 43,44 . The EMBRYO-DEFECTIVE gene EMB2756, corresponding to At1g34550, encodes a GT116-domain protein that is a putative RG-I:GalAT. At the time of the original study, EMB2756 was classified as a protein of unknown function 44 . If EMB2756 is an RG-I:GalAT, then the 'embryo-defective' phenotype of the emb2756 mutant suggests that RG-I synthesis is an essential cellular function necessary for the completion of embryonic development.
Because the other members of the GT116 family have not yet been shown to have RG-I:GalAT activity, we have not proposed changing the gene annotations of the other family members to RGGAT. The GT116 family is complicated by the existence of the family member At5g46220 (TOD1), which has previously been identified as having AlphaFold2 structure (red and green) to a GT31 structure (grey) (pdb: 6wmo) at an RMSD of 3.2 Å across 155 residues, validating that the 3D structure matches a GT-A fold topology. The secondary structure matching algorithm 64 in the molecular graphics programme Coot 0.9.7 was used to produce an alignment that was restricted to the core Rossman α-helices (red) and β-sheets (green) shown as opaque structures. b, The AlphaFold2 structure of RGGAT1 has elements of the canonical GT-A fold structure that includes β-sheets of the Rossman fold (green), α-Helix F (dark red) and three conserved motifs of the GT-A fold core (xED, G-Loop and DxD motifs, blue). Hypervariable region 2 (HV2, orange) has helices that are poorly aligned to the template. c, Docked structure of RGGAT1 with the donor UDP-GalA and acceptor RG-I (R) DP12 oligosaccharide. Selected amino acids predicted to interact with the donor and acceptor substrates are shown in stick representation. Dashed lines represent putative hydrogen bonding interactions within 2.7-3.4 Å distance. Residues in blue indicate putative GT-A motifs. Residues in orange are residues present on the HV2 region in contact with the acceptor. Based on a retaining mechanism, the acceptor nucleophile (red sphere) is deprotonated by the β-phosphate oxygen of the UDP donor, allowing a nucleophilic attack on the anomeric carbon of the GalA (yellow sphere) 33 . The side-chain amine of K363 (blue sphere) may function in place of a divalent cation to interact with the nucleotide phosphate diester of UDP-GalA.
Article https://doi.org/10.1038/s41477-022-01270-3 alkaline ceramidase activity 45 . In addition to the substrates being lipids rather than polysaccharides, this activity has notable differences from the activity of RGGAT1, such as being calcium-dependent and having an optimal pH of 9.5 45 . Of the eight family members, TOD1 shares the least sequence identity with RGGAT1 (Extended Data Fig. 10), but it does have a characteristic DGK motif that was found to distinguish the GT116 family from other GT-A fold families. On identifying TOD1 as an alkaline ceramidase, Chen et al. noted that TOD1 has low sequence similarity to alkaline ceramidases from other organisms, including mammals and Saccharomyces cerevisiae 45 . Future studies will be needed to determine whether At5g46220 (TOD1) is a GT116 family member with RG-I:GalAT activity or whether it should be categorized as a separate family of alkaline ceramidases. With the identification of RGGAT1, it is now possible to compare the catalytic properties of RG-I and HG backbone biosynthesis. Comparison of the in vitro synthesis rates reveals that for both backbones, the rate of transfer of GalA is dependent on the chain length of the acceptor. For RGGAT1, increases in activity for acceptors of lengths greater than DP8 appear to be due to an increased affinity of the enzyme for the longer-chain acceptors, as represented by an estimated ~10-fold lower K M value for DP ≥ 8 acceptors. In a study of HG biosynthesis by the GAUT1:GAUT7 complex, transfer to short-chain HG acceptors (DP ≤ 7) was also marked by reduced catalytic efficiency relative to acceptors of increased chain length (DP ≥ 11) 6 . Measurements of the reaction kinetics of pectin biosynthesis have been limited due to the resource-intensive requirement for purified acceptor substrates. However, the results provided here suggest that RG-I elongation has a mechanism consistent with the previously discovered mechanism for HG 6 in which the transition to longer-chain oligosaccharides (approximately 10 sugar units) represents a considerable increase in the catalytic rate.
The metal-independent catalysis by RGGAT1 is unusual for GT-A fold enzymes 33 , and contrasts with the Mn 2+ -dependent catalysis by galacturonosyltransferases involved in HG biosynthesis 6,8 . The metal-independent activity is shared by the RRT-family enzymes 16,17 , allowing both GalAT and RhaT activities of RG-I backbone polymerization to occur without the addition of exogenous metal cations and suggesting a common feature of enzymes involved in RG-I biosynthesis. The DxD motif (Asp-x-Asp), which is highly conserved in metal-dependent GT-A fold enzymes 33 , is changed to 361 Asp-Gly-Lys 363 in RGGAT1. Partial loss of the DxD motif was also found to have occurred in a GT from Bacteroides ovatus, BoGT6a, one of the GT-A fold families with a metal-independent mechanism of catalysis 46 . This study of RGGAT1 illustrates the power of using new deep-learning-based tools 36 to investigate the relationships between anomalous mechanistic properties and predicted structures of newly discovered GTs. The initial structural model for RGGAT1 has provided a template for future studies of the unique catalytic properties of this enzyme and its divergence from related GT families. Recent efforts have focused on discovering mechanisms by which pectin consumption contributes broadly to human health through proposed roles in metabolic pathways, including immune system function and cholesterol metabolism 47,48 . Because differences such as polymer size and sugar composition are likely to affect the bioactivity of pectins as components of dietary fibre, increasing recognition has been placed on the need to develop methods to purify pectic glycans with reduced biological variability 41,49,50 . The development of in vitro tools for the controlled synthesis of pectic glycans presents an avenue for production of pure substrates for use in biological studies of pectin function. Controlled chemoenzymatic methods have the potential for broader glycobiology applications, as similar methods have been explored for the synthesis of oligosaccharide domains for use in glycoconjugate vaccines 51 . Continued improvements to heterologous expression systems will allow for higher-yield purifications of GTs and the potential to expand the current capabilities of in vitro polysaccharide synthesis.

Extraction of Arabidopsis mucilage and purification of RG-I oligosaccharide acceptors
Arabidopsis mucilage used as the source of RG-I oligosaccharides was extracted using a scaled-up version of the protocol outlined previously 15 . Arabidopsis wild-type (Col-0) seeds (10 g total) placed in five 50 ml conical tubes each containing 2 g were mixed with deionized water to a total volume of 40 ml. Non-adherent mucilage was extracted by head-over-tail mixing for 3 h. The mixture of seeds and water containing extracted mucilage was centrifuged (2,500 × g, 5 min) and the water removed. The seeds were washed with water by mixing for 10 min and the water recovered after centrifugation. The mucilage extracted and water washes (600 ml total) were filtered using a polycarbonate filter of 3 µM pore size (Osmonics) and the filtrate lyophilised. Dry mucilage was resuspended in water at 10 mg ml −1 .
Recombinant rhamnogalacturonan hydrolase (RGase A from Aspergillus aculeatus) was obtained as a gift from Novo Nordisk as previously described 30 . Resuspended mucilage was digested under a range of RG-I hydrolase concentrations (0.01-1.0 µg ml −1 ) at 40 °C in a sodium acetate buffer (20 mM, pH 5.0). The resulting oligosaccharide mixtures were visualized by high-percentage PAGE and stained with a combination of alcian blue and silver staining (described below). For the scaled-up preparation of RG-I oligosaccharides, 50 mg mucilage was digested in a reaction containing 0.2 µg ml −1 RG-I hydrolase for 21 h in 6.5 ml total volume. This mixture was boiled to terminate the hydrolase reaction, dialysed against water using a 3,500 Da cut-off membrane (SpectraPor) and lyophilised. The resulting oligosaccharides contained Rha at the non-reducing end and were designated RG-I (R).
Resuspended mucilage was also digested by acid hydrolysis using 0.1 M hydrochloric acid at 80 °C for up to 48 h. The resulting oligosaccharide mixture was visualized by high-percentage PAGE, as above. The digested oligosaccharides were neutralized by addition of 0.1 M sodium hydroxide, dialysed against water using a 3,500 Da cut-off membrane and lyophilised. The resulting oligosaccharides contained GalA at the non-reducing end and were designated RG-I (G).
The lyophilised mixture of digested RG-I oligosaccharides was fluorescently labelled on the reducing end by resuspending at 10 mg ml −1 in 10% acetic acid containing 0.2 M 2-aminobenzamide (2AB) and 1 M sodium cyanoborohydride 31 . The mixture was incubated at 45 °C for 16 h, dialysed against water using a 3,500 Da cut-off membrane and lyophilised. After resuspension in water, the concentration of 2AB-labelled oligosaccharides was determined using UV-visible spectroscopy (Nanodrop) with a molar absorptivity coefficient for 2AB at 330 nm of 2,500 M −1 cm −1 . RG-I oligosaccharides were separated using a semi-preparative CarboPac PA-1 column (22 × 250 mm) connected to a Dionex system with fluorescence detection (excitation 330 nm, emission 420 nm). Peaks containing RG-I oligosaccharides ranging from DP6 to DP18 were separated using an ammonium formate gradient. Peaks enriched for the target oligosaccharides eluted as the ammonium formate concentration increased from 350 mM to 450 mM. Samples containing up to 10 µmol of RG-I oligosaccharides were injected into the system for semi-preparative scale purification. Individual peaks containing homogenous RG-I oligosaccharides were collected, dialysed against water using a 3,500 Da cut-off membrane and lyophilised. The purity of the collected fractions containing RG-I oligosaccharides was assessed by an analytical-scale injection of 5 nmol into a CarboPac PA-1 column (4 × 250 mm) and matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF-MS).

High-percentage PAGE
RG-I oligosaccharides were separated over a 30% acrylamide resolving gel (38 mM Tris, pH 8.8). Samples ranging from 300 ng (homogeneously purified DP12-2AB oligosaccharide) to 10 µg (undigested mucilage) were mixed with loading buffer (100 mM Tris, pH 6.8, 0.01% phenol red and 10% glycerol) and loaded into a stacking gel (5% acrylamide, 64 mM Tris, pH 6.8). Current (25 mA) was applied for up to 90 min. The gel was soaked for 20 min in a fixative solution (40% methanol, 10% acetic acid) and stained for 2 h in a solution of 0.1% alcian blue in 40% ethanol. After staining, the gel was washed with at least three changes of water for a total of 12 h. Silver staining and developing was completed using a silver staining kit (Bio-Rad). Staining was terminated by addition of 5% acetic acid. Gel images were captured using Bio-Rad Image Lab 5.2.1.

MALDI-TOF mass spectrometry
Negative ion mode MALDI-TOF-MS spectra were acquired using an LT Bruker Microflex spectrometer. Nafion 117 solution (Sigma) was applied to a Bruker MSP 96 ground steel target. RG-I oligosaccharides labelled with 2AB were mixed 1:1 with a 20 mg ml −1 2,5-dihydroxybenzoic acid matrix solution in 50% methanol. Purified RG-I oligosaccharides were resuspended at a concentration of at least 20 µM for detection by MALDI-TOF-MS. Reaction samples containing 100 µM acceptor oligosaccharides were diluted 1:4 in water containing 100 mM ammonium hydroxide before mixing with the sample matrix. Ammonium hydroxide was added to hydrolyse any sugar lactone structures present in the purified RG-I oligosaccharides. Data were collected with Bruker Daltonik FlexControl 3.0 software.

Cloning and expression of recombinant glycosyltransferases in HEK293 cells
All glycosyltransferase constructs were cloned for heterologous expression in HEK293F cells as previously described 6,25 . The expression construct for MUCI70Δ77 was cloned for use in a previous study 20 . The sequences for five Arabidopsis proteins (MUCI70/RGGAT1, RRT1, RRT2, RRT3 and RRT4) were analysed using The Arabidopsis Plant Membrane Protein Database (Aramemnon) to identify putative N-terminal transmembrane domains on the basis of the consensus results from hydrophobicity prediction servers. Primers for PCR amplification of the protein coding sequences truncated by the N-terminal transmembrane domain were designed with overhanging universal sequences for attB sites to enable Gateway cloning. The template for PCR amplification was complementary DNA produced from RNA extracted from 7 d old Arabidopsis seedlings for MUCI70Δ77 20 and RNA extracted from Arabidopsis leaf tissue for RRT1Δ61, RRT2Δ62, RRT3Δ54 and RRT4Δ51. Following the first round of PCR to amplify the truncated gene sequence, a second round of PCR was done using PCR products as templates to insert Gateway cloning-specific sequences using Universal Primers.
PCR products were inserted into the Gateway entry vector pDONR221 by reaction with BP clonase (Invitrogen). Sequences were verified after insertion into vector pDONR221 using M13F and M13R primers. The coding sequences were inserted into the mammalian Article https://doi.org/10.1038/s41477-022-01270-3 destination vector pGEn2 by reaction with LR clonase (Invitrogen). All primer sequences used are listed in Supplementary Table 1.
Following LR cloning, the expression constructs containing each truncated coding region in the pGEn2 expression plasmids were purified using Purelink HiPure Plasmid Gigaprep kits (Invitrogen). Fusion proteins were expressed in HEK293F cells and cell culture medium containing secreted proteins was collected after an incubation of 6 d. Secreted proteins were purified from the medium using Ni 2+ -NTA affinity chromatography with a column (HisTrap HP, GE Healthcare) equilibrated in 50 mM HEPES buffer, pH 7.2, with 20 mM imidazole. The column was washed and protein was eluted in steps containing 40 mM, 100 mM and 300 mM imidazole. The protein in the fraction eluted with 300 mM imidazole was dialysed into a storage buffer containing 50 mM MES, pH 6.5, using the metal-chelating ion resin Chelex-100. Protein was dialysed against two changes of storage buffer for 4 h each. Recovered protein was concentrated in centrifugal concentrator units with a 30 kDa cut-off (Amicon Ultra-15, Millipore). The protein concentration was determined using UV-visible spectroscopy (Nanodrop).
Protein purity was assessed using SDS-PAGE. An aliquot containing 5-10 µg of the purified protein was mixed with loading buffer at a final concentration of 20 mM Tris-HCl, pH 6.8, 2% SDS, 5% glycerol and 0.01% bromophenol blue. For reducing SDS-PAGE, 25 mM DTT was included in the loading buffer. Samples were boiled for 5 min to denature proteins before loading into the gel (MINI-PROTEAN 4-15% gradient gel, Bio-Rad). Proteins were detected by Coomassie blue staining. Gels were destained by mixing with a solution of 40% methanol and 10% acetic acid, followed by repeated washes with water.
Enzyme activity measurements completed using the UDP-Glo glycosyltransferase assay (Promega) were carried out according to the manufacturer's instructions. A standard curve of UDP concentration vs luminescence established the linear range of the assay to be 50 nM-20 µM. From each 20 µl reaction, 5 µl aliquots were removed and mixed 1:1 with the UDP detection reagent at the indicated times to stop the reactions. All activity measurements were in duplicate, and the data report the averages. Unless noted otherwise, all assays were replicated in three independent experiments. The luminescence reading was converted to µM UDP released on the basis of comparison to a UDP standard curve carried out in duplicate for each set of reaction samples. Data were acquired using BioTek Gen5 3.05.11 software and imported into Microsoft Office Excel 2007 for conversion calculations. The UDP-GalA donor substrate was incubated with calf intestinal alkaline phosphatase (CIAP, Promega) to remove residual UDP from the sample. CIAP was removed from the UDP-GalA preparation by centrifugation using a Microcon 10 kDa centrifugal filter unit (EMD Millipore). The filtrate was collected and concentrated to 10 mM as determined by UV-visible spectroscopy (Nanodrop 1000, Thermo Fisher, v3.7.1) with a molar extinction coefficient for UDP of 10,000 M −1 cm −1 at 260 nm. Nonlinear regression Michaelis-Menten kinetics analysis was performed using Graphpad Prism 9.0.2 for Windows (www.graphpad.com).
All RG-I synthesis activity measurements using anion exchange chromatography were done under similar conditions. All samples were boiled at the indicated time points to stop the reaction. From each 30 µl reaction, an aliquot of 25 µl containing the equivalent of 2.5 nmol total DP12-2AB acceptor was mixed with water and 100 mM ammonium hydroxide to a total volume of 1 ml. Ammonium hydroxide was included before injection to hydrolyse sugar lactone structures that resolve as peaks in the chromatogram in addition to the parent RG-I oligosaccharide structure. The sample was injected into a CarboPac PA-1 (4 × 250 mm) column and resolved using an ammonium formate gradient. From 5 to 45 min, the ammonium formate concentration was increased from 200 mM to 600 mM, resulting in a DP12-2AB acceptor with a retention time of 23 min and a DP13-2AB product with a retention time of 27.4 min. 2AB-labelled acceptors and reaction products were detected using an RF-2000 fluorescence detector set to high sensitivity. The peak areas of the DP12 acceptor and DP13 products were measured using Chromeleon 6.80. Percentage of acceptor conversion was calculated on the basis of the proportion of product peak area to total combined peak area of acceptors and products.
For the test of metal dependence of RGGAT1, enzyme at a concentration of 4 µM was diluted into a mixture of MES buffer, pH 6.5, containing 10 mM of either EDTA, MnCl 2 or no additive. This mixture was incubated at room temperature for 30 min. Enzyme from this mixture was diluted in MES buffer and added to reactions containing a total EDTA concentration of 10 mM, a total MnCl 2 concentration of 0.25 mM or no additives and incubated under standard reaction conditions for 10 min. For assays containing alkaline phosphatase (Potato apyrase, Sigma A6535) to reduce inhibitory UDP formed during the reaction, a total of 0.2 U of the phosphatase was added to the reaction.

RG-I polymerization reactions
In vitro polymerization of RG-I was completed using conditions described for RG-I:GalAT activity with the following modifications. Two enzymes, RGGAT1 and RRT4, at concentrations of 5 µM were mixed with two nucleotide sugar donors, UDP-GalA and UDP-Rha, and with an RG-I oligosaccharide acceptor. For the detection of reaction products using MALDI-TOF-MS, UDP-GalA and UDP-Rha at a concentration of 1 mM and an RG-I (G), DP12-2AB acceptor at a concentration of 100 µM were incubated in a total volume of 20 µl. For the detection of reaction products by alcian blue-stained PAGE and size-exclusion chromatography, UDP-GalA and UDP-Rha at a concentration of 2.5 mM, an RG-I (R), DP12-2AB acceptor at a concentration of 25 µM, and 1 U potato apyrase (Sigma) were incubated in a total volume of 120 µl. Each reaction was incubated for the indicated time (0-12 h) and boiled. An aliquot of 6 µl of the full reaction, representing 300 ng of the starting DP12-2AB acceptor, was removed and mixed with loading buffer for detection by alcian blue-stained PAGE. An aliquot of 100 µl from the full reaction, representing 5 µg of the starting DP12-2AB acceptor, was removed and injected into a Superdex 75 10/300 GL column attached to an Agilent 1260 Infinity II high-pressure liquid chromatography system at a flow rate of 0.5 ml min −1 of 50 mM ammonium formate. RG-I products were detected by multi-angle light scattering coupled with size-exclusion chromatography (SEC-MALS) as described for the determination of RG-II molecular mass 52 . Detection was performed using an Optilab t-rEX differential refractometer (Wyatt Technology) connected in series with a Dawn Heleos 8 MALS detector. The molecular mass was calculated using a dn/dc value of 0.122 mg ml −1 . Data were processed using ASTRA 7 software (Wyatt Technology).

Prediction of DUF616 structural fold
Representative DUF616 sequences were collected using PSI-BLAST with the A. thaliana RGGAT1 sequence as query and a stringent e-value cut-off; 679 representative sequences were selected. These sequences were then passed through the fold prediction pipeline previously described 36 . In brief, NetSurfP2.0 53 was used to predict secondary structures for the 679 sequences. The 3-state secondary structure prediction results were then evaluated using the deep-learning model to calculate the reconstruction errors and the fold assignment score. On the basis of these scores, the final fold prediction was made. The average reconstruction error was well below the 95% confidence interval limit of 0.107, indicating that GT116 adopted a known fold, and the fold assignment score was positive and highest for the GT-A fold, indicating that members of the GT116 family adopt a GT-A fold.

Generation of the RGGAT1 predicted model using AlphaFold2
A local version of AlphaFold2 (v2.0.1) 37 was used to generate models for the RGGAT1 GT-A domain. After an additional relaxation using Article https://doi.org/10.1038/s41477-022-01270-3 Rosetta (v3.9) minimization 54 , a structural comparison was performed in PyMOL using the ceAlign 2.5 algorithm with a well-studied GT31 domain (pdb: 6wmo) 55 to validate that the RGGAT1 sequence indeed formed a GT-A fold enzyme.

Phylogenetic comparison of GT-A fold families
An expansive GT-A phylogenetic tree was previously published, providing evolutionary relationships between GT-A enzymes 35 . With this new enzyme family, we sought to update the tree. As this new GT-A is highly variant, it failed to map to previously published profiles of GT-A sequences. Thus, we opted to utilize the highest ranked Alpha-Fold2 predicted structure and performed a structure-based sequence alignment, comparing it with a GT31 domain. We additionally ran Blast using the RGGAT1 sequence to collect divergent RGGAT1 sequences and create a consensus RGGAT1 sequence. We then added the RGGAT1 consensus and the AlphaFold2 sequence to the profile, and manually aligned the sequences on the basis of the structural alignment. As the three-dimensional (3D) topologies aligned quite well, we were able to integrate the RGGAT1 consensus sequence into the existing GT-A profiles. To evaluate that the profiles were accurate, we ran the software MapGaps 2.1 56 , which picks up sequences that match a constructed profile, and found that the profile matched to RGGAT1 sequences.

Sequence analysis of DUF616-domain enzymes
The amino acid sequence for At1g28240 was searched using Pfam (https://pfam.xfam.org/). The corresponding family, DUF616 (Pf04765), was sorted by species for A. thaliana. Redundant sequences were removed by manual curation. The amino acid position of the DUF616 domain for each of the eight unique Arabidopsis sequences was identified by searching individual sequences in Pfam. The amino acid sequence for At1g28240 was entered as a query sequence against A. thaliana (taxid:3702) using Protein BLAST (blast.ncbi.nlm.nih.gov). Each of the eight DUF616-domain-containing target sequences were identified. The residues aligned to the query, query coverage percentage, amino acid sequence identity percentage and sequence similarity percentage were identified in the BLAST results report.

Phylogenetic tree of DUF616-domain enzymes
The amino acid sequences for DUF616-domain proteins from 9 plant species were obtained from Phytozome v13 57 . The Biomart tool was used to extract protein sequences containing Pfam ID PF04765 from each selected species. In cases where more than one sequence was identified for each individual gene annotation, redundant sequences were removed by manual curation to create a list of 77 sequences from 9 species. MEGA11 software 58 was used to create an alignment, compute the best substitution model and construct a maximum-likelihood tree. The phylogenetic analysis was completed following the guidelines for this software as previously published 59 . The MUSCLE method was used for primary sequence alignment, and the LG model (G+I) with 500 bootstrap replicates was used. All data were imported into the Interactive Tree of Life tool for visualization 60 .

RNA-seq expression analysis
Average expression values were obtained from TraVA 39 (http://travadb. org/) as absolute read counts normalized using the median-of-ratio method. Expression values from selected tissues (germinating seeds 3, whole mature leaf, root without apex, flower 3, silique 8 and seeds 7) were used for comparison. A heat map corresponding to the mean expression values was plotted using the 'pheatmap' package in RStudio 2022.02.3.

Molecular docking
Molecular docking studies were conducted on the AlphaFold2 protein model. We generated the acceptor substrate with the GLYCAM carbohydrate builder tool (GLYCAM Web, http://legacy.glycam.org). The donor substrate, UDP-GalA, was acquired from a UDP-phosphorylase crystal structure (pdb: 3OH1). The grid and docking parameters were created using AutoDock Tools 61 . Molecular docking was performed using Autodock Vina with the Vina-Carb scoring function to treat carbohydrate molecules 62 using an 80 Å 3 grid placed at the centre of the active site. After docking each molecule, the top scoring conformations were analysed together.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data generated or analysed during this study are included in this published article (and its supplementary information files) or are available from the corresponding author upon request. UDP-GalA structure was accessed from Protein Data Bank: 3OH1 (https://www.rcsb. org/structure/3OH1). Plant genome sequences were accessed from Phytozome v13 57

Code availability
The code used to predict the glycosyltransferase fold structure was a deep-learning framework previously described in ref. 36 , available at https://www.nature.com/articles/s41467-021-25975-9. The published version of the code with the manuscript is available at https://doi. org/10.5281/zenodo.5173136. Article https://doi.org/10.1038/s41477-022-01270-3 Extended Data Fig. 1 | Expression of MUCI70Δ77 in HEK293 cells. MUCI70Δ77 was expressed in a total of two small-scale (20 mL) and one large-scale (250 mL) cultures. Total protein is the measure of fluorescence of total GFP fluorescence from cells + culture medium. Secreted protein is the measure of fluorescence of cell-free medium. All samples were taken from a 100 µL aliquot from the cell culture after 6 days. MUCI70Δ77 was expressed with 93% secretion efficiency, defined as the proportion of secreted protein to the total protein fluorescence. Error bars represent the standard deviation from three biological replicates. a, Comparison of RGGAT1 activity using two independent methods. For anion exchange, percentage of acceptor converted was calculated based on the relative proportion of the peaks for the DP12 (R) acceptor and DP13 (G) in the fluorescence chromatogram. For UDP-Glo, activity was measured as a function of UDP released in a 10 min assay containing 1 mM UDP-GalA and 100 µM acceptor. This activity value was presented as "percentage of acceptor converted" based on the conversion that 1 µM UDP released is equal to conversion of 1% of the starting DP12 (R) acceptor to a DP13 (G) product. Reactions contained 50 nM enzyme. Error bars represent the standard deviation from three independent experiments. b, Progress curve of activity using UDP-Glo. In all assays, each point represents the average of duplicate luminescence readings. The blue (assay with 1 mM UDP-GalA) and red (assay with 100 µM UDP-GalA) lines represent the average activity from three independent assays containing 50 nM enzyme. The results from independent assays are shown as individual points. c, Percentage of acceptor conversion was enhanced by addition of a phosphatase (potato apyrase, Sigma A6132) to the reaction. Percentage of acceptor converted was measured as the relative proportion of the peak area of the product to the remaining acceptor at 60 minutes in a reaction containing 50 nM enzyme, 1 mM UDP-GalA, and 100 µM DP12-2AB (R) acceptor. Error bars represent the standard deviation from three independent experiments.  6). The proteins were purified by Ni 2+ -NTA affinity from the cell culture medium. Protein concentration was measured by fluorescence. Proteins were loaded into an SDS-PAGE gel based on an equal amount of fluorescence corresponding to an estimated 1 µg total protein. All samples were separated under reducing conditions (+DTT) to observe the presence of monomers. Proteins were compared to previously-purified controls (Lanes 8-10). Lane 10, containing both RGGAT1 and RRT4 protein, was used as a control to demonstrate that the RGGAT1 and RRT4 monomers can be distinguished when an equal amount of both proteins was present. Although some RRT protein may be present in each co-expression lane, the results indicate that they were poorly expressed compared to RGGAT1. The gel represents a single experiment of the coexpression of RGGAT1 with RRT family members. Fig. 9 | DUF616 family sequences are predicted to be a GT-A fold type. a, Reconstruction error (RE) values are calculated for DUF616 (n = 678) sequences and fall within 95% CI of the RE values for GT-A, B, C and lyso type folds suggesting that DUF616 belongs to one of the known folds. The reference RE values (blue line) were combined from the training set consisting of 39713 GT-A, GT-B, GT-C and GT-lyso sequences. b, RE values for the GT-A (n = 12,316), B (n = 20,397), C (n = 1,518), lyso (n = 5482) and DUF616 (n = 678) sequences are shown as boxplots. Dotted lines mark the 95th and the 99th percentile upper bounds. Boxes show the first and third quartiles. The line within the box indicates the median value. The whiskers mark 1.5 times the interquartile range, excluding the outliers shown as individual diamonds. c, Highest Fold Assignment Scores are found to be for the GT-A1 subcluster for the DUF616 sequences, suggesting that the sequences from this novel family adopt a GT-A type fold. d and e, The RE values against sub cluster GT-A1 and GT-B1 are plotted for DUF616 sequences. As seen, the RE values for GT-A1 are much closer to the true RE values, suggesting overall similarity in core structural fold.

March 2021
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy All data generated or analyzed during this study are included in this published article (and its supplementary information files) or are available from the corresponding author upon request. UDP-GalA structure was accessed from Protein Data Bank: 3OH1 (https://www.rcsb.org/structure/3OH1

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
All proteins were expressed in HEK293F cells at least twice and the GalA transferase activity was confirmed from independently-purified protein batches. All enzyme activity assay measurements were completed using different aliquots from a single expression and purification batch. The individual assays represent a minimum of two technical replicates and two or three independent experiments as indicated in the figure legends.
Sample sizes were chosen to supply the information needed to determine kinetic values and constants including Kcat and Km. Design points including number of replicates was calculated for mechanistic studies and included high substrate concentration for Kcat and a wide range of substrate concentrations for Km and resulting Specificity Constants. Enzyme measurements were from the same protein batch to enable cross reference of kinetic results and two technical reps from two to three experiments was found to yield low error values.
Data exclusions No data were excluded from the analyses.

Replication
All proteins were expressed at least twice and the GalA transferase activity was confirmed from independently-purified protein batches. There were no results that were unable to be replicated.
Randomization Randomization was not applicable to this protein expression and characterization study since specific gene constructs were tested for enzyme activity and characteristics. There were no biological participants that required randomization.

Blinding
Blinding was not applicable to this protein expression and characterization study. There was no blinding since the nature of the research was not qualitative and no humans were subjects. The data obtained were numerical.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.