Sequence polymorphisms encompassed in the acyl-ACP TE variant library
Six parental acyl-ACP TEs (i.e., CnFatB3, CvFatB1, CnFatB2, UaFatB1, CvFatB2, and CpFatB1) (Supplemental Figure 1) were selected to initiate this directed evolution study because prior characterizations had identified that these enzymes express diverse substrate specificities and diverse in vivo fatty acid productivity when expressed in E. coli 21. The directed evolution strategy that we implemented generated variant enzymes that were initially screened for enhanced fatty acid productivity in E. coli. Ninety-eight sequence polymorphisms that occur among the six parental acyl-ACP TEs were randomly recombined in vitro by a PCR-based reassembly of the acyl-ACP TE-coding sequences (See Methods).
An initial pilot study evaluated the diversity of the acyl-ACP TE sequences recoverable from the constructed variant library. As a control experiment, 47 colonies were randomly chosen from the initial transformants without the Neutral Red selection for enhanced fatty acid accumulation, and the acyl-ACP TE sequences were determined from the recovered plasmids. The sequences of these 47 variant acyl-ACP TEs all differ from each other and from the six parental acyl-ACP TE sequences that went into the design of the variant library. However, only two of the reassembled acyl-ACP TE sequences encode a fully translatable full length acyl-ACP TE protein. The majority of the recovered mutants in this small sub-sample contained nonsense mutations (e.g., premature stop codon), or frame shifts due to an insertion or deletion of a single nucleotide. These are likely due to mis-alignments during PCR assembly or errors introduced during the chemical synthesis of the oligomers that were used in the construction of the variant library.
Neutral Red colony-staining screen to identify hyperactive acyl-ACP TEs
Prior studies established that the host used to propagate the variant acyl-ACP TE library, E. coli strain K27, which carries a mutation in acyl-CoA synthetase (fadD), results in the over-production of free fatty acids 20. Indeed, when expressed in this strain there is a direct relationship between the levels of acyl-ACP TE activity and the accumulation of free fatty acids 33, 44. Therefore, the acyl-ACP TE variant library was bulk screened by growing transformants on media plates supplemented with the pH indicator dye, Neutral Red. Because the higher accumulation of free fatty acids acidifies the media, the Neutral Red dye is a gauge of fatty acid productivity by individual colonies. Figure 1a shows colonies on a typical Neutral Red-containing plate. The majority of the recovered colonies (~98%) displayed a light red/pink color, but about 2% of the colonies exhibited a more intensive red color, indicative of increased fatty acid production resulting in acidification.
Based on this rationale, we selected 133 dark red-staining colonies and 77 light red/pink colored colonies and determined the fatty acid productivity of these strains. Among the 133 dark red staining strains, 75% produced more than 600 µM of fatty acids, 50% produced more than 1000 µM of fatty acids, and 25% produced even more, reaching levels greater than 1200 µM of fatty acids (Figure 1b). In contrast, the majority of the strains identified as light red/pink colored colonies produced <100 µM of fatty acids; the maximum amount of fatty acid produced by these light red/pink colonies was 260 µM (Figure 1b). These results confirm that there is correlation between the intensity of the color produced by Neutral Red staining of colonies and the fatty acid productivity of these strains.
Ultimately, approximately 30,000 colonies were screened, which resulted in the selection of 480 strains that were expected to express a higher fatty acid productivity based on enhanced Neutral Red staining (Supplemental Table 2). The fatty acid productivity of these strains was determined and compared to the productivity of strains expressing the original six parental acyl-ACP TEs that were used as guides for the design of the acyl-ACP TE variant library. The fatty acid productivity of the strains expressing these parental acyl-ACP TEs range between 100 µM to 900 µM (Figure 2, green data bars). Among the 480 colonies that were selected with the Neutral Red colony-staining assay, 151 expressed a fatty acid productivity that is higher than 600 µM, ranging up to a maximum of 1700 µM (Supplemental Table 3). These productivities are between 4- and 15-fold higher than five of the six parental acyl-ACP TEs. Even compared to the most productive parental acyl-ACP TE (i.e., CpFatB1), the productivities expressed by the variant acyl-ACP TEs are nearly 2-fold higher (Figure 2).
Sequences of acyl-ACP TE variants
The 175 acyl-ACP TE variants that expressed higher in vivo fatty acid productivities (ranging between 500 µM and 1700 µM) were sequenced. These sequences identified 26 distinct acyl-ACP TE variant proteins (Supplemental Figure 3). One of these variant proteins, TEGm162, recurred 147 times in the sequenced collection, TEGm204 was recovered 3 times, and TEGm198 was recovered twice; the remaining 22 sequences occurred uniquely in this collection (Supplemental Tables 2 and 3). None of these recovered sequences identified with the Neutral Red staining screen were included among the original 47 randomly selected control sequence variants, which were isolated without the Neutral Red-staining screen. Hence, these findings indicate that the Neutral Red staining screen has strong selection capability for acyl-ACP TE variants that express higher productivities of fatty acids. The collective average of the fatty acid productivity of the 147 independently isolated TEGm162 variants was 1170 ± 210 µM, and the average for the three TEGm204 variants was 1100 ± 140 µM. These productivities are ~30% higher than the most productive parental acyl-ACP TE (i.e. CpFatB1), and 10-fold higher than the productivity of the least effective parental acyl-ACP TE (i.e., CnFatB3).
The 26 distinct acyl-ACP TE variant sequences selected by this directed evolution strategy (Supplemental Figure 3) were compared to each other and to the six parental acyl-ACP TEs that were used to initiate this study. These analyses demonstrate that the recovered acyl-ACP TE variants share an overall sequence identity of ~67%. Among the 307 amino acid positions of these recovered variant enzymes, polymorphisms occur at 100 positions, which is very close to the number of positions (98) we targeted for mutagenesis in the design of the variant library. The two additional polymorphic positions may be attributable to variants introduced by errors in DNA primer synthesis or by PCR errors.
Hierarchical clustering analysis of these variant sequences identify a majority clade that is most similar to two of the parental sequences, CvFatB1 and CpFatB1 (Figure 3a). Within this clade, variants TEGm413 and TEGm419 are closest in sequence to the CpFatB1 and CvFatB1 parents, and these four proteins share ~64% amino acid identity, but they express in vivo fatty acid productivities that range between ~240 µM and ~1390 µM (Figure 3b).
The substrate specificities of acyl-ACP TE variants
In addition to showing differences in in vivo fatty acid productivity, the six parental acyl-ACP TEs that were used to guide this directed evolution strategy also display differences in acyl-chain length substrate specificity. This variation provided an added opportunity to explore the relationship between amino acid sequence and substrate specificity attributes of acyl-ACP TEs. Therefore, we evaluated how substrate specificity evolved in the acyl-ACP TE variants that were selected for enhanced in vivo fatty acid productivity.
Figure 3b shows the fatty acid profiles produced by the 26 evolved acyl-ACP TE variants as compared to the six parental acyl-ACP TEs. Prior characterizations of the six parental acyl-ACP TEs, in the context of the structural difference among 31 naturally occurring diverse acyl-ACP TEs from plant and microbial sources, had categorized these parental enzymes into three classes, Class I to III 21. CvFatB2 and CnFatB2 are Class I enzymes that primarily hydrolyze acyl-ACPs of 14- and 16-carbon fatty acyl chains, CnFatB3 is a Class II enzyme that prefers 8- to 16-carbon acyl-chains, and CpFatB1, CvFatB1, and UaFatB1 are Class III enzymes that have a preference for 8-carbon acyl-chains 21. The 26 acyl-ACP TE variants generated by this directed evolution study distributed somewhat unevenly among these three functional classes, with a preference for Class I and Class II enzymes (13 and 9 variants, respectively), and only four variants (TEGm162, TEGm169, TEGm258, and TEGm288) belonging to Class III acyl-ACP TEs. Although these 26 variant acyl-ACP TEs are classifiable among these categories, an analysis of variance (ANOVA) demonstrates that these substrate specificity classifications do not explain the observed variations in the in vivo fatty acid productivity observed in the E. coli host (p-value >0.05).
Machine learning model reveals structural constraints to substrate specificity.
Because acyl-ACP TE classification based solely on sequence similarity and diversity does not fully predict the fatty acid productivity of these enzymes, we adopted an alternative classification strategy based on the fatty acid profiles. Thus, in addition to clustering the variant acyl-ACP TEs relative to their sequence similarity (Figure 3a), clustering was performed based on the fatty acid profiles produced when variant enzymes were expressed in vivo to evaluate their substrate specificity. These analyses not only evaluated the acyl-ACP TEs generated by the current in vitro directed evolution study, but also included previously characterized natural variants of acyl-ACP TE isolated from a wide variety of different phylogenetic clades 21, 33. Thus, collectively 57 acyl-ACP TE variants were analyzed, 26 being products of in vitro directed evolution selection and 31 being products of natural evolution selection. Hierarchical clustering that minimized within-cluster variance in substrate specificity separated the 57 acyl-ACP TE variants into three distinct clusters (Clusters A-C) (Figure 4a). Similar segregation pattern occurs upon principal component analysis (PCA) of these data (Figure 4b), and in combination the two primary principal components (PC1 and PC2) explain nearly 60% of the variation in the substrate specificity among these variants. PC1, which accounts for 43% of the variation in the fatty acids profiles, primarily separates Cluster C-enzymes from Clusters A and B, while PC2 explains 16% of the variation, and separates Cluster A from Cluster B and Cluster C (Figure 4b). This tripartite classification of the variants reflects the prior classification of naturally occurring acyl-ACP TEs variants 21, which identified three classes of acyl-ACP TEs, with preferences for C14/C16 (Class I), C8 (Class III) or broad range chain-length (Class II) acyl-ACP TE substrates. Similarly in this study, Cluster A and Cluster C enzymes exhibit preferences for C8 and C14/C16 acyl-ACPs, respectively, whereas Cluster B enzymes have broader substrate specificity able to hydrolyze C8 to C16 acyl-ACPs (Figure 4c, 4d, 4e).
Manual comparisons of the recovered acyl-ACP TE sequence variants and their substrate specificities can provide constraints on the relationship between primary structure and substrate specificity of these enzymes. For example, by comparing the acyl-ACP TE sequence variants that are sorted into the same sequence-based hierarchical cluster, but are separated into different functional classes based on substrate specificities (i.e., Classes A-C; Figure 4a), one can heuristically identify those polymorphic residues that contribute to altered substrate specificity. We however, developed a systematic computational machine learning random forest classification model that improves on this manual strategy, and quantitatively assessed the importance of each polymorphic amino acid residue in determining the substrate specificity of the acyl-ACP TE variants.
The random forest classification strategy utilized both binarized substrate specificity data and amino acid sequence data as described in the Methods. Substrate specificity was binarized according to the fatty acids produced when each variant enzyme was expressed in E. coli, and two acyl-ACP TEs were defined as sharing substrate specificity if they were members of the same Cluster (A, B or C) (Figure 4a). In juxtaposition, two acyl-ACP TEs that had membership in separate Clusters were deemed as having different substrate specificities. After transforming the data and encoding, a random forest classifier was trained with all encoded data, and the mean feature importance scores for the 350 amino acid positions were calculated based on ten iterations of the model (Figure 5a and Supplemental Table 4a), which quantified the importance of individual residues in determining substrate specificity of each acyl-ACP TE variants. A total of 174 residue positions with importance scores ranging from ~0.5 to ~15, had a statistically significant impact on substrate specificity (i.e. corrected p-values <0.001; Supplemental Table 4a), and these are blue-highlighted in Figure 5a.
This list of residues was refined by a two-step approach. Initially, an incremental feature selection (IFS) approach was used that built a series of random forest models, in which each model added an additional residue to the evaluation process. The random forest classifier that included the 59 residue positions with the highest importance scores as the predictors exhibited optimal predictive performance, with a recall (i.e., true positive rate) of 70%, a specificity (true negative rate) of 91%, and a MCC of 0.69 (Figure 5b; Supplemental Table 4a). Next, the list of 59 residues were further refined by pairwise comparisons of MCC scores using Student’s t-test between every pair of adjacent models (i.e., the model that included one additional residue position versus the previous model that did not include that additional residue) until all 59 residues were examined (Figure 5b and 5c). The final model that contained the top 22 residue positions (orange-highlighted in Figure 5a) reached the statistical plateau of MCC (q-value >0.05; Supplemental Table 4b), and thus these 22 residues were considered as most impactful in determining the substrate specificity of the enzyme.
Mapping these 22 residues onto a predicted three dimensional structure of CvFatB2 indicates that the majority of these residues (17 of 22) are located in the N-terminal hot-dog domain structure (Figure 6). The other five residues are in the C-terminal hot-dog domain structure, among which are two residues that are adjacent to the catalytic residues we identified in a previous study 27.