Construction of a machine-learning model to predict the optimal gene expression level for efficient production of d-lactic acid in yeast

The modification of gene expression is being researched in the production of useful chemicals by metabolic engineering of the yeast Saccharomyces cerevisiae. When the expression levels of many metabolic enzyme genes are modified simultaneously, the expression ratio of these genes becomes diverse; the relationship between the gene expression ratio and chemical productivity remains unclear. In other words, it is challenging to predict phenotypes from genotypes. However, the productivity of useful chemicals can be improved if this relationship is clarified. In this study, we aimed to construct a machine-learning model that can be used to clarify the relationship between gene expression levels and d-lactic acid productivity and predict the optimal gene expression level for efficient d-lactic acid production in yeast. A machine-learning model was constructed using data on d-lactate dehydrogenase and glycolytic genes expression (13 dimensions) and d-lactic acid productivity. The coefficient of determination of the completed machine-learning model was 0.6932 when using the training data and 0.6628 when using the test data. Using the constructed machine-learning model, we predicted the optimal gene expression level for high d-lactic acid production. We successfully constructed a machine-learning model to predict both d-lactic acid productivity and the suitable gene expression ratio for the production of d-lactic acid. The technique established in this study could be key for predicting phenotypes from genotypes, a problem faced by recent metabolic engineering strategies.


Introduction
With the recent development of metabolic engineering, which uses genetic recombination technology and other techniques to modify the intracellular metabolism of organisms, many studies have focused on the production of various useful chemicals using microorganisms (Hong and Nielsen 2012;Lennen and Pfleger 2012;Mitsui et al. 2019). Because the production of chemicals using microorganisms allows the use of renewable biomass resources and the production of chemicals under mild conditions, such as mild temperature and pressure, environmentally friendly biotechnological processes are expected to be established (Jullesson et al. 2015). In particular, the yeast Saccharomyces cerevisiae has been used in various studies as an excellent model organism for eukaryotic cells because it is safe, non-pathogenic, easy to genetically modify, and can be easily manipulated in the laboratory. In addition, culture technology has been established on an industrial scale and is used in a wide range of fields, including food and brewing fields, such as the production of alcoholic beverages and bread, and chemical fields, such as the production of 2,3-butanediol (Kim et al. 2013;Mitsui et al. 2022) and d-lactic acid (Ishida et al. 2006;Yamada et al. 2017a).
In yeast metabolic engineering, various studies have reported the production of useful chemicals by introducing foreign genes or disrupting metabolic pathway genes that produce byproducts. Examples include improving the production of fatty-acid-derived products by enhancing the supply of precursor metabolites (Lian and Zhao 2017), improving the production of isobutanol by disrupting genes that have competitive pathways (Matsuda et al. 2013), and producing xylitol from xylose by introducing a foreign gene encoding xylose reductase (Hallborn et al. 1991).
In eukaryotes, e.g., yeast, it is difficult to increase the metabolic rates in major metabolic pathways, such as glycolysis, to increase the productivity of target chemicals. Previous studies have reported that the overexpression of a single glycolytic enzyme does not affect ethanol productivity (Schaaff et al. 1989;Wang et al. 2019). Therefore, techniques to regulate the expression levels of numerous genes are being developed to increase the metabolic rates in major metabolic pathways (Guo et al. 2015). The δ-integration method enables the introduction of a large number of genes by homologous recombination targeting δ-sequences present at multiple locations in the genome (Sakai et al. 1990). We applied this technology to develop a cocktail δ-integration method that can improve the metabolic rate of major metabolic pathways while modifying the expression of many metabolic genes (Yamada et al. 2010(Yamada et al. , 2017c. Using the cocktail δ-integration method, we improved the metabolic rates of glycolysis (Yamada et al. 2017c) and productivity of 2,3-butanediol (Yamada et al. 2017b), β-carotene (Yamada et al. 2018), and patchoulol (Mitsui et al. 2020). In the production of d-lactic acid, a raw material for highperformance biodegradable plastics, we constructed a yeast strain YPH499/dPdA3-34/DLDH/1-18 producing d-lactic acid at a rate of 2.80 g/L/h by modifying the expression levels of d-lactate dehydrogenase (D-LDH) and 12 glycolytic enzymes via the cocktail δ-integration method (Yamada et al. 2017a). Thus, the cocktail δ-integration method is a very promising tool for the metabolic engineering of yeast.
However, when the expression levels of many metabolic enzyme genes are modified, the expression ratio of these genes becomes diverse, while the relationship between the gene expression ratio and chemical productivity remains unclear. In other words, it is challenging to predict phenotypes from genotypes. However, the productivity of useful chemicals can be improved if this relationship is clarified. However, when the expression levels of many metabolic enzyme genes are modified, it is extremely difficult to comprehensively verify the relationship between the expression levels of each metabolic enzyme and the productivity of chemicals because the expression ratios are diverse.
In recent years, machine-learning technology has been used in biochemical engineering. For example, a yeast library was constructed in which the expression levels of five types of genes varied, and a machine-learning model was constructed with gene expression levels as input values and β-carotene production as output values (Zhou et al. 2018). In addition, the gene expression ratio that realizes efficient β-carotene production was predicted using the constructed model. In another study, a machine-learning model that could estimate the cell growth and fermentation characteristics of yeast was constructed and used for bioethanol production based on the composition data of corn hydrolysate and yeast fermentation data (Konishi 2020).
In this study, we aimed to construct a machine-learning model that could predict the optimal gene expression level for the efficient production of d-lactic acid. We first constructed a yeast library with various glycolytic enzyme genes and D-LDH expression levels, and we measured the gene expression levels, d-lactic acid production, and optical density at 600 nm (OD 600 ) of each strain. Then, a machinelearning model was constructed based on the acquired data, with gene expression levels as input values and d-lactic acid productivity as output values. Finally, the model was used to identify genes that have a particularly large impact on d-lactic acid productivity and to predict the optimal gene expression levels for highly efficient d-lactic acid production. Table 1 summarizes the features of the Escherichia coli and S. cerevisiae strains used in this study. E. coli strain HST08 (Takara Bio, Otsu, Japan) was used as a host for recombinant DNA manipulation. Recombinant E. coli cells were cultivated in Luria-Bertani (LB) medium (20 g/L LB Broth), supplemented with 100 µg/mL ampicillin sodium salt.

Strains and media
S. cerevisiae strain YPH499 (NBRC 10505) was used as the host for metabolic engineering. S. cerevisiae cells were cultivated on a yeast/peptone/dextrose (YPD) medium (10 g/L yeast extract, 20 g/L peptone, and 20 g/L glucose) or a synthetic complete drop-out (SCD) medium (6.7 g/L yeast nitrogen base without amino acids, 2.0 g/L Synthetic Complete Mixture, and 20 g/L glucose), supplemented with the appropriate amino acids and nucleic acids. For the plate medium, 20 g/L of agar was added to the medium. Reagents for media were purchased from Nacalai Tesque (Kyoto, Japan) or Formedium (Norfolk, UK).

Yeast cultivation
Yeast cultivation was performed in 1.0 mL of a YPD medium, using a 2-mL 96-well deep-well plate equipped with a gas-permeable seal, as well as a rotary plate shaker set to 30 °C and 1500 rpm. Fermentation was started by inoculation (5% v/v) of a preculture grown in a microplate containing 1.0 mL of an SCD medium for 48 h at 30 °C and 1500 rpm.

Analysis of growth and metabolites
OD 600 was determined using a Multiskan GO microplate reader (Thermo Scientific, Rochester, NY, USA). The concentration of glucose was determined using the glucose CII-test kit (FUJIFILM Wako Pure Chemical Corporation, Osaka, Japan). The d-lactic acid concentration was determined using a d-lactic acid assay kit (Megazyme, Wicklow, Ireland).
The d-lactic acid productivity was calculated using Eq. (1).

Real-time polymerase chain reaction analysis
Total RNA was isolated from yeast cells cultivated in a YPD medium for 16 h at 30 °C, and cDNA was synthesized as previously described (Yamada et al. 2017c). The transcription levels of glycolysis-related genes and D-LDH were quantified by real-time polymerase chain reaction as described previously (Yamada et al. 2017c), using (1) D-lactic acid productivity g∕L∕OD 600

Construction and evaluation of machine-learning model and prediction of d-lactic acid productivity
Two hundred data sets were prepared by setting the expression levels (fold change) of 12 types of yeast glycolytic genes and the D-LDH gene as explanatory variables and d-lactic acid productivity as the objective variable. Next, outliers contained in the datasets were detected using ensemble learning for outlier sample detection (Kaneko 2018), and 19 datasets judged to be outliers were removed. Furthermore, the Yeo-Johnson transformation (Yeo and Richard 2000) was applied to the explanatory and objective variables, and the data were transformed to follow a normal distribution.
To build the machine-learning model, 25 different algorithms were examined using the Python package "PyCaret" (Ali 2020). Ten-fold cross-validation was performed using 20% of the 181 random datasets as the test data.
In ensemble learning, bagging was performed with a machine-learning model that learned the MLPRegresser as an algorithm, and stacking was performed using the machine-learning model as a metamodel. During stacking, machine-learning models that learned based on 25 types of algorithms were selected as base models, and combinations were searched for, which would increase the coefficient of determination when using test data (Supplementary Information). The prediction accuracy of the machine-learning model was evaluated using the coefficient of determination of the measured and predicted d-lactic acid productivity values.
d-Lactic acid productivity was predicted via the constructed machine-learning model using 100,000 datasets with randomly changed expression levels of 13 types of genes as input values. The optimal gene expression ratio for d-lactic acid production was determined based on the prediction of d-lactic acid productivity. The range of input values for the gene expression levels was limited to a practically feasible range (the range from the minimum to the maximum value of the gene expression levels of the strains in the yeast library constructed in this study).

Construction and evaluation of D-LDH and glycolytic enzymes expression-modified yeast library
Strain YPH499/dPdA, in which ADH1 and PDC1 were disrupted to reduce ethanol formation (Yamada et al. 2017c), was transformed with the D-LDH-expressing plasmid pδU_ LibDLDH, and the d-lactic acid-producing strain YPH499/ dPdA/DLDH/A5 was selected ( Supplementary Fig. 1). Then, strain YPH499/dPdA/DLDH/A5 was transformed with the D-LDH expressing plasmid pδH_LibDLDH as well as plasmid pδH_Lib* for the modification of glycolytic enzyme expression, and D-LDH and glycolytic enzymes expression-modified yeast library was constructed by modifying the expression levels of D-LDH and 12 glycolytic enzymes (Fig. 1).

Distribution of gene expression levels of transformants in the yeast library
The expression levels of D-LDH and 12 glycolytic genes in 200 strains evaluated for d-lactic acid productivity were measured, and their distribution is shown in Fig. 4. More than 50% of the strains showed a similar level of expression (0-2.0 times) to the parental strain for all genes. HXT7 and PYK2 had the highest proportion of strains with high expression levels.

Construction and evaluation of machine-learning model
A machine-learning model was constructed using a training dataset with D-LDH and glycolytic gene expression (13 dimensions) as input values and d-lactic acid productivity after 16 h of cultivation as output values. The accuracy of the prediction was evaluated by the error (coefficient of determination) between the measured value of d-lactic acid productivity and the value predicted by the constructed machine-learning model.
First, outlier detection (Kaneko 2018) was performed on the acquired dataset of 200 strains, and data that were detected as outliers were excluded from the dataset. This resulted in 181 training datasets, and the Yeo-Johnson transformation (Yeo et al. 2000) was performed to transform the data to follow a normal distribution.
Machine learning was then performed using multiple algorithms (Supplementary Table 2). First, a machine-learning model was constructed using MLPRegressor, a neural network-type algorithm, and was used as the metamodel for stacking after bagging. Then, further machine-learning models were constructed using 25 different algorithms and used for the stacking sub-model. Stacking was performed by combining the constructed metamodel and submodels to search for a combination that maximized the prediction accuracy. The coefficient of determination of the completed machine-learning model was 0.6932 when using the training data and 0.6628 when using the test data (Fig. 5).
Shapley additive explanations (SHAP) values were also calculated to evaluate the effect of increasing or decreasing the expression level of each gene on d-lactic acid productivity (Fig. 6). The expression level of PFK1, a glycolytic enzyme gene, showed a strong positive correlation with d-lactic acid productivity, while the expression level of ENO2 showed a strong negative correlation. The expression levels of PFK1, HXK2, FBA1, GPM1, PYK2, PGK1, and PFK2 were positively correlated with d-lactic acid productivity, whereas those of ENO2, TDH3, TPI1, HXT7, DLDH, and PGI1 were negatively correlated.

Prediction of gene expression levels using machine-learning model
The machine-learning model with the highest prediction accuracy was used to predict the optimal gene expression levels for high d-lactic acid productivity (Fig. 7). The gene expression levels predicted to show the highest d-lactic acid productivity were particularly high for glycolytic enzyme genes such as FBA1, PFK1, and PYK2. The predicted d-lactic acid productivity was 7.96 g/L; this was approximately 1.4 times higher than that of strain YPH499/dPdA/DLDH/

Discussion
When the distribution of the expression levels in the glycolysis-modified yeast library was evaluated, the expression levels of HXT7 and PYK2 were found to be particularly high (Fig. 4). A reason for the high number of transformants with high HXT7 expression may be that the enhanced expression of HXT7 was preferentially selected because of the increased expression of several genes in the glycolysis, which resulted in a rate-limiting step of glucose uptake into cells. Previous studies have suggested that enhanced expression of HXT7, which encodes a hexose transporter, may increase glucose uptake, leading to increased ethanol and lactic acid productivity (Rossi et al. 2010;Tanino et al. 2012). However, the reason why there were many transformants with high PYK2 expression is unclear. One possibility is that overexpression of the foreign gene D-LDH causes pyruvate deficiency, and transformants with high PYK2 expression were preferentially selected for enhanced pyruvate production.
The calculation of SHAP values confirmed that the expression levels of many genes had a clear positive or negative correlation with d-lactic acid productivity (Fig. 6). In a previous study, YPH499/dPdA3-34/DLDH/1-18, whose d-lactic acid productivity was greatly improved compared to that of the parental strain YPH499/dPdAW/DLDH, was constructed by modifying the expression levels of 13 yeast glycolytic genes and D-LDH, as in this study (Yamada et al. 2017a). YPH499/ dPdA3-34/DLDH/1-18 showed increased PFK1 expression and decreased ENO2 and TDH3 expression compared to the parental strain. This result was consistent with the calculated SHAP values obtained in this study. In addition, the expression level of D-LDH, which is expected to be directly related to d-lactic acid production, had little effect on d-lactic acid productivity (Fig. 6). This may be because the expression level of D-LDH in the parental strain YPH499/dPdA/DLDH/A5 was sufficiently high that modifying the D-LDH gene expression level did not significantly affect the productivity of d-lactic acid.
Stacking the MLPRegressor as a metamodel resulted in the highest prediction accuracy ( Fig. 5 and Supplementary Information). When constructing machine-learning models, it is important to compare multiple algorithms and determine the best one (Culley et al. 2020). In this study, the prediction accuracy and other aspects of the 25 different algorithms were compared and examined to determine the algorithm to be used (Supplementary Table 2). The neuralnetwork-type algorithms such as MLPRegressor are capable of learning nonlinear correlations and are generally good at constructing models from large-dimensional datasets (Caldeira et al. 2011;Culley et al. 2020;Silva et al. 2012). Therefore, it is considered to be an algorithm suitable for model construction from a relatively large data set as in this study. On the other hand, similar to this study, previous studies using pycaret show that Passive Aggressive Regressor and Lasso Least Angle Regression have low prediction accuracy (Bhushan et al. 2022;Tsai et al. 2022). Besides, we initially considered employing algorithms based on decision trees (Quinlan 1986) and random forests (Breiman 2001) because of their high prediction accuracy. However, owing to their computational methods, these algorithms have difficulty predicting physical property values beyond the range of training data and were therefore excluded from the metamodel in stacking in this study.
The gene expression ratio suitable for the production of d-lactic acid was successfully predicted using a machinelearning technique (Fig. 7). Even d-lactic acid, a compound produced by a relatively short metabolic pathway, requires the manipulation of the expression levels of 13 enzymes. Therefore, for further development of metabolic engineering, techniques for manipulating the expression of a large number of genes are required. Artificial yeast chromosomes can introduce genes with hundreds of kbp Gene expression levels of measured and predicted strains with maximum d-lactic acid productivity. The YPH499/dPdA/DLDH/ A5/C69 strain had the highest d-lactic acid productivity in the constructed yeast library. The gene expression levels of the predicted strain were predicted using a completed machine-learning model. Fold changes were calculated relative to the parental strain YPH499/ dPdA/DLDH/A5 using PDA1 as a housekeeping gene, and thus red dotted line (fold change = 1.0) represents the gene expression levels of the parental strain into yeast (Gibson et al. 2008). Therefore, artificial chromosomes are considered an effective method for future metabolic engineering, which requires the manipulation of a large number of gene expressions. The technology established in this study to predict the optimal gene expression level by machine-learning could be applied to the production of various useful chemicals. We constructed a machine-learning model for d-lactic acid, which is produced via glycolysis, by acquiring data from a yeast library constructed using the cocktail δ-integration method. To date, we have constructed yeast strains producing 2,3-butanediol (Yamada et al. 2017b), which is produced via the glycolysis pathway, and β-carotene (Yamada et al. 2018) and patchoulol (Mitsui et al. 2020), which are produced via glycolysis and the mevalonate pathway, respectively, by the cocktail δ-integration method. In the future, this technology will be applicable for the production of these compounds and other useful chemicals produced through similar metabolic pathways.

Conclusion
We successfully constructed a machine-learning model to predict d-lactic acid productivity based on the measured expression levels of 12 glycolytic genes and D-LDH. In addition, the gene expression ratio suitable for the production of d-lactic acid was successfully predicted using the constructed machine-learning model. The technology established in this study to predict the optimal gene expression level by machine-learning may be applied in the production of various useful chemicals. The future challenge is to achieve the predicted gene expression levels using various techniques, including yeast artificial chromosomes. The technique established in this study could be a key technique for predicting phenotypes from genotypes, a problem faced by recent metabolic engineering strategies.