Dataset source and preprocessing. To evaluate the PreKcat model, we selected several representative datasets and constructed several new datasets to verify its accuracy.
DLKcat dataset. The DLKcat dataset was prepared as in the original publication [10]. Specifically, we began by utilizing the DLKcat dataset, which is the most comprehensive and representative dataset based on enzyme sequences and substrate structures from BRENDA and SABIO-RK databases. Initially, the dataset contained 17,010 unique samples, but we excluded samples with substrate simplified molecular-input line-entry system (SMILES) containing "." or kcat values less than or equal to 0, as per the DLKcat instruction. This resulted in 16,838 samples, which encompassed 7,822 unique protein sequences from 851 organisms and 2,672 unique substrates. All kcat values were converted to a logarithmic scale. The dataset was divided into training and test sets, with a ratio of 90% and 10%, respectively, which was repeated five times to obtain 5 randomized datasets for downstream model training and test, keeping the same as in the previous publication.
pH and temperature datasets. To predict the influence of environmental factors to kcat, we constructed two datasets that contain enzyme sequences, substrate structures, and their corresponding pH or temperature values. We obtained the enzyme sequences, substrate names, and pH or temperature values from the Uniprot database [7]. To obtain the corresponding substrate structure, we downloaded it from the PubChem database based on the substrate name and generated a SMILES representation via a python script [28]. The pH dataset comprised 636 samples, consisting of 261 unique enzyme sequences and 331 unique substrates, which resulted in 520 unique enzyme-substrate pairs. The pH values ranged from 3 to 10.5. The temperature dataset contained 572 samples, consisting of 243 unique enzyme sequences and 302 unique substrates, which resulted in 461 unique enzyme-substrate pairs. The temperature values ranged from 4 to 85 degrees. To evaluate the performance of PreKcat on these datasets, we divided each dataset into a 20% training set and an 80% test set.
Michaelis constant (Km) dataset. To assess the generalizability of PreKcat on related tasks, we utilized a representative dataset obtained from a previous publication with SOTA results [31], which contains data retrieved from BRENDA. This dataset consists of 11,722 samples, comprising of enzyme sequences, substrate molecular fingerprints, and corresponding Km values. We converted the substrate structures into SMILES representations and log10-transformed all Km values. To evaluate the performance of PreKcat on this dataset, we randomly divided the entire dataset into 80% training data and 20% test data, keeping the same as in the previous publication.
kcat / Km dataset. We constructed a new dataset using information sourced from the BRENDA, UniProt, and PubChem databases [7, 28, 32]. This dataset comprises 910 samples consisting of enzyme sequences, substrate structures, and their corresponding kcat / Km values. We first obtained the UniProt ID of the enzyme and the name of the substrate along with their kcat / Km values from the BRENDA database. Then, the corresponding enzyme sequences and substrate structures were obtained from the UniProt and PubChem databasesusing the UniProt ID and the name of the substrate, repsectively. We divided the entire dataset into five parts randomly to evaluate the performance of PreKcat.
Construction of PreKcat. We implemented the PreKcat framework using torch v. 1.10.1+cu113 and sklearn v. 0.24.2. PreKcat consists of a representation module and a machine learning module. The representation module is responsible for generating effective representations of the enzyme sequences and substrate structures. We used the ProtT5-XL-UniRef50 protein language model, which has been shown to be effective in predicting peptide and protein function, to generate an embedded vector for the enzyme sequence [16]. Every amino acid was converted into a 1024-dimensional vector on the last hidden layer, and the resulting vectors were summed and averaged. The final enzyme representation was a 1024-dimensional vector. For the substrate structure, we generated a SMILES and used a pretrained SMILES transformer to create a 1024-dimensional vector by concatenating the mean and max pooling of the last layer and the first outputs of the last and penultimate layers [18]. The representation module converted the enzyme sequence or substrate structure into a numerical representation using an unsupervised learning process, making it easier for machine learning models to learn. The second module was an Extra Tree model, a machine learning method that can effectively capture the relationship between the concatenated representation vectors of the enzyme sequence and substrate structure and the kcat value [24]. All experiments were conducted in a Linux environment running Ubuntu 20.04.5 on a server with 64 cores and 4 NVIDIA GeForce RTX 3080 GPUs. We used a single core and GPU for training.
Construction of EF-PreKcat. We developed a novel framework, called EF-PreKcat, which takes into account environmental factors such as pH and temperature. This two-layer framework comprises a base layer with two individual models: PreKcat and Revised PreKcat. The PreKcat model takes as input a concatenated representation vector of the protein and substrate, while the Revised PreKcat model uses a concatenated representation vector of the protein and substrate, combined with the pH or temperature value. Both models were trained using the Extra Tree algorithm. The meta layer of the framework consists of a linear regression model that utilizes the predicted kcat values from both the PreKcat and Revised PreKcat models as inputs. The pH and temperature datasets were divided into training and test sets, with the former being 80% of the dataset. The training set was further split into two subsets: the first training set was 80% of the training set (or 64% of the entire dataset) and the second training set was 20% of the training set (or 16% of the entire dataset). The training process involved two steps. In the first step, PreKcat was trained using the DLKcat dataset without environmental factors, while Revised PreKcat was trained using the first training set of pH or temperature dataset. In the second step, a linear regression model was trained using the second training set of pH or temperature dataset, and the outputs from both models in the first layer. The evaluation was performed using the test data of the pH or temperature dataset. As the model's performance may be influenced by different training and test set division, which were generated randomly, we have taken the precaution to average the results three times to mitigate this risk.
Evaluation metrics. To evaluate the performance of our model, we utilized various metrics to compare the predicted kcat value and experimentally measured kcat. Our selected metrics included the coefficient of determination (R2) in Equation 1, the pearson correlation coefficient (PCC) in Equation 2, the root mean square error (RMSE) in Equation 3, and the mean absolute error (MAE) in Equation 4. These equations utilize variables such as for the experimentally measured kcat value, for the predicted kcat value, for the average of the experimentally measured kcat values, for the average of the predicted kcat values, and n for the number of samples (which depends on the size of the selected dataset).
Feature importance analysis by SHAP. We utilized SHapley Additive exPlanations (SHAP), a unified framework for analyzing model interpretability, to compute the importance value of each feature [26]. The assigned SHAP value represents the significance of the feature, with higher values indicating greater importance. Moreover, SHAP can also indicate the positive or negative effects of features. This framework has been widely used to interpret the importance of various biological problems, including Type IV Secreted Effectors prediction and anticancer peptide prediction [43-44]. We applied SHAP on the kcat test set, which comprises 1,684 samples, based on the trained PreKcat model. The SHAP summary produced by TreeExplainer displayed the magnitude, distribution, and direction of every feature effect. Each dot on the graph represents a dataset sample, and the x-axis position denotes the SHAP value, while the change in color represents different feature values. The implementation of SHAP was achieved through a freely available Python package.
t-SNE visualization. To better understand the distribution of embedded enzyme and substrate representations and explore the necessity of using machine learning, we utilized t-distributed stochastic neighbor embedding (t-SNE) to visualize the embedded enzyme and substrate vectors [25]. This widely used method has been employed to analyze feature distributions in biological tasks, such as antimicrobial peptide recognition and protein subcellular location [45-46]. We calculated embedded vectors for all 16,838 samples, which were concatenated and inputed into the t-SNE algorithm, which transformed them into two-dimensional vectors. We used the default parameters for the algorithm and normalized the resulting numerical values of the two-dimensional vector for each sample to display, as shown in Equation 5, where yit denotes the value of the ith projected vector and yt denotes all values of the projected vector.
Sample weight redistribution methods. We explored four different methods to adjust the weight of the samples for accurate high kcat prediction. These representative weight redistribution methods included Directly Modified Sample Weight (DMW), Cost-Sensitive re-Weighting methods (CSW), Class-Balanced re-Weighting methods (CBW), and Label Distribution Smoothing (LDS) [22, 29-30]. To enable fair comparison of the methods, we employed 5-fold cross-validation on the entire dataset to ensure that all samples could be predicted independently. We then divided the predicted kcat values into different intervals and calculated RMSE and MAE separately for kcat values higher than 4 (logarithm value) and kcat values higher than 5 (logarithm value).
DMW method. The DMW method is a weight redistribution approach where the weight of samples with kcat values higher than 4 (logarithm value) is directly enhanced. We explored several parameters, including weight multipliers (2, 5, 10, 20, 50, 100) and whether to normalize the weights. Through this process, we analyzed twelve optimized model combinations, which revealed that a weight coefficient of 10 without normalization was optimal. Increasing or decreasing the coefficient resulted in higher RMSE and MAE in predicting high kcat values.
CSW method. The CSW method assigns different weights to different classes to guide the model to pay more attention to minority categories. Three CSW variants, including CSW, root CSW, and square CSW, were applied to all samples. All the samples were divided into 131 bins, each covering an equal numeric interval. For CSW, the weight of each sample was set to the reciprocal of the sample size in each bin. The root CSW and square CSW methods reset the weight of each sample by its square root and its square, respectively. We found that the root CSW was the most effective method.
CBW method. The CBW method posits that the value of adding new data points will decrease as the size of the dataset increases. To reflect this, the effective number of samples can be calculated using Equation 6, where n is the number of samples and β is a hyperparameter that ranges between 0 and 1. The weighting of each sample is then set to 1 divided by the effective number of samples. We evaluated the CBW method using different beta values (0.7, 0.75, 0.8, 0.85, 0.9, 0.99, 0.999, 0.9999) and found that the optimal value was 0.9, which resulted in the lowest prediction RMSE and MAE compared to other settings.
LDS methods. LDS is a simple, effective, and interpretable algorithm for tackling the problem of unbalanced datasets that exploits the similarity of the nearby label space. It had been verified to be very effective in sections where only a few samples exist, and the predicted error would be reduced dramatically. LDS convolves a symmetric kernel with the empirical density distribution to generate a kernel-smoothed effective density distribution
Experimental validation of PreKcat. We attempted to utilize PreKcat to boost the enzyme mining process. Specifically, we selected a crucial enzyme in the naringenin synthetic pathway, tyrosine ammonia lyase (TAL).
BLASTp. Basic Local Alignment Search Tool protein (BLASTp) is a widely used bioinformatics tool for sequence similarity search [47]. It is a protein-protein BLAST algorithm that compares the query protein sequence against a non-redundant protein database and retrieves similar sequences based on their E-value, which estimates the probability of observing the alignment by chance. In this study, we utilized the BLASTp algorithm with RgTAL from Rhodotorula glutinis (AGZ04575) as the template to identify sequences with high similarity to TAL, and subsequently selected the top 1000 sequences based on E value for kcat prediction using the PreKcat model. The parameters used were default, employing a BLOSUM62 scoring matrix, a word size of 5, and an expectation threshold of 0.05 for the setting.
Experimental materials. The plasmids and strains used in the experiment are detailed in Supplementary Tables 1-3. For strain maintenance, Luria-Bertani (LB) medium, which contains 10 g/L tryptone, 10 g/L NaCl, and 5 g/L yeast extract, was utilized. In order to produce naringenin in diverse strains, MOPS (3-(N-morpholino)propanesulfonic acid) medium was used. All of the chemicals used in the experiment were reagent grade and purchased from Sigma-Aldrich (St. Louis, MO, USA). NEBuilder® HiFi DNA Assembly Kit (E2621S) was purchased from NEB (Beverly, MA, USA) for the purpose of plasmid constructions.
Determination of enzymatic kinetic parameters. The predicted TALs were codon-optimized for expression in BL21(DE3), as described by Zhou et al [35]. The enzyme kinetic parameters of the different TAL enzymes were also evaluated using the same method in the publication. Specifically, the candidate enzymes were tested in a 200 μL reaction volume with purified protein (1 μg), different concentrations of L-tyrosine, and Tris-HCl buffer (90 μL 50 mM pH 8.5). The mixture was incubated at 40°C for 30 min and monitored for the appearance of coumaric acid at 315 nm [35]. One unit of enzyme activity was defined as 1 μM p-coumaric acid production in one minute.
HPLC methods for naringenin detection. HPLC methods for detecting naringenin involved the use of an Agilent 1260 HPLC system (Waldbronn, Germany) equipped with a diode array detector (DAD) 1260 model VL + (G7115A) and a C18 column (3×100 mm 2.7 μm). The detection was performed at 290 nm and 30 °C. A gradient elution condition was employed with the following steps: 10% to 40% acetonitrile/water (vol/vol) for 5 min, 40% acetonitrile (vol/vol) for 7 min, 40% to 95% acetonitrile (vol/vol) for 3 min, and 95% to 10% acetonitrile (vol/vol) for 3 min. The elution rate was 0.3 mL/min. Additionally, 0.3% acetic acid (vol/vol) was added to the mobile phases to facilitate the separation of naringenin.
REFERENCES
43. Chen, T. et al. T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm. Front. Microbiol. 11, 580382 (2020).
44. Lv Z, Cui F, Zou Q, Zhang L, Xu L. Anticancer peptides prediction with deep representation learning features. Brief Bioinform. 22(5), bbab008 (2021).
45. Veltri, D., Kamath, U. & Shehu, A. Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740–2747 (2018).
46. Pan, X. et al. Identification of Protein Subcellular Localization With Network and Functional Embeddings. Front. Genet. 11, 626500 (2021).
47. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).