Background: An enzyme activity is influenced by the external environment condition. It is important to have an enzyme remain high activity in a specific condition. A usual way is to first determine the optimal condition of an enzyme by either the gradient test or by tertiary structure, and then to use protein engineering to mutate a wild type enzyme for a higher activity in an expected condition.
Results: In this paper, we investigate the optimal condition of an enzyme by directly analyzing the sequence. We propose an embedding method to represent the amino acids and the construct information as vectors in the latent space. These vectors contain information about the correlations between amino acids and sites in the aligned amino acid sequences, as well as the correlations with the optimal conditions. We crawled and processed the amino acid sequence in glycoside hydrolase GH11 family, and got 125 amino acid sequences with optimal pH condition. We used probabilistic approximation method to implement the embedding learning method on these samples. Based on these embedding vectors, we design a computational score to determine the optimal condition for an enzyme and achieves the accuracy 80% on the test proteins in the same family. We also give the mutation suggestion such that it has a higher activity in the expected environment, which is consistent with the professional wet experiments and analysis.
Conclusion: A new computational method is proposed for the sequence based enzyme optimal condition analysis. Compared with the traditional process that involves a lot of wet experiments and requires multiple mutations, this method can get the desired protein for an expected condition in an efficient and effective way.
Keywords: Protein sequence analysis; Embedding; Bioinformatics