Novel Few-shots Learning Neural Network for Predicting Carbohydrate-active 2 Enzyme (CAZyme) Affinity Towards Fructo-oligosaccharides

11 Background: The enzymatic activity of the microbiome toward carbohydrates in the 12 human digestive system is of enormous health significance Predicting how carbohydrates through food intake may affect the 14 distribution and balance of gut microbiota remains a major challenge. Understanding 15 the enzyme-substrate specificity relationship of the carbohydrate-active enzyme 16 (CAZyme) encoded by the vast gut microbiome will be an important step to address 17 this question. In this study, we seek to establish an in-silico approach to studying the 18 enzyme-substrate binding interaction. 19 Results: We focused on the key carbohydrate-active enzyme (CAZyme) and 20 established a novel Poisson noise-based few-shots learning neural network (pFSLNN) 21 for predicting the binding affinity of indigestible carbohydrates. This approach 22 achieved higher accuracy than other classic FSLNNs, and we have also formulated new 23 algorithms for feature generation using only a few amino acid sequences. Sliding bin 24 regression is integrated with mRMR for feature selection. 25 Conclusion: The resulting pFSLNN is an efficient model to predict the binding affinity 26 between CAZyme and common oligosaccharides. This model can be potentially applied 27 to binding affinity prediction of other protein-ligand interactions based on limited 28 amino acid sequences. 29

for predicting the binding affinity of indigestible carbohydrates. This approach 22 achieved higher accuracy than other classic FSLNNs, and we have also formulated new 23 algorithms for feature generation using only a few amino acid sequences. Sliding bin 24 regression is integrated with mRMR for feature selection. predicted to increase as more human gut microbiome samples are collected (Almeida,42 to consist of rather various sequences from training samples. A neural network 127 optimization algorithm of finding the loss of each round of neural network generation 128 is a core feature of few-shot learning (Garcia, V. and Joan B., 2017). The loss algorithm 129 applied is based on a prototypical neural network with adjustment of using accuracy 130 rate instead of Euclidean distance; since among the various neural networks available, 131 a prototypical neural network is the most reliable means of approach in this situation 132 thanks to its outstanding performance in the small sample space in practices (Pan, Y. et 133 al., 2019), which often outputs the prediction accuracy that has surpassed human 134 recognition (He, K. et al., 2016). By integrating with few-shot learning algorithms, the 135 prototypical neural network achieved an approximated 70% accuracy in 5-way 5-shot 136 image classification (Richard, Z. et al., 2017). 137 138 In this study, the whole set CAZyme CBMs of probiotic human microbiomes are 139 obtained from CAZy-database (Lombard, V. et al., 2014). The over 4000 proteins are 140 clustered based on K-nearest neighbors according to the primary structure. This study 141 provides the novel idea of selecting anchor protein as bases for feature generation, 142 including cavity site and protein binding site similarity calculated through fuzzy search 143 according to anchor protein binding site fragment sequences. 144 145 Aim to establish an improved few-shots learning model, we bring in the data 146 augmentation through Poisson noise, since it represents the distribution of amino acid in 1D. Previous research shows that the site substitution mutation of proteins can be 148 described by the Poisson-correction method (Sadygov, R. G., 2018). Especially when 149 the substitution rate is independent among sites, Poisson -correction can best describe 150 the scenario (Grishin, N. V., 1995). In this study, since the site-dependence of 151 substitution is unknown, site-independent substitution will be assumed. In addition, we 152 mapped the data into several higher dimensions. We have also compared the Poisson 153 data augmentation with Gaussian, random, and salt-and-pepper noises. 154

155
The major significance of this study is several folds: firstly, this study takes the first 156 step towards understanding enzymatic function at the scale of the gut microbiome, 157 which is a timely topic attracting much attention. Secondly, the study establishes a 158 generalized method pipeline for future similar few shots learning in biology and is the 159 first to try FSL and noise augmentation on proteins. Since the enzyme-substrate binding 160 predictions are based on primary structure instead of the tertiary structure used in most 161 other studies, the time of protein simulation can be reduced. Thirdly, the study sets up 162 the first example for future studies of protein-substrate interactions to be performed 163 with minimal data input and limited computational power with reasonable accuracy. 173 Using a small training set, though timelier, provides less accurate prediction results 174 using traditional machine learning algorithms. To achieve better performance, we 175 adopted and modified the prototypical neural network for few-shot learning. This model 176 applies to our data set in two aspects. First is that most features of the data set resemble 177 distance from a specific anchor data, which renders each data point inherent distance to 178 a calculated prototype. This feature generating technique inherently implies the protein 179 evolution tree, where proteins with similar functions from similar organisms closely 180 resemble ligand binding site structure with each other. Second is that since the training 181 set is small, multiple epochs of neural network formation are best to run to exploit the 182 random selection of the starting point of linear regression so that the neural network 183 which has the best performance during cross-validation can be selected. 184 185 Applying prototypical neural network algorithms increases the F1 score of prediction 186 for 18%, comparing the next best machine learning model SVM (table 1) where the value of lambda represents the mutation rate. This mutation rate consists of 205 both total mutation rate, the possibility that an amino acid site will mutate versus will 206 not mutate, and the amino acid-specific mutation rate, the probability of which amino 207 acid the site will mutate. The amino acid-specific mutation state was generated by 208 summarizing the occurrence frequency of each amino acid in the entire sample set. This 209 likely represents the relative abundance of each amino acid. Such a method was compared with an evenly distributed model, in which the former model has better 211 prediction results. A range of different total mutation rates was tested, and 10% gave 212 the best result. The increase of the total mutation rate exhibits a possible trade-off 213 between overfitting and information perseverance. CAZyme relationships, together with the charming world of FSL modeling, is definitely 313 worth substantial future works to be devoted to. 314

Conclusion 316
The study focused on the binding of 4 typical resistant sugars with key carbohydrate-317 active enzymes (CAZyme) and established a novel Poisson noise-based few-shots 318 learning neural network (pFSLNN) for predicting the binding affinity of indigestible 319 carbohydrates. This approach achieved higher F1 scores than other classic 320 FSLNNs using Poisson noise augmentation, which has never been applied in the FSL 321 fields before. The Poisson augmentation is found to be optimal at a 10% noise level. Rerank score was recorded. For each AAS-oligosaccharide pair, ones with Rerank score 359 below -100 were labeled as 1, representing binding, and others were labeled as 0, 360 representing non-binding. Each AAS thus has four labels. 361 362

Anchor AAS selection 363
Assuming each AAS in different groups is distinct, one AAS from each group (10 in 364 total) were selected as anchor AAS. Those AAS were not used as testing samples in the 365 following few-shot learning process. For those 10 AAS, residues that are within 6Å 366 (Biro, J. C., 2006) of the cavity site were recorded as cavity related fragments with 367 connected residues in the same fragment. Fragments of less than three residues were 368 neglected. Sugar-binding fragments were also recorded based on the binding position 369 of each oligosaccharide. Those fragments were searched for in each AAS. 370

371
The key concept of the feature generation pipeline is to obtain the binding pattern of 10 372 anchor AAS. The higher similarity in secondary structure, cavity fragments, and sugar-373 binding fragments between a tested AAS and an anchor AAS suggest a higher 374 possibility for the two proteins to have the same protein-ligand binding pattern. Anchor 375 AAS always has the maximum available score when compared to its secondary 376 structure, cavity fragment, and sugar-binding fragments; thus, they were taken as 377 feature standards by the prototypical neural network and remained in training set for 378 each round of learning. 379 380

Feature value matrices preparation 381
According to the secondary structure sequence returned from I-Tasser, the frequency of 382 each AA appearing as each general secondary structure type (Helix, Sheet, Coil) was 383 recorded. These data were used to predict secondary structure. For each AAS sample, a total of 71 features (6 from secondary structure score, 10 from 394 binding cavity alignment, 10 from whole sequence alignment, 40 sugar-binding 395 fragment alignment, 4 from sugar-binding whole sequence alignment, 1 from sample 396 AAS length) were generated according to the AAS and the matrixes mentioned above: 397 6 features were generated for secondary structure score, including the estimated number 398 of promoting AA and estimated number of long consecutive representing each of the 399 three general secondary structure types. Those parameters of secondary structure give 400 hints to the overall shape of the protein, as more helix promoting AA with less helix 401 strand suggests helix strands being longer towards a rod shape. 10 features were 402 generated from cavity fragment alignment. Fuzzy search algorithm (Algorithm 2) was 403 applied to cavity fragments generated from anchor proteins on each sample AAS to 404 search for the longest succeeding fragment chain. A higher score against either anchor 405 showing the consecutive occurrence of the same classic secondary structure over 5 460 times as a secondary structure strand. 461 The function that generates secondary structure scores, denoted GenSS(), was based on 463 the secondary structure promotion matrix (sspM) and secondary structure promotion 464 threshold matrix (sspT). sspM recorded the frequency of each AA appearing in each 465 classic secondary structure after normalization according to the average and standard 466 deviation of the training set. The normalization of this matrix allowed the values to be 467 in a statistical range. sspT contained three sets of thresholds, which are the promotion 468 bar, demotion bar, and the tolerance number. The two bars characterized AA into 469 Promoting, indifferent, and demoting for each of the three secondary structure types. 470 The promotion bar is the lower bound of the sspM value of a given AA being 471 characterized to start or succeed a secondary structure strand. The demotion bar is the 472 higher bound of the sspM value of a given AA being characterized to prohibit or 473 terminate a secondary structure strand. AA with sspM value between the two thresholds 474 is considered indifferent to a classic secondary structure type. The tolerance number is 475 the minimum number of AA in a secondary structure strand to view the next indifferent 476 AA as a successor of the ongoing secondary structure strand. 477

-Generation process 479
The data input that generates those matrixes consisted the AAS training set, denoted Waterman algorithm is that the Smith-Waterman algorithm aims to find local 516 alignments between the two strands, which neither must include the other, while the 517 sliding-window algorithm ensures to find consecutive and including alignments. In 518 addition, the Smith-Waterman algorithm aims to find the aligning strand while the aim 519 of the fuzzy search is to return the alignment score for each site. For the whole sequence alignment score, the shorter sequence between the anchor 564 sequence and the sample sequence was viewed as a fragment. The same algorithm in 565 the previous section is applied with the substituting and substituted AA assigned 566 according to the compared length between the two AA. AA was substituted from the 567 sample AAS to the anchor AAS. The fuzzy search alignment value was returned as the 568 whole sequence alignment score. was a matrix that contains the ratio between sbVn and sbVm. A larger ratio represents a 580 higher affinity of substituted AA. sbEX i was generated by applying a sbM i filter, 581 multiplied by a factor F, on aaEX. 582 = 584 × (1 + 585 × ( 586 − 1)) (9) 587 F was also optimized using equation 8, replacing T by F. 588 589 3.3 Whole sequence sugar-binding 590 4 whole sequence sugar-binding score, one for each oligosaccharide ligand, was 591 generated. AA-sugar interaction matrix was obtained using the same method as above. 592 The difference is that AA in the whole sequence was accounted for instead of AA 593 residuals that appear within 5Å of the sugar-binding site. And the average of AA-sugar 594 and P19 (gray) using align function in pymol with alignment RMSD = 0.389; residue 772 labels are shown in corresponding colors. f) Sample AAS alignment of P5 (GlgX [B. 773 glumae]) and P19 (GlgX [A. veronii]) around the two cavity fragments; aligned AA of 774 a given secondary structure or belongs to a cavity fragment is shown in the color scheme. 775 g) Sample AAS alignment of non-cavity parts of P5 and P19, with AA of a given 776 secondary structure shown in the color scheme. 777  (H) is shown as an example. A cursor scans over the input AAS to find the first H 792 promoting AA. A secondary structure strand of H begins. When the strand length is 793 more than or equal to 5, the strand is considered a long strand. As the cursor proceeds, 794 3 types of successor (promoting, demoting, indifferent) AA can result in 5 cases. Initial 795 sspTH = (-0.5, -1.6, 2). 796 797 Tables 798 Table 1: Accuracy and F1 score for experimented models 799 Table 2: Percentage of selected feature types in the top 20 features 800