A normalized differential sequence feature encoding method based on amino acid sequences

doi:10.21203/rs.3.rs-2246007/v1

Download PDF

Research Article

A normalized differential sequence feature encoding method based on amino acid sequences

https://doi.org/10.21203/rs.3.rs-2246007/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Protein interactions is the foundation of all metabolic activities of cells, such as apoptosis, immune response and metabolic pathway. In order to optimize the performance of protein interaction prediction, a coding method based on normalized difference sequence characteristics (NDSF) of amino acid sequences is proposed.

Methods

By using the positional relationships between amino acids in the sequences and the correlation characteristics between sequence pairs, NDSF is jointly encoded. Using principal component analysis (PCA) and local linear embedding (LLE) dimensionality reduction methods, the coded 174-dimensional human protein sequence vector is extracted by using sequence features. This study compares the classification performance of four ensemble learning methods (AdaBoost, Extra trees, LightGBM, XGBoost) applied to PCA and LLE features, and uses cross-validation and grid search methods to find the best combination of parameters.

Results

The accuracy of NDSF is generally higher than that of MOS coding method, and the loss and coding time can be greatly reduced. The bar chart of feature extraction shows that the classification accuracy is significantly higher when using the linear dimensionality reduction method PCA than the nonlinear dimensionality reduction method LLE. After classification with XGBoost, the model accuracy reaches 99.2%, which is the best performance among all models.

Conclusions

NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

A normalized differential sequence feature encoding method based on amino acid sequences

Status:

Version 1

Abstract

Background

Methods

Results

Conclusions

Full Text

Additional Declarations

Status:

Version 1