Pathway-specific protein domains (PSPD) discrimination by using a hybrid feature space based on deep neural networks (DNN)

The Pathway-specific protein domains (PSPDs) are important tools in examining drug growth as they provide a fast, reliable, and inexpensive way of estimating complex new molecular targets in specific diseases. The protein architecture prevents the formation of a direct correlation between signal transduction behavior and cellular structure. Accordingly, protein–tissue factor pathway inhibitor 2 isotypes 1 precursors have been used to encode peptide sequence information into specific feature structures. The measurable structure-activity classification model obtained by machine learning technology can predict pathway-specific protein interactions and new signaling peptides. We introduce deep neural network (DNN)-based PSPDs, abbreviated as DNNPSPDs, as the first pathway-specific protein domain that is built based on five extant models, namely, the AAindex, pseudo-amino acid composition, amino acid composition, composition mood of pseudoamino acids, and dipeptide composition. A total of 900 proteins with undetermined roles collected from the PDB data base are tested to evaluate the predictive power of this model. Various combinations of the available feature selection technologies are also combined to process a hybrid function space. DNNPSPDs predicts PSPDs by using features that are automatically learned from primary protein sequences. The sequences of pathway-associated proteins are sequentially fed into and decoded in neural network layers. Several classifications are also employed. DNNPSPDs achieves a prediction accuracy of 0.957 at a Matthew’s correlation coefficient (MCC) of 91.86%, with DPC, and 2nd achieve high prediction score 0.936 at Matthew’s correlation coefficient (MCC) of 88.02%, accuracy which is probably better. In terms of ROC–AUC, DNNPSPDs achieves a ROC–AUC curve of 0.982, which is larger than that of the other machine learning classifiers. A study using an alternative dataset reveals that our primary pathways, as pathway-specific protein domains, have accurate and reliable associations, thereby proving the viability of the proposed DNNPSPDs.


Introduction
Machine learning techniques are dominant statistical strategies for predicting pathway-specific protein domains (PSPDs).Creating a realistic feature set and selecting matching machine learning algorithms are two main stages in deep neural network (DNN)-based and long-and short-term memory machine learning predictions.This study provides detailed information on each cell in the human pathway that is associated with many proteins, with each protein serving one or more specific functions.A cell can receive numerous signals at the same time and use these data to build an integrated action plan.PSPDs have proteins as their functional units.Many of these proteins are involved in various biological processes, although some of them are connected to certain pathways.The recent advances in experimental methods have improved the pathway prediction capabilities of theoretical methods, such as those based on homology and protein pathway analysis.In this review, we propose a novel method for predicting protein pathways in cancer.Genetically complex diseases have recently attracted much research attention, and studying the protein-pathway association has become a key process in predicting disease pathways.The disease pathway has the basic intention of disease pathogenesis and characteristics [1].Many studies on proteins and association pathways have also begun to consider those components that are associated with cells, but the exact mechanical effects of these pathways remain unclear.Other studies show that some machine learning methods outperform in pathological prediction by relying on biological disease pathway has the basic intention of disease pathogenesis and characteristics [1].Many studies on proteins and association pathways have also begun to consider those components that are associated with cells, but the exact mechanical effects of these pathways remain unclear.Other studies show that some machine learning methods outperform in pathological prediction by relying on biological networks.In order to limit the combinatorial explosion, the prediction of human cancer pathways is aimed at identifying signaling pathways in PSPD prediction [2].However, previous studies have largely ignored the role of PSPDs in identifying the roles of proteins in the complex interactions and biological mechanisms that drive cellular procedures.While the function of proteins needs to be verified by hand in a wet laboratory, scientists require a suggestion before they can even attempt to determine the probable function of a protein.Biologists can use computers to make these gene-function assumptions.Studies show that genome sequencing has become a routine practice in determining the functions of proteins.However, the application of gene sequencing in laboratories remains controversial, thereby increasing the importance of computational gene purpose prediction.Computational approaches are deemed suitable for function prediction as they generate inferences from experimental data that identify the similarities between a gene and its known proteins.These approaches include sequence similarity tools, such as the basic local alignment search tool that searches for all previously recorded sequences and generates a list of possible roles for these sequences.
Computational biology urgently requires new methods that can accurately reflect the nature of biological procedures.Previous studies have attempted to identify such nature based on hierarchical multiple protein features.This approach takes evolutionary relationships into account unlike traditional sequence-based methods and has a better allocation function compared with the simple backbone of amino acids.Machine learning methods have also been used to predict whether a protein has a dual role.Genetically complex diseases have recently attracted much research attention, and studying PSPD associations plays an important role in predicting disease pathways.This study attempts to establish a connection between pathways and proteins, provide detailed information about the pathogenesis of a complex disease and its features, and highlight the role of signaling pathways in predicting protein-pathway associations [3].Computational biology is a growing discipline that combines research methods with system biology to explore various biological phenomena.Cellular classifications, such as PSPDs, allow us to examine the structures of cells.Nevertheless, one major challenge in system biology is fully integrating genomic and proteomic knowledge into other data sources, such as printed literature, which can translate the original data into useful information and contribute to the present knowledge of biology [4].The recognized proteins and the abundant differences in protein-tissue factor pathway inhibitor 2 isoform 1 precursor, Approximately, PSPD includes a simple insert that serves as a signal to retention cells.A standard alveolar septum structure during embryogenesis, normal gastrointestinal tract growth, normal Leydig cell development, and spermatogenesis are needed.The normal production of oligodendrocytes and normal formation of myelin are also needed in the spinal cord and cerebellum given their important roles in wound healing.PSPD associated with a pathway (R-HSA-3000171).This study examines the relationship between protein group coherence and pathway assignment based on a functional association.To this end, we use 15 proteins and 4 protein-pathway associations(NP_001230133.1.NP_057665.2.NP_00113593 6.1.NP_001135937.1) as the name of Evolutionarily conserved signaling intermediate in Toll pathway, mitochondrial isoform 1 precursor which is the type of (Homo sapiens).These above 15 proteins reviewed by (Swiss-port)-manually interpreted.
The UniProt Knowledgebase (UniProtKB) is considered the most source of operational data related to proteins given its complete, consistent, and powerful annotations.Each entry in UniProtKB (i.e., amino acid sequence, protein name or description, classification data, and reference information) requires as much annotation information as possible.The use of mining rules in predicting human pathways associated with prokaryotic UniProtKB data has received much attention in recent years [5].Many researchers have also proposed innovative computational methods for predicting protein function and amino acid sequences.Compared with the mono-functional approach of a single protein, these methods can effectively predict the relationships of perfect protein sets with specific biological processes.Many studies have also applied mining to examine the human involvement in protein pathways with multiple independent functions.Posttranslation modifications, such as phosphorylation and ubiquitization, can significantly affect protein functions and have often been used as control devices in signal transduction pathways [6].Given that proteins communicate with one another, the interaction between proteins and related binding sites should be identified to facilitate hypothesis-driven research and explorations of regulatory networks [7].The precise subcellular localization of proteins and their tissue distribution in vivo are also important in identifying protein function [8].Checking if any proteins are affected can also help indicate a pathway [9].The National Biotechnological Information Center Reference Sequence Database is among the main repositories for DNA and protein sequences and features [10].Swiss-Prot is another common source of protein information [11].However, these databases do not contain information on many protein characteristics and functions that are too complex to understand.
In addition to visualizing protein interaction networks, this diagram can also describe the roles of new molecules in large signal networks.These networks have a large number of interactive proteins that can highlight the patterns of certain classes of molecules, pathways, or cellular processes.Motivated by previous studies that have largely focused on sequence characteristics, including amino acid composition (AAC), chain transfer distribution, dipeptides composition (DPC), and pseudo-amino acid composition (PAAC), and have employed various feature selection strategies, such as correlation-, variance-analysis-, minimum-redundancy-, and maximum-correlation-based feature selection, we develop a novel domain called DNNPSPDs for protein function prediction [12][13][14][15][16][17][18][19].Protein structures have been predicted in the literature by using several machine learning methods, such as artificial neural networks (ANNs).DNNs, as a subgroup of ANN, have several hidden layers.These networks take low-level features as inputs and create highly advanced features at each subsequent layer.DNN-based approaches have been widely applied in the fields of computer vision and natural language processing.The recent improvements in computational capacity have also allowed the scientific community to apply DNN-based methods across various domains, including biomedical data analysis, where DNN algorithms are shown to outperform the conventional predictive methods used in bioinformatics and cheminformatics [20].DNNs can be divided into two classes.Multi-task DNNs classify input instances into multiple predefined classes/tasks, whereas single-task DNNs aim to produce a binary prediction.DNNs have also been categorized into several classes based on their designs and features.The most common architectures include feedforward DNNs (i.e., multi-layered perceptron), recurrent neural networks, restricted Boltzmann systems, and deep belief networks [21].In this study, we introduce DNNPSPDs and demonstrate its excellent performance in PSPDs prediction.Apart from tuning its parameters, we also compare the prediction performance of this model with that of other machine learning classification classifiers, such as AdaBoost (ABC), KNN classifier, SVC classifier, linear discriminant analysis (LDA), gradient boosting classifier (GBC), random forest classifier (RFC), Gaussian naïve Bayes classifier (GNB), decision tree classifier (DTC), multi-layer perceptron classifier (MLPC), and extra trees classifier (ETC).We also adopt feature extraction protocols that have been successfully used in solving various biological problems to determine the best feature extraction method.We hypothesize that AAindex, AAC, DPC, PseAAC, and PAAC are the best feature extraction methods.We also propose a method based on the aforementioned machine learning classifiers.

Proposed model
By using a minimum-redundancy dataset, we construct DNNPSPDs as a novel machine learning model for predicting protein-pathway associations.Prior to the construction of this model, we investigate AAindex, AAC, DPC, PAAC and PseAAC, implement a two-step function selection protocol, and define the correct optimal feature selection protocol to remove the irrelevant functionalities.We then compare the prediction performance of five feature encoding models with that of DNNPSPDs and use the results as inputs to 10 machine learning classifiers.We also generate various feature space combinations to create hybrid paces.K-tenfold crossvalidation tests are also conducted to evaluate the performance of these classifiers.Figure 1 illustrates the structure of the proposed model.

Datasets
We employ a quantitative approach that involves the use of a dataset mostly enhances the success rate simplification when used in machine learning models.We collect our data from the SMPDB, UniPortKB, and Swiss-Port databases.A total of 900 protein sequences are collected, among which 115 are PSPDs positive and 121 are non-PSPDs as negative.These pathway proteins usage of prediction of subcellular localizations [22].We have downloaded the dataset from the above database.We preprocess the collected datasets according to the protein-pathway and protein-non-pathway relationships.We saved our dataset in CSV format and then set the parameters of the proposed model.

Feature extraction techniques
Selecting the appropriate feature extraction techniques is a complex process that can greatly facilitate the exploration of biological features.These techniques also require tuning and fine adjustment [23].The number of elements (n) in the function vector varies along with the function forms.Generally, the protein sequence vector (V) of the order index (i) can be interpreted as: , , ......
where fj is an element value of element j.The function dimension is also known as a component, vector, or column, and these terms are used interchangeably.The characteristics of proteins are usually derived from various sequences.If N The above matrix can also be viewed as a table with N rows and n columns.While N is calculated by a number of sequences, n greatly depends on the applied feature extraction methods.These features are generally classified into seven groups, namely, AAC, autocorrelation, transformation and distribution of composition, quasi-sequence order, and PseAAC.AAC is computed as the percentage of amino acids in a peptide.Among the 20 produced vectors, only 1 matches an amino acid.Amino acids are organic compounds that mix together to form proteins, and both amino acids and proteins are fundamental components of any living being.These acids have also been described as building blocks of peptides and proteins.Each amino acid is taken from an amino group and a tetrahedral fuel-bound carboxyl group.The carbon is referred to as α-carbon (alpha-carbon).Amino acids, which originally takes the form of a molecular chain of 20 amino acids, vary from each other in comparison to their side chains.Each amino acid is unique in terms of its hydrophilicity, hydrophobicity, polarity, and charge.The features of amino acids in this study are extracted by using AAindex, PAAC, AAC, DPC, and PseACC, all of which involve a PSPD sequence formation that reliably classifies protein pathways.Machine learning techniques have been extensively used with organic difficulties in direction to predict protein complicated biological functions.

Composition of amino acids (AAindex)
We compute the binary profiles by using AAindex.When users choose 10 AAindex to construct an amino acid with 10 values, each value reflects 1 AAindex value.This function gives input AA Indices binary profile.If the normalized AAIndex score of a residue is negative, then this residue is assigned a value of 0; otherwise, a value of 1 is assigned.
Collected 236 amino acid indexes for each amino acid index.
We choose the complete AAindex that not only accurately represents the physicochemical properties around the acetylation site but also generates redundancy and noise.Twelve physicochemical properties are eventually selected.
We improve the SVM classifier by using different descriptors and find that using APAAC can slightly improve its prediction accuracy.In this case, we use APAAC to extract details on hydrophobicity, hydrophilicity, and amino acid sequences.

Composition of amino acids (AAC)
We determine AAC by calculating the present frequency of each amino acid.AAC is a feature sequence that has been commonly used to measure the occurrence frequency of 20 amino acids within a given sequence fragment.AAC can be expressed as where n denotes 1 of the 20 types of native amino acids given a sequence w (hence, w1,2,3, 4,…20), and T denotes the size of the protein sequence.Many studies from different fields, such as bioinformatics, have developed AAC construction techniques to distinguish different protein structure categories [25], membrane protein types, and protein contact numbers.We obtain predictive data with 21 characteristics from the proposed model.We set the target variable and calculate AAC as given the absence of any missing value in our data, we do not check for null values.

Dipeptide composition (DPC)
We calculate the DPC of each amino acid in a given peptide sequence length.Every single peptide/protein... Rhythms are classified based on DPC, which compares pairs of residues in a sequence (e.g., AA, AC, and AD).Various PSPD prediction and composition-based algorithms proposed in the literature are using DPC to classify protein sequences [26].DPC is the composition of measurements for each of the 400 possible dipeptides produced by 20 amino acids.Similar AAC, DPC provides additional local arrangement information in a peptide/protein, as it is the pair of amino acids positioned adjacently [27] We calculate DPC as where i¼ 1, 2, 3 … 400, N denotes the number of dipeptides (represented by amino acid types i), and T represents the size of 400 dipeptides possibly molded by 20 amino acids.

Feature-based on Pseudo-Amino acid composition (Pse_PAAC)
Empirical evidence seems simple and has been used in the existing literature in the fields of bioinformatics and biomedical.By introducing the use of PseAAC [28] in preparing protein sequences, the problem is confused.
PseAAC uses the values associated with the factors that represent sequential data [29], such as the subcellular localization of mycobacterium proteins and the superfamily and family classifications of snail toxins.Sub-cells are used to determine the quaternary structure of proteins [30,31].
IFeature employs a comprehensive protein-related pathway sequence encoding scheme that covers 53 types of feature descriptors.This tool also allows users to choose specific amino acid characteristics from the AAindex database and is equipped with a Python package and web server for selecting features from the PseAAC of protein and peptide sequences by using the following equation: Chou introduced the concept of PseAAC for estimating cellular protein attributes and proposed a set of discrete numbers based on traditional AAC to determine the potential patterns of sequence order.PseAAC has been effectively used in solving many biological problems [32].Where T denoted transposing sets, as such f1,…20 is the fraction of remaining 20 distinct amino acids are amino acid association variables determined on the basis of charge, hydrophilicity, and hydrophobicity.PseAAC has also been used in preparing RNA/DNA sequences [33].

Hybrid features PSPDs
For the analysis, we build a hybrid PSPD model with a pathway-specific domain based on AAC, DPC, and PseAAC and then check for the presence of peptides in the training dataset of pathway-specific and non-pathway-specific protein domain motifs.Where the epitope included the patched protein domain motives, AAindex, PAAC, AAC, DPC or PseAAC weight of + 1 was applied based on various classifiers for the machine learning performance.Similarly, the weight was applied to the same if the epitope is positive for non-pathway protein motives as −1.

Correlation Matrix
Figure .2 presents a function correlation matrix constructed by Pyplot and visualized to 12×8 by using rcParams.Using xticks and yticks, the matrix of correlation was introduced with names.None of the features is significant to our target, whereas some features show either negative or positive correlations with the target value.

Figure. 2 High Correlation with our target value
We now describe those types of graphics that require only one command to visualize and provide a large amount of information.Figure 3 shows how different sets of features and marks are distributed, which further reinforces the need for scaling.Each bar chart in the figure denotes a specific category of variables that needs to be examined before the implementation of machine learning.Our proposed model comprises an input layer, several hidden layers, and an output layer.Any neural network with two or more hidden layers is considered a DNN [34].The layers in a DNN are completely connected, and the secret, hidden, or output layer units are connected to all previous layer units (Figure 4-A).The output values are measured sequentially along with the network layer (Figure 4-B) and are transformed in a non-linear manner until the final output is determined.
The rules of ReLU present another problem [35].Along with Sigmoid and Tanh, ReLU is an activation feature widely used in the literature that we also adopt in this study.We perform our optimization analyses and experiments by using Adam given its simple implementation, computational effectiveness, and low memory requirements.Accordingly, this software does not affect the gradient sparse when updating its parameters and is optimized for sparse gradients or high noise rates.We employ the cross-entropy method to prevent very late weight updates.A cross-entropy is a nonnegative function, and a smaller loss function corresponds to a better model performance.This is a part of our projected cost model.Thus, it is the objective function to prefer crossentropy costs.Several scholars have also used Adam [36] to achieve an optimized cross-entropy loss detection function at a dropout rate of 0.5 [37].The softmax function [38] function is seen as a class probability.
derivate (with respective to x)

Activation function
DNNs have been widely studied due to neurons.A neuron generally accepted that a number that comes from the final branches of the neuron (pathway-specific proteins).What is happening to the neural network layer, we increased the input shape into a protein by the weight of that pathway-specific proteins and summarized all the neurons.

The role of ReLU Activation Function
Many studies have highlighted ReLU as the most used activation function in the world.Accordingly, the application of ReLU in neural networks and deep learning has been intensively examined.We apply the following ReLU function as the activation function of all neurons: The conducted to determine ReLU, as shown in figure 5, is half corrected (from the bottom), as you can see.R(z) is equal to 0 if z is below 0 and takes a value of z if z is equal to or above 0.Range: [0 to endlessness] Mathematics is accompanied by softmax where z is the reference vector for output layer and j indexes 1,2, 3 ....K. Similar to a sigmoid function, the softmax function keeps the output value of each variable between 0 and 1 yet splits each output to 1 (Figure 4).In this case, the total output number is 1.The output of the softmax function is equal to the categorical distribution of probabilities, which shows that each class is valid.The softmax function is mathematically expressed below, where z is a vector of output inputs (if we have 10 output units, then 10 elements will appear in z, which denotes the total number of output neurons in the softmax layer).The majority of studies in this area have used Keras and TensorFlow to enforce their models.The parameters used in the experiment are listed in Table 1.In our experiment, cost function is a key aspect of lost function that denotes the distance between the expected and real values.We use cross-entropy as our cost function.Sigmoid functions are mostly used in shallow neural networks and require a low initialization power.Tanh functions are mostly used to address symmetry problems with two classifications, whereas ReLU is often used in deep learning.A sparse activation of neurons in a neural network is caused by a unilateral ReLU inhibition.We control the extraneous variables by using the ReLU activation function to ensure an easy operation and good learning ability.

Evaluation of method croos-validation
We utilize various cross-validation approaches to analyze the effects of our statistical parameter forecasts.Specifically, we adopt independent dataset testing, k-10-fold cross-valuation, and k-folding test.The jackknife test has been used progressively in previous research to check the accuracy of various predictors and to evaluate the forecasts of classifiers.
In our analysis, we divide our dataset into 10 parts and subject them to a cross-validation experiment.

Computational tools for experiments
We use qualitative/quantitative approaches along with some machine learning tools, such as MATLAB R2018a, to evaluate and build partial algorithms.Specifically, we utilize MATLAB to select digital descriptors from a protein sequence and Weka for the classification.We then examine the output of different classifiers.The MATLAB method is used as a programming language in the fourth generation.

Result and performance evaluation Performance evaluation of classifiers
Assume that M is a dataset that includes N samples, Xi is the feature space, and Yi denotes the settings of the target set, where I ϵ M. We measure precision as where Eqs. ( 16) to (19) measure flexibility, Matthew's similitude, accuracy, and characteristics.We calculate the correlation among these parameters to evaluate the Matthew's similitude.The number of true PSPD  − + proteins that have failed to identify as non-PSPD proteins, the amount of total estimated PSPD  + protéins, N − th total number of checked nonPSPD proteins and the a mount of failed PSPD proteins,  + − And the number of non-PSPD proteins incorrectly identified as PSPD proteins.

Impact of extraction algorithm
The proposed DNNPSPDs model demonstrates an excellent performance in formulating the protein sequences PseAAC DPC ,AAindex, AAC, and PAAC.We use datasets with the same data points in various classification tasks and perform principal component analysis to condense the hybrid feature space.This approach demonstrates the usefulness of the individual and hybrid feature spaces of different classifications.We use DNN in all analyses, while PseAAC DPC ,AAindex, AAC, and PAAC are used to determine the optimal parameters that have critical effects on the model development.We also conduct a content analysis to calculate the values of various parameters, and we use PSPDs predictability as a metric to determine those parameters with

Comparison of predicted ROC-AUC score
Although multi-information fusion improves the prediction efficiency of a model to some degree, this approach also provides redundant feature information, thereby affecting the model classification accuracy and reducing the calculation speed.We compare the average prediction, precision, and measurement performance of DDNSPSDs with those of other models.Figure 6 shows the ROC-AUC curves that correspond to various PSPDs datasets.DDNSPSDs achieves the highest ROC-AUC score (0.983), followed by DPC (0.982), PseAAC (0.983), AAindex (0.965), AAC (0.943), and PAAC (0.815).The ROC-AUC curves are shown in Figure 6.

Performance of classifier using (AAC+PseAAC)
PseAAC involves AAC principles and sequence correlation variables.Table 5 in supplementary material (Appendix.B). presents the expected effects of those classifiers that use ABC and PseAAC hybrid feature spaces.ABC, LDA, KNN, and NB achieve accuracies of 0.915% 0.879% (86.82%), 85.58% (78.38%), and 84.31% (54.26%) when using AAC (PseAAC), respectively.Although we investigate 10 classifiers in this work, we only focus on the accuracies of the 3 aforementioned classifiers, all of which have achieved higher accuracies by using ABC and LDA instead of PseAAC as shown in Figure 11.reports the results of classifiers that use PseAAC extraction feature spaces.DNN achieves the best accuracy of 93.60%, followed by ABC.

Performance of classifiers using (DPC+PseAAC)
As shown in  This process is repeated 10 times.The manipulation precision and Matthew's correlation coefficient (MCC) can be used to test the output of different modules.We also assess the efficiency of classifiers in compressing the function (feature) space.Table 4 shows the performance of these classifiers in a compressed function space.When DPC and PseAAC are used as extraction models, DNN achieves the best accuracies of 95.72% and 93.60%, respectively.Meanwhile, ABC achieves the best accuracy (87.30%), sensitivity (96.00%), specificity (81.82%), precision (85.71%), recall (96.00%), and F1-measure (86.77%) when using AAC.Similarly, ABC achieves the best accuracy of (86.82%), sensitivity (76.00%), specificity, (81.82%), precision (82.61%), recall (76.00%), and F1measure (85.22%) when using PseAAC.All used classifiers output on individual and mixed-function spaces after empirical evaluation.In sum, using the AAC and PseAAC hybrid feature yields promising results for ABC.We also reduce the function space by implementing the mRMR selection technology, which negatively affects the performance of classifiers by removing some essential features of the space.Specifically, these classifiers show better outputs in the original feature space than in the compressed feature space after an empirical evaluation.It was therefore decided that the achievement was due to the PC and PseAAC mixed feature area, as well as AdaBoost capacity for discrimination.
Performance comparison of our model with existing models.Table 7 compares the performance of DNNSPSDs with some extant classification methods.Jung et al. [54] proposed the ECMPP method [54], where five pathwayprotein characteristics are used for the classification.This approach yields a precision of 85.71%, sensitivity of 96.00%, and specificity of 81.82% [54].Meanwhile, the PSPD model introduced in [55], which bases its predictions on 10 classifiers with PSSM spaces, achieves a precision and sensitivity of 96.00% and 81.82%, respectively.Yang et al. proposed the IECMP model [56], which predicts PSPD proteins with a mixed-characteristic set classification and achieves precision, sensitivity, and specificity of 86.40%, 87.80%, and 88.67%, respectively.The proposed DNNSPSDs model outperforms the IECP model by 10.36% in terms of precision [56].In addition, the computer model described in this paper will be provided with a web server.

Performance comparison of different classifiers
Among the examined classifiers, DNN achieves the best accuracies of 95.72% and 93.61% when using DPC and PseAAC, respectively.ABC shows the second-highest accuracy of 85.99% when using DPC, followed by KNN (85.99%).Overall, ABC emerges as the best classifier, followed by KNN and LDA (with 90.66% accuracy) when using DPC as shown in Figure .12 in supplimentery material (Appendix A).

Performance comparison of (AAC, DPC and PseAAC model)
We adopt qualitative/quantitative techniques to analyze the extraction of PSPD functional annotation features.Feature engineering is an important step in the application of machine learning methods.Given that AAC is the most popular feature of PSPDs, we use PAAC, AAIndex, AAC, DPC, and PseAAC to determine which model performs best in terms of features construction as shown in Figure .13 in supplimentery material (Appendix A).

Case Study
Qualitative/quantitative techniques are also used to analyze 31 proteins-WNT1-inducible-signaling pathway protein 2 isoform 1 precursor, which contains was [Homo sapiens].We analyze the relationship between the proteins and pathways by using the three aforementioned feature extraction models and then use the 10 classifiers to test the accuracy of these models.The analysis is conducted based on the 4 proteins-CCN6_human cellular communication network factor, 9-G protein pathway suppressor 2, 2-proteins TIP41, TOR signaling pathway regulator-like, 10-epidermal growth factor receptor pathway substrate 8, and 6-disease-pathway association 6 proteins as the name of tissue factor pathway inhibitor isoform a precursor as shown figure 14.

WNT1 inducible signalling pathway protein sequence
We present a protein-based pathway specificity prediction for protein domains to be used for classifying domainspecific pathways.AAH74841.1 protein Wnt / Frizzled signaling pathway downstream regulator [58].We presented the top 10 protein WNT1 inducible signaling pathway protein sequence and database entry id evidence, as shown in table 8 and Cell survival linked.Attenuates apoptosis p53-mediated by the activation of AKT kinase in response to damage to DNA.The anti-protein Bcl-) is upregulated.Numerous cancers, including breast and colon tumors, demonstrate overexpression.In addition to fundamentally modifying cells, malignant cells show soluble microenvironmental signals, including WNT1 (WISP1), which is a secreted matricellular protein that increases the number of cancers and has been associated with reduced survival rates [59].We collect these proteins from the Signor 2.0 database name of (The signaling network open resource and evidence of pathway map taken from https://signor.uniroma2.it/.As shown in figure 14.  activating AKT kinase, upregulates the Bcl-X(L) antiapoptotic protein, supports skin and melanoma fibroblasts, and facilitates the binding of proteoglycans, decorin, and bigly to skin fibroblasts in vitro.(B) Controlling cell growth, regulating mitogenic signals, managing cell proliferation, and internalizing receptor tyrosine kinase (RTK)-type ligand-inducible receptors, especially EGFR, which plays a role in clathrin-coated pit (CCP) assembly, acts as a clathrin adapter for post-Golgi trafficking, and is involved in the maturation, invagination, or budding of CCPs, endocytosis of integrin beta-1 (ITGB1) and transferrin receptor (TFR), and internalization of ITGB1 as a DAB2-dependent cargo (which, in turn, requires DAB2).
(C) Interaction with non-receptor tyrosine kinases ABL1 and/or ABL2 in the negative regulation of cell growth and transformation, EGF-induced pathway activation regulation, cytoskeletal reorganization, and signaling of EGFR.Together with EPS8, these functions participate in the transfer of signals from Ras to Rac.The ABI1, EPS8, and SOS1 trimeric complexes exhibit a Rac-specific guanine nucleotide exchange factor (GEF) activity in vitro, and ABI1 tends to be an adapter in the group.These functions also include ENAH-ABL1/c-Abl-mediated phosphorylation, recruitment of WASF1 to lamellipodia, controlling the WASF1 protein level, controlling dendritic outgrowth and branching in the brain, and determining the form and amount of synaptic neuron contacts.(D) Direct inhibition of factor X (X(a)) and VIIa/tissue factor activity, possibly by forming a quaternary Xa/LACI/VIIa/TF complex, which has an antithrombotic function and potential to interact with plasma lipoproteins.

GPS2 protein pathway suppressor 2[human]
The GPS G-protein pathway protein 2 is encoded by gene GPS2 in humans.Table 9 lists the top 10 GPS2 G-proteins pathway suppressor 2 in humans.GPS2 codes a protein that participates in the cascade signaling of G protein-mitogenactivated protein kinase (MAPK), may effectively suppress a signal mediated by RAS and MAPK when over-expressed in mammalian cells, and interfere with JNK activity, which suggests that signal replacement may be a function of this gene [60].GPS2 also functions as a B-cell production regulator by inhibiting UBE2N/Ubc13, thereby reducing activation by related (B) pathways for signaling Toll-like (TLR) and B-cell antigen receptors (BCRs).Action as the main mediator for mitochondrial stress reaction relocates to the nucleus following desumoylation and promotes specifically the expression of nuclearencrypted mitochondrial genes in response to depolarization [61].

Tissue factor pathway inhibitor protein sequence
Plasmin-mediated matrix reshaping control may prevent the formation of trypsin, plasmin, factor VIIa, and tissue factor Xa and have no effects on thrombin.Table 10 presents the top 10 matrix protein-tissue factor pathway inhibitor 2 isoform 1 precursors [62].Serine proteinase inhibitors play an essential function in the combination of tissue turnovers.In this analysis, trypsin/elastase/plasmin inhibitors of the extracellular matrix of the human skin-Table 9. GPS2 protein sequence and database entry IDs transformed fibroblasts are isolated and determined in the partially amino-terminal amino acid sequence.Substrate reverse zymography tracks the antitrypsin activity of these inhibitors.The amino acid sequence homology of the 31-kDa inhibitor has been proven to be novel by a computer.Meanwhile, the 33-kDa inhibitor sequence is 70% to 90% similar to an amino-terminal sequence known as the 32-kDa inhibitor of the tissues factor or tissue factor pathway inhibitor-2.

Epidermal growth factor receptor protein sequence
Adapter to regulate actin cytoskeleton dynamics and architecture controlling many cellular protrusions.Different processes may be controlled depending on their relationship with other signal transducers [63].These processes include the axonal production of philopodia, stereocidal volume, dendritic migration of cells, and migration and invasion of cancer cells.

Future Direction
Machine learning is a valuable tool for modeling the interaction between protein structures and features that are derived from human pathways or multiple biological data sources.The algorithmic approaches for feasibility concern relevant to single task networks will be useful for potential innovations.We may build and evaluate a unique task protein feature prediction DNN-based method based on these solutions.Although DNNPSPDs can improve pathway-specific protein prediction accuracy and precision to some extent, its predictability and algorithm efficiency require further improvements.We shall also attempt to build our fundamental knowledge on proteins to produce more successful hidden features, to take biological meaning into account, and to incorporate specific effective algorithms, such as convolutional neural networks [65], capsule networks [66], and generative opponent networks.The versatility of our work contributes to the advancements in protein function analysis and association predictions with human pathways based on convolutional neural networks and several specific biological data sources, such as the graph embedding features or pathway involvement of protein-protein networks.Future studies should attempt to identify the best way of building a concept for our webbased software in addition to utilizing expanded vocabulary and different datasets.

Discussion
To the best of our understanding, the application of deep learning algorithms in predicting functional large-scale protein pipelines has not extensively examined in the literature.Moreover, previous experiments have only focused on small protein sets and functional groups.In these experiments, the application of DNNs has been extended to predict the protein functions of various forms, such as amino acid sequences [67], 3D structural properties, non-protein networks, molecular and functional aspects, and specific DNN computational feed-forward architectures (i.e., single-or multi-task feed-forward DNNs, recurrent neural networks, deep auto-encoder neural networks, and profound re-task) [68].
The technical complex research methods that restrict the size of the input data and number of integrable functional classes present a key barrier in designing realistic DNN prediction devices.Given this limitation, previous studies have only focused on few protein families.Therefore, new analytical methods with high efficiency and applicability in real scenarios should be proposed to facilitate in vitro research on protein-pathway recognition.
In this work, we propose DNNPSPDs as a novel hierarchical multitasking deep learning approach for predicting protein term interactions from protein sequence records.We also conduct a robust DNN-based predictive model characteristics analysis.This work is among the first to use DNNs in predicting sequence-based pathway specific protein functions and contributes to the literature by creating a broad-based deep learning predictive framework with a 1 multi-task stack and 101 feed-forward DNNs that can predict thousands of functional concepts based on a protein or gene.We also examine the forecasts of pathway-specific training instances, which present a major problem in the field of automated protein function prediction, by proposing an approach that enhances the quality of automated functional predictions that have been previously developed through machine testing.

Conclusion
PSPD functional annotation is an important challenge in the genomic era.The most widely used feature selection technologies, such as AAindex, AAC, DPC, PAAC, and PseAAC, are used for feature extraction models to express protein pathway sequences.We adopt several analytical methods to predict the function of novel pathway-proteins associations.However, PSPDs profiles are complex models with many free parameters.The difficulty of controlling external parameters lies in setting positionspecific residue scores and combining a structure with multiple sequence information.Our proposed method shows a higher accuracy compared with the other extant methods proposed in the literature.This study processes the data for protein-pathway association datasets and then trains and tests 10 machine learning models.Our proposed model achieves the highest prediction accuracy of 0.957 at an MCC of 91.86%, followed by DPC (0.936 at an MCC of 88.02%).In addition, DNNPSPDs achieves an ROC-AUC score of 0.982 by using PseAAC and 0.981 by using DPC.This model also has higher accuracy than other classifiers, including LDA, GBC, RFC, GNBC, DTC, MLP, ETC, and DNN.All methods have randomly predicted 115 pathway-protein associations for carbon metabolism, lipid, energy, and non-standard amino acids.These pathwayprotein associations are preserved by strong co-inheritance patterns in genetic information processing and may be linked to one another via physical cell interactions.

Figure 3 .
Figure 3. Binary classifications of pathway-specific protein domains

Figure 4 .
Figure 4. Architecture of the proposed DNNPSPDs model and preparation of the training dataset.(A) The DNN network structure, which comprises an input layer unit, four hidden layer units, and an output layer unit.(B) The functions of each hidden layer and a nonlinear activation function that measures the output value.

Figure 5 .
Figure 5. ReLU activation function However, any negative value automatically becomes 0, thereby reducing the ability of the proposed model to suit or train correctly from the data.In other words, any negative input given to the ReLU function automatically transforms the value into 0 in the graph and, in turn, does not map the negative data.For instance, if the weights are w1, w2, w3 (Figure 4), an input layer, a1, a2, a3, and wN inputs...We provide a summary of w1*a1 + w2*a2 + w3*a3 .... Small * Small.where R represents the activation function, w represents the connected weight matrix, a l represents the input layer values for the z-th layer neuron output indicating class.In recent years the Models of Neural Networks (NNs) state-of-the-art performance language modeling efficiency and are now undergoing adoption on biological issues.

)A
= / + (11) = / + (12) −  =  × ( × / + ) real positive TP is an event in which the model accurately predicts the positive class, a true negative TN is an outcome where the model accurately predicts the negative class, and a false positive FP is an outcome where the model incorrectly predicts the positive class.False negatives, FN, represent cases where the forecast is negative and the actual category is positive.The four parameters shown in Eqs.(6) to(10) are then determined as follows.Not more comfortable to know, especially the correlation coefficient of Mathew, and quiet.

Figure 9 .
Figure 9. Combine compared ACC model predicted data.
Quality of classifiers selected for minimumredundancy-maximum significance (mRMR) K-5-fold cross-validation is one of the most popular methods for model evaluation and model selection in the field of machine learning.The central concept of crossvalidation is that any finding is checked in our dataset.K-5-fold cross-validation is a unique cross-validation case where k iterations are performed over a dataset.In each round, the dataset is divided into k parts, where one part is used for validation and the other k−1 parts are fused into a model assessment training subset.In K-10-fold crossvalidation, 9 sets are used to prepare the training sets whereas the 1 remaining set is used for practice or testing.

Table 1 .
Parameters of the DNN model settings.

Table 2 .
Given that the testing dataset samples include the synthetic amino acid O, we set the value of  to 1, 2, 3, 4, 5, and 6.The optimum parameter values of DNNPSPDs and the other models are shown in Table2.DNN Identifying optimum parameter for various models

Classification performance of DPC model
In the Table.4 in supplementary material (Appendix.B).comparedDPC with the 10 classifiers in terms of several parameters, including AUC, accuracy, precision and score of 85.22%, and then we presented ROC-AUC achieved score with DPC as shown in figure10.Figure10clearly shows that DPC outperforms all classifiers with a ROC-AUC score of 0.939.

Table 7 .
Performance comparison of different prediction models.

Table 8 .
WNT1 inducible signaling pathway protein sequence and database entry id evidence

Table 10 .
[64]ue factor pathway inhibitor protein sequence and database entry IDsWe analyze the top 10 epidermal growth factor receptor pathways shown in Table11.Component of a WHRN and MYO15A complex located at stereo-types and needed to elongate the stereo-actin center.Cell cycle degradation is required during the G2 phase to sustain the changes in cell structure[64].With its active barbed finish activity and ability to modulate Rac activity, Eps8 is involved in actin dynamics.In addition, IRSp53 is bound to Eps8.Here's a preview of Eps8's novel actin interconnect.

Table 11 .
Factor receptor protein sequence and database entry IDs