CAA-PPI: A Computational Feature Design to Predict Protein-Protein Interaction using different Encoding Strategies

Protein-protein interactions (PPIs) carry out an extensive variety of biological procedures, containing cell-to-cell interactions, metabolic and developmental control. PPI is becoming one of the most important aims of system biology. PPI act as a fundamental part in predicting the protein function of the target protein and drug ability of molecules. Abundant work has been done to nurture methods to predict PPIs computationally as this supplements laboratory trials and oﬀers cost-eﬀective way of predicting the most likely set of interactions at the entire proteome scale. This article presents an innovative feature representation method (CAA-PPI) to extract features from protein sequence using two diﬀerent encoding strategies and then an ensemble learning method, the Random Forest method is used as a classiﬁer for PPIs prediction. CAA-PPI considers the role of trigram and bond of given amino acid with its nearby ones. The proposed PPI model achieves more than 98% prediction accuracies with one encoding scheme and more than 95% prediction accuracies with another encoding scheme respectively for the two diverse PPI datasets i.e. H. Pylori and Yeast . Further investigations are made to compare the CAA-PPI approach with existing sequence-based methods and reveals the proﬁciency of the proposed method with both encoding strategies. To further assess the practical prediction competence, a blind test has implemented on ﬁve other species’ datasets independent of the training set, and obtained result ascertains the productivity of CAA-PPI with both encoding schemes.


Introduction
Proteins word is spawned from the Greek word 'protos', where its significance is the first element Reeds (2000), undoubtedly protein is the fundamental of life.Proteins are complexes made with 20 types of amino acids that are linked through bonds called α-peptide bond.Merely protein can't act any function, therefore it accomplishes its roles by interacting with other molecules, like DNA, RNA, or other proteins, which catalyse different biological functions at a system or cellular stage.PPI can create a novel binding site for small effector molecules in accordance with different researches, PPI is the consequence of hydrophobic and electrostatic interactions and hydrogen bonds, all together contribute to the binding interaction nonetheless significance of hydrophobic forces have been proved Maleki et al. (2013).There are some important properties of PPIs that are noteworthy.Firstly, the changes in kinematic properties of enzymes due to PPIs might cause delicate fluctuations in substrate binding or allosteric effects.Secondly, PPIs can form a new binding site for small effector molecules.Next, they can deactivate or suppress a protein.Interacting with different binding partners, PPI can alter the precision of protein for its substrate.PPI types are varied can be categorized centred on stability, affinity, and composition of the consequential complex Keskin et al. (2016), in transient or permanent, nonobligate interactions or obligate interactions respectively, similarly can be characterised in Homo-Oligomeric and Hetero-Oligomeric Complexes on the basis of the likeness of protein pair involved in the interaction.The interactions are obligate if the complex of PPI are unsteady on their own in vivo whereas resultant complexes of nonobligate interactions can occur autonomously.Nonobligate interactions exist either in transient or permanent mode.Transient interactions clasp and unclasp provisionally in vivo, though in permanent interaction complexes remain established once interaction of proteins.As PPIs implies various effects alike Golemis and Adams (2002): • Permit for substrate channelling.
• Formation of a novel binding site.
• Deactivate or abolish a protein.
• Alteration of the specificity of a protein.
• Oblige an adjusting role in an upstream or a downstream event.
All of the above placed excessive influence on plenty of biochemical events, consequently the exploration of PPIs can comfort scholars divulge tissue functions and structures and detect the maturation of syndromes and drug targets of gene remedy.In past years, numerous experimental methods have been employed to detect PPI resulted in high-productivity, comprising immunoprecipitation, Yeast two-hybrid system, affinity purificationmass spectrometry (AP-MS), and protein microarrays.Conversely, biological experiments are mostly expensive and onerous.Also, both the FN and FP rates of these methods are very high Prieto et al. (2014).Thus, to develop reliable computing models for PPI prediction has great practical impact.
As per Galileo 'Book of Nature is written in mathematical language', therefore to detect the possibilities of interaction can be mapped by using different mathematical approaches by using properties and associated data of proteins as input to different computational models.Heretofore multiple research model has been introduced for predicting PPI which are categorize into approaches Rai and Bhatnagar (2017) like Gene data-Based methods, Network Topology-Based Methods, Structural profile-based methods, and ML-based methods as shown in Figure 1.Genetic linkage, genetic fusion, polygenetic profile, and in silico two-hybrid systems for PPI prediction are used in the genetic approach.Protein three-dimensional information is used in the structural approach whereas in the network topology based approach a confidence score matrix is generated for prediction purposes.ML based methods train the prediction model by exhausting diverse features of the interacting protein, far ahead train model predict the interaction of proteins.
In the proposed approach, ML-based model named Connecting Amino Acids Feature based PPI (CAA-PPI) approach is introduced to predict PPI.A Major contribution of the given model is the novel feature generation method using the hypothesis of association of different amino acid with a residue in a given trigram.CAA-PPI is using PCA to eliminate irrelevant and redundant features from the dataset.Also, CAA based feature extraction approach is implemented with two different encoding schemes named ES1 and ES2, trailed by an RF classifier to train the model.CAA-PPI with RF classifier model's performance is verified with two different PPI datasets that are Yeast and H. Pylori, accomplished the average accuracies of 98.25% and 98.25% with one encoding scheme; 98.69% and 95.49% with another encoding scheme respectively and similarly, the comparison result of CAA-PPI with competitive approaches proved it more accurate.The proposed model is also tested with a random dataset (five spices dataset) define as an independent dataset due to its independence from the training dataset.Overall outcomes of the proposed method prove this approach more efficient in the prediction of PPI with both encoding strategies.The structure of this research article is: Firstly, the importance and challenges of PPI and the need for PPI prediction are discussed.Next, the studies about PPI that were published previously are deliberated including the researches related to encoding strategy and feature extraction approach as well.In the next section, details of additional materials that are used to carry out experimental work are presented.Tracked by the detailed description of the proposed approach including systematic workflow and pseudo code with an example for further understanding of CAA-PPI.After then, the performance of CAA-PPI with ES1 and ES2 on two diverse datasets is presented with seven standard measures followed by their respective Bonferroni post-hoc analysis comparisons with the state-of-the-art models.In the end, the research work is concluded with possible opportunities in CAA-PPI.

Documentary Research
The knowledge to build a PPI prediction model using sequence is primarily dependent on three factors: • Select an appropriate manner to cover the possible essential information about PPI.
• Strategy to develop protein sequence feature extraction, and • Apply a favourable classification algorithm.This article mainly concentrates on the first two factors and therefore this section concisely considers their related study as briefed in table 1.Several investigations have been done in the development of an encoding scheme for fully capturing biological sequence information.Stupar (2010) suggested 7 classes of amino acids centered on their dipole and side-chain volumes, and according to those classes, the features of the protein pairs were extracted.Al-Daoud (2011) used three different encoding strategies based on amino acids' chemical properties, polarity, and structure with three newly created feature sets.In 2017, Zhou encoded a protein sequence at multi-scale using seven properties, covering their qualitative and a quantitative explanation of amino acids.These encodings were then used for representing each protein sequence in terms of five different HOG+SVD RF Proposed SVD and HOG algorithms for feature vector generation. 10.

An et al. (2019)
LCPSSMMF SVM Proposed a feature extraction method which considered residues' interactions of both continuous and discontinuous sections present in sequences.
protein descriptors i.e.AC, composition, frequency, transformation, and distribution Zhou et al. (2017).ElAbd et al. ( 2020) compared two counterpart amino acid encoding schemes using CNN, RNN, and a hybrid CNN-RNN architecture, applied to two challenging problems.In 2020, a broad review was projected by Jing about various encoding methods for amino acids followed by systematic analysis of encoding methods and discussed the comparison of the performance of 16 different representatives of encoding methods which were classified in five categories by the author (Le and Nguyen ( 2019)).Numerous computational methods proposed by several publishers for extracting sequence features mostly deal with evolutionary information of proteins, physiochemical information, or structure information.One of the popular feature extraction work published by Chou (Kuo-Chen (2009)) remarkably reflects the amino acids' composition and progressions the locus information of amino acid.Yang et al. (2010) deliberated the residues' interaction in both continuous and discontinuous regions and extracted more information of PPI present in the sequence of the protein and proposed LD and a KNN model.Guo et al. (2008) done another excellent work by taking into account the discontinuous amino acid fragments of protein sequence by using an AC-based method.The process considered physicochemical properties, a descriptor 'signature product' was developed to determine PPIs (Martin et al. (2005)).The research work by (You et al. (2013)) proposed a new hierarchical model by first extracting the information that causes interaction of protein sequence using CT, AC, MAC, and LD and then use PCA and finally employed an E-ELM classifier to predict PPI.Another great research by You et al. (2014) considered the interfaces between serially remote but spatially near amino acid residues.Again in 2015, (You et al. (2015)) suggested another innovative feature representation approach with a postulation that the interaction between protein pairs could be possible in unceasing amino acid fragments having different segment lengths.The notable work was proposed by Wong et al. (2015) using image processing methods for feature extraction using Physicochemical PR Matrix and then employed LPQ for mining complex and essential coefficients from obtained features.To predict PPIs favorably, the RoF classifier was used and showed efficient performance by comparing with existing approaches.Ding et al. (2016) improved the precision of prediction using AAC matrix to obtain an SMR matrix followed by SVD and HOG algorithms that generate a feature vector.Another brilliant work was published by An et al. ( 2019) used both local and global features by using PSSM based local encoding approach to create a novel multifeatures fusion matrix (CPSSM) and then employed Local Average Group (LAG) and Bigram Probability (BP) to extracted key features from the obtained matrix.In the current research article, an innovative feature representation method CAA-PPI is projected to extract key information present in protein sequence which takes into consideration the association of different amino acids with a residue in a given trigram.These novel features are extracted using two different encoding schemes for representing amino acids.Then, RF is used as a classifier to prove the efficacy of the approach for predicting interaction between protein pairs.The proposed PPI model achieves prediction accuracies of 98.25% and 98.25% with one encoding scheme; 98.69% and 95.49% with another encoding scheme respectively when applied on two diverse PPI datasets including Yeast and H. Pylori.Further investigations are made to compare the proposed approach with existing sequencebased methods and reveals outstanding results which prove the proficiency of the proposed method.Further, to evaluate the practical prediction competence, a blind test has been implemented on five other species' datasets which are autonomous to the training set, and obtained results ascertain the productivity of CAA-PPI.

Dataset
The data is collected from DIP Xenarios et al. (2001) and PIR to validate the CAA-PPI approach.Evaluation is performed on Yeast and H. Pylori, all having different numbers of interacting protein pairs.S. cerevisiae which is basically Yeast protein, their PPI datasets are taken from DIP taking reference of An's work An et al. (2019) in the prediction of PPI.Replication of protein pair is done by scrutiny of a dataset having similarity lesser than 40% and 5594 datasets of interacting pairs is obtained.For effective testing of the performance of the model non-interacting pairs need to involve in the dataset to train the model.Consequently, 5594 datasets of non-interacting or negative pairs are selected consisting of diverse subcellular localizations.Finally total of 11,188 protein pair of Yeast dataset need to evaluate.H. Pylori are the next considered PPI dataset which is a collection of 2916 pairs of protein containing 1458 pairs of interacting and non-interacting each as Martin used in their work Martin et al. (2005).Besides these, the PPI dataset of the following five species M. musculus, H. Pylori, C. elegans, E. coli, H. sapien are also used to test the performance of CAA-PPI, respectively comprising of 313, 1420, 4013, 6954, and 1412 number of interacting pairs also used in PPI related research work by Zhou et al. (2011).

Cross Validation
A cross-validation system is a typical procedure for circumventing any cross section prejudices as well as corroborating the reliability of the model Stone (1974).In this article, a five-fold cross-validation technique is performed for assessing the classifier's performance.In the X -fold cross-validation technique (X is any valid number), the complete dataset is randomly fragmented into X equal fragments say folds; out of X folds, X-1 are used for training and the lingering one is used as a test set in each fold of the cross-validation.Similarly, this practise is recurred X times to achieve X distinct models.Lastly, the outcomes of X distinct trials are averaged to contribute to an inclusive assessment.

Performance Evaluation
With the purpose of quantitatively appraise the efficacy and constancy of a classifier, a number of broadly performed statistical measures are considered in this articles, namely accuracy (A), sensitivity/recall (Se), specificity (Sp), positive predictive value/precision (Pr), NPV, Fscore (Fs) and MCC.These measures are expressed as follows: Here, TP is the measure of true PPI pairs that are predicted appropriately as interacting pairs.TN defines the quantity of true non-interacting pairs predicted properly.FP is the amount of true non-interacting pairs that are wrongly predicted as interacting ones.FN states the number of falsely predicted true interacting pairs as non-interacting pairs.Though A is a simple assessment measure, it may lead to a very biased evaluation in case of a discrepancy dataset.Pr confirms the total predicted pairs that are allied to the PPIs.As Pr and Se contradict each other, Fs is evaluated as the weighted harmonic mean of Pr and Se to inclusively reveal the prediction performance of PPI (Hripcsak and Rothschild (2005)).Higher the value of Fs, so is the Pr and Se.MCC is a different objective index to imitate the whole method performance that considers under prediction as well as over-predictions (Matthews (1975)).

Principal Component Analysis (PCA)
PCA (Vipsita et al. 2013 ) is an unsupervised linear dimensionality reduction method, which is used for the projection of a data space into smaller dimensional space by using orthogonal transformation.It is a widespread method for eliminating redundant and noise data, and to extract relevant features.The objective of PCA is to condense the big feature set to small without dropping suitable information of the original set.The process used for reduction using PCA is given below in six basic steps: • Transform the entire dataset into matrix of dimension i × j and ignore the class label.
• Calculate the mean vector of the matrix.
• Calculate the covariance of entire magnitudes.
• Arrange eigenvectors by declining eigenvalues and select any p eigenvector with highest eigenvalues result in a matrix of dimension j × p. • The resultant j × p matrix is used to convert sample space to new subspace.

Random Forest Classifier
RF is a booming classifier in the area of machine learning.It is a procedure of an ensemble classification that appoints a set of DTs to diminish the resultant variance of distinct trees to develop the constancy and exactitude of classification.RF judiciously take advantage of two influential ML techniques: • for each tree, the election of training samples; • the random feature selection to fragment the data set.
The selection of training samples is implemented by using a bootstrap sample from original data (termed bagging).The outcomes of bagging lead to two dismember bags, one holding the training data of around 63.2% instances and the other one holding the remaining samples generally stand for out-of-bag (OOB) samples.Usually, in-bag samples are used to build the RF classifier and OOB samples are used for the assessment of prediction.The next powerful ML technique selects a features' subgroup at every single node in respective classification tree i.e., RF handpicked a fixed amount of features randomly at every node of a tree and the one with the thoroughgoing decrease in Gini index (Qi and Y 2012) is selected for the split when emerging the tree.
Naturally, a forest is made up of trees and more trees mean a more robust forest.In the same way, the RF algorithm generates DTs on data samples and acquires the prediction from each of them and lastly chooses the finest result through voting.This ensemble scheme is superior to a solitary DT as this condenses the over-fitting by averaging the outcome.
The RF algorithm can be understood with the help of the following steps: • Firstly, select random samples from the assumed dataset.
• Then, a DT will be generated by the algorithm for every sample and prediction outcome from each DT will be achieved.• Then, for every predicted outcome, voting will be implemented.• In the end, the final prediction result will be the maximum voted prediction result.

Working of CAA-PPI
The proposed approach is based on the fact that interaction possibility-related information exists in the sequence of protein pairs.This information can be generated by deriving different features from the sequence by applying a varied feature extraction approach.The count of generated features can't vary with change in length of protein sequence; hence feature is generated in respect of amino acid.
CAA-PPI can generate these features by using the ratio of combinations of trigram presence in the protein sequence and count of central amino acid present in the same trigram.Here trigram (3-mer) represents a set of 3 consecutive symbols in the protein sequence.Since a protein sequence contain 20 amino acid so a trigram contains 20 3 tri-peptide combinations, so total 8000 feature values need to generate for each combination of tri-peptide of a protein, hence a protein pair have about 16000 values in its feature sets.It will be challenging to work with such huge feature sets.
Encoding is proposed to reduce the count of a protein's features by aggregating 20 amino acids in 7 classes, which can reduce the number of features from 20 3 to 7 3 .Encoding enhances the speed of the proposed model, but still had the issue of the performance that is analyzed by taking multiple encoding schemes and select the best one.The proposed work is evaluating results based on the encoding scheme proposed by Stupar ( 2010   The first step of CAA-PPI is encoding of protein sequence, this process converts amino acid characters into symbols ('1', '2', '3', '4', '5', '6', '7'), features values need to generate for 7 3 combinations of trigrams so simul-taneously we need to generate 7 3 features.For which combinations will generate.Now for each combination feature value is generated, for example, if the sequence is 'AWGVWEGIAVGWAWG'.
Then for combination 'AWG', the feature value will be: • Count of 'AWG' in given sequence/Count of W in given sequence =2/4=0.5For instance, the input dataset has positive (interacting) and negative (non-interacting) protein pairs, so labeling is done with 1 and -1 receptively.Now PCA will apply to the resultant feature for the feature selection process, followed by a fivefold cross-validation process of the dataset as shown in figure 4. Five-fold process divide dataset into 5 equal parts from, which one part is used at the place of test dataset and remaining are used like train dataset, hence it represents 1-4 partition of data.The selected train dataset is used to train a model using the random forest classification method.Trained model predict the class of the test dataset, which support in calculating performance measure as A, Pr, Se, MCC, Fs, Sp, and NPV of results.The proposed feature generation algorithm is depicted in Algorithm 1-3, representing CAA-PPI.Since the main procedure of the model, taking protein sequence (seq), encoding pattern style (encoding pattern), and numbers of elements in combination (p gram) as arguments.The first step is encoding, perform by function encoding PPI using a sequence of protein and encoding pattern like an input argument and generate encoded protein sequence either according to ES1 or ES2 depend on the value of the encoding pattern.The next step of the procedure is to generate features of CAA-PPI that is performed by function Generate Feature CAA by counting trigrams, and central amino acid in a given sequence and generate all features for all combinations of a given sequence and return value CAA featureset.To filter out the correlated data from feature sets PCA function is used that return CAA filtered featureset, that is further labeled by Add label function after testing that whether protein pairs are interacting or non-interacting.

Performance of PPI Prediction
This section specifics the performance of the neighbourhood based feature representation approach to predict PPIs via two diverse PPI datasets with two encoding strategies discussed in previous sections.The outcomes are then compared with numerous existing approaches that were suggested in already published works.Subsequently, a blind test has implemented on five other species' datasets (M.musculus, H. sapiens, C. elegans, H. Pylori and E. coli) which are autonomous to the training set to prove the productivity of CAA-PPI.
Performance of CAA-PPI model using ES1 and ES2 on the Yeast Dataset.CAA representation of protein sequence having RF predictor is tested using five-fold cross-validation with the dataset of Yeast as shown in Table 2 using ES1 and in Table 3 using ES2.It is noteworthy from Table 2 that great prediction accuracy of 98.25% is attained for the CAA-PPI approach with ES1.The values of the other six standard measures are also assessed for the proposed model to well discover the prediction capability, and achieve decent performance with both the encoding schemes but ES1 performs comparatively better than ES2.
Furthermore, prediction model using CAA approach with ES1 and ES2 is compared against the methodologies proposed by various publishers as follows: An et al. ( 2019), You et al. (2014), Wong et al. (2015), Zhou et al. (2011), Guo et al. (2008) and presents the Bonferroni post-hoc analysis in Table 4.These comparing approaches has discussed in previous sections, they distinctly used LCPSSMMF, PSSMMF, LCPSSMAB, LCPSSMBG, AC+CT+LD+MAC, MCD, PR-LPQ, LD, ACC, AC for encoding amino acid sequence and predict PPI using SVM, RF, RoF, E-ELM classifier.It is notable from Table 4 that CAA-PPI using ES1 and ES2 beats all competitive methods i.e. it generally has a significant difference in prediction accuracy than these state-of-art PPI predictors for the Yeast dataset also depicted by Figure 5.
Performance of CAA-PPI model using ES1 and ES2 on the H. Pylori Dataset.Further, to evaluate the efficacy of the proposed method, the CAA-PPI model with ES1 and ES2 has been tested on the H. Pylori dataset using five-fold cross-validation which reports the result as shown in Table 5 and Table 5.It can be observed from Table 5 and 6 that the average accuracy of the proposed model is 98.25% with ES1 and 98.69% with ES2.Moreover, the performance of CAA-PPI is computed comprehensively with other evaluation metrics including Se, Sp, Pr, NPV, Fs, MCC as shown in Table 5 and 6.Likewise, the performance of CAA-PPI with both ES1 and ES2 are compared with the approaches suggested in previous literary works and presents the Bonferroni post-hoc analysis in Table 7.These comparing approaches independently used HOG+SVD (Ding et al. (2016)), AC+CT+LD+MAC (You et al. (2013)), MCD (You et al. (2014) (Zhou et al. (2011)), Phylogenetic bootstrap (Bock and Gough (2003)), HKNN Nanni (2005), Ensemble of HKNN Nanni and Lumini (2006), Signature products (Martin et al. (2005)), Boosting (Liu et al. (2013)) to express amino acid sequence and use favourable classifier to predict PPIs.From Table 7, it is worth noting that CAA-PPI is more effective than other competitive methods with both ES1 and ES2 as well i.e. it generally has a significant difference than these state-of-art PPI predictors for the H. Pylori dataset also represented by Figure 6.

Outcomes on five species datasets
To assess the hands-on prediction aptitude, initially, CAA-PPI is trained with PPIs of Yeast dataset using ES1 and ES2 separately and used five independent species' datasets to test,   4).7).
the training set using the same proposed approach.The resultant performances are shown in Table 8.The outstanding result of the proposed method using both ES1 and ES2 ascertains the significant proficiency (p<0.05) of the CAA-PPI compared to the existing published works from Table 8.Additionally, it is worth noting that both encoding strategies are equally effective with CAA-PPI approach to better predict new protein interactions.It is said that there is always room for improvement or change, the only challenge is to discover the same.The next step could be to discover more interacting protein pairs and to generate a new set of features using the proposed approach.Moreover, CAA-PPI can be assessed by using other encoding strategies and apply them to other organisms.Likewise, a new encoding scheme can be developed with the systematic categorization of amino acids.The proposed approach can also be extended in the direction of the interaction of the protein with other possible molecules.
) (ES1 ) and Talwar (2015) (ES2 ).Both encoding schemes are aggregating 20 amino acids in 7 classes nonetheless combination benchmarks are dissimilar as shown in Figures 2 and 3.In ES1, amino acids are categorized based on chemical properties like dipole scale and volume scale; whereas ES2, encoding is based on the structure of the side chain influenced by the significance of the amino acid's side chain in deciding properties of amino acid.
P P I() Input : P rotein sequence (seq), Size of combinational p gram = 3 seq encoded = Encoding P P I(seq, encoding pattern) CAA f eatureset = Generate F eature CAA(seqencoded, p gram) CAA f iltered f eatureset = P CA(CAA f eatureset) If y in CAA f iltered f eatureset is N egetive P P I Add label − 1 to y else Add label 1 to y end end P rocedure Algorithm 1: Initialization of CAA PPI 5 Results and Discussions Fig. 5: Comparison of CAA-PPI applied on Yeast dataset using ES1 and ES2 with existing approaches (Ref: Table4).

Table 1 : Brief overview of previously suggested feature extraction methods for PPI prediction.
), DCT+SMR (Huang et F unction Seq encoded = Encoding P P I(seq, encoding pattern) If encoding pattern is ES1 Aggregation of amino acid in 7 classes in seq encoded {

Table 3 :
Result of Five-fold cross validation for proposed approach on Yeast dataset using ES2

Table 4 :
Bonferroni post-hoc analysis result of CAA-PPI using ES2 and ES2 compared with existing approaches for Overall prediction accuracy for Yeast dataset.

Table 5 :
(Shi et al. (2010)lidation result of CAA-PPI using ES1 on H. Pylori dataset.Testing Set A (%) Se (%) Sp (%) Pr (%) NPV (%) Fs (%) MCC (%)(Shi et al. (2010)).Henceforth, in this section, above stated and experimentally demonstrated the interaction of any one species Yeast dataset (with 11,188 samples in this case) is employed to predict the interactions of other ones.Then, a blind test has implemented on five other species' datasets which are autonomous to

Table 6 :
Five-fold cross validation result of CAA-PPI using ES2 on H. Pylori dataset.

Table 7 :
Bonferroni post-hoc analysis result of CAA-PPI using ES1 and ES2 compared with existing approaches for Overall prediction accuracy for H. Pylori dataset.

Table 8 :
Performance of PPI Prediction on five species datasets taking Yeast dataset for training (in terms of Accuracy).With the growing number of PPI calculation methods, the codification of numerous amino acid feature vectors are also evolving.Even though considerable advancement has been achieved by now, further operational approaches are required to deal with the precincts.This research presents a ML based model (CAA-PPI) to predict PPI using two distinct encoding strategies.Major contribution of the given model is novel feature generation method using association of different amino acid with a residue in a given trigram.CAA based feature extraction approach is implemented with different encoding scheme trailed by random forest classifier, to train the model.Proposed CAA-PPI with RF classifier model's performance is then verified with two diverse PPI datasets that are Yeast and H. Pylori and attain a favourable outcomes with both the encoding scheme.