Hot spot identiﬁcation by means of classiﬁcation methods employing wavelet transform-based features

Background: Proteins can interact with one another by an interface composed of two proteins. Some of the interface residues – called hot spots – have the greatest impact on binding energy in a protein complex. This paper introduces the application of continuous wavelet transform based on the fast Fourier transform (CWTFT) to the analysis of hot spots in proteins. Results: The algorithm was evaluated by using data sets containing 30 proteins. From the number of tested classiﬁers the best 10 models were preferred. The classiﬁers achieved sensitivity of 52%–71%, speciﬁcity of 70%–83% and accuracy of 70%–74%. Conclusion: The analyses show that the method combining CWTFT and classiﬁcation algorithms is able to identify hot spot residues with valuable results. Methods: The basis of the algorithm is extraction of features from spectra obtained by using CWTFT with diﬀerent wavelet functions including Morlet, m th order derivative of Gaussian, Paul and Bump wavelets. Then, the classiﬁers that are able to separate hot spot from non-hot spot residues according to these features are applied.


Background
Protein complexes control most of the biological processes in living cells. They can interact with one another at the interface, which is composed of two proteins linked by a noncovalent bond [1]. If the distance between any two atoms of two amino acid chains is less than the sum of their van der Waals radii plus 0.5Å tolerance, then these amino acids can be deemed interface residues [2,3]. One of the most important properties of an interface is that the energy is not uniformly distributed. Some of the interface residues have the greatest impact on binding energy in the protein complex. Such residues are called hot spots. Studies have shown that hot spots tend to cluster near the centre of the interface [2,4,5]. The hot spot residues are surrounded by energetically less important residues, which occlude the hot spots from the solvent. Such occlusion plays an important role in highly energetic interactions and determines the location of hot spots in the interface centre. Studies also show that hot spots usually comprise structurally conserved sequences because hot spot residues are found to mutate more slowly than the rest of the protein surface. Furthermore, Bogan and Thorn [4] observed that tryptophan, arginine, and tyrosine are the preferred residues in hot spots. Hot spot analysis is important because it can help to discover biological functions and structures of proteins and understand the interactions between them. Studies on energy distribution in proteins may also be useful in the detection of mutations and single-nucleotide polymorphism analysis. Another application of hot spot detection is the possibility of creating proteins with predefined functions, which can be useful in drug design. The basis of drug design is often mimmicking the regulation of interaction between proteins; hence, it is important to understand the rules governing these interactions [2,4,6].

Methods of hot spot identification
Experimentally, hot spot residues can be identified by measuring the change in binding free energy (∆∆G) by using a molecular biology technique called Alanine Scanning Mutagenesis (ASM) [7,8]. In ASM experiments, each of the amino acids is mutated to alanine one by one, and the binding affinity of mutated chain is measured. The binding affinity can be determined by ∆∆G upon mutation of a given residue to alanine. In general, it is assumed that a residue can be deemed hot spot if it has a ∆∆G ≥ 2 kcal/mol when mutated to alanine. Mutations to alanine can cause changes in the biological properties of the proteins; thus, if the given residue is important for protein function, then replacing it with alanine should change the protein function. The ∆∆G from ASM experiments along with other data are deposited in the Alanine Scanning Energetic Database (ASEdb) [4,9]. However, the ASM method is expensive and time-consuming, and thus a number of computational algorithms presenting different approaches were created to predict hot spot residues. Most of the developed models require information about the complex, such as the 3D structure of proteins (e.g. Robetta [6,10], HotPoint [2,3,5], MAPPIS [1], KFC [11], SpotOn [12], PredHS [13], iPPHOT [14], the method using docking approach [15], HSPred [16], HEP [17]) or physico-chemical properties of their residues (e.g. method using random projection-based classifier [18], MSCA [19], method applying ensemble learning [20], DICFC [21], iFrag [22]). Most of the aforementioned algorithms require knowledge of the protein structure, which is a significant drawback of these methods because the protein structure has been determined only for a limited number of proteins. However, it is possible to analyse hot spots by using only the primary structure of proteins because the amino acid sequence contains all the information about the functions and structure of the protein.
This assumption is used in algorithms applying digital signal processing methods to hot spot analysis [23]. Most such models try to find regions in the protein sequence where the characteristic frequency (described below) is dominant. The most direct algorithms apply Fourier transform in order to search for hot spots, either by changing the amplitude in the Fourier spectrum [24] or by simply subtracting the consensus spectrum from the amplitude spectrum of the given protein [25]. However, Fourier transform is insufficient for the analysis of protein sequences because such signals are nonstationary (their characteristics change in time). Thus, most researchers use signal processing methods by employing time-frequency tools, which are more suitable for nonstationary signal analysis. Continuous wavelet transform (CWT) is one of the most popular methods of such analysis and was used in many research studies for the identification of hot spots [26,27,28,29]. The wavelet-based approach helps to find areas of high energy that correspond to hot spots, but precise identification of hot spots is missing in most of these works. A more accurate method of hot spot detection was presented in a study by applying modified Gabor wavelet transform [30]. Other methods of hot spot analysis use short-time Fourier transform (STFT) [31] and S transform [32,33], which allow for signal analysis in the time-frequency plane. An algorithm employing digital filtering was also presented [34]. Another technique uses statistically optimal null filters [35] with the smoothed pseudo Wigner-Ville distribution [36] incorporated. Another approach focuses on finding sequence-based frequency-derived descriptors capable of discriminating hot spot residues from others [37].

Resonant Recognition Model
Digital signal processing tools can be applied to the analysis of hot spots in proteins on the basis of the Resonant Recognition Model (RRM), which treats protein (and nucleic acid) sequences as discrete signals. The RRM assumes correlation between protein function and a numerical representation of its amino acids, which can be obtained by assigning each of the residues a numerical value relevant to the biological activity of the protein [23,24]. A number of parameters were considered (including physicochemical, thermodynamic, structural and statistical parameters), and the electron-ion interaction pseudo-potential (EIIP, Table 1) was selected as the most suitable for the analysis of proteins by using digital signal processing methods [38]. EIIP is a physical quantity denoting average energy state of all delocalised electrons of a given amino acid. The RRM shows that proteins with the same biological functions (i.e., the same target or receptor) have a common frequency component in the energy distribution of the delocalised electrons. The common frequency component of the biological process is called characteristic frequency. The cross-spectral function of the group of sequences with the same biological function is called a consensus spectrum. Peak frequencies in the consensus spectrum of the considered proteins correspond to frequencies common in all of the analysed signals (sequences), whereas frequency components that do not occur in all signals are removed from the consensus spectrum. Thus, the characteristic frequency can be obtained by finding the frequency of significant peak in the consensus spectrum of functionally related proteins. In a previous study [29], the authors showed that some problems with gathering a proper set of related proteins may occur; thus, the characteristic frequency could not always be properly determined [24,23].

Results
The analysis of hot spots with the aid of continuous wavelet transform based on the fast Fourier transform was conducted according to the algorithm described in the Methods section. The model was validated by comparing the results with ASEdb, which contains hot spots from experimental techniques (ASM). In order to present the performance of the algorithm, the following evaluation metrics were calculated: • Sensitivity: • Specificity: • Accuracy: where TP and TN are defined as the number of correctly identified hot spots and non-hot spots, respectively, whereas FP is defined as the number of non-hot spots incorrectly identified as hot spots, and FN denotes the number of hot spots incorrectly identified as non-hot spots.
In order to demonstrate the application of the algorithm to the identification of hot spots, 10 models that achieved the best results were chosen from all tested classifiers. Each classifier was validated with a five-fold cross-validation. To evaluate the models, 10 iterations were performed. The mean values for accuracy, sensitivity and specificity along with standard errors (se, eq. 4) acquired by using the presented algorithm are presented in Table 4.
where s is a standard deviation and n = 10 is the number of iterations. The classifiers were compared using one-way analysis of variance (ANOVA) for a significance level of α = 0.05. P-values for the null hypothesis that the means of the groups (classifiers) are equal are presented in Tables 5, 6 and 7. Figure 4 presents 95% confidence interval for the comparison.

Discussion
The aim of this study was to create models that are able to predict the potential unknown location of hot spots, which could help researchers in performing ASM in the wet lab. ASM is expensive and time-consuming; hence, by identifying the possible locations of hot spots with the aid of computational methods, the researchers can perform alanine mutations only in amino acid locations identified by these methods. In order to correctly detect the greatest percentage of hot spots in the ASM method, the model should be described by as large a sensitivity (Sn) value as possible.
The results obtained by using the proposed algorithm and presented in Table 4 indicate that the Classifier 10 (RUSBoost algorithm) can be considered the best tool to identify hot spots. It returns the greatest Sn value (∼ 71%) compared to other tested classifiers (Sn of other classifiers varies from ∼ 52% for Classifier 1 to ∼ 58% for Classifier 6; differences between them are not statistically significant, Table 5), even though it shows a low Sp value. The accuracy (ACC) and specificity (Sp) values for Classifier 10 are both ∼ 70%; thus, it is the smallest Sp value of the tested classifiers, while the ACC is similar compared to the other classifiers (only Classifiers 2 and 3 exhibit significantly greater ACC values than Classifier 10, Table 7).
Comparing ACC values, it can be observed that Classifier 3 (decision tree) achieved the greatest ACC ∼ 74%; however, the statistical testing showed that the ACC value is significantly greater compared to only four classifiers (Table 7). Additionally, the Sp value is the greatest for the Classifier 3 (∼ 83%, differences are statistically significant for three classifiers, Table 6). Finally, in addition to Classifier 10, Classifier 3 can also be considered a good tool for hot spot identification due to the greatest Sp and ACC values and Sn ∼ 58%.
It should be noted that non-hot spots were overrepresented in the analysed data set (∼ 66% NH vs. ∼ 34% HS), therefore, some of classifiers could achieve the ACC ∼ 66% by assigning all observations to one class (thus, Sp = 100% and Sn = 0%). This observation resulted in focusing this study on finding the classifier achieving as great an Sn value as possible. The RUSBoost algorithm, used by Classifier 10, is effective for imbalanced data, which was shown by the performance of Classifier 10. Classifier 10 is also described by a greater sensitivity value compared with other feature-based approaches introduced by Nguyen et al. Table 8) and Cho et al. (58%) [56]. In addition, in [37], it was shown that changing the value of one sample does not significantly affect the spectrum of the analysed sequence; thus, a window of five residues centred at the position of a given residue was replaced by alanine. In this paper, only one residue (corresponding to a hot spot or non-hot spot) was substituted by alanine; thus, the effect of the change in spectra by a single residue was examined. Even though the application of continuous wavelet transform to hot spot analysis was considered earlier [26,27,28], the exact identification of hot spots is missing in these works, and the results were presented only for a few example proteins, which did not allow for the evaluation of the performance of the proposed methods. A more accurate model employing modified Gabor wavelet transform (MGWT) was presented by Shakya et al. [30], which achieved accuracy of 67%, sensitivity of 70%, and specificity of 65% (Table 8). The presented algorithm using all of the classifiers showed greater Sp and ACC values than the MGWT-based method, whereas only Classifier 10 acquired slightly better results for the Sn metric. Although the algorithm does not improve the performance significantly compared with MGWT, it does not require knowledge of the characteristic frequency, which is a considerable advantage over the MGWT method. As mentioned before, gathering a proper set of related proteins can be difficult but is important for the determination of characteristic frequency, and thus for hot spot identification. Therefore, the superiority of the proposed approach over the other wavelet-based methods of hot spot identification was demonstrated.

Conclusion
In this paper, the analysis of hot spots by using classification algorithms and signal processing methods employing a time-frequency tool such as continuous wavelet transform based on the fast Fourier transform (CWTFT) was conducted. The protein data set consisted of 219 amino acid residues, and various wavelet functions were used for hot spot analysis. The numerical signals were acquired by converting the amino acid sequences into EIIP values. The obtained numerical signals were then processed by using CWTFT. The employed approach focuses on finding classifiers that are able to separate hot spots from non-hot spots through features extracted from CWTFT spectra. Ten classifiers were considered. Classifier 10 (employing decision tree learners with random under-sampling boost ensemble method) was found to be the most appropriate tool for hot spot analysis because of the greatest sensitivity value (∼ 71%) among all of the tested classifiers. Most of the classifiers achieved similar Sp values (from ∼ 77% to ∼ 83% for Classifier 3) except for Classifier 10 (∼ 70%). The ACC values were also similar and ranged from ∼ 70% for Classifier 10 to ∼ 74% for Classifier 3. Thus, Classifier 3 (employing a decision tree) can also be used due to the greatest ACC and Sp values and sufficient Sn. The above leads to the conclusion that it is difficult to maintain a balance between all considered evaluation metrics. This study shows that CWTFT is a proper digital signal processing tool for hot spot analysis and provides valuable results. Furthermore, the proposed algorithm does not require a priori knowledge of the characteristic frequency, which is a considerable advantage over other wavelet-based methods.

Methods
This study presents the analysis of hot spots by using one of the digital signal processing tools, that is continuous wavelet transform based on the fast Fourier transform and several classification methods.

Continuous wavelet transform based on the fast Fourier transform
The continuous wavelet transform (CWT) describes the correlation between the analysed signal x(t) and shifted and compressed or stretched versions of a function called wavelet (ψ(t)). The CWT can be defined as follows: where ψ * (t) denotes the complex conjugate of the analysing mother wavelet ψ(t), and C(a,b) is the function of the parameters: where f c is the centre frequency of the mother wavelet at scale a = 1 and position b = 0. The CWT is invertible, that is the original signal can be recovered from the wavelet transform coefficients by the inverse wavelet transform: where K ψ is a constant factor depending on the wavelet function. The analysed signal x(t) representing an amino acid sequence, is discrete; thus, the continuous wavelet transform was calculated on the discrete data [39,40,41]. The difference between CWT performed on discrete data and discrete wavelet transform is that CWT can be calculated for any value of scale. The CWT is also continuous to position, i.e. for the given scale, the analysing wavelet is shifted in the whole interval of the analysed function's domain [42]. CWT has one significant drawback -it is time-consuming -which induced the need to develop new more efficient algorithms. One of them uses fast Fourier transform to obtain CWT coefficients and is often referred to as continuous wavelet transform based on the fast Fourier transform (CWTFT). In the CWTFT algorithm, the CWT is expressed as an inverse Fourier transform: whereX(ω) andΨ * a,b (ω) denote the Fourier transforms of the analysed signal x(t) and the wavelet at scale a and location b, respectively [43]. In the present study, six different wavelets were considered, which in Fourier domain can be defined as follows: • the analytic Morlet wavelet: where U (ω) is the Heaviside step function, and ω 0 is a frequency parameter. • the nonanalytic Morlet wavelet: • the m th order derivative of Gaussian wavelets: where Γ is the gamma function. The second-order (Mexican hat wavelet) and the fourth-order derivatives of Gaussian wavelets were used. • the m th order Paul wavelet: • Bump wavelet: where 1 [(µ−σ)/a,(µ+σ)/a] is the indicator function for the interval (µ − σ)/a ≤ ω ≤ (µ + σ)/a [41,43].

Classification
Classification is a problem of machine learning that focuses on assigning observations to classes on the basis of features of these observations. The aim of classification is to construct a model that makes predictions based on the information obtained from learning carried out on observations. The training of the model is conducted by using the data set with known classes (the training set), and then the obtained model can be used for prediction of classes for the data set with unknown classes (test set) [44,45,46,47]. In this work, the following classification methods were used: • Support vector machine (SVM) -the data set is divided by finding the best hyperplane that separates data points belonging to one class from data points of other classes. • k-nearest neighbours -assumes that observations in the data set are close to other observations with similar attributes. The algorithm searches for k nearest neighbours of the given observation and determines its label by finding the most frequent label among these neighbours. • Decision trees -the method classifies observations by constructing a tree that assigns each observation to the class on the basis of its features. Each node of the tree corresponds to the feature of the given observation to be classified, and the branch represents the value that the node can receive [45]. • Ensemble classifiers -the basis of these algorithms is independent classification of observations by the group of classifiers. Ensemble classifiers need two kinds of information: ensemble aggregation method and a type of weak learner, such as a decision tree or k-nearest neighbours [47,48]. Among the others, the following ensemble aggregation methods can be distinguished [49]: -Adaptive boosting (AdaBoost) -in each iteration the weighted classification error for all of the tested weak classifiers is measured. Then, in the next iteration, the weights of the misclassified observations are boosted [50]. -Bootstrap aggregating (Bagging) -a decision tree is generated on the many bootstrap replicates of the analysed data set [51]. -Random under-sampling boost (RUSBoost) -classes with a greater number of observations are under sampled; thus, for each of the weak learners, the subsets of observations of each class are analysed [52].

Data set
Primary sequences of the analysed proteins were obtained from the Protein Data Bank (PDB) [53] and UniProt database [54,55]. The used data set (Table 2) consisted of 30 proteins [37]. The two proteins from the data set (PDB id = 1fc2 and PDB id = 1jtg) were excluded because of inconsistency between the acquired sequences and information given by the data set. For all residues in the data set, the change in the binding free energy (∆∆G) was characterised and is available in the ASEdb. Residues associated with a value ∆∆G ≥ 2 kcal/mol when mutated to alanine were deemed hot spots, whereas residues with ∆∆G < 0.4 kcal/mol were considered as non-hot spots. Thus, the data set was found to be comprised of 219 residues of which 75 were hot spots (class HS) and 144 were non-hot spots (class NH). For each residue in the data set, a set of six features was determined according to the algorithm described below.

Algorithm used in hot spot identification
The algorithm demonstrating the application of CWTFT in the hot spot analysis is described below. The method is based on an approach presented in [37] and [56]. It performs ASM, but computationally, by replacing the given residue (corresponding to the hot spot or non-hot spot) by alanine and looking for the differences between wild-type and mutated sequence spectra. It focuses on finding classifiers that are able to separate hot spots from non-hot spots through the given features. The features were selected from the spectra acquired by using CWTFT with the six different wavelets described above. The algorithm is presented as follows ( Figure 1): 1 Mutate residue at a given position to alanine. 2 Convert wild-type and mutated amino acid sequences into numerical signals using EIIP values. 3 Compute the CWTFT coefficients (with different wavelets) of the wild-type and mutated sequences. 4 Compare the wild-type and mutated sequences by computing the difference of their CWTFT coefficients as follows: where CW T wt and CW T mut are the CWTFT coefficients of the wild-type and mutated sequences, respectively. 5 Extract a feature by finding the greatest difference in D CW T (Figure 3). 6 Select the best classifiers that are able to discriminate hot spots from non-hot spots on the basis of the extracted features. The performance of the algorithm is presented in Figures 2 and 3 and Table 3 using the example of Colicin E9 immunity protein (1bxi). Table 3 shows all considered positions of hot spots and non-hot spots of the 1bxi protein selected according to the value of the change in the binding free energy from the ASEdb database. Then, the sequence of the 1bxi protein (wild-type) and the signal obtained for this sequence are presented (Figure 2A) along with the sequence mutated for one of the hot spot positions (Glu 41) and its signal ( Figure 2B). Finally, the amplitude spectra for the wild-type and mutated sequences are presented in the Figure 3A and Figure  3B, respectively. Then, the feature is extracted by computing the greatest value of the modulus of the difference between these spectra ( Figure 3C). The procedure is repeated for all hot spots and non-hot spots for each of the proteins in the data set.
A number of classifiers were tested, and 10 that gave the best results were chosen [57]: • Classifier 1 -a decision tree with many leaves that makes many fine distinctions between classes. • Classifier 2 -a medium-complexity decision tree with fewer leaves.
• Classifier 3 -a simple decision tree with few leaves that makes coarse distinctions between classes. • Classifier 4 -support vector machine with quadratic kernel function.
• Classifier 5 -support vector machine with Gaussian kernel function.
• Classifier 8 -ensemble classifier with decision tree learners and adaptive boosting ensemble method. • Classifier 9 -ensemble classifier with decision tree learners and bagging ensemble method. • Classifier 10 -ensemble classifier with decision tree learners and random under-sampling boost (RUSBoost) ensemble method.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The datasets during and/or analysed during the current study available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
Silesian University of Technology, Poland Brno University of Technology, Czech Republic Author's contributions AT participated in design of study, investigation and the analysis of results. ET and IP helped to draft the manuscript and supervised the study. All authors read and approved the final manuscript.    Figure 4 The confidence intervals for comparison of (A) sensitivities, (B) specificities and (C) accuracies of classifiers. The best classifier for each evaluation metrics is in blue. Red indicates the classifiers which means are significantly different than the mean of the classifier in blue. Each interval is a 95% confidence interval for the mean of a group.     1e-07 5e-07 2e-06 7e-06 1e-07 9e-06 1e-06 1e-07 2e-07 1e-07 1e-07 1e-07 1e-07 1e-07 3e-07 1e-07 1e-07 1e-07