Extended Representation
In the present paper, a binary form of representation is used and it is considered for all the 20 amino acids in the order shown in Table-1. Here the order stands for alphabetical order of abbreviated form of 20 amino acids. Each amino acid is represented by a 20 component vector. Each vector contains nineteen ‘0’s and one ‘1’ depending on the sequence number of the amino acid as shown in the last column in Table-1. This is an extension of Voss type of representation [18]. This maps each amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y) into 20 binary indicator sequences as shown in Table-1 using ‘1’ or ‘0’ to give the binary representation. Now this 20 component vector is used on a protein sequence of length ‘n’ to get a 20*n component vector of 0 and 1.
Calculation of descriptor
To calculate the descriptor, first of all, FFT is applied on the above represented sequence of length 20*n to get 20*n component complex values using the following formula.
\({U_i}(k)=\sum\limits_{{n=0}}^{{N - 1}} {{u_i}(n)} {e^{ - i(2\pi /N)kn}},k=0,1,..,N - 1\)
where N is the sample size and k stands for frequency.
Next power spectrum for each component signal Ui(n) is calculated using the following formula.
\(P{S_{{U_i}}}(k)={\left| {{U_i}(k)} \right|^2},k=0,1,2,..,N - 1;i=1,2,..,20\)
The Perseval’s identity assumes a simple form given by
\(\frac{1}{N}\sum\limits_{{k=0}}^{{N - 1}} {PS{U_i}(k)} ={\sum\limits_{{n=0}}^{{N - 1}} {\left| {{u_i}(n)} \right|} ^2}=(N{u_i}),i=1,2,...,20.\)
where Nui is the number of 1’s present in ui(n).
Next normalization is done on the components of the power spectrum using the following formula \({Z_i}(k)=\frac{{PS{U_i}(k)}}{{N(N{u_i})}},{\text{where }}\sum\limits_{{k=1}}^{{N - 1}} {{Z_i}(k)=1}\)
Now descriptors are calculated using 2nd order moments for each component of Zi(k), where the jth order moment of each (zi(k)), is given by
\(\frac{1}{N}{\sum\limits_{{k=1}}^{{N - 1}} {\left| {{Z_i}(k) - A} \right|} ^j},A={\text{ mean of each (}}{{\text{Z}}_{\text{i}}}{\text{(k))=}}\frac{1}{N},i=1,2,..,20\)
This gives the descriptor of the protein sequence considered. This process is repeated for all the protein sequences to be compared.
Computation of distance matrix- Let S1and S2 be two protein sequences of two different species to be compared. Let Di and Dj are the two descriptors produced in the last step from S1 an S2. Now, distance between two sequences S1 and S2 is calculated using Euclidean Distance by the following formula-
Distance(Di,Dj)=(Σ(Di-Dj)2)1/2
Now if there are ‘m’ numbers of sequences, the distance between each pair of sequences is measured to form a matrix called distance matrix. Finally, Phylogenetic tree is generated to compare the protein sequences using the UPGMA algorithm. The whole procedure is described by the following algorithm.
Procedure MOMENT
Take m number of protein sequences one by one. Find out the maximum length of the sequences and let it be L. Make all the sequence of length L by appending required number of ‘0’ at the end.
for i=1 to m by 1 do
Take the ith protein sequence of length L.
Represent all the amino acids present in the sequence individually in binary form of representation
using extended Voss typeRepresentation to give a 20 component vector of ‘0’ and ‘1’. So a sequence of length L, will be represented by 20*L size vector.
FFT is used on this binary form and subsequently power spectrum is calculated and normalized on
these represented values of length L. It will produce L/2 frequencies.
Descriptors are calculated using 2nd order moments.
End for
Construct distance matrix from the mnumber of descriptive vectors using Euclidean Distance between
each pair of protein sequences.
Finally construct Phylogenetic Tree from the distance matrix using UPGMA Algorithm.
End MOMENT
Complexity Analysis
The proposed algorithm has following steps-
a) Representation of protein sequence of length L using extended Voss type representation which involved
O(L) in the worse case.
b) Calculation of FFT of a binary sequence of length 20*L which is O(L log L) in the worse case.
c) Calculation of descriptor which is also O(L).
All these steps are executed for m number of sequences, which gives a complexity of O(m*L log L). Without loss of generality, if we assume m and L are equal, then the complexity becomes O(m2 log m).