Similarity/dissimilarity study of protein and genome sequences remains a challenging task and selection of techniques and descriptors to be adopted, plays an important role in computational biology. Again, genome sequence comparison is always preferred to protein sequence comparison due the presence of 20 amino acids in protein sequence compared to only 4 nucleotides in genome sequence. So it is important to consider suitable representation that is both time and space efficient and also equally applicable to protein sequences of equal and unequal lengths. In the binary form of representation, Fourier transform of a protein sequence reduces to the transformation of 20 simple binary sequences in Fourier domain, where in each such sequence, Perseval’s Identity gives a very simple computable form of power spectrum. This gives rise to readily acceptable forms of moments of different degrees. Again such moments, when properly normalized, show a monotonically descending trend with the increase in the degrees of the moments. So it is better to stick to moments of smaller degrees only. In this paper, descriptors are taken as 20 component vectors, where each component corresponds to a general second order moment of one of the 20 simple binary sequences. Then distance matrices are obtained by using Euclidean distance as the distance measure between each pair of sequence. Phylogenetic trees are obtained from the distance matrices using UPGMA algorithm. In the present paper, the datasets used for similarity/dissimilarity study are 9 ND4, 16 ND5, 9 ND6, 24 TF proteins and 12 Baculovirus proteins. It is found that the phylogenetic trees produced by the present method are at par with those produced by the earlier methods adopted by other authors and also their known biological references. Further it takes less computational time and also it is equally applicable to sequences of equal and unequal lengths.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Loading...

Posted 30 Nov, 2021

###### No community comments so far

Posted 30 Nov, 2021

###### No community comments so far

Similarity/dissimilarity study of protein and genome sequences remains a challenging task and selection of techniques and descriptors to be adopted, plays an important role in computational biology. Again, genome sequence comparison is always preferred to protein sequence comparison due the presence of 20 amino acids in protein sequence compared to only 4 nucleotides in genome sequence. So it is important to consider suitable representation that is both time and space efficient and also equally applicable to protein sequences of equal and unequal lengths. In the binary form of representation, Fourier transform of a protein sequence reduces to the transformation of 20 simple binary sequences in Fourier domain, where in each such sequence, Perseval’s Identity gives a very simple computable form of power spectrum. This gives rise to readily acceptable forms of moments of different degrees. Again such moments, when properly normalized, show a monotonically descending trend with the increase in the degrees of the moments. So it is better to stick to moments of smaller degrees only. In this paper, descriptors are taken as 20 component vectors, where each component corresponds to a general second order moment of one of the 20 simple binary sequences. Then distance matrices are obtained by using Euclidean distance as the distance measure between each pair of sequence. Phylogenetic trees are obtained from the distance matrices using UPGMA algorithm. In the present paper, the datasets used for similarity/dissimilarity study are 9 ND4, 16 ND5, 9 ND6, 24 TF proteins and 12 Baculovirus proteins. It is found that the phylogenetic trees produced by the present method are at par with those produced by the earlier methods adopted by other authors and also their known biological references. Further it takes less computational time and also it is equally applicable to sequences of equal and unequal lengths.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Loading...