Graphical signature representation is a very important step to know the variation between different genomic sequences [31, 34–40]. The major advantage of this step is being unworthy of any need for any previously biological knowledge. The only needed here is a biological database (DNA sequences) and knowledge of the two bioinformatics and signal processing tools. We start by creating our genomics database which contains 25 investigated virus sequences. After that, we applied to each DNA sequence different coding techniques. Then we apply an analysis technique including genomic signals to see the correlation between the genomes.
1. Genomic database:
In this step, all the investigated coronavirus genomes were downloaded from the public NCBI database (http://www.ncbi.nlm.nih.gov/Genbank/). Our constructive database contains 25 genomes including the COVID-19 genome (Fig. 1).
2. Genomic Image Representations:
Obtain numerical DNA representations using different methods were used as an important step that specializes in DNA sequences if it contains repetition patterns [34–40]. To obtain different numerical representations (2-D) of DNA sequences we can use different methods: CGR image, or analysis technique applied to DNA signals. To obtain a DNA signal we can use different coding techniques: the binary [31], the EIIP mapping [41], the structural bending trinucleotide (PNUC) [42], the Frequency Chaos Game Signal (FCGS) [34–40], and so on.
2.1 CGR representation
The CGR image represents one DNA sequence unambiguously along with other sequences that reveal both local and global patterns hidden in it [31, 34–45]. This simple representation of DNA sequence is derived from chaos theory which was proposed, in 1990, by Jeffrey and it was considered as a mapping method of genome sequences [34, 43–45].
To construct a 2-D image for a DNA sequence, an iterative mappingmethod assigns to each nucleotide a 2-D coordinate (X, Y) in 2-D dimensional space [43–45]. This constructed image containthe distribution of the dotson captured in a form of 0 (empty coordinate) to 1 (dot Coordinate) square matrix. M nucleotides (Ni=N1, N2, .., NM) is represented into a square by a point \({CGR}_{n}\), which N can be A or T or C or G. This point \({N}_{i}\) nucleotide position in 2-D dimensional spice (X, Y) is repeatedly placed halfway between the previous plotted point \({N}_{i-1}\) and the segment joining the vertex corresponding to the read letter Ni [34]. The following formula presents the prolific iterative CGR function.
\(\text{S}\text{q}\text{u}\text{a}\text{r}\text{e} \text{w}\text{i}\text{t}\text{h} \text{e}\text{a}\text{c}\text{h} \text{c}\text{o}\text{r}\text{n}\text{e}\text{r}\left\{\begin{array}{c}{X}_{A}=(0, 0)\\ {X}_{T}=(1, 0)\\ {X}_{C}=\left(0, 1\right)\\ {X}_{G}=\left(1, 1\right)\\ {P}_{0}=\left(0.5, 0.5\right)\end{array}\right.\) ; nucleotide position (N) in 2-D dimensional space \({\text{N}}_{\text{i}}=\left\{\begin{array}{c}{\text{N}}_{\text{i}}=({\text{X}}_{\text{i}}, {\text{Y}}_{\text{i}})\\ {\text{X}}_{\text{i}}=0.5*\left({\text{x}}_{\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{t}\text{i}\text{d}\text{e}}+{\text{x}}_{\text{i}-1}\right);\\ {\text{Y}}_{\text{i}}=0.5*\left({\text{y}}_{\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{t}\text{i}\text{d}\text{e}}+{\text{y}}_{\text{i}-1}\right)\\ i=1, 2, ..M\end{array}\right.\)
Figure 1 presents the steps (5 steps) of applying CGR representation to DNA sequence, which the obtained result is 5 points.
To compare the similarities between more of two DNA sequences, we propose to calculate center point (centroid) of 4k squares of the CGR image for each sequence. Each centroid corresponds to sub-image after partitioning the CGR image into (k/4) × (k/4) equal sub-images. After that, for each sub-region, we compare the centroid value of each sub-region to knowing the must similar sequences. Each partition of this sub-region contains a local information and if we divide the image to k image and calculate the center we can find the CGR centroid correspond to each nucleotide (A,T,C or G). Here two-point are within the same quadrant correspond to a succession of nucleotides in the sequence with the same last mononucleotide. In addition when these points are within the same sub-quadrant, the DNA sequences have the same last dinucleotides; and so on. The coordinate of the centroid point corresponds to local information of the sub-region and can differentiate the sequence and can be used to knowing the degree of similarity between DNA sequences. Then, for each sub-region, we calculate all pairs of distances between the Covid-19 centroids and the others centroids sequences. The distance between them can indicate their similarities degree. The following flowchart presents the GGR Centroid steps (Fig. 3).
2.2 FCGR representation
CGR image contains the specific genomic signature of a given DNA sequence. After dividing this image into 4k squares, we can obtain a FCGR representation that presents the global information of the DNA sequence. Each sub-square is associated to a sub-pattern and has a side of \({\left(\frac{1}{2}\right)}^{k}\). A visible pattern in the FCGRk corresponds to some specific pattern of a DNA sequence. The FCGR image order 1 to order 4 is presented in Fig. 4.
2.3 Time-frequencies analysis technique:
The time-frequency (T-F) analysis techniques are vital step to visualize hidden information’s in DNA or RNA or proteins signals [31, 35–40]. First to obtain a signal from a genomic sequence we can use diverse coding techniques: the electron-ion interaction pseudo-potential (EIIP) mapping [41], the binary [31], the structural bending trinucleotide (PNUC) [42], the Frequency Chaos Game Signal (FCGS) [34–40], and so on.
After that, several analysis techniques can be applied to the obtained signal: Fourier Transform (FT), Wavelet Transform (WT), S transform, and so on. In this paper we have used the EIIP as a coding technique and for analysis technique we have used, the Smoothed Discrete Fourier Transform (SDFT) [31, 36] and the Continuous Wavelet Transform (CWT) [31, 35–37]. For this, a genomic sequence (nucleotide and protein sequences) is converted into a 1-D signals before processing. The investigated genomic sequences are extracted from the public NCBI platform.
Table 1
EIIP coding technique for transformation of the genomic sequence into a signal.
Amino Acid
|
Single Letter Symbol
|
EIIP amino-acid value
|
Amino Acid
|
Single Letter Symbol
|
EIIP amino-acid value
|
Ala
|
A
|
0,0373
|
Leu
|
L
|
0
|
Arg
|
R
|
0,0959
|
Lys
|
K
|
0,0371
|
Asn
|
N
|
0,0036
|
Met
|
M
|
0,0823
|
Asp
|
D
|
0,1263
|
Phe
|
F
|
0,0946
|
Cys
|
C
|
0,0829
|
Pro
|
P
|
0,0198
|
Gln
|
Q
|
0,0761
|
Ser
|
S
|
0,0829
|
Glu
|
E
|
0,0058
|
Thr
|
T
|
0,0941
|
Gly
|
G
|
0,005
|
Trp
|
W
|
0,0548
|
His
|
H
|
0,0242
|
Tyr
|
Y
|
0,0516
|
Ile
|
I
|
0
|
Var
|
V
|
0,0057
|
Nucleotide
|
EIIP
|
A
|
0.126
|
G
|
0.0806
|
C
|
0.1340
|
T
|
0.1335
|
After transforming these genomic sequences processing, the SDFT transform, which is based on Discrete Fourier Transform, have been applied to genomic signal [31, 36]. Figure 5 and Eq. 1 illustrate the SDFT steps applied to DNA numerical sequence in the aim to obtain a time-frequencies representation corresponding to a DNA sequence.
$$\left\{\begin{array}{cc}\begin{array}{c} R=512 ,\text{f}\text{r}\text{a}\text{m}\text{e}\text{s} \text{l}\text{e}\text{n}\text{g}\text{t}\text{h} , \\ N=64, \text{s}\text{u}\text{b}\text{f}\text{r}\text{a}\text{m}\text{e}\text{s} \text{l}\text{e}\text{n}\text{g}\text{t}\text{h},\end{array}& \begin{array}{c} {\Delta }\text{r}= 256;\text{s}\text{h}\text{i}\text{f}\text{t} \text{i}\text{n}\text{d}\text{e}\text{x},\\ {\Delta }\text{n}=32\end{array} \end{array}\right.; \text{w}\text{i}\text{n}\text{d}\text{o}\text{w} \text{t}\text{y}\text{p}\text{e}=\text{B}\text{l}\text{a}\text{c}\text{k}\text{m}\text{a}\text{n} \text{w}\text{i}\text{n}\text{d}\text{o}\text{w} \text{E}\text{q}.1$$
The Continuous Wavelet Transform (CWT), with 64 scales and the parameter w0 ~ 5.5, was applied to the genomic signal in the aim to obtain a DNA image (scalogram) [31, 35, 37]. After applying a CWT analysis technique to a DNA signal we obtain a scalogram with their correspond matrix. After that, we explore the obtained time-frequency information located in DNA image by calculation of the scale-energy of scalogram in the goal to obtain a vector (1-D spectrum) with size equal to 64 that contains the energy of the DNA scalogram (wavelet matrix) by scale [37].
2.4 Recombination analysis
To more detect if the similarity exist between our investigated sequences we have used Clustal X program as an aligner after that we have analyze them using Simplot program [46] with a default settings (window size equal to 200, replicate used equal to100, a step size equal to 20, tree model is “Neighbor Joining”, distance model is ”Kimura”, gap stripping is “on”). To see shorter alignments, we have used Blast sequences tool with an Expect threshold (E-value) of 10, according to the stochastic model of Karlin and Altschul (1990) [47–48].