k-mer proximity index for phylogeny comparison of SARS-CoV-2 with other pathogens

We developed a compact and computationally inexpensive method for in-silico comparison of nucleotide sequences at a macro level using subtraction-percentage plots (SP-plots) of a modied chaos game representation (CGR). Analyzing these plots, we dened the k-mer proximity index quantifying the differences between SARS-CoV-2 and other pathogens’ genome sequences. We categorized 31 pathogens, on the basis of their proximity to SARS-CoV-2, in four groups to possibly plan a treatment strategy for Covid-19.


Introduction
The order of the nitrogen bases (adenine, thymine, guanine and cytosine or A, T, G and C) in a nucleotide sequence determines the genetic code of a species. One of the fundamental tasks of genomics is to compare these sequences for phylogenetic analysis. Several methods are available to compare genetic sequences, either through sequence alignment [1,2] or alignment-free approach [3,4,5,6,7]. Due to large size of genomic data, the sequence alignment approaches are not time and memory e cient and also have some shortcomings [8,9]. The alignment-free approach provide several different methodologies [10,11,12,13]. One of the approaches is based on chaos game representation (CGR) which has become popular due to its compact representation of the whole genome sequence in graphical form [14,15,16,17,18].
Here we describe a novel and simple approach to quantify the similarity/dissimilarity between two genome sequences using a modi ed CGR. CGR is a iterative mapping technique to convert the genome sequence of a given length into a single image based on the frequency of k-mers (the order of the nucleotides in a sequence for a word of size k) where each pixel corresponds to a speci c nucleotide combination [19]. As the genome sequences of different species differ in length, a direct comparison of the species based solely on the CGR image is di cult. To make the CGR image length independent, we rst convert it into a percentage CGR plot (PC-plot) by plotting, at each pixel level, the percentage of the kmer frequencies in a genomic sequence (See Methods). Visually, the PC-plots and traditional CGR images are identical. To quantify the similarity between two species, we introduce two new concepts, a subtraction percentage CGR plot (SP-plot) and the k-mer proximity index (Pr). A SP-plot is obtained by subtracting the percentage points of respective k-mers in each sequence. The SP-plot consists of positive and negative values indicating the differences of k-mers percentage distribution between two species (See Methods). The sum of the positive differences is always be equal to the sum of negative differences. This sum is named as k-mer proximity index (Pr) which represents the degree of similarity between the genome sequences of two species (See Methods). Obviously, the value of this proximity index will increase with the degree of dissimilarity between two species. The value of this index also changes with the value of 'k' because the distribution of a speci c length combination of nucleotides will change as 'k' changes.

Results
As the world is currently suffering with the deadly Covid-19 pandemic, it is important to quickly understand the interrelationship between different pathogens to plan a strategy for treatment of Covid-19 based on the available cure of existing pathogens. We compare the genome sequences of SARS-CoV-2 virus with 31 other pathogens using the visual inspection of PC-plots of individual pathogens, the SPplots of pathogen pairs ( Supplementary Figures 3-33) and quantifying the similarity through the k-mer proximity indices between each pair (Supplementary Table 1 We calculated the SARS-Cov-2 k-mer proximity indices for several higher order oligomers (k = 4 to 9) for all 31 pathogen pairs (See Supplementary Table 1). The bottom panel of Figure 1 shows the value of 4mer (tetra-nucleotide) proximity indices for all 31 pair of pathogens in the increasing order.
The lowest tetra-nucleotide proximity index is for Cov-2 and CoV-1 pair and the highest is for CoV-2 and Rubella pair. We divided the pathogens in four categories according to the proximity index values.
Category AThis category has the pathogens with lowest 4-mer proximity index (less than 10) with SARS-Cov-2 and has uncanny similarity with the COVID-19 virus. The existing drugs/treatment, if available, for any of these pathogens will stand a very high chance to be successful. A suggestion to use Sofosbuvir, a potential drug for Human Corona virus (HCV) was recently reported [20] as a potential treatment for COVID-19..
Category BPathogens with low 4-mer proximity index (between 10 and 20) belong to this group and any treatment available of the pathogens from this group tried with the SARS CoV-2 will stand moderate chances of success. We have already noted that for recent Remedesivir (earlier used for Hepatitis and Ebola) trials.
Category CPathogens with moderate 4-mer proximity index (between 20 and 30) belong to this group and any treatment available for the pathogens belonging to this group has very little chances of success for SARS CoV-2 cure. This has been observed with a few recent trials of HIV drugs administered to Corona virus patients without much success.
Category DThe genome sequence of the pathogens belonging to this group are quite far (4-mer proximity index more than 30) from the PC-plot of SARS CoV-2 and therefore any treatment for the pathogens in this group cannot be repeated for COVID-19. The recent failure of hydro-chloroquine (malaria) and BCG vaccine (tuberculuosis) support our hypothesis.
We plotted the variation in the value of k-mer proximity indices for 31 pathogen pairs (Figure 2). The index value increases with the increase in the size of k-mer, which means that at higher k-mer nucleotide levels, the dissimilarity between the species increases, but the relative changes in the value of k-mer proximity index for a given pathogen pair remain same throughout.
Another important aspect of this method is the time e ciency. The MATLAB code generated for this method takes only 37 seconds to compare two sequences of 30,000 points each on a core i7 laptop computer with 8 GB RAM. Our method provides a novel and compact way phylogeny comparison and quantify the similarity between two species in a time e cient way.

Methods
CGR-A genetic sequence X(k) can be considered as a string composed of A, G, C and T which represent Adenine, Guanine, Cytosine and Thymine, respectively.

Xk∈C,A,T,G
We consider a unit square U and name corners C i (i = 1,2,3,4) as C, A, T and G respectively, which corresponds to the value of X(k). The initial point P(0) is the midpoint of the square. Now the second point P(1) is the midpoint between P(0) and C X(1). In General, P(k) is plotted as the midpoint between P(k-1) and C X(k) [14].
After plotting the genetic sequence X in unit square U, the unit square is divided into 2 N x 2 N sub squares; each sub-square represents a unique sub-sequence of length k (k-mer).
An example for movement of points in CGR is shown with the rst eight members of the data sequence (GCTTATGT) in Supplementary Figure 1. An example of addresses of the sub-squares for nucleotides, dinucleotides (2-mer), tri-nucleotides (3-mer) and tetra-nucleotides (4-mer) is given in Supplementary Figure  2.
PC-plotsTo make these plots, the percentage of points plotted in sub-square is calculated. This percentage value represents the intensity of points in each sub-square. After plotting points by CGR and dividing the unit square into 2 k x 2 k sub squares, each sub-square is color-lled based on the calculated intensity values. Supplementary Figure 3 shows the percentage plot (Y) made for the SARS-Cov-2 and SARS-Cov-1 for k = 7. Similar plots were made for all the pathogens (See supplementary Figs 4-33).
SP-plots and k-mer proximity Index-Subtraction plot between genome1 (g1) and genome2 (g2) is plotted as For example, if percentage density values of Y g1 and Y g2 in 4x4 matrices corresponding to di-nucleotides From the subtraction plot S, the sum of all the positive numbers (also the sum of modulus of negative numbers) is a measure of similarity or dissimilarity between two genetic sequences.
The top panel shows PC-plots (upper half) of ve pathogens, SARS-CoV-2 and one each from four categories, and SP-plots (lower half) of four pathogens compared with SARS-CoV-2. The number in parentheses is the tetra-nucleotide proximity index of the corresponding pathogen pair. The bottom panel shows the variation in tetra-nucleotide proximity indices for 31 pathogen pairs. The pathogens are categorized in four categories, A, B, C and D based on their dissimilarity from SARS-CoV-2.