Datasets and experimental protocol
The experimental protocol is performed through three steps that are illustrated in figure 1:
- Collecting the dataset: A dataset containing 12962 genomes belonging to 96 families of coronavirus was collected from [22]. The dataset includes 10313 genomes of SARS-COV-2.
- Preprocessing of the dataset: After fixing the value of we used N-grams technique to extract the number of occurrences of each subsequence of nucleotides of size in all of a given dataset. Then, we form a common base by selecting the first subsequences that are the most shared between all of the viruses. Next, we project each virus onto this base of size and we normalize the results. An example of this preprocessing with two genomes “ATGATGGATTG” and “ATGGATGTGGG” is given in table 2.
Table 2: Example of applying N-grams on two nucleic acid sequences, the values between
brackets represents the normalization
N-grams of genome N°1 with
ATGATGGATTG
|
N-grams of genome N°2 with
ATGGATGTGGG
|
Common base of size
|
projection of genome N°1 onto the common base
|
projection of genome N°2 onto the common base
|
A T G
|
2
|
A T G
|
2
|
ATG
|
2 (0.8660254)
|
2 (0.8660254)
|
G A T
|
2
|
T G G
|
2
|
GAT
|
2 (0.8660254)
|
1 (-0.8660254)
|
A T T
|
1
|
G G A
|
1
|
GGA
|
1 (-0.8660254)
|
1 (-0.8660254)
|
G G A
|
1
|
G G G
|
1
|
TGG
|
1 (-0.8660254)
|
2 (0.8660254)
|
T G A
|
1
|
G A T
|
1
|
|
|
|
T T G
|
1
|
T G T
|
1
|
|
|
|
T G G
|
1
|
G T G
|
1
|
|
|
|
- Identifying SARS-CoV-2: We considered that all of the 10313 genomes of SARS-CoV-2 that are collected from different geographic locations are unknown. Then we used the aforementioned five supervised machine learning algorithms to identify the specie (class) of each one.
The step 3 is performed in two phases (Figure 1). The first is called the training in which each machine learning algorithm will learn the genomes of the whole dataset (except SARS-COV-2). In this phase we selected the parameters of each algorithm using a grid search: SVM has two parameters (The regularization parameter) and (the width of the Gaussian kernel); ANN has two parameters the number of hidden layers and the number of neurons by each layer; KNN has two parameters k (The number of neighbors) and the distance measure that was chosen as Euclidian; DT has one parameter which is the criteria to split a given node (two possible impurity functions can be used Gini index and Information gain); and BN hasn’t practically any parameter to estimate. The second is named testing in which the algorithms will be used to predict the species of the genomes of SARS-COV-2 that are considered previously as unknown.
Experimental results
Before starting the research of the origins and inter-hosts of COVID-19 we propose to visualize the set of 10313 genomes of this virus. Since each genome, after applying N-grams, will be represented in higher dimensions (i.g dimensions if ) it is very difficult if not impossible to visualize them directly. So reducing our space to 2D or 3D (without loss of information) may allow us to plot and observe patterns more clearly. Figure 2 shows the results of applying a dimensionality reduction technique, from 64D to 2D, known as principal components analysis on the aforementioned set. It can be seen clearly that the majority of genomes of SARS-COV-2 belong to the same cluster or family (except some outliers).
The Figures 3, 4, 5, 6 and 7 show that the origins of 10313 genomes of SARS-COV-2 vary with respect to the machine learning technique used in the experiment. The majority of classifiers (KNN, SVM, DT and ANN) gave practically the same result that shows that Pangolin is the most probable inter-host of SARS-COV-2. Concerning the result illustrated in Figure 7, which corresponds to NB, it can be seen that all of the genomes of SARS-COV-2 are categorized by this classifier as Alphacoronavirus 1. This unacceptable result, which is completely different from the others, can be explained by the fact that NB is based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The features here are the set of 3Grams that are evidently dependent because their succession determinates a nucleic acid sequence.
Since, SVM is well known by its high discrimination capacity we will use its detailed results to further investigate this virus (Figure 8).
Figure 8 shows the detailed results of the belonging degree of a sample of SARS-COV-2 to each member of Coronaviridae family (A total of 96). This belonging degree is a natural number that expresses the number of times where a member is voted as a potential origin of this virus. It can be seen that the most voted origin of COVID-19 is pangolin followed directly by Alphacoronavirus-1. This latter includes: Canine coronavirus, Feline coronavirus and Transmissible gastroenteritis virus that are linked to dogs, cats and porcine respectively [22]. The combination of these results with those obtained previously (Figure 3,4,5,6) show that even if alphacoronavirus-1 is classified in the second position and is linked to the most domestic animals (direct contact with humans) it still a very weak competitor for pangolin. This conclusion is due to the fact that four classifiers among five used in the previous experiment didn’t give practically alphacoronarirus-1 as origin despite the variety of genomes that are collected from different geographical locations. In the third position we find bat alphacoronavirus which is not given by any classifier as a potential origin of SARS-COV-2 (except ANN with a very negligible number of genomes). In conclusion, pangolin has the greatest possibility to be the actual origin of SARS-COV-2.