Nucleic acids are the biopolymers that carry all genetic information of living organisms. The two main classes of nucleic acids are deoxyribonucleic acid commonly named as (DNA) and ribonucleic acid known by the acronym (RNA). DNA is a double stranded that consists of four nucleotide bases: Adenine (A), Guanine (G), Thymine (T) and Cytosine(C). While RNA is a single stranded that contains: Guanine, Uracil (instead of Thymine), Adenine, and Cytosine, denoted by the letters G, U, A, and C respectively.
In the fields of molecular biology and genetics, a genome is the complete set of genetic information of an organism. It consists of a large set of the aforementioned letters arranged in particular order (e.g Human genome is 3.2 billion of them). These letters contain instructions or genes that control all of the fundamental biological processes of life. A genome can be seen as collection of information or a text written in a particular language of a simple alphabet, not with 26 letters but just four.
Biological sequence analysis, a subfield of bioinformatics and computational biology, aims at computationally process and decode the information stored in genomes. The analysis brings together several fields, from computer science to probability and statistics. Biological sequence analysis has many goals for example: Search of similarity between sequences of different organisms, Identification of intrinsic features of a sequence, Determination of sequence differences, Identification of molecular structure.
This paper falls within the scope of biological sequence analysis and has two goals: The first it to search the origin(s) and the inter-host(s) of each genome of SARS-CoV–2 by comparing it with all members of Coronaviridae family. The second is to investigate the relationship between the regions of sampling and the origins of this virus. The experimental study is performed thought two steps that work successively: The first one uses N-grams; its role is to extract relevant information from a given sequence and to present it in a numeric form. The second one uses machine learning techniques to find biological homology between the genomes of different viruses.
The rest of this paper is organized as follows: Section 2 introduces briefly SARS-COV–2 and presents some pioneer previous researches about its possible origin(s) and inter-host(s). Section 3 gives a short presentation that includes: supervised machine learning, the five proposed techniques, and N-grams method. Section 4 gives the experimental protocol and the results of applying the aforementioned techniques to analyze the genomes of this virus. Finally, the conclusions are discussed in Section 5.
A brief overview about Coronaviruses, its origin(s) and inter-host(s)
Coronaviruses belong to the family Coronaviridae, which includes a group of enveloped, positive-sensed, single-stranded RNA viruses. This virus officially named as SARS‐CoV‐2 by International Committee on Taxonomy of Viruses [9] is rapidly spreading from its origin in Wuhan City of China to the rest of the world. This virus is mainly transmitted through droplets that are produced when an infected person coughs, sneezes, or exhales and that fell on floors or surfaces due to their heavy weight. A person can be infected by touching a contaminated surface and then his eyes, nose or mouth or inhaling directly these droplets. The symptoms of this disease are fever, dry cough, breathing difficulties, headache, nasal congestion, runny nose and pneumonia. This illness that has neither an approved drugs nor vaccines is significantly impacting people’s health, businesses and the economy in all over the word. Unfortunately, the only ways to prevent it are to avoid being exposed directly to the virus and to the other people.
Potential origins and inter-hosts of this virus were discussed in many research papers. For example, researchers [12, 26] found that the SARS-CoV–2 showed a higher sequence homology to Bat-CoV-RaTG13 and stated that the origin of this virus is bats.Another paper [6] suggested that bovidae and cricetidae should be involved in the screening of intermediate hosts for SARS-CoV–2. In the same context [30] predicted that SARS- CoV–2 utilize ACE2s of various mammals, excluding murines, and some birds, such as pigeon as intermediate hosts. Several recent studies [11, 27, 28, 29] proposed that pangolins might be the intermediate hosts between bats and humans because of the similarity of the pangolin coronavirus to SARS-CoV–2.
A brief overview of the five supervised learning algorithms and N-grams method
Machine Learning is a subfield of Artificial Intelligence and is concerned with the development of techniques and methods which enable a computer to learn. Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. In the first, we train the machine using a dataset which is labeled (i.e the classes of the objects are known). For example, a dataset of genome sequences where each one is tagged with his specie by an expert in the domain. Typical fields of supervised learning are classification, regression, and time series analysis. In the second we train the machine using unlabeled dataset. For example, a dataset of genome sequences where each one is unknown. Typical fields of unsupervised learning are projection, clustering, density estimation or generative models.
This research paper focuses on supervised learning algorithms and uses five among them to search similarities between the genomes of different viruses. The first one is: Support Vector Machine (SVM), it can be used for classification, regression and prediction challenges. SVM was first introduced by Boser, Guyon, and Vapnik in COLT–92 in 1992 [1]. The basic idea of SVM is simple: The algorithm creates a line which separates two different classes of objects represented in two dimensions. The equation of the separator is calculated mathematically with the objective of maximizing the margin between both of them. The decision boundary is used to classify new unknown objects basing on their positions (above or below the line). In medical domain SVM was used intensively in many application fields such as: Disease diagnosis [5], detection of medical disorders in MRI images [7], medical data classification [8], etc. The second is called Artificial Neural Networks (ANN), they are inspired by the way in which the human brain learns and processes information. They consist of a collection of connected units or nodes called artificial neurons, which approximately model the neurons in a biological brain. In bioinformatics ANN was used for many tasks like classification of biological data [13], identification of functional genetic variants and the prediction of traits [14], Breast cancer image classification [15], etc. The third is called Naïve Bayes (NB), it’s a probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. Also this technique was applied in medical domain for example: Diagnosis of Alzheimer’s disease [16], classifier for DNA barcodes [17], gene expression data [18].The fourth is K-Nearest Neighbors (k-NN), it’s the simplest and the fast classification method. It searched in a given dataset the k closest points to an unknown sample, with similarity defined by a distance function. Then it uses the classes of the k points to identify the one of the unknown sample by searching the most frequent class among them. K-NN has found several applications in biology. For example, predicting the protein subchloroplast locations [19], gene expression [20] and gene expression in cancer diagnosis [21]. The fifth classifier is named Decision tree (DT), it uses a tree-like model of decisions in which each internal node represents a test on an attribute. This classifier was used in many biological applications such as missing value imputation in DNA microarray gene [23], analyzing gene expression data [24], and classification of pathogenic gene sequences [25].
In language modeling or Text categorization, N-grams [2] are a sequence of N consecutive items in a text; it can be classified into two categories: Character N-grams and Word N-grams. The first category is a set of N consecutive characters extracted from a word; it’s used for tasks like language identification and data compression. The second category is a sequence of N consecutive words extracted from a text; it’s used for a wide range of tasks like modeling language statistically as well as for information retrieval. As said before a nucleotide sequence can be seen as a text (or more precisely a word because of the absence of space) and then N- grams can be applied on it. Table 1 shows the result of applying this process on a random sequence “TGATGACTGATACA”. N-gram was also applied in numerous medical and biological fields like: Analysis of RNA [3], clustering DNA sequences [4], genome data classification [10], etc.
Table 1: Example of extracting N-grams from a nucleic acid sequence “TGATGACTGATACA”
|
N = 3
|
|
N = 6
|
|
N = 9
|
|
|
N-Grams
|
occurrences
|
N-Grams
|
occurrences
|
N-Grams
|
occurrences
|
1
|
TGA
|
3
|
GATGAC
|
1
|
GATGACTGA
|
1
|
2
|
GAT
|
2
|
TGACTG
|
1
|
GACTGATAC
|
1
|
3
|
ACT
|
1
|
ATGACT
|
1
|
ACTGATACA
|
1
|
4
|
ATG
|
1
|
CTGATA
|
1
|
TGACTGATA
|
1
|
5
|
CTG
|
1
|
GATACA
|
1
|
ATGACTGAT
|
1
|
6
|
GAC
|
1
|
TGATGA
|
1
|
TGATGACTG
|
1
|
7
|
TAC
|
1
|
GACTGA
|
1
|
|
|
8
|
ACA
|
1
|
ACTGAT
|
1
|
|
|
9
|
ATA
|
1
|
TGATAC
|
1
|
|
|