DOI: https://doi.org/10.21203/rs.3.rs-322281/v1
Objectives: All three SARS-CoV-2, SARS-CoV and MERS-CoV belong to the Coronaviridae family. In this study we compare amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence lead to different incubation periods and outbreak periods.
Results: The initial question we had was to compare SARS-CoV-2 to different viruses in the coronavirus family to understand its characteristics. The result of experiments using BLAST, Apriori and Decision Tree has shown that SARS-CoV-2 had high similarity with SARS-CoV while having comparably low similarity with MERS-CoV. We decided to compare the codons of SARS-CoV-2 and MERS-CoV to see the difference. Though the viruses are very alike according to BLAST and Apriori experiments, SVM proved that they can be effectively classified using non-linear kernels. Decision Tree experiment has proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence.
The consequential purpose of this paper is to minimize the damage on humanity from SARS-CoV-2. Hence, further studies can focus on the comparison of SARS-CoV-2 virus with other viruses that also can be transmitted during latent periods.
All three SARS-CoV-2, SARS-CoV and MERS-CoV belong to the Coronaviridae family, Orthocoronavirinae subfamily and to Betacoronavirus genera. Betacoronavirus infect mammals and other known members are Bovine Coronavirus, Human coronavirus OC43, Tylonycteris bat coronavirus HKU4 and etc. Of six species of known human coronaviruses, seven including two different strains subdivided from one species, the three previously mentioned viruses are known to produce severe symptoms.
In this study we compare amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence lead to different incubation periods and outbreak periods. We also hope to provide insight on the solution of the current SARS-CoV-2 pandemic and suggest future research directions.
SARS-CoV-2, SARS-CoV and MERS-CoV are all members of the coronavirus family. Thus, they share many microbiological similarities. Table 1 visually shows some of the similarities and differences among the viruses. The table was based on WHO website.
SARS-CoV-2 |
SARS-CoV |
MERS-CoV |
|
---|---|---|---|
microbiology |
Enveloped RNA virus |
Enveloped RNA virus |
Enveloped RNA virus |
Outbreak period |
2019-present |
2002–2003 |
2012-present |
Initial Site of isolation |
Wuhan, China |
Guangdong province, China |
Saudi Arabia |
Countries |
214 |
29 |
27 |
No. of cases (mortality) |
1,033,187 (2.9%) |
8096 (9.6%) |
2494 (~ 34%) |
Reservoir (intermediate host) |
Likely bats (pangolins) |
Bats (palm civet) |
Bats (dromedary camels) |
Incubation period |
2-5days (range, 2-14days) |
2-7days (range, 2–21) |
2–7 (range, 2–14 days) |
Infectivity, R0 |
2.5-3 |
2.2–3.7 (range, 0.3–4.1) |
0.3–1.3 |
Super spreaders |
yes |
Yes |
Yes (Uncommon) |
Transmission (including to HCP) |
Droplet/direct, Airborne/Indirect |
Droplet/direct, airborne/indirect? |
Droplet/direct, airborne/indirect? |
Treatment (PEP) |
Dexamethasone, Remdesivir |
Supportive (none) |
Supportive (None) |
Infection prevention |
Droplet, contact, face shield |
Droplet, contact, face shield |
Droplet, contact, face shield |
Species |
Rule (default class 1) |
Rule (default class 2) |
Rule (default class 3) |
|
---|---|---|---|---|
9window |
SARS-CoV-2 |
pos2 = D & pos9 = Q |
pos2 = K & pos7 = N |
|
SARS-CoV |
||||
MERS-CoV |
pos3 = W & pos6 = V pos3 = S & pos9 = L pos2 = F & pos6 = C pos3 = I & pos6 = V pos2 = L & pos6 = R pos1 = Q & pos2 = G |
pos1 = M & pos4 = F pos1 = P & pos7 = T pos5 = G & pos7 = M pos4 = H & pos7 = N pos4 = D & pos7 = _ pos5 = E & pos7 = M pos2 = L & pos6 = R pos2 = E & pos4 = N pos2 = L & pos4 = Y pos1 = Y &pos4 = F pos1 = P & pos4 = K |
||
13window |
SARS-CoV-2 |
pos12 = D & pos13 = N |
||
SARS-CoV |
pos1 = D & pos13 = V |
pos12 = S & pos13 = N |
||
MERS-CoV |
pos1 = D & pos10 = I pos11 = I & pos13 = V pos12 L & pos13 = I pos12 = A & pos13 = P pos12 = V & pos13 = P pos6 = I & pos13 = _ pos7 = Y & pos13 = A pos11 = V & pos13 = L pos3 = A & pos13 = Q pos11 = F & pos 13 = V pos11 = I & pos13 = V pos5 = V & pos13 = P pos11 = H & pos13 = V |
pos11 = H & pos13 = V pos5 = V & pos13 = P pos6 = L & pos13 = G pos11 = I & pos13 = E pos11 = I & pos13 = V |
||
19window |
SARS-CoV-2 |
pos10 = T & pos12 = K |
||
SARS-CoV |
pos5 = L & pos10 = V pos4 = I & pos7 = K pos10 = I & pos13 = K |
|||
MERS-CoV |
pos5 = Y & pos10 = V pos4 = L & pos7 = K pos7 = A & pos12 = Y pos7 = I & pos19 = T pos15 = G & pos16 = I pos13 = L & pos15 = K pos 13 = V & pos15 = K pos15 = V & pos16 = A pos15 = V & pos16 = P pos3 = S & pos15 = G pos12 = E & pos16 = G pos3 = S & pos6 = S pos7 = H & pos11 = I pos2 = S & pos15 = Q pos2 = E & pos15 = Q pos4 = T & pos10 = I pos3 = L & pos10 = L pos7 = S & pos15 = T pos15 = V & pos16 = S pos15 = V & pos16 = V |
pos13 = V &pos15 = K |
Window
In a peptide sequence, window is a region of a regularly divided peptide sequence. Appropriate window size is important to eliminate variability and to ensure reliable patterns.
FASTA Format
FASTA format converts nucleotide sequences or peptide sequences in a single letter code. This is useful in bioinformatics as nucleotide information can be inserted into text processing tools.
BLAST
BLAST is a program provided by NCBI that is used to compare the biological sequence information. Among several different BLAST programs, we chose Nucleotide-nucleotide BLAST(blastn) which finds DNA sequences that are mostly similar to the query DNA from NCBI DNA database.
Apriori Algorithm
Apriori is an algorithm that finds the frequency of individual items and identifies the relationships among them. Given databases containing itemsets, Apriori algorithm shows the itemsets that are over given threshold.
SVM
SVM is used in classifying, predicting and regressing problems. It classifies samples into categories. It is originally based on Statistical Learning Theory. Each sample is plotted on a n-dimensional space.
Decision Tree
Decision Tree display decisions and their possible consequences. We use this algorithm to specify the difference between two viruses. Among two types of decision tree algorithm usage, classification and regression, our usage is to classify the cases by choosing the right path at each node starting from the root, so that the case reaches a single leaf after satisfying all the conditions of the path.
We conducted a data analysis on the protein sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using three algorithms: BLAST, Apriori and Decision Tree. Considering the results of those experiments on protein sequences, we concluded that MERS-CoV is remarkably different from SARS-CoV-2 and SARS-CoV. Hence, we decided to conduct further analysis using BLAST, Apriori, SVM and Decision Tree to compare SARS-CoV-2 and MERS-CoV, but this time comparing the codon sequences of the viruses rather than the protein sequences. We expected to earn more accurate and useful results from such experiments since codon sequence is a form of DNA sequence which is more related to actual properties of a virus.
BLAST
First, we briefly experimented BLAST on the three Coronaviruses: SARS-CoV-2, SARS-CoV and MERS-CoV. The result shows that SARS-CoV-2 is almost identical to SARS-CoV while MERS-CoV shows substantial difference in amino acid sequence.
We have experimented the virus in pairs. The BLAST experiment on SARS-CoV-2 and SARS-CoV shows 92% identities, 96% positives and 0% gaps which indicates high similarity. The BLAST experiment on SARS-CoV-2 and MERS-CoV shows 51% identities, 66% positives and 3% gaps which indicates relatively low similarity. To add, the BLAST experiment on SARS-CoV and MERS-CoV shows 56% identities, 72% positives and 1% gaps.
Apriori
We firstly analysed the genome of SARS-CoV-2, SARS-CoV and MERS-CoV using Apriori algorithm in 9, 13, 19windows. For each window, we set the minimum support as 0.1, so that only the associations appearing more than 10% of the whole instances are regarded as best rules. We define the rule as the tendency of an amino acid A to appear in position N of window, written posN = A. For accurate analysis, we set the minimum metric confidence level as 0.9 and performed the experiment for 18 cycles.
Apriori in 9window The results showed that the most rules involve Leucine in position 5 with large instances in all three genomes. Additionally, in MERS-CoV, Valine appeared frequently in position 4 and 6.
Apriori in 13window The results showed that all three genomes involve Valine in position 1 and Leucine in position 2 with large instances in both genomes. Additionally, in MERS-CoV, Valine appeared frequently in position 2.
Apriori in 19window All three genomes involve Leucine in some positions as one of the best rules with large instances. Additionally, both SARS-CoV-2 and MERS-CoV involve Valine. In SARS-CoV-2, Valine appears frequently in position 4 and in MERS-CoV, Valine is more dominant than Leucine, appearing frequently in position 4, 6, 9, 11, and 13. SARS-CoV only had one best rule, having Leucine in position 1.
These results suggest that Leucine is a commonly significant amino acid in the entire genome of all three genomes. To add, the experiment also suggests that Valine is also a commonly essential amino acid in SARS-CoV-2 and MERS-CoV, especially in MERS-CoV.
Decision Tree
We defined SARS-CoV-2 as class 1, SARS-CoV as class 2 and MERS-CoV as class 3. We compared the data from the start codon to the stop codon. The characteristics written down are rules that had the probability of at least 0.800. This value is high enough to conclude that the species possess a distinguishable trait to the default class.
Decision tree in 9window. The results show that SARS-CoV-2 and MERS-CoV have their unique characteristics that can distinguish them from SARS-CoV-2 and SARS-CoV. However, there weren’t any unique characteristics that can differentiate them from MERS-CoV. SARS-CoV does not have a distinct amino acid sequence characteristics compared to the other two viruses. The results show that there are few unique characteristics to distinguish SARS-CoV-2 and MERS-CoV but that SARS-CoV are more similar to the other two viruses. Also, the results show that there are no unique characteristics to distinguish the three viruses from default class 3. This means that all three viruses are similar to default class 3.
Decision tree in 13window. The results show that SARS-CoV-2 has one unique characteristic that can distinguish them from the default 2. SARS-CoV has one distinct characteristic each to default class 1 and 2. MERS-CoV has few unique characteristics that can distinguish them from default class 1 and 2. The results show that there are no unique characteristics to distinguish the three viruses from default class 3. This means that all three viruses are similar to default class 3.
Decision tree in 19window. The results show that SARS-CoV-2 has one unique characteristic that can distinguish them from the default 2. SARS-CoV has three distinct characteristics to default class 1. MERS-CoV has few unique characteristics that can distinguish them from default class 1 and one unique characteristic to default class 2. The results show that there are no unique characteristics to distinguish the three viruses from default class 2. This means that all three viruses are similar to default class 2.
Blast
BLASTN program of NCBI is used to analyze the identicality of SARS-CoV-2 and MERS-CoV. The result shows 59% identity and we could see the distribution of top 8 blast hits on the subject sequence.
Therefore, using the remaining three methods, we compared the two DNA sequences and figure out appreciable similarities and differences. Throughout following experiments, we chose to compare orf1ab, the first and the longest ORF, of SARS-CoV-2 and MERS-CoV since it presents the most remarkable difference between two viruses among several ORFs with the same position.
Apriori Algorithm
We firstly analysed the genome of SARS-CoV-2 and MERS-CoV using the Apriori algorithm in 9, 13, 19 windows. Other settings were identical to the previous experiment.
Apriori Algorithm in 9window. Most rules involved Leucine in most positions with large instances in both genomes. Additionally, in MERS-CoV, Valine appeared frequently in position 1, 3, 4, and 8.
Apriori Algorithm in 13window. Most rules involved Leucine in almost all positions with large instances in both genomes. Additionally, in SARS-CoV-2, Valine appeared frequently in position 4. Also, in MERS-CoV, Valine appeared frequently in position 3, 6, 7, and 13.
Apriori Algorithm in 19window. Most rules involve Leucine in almost all positions with large instances in both genomes. Additionally, in SARS-CoV-2, Valine appeared frequently in position 12 and 16; and Threonine also appeared frequently in position 17. Also, in MERS-CoV, Valine appeared frequently in position 2, 13, 14, and 16; Threonine appeared frequently in position 13; and Serine also appeared frequently in position 19.
These results suggest that Leucine is a significant amino acid in the entire genome of both genomes. To add, Valine and Threonine are also essential amino acids in certain positions of both genomes, with MERS-CoV having more Valine as well as Serine.
SVM
The result of Apriori experiment suggests that the DNA sequence of SARS-CoV-2 and MERS-CoV are very similar, having Leucine as their main amino acid. However, the slight difference such as frequency of Valine and Threonine is not neglectable, so for more accurate results SVM algorithm is utilized. The SVM experiment is conducted in 9window, 13window, and 19window with four types of functions: normal, polynomial, RBF, and sigmoid. The experiment method was 10 fold cross validation.
During the experiment, we made data types of < SARS-CoV-2 and MERS-CoV>. Normal SVM experiments have accuracy rates slightly over 50%, which is quite low. This implies that there is some difference between SARS-CoV-2 and MERS-CoV. Polynomial SVM experiment and sigmoid SVM experiment show low accuracy rates. These results support that SARS-CoV-2 and MERS-CoV are difficult to differentiate using linear classifying processes. However, the accuracy rate of RBF, a non-linear kernel, is remarkably high, implying that it is the best chance of classifying the data set.
Decision Tree
We defined SARS-CoV-2 as class 1 and MERS-CoV as class 2. We compared the data from the start codon to the stop codon. Rules that had the probability of at least 0.850 were selected as distinguishable trait. Table 3 shows that SARS-CoV-2 and MERS-CoV have their unique characteristics in all 9, 13, and 19 window. The results show that there are many unique characteristics to distinguish the two viruses.
Species |
Rules in 9window |
Rules in 13window |
Rules in 19window |
---|---|---|---|
SARS-CoV-2 |
pos3 = L & pos5 = P pos3 = N & pos8 = I pos1 = G & pos3 = V |
pos1 = T & pos10 = G pos5 = L & pos11 = I pos6 = T & pos11 = A pos2 = R & pos6 = M pos10 = L & pos12 = I |
pos17 = N & pos19 = L pos14 = K & pos18 = L pos12 = T & pos17 = V pos17 = H |
MERS-CoV |
pos1 = Y & pos3 = V |
pos10 = Q & pos13 = L |
pos4 = V & pos12 = G |
pos1 = V & pos3 = P |
pos3 = A & pos10 = T |
pos12 = S & pos17 = V |
|
pos3 = S & pos9 = V |
pos6 = C & pos11 = A |
pos17 = L & pos18 = V |
|
pos1 = M & pos3 = V |
pos11 = W |
||
pos2 = D & pos3 = L |
pos5 = S & pos11 = I |
||
pos1 = Y & pos3 = V |
pos2 = T & pos13 = I |
||
pos2 = L & pos3 = Q |
pos5 = V & pos11 = D |
||
pos1 = Q & pos3 = V |
pos6 = V & pos11 = A |
||
pos2 = Y & pos4 = S |
Our research is composed of three experiments on SARS-CoV-2, SARS-CoV, and MERS-CoV using three algorithms (BLAST, Apriori, and Decision Tree, and SVM) followed by four further experiments on SARS-CoV-2 and MERS-CoV.
Comparing SARS-CoV-2, SARS-CoV and MERS-CoV, the result of BLAST has shown that SARS-CoV-2 and SARS-CoV had remarkable gap with MERS-CoV. The Apriori experiment specifies that SARS-CoV-2 and SARS-CoV have almost the same distribution of amino acids, having Leucine as an unrivaled main amino acid, while MERS-CoV has high frequency of Valine as well. In Decision tree experiment, all three viruses are similar to MERS-CoV in 9 and 11window. The three viruses are similar to SARS-CoV in 19window.
These experiments showing high similarity as well as remarkable difference between SARS-CoV-2 and MERS-CoV led us to conduct further experiments on those two viruses, this time using the codon sequence of the viruses instead of protein sequence, as codon sequence is more related to the actual properties of the virus.
In further experiments on the codon sequence of SARS-CoV-2 and MERS-CoV, the result of BLAST has shown 59% similarity. The Apriori experiment specifies that the viruses are similar in having Leucine and Valine as their main amino acid, as well as having Threonine frequently appearing. However, the SVM result shows that though the viruses are very alike they can be effectively classified using non-linear kernels such as RBF. Decision Tree experiment has proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence: each 9window, 13window and 19window result has shown characteristic rules of both MERS-CoV and SARS-CoV-2.
Not applicable
SARS-CoV-2: Severe acute respiratory syndrome coronavirus 2
SARS-CoV: Severe acute respiratory syndrome coronavirus
MERS-CoV: Middle East respiratory syndrome coronavirus
BLAST: Basic Local Alignment Search Tool
SVM: Support Vector Machine
(1) Ethics approval and consent to participate: N/A
(2) Consent for publication: Yes
(3) Availability of data and materials: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. [Email] [email protected]
(4) Competing interests: The authors declare that they have no competing interests.
(5) Funding: N/A
(6) Authors’ contributions: Jung Eun Huh and Seunghee Han has equally contributed to this work. Taeseon Yoon was a supervisor.
(7) Acknowledgements: N/A