BLASTN program of NCBI is used to analyze the identicality of SARS-CoV-2 and MERS-CoV. The result shows 59% identity and we could see the distribution of top 8 blast hits on the subject sequence.
Therefore, using the remaining three methods, we compared the two DNA sequences and figure out appreciable similarities and differences. Throughout following experiments, we chose to compare orf1ab, the first and the longest ORF, of SARS-CoV-2 and MERS-CoV since it presents the most remarkable difference between two viruses among several ORFs with the same position.
Apriori Algorithm
We firstly analysed the genome of SARS-CoV-2 and MERS-CoV using the Apriori algorithm in 9, 13, 19 windows. Other settings were identical to the previous experiment.
Apriori Algorithm in 9window. Most rules involved Leucine in most positions with large instances in both genomes. Additionally, in MERS-CoV, Valine appeared frequently in position 1, 3, 4, and 8.
Apriori Algorithm in 13window. Most rules involved Leucine in almost all positions with large instances in both genomes. Additionally, in SARS-CoV-2, Valine appeared frequently in position 4. Also, in MERS-CoV, Valine appeared frequently in position 3, 6, 7, and 13.
Apriori Algorithm in 19window. Most rules involve Leucine in almost all positions with large instances in both genomes. Additionally, in SARS-CoV-2, Valine appeared frequently in position 12 and 16; and Threonine also appeared frequently in position 17. Also, in MERS-CoV, Valine appeared frequently in position 2, 13, 14, and 16; Threonine appeared frequently in position 13; and Serine also appeared frequently in position 19.
These results suggest that Leucine is a significant amino acid in the entire genome of both genomes. To add, Valine and Threonine are also essential amino acids in certain positions of both genomes, with MERS-CoV having more Valine as well as Serine.
SVM
The result of Apriori experiment suggests that the DNA sequence of SARS-CoV-2 and MERS-CoV are very similar, having Leucine as their main amino acid. However, the slight difference such as frequency of Valine and Threonine is not neglectable, so for more accurate results SVM algorithm is utilized. The SVM experiment is conducted in 9window, 13window, and 19window with four types of functions: normal, polynomial, RBF, and sigmoid. The experiment method was 10 fold cross validation.
During the experiment, we made data types of < SARS-CoV-2 and MERS-CoV>. Normal SVM experiments have accuracy rates slightly over 50%, which is quite low. This implies that there is some difference between SARS-CoV-2 and MERS-CoV. Polynomial SVM experiment and sigmoid SVM experiment show low accuracy rates. These results support that SARS-CoV-2 and MERS-CoV are difficult to differentiate using linear classifying processes. However, the accuracy rate of RBF, a non-linear kernel, is remarkably high, implying that it is the best chance of classifying the data set.
Decision Tree