COVID-Predictor: RNA Sequence based Prediction of Coronavirus

The problem of virus classi�cation is always a subject of concern for virology or epidemiology over the decades. Moreover, the detection of highly divergent or yet unknown viruses is a major challenge despite of its clinical importance. In this situati on, the outbreak of novel coronavirus (SARS-CoV-2) and its susceptibility in different epidemic condition around the world clearly suggest that the virus is mutating to create divergent variants and making the task of virus prediction more challenging. On the other hand, despite of novel coronavirus, two more coronaviruses such as MERS and SARS-CoV-1 are already present. Therefore, the use of machine learning technique is highly required at this moment to predict the coronaviruses by considering their divergent genetic functional characteristics. Thus, we are proposing machine learning based coronavirus prediction technique, called COVID-Predictor, where 1000 of RNA sequences of SARS-CoV-1, MERS, SARS-CoV-2 and other virus are used to train a Na¨ıve Bayes classi�er so that it can predict any unknown sequence of these viruses. In order to develop the COVID-Predictor, the feature vector is constructed by the motifs of the sequence generated by k-mer and n-gram techniques. The model has been validated using 10 fold cross validation in comparison with other classi�cation techniques. The results show the superiority of our predictor by achieving average 97% accuracy on unseen validation set. The same pre-trained model has been used to design a web based application where RNA sequences of unknown viruses can be uploaded to predict class of coronavirus. The predictor, code and datasets are available here: http://www.nitttrkol.ac.in/indrajit/projects/COVID-Predictor/


Introduction
The China Country O ce of World Health Organization (WHO)1, 2 on 31st December 2019, informed that few pneumonia cases have been detected in Wuhan City, Hubei Province of China with unknown etiology.Subsequently, on 7th January 2020, Chinese authority identi ed a novel virus as a cause of this disease, which WHO and International Committee on Taxonomy of Viruses declared as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) or novel coronavirus (COVID-2019)2, 3 on 11th February 2020.After subsequent research, it is found that the novel coronavirus belongs in the family of coronavirus.In this family, Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) are also present.The medical research community suspects that COVID-19 is more transmissible but comparatively less fatal than SARS.According to various evidences4, 5, the transmission rate of COVID-2019 from a human to another human seems higher than SARS and the virus might be bat or pangolin origin3, 6, 7. It is also suggested that mostly the transmission of this virus is via droplets.It means once an infected person coughs or sneezes, emitted droplets come into contact with another individual over a short distance, the second individual might get infected.
As on 15th April 2020, 2077453 positive cases have been registered across the world while 134355 patients died and 509741 recovered for this virus according to "worldometers.info" 1.This virus is spreading rapidly and a threat to the human population.Generally, coronaviruse can infect multiple organs in different hosts such as animal and human.It mainly attacks respiratory system in human like other two viruses, SARS and MERS, in the same family.The genetic features like potential etiological agents of the SARS-CoV-2 have recently been identi ed after metagenomic analysis using next-generation sequencing (NGS)2.Moreover, another study8 shows that spike protein receptor-binding domain (RBD) of SARS-CoV-2 binds with host receptor angiotensin-converting enzyme 2 (ACE2).It generally helps to regulate the transmission of COVID-2019 in cross-species and human.At present, it is observed that the virus is susceptible with the environment by mutating and creating divergent variants.
Thus the early prediction of pneumonia caused by COVID-2019 is challenging task9.
To address the above urgent requirement, here we have developed a machine learning based technique, called, COVID-Predictor, where RNA sequences of three different coronaviruses and other virus, such as Ebola and Debgue, are used from The National Center for Biotechnology Information (NCBI) 2 and Global Initiative on Sharing All In uenza Data (GISAID) 3 to train a Multinomial Naive Bayes (MNB)10 classi er so that it can predict any unknown sequence of these viruses.For this purpose, k-mer algorithm11, 12 is used to create motifs from the RNA sequences.Thereafter n-gram concept is used to create a Bag-of-Words (BoWs) in order to create a count vector.Such count vectors of sequences are used to train the MNB.Subsequently, testing is done in the same fashion with 10 fold cross validation and unseen sequences of coronaviruses from databases.The model has also been compared with other classi cation techniques such as kernel based Gaussian Support Vector Machine (GSVM)13 and Random Forest (RF)14.The MNB based model shows the superior performance in comparison with other classi er based model by achieving average 97% accuracy on validation data.The same pre-trained model is used to develop a web application so that scienti c and diagnostic communities related to coronavirus prediction can get the bene t out of this.

Results And Discussion
In this section, we have discussed about the data preparation, the parameters and metrics which are used for COVID-Predictor and the outcome of the predictor.

Parameters setting and Metrics
The experiments have been performed using python 3.6 and executed on an Intel Core i5-2410M CPU at 2.30 GHz Machine with 8GB RAM and Windows 7 operating system.The required input parameters are experimentally set and those are number of trees for RF = 100, decision for RF is "gini", alpha value as smoothing factor of MNB is 0.1 and kernel used in GSVM is "rbf".To evaluate results of COVID-Predictor, the popular performance metrics such as Accuracy, Precision, Recall and F1 −Score are used.The training dataset is used in three independent machine learning techniques viz.MNB, GSVM and RF.For each machine learning technique, the motifs of virus sequences are created using k-mer method.Thereafter, such motifs are combined using n-gram technique to create count vector which is used to train the classi ers.In our experiments the value k of k-mer varies between 2 to 7, while the value of n-gram varies between 2 to 5. Each classi er has been evaluated with 10-fold cross validation followed by further validation on unseen dataset taken from NCBI and GISAID on 8th April 2020.The performance metrics of each machine learning technique with 10 fold cross validation for different values of k-mer and n-gram have been reported in Table 3.Four quantitative metrics are further consolidated as single aggregated score for ease of comparison.The aggregated score has been computed simply by taking average of all the scores following the similar approach of what is used in20.The boundary of aggregated score is [0,1] where higher value signi es better result.It is evident from the Table 3 that MNB based COVID-Predictor produces higher aggregated score, i.e. 0.99953 for value of k-mer as 7.
Similar results are also observed for MNB based COVID-Predictor for other values of k-mer.Thus, according to the results, we have prepared the pre-trained model of COVID-Predictor with 1000 genomic sequences of four virus classes for values of k-mer and n-gram as 7 and 3 respectively.To gain further con dence, we have used additional validation set of sequences as reported in Table 4.While validated with 2043 samples, it is observed that 159 cases are false positive considering prediction of SARS-CoV-2 is positive.After further investigation, it has been found that these 159 sequences are SARS-CoV-1 and misclassi ed by COVID-Predictor as SARS-CoV-2.As our primary objective is to predict SARS-CoV-2, we further wanted to examine the rate of false negative.For this purpose, additional two sets of SARS-CoV-2 sequences are used separately, one with 493 samples from NCBI and another with 4747 samples from GISAID.Both the cases, the COVID-Predictor predicted SARS-CoV-2 with 100% accuracy.This experiment establishes that COVID-Predictor with the proposed feature building approach has potential to predict SARS-CoV-2 with higher accuracy.The same pre-trained model is used to build the web based application where the unknown sequences can be uploaded to predict the class of coronavirus.The screen shot of the web based predictor is shown in Figure 3 and 4.

Method
The primary objective of the proposed COVID-Predictor is to correctly classify the RNA sequences of coronaviruses.In this regard, the complete RNA sequences are split into motifs using popular k-mer technique.Such motifs for four class of viruses are shown in Figure 6(a)-(d) as word cloud.Thereafter, the n-gram technique is used to create a feature by considering n number of motifs.Top 10 n-grams for different viruses is shown in Figure 7(a)-(d).These n-grams/features are used call as Bag-of-Words(BoW).Such BoWs are further used to create count vector for a virus sequence.The count vectorization computes the frequencies of n-grams in a particular sequence and creates a numeric feature vector which is used in subsequent machine learning techniques.This is reported and visualize from Figure 1 that the RNA sequences of SARS-CoV-1 and SARS-CoV-2 are similar and challenging to distinguish.Therefore, the machine learning technique can play an important role to predict such sequences.As we have broken the RNA sequence into motifs, grouped into n-grams, the features behave like sequence of texts.Therefore, it becomes a text classi cation problem.
In this regard, we have considered probabilistic based Multinomial Naive Bayes (MNB), kernel based support vector machine (SVM) and tree based technique like random forest (RF) to evaluate.Independently, all three machine learning techniques are evaluated with features generated by count vectorization after considering different values of k-mer and n-gram.Based on the performance of three machine learning techniques over 10-fold cross validation on training data, we have nalised MNB as underlying technique for building COVID-Predictor as used in our web application.The pipeline of the proposed COVID-Predictor is described in Figure 5.

Conclusion
In current world wide context, it has become very much essential for mankind to predict coronavirus as early as possible, because SARS-CoV-2 infection has become pandemic and the both infection & death rate is getting increased world wide almost exponentially at every day while we are writing this article.As a contribution to mankind, in this study, we have proposed COVID-Predictor for predicting the coronaviruses viz.
SARS-CoV-1, MERS and SARS-CoV-2 based on their RNA sequences.The same is also provided as web application so that scienti c and diagnostic communities related to coronavirus prediction can get the bene t out of this.In order to achieve better performance, we have Figures

Figure 3 The 4
Figure 3

Table 1 .
Statistics of the refined datasets of corona and other viruses CoV-1, MERS, other kind of viruses like Ebola and Dengue were downloaded from NCBI while SARS-CoV-2 was downloaded from GISAID in fasta format on 28th March 2020.Although proposed predictor does not require sophisticated data prepossessing, only it requires complete genome sequence of viruses.As a result 515, 291, 2369 sequences of SARS-CoV-1, MERS, SARS-CoV-2 respectively of length more than 20K bp while 600 other virus such as Ebola and Dengue of length more than 10K bp are considered in our experiment.The statistics of the re ned consolidated datasets are shown in Table1, while the country wise statistics of SARS-CoV-2 is reported in Table2.In order to visualise the virus sequences, t-distributed Stochastic Neighbor Embedding (tSNE)15 is used on count vector as generated by k-mer and n-gram techniques.k-mer is now an essential part of many methods in bioinformatics such as genome and transcriptome assembly, metagenomic sequencing, error correction of sequence reads etc.16.Solis-Reyes et.al in11 has explained that k-mer works better than other popular methods like REGA17, SCUEAL18, COMET19 etc.The embedded representation of all four virus classes and top 21 country speci c sequences of SARS-CoV-2 are shown in Figure1 and 2.

Table 2 .
Statistics of country wise refined sequences of SARS-CoV-2

Table 3 .
Classification performance of different machine learning techniques after performing 10-fold cross validation with different values of k-mer and n-gram on 1000 genome sequences of SARS-CoV-1, MERS, SARS-CoV-2 and Other virus samples Outcome of the Predictor The dataset consisting all four types of virus sequences such as SARS-CoV-1, MERS, SARS-CoV-2 and Other viruses has been divided into two sets -one for training set and other for validation purpose.Strati ed sampling method has been applied to prepare training dataset to ensure that representative from all four types of virus classes are present.As a result 1000 of virus sequences are used in training.Moreover, data samples are carefully selected from each category to avoid imbalance class problem.The validation dataset contains those sequences which are not present in training dataset.

Table 4 .
Classification performance of COVID-Predictor on validation data