Investigation of Oligonucleotide Usage Variance Between SARS -related Coronaviruses and Common Cold Coronaviruses

Background: The widespread outbreak of SARS-CoV-2 has become a deal threat for human health. This new emerged virus coupled with severe acute respiratory syndrome (SARS) and middle east respiratory syndrome (MERS) viruses belong to coronoviridae family, which develop SARS in human being. However, prior to the emergence of virulent viruses, the coronaviruses were known as the leading causes of mild common cold. Getting more knowledge about the genome organization of different strains can conduct us how these viruses evolve and become a virulent strain. Here, we reported the difference of oligonucleotide distribution contributing in genome of two groups of coronaviruses, SARS related viruses versa common cold coronaviruses, by employing weighting algorithms approaches. Results: In this study, we found a few oligonucleotides, which signicantly distinguish two viral groups. Among dinucleotide’s features, the discrepancy of TC and CC between SARS related viruses and common cold coronaviruses was quite considerable. Furthermore, CC dinucleotide was sequentially repeated in a few multinucleotide patterns including CCA, CCAC, ACCAC, and CACCAC motifs with the highest values, which also discriminated two viral groups. Conclusions: Theses remarkable oligonucleotides might point towards the existence of some particular RNA elements that might be involved in viral infectivity.


Background
Coronaviridae family is one of the largest groups of RNA viruses, which cause a board range of diseases in animal species and human (1). Although until 2002, it was supposed that human coronaviruses only cause a mild self-limiting respiratory disease, the emergence of severe acute respiratory syndrome (SARS) in 2003, middle east respiratory syndrome (MERS) in 2012 and recently the widespread outbreak of SARS-CoV-2 virus in 2019 has been sparked to be paid more attention to human coronaviruses as those pathogens develop pneumonia, severe acute respiratory syndrome and even death in human being (2). In contrast, some coronavirus strains such as OC43, HKU1, and NL63 and 229E have more probably association with seasonal common cold and are rarely led to severe disease in human (3)(4)(5)(6). Although, due to the current global pandemic of SARS-CoV-2, the extensive researches are being done to expand the knowledge about virulence factors in the pathogenicity process (7)(8)(9), the mechanism of coronavirus's strains in developing mild to severe illnesses in human is ambiguous yet. Among infectious pathogens, viruses, particularly RNA viruses have regularly evolved their genome to make an adaption by host and avoid host antiviral mechanisms to nally establish a severe disease in host (10). In fact, viral genome consists of particular regions in both coding and non-coding area which interact with both viral and host factors which dictate the progression of disease (10,11).
Viruses are able to change their genome during selective pressure to escape from host defense approaches (12,13). To take an example, an antiviral protein in human called, zinc-nger antiviral protein (ZAP), enables the restriction of viral replication by blocking speci c sequences enriched by CG dinucleotide in viral genome (14)(15)(16), however, some RNA viruses such as HIV are able to decline the abundance of CG in their genome during evolution and eventually hamper ZAP's binding activity (17). Although the genome organization of coronaviruses was structurally determined, extra investigations on genome structure of different coronavirus species can draw up a guide for better understanding of virus evolution and discovering the potential patterns present in viral genome that might have signi cant role in virus life cycle and pathogenicity. The genome of coronaviruses is a positive single strand RNA with nearby 27-30 kb length, which is the biggest genome among RNA viruses. The genome organization is almost similar in all strains containing gene 1 (ORF1ab) which occupies two-thirds of the genome approximately 20 kb encoding replicase enzyme and a few number of non-structural proteins and ORF2-9, which make up only about 10 kb of genome encoding S-E-M-N (structural proteins) respectively. The 5′ and 3′ ends of genome contain untranslated regions (UTR) which include leader sequence in 5′, several stem loops in both 5′ and 3′ ends and a ploy A tail in 3′ end which plays signi cant roles in viral replication and translation (18,19). Generally, the genome of viruses has constructed by distribution of nucleotides, which are able to create particular patterns including dinucleotides, trinucleotides and other multinucleotides that may have a vital role in viral replication and pathogenesis (20). As mentioned above, viruses are able to change their genome during evolutionary process to maximize their power against host defense activity. In this study, we decided to perform a comprehensive analysis on genome of two groups of coronaviruses, SARS related viruses and common cold coronaviruses in order to expand our knowledge about the genome structure of human coronaviruses and oligonucleotide distribution in their genome. To achieve this goal, at rst the relative frequency of dinucleotides to multinucleotides were calculated and then, various attribute weighting algorithms were used to determine the discrepancy of oligonucleotide distribution in genome of two groups of human coronaviruses. Findings of current study can conduct us to identify some particular oligonucleotide patterns in genome of coronaviruses, which might have a vital role in evolutionary adaption and viral infectivity.

Datasets generation
Generally, 5 datasets were generated based on oligonucleotide feature that each one contained 532 samples (293 viral sequences related to SARS and 239 viral sequence related to common cold) with16 dinucleotides, 64 trinucleotides,256 tetra nucleotides, 1024 penta-nucleotides, and 4096 Hexa-nucleotides respectively as oligonucleotide's attributes in each dataset. Moreover, one dataset containing all 5440 was also created to be analyzed by weighting algorithms (Sup 1).

Selection of the most important features
Given the data cleaning was performed to remove useless attributes, all attributes were precious and remained in each dataset. The importance of each contributing attribute in viral genome was evaluated by attribute weighting algorithms in two groups of SARS and common cold coronaviruses. Albeit a signi cant few number of oligonucleotide attributes were identi ed between two viral groups that have been presented in table 1 and Sup 2, a considerable oligonucleotide pattern was also observed which discriminated two viral groups. Brie y, CC dinucleotide got a signi cant value among dinucleotides attributes. Moreover, among trinucleotide features, CC dinucleotide was also repeated with a high value in CCA, GCC, and ACC features. In continue, those features were also sought among tetra oligonucleotides, which three features of CCAC, GCCG, and ACCC were identi ed with the signi cant value. Interestingly, our attention was drawn to CCAC and ACCC patterns as being also repeated with a signi cant weight among a few Panta and hexaoligonucleotides g 1. Furthermore, to identify the most important oligonucleotide pattern, all attributes from di to multinucleotides were also run at one dataset by different weighing methods. Remarkably, CACCAC oligonucleotide coupled with a few other features, was highlighted with the highest score (seven value) as shown in table 1. The result of feature selection was provided in Sup 2. were illustrated on reference genomes in g 2. We found ten conserved motifs of CACCAC in different positions on SARS-CoV-2and SARS genome. However, the only seven-conserved motif of CACCAC was identi ed on MERS genome.
Although the most repetitions located on ORF1, CACCAC Motif was also repeated one time on S ORF of SARS and MERS and tree times on SARS ORF S. Moreover, this motif was also observed in the 3′ UTR site of SARS-CoV-2and SARS. However, this motif was also identi ed on common cold coronaviruses genome; the number of repetition was quite variable in each species. In addition, the motifs were not quite conserved among some strains especially HKU1 strains.

Discussion
The genome of RNA viruses contains different structures such as cis-acting elements, repeated sequences and RNA motifs, which contribute in the process of viral life cycle (11,26). In fact, these elements are able to interact with viral and cellular factors and regulate viral translation, replication and encapsidation (27,28). For instance, the presence of a particular RNA structure named internal ribosome-entry sites (IRESs) in 5′ end of genome in many pathogenic viruses such as hepatitis A virus (HAV) , hepatitis C virus (HCV) and poliovirus allows them to interact with host ribosomal proteins and recruit eukaryotic translation machinery for their own proteins synthesis (29). Moreover, some other features can be involved in virus strategies for induction and regulation of host immune system. A conspicuous example of this sort of features is the existence of (pathogen-associated molecular pattern) PAMP as a small piece of RNA in viral genome. In fact, PAMPs are conserved small sequence of viral genome, or viral replication products, which are recognized by pattern-recognition receptors (PRRs) such as Toll-like receptors (TLRs) or RIGI-like receptors (RLRs) and in the following, host innate immune system, would be activated against the pathogens (30,31). In contrast with this, the presence of some other motifs or RNA elements in genome of some viruses assists them to evade host immune mechanism. As an example, an RNA structure in the 3C protease ORF of poliovirus genome inhibits the function of RNase L, an antiviral endonuclease, that is activated during viral infections as the part of innate immune system (32,33). According to the importance of these elements in viral replication and infectivity, the current study was performed to comprehensively analyze viral sequences of high virulent coronaviruses in comparison to coronaviruses related to common cold to predict a few probable signi cant RNA motifs. With the development of computational programs, the presence of RNA structures in viral genome has been anticipated by bioinformatics methods. Recently, feature selection techniques such as attribute weighting algorithms have already been used to predict the most important attribute in nucleotide and amino acid level among a large number of protein or genome sequences (23)(24)(25).
In this study, the relative frequency of contributing oligonucleotides (dinucleotide to hexa nucleotide) in viral genome of different coronavirus strains was calculated as explained in the method section, and then the most important patterns were identi ed by different attribute weighting algorithms. Given the results, a sequential pattern of CC dinucleotides to CACCAC hexa nucleotides de ned by almost 90 percent of all attribute weightings, were identi ed as the most important features to distinguish SARS and common cold coronaviruses g 1. A few previous experiments showed that the presence of CCA boxes in viral genome, particularly the genome of positive single strand RNA viruses, would increase signi cant levels of transcriptional initiation at multiple sites. In fact, viral replicase seems to be able to initiate transcription from CCA boxes without the presence of a unique promoter (34). In the current study, CCA motif was shown as a remarkable feature among trinucleotides and it was repeated sequentially in CCAC, ACCAC, and CACCAC motifs. Furthermore, among all attributes features (di to hex nucleotides), CACCAC was also valued by 70 percent of all weighting models Table 1. There is a possibility that the presence of conserved multiple motifs in genome of SARS related viruses, especially, SARS and SARS-CoV-2 with the most frequency of this motif, might exert a strong in uence on viral RNA synthesis. It is noticeable that this motif was also presented as a conserved motif in 3′ UTR of SARS-CoV-2 and SARS but it was not observed on other coronaviruses genome in this region. According to the importance of 3′ UTR sequences in viral replication and infectivity, the role of this remarkable motif should be evaluated. In this study, some other oligonucleotide features with sizable score was also distinguished between two viral groups as shown in table 1 and sup 2. To understand the biological importance of these features in life cycle of different coronavirus strains, those should be aimed and scrutinized by laboratory techniques in cell culture system and animal models.
Among dinucleotide features, TC and CC dinucleotides, which were con rmed by 80 and 90 percent of all attribute weighting respectively, attracted us too. According to a myriad number of researches, dinucleotide composition constitutes a genomic signature among a variety of virus species, which might represent a signi cant impact on viral life cycle and host adaption (20,35). To exemplify, the reduced frequency of UA and UU dinucleotides in HCV genome lead to the interferon (INF) resistance among some HCV genotypes (36). In some other research, it is supposed that frequency of CG and UA enables RNA viruses to escape from host immune system (17). In this study, the relative frequency of TC and CC dinucleotides in SARS related viruses were signi cantly different from those of common cold coronaviruses. It can be supposed that TC and CC dinucleotides represent an important role in coronaviruses pathogenicity. Interestingly, there is a human enzyme named Apo lipoprotein B mRNA-editing enzyme-catalytic polypeptide-like 3 (APOBEC3), which has an effective role in innate antiviral immunity especially about retroviruses and DNA viruses (37,38). The preferred effective sites of two main isoforms of this enzyme, APOBEC3A and APOBEC3 G were reported as TC and CC respectively (39). Both of mentioned dinucleotides were as distinguishing features between two viral groups in the current study. Although in most of studies, the antiviral activity of this enzyme has been identi ed on retroviruses and DNA viruses. The recent study on NL63 coronavirus showed that replication of RNA viruses can be also restricted by APOBEC3 activity (40). It can be hypothesized that the difference of TC and CC dinucleotides in genome of two groups of coronaviruses is more likely in the result of evolutionary process and thus it can have a substantial role in viral pathogenicity.

Conclusion
To conclude, this mining showed us a few highlighted oligonucleotide features that differed in genome of two groups of common cold and SARS coronaviruses. Those features might contribute to a better understanding of coronaviruses pathogenicity and encountering in innate immunity in the future.

Viral Genome Sequences
For the beginning, the nucleotide database of NCBI was searched for each virus species including SARS-CoV-2, SARS, MERS, HKU1, OC43, NL63, and 229E viruses to obtain full-length genome sequences of each strain. Totally, nearby a hundred full genome sequence of each virus species were retrieved as initial data. However, in the case of NL63, 229E and HKU1, the numbers of deposited full genomes were less than 100. To con rm that, the retrieved sequences belonged to the same species, the multiple sequences alignments were computed using CLUSTAL Omega algorithm in EBI web service. Finally, after checking the aligned sequences and excluding some genomes related to animal species, the nal initial data for each human virus strain was created. The more detailed information of viral sequences was summarized in table 2.

Oligonucleotide's frequency analyses and attributes extraction
In order to carry out the preliminary analysis, a Hyper Talk program was written in the lab view software, which accepted Fasta text -formatted les. The written program in the software was able to scan the sequences sequentially and build up the overall nucleotide composition, alongside with the frequency of each oligonucleotide in turn. In this study, the frequency of dinucleotides to hexa oligonucleotides for each sequence was computed as an observed oligonucleotide in the lab view software (21). On completion of the scan, the expected numbers of a given oligonucleotide were also calculated using Markov method (22). To avoid the effects of length factor of the sequences and estimate the level of statistical signi cance of oligonucleotides occurrences, the observed to expected oligonucleotides ratios were obtained (21) and nally, each oligonucleotide odds ratio was considered as an attribute.
Totally 5440 attributes (16 dinucleotides, 64 trinucleotides, 256 tetra nucleotides, 1024 penta-nucleotides, and 4096 Hexa-nucleotides) were extracted for each virus sequence by lab view software. List of attributes and calculated values were presented in Sup 3.
In the following, a new dataset was generated for each oligonucleotide feature in two viral groups; the viral sequences related to the Severe Acute Respiratory Syndrome (SARS) including SARS-CoV-2, SARS-CoV, and MERS viruses and the viral sequences related to common cold including OC43, HKU1, NL63, and 229E. Then, the attributes of SARS related viruses were compared with those related to common cold coronaviruses (Sup 1). For this aim, each dataset was imported into Rapid Miner Software [Rapid Miner, Germany] and the following steps were sequentially done. The processes of datasets creation and data mining are outlined in the g 3.

Data Filtering
In order to get a nal cleaned database (FCdb), any duplicated attributes, useless and related attributes with Pearson correlation coe cient greater than 0.9 and also numerical attributes with standard deviations less than or equal to a given deviation threshold (0.1) were excluded from the datasets (23).

Attribute weighting
Ten different algorithms of attribute weightings named Information Gain, Information Gain Ratio, Rule, Deviation, Chi Squared, Gini Index, Uncertainty, Relief, Support Vector Machine (SVM), and PCA (24,25) were performed on all datasets to achieve the most important nucleotide's attributes that probably discriminate coronaviruses which cause SARS against those which are known as common cold coronaviruses. During execution of attribute weighting program, each attribute gained a value between 0-1 showing its importance. Then, the attribute with a weight higher than 0.7 owning the highest number of weighting algorithms was allocated as the most important attribute. All Attributes and the relevant weighting models have been presented in Sup 1. The sequential pattern with the highest value, which discriminates the genome of SARS-related coronaviruses from common cold coronaviruses