The rst report of the most important sequential differences between COVID-19 and MERS viruses by attribute weighting models, the importance of Nucleocapsid (N) protein

COVID-19 and the Middle East respiratory syndrome-related coronavirus (MERS) viruses are from coronaviridae family; the former became a pandemic while the latter conned to a limited region. Their pathogenicity and infection rates are also different; the high mortality rate for MERS with low spreading capability. To investigate the possible structural changes at RNA sequences of both virus, 1621 and 125 sequences of COVID-19 and MERS downloaded and converted to polynomial datasets and seven attribute weighting (feature selection) approaches have been used for the analysis of genomic sequences of COVID-19 and MERS viruses. The end nucleotide sequences (from 29288 to the end genome positions) selected by the most attribute weighting models to be signicantly different between two virus classes followed by smaller piece at 5700 and 1750 and 7600 nucleotide positions. These parts encode Nucleocapsid (N), Papin-like protease (NSP3) and NSP4 proteins of COVID-19. The nding for the rst time reports the structural differences between two important viruses at the sequential level and paves the road to decipher new emerging COVID-19 virus high pathogenicity.


Introduction
The coronavirus 2019 (COVID-19) was rst identi ed and reported in patients with severe respiratory disease in Wuhan, China. The virus was a novel member of the coronavirus family which scienti cally named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1][2][3]. Since its discovery, more than four million cases have been infected, including nearly 400,000 who have died. The most worrisome features of COVID-19 are its apparent ability to spread readily, to cause severe disease in high-risk patients and older adults and to mutate and recombine with changes in the genomic sequence since it was rst reported [4][5][6].
Coronaviruses members belong to the subfamily Coronovirinae within the family Coronaviridae and the order Nidovirales [7]. They are zoonotic pathogens that can be transmitted to human due to direct contact with animals; many scienti c reports claimed COVID-19 originated in bats and transmitted to humans via intermediate host animals in the seafood market [8][9][10]. Coronaviruses genome is a single-stranded positive-sense RNA (+ssRNA) molecule with genomic size ranges between 27-32 kbp which contains at least six open reading frames (ORFs) [11][12][13]. The rst ORFs (ORF1a/b) encodes a polyprotein1a,b (pp1a, pp1b) while other ORFs are located on 3′ end encodes at least four structural proteins: envelop glycoprotein spike (S), responsible for recognizing host cell receptors, Membrane (M) proteins, responsible for shaping the virions, the Envelope (E) proteins, responsible for virions assembly and release and the Nucleocapsid (N) proteins are involved in packaging the RNA genome and in the virions and play roles in pathogenicity as an interferon (IFN) inhibitor [14,15]. In addition to the four main structural proteins, there are structural and accessory proteins that are species-speci c, such as HE protein, 3a/b protein, and 4a/b protein. Upon entrance of the viral genome into the cytoplasm of the target cell, the positive-sense RNA genome translates into two polyproteins 1a, b (pp1a, pp1b) and then are processed into 16 Non-Structural Proteins (NSPs) to form a replication-transcription complex (RTC) that is involved in genome transcription and replication [16,17].
The COVID-19 or SARS-CoV-2 is the third novel coronavirus to cause a large-scale epidemic or as currently named by WHO as a pandemic in the recent century after the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) in 2003 [18] and the Middle East Respiratory Syndrome Coronavirus (MERS) in 2012 [19]. As a large group of viruses with large peplomers that make it look like a crown, they are common among many animals, coronaviruses can cause respiratory illnesses in humans and gastrointestinal illnesses in animals. Before SARS-CoV epidemy in 2003, this virus family was not considered a deadly virus in human and just caused mild symptoms in immunocompetent people with a chance of the lower respiratory illness like pneumonia and bronchitis; but after 2003 the rst and second pandemics of the 21st century; the SARS-CoV and the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) reported from China and Saudi Arabi, respectively [20][21][22]. MERS-CoV with a high fatality rate of 34.4% con rmed to infect approximately 2500 cases including 861 deaths [23].
In the present study, to search for the causes of different pathogenicity and fatality rates of two members of coronaviridae family, we convert the genomic sequences of the COVID-19 and MERS into the polynomial dataset (each nucleotide sat as a variable, therefore, about 30,000 variables generated for each virus RNA sequences) and seven attribute weighting (or feature selection) models applied on the dataset. As some attributes will be more important than others, each attribute weighting model gives each variable weight and normalize the gure into a digit between 0 and 1.0; higher weights re ect the importance of that variable regarding the virus type (a variable with two virus types of COVID-19 and MERS).

Materials And Methods
The owchart of this study is presented in Figure 1. One thousand and two hundred and sixty-one (1261) COVID-19 sequences and 136 MERS sequences in Fasta format were downloaded from NCIB nucleotide site (https://www.ncbi.nlm.nih.gov/nuccore). The average sequences length for COVID-19 and MERS were 29500 and 29650, respectively. The sequences were converted to a polynomial dataset (single nucleotide treated as an attribute; dataset contained ~ 30,000 attributes or variable) and the target variable with two groups of virus types as label variable. The dataset imported into RapidMiner software (RapidMiner GmbH, Westfalendamm 8744141 Dortmund, Germany) and seven attribute weighting models applied on them as follows:

ATTRIBUTE WEIGHTING
To identify the most important features or attributes (or nucleotide position) that different between COVID-19 and MERS viruses, the following attribute weightings applied on the dataset:

Weight by Information gain
The Weight by Information Gain operator calculates the weight of attributes concerning the class attribute by using the information gain. The higher the weight of an attribute, the more relevant it is considered. Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose some data that describes the customers of a business. When information gain is used to decide which of the attributes are the most relevant, the customer's credit card number may have high information gain. This attribute has a high information gain because it uniquely identi es each customer, but we may not want to assign high weights to such attributes.

Weight by Information Gain ratio
The Weight by Information Gain Ratio operator calculates the weight of attributes for the label attribute by using the information gain ratio. The higher the weight of an attribute, the more relevant it is considered. Information gain ratio is used because it solves the drawback of information gain. Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose some data that describes the customers of a business. When information gain is used to decide which of the attributes are the most relevant, the customer's credit card number may have high information gain. This attribute has a high information gain because it uniquely identi es each customer, but we may not want to assign high weights to such attributes. The Weight by Information Gain operator uses information gain for generating attribute weights.

Weight by Rule
The Weight by Rule operator calculates the weight of attributes to the label attribute by constructing a single rule for each attribute and calculating the errors. The higher the weight of an attribute, the more relevant it is considered.

Weight by Chi-squared statistic:
The Weight by Chi-Squared Statistic operator calculates the weight of attributes concerning the class attribute by using the chi-squared statistic. The higher the weight of an attribute, the more relevant it is considered. Please note that the chi-squared statistic can only be calculated for nominal labels. The chisquare statistic is a nonparametric statistical technique used to determine if a distribution of observed frequencies differs from the theoretically expected frequencies. Chi-square statistics use nominal data, thus instead of using means and variances, this test uses frequencies. The value of the chi-square statistic is given by where X2 is the chi-square statistic, O is the observed frequency and E is the expected frequency. Generally, the chi-squared statistic summarizes the discrepancies between the expected number of times each outcome occurs (assuming that the model is true) and the observed number of times each outcome occurs, by summing the squares of the discrepancies, normalized by the expected numbers, overall the categories.

Weight by Gini index
The Weight by Gini Index operator calculates the weight of attributes with respect to the label attribute by computing the Gini index of the class distribution, if the given example set would have been split according to the attribute. The higher the weight of an attribute, the more relevant it is considered. Please note that this operator can be only applied on ExampleSets with the nominal label.

Weight by Uncertainty
The Weight by Uncertainty operator calculates the weight of attributes with respect to the label attribute by measuring the symmetrical uncertainty with respect to the class. The higher the weight of an attribute, the more relevant it is considered. The relevance is calculated by the following formula: relevance = 2 * (P(Class) -P(Class | Attribute)) / P(Class) + P(Attribute)

Weight by Relief
Relief is considered one of the most successful algorithms for assessing the quality of features due to its simplicity and effectiveness. The key idea of Relief is to estimate the quality of features according to how One hundred twenty-six variables weighed higher than 0.5 with 6 or 85% attribute weighting models, ve nucleotides positions (29579, 29598, 29621, 29652 and 29662) received weights higher than 0.75 and just position (29617) had weights equal to 1 by 60% of attribute weighting models. These ndings indicate that around the nucleotide position of 29600, clear structural differences can be traced between COVID-19 and MERS. In COVID-19, this region of sequence encodes for Nucleocapsid proteins (N) that pack the RNA genome and, in the virions, and play roles in pathogenicity as an interferon (IFN) inhibitor. Although it is a structural protein, in an unknown way it functions in viral replication and localizes to the viral replication-transcription complexes (RTCs). In fact, the nucleocapsid protein packages the viral genomic RNA to form the helical nucleocapsid that is incorporated into the budding particle but also ful ls additional roles during the viral infection. It has been shown to function as an RNA chaperone (33), to facilitate viral RNA synthesis (2,5,16) and contributes to the perturbation of several host cellular processes (reviewed in reference 27).
The results of attribute weighting models also suggested three other regions in genomic sequences that are signi cantly different between COVID-19 and MERS viruses (weights higher than 0.5 by at least 5 attributes weighting out of 7, more than 71% of the models). Those regions code for Papin-like protease (NSP3) and NSP4 proteins.
The results of this study for the rst time report the structural differences between two important viruses of coronaviridae at the genomic level and paves the road to decipher new emerging COVID-19 virus high pathogenicity.