Emerging SARS-CoV-2 Genetic Variations and Mutations in the COVID-19 genomic sequence: A Systematic and Meta-analysis review

Objectives: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of coronavirus disease 2019 (COVID-19). The high mutation rate of RNA viruses causes genetic variation, virus evolution and it is a strategy to escape the immune system. In the present study, all researches and evidence were extracted from the available online national databases. Two researchers randomly evaluated the assessment of the research sensitivity. Finally, after quality assessment and specific inclusion and exclusion criteria, the eligible articles were entered for meta-analysis. The heterogeneity between the results of studies was measured using test statistic (Cochran's Q) and I 2 index. The forest plots illustrated the point and pooled estimates with 95% confidence intervals (crossed lines). All statistical analyses were performed using Comprehensive meta-Analysis V.2 software.This meta-analysis included 13 primary studies investigating the SARS-CoV-2 genetic variations and mutations in the COVID-19 genomic sequence. According to the pooled prevalence (95% confidence interval) of mutations, the spike gene variations showed the highest non-synonymous mutation frequency (16.4%, CI: 13.6, 16.6) and the Non-structural protein (NSP) genes possess the highest mutation frequency among total mutations (31.6%, CI: 21, 44.6). Genomic mutation analysis of SARS-CoV-2 strains may provide knowledge about different biological infrequent mutations and their relationships of viral transmission, pathogenicity, infectivity, and fatality rates between SARS-CoV-2 and human cells.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) the causative agent of coronavirus disease 2019 (COVID- 19), poses a foremost challenge to public health. Since the primary appearance of SARS-CoV-2 in late December 2019 in Wuhan, Hubei province, central China, a high dissemination rate has been observed worldwide [1]. Based on information released from the World Health Organization (WHO) on 29 December 2020, the present pandemic COVID-19 has nearly 79 million con rmed cases worldwide and over 1.7 million [2]. The SARS-CoV-2 is classi ed in the family of Coronaviridae, the order of Nidovirales, and the genus Betacoronavirus [3]. Similar to other coronaviruses, the genome of SARS-CoV-2 consists of speci c genes encoding some structural/non-structural proteins [4]. Mutation level among RNA viruses is notably high, which this phenomenon is essential for viral adaptation [5]. Though, coronaviruses have been introduced to have proofreading systems and so, nucleotide sequence variety in SARS-CoV-2 has been observed at a very low level [6]. In a study, Wang et 30.53% and 29.47%, respectively [7]. In addition, based on the evidence obtained from a study on 48,635 SARS-CoV-2 sequences, 353,341 mutations have been detected throughout the world. Among them, D614G mutation in C-terminal of the spike protein (Aspartate to Glycine substitution at position 614) is one such evolutionary alteration detected in the SARS-CoV-2 and has become the most common type reported in many regions of the world such as Europe, Oceania, South America and Africa [8]- [9]. The present study was aims to assess the prevalence of SARS-CoV-2 genetic variation and mutation in COVID-19 sequences.

Genetic diversity and mutations of the COVID-19
There are several reports of unusual public health due to variants of SARS-CoV-2, which changes in transmissibility, clinical features, and severity. Table 1 shows the list of signi cant mutations in the world.

Search strategy
In the present study, the search strategy was done using available online national databases, including ISI, Science direct, Scopus, Pubmed, Wiley and Google scholar between December 2019 and January 2020. The search was performed based on appropriate keywords of SARS-CoV-2, Variation, Mutation and COVID-19 sequences, which were combined with and/or/not to determine and screen articles in the search strategy. Besides, it is investigated the references of the published studies to improve the sensitivity of the search. The assessment of the research was randomly evaluated by two researchers and con rmed that all suitable studies had been detected.

Study selection
At rst, articles of all researches, evidence or reports were extracted from the electronic database. After examinations of studies, duplicate articles were identi ed and removed from the study. Then, after analyzing the articles, the irrelevant articles were excluded by reviewing of title, abstract, and full text. Also, articles screened for eligibility and review articles and articles published in other languages were extracted from this study.

Quality assessment
The PRISMA checklist was used for evaluation of the quality of the related studies and determination of the selected studies based on title and contents. The PRISMA checklist consists of 27 items covering different aspects of research methodology such as determining Protocol and registration, eligibility criteria, search, study selection, de ning variables, method of data collection, risk of bias in individual studies, presentation of results and statistical tests. Each question was required one score.
Inclusion/exclusion criteria All articles approved by the above assessment phases were considered eligible for nal meta-analysis: 1) All English studies. 2) Studies based on the prevalence of SARS-CoV-2 genetic variation among total mutation. 3) Reported prevalence of SARS-CoV-2 genetic variation among non-synonymous mutation. The following studies were ruled out: 1) Duplicated studies. 2) Non-relevant articles. 3) Article with non-full length sequence. 4) Abstracts, letters or review studies. 5) Studies published in languages other than English. 6) Articles with no access to the full-text.

Data extraction
After selection of appropriate articles, the following data for each research were extracted based on rst author's name, geographical regions, publication year, language, the number of total mutations, non-synonym mutations, Mutation in S protein, Mutation in N protein, Mutation in M protein, Mutation in E protein, ORF 1a/1b, ORF 3a, ORF 7a, ORF 7b, ORF8a, ORF 10a, ORF6, ORF 1a and NSP. The data were extracted and entered into a Microsoft Excel spreadsheet.

Statistical analysis
The primary outcome was the SARS-CoV-2 genetic variation and mutation in COVID-19 sequences. In our research, the heterogeneity between the results of studies was measured using the test statistic (Cochran's Q) and the I 2 index. P-value less than 0.1 was used to consider signi cant heterogeneity.
The forest plots illustrated the point and pooled estimates with 95% con dence intervals (crossed lines). Each box in a forest plot indicated the study's weight. The heterogeneity and homogeneity of the suspected factors were performed using random and xed effects models, respectively and more than 50% were considered as high degrees of heterogeneity. All statistical analyses were performed using Comprehensive Meta Analysis V.2 software.

Results
In the present study, 1370 articles were identi ed in the starting process. The number of studies was reduced to 1209 following the removal of duplicate articles. In the next step, 890 irrelevant documents were removed after reviewing the full texts. Then, 319 articles were considered for further screening. After the exclusion of 291 articles, 28 articles were assessed for eligibility. 6 articles with non-full length sequence, 8 review articles and one article with other languages were excluded. Finally, 13 relevant articles were included in the Meta-analysis review (Fig.1). In addition, the geographic distribution and frequently mutated residues among COVID-19 sequences are shown in table 2 and g.2 respectively.

-Analysis of S mutation
Our analysis revealed that the D614G spike mutation has the highest frequency. This mutation improved spike protein tness with cell surface receptors and increased the virus's transduction compared to the wild type [20]. Other S mutations, P1263L, V483A, and L54F, have a low frequency. The forest plot shows that the overall frequency of S mutation is 16.4% (13.6, 16.6) and with the compounding of the results, the overall prevalence of S mutation with the con dence interval of 95 % and based on random effect model is (I²: 85.98%, Q=49.947, P <0.001). Also, the results of the heterogeneity studies show that there is heterogeneity among the primary results of the studies (Fig 3, A).

-Analysis of N mutation
Other frequent mutations are R203K and G204R located in the N-area. N genes encode the nucleocapsid protein that contributes to the formation of helical ribonucleoproteins in the virus [21]. These mutations modify miRNAs' binding mechanism and changed the pathogenesis and development of COVID-19 infection in subjects [22]. Other mutations in region N include S197L, P13L, L37F, P323L, and P1103L, which are less frequent, respectively. As can be seen the total prevalence of N, mutations are estimated as 11.7% (7, 19.1). Generally, with the compounding of the results, the overall prevalence of N mutation with the con dence interval of 95 % based on the random effect model is (I²: 98.23%, Q=396.15, P <0.001). Besides, the results of the heterogeneity studies show that there is a heterogeneity among the initial results of the studies (Fig 3, B).

-Analysis of M mutation
The M protein plays a part in the viral envelope packaging by interacting with the S protein [23]. Our analysis revealed two lowfrequent T175M and D3G mutations in the M gene. Accordingly, analysis of M mutation is calculated 1.9% (0.9, 4.1). The overall prevalence of M mutation with the con dence interval of 95 % based on the random effect model is (I²: 84.70%, Q=45.76, P <0.001). The results of the heterogeneity studies describe that there is a heterogeneity among the result of these studies (Fig 3, C).

-Analysis of ORF1a/1b mutation
ORF1ab is a large gene that coded polyprotein (16 proteins) involved in virus genome synthesis and replication [24]. P4715L, L3606F, C8517T, A876T and F3071Y mutations are more frequent in ORF1ab. Due to the overall distribution of ORF 1a/1b mutation 12.8% (5.7, 26.4) with the con dence interval of 95 % based on random effect model is(I²: 97.09%, Q=240.66, P <0.001) and it is shown that there is a heterogeneity among the results of the studies (Fig 3, D).

-Analysis of ORF3a mutation
Q57H, G251V, S193I, and G196V are more frequent mutations in ORF3a. ORF3a proteins are located in host cells and found in the endoplasmic reticulum or Golgi intermediate space, acting as ion channels and controlling the virus's release [25]. moreover, ORF3a triggers pro-in ammatory pathways and assists in severing modes of infection [26]. It is noteworthy that the ORF3a gene shows a high level of non-synonymous and neutral mutations with a potential effect on B-cells like epitope generation that is a signi cant point [27] The incidence of non-synonymous mutation according to ORF 3a group by 95% con dence interval in different studies is shown in the forest plot 5.7% (4.3, 7.6). The results of the analysis demonstrated that the heterogeneity among reported studies is (P <0.001; I²=78.67%, Q=32.81) (Fig 3.E).

-Analysis of ORF7a mutation
Test results of forest plot shows that the average rate of ORF 7a is reported to be 2.1% (1.3,3.3) and the overall prevalence of ORF 7a mutation with the con dence interval of 95 % is(I²: 60.03%, Q=17.51, P =<0.014) So, there is aheterogeneity among these studies (Fig 3, F).

-Analysis of ORF7b mutation
The forest plot shows the prevalence of the non-synonymous mutation based on ORF 7b mutation and con dence intervals (95 % CI). The average frequency of ORF 7b mutation is estimated to be 0.4% (0.1, 1.4). We observed heterogeneity (I 2 :72%, Q=25, P <0.001) among these studies (Fig 3, G).

-Analysis of ORF10amutation
According to the heterogeneity between the results of the studies, the overall prevalence of ORF 10a mutation 0.5% (0.2, 1) with the con dence interval of 95 % based on random effect model is (I²: 50.84%, Q=14.24, P <0.047) (Fig 3, I).

-Analysis of E mutation
The heterogeneity indices show the heterogeneity between the primary results of E mutation. Therefore, the random effect model is applied for combining the results (I²:= 56.68% Q = 16.16, P <0.024). The pooled event rates for mutations of ORF6 is estimated as 0.4% (0.2, 1.1) (Fig 3, K).

Analysis of mutationsamong total mutation
In the current Meta-analysis, review 5 primary studies. S, N, M, ORF 3a, ORF 7a, ORF 7b and NSP mutations were examined among total mutations.

-Analysis of ORF3a mutation
The forest plot indicated that the overall frequency of ORF3a mutation is 3.9% (2.5, 6) and with the compounding of the results, the overall prevalence of ORF3a mutation with the con dence interval of 95 % and based on random effect model is (I²: 60.33%, Q=10.08, P <0.039). Also, the results of the heterogeneity studies show that there is a heterogeneity among the primary results of the studies (Fig 3, M).

-Analysis of M mutation
The prevalence of total mutation according to the NSP group by 95% con dence interval in different studies is shown in the forest plot 31. 6% (21, 44.6). The results of the analysis manifest heterogeneity among reported studies (P=0.00; I²=90.47%, Q=42) (Fig 3,   N).

-Analysis of ORF7b mutation
In this study, it is observed a great heterogeneity between the results of studies regarding the effect of ORF 7b (I²: 58.07%, Q=9.54, P <0.049). Therefore, the random effects model was applied that estimated the pooled event rate for this mutation as 0.4% (95% CI: 0.1, 1.2). (Fig 2, R).

Discussion
Research on the variation in the SARS-CoV-2 genome sequence is necessary for the examination of disease course of COVID-19, disease progression, monitoring, controlling and treatment of SARS-CoV-2 infection [28]. In this present study, the genome sequences of MERS-CoV-2 isolates were examined. The impact of epitope deletion among non-synonymous mutations was the aim of this study which is related to immune escape and pathogenesis. Our study showed that according to the pooled prevalence (95% con dence interval) of mutations, the S variation was shown high frequency 16.4% (13.6, 16.6) among non-synonymous and NSP was the most common mutant among total mutation 31. 6% (21,44.6).
The high mutation of RNA viruses causes genetic variation, virus evolution and it is a strategy to escape the immune system and drug resistance. The SARS-CoV-2 complete genomes with different geographical locations are essential for detecting the genetic variations in the virus that causes viral shedding [29]. Several genome variations in the SARS-CoV-2, such as nucleocapsid (N) protein, ORF4a and the surface protein S associated with the host immune system [30]. Research indicates that genetic variations of SARS-COV-2 can transmit during the early stage of the epidemic; however, genomes are remarkably stable, and they are not able to evolve rapidly [31].
It is demonstrated that the fatality rate of COVID-19 can vary in different populations, and the level of virulence varies among humans [32]. A larger number of speci c mutations with a rapid transmission is detected in Italy, Spain and US and it is related to critical conditions [33]. However, it is demonstrated that genome sequences of SARS-CoV-2 are similar with only a few mutations, but some countries such as North America and Europe are shown the heavily affected regions and Australia, Asia and Africa less affected with sequence variation [34,35]. Research shows that the variation of RNA viruses is pivotal during an outbreak and it depends on nucleotide substitutions. Based on the viral transmission, the viral mutation rates vary in different viruses and help the virus in host adaptation [36].
Finding non-synonymous mutations through the database is useful for identifying mutations and their modes of transmission [37]. There are some new variants such as (deletion 69-70, deletion 144, N501Y, A570D, D614G, P681H, T716I, S982A, D1118H) are de ned in the spike protein of SARS-CoV-2. The novel mutation, (N501Y) which is found in the UK virus variant is located in the receptor-binding domain (RBD). The severity and infectious diseases of the UK variant remain unknown [13]. In SARS-CoV-2 viruses, D614G, is a common mutation spike protein around the world. Also, it is proposed that the highest frequency of spike D614G mutation (S) may be associated with higher viral loads, cellular infectivity, infection severity and lethal outcome in COVID-19 [38]. The relation between high viral loads in the upper respiratory tracts and G clade is measured by RT PCR. It is suggested that the sensitivity of the G variant of SARS-CoV-2 spike to neutralizing antibody is more sensitive than D variant [8]. It is reported that D to G mutation at position 614 (D614G) in the spike glycoprotein which is originated from Europe or China is a signi cant variation in changes of the secondary structure of protein [39] [8]. D614G mutation started in all affected regions such as Bangladesh (with 95.6% D614G mutation), Italy, Spain, North America and European countries [40] [41]. , amino-acid substitution 1109 (F→L) and 76th (S_T76I) position at spike protein found in Bangladeshi and Indonesian strain respectively [37]. It is also suggested that mutation in RNA-dependent RNA-polymerase (RdRp) and D614G increase SARS-CoV-2 transmission and promote the infectivity of SARS-CoV-2 [42]. The study of 12,300 SARS-CoV-2 genome sequences from different countries reported that D614G and P4715L variation was associated with higher COVID-19 mortality [43].
It is evident that ORF1ab P4715L (nsp 12) plays a pivotal role in viral replication and it is reported that ORF1ab-V378I mutation is associated with COVID-19 infection in Taiwan, Australia and Germany [4] [44]. Also, three mutations, including (M5865V, S5932F) and (R203K) described in ORF1ab and N respectively [45]. It is noticed that mutation in Nucleocapsid (N protein) (R203K and G204R) observed in Italy, Spain, India and France and also N_S202N mutant was detected in Saudi Arabia [46] [47].