Modest Evolutionary Changes of the SARS-CoV-2 Genome in Bangladesh

Background: The human-to-human transmissive nature of SARS-CoV-2 makes Bangladesh, as well as the other South Asian regions, vulnerable to the ongoing pandemic of COVID-19 due to their high population densities. The present study was designed based on the genome wide analysis of Bangladeshi and other South Asian isolates. Complete sequences of SARS-CoV-2 were retrieved from the EpiCoV database in order to identify molecular features demonstrating the evolutionary trail and mutation rate. Result: The complete genome mutation rate of the Bangladeshi isolates was estimated to be 0.49E-3 nucleotide substitutions/site/year. A higher mutation rate was found in the non-structural protein-coding genes at: ORF6 (10.29E-3), ORF7a (31.81E-3), and ORF8 (18.35E-3). In contrast, the mutation rates of the structural protein-coding genes were relatively low at: M (1.14E-3), S (1.47E-3), E (3.35E-3), and N (4.59E-3). Conclusions: A comparison of Bangladeshi and other South Asian isolates demonstrated that there were limited mutational changes in the SARS CoV-2 genome. Knowledge of the Southeast Asian SARS CoV-2 evolutionary genome will help in selecting future vaccine candidates and designing therapeutic drug targets.


Background
A novel -coronavirus nCoV-19, later known as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), caused acute pneumonia in Wuhan City, China on December 2019 (1).SARS-CoV-2 has rapidly evolved into a global pandemic.Just in one year of its emergence, the virus has infected 56.3 million people and is responsible for more than 1.3 million deaths worldwide (2).
The human-to-human transmissive nature of SARS-CoV-2 makes South Asia, the most densely populated region in the world, extremely vulnerable to the current pandemic.The rst laboratory-con rmed case in Southeast Asia was reported in Nepal, which was documented on January 23, 2020 (3).This initial report was followed by case reports from Sri Lanka, India, Pakistan, Bhutan, Maldives and Bangladesh.In terms of the number of coronavirus cases and deaths, India was the rst South Asian country to overtake China, followed by Pakistan and Bangladesh.India is the world's second-worst affected country with total cases of more than 6 million, trailing only the United States, which has more than 7 million cases (2).In Pakistan, cases have fallen from a peak in mid-June of almost 6,000 new infections, to about a few hundred a day in September.Bangladesh, which had a total of 438,795 cases as of 18 November, saw its daily cases peak in mid-June before they dropped in the last week of July.Nepal has had a relatively low overall number of cases but is now seeing their reported infection rate rising again after an initial peak in June.Sri Lanka has much lower levels of infection t (2).SARS-CoV-2, a positive-sense single-stranded RNA virus, has a genome size of approximately 30 kb which encodes 27 proteins from 11 open reading frames (ORF) (4).The largest ORF (ORF1ab) constitutes about 70% of the genome and encodes polyproteins PP1ab and PP1a (5).These two polyproteins are further cleaved into 16 non-structural proteins (NSP1 to 16).The NSPs play a role in pathogenesis and are responsible for replication, transcription, proteolytic processing (6-8).The RNA-dependent polymerase is a potential antiviral drug target (9,10).The remaining ten ORFs, encode four structural proteins and at least six accessory proteins.The structural proteins are spike glycoprotein (S), matrix protein (M), envelope protein (E), and nucleocapsid protein (N), while the accessory proteins are ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10 (4,11).The S protein is crucial for viral attachment and entry through ACE2 receptor recognition (12)(13)(14).Antibodies against the S protein were found to be effective in inhibiting SARS-CoV entry into the host cell ( 15) thus serving as a potential drug target and the basis of a future vaccine (16).On the other hand, mutations in S protein help the virus escape the host cell immune system and lead to the emergence of variant strains.As the most abundant structural protein, M protein can also be considered as an antiviral drug target (17).Data on the mutation rate can provide information about the level of genetic diversity in the population.Therefore, understanding the evolution of SARS-CoV-2 at the genomic and individual gene level is important for the development of low cost diagnostic tools, therapeutics and vaccines.
In the present study, we analysed the full genome of SARS-CoV-2 Bangladeshi isolates in order to geographical variants and the mutation rate.The evolutionary pattern of three major genes (ORF1a, S, and M), which are considered as drug/vaccine targets, were investigated by comparing Bangladeshi isolates with South Asian isolates and the reference Wuhan-Hu-1 strain.

Geographic distribution of Bangladeshi isolates
A total of 306 SARS-CoV-2 full genomes were submitted from patients representing 20 different districts of Bangladesh (Figure1).The highest number (n=66) of was from the capital city Dhaka, while 20 samples were not speci ed at the district level.
The evolutionary rate for Bangladeshi isolates We analysed four structural genes (S, E, M and N) and four major open reading frames (ORF1a, ORF6, ORF7a and ORF8) along with the full genome (Table -1).The mutation rate of the in full Bangladesh genome was 0.49E-3 (95% HPD 0.30E-3 to 0.68E-3) nucleotide substitutions/site/year.In ORFs, the highest mutation rate was found for ORF7a (31.8E-3 nucleotide substitutions/site/year) and lowest for ORF1a (0.59E-3 nucleotide substitutions/site/year) while all segments showed negative or stabilizing selection (dN/dS, 0.47 to 0.95).In the structural proteins, the highest mutation rate was 4.59E-3 nucleotide substitutions/site/year for N and it gradually decreased for E, S and M (3.35E-3, 1.47E-3 and 1.14E-3 nucleotide substitutions/site/year).Only the envelop protein gene (E) showed positive selection (dN/dS, 1.43) while other structural proteins showed negative selection (dN/dS, 0.20 to 0.57).

Mutation in S gene of Bangladeshi isolates
A total of 107 nucleotide changes were found in the Bangladeshi S gene; among them 53 were nonsynonymous.Among 107 substitutions, 98 were present in <1% isolates, and six changes were present in <2% isolates (Supplementary Table 4).Nucleotide changes C13T, C882T and A1841G were present among 2.6%, 6.3% and 97.7% of isolates respectively.Among them C882T was synonymous, while C13T and A1841G substituted amino acid L5F and D614G.A non-synonymous change in amino acid position ve (L5F) occurred between the same group of amino acids (non-polar Leucine to Phenylalanine).While, in position 614, a hydrophilic amino acid D (Aspartate) was substituted by an amphoteric amino acid, G (Glycine).
The evolutionary rate of S gene in South-Asia The evolutionary rate of S gene varied in the different regions between 1.34E-3 to 1.84E-3 nucleotide substitutions/site/year, except for Maharashtra which was 3.50E-3 (Table -2).With regards to the degree of natural selection (dN/dS) acting on all S genes, it ranged from 0.41 to 0.76 (stabilizing selection) except for Gujrat at 1.0 (neutral selection).,The major substitution found for the Bangladeshi S gene (e.g., D614G substitution) was also analysed for Indian isolates.Only 15.7% of isolates from New Delhi contained this substitution and the highest substitution rate was identi ed for Gujarat isolates (93.8%) which are comparable with Bangladeshi isolates (97.7%).
Mean p-distance of genes from the reference strain Wuhan-Hu-1 Finally, from 'Dataset-2', three major genes (ORF1a, S, and M) were subjected to genetic distance analysis over time through the p-distance method.For Bangladeshi isolates, mean p-distance from the reference sequence (Wuhan-Hu-1) increased during rst month then ran horizontally for ORF1a while it slightly increased for S gene (Figure -2).For M gene, p-distance showed a curved line starting with high similarity with Wuhan-Hu-1 strain, then increased and nally declined again.A similar pattern to Bangladesh was observed for Maharashtra and West Bengal for S and M genes.All Indian regions showed a slightly upward trend for ORF1a.Only Delhi and Gujarat showed an upward trend for S and M genes.Notably, in the last two time-points of these two regions, the numbers of isolates were less than ten (Supplementary Table -3).

Discussion
The study presented the evolutionary characteristics of SARS-CoV-2 circulating in Bangladesh and the South Asian region (mainly four states of India by sequence analysis).Here we provided: (i) the mutation rate of the SARS-CoV-2 genome in Bangladesh including the major genes (e.g., ORF1a, ORF6, ORF7a, ORF8, S, E, M and N); (ii) the type of substitution occurring in the S gene of Bangladeshi and South Asian isolates; (iii) the mutation rate of S gene from South Asian region; (iv) and the genetic changes over time in three longer genes (ORF1a, S, and M) from South Asian region through the p-distance methodology.
During the rst global epidemic of SARS-CoV in 2003, the mutation rate was estimated to be 0.80E-3 to 2.38E-3 nucleotide substitutions/site/year (16).For MERS-CoV which caused an epidemic in the Middle East in 2012, the mutation rate was 0.46E-3 to 1.12E-3 nucleotide substitutions/site/year (18, 19).In the early stage of current pandemic, several studies estimated the mutation rate of full genome of SARS-CoV-2 to be approximately 1.05E-3 to 1.80E-3 nucleotide substitutions/site/year (12,20) which is comparable with our result (0.49E-3 nucleotide substitutions/site/year) for Bangladesh.For Bangladeshi isolates, ORF7a, ORF8 and ORF6 showed higher mutation rates compared to other genes (N, E, S, M and ORF1a).
However, all mutations were conserved during evolution (negative selection, dN/dS value <1) except the envelop protein gene (E) which adapted (dN/dS=1.43).Importantly, the mutation was conserved during the evolution of the spike protein gene (S).
Amino acid position 614 in the S gene is located at the C-terminal region of the S1 domain and does not belong to a receptor binding domain.This study showed that 98% of the Bangladeshi isolates had D614G substitution.This mutation received attention because some studies claimed that the D614G mutation might be involved in increased infectivity and mortality (21)(22)(23).Moreover, one study claimed that the substitution was most likely neutral due to protein function and interaction with the human ACE2 receptor by in silico analysis (24) and another clinical study supported this nding where no association of this mutation was found with severe illness (25).Although a high D614G substitution was present in the Bangladeshi isolates, the case fatality rate ( The overall mutation rate for the S gene of SARS-CoV-2 is lower than other respiratory RNA virus membrane glycoprotein genes or other genes used for drug and vaccine targets.For example, the HA and NA gene of in uenza virus have mutation rates of 2.0E-3 to 10.0E-3 nucleotide substitutions/site/year (26-29), the SH and G gene of Respiratory Syncytial Virus have rates of 2.5E-3 to 3.49E-3 nucleotide substitutions/site/year (30,31), and human metapneumovirus rates of 1.6E-3 to 3.61E-3 nucleotide substitutions/site/year (31,32).
This study aimed to cover all of the South Asian region, but due to a low number of the available sequences, we had to exclude Pakistan, Sri-Lanka, Nepal, the Maldives, Bhutan and some states of India (Supplementary Table -2).Here we examined how average genetic distances of SARS-CoV-2 isolates from different regions changed over time.In our analysis, we did not show an upward trend of genetic changes except some uctuations due to the low numbers of isolates during that time points.This indicates that the virus is not evolving away from Wuhan-Hu-1 strain signi cantly as reported previously (33).The only exception was for the M gene of the Gujarat isolates with upward trends, indicating a gene-speci c evolutionary change.

Conclusions
A comparison of the Bangladeshi and other South Asian isolates with the Wuhan-Hu-1 strain provides evidence that that SARS-CoV-1 genome is largely conserved and this has important implications for future vaccine and drug development.

Methods Preparation of 'Dataset-1' for Bangladeshi isolates
In this study the full genomes of SARS-CoV-2 submitted from Bangladesh were retrieved from the EpiCoV database available via the GISAID website (www.gisaid.org)for analysis of evolutionary pattern.Multiple sequence alignment was performed using the MUSCLE programme (34) and gene segments (ORF1a, ORF6, ORF7a, ORF8, S, E, M and N) were segregated from align le with BioEdit (version 7.0.9.0).Geographical locations in district level were also retrieved (Supplementary Table-1) to prepare a sample distribution map.

Evolutionary rate and selection pressure measurement
An uncorrelated relaxed clock log-normal molecular clock model was used to calculate the evolutionary rate (numbers of nucleotide substitutions/site/year) with coalescent Bayesian skyline prior of the study isolates through a Markov Chain Monte Carlo (MCMC) framework in BEAST (v2.6.3)(35,36).MCMC chains were run for 100 million steps with sampling every 10,000 steps from the posterior distribution to ensure adequate mixing of model parameters.Hasegawa-Kishino-Yano (HKY) nucleotide substitution model was used (12), and the evolutionary rate was evaluated using Tracer v1.7.1.Finally, the S gene of the study isolates were subjected for selection pressures calculate using Datamonkey server (37), by calculating the ratio of non-synonymous (dN), and synonymous (dS) nucleotide substitutions per site (dN/dS) (38).

Preparation of 'Dataset-2' for genetic distance analysis
To observe genetic change over time different datasets of sequences were prepared where the Wuhan-Hu-1 strain (NC_045512.2) was used as a reference sequence.For this dataset the full genomes of SARS-CoV-2 from South-Asian countries with at-least 100 isolates were retrieved from the EpiCoV database up to 32 weeks (January 01 to August 11, 2020) (Supplementary Table -2).During sequence retrieval, only Bangladesh and seven states of India (Delhi, Gujrat, Karnataka, Odisha, Telangana and West Bengal) having at-least 100 isolates were sampled during weeks of 11-32 (11 March to 11 August) of 2020.
Finally, datasets of sequences were prepared for the individual region with two weeks' time-points.Datasets having a minimum of ten isolates in continuing ve time-points were selected for analysis (Supplementary Table -3).Sequences which did not mention collection dates or had high ambiguity (contained a lot of Ns for example) were excluded during the preparation of the dataset.Finally, a total of 1480 isolates were included in the analysis from ve South Asian regions; Bangladesh (n=306), Delhi (n=253), Gujarat (n=457), Maharashtra (n=312) and West Bengal (n=152).

Genetic distance matrix calculations
Multiple sequence alignment was performed using the MUSCLE programme (34) and MEGA-X software was used to calculate the distance matrix of study sequences by p-distance model (39).Genetic distances were calculated between all possible pairs.By using distance matrix data the average distance between reference sequences and study sequences were calculated.).An open access geographical index map of Bangladesh was downloaded from the national portal of 'Geological survey of Bangladesh (www.gsb.gov.bd)' and re-created for the manuscript.
Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any

Figures
Figures

Figure 1 Available
Figure 1

Table 2 :
Evolutionary rate and presences of D614G substitution of S gene SARS-CoV-2 Bangladesh and different provinces of India 1 Selection pressures calculate using Datamonkey server, by calculating the ratio of non-synonymous (dN), and synonymous (dS) nucleotide substitutions per site.CFR (Case Fatality Rate) was retrieved from database of Johns Hopkins Coronavirus Resource Center on September 15, 2020 (Dong, Du et al. 2020). 2