Whole Genome Sequence Analysis of SARS-CoV-2 Strains Circulating in Malaysia During First Wave and Early Second Wave of Infections.

Background: Since December 2019, the outbreak of COVID-19 has raised a great public health concern globally. Here, we report the whole genome sequencing analysis of SARS-CoV-2 strains in Malaysia isolated from six patients diagnosed with COVID-19. Methods: The SARS-CoV-2 viral RNA extracted from clinical specimens and isolates were subjected to whole genome sequencing using NextSeq 500 platform. The sequencing data were assembled to full genome sequences using Megahit and phylogenetic tree was constructed using Mega X software. Results: Six full genome sequences of SARS-CoV-2 comprising of strains from 1 st wave (25 th January 2020) and 2 nd wave (27 th February 2020) infection were obtained. Downstream analysis demonstrated diversity among the Malaysian strains with several synonymous and non-synonymous mutations in four of the six cases, affecting the genes M, orf1ab, and S of the SARS-CoV-2 virus. The phylogenetic analysis revealed viral genome sequences of Malaysian SARS-CoV-2 strains clustered under the ancestral Type B. Conclusion: This study comprehended the SARS-CoV-2 virus evolution during its circulation in Malaysia. Continuous monitoring and analysis of the whole genome sequences of conrmed cases would be crucial to further understand the genetic evolution of the virus.


Background
Coronavirus disease 2019 (COVID-19) emerged in December 2019 as a new pandemic form of lifethreatening infection caused by a novel coronavirus, namely SARS-CoV-2 [1]. As of 14 th May 2020, there were 4.2 million cases of SARS-CoV-2 with 294,046 deaths reported worldwide (WHO) [2]. In Malaysia, the total number of cases was 6,819 cases with 5,351 recovered, and 112 deaths as of 14 th May 2020 [3].
Coronaviruses are enveloped viruses with a positive-sense, single-stranded RNA viruses belonging to the family Coronaviridae. To date, four coronavirus genera have been identi ed which are Alphacoronavirus, Betacoronavirus, Gammacoronavirus and Deltacoronavirus [5]. These viruses generally infect animals, including birds and mammals [6]. Similar to SARS (Severe Acute Respiratory Syndrome) and MERS (Middle East Respiratory Syndrome), SARS-CoV-2 is a zoonotic coronavirus that belongs to the genus Betacoronavirus. It has a genome size varying from 29.8 kb to 29.9 kb that encodes for multiple structural and non-structural proteins [7]. The structural proteins include the spike (S) protein, the envelope (E) protein, the membrane (M) protein, and the nucleocapsid (N) protein [8]. Brie y, S protein guides the entry of the virus into host cells, E protein plays a role in production and maturation of the virus, M protein will determine the shape of virus and N protein involves in viral replication [9,10].
Malaysia experienced the rst wave of SARS-CoV-2 infection in late January 25 th , 2020 when three cases were con rmed positive for COVID-19 through contact tracing of the index case identi ed in Singapore.
The second wave of SARS-CoV-2 infection started on 27 th February 2020 after Malaysia reported no new cases for 11 days. Initially, there were more con rmed positive imported cases originating from China travellers. Subsequently, clustered and con rmed cases without a history of travel to China increased as the outbreak progressed. Hence there is a need to look into the molecular epidemiology of the SARS-CoV-2 to comprehend the evolution of the virus and to compare with those circulating elsewhere. Whole genome sequencing was carried out by the Institute for Medical Research, Malaysia to compare the genomic evolution of SARS-CoV-2 strains circulating in Malaysia during the rst wave and early second wave of infections.

Sample selection and viral cultivation
Nasopharyngeal swab (NPS) and oropharyngeal swab (OPS) specimens from COVID-19 suspected patients that were sent for routine diagnosis to the Virology Unit, Institute for Medical Research, Malaysia, were selected for viral cultivation. Selection criteria were based on con rmed positive cases by COVID-19 Real-Time RT-PCR from the rst wave and second wave clusters. Specimens were inoculated into Vero E6 cells in a biosafety Level-3 facility according to the WHO laboratory biosafety guidelines and monitored for cytopathic effect (CPE).

Viral RNA Extraction
Virus isolates were harvested from passage 1. Viral RNAs were extracted from these isolates using QIAamp Viral RNA Mini kit (QIAGEN, Hilden, Germany) according to the manufacturer's instruction and con rmed for presence of SARS-Cov-2 genome by Real-Time RT-PCR for E gene and RdRp (Berlin WHO) [11]. Additionally, viral RNAs were also re-extracted from retrospective original specimens from con rmed COVID-19 cases that showed Cq <20.
Next Generation Sequencing (NGS) and data analysis NGS library was constructed using TruSeq Stranded Total RNA Gold library prep kit (Illumina, USA) from ve clinical specimens and ve viral isolates which passed quality control assessment for RNA concentration by Qubit™ RNA HS Assay (Thermo Fisher, USA). Sequencing was performed on the NextSeq 500 platform (Illumina USA) and subjected to 160 million pair reads that were evaluated with FastQC [12,13]. These raw reads were then parsed through quality ltration with remaining TruSeq Illumina adaptor sequences. Any low quality, low complexity and unpaired reads were removed using Trimmomatic [14] with the option: LEADING:3 TRAILING:3 MINLEN:30 and Qpred 33. De novo genome assembly was conducted using Megahit [15] with default parameter. The genome sequence built from each sample was blast to reference viral genome SARS-CoV-2 (NC_045512.2) and mapped using Hisat2 program [16]. Gene prediction was done with Vgas (http://cefg.uestc.cn/vgas/) [17] followed by identi cation of single nucleotide variants (SNVs) using samtools [18], GATK [19] and Lofreq [20]. The SNVs were annotated to the reference strain using snpEfff [21] and effects were predicted via snpSift [22].

Phylogenetic analysis
The SARS-CoV-2 full length genome sequences were subjected to phylogenetic analysis. A dataset of 50 SARS-CoV-2 complete genomes from different countries was retrieved from GISAID (https://www.gisaid.org/, last access 16 March 2020). Sequence alignment was performed using Multiple Sequence Comparison by Log-Expectation (MUSCLE) software and the phylogenetic tree was constructed by neighbour joining method (bootstrap 1000x) using Molecular Evolutionary Genetics Analysis (MEGA, v7).

Ethical consideration
This study used the retrospective specimens and this study does not require approval from human subject's ethics review committee.

Results
General information of selected COVID-19 cases Ten con rmed cases were subjected to whole genome sequencing study. Of these, only six were successfully sequenced to full genome; comprising two clinical specimens and four viral isolates; representing the rst and early second wave of SARS-CoV-2 outbreak in Malaysia. As shown in Table 1

Whole genome sequencing analysis
After removing reads mapped to human genome, it was found that there were about 4 to 10 million of reads that passed quality control for each sample. De novo assembly resulted in about 29.8kb genome sequence for each sample, with coverage between 439X to 1166X and average GC content of about 45% (Supplementary Table 1). The genome sequence assembled from each sample had 99-100% similarity to reference viral genome SARS-CoV-2 (NC_045512.2). Our ndings revealed 5 non-synonymous variants detected in four of the six cases, affecting the genes M (membrane glycoprotein), orf1ab (orf1ab polyprotein), and S (spike protein) of the SARS-CoV-2 virus (Table 2). Only cases from rst wave of outbreak were seen to be closely related to Wuhan strains while others Malaysian SARS-CoV-2 strains were segregated towards other strains in the same group.

Discussion
The current circulating SARS-CoV-2 was reported to diverse into three main variants which are A, B and C, characterized through whole genome sequencing [23]. Whilst type A and C are discovered to be the dominant types outside of East Asia, type B is commonly detected to be circulating in East Asia. However, derived B types which mutated from ancestral B-type enabled these subtypes to transmit outside of East Asia. In comparison to A-type, B-type genomic sequence showed two synonymous mutation which are T8782C and C28144T. On the other hand, Type C is reported to have nonsynonymous mutation G26144T compared to its parental type B [23].
In this study, the SARS-CoV-2 virus from clinical specimens and culture isolates were successfully sequenced to whole genome using the Illumina NexSeq platform. The nucleotides identity reached up to 99.9% similarities to 93 full genomes of SARS-CoV-2 and the sequence homology analysis showed that all sequenced samples belonged to ancestral Type B variant. The two cases in rst wave were closely related to the Wuhan strains whereas the other four cases were dispersed from the Wuhan strain and branched close to the strains of Shenzhen and Hangzhou. Although analysis of larger sample size is needed, our preliminary nding supports the evidence of Forster et al (2020) that the East Asians monopolized the ancestral B type.
The variant analysis of Malaysian SARS-CoV-2 strains was studied. Intriguingly, it was found that SARS-CoV-2 virus from case number 26 (EPI_ISL_430442) had 15 nucleotides in frame deletion in the spike protein that none of the other strains had. This in-frame deletion has been previously characterised to result in the loss of ve amino acids (QTQTN) anking the polybasic cleavage site of the spike protein and hypothesized to be passage oriented [24]. However, one study demonstrated that the deletion also occurred in SARS-CoV-2 extracted from clinical samples [25]. The spike protein of coronaviruses plays pivotal role in viral infectivity, transmissibility and, antigenicity. Therefore, the genetic characterization of the spike protein in SARS-CoV-2 would shed light on its evolution. The viral isolate from case 26 was passaged once in Vero E6 cells, which could have promoted such deletion. The prevalence of this mutation among clinical samples, warrants further investigation. Notably, based on epidemiological mapping, the nature of case 26 that was reported to have generated rst-generation and secondgeneration clusters after attending a meeting highlights its high transmission ability which could have been caused by the deletion in the spike protein of the virus. However, it is unclear if this deletion contributes to severity of the disease.

Conclusion
In this study, six SARS-CoV-2 strains isolated from Malaysian COVID-19 patients and Chinese tourists were sequenced and analysed.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.