A gapless unambiguous RNA metagenome-assembled genome sequence of a unique SARS-CoV-2 variant encoding spike S813I and ORF1a A859V substitutions

The novel severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) is causing an unprecedented pandemic, threatening global health, daily life, and economy. Genomic surveillance continues to be a critical effort towards tracking the virus and containing its spread, and more genomes from diverse geographical areas and different time points are needed to provide an appropriate representation of the virus evolution. We here report the successful assembly of one single gapless, unambiguous contiguous sequence representing the complete viral genome from a nasopharyngeal swab of an infected healthcare worker in Cairo, Egypt. The sequence has all typical features of SARS-CoV-2 genomes, with no protein-disrupting mutations; however, three mutations are worth highlighting and future tracking: a synonymous mutation causing a rare spike S-813-I variation) and two less frequent ones leading to an A41V variation in NSP3, encoded by ORF1a (ORF1a A895V), and a Q677H variation in the spike protein. Both affected proteins, S and NSP3, are relevant to vaccine and drug development. While the genome, named CU_S3, belongs to the prevalent global genotype, marked by the D614G spike variation, the combined variations in the spike proteins and ORF1a have not been observed in any of the 197,000 genomes reported to date. Future studies will assess the biological, pathogenic, and epidemiologic implications of this set of genetic variations.


Background And Motivation
Since the rst case of COVID-19 in December 2019, the novel severe acute respiratory syndrome corona virus 2 (SARS-CoV-2) has been circulating among humans (Li et al., 2020) causing a historically unmatched pandemic that is threatening global health, lifestyle, and economy.
In context, two of the most unforgettable pandemics (the medieval bubonic plague-'the Black Death'and the 1918 in uenza) occurred at times neither bacteria nor viruses were properly characterized, and thus remained with no de ned etiological agent (neither identi ed morphologically/microscopically norof course-genetically).Conversely, the early 21 st century SARS epidemic of 2002-2003 was promptly identi ed and well contained, and the success of its containment was, at least in part, due to genomics.
The SARS virus genome was sequenced and tracked properly; cases were traced and isolated in a proportionate way; and the severe disease, which had a high case fatality rate, vanished within two years (de Wit et al., 2016).
On the other hand, the new SARS-like virus, SARS-CoV-2 is intensely challenging all containment measures for several epidemiological reasons (e.g., longer incubation period, transmission from presymptomatic or mildly symptomatic individuals, and-sadly-politics).
As of November 11 2020, over 197,000 genome sequences have been made available (with an hourly increase), in an extraordinary speed, through different databases, including the GISAID website (GISAID), which allows swift, immediate, and open sharing of data.To the best of our knowledge, no organism or virus has been sequenced at this throughput-not even close.Continuous sequencing allows tracking mutations and their resulting genome variants, following trends of transmission, and-most importantlygetting prepared for potential biologically signi cant variants of the virus that may increase the its spread, virulence, or immune escape.
Here we report a complete genomic sequence of SARS-CoV-2, assembled from a metagenomic sequence of a nasopharyngeal swab obtained from a healthcare worker in a hospital in Egypt.The reported sequence encodes a unique combination of amino acid variations that affect the sequence of the replicase and spike proteins.

Materials And Methods
Ethics statement.All protocols were approved by the Cairo University anti-COVID-19 Task Force on 15 April 2020.Sampling, informed consent forms, and all sequencing protocols have been approved by the Ethical committee and Institutional Review Board of the Faculty of Medicine, Cairo University (IRB Approval: 07072018).
Sample information.The sequenced swab was obtained on June 19 th , 2020 from a 40-year-old woman, a healthcare worker, who had typical symptoms of mild COVID-19 pneumonia and a 38.5 ˚C fever.After con rmation of SARS-CoV-2 positivity by real-time PCR, the patient's consent was obtained for full sequencing of the extracted RNA.
Library preparation and Sequencing.Total RNA was directly extracted from the swab with Qiagen QIAamp Viral RNA Mini Kit (Qiagen, Valencia, CA, USA), depleted for human, bacterial, and mitochondrial ribosomal RNA by Ribo-Zero plus (Illumina, La Jolla, CA, USA).Illumina MiSeq was used for sequencing a shotgun metagenomic library prepared by Illumina's TruSeq RNA Library Prep Gold Kit (v2) protocol.
Assembly and mapping.The sequences were mapped to SARS-CoV-2 genomes by Blast+ local alignment (Camacho et al., 2009).Mapped hits were ltered and assembled in the PATRIC platform (Davis et al., 2020) (accessed on 9/9/2020) by the built-in SPAdes de novo assembly algorithm (Bankevich et al., 2012).Bioinformatics analysis.All genomes used in multiple sequence alignments or phylogenetic analysis were obtained from GISAID or NCBI Virus portals (Brister et al., 2015).Only genomes labeled as 'complete' were used.The data set was stringently ltered for ambiguous sequences (no more than 3 Ns) and for redundancy (all duplicate/100% identical genomes were ltered out).The stand-alone MUSCLE program (Edgar, 2004a, b) was used for multiple sequence alignment of the full nucleotide sequence of the ltered SARS-CoV-2 genomes.FastTree (Price et al., 2009(Price et al., , 2010) ) was used to compute phylogenetic distances, with a general-time reversible (GTR) substitution model, and FigTree (v.1.4.4) was used for the tree visualization (Rambaut, 2018).Other phylogenetic and phylogeographic comparisons were implemented from built-in tools in the GISAID (GISAID), NextStrain (Had eld et al., 2018), and CoV-Glue (Singer et al., 2020) sites.CoVsurver, an application available at the GISAID website, was used to map spike amino acid variations to the 3D structure of a spike protein trimer.
Data availability.The CU_S3 sequence was deposited in both GISAID and NCBI Virus databases and was given the accession IDs EPI_ISL_529032 and MT990450, respectively.

Results
The MiSeq high-throughput sequencing of the rRNA-depleted shotgun RNA library, obtained from the patient's nasopharyngeal swab extract, generated over 5 million high-quality RNA sequence reads.Out of 15,000 reads with positive BlastN hits (>90% identity) to SARS-CoV-2, the complete viral genome of SARS-CoV-2 was successfully assembled into a single gapless contiguous sequence with no ambiguous bases.The coverage of any given genomic position ranged from 15-50x.
Brief genome description.We named the assembled genome CU002b-S3 (or CU_S3 for short).Its sequence has 29,792 bases with typical features of SARS-CoV-2 genomes (Dabravolski and Kavalionak, 2020;Li et al., 2020) and includes all its open reading frames (ORFs), ORF1a through ORF10, without any disruptive mutations.The genome has been classi ed as part of clade GR (according to GISAID classi cation); of the pangolin lineage B.1.1.1 (Rambaut et al., 2020); or of the major clade 20B (according to the new NextStrain classi cation (Had eld et al., 2018))-formerly known as clade A2a.
Other than this unique variation, the spike protein in the CU_S3 genome carries three other variations from the 'reference' Wuhan isolate, the most popular of which is the D614G variation that is becoming prevalent worldwide (~ 88% frequency), and two more substitutions at positions 12 and 677 of the spike protein (S12F and Q677H, Fig. 1B and C) with frequencies of ~ 0.11% and ~ 0.19%, respectively .Another rather rare variation is an A-to-V substitution in residue 859 of ORF1a (amino acid 41 in NSP3), the one that encodes the replicase, and that has been seen in only 45/ 157,853 (~0.03%) global isolates (http://cov-glue.cvr.gla.ac.uk/#/project/replacement/NSP3:A:41:V), including one from Malaysia and one from Italy (Fig. 1D and 2A), and only one from Africa/Egypt (isolate CUNCI-HGC4I031).
To date, none of the 197,265 publicly available SARS-CoV-2 genome sequences has the unique combination of mutations observed in CU_S3.Speci cally, the combination of S: S813I and ORF1a A858V variations encoded by this genome has no precedent in any public genomic sequence.
Phylogeny and phylogeography.Although genomic sequences reported to date remain highly closely related, we used multiple sequence alignment followed by phylogenetic analysis, at the nucleotide level, to analyze CU_S3 and trace its microevolution, in context of local and global genomes.
To visualize the phylogeny CU_S3, we compared it to selected genomes from all around the globe representing different major clades (Fig. 2A) and to all non-redundant genomic sequences from Egypt that were available until October 1 st (Fig. 2B).As of that date, the most closely related local isolate was one from the same city and hospital (GUNCI-HGC4I031) and the most closely related global isolate was one from Malaysia (Fig. 2).

Discussion
We here report the successful metagenome-based assembly of a single gapless contiguous sequence representing the complete viral genome of a SARS-CoV-2 isolate from an Egyptian patient.The genome, named CU002b-S3 (CU_S3 for short), was analyzed for mutations that resulted in amino acid variations, and was phylogenetically compared to representative local and global genomes.
In spite of the millions of cases worldwide and the tens of thousands of sequenced genomes, SARS-CoV-2 genome is relatively stable with an estimated mutation rate of ~24 base per genome per year (GISAID); thus phylogenetic analyses do not lead to strongly distinct clusters, except for the major clades (Rambaut et al., 2020) that have been de ned so far in several studies (reviewed in Dabravolski and Kavalionak (2020) and monitored by NextStrain (Had eld et al., 2018)).
What we nd intriguing and worth sharing, while we continue to monitor isolates from Egyptian patients, is that the speci c isolate reported here (CU_S3) encodes a rare S813I variation, not seen in Africa.Although the biological signi cance of this variation is yet to be explored, and although serine and isoleucine are not dramatically different, they still are physicochemically distinct, as isoleucine is larger in size and less polar than serine.
Of interest, the rare cases in which this S813I variation has been reported (24 cases) are mostly scattered and polyphyletic (belonging to different clades, with no evidence of phylogenetic contiguity).Such observation suggests that this variation results from spontaneous mutations that may occur independently and repeatedly (akin to convergent evolution, but with no particular selective pressure favoring the event-yet).
Additionally, while all 24 genomes with reported nonsynonymous mutations leading to the spike S813I variation also have the D614G variation, CU_S3 is the only genome sequence, in all public databases, to have both mutations causing S: S813I and ORF1a: A859V (NSP3: A41V) variations.
Numerous reports focused on the spike D614G variation (Díez-Fuertes et al., 2020;Korber et al., 2020;Plante et al., 2020;Zhou et al., 2020).Although it is not the most frequent variation, it is quite prevalent, and its occurrence in the spike protein made it an attractive target for speculations about an effect on viral transmissibility, infection e ciency, virulence, immunogenicity, immune evasion, and even disease prognosis (Grubaugh et al., 2020;Hou et al., 2020;Korber et al., 2020;Zhou et al., 2020).Yet, the only solid experimental evidence supports a slightly higher transmissibility but no signi cant structural effects leading to virulence or immune evasion (Díez-Fuertes et al., 2020;Grubaugh et al., 2020;Korber et al., 2020;Plante et al., 2020).Quite interestingly, D614G, Q677H, and S813I, three of the four variations in the CU_S3 isolate, occur all in the middle of the spike protein, away from its receptor-binding domain (which is logically the most impactful on infectivity, drug susceptibility, and neutralizing antibody binding, Fig. 3).However, suggestions that these variations may affect the protein exibility or stability are valid and remain to be structurally con rmed by X-ray crystallography or cryogenic electron microscopy.As for the S12F variation, it doesn't show in the available protein structure (Fig. 3) as it is in the N-terminal region.
In conclusion, using a direct, ampli cation-free RNA metagenome/metatranscriptome sequencing approach, we fully assembled and identi ed a novel variant of SARS-CoV-2, with a novel nonsynonymous mutation (causing a rare spike S-813-I variation) and two less frequent ones (NSP3:A41V ) and (S:Q677H).Both affected proteins have signi cance in vaccine and drug development, respectively.
To the best of our knowledge, no other sequenced genomes combine mutations that lead to S: S813I and ORF1a: A859V (NSP3: A41V) variations.
Future studies are needed, in parallel with relentless genomic surveillance programs, to assess the biological, pathogenic, and epidemiologic implications of this set of genetic variations.Whether such variant is going to expand and be seen further depends on an increase of genomic sampling as well as potential tness advantages conferred by the observed amino acid substitutions.