SARS-COV-2 genomic surveillance in the Brazilian Western Amazon region: evolutionary history of epidemic dissemination and genetic signature in Rondônia, Brazil

SARS-CoV-2 has spread rapidly around the world, with Brazil currently considered an epicenter of the pandemic. The northern region of the country has the highest incidence and mortality rates. This study aimed to investigate information about the evolutionary history of epidemic spread and genetic aspects of strains isolated on the Western Amazon, in the State of Rondônia, Brazil. It was possible to detect a total of 22 mutations. Some of these alterations may possibly be related to effects on transmissibility, the delity of RNA replication, the ability of cancer patients to respond to infection, beyond a mutation that emerged after the introduction of SARS-CoV-2 in Rondônia. At least two events of introduction were detected, corresponding to the B.1 and B.1.1 European lineages. An introduction was observed possibly through Argentina, where strains originated that circulated in the Minas Gerais and Ceará Brazilian states, prior to Rondônia (B.1.), as well as through the Minas Gerais state and the Federal District, which gave rise to strains that spread to Rondônia, from the capital to more rural parts of the state (B.1.1.). The ndings show the need to monitor the genetic epidemiology of COVID-19, in order to surveil the virus’s evolution, dispersion and diversity.


Introduction
An outbreak of a serious respiratory disease of unknown etiology emerged in December 2019 in Wuhan, China. In early 2020, epidemiological and genetic analyses allowed for the identi cation of a new Coronavirus, later named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) which causes a clinical presentation now de ned as Coronavirus Disease of 2019/COVID-19 1,2 . It is classi ed as part of the genus Betacoronavirus, subgenus Sarbecovirus, along with SARS-CoV, the strain responsible for the SARS outbreak in 2003 3 . Both have zoonotic origin associated with bats; the speci c host responsible for the SARS-CoV-2 leap to the human species has not yet been determined, although there is evidence that the pangolin was the intermediate host 4,5 .
SARS-CoV-2 has a large and complex positive chain RNA genome of approximately 30,000 nucleotides in length that encodes known and hypothetical proteins 6 . Much of the viral genome corresponds to a large open reading frame (ORF), called ORF1ab. This gene encodes a polyprotein which, when cleaved during the replication process, gives rise to 16 non-structural proteins (nsp), including an RNA-dependent RNA polymerase (RdRp) and a helicase (Hel), as well as proteins involved in viral transcription, revision (Exonuclease nsp14), translation, cleavage (3CL-PRO), assembly and others related to the suppression of the host cell and immune system functions. The other part of the genome comprises genes encoding structural and accessory proteins, including the Spike surface glycoprotein (ORF2), made up of S1 and S2 domains, responsible for viral recognition and penetration through the human ACE-2 receptor; the membrane glycoprotein M (ORF5), responsible for the morphogenesis and assembly of the virus; nucleocapsid phosphoprotein N (ORF9), responsible for packaging genomic RNA, and envelope protein E (ORF4) 1,[6][7][8] . Other reading frames are predictive and lack functional evidence. Thus, the viral genetic content has not been fully elucidated, with several predicted hypothetical open reading frames, whose function or even protein coding status is unknown 3 .
Since the outbreak began, SARS-CoV-2 has spread rapidly around the world. The virus has a high potential for transmissibility, with a reproductive number (R0) that can vary between 2.0 and 4.0 9 .
Currently, four main transmission routes have been de ned: 1) symptomatic transmission, which occurs through direct contact with a symptomatic individual; 2) environmental transmission, which occurs through sharing a contaminated environment or surface; 3) asymptomatic transmission, which occurs through direct contact with infected individuals who never presented symptoms and; 4) pre-symptomatic transmission, which occurs through direct contact with individuals who are in the incubation period (who do not yet have visible symptoms, a phase that can last on average 5 days, and may vary between 3 and 14 days). The latter is the main form of transmission, responsible for 42% to 66% of SARS-CoV-2 contamination, a rate su cient to sustain an epidemic which, therefore, explains why social isolation restricted to symptomatic individuals is ineffective, and other strategies are necessary 10-14 . Due to this ease and high rate of transmissibility, through August 14, 2020, COVID-19 has already been responsible for more than 750,000 deaths worldwide, with Brazil accounting for about 13,8% of this total (104,201) (COVID-19 MAP -Johns Hopkins Coronavirus Resource Center), in addition to being considered by the World Health Organization (WHO) an epicenter of the pandemic in South America and in the world.
The Central-West region of the country is considered the Brazilian epicenter; however, the Northern region currently has the highest incidence coe cient, as well as the highest mortality rate (2409,3 and 66,9 per 100,000 inhabitants, respectively) 16 . Since the rst con rmed case on March 20, 2020, Rondônia, a Brazilian state located in the southern part of the Amazon, has seen more than 46,000 con rmed cases (2648,3/100,000 inhabitants), and over 990 deaths (52,9/100,000 inhabitants), with the great majority located in the capital, Porto Velho (BRASIL, 2020; BRASIL/RONDÔNIA, 2020). Although some Brazilian states have publicly disclosed genomic sequences of SARS-CoV-2, Rondônia does not yet have this type of information.
Extensive sequencing of the viral genome from different regions of the country provides insights into the prevalence of viral strains and any regional differences that may lead to a better understanding of patterns of transmission, outbreak tracking and formulation of containment measures. This study presents genetic data of the rst 08 sequences of SARS-CoV-2 isolates in the state of Rondônia. The information available on the main mutations detected was reviewed in order to provide current and important details for the development of vaccines, speci c antivirals and effective diagnostic tests.
Additionally, the phylodynamic relationships between samples and sequences of isolates from different locations were also studied to assess the epidemic and evolutionary history of the virus in this State.

Nucleotide sequencing of viral isolates
The study was carried out by the Molecular Virology Laboratory of the Oswaldo Cruz Foundation of Rondônia (FIOCRUZ-RO) in collaboration with the Central Laboratory of Public Health of Rondônia (LACEN/RO) and the Leônidas and Maria Deane Institute (ILMD) of FIOCRUZ Amazonas. Ten samples of combined swabs collected from individuals residing in the State of Rondônia/Brazil with clinical symptoms of COVID-19 had detectable viral RNA following the RTq-PCR protocol from the Center of Disease Control and Prevention (CDC, Atlanta, USA). Sequencing of a new generation of isolates was carried out using the MiSeq platform (Illumina), according to Nascimento et al. (2020). The study was approved by the Research Ethics Committee of the Research Center for Tropical Medicine of Rondônia (CEP / CEPEM-RO), under opinion number 4,000,086. The research procedure followed all the stipulated recommendations, including obtaining the consent of the involved participants.

Analysis of mutations
Sequences of the viral isolates were aligned with the reference sequence for SARS-CoV-2 (NC_045512) in MEGA7 software (Molecular Evolutionary Genetics Analysis) 19 , under the action of the MUSCLE algorithm 20 . The positions where the nucleotide present in the samples diverged from that present in the reference sequence were marked as mutations. for infections in most parts of the world. This latter classi cation was also used by Forster et al. (2020) and, therefore, this organization of super-spreading groups was used in the present study.

Determination of the evolutionary group
The previous classi cation of SARS-CoV-2 strains isolated in the state of Rondônia/Brazil in SS groups had two main objectives: 1) to stratify the amount of genetic data needed to carry out evolutionary analyses and 2) greater precision in the choice and use of evolutionary models and the molecular clock detailing the evolutionary history, permitting non-generalized performance in a more speci c set of data. Therefore, 10 strains from each SS group were collected randomly and these, along with the 08 sample sequences, were included in a multiple sequence alignment performed by the online software MAFFT 23 (https://mafft.cbrc.jp/alignment/software/). In order to increase the quality of phylogenetic inference, a root was also included in the alignment.
In this context, Pipes et al. (2020) published a recent study evaluating the uncertainty of rooting methods in the phylogeny of SARS-CoV-2. They concluded that neither the outgroup method using the bat CoV sequences RatG13 and RmYN02, nor the molecular clock method, presented reliable probabilities of root determination. However, through root-to-tip regression, a great posterior probability (0.516) of root placement in an old Wuhan sequence was observed (Wuhan / IPBCAMS-WH-01/2019 -EPI_ISL_402123). Therefore, this sequence was also included in the alignment for rooting the inferred phylogeny. This alignment was called Dataset A and was made up of a total of 49 sequences.
Due to low homology at the extremities, the following analyses were performed based on the region of the genome that extends from position 55 to 29 838 (positions determined based on the reference sequence NC_045512). The alignment was used as an inference input for a non-clock tree based on Maximum Likelihood using the software IQtree v.1.6.12. 25 . The replacement model was selected using the integrated Model Finder tool 26 and based on the lowest Bayesian Information Criterion (BIC) value.
The TIM+F+I model (transition model, with empirical frequency of bases and heterogeneity of invariant sites) was listed as most suitable for dataset A. Branch support values were obtained using 2000 replicas from Ultrafast Bootstrap 27 , with a search for nearest neighbor interchange (NNI) to optimize each initialization tree and reduce the risk of overestimating branch support. The generated consensus tree was imported into Tempest v.5.1 software 28 , in order to verify the existence of a temporal signal in this data set. Once the existence of a time signal is detected, the alignment then becomes su cient for a molecular clock approach.
The molecular clock method was used for Bayesian inference of the phylogeny of evolutionary groups. The Lognormal and Exponential distributions of the Relaxed Non-Correlated Clock were tested. The replacement model used was the second best according to the estimate previously described, TN93+F+I (Tamura-Nei with empirical base frequency and heterogeneity of invariant sites), since the TIM model is

Detailing of the evolutionary history
Once the SS group of the strains was determined, the analyses continued to elucidate the detailed evolutionary history, through phylogenetic inference of the molecular clock with speci c representatives of the group. Based on the results of the analysis of mutations and of the evolutionary group, it was observed that the study group would be SS4. Therefore, all SS4 representatives that were identi ed by Yang and collaborators (2020) were included in this second analysis.
Additionally, with exceptional help from the interactive panel of the GISAID platform, it was observed that the GISAID clades 20A, 20B and 20C comprise isolates that have the signature mutations of the SS4 group and, consequently, all sequences belonging to these clades were collected, using June 26 th as the deadline. This collection was directed exclusively at strains isolated in South American countries, with Brazil being the only country whose sequences were collected at the State level. The use of this geographical lter at this stage of the study is justi ed due to the issue of urban mobility, a factor that may have been a facilitator and differential in the introduction of SARS-CoV-2 in the State of Rondônia.
Uncertainty and lack of resolution have been described regarding the phylogenies of SARS-CoV-2, due to the relatively small genetic diversity that has been accumulated during the short time of the outbreak 30 . Considering this, therefore, and in order to obtain good representativeness of genetic variability, locations with more than ten sequences available under the criteria previously mentioned, passed through two additional lters: 1) Maintain greatest disparity between the collection dates, to avoid sequences that are too similar and; 2) Maintain the representativeness of all pangolin lines available by location, ensuring a limited number of sequences per lineage, based on patient exposure location (States of Brazil: 10; other countries: 5). The whole process prioritized the highest degree of coverage between sequences when selecting them, not exceeding values less than 97.5% (considered average). At the end of the entire process, the alignment had a total of 307 sequences and was performed under the in uence of the online software MAFFT. The region of analysis was the same as that previously used (nt 55 -29 838). This second set of sequences was called Dataset B.
This phylogenetic inference followed a methodology similar to that used to determine evolutionary groups. In summary, a non-clock tree based on Maximum Likelihood was built using IQtree v.1.6.12 software under the previously described parameters. This, in turn, was used to visualize the existence of a temporal signal in the data set using the software Tempest v.1.5. Once a time signal was detected, the alignment was then imported into software from the BEAST package to perform a molecular clock approach. In this case, the Lognormal and Exponential distributions of the Non-Correlated Relaxed Clock were also tested. The replacement model used was GTR+F+I. The adjusted length of the MCMC for convergence of the parameters was 3 x 10 7 , with collection every 3000. The model used was Coalescent: Exponential Growth. The sample of trees was summarized, and the consensus tree was visualized/customized using FigTree v.1.4.4.

Analysis of mutations
Of the ten samples from patients with a con rmed diagnosis for COVID-19, 8 were successful in the new generation sequencing procedure, generating complete genomic sequences with mean coverage level > 99%. The analysis of mutations in the genome of these strains demonstrated the presence of a total of 22 alterations in different sites, one of which was found in a non-coding region. Among those found in coding regions, 12 are classi ed as non-synonymous mutations. They are found in 7 viral proteins: nsp1, nsp12, Spike, ORF3a, ORF6, ORF8 and Nucleoprotein. The complete list of mutations found, and the percent frequency of occurrence is shown in Table 1. Four mutations were found in 100% of the isolates: C241T, C3037T, C14408T and A23408G. With the exception of alteration C14408T, the others were classi ed as signature mutations for super spreader group 4 (SS4) identi ed by Yang and collaborators (2020). Therefore, according to this information, all samples from the present study that were isolated belong to the SS4 group. Phylogenetic analysis to determine evolutionary groups ( gure 2) con rms this classi cation.
The vast majority of the mutations found have no clinical/virological signi cance described in the literature, with some being considered unique. Among the known mutations in the ORF1ab gene, C14408T was found in 100% of the samples, which results in the replacement of a proline amino acid with a leucine at position 323 (P323L) of nsp12 RdRp (RNA-dependent RNA polymerase). Alterations in viral enzymes of this nature raise a level of concern, since they can cause resistance to drugs that have RdRp as a target, as previously described for hepatitis C, In uenza and also for one Coronavirus infection in mice treated with Rendesivir 31-33 .
However, the P323L alteration results in an amino acid with an isoelectric point similar to the wild type amino acid 34 , which may mean a not-so-signi cant change in the molecular structure of this protein. In addition, it is located outside the nsp12 RdRp catalytic region. However, it is located in a region equivalent to the SARS-CoV RdRp interface domain. This domain is supposedly implicated in the interaction with other viral proteins that can regulate the processivity of RdRp during the activity of Replicase Transcriptase Complex (RTC) 35,36 . RTC, in turn, has an interaction with the exonuclease nsp14, the protein responsible for reviewing viral RNA synthesis, and this interaction is important in the control of accurate RNA replication 35 . Thus, it is assumed that there is the possibility of an indirect in uence of C14408T on the viral mutation rate.
In addition, it was recently proposed in a pre-printed study with over 11,200 sequences that this alteration may be associated with an increase in the rate of viral mutations 37 . In addition to this study, another also observed an extremely high rate of nucleotide substitutions in a group of viruses descending from an isolated parent strain in Germany (Germany/BavPat1/2020|EPI_ISL_406862), the SS4 group, which acquired the C14408T mutation later and was more widespread to other European countries 22 . Thus, considering the information raised about it and the high frequency of occurrence of this substitution in the analyzed samples, further studies are needed to assess the role of this mutation in the delity of RdRp, whose errors can directly affect the long-term effectiveness of a vaccine and speci c antiviral drugs.
Among the modi cations found in the S gene, A23403G stands out. This non-synonymous mutation was found in 100% Other reasons have also been pointed out to justify this association: 1) Structural reason: the resulting conformational change can improve the membrane fusion step by facilitating the separation of the S1 domain from the S2 domain of the Spike protein bound to the receptor; 2) Immune reason: because it is a residue located in a region equivalent to the target epitope of antibody-dependent improvement in SARS-CoV, it is assumed that the binding with the antibody may alter the conformation of the protein and increase its interaction with the ACE-2 receptor and; 3) Genetic epidemiological reason: an increase in the isolation frequency of strains that contain the D614G mutation has been detected in several regions of the world, including detection of the G614 variant's prevalence in a matter of weeks in places where the D614 variant was previously prevalent 34,41 .
Considering that this mutation was rst identi ed on January 28, 2020 (Germany/BavPat1/2020|EPI_ISL_406862), this increase in frequency was also found in a direct genomic analysis of all 1539 SARS-CoV-2 genomes deposited in the GISAID platform between February 29 th and March 26 th . There was a prevalence of 56% of isolates belonging to the SS4 group, which hosts the D614G mutation as a signature characteristic of the group, showing the rapid dissemination of this variant over time 22 .
A recently published in vitro study that is currently in prepress performed comparisons of the functional properties of the D614 and G614 variations of the Spike protein, nding greater e ciency of infectivity with the G614 variant in the replication of pseudotyped retroviruses in cells that express ACE-2. The improvement was associated with a possible marked incorporation of Spike protein into the nal structure of the virus, which may therefore improve the transmission of SARS-CoV-2 between different hosts 42  Other changes were also found in the N gene of the strains analyzed. Three sequential nucleotide changes are highlighted: G28881A, G28882A and G28883C, which were found together in 75% of the samples. They result in the replacement of two amino acids of the viral Nucleoprotein, R203K and G204R (R -arginine; K -lysine; G -glycine). The potential effect of these mutations on viral and host processes has been investigated, and it has been observed that they result in considerable changes in the predicted binding with some miRNAs, which may play a role in in uencing the progress of the infection. Some of the miRNAs that bind to this mutated type of nucleoprotein may be under-regulated in several types of cancer. This increases the possibility that cancer patients may have a high susceptibility to the mutated variant due to a reduced ability to contain the virus, compared to the wild-type infection 40 .
Another alteration detected in the same gene is C29367T, found in three (37.5%) of the eight samples in the study. It is a nonsynonymous mutation that results in a P365L substitution. This mutation has not yet been described in the scienti c literature. When looking for other sequences with this mutation in Dataset B, it was observed that none of the strains included show this change. This leads to the assumption that it has appeared more recently in the viral evolutionary history of SARS-CoV-2. Because it was detected only in some sequences in Rondônia, it may have appeared after the virus entered the state and can be used as a marker to study viral spread among different municipalities in the state.

Evolutionary analysis
Temporal signal evaluation

Determination of evolutionary group
Phylogenetic analysis to determine evolutionary groups of SARS-CoV-2 strains in the present study, using dataset A, con rmed one of the conclusions previously obtained with the study of mutations: all samples were identi ed as belonging to the SS4 group, with a posterior support of the 100% cladistic distribution ( gure 2). The best distribution of the non-correlated relaxed clock was "exponential", chosen through the analysis of convergence of MCMC run parameters and tree topology.  Detailed evolutionary history Phylogenetic analysis to detail the evolutionary history of the SARS-CoV-2 strains from the present study was performed based on the relaxed correlated molecular clock model using dataset B. The "lognormal" distribution was chosen through the convergence analysis of MCMC run parameters and tree topology.
The inferred tree allowed us to observe that 75% of the strains isolated in the State of Rondônia belong to pangolin lineage B.1.1.; while the remaining 25% belong to line B.1. ( gure 3, A and B). This classi cation was supported by 100% of subsequent probability in determining the lineage at some cladistic level of the tree and provides support for the previous conclusion of the occurrence of at least two SARS-CoV-2 entry events in the State.
In order to avoid inaccurate conclusions regarding the introduction of SARS-CoV-2 in the State, details of the evolutionary path were obtained from information on clades that included samples with posterior support greater than 85%. Therefore, considering this criterion, it was not possible to fully detail the evolutionary history of the introduction of the B. C29367T alteration in their genome, which presumably arose after the introduction of SARS-CoV-2 in the State and which can be used as a source of information to study the form of dissemination. Therefore, it is presumed that this alteration occurred after passing, not necessarily directly, from 07 to 08 in the city of Porto Velho (place of residence of their respective carriers). Subsequently, there was a continuation of the transmission of strains that carry this mutation before reaching the list of samples 01-03. Since sample 03 was isolated from a patient residing in Porto Velho, and sample 01 was isolated from a patient residing in the municipality of Jaru (about 290 kilometers away from the capital), it is assumed that a strain was transmitted from Porto Velho to Jaru.

Conclusion
This study presented the genetic data of the rst 08 SARS-CoV-2 sequences isolated in the state of Rondônia/Brazil, located in the southern portion of the Western Amazon. It was possible to determine at least two events of viral introduction into the state, corresponding to strains B.      1.67. and B.1.8. The study samples are colored black, along with the sequence used for rooting the inferred tree. In each node, the subsequent probability rate for supporting the branches in decimal data is