Unravelling the epidemiological diversity of Zika virus by analyzing key protein variations

The consequences of Zika virus (ZIKV) infections were limited to sporadic mild diseases until almost a decade ago, when epidemic outbreaks took place, with quick spread into the Americas. Simultaneously, novel severe neurological manifestations of ZIKV infections were identified, including congenital microcephaly. However, why the epidemic strains behave differently is not yet completely understood, and many questions remain about the actual significance of genetic variations in the epidemiology and biology of ZIKV. In this study, we analysed a large number of viral sequences to identify genes with different levels of variability and patterns of genomic variations that could be associated with ZIKV diversity. We compared numerous epidemic strains with pre-epidemic strains, using the BWA-mem algorithm, and we also examined specific variations among the epidemic ZIKV strains derived from microcephaly cases. We identified several viral genes with dissimilar mutation rates among the ZIKV strain groups and novel protein variation profiles that might be associated with epidemiological particularities. Finally, we assessed the impact of the detected changes on the structure and stability of the NS1, NS5, and E proteins using the I-TASSER, trRosetta, and RaptorX modelling algorithms, and we found some interesting variations that might help to explain the heterogeneous features of the diverse ZIKA strains. This work contributes to the identification of genetic differences in the ZIKV genome that might have a phenotypic impact, providing a basis for future experimental analysis to elucidate the genetic causes of the recent ZIKV emergency.


Introduction
Zika virus (ZIKV), a member of the genus Flavivirus, is a single-stranded positive-sense RNA virus with a 10.8kb genome that is translated as a single polyprotein. This polyprotein is subsequently cleaved into structural (capsid [C], premembrane/membrane [prM/M], envelope [E]) and non-structural proteins (NS1, NS2A, NS2B, NS3, NS4A, p2K, NS4B, and NS5) [1]. ZIKV was discovered in Africa in 1947, and the consequences of its infection of humans seemed to be restricted to sporadic cases associated with self-limiting mild fever or asymptomatic infections [2]. However, the situation changed in 2013 when ZIKV also began to be associated with severe disease during some epidemic bursts reported in the Pacific islands [3][4][5]. Later, when it was introduced into the Americas, ZIKV started to be recognized as a pathogen with a strong impact on public health. At that time, a large number of cases occurred in Brazil, after which it quickly spread to many Central and South American countries, as well as to Mexico and the Caribbean [6]. During the most recent outbreaks, previously unrelated serious neurological manifestations were associated with ZIKV infections, including Guillain-Barré syndrome in adults and teratogenic neurological disorders in neonates, mainly microcephaly, which pointed towards the possibility of vertical transmission [7][8][9]. The detection of ZIKV in both fetal brain tissue and amniotic fluid, as well as the ability of the virus to infect neuronal progenitor cells, strongly supported this theory [5,10]. Moreover, in addition to its well-known vectorial transmission by mosquitoes, principally by members of the genus Aedes, several studies have provided evidence of sexual transmission of ZIKV, which had not been observed with any flavivirus [8,11,12]. Furthermore, ZIKV was detected in different human fluids, indicating extensive dissemination in humans and suggesting the possibility of multiple infection routes [13,14]. Phylogenetic analysis of different ZIKV isolates has indicated the existence of two major lineages: the African and the Asian lineages [3]. Interestingly, the strains responsible for the epidemic burst belong to the Asian lineage [8,15], which is also the only lineage that has so far been associated with microcephaly cases and other neurological disorders [16]. Despite the recent global emergency and the novel severe associated diseases, it is still not understood why the Asian strains that caused the most recent outbreaks behave differently from previously circulating variants. To address this question, several studies have been carried out to determine the relevance of genetic variations present in various epidemic ZIKV isolates and their potential effects on pathogenesis. Some of those studies were based on Brazilian strains, involving only a limited number of ZIKV sequences that were available at that time [17,18]. Subsequently, with the spread of the ZIKV outbreak and the use of next-generation sequencing technology, many more sequences became available in public databases, and more-complex analysis were performed to understand the genetic diversity of ZIKV [19][20][21]. Many of those findings are documented on the website https:// nexts train. org/ zika. However, the biological impact of these genetic variations it is still not clear, and it is still not obvious how they are related to the novel features of the epidemic strains.
The growing bulk of recently identified mutations and the lack of a sufficient understanding of the relevance of these findings indicate the need for further characterization and analysis of genetic polymorphisms in the ZIKV genome. In this context, we used a series of bioinformatics tools to make the analysis of ZIKV sequences feasible, aiming to identify genes with different levels of variability and to look for patterns of genomic variation that might be associated with the biological and epidemiological diversity of ZIKV. We compared a large number of epidemic strains with pre-epidemic ones from both Asian and African lineages. In addition, we investigated genetic variations within the specific subgroup of epidemic ZIKV strains derived from microcephaly cases. We detected different rates of nucleotide changes for each ZIKV strain group and subgroup. Moreover, we identified viral genes with different degrees of variability and found non-synonymous variation profiles that might be associated with the epidemiological and biological characteristics of the diverse ZIKV strains, some of which were predicted to affect the stability of viral proteins. Finally, we selected the most likely epidemiologically relevant variations to perform deeper structural analysis and found that some amino acid variations were potentially linked to changes in the functionality of viral proteins. This work contributes to the identification of genetic variations that might be significant for understanding the pathogenicity of ZIKV, helping to lay a foundation for future experimental analysis.

Database generation
A set of complete ZIKV genome sequences available on 15 March 2020 was collected from different online platforms, including the National Center for Biotechnology Information (NCBI) Viral Genome Database (https:// www. ncbi. nlm. nih. gov/ labs/ virus/ vssi/#/) and the Virus Pathogen Resource (ViPR) Genome Database (https:// www. viprb rc. org/ brc/ home. spg? decor ator= flavi_ zika). This set was composed of 779 sequences from different locations with known collection dates. These sequences were initially curated based on the sequence length, which had to be longer than 10,000 nucleotides to ensure that the database only included genomic sequences spanning the entire coding region of the ZIKV genome. In addition, all sequences with indeterminate regions constituting more than 1% of the total were discarded. In this way, all of the genes and regions of the genome were equally represented to improve the subsequent alignment process. The collection date was used to identify isolates that were obtained before or during the last ZIKV outbreak (since 2013) to discriminate between epidemic and pre-epidemic strains. In addition, the lineage, geographic location, and isolation date of each sequence were reviewed to identify cases of resequencing or repetitive sequences in order to avoid overrepresented sequences. The type of sample used for virus isolation was also considered in order to identify strains associated with severe neurological disorders such as microcephaly. Finally, after the curation process, the sequences were classified into the following groups: The "African" group contained sequences of the African lineage, which so far has not been associated with any outbreak. The "Asian pre-epidemic" group consisted of sequences from the Asian lineage isolated before 2013, the year when ZIKV began to be associated with severe pathology and epidemic bursts [22]. The "Asian epidemic" group contained sequences of the Asian lineage that had been isolated since 2013. Moreover, we created a subgroup called "brain" from the Asian Epidemic group, comprising sequences that were isolated from brain tissues of neonates with microcephaly.
Given that ZIKV has only acquired public health relevance during the past decade and was rarely isolated from patients prior to the epidemic in French Polynesia, there are a relatively small number of pre-epidemic sequences available in public databases. Furthermore, since the frequency of severe clinical manifestations during outbreaks was relatively low compared to oligosymptomatic cases, and because of the complexity of sampling brain tissues from neonates, there are also only a limited number of available sequences of isolates obtained from brain tissue of microcephaly cases. Consequently, most of the sequences in public databases belong to the Asian epidemic group. In this study, the African group had seven sequences, the Asian pre-epidemic group had three sequences, the Asian epidemic group had 558 sequences, and the brain subgroup had seven sequences. The accession numbers of the collected sequences are listed in Supplementary Table S1.

Consensus sequence for the Asian epidemic group
The large and increasing number of sequences available for members of the Asian epidemic group makes comparisons to other groups difficult. Therefore, in order to determine the most frequent alleles for the epidemic strains and to compare them with other ZIKV groups, we generated an Asian epidemic consensus sequence based on 558 sequences of Asian lineage isolates obtained during epidemic bursts. First, using MEGA X software [23], we computed a pairwise distance matrix for the whole Asian epidemic group in order to quantify the similarity among the sequences of this group. The overall mean distance for these sequences was 0.006 using the maximum-likelihood algorithm, confirming the high degree of sequence conservation among the epidemic strains (Supplementary Table S2). Next, all of the sequences of this group were aligned using two independent methods to increase the precision and reliability of the result. On one hand, the sequences were aligned using the MUSCLE algorithm with the default settings [24], and a consensus sequence was extracted using the software EMBOSS cons v.6.5.7 [25]. On the other hand, the sequences were also aligned using the Burrows-Wheeler Aligner maximal exact matches (BWA-mem) algorithm with default settings [26]. The alignment was visualized and a consensus sequence was extracted using Integrative Genomics Viewer v.2.6.3 (IGV) software [27]. The consensus sequences obtained by both methods were identical and are provided in Supplementary Text File S1, and the alignment is shown in Supplementary Text File S2.

Screening of variations
Given the current large number of genomic sequences available, the patterns of mutations and their relationship to the reported changes in ZIKV biology and epidemiology have become harder to analyze using conventional methods such as dendrograms [22]. Therefore, we used a graphical approach to facilitate the identification of genetic variations that could be associated with the epidemiological and biological characteristics of the different groups of ZIKV strains.
The genome sequences of the African, Asian pre-epidemic, and brain groups were aligned against the Asian epidemic consensus sequence using the BWA-mem algorithm with the default settings [26]. These alignments were visualized using IGV software [27] to verify the alignment. Individual nucleotide variations were then identified using the bcftools mpileup v.1.7 and bcftools call v.1.7 software packages [28]. Next, the translational consequences of each detected variation were predicted using bcftools csq v.1.7 software [28]. The correct reading frame and the gene annotations were obtained from the GenBank database (accession number NC_035889). The type, genomic coordinates, translational consequences, and frequency of each variation within the different groups were saved in variant call format (vcf) files, and a multi-dimensional database was generated to establish possible relationships between these changes and the variations in ZIKV properties.

Quantification of sequence variability among ZIKV groups
The variations detected in each ZIKV group were divided into synonymous and non-synonymous substitutions. The number of each type of substitution was determined for each ZIKV group and normalized to the number of sequences in the group. This last step allowed the results from groups of different sizes to be compared. This analysis was performed using an in-house code written in R with the tidyverse library [29], available at https:// github. com/ Santi 8Leiva/ Zika.
The intrinsic variability was estimated by comparing sequences from the Asian epidemic group to the consensus sequence. In order to avoid the bias that would be produced by analyzing groups with significant size differences, the Asian epidemic isolates were randomly split into subgroups of seven sequences. The sequences of each subgroup were aligned to the Asian epidemic consensus sequence and processed using the same software packages and settings described above. The variations detected in the different random subgroups were subsequently combined and analyzed as described above to estimate the intrinsic synonymous and non-synonymous substitution rates in the Asian epidemic group.

Analysis of variability of ZIKV genes
The total number of synonymous and non-synonymous variations was determined for each gene. The results were normalized to the gene length to allow comparisons between genes of different length. In order to obtain a quick and complete visualization of the results, these normalized values were graphed in a point plot. To evaluate the influence of gene length on the acquisition of variations, the point sizes in the graph were made proportional to the absolute number of variations detected in each particular gene. Also, the total number of synonymous and non-synonymous changes detected for the whole coding region of the genome was determined and normalized to the length of the coding region. This analysis was performed using an in-house code written in R with the tidyverse library [29], available at https:// github. com/ Santi 8Leiva/ Zika.

Screening of non-synonymous variation profiles
The non-synonymous variations detected by alignment of the Asian epidemic consensus sequence with those of the remaining ZIKV groups were plotted on the ZIKV genome using their genomic coordinates. We employed a concise and intuitive graphical representation, a lollipop-diagram, to achieve a precise visualization of both the large number of variations detected and the relationships among multiple variables. The gene annotations were obtained from GenBank database (accession number NC_035889). Each variation was represented as a circle, and a color code was used to distinguish the groups of sequences analyzed. This analysis was performed using a code written in R with the trackViewer library according to the protocol published by the author [30].

Protein 3D structure modeling and validation
In order to generate 3D models of each ZIKV protein, we applied different modelling algorithms, including I-TASSER, trRosetta, and RaptorX. I-TASSER uses multiple threading alignment approaches to build up the full-length protein model structures [31], while trRosetta predicts interresidue orientations and distances from co-evolutionary data applying, deep-knowledge algorithms [32], and RaptorX predicts a contact and distance matrix using a deep convolutional residual neural network [33]. The Asian epidemic consensus sequence was used as the input for each protein structure prediction. The resultant structures were then visualized using YASARA View 20.12.24. The validation servers Rampage [34] and QMEANBrane [35] were used to assess the predicted models. The TM-align algorithm was also employed to identify the best models [36].

Protein stability analysis
To predict the effect of non-synonymous variations on ZIKV protein structure and molecular stability, FoldX 5.0 software [37] was run locally in order to estimate the change in the Gibbs free energy of protein folding (ΔΔG) between the protein models from the Asian epidemic consensus sequence and each individual non-synonymous variant (ΔΔG = ΔG individual variant − ΔG asian epidemic consensus ). The best protein models generated for each ZIKV protein were used as starting points. Each of the non-synonymous variations was evaluated individually using the FoldX -BuildModel function [37]. ΔΔG calculations were executed on previously optimized structures and were performed 10 times. All reported ΔΔG values represent the average of these independent runs. Other options were set to default. Each nonsynonymous change was categorized as stabilizing (< −1.5 kcal/mol), neutral (− 1.5 to + 1.5 kcal/mol), or destabilizing (> +1.5 kcal/mol). These cutoff values were established based on the reported experimental standard error for the technique (ΔΔG ∼ 0.5 kcal/mol) [37,38]. The most extreme changes that were also present at high frequency within the ZIKV group were visualized using YASARA View 20.12.24 [39].

Different variability levels among ZIKV groups
Amongst the ZIKV genomic sequences currently available, we defined three groups and one subgroup of isolates with similar epidemiological characteristics, as described in Materials and methods (Supplementary Table S1), and we estimated the degree of divergence of the different ZIKV groups from the Asian epidemic consensus sequence. As shown in Fig. 1, synonymous variations were more frequent than non-synonymous ones in each ZIKV group (ratios of synonymous to non-synonymous changes: African, 8.8/1; Asian pre-epidemic, 8.8/1; brain, 3.8/1; Asian epidemic, 8.2/1). The African group showed the most divergence from the Asian epidemic consensus sequence with an average of 1244.9 nucleotide changes per sequence; which was expected, since these sequences belong to different lineages. The second most divergent group from the Asian epidemic sequence was the Asian pre-epidemic group, with an average of 250 variations per sequence, representing a fifth of the variability exhibited by the African group (Fig. 1). The brain subgroup, a specific Asian epidemic subset associated with severe pathological outcomes, had an average of 36 variations per sequence when compared with the epidemic isolates, which is slightly higher than the total internal variations in the Asian epidemic group (31.3 variations per sequence). However, the brain subgroup had twice as many non-synonymous changes as the average within the Asian epidemic group: 7.4 versus 3.4 non-synonymous variations per sequence, respectively (Fig. 1, red bars). This suggests a high level of variability for this specific subgroup regarding to the other Asian epidemic isolates. Altogether, the data shown in Fig. 1 indicate different degrees of divergence of the different ZIKV groups from the Asian epidemic consensus sequence (Fig. 1).

Different variability rate among ZIKV genes
To determine the degree of variability of each particular ZIKV gene, we analyzed the distribution of the detected variations throughout the viral genome while distinguishing the rates of synonymous and non-synonymous changes for each epidemic group. To obtain a complete visualization of the results, the rates of synonymous and non-synonymous variations were graphed in a point plot (Fig. 2A). The genes with the highest accumulated number of variations in all of the ZIKV sequence groups were the largest ones, such as NS5 and NS3 (see size points in Fig. 2A). The synonymous variation rates were higher than the non-synonymous ones for each gene in all ZIKV groups ( Fig. 2A), in agreement with the data shown in Fig. 1.
The numbers of genetic changes in each gene were then normalized to the corresponding gene length in order to compare genes of different lengths. In this way, we were able to identify highly conserved and hypervariable genes. We found that in all of the ZIKV groups, the genes with the lowest non-synonymous variation rates compared with the Asian epidemic group were p2K and NS4A (Fig. 2 and Supplementary Table S2). Curiously, no non-synonymous variations were detected in p2K for any of the groups, indicating that their amino acid sequences were identical to the Asian epidemic consensus sequence. The gene with the second-lowest non-synonymous variation rate was NS4A, in which no non-synonymous changes were detected in the Asian pre-epidemic and brain subgroups in comparison with the Asian epidemic group. However synonymous changes were detected in both p2K and NS4A in all of the groups ( Fig. 2A, cyan points).
The C and prM genes exhibited the highest non-synonymous variation rates (Fig. 2). The prM gene had the highest non-synonymous variability in the Asian pre-epidemic and brain groups, while the C gene had the highest nonsynonymous variation rate in the African group. Among the non-structural genes, the NS2A, NS4B, and NS1 genes, showed the highest variability in the African, Asian preepidemic, and brain groups, respectively, when compared to the Asian epidemic consensus sequence (Fig. 2). Interestingly, the most variable genes code for proteins that have been shown to be involved in pathogenesis, and hence, the detected variations may impact the development of ZIKVassociated disease [40][41][42][43][44][45][46][47].
These results demonstrate that ZIKV genes have different plasticity to support changes in their encoded proteins. Therefore, while some of them are able to accumulate a large number of non-synonymous variations, others are tightly conserved.

Profiles of non-synonymous variations
To investigate whether specific genetic changes could be associated with distinct biological and epidemiological characteristics, we generated a lollipop diagram. We employed this graphic tool to analyze the distribution, translational consequences, and frequency of the non-synonymous variations present in the different ZIKV groups in comparison with the Asian epidemic consensus sequence (Fig. 3). The resulting profiles revealed probable epidemiologically significant non-synonymous changes. Several of these variations were detected only in sequences belonging to the African (143 variations) or Asian pre-epidemic (31 variations) group ( Fig. 3 and Supplementary Table S3). Interestingly, we found variations that, while being exclusive to a certain group, were detected in all of the sequences belonging to either the African (41 variations) or the Asian pre-epidemic (4 variations) group (in blue and green, respectively, in Fig. 3 and Supplementary Table S3), suggesting that these changes are representative of these groups. It is important to note that the M763V mutation in the E protein, which was detected in all of the African and Asian pre-epidemic sequences (in yellow in Fig. 3 and Supplementary Table S3). This variation could be interpreted as representative of all pre-epidemic isolates, regardless their lineage. The number of such unique variations found was much larger in the African strains than in the Asian pre-epidemic isolates.  The graph shows the identity, distribution, predicted consequences, and frequency of non-synonymous variations in comparison to the Asian epidemic consensus sequence. The horizontal axis represents the ZIKV genome, with the genes annotated. Non-synonymous variations are represented as a circle and associated by a color code with the specific ZIKV groups in which the variations were detected. The colored fraction of each circle represents the frequency of the variant within the group. Each circle is labeled with the corresponding amino acid of the Asian epidemic consensus sequence and the translational consequence predicted for the variation found at this genomic position. The colors of the fonts indicate variations that are representative of and restricted to a ZIKV group (gray, none; blue, African; green, Asian pre-epidemic; yellow, both African and Asian pre-epidemic).
In addition, 47 non-synonymous variations were detected in the brain isolates when compared to the Asian epidemic group. However, none of these variations were common to all of the sequences of this subgroup, and all of them occurred at low frequency, as indicated by red circles in Fig. 3. Therefore, none of them appear to be representative of the neuropathology-prone isolates, suggesting that none of these changes would individually be sufficient to cause severe pathology.

Impact of amino acid changes on viral protein stability
To assess the potential structural effect of the predicted amino acid substitutions, we evaluated changes in the predicted Gibbs free energy of protein folding (ΔΔG) in the corresponding protein structure model [48,49].
As shown in Fig. 4, only 14% (40 of 275) of the nonsynonymous variations were predicted to affect the stability of the corresponding viral proteins ( Fig. 4 and Supplementary Table S3), consistent with the observation that ZIKV is under strong negative selection pressure.
Moreover, only four of the mutations predicted to affect protein stability were present in all of the sequences of a particular group: M763V, I459V, V988A, and S3162P (in red in Fig. 4). To our knowledge, the impact of these variations on the corresponding protein structure has not been evaluated previously. We considered these variations particularly interesting because they not only disturb the protein structure but are also common to all members of a particular ZIKV group. We therefore modeled the corresponding viral proteins based on the Asian epidemic consensus sequence and evaluated the potential effects of these mutations (Fig. 5).
The S3162P variation (ΔΔG = −2.55 kcal/mol) is located in the vicinity of the catalytic site of the RNA polymerase domain of the NS5 protein [50] (Fig. 5A). All of the African isolates and some of the Asian pre-epidemic strains have a proline (P) at this position in place of serine (S). As shown in Fig. 5B, this difference might stabilize the NS5 protein despite the fact that some hydrogen bonds are disturbed, in particular, the hydrogen bond between serine 3162 and glutamate 3163, because the proline at position 3162 fits better into a hydrophobic pocket created between two adjacent alpha-helices, involving leucine 3159, valine 3165, and valine 2894 (Fig. 5B).
The V988A variation (ΔΔG = 1.76 kcal/mol) is located in the NS1 protein within the oligomerization domain [51] (Fig. 5C). In this case, all of the African isolates had alanine (A) at this position instead of valine (V). This mutation might destabilize the NS1 protein by disrupting some hydrogen bonds, especially one connecting two beta sheets, involving serine 990 and glycine 979. This change would also cause some large neighboring residues, such as tyrosine 994 and tryptophan 995 to move to higher-energy positions (Fig. 5D).
The other two potentially significant mutations were in the envelope protein, but in different domains. The mutation I459V is in domain I, and M763V is in domain III [52] (Fig. 5E), and both were predicted to be destabilizing. The change from isoleucine (I) to a valine (V) at position 459 (ΔΔG = 1.75 kcal/mol) was present only in the African isolates. The modeling results suggested that this valine does not fit as well as isoleucine into the hydrophobic pocket created among the neighboring tyrosine 427, isoleucine 429, valine 457, alanine 466, and leucine 478 residues. Moreover, this variation would cause some of the larger neighboring residues, such as tyrosine 427, to move to a higher-energy position (Fig. 5F). On the other hand, in the variant M763V (ΔΔG = 2.88 kcal/mol), the presence of valine, which has a less-flexible side chain than methionine (M), would result in destabilizing interactions between two adjacent alphahelices. The modeling results showed that the presence of valine at this position would result in Van der Walls clashes between two adjacent alpha-helices, especially against leucine 781 (Fig. 5G). Notably, this change was detected in all of the African and Asian pre-epidemic sequences analyzed, suggesting that this variation is characteristic of preepidemic strains.

Discussion
The unprecedented ZIKV outbreaks that began on some Pacific islands and then spread to the Americas were associated with changes in the biology and the epidemiology of this virus [15]. Several studies have attempted to identify genetic changes related to different clinical outcomes of infection. However, so far, no clear associations have been found. Here, we used a strategy that facilitates the analysis and comparison of the numerous viral sequences available in order to find associations between genetic variations and pathogenic properties of the virus. By generating a consensus sequence for the strains responsible for outbreaks worldwide, we indirectly compared this large set of Asian epidemic isolates with African and Asian pre-epidemic isolates and with a specific subgroup of the Asian epidemic strains that were associated with neurological diseases.
Our analysis showed that the number of synonymous variations in each of the groups was greater than the number of non-synonymous ones. This observation agrees with the fact that synonymous changes are only marginally selected and hence tend to be more abundant than non-synonymous ones. This indicates that negative selection plays an important role in ZIKV evolution [53]. The African group was found to contain the largest number of non-synonymous changes, which was expected, given that these sequences belong to a different lineage than the rest of the isolates. Interestingly, the brain subgroup showed a significantly larger number of non-synonymous changes than the Asiatic epidemic group as a whole, indicating a higher mutation rate in this specific subgroup.
Analysis of the distribution of variations in each viral gene showed that synonymous mutations were more common than non-synonymous ones in all of the genes, and this was true of all of the ZIKV groups analyzed. As expected, the largest genes contained the largest number of variations, consistent with these changes having been introduced randomly during viral replication [54]. However, when the number of mutations was normalized to gene size, this relationship changed, and different patterns of variation were observed, allowing us to identify conserved regions that may play crucial roles in the ZIKV life cycle. In particular, the NS4A and p2K genes were found to be the least variable in all of the groups, highlighting the importance of these genes in the ZIKV life cycle. NS4A has been shown to be a strong antagonist of the type I interferon pathway in different flaviviruses, including dengue virus [55], Japanese encephalitis virus [56], and ZIKV [57][58][59]. Thus, this protein is an important viral immunomodulator that plays important roles in flavivirus pathogenesis. Furthermore, in the case of ZIKV, it has been reported that NS4A is able to cause microcephaly in Drosophila flies [60]. The functional importance of the ZIKV p2K gene is not well understood, but a few reports have suggested that it could have an important function during the early stages of the viral cycle of dengue virus, being essential for the correct assembly of the replication complex [61]. In addition, non-synonymous mutations in the West Nile virus p2K gene have been associated with resistance to several antiviral agents [62,63].
In contrast to the highly conserved genes mentioned above, we found that the prM and C genes, both of which encode structural proteins, were the most variable. prM is required for proper virion assembly and maturation [64], and the C protein forms the nucleocapsid that packages the viral RNA [65]. Alterations in both of these genes have been associated with changes in ZIKV pathogenesis, and some mutations in prM have been shown to increase the neurovirulence of ZIKV both in in vivo and in vitro models [40]. Interaction of the C protein with neuronal cell proteins has been reported to affect brain development and alter important neurogenic processes, an effect that can be suppressed by certain amino acid substitutions [43]. Thus, the observed high rates of non-synonymous variations in the ZIKV prM and C genes might be associated with either a loss or gain of function. However, since all of the sequences analyzed in this study were from infected patients, they all represented variants that had maintained the functions necessary to complete their viral cycle.
To analyze the distribution of non-synonymous variations within each viral gene, as well as their frequency in each group, we used a lollipop graph, which has proven to be a useful tool for studying genetic mutations in cancer [66]. This analysis showed that many of the variations were restricted to the African or Asian pre-epidemic group. These might therefore be linked to the epidemiological particularities of each of these groups. One specific variation in particular (M763V) was found in all of the sequences belonging to the African and Asian pre-epidemic groups and might therefore be a representative marker of pre-epidemic isolates. It would be worthwhile to analyze the biological impact of these non-synonymous changes in further studies in order to determine the reasons for the apparent heterogeneity among ZIKV strains.
Most of the non-synonymous mutations described here have not been yet investigated with respect to the epidemiological or biological diversity of ZIKV, but some have been shown previously to have a significant impact on ZIKV pathogenesis. The V982A variation in the NS1 protein is of particular interest because Asian strains with valine (V) at position 982 are more pathogenic than those with alanine (A) at this position [45,46]. In our study, we found that valine 982, which is associated with virulence, was the most common residue among the Asian epidemic isolates, whereas all of the Asian pre-epidemic sequences had alanine at this position. Therefore, our study contributes to linking the previously reported biologically significant Asian variants, such as V928A [45,46], to epidemiological features related to differences between epidemic and pre-epidemic Asian isolates.
Despite the identification of a significant number of non-synonymous variations in the brain subgroup, all of them had a low frequency within the group. Therefore, we could not identify any changes that were representative of the group that would serve as a marker of neuropathogenicity. However, we cannot yet rule out that the detected changes could modify the virulence of some Asian ZIKV strains and thus contribute to the novel pathological features. For instance, it has been reported that ZIKV strains encoding an asparagine at position 139 of the ZIKV polyprotein (prM protein) showed greater neuroinvasiveness and neurovirulence in mouse models than those encoding a serine (N139S variant), and hence, the variant 139N has been proposed to be important for the occurrence of severe neurological disorders [40]. Accordingly, our results showed that sequences unrelated to epidemic bursts had serine at this position. However, a minority of the brain strains analyzed also had the putatively less pathogenic variation. Therefore, this mutation could be relevant while not being essential for pathogenicity. It is therefore likely that most ZIKV strains would be capable of causing severe manifestations but that the final disease outcome would be determined by a complex interplay between the pathogen and its host, as has been suggested previously by other authors [67].
Finally, to assess the structural impact of specific nonsynonymous mutations, we performed in silico experiments to calculate the effect of each amino acid change on the stability of the corresponding viral protein. We found that only a few of these changes would have a significant effect on protein stability. This indicates that ZIKV is under strong negative selection pressure, making variations that do not substantially affect protein structure more likely to spread into the viral population. We focused on specific variations that affect protein stability while also being epidemiologically representative of a certain group of ZIKV isolates. To our knowledge, the structural impact of these variations on the corresponding protein had not yet been evaluated. In the case of the S3162P variation in the RNA polymerase domain of the NS5 protein, which is representative of the African group, proline at this position significantly stabilizes the NS5 protein compared to serine at the same position. Thus, the African NS5 might be more stable than the Asian epidemic one. Moreover, since NS5 is responsible for replication of the viral genome, this amino acid change could be related to the higher replication rates of African strains than Asian epidemic strains shown in vivo and in vitro by several independent authors [68][69][70].
Other variation of interest was V988A in the NS1 protein. This change was restricted to and representative of the African lineage. Modeling analysis indicated that the presence of alanine at position 988 would destabilize the NS1 protein compared to valine at the same position. This change is located in a beta ladder domain that is responsible for NS1 oligomerization [51]. Since the functions of NS1 depend on its oligomeric state, changes in this domain could have a direct impact on the function of this protein. In fact, other variations in this region have been shown previously to affect NS1 stability and functionality [45,46,71]. In particular, the presence of valine at position 982, which we found to be more common in Asian epidemic strains, has been shown to be associated with increased infectivity in mosquitoes [45] and enhanced antagonism of IFN-β induction [46]. Therefore, it stands to reason that other variations in this region, such as V988A, which is representative of and restricted to the African group, could be related to the biological differences between the African and the Asian epidemic strains. We therefore consider the V988A variation worthy of further analysis.
Two of the variations analyzed affected different domains of the E protein. The I459V mutation is in domain I, which plays a crucial role in essential conformational changes in the E protein that occur during the ZIKV infection process [51]. The presence of valine at position 459, which was restricted to and representative of the African group, was predicted to be destabilizing compared to isoleucine at the same position. Modeling of selected ZIKV protein variants. (a, c, and e) Protein models of the NS5, NS1, and E protein, respectively. Purple circles indicate the locations of amino acid variations. (b, d, f, and g) Detailed views of areas surrounding the variant amino acids. Backbone colors: green, beta-sheet; blue, alpha-helix; cyan, loop. Amino acid residue colors: red, changed residues; orange, residues disturbed by the variation. Atom colors: blue, nitrogen; cyan, carbon; green, sulfur; purple, oxygen; yellow, hydrogen bonds ◂ The other variation, M763V, is positioned in domain III, which is the primary viral receptor-binding region [51]. In this case, the presence of valine at position 763, which was predicted to be destabilizing compared to methionine in the same position, was detected in every African and Asian pre-epidemic sequence analyzed. Thus, the 763V variation might be representative of the pre-epidemic strains. Considering the importance of the E protein for viral entry and neutralization by antibodies, these two variations might affect early viral infection processes, especially receptor binding and subsequent entry of the virus into the target cell [72]. Furthermore, due to the epidemiological relevance of these variations among the pre-epidemic strains, it is possible to associate them with the differences reported in the aforementioned processes between the epidemic and pre-epidemic strains [68][69][70].
In conclusion, this study contributes to the identification of genetic variations in ZIKV that may have a significant phenotypic impact and constitutes an important basis for future experimental analysis to elucidate the complex novel epidemiological and biological features of ZIKV.
Author contributions SL planned and performed bioinformatics analysis, analysed data, and wrote the paper. MBV planned and analysed the data and wrote the paper. DG supervised the project, analysed data, and wrote the paper.
Funding Marina Bugnon Valdano and Santiago Leiva were supported by fellowships from CONICET and Agencia de Promoción Científica y Tecnológica, respectively. This work was supported by a research grant from the Agencia de Promoción Científica y Tecnológica (Argentina, PICT 2016-1725, PICT 2019-03050).
Data availability All data generated or analysed during this study are included in this article and its supplementary information files, and the code used for the analysis is publically available in the GitHub repository at https:// github. com/ Santi 8Leiva/ Zika.