Cosmopolitan genotype demonstrated inter-continental dispersal through an extensive spatial network
A spatial analysis of DENV-2 sequences of all known genotypes was conducted to determine their distribution patterns. All complete E gene sequences (1,485 bp) of DENV-2 as of July 2019 were retrieved from the NCBI database and tagged with the country of origin (location) and year of isolation (time) information available in NCBI flat files. The final alignment included 4,551 sequences. Each sequence was classified into a known genotype, based on the phylogenetic relationship. Sequences of each genotype were mapped to determine the global distribution (Figure 1). The analysis demonstrated a variable degree of geo-distribution of each genotype. Cosmopolitan genotype was the most widely distributed, though was remarkably absent in the Americas. Asian II and Asian-American genotypes were present in Southeast Asia as well as Central and South America. American genotype was the least widespread. Asian I genotype was particularly restricted to Southeast Asia. Southeast Asia showed the highest genotype heterogeneity and was the most notable global epicenter of DENV-2 transmission.
Bayesian Stochastic Search Variable studies (BSSVS)-based phylogeography analyses were conducted to determine the spatio-temporal dispersal and the most active hubs of cosmopolitan genotype (Figure 2). The analyses showed an extensive dispersal network of short- and long-distance links. All BSSVS links achieved high statistical support (Bayes Factor >3 and posterior probability >=0.75). Inter-continental dispersal was notable among the long-distance links.
The BSSVS findings reflected the epidemiological reality of cosmopolitan genotype’s presence in a wide geographical scale [20, 33, 52-59]. The most intense virus activity was visible in Southeast Asia, mainly in two epicenters; first, an area covering Singapore, Malaysia and Indonesia and the second centered on Taiwan and China in the Far East. Both epicenters actively exchanged virus strains with Australasia and the Pacific regions. The Far East hub also demonstrated long-range dispersal links with East Africa. However, both hubs were not linked to the Middle East. Instead, Middle East demonstrated alternative dispersal links with the Indian sub-continent. Indian sub-continent also showed decisive links with Southeast Asian and Far-East epicenters, suggesting active virus exchange between the sub-continent and other regions. The extensive network demonstrated how the global trade and travel have facilitated the geo expansion of DENV-2 cosmopolitan genotype. The long-range dispersal of DENV is possible either through the migration of humans, infected mosquitoes or desiccated mosquito eggs. Human migration is unarguably the more likely mode of distant virus dispersal than live mosquito transfer. Virus dispersal through desiccated mosquito eggs is less plausible than other modes due to limited evidence of vertical transmission of DENV [60]. The rapid expansion of air travel in recent decades has immensely facilitated human migration and thereby DENV dispersal [61]. For instance, Singapore is one of the most connected global aviation hubs and showed numerous inbound and outbound dispersal links in our BSSVS analysis (Figure 1). China’s economic and trade expansion resonates well with its long-distance outbound virus dispersal to the African continent. The substantial dispersal of DENV-2 from Indian sub-continent to the Middle East might be due to increased trade and pilgrimage travel [62].
Overall the extensive dispersal of cosmopolitan genotype suggests its ability to adapt to vector and host populations in a wide geographical range. This contrasts with DENV-2 genotypes such as American and Asian I, which are largely restricted to certain geographies [5]. However, cosmopolitan genotype seemed to have struggled to establish in locations where these geographically restricted genotypes dominated. For example, of 4,551 E gene sequences analyses in the present study, Asian I genotype (70.3%, n=333) was the most common among 474 sequences from Vietnam. This was in comparison to 7.6% (n=33) of cosmopolitan genotype sequences. The same pattern was observed among 322 sequences from Thailand (91.3% Asian I; n=294 and 5.9% cosmopolitan; n=19). Therefore, the ability of cosmopolitan genotype to survive in different environments seemed to be variable. One of the contributory factors for this variability might be evolutionary adaptations that generate heterogeneity, confer fitness advantages, and cause selective sweeps, allowing virus lineages to dominate differentially in various geographies [63]. This warranted further analyses of evolutionary signatures, heterogeneity, sub-lineage geo-clustering and natural selection of the cosmopolitan genotype.
Cosmopolitan genotype demonstrated high intra-genotype heterogeneity
The Bayesian phylogeny of DENV-2 complete coding sequences revealed distinct clustering of each genotype, with strong posterior probability support (Figure 3). The cosmopolitan genotype comprised of two distinct groups. One group included sequences belonging to the Indian sub-continent lineage, which has largely been reported in the Indian sub-continent [64, 65]. The other group included the rest of cosmopolitan genotype sequences reported from other countries, especially in Southeast Asia.
The mean nucleotide substitution rate of each genotype ranged from 10-4 to 10-3 subs/site/year (Table 1). This estimate agreed with the empirical data [62, 64, 66], demonstrating comparable evolutionary rates among all genotypes of DENV-2, regardless of their differences in distribution. The inferred mean root age of DENV-2 was ~287.32 years from 2019 (95% highest posterior density (HPD) - 164.77 to 440.30 years), which was comparable to previous estimates of 325.2 years [62]. The tMRCA data also suggested that cosmopolitan genotype might have emerged in parallel to Asian I and II genotypes between 1944 and 1964. The analysis predicted a major split in cosmopolitan genotype, with the emergence of Indian sub-continent lineage in early 1980s.
Table 1. Time to most recent common ancestor (tMRCA) analysis of DENV-2 genotypes
Description
|
Node height
(95% HPD) in years
|
Estimated year of emergence¶
|
Root ancestor
|
287.32 (164.77-440.3)
|
1745
|
Sylvatic
|
140 (87.9-208.46)
|
1886
|
Cosmopolitan genotype (other lineages)
|
64.45 (54.96-75.05)
|
1955
|
Cosmopolitan genotype (Indian sub-continent lineage)
|
36.1 (30.6-41.92)
|
1983
|
Asian II genotype
|
74.24 (64.14-86.89)
|
1939
|
Asian I genotype
|
62 (49.06-77.62)
|
1946
|
American genotype
|
125.26 (90.46-174.85)
|
1899
|
Asian-American genotype
|
51.12 (43.78-60.13)
|
1968
|
¶Estimated year of emergence was calculated by subtracting the node height from 2019, which was the latest temporal data point in the analysis. HPD=Highest posterior density
The cosmopolitan genotype was distinct from other genotypes by a 12-amino acid signature (Table 2). Among non-sylvatic genotypes of DENV-2, each residue of this signature was “fixed” and unique to the cosmopolitan genotype. The signature mutations were spread across the structural and non-structural proteins (PreM/M, Envelope, NS3, NS4B and NS5), implying a genome-wide selection of amino acid residues during the evolutionary history of cosmopolitan genotype. Whether these residues conferred a fitness advantage to cosmopolitan genotype warrant further investigations.
Table 2. Unique amino acid signature of cosmopolitan genotype.
Polyprotein
position
|
Gene
|
Gene position
|
REF
|
Other genotypes
|
Cosmopolitan genotype (% of fixation¶)
|
166
|
preM/M
|
52
|
K
|
K
|
N (100%)
|
196
|
preM/M
|
82
|
T
|
T
|
A (99%)
|
262
|
preM/M
|
148
|
H
|
H
|
Y (96%)
|
351
|
E
|
71
|
E/D
|
E/D
|
A (100%)
|
429
|
E
|
149
|
H
|
H
|
N (99%)
|
670
|
E
|
390
|
N/D
|
N/D
|
S (97%)
|
742
|
E
|
462
|
I
|
I
|
V (99%)
|
1635
|
NS3
|
160
|
A/S
|
A/S
|
T (100%)
|
2256
|
NS4B
|
13
|
L
|
L
|
F (99%)
|
2496
|
NS5
|
5
|
I/V/M
|
I/V/M
|
T (99%)
|
2506
|
NS5
|
15
|
S
|
S
|
N (100%)
|
2826
|
NS5
|
335
|
V/L
|
V/L
|
I (100%)
|
¶The percentage values indicate the extent of fixation of each mutation in 445 sequences of cosmopolitan genotype as compared to 1,007 sequences of other genotypes of DENV-2. M=membrane; NS=non-structural proteins; preM=pre-membrane
A separate phylogenetic analysis was performed by using complete E gene sequences to determine the intra-genotype heterogeneity of cosmopolitan genotype. Here, E gene was preferred to complete genomes because of the availability of more E gene sequences from different geographical regions. E gene is also a widely used candidate for the phylogenetic classification of DENV lineages [13, 67, 68]. The analysis included 2,392 complete E gene sequences of cosmopolitan genotype reported from 39 countries (Supplementary Table 2).
Similar to the whole genome-based phylogeny, E gene tree also branched into two main groups at the root, separating the Indian sub-continent lineage from the rest of cosmopolitan genotype (other lineages) sequences (Figure 4). Besides, other lineage sequences clustered into five distinct lineages, suggesting an overall high heterogeneity of the cosmopolitan genotype (Figure 4). Based on the nucleotide and amino acid similarity matrices (Table 3), Indian sub-continent lineage was the most divergent among all lineages within the cosmopolitan genotype.
Table 3. E gene nucleotide similarity matrix between different lineages of the cosmopolitan genotype
Otheror Indian sub-continent lineages
|
Other lineages
|
Indian sub-continent lineage
|
Lineage-1
|
Lineage-2
|
Lineage-3
|
Lineage-4
|
Lineage-5
|
Lineage-1
|
1.00
|
0.97
|
0.97
|
0.97
|
0.97
|
0.94
|
Lineage-2
|
0.97
|
1.00
|
0.97
|
0.97
|
0.96
|
0.93
|
Lineage-3
|
0.97
|
0.97
|
1.00
|
0.97
|
0.97
|
0.93
|
Lineage-4
|
0.97
|
0.97
|
0.97
|
1.00
|
0.98
|
0.94
|
Lineage-5
|
0.97
|
0.96
|
0.97
|
0.98
|
1.00
|
0.94
|
Indian sub-continent lineage
|
0.94
|
0.93
|
0.93
|
0.94
|
0.94
|
1.00
|
The geographical distribution of these lineages varied within the complete dataset. Notably, sequences from Indonesia positioned at the base of all lineages, except the Indian sub-continent lineage. On the other hand, sequences from Singapore, Malaysia and China appeared in all lineages. While all lineages were spread across different countries, certain lineages dominated in a particular country or a region. For example, lineage 1 almost entirely comprised sequences from the Philippines. There were two outgroups of lineage 1 that included sequences from the Pacific Islands (founder outgroup) and Indonesia (secondary outgroup), suggesting a likelihood of the earliest ancestry of lineage 1 either in the Pacific Islands or Indonesia, and subsequent introduction into the Philippines. Interestingly, the phylogeography analysis showed a decisive direct link between the Philippines and the Pacific, and an indirect decisive link with Indonesia, through Papua New Guinea (Figure 2). Root state posterior probability data extracted from the MCC tree suggested that the most likely common ancestor of lineage 1 originated in Indonesia. This strain might have spread to the Philippines via Papua New Guinea as shown in the phylogeography analysis (Figure 2). A similar observation was also made in the Indian sub-continent lineage that is largely circulating in the Indian sub-continent countries. The earliest available sequence in the basal clade of this lineage (outgroup) was a sequence reported from India in 1974 (NCBI Accession FJ538920). Two sub-lineages (sub-lineage 1 and 2) emerged from this outgroup in early 1990s. One of them showed limited evidence of outbound spread from the Indian sub-continent, mainly to the Middle East and East Africa. In contrast, the other sub-lineage showed relatively wide dispersal from the Middle East to Southeast and East Asia. The available data showed records of both sub-lineages as late as 2017-18.
Is Indian sub-continent lineage becoming a new genotype?
Having been genetically and geographically distinct from the rest of cosmopolitan lineages (Figure 4), we investigated whether the Indian sub-continent lineage is divergent enough to be classified as a new genotype. To achieve this, we analyzed the genetic divergence of Indian sub-continent lineage from other known non-sylvatic genotypes (Asian I, Asian II, Asian-American and American genotypes). These analyses included complete E gene sequences of all cosmopolitan genotype sequences used in the current study, but split into two categories, namely “Indian sub-continent lineage” and “other lineages”. Other lineages category included lineages 1-5 illustrated in Figure 4. The dataset also included sequences of Asian I, Asian II, Asian-American and American genotypes that were selected to represent the genetic and geographical variations within each genotype. First, E gene nucleotide and amino acid similarity scores were calculated and compared between the Indian sub-continent and other lineages of cosmopolitan genotype and other genotypes (Table 4).
Table 4. Comparison of E gene divergence matrices between the Indian sub-continent lineage and other lineages of cosmopolitan genotype and other genotypes of DENV-2.
Genotype
|
Nucleotide dissimilarity (Amino acid dissimilarity)
|
Cosmopolitan (Other lineages)
|
Cosmopolitan (Indian sub-continent lineage)
|
Cosmopolitan (other lineages)
|
0% (0%)
|
5.6% (5.3%)
|
Cosmopolitan (Indian sub-continent lineage)
|
5.6% (5.3%)
|
0% (0%)
|
Asian II
|
8.8% (7.9%)
|
8.6% (7.4%)
|
Asian I
|
8.8% (8.8%)
|
8.9% (8.2%)
|
American
|
10.2% (10%)
|
9.4% (9%)
|
Asian/American
|
7.8% (8.5%)
|
7.1% (7.7%)
|
The nucleotide dissimilarity between the cosmopolitan (other lineages) and Indian sub-continent lineage was 5.6% (Table 4). Respective values between the cosmopolitan (other lineages) and other genotypes ranged between 7.8% (Asian-American) and 10.2% (American). The Indian sub-continent lineage also demonstrated comparable nucleotide divergence with other genotypes of DENV-2 (divergence range: 7.1%-9.4%). Empirical data has suggested a nucleotide divergence (range 3-6%) threshold of 6% within E gene among different genotypes of DENV-2 [7, 14, 69]. Our data showed that nucleotide divergence between the cosmopolitan (other lineages) and Indian sub-continent lineage (5.6%) is on the verge of this threshold. To further corroborate the genetic distinction between Indian sub-continent and other lineages of cosmopolitan genotype, we compared π nucleotide diversity index values of E gene sequences of cosmopolitan genotype (split into Indian sub-continent and other lineages) and other genotypes of DENV-2. As shown in Figure 5, cosmopolitan genotype demonstrated the highest nucleotide diversity among all non-sylvatic genotypes of DENV-2. Moreover, Indian sub-continent and other lineages formed distinct subpopulations within the cosmopolitan genotype, with ~6% E gene pairwise distance between the two groups. The π index of Indian sub-continent lineage overlapped with a batch of Asian II genotype sequences, further corroborating the genetic distinction of Indian sub-continent lineage from other lineages of cosmopolitan genotype. These observations suggested that Indian sub-continent lineage has already diverged from the rest of cosmopolitan genotype lineages and is almost at the threshold of 6% nucleotide distance within E gene to qualify as a separate genotype.
To further support the genetic distinction of Indian sub-continent lineage within cosmopolitan genotype, we compared complete coding sequences of Indian sub-continent and other lineages. The analysis showed a 16-amino acid signature unique to the Indian sub-continent lineage (Figure 6, Supplementary Table 3).
Eight amino acid residues of this signature were also shared by a small subset of Indonesian sequences of the cosmopolitan genotype, isolated during 1974-75 (Table 5). Based on NCBI data (latest by July 2019), the earliest sequence of Indian sub-continent lineage is a strain (NIV_742295) isolated in 1974 (NCBI Accession FJ538920). NCBI Accession FJ538920 is a partial genome sequence and was thus not included in the whole genome based tMRCA analysis that estimated the origin of Indian sub-continent lineage in early 1980s (Table 1). Therefore, the partial genome data suggested that Indian sub-continent lineage might have emerged even earlier than early 1980s. This was corroborated by the group of Indonesian sequences that were reported at the same time as the earliest genetic record of Indian sub-continent lineage in 1974. Sharing of amino acid residues unique to Indian sub-continent lineage by 1974-75 Indonesian sequences also implied a common ancestry between the two groups.
Table 5. Unique amino acid substitutions shared between the Indian sub-continent lineage and Indonesian sequences (1974-76).
Polyprotein position
|
Gene
|
Gene position
|
REF
|
Cosmopolitan genotype
|
Indonesian
(1974-76; n=07)
|
Indian sub-continent lineage (n=94)
|
Other lineages (n=344)
|
143
|
preM
|
29
|
D
|
D (86%), N (14%)
|
D (100%)
|
N (100%)
|
332
|
E
|
52
|
Q
|
Q (71%), H (29%)
|
Q (100%)
|
H (97%)
|
906
|
NS1
|
131
|
Q
|
Q (71%), H (29%)
|
Q (100%)
|
H (98%)
|
1408
|
NS2B
|
63
|
D
|
D (100%)
|
D (100%)
|
E (98%)
|
1748
|
NS3
|
273
|
V
|
V (100%)
|
V (100%)
|
I (98%)
|
1936
|
NS3
|
461
|
I
|
I (86%), V (14%)
|
I (98%)
|
V (97%)
|
2589
|
NS5
|
98
|
R
|
R (71%), K (29%)
|
R (98%)
|
K (98%)
|
3096
|
NS5
|
605
|
G
|
G (71%), V (29%)
|
G (100%)
|
V (98%)
|
Percentage values indicate the extent of fixation of each mutation in respective groups of sequences. NS=non-structural proteins; preM=pre-membrane
Selection pressures acted on different amino acid residues between Indian sub-continent and other lineages of cosmopolitan genotype
The spatio-temporal dominance of virus genotypes is influenced by selective sweeps that partly occur during virus-host adaptations. In arboviruses, such as DENV-2, these adaptations occur in both vertebrate and invertebrate hosts [70]. Therefore, we hypothesized that selection pressure analyses among different lineages would depict amino acid residues that altered during evolution and dispersal of the cosmopolitan genotype. To achieve this, differential rates of nonsynonymous and synonymous substitution rates per site were estimated in the complete coding sequences of cosmopolitan lineages (447 sequences of other lineages and 98 sequences of Indian sub-continent lineage) using SLAC, FEL, FUBAR and MEME methods [71-74]. The findings showed that majority of residues were neutral and purifying selection was more pronounced than the positive selection. This was in agreement with previous studies [71, 75]. Among the few positively-selected sites, there was no overlap between the Indian sub-continent lineage and other lineages, suggesting differences in selective pressures acting on these lineages. NS2-1160 and NS5-3122 residues showed evidence of positive selection in the Indian sub-continent lineage. Valine was present at residue NS2-1160 in 77% of Indian sub-continent lineage sequences, whereas Isoleucine was present at the same position of other lineages. On the other hand, the positive selection was evident on C-9R, NS2-1298A, NS3-1867S and NS4-3139V residues in other lineages of the cosmopolitan genotype (Table 6). At residue NS5-3122, 35% of Indian sub-continent lineage sequences possessed Glycine. In contrast, Serine was present at the same residue in ~99% of sequences of other lineages.
Table 6. Selection pressure analysis of the complete coding region of cosmopolitan genotype
Lineage
|
Protein
|
Amino acid position
|
SLAC
|
FEL
|
FUBAR
|
MEME
|
Lineages 1-5
|
C
|
9
|
0.05
|
0.03
|
89.45
|
0.67
|
C
|
95
|
0.14
|
0.09
|
46.73
|
0.11
|
E
|
627
|
0.26
|
0.22
|
29.06
|
0.07
|
E
|
670
|
0.11
|
0.02
|
214.78
|
0.67
|
NS1
|
953
|
0.18
|
0.16
|
48.62
|
0.16
|
NS2
|
1298
|
0.02
|
0.03
|
88.23
|
0.04
|
NS3
|
1867
|
0.04
|
0.03
|
134.29
|
0.02
|
NS5
|
3139
|
0.03
|
0.23
|
80.97
|
0.14
|
Indian sub-continent lineage
|
C
|
130
|
0.34
|
0.18
|
9.69
|
0.2
|
C
|
171
|
0.41
|
0.19
|
10.86
|
0.1
|
E
|
332
|
0.28
|
0.14
|
14.86
|
0
|
E
|
640
|
0.29
|
0.14
|
20.05
|
0.67
|
NS1
|
967
|
0.34
|
0.19
|
9.61
|
0.01
|
NS2
|
1160
|
0.07
|
0.06
|
65.28
|
0.67
|
NS5
|
3122
|
0.16
|
0.04
|
91.78
|
0.67
|
Positive selection was evident on residues in bold font, based on the selection pressure analyses. The values shown for each method are p-values, except for the FUBAR method for which the values indicate the Bayes factor. The p-values = < 0.1 were considered as significant for SLAC, FEL, and MEME methods, whereas the Bayes factor >=30 was used for FUBAR method. C=Capsid; E=envelope; NS=non-structural proteins