Sequence and Google mobility data
Research data for this report consists of SARS-CoV-2 genomes (n = 1,597) that were sequenced from SARS-CoV-2 PCR positive patient samples with Illumina NovaSeq and MiSeq sequencing platforms in-house within the Department of Virology in the University of Helsinki and submitted to the GISAID database. Due to HUSLAB initially being the only clinical laboratory sequencing patient samples, some of the virus sequences originate from outside the HUS area. The collection period was from spring to fall 2020. In addition to the local sequence data, global SARS-CoV-2 genomes (n = 20,720) were acquired from the GISAID database (Supplementary Table S1).
In order to infer the geographic source(s) of SARS-CoV-2 lineages contributing to the first wave in Finland, we extended our dataset of Finnish genomes with genomes available for other European countries. A recent phylogeographic analysis demonstrated that SARS-CoV-2 spread in Europe was strongly predicted by Google mobility flows 31. To inform our sampling, we therefore turned to the Google COVID-19 Aggregated Mobility Research Dataset containing anonymized mobility flows aggregated over users who have turned on the Location History setting (on a range of platforms). Aggregated mobility flows between Finland and all other European countries were summarized between January and April 2020, and we selected the following 16 countries that were responsible for 95% of international travels from and to Finland: Estonia, Latvia, Norway, Hungary, Poland, Turkey, Sweden, Netherlands, Austria, Denmark, Italy, Germany, Switzerland, Spain, France and the United Kingdom. For these countries, we downloaded the available SARS-CoV-2 genomes from GISAID on April 17, 2020. For six countries (Estonia, Latvia, Norway, Hungary, Poland and Turkey) represented by a relatively small number of genomes, we decided to augment our dataset with genomes from GISAID with a sampling date up to April 31, 2020.
We selected only sequences from the B.1 lineage with the D614G mutation for the analyses. We removed duplicate genomes for each country using SeqKit v0.11 32. For Finland, we retained duplicate genomes when these were sampled from cases with different travel histories. All genomes were aligned using MAFFT 33 and trimmed at the 5′ and 3′ ends. We then subsampled each country proportionally to the cumulative number of cases on April 17, 2020 by setting an arbitrary threshold of 7.5 genomes per 10,000 cases, with a minimum number of 100 sequences per country. For the 6 countries where the number of unique genomes was below 100, all genomes were included in the analysis. To maximize the spatial and temporal coverage of the subsampling, we partitioned each country's genome pool by week and sampled as evenly as possible, selecting sequences from a different region within the country when available. We checked the resulting dataset for potential outliers with a root-to-tip regression using TempEst v1.5.3 34 on a maximum likelihood inferred using IQ-TREE v2.0.3 35, and removed 9 genomes. The final dataset consisted of 1,643 genomes out of an initial 8,513 genomes. Total, unique and downsampled number of genomes by country are given in Supplementary Table S1. All genomes were associated with exact sampling dates, except for the four genomes from Estonia that were sampled in March 2020.
Bioinformatic analyses
Consensus sequence data for Finnish SARS-CoV-2 were computed and classified with HAVoC 36 and a modified pipeline consisting of Jovian 37 and pangolin 38. Clade and lineage assignment for GISAID sequences was done Nextclade 19 and pangolin.
Maximum-likelihood phylogenetic trees
Viral sequences were aligned with MAFFT 33 and the trees were computed with a SARS-CoV-2 version IQ-TREE 35. The phylogenetic trees were visualized in R with ggtree 39.
Bayesian time-measured phylogeographic analyses
We performed Bayesian evolutionary reconstruction of timed phylogeographic history using BEAST 1.10 40 incorporating genome sequences, their country and date of sampling, Google mobility data and individual travel history 23,24. We modelled sequence evolution using a strict molecular clock model and an HKY nucleotide substitution model 41 with gamma-distributed rate variation among sites 42. We assumed an exponential growth coalescent model as the tree-generative process prior. Uncertainty in the sampling time for the four Estonian genomes was accommodated by sampling uniformly across the reported collection month in the Markov chain Monte Carlo (MCMC) analysis. Our phylogeographic model incorporated the country of sampling as discrete traits associated with the sampled genomes, and following a recent European SARS-CoV-2 phylogeographic analysis 31, we adopted a generalized linear model (GLM) specification to parametrize each rate of among-location movement as a log linear function of the total Google mobility flows for the period January-April 2020. Total mobility flows were log-transformed and standardized after adding a pseudocount to each entry in the matrix. The main goal of our GLM extension was to obtain well-informed phylodynamic estimates. As the ancestral reconstruction of locations depends on the availability of samples, over- or undersampling of sequences from a given location can greatly impact the estimated ancestral locations 23. To mitigate sampling bias and improve the location-transition history reconstructions, we augmented our elementary phylogeographic model by incorporating travel history information obtained from 44 cases that returned to Finland from Austria (n = 20), Italy (n = 13), Spain (n = 7), Estonia (n = 1), Germany (n = 1), Switzerland (n = 1) and United Kingdom (n = 1).
We also investigated how unsampled diversity for six European countries or oversampling of Finnish SARS-CoV2 diversity may impact our phylogeographic reconstructions. Building on our extended phylogeographic model including sampling locations and individual travel histories, we incorporated unsampled taxa for the under-sampled countries Estonia (n = 96 taxa added), Latvia (n = 83), Norway (n = 56), Hungary (n = 54), Poland (n = 46) and Turkey (n = 41) to arrive at a minimum of 100 genomes for all countries. Unsampled taxa without observed sequence data were added with associated location and sampling times, for which we randomly sampled dates from case count distributions per country. For this analysis, we also downsampled the Finnish genome dataset to 100 taxa, while ensuring we incorporated the 44 samples with known travel histories.
We performed inference under the full model specification using MCMC sampling while employing the BEAGLE library v3 43 to increase computational performance. Because MCMC burn-in takes considerable computational time due to the size of our dataset, with the tree topology representing the most challenging parameter for convergence, we initially only considered sequence evolution to arrive at a tree distribution from which trees were taken as starting trees in our phylogeographic analyses. Multiple independent MCMC runs were run to ensure that their combined posterior samples achieved effective sample sizes (ESSs) larger than 100 for all continuous parameters. Transition histories were summarized using the tree sample tool, TreeMarkovJumpHistoryAnalyzer, implemented in BEAST to collect Markov jumps 44 and their timings from a posterior tree distribution annotated with Markov jumps histories 31.