Epidemiological data
We retrieved the number COVID-19 cases reported by PRDH from March 2020 to September 30th, 2021 from the PRDH database dashboard on December 1st, 2021 available here
https://covid19datos.salud.gov.pr/. The collection of cases includes cases classified as confirmed (by molecular tests) or probable (by antigen tests) and plotted by date of sample collection.
Patient sample selection
Nasopharyngeal swab samples pre-selected for genomic surveillance were received from COVID-19 passive surveillance conducted by the Puerto Rico Department of Health (PRDH), the Ponce Health Sciences University (PHSU) and hospital-based acute febrile illness surveillance conducted by the Centers for Disease Control and Prevention (CDC) Dengue Branch. A total of 785 samples were collected from March 2020 to September 30th, 2020 from the 7 health regions of the island, including 63 out of the 78 municipalities, and selection criteria included all samples with SARS-CoV-2 detected by reverse-transcriptase polymerase chain reaction (RT-PCR), viral load (CT<28) and sufficient residual sample volume stored at -80°C39. All samples were de-linked from patient identifiable information and processed under the guidelines approved by the CDC and Ponce School of Medicine institutional review boards (IRB) protocol 6731.
Lineage frequency analysis
Frequency of SARS-CoV-2 lineage detection in Puerto Rico was calculated using the total number of SARS-CoV-2 genomes published in the Global Initiative on Sharing All Influenza Data (GISAID) (https://www.gisaid.org) with collection dates ranging between March 1st, 2020 and September 30th, 2021. All complete genome sequences and metadata were retrieved from the GISAID database as of October 31st, 2021. The dataset was filtered for complete genome data, high-coverage data, and complete collection date for a final dataset of 2,514 entries. Lineage assignment on GISAID was determined by the Phylogenetic Assignment of Named Global Outbreak Lineages (Pangolin)40, 41. R with ggplot package was used to calculate lineage frequency and plot the graph focusing on the following lineages of interest: B.1.1.7 (Alpha), P.1+P.1.1 (Gamma), B.1.588, Delta (B.1.617.2+AY.x), B.1.427+B.1.429 (Epsilon), B.1.526 (Iota), B.1.621/1 (Mu), and all other Pangolin-designated B lineages grouped as Other. No genomes collected in May 2020 have been published in GISAID by October 31st, 2021.
Complete genome sequencing and assembly
Complete SARS-CoV-2 genome sequences were generated directly from clinical nasopharyngeal samples. Viral RNA was extracted from viral transport media using the automated MagNA Pure 96 system (Roche) with the MagNA Pure 96 DNA and Viral Nucleic Acid Small Volume Kit (Roche) following manufacturer-recommended protocols for 0.2mL sample input volume and 0.1mL RNA elution volume. MP96 external lysis buffer was used to pre-treat the samples for neutralization and assist the lysis process. First strand cDNA was synthesized with random hexamers using SuperScript IV reverse transcriptase (ThermoFisher), and tiling PCR amplicons were generated using Q5® high-fidelity DNA polymerase (New England Biolabs) and the ARTIC nCoV-2019 V3 primer scheme purchased from Integrated DNA Technologies (https://github.com/artic-network/artic-ncov2019/blob/master/primer_scheme/nCoV-2019/V3/nCov-2019.tsv). Candidate samples for sequencing presented clearly visible bands of target size (approximately 400bp) in DNA gel electrophoresis for both primer pools. PCR products were purified with AMPure XP magnetic beads (Beckman Coulter) and quantified using Qubit 4.0 fluorometer (ThermoFisher). DNA libraries were generated using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs) reducing all reagents volumes to 25% from the manufacturer recommended protocol to increase throughput. The resulting products were screened for size and quality using the Bioanalyzer 2100 instrument (Agilent Technologies) and quantified with Qubit 4 fluorometer (ThermoFisher). Passing libraries were pooled and run in the MiSeq sequencer instrument (Illumina) using the MiSeq Reagent Kit v3 in 600-cycle program.
The resulting sequence reads were screened for quality, trimmed and assembled into complete consensus SARS-CoV-2 genomes using the Genome Detective Virus Tool v1.13642 (https://www.genomedetective.com) and assembly confirmed with iVar43. The Pangolin COVID-19 Lineage Assigner tool was used for lineage assignment41 (https://pangolin.cog-uk.io). A total of 753 samples were sequenced with more than 95% genome coverage at a minimum of 10x sequence depth. All sequence data obtained for this study was submitted to GISAID, accession numbers available in Supplementary Table 1.
Phylogenetic analysis
Our Puerto Rico SARS-CoV-2 genomes dataset was analyzed against a diverse panel of genomes from across the world which provide regional phylogenetic context. Initially, we downloaded the Genomic Epidemiology metadata package for all entries from GISAID on August 18th, 2021 to screen genomes for subsampling. However, due to the large number of genomes available in GISAID, we downloaded and combined the following pre-sampled datasets for regional studies: NextRegion-North America, NextRegion-South America, and NextRegion-Global. We then used the standard ncov augur/auspice multiple input workflow available in the Nextstrain platform44 (https://github.com/nextstrain/ncov) to subsample contextual genomes and phylogenetic inference with time-stamped trees. The custom subsampling scheme program selected 2,611 contextual genomes from the United States, North America, the Caribbean, Central America, South America, Africa, Europe, Asia, and Oceania, with higher proportions on The Americas based on collection dates and genetic proximity to our Puerto Rico dataset. The combined dataset of 3,364 genomes was aligned using MAFFT45 and a global maximum likelihood (ML) phylogenetic inference was reconstructed with IQ-TREE46. The ncov workflow then transferred the ML tree to TreeTime47 for time calibration and ancestral state reconstruction of the tree topology at constate rate of 8x10-4 nucleotide substitutions per site per year. The resulting global ML tree was visualized with Nextstrain auspice44 and annotated with iTol for region of origin and emerging variants48. Subsampling from the Genomic Epidemiology metadata package retrieved in August and from the concatenated NextRegions produced phylogenetic inference trees with similar topologies. A list of all the sequences used in this study, including sequence labels and authors can be found in Supplementary Table 2.
Significant lineages of interest were studied further by reconstruction of phylogenetic inference sub-trees. For the B.1.588 lineage sub-tree, we selected all B.1.588 genomes published in GISAID by October 31st, 2021. Contextual B.1 lineage genomes were selected based on phylogenetic clustering near the base of the B.1.588 clade in the global tree and by temporal proximity to the date range of B.1.588 circulation between June 2020 – January 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 239 genomes under the GTR+G+I nucleotide substitution model and 1,000 bootstrap replicates using IQ-TREE v1.6.1246. The resulting tree topology and node support were compared to Bayesian maximum clade credibility (MCC) tree reconstruction using BEAST v1.10.449. Briefly, we used time-calibrated genomes with sample collection dates and the nucleotide substitution model parametrized using Yang96 model under strict molecular clock, and Bayesian Skyline coalescent model. Markov chains were run for a total of 100 million steps with sampling every 10,000 steps in the chain. Run results were evaluated in Tracer (http://tree.bio.ed.ac.uk/software/tracer/) to ensure stationary parameters with statistical errors reflected in 95% highest probability density ranges with effective sample size (ESS) higher than 200 for each tree prior. MCC trees were generated in TreeAnnotator from BEAST package after discarding 10% of runs as burn-in. The resulting ML and MCC trees were visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).
For the B.1.1.7 (Alpha VOC) lineage sub-tree, we selected all B.1.1.7 + Q.x designated genomes from the global tree and supplemented the dataset with additional B.1.1.7 + Q.x genomes from Puerto Rico and the United States to understand the lineage emergence and spread in the island. A custom subsampling scheme was selected on the Nextstrain ncov workflow to select genomes from samples collected between November 1st, 2020 and February 28th, 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 729 genomes under the GTR+G+I nucleotide substitution model and 1,000 bootstrap replicates using IQ-TREE v1.6.1246. The resulting ML tree was visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).
For the B.1.617.2 (Delta VOC) lineage sub-tree, we selected all B.1.617.2 + AY.x designated genomes from the global tree and supplemented the dataset with additional B.1.617.2 + AY.x genomes from Puerto Rico to understand lineage spread and sub-lineage clustering patterns across the island. The genome selector Python script designed by Anderson Brito (https://github.com/andersonbrito/ncov) was used to select additional Delta-designated genomes from Puerto Rico with collection dates range between June 1st, 2021 to September 30th, 2021. Maximum likelihood phylogenetic trees were reconstructed with the resulting dataset of 815 genomes under the GTR+G+I nucleotide substitution model and 1,000 bootstrap replicates using IQ-TREE v1.6.1246. The resulting ML tree was visualized in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).
The date of the most recent common ancestor (tMRCA) determined by the Nextstrain ML phylogenetic inference was confirmed with Bayesian coalescent analyses for B.1.588, Alpha VOC and Delta VOC lineage sub-trees. Due to the large size of the datasets, each sub-tree was reduced by tree-pruning to datasets with less than 150 genomes. tMRCA analyses were performed with BEAST v1.10.4 under a strict molecular clock fixed at 8x10-4 substitutions per site per year and 150 million Markov chains sampling every 10,000 steps. Results were evaluated in Tracer (http://tree.bio.ed.ac.uk/software/tracer/) for convergence and ESS greater than 200.