2.1 Data acquisition
The analysis presented in this article is based on nucleotide sequences with accession numbers EPI_ISL_402124 to EPI_ISL_541337, downloaded from the GISAID database (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) as a file in "fasta" format on 22 September 2020. Only patients with additional metadata (age, sex, and hospitalization status as plain text comments) were selected on GISAID, resulting in 7,151 samples.
2.2 Data cleaning
We filtered the 7,151 samples for complete nucleotide sequences, and aligned them using the anchor sequence given in Hahn et al. (2020a). After this step, 4,385 samples remained.
Using the location tag in the fasta file, we grouped all samples according to the WHO regional offices for Africa (AFRO), for the Eastern Mediterranean (EMRO), for Europe (EURO), for South-East Asia (SEARO), for the Western Pacific (WPRO), as well as the Pan American Health Organization (PAHO). In particular, the countries included in each group are as follows: (1) AFRO (Algeria, South Africa, Gambia, Nigeria, Senegal, as well as Congo); (2) EMRO (Egypt, Morocco, Kuwait, Lebanon, Oman, Saudi Arabia, United Arab Emirates, as well as Iran); (3) EURO (Austria, Belgium, Bosnia and Herzegovina, Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Faroe Islands, France, Germany, Hungary, Italy, Poland, Portugal, Romania, Russia, Slovakia, Spain, Sweden, as well as Israel, Turkey, Kazakhstan, Andorra, and Georgia); (4) PAHO (Canada, USA, Costa Rica, Mexico, Argentina, Brazil, Chile, Colombia, Ecuador, Peru, Venezuela, as well as Puerto Rico and Uruguay); (5) SEARO (Bangladesh, India, Indonesia, Myanmar, Nepal, Sri Lanka, Thailand); (6) WPRO (Cambodia, Japan, Malaysia, Vietnam, Australia, Guam, Hong Kong, China, Singapore, as well as South Korea and Taiwan). Filtering for only those samples falling into the aforementioned geographical categories resulted in 4,378 samples.
Finally, we matched these 4,378 samples to the metadata information (age, sex, clinical outcome) available on GISAID. Filtering for those samples having complete metadata information resulted in n=3,626 samples.
2.3 Data analysis
After aligning all sequences, we established a window of length 28,492bp in which all viral sequences have reads. We compared all trimmed and aligned sequences entry-wise to the SARS-CoV-2 reference sequence published on GISAID (accession number EPI_ISL_412026), and denoted in a matrix X with an entry Xij=1 that sequence i deviated from the reference sequence at position j. All other entries of X are zero.
We used the R-package "locStra" (Hahn et al., 2020c,d) to calculate the Jaccard similarity matrix (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017) for the n viral genomes based on the matrix X. The Jaccard matrix J(X) has n rows and n columns, and each entry (i,j) is the Jaccard similarity index between the ith and jth SARS-CoV-2 genome in our dataset. Computing the first 5 eigenvectors of the Jaccard similarity matrix J(X) allows us to visualize the geographic clustering of the viral genomes, as well as to guard the logistic regression analysis against such confounding by including the first eigenvectors in the regression analysis as covariates.
For the association analysis of the entire viral genome, we defined the response to be a binary indicator for the clinical outcome, where we only distinguish between all those patients/hosts whose hospitalization status tag at enrollment into the GISAID database was listed as “deceased” (outcome of 1) versus the remaining samples as non-deceased (outcome of 0). At this point, no other information regarding clinical outcome was available in GISAID.
We then performed p=28,492 logistic regressions of the binary outcome variable on the following covariates: the column vector X·i encoding the mismatches of each sample at the i'th location on the SARS-CoV-2 nucleotide sequence, patient’s age, sex and the first 5 eigenvectors of the Jaccard matrix. The logistic regression was carried out in R using the default “glm” command, where the parameter "family” was set to “family=binomial(link="logit")”. We tested the i'th locus/location of the viral genome for association with mortality by testing whether the regression coefficient for column X·i is equal to zero. Finally, we assessed the significance of each tested locus by comparing the individual p-values to a Bonferroni corrected threshold of 0.05/p=1.75e-06.
We observed in Figure 1 that the viral genomes that carry least one mutation at 12,053bp or 25,088bp are located in distinct branches that correspond to geographic regions. Therefore, we performed two additional logistic regression analyses to take the observed clustering into account: one in which we also included “region” as a categorical covariate in the regression model, and one in which we matched the viral genomes based on their positions in the eigenvector plot (Figure 1). For the matched analysis, we identified the 359 viral genomes whose patient indicator variable is set as “deceased” (outcome of 1). For each of the 359 deceased samples, we identified the “non-deceased” viral genome that is closest in terms of Euclidian distance in the eigenvector plot (Figure 1), where each deceased sample is matched to a different “non-deceased” sample.
Finally, we also report results from the Fisher exact test that is applied to contingency tables. To this end, for a certain subgroup of the population (e.g., location tag “Brazil”), we determined the number of deceased and non-deceased samples which are carriers or non-carriers of the mutations at 12,053bp or 25,088bp, respectively. The resulting 2x2 contingency table was tested in R with the default “fisher.test” command, and the p-value of the Fisher test was reported.