SNVs with significant and substantial expansions
The SARS-COV-2 accumulates mutations during replication, potentially generating new strains, some of which have undergone substantial expansions in the population either because of super-spreader events or due to functional changes resulting in elevated transmissibility. Consequently, here we considered an SNV to be of interest if it had a statistically significant expansion in the study period and further its average proportion of mutations in the last three months exceeded 10%, to focus on prevalent variants. To identify such SNVs, we utilized 7137 viral genomes that had been generated from laboratories in Washington state and deposited to GISAID and that had been aligned and subject to quality control (see Materials and Methods). Comparing all viral genome sequences to the reference sequence led to counts of mutations at each nucleotide in the genome shown in Figure 1A. This analysis showed the Spike protein had two common SNVs with 300 or more copies and the nucleocapsid (N) protein demonstrated several SNVs with 300 or more occurrences. Since over 90% of positions in the 30 Kb genome had fewer than 3 mutations, we focused on the remaining 2516 nucleotides for possible significant expansions. A non-linear logistic regression model to regress the binary indicator for mutant type at each selected nucleotide on the collection time was then employed. In essence, this model fitted “locally averaged proportions of mutants” throughout the study period. If an SNV had gone through expansion, its locally averaged proportion would increase over time, as the mutant type became more common in the population.
This temporal expansion, deviating from random fluctuation, was quantified by a non-parametrically estimated function and its statistical significance was quantified by the p-value. To account for multiple comparisons, p-values were then converted to false positive error rates, i.e., q-values. An SNV was deemed to have a significant expansion if the q-value was less than 0.01. Meanwhile, to focus on those pertinent SNVs that emerged or maintained their dominance in the population, fitted models were used to compute locally averaged proportions daily, and calculated the maximum proportion in last three months, denoting them as Pmax. An SNV was deemed to have a substantial expansion if the Pmax exceeded 10%. Figure 1B shows q-values and Pmax from 2516 nucleotides, in which 53 nucleotides met the established threshold values (q-value<0.01 and Pmax>0.10) (supplementary Table S1). Noticeably, four coding SNVs and 1 non-coding SNVs were identified in the Spike protein. Equally important to note was that SNVs with substantial expansions occurred in other ORFs/genes as well.
variants in Nucleocapsid, endoRNase and ORF3a have significant associations with severe COVID-19
The central goal of this study was to correlate identified SNVs with the hospitalization status in the discovery case-control samples in Washington state. The discovery cohort included 295 outpatients (controls) and 388 inpatients (cases), on which SARS-CoV-2 genome sequences were obtained (Table 1). Given constraints on the assembly of nasal swab samples and extraction of clinical data from the operational database, the case-control study took available samples from COVID-19 patients, while attempting to balance cases and controls for sex, age and collection times. As a result, there are some imbalances in collection time for outpatients from March to June and for inpatient from March to August.
To ensure the robustness of the association analysis on individual SNVs, we limited our analysis to 13 SNVs that had ten or more mutants observed (Table 2). Correlating these SNVs with case/control status through an unadjusted logistic regression model, we estimated coefficient (log odds ratio), standard error, Z value, p-value and q-value. For two SNVs with zero occurrences among outpatients, we performed the Fisher’s exact test, where the logistic regression was not appropriate. On the far-right panel, we present results from the adjusted logistic regression analysis. For readability, we highlighted q-value if it was less than 0.01 and used green and red to correspond to positive and negative Z-value, corresponding the susceptibility and resistance to severe COVID-19.
Strikingly, three SNVs (g28881, g28882, g28883) in the nucleocapsid had significant associations with hospitalization status. These SNVs were found in perfect linkage disequilibrium and denoted as haplotype g28881/2/3, significantly elevated the risk of hospitalization (OR=5.81 and 6.55, q=1.47*10-5 and 4.15*10-6 in the unadjusted and adjusted analysis, respectively). Further, the SNV c28854, also in the nucleocapsid, was not observed among outpatients and was found to be highly associated with hospitalization (p=2.03*10-11) by the Fisher’s exact test, given the logistic regression could not deal with the extreme association.
Two SNVs (t19838 and a20268) in endoRNase were found to have significant associations. The SNV t19838 mutant was absent among outpatients and was found to be significantly associated with hospitalization status by the Fisher’s exact test (p=1.09*10-11). Similarly, the SNV a20268 was found to associate with the risk of hospitalization in both unadjusted and adjusted analysis (OR=38.47 and 42.95, q=6.85*10-4 and 4.67*10-4, respectively). Two remaining SNVs (c1059, g25563) were found to have resistant association with severe COVID-19 (OR=0.46 and 0.45, p=8.06*10-6 and 7.18*10-6, respectively). Similar resistant associations were observed for the adjusted analysis. It is of interest to note that a single SNV a23403, coding for the well-known D614G, appeared to have no association with the severe COVID-19 in unadjusted or adjusted analysis (p=0.39 and 0.29, respectively).
SNVs in nucleocapsid and endoRNase have synchronized expansion patterns
By the selection threshold values, six discovered SNVs were expected to have significant and substantial expansion during the study period (Figure 1C). The SNV haplotype g28881/2/3 started to expand prior to day 100, peaked around day 150, declined to below 10% around day 320, and re-emerged in January 2021 (black line). Following a similar temporal pattern, SNV t19839 followed a synchronized pattern with g28881/2/3, with a slightly later incline and earlier decline (red line). Similarly, two SNVs (a20268, c28854) in enodRNase and Nucleocapsid, respectively, appeared to have a synchronized expansion pattern; proportions started to rise around day 150 and reached a plateau after day 200 (blue dash and dotted lines, respectively). Finally, both SNVs (c1059, g25563) in nsp2 of ORF2ab and ORF3a, respectively, were synchronizing well, expanding from day 30, then contracting and expanding again towards the end of the study period. Such synchronies may imply that these variants share the same mutational histories, and thus the same haplotypes.
SNV-Haplotype (t19839-g28881-g28882-g28883) associates with Severe COVID
Due to the haploidic nature of the viral genome, the presence of multiple SNVs in a single patient implies that they form a single haplotype. Focusing on four SNVs that were synchronized temporally (t19839, g28881, g28882, g28883), their haplotypic frequencies across outpatients and inpatients in the discovery case-control study were tabulated (Table 3). The reference haplotype “tggg” had frequencies of 287 (97%) and 334 (86%) copies in outpatients and inpatients, respectively, while the haplotype “taac” with a single mutation was observed 8 (3%) and 11 (3%) in outpatients and inpatients. Interestingly, the haplotype “caac” was absent among outpatients completely, while it was observed 43 times (11%) among inpatients. The application of Fisher’s exact analysis suggested that this SNV haplotype was significantly associated with the severe COVID-19 (p-value=2.84*10-11).
Repeating the same haplotype tabulation in the replication case-control study yielded corresponding haplotype frequencies. Other than including a rare haplotype “tag”, the replication analysis showed largely comparable frequencies of three SNV haplotypes “tggg”, “taac” and “caac”. This haplotypic association with severe COVID-19 was replicated with high confidence (p=2.21*10-10).
Applying the logistic regression of hospitalization status over this SNV haplotype, haplotypic association with severe COVID-19, with “tggg” as the reference haplotype was evaluated (Table 4). Treating “tggg” as the reference, we effectively set its coefficient to zero (OR=1). The mutant haplotype “caac” was found to have a significantly elevated risk of severe COVID-19 (OR=3.69, p=3.44*10-10), without adjusting any covariates. After adjusting for sex, age and a potential non-linear effect of collection time, the “caac” association was improved further (OR=5.46, p=4.71*10-12). Note that male and older age tended to increase risk of severe COVID-19 from the adjusted analysis, while the collection time appeared to have a significant curvature, i.e., risk of severe COVID-19 was relatively lower in the middle of the study period.
Using temporal synchrony, we considered the haplotypic association of t20268-c28854 with hospitalization (Table 3). In the discovery set, the mutant “gt” was absent in outpatients, and was observed 10% among all inpatients. As a result, this haplotype was found to have a significant association (p=4.56*10-11) in the discovery set. However, the replication analysis provided a support for this association with a marginal significance (p=0.05), given 22% of outpatients carried this haplotype in comparison with 18% of inpatients. The discovered association of c1059-g25563 was replicated also with marginal significance (p=0.07).
dynamic Expansion of SNV haplotype (t19839-g28881-g28882-g28883) in Washington State
The SNV haplotype t19839-g28881-g28882-g28883 has a group of seven relatively uncommon haplotypes (cagg, cgac, cggc, taaa, tagc, tag, ttgg) with fewer than 5 copies, known as rare haplotypes, and has four other relatively common haplotypes tggg/0 with 5462 copies, cggg/1 with 29 copies, taac/3 with 434 copies and caac/4 with 1201 copies, in which /# indicates the number of mutants in the haplotype. Tabulating these haplotypes over collection time by months, Figure 2 shows that the reference haplotype tggg (gray) dominated over all months. The mutant haplotype “caac” (green) appeared in April, peaked in June, declined to a relatively low level, and appeared to rise again in January, 2021. The other mutant haplotype taac (red) was relatively steady throughout the year, since its appearance from April, 2020. In Washington state, the reference haplotype “tggg” had a frequency of 76%, while the mutant “caac” had a frequency of 17% (Table 3).
Classification of The haplotype (t19839c-g28881a-g28882a-g28883c)
All of Washington viral genomes obtained from GISAID were classified by nextstrains, GISAID-clade, and lineage. To assess the relationship between the haplotype and nextstrain classification, we tabulated their cross-table frequencies (Table 5). All 1201 carriers of “caac” haplotype were classified to 20B, while a few carriers of “tggg” were classified to 20B. Similarly, carriers of “caac” belonged to the clade GR, as did “taac”, while no carriers of the reference haplotype were assigned to the clade. Finally, with respect to the assigned lineage, 80% of “caac” carriers were assigned to the lineage B.1.1.291, 7% to B.1.1.290, 6% to B.1.1, in addition to several sporadic assignments to, mostly, B.1.1 (Table 6). In contrast, only 11% of the carriers of the reference haplotype “tggg” were assigned to the B.1.371 lineage.