Sample Characteristics
A total of 24,441 SARS-CoV-2 positive samples (April to June of 2021) were sequenced (Table S1). Of these, the majority (24,237) samples were saliva collections (SA), 131 and 73 were from nasopharyngeal swabs (NP) and oropharyngeal swab collections (OC), respectively (Table 1). Of the 24, 441 individuals, 74% were from NJ and MN, with the remaining 26% distributed across the remaining 50 states. Where data was provided (21,125 patient samples) the mean age was 33 (ranging from 0-97) with a gender distribution (20,868 patients) of 38% females, 46% males and 15% with no gender data available (Table 1, Table S1). Of the 24,441 positive individuals, ~60% self-identified with an ethnicity, ~10% Hispanic or Latino, ~0.02%, Black not of Hispanic Origin, ~50% Non-Hispanic or Latino and White not of Hispanic Origin ~0.13% and 14% were either other or unknown origin. The remaining declined to identify, or information was unavailable. Vaccination status was provided by 23,329 (~95%) of patients. Of those ~7% indicated yes while 88% indicated no, the remaining declined to provide status.
Table 1
Totals and Percentages of Samples Sequenced and Characteristics
Total Samples
|
24,441
|
Sample Type
|
Saliva
|
24,237
|
Nasopharyngeal
|
131
|
Oropharyngeal
|
73
|
Geographic Location
|
MN
|
16,485
|
NJ
|
1,684
|
CA
|
386
|
Other U.S. States and Territories
|
5,886
|
Avg Age Yrs (Range)
|
Males
|
32.35 (0-94)
|
Females
|
32.17 (0-99)
|
Undisclosed
|
34.23 (6-74)
|
Mean
|
32.29 (0-99)
|
Gender (% Samples Sequenced)
|
Male
|
11,418 (46.72)
|
Female
|
9,450 (38.66)
|
Not Identified
|
3,573 (14.62)
|
Ethnicity (% Samples Sequenced)
|
Hispanic or Latino
|
2,518 (10.30)
|
Non-Hispanic or Latino
|
12,112 (49.56)
|
Others/Unknown Not Disclosed
|
9,811 (40.14)
|
Vaccination Status (% Yes/No/Unknown)
|
|
Yes
|
1,757 (7.18)
|
No
|
21,572 (88.26)
|
Unknown
|
1,113 (4.55)
|
Genome Coverage and Ambiguity Rates Between Sequencing Instruments
Of all the samples, 73.5% (17,974) samples were sequenced on the NextSeq 550, while 26.46% (6,467) were sequenced on the NovaSeq 6000. The average sequencing coverage between the two was ascertained using average sequencing coverage over the SARS-COV2 genome and fraction of nucleotides masked due to sequencing ambiguity in the consensus sequence generated. The global average SARS-CoV2 genome coverage for all samples sequenced across both instruments was 1324x. Samples run on the NovaSeq 6000 had twice the average coverage (2133x) compared to samples run on NextSeq550 (1034x) because more reads per sample were generated on the NovaSeq 6000. The fraction of masked nucleotides in consensus sequence generated was globally 5.9%. The ambiguity fraction rate using the NovaSeq6000 was 2.9% while genomes sequenced on the NextSeq550 had a higher ambiguity rate at 6.9% (Figures 2a, b and c).
Sequence Data Quality Between Sample Collection Methods
The mean RT-PCR Ct values for the three sample types used in this study across NP, OP and SA were 20.83, 21.83 and 22.70 respectively. By Tukey's test, the mean Ct values differed significantly between SA and NPC (p = 4e-7), but not between the other sample pairs. With respect to SARS-CoV2 sequence data quality metrics, several sample specific patterns were identified. The rate of failure of analytical detection of SARS-CoV2 sequence was highest in OP samples (4.11%, 3 of 73 samples), followed by SA samples (1.58%, 382 of 24, 237 samples), The sample type exhibiting the lowest rate of SARS-CoV2 detection failure was NP (0%, 0 of 131 samples). An important measure of input sample and sequence data quality is reflected in the percentage of consensus sequence masked as ambiguous. Higher values of this percentage are indicative of greater proportion of data that is not informative in variant determination. Figure 3a depicts the average percentage of consensus sequence masked as ambiguous for the three sample types used in this study. With a value of 0.5%, NPC samples yielded the highest quality sequence data, while SA (5.84%) and OPC swabs (41%) performing relatively less well. Another critical measure of quality, sequencing depth, was assessed for the three sample types as depicted in Figure 3b. This analysis determined that NPC and SA samples yielded the greatest depth, exhibiting values of 1,948x and 1,323x, respectively, while OPC sample coverage was the lowest at 586x.
Association Between Sequencing-Based Detection of SARS-CoV2 Virus and Baseline RT- PCR Ct Values
Samples were selected for inclusion in the sequencing study based on a minimum threshold of SARS-CoV2 detection by RT-PCR (N Gene Ct value threshold ≤ 30). Further investigation of the SARS-CoV2 PCR data of the samples in this study reveals a direct correlation between the ability to detect SARS-COV2 viral content by sequencing and the Ct values of the N gene. This observation held across all three sample types. Of the 24,441 samples sequenced, 385 (1.5%) were negative for SARS-CoV2 virus by sequencing, while 24,056 samples were positive. Among the negative samples, the mean Ct value for the N Gene was 24.5, while for positive samples it was 22.5 (Figure 4a, Table 2).
Table 2
Mean Ct Value Ranges for Three SARS-CoV-2 Genes in Detected and Not Detected Samples
|
# Samples
|
x̅ of N Gene Ct
|
x̅ of S Gene Ct
|
x̅ of ORF1Ab Ct
|
Not Detected
|
385
|
24.56
|
24.75
|
24.79
|
Detected
|
24,056
|
22.57
|
22.47
|
22.64
|
Total
|
24,441
|
22.60
|
22.54
|
22.68
|
Association between Ct Value and Genome Coverage
We observed a strong inverse association between higher coverage and N gene Ct values as determined by RT-PCR (partial r, controlling for sequencer, = -0.58, partial r- squared = 0.34, p < 1e-15) (Figure 5) with higher Ct values resulting in lower overall mean genomic coverage. Spearman's rank correlation coefficient between Mean Ct and % Genome Coverage was -0.52 (p < 1e-15) for samples sequenced on the NovaSeq and -0.75 (p < 1e-15) for samples sequenced on the NextSeq. To examine the relationship between baseline Ct value and mean genome coverage, we divided the samples into 4 groups, group 1 with Ct values <22, group 2 with Ct values between 22 and 25, group 3 with Ct values between 25 and 28, and group 4 with Ct values >28. We saw a clear difference in mean genome coverage among the four groups (Figure 5). Group 1, with lowest Ct samples had the highest genomic coverage while group 4 with highest Ct values, had the lowest mean genomic coverage. We identified only 8 samples in our dataset that had mean genome coverage less than 100x. Of the remaining 24,433 samples we determined that a mean Ct value of 26.5 for NextSeq 550 and 27.9 for NovaSeq 6000 was a threshold for producing high quality genome sequencing reads (Figure 5a, b and c).
Strain Prevalence
We identified a total of 161 lineages of SARS-CoV2 variants in our dataset. The 10 most prevalent lineages are listed in Table 3. For a full list of all variants identified refer to Supplementary Table S1. The most prevalent lineage identified was B.1.1.7 or the Alpha variant (65%, n=15, 806), followed by B.1.526 variant (5.54%, n=1,330). We identified 3 out of the 4 CDC variants of interest (B.1.525, B.1.526, B.1.617.1) and all CDC variants of concern B.1.1.7, B.1.351, B.1.351.2, B.1.351.3, B.1.617.2, P.1 and P.1 sublineages) in our study (Figure 6a). The only CDC variant of interest that was not identified in our study within the sample set and time frame tested was B.1.617.3. We observed a rapid increase in the B.1.617 and AY lineages during May and June 2021 in our dataset, while the Alpha variant by proportion declined during that same period of time. Further details on trends in lineage discovery in our study as the data continued to be accumulated are provided in supplementary table 1.
Table 3
Incidence of CDC SARS-Cov2 Variants of Concern in Population of Samples Sequenced
Lineage
|
#Samples
|
% Of total Sequenced
|
B.1.1.7
|
15,806
|
65.86
|
B.1.526
|
1,330
|
5.54
|
B.1.429
|
1,015
|
4.23
|
B.1.2
|
950
|
3.96
|
B.1.1.519
|
730
|
3.04
|
B.1.526.2
|
488
|
2.03
|
P.1
|
435
|
1.81
|
B.1.427
|
411
|
1.71
|
B.1
|
396
|
1.65
|