SARS-CoV-2 contains a single RNA genomic strand (+) that encodes at least 29 proteins including four structural proteins: spike glycoprotein (S), membrane (M), the envelope (E) and nucleocapsid (N) [11]. The remaining proteins are known as nonstructural proteins (nsp) [12] which are produced by the host cell after viral infection [13]. The S protein, with a size of 180-200 kDa, is a transmembrane domain protein anchored in the viral membrane, whose main function is to allow the virus to fuse with the host cell. To avoid detection by the host immune system, these spike proteins are coated with polysaccharides (hence the term glycoprotein) that allow the virus to camouflage [14, 15]. The M protein is the most abundant structural protein of this virus, and while it is thought to also be a glycoprotein, its function is not fully understood [16].
There are three additional transmembrane proteins that are similarly not as well studied as the S protein, and are named after their open reading frame location: ORF3a, ORF7a, and ORF8. The ORF3a proteins regulates the subcellular environment in the host and plays an important role in the defense against infection by inducing apoptosis [17]. Both ORF7a and ORF8 are accessory proteins whose functions are not determined [18, 19]. Finally, the E protein forms a protein-lipid ion-transporting pore, and is involved in virus morphogenesis, assembly, and the induction of apoptosis [20, 21]. Replicase polyprotein 1ab (ppb1a) and replicase polyprotein 1a (ppb1a) are non-structural proteins (NSPs) involved in the transcription and replication of the virus and are produced after host cell infection by the virus. These proteins inhibit host translation and maintain optimal cellular conditions for the replication of SARS-CoV-2 [22].
In wastewater-based studies of COVID-19, the main detection technique of SARS-CoV-2 has been RT-qPCR, which has been used to measure SARS-CoV-2 in untreated wastewaters in Australia [23], Canada [5, 6], Germany [8], Netherlands [10], Spain [9] and USA [7]. In these studies, scientists detected the concentration of the virus (number of copies/L) using an average of the genes N1 and N2 (encoding the Nucleocapsid protein(s)) [4-6, 8-10]. Although this technique has produced satisfactory results for monitoring population-level infection, measuring viral proteins using liquid chromatography tandem mass spectrometry (LC-MS/MS) offers great advantages, including the considerably shorter run time [24], potentially improved sensitivity [25], the ability to multiplex and perform simultaneous non-targeted measurements, and a significantly reduced cost per-sample compared to RT-qPCR [24]. In the present study, our goal was to determine if we could detect the presence of SARS-CoV-2 proteins in untreated wastewater using a top-down proteomics approach with LC-MS/MS, and then to evaluate the potential use of this methodology as a routine method for the detection of SARS-CoV-2 in wastewaters.
We analyzed 171 influent samples from 3 WWTPs (see Methods section for full details). In total, we identified 298 unique peptides from eight SARS-CoV-2 proteins. 1. Structural proteins: M (3 peptides), S (30 peptides), and N (9 peptides), 2. Transmembrane proteins: ORF3a (5 peptides), ORF7a (70 peptides) and ORF8 (4 peptides), and 3. the NSPs, post-infection protein: pp1ab (177 peptides) (See Table S1 for complete protein identification details). The peptides were identified firstly with our untargeted method, and then we incorporated the m/z and RT into a target list (see Table S2 for target list.). Figure 1 displays the number of peptides for each protein in influent samples over the course of the sampling period as a heatmap for each WTTP. The sequence coverage for each protein was: 13.33% (M), 38.02% (S), 13.33% (N), 30.26% (ORF3a), 40.50% (ORF7a), 57.02% (ORF8) and the non-structural protein 45.67% (pp1ab) (Figure 2). It is immediately striking that we frequently detected the most peptides for pp1ab compared to the other 7 SARS-CoV-2 proteins.
Thus, we determined that the pp1ab protein was the most consistently present and abundant protein in the Durham Region WWs. Considering that the structural proteins will degrade as the SARS-CoV-2 virus travels through the human gut and the wastewater system, it makes sense that we did not detect structural proteins with as much consistency or intensity as the pp1ab protein. In contrast, after LC-MS/MS analyses of nasal swabs and gargle samples collected from positive COVID-19 patients, the most abundant peptides were from structurally abundant proteins (N most often, then the E, M, and S proteins) [25-27]. Those structural proteins are all part of the viral envelope, which makes them the easiest to detect in a freshly collected clinical sample, but they are also the proteins that are most-exposed to environmental factors, and will likely degrade first in wastewaters known to contain many chemicals with the potential to degrade SARS-CoV-2 [28].
What is more interesting to consider is the fate of the pp1ab protein. As we currently understand SARS-CoV-2 pathology, the pp1ab protein would not be found inside the viral envelope because it is made by human cellular machinery after SARS-CoV-2 infection, and so we wonder, how it is possible that this protein is the most abundant in wastewaters? First, we considered the possibility that we were obtaining false positives; however, our most abundant and consistently detected peptide sequence (KAIKCVUPQADVEWKFY) had a local FDR < 0.1%, and only matched to SARS-CoV-2 viruses after a BLAST search against the entire Uniprot database (1000 Hits, default settings). Additionally, our results demonstrated > 45% sequence coverage of the pp1ab protein with other high quality (global FDR < 1%) peptide matches to pp1ab in our combined dataset from 15 weeks of wastewater samples across 6 different wastewater catchments. Therefore, we are confident that we are not observing a false positive result.
We posit that there are two main factors which could affect the persistence of the pp1ab protein in the environment: (1) the abundance of the protein in the feces and urine in SARS-CoV-2-infected people within the community, and (2) the size and structure of the pp1ab protein. After extensive literature search, we cannot find any reports about the abundance of the pp1ab protein in urine or feces, with the majority of papers that include pp1ab in the text describing only it’s general function in SARS-CoV-2 pathogenesis [29-31]. Thus, it is difficult to discuss if this viral protein is shed in high levels through urine and feces, because we did not measure this directly. However, based on our results, we think that it is highly probable that this protein is abundance in human excrement. The pp1ab protein is very large (7096 amino acid residues in length and 794,058 Da in mass) [32] which could be a second factor contributing to the abundance of pp1ab in wastewater. In infected cells, pp1ab is cleaved into multiple functional proteins by proteolytic auto-processing. If the protein is released from cells before complete auto-processing, its sheer size could make the pp1ab protein more resistant to complete degradation while travelling through the human gut and municipal wastewater system. Additionally, the peptide that we detected with the greatest frequency falls into the nsp14 region (amino acid positions 6261-6276), which is cleaved into the proofreading exoribonuclease (ExoN). We could not find any available information on the structure of ExoN, and so we cannot determine if our most abundant peptide falls into a protected pocket or region of ExoN, but we did attempt to model the peptide’s likely structure using PEP-FOLD3, an online, freely available modelling tool (RPBS, Parisian Resource in Structural Bioinformatics). The model suggests that our peptide would form a hairpin-like structure which could be stabilized by hydrogen bonding (Figure S1). It is difficult to make any assumptions about the stability of the ExoN protein based upon this information, but we are certain that what we have observed in our study warrants further investigation into the 3D structure of this particular SARS-CoV-2 NSP. We also observed much of the pp1ab sequence coverage in the nsp 2 and nsp 3 (3C-like proteinase) regions, which may indicate that these regions (or proteins) are more resistant to degradation than other proteins in the pp1ab polyprotein sequence.
We employed a cross correlation analysis comparing the abundance of pp1ab in the wastewater from each treatment plant to the number of cumulative active cases of COVID-19 by onset date in the corresponding municipalities (Figures 3 and 4). The pp1ab signal from WWTP1 had a lag value of 23 days and a low positive, statistically significant correlation with the cumulative active cases (r = 0.5462, p = <0.0001 in 79 pairs of data); pp1ab signal from WWTP2 had a lag of 9 days, with a low positive, statistically significant correlation (r = 0.4814, p = <0.0001 in 93 pairs of data); and pp1ab signal from WWTP3 had a lag of 45 days and a low positive, significant correlation (r = 0.4304, p = 0.0008 in 57 pairs of data). Even though the correlations are not strong, these results show that we can detect the presence of COVID-19 proteins from infected patients up to three weeks in advance of clinical case results (based upon the lag). There are many reasons why the correlations would be low. The presence of pp1ab in the feces of asymptomatic patients/people who have not yet presented with symptoms, and variations in wastewater composition due to industrial influents, flow rates, storm events, and degradation of sample material during residence time would all contribute to a high level of varability. The differences in lag could be due to the population level using the wastewater system, and the residence time of the proteins within the wastewater catchment (from toilet to WWTP). We acknowledge that the lag of 45 days at WWTP3 seems much too large to possibly be an accurate representation of clinical cases for the municipalitis that this WWTP serves – and this likely due to the fact that there are inputs from a subpopulation of another nearby municipality which made it impossible to accurately identify the active clinical COVID-19 cases that contribute to the viral signal.
To compare, much weaker and less signficant correlations were obtained for qPCR measurements of SARS-CoV-2 N1 RNA in the same wastewater samples, which also likely reflects many unknowns that contribute to high variability in the wastewater data (i.e.: flow rates, storm events, population usage, degradation). We performed the same cross correlation analysis between N1 RNA (normalized) by qPCR and the number of total cumulative active cases of COVID-19 by onset date (Figure 5). There was no statistical significance in WWTP1 in 80 pairs of data; in WWTP2 there was no statistical significance in 80 pairs of data, and at WWTP3 there was statistical significance with p <0.0001 in 91 pairs of data. These correlations are stronger for WWTP3 but with a lag of 0, which very different from the lag for the protein cross correlation, and is also inconsistent considering the complex wastewater matrix and number of variables that affect this particular location.
To our knowledge, we are the first to use LC-MS/MS to measure SARS-CoV-2 proteins in wastewater. Neault et al (2020) detected SARS-CoV-2 structural proteins with immunoblotting and then measured their abundance with an immune-linked PCR method. They found that proteins were detected more frequently than RNA and concluded that proteins were present in higher abundance. As well, they found that protein and RNA were visually correlated, but no statistical correlation analyses were performed. We attempted to correlated our RNA and protein data, but there was no significant correlation to be found, even with cross correlation (data not show). Thus, we think that a protein-based LC-MS/MS approach may have more utility than immunoblot and RNA methods because it correlates more strongly with case data, and because it is possible to measure many other human-protein biomarkers of SARS-CoV-2 infection and susceptibility simultaneously in the same sample using our shotgun approach, which is a topic we are exploring for future work.
Here, we have presented a MS-based method for WWs samples that specifically detects SARS-CoV-2 proteins. We were able to identify unique peptides of at least eight proteins related to the SARS-CoV-2 virus and COVID-19 infection. We noticed a consistent presence in all the samples of the NSP protein pp1ab, which is only transcribed after host cell infection. Given the results from the present study, we suspect that the pp1ab protein is present in high abundance in the urine and (more likely) feces of active COVID-19 cases, and that this protein might make an excellent alternative biomarker for testing people for infection in a less invasive manner compared to the common nasal swab qPCR test. Our next steps are (1) to quantify our strongest signal peptide of pp1ab using a stable isotope labelled peptide standard, and (2) to conduct a multivariate analysis of all human proteins detected within the samples to identify proteins we can use for normalization, as well as other protein biomarkers of infection and susceptibility for COVID-19. Our ultimate goal is to use this methodology for sensitive, accurate, and routine detection of SARS-CoV-2 population-level infection in WW samples, thereby providing robust monitoring data and proof of principle that establishes the value of wastewater-based epidemiology for future public health studies.