Highly predictive regression model of active cases of COVID-19 in a population by screening wastewater viral load

The quantification of the SARS-CoV-2 load in wastewater has emerged as a useful method to monitor COVID-19 outbreaks in the community. This approach was implemented in the metropolitan area of A Coruna (NW Spain), where wastewater from the treatment plant of Bens was analyzed to track the dynamics of the epidemic in a population of 369,098 inhabitants. We developed statistical regression models that allowed us to estimate the number of infected people from the viral load detected in the wastewater with a reliability close to 90%. This is the first wastewater-based epidemiological model that could potentially be adapted to track the evolution of the COVID-19 epidemic anywhere in the world, monitoring both symptomatic and asymptomatic individuals. It can help to understand with a high degree of reliability the true magnitude of the epidemic in a place at any given time and can be used as an effective early warning tool for predicting outbreaks.


INTRODUCTION
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a novel member of the Coronaviridae family and is the pathogen responsible for coronavirus disease 2019 , which has led to a worldwide pandemic. Patients may present with a wide variety of symptoms and the prognosis ranges from mild or moderate disease, to severe disease and death 1 . Importantly, a significant percentage of those infected are asymptomatic, with studies finding that 20% to over 40% of cases show no symptoms 2-4 , a condition that helps the silent spread of the disease. SARS-CoV-2 is an enveloped virus with a nucleocapsid made up of single-stranded RNA bound to protein N (Nucleocapsid), surrounded by a lipid membrane that contains structural proteins M (Membrane), E . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint (Envelop) and S (Spike) 5-8 . The structure of protein S gives the virus its distinctive crown of spikes and is responsible for binding to the angiotensin-converting enzyme 2 (ACE2) receptor, which allows the virus to enter the host cell 9,10 . ACE2 receptors are present in a range of human cell types, with particular abundance in respiratory and gastrointestinal epithelial cells. In fact, an analysis of ACE2 receptor distribution in human tissues found the highest levels of expression in the small intestine 11 . Although respiratory symptoms are the most frequently described in patients with , several studies have shown that the gastrointestinal tract can also be affected by SARS-CoV-2. A meta-analysis found that 15% of patients had gastrointestinal symptoms and that around 10% of patients presented gastrointestinal symptoms but not respiratory symptoms 12 . Conversely, SARS-CoV-2 RNA has been found in the faeces of people without gastrointestinal symptoms [13][14][15][16][17][18] . A systematic literature review found that more than half (53.9%) of those tested for faecal viral RNA were positive, and noted that the virus is excreted in the stool for long periods, in some cases a month or more after the individual has tested negative for their respiratory samples 15,[19][20][21][22][23][24] . The fact that the virus can grow in enterocytes of human small intestine organelles 25,26 and the discovery of infectious virus in faeces highlights the potential for replication in the gastrointestinal epithelium of patients. However, despite several studies suggesting that transmission of the virus could take place through the faecal-oral axis 27,28 , there is so far insufficient evidence to confirm this method of contagion 19,29 . Similarly, there has been no evidence of contagion through wastewater, which might reflect SARS-CoV-2's instability in water and its sensitivity to disinfectants 9,29-34 . Viral RNA can, nonetheless, be found in wastewater 33 , which has made monitoring of viral RNA load in sewage a promising tool for the epidemiological tracking of the pandemic [35][36][37][38][39][40] .
Wastewater is a dynamic system that can reflect the circulation of microorganisms in the population. Previous studies have evaluated the presence in wastewater of several viruses [41][42][43][44][45] . Processes to monitor SARS-CoV-2 in wastewater were first developed in the Netherlands 35 , followed by the USA 46 , France 39 , Australia 36 , Italy 47 and Spain 37,48 . In the Netherlands, no viral RNA was detected 3 weeks before the first case was reported, but genetic material started to appear over time, as the number of cases of COVID-19 increased 35 . A wastewater plant in Massachusetts detected a higher viral RNA load than expected based on the number of confirmed cases at that point, possibly reflecting viral shedding of asymptomatic cases in the community 38 . In Paris, wastewater measurements . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint 4 over a 7-week period, which included the beginning of the lockdown on March 17th, found that viral RNA loads reflected the number of confirmed COVID-19 cases and decreased as the number of cases went down, roughly following an eight-day delay 39 .
Another study in a region of Spain with the lowest prevalence of COVID-19, detected the virus in wastewater before the first COVID-19 cases were reported 37 . A study from Yale University measured the concentration of SARS-CoV-2 RNA in sewage sludge and found that viral RNA concentrations were highest 3 days before peak hospital admissions of COVID-19 cases, and 7 days before peak community COVID-19 cases 40  Again, wastewater inflow variability was highlighted as a prominent source of uncertainty, as well as the stability of the substances in wastewater and their pharmacokinetics. In general, WBE studies show that despite the wide number of parameters involved in predicting the consumption rate of a specific substance, a correct selection of assumptions combined with a thoughtful modelling process will overcome such uncertainty, leading to accurate results.
Based on the available data, we have hypothesized that the viral load obtained from wastewater allows modelling that can predict outbreaks with high reliability. In fact, an . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint earlier study has confirmed the theoretical feasibility of combining WBE approaches with SARS-CoV-2/COVID-19 data 52 .
In the present study we set out to monitor the wastewater viral RNA load from a treatment plant in the Northwest of Spain that services a metropolitan area with 369,098 residents, with the aim of developing new statistical regression models to estimate the number of infected people, including symptomatic and asymptomatic persons.

Flow variations during the COVID-19 epidemic
The main objective of the present work was to develop a useful statistical model to determine the entire SARS-CoV-2 infected population, including symptomatic and asymptomatic people, by tracking the viral load present in the wastewater. Since the flow is expected to influence the concentration of SARS-CoV-2 detected in the wastewater treatment plant (WWTP) Bens, a study of this variable was first carried out at the wastewater inlet of the treatment plant before, during and after the lockdown. Hourly mean flow box-plots ( Figure 1) at WWTP Bens showed a clear daily trend, with the highest values between 12:00 and 18:00 in period A (before the lockdown). As periods B, C and D pass, a reduction in the level and the variability of the flow was observed. The estimated time for the wastewater to reach the WWTP Bens along the network is between 1.5 h and 3 h, depending on the source in the metropolitan area. Therefore, when interpreting Figure 1, a peak between 12:00 and 18:00 reflects greater human or industrial activity at least 1.5 h beforehand.
In addition, the daily two-minute flow curves performed were exceptionally noisy ( Figure   2A), so these curves were smoothed ( Figure 2B) to study their real shape. When considering all the daily curves and grouping them into the four time intervals (a-d), clear patterns can be seen in Figure 2C. These patterns were even clearer when plotting central curves (the deepest curves) within every group ( Figure 2D).

Estimating COVID-19 active cases in the metropolitan area of A Coruña
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 4, 2020. To model the viral load, the number of COVID-19 cases needs to be reported or estimated. As explained in the methods section, there were difficulties in determining the number of people infected with SARS-CoV-2. Therefore, mathematical models were developed to estimate the COVID-19 cases, based on data recovered in Dataset S1.
Linear regression models ( Figure 3) were successfully used to predict COVID-19 active cases in the A Coruña -Cee health area, where the region of this study is located ( Figure   S1). This was based on Intensive Care Unit (ICU) patients before April 29 th ( Figure 3A) and, from this date on, on active cases reported by health authorities in Galicia. Thus, a linear regression trend was fitted to predict the proportion of active cases in the health area of A Coruña -Cee based on the proportion of cumulative cases ( Figure 3B). This linear regression fit was finally used to estimate the proportion of active cases in the metropolitan area of A Coruña served by the WWTP Bens, by means of the proportion of cumulative cases in the same area, which was directly obtained from the individual patient data.

Daily variation of viral load in in the metropolitan area of A Coruña
The evolution of the viral load along the day is an important feature for selecting narrower sampling intervals when the viral load was low and difficult to detect. On the other hand,  Figure   4 shows the viral load trends at CHUAC ( Figure 4A) and at Bens ( Figure 4B) depending on the hour of the day, during four different days. Figure 4A shows the hourly trend at CHUAC, with a maximum around 08:00, whereas the viral load curves at Bens ( Figure   4B) attained a minimum around 05:00 and a maximum between 14:00 and 15:00.

Lockdown de-escalation in the metropolitan area of A Coruña
As expected, the mean viral load decreased with time when measured at CHUAC ( Figure   5A, late Aprilmid May) and at WWTP Bens ( Figure 5B, mid-Aprilearly June) following an asymptotic type trend (fitted using GAM with cubic regression splines).
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 4, 2020. The viral load in WWTP Bens was consistent with the number of estimated COVID-19 cases in the metropolitan area of A Coruña, as shown in Figure 6. The number of copies of viral RNA per litre decreased from around 500,000 to less than 1,000, while the estimated cases of patients infected by SARS-CoV-2 decreased approximately 6-fold in the same period, reaching in both cases the lowest levels in the metropolitan area at the beginning of June.

Wastewater epidemiological models based on viral load for COVID-19 active cases prediction
Based on the correlation analysis (Figure 7), the number of active cases strongly correlate linearly with the logarithm of daily mean viral load at Bens (R=0.923) and with the mean flow (R=-0.362). Nonetheless, there is also a strong inverse linear relationship between active cases and time (R=-0.99), probably because our measurements coincide with the lockdown period. Therefore, fitting a linear model to estimate the active cases as a function of the logarithm of the viral load is reasonable.
Different regression models were used to fit the backcasted number of COVID-19 cases based on the viral load, the flow, and the most relevant atmospheric variables (rainfall, temperature, and humidity). The best results were obtained using GAM models depending on the viral load and the mean flow ( Figure 8). The effect of the viral load in the real number of COVID-19 active cases showed a logarithmic shape ( Figure 8A), which suggests that the number of COVID-19 active cases can be modelled linearly as a function of the logarithm of the viral load. On the other hand, the shape of the effect of the mean flow on the number of COVID-19 active cases appears to be quadratic ( Figure   8B), but its confidence band was wide and contained the horizontal line with height zero, which means that the effect of the mean flow was not significant (p-value=0.142).
Therefore, the only independent variable that was significant was the viral load, with R 2 =0.86.
Since the nonparametric estimation of the viral load effect had a logarithmic shape, a multiple linear model was fitted using the logarithmic transformation of the viral load, daily flow, rainfall, temperature, and humidity. Figure S2 shows the more explicative models for a variable number of predictors using the R 2 maximization criterion, which finds that the only significant predictor was the viral load. In fact, when a multivariate . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint linear model depending on three predictors (viral load, daily flow, and rainfall) was performed, data showed that the only significant explanatory variable was the viral load (p-value=1.32·10 -8 ). Table S1 shows that the effect of the other two predictors, daily flow (p-value=0.186525) and rainfall (p-value=0.099239), were not clearly significant.
Finally, ignoring the rest of the explanatory variables, the natural logarithm of the viral load gave a good linear model fit (R 2 =0.851) that was useful to predict the real number of active COVID-19 cases ( Figure 9A). After removing three outliers, the fit improved slightly (R 2 =0.894), as shown in Figure 9B.
The final fitted linear model became: where N denotes the real number of active COVID-19 cases, V is the viral load (number of RNA copies per L) and log stands for the natural logarithm.
For instance, a viral load of V = 150,000 copies per liter would lead to an estimated number of N = 5,543 active cases.
The prediction ability of this fitted linear model, the GAM, and the linear and quadratic LOESS models has been evaluated using a 6-fold cross validation procedure, to prevent overfitting. In all cases, the response variable was the estimated number of real active cases in the metropolitan area, and the explanatory variable, the natural logarithm of the viral load. Table S2 shows the corresponding prediction R 2 for each one of the four models, along with the root mean squared prediction error (RMSPE). The smaller this error, the better the predictive ability of the model was. All the models provided quite accurate predictions for the number of active cases using the viral load, with an error of around 10% of the response range. The model with the lowest prediction error, 9.5%, was the quadratic LOESS model. Flexible models, such as LOESS and GAM, slightly improved the predictive performance when compared with the linear model, which has a prediction error of around 11.4% of the response range. The quadratic LOESS model was also the one with the largest value for R 2 . Therefore, it provided the best predictive results. Figure 10A shows a scatter plot of the number of real active cases in the metropolitan area versus the natural logarithm of the viral load, along with the quadratic LOESS fitted . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint 9 curve. Figure 10B displays the actual and predicted values of real active cases. The diagonal line was added to compare with the perfect model prediction.
In summary, the quadratic LOESS curve of Figure 10A captured the relationship between the number of active cases and the viral load, avoided overfitting, and showed a good predictive ability (R 2 =0.88, RMSPE=478), the best among all the considered models.
Besides, the linear model (R 2 =0.851, RMSPE=581.94) brings the advantage of simplicity, so both models could be successfully used to predict the number of infected people in a given region based on data about viral load obtained from wastewater. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint were continuously analyzed until early June for this study, although surveillance will continue at WWTP Bens until the virus disappears.
The data from wastewater obtained from April 19 th onwards has confirmed the decrease in COVID-19 incidence. We showed that time course quantitative detection of SARS-CoV-2 in wastewater from WWTP Bens correlated with COVID-19 confirmed cases, which backs up the plausibility of our approach. Moreover, the seroprevalence studies carried out by the Spanish Centre for Epidemiology showed that cases in A Coruña represented about 1.8 % of the local population. This means that, for a population of about 369,098 inhabitants, the number of people infected with SARS-CoV-2 contributing their sewage into the WWTP Bens would be around 6,644, which includes people with symptoms and those who are asymptomatic. Considering that the ratio between people with symptoms (reported by the health service) and the total infected population (including asymptomatic people) is estimated to be 1:4, we calculated that reported cases contributing their wastewater into WWTP Bens would be around 1,661, which is close to the maximum number of cases reported in the A Coruña-Cee area (1,667 cases on April 28 th ). It must be noted that the criteria used by the authorities to report cases varied over time, so this may explain the gap between the graphs reported in the media throughout the epidemic and our Figure 6, where both a decrease in the viral load and in the estimated COVID-19 cases can be observed from mid-April to early June.
An initial study of flows was made to analyze their variability, which could have influenced the concentration of the virus in the wastewater. For instance, it was expected that on rainy days, the viral load detected at the entrance of WWTTP Bens would be less than for dry days, with the same number of COVID-19 cases. Therefore, a study of flow rates was first carried out at the wastewater inlet of the treatment plant before, during and after the lockdown. The daily flow analysis showed that the usage of the sewage network changed over time, especially when comparing the pre-pandemic period with the three phases during the lockdown. The mean flow curves exhibited higher levels before the lockdown (period a) and their levels decreased as time passed (b-d). In addition, the mean flow curves tended to shift to the right when moving from a to b, c, and d periods. This is probably due to the change in habits related to the restrictions to work activity, the confinement in people's homes and the increasing paralysis of the economic activity during the state of alarm in Spain.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

11
The level of the curve at WWTP Bens at the beginning of May was much greater than that corresponding to the 11 th May. This is due to the effectiveness of the confinement measures applied in Spain. The 12 th May daily curve at CHUAC showed a higher viral load than the one corresponding to the 11 th May at WWTP Bens, showing the viral load measured at the hospital tends to be higher than at WWTP Bens, as expected.
In the present work, nonparametric and even simple parametric regression models have been shown to be useful tools to construct prediction models for the real number of Other possible explanatory variables (such as rainfall or the mean flow) did not enter the model. Although this is a bit counterintuitive (dilution should affect the viral load measured), it is important to point out that rainfall fluctuated little: its median was 0, its mean was 2.88 L/m 2 and its standard deviation was 6.59 L/m 2 .
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint Therefore, as a consequence of the results of the GAM fit, a simple linear model was considered to fit the estimated number of COVID-19 active cases as a function of the logarithm of the viral load. The percentage variability explained by the model was reasonably high (85.1%) but three outliers were detected in the sample. Excluding these three observations, no further outliers were detected, and the percentage of variability explained by the model increased up to 89.4%. Alternative, more flexible models, such as GAM and LOESS, were also fitted. They produced slightly better results in terms of R 2 and RMSPE. However, the final fit for the quadratic LOESS model was similar to the linear model fit.
As a conclusion, a simple linear model that relates the logarithm of the viral load to the number of COVID-19 active cases gave a good fitting, explaining nearly 90% of the variability of the response. Of course, this is a simple model. Although a quadratic LOESS model improved the prediction error to a certain degree, the final fit was similar.
These two models (linear or quadratic LOESS) can be used as useful new epidemiological tools, and complementary to seroprevalence or RT-PCR tests carried out by healthcare institutions.
Our models, as described, are only applicable to the metropolitan area of A Coruña, the region for which these models have been developed. This area has Atlantic weather and it may rain substantially in autumn and winter, which could lead to explanatory variables such as rainfall and/or mean flow becoming significant for those seasons and needing to enter the prediction model. Thus, when applying these models to the same location but in seasons with different climatic behavior, they might need to be reformulated. In addition, the methodology used to build these statistical models could be used at other locations for epidemiological COVID-19 outbreak detection, or even for other epidemic outbreaks caused by other microorganisms. Of course, in that case a detailed data analysis would have to be carried out as well, since specific features of the sewage network or the climate may affect the model itself.
These are the first highly reliable wastewater-based epidemiological statistical models that could be adapted for use anywhere in the world. The models allow the actual number of infected patients to be determined with around 90% reliability, since it takes into account the entire population, whether symptomatic or asymptomatic. These statistical . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint 13 models can estimate the true magnitude of the epidemic at a specific location and their cost-effectiveness and sampling speed can help alert health authorities of possible new outbreaks, which will help to protect the local population.

Sample Collection
The WWTP of Bens (

RNA extraction and qRT-PCR assays
RNA was extracted from the concentrates using the QIAamp Viral RNA Mini Kit (Qiagen, Germany) according to manufacturer´s instructions. Briefly, the sample was . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. For RNA quantification, a reference pattern was standardized using the Human 2019-nCoV RNA standard from European Virus Archive Glogal (EVAg) ( Figure S3). To build the straight pattern, the decimal logarithm of SARS-CoV-2 RNA copies/µL ranging from 5 to 500 were plotted against Ct (threshold cycle) values. Calibration was done amplifying the N gene.

Data collection
The present study also includes data gathered from different sources. More specifically, two-minute flow measurements at WWTP Bens for the period January is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Exploratory flow study
Since flow may be an important variable when determining the viral load in the wastewater, an exploratory data analysis for the volume of water pumped at the WWTP in Bens during the lockdown period has been performed. Two-minute data of volume of The mean hour flow at the WWTP has been computed and a multivariate exploratory data analysis was performed. Box-plots for flow at every hour have been computed using the data in every time period a-d. They were plotted as a function of the hour of the day.
Exploratory functional data analyses have been also performed. Since the raw flow curves are exceedingly noisy, local polynomial methods 57 have been used to obtain smooth curves. A direct plug-in method 58 has been used to choose the smoothing parameter. The collection of smoothed daily curves has been analyzed and the deepest curve 59 among every time period a-d has been computed.

Backcasting of COVID-19 active cases
Preliminary statistical methods have been devised to backcast the real number of COVID-19 active cases based on reported official figures.
Follow-up times (available only until May 7 th ) for anonymized individual official COVID-19 cases in Galicia (NW Spain where A Coruña belongs, Figure S1) have been used to count the number of cases by municipality based on patient zip codes. Since the epidemiological discharge time is missing, the number of active cases in the metropolitan area of A Coruña could not be obtained but the cumulative number of cases was computed. On the other hand, the main epidemiological series for COVID-19 were publicly available in Galicia at the level of health areas. However, the definition of one of the series changed from cumulative cases to active cases in April 29 th .
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. for May 18 th -June 1 st (prevalence 2.2%). Confronting these numbers with the official numbers in May 11 th (10,669) and June 1 st (11,308) gives estimated ratios of 5.316 and 5.254 in these two periods, with an average of around 5.29. This conversion factor was used to backcast the series of real active cases based on the estimated daily official COVID-19 cases in the metropolitan area of A Coruña. Some of these series, including the backcasted series of real active cases, are included in the Dataset S1.

Nonparametric setting of viral load overtime
Generalized Additive Models (GAM) using a basis of cubic regression splines 61 have been used to fit the viral load as a function of time at CHUAC from April 22 nd to May 12 th and at WWTP Bens from April 16 th to June 3 rd . Several outliers have been removed from the data, corresponding to unexpected and intensive pipeline cleaning episodes (8hour 70 ºC water cleaning during the Thursday-Friday nights) carried out in April 23 rd -24 th , April 30 th -May 1 st and May 7 th -8 th .
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint Viral load models GAM and LOESS 62 nonparametric regression models have been used to explain the viral load (number of RNA copies) as a function of time, to describe the trend of response variable during a day and through the days.
Well known regression models such as simple and multivariate linear models and more flexible models such as nonparametric (e.g. local linear polynomial regression) and semiparametric (GAM and LOESS) models have been formulated. The latter ones allowed the introduction of linear and smooth effects of the predictors on the response.
All these models have been successfully used to predict the number of COVID-19 active cases based on the measured viral load at WWTP Bens, daily flow in the sewage network as well as other environmental variables, such as rainfall, temperature and humidity.
Diagnostic tests (Q-Q plots, residuals versus fitted values plots and Cook's distance) were used for outlier detection, which improved the models fit.
The R statistical software was used to perform statistical analysis 63 . Namely, the mgcv library 64 was applied to fit GAM models and ggplot2 and GGally 65,66 to perform correlation analysis, obtain graphical output and fit LOESS models, respectively. The caret R package was used to fit and evaluate regression models.
Although some RT-qPCR replicates could not be measured due the limitation of the detection technique (some errors occurred when the number of copies/L was under 10,000), 74% of the assays led to three or more measured replications, which gives a good statistical approach. However, conditional mean imputation 67 was used for unmeasured replications. Thus, unmeasured replications in an assay were replaced by the sample mean of observed measurements in that assay. In the only assay with all (six) unmeasured replications, the number of RNA copies was imputed using the minimum of measured viral load along the whole set of assays.

ACKNOWLEDGEMENTS
This work was funded by Project INV04020 from the University of A Coruña Foundation (FUAC-UDC) and EDAR Bens S.A., A Coruña, Spain, awarded to MP, by Projects . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 4, 2020.  . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 4, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 4, 2020. . https://doi.org/10.1101/2020.07.02.20144865 doi: medRxiv preprint