Prevalence Estimates of COVID-19 by Web Survey Compared to Inadequate Testing

Background: Current prevalence of COVID-19 drives many policy decisions, but is hampered by ambiguities in testing and reporting. We propose an alternative method for estimating community prevalence that is inexpensive and timely. We test the Hypothesis that the survey sampling provides a quantitative prevalence that is similar to widespread genomic or serological testing. Methods: We have built a simple, web-based survey of signs and symptoms for COVID-19 based on six questions. No personally identiable information is collected to maintain privacy. Sampling can be directed to a population of interest such as a company, or broadcast widely to get geographic sampling. Data reporting can be real-time and plotted onto zipcode maps. Rates of prevalence were calculated from presumed COVID cases and respondents, with condence intervals based on the Blaker method. Results: The website was created quickly, and survey results were quantitatively useful after only a few days. Analyzing 3161 cases from CountCOVID.org, we found a community prevalence of 7% in Georgia that was much greater than the reported conrmed cases. Our prevalence estimate of 21% in New York City was similar to the reported 19.6% by surveillance antibody serotesting. Our estimates are validated by ve other community surveillance studies using genomic or antibody testing. Conclusions: Prevalence and incidence of COVID-19 symptoms in the community can be estimated by a crowd-sourced website at considerably less expense than widespread testing.

available in the future, but it is equally unrealistic to rely on future testing only for public health decisions.
The President has threatened to withhold testing and testing results to reduce the perceived number of cases. Some advocate the testing of an entire population at multiple times to capture the spread of the disease. This would be extremely expensive to test millions of people, then repeat the testing every two weeks to catch an outbreak. The naso-pharyngeal swab is currently the best method for obtaining samples for COVID-19, but is uncomfortable, with almost no one volunteering to give this sample a second time. Some people report this experience as similar to a brain biopsy or that it has given them headaches for 3 days.
Epidemiology has traditionally counted prevalence and incidence by directly interviewing populations. 6 Clinical diagnosis of COVID-19 currently relies primarily on symptoms and signs, especially for mild cases. 5 Traditionally, public health surveillance does not require PCR testing, but does require ongoing, systematic collection of data, de ned by the CDC as "Syndromic Surveillance". 5,25 Statistically, sampling is an e cient way of gathering these population statistics without testing everyone. 8 Some groups have reported using digital means such as a smartphone application or web-based series of directed questions to gather information on symptoms that can be tracked individually. [9][10][11] Triangulation of estimates of prevalence and incidence by independent means can substantially reduce errors. 7 The independent information may address issues if a state manually changes data 12 or claims that corona virus outbreaks are due to rises in testing. 13 We hypothesize that a web-based system of data collection from the population can provide a reasonable and useful estimate of prevalence at much lower cost than expensive, widespread laboratory testing. Our metrics are then compared and validated against several independent publications describing community testing.

Methods:
An interactive website was developed to survey the community for signs and symptoms of COVID-19. See CountCOVID.org. This minimum viable system was developed as a proof of concept that could be implemented within one week with minimal expenditures. No application was required to be downloaded and no personally identi able information was collected or extracted. Because no personally identi able information was collected, the need for consent was waived by the Institutional Review Board of Georgia Institute of Technology. The website was purposely designed to ask only 6 binary questions to encourage completion ( Figure 1). The questionnaire was newly written for this study by one of the authors and the authors retain copyright of the website and print screens. User zip code was requested to assign cases to a geographic area. Users were asked about fever, cough, shortness of breath, loss of smell, di culty breathing, and previous COVID testing. These symptoms were selected to be sensitive and speci c to reports of COVID illness. 1,2 The respondents were instructed to answer any symptom experience starting March 2020. A ThankYou page acknowledged the submission of the survey and provides instantaneous information to the user ( Figure 2). Knowing the sample size, we could calculate a prevalence per 100,000 population (php) for a geographic region. 8 Crude rates of prevalence were calculated from presumed COVID cases and respondents in any one county, with con dence intervals based on the Blaker method. 14 No adjustment for age and gender was made due to the lack of personal identi cation information. For geographic data visualization, the open-source QGIS 3.12 GIS application was used.

Results:
The rst 3161 cases were collected between April 10 and April 25, 2020. The responses to the six questions were analyzed for COVID symptoms (Fig. 1). 9% had fever, 18% had a dry cough, and 4% had lost smell (Fig. 2). 7% reported di culty breathing, while 88% could easily hold their breath at the time of the survey. Only 1% had tested positive for COVID. Using a combination of signs and symptoms led to a metric of Presumed COVID infection in 7% of the sample population. Presumed COVID was de ned for this dataset as: Positive COVID test; Fever + Cough; and Loss of Smell with any other symptom such as fever, cough, di culty breathing or inability to hold breath.
This yielded a prevalence of 7,000 (php) for Georgia(7%) and about 20,000 for New York City(20%). Note that the prevalence values are based on limited sampling, so the 95% con dence levels are given in Table 1. We used data from 1018 cases in Georgia and 103 cases in NYC. The COVID con rmed case counts for reference were obtained from the JHU CSSE COVID-19 data repository. 15 The county prevalence varied throughout the state of Georgia. The four major counties in the Atlanta metropolitan area are shown in Figure 3. At the county level, the prevalence php for Fulton county from CountCOVID.org was 5,255 that appears to be less dense than Cobb county that had the highest prevalence of 9890, despite the fact that Fulton had the most total con rmed cases in Georgia.
Counties in the New York City area are shown in Figure 4. Note that Nassau and Queens display a very high php of up to 29,000, while Manhattan and Westchester were lower at around 16,000 php. Hudson county in New Jersey was high, while the adjoining Bergen county was much lower at 5900.
Comparison with con rmed cases in Georgia indicate that the Presumed COVID cases in the wider community are ~ 40x that of con rmed cases. 16 There were regional differences as Fulton county had Presumed COVID cases that are about 20x higher than the number of con rmed cases of ~300 for this period. Cobb county symptomatic cases were ~40x higher than the reported number of con rmed cases of ~230.
For the month of May, the survey was changed to ask about symptoms in the past two weeks only. The time restriction yields an estimate of incidence rather than prevalence. The data was then analyzed in weekly intervals to estimate incidence as it changed. The incidence for May is given in Table 2.  22 The remarkable similarity in prevalence estimates between our survey-based study and the aforementioned testing studies highlights the ability of self-reporting to yield a reasonable determination of COVID19 prevalence.
The prevalence values for community COVID are much greater than con rmed cases. Given the preferential testing of moderate to severe cases which present in the hospital setting, it is likely that the number of con rmed cases greatly underestimates the overall prevalence and incidence of the disease.
When the CountCOVID results showed 40x con rmed cases in Georgia, we were initially concerned that this ratio was improbable and too high. Since then, two antibody surveillance studies in California gave ratios of 43.5 × (28-55) for Los Angeles County 20 , and 54 × (25-91) for Santa Clara County 21 . Thus, despite the striking discrepancy with con rmed cases, our estimate of 40x for Georgia appears to be validated.
The ratio of prevalence values to con rmed counts may fall as the number of widespread testing (php) increases. For example, the ratio for NYC (20x) is about 1/2 of that for Georgia (40x), although NYC has 5x more con rmed cases php. NYC boasts the largest number of tests in the country in May 2020. Although presumed cases and con rmed cases may approach each other with increasing testing, urgency and nancial consideration is of utmost importance.
Incidence will track the number of new cases and may provide information on basal levels and outbreaks.
For the month of May, we estimate the prevalence in Georgia to be between 0.56-1.79% as the state reopened. In comparison, the COVID-19 website estimates symptomatic COVID incidence as between 0.2-0.4% for a slightly different time frame. 23 The order of magnitude is similar and all the values may re ect sampling bias and de nition of COVID symptoms. Nonetheless, the similarity in values provides con dence that this method of self-reporting is scienti cally reasonable.
Criteria for presumed COVID may need to be re ned as we learn more about this disease. We did not ask for symptoms of diarrhea or "COVID toes" in the initial survey. Each question is subject to False Positives and False Negatives. For instance, cough was quite common in the community, especially given the temporal association with an uptick in hay fever symptoms during the spring. Thus, cough by itself at 17% was not speci c and would have many False Positives. Fever was at 9%, but can be caused by a plethora of illness. However, the intersection of fever and cough yielded about 5%. We included in the de nition of Presumed COVID all positively tested individuals and loss of smell plus at least one other symptom. These additional categories yielded a small subset of the total cases of the nal estimate of Presumed COVID prevalence of 7%. This algorithm of signs and symptoms mirrors current clinical judgement. We do note that the survey by the Imperial College with predominantely users from the UK yielded a much greater percentage of loss of smell. 24 It is not known whether this is a difference in patient population or survey technique. The selection of other criteria can be made post-hoc, but our current criteria yielded extraordinarily similar results to serotesting. Given time, the analysis can be backcalibrated with selective surveillance testing to correct for errors or biases. However, the application of natural intelligence instead of arti cial intelligence may be good enough.
Prevalence is an important parameter to assess for determining the effectiveness of social distancing, testing, and herd immunity. The proposed method of web sampling is rapid and inexpensive. Given the current nancial crisis which has resulted from this pandemic, economic burden can be minimized in the quanti cation of disease burden by using web sampling. In contrast, serologic or PCR testing is so much more expensive. PCR testing of 1000 people would be approximately $1 million. The web-based survey of 1000 people is estimated to cost approximately $100. Because one can sample quickly and often, a sudden increase in symptoms on CountCOVID.org may provide advance warning of an outbreak.
This survey, and all sampled studies have bias. Bias is possible if the sample size is small or skewed by the population completing the survey. It would be essential to widely encourage the population to participate in a web-based survey such as ours. The current results are likely biased to adult faculty and staff from Georgia Tech who are employed, instead of the general population. Unfortunately, the comparison Con rmed cases is also subject to large bias since they are mostly directed at severe cases that can access high quality health care, and not a sampling of the wider community. Conversely, sampling may be purposely restricted to quantitatively assess the baseline and trends for selected populations such as the elderly or particular neighborhoods. There are other electronic based systems that do a sampling based on signs and symptoms. [13][14] We applaud all of these systems and encourage them to report on their ndings for academic comparison and collaboration.

Conclusions:
We describe an inexpensive, crowd-sourced system to rapidly obtain prevalence in the community that does not rely on testing. The website may be directed to a particular population, such as a large o ce building or an at-risk population, to identify an outbreak faster than following con rmed cases or COVID-19 deaths. Abbreviations: COVID-19: COrona VIrus Disease of 2019.

PCR: Polymerase Chain Reaction
Declarations Ethics approval and consent to participate: As we are not collecting any personally identi able data, no Human Subjects consent is required. The IRB of Georgia Institute of Technology reviewed this study and waived the need for consent.
Availability of data and materials: The datasets analysed during the current study are available from the corresponding author on reasonable request.
Authors Contributions: DK designed the study, analyzed the data and wrote the manuscript. BK wrote the software code for the website. TL provided statistical analysis of the results. ZL displayed the prevalence and incidence on 24. Menni C, Valdes AM, Freidin MB. Real-time tracking of self-reported symptoms to predict potential COVID-19. Nature medicine. https://doi.org/10.1038/s41591-020-0916-2.  Figure 1 First page of website with questions for survey on one screen Thank-you page shown after submission of data showing aggregate results in realtime.

Figure 3
Example map of prevalence per hundred thousand population for four Atlanta area counties. Fulton county surprisingly had less prevalence as it has the most con rmed cases, while Cobb county had more prevalence.
Page 14/14 Figure 4 Example map of prevalence per hundred thousand population for eight counties near New York City. The heterogeneity is evident and may have utility in deciding issues of public health response.