Data collection
We collected data using the COVID-19 streaming API of Twitter (8). This API was made available by Twitter specifically for supporting COVID-19 related research, and it does not impose throughput limitations or daily/monthly quotas. Consequently, we were able to collect all tweets that mentioned the COVID-19 related keywords and phrases (eg., coronavirus, covid19 and covid) (8). Streaming data was stored in real-time in a mongodb database hosted on the Google Cloud Platform. The collection of data was continuous with only minor down times that were necessary for system modifications or updates.
Product detection
The list of products and entities were manually collected from the FDA website (4). The products included were advertised as treatments/cures, tests or preventative measures for COVID-19. We curated a comprehensive list of entity names, products, FDA letter dates, person(s) who owned the entities or the products, websites and social media profiles (if any). We curated this information for a total of 183 letters issued by the FDA. Each warning letter was manually reviewed. From these, we manually curated a set of product names and/or entity names that were potentially used for promotion over social media. If the same product was mentioned in multiple letters, we only included the first mention of the product or entity and the corresponding date, excluding the later ones. We also manually curated key words and phrases that were likely to be used to refer to the products or entities on Twitter. The full list of products and entities and their earliest letter dates are provided in Table 1.
Since product and entity names are often misspelled by social media subscribers, we generated potential spelling variants or misspellings of the products and entities using a data-centric tool (14). The variant generation tool uses a combination of semantic and lexical similarity measures to automatically identify common misspellings and spelling variants of terms/phrases, including multi-word expressions. Our past work revealed that such lexical expansion strategies are capable of significantly increasing retrieval/detection rates from Twitter (15). Examples of product names extracted from the warning letters and their automatically-generated lexical variants are shown in Table 2. We included all products/entities and their spelling variants that had at least 10 mentions in our collected data. We excluded key phrases that were mentioned less than 10 times because such low occurrences indicated that the corresponding products/entities were either not promoted over Twitter or never actually gained popularity on the platform. We counted the number of mentions of each product/entity, including their spelling variants, from the entire collected dataset. Counts of spelling variants were grouped with the original products/entities. Daily counts were normalized by the total number of posts collected on the same days. The daily relative frequencies were represented as the number of mentions per 1000 tweets.
Table 2. Fraudulent product names extracted from the FDA warning letters and their automatically-generated lexical variants.
Product
|
Spelling variants
|
chlorine dioxide
|
chlorinedioxide||chloride dioxide||chorine dioxide||clorine dioxide||clorinedioxide
|
fortify humic beverage concentrate
|
fortify humic beverage concentrates||fortify humic beverage cocentrate
|
electrify fulvic beverage concentrate
|
electrify fulvic beverage cocentrate||electrify fulvic beverage concetrate||electrify fulvic beverage concentrates
|
supersilver whitening toothpaste
|
supersilver whitening toothpast||supersilver whitening toothpastes||supersilver whitening tooth paste
|
superblue fluoride free toothpaste
|
superblue fluoride free tooth paste||superblue fluoride free toothpastes||superblue fluoride free toothpast
|
prefense hand sanitizers
|
prefense handsanitzers||prefense hand sanitizes||prefense hand sanitiers||prefense hand andsanitizers||prefense hand||prefense hand handsantizers||prefense hand handsanitzers||prefense handsantizer||prefense handsanitizers||prefense||prefense hand santitizers||prefense handsanitisers||prefense handsanitzer|
|
covid-19 cough syrup
|
covid 19 cough syrups||covid 19 coughsyrup||covid 19 cough syrup||covid 19 cough coughsyrup
|
ncov19 spike protein
|
ncov19 spike spike protein||ncov19 spike spikeproteins||ncov19 spike protei||ncov19 spikey proteins||ncov19 spike spikeprotein||ncov19 spikeprotien||ncov19 spike proteins||ncov19 spike spikey proteins||ncov19 spikeprotein||ncov19 spikeproteins||ncov19 spike spikeprotien
|
Detecting anomalies
We applied a 14-day moving average filter to construct a smooth line representing the daily mention frequencies, and anomalies or outliers were detected relative to this moving average line. For each day, the residual for standard deviation calculation was computed by subtracting the 14-day moving average from the relative frequency per 1000 tweets on that day. For a given day (n), the standard deviation for the day (σn), is computed progressively, given as:
with 0 mentions early on in their timelines, but the added bias causes the progressive standard deviation to be non-zero. For 3 products with letter issue dates in March 2020, this added bias caused the method to miss early outliers that are detectable without adding the bias. Specific details are provided in the supplementary material (S3). Minimum value for daily relative frequency was set at 0.001 (i.e., 𝒌 0.001 served as the minimum threshold for outlier detection).
The chosen window size (14) and standard deviation (3), for which we report results in this paper, were relatively conservative choices for signal detection. We also performed experiments with multiple window sizes (7, 10, and 14) and standard deviation thresholds (2, 2.5, and 3) to study how the anomaly detection performance varied based on these parameters. Slight variations in window sizes and standard deviations did not impact overall performance.
Evaluation
Data points that had a distance of more than 3 standard deviations from the moving average were considered to be outliers (i.e., signals). For each key phrase, the date of the first outlier was compared with the FDA letter issuance date to determine if the signal was detected earlier, within 1 week, or later than the FDA letter issuance date. System percentage accuracy was computed using the formula: . For products that were mentioned in multiple letters, our approach was only considered to be successful in early detection if the outlier was detected prior to the first mention date. Thus, the reported system performance is actually likely to be lower than in practice.