Collecting COVID-19 Drug Sets from Drug Repurposing Publications
Since the emergence of the COVID-19 epidemic, tens of thousands of new publications related to COVID-19 research have emerged in a very short period (2 months). We continually survey these publications to identify research that describes drug repurposing efforts, and manually extract drug sets from these studies to populate the drug set library. We also submit to the platform published drug sets from historical sources such as those from studies that listed drugs showing antiviral activity for other related viruses. So far, we have collected 20 drug repurposing publications (Table 1). An updated version of this table is maintained here: https://docs.google.com/spreadsheets/d/1x6aKaZGadfLqNrQoFQwLlRhCXfUGYiRbzwYipIon_ WM/edit?usp=sharing
To assist us with developing and maintaining the collection, we have received help from the research community by allowing researchers to upload gene and drug sets to the database. These submissions are manually evaluated before making them publicly available.
Collecting SARS Signatures from GEO with GEO2Enrichr and GEN3VA
A set of 35 gene expression signatures resulting from infections by different coronaviruses for different cell types and tissues, with expression data originating from the gene expression omnibus (GEO) database, was processed using the GEO2Enrichr tool (15) and stored on the GEN3VA platform (16). The 70 entries were submitted to the COVID-19 crowdsourcing platform, with an upregulated and a downregulated gene set associated with each signature. The GEN3VA report for these signatures is available here: https://amp.pharm.mssm.edu/gen3va/report/646/SARS.
Collecting COVID-19-Related Gene Sets with Geneshot
Geneshot (17) is a platform that we developed to convert PubMed searches into gene sets. Using Geneshot, gene sets associated with the search terms SARS, SARS-CoV, MERS-CoV, ACE2, and TMPRSS2 were created using both the AutoRIF and GeneRIF (18) methods. Additionally, top COVID-19 drug repurposing candidates reported in recent literature (Table 1), including chloroquine and hydroxychloroquine, were included. Predictions of additional genes potentially associated with these terms were also added to the COVID-19 gene set library. These predictions were based on the literature-associated genes using each of five strategies: Co-occurrence via AutoRIF, GeneRIF, Enrichr (19), Tagger (20), and co-expression using data from ARCHS4 (21).
Collecting COVID-19 Drug Sets from Twitter
Twitter is an important source for timely discussions related to therapeutics for COVID-19, including drug repurposing efforts and clinical trials. Using the Twitter API, we query Twitter daily with a list of more than 14,000 drug terms and their synonyms to collect tweets that mentioned these drugs in context of COVID-19. The drug search list was curated from DrugBank (22), L1000FWD (23), and the list of drugs submitted to the COVID19 drug and gene set library website. We then filter the identified tweets for those that are co-mentioned with COVID-19, and SARS linguistic variations. For each drug, we counted the occurrences of tweets and recorded a tally of mentions for each day. Data collection continues with daily reports, tweet IDs of the tweets originating the discussions, and the longitudinal drug trends. These data are shared publicly on GitHub:https://github.com/MaayanLab/COVID19DrugsTrendTracker/tree/master/daily_reports
Each day the set of discussed drugs on Twitter are automatically deposited into the COVID-19 drug set library via an API. This approach enables real time trend detection of the the most discussed drugs as potential therapeutics for COVID-19 while enriching the content of the COVID- 19 drug and gene set library.
Developing the COVID-19 Gene and Drug Set Library Website
The COVID-19 gene and drug set library website has five sortable and searchable tables that list the drug and gene sets (Fig. 1). Sorting can be based on the date of submission, alphabetical ordering, or list size. The two tables are searchable via metadata terms such as title, authors, and descriptions, as well as via data search for specific gene or drug names. Users can download each gene set or drug set as well as the entire library. In addition, each gene set is provided with the option to perform gene set enrichment analysis with Enrichr (19), while genes are linked to Harmonizome (24) for further interrogation. The individual drugs that map to known compounds are linkable to their corresponding DrugBank landing pages (22). The website enables users to submit drug and gene sets related to COVID-19 research by completing a simple form. The form includes a dataset title, a URL source, and a description that explains how the set is relevant to COVID-19 research. The submitter is also provided with mechanisms to add additional metadata terms that can describe the cell type, tissue, organism, and other critical information about the submitted set. Users can specify the category of metadata provided, allowing for a broad set of additional metadata about each set. Users can also opt to submit their contact information; this information is kept private, but users can opt-in to make it public. Once a user submits a contribution to the site, their dataset is directed to a review queue in which we can examine the validity and relevance of the contribution. The reviewing process enables an administrator to approve or reject the submitted set. If approved, the set is added to the library. To make it easy for contributors to submit multiple sets, users can access the site via API. The code behind the site is open source and available at: https://github.com/MaayanLab/covid19_crowd_library
Expression Analysis of In-Vitro Screens Hits
Drug sets extracted from 3 in-vitro screens (1-3) were first identified. The drugs were matched to drugs profiled by the L1000 assay available from GSE92742. Average signatures for each drug were computed by taking the z-score mean for each gene. Clusters were identified based on the average signatures using hierarchical clustering. Differential z-scores of genes relative to the two clusters were identified using the t-test statistic. The top up and down differentially expressed genes in each cluster were submitted to Enrichr for gene set enrichment analysis. To quantify the z-scores of genes co-expressed with ACE2, we calculated the correlation over 2,000 randomly sampled drug signatures from the L1000 database. We then calculated the mean z-scores of the top 50 correlated genes to ACE2 and compared those values against a distribution calculated from sampling 50 random genes, repeatedly 10,000 times. The p-values were calculated against the sampled distribution and corrected for multiple hypothesis testing by applying the Bonferroni correction method. The code behind this analysis is open source and available at: https://github.com/maayanlab/covid19l1000