Project SCANCER: An Open-access Drug Repurposing and Data-mining Platform to Enhance Target Validation and Optimize International Research Efforts against Highly Progressive Cancers

The expanding body of potential therapeutic targets requires easily accessible, structured, and transparent real-time interpretation of molecular data. We invite research teams to join and aim to enhance the cooperative work of more experienced groups to harmonize international efforts to overcome less prevalent, devastating malignancies. We show how we integrated our previous expertise in small-cell lung cancer (SCLC) with data mining approaches to initiate a new platform to overcome highly progressive cancers such as triple-negative breast and pancreatic cancer. Here, we integrate available fundamental data and present a novel, open access, data-mining, drug repurposing platform, deriving our searches from the entries of Clue.io. Through the front end, users can create their drug-target list to select the clinically most relevant targets for further functional validation assays or drug trials. SCANCER integrates searches from publicly available databases, such as PubChem, Drug Bank, PubMed, and EMA; and complementing them with background information on compounds using entities like UniProt, String, and genecards. Cancer drug discovery requires a convergence of complex, often disparate elds. We present a simple, transparent, and user-friendly drug repurposing database.


Introduction
There is an increasing need for open-access drug repurposing databases for researchers in the translational and clinical eld due to the emergence of many potential new therapeutic targets every year. This trend will likely continue in the future with exponentially increasing repositories full of data, and data mining requires particular expertise and quali cations that cannot be expected from a single group of researchers. The development of novel pharmaceuticals takes enormous effort and immense nancial and human resources, a decade of research and clinical trials until approval with more advancements in prevalent cancers. Research groups usually select their target and related research directions without cross-optimization of efforts across individual groups. Industry AI-based drug developments have limitations and also require preclinical validation. Therefore, in silico approaches with solid human resources might assist than fully overtake all steps.
The therapeutic potential of a molecular target, the suitable inhibitor (or agonist), might already be available in another indication, or a small molecule lead compound is available that needs further modi cation and preclinical testing. Nowadays, a wide selection of databases is available for various goals in drug repurposing (Parvathaneni et al., 2019;Gns et al., 2019;Tanoli et al., 2021). Omics data about molecular targets can be retrieved from Uniprot (UniProt Consortium. 2021), Genecards (Stelzer et 2016), TTD (Chen et al, 2002), STITCH (Kuhn et al, 2008), BioGRID (Oughtred et al, 2020), and STRING (Szklarczyk et al, 2019); or corresponding pathways from KEGG (Kanehisa and Goto 2000), Pathways Common (Rodchenkov et al, 2020), or Reactome (Griss et al, 2020) with to a certain level of de cient or overlapping information. Information about drugs launched or in development (drug omics data) can be obtained from Pubchem (Kim et al, 2019), Drug Bank (Wishart et al, 2018), Drug Map Central (Fu et al, 2013), or FDA or EMA Label repositories. Also, the Clinicaltrials.gov, SIDER (Kuhn et al, 2015), or FDA Adverse Event Reporting System (FAERS) platforms provide essential information on stages of drug testing.
Platforms for browsing and visualizing drug-target interactions and drugs in disease-context are already available for many users, such as Drug Target Pro ler (Tanoli et al, 2018), Cancer Genome Interpreter (Tamborero et al, 2018), SwissTargetPrediction (Daina et al, 2019), OpenTargets (Hecker et al, 2012), and PharmGKB (Eichelbaum et al, 2009). These platforms include many entries and a gargantuan amount of unstructured information that is sometimes time-consuming and di cult to handle, especially for those lacking expertise in the eld. The data mining techniques require speci c processes usually nonexistent in less prevalent cancers and often lack nancial incentive for clinical testing (Josephs et al, 2019). Cancer drug discovery requires a convergence of complex, often disparate elds. There is a great need for simple, transparent and user-friendly drug repurposing databases.
One recent endeavor with high impact is Broad Institute's Clue.io, which serves as a drug repurposing hub curated and annotated collection of FDA-approved drugs, clinical trial drugs, and preclinical tool compounds with a companion information resource (Corsello et al., 2017). In this paper, we present a novel, open access, data-mining, drug repurposing platform, deriving our searches from the entries of Clue.io. Project Scancer provides researchers and clinicians (especially in cancer research) an easy-to-use platform to retrieve all the necessary information for a freely selected array of potential therapeutic targets with a parallely-working, data-mining application. In addition, Project Scancer provides detailed biological information on the selected set of target molecules using open-access databases, such as Uniprot, GeneCards, Gene Ontology and STRING. This way, the user receives a concise summary on the biological relevance of every target, that is explicitly important for researchers who are not experts in molecular biology.

Clue.io and dataPatch R script
Target inclusion/exclusion depends on search results from a clue.io query. Scancer consists of 3 separate R scripts. The rst script -clue.R-calls various clue.io REST API endpoints to build up a result table (Fig. 2). If the main API call does not nd any component for a target, that target will not be involved in further processing steps since no known drug repurposing approaches are available in Clue.io. R looks up the input target list in two ways. First, it tries to access a private and/or shared Google Sheet le. It requires a unique "key" (a token) given to clue.R via a simple environmental variable. If this secret key is available for the script, it authenticates by gargle package to access Google Sheet API services. Next, it reads the sheet and takes the values from its rst three columns. An ID string also identi es the Google Sheet, and it is passed via an operating system environment. If there is no API key/Google Sheet identi er, then clue.R tries to load a TSV le from the INPUT directory of the Scancer directory. Clue.R merges the outputs of various clue.io API calls and saves the composing table into an RDS le (R-speci c data format to store and load R objects). At the next stage, data patch.R reads this RDS le and restores the data frame from it. FDA Label function dataPatch.R collects additional details from external resources. Most of them are used by clue.io itself, but gaps can occur in its dataset. For example EMA data is not included at all, so appending it is an improvement. Another plus is harvesting direct links to drug labels from the search interface of FDALabel. fdaLabel function implements this feature that contains several internal sub-functions (some of them are anonymous and vectorized). This function then sends a POST HTTP request to a speci c FDALabel linkthe same that is used by the human search interface to send a query. fdaLabel function uses pert_iname from clue.io as a search parameter and uses the following parameters to de ne the search criteria: document types of labeling types as documentTypeCodes: Human Rx 34391-3, Human OTC 34390-5, Vaccine 53404-0 labeling section: selectedLabelingType: 0, sectionTypeCode: Active Ingredient 2-55106-9 fdaLabel uses its internal function -getFDALabelResults -to send the query and interpret the HTTP response from the FDA service.
The result may contain several duplications for a speci c product. It is a consequence of FDA strict registration and regulation procedures (products are allowed for different time ranges or can be withdrawn, etc). The function uses a weighting method in order to reduce these duplications and return the most relevant items. fdaLabel function lters out products that are parts of aid kits by verifying presence of simple words rst aid|kit|KIT in product names. It receives responses from the FDALabel service in JSON format. It parses, prioritizes and lters the list of returned drug products based on their market date and match-mode with pert_iname (exact, pre x, su x, inner, not at all). This is a necessary step since the product list contains copies of drugs with only different dates.

PubMed function
This function searches for the compound name received from clue.io and sends a search request to PubMed® service of National Center for Biotechnology Information (NCBI). It restricts the result set by including only clinical trials, meta-analyses, randomized controlled trials, reviews and systematic reviews. The results are ordered by the best match algorithm of PubMed®. Our function picks the top 3 of the result set and stores it in the global datatable. If there is no hit at all, dataPatch.R provides the used search URL and this search can be re-initiated and/or re ned by users of Scancer. pubMed function uses a simple XPath query to extract PubMed identi ers embedded into the resulting HTML source code.
EMA function EMA function complements the dataset with product details from the website of European Medicines Agency (EMA). It sends search queries to the site through the human-dedicated interface -similar to fdaLabel and pubMed functions above. It also uses XPath expression to extract the rst hit from the result list and to identify the most speci c link to a document in PDF format. There is no strict naming/referencing convention for these documents on EMA, this function tries to discover a PDF document in the following order (with descending priority): summary in English, product information, refusal public assessment report, public statement which is not "non-renewal" and not "authorization" Otherwise it selects any PDF document that is encountered on the page of the product rst. If EMA nds a PDF reference on the product page, it veri es the existence of the referred PDF le by sending an HTTP HEAD request to the EMA web server. This additional check ensures that Scancer will not provide an URL with "Page Not Found"-experience to its users.

XmlUniProt function
This function collects data from the UniProt website. UniProt provides APIs to access and query its data. Easy to access the human readable contents in machine readable formats (for example XML, RDF, etc.).
The usage of the UniProt website REST API is straightforward, since the input target list also contains UniProt identi ers. xmlUniProt extracts GO (Gene Ontology) molecular function and cellular component terms, STRING and Reactome references from the received XML data. These speci c entries are stored in simple R lists and added to the already collected data in a new column: UniProtData RenderWebPage function and renderWebPage.R script RenderWebPage is responsible for rendering human-readable HTML documentation from "patched" data produced by dataPatch.R. This function iterates on rows of the input table created by dataPatch.R. It prepares a compounded, hierarchical data structure from the data table. This data structure helps to simplify data access from the template le which is an important substance of generating out HTML.

MultivaluedCellsToHTML function
A single compound can have multiple related values as elements of various resources. For example a compound can have two Mechanism of Action items (MoA); or a PubChem reference along with a DrugBank reference. These values have to be organized into the same row as the compound and items of the same categories must be displayed in a single table cell. multivaluedCellsToHTML function handles these cases. MultivaluedCellsToHTML function uses these functions to compose corresponding URLs for each identi er from different data sources, including Chembl, PubChem and DrugBank.

Presentation layer
A web browser is a "mandatory" software of each end user's computer, so HTML is a clear choice to summarize, visualize and deliver collections of texts, images and hyperlinks. An important part of this rendering is building an HTML source le and populating it with the collected data in an user-friendly way. Scancer follows the popular Model-View-Controller design pattern even though it composes only static HTML output (View) from the data (Model) at this stage of the work ow. (NOTE: However previous actions and functionalities of the work ow can be interpreted as the Controller part of the MVC pattern.)

Templating
Most web frameworks incorporate a templating system -as a result of their own solution or reusing a 3rd party component. These templating components are not tightly coupled with web services, any software can use their power. Scancer uses the whisker package, which implements the Mustache template language. This approach excludes any occurance of program source code from the template code. It provides a strict separation for the View layer of Scancer. The structure of the rendered HTML output is based on Bootstrap components and their hierarchy. (NOTE: The current hierarchical structure and the content of components cannot support a responsible page design. scancer output is tailored for desktop browsing.) Dynamic request for STRING network image and a simple cache for HTML/XML contents Current functionality of Scancer is very similar to a web crawler. This kind of interaction with popular web servers requires respecting their policies controlled by their robots.txt les and described in public pages (FAQ, Usage Guideline, Terms of Service etc.) in order to minimize the load on their resources. Respecting their resources also helps to avoid a possible block or denied access from a well-protected website.
Scancer uses an event-based method, a dynamic DOM modi cation from JavaScript (see scancer.js). If the user opens the STRING accordion of a target content, then the getSTRING function (set up as on onclick handler) inserts the image tag. The HTTP request of the STRING network image is initiated by the browser as it modi ed the Document Object Model and loads the missing/uncached image content to complete the rendering of the missing part of the document. It loads the images belonging to only the visited STRING panels of the document.
The simple cache is also purposed to saving resources on websites. scancer send HTTP requests and receive HTTP responses via its getPageCached function. This function checks the cache.tsv le in the caching folder and returns the content immediately, when it has been already downloaded earlier. If the looked up entry is missing from the cache, the function downloads it and adds it to the cache. This does not just spare remote resources, but it speeds up the subsequent queries to a speci c server since Scancer does not need to wait between politely after requests served by the cache-solution.

Construction and Content
Each entry obtained from the search results in the interactive online platform of project Scancer is referenced and has at least one scienti c piece of evidence. Figure 1 shows a owchart on the processing steps and work ow of Scancer. First, users of Scancer can start their work ow by opening the project's starting page on Github (https://cycle20.github.io/scancer/). Then, users can upload their target list with three pieces of information into a google spreadsheet (Target INPUT). The rst piece (Fig. 2 Column A), asks for the HUGO ID. In column B, users can give a "Label" for every target for classi cation and clustering, useful in later work. The third piece (Fig. 2 column C) is the UniprotKB ID of the searched gene. Inputs for the HUGO and Label columns are limited to 12 characters. On the right side of the spreadsheet (columns E-K), hyperlinks provide access to the results page on Github, and in the "Results of Update Request" box, users can check the query's status. Hitting the "Start Rendering" button located on columns E-F starts the query. The area within E1-K8 are protected and automatically overwritten if edited (Fig. 2).
By clicking on the "Result page" link on the target spreadsheet, we can access the results of our query within approximately 30 minutes. Clicking on the hyperlink in cell H6 we can follow the progress of the query (Fig. 2, arrow). A new query overwrites the earlier one in the web application, but every previous version is saved on Github under the "Result of Update Request'' link (https://github.com/cycle20/scancer/actions/work ows/clue.yml). A scrollable panel displays all the targets on the left with at least one valid drug compound available. Scancer automatically excludes entries where no drug or small molecule inhibitor/agonist is available according to the Clue.io repurposing hub. The platform creates a table for every target, where different columns indicate the mechanism of action (MoA), clinical status (preclinical, phase 1, phase 2, phase 3, or launched), and the search resources from PubMed EMA and the direct entry from Clue.io. Furthermore, the query table includes hyperlinks with DrugBank, PubChem, and ChEMBL IDs to quickly access the compounds' chemical and pharmacological properties (Fig. 3).
Project Scancer also gives a comprehensive, highly structured overview of the selected targets regarding their molecular biology data, including molecular function (Gene Ontology), their connectome (STRING), participation in pathways, and cellular localization (Reactome) retrieved from various databases.
Hyperlinks to GeneCards and DrugBank Target Search are also available but differently structured as for UniProt entries. The "STRING" entry opens a static string map for the target and provides a hyperlink to string-db.org (Fig. 4). The following entry carries the "Molecular Functions / Subcellular Localisations" title, where the two main hyperlinks (source) lead to UniProt's "Function" and "Subcellular Localization" pages. Molecular function entries and target localizations are also provided as text separately, where hyperlinks lead to the QuickGo platform to obtain further information about relevant compartmentspeci c molecular pathways (Fig. 4). The last entry named "Pathways" provides links to every Reactome database, where the target's participation is visualized in every relevant metabolic pathway (Fig. 4).
Supplementary Video 1 shows a short tutorial about the functionality of the program and the main steps to generate a query.

Current content
This current database includes 97 targets that focus on molecular targets of SCLC (Dora et al, 2021), but can be expanded or replaced with targets concerning other aggressive malignancies, including triplenegative breast cancer, glioblastoma multiforme, and pancreatic cancer. Users and groups can expand the content and assign their own tasks to indicate progress percentages in a different platform. Also, it is possible to give in the "label" column to address targets and classify them into different groups.

Discussion
Open access journals and databases are an essential basis for drug target developments. Endeavours, like the TCGA database and Oncomine (Rhodes et al, 2004) has contributed vastly to accelerate drug research in oncology the latest decade and concurrently multiple other enterprises emerged to assist researchers with valuable genomic, transcriptomic and proteomic data in the pursue for novel cancer biomarkers (Krempel et al, 2018;Wishart et al, 2021;Banck et al, 2021;Pantziarka et al, 2021). However, the information is not well structured for speci c diseases, including rare and highly progressive cancers (Gadaleta et al, 2011;Creighton 2018). Moreover, it is challenging and time-consuming to associate the latest biomarkers with drugs to pick the optimal way and contribute to the eld. Only a few research groups with diverse expertise can participate, leaving many researchers without an equal opportunity of involvement. Also, the individual interest of these groups might not represent the optimal way to examine diseases. Therefore, we propose a novel, optimal target selection methodology. Notably, after the success of PD-L1-inhibitors in non-small cell lung cancer (NSCLC), where 5-year-survival in extensive-stage disease increased from 2-25%, there has been a keen interest to expand on immunotherapy utilization in small cell lung cancer (SCLC) as well. However, two anti-PD-1 immunotherapies, nivolumab and pembrolizumab, have had their FDA approval (Horn et al., 2018;Pas-Ares et al., 2019), they were withdrawn after the con rmatory phase III trials did not reach statistical signi cance for overall survival. Nevertheless, PD-L1 expression in SCLC has never been unequivocally correlated with the response.
Open-access data on the latest research requiring further validation is of high interest to the eld. In lowprevalence and highly aggressive cancers, scarcity of available tissue samples limit research, so there is an unmet need to share and optimize resources in the eld. It has been decades with only modest therapeutic advancements for highly progressive cancers such as pancreatic cancer, triple-negative breast cancer, glioblastoma multiforme, or SCLC, with an unmet need for advances. To enhance drug target development, we believe that project Scancer can serve as an easy-to-use, semi-comprehensive data-mining platform for drug-repurposing and can assist signi cantly smaller research groups in the ght against malignancies. An outstanding advantage of Project Scancer compared to other drugrepurposing platforms is that it does not require any experience in database handling and provides information on the biological background of our target molecules in a processed way that is easy to understand. The latter feature can be used for educational purposes as well. We believe that Project Scancer is a useful addition to the eld of drug-repurposing in cancer science and oncology and will be particularly useful for smaller research groups with limited expertise in database handling.   Flowchart of functionality. Flowchart describes the main steps of Project Scancer's functionality, including data input, Clue.io target search, cross-referencing in databases (Datapatch) and molecular background information on selected targets (Render).

Figure 2
Input table for molecular targets. Users can enter selected targets' HUGO name, label and Uniprot ID in columns A, B and C. Hitting "Start Rendering" will initiate the Clue.io search (arrowhead). Progress can be traced by clicking on hyperlink in cells H6-K6 (arrow). Clicking on the hyperlink in cell F2-I2 reveals the results page.

Figure 3
List of drug targets. Clicking on the labels of selected targets (column on left side) unveils available compound list (black box) describing also mechanism of action (MoA, dashed box), clinical status (red box), resources of information on PubMed (green box) and DrugBank/PubChem/ChEMBL entries (blue box).

Figure 4
Details on the target's molecular background. Panel A shows "results page" menu for network map from String.io with static string map and hyperlink to String-db entry. Panel B displays hyperlinks to "molecular function" and "subcellular localisation" to browse the UniProt database on molecular background. Panel C shows hyperlinks to visualize "Reactome" pathways of the selected target.