PlatCOVID: A Novel Web Tool to Analyze, Curate and Share COVID-19 Literature

Background In the attempt to face the COVID-19 pandemic, the global scientic community has been expending great efforts to produce useful and reliable data aiming to help patients, physicians and guiding public health policies. A huge amount of information is being released every week, making impossible for a single person (or even for a research group) to read everything and get constantly updated on the scientic literature concerning COVID-19 and its etiological agent, SARS-CoV-2. Therefore, we developed a Web platform designed to analyze, cluster, classify and discuss COVID-19 literature available on LitCovid (NCBI). PlatCOVID has been created as a novel COVID-19 hub able to add features of text mining and syntax analyses methods, such as word and sentence atomization and tokenization, clusterization and classication. The main division of the literature comprehends ve categories: 1) Diagnosis; 2) Epidemiology; 3) Clinical, Signs & Symptoms; 4) Transmission; and 5) Treatment & Prevention. Consequently, it is possible to reduce the amount of text to be read with minimal loss of information, identifying target subjects by mining as new insights arise, enhancing data analysis eciency. PlatCOVID has been designed with central panels (Gene, Drug and Tissue panels) to easily gather and share with the scientic community important COVID-19 information.


Abstract Background
In the attempt to face the COVID-19 pandemic, the global scienti c community has been expending great efforts to produce useful and reliable data aiming to help patients, physicians and guiding public health policies. A huge amount of information is being released every week, making impossible for a single person (or even for a research group) to read everything and get constantly updated on the scienti c literature concerning COVID-19 and its etiological agent, SARS-CoV-2. Therefore, we developed PlatCOVID (www.platcovid.com), a Web platform designed to analyze, cluster, classify and discuss COVID-19 literature available on LitCovid (NCBI).

Results
PlatCOVID has been created as a novel COVID-19 hub able to add features of text mining and syntax analyses methods, such as word and sentence atomization and tokenization, clusterization and classi cation. The main division of the literature comprehends ve categories: 1) Diagnosis; 2) Epidemiology; 3) Clinical, Signs & Symptoms; 4) Transmission; and 5) Treatment & Prevention. Consequently, it is possible to reduce the amount of text to be read with minimal loss of information, identifying target subjects by mining as new insights arise, enhancing data analysis e ciency. PlatCOVID has been designed with central panels (Gene, Drug and Tissue panels) to easily gather and share with the scienti c community important COVID-19 information.

Conclusions
Although most of the text mining and syntax analysis is made by using automated computing processes, the nal results must to be humanly curated. With this in mind, PlatCOVID allows researchers to be part of our effort to curate, analyze, discuss and rate each matter of interest (helpus.platcovid.com). We welcome user feedback for further enhancement.

Background
During the last months the world has been distraught by the COVID-19 pandemic, a disease responsible for an aggressive and acute respiratory syndrome. In fact, COVID-19 patients present other not speci c symptoms such as fever, myalgia, headache, lymphopenia, hyposmia, and hypogeusia [1]. Although in a minor fraction of the cases, other tissues may be affected by SARS-CoV-2, causing diarrhea, nausea and vomiting, suggesting the susceptibility of the gastro-enteric system to the infection [2]. Moreover, proteinuria and acute renal tubular damage in COVID-19 patients indicate a kidney impairment [3], and elevated troponin T and N-terminal pro B-type natriuretic peptide levels imply a possible cardiovascular injury [4].
COVID-19 has been causing thousands of deaths worldwide. Several specialists had declared the COVID-19 as the most important health issue of the century with millions of human beings directly or indirectly affected. The course of this pandemic situation showed us that science communication and sharing have never been so important as it is now. To produce objective management guidelines for patients with COVID-19 and deal with the high demand for hospital beds, effective and reliable scienti c data are required.
Thousands of scienti c reports about COVID-19 have been published and the number of articles is still increasing, as reported by LitCovid [5].
Considering the current scenario, the speed of the publication process could be a pitfall, since the methodology accuracy and the relationship between results and conclusion could be sometimes mistaken.
In the face of this great amount of information, what has science been doing to e ciently translate this information into health policies?
What has been the path, if one, followed so far? How to pass along this massive quantity of information to our leaders/policy makers in a comprehensive way? How to make them reach general population? In order to have an overview of science's response to this complex situation, we designed PlatCOVID [6], a tool to analyze and categorize, automatically, the whole literature about COVID-19, allowing scientists to discuss and classify published data. We believe that professional engagement will accelerate the curation of the literature. Finally, our platform will gather dynamic information aiming to build a scienti c consensus to assist our policy managers in decision-making processes.

Implementation
PlatCOVID is free Web platform that allows to analyze and curate scienti c data, enabling the identi cation of useful information concerning the 2019 new coronavirus (SARS-CoV-2) pandemic. The platform aims to provide scienti c consensus about COVID-19 issues by analyzing, discussing and classifying published scienti c data, making possible to assist and guide health care policies. Therefore, it is addressed mainly to scientists, academical staff, specialists in the eld and health professionals.
The compiler design process is divided into two phases: lexical analysis and syntax analysis. The lexical analysis or "tokenization" is the process of breaking up a sequence of characters into pieces called "tokens". The syntax analysis or "Parsing" comes after the lexical analysis and analyzes the syntactical structure of the given input (source code or a program). It does so by building a data structure that may be called a "Parse tree" or "Syntax tree" [7].
After that, a secondary search was done according to ve categories: Diagnosis, Treatment, Epidemiology, Transmission and Clinical & Signs & Symptoms. For this categorization process we used Mesh [14] and DeCS [15] terms list. Then, we selected articles that had available abstracts. The analysis of the abstracts was performed by the linguistic structured by the level of sentence and word tokenization using the pubmed.mineR and tokenizer. The online map was built up by tmap [16] and sp [17] R packages. Common words and numerals were extract from the results (Supplementary Table 1). All analyzes were developed in R environment and all script and data (.Rdata) are accessible at our github repository [18].
To facilitate the screening of publications, we assembled panels for the genes, tissues and drugs involved in COVID-19. A FAQ section is available with tutoring and information about how to curate data.

Results And Discussion
On 6 of July, 2020, the search found 1405,7 abstracts from 26,980 published articles. As expected, we observed an exponential increase in publication never seen in the recent scienti c literature history (Fig. 1). These articles were published manly as Journal Articles (60.8%), Letters (17.09%), Editorials (6.84%), Reviews (6.51%) and Comments (2.03%). We excluded articles without available abstract (12,923) and applied the word and sentence tokenization methodology. Then, using the countrycode R package [19], we calculated how many times a country was cited in the abstract and the article liation. United States (43.59%), United Kingdom (16.63%), China (11.25%), Italy (5.71%) and Spain (5.35%) were the main source of scienti c literature. About 82.53% of articles analyzed came from these ve countries.
Using the atomization process, 75,368 words/terms were found. Of these, 7,899 common words were excluded, remaining 67,469 words. The ten most cited terms are demonstrated in Table 1. After that, we selected the 50 most recurrent words in the abstracts to continue the investigation (Supplementary Table 2). Our analysis suggests that the scienti c focus, until now, has been to summarize the main clinical symptoms of COVID-19. It is also possible to infer that many articles were driven to describe the virus spreading. The other scienti c efforts discussed were about the transmission, prevention, treatment, health care management and diagnosis of SARS-CoV-2 and COVID-19. Table 1 The ten most cited words in COVID literature.  Fig. 1). Twenty-eight articles hit all ve criteria simultaneously (Fig. 2) and 3,374 abstracts were not categorized. Diagnosis studies have been focusing on clinical diagnosis of the acute symptoms, mainly respiratory. The terms "PCR" or "qPCR" were rarely found in the abstracts. Curiously, a small quantity of molecular diagnosis was cited and consequently discussed. We are sensitive to this matter, since molecular or antibody detection tests (qPCR and ELISA/CLIA, respectively) are considered golden standard for diagnosis. Treatment focused in the clinical treatment of the severe acute respiratory syndrome and pneumonia. Health care management was highly mentioned. The use of antivirals was suggested, but no speci c drugs were found to be relevant. The words "therapy", "drugs", "trials" and "effective" indicate that investigations into forms of treatment are currently being conducted. Despite that, we implement a Panel Drug at PlatCOVID to list all cited drugs. Epidemiology studies have been focusing on clinical and infection features of the disease as well as on the transmission risks. Epidemiological data from pneumonia status seems to be relevant to medical prevention and treatment during COVID-19 pandemic. Transmission studies have reported how the disease is transmitted by respiratory routes. The terms "transmission", "disease", and "infection" were highly cited in the abstracts, suggesting that forms of infection play an important role in epidemic transmission. Articles categorized as Clinical & Signs & Symptoms were the most abundant in the general analysis. In detail, these studies discussed the severe acute respiratory conditions and pneumonia symptoms in the infected group, being "acute", "pneumonia" and "lung" common terms used to describe patient's clinical condition.
Moreover, the most frequent terms (Table 1) indicate the importance of determining the clinical aspects of the infection. Taking all these ndings into account, the primary scienti c response during the pandemic seems to be focused into the report of main clinical signs and symptoms in order to extend this information to appropriate treatment and patient management. Nevertheless, a new perspective in molecular treatment and diagnosis shall be critical to face COVID-19.
The translation scienti c language is a continuous challenge. The scienti c perception and fake news circulating with dramatic frequency in the media and social networks could misunderstand the real meaning of scienti c evidence. Thus, we implemented a Web platform dedicated to COVID-19 scienti c literature that is able to automatically analyze, classify and evidence the important information of published articles.

Conclusions
Aware of the computational limitations to study scienti c article linguistic and semantic, we invite scientists and all specialists in the eld to join us and help mining and curating COVID-19 literature. The categorization, classi cation and discussion of scienti c issues led by professionals in the eld should be translated to help guiding public health measures and policy managers' decisions in controlling and managing this pandemic.

Availability And Requirements
All data is available at www.platcovid.com. The source code created, data analyzed and results are available in the platcovidsource repository from our github, (https://github.com/bio-hub/platcovidsource). We wish to con rm that there are no known con icts of interest associated with this publication.

Consent for publication
Not applicable

Competing interests
None.

Funding
This study has been supported by the following grants: ISE-EMH (Italian-Slovenian Ecosystem for Electronic and Mobile Health) from European Community and RC03/20 BioHub -A High-throughput Platform For OMICs Data Analysis And Integration from the Italian Ministry of Health. Both, ISE-EMH and RC03/20 supported the computational stations and requirement to perform the current analysis. RC03/20 project also support Ronald Moura as a recipient of a senior fellowship. Lucas Brandão is recipient of a senior fellowship from the Brazilian National Council for Scienti c and Technological Development (CNPq). None of these funding, cited above, were involved in the design of the study and collection, analysis, and interpretation of data.

Figure 2
Venn diagram for the ve categories in PlatCOVID.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.