regCOVID: Tracking publications of registered COVID-19 studies

In response to the COVID-19 pandemic many clinical studies have been initiated leading to the need for efficient ways to track and analyze study results. We expanded our previous project that tracked registered COVID-19 clinical studies to also track result articles generated from these studies. We conducted searches of ClinicalTrials.gov and PubMed to identify articles linked to COVID-19 studies, and developed criteria based on the trial phase, intervention, location, and record recency to develop a prioritized list of result publications. We found 760 articles linked to 419 interventional trials (15.7% of all 2 669 COVID-19 interventional trials as of 15 August 2021), with 418 identified via abstract-link in PubMed and 342 via registry-link in ClinicalTrials.gov. Of the 419 trials publishing at least one article, 123 (29.4%) have multiple linked publications. We used an attention score to develop a prioritized list of all publications linked to COVID-19 trials and identified 58 publications that are result articles from late phase (Phase 3) trials with at least one US site and multiple study record updates. For COVID-19 vaccine trials, we found 69 linked result articles for 40 trials (13.9% of 290 total COVID-19 vaccine trials). Our method allows for the efficient identification of important COVID-19 articles that report results of registered clinical trials and are connected via a structured article-trial link.


Introduction
The COVID-19 pandemic led to the initiation of thousands of clinical studies testing various interventions and studying the natural course of the disease. For researchers or the public, it can be di cult to navigate and organize a large number of such studies. We previously created a framework for monitoring registered COVID-19 studies using ClinicalTrials.gov (CTG) registry, known as regCOVID. 1,2 The framework uses data science methods that computationally identi es COVID-19 clinical studies using a keyword search. The framework also uses a computerized code to regularly monitor and analyze key features relating to COVID-19 interventional trials, observational studies, and patient registries registered at CTG.
A study may publish three types of information: (1) registration data at study initiation (in a clinical trial registry, such as CTG), (2) basic summary results at study completion (in a clinical trial registry), or (3) an article with well commented full study results (in a journal). Prior analyses of phase-2-or-higher interventional trials indicate that only 27.8% publish a study result article. 3 A completed study with one or more study results journal articles provides the most value to researchers and the public. Poor information about study status or study results may lead to reduced public trust in clinical trials enterprise. 4 In this study, we extended our regCOVID monitoring project to now identify study result articles that are linked to registered COVID-19 trials. 1,3 Since the total amount of all published COVID-19 articles may be overwhelming, we propose focusing only on articles that are linked to formally registered studies to facilitate an effective review of COVID-19 scienti c literature. Unlike many efforts that use predominantly manual review to provide the public with an overview of trials and their results, we use a computational clinical research informatics approach to assess which COVID-19 studies are publishing, what they are publishing and when. 5 We understand a reader may have limited time to read and review articles or abstracts and therefore the purpose of this research is to create a system to prioritize which articles to read to best understand the current state of clinical trial research for COVID-19. Our computerized processing script can also be generalized and applied to other conditions.

Materials And Methods
Our project repository (available at https://github.com/lhncbc/r-snippets-bmi/tree/master/ regCOVID/regCOVIDpublications) includes our computer code, supplemental les, analysis results and a detailed web-based results report. 6 We also refer to the project using a short name of regCOVIDpub.
Throughout methods and results, we reference supplemental les on the project repository by the le name. The script is written in R language. For result reports, we use R Markdown framework. For most analyses, the repository will offer monthly refreshed results.
To nd result articles linked to COVID-19 clinical studies we perform three high-level steps. In the rst step, we identify all COVID-19 studies . In the second step we attempt to gather all published study result articles linked to those studies, and in the third step, we retrieve additional metadata about the articles and their a liated studies and create a prioritization scoring system to identify the most signi cant publications. The sections below elaborate on details of each high-level step.

COVID-19 Studies
For the rst step we retrieved all COVID-19 studies (see supplemental le '../regCOVIDpublications_trials_all.csv' in the study repository) using the results of our previously published work on tracking registered COVID-19 clinical studies (regCOVID). 1 We considered eligible studies to be a COVID-19 interventional trial, observational study or registry, that was recruiting, active, or ended (completed or terminated) and registered at CTG.

Identi cation of COVID-19 research articles
Once we identi ed the eligible studies, in the second step, we searched for publications linked to each study using two different methods: registry-linked and abstract-linked. This methodology is based on prior published work by our research group. 3 We describe each article linkage mechanism separately below.

Registry-linked result article search
Registry linked result articles are those included in the study record on the CTG registry. We used the Aggregate Analysis of ClinicalTrials.gov (AACT) database developed by researchers at Duke University. 7 The AACT database is created by parsing the XML study data from CTG. 7 We used the 'result_reference' XML eld within the study record. Using prior knowledge that some result_reference articles are incorrectly labelled as such, we used article publication date to remove misclassi ed articles (that were actually of type 'supporting_reference'). See this prior publication for details. 3 We then linked the results publications found in the CTG study records to the PubMed abstract to identify key details about the article, such as article title and type. For context, a prior study on a set of 8 907 trials completed between 2006 and 2009 found that 7.3% of trials tend to have at least one registry-linked result article. 3 Abstract-linked result article search Abstract linked articles are those where authors of trial result articles follow guidance of the International Committee of Medical Journal Editors and reference properly the relevant trial identi er in the article abstract. This reference is processed by PubMed and turned into searchable article metadata (called secondary identi er). We retrieved abstract linked articles by a metadata search in PubMed as articles where the article secondary identi er contained a CTG identi er (NCT ID) of a COVID-19 trial. For context, the same previously mentioned prior study found that 23.3% of trials tend to have abstract-linked result articles. 3 We combined the lists of publications from these two search methods to generate a master list of linked COVID-19 articles (see supplemental le 'regCOVIDpublications_publication_list_all.csv'). The master publication list allows for an enhanced review of the resulting articles. It combines PubMed and CTG data and shows the trial NCT identi er, PubMed PMID identi er, trial intervention (e.g., convalescent plasma), article keywords using Medical Subject Headings (MeSH), trial sponsor (e.g., University of Oxford) and many other article or trial metadata. We separated the article set based on study type and performed the rest of the analysis on just interventional trials, as they are the most relevant trials (at this point in the pandemic) and the main focus of our study.

Interventions
The intervention being studied (e.g., remdesivir) in a trial and discussed in a publication contributes to how signi cant the publication is in the research landscape. Interventions must progress through the phases of interventional trials (phase 1/2/3) to receive regulatory approval for a given indication.
Different interventions were studied for COVID-19 and advanced to different phases. Therefore, we created an intervention signi cance score for each intervention studied. The score was calculated by assigning phase-based numeric value based on whether an intervention has a trial in a given phase and adding 0.01 for each trial in that phase to add signi cance for the existence of multiple trials in that phase. For example, tocilizumab had 12 phase 3 trials so that would add 3.12 to the intervention score ( 3 for having a phase 3 trial and .12 [12 *.01] for having 12 phase 3 trials). The higher the score the more signi cant the level of study of the intervention in the COVID-19 research landscape. For trials that combined two phases, we counted the trial as being of the higher phase (a phase 2/3 trial was considered just a phase 3 trial).

Publication attention score
Our goal was to generate a ranked list of publications with the most signi cant publications appearing on top. We used a construct of an attention score that gives the most signi cant publications higher values.
The score is based on the recency of the publication, the phase of the trial, the intervention signi cance score, the number of times the trial record has been updated (high impact trials are more frequently updated), and whether the trial includes a US site. In other words, publications ranked higher if they were recent, from a later phase trial, involved a signi cant intervention, involved a CTG study record that had been updated multiple times and had at least one US site. For scoring purposes, if a trial was a combination of two phases, such as a phase 2/3 trial, we considered it under the higher phase (phase 3 in this example case).
We also retrieved article type from PubMed and gave publications that were not study result articles, such as protocols or editorials, less signi cance, and therefore lower attention scores, than study result articles.
In the nal ranked publication list, we also present to the user further important publication and study metadata that are not input parameters for the calculation of the attention score. This information includes, the study sponsor, the journal where the publication was published, and whether study results were deposited on CTG as part of the trial record. This information can be seen in the supplemental material (regCovidpublications_Master.csv at the project repository).

Subset of COVID-19 vaccine trials
Due to the great importance and interest in vaccine trials for COVID-19, we looked speci cally at a subset of COVID-19 vaccine interventional trials. The subset was developed by searching for the term vaccine in the trial's title (developed and evaluated in the previously published regCOVID study; as of 2021, CTG does not capture vaccine as a separate intervention type). 1 Similar to, the overall set of COVID-19 studies, we analyzed the vaccine trials based on key trial and publication features and generated attention scores for each publication associated with a trial of a COVID-19 vaccine.

Observational studies and registries
We also analyzed both observational studies and registries. Similarly, to interventional trials, we identi ed both abstract and registry linked publications and assigned attention scores based on the recency of the publication, the number of study record updates and whether or not the study included a US site. Phase is not relevant for observational studies and registries.

Results
All analytical results presented below were based on a query date of 15 August 2021. We plan to publish refreshed results at the study repository. 6 Repository history mechanism and formal data releases allow retrieval of any data release over time. The repository contains a report generated using an R notebook framework (computer code combined with user friendly result outputs). In addition to the report, important results are available as separate les in spreadsheet format. Such separate les are referred to in the results pre xed with 'regCOVIDpublications_'.

Interventional trials
As of the query date (15 August 2021), we identi ed and analyzed a total of 2 669 recruiting, active or ended (completed or terminated) COVID-19 interventional trials (see le regCOVIDpublications_trials_int.csv). On the trial level, a total of 419 trials (15.7% out of all 2 669 trials) have at least one linked result article. 123 trials have multiple publications, with 63 trials having published three or more articles.
The total number of trial-article-link-type combinations was 760, with 418 (55.0%) articles identi ed via abstract link and 342 (45.0%) identi ed via registry link. 11 (1.5 %) articles overlapped and were identi ed via both link types. Since the same article can be linked to multiple trials (e.g., meta-analysis or an editorial about multiple trials), we found that there were 679 distinct publications linked to all included COVID-19 interventional trials.
It is important to consider the level of effort (of the principal investigator or other study o cials) to link a publication to a trial. Abstract linking is easier and faster because the article author can simply state the NCT ID in the abstract and the article-study linkage is auto-generated thanks to the automated processing of PubMed abstracts. The majority of result articles (55.0%) were abstract-linked. On the other hand, registry linking requires update of the record in CTG by either XML le submission though their application protocol interface or by using CTG's web-based data entry system (called Protocol Registration and Results System; PRS). Per our methodology, 964 registry-linked articles were removed as incorrect, misclassi ed result articles (articles that had a publication date prior to the start of the trial).

Interventions
Using our computerized approach, we identi ed 3 295 interventions used in COVID-19 interventional trials. Of these 3 295 interventions, 549 had at least one publication connected to a trial. Table 1 shows the top 10 interventions based on intervention score, and includes the number of total trials, the count of trials by phase, the number of sponsors testing a given intervention, and the number of publications resulting from these trials. Data for all interventions (beyond those top 10 shown in Table 1) are available in le regCovid_intervention-phase_cnts_int2.csv as well as in the regCOVIDpub report at the project repository. While Hydroxychloroquine was the intervention with the most publications (81) and highest intervention score (8.301) based on the number of trials and the breadth of the phases the trials covered, Convalescent Plasma was the intervention with the most distinct sponsors studying it (103). 708 interventions had at least one phase 3 (or phase2/3) trial. While multiple vaccine candidates have progressed through each phase, the intervention signi cance score is lower than most other interventions that progressed to a similar phase since the volume of trials studying the vaccine candidate is usually limited by the fact that only the developer (and select co-sponsors) are studying the vaccine candidate.
For example, the vaccine candidate mrna-1273 from Moderna has 9 total trials (three Phase 1, two Phase 2 and four Phase 3) with an intervention signi cance score of 6.09, which is lower than most other interventions that also proceed to phase 3 (as seen in Table 1) which have a much higher volume of total trials.

Publication signi cance
Using the attention score to rank publications, we generated a ranked list of all 760 publications and a short list of 58 prioritized publications (publications that were not protocols, were from late phase trials (phase 3) with at least one US site and had multiple study record updates). Of the 760 trial-publication combinations, 234 (30.8%) were phase 3, 186 (24.5%) had at least one US site, and 528 (69.5%) had multiple study record updates.  Our methodology quickly identi ed result publications for prominent trials, such as trials involving vaccines approved in the US. Targeted review of those studies shows that such studies updated their CTG record frequently, which gives more con dence in the study metadata and study status (completed, terminated, or ongoing). In terms of paring trials with their result-reporting journal articles, the majority of linked result articles for interventional COVID-19 trials were found via abstract-link (55.0%), perhaps due to the easier practice of including the NCT ID in the article abstract.
The main advantage of our approach is offering researchers and the public a structured overview of literature with valuable metadata that combines information from scienti c literature (PubMed) and clinical trial registry (CTG). It allows researchers to sort or aggregate articles based on various useful parameters (trial phase, sponsor, intervention and many others). Such capability is not possible with existing tools. Neither PubMed search nor clinical trial registry allow for review that would combine data from both sources. It allows for an overview of the clinical research in a given disease generated though automated computer script. For example, a review of all articles for a given intervention (such as hydroxychloroquine) could reveal if there is a consensus opinion on its e cacy or if there is a divide and more research is needed. In the case of hydroxychloroquine, a review of ve results articles from four clinical trials in the US (on the prioritized short list) all expressed that the intervention was ineffective. A review of a full article master list (worldwide scope; not restricted to trials with at least 1 US site) would show a total of 81 articles from 38 trials studying hydroxychloroquine (see supplemental le for the master article list called 'regCOVIDpublications_publication_list_int.csv').

Levels of trial visibility
Our results show various levels of trial result reporting ranging from zero to multiple result articles. We found 96 COVID-19 interventional trials that had multiple study result articles, as well as multiple registry record updates. On the next level are trials with exactly one result article. Considering trials with at least one linked journal article, 70.6% of those have exactly one article. Within the set of trials with exactly one article, 26.9% only had a publication of publication type protocol and not of publication type study result article, which is most valuable. Finally, the vast majority of COVID-19 trials do not have any linked result publications (2 250 studies, 84.3%), making it di cult for interested parties to know the outcome of the trial. An even more extreme case of minimal trial information are trials with no linked result articles and zero updates (besides the initial registration) to the CTG study record (459 interventional trials, 17.2% of 2 669 total interventional trials). Our project, regCOVID, is the rst to utilize number of registry record updates (and the type of this update) as a novel, computed study metadata construct to further categorize studies by level of activity. This can be helpful in comparing studies with identical o cial study status and improve the prioritization of result publications stemming from these studies.
Result deposition: As an alternative to publishing study results through an article, many studies chose to distribute study results by depositing them on CTG. A total of 61 trials deposited basic summary results.
Within those, 35 trials only did registry result deposition and have no study result article and the remaining 26 trials did both result deposition and published a result article.

Trial registration timing
As part of our analysis, we found that trials register at three different points in time: (1) prior to trial initiation, (2) after trial initiation and prior to completion (during), and (3)  Publication bias: While manual review of abstracts of result publications was out of scope, we understand the potential presence of publication bias that may lead some trials to not formally publish results in a medical journal. For example, with reports of clearly terminated plans for further vaccine developments by some sponsors, a lack of result articles for certain trials and vaccine candidates hints at possible publication bias in vaccine trials.
Other manual trial trackers: Besides computational methods to obtain the most relevant COVID-19 journal articles, alternatively, it is possible to rely on websites (and research teams) that provide manually reviewed lists of completed studies with reported results. For example, The New York Times maintains a vaccine and therapy tracker. 5 Another study tracker is published by the NIH. 10  Generalization to other diseases: regCTGpublications Due to the computerized nature of our methodology, the method and developed script can be applied to other conditions to achieve an analogous overview of interventions and ranked list of publications. Our project called regCTG 11 nds a list of studies for a given condition (generalization of regCOVID

Limitations
Our study has several limitations. First, we rely on structured links between a registered study and the result article. A prior study for trials completed from 2004 to 2008 indicates that the negative predictive value of such a link may be as low as 56%. 13 In other words, an unlinked result article may exist for a trial. However, in recent years, journal requirements to include NCT trial identi ers in an abstract may now be better enforced. Second, researchers have no obligation to publish result articles in a medical journal.
Our study uses indexed medical journal publications, though sponsors may make study results public via a press release, instead. Third, our study uses only a single, US-based, clinical trial registry: ClinicalTrials.gov, though, on the other hand, other registries often do not allow linking of a result publication in a registry record, don't support basic summary result deposition and have limited or no API access options. Also, the CTG registry has a signi cant number of non-US studies: as of March 2021, 60% of studies in the recruiting status were non-US only. Fourth, one part of our algorithm, that can be turned off or re-con gured for a different country, focused on trials with at least one US site. We chose this because some legal mandates are tied to this factor. Also, approval in the US (by Food and Drug Administration) is a signi cant factor in world-wide regulatory context (with some exceptions). Fifth, interventions are entered into CTG as free text and proper linkage of identical interventions (expressed using similar intervention strings, such as 'anti-sars-cov-2 convalescent plasma' and 'convalescent covid 19 plasma') depends on a computational algorithm that can miss some linkage of identical interventions.

Conclusion
We developed a data science driven approach to quickly identify and track linked articles for COVID-19 clinical studies. We characterize which studies are publishing, what type of trial-article link is used, and design a ranking score to prioritize the most signi cant publications for understanding clinical research for COVID-19. For a set of 2 669 active or ended interventional trials, we found 760 published study result articles, including a short list of 58 key articles from late phase, US based trials with multiple study updates. We separately analyzed trials for COVID-19 vaccines and found 69 linked result articles (including the P zer/BioNTech, Moderna and Johnson and Johnson vaccine trials). The computerized nature of our many analyses allows for the publication of monthly refreshed data at our GitHub repository and the development of a generalized format that can be used to perform similar analysis for other conditions.

Declarations Data Availability
The datasets generated and analysed during the study are available in the regCOVIDpublications repository: https://github.com/lhncbc/r-snippets-bmi/tree/master/regCOVID/regCOVIDpublications