Several recent studies have reported a conspicuous prevalence of termination of studies that are conducted with full financial and institutional support from national agencies such as the National Science Foundation (NSF), National Institute of Health (NIH), etc. One such study reported that about 19 percent of all studies registered on the clinicaltrials.gov repository were terminated before yielding results.
Prior to the publication of this report, [1] followed a cohort of randomized controlled trials that were conducted in Switzerland, Germany, and Canada over a three-year period (from 2000 to 2003) and reported that about 25 percent of the studies that they observed were discontinued. The risk of study failure is known to vary by the study’s focus area. For example, it is reported that 19% of studies conducted between 2008 and 2010, that focused on pediatric medicine topics did not yield results. Jamjoom et al. reviewed neurosurgery trials data from the ClinicalTrials.org repository and reported that about 26.6% of such trials were discontinued early. If we judge study success by publication and not termination, it is intuitively clear that the proportion of study failures is even higher than what is reported in the above literature.
In most cases of study terminations, the reasons for the terminations were not readily given. Among the known reasons for termination, inadequate subject enrollment appears to be the most common. Other factors such as unanticipated adverse drug events such as toxicity among drug trials, and early termination due to higher than expected treatment efficacy are also cited, but to a much lesser degree.
We know that scientific studies in all disciplines are initiated with extensive planning and deliberation, often by a highly-trained team of scientists. Further, to assure that the quality, integrity and feasibility of funded research projects meet certain standards, research funding agencies such as the National Institute of Health, the National Science Foundation etc., approve proposed research plans and/or proposals through a rigorous peer review process to make decisions about whether or not the projects should proceed. The proposal review process has been described as a time consuming and costly enterprise. Yet, some studies pass through all the rigorous scrutiny of the peer review process and end up being terminated before yielding results. Our assessment of this circumstance convinces us of the need to explore an approach that could be used to ameliorate the screening process so as to minimize trial terminations.
The existence of the clinicaltrials.gov repository presents a unique opportunity to study a number of issues regarding the lifecycle of scientific studies. The origin of this repository is linked back to the Food and Drug Administration Modernization Act (FDAMA) of 1997, which included the requirement to register all trials testing the effectiveness of investigational drugs for serious or life-threatening conditions. In 2000, Congress authorized the creation of the ClincialTrials.gov (CT.gov) registry to provide information and access to clinical trials for persons with serious medical conditions.
The Food and Drug Administration Amendments Act 2007 (FDAAA), established mandates requiring sponsors of applicable interventional human research studies to register and report basic summary results on CT.gov – widening the inclusiveness of studies that must be registered. In general, this included all non-phase 1 interventional trials of drugs, medical devices, or biologics initiated after September 27, 2007. The FDAAA also required that all such trials report the results within one year after the primary completion date or within one year after the date of early termination. About the same time the Health and Human Services (HHS) established a new regulation known as “the final rule” which clarified the requirements for reporting of summary results in ClinicalTrials.gov repository [3]. Currently, government funded studies that are conducted within the United States must be registered by law and as a prerequisite for publication, making CT.gov useful for cross disciplinary analysis of trends in clinical trial protocol and conduct. Although FDAAA and the HHS policies that are outlined above have improved the completeness of the clinicalTrials.gov repository data, as Cahan and Anand (2017) observe, major inconsistencies in the data, resulting from the manner in which the data are reported cause significant problems for researchers who wish to use the data for analysis [4]. Such inconsistencies create obstacles to using the “structured data” in the repository for statistical modeling and analysis.
At the time of writing this manuscript, there were 281,648 research studies registered in the clinicaltrials.gov registry, with slightly varying details about the studies. Researchers can provide information about studies in a total of 356 attribute fields, most of which are stored in the form of structured attribute data (string, numeric and date types). There are 36 fields that represent free-text fields in which lengthier descriptions of study characteristics are saved.
Recently, researchers have highlighted the ubiquity of unstructured data generated through health care practice transactions. Such observations have spurred increased interest in the application of text mining approaches in the field of health care and medical research. Examples of studies that used text mining approaches include works by Lazard, and Glowacki and collaborators. Both of these researchers used a text mining approach to extract distinct topics that are present in tweets concerning health issues. Lazard et al. highlighted the use of e-cigarettes while [6] and [7] focused on the public interest and concern regarding Ebola and Zika, respectively. Topic generation in text mining uses one of two approaches. The first one called “Latent Sematic Indexing” (LSI) uses the method of linear algebra (singular value decomposition) to identify topics.
The second approach, called latent Dirichlet allocation (LDA), uses a Bayesian approach to modeling documents and their corresponding topics and terms. The goal of both techniques is to extract semantic components out of the lexical structure of a document or a corpus. LDA is a more recent (and more popular) of the two approaches. It is introduced by [8] in a work that they published in 2003.
LDA uses Bayesian methods in order to model each document as a mixture of topics and each topic as a mixture of words. The word ‘mixture’ here entails a set of elements (topics or words) with corresponding probabilities. It promotes the idea that, realistically, a body of text (a document or corpus) will incorporate multiple themes and that the topics will be fluid in nature. Thus, each document can be represented by a vector of topic probabilities while each topic can be represented by a vector of word probabilities.
The number of topics used in LDA is a user supplied parameter and there currently is not a formal way of determining how many topics should be extracted using the LDA approach. Hence, researchers are generally at liberty in selecting n topics out of however many mixtures of corpora and terms they work with. Most current literature suggests that researchers in diverse applied and scientific fields are in pursuit of a suitable approach for determining the number of topics that are robust for characterizing corpora. Amado et al., in their comprehensive study of current trends on big data in marketing literature, use a simple approach suggested by [10]. However, others take a more exploratory approach and try multiple numbers of topics. Cai et al. presented an alternative way to represent documents as vectors calculated using the word-topic probabilities in conjunction with word-document counts. Cai et al. demonstrated this method using 4, 8, 12, 16, and 20 as the number of topics. They showed in an empirical study that this “probability sum” representation results in more efficient document classification.
While LDA is useful in the context of description alone, it can also be used in conjunction with supervised machine learning techniques and statistical algorithms in order to make predictions. The topic-document probabilities (or, as would be suggested by [11], probability sums) can be used as supplemental structured data as an input to prediction algorithms. For example, [12] use LDA along with AdaBoost in order to predict whether or not a body of text was derived from a phishing attempt. They demonstrated a high level of accuracy in classifying a document as a phishing attempt when it truly was a phishing attempt. Xiao et al. recognized the advantage to the interpretability of LDA results as well as its ability to increase the prediction performance of standard methods. In their work, they predict adverse drug reactions using output from LDA by using the “drug document” as the textual input. Here, topics had the useful interpretation of biochemical mechanisms that link the structure of the drug to adverse drug reactions.
Unstructured text is an integral part of the funding and acceptance of clinical trials. When a study is initially proposed, the researchers must specify expected/planned features such as enrollment numbers, enrollment requirements, assignment of treatments, timeline, and so on. Further, the researchers submit a description of the study and its ultimate research objectives. This description can contain a wealth of useful information that are used not only for funding decisions but also to investigate the studies’ life cycle. In particular, this description may very well hold the key to the intricate underlying causes of study failure or success. We propose using LDA to extract topics from the descriptions of the research studies registered in the clinicaltrials.gov. We then propose to use these topics to train a random forest to predict whether or not a trial will ultimately terminate.
Our specific goal in this study is to continue to investigate the question of to what extent study terminations can be predicted from the characteristics assigned to them prior to their funding or approval. This builds upon the work of [14]. By way of achieving this goal, we opted to fulfill the following specific objectives. First, we aim to explore the use of LDA to extract topics from the descriptions of trials prior to their funding. Second, we use the LDA-derived topic probabilities assigned to each clinical trial in order to improve the detection of trial termination over the use of standard structured data alone.