Finding Success in Clinical Trial Recruitment: A Trial Registry Analysis

Background : Randomized clinical trials are the gold-standard for generating high-quality medical evidence, but patient recruitment remains one of the most important barriers to their success despite significant administrative effort and money being spent to address this problem. While previous studies have highlighted key trial design characteristics, such as trial phase, trial sponsor, and high target accrual, as important factors in why some trials fail to recruit enough patients, these studies have been limited in the number of trials analyzed and in the scope of trial characteristics considered. This work aims to thoroughly assess the association of different trial characteristics on patient enrollment in terms of recruitment rate and early termination rate on a larger scale than has been accomplished previously. Methods : This trial registration analysis collected recruitment information on clinical trials registered in ClinicalTrials.gov as well as trial characteristics from multiple additional databases (Clinical Trials Transformation Initiative, COHD.io, automatically parsed eligibility criteria). Descriptive statistics were calculated and the primary outcomes were associations of individual trial characteristics with patient recruitment rate and likelihood of early termination due to failed patient recruitment as well as variable selection using Group LASSO. Results : The trial characteristics with the strongest significant associations to patient recruitment included design variables (e.g. intervention model, allocation status, number of locations, phase, etc. ), sponsor experience (e.g. sponsor class, number of previous trials terminated due to recruitment issues, ratio of terminated trials to completed trials, etc. ), eligibility criteria (e.g. number of inclusion criteria, number of exclusion criteria), and trial competition (e.g. overlapping eligibility criteria, similar trials within 100 miles, etc. ). Different disease categories also showed different recruitment efficacy. Conclusions : When designing clinical trials, special attention should be paid to design variables, sponsor trial experience, eligibility criteria, and trial competition to balance the likelihood of successful recruitment against evidence strength. Further research is needed to identify causal variables and improve the predictive power of patient recruitment rates to increase the breadth of this analysis.

sponsor historical trial information, and eligibility criteria to better characterize recruitment success.
We hypothesize that the relative population representativeness of eligibility criteria, as well as past trial experience of investigators and sponsors, will be strongly associated with the recruitment rate or early termination rate of clinical trials. Further, we hypothesize the size of eligible population, estimated using Electronic Health Record (EHR) data, will be associated with trial recruitment. By investigating the above hypotheses, we aim to provide additional information for future study developers to consider when designing new clinical trials.

Methods Trial Selection and Curation
As of February 2019, 15,602 clinical trials were extracted from ClinicalTrials.gov (CT.gov) which provided recruitment details in the 'participant flow' table of the Aggregate Analysis of ClinicalTrials.gov (AACT) database (15). To improve sample homogeneity, trials were further filtered by interventional in the study_type field, treatment in the primary_purpose field, no in the healthy_volunteers field and actual in the enrollment_type field suggesting the trial's recruitment phase had ended. Total patient enrollment was extracted from the enrollment field and the dates of first and last enrollment were extracted from free text using SUTime from the Stanford NLP Group (16). A manual review of extracted data was also performed to ensure accuracy.

Data Extraction and Annotation
To enable robust aggregate analysis, this dataset was further annotated. Due to inconsistencies in trial records, information about the 8 trial design parameters outlined in Table 1 were sought in both structured data fields (e.g. phase, start date, etc.) and free text data fields (e.g. official title, brief summary, etc.). The keyword-based searches used in the free text data fields (11) were developed following a manual analysis of a subset of 300 randomly selected clinical trials. Center status was categorized as 'single-center' or 'multi-center' according to the number of locations provided in facilities table.
Control status was filled with 'controlled' if: 1. the trial has an arm of type Placebo comparator, No Intervention, Sham comparator, or Active comparator AND 2. the trial has an intervention whose name includes the phrase placebo, vehicle, or sugar pill AND 3. the trial's brief title, official title or brief summary contains any of the following phrases: controlled, active-controlled, active comparator, comparative study, non-inferiority, standard therapy, standard of care, or standard treatment Otherwise, the control status was filled with 'non-controlled.' Agency Class was mapped to three sponsor classes: 'NIH/US Fed', if NIH or US Fed was listed either as the lead sponsor of a trial or a collaborator 'Industry', if NIH was not involved, but either the lead sponsor or a collaborator was from industry Remaining trials were assigned 'Other Investigators' The following variables were classified based on the values in corresponding fields in AACT: intervention models ('parallel assignment' vs. 'factorial assignment' vs. 'single group assignment' vs.
'crossover assignment'), allocation status ('randomized' vs. 'non-randomized'), masking status ('blind' vs. 'open label'), and data monitoring committee (DMC) status ('has dmc' vs. 'no dmc'). A criteria entity is defined as the concept of the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) Vocabulary (17) presented in the eligibility criteria of the clinical trial.
Wherever possible, missing information was inferred from available fields. For example, for single arm interventional trials with missing allocation and masking status, they were assigned as 'nonrandomized' and 'open label'; for trials with missing location data, the center status was inferred from the information in countries table, overall_officials table, official_title field and recruitment_details field.

Recruitment Success
The dataset was further enriched with derived variables. We derived recruitment time from dates of first and last enrollment. Each trial's average recruitment rate was calculated by the total number of patients recruited divided by recruitment time in units of participants per week.
Additionally, each clinical trial in this dataset was further labeled as 'active', 'completed', 'terminated due to recruitment issue', or 'terminated due to other issues' based on the information in overall_status and why_stopped fields (specific criteria outlined below in sponsor trial experience).
Trials which were active but no longer recruiting at the time of analysis were considered 'completed' as they had completed their patient recruitment. For the sake of this analysis, trials stopped early due to non-recruitment related reasons were excluded, leaving 3,077 trials for analysis hereby referred to as the target trials. A flowchart of these trial selection steps is shown in Fig. 1.

Sponsor Trial Experience
All study sponsors with at least one occurrence of lead in the lead_or_collaborator field were identified. For each sponsor, the count of previous trials as well as count of trials according to overall status (completed, terminated, actively recruiting, etc.) were collected. Further, trials were considered 'stopped' if they had withdrawn, suspended or terminated in the overall_status field and were considered 'terminated due to recruitment issue' if the following terms were found in the why_stopped field: enroll, recruit, accrue, eligibility, eligible, lack of patient, not enough patient, no patient. The following additional characteristics were calculated using these counts: Because this package only uses zip codes within the US, the scope of nearby trials was limited to sites within the United States.

Eligibility Criteria Statistics
Leveraging the sizable amount of eligibility criteria information stored within CT.gov, previous work has been done to parse and map these eligibility criteria to medical concepts within the OMOP CDM (17). Using Criteria2Query's pre-parsed eligibility criteria, total inclusion and exclusion criteria counts were extracted. Additionally, as these criteria were mapped to the OMOP common data model, counts of inclusion and exclusion criteria were also extracted according to OMOP concept type (e.g. measurement, procedure, condition).

Individual Eligibility Criteria Analysis and Retrieval
Individual medical concepts used in eligibility criteria were extracted for each trial and were simplified using concept clustering as described previously (19). For each medical concept, the total count as an eligibility criterion across all trials in ClinicalTrials.gov was calculated and was titled concept count.
Additionally, for each medical concept and each target trial, the count of times a concept is used as an eligibility criterion in related trials according to MeSH category (e.g. in other autoimmune diseases) was calculated and titled overlap count.
Thirdly, for each medical concept, a competition score was calculated using a formula modified from Liu et al.: (19) where p is the total count and p' i is the concept count for overlapping disease category i. The higher the competition score means more competition a trial might face during recruitment from other trials.
These data were then averaged for each target trial for all medical concepts in the eligibility criteria.

Concept Prevalence
The prevalence of each medical concept in a real patient population was extracted from Columbia To further explore the dependency and impact of the design parameters and sponsor type, 1000 bootstrap iterations of Group LASSO analysis (22) were conducted for both recruitment rate and early trial termination. In each bootstrap dataset, hyper-parameters were optimized based on a five-fold cross validation and a group LASSO model with optimized hyper-parameters was then fitted using this full bootstrap dataset for variable selection. Variables were selected for inclusion when the magnitude of their coefficients were larger than zero at least 95% of all bootstrap datasets. Subsequently, multivariate linear and logistic regressions were fitted using the selected variables for recruitment rate and early termination, respectively. In addition, linear regression was performed on the NYP/CUIMC subset to test the association between EHR eligible rate and recruitment rate. All statistical analyses were performed using R (3.3.3) (23) and Group LASSO was performed using the gglasso package(24). Significance was determined as p-value < 0.05.

Descriptive Statistics
Trials included in this analysis were active from 9/30/1991 to 6/30/2017. Among the 3,077 trials, 2,789 trials (90.6%) were 'completed' and 288 trials (9.4%) were 'stopped due to recruitment failure.' The average recruitment rate was 3.71 participants per week. Descriptive statistics are available in

Design Factors
Many design factors show a significant association with both recruitment rate and termination rate (  Table 6).

Eligibility Criteria
In general, fewer eligibility criteria in trial protocols are associated with higher recruitment rate and decreased termination rate. This trend is statistically significant for inclusion eligibility criteria. Trials with fewer than 3 inclusion eligibility criteria in the protocol have an average recruitment rate of 5.52 patients/week, which is about 3 times higher than trials with more than 8 inclusion criteria (Table 5 Number of Criteria).

Sponsors
Trials sponsored by industry have a significantly higher recruitment rate and lower termination rate than those sponsored by the National Institute of Health. Both sponsor's specialization and sponsor's trial termination history were strongly associated with slower patient recruitment and increased likelihood of early termination (Table 5 Sponsors).

Competing Trials
More competing trials targeting the same group of eligible patients, represented by higher overlap counts (Table 5 Competing Trials), are shown to have lower recruitment rate and higher termination rate. In addition, the number of nearby trials was shown to have a significant association with recruitment rate and early trial termination. Patient prevalence of concepts used in eligibility criteria also showed significant association with recruitment rate and trial termination, though this differed between inclusion and exclusion criteria.

Target Disease
Different target conditions and interventions can have different recruitment rates and termination rates. Nutritional and metabolic diseases have the highest recruitment rate, followed by cardiovascular diseases and pathological conditions (Fig. 3).

EHR Eligible Rate
We further investigated whether the number of eligible patients in EHR is associated with the recruitment process using 255 trials conducted with NYP/CUIMC. Linear regression identified a significant association between the patient recruitment rate and number of eligible patients in the EHR with a slope coefficient of 9.82 × 10 − 5 and standard error of 2.45 × 10 − 5 (p-value < 0.001; Fig. 4).

Discussion
In this study, in an effort to further explore how observational data and historical clinical trial experience impact patient recruitment in clinical trials, we identified trial characteristics including design factors, sponsors, competing trials and eligibility criteria with significant association with both recruitment rate and early trial termination. Further, using a subset of trials conducted in NYP/CUIMC and its EHR data, we explored the correlation between the size of the eligible patient population and the trial recruitment process. This study provides new depth and clarity into the association between clinical trial design and participant enrollment. Significant associations were also identified among trial sponsor experience, or more specifically, the sponsor's historical completed and stopped trials and disease specialization. Patient recruitment requires significant financial and administrative investment (30) and it could thus be assumed that more experienced sponsors with a broad clinical focus are more capable of meeting these demands.
However, some sponsors specialize in rare diseases in an effort to combat their historic difficulties with finding eligible patients (25, 31). In a separate Group LASSO analysis (results not shown here), disease specialization was excluded, but historical experience remained, suggesting that some difficulties recruiting patients may relate to the previous success of the sponsor itself (30). To the best of our knowledge, this is the first study to show that a sponsor's historical experience is strongly correlated to its future performance, regardless of sponsor class, and should inspire future research into this area in an effort to establish a causal relationship.
Though not a new idea, quantitative evidence was provided here that eligibility criteria are also an important factor in patient recruitment with increased numbers of inclusion criteria limiting patient recruitment and increasing the likelihood of early termination (2, 32, 33). However, the same association was not present for the number of exclusion criteria. This conclusion was further supported by the differing association of concept prevalence within inclusion and exclusion criteria in relation to early trial termination. The roles of inclusion and exclusion criteria are reported to be different with inclusion criteria including demographic, clinical, and geographical characteristics and exclusion criteria defining features which could interfere with the success of the study or increase the risk of an unfavorable outcome (34). As exclusion criteria are responsible for limiting the study population to those who have the greatest chance to experience clinical benefit and avoid unnecessary risk, these criteria are also understood to hamper the trials generalizability (35-37).
Further, these criteria are often not justified by their authors, leading to a lack of uniform application across trials targeting the same disease or drug (37, 38). As such, their inconsistent use among related trials is one possible explanation for their relatively low impact on patient recruitment in this analysis, though further experimentation is required to confirm and expand this hypothesis.
Beyond the number of criteria alone, trial competition also showed a strong association with patient recruitment. This is a relatively well recognized phenomenon with the Clinical Trials Transformation Initiative (CTTI) proposing a series of actionable recommendations toward improving patient recruitment in 2018 which included site selection based on access to the target population (8). In this work, however, we were able to quantify this negative association with patient recruitment and provided new means to quantitatively assess the competition, helping to inform future protocol planning and site selection methods moving forward (14,39).
EHRs have been universally adopted in almost every hospital, clinic, and other healthcare institution As evidenced here, EHR data has the potential to optimize trial recruitment prediction in a single clinical site though further research is required to highlight its clinical utility (42).
As mentioned in the introduction, the primary aim of this study was to assess the relationship between certain clinical trial characteristics and patient recruitment.

LIMITATIONS OF THIS STUDY
This study has several limitations. First, because of the nature of retrospective design, our analysis was unable to establish causality between the collected variables and successful patient recruitment.
Another limitation is that automated eligibility criteria parsing can be incomplete and occasionally inaccurate (17). In this work, many pitfalls were avoided by limiting the analysis to eligibility metadata (e.g. total counts of inclusion and exclusion criteria, averaged patient prevalence) and not including individual criteria or concepts. However, this approach also serves to limit the scope of trial characteristics assessed. Finally, though this study expanded the scope of characteristic study performed previously, this was not a complete list. Much of the information regarding clinical trials remains in free-text within CT.gov, making access over a large number of clinical trials very difficult and error-prone as the complexity has been well described (9). Future work in this field should include more longitudinal data collection, improved automated natural language processing, and a greater expanse of trial information to address these stated limitations and further improve our understanding of patient recruitment.

Conclusions
Although patient recruitment is a well-recognized barrier to clinical trial success, little was understood about the role of specific trial characteristics on this issue. In this analysis, multiple categories of clinical trial characteristics were found to be strongly associated with patient recruitment and trial success, including design parameters, eligibility criteria restrictiveness/competition and sponsor trial experience, and quantitative support was provided to support previous hypotheses. As increased success of clinical trials can mean greater and faster availability of novel therapies for patients, adopting a more informed approach to trial design may provide a new way for investigators to give studies the best chance of success.

Ethics Approval and Consent to Participate
Not applicable as this study did not involve the use of any animal or human data or tissue.

Consent for Publication
Not applicable as this study does not contain any data from any individual person.

Availability of Data and Materials
In addition to data sources listed and cited in the Methods section above, the datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing Interests
The authors declare that they have no competing interests.

Funding
This research was supported by National Library of Medicine Research grants R01LM009886, UL1TR001873, and Janssen Pharmaceuticals (grant reference CU15-2317). The study's design, collection, analysis, and interpretation were independent of these funding sources as well as the writing of the report and decision to submit the article for publication.

Authors' Contributions
These

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. butlerstrobechecklist.docx