Data source. The database for aggregate analysis of ClinicalTrials.gov (AACT) is a publicly available relational database enhanced by The Clinical Trials Transformation Initiative (CTTI) that contains both protocol and result elements for all studies recorded in ClinicalTrials.gov 5. The AACT database had 420,268 clinical studies registered from 1999 to July 2022 recorded in their database on 6 July 2022 and this version is extracted in comma-separated values (CSV) format from the database for this research.
Filtering the data. The dataset is filtered to only have “Interventional” studies out of the three study types included in the raw dataset are “Interventional”, “Observational” and “Patient Registry”. There are 14 different overall status records for this dataset including recruiting, completed, withdrawn and unknown status. Only studies that are completed, terminated or withdrawn are extracted from the overall dataset as these will be our focus point to generate our binary classification models.
The objective of clinical study registries is to provide complete, accurate and timely recorded trial data. Although the emphasis on registering clinical studies and providing quality data increases over time (since 1999) there are still a high number of studies that have a significant number of missing data points or have substantial errors 6. Figure 1 illustrates a bar plot which shows the average missing values ratios starting from 1999 where the first study registered to this registry until 2022 considering the 24 study design features investigated in this research. Average missing values are calculated as the proportion of sum of missing values over total number of studies recorded that year. There is a significant decrease in the average missing value rates over the years especially from 1999 to 2008 and after 2011 the average missing value rate is stable below 10. Hence, the studies registered before 2011 are removed from the dataset, leaving 112,647 studies.
Study characteristics features. Table 1 gives a summary of the number of studies and their recorded phases in the dataset. Some studies are recorded as belonging to two phases. For phase specific subset data generation, if the study phase recorded for a study includes the phase of interest, then it is included in that subset. For example, if a study phase is recorded as Phase2/Phase3 it will be included in both Phase 2 and Phase 3 subsets.
Out of 190,678 studies, 19,252 studies did not record the number of sites. Although the recent growth in decentralised trials emphasised that sites are not always needed, most of the historical studies do not belong to this category of trials 7. The number of clinical sites have an impact on the trial enrolment and patient demographics as the clinical study is limited by the participants who live near the defined sites and can attend study visits Therefore, we used the number of sites in the final features set.
One of the essential parts of the clinical trial design is to define the primary and secondary outcomes 8. The primary outcome measures directly form part of the study hypothesis. The number of primary and secondary outcomes to measure are included in our final feature set as two separate features.
Table 1 Study phases and number of studies recorded under each phase in clinicaltrials.gov
Phase
|
Number of Studies
|
Early Phase 1
|
1549
|
Phase 1
|
23914
|
Phase 1/Phase 2
|
5350
|
Phase 2
|
25599
|
Phase 2/Phase 3
|
2910
|
Phase 3
|
19699
|
Phase 4
|
16463
|
Figure 2 illustrates proportions of completed and failed studies by disease category. Categorization of a clinical study is done by searching through the conditions recorded for that study. This categorization of specific conditions is the same used in the clinicalTrials.gov database. As illustrated, studies under neoplasms and blood lymph conditions categories are the most likely to fail whereas studies under occupational diseases and disorders of environmental origin are the least likely to fail. Full list of study characteristics features is provided in supplementary materials.
Eligibility criteria statistical and search features. The eligibility criteria field is used to create a set of features including some basic descriptive features as well as more complex search features. Eligibility criteria are implemented to control who can participate in clinical studies. Acceptance of healthy volunteers, acceptance of patients by their genders and ages are among the features added followed by number of inclusion and exclusion criteria as well as total and average number of words for eligibility criteria per study. 54,758 studies that accepted healthy volunteers had a 7% failure rate whereas 134,842 studies that did not accept healthy volunteers had a 17% failure rate. The importance of inclusive eligibility criteria has been emphasised increasingly over the years as exclusion of particular subgroups makes it harder for studies to recruit patients and deliver inclusive outcomes 9.
The next set of features generated from the free-text eligibility criteria field are defined as search features. The CHIA dataset is a public dataset for a shared benchmark generated by Kury et al 10. It is a large, annotated corpus of patient eligibility criteria extracted from 1,000 Phase IV studies registered in ClinicalTrials.gov. The dataset contains 12,864 inclusion and exclusion criteria. We use CHIA to create search terms to extract features from the free-text inclusion and exclusion fields.
We use the following category types in CHIA to generate our search terms: "Condition", "Procedure", "Person", "Temporal", "Drug", "Observation", "Mood", "Visit". Category and entity pairs are generated for inclusion and exclusion criteria separately. 12,864 pair features were generated, and these pairs are used to search the inclusion and exclusion free text fields for features. For computational reasons, we restricted search terms to those with 5 or less words. This process generated sparse binary features which we concatenated to our original features.
Data labelling. Overall status is recorded for every clinical trial in the AACT database. If no participants were ever enrolled in the trial, the status of that trial is ‘Withdrawn’ and if a trial was stopped prematurely, the status of that trial is ‘Terminated’. Out of 28,098 terminated or withdrawn studies that reported a reason for stopping the study, 9,260 studies prematurely stopped due to reasons related to participant recruitment and enrolment. Trials that are successfully completed have the status ‘Completed’. The classifying factor between studies for supervised model training is their overall status being either in success class or in failure class. Terminated and withdrawn studies are labelled as “failure” and completed studies are labelled as “success”.
Numerical and categorical feature encoding. The final feature set is a mixture of numerical and categorical columns which requires different methods of encoding for a machine learning model. Large public datasets come with a lot of missing and erroneous data. Especially for numerical features, handling of the missing values could have a big impact on the performance of machine learning algorithms. Multiple Imputation by Chained Equations (MICE) algorithm selected to handle numerical missing data as it is a robust and informative method 11. Missing cells imputed through an iterative sequence of predictive models where in each iteration one of the features imputed using other features of the dataset. This algorithm runs until it converges and all the numerical features values that are missing are imputed in this process.
One hot encoding is an effective method to encode categorical features. This method generates new binary features for each sub-category of a categorical feature The method handles missing categorical values by encoding them as zeros.
Train / test datasets. Phase specific datasets for Phase 1, Phase 2 and Phase 3 studies were generated for comparison to each other and the overall dataset. It is also possible to generate disease category specific datasets, but this is out of scope for this study. In order to estimate the performance of our machine learning models, the train-test split method was used 12. For the final model a 70:30 train to test split ratio was selected. The train set is used to train the models whereas the test set is held aside for the final evaluation of the model. This is an effective and fast approach to test our trained models with data they have never seen before.
Handling data imbalance. The ratio of positive to negative class labelled studies changes for different subsets generated but data imbalance is a common issue for all subsets. The ratio of positive to negative samples for the overall dataset which contains studies from all phases is 15:85. Input data has bias towards the positive class, which can result in a falsely perceived positive effect on the model accuracy. In order to prevent this, random undersampling is applied to the training set. According to the given ratio, a necessary number of data points is deleted from the positive class subset. We use a 1 to 1 ratio between negative and positive class. Random under sampling was applied only on training samples after the train test split during the cross-validation process. The test set remained unsampled to preserve a realistic test distribution.
Top Feature Selection. Eligibility criteria features increased the feature set size significantly. In order to achieve the best performance without generating unnecessary noise in the data, feature selection was applied for the various datasets generated from the main dataset. An ablation study done to understand the effects of adding more features to the model performance. Figure 3 shows an example plot of number of features vs model error on test set using Phase 2 studies dataset. The purpose of plots is to find an elbow point. The elbow point is where the decrease angle of the error line dropped significantly so that we know adding more features does not have a significant effect on the performance. Analysing the plot in Figure 3, we can choose the elbow point as 4000 features. The same ablation study was conducted for all datasets in this paper and number of features selected accordingly.
Machine learning models. Logistic regression, random forest classifier and extreme gradient boosting classifier are selected for experiments. The logistic regression classifier is a simpler algorithm compared to the tree-based ensemble models such as random forest and extreme gradient boosting 13,14. Even though feature selection is applied, the final datasets are still large sparse datasets. This ruled out many machine learning model architectures. Cross validation is used for the model evaluation to achieve unbiased metric scores. Using 5- old cross validation, the dataset split into 5 chunks and the model trained 5 times where each time a different chunk was kept as the test set to evaluate the model. This provided reliable metric scores to evaluate different models.
Tree based models require careful hyperparameter tuning, however it is computationally expensive to test out every combination of parameters to achieve the best results. Therefore, a strategy is made to find the best possible parameters for the models in hand. In order to prevent overfitting, the initial method is to control the model complexity and the second method is to add randomness to make training robust to noise 15. Optimal parameters are chosen after several iterations following this strategy.
Model interpretations using Shapley Additive exPlanations. SHAP (SHapley Additive exPlanations) is a framework based on Shapley values, a game theory approach 16. This method is used to get visual outputs to explain model predictions 17. SHAP locally explains the feature contributions on individual predictions by connecting optimal credit allocation to local explanations using Shapley values. A base value and an output value calculated for each plot. Base value is the average model output based on the training data and output value is the overall addition of the Shapley values for each feature for that instance. This allows us to explain the influence of features to the overall prediction.