Common features of the SR process include: Developing a research question, developing inclusion criteria, conducting a literature search, article title and abstract pre-screening, abstracting articles for analysis, and aggregating results to generate summary findings.2 Our ML application efforts focused on creating efficiency in article title and abstract pre-screening. We refer to this step as “down selection” - reducing the initial pool of potential articles that meet inclusion criteria. To develop a generalizable ML approach to down selection we 1) created a theoretical ML “down selection” process based on goals and constraints of including ML in a down selection process, 2) established ML configuration guides by testing settings with experiment SRs, and 3) applied the process and ML configuration guides to a SR conducted by a team of scientists at the Centers for Disease Control and Prevention (CDC).
Developing a theoretical ML “down selection” process
To operationalize the ML addition to the down selection process,2 we developed a process based on the constraints of ML and SRs. The steps in the proposed process include 1) train ML models and perform ML predictions; 2) conduct human review of articles selected by the ML model and determine accuracy of prediction; 3) incorporate new human reviewed articles into iterative ML training; 4) random sampling to ensure performance (Fig. 1).
Step 1. Train the ML Models and Perform ML Prediction. Supervised ML algorithms require training data to build models. These data “teach” the algorithms which articles meet inclusion criteria, and which do not, facilitating the creation of a model.11,12 Effective training data can come from a small set of articles that are pre-identified as meeting or not meeting inclusion criteria. In addition, a random sampling of articles can be drawn from a keyword literature search. Iterative training will occur later in the proposed process after more articles are reviewed.
In our proposed process, a ML model makes predictions, after training, on combined title and abstract text review. Through this process, each potential article fed to the model is given a score of how likely it is to fit inclusion criteria based on articles in the training data set. Prediction scores are probabilistic ranging from 0 to 1.0, with 1.0 being a perfect match to inclusion examples.3
Step 2. Human Review of ML Selected Articles and Determination. In this step, human reviewers examine articles predicted by the ML model to fit inclusion criteria. During review, humans correct the ML prediction by confirming if an article met inclusion criteria or did not. From this process a new training set (human reviewed) is created for use in a new iteration of training. This iterative training process is proposed to help improve ML prediction by expanding training sets and overcoming potential bias introduced by small training sets in step 1.
Step 3. Incorporate New Human Reviewed Articles into Iterative ML Training. For each iteration, a new model is trained on the set of human reviewed articles in step 2 and any previous training sets. The iterated model is then employed on the remaining unreviewed articles to determine a new predictive score. Iterations should continue until the number of articles predicted as relevant becomes small and human review does not confirm articles predicted to be relevant.
Step 4. Random Sampling to Ensure Accuracy. After exiting step 3, humans select a random sample of articles not predicted as relevant by ML to test sensitivity of the ML process. For our process we suggested a 99% confidence level sample with a 10% margin of error for calculating the total articles for random sampling to ensure confidence. We recommend this process to increase confidence that all inclusion articles have been identified. Humans check this random selection of articles to look for articles that fit review criteria. If more than one or two articles are found that fit inclusion criteria, this would indicate the ML approach has not reached a reasonable sensitivity and should continue for a new iteration (step 2).
Establish ML Configurations for the Down Selection Process
Supervised ML has a superabundant number of configurations for predictive model development. Areas identified that could have multiple configurations include a) cleaning text; b) reducing dimensionality; c) feature engineering; d) developing a training sample; e) initial algorithm assessment; f) creating a soft voting stacked model; and g) choosing thresholds for the iterative modeling steps. To identify which ML configurations should be utilized for the proposed ML down selection process we utilized a hybrid theoretical and results-driven approach by testing on four previously completed SRs hereby referred to as experiment SRs. Configurations for steps a, b, c, and d were selected based on theoretical knowledge, while configurations for steps e, f, and g were selected by testing performance of different configurations for each experiment SR individually.
Step a. Cleaning Text. From each of the experiment SRs, we performed standard text cleaning on the combined titles and abstracts,13 removing numbers and common English words, and tokenizing words into single and bi-grams. We used Python’s Natural Language Toolkit (NLTK) version 3.2.4 for this process.
Step b. Reducing Dimensionality. Because our text cleaning process resulted in a data set with a large number of rows and columns (a high-dimensional matrix) that represent the numerical frequency of token occurrences, we performed dimensionality reduction.14 We also manipulated the cleaned data into a term frequency–inverse document frequency (TF-IDF) matrix.15 TF-IDF is a statistical weight, meant to show importance of a word is to a document and the entire series of documents in an analysis.
Step c. Feature Engineering. In ML applications, variables for modeling are often referred to as “features” of the data. Feature engineering involves manipulating variables to create new “features” of the data and is often used to boost predictive performance.16 We utilized latent Dirichlet allocation (LDA) on the reduced TF-IDF matrix to create new features based on topics found in the data using a generative probabilistic approach.17 Using topics instead of just word counts as features creates the ability to identify patterns across articles that does not rely on word token occurrences. We set the LDA topics at 30 new features under the theoretical assumption that 30 would reach topic saturation in the data. We also used truncated singular value decomposition (TSVD) to perform feature decomposition – reducing a matrix to its constituent parts – on the TF-IDF Matrix. This resulted in a condensed TF-IDF matrix containing the 50 most significant features in terms of their representation of the original data.18 We also know that a literature search will typically have few articles returned that will meet inclusion criteria (imbalanced data). To address this issue, we applied a Synthetic Minority Over-sampling Technique (SMOTE), which creates new feature points aimed at overcoming imbalanced data.19 We appended the features derived from the LDA, TSVD, and SMOTE to the reduced TF-IDF matrix to get our final matrix for ML modeling.
Step d. Developing Training Sample. Our approach to creating a training set was to mimic the operational approach we outlined in the proposed ML “down-selection” process. We assumed that a small number of articles would be available for training; no more than 60 with examples from both relevant and non-relevant articles. Through this we created our initial training data through stratified random selection of the experiment SRs articles, our test data set was the unreviewed data from experiment SRs.
Step e. Initial Algorithm Assessment. Many ML algorithms for building models exist. We tested the following algorithms for overall accuracy from the initial training set: Support Vector Machine (SVM) with Stochastic Gradient Descent, K-Nearest Neighbors (KNN), Decision Tree Binary, SVM with a Sigmoidal Kernel, Gradient Boosting Classifier, Random Forest Classifier, and Multinomial Naive Bayes.20,21,22,23 From these, we chose the four models that performed the best in terms of accuracy (ratio of number of correct predictions to the total number predictions made) to include in a stacked ensemble model,24,25 which combines the strengths of the best performing models. They were SVM with Stochastic Gradient Decent, KNN, Decision Trees, and Sigmoidal SVM.
Step f. Compiling into a Stacked Ensemble Model. We used a soft voting ensemble classifier to build predictive models to overcome any weakness in each individual model.25 In a soft voting ensemble, different models are given weights that are applied to their prediction and combined for a final stacked prediction. We evaluated various weight distributions according to their area under receiver operating characteristic (ROC) curves from initial model training (step d).26 The [10 -SVM, 1 - KNN, 1 – Decision Tree, 1 – Sigmoidal SVM] weighting distribution consistently resulted in the highest area under the curve of the options tested. As with individual models, the stacked ensemble model predicts a value for each article from 0–1.0, where values closer to 1.0 indicate an article being relevant for inclusion.
Step g. Choosing Prediction Thresholds for Iterative Modeling. Prediction thresholds in our scores (0–1.0) can be changed to influence the selection of volume of articles for review in iterations (SR Down Selection Step 3).27 High thresholds result in lower volumes and vice versa. By comparing actual results of experiment SR data with different prediction thresholds, we were able to identify predictive threshold guides to optimize the selection of articles for review during iterations. Once we determined an optimal threshold for an iteration, we tested multiple thresholds on the next iteration to confirm sensitivity. We were able to accomplish this because we had known results and could simulate a human review (SR Down Selection Step 3). Based on testing of experiment SRs we found that 3 iterations, including the original training round, would reach the optimal trade off in sensitivity versus percent of articles reviewed based on a 98% sensitivity goal of finding relevant articles.
From our predictive threshold testing, we used the weighted average of best thresholds from each experiment SR as a guide for non-experimental application. These thresholds are shown in Table 1. These thresholds should be thought of as guides. Volume of articles selected for review from different thresholds should also be considered when selecting which threshold to proceed with.
Table 1
Average Post-Hoc Model Performance
Average Post-Hoc Model Performance |
Data Set | Prediction Threshold for 1st Iteration | Prediction Threshold for 2nd Iteration | Prediction Threshold for 3rd Iteration | Total Articles | % of total human-reviewed articles needed to return 95% relevant articles | % of total human-reviewed articles needed to return 98% relevant articles |
1st SR Review | 50.0% | 20.0% | 20% | 14,655 | 19.3% | 24% |
2nd SR Review | 50.1% | 30.3% | 44% | 15,234 | 18.9% | 25% |
3rd SR Review | 75.0% | 20.0% | 20% | 7,670 | 10.0% | 34% |
4th SR Review | 70.0% | 27.5% | 19.5% | 1,820 | 30.0% | 41.8% |
Weighted Average | 57.6% | 26.0% | 29.5% | N/A | 20.9% | 29.8% |
Using these ML configurations, we examined the percent of total articles needed to reach 95% and 98% sensitivity of what human reviewers selected for inclusion in the experiment SRs. On average only 21% of articles would have to be reviewed to find 95% of what the human SR selected, while 30% would have to be reviewed to find 98%, including initial training articles (Table 1).