Sample
This study includes the abstracts and associated meta-data for Implementation Science articles published between February 22, 2006 (Welcome to Implementation Science19) and October 1, 2020. We retrieved the study data using PubMed’s Application Programming Interface (API) to assemble a database that included the following publication attributes: title, authors, affiliation, abstract, keywords, and date indexed. Of primary interest to us for this study were the publication abstracts and the data indexed. There were no applicable EQUATOR standards since i) we limited our database to articles published in one journal and ii) we used all data that was indexed.
As of October 1, 2020, there were 1711 unique article abstracts available in PubMed, which we used in this analysis. We note that there appears to be a day or two lag time between when articles are uploaded to the Implementation Science website and when they appear in PubMed. Additionally, articles published in 2008 and 2009 were not indexed until 2010 for unknown reasons. We chose to preserve the PubMed indexing date because that is when a consumer of research would have come across the studies in PubMed.
Data Analysis
We used several NLP methods to review the content in the abstracts. The core of NLP is tokenization, which treats the words as “tokens” that are meaningful information units in and of themselves. Generally, the more frequently that a token occurs, the more relevant it is in descriptive analysis. This is known as the Bag of Words approach. However, a fair amount of preprocessing is needed to prepare the text for analysis. We first removed stop words, or words that add little informative value to the overall text, such as “and,” “the,” and “with.” We then lemmatized the remaining words. Lemmatization involves converting a word down to its root form.20 In some cases, this involves making plural words singular. It also involves normalizing the tense of the word (which is especially relevant when dealing with irregular verbs in English, such as “be”). A similar, less intensive method, stemming, is sometimes used because it is less computationally expensive than lemmatization. However, our dataset was not large enough for this to be an issue.
What is the composition of content areas published?
To examine study question 1, we used a topic modeling approach using Latent Dirichlet Allocation (LDA). LDA assumes that each topic is a cluster of words that co-occur together, and that documents (in this case: abstracts) are clusters of topics.21-22 In order to identify the ideal number of topics (k), researchers want to find the lowest perplexity score across a number of possible topics. Perplexity is a metric that looks at how well a probability distribution predicts a sample. To arrive at this minimum, we tested potential topic clusters from two up to 60. While our goal was to minimize the perplexity score, we also wanted to maintain some topic interpretability. This can be a tradeoff; while we may show mathematical improvements, several of the emergent topics may strain a clear interpretation. We decided on a parameter of k = 30 topics. While we were continuing to show improvements in the perplexity at higher ks, too many of the emergent topics did not appear to add conceptual value. Topic modeling is a Bayesian process,22 so to ensure replicability of our analysis we set a seed number.
An LDA yields a matrix where each document (in this case: an abstract) is assigned a likelihood of belonging to a specific topic; the γ (“gamma”) value.23 Researchers generally take the highest γ value to assign a document to a topic, with the caveat that this statistic is a likelihood, rather than a classification. Because we end up with a k-dimensional matrix, we needed a dimensionality reduction algorithm to appropriately visualize the research space. We used Uniform Manifold Approximation and Projection (UMAP), a general-purpose manifold learning and dimension reduction algorithm24 that has advantages over principal components analysis (PCA) and more advanced techniques like t-distributed stochastic neighbors estimation (tSNE) because the global structure of the data is better preserved. We have posted an interactive UMAP visualization online at www.dawnchorusgroup.com, as the color scheme and embedded data does not translate to print, .pdf., or static .html pages.
Topic modeling is a primary data-driven process, though there are methods to seed the model with specific words.25 Because of this, the LDA outputs may either not be interpretable or so broad that there is little meaning that a human could sense. Boyd-Graber and colleagues21 developed a number of metrics to judge the quality of topics generated by an LDA approach (listed below). Like most metrics, there is no rule of thumb that guides absolute cutoff scores.
- Topic size: Total number of tokens by topic. There is a strong relationship between topic size and topic quality because common topics are generally represented in many documents, and are not as susceptible to being “diluted” by smaller topics.21
- Mean token length: The average number of characters for the top tokens in a topic.
- Prominence: How many unique documents in which a topic appears. In this type of analysis, we generally find that the methodology-specific topics have higher prominence because they are more likely to feature in many articles.
- Coherence: How often the top tokens in each topic appears together in the same document.
- Exclusivity: How unique the top tokens in each topic are when compared to the other topics.
How have these content areas changed over time?
We assigned each article to a topic based on its maximum γ value, and then plotted the number of articles in that topic over time against the years indexed in PubMed. To decide which articles to visualize, we used the above metrics to filter the results so we did not end up with a time-series plot with 30 lines, all along a greyscale.
All statistics were computed in R 4.0.2 using a number of open-source packages (tidytext, topicmodels, quanteda, umap). We used what are known as “out-of-the-box” algorithms, meaning we did not need to make any substantial changes to the underlying mathematics. All data that we used in this analysis is available, either directly as a database from the authors, through PubMed, or on Implementation Science’s website.