3.1. Textual mining approach
Recently, different types of data mining have been deployed in medicine such as bioinformatics (Alizadehmohajer et al., 2022) but this paper proposes a novel method for finding latent topics in the literature review. LDA has attracted much attention to reveal latent topics in text using scholarly literature documents, which models each item as a finite combination over a specific set of topics. LDA as a three-level hierarchical Bayesian model minimizes the effects of a researcher’s bias. It also utilizes the Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) algorithms to disclose the latent topics in text (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990; Hofmann, 1999). LDA was proposed by Blei, Ng, and Jordan (2003). LDA has been applied to classify discussions into correlated words to give an opportunity to researchers for investigating the literature guided by such a structure (Amado, Cortez, Rita, & Moro, 2018; Choi, Lee, & Sohn, 2017; Griffiths & Steyvers, 2004; Moro, Cortez, & Rita, 2015; Moro, Rita, & Cortez, 2017). The main idea behind LDA is using a hierarchical Bayesian analysis that follows a generative process that considers topics as a consequence of the Dirichlet distribution over words and each document derive from a distribution over topics. An advantage that distinguishes LDA from single-membership clustering is being a mixed-membership model meaning that multiple topics may include each word.
LDA based models evaluate the subsequent probability of each word belonging to a specific topic. One advantage of this algorithm is giving an opportunity to researchers to order words based on their correlation with the topic. The other advantage is making a profile of each topic according to their most correlated terms to determine the fundamental discussion. LDA facilitates the finding of most discussed topics in every paper by calculating the probability of relevant documents to each topic (Blei et al., 2003).
Figure 1 indicates a generative process derived from LDA, where K represents each topic and sketches a picture of a distribution over words such as, \(\overrightarrow{{\beta }_{k}}{Dir}_{v}(\eta\)), D indicate each document and draws a vector of topic categories such as \(\overrightarrow{{Ѳ}_{d}}∼Dir⟮\overrightarrow{\alpha )}\), a topic assignment \({Z}_{d, n}∼Mult\left(\overrightarrow{{Ѳ}_{d}}\right), {Z}_{d, n}∊\{1\dots .K\}\) and one word such as \({W}_{d, n}∼Mult\left(\overrightarrow{\beta } {Z}_{d, n}\right), {W}_{d, n}∊\{1\dots .V\}\) (Blei & Lafferty, 2009). This paper proposes the Gibbs sampling which is a Markov chain Monte Carlo (MCMC) algorithm for sampling according to a probability distribution due to its efficiency and convergence capabilities. Text mining includes several steps that need to be taken to find topics in textual information and data preparation for LDA. The statistical assumptions underlying LDA algorithm are organized in several main steps. The first step is defining corpora without any formatting and converting words to lower spaces and white spaces as well as removing the numbers from each corpus (Meyer, Hornik, & Feinerer, 2008). On the last stage of transformation is altering all auxiliary terms to terms called stop words (e.g., terms such as “the”, “as”, that”), or omitting infrequent terms (Delen & Crossland, 2008) from the topic evaluation. A document-term-matrix (DTM) is subsequently formed by a collection of deciphered documents to count the frequency of terms per document. Later, Penalty functions will be used to reduce DTM which results in removing the outliers (Blei & Lafferty, 2009; Delen & Crossland, 2008; Meyer et al., 2008). Revelation of hidden topics is possible by the application of generated models from LDA to the DTM.
(Please insert Figure.1 here)
3.2. Process and Sampling
We performed a web search query with the term “Pharmaceutical business and public health” in Web of Science (WOS), PubMed, Scopus and Google Scholar web search engines and found 737 essays about relation between pharmaceutical supply chain and public health. However, more detailed investigation revealed that most of the WOS categories are associated with Computer Science, Engineering, Medical and other fields not related to the research topic. Thus, we carried out a second search on search engines for latent topics related to pharmaceutical marketing and public health matters such as health outcomes. The main concept behind doing research is finding words in the title, abstract and keywords. Wildcard is used to search some words and account them as multiple possible terms under the root word. The results obtained from the query were categorized into English and non-English papers and our studies were focused on papers in English in peer-review journals. At first glance, the dispersion of the essays among various journals indicates that although the research is focused on pharmaceutical and health terms, papers appeared in a vast range of journals. The 432 articles that were in accordance with these criteria were then filtered according to a systematic literature review since we aimed to find papers focused on the outcomes of the pharmaceutical business on the public health while most of them were focused on other topics. The systematic analysis process evaluates three basic criteria: Accuracy (is the degree of how close the data from the study to information presented in articles to evaluate what we want to investigate, that is, the effect of pharmaceutical industry on public health); reliability (is the overall consistency of a measure and the degree of replication of the results under consistent conditions which confirms how generalizable the results are); credibility (By considering the articles published in the most prestigious and well-reputed journals worldwide); integrity (regarded as how reliable, accurate and complete the investigation is and if precision in the selected research process is taken into account) (Moher, Liberati, Tetzlaff, Altman, & Altman, 2009; Nill & Schibrowsky, 2007). Systematic analysis of the titles, keywords, abstracts, and text of studied articles resulted in eliminating a set of 303 duplicated papers, however from a set of 129 remained essays we focused our deeper content analysis on 53 papers because most of them were unrelated to the pharmaceutical decisions. In this study, two independent researchers identified potentially relevant papers and disagreements between researchers were subsequently discussed to reach an agreement on achieving the final set of papers. Discussing the conflicts between investigators led to agreement of > 0.85 (Cohen's Kappa coefficient). Figure 1 illustrates how the final set of papers was created after finding the most relevant articles and excluding the rest of them.
(Please insert Figure.2 here)
The first paper was published date back to 1992 while the most recent article was published in 2018. The 5-year journal Impact Factor (5Y-IF) of journals that published those papers in each period of 5 years is shown in Fig. 2. Attention has increased but it is not enough. From Fig. 2 It is evident that, as the number of papers increases the number of journals gets more dispersed.
(Please insert Figure.3 here)
2017, papers on the relation between pharmaceutical business and healthcare were published in 16 different journals, while from 2012 to 2017 only 12 journals published such articles, 11 in 2007–2012, 7 in 2002–2007, 5 in 1997–2002 and 2 in 1992–1997. From 2017 to date most of the papers were published in higher ranked journals, since they have been published in lower 5Y-IF journals as well. Abstracts of the articles were extracted and transformed with respect to the best practices in text-mining analysis (Guerreiro, Rita, & Trigueiros, 2016). R programming language was utilized to turn the text into a structured document-term matrix using “tm” and “topicmodels” library. After converting all text to lower case, whitespaces, punctuation, and numbers were removed by stripping the text corpus from them. English commonly used words (such as “the”, “is” and “and”) and core keywords such as “Pharmacy”, “health”, “pharmaceutical business” and “public health” were removed from text. The most frequent terms in the text may put the cohesion of the topics in danger with respect to their high frequency overall (Guerreiro et al., 2016). The final step of tokenization performed on the text includes choosing the individual or cooccurring terms. Single term (pharmacy) and two-term (public health) were used for analysis. Treating the document-term matrix aimed to study sparsity and revealed a total of 3107 terms over 53 papers. Latent Dirichlet allocation (LDA) topic models using Gibbs sampling method was used to categorize the latent topics discussed in the literature based on co-occurrence of words (Cao, Xia, Li, Zhang, & Tang, 2009; Geman & Geman, 1984; Griffiths & Steyvers, 2004). Regarding selection of the K to specify the number of topics that could best represent the underlying groups of discussions expected in the corpus of documents, a collection of possible LDA models were developed with K = 2 topics to K = 60 topics. Results from each LDA model is shown in Fig. 2. Measures for selecting the best K that has the largest likelihood were derived from the work of Griffiths and Steyvers (2004) and Cao et al. (2009). The ideal number of clusters/topics depends on the stability of the variability explained after adding more clusters as Guerreiro et al. (2016) suggested. The most general topic is produced by a small group of topics, while most of the topics decrease the interpretability of the results (Pavlinek & Podgorelec, 2017). Appearing the more negative likelihood after stabilizing the variability of a model and before re-increasing the log-likelihood helps to select the number of optimal topics (K). According to Cao et al. (2009) and Griffiths and Steyvers (2004) studies, the number of optimal topics is equal to 7. Considering a few numbers of topics lower than 7 would reduce explained variability significantly and a higher K could increase the explained variability.
3.3. Topic modelling
The Gibbs sampling method with 849 iterations was used to achieve the LDA model (Griffiths and Steyvers (2004)). A burn-in-period (a procedure to discard the samples at the beginning of the Markov chain due to low accuracy) was utilized for a group of 4000 iterations. This model helped to achieve the most correlated terms with each topic. Each of the nine topics was labelled according to its most frequent noun terms.
According to the results of the study, 23 articles mainly focused on healthcare services, 17 papers are correlated with topic of pharmaceutical firms and health, 7 papers discussions around pharmaceutical supply chain and 6 papers discussed issues about community and hospital pharmacies.