Semantic similarity measure for topic modeling using latent Dirichlet allocation and collapsed Gibbs sampling

Automatically extracting topics from large amounts of text is one of the main uses of natural language processing (NLP). The latent Dirichlet allocation (LDA) technique is frequently used to extract topics from pre-processed materials based on word frequency. One of the main problems of LDA is that the topics extracted are of poor quality if the document does not coherently belong to a single topic. However, Gibbs sampling operates on a word-by-word basis, which allows it to be used on documents with a variety of topics and modifies the topic assignment of a single word. To improve the quality of topics extracted, this paper developed a hybrid-based semantic similarity measure for topic modeling combining LDA and Gibbs sampling to maximize the coherence score. To verify the effectiveness of the suggested model, an unstructured dataset was taken from a public repository. The evaluation carried out shows that the proposed LDA-Gibbs had a coherence score of 0.52650 as against the LDA coherence score of 0.46504. The proposed multi-level model provides better quality of topics extracted.


Introduction
In real life, every discussion between two or more people revolves around a specific topic. In natural language processing (NLP), a subject is a group of words that is somehow associated with a specific topic. A topic is a collection of words that infers documents belonging to a particular subject. There will be words that appear more frequently in a document about a particular subject than in others. For example, an article about an election will frequently mention words such as "campaign," "vote," "manifesto," "ballot," and "president". However, unless it is included as an example, it is unlikely to use phrases like "beauty, pride and prejudice" (a romance book by Jane Austen published in 1813) in the documents. Topic Modeling (TM) automatically finds topics in a set of documents. The origin of a topic model is latent semantic indexing [1], which served as the foundation for the development of a TM. It is a common analytical tool for evaluating data and its primary responsibility is to identify the topics to which each document belongs [2]. To infer subjects from a document or unstructured data, TM groups words with similar word patterns and counts words. For example, if Samsung wants to know what customers think about a particular feature of one of their products, instead of spending hours trying to identify which comments are been discussed on that particular feature or topics of interest, a topic modeling algorithm can be utilized to analyze those comments. Manually reading through such large volumes of documents and compiling the topics would be very difficult.
Most documents are not limited to covering just one topic. They frequently cover a variety of topics. In the case of an article about "sport and SEO," it will likely contain words like "search engines," "optimization," and "goal" in addition to those previously mentioned topic model words. There are several uses of TM in NLP. With the advent of social media, topic models have been used for social media content analysis [3], such as content characterizing and recommendation [4,5], text classification [6,7], event tracking [8][9][10], and community discovery [11].
Topic models provide insights into unstructured data. The existing topic model's algorithm includes LDA (latent Dirichlet allocation) [12], LSA (latent semantic analysis), and TF-IDF (term frequency-inverse document frequency) [13]. One issue in topic modeling is measuring how the topics generated relate to the document, that is, how good are the topics generated. This is where semantic similarity comes to play using a coherence score to evaluate, that is, measuring the quality of topics generated in a document.
Semantic similarity measure (SSM) is a key part of information processing technology [14]. Semantic similarity refers to the degree of relevance between concept words [15]. It determines the similarity between different terms such as words, sentences, documents, concepts, and instances. SSM has many applications in computing which include information retrieval, educational systems, text summarization, and NLP [16]. One issue with measuring semantic similarity was that two words/sentences most times portray different meanings. For example, the phrases "Tunde and Ayo studied English and Agric." and "Tunde studied English and Ayo studied Agric." as an illustration. Although these two sentences include the same words, their meanings are not the same. Similarly, the phrases "Mary is lactose intolerant" and "Mary is allergic to dairy products" have the same meanings but use different vocabulary. This is where the coherence score comes to play as it's how semantics between those two phrases.
Though several methods or models for measuring semantic similarities have been developed, results from their implementations still point to the fact that much improvement is still required in terms of improving the quality of learned topics. This paper presents a semantic similarity measure for topic modeling using LDA and Gibbs sampling to check if combining Gibbs with LDA would provide a better quality of topics than independently using LDA alone. The commonly used LDA topic model was used on a given dataset and the coherence score was measured. Gibbs sampling was then combined with LDA to find out if the coherence score could be improved. The goal of the study is to generate quality learned topics than what existing models like LDA produced. This quality of learned topics is evaluated using a coherency score. The main difference between the proposed model and the existing ones is that it adopts Gibbs in improving the quality of topics learned (coherence score) while independently using the LDA topic model on the dataset. LDA is used and coherency measured, then the output of LDA is used as a feeder to Gibbs, and coherency is also measured. The main difference between the proposed model and the existing ones is that it separately calculates the coherence score at the LDA level and LDA/Gibbs level. Most studies generally do not find coherency at LDA and ignore how LDA plays an important role in improving the quality of topics learned. The quality of learned topics improvement is achieved via introducing Gibbs at the base of LDA output.
The dataset comprises 11,000 newsgroup postings collected from an online database [17] that featured 20 various categories, including ICT-related articles from Nigerian newspapers' sports, entertainment, politics, and health sections. Punch Nigeria, the Sun daily, metro news, and other outlets were among the news sources taken into consideration. The choice of the dataset was based on having a variety of topics to be generated using both LDA and Gibbs sampling. The general result from the research includes generating topics for both algorithms and increasing the coherence score of LDA using Gibbs. Other results include visualization of topics generated by LDA, word count, the importance of topic keywords, and distribution of document word counts by dominant topic. Each of the images was edited at 300 dpi.
Section 2 presents the review of relevant literature. The methodology, data source, text pre-processing, and model architecture are presented in Sect. 3 while Sect. 4 focuses on results and discussions. The conclusion drawn from the research is presented in Sect. 5

Related work
The study in [18] concentrated on examining LDA, topic modeling, its application, and survey. To understand the evolution of research, current trends, and the intellectual structure of topic modeling, the work looked at highly scholarly articles (published between 2003 and 2016) about topic modeling based on LDA. It presented methods like user behavior modeling and LDA-based topic visualization but did not address other methods like Gibbs. A thorough analysis of TM techniques was presented in [19], which covered classification hierarchy, posterior inference methods, various LDA evolution models, and their applications in a variety of technological fields, such as scientific literature, bioinformatics, software engineering, and social network analysis.
The latent semantic indexing approach in [20] employed the empirical prior Dirichlet allocation (epLDA) model to construct the priors necessary for topic computation from the data. Using an exponential function, the acquired priors' parameters were connected to those of the traditional LDA model. The model's prediction accuracy was 92.15% when it was put into use and tested using benchmarked data. With an average percentage accuracy of 6.33%, it was found that the epLDA model consistently outperforms the traditional LDA model on various datasets but did not do well on datasets with non-linear dependencies. Using two well-known text semantic representation approaches, Semantic Role Labeling (SRL) and Explicit Semantic Analysis (ESA), [21] developed a text summarizing methodology based on these two methods (ESA). According to experimental results, the proposed summarizer ROUGE-SU4 outperforms all contemporary related comparators when summarizing a single document using the ROUGE-1 and ROUGE-2 measures. It also comes in second when summarizing multiple documents using the ROUGE-1 (0.499) and ROUGE-SU4 (0.286) scores. One of the major drawbacks of this study was both methods do not cater to different words that have the same meaning, as it measures syntactical matches rather than semantics. The subject extraction from software engineering data was the main objective of the work [22]. The study demonstrated how to use LDA to generate various topics from a set of textual data. It revealed that the matrix of topic-topic correlation has 95% confidence intervals for the correlation value with a little number of datasets.
[23] investigated the effectiveness of TM methods on the Twitter dataset. The best models for each experiment were selected with K 100 topics because that was the number of search queries that were used to gather the data. The study's findings suggested that when dealing with brief papers alone, the biterm topic model (BTM) was superior to all other models. The semantic similarity was reviewed in the study in [2]. The result showed and categorized numerous semantic similarity measures and approaches along with their benefits and drawbacks using various documents.
[24] presented a step-by-step architecture with three steps: pre-processing, topic modeling, and post-processing, where the topic model LDA is utilized, to make topic modeling accessible to academics. A 650 by 20 matrix of topic probabilities is the result of the approach, which took 3.5 h to compute for the entire run of 650 papers taken into account for 20 topics. The highest probability distribution for one document was found in topic 16 as well but the quality of topics was not measured in the study.
The study in [25] looked at the literature's current work series of metrics quantitatively by bench-marking them against a generated dataset with a known value for k and assessing each metric's capacity to recover the true value while varying over different levels of topic resolution in the Dirichlet prior distributions. The new metric that was suggested in the research fared substantially better throughout the tested range of k and suffered from far less overfitting at low values of k. As K surpasses 80, all three measurements started to show early indicators of underfitting. The result was that the new metric can both determine the right value of k in this situation, as well as provide evidence for potential overfitting. There was no measure of topic quality and assessment.
[26] examined the issue of determining the maximum posterior (MAP) assignment of topics to words to examine the computational cost of probabilistic inference in LDA but the result was highly dependent on the choice of parameters. [27] thought about utilizing LDA to enhance the subject modeling of the Spanish state's official bulletin (BOE). Because they are not adequately defined at the documentary level and some of their crucial meta-data are empty, the analysis's findings indicated that more than 89% of the documents cannot be recommended.
[28] developed t-Distributed Stochastic Neighbour embedding with LDA to improve scientific reading comprehension of enterprise architecture-related papers. The analysis's findings revealed that among the subjects assigned to the studied documents, "sustainability" had the highest rating. The quality of the resulting documents was not stated in the study. A quick and effective learning process was described in the study [29] that was guaranteed to recover the parameters for a variety of topic models, including LDA. [30] worked on online topic inference using LDA without showing ways the coherency score of LDA could be improved. The research [31] investigated various approaches to describe the structure and behavior of virtual organizations observed in contemporary social media and social networking environments using the LDA automated topic modeling algorithm but didn't show how LDA generated topics could be measured.
Three techniques for the LDA model's parameter estimation were examined in [32]. It was discovered through experimental comparison that the online variational Bayesian inference converges more quickly than the other two inference procedures while maintaining a level of result quality even though the level of result quality was heavily dependent on the starting point for the optimization. Through the use of Bayesian Mixture Modeling, the LDA Model estimated the inference in [33] for the number of topics. The research presented a multiple-corpora LDA (mLDA) model in [34] assumed that the proportions of document topics follow a symmetric Dirichlet distribution. The outcome demonstrated that by adding more data to a single topic model, mLDA made it possible to apply the power of TM to a wide range of sectors with heterogeneous data. One of the issues in the study was the fixed K (the number of topics is fixed and must be known ahead of time).
The limitations of the works presented in this section include having a fixed number of K (the number of topics is fixed and must be known ahead of time), poor performance on datasets with non-linear dependencies, and difficulty in some methods catering to different words that have the same meaning, as it measures syntactical matches rather than semantics.
Others are the quality of topics not being measured, the quality of the resulting documents, generated LDA topics not measured and the level of result quality was heavily Lemmatize the words dependent on the starting point for the optimization. More specifically, the objective of the research was to develop a system where the LDA quality of topics generated was measured and evaluated using a coherency score and using Gibbs sampling algorithm to improve on the topics generated.

Methodology
In the section below, we presented four important techniques in our topic modeling. 11,000 newsgroup postings were collected from an online database [17] that featured 20 various categories, including ICT-related articles from Nigerian newspapers' sports, entertainment, politics, and health sections. Punch Nigeria, the Sun daily, metro news, and other outlets were among the news sources taken into consideration.

Text preprocessing
To increase accuracy, decrease data redundancy, and shorten model training times, text preprocessing was employed to clean and normalize the text data [35]. Regular expression patterns were used to tokenize each sentence into a list of words, eliminating all punctuation and unnecessary letters. The next step was creating the bigram and trigrams. Bigrams are pairs of words that commonly appear together in a text while trigrams are 3 words frequently occurring together. Bigrams and trigrams were applied to the dataset to identify words that frequently appear together and forecast the conditional probability of the subsequent words. The LDA model was constructed using 20 different topics, each of which is made up of several keywords, each of which provides a certain weight to the subject. Table 1 outlines the step-by-step procedure for the first text cleaning, and Fig. 1 displays the results of the pre-processed text.

Feature extraction concepts
A critical component of topic modeling performance is feature extraction. Set, d i {W 1, W 2, W 3 …W n }, where w i represents a word in document d. To map each word to a topic and each topic to a document, the words go through a feature extraction procedure. Feature extraction is divided into two classes: document-topics and word-topic mapping. In Eq. 1 for topic-document term frequency-inverse document frequency (TF-IDF) for feature extractions, we gave a brief explanation of these methods.
where n d is the total number of documents, df (t, d) is the number of documents that contain the word t.

Methods I-latent Dirichlet allocation
Given a set of documents d i , each of which is made up of unique words W i that are associated with a variety of topics k i , LDA assumes that the words in the documents serve to identify the topics and converts the documents into a list of topics by associating each word in the document with a variety of topics, as shown in the following example: The likelihood that a word, wj, belongs to a subject, tk, is then determined by LDA, where j and k are the word and topic indices, respectively. Once the probabilities have been computed, it is possible to select the words with the highest "r" probabilities or to define a probability threshold and select only the words whose probabilities are higher than or equal to the threshold value. For example, if there are 3 subjects and 3 words, LDA determines the likelihood using the following expression: In the above representation, there are three weights for topics: topic-1, topic-2, and topic-3 respectively for a given document d i . (w 1i * Topic-1) indicates the proportion of words in the document that represent topic-1, (w 2i * Topic-2) indicates the proportion of words in the document that represent topic-2, and so on.
Each document is thought to be produced by a statistical generative process by LDA. As a result, each document is a mixture of topics and words within each topic. Documents are made up of a variety of topics, and each topic is made up of a variety of word groups. To create this document, a topic is first Fig. 1 Snippet of the pre-processed text chosen from the document-topic distribution, and then a word from the multinomial topic-word distribution is chosen from the topic. Starting with an assumption of K topics, the LDA algorithm goes through each set of documents and randomly assigns each word to one of the K topics.
For each document loop through, it loops through each word w and computes below: • P(Topic t|Document d) p(t k |d i ), defined as the percentage of words in document d i that are currently assigned to topic t k . Also, defined as the number of words in document d that are currently assigned to topic t. • P(Word w | Topic t) p(w j |t k ), defined as the percentage of all documents allocated to a topic t k given word w j over all documents. Also, defined as the percentage of assignments to topic t over all documents that come from this word w.
The proportion of words w j in document d i that are assigned to topic t k tries to capture how many words belong to the topic t for a given document d and is expressed as follows: The final step of LDA is to reassign the p(w j |t j ,d) to a new topic where Topic T with probability P(Topic T | Document D) * P(Word W | Topic T ) is chosen which is essential that Topic T generated word w and is expressed as follows: LDA for topic modeling aim is to see each document as a collection of topics that are distributed in a specific way and also see each topic as composed of a specific number of keywords. Once the algorithm is provided with the number of topics, all LDA does it to rearrange the distribution of topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic keywords distribution. A topic is a group of dominant keywords that are typical representatives. One can tell what the topic is about just by glancing at the keywords. The LDA algorithm operates as follows: i. Selecting the k topics (that make up the corpus) ii. At random, place each word in each document under one of the "k" subjects. iii. Examine each word and its assigned topic in each document. iv. Examines the first topic and how frequently it appears in the text, then how frequently the terms appear in "k + 1," where k 20. v. Assign the term to a new topic following step (iii).
The model can be written in Eq. 5 as: where W is a specific word, Z represents a specific topic, θ d is the topic distribution for document d, p(Wd n |Zd n ) is the probability of words in topics, and p(Zd n |θ d ) is the probability of distribution of topics in documents. LDA's approach to topic modeling is that it considers each document as a collection of topics in a certain proportion and each topic as a collection of words.

Methods II-Gibbs sampling
Gibbs sampling operates on a word-by-word basis, which modifies the word's topic assignment. It operates on the presumption that while the subject assignment for the given word is unknown, it is known for all other words in the text, and this information is utilized to infer the topic that will be given to this word. The model randomly assigns k topics to all the words in 'm' number of documents, taking into account 'm' as the corpus of a document and 'k' as the number of topics, and is represented as follows: Equation 6 tells the number of words in the corpus of a document from word-1 to word-n. Assuming there are 3 topics to be considered: The model randomly assigns k topics to all the words in m i and can be represented as: The model then counts the total number of words in the ith document belonging to the kth topic. For example, n (1,2) means the total number of words in 1st document belonging to 2nd topic and can be represented as: After generating a topic count per document, the model, as previously said, generates a word count per topic for all documents. In other words, it counts each word as belonging to a specific topic for all documents. To explore the entire space, a small number alpha α is added to n (i, k) and known as the Dirichlet parameter for the document to topic distribution represented as: The model continues by decrementing the count for the respective topic allocated from the document-topic matrix by subtracting 1 from the number N i of in the ith document which is added to the product of the number of topics k and the hyperparameter α introduced earlier. This can be represented as: To indicate how much document d i likes topic t k or finding the probability of how much document d i likes topic t k can be represented as In other to take care of the word to topic assignment, the model introduces β to explore the entire space of the corpus. The corpus-wide assignment of word w j to kth topic is added to the hyperparameter as done earlier for the document to topic assignment and is represented as: For instance, m (3,4) denotes the third word in the fourth subject, which is added to the hyperparameter. By decreasing the count for the relevant subject assigned from the document-topic matrix, it can be recalled that from a previous decrement that had a place. The model scans the entire corpus of vocabulary, from the first to the last word, and adds up the product of the vocabulary and β, which is represented as: To indicate how much topic t k likes word w j or finding the probability of how much word w j likes topic t k is represented as: To calculate for the word w i , the product of the probability of how much document d i likes topic t k and probability of how much word w j likes topic t k is represented as: In expansive form, it is represented as: The model locates the topic 'k' for which p(w j |t k ,d i ) is maximal for a given word w i in a document d i and reassigns the word to the subject. As opposed to the number of iterations, the procedure is repeated repeatedly.

Method 3: proposed hybrid model
LDA and Gibbs are both used in the proposed hybrid model to generate results. Gibbs sampling operates on a word-by-word basis, which modifies the word's topic assignment. It operates on the presumption that while the subject assignment for the given word is unknown, it is known for all other words in the text, and this information is utilized to infer the topic that will be given to this word. The following conditional probability equation applies to a single word w in document d that is associated with topic k: where W is a specific word, Z represents the specific topic, θ is the topic distribution for the document, Φ is word distribution for topic k, α is the Dirichlet parameter for the document to topic distribution, β is the Dirichlet parameter for topic-to-word distribution, M represents the number of documents, N is the number of words in a given document and K is the number of topics. The equation consists of two parts (2). The first section indicates the percentage of each topic's presence in a document, and the second part indicates the percentage of each topic's presence in a word. The output for LDA was taken into consideration first for the multi-level system, followed by the output for LDA/Gibbs algorithm, and finally, they were compared. We collected and compared the coherence scores for LDA, LDA/Gibbs (hybrid-multi-level concept). The architectural view of the proposed modeling process is presented in Fig. 2.

Coherence Score
Topic coherence is an important metric for assessing the quality of a particular topic model and is used in topic model evaluation. It is calculated while taking into account the top n words in terms of frequency. For one topic, the words i, j are scored in Eq. 18: Coherence i< j score(w i , w j ) where w i and w j are the top words of the topic where p(w) the probability of seeing w i inarandomdocument, p(w i , w j ) the probability of seeing both w i and w j cooccurring in a random document.

Discussion of the findings
The experiment setup for topics modeling was implemented on Anaconda Jupyter with GPU capability using the following packages: Numpy, Pandas, pyLDAvis, and the NLTK in python 3.6.

LDA Topics
The LDA model is composed of many topics, each of which is made up of several keywords, each of which gives a certain weight to the subject. As demonstrated in Fig. 3, the keywords for each subject and their relative relevance are listed. The results of the various topics, where each subject is made up of several keywords and each keyword gives the topic a particular weight, are displayed in Fig. 3. It is observed in Topic 2 that the document is related to buying and selling which is not quite detailed enough. Topic 0 is about anything related to ICT obviously and mainly on hardware. Previous works failed to provide detailed weightage of topics generated. Also works like [32] failed to provide how Gibbs could be used to improve the LDA generation of topics and also evaluate topics generated to see the quality of topics extracted. Figure 3 shows the combination of keywords and how each keyword contributes to the weightage of the topic. Topic 0 is represented as (0, '0.036*"law" + 0.030*"child" + 0.030*"government" + 0.029*"gun" + "0.027*"people" + 0.027*"kill" + 0.026*"state" + 0.023*"death" + "0.021*"right" + 0.020*"die"'). It means the top 10 keywords that contribute to this topic are: 'law', 'child', 'government' and so on and the weight of 'law' on topic 0 is 0.036 law. The weights represent a keyword's relative importance to the topic. Looking at these keywords, one can guess that this topic could be related to "terror" or "negativity" which indicates that the dataset was more tailored toward negativity.

Visualize the topics-keywords
Following the construction of the LDA model, the generated topics and the related keywords are analyzed using an intertopic distance map, which displays the fraction of words that are associated with each topic. A topic is represented by each bubble on the left-hand side of Fig. 4. The bigger the bubble, the more prevalent is that topic. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. A model with too many topics will frequently have overlaps and little bubbles grouped in one area of the chart. Topic 4 has the most overlap with the other subjects, as illustrated in Fig. 7, implying it has the most relevant keywords and meaning. In addition, topic 1 had the highest likelihood of being discussed if words were chosen at random from the dataset, followed by topics 2, 3, and 4. The man-related topics 1, 2, 3, and 4 were quite far from all other topics. The size of the circle can also be used to interpret this in terms of word prediction in the outcome. The bigger the circle, the higher the topic prediction by the size of the circle.

Frequency distribution of word counts in documents
Knowing the size of the documents overall and by topic is necessary when working with a large number of documents, distributing the document's word counts on a graph.  shows the distribution of document word counts plotted on several documents (y-axis) and document word count (xaxis). The plot showed that many words in the document occurred that occurred frequently were more compared to those that appeared rarely. The peak document word count was between 0 and 110.

Distribution of document word counts by dominant topic
The distribution of document terms by topic, which dominates the dataset, is examined in Fig. 6. Different topics were found, as was described before. It is observed that Topics 2 and 1 have a greater number of words that are being considered compared to the other topics. Topic 2 is the dominant topic in the document. Each of the topics are distributed similarly.

Word count and importance of topic keyword
When it comes to the keywords in the topics, the importance (weights) of the keywords matters. Along with that, how frequently the words have appeared in the documents is observed. Figure 7 shows the significance of the keywords in the topic as shown by the weights. It is possible to reduce the number of topics since there is overlap by adding frequent terms to stop words to ensure that the common words are not taken into consideration. The keywords for the topic are plotted as seen in Fig. 7. Figure 7 shows words that occur in multiple topics and the ones' whole relative frequency has more weight. Most times, such words turn out to be less important. The chart drawn is a result of adding several such words to the stop words list in the beginning and re-running the process. It is observed that the Topic 3 chart has an evenly distributed word count.

LDA model coherence score
A way to assess a topic model's quality is topic coherence. Table 2 shows the coherency (evaluation metric used to judge how good a topic is) LDA output:

LDA/Gibbs model topics
Gibbs operates on the presumption that while the topic assignment of the supplied word is unknown, it is known for all other words in the text, and this information is utilized to deduce the topic that will be given to this word. The objective of the study is to present that the proposed multi-level approach provides extracted topics of higher quality. The output shown in Fig. 8 can be used to calculate coherence after creating the topics for LDA-Gibbs sampling. It is observed in Topic 1 that the country "Nigeria" suffered an "attack", people were "killed", "unknown" "people" killed "children" on "land" and the "government" failed to do anything. Topic 2 was about health and how "doctors" "treated" "patients" with different "diseases" (Tables 3 and 4).

Proposed model coherence score
A way to assess a topic model's quality is topic coherence. The proposed hybrid model coherency score is displayed in   Table 5 showing a significant increase in the coherence score as compared to LDA.

Proposed model comparison with LDA
According to the evaluation, the proposed model had a coherence score of 0.5265013 as opposed to LDA's coherence score of 0.465038, indicating that it had retrieved topics of higher quality. In comparison to LDA, the proposed multilevel model offered a more accurate and practical way to assess the quality of a particular topic.

Conclusion
This study gave a better understanding of topic modeling. One of the main problems with LDA is that if the document does not coherently explore a single topic, the quality of the topics extracted is poor. However, Gibbs sampling with LDA can be used on documents having different topics.
Since a better coherence score may be produced, the strength of the suggested multi-level-based SSM for topic modeling utilizing LDA and Gibbs sampling has been backed up. This was proved by using the output of LDA as a feeder to Gibbs sampling to increase the coherence score (which is used to determine the quality of extracted topics). Future research can compare LDA and Gibbs sampling using additional conditional distribution approaches such as variational inference, which was previously discussed in [35]. The visualization of the results of Gibbs sampling and variational inference can also be highlighted in future works.
Author contributions The research's conception and design were influenced by the work of all contributors. Data gathering, material preparation, and analysis were completed by MOA. The first draft of the manuscript was written by AAO, UCC, POS. All the authors read and approved the final manuscript and also agreed to all the content of the article including the author list and contributions.
Funding There was no outside funding for the study.

Data availability
The data that support the findings of this study are openly available in Newsgroup Master Datasets at https://raw. githubusercontent.com/selva86/datasets/master/newsgroups.json.