This section describes the features of the dataset used for the study and the data preprocessing techniques applied prior to the feature extraction.
3.1 Dataset
To conduct of experiments to cluster and predict the topic, the Benchmark dataset, COVID-19 Open Research Dataset (CORD-19), which contains 900,000 + academic articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the worldwide research community and we considered 801857 documents which are published after Jan 2020[8]. The Covid-19 Open Research Dataset (CORD-19) is an increasing source of scientific papers on Covid-19 and related past coronavirus research. CORD-19 is designed to enable the development of text mining and information retrieval techniques over its rich collection of metadata and structured full-text papers.
3.2 Data Preprocessing
The data preprocessing techniques are used for information mining which includes the transformation of raw data and making it suitable for machine learning models. Real-world information is often inaccurate, incompatible, and/or missing in certain behaviors or trends and is probable to contain many mistakes. Preprocessing data is a proven technique for solving such problems as it prepares raw data for further processing. Initially, the input data is pipelined through Data Cleaning, Data Integration, Data Transformation, Data reduction, and Data discretization preprocessing techniques. The following preprocessing techniques and algorithms are used in this study.
BERT (Bidirectional Encoder Representations from Transformers): It is a Natural Language Processing Model proposed by Google Researchers [11]. It is a topic modeling technique that uses transformers (BERT embeddings) and class-based TF-IDF to generate dense clusters. The generated resultant topics can be easily interpreted and visualized using this BERT algorithm We have used Sentence Transformers in this paper.
Latent Dirichlet Allocation (LDA): It is a widely used text classification algorithm based on statical topic models. We can easily discover and visualize topics from the document using this model.
Clustering: It is an unsupervised technique used to group the data points into a similar group. We can use clustering to group the topics [12] so that we can find the topics within these clusters. We used the k-means algorithm for clustering.
K-means Clustering: It is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different groups [13]. Here k describes the number of pre-defined clusters, as if k = 8, there will be 8 clusters. To identify the correct clusters, applied the Elbow Method.
Dimensionality Reduction: Dimensionality reduction is a significant step in text mining. Dimension reduction enhances the performance of clustering methods by reducing dimensions so that text mining techniques process data with a reduced number of terms. There are multiple methods available for dimensionality reduction. Here we have used PCA, UMAP, and T-SNE.
PCA (Principal component analysis): Principal Component Analysis is a dimensionality reduction method that is applied to reduce the dimensionality of massive data sets, by altering a large collection of variables into a smaller one which retains the information in the massive data set.
UMAP (Uniform Manifold Approximation and Projection): It is one of the novel techniques for dimension reduction. It is comparatively performing better, and it keeps a considerable portion of the high-dimensional structure into lower dimensional structure.
t-Distributed Stochastic Neighbor Embedding (t-SNE): It is a technique for dimensionality reduction which is predominantly suited for high-dimensional datasets visualization.
The following section describe the experimental setup and analyze the results in detail.