Type2 IFC with SOA for Topic Detection and Document Clustering Analysis

: The automatic document clustering and topic extraction from the corpus provides a very essential requirement in many real time applications. The document clustering and topic detection is utilized to locating data quickly. Hence, in this paper, Type 2 Intuitionistic Fuzzy Clustering and Seagull Optimization Algorithm (Type 2 IFCSOA) is developed for document clustering and topic detection. The Type 2 IFCSOA is utilized to cluster the documents. Additionally, ensemble approach is utilized to identify by the topics from the clustered documents. In the proposed methodology, the pre-processing is utilized to remove unwanted information from the documents such as tokenization, stop word removal and stemming process. After that, the proposed method is utilized to cluster the documents. The clustered documents are labeled with the basis of clusters. After that, to achieve topic detection, the ensemble approach is utilized with feature extraction phases such as Term Frequency-Inverse Document Frequency (TF-IDF), Mutual information (MI), Text Rank Algorithm and analysis of keyword taking out from co-occurrence statistical -Information (CSI). The proposed methodology is implemented in MATLAB and performances were evaluated with the statistical measurements such as precision, recall, accuracy, sensitivity, purity measure and entropy. The proposed method is compared with the conventional methods such as Fuzzy C Means clustering (FCM), FCM-Particle Swarm Optimization (PSO), FCM-Genetic Algorithm (GA) and K means clustering.


Introduction
A large number of online scientific publications are written on the World Wide Web or on the Internet every day in the digital era [1]. There is generally increased interest in identifying topics in published scientific papers, including defining boundaries in the scientific field and identifying innovative capabilities, [2] which in turn sets out to be the central part of the process of producing scientific knowledge. Resource-rich research demands knowledge of earlier research and technology [3]. These will in turn affect subjects and form the foundation for advanced research based on earlier growths to achieve new findings and advances. Academic literature is generally a reliable data source [4]. The growth rate of knowledge and science has been examined through publications. Academic publications can make science communications possible. Written documents therefore include scientific and science explanations and make up the existing scientific literature [5].
Extraction of topic (TE), known as 'theme discovery,' 'theme identification,' 'theme discovery,' 'tag identification,' extraction of a keyword, category tagging, and cluster tagging' are all useful in a number of practical applications [6]. For example, by examining new publications on computer science, it is possible to identify increasingly important areas; in the foreseeable future, their trends and popularity may be further predicted [7]. There is however much uncertainty in terms of how these topics are defined, and there is a continuing discussion about how to automatically extract them. Furthermore, it is slower, costly, and error-prone to use manual methods to extract these subject areas [8,9]. One of the most common techniques for identifying topics is to compile papers in order to recognize exact collections of academic papers that signify an associated topic. The most relevant terms are then extracted and classified from each cluster [10].
The Clustering of Text Documents (TDC) [11] is one of the strongly and effectively unmonitored textmining learning techniques. From this method, the documents are clustered or grouped based on the similar keywords and placed in a comparable cluster. However, various documents are located in various clusters.
Metaheuristic algorithms have made considerable efforts in recent years to solve the TDC problem [12]. This is because of the weaknesses of today's deterministic means of finding globally optimal solutions to address the problem. The majority of algorithms are based on the combined method (EA) [13], based on evolutionary algorithms, fittest theory survival and TA. The method of clustering documents is used [14]. Clustering methods have been improved with regard to the consideration of artificial (AI) [15] techniques such as Particle Swarm Optimization (PSO), Whale Optimization (WOA), and GWO. This paper has the main contributions,

The main contribution of the paper
The main contribution of the paper is presented as follows,  In this paper, Type 2 IFCSOA is developed for document clustering and topic detection. The Type 2 IFCSOA is utilized to cluster the documents.
 Additionally, ensemble approach is utilized to identify by the topics from the clustered documents.
 In the proposed methodology, the pre-processing is utilized to remove unwanted information from the documents such as tokenization, stop word removal and stemming process. After that, the proposed method is utilized to cluster the documents. The clustered documents are labeled with the basis of clusters.
 To achieve topic detection, the ensemble approach is utilized with feature extraction phases such as TF-IDF, MI, Text Rank Algorithm and CSI.
 The proposed methodology is implemented in MATLAB and performances were evaluated with the statistical measurements such as precision, recall, accuracy, sensitivity, purity measure and entropy. The proposed method is compared with the conventional methods such as FCM, FCM-PSO, FCM-GA and K means clustering.
The remaining part of the paper is organized as follows, section 2 provides the review section of the topic detection and document clustering. The detail description of the proposed topic detection and document clustering is presented in the section 3. The result and discussion of the proposed methodology is presented in section 4.
Conclusion of the paper is presented in section 5.

Literature Review
Many different methods are available to topic detection and document clustering. Some of the methods are reviewed in this section.
Stephan A. Curiskis et al., [16] analyzed Embedding models for neural network document clustering and thematic modeling on 3 Twitter and Reddit datasets. Here, the benchmarking is to provide four different characteristics of term-frequency inversion-document-frequency (tf-idf) and word embedding models combined with four clustering methods. Several different measurement measures have been used in literature, thus giving the most appropriate extrinsic measures for this task a discussion and recommendation.
Peng Yang et al. [17] have developed a probabilistic tpLDA model, which incorporates different levels of information on the topic popularity to determine the prior distribution of the LDA, identify latent topics and improve clustering. In particular, the global topic popularity is introduced to reduce the potential distraction in the popularity of local clusters and local popularity focuses more on parts of the global theme. Both popularities provide complementary information and can dynamically adjust the model's statistical parameters to their integration.
An ensemble method for automatic extraction of topics (ATT) was introduced in the paper by Kamal Abasi et al. [18], based on a series of scientific publications in text documents intended to extract topics from clustered documents. The current TE methods often draw on the theory of statistics. However, if the same clustered document is used, the results could be different. Consequently, inaccurate results, due to the behavior of TE methods, can be found in the topics extracted from the clustered papers.
In order to build effective model documents representation Nabil Alami et al. [19] have submitted an approach using clustering, topic modelling, and unattended neural networks. First, on a large text collection, a document grouping technique using an Extreme learning machine was performed. Second, topic modeling was used to collect documents, so that the subjects present in each cluster could be identified. Third, in a thematic space every document was represented by a matrix where the rows show the phrases of the document.
Di wu et al. [20] presented short text clustering algorithms for Micro-Blog Hot Topic Discovery based on BTM and GloVe similitudes of linear fusion (BG and SLF-Kmeans) The preprocessed microblog short texts have been used by BTM and GloVe. In order to calculate the similarity of the text using BTM topic modeling, the JS divergence was adopted. In calculating the text similarity based on GloVe word vector modeling, WMD was used for the improved word weight (IWMD). Finally, these two similarities are linearly fused and used to perform K-means clustering as the remote function.

Proposed System Model
Nowadays, the topic extraction and document clustering is an essential topic for many real-world applications. Moreover, there is uncertainty related to defining these topics, the efficient topic detection is designed. In this paper, Type 2 IFCSOA is developed to document clustering and topic detection. The complete architecture of the proposed system for document clustering and topic detection is presented in figure 1.

Pre-processing
The text documents are presented in electronic format which increased in usage. Hence, the text clustering has been required technique to arrange the documents into clusters. The text clustering techniques mostly objective to create text papers clusters related to the papers with the basis of intrinsic contents. once start clustering method, the text documents should be procced with the pre-processing methods such as tokenization [21], removal of stop words and stemming [22] process. Hence, the text documents are changed into a required format.

Type-2 Intuitionistic Fuzzy C-Means Clustering for document clustering
The proposed clustering is utilized to topic detection and document clustering. The documents are clustered with the basis of the topics. The proposed clustering algorithm is utilized to cluster the documents based on the topics. The proposed clustering algorithm is an extension of Fuzzy C Means Clustering (FCM) and handles more uncertainty in data than FCM [23]. In the proposed methodology, two different fuzzifiers are utilized which are defined as a fuzzy degree. This proposed fuzzy degree is mathematically formulated as follows, Where, 2 ( , ) can be described as Euclidean distance among the j th cluster center and i th pattern, 2 ( ) can be described as the membership function of the i th pattern related to the j th cluster, can be described number of clusters and can be represented a number of data = { 1 , 2 , … , }. In the proposed clustering, the lower and upper membership functions are mathematically formulated as follows, To compute the minimum value and maximum value of the j th cluster center, the karnik mendal iterative algorithm is utilized. While the iterative algorithm is performed, the right memberships and left memberships are calculated related to each feature for a pattern which formulated as follows, Where, can be described as number of features of a pattern, can be described as a minimum value of j th cluster center, can be described as a maximum value of j th cluster center, ( ) can be described as right membership functions and ( ) ( ) can be described as left membership function. The membership matrix and crisp centroids can be achieved with the consideration of defuzzification methods and type reduction methods.
With the help of proposed clustering methods, the documents are clustered from the documents. The fuzzy membership function is an essential parameter to achieve the best cluster of the document process. The optimal fuzzy membership function is selected based on the SOA process. The detailed description of the proposed SOA-based fuzzy membership function selection is presented below section.

Seagull Optimization Algorithm
In this proposed methodology, the seagull optimization algorithm is developed to select optimal fuzzy membership function parameter in Type-2 Intuitionistic Fuzzy C-Means Clustering. The detail description of the seagull optimization algorithm is presented in this section. Seagulls are usually referred to as laridae and are a sea bird that can be shown throughout the world. There are many sea birds available that make sea-goats omnipresent and eat tuberculosis, amphibians, rodents, fish and eats. The seagulls are an intelligent bird [24], which draw fish with their feet by using crumbs that make rain. Fresh and salty water is eaten by the seagulls. The moose have a superior drum pair over their eyes, which is used to remove the salt from their system. The village is a most specified place for seagulls. Based on their own knowledge, the prey location is computed by seagulls. In the seagulls, the attacking mechanism and migration is a most significant process. Migration is defined as the seasonal changes of seagulls from one place to another place to compute most plentiful and richest energy sources. The mathematical model of attacking the prey and migration is presented as follows,

Migration
In the migration procedure, the algorithm proceeds with the seagulls transfer towards one position to another. In the migration process, the seagull must compensate three scenarios.
Avoiding the collisions: In the SOA, the collision should be avoiding among neighbors and additional parameters is described for the computation of optimal search agent position.
Where, can be described as movement characteristics of search agent, can be described as current iteration, can be described as current position of search agent and can be represented as search agent position that does not collision with remaining search agents. The movement characteristics of the search agent is presented as follows, Where, = 0,1,2, … , max Where, can be described as value 2, can be described as linearly reduced from to 0, can be described to control the frequency of considering parameter.

Movement towards best neighbors' direction:
Once avoiding the Collison among neighbors, the search agents are locating towards the movements of best neighbor.
Where, can be described as randomizes that is responsible for efficient balancing among exploitation and exploration, ( ) can be described best fit search agent, ( ) can be described as search agent and can be described as search agent position.
Where, can be described as in the value generated from the range of [0, 1].

Best search agent:
At last, the search agent position is updated based on the below equation, Where, can be described as distance among the best fit search agent and search agent.

Attacking of prey
The exploitation motive is exploiting the experience and history of the search procedure. In the migration process, the seagulls can change the attack angle condition [25]. Based on weight and wings, the seagulls are maintaining the altitude. During the prey attacking, the spiral movement characteristics can be occurred in the air.
This spiral movement behavior can be presented as follows, `= * sin( ) Where, can be described as base of the natural logarithm, can be described as spiral shape constants, can be described as random number within the range of ≤ ≤ 2 ], and radius of individually try of the curved is denoted by . The search agent updating process is computed by below equation, Where, ( ) is described as the optimal solution and the position of other search agents can be updated. The flowchart of the SOA is presented in figure 2.

Fitness function
Once the initial population is completed, the fitness function is computed. Based on the fitness function, the fuzzy membership function in fuzzy clustering. The fitness function is evaluated with the consideration of the distance value. Hence, the fitness function is formulated with the minimization of distance. The fitness function is achieved by selecting the optimal fuzzy membership function of clustering. The fitness function is mathematically formulated as follows, ( 1, 2) = (∑ | 1, − 2, Where, can be described as the text contains from the documents. 1 = ( 11, 12, … , 1 ) and 2 = ( 21, 22, … , 2 ) can be considered as the document 1 and 2 respectively. Based on the fitness function, the optimal fuzzy membership function is selected which are utilized to enhance the optimal clustering process. The proposed SOA is started with a random generated population. After that, the search agents are updated their positions based on the optimal search agent under the iteration procedure. The value is linearly reduced from the to 0. To achieve the smooth transition among exploitation and exploration, the variable B is responsible.
So, the SOA can be considered as the global optimal solution because of their optimal exploitation and exploration capability. Once documents are clustered, the topic detection is proceeding with the feature extraction and ensemble method. The ensemble method is utilized to topic detection from the clustered documents.

Proposed Ensemble method for topic detection
Based on the proposed clustering algorithm, the labelling the cluster becomes essential so, the essential topics are detected to compute the clusters. Initially, the keyword extraction is essential requirement to topic classification. The RAKE [26] algorithm for automatic keyword extraction is used to compute the keyword of each cluster-related document for key frame extraction. For key frame extraction, Pre-processing steps like stop word removal, frequency breaks are used in the RAKE algorithm. The frequency cut is used to allow users to omit many less common terms. After that, it is sent to the feature extraction method for topic calculation. The feature extraction methods are explained in the following section.

Feature extraction
The feature extraction is an essential requirement to classify the documents based on their topics. In this proposed methodology, the MI, TextRank algorithm, CSI, and TF-IDF. The features are utilized to classify the documents. The feature extraction method mathematical computation is presented in this section.

MI
The MI is utilized to compute how much data in the absence or presence in a specific term to compute to determining the optimal clustering on a required cluster. Based on the concept of knowledge, the reciprocal information provided to reduce cluster uncertainty [27]. The reciprocal information between the exact word and group is presented as follows: Where, can be described as specific term, can be described as cluster, ( 1 ) ( ) can be described as probabilities which event occur independently, 0 can be described as the term specified in Article does not include, or 0 can be described as indicates a file which not related to the cluster. This probability value can be computed with the dividing the calculated events frequency by the documents.

TextRank Algorithm
Text rank algorithm can be represented as model that is graph based ranking for text mining. The TR algorithm can be utilized in many natural language processing methods which consisting of sentence and keyword abstraction from the input of the documents. In the TR algorithm, the extract keywords are tokenized. The edge can then be generated between two nodes. In a precise word, the words were co-occurring using the window. This algorithm can be calculated in connection with the votes in each term. Each term score is computed as follows, Where, ( ) can be described as the set of vertices which term points, ( ) can be described as vertices set, can be described as constant factor which is often to 0.85.

CSI
The CSI technique to compute significant terms in related with the word co-occurrence circulation, documents and alike documents distribution in cluster. Initially, the frequent terms are computed, co-occurrences can compute co-occurrences in an equal document among recurrent terms and collections of the terms [28]. The degree of probability distribution of frequent term is computed as follows, Where, can be described as frequent terms, can be described as co-occurrence of total number, can be described as frequent term of specific expected probability and 2 can be described as statistics-based keyword extraction method. Hence, the robustness is enhanced with the consideration of document clustering. To enhance the performance of statistics, the statistics can be formulated as follows, Normally, the CSI is proceeding with six phases. The first stage can be proceeding with stemming process in the text document. After that, Aprioti algorithm is utilized to phrase extraction. Additionally, the frequent parameter is selected from the phrase extraction. Using the Jensen-Shannon divergence, the frequent term can be clustered. The probability of the document can be calculated. Finally, the statistics value can be keyword extraction based on above formulation.

TF-IDF
The TF-TDF shall be used to extract the documents that are considered a vector of TF-IDF weights from the text document. TF-IDF [29] the value of document term, the frequency product and the frequency of the opposite document term, the inverse frequency of the document. This condition can be formulated as follows, Where, ( ) can be described as quantity of documents which at least once consist the term j, can be described as significance of complete documents-based number of clusters and ( ) can be described as frequency of term in the complete documents based on number of clusters. The complete features are combined; the topics are classified with the ensemble method. The outcomes of the proposed methodology are evaluated in below section.

Results and Discussion
The  Population female 20 16 Inertia weight 0. 8 17 Inertia weight damping ratio 1 18 Personal learning coefficient 1.0 19 Global learning coefficient 1.5 The proposed method is used to topic detection and document clustering. The confusion matrix is computed based on following constraints,  A topic is actually presented and detected as presented scenario which named as True Positive (TP).
 A topic is actually not presented and detected as not presented which named as True Negative (TN).
 A topic is actually not presented but detected which named as False Positive (FP).
 A topic is presented and detected as not presented which named as False Negative (FN).
Based on the progress of confusion matrix terms, the proposed methodology is evaluated by performance metrices which are formulated as follows, Accuracy: It is defined as number of correctly detected data instances from the total number of instances.
The formula of accuracy is presented follows, Specificity: It is defined as fraction of positive detected which actually correct. The precision effectively computed a complete probability and performance measure. The specificity formulation presented follows, Recall: It can be defined as a ratio of positive samples correctly detected to positive total instances, as set out below.
Sensitivity: It can be defined as the proportion of negative instances that have correctly been detected in the following total negative instances; F_Measure: The F_Measure must be presented among 0 to 1. The worst value is 0 and best value is 1.
The F_Measure can be represented as follows, Purity Measure: The purity measure is used to calculate the percentage of documents and objects assigned to each cluster in the most common class. The effective value of pureness is equal to 1. The measurement of purity can be calculated using the following equation, Where, can be described as quantity of clusters, can be described as whole number of documents in the dataset, ( , ) can be represented as large class size with the cluster .

Entropy:
The entropy test measures are computed to identify the correct clusters for every class. The entropy measure is computed by below equation, Where, probability of the clusters within the documents are represented as ( , ), entropy of the cluster is represented as ( ). The complete clusters, the entropy is computed as follows, Where, can be described as clusters, can be described as quantity of papers in the cluster, can be described as the quantity of documents presented in the datasets.

Case 1: Document Clustering
In this section, the proposed document clustering is evaluated and compared with the conventional methods by statistical measurements. To evaluate the proposed methodology, 100 documents are collected and procced with the proposed methodology.

Case 2: Topic Detection
In this section, the proposed ensemble approach-based topic detection is evaluated and compared with the conventional methods. The three different topics are classified with the help of proposed methodology. The proposed methodology has been validated in this section by statistical measurements.

Conclusion
In this paper, Type 2 IFCSOA has been developed for document clustering and topic detection. The Type 2 IFCSOA has been utilized to cluster the documents. Additionally, ensemble approach has been utilized to identify by the topics from the clustered documents. In the proposed methodology, the pre-processing has been utilized to remove unwanted information from the documents such as tokenization, stop word removal and stemming process. After that, the proposed method has been utilized to cluster the documents. The clustered documents have been labeled with the basis of clusters. After that, to achieve topic detection, the ensemble approach has been utilized with feature extraction phases such as TF-IDF, MI, Text Rank Algorithm and CSI. The proposed methodology is implemented in MATLAB and performances were evaluated with the statistical measurements such as precision, recall, accuracy, sensitivity, purity measure and entropy. The proposed method is compared with the conventional methods such as FCM, FCM-PSO, FCM-GA and K means clustering. From the outcome analysis, the proposed methodology has been achieved best results in terms of accuracy. In future, the large amount of database with various topics will be analyzed with effective methods.

Data Availability Statement
Not Applicable.