Text mining: identification of similarity of text documents using hybrid similarity model

The volume of data that are accessible on the internet has increased dramatically. This growth of data will only increase exponentially in the future as more data exhaust devices are connected to the network. A part of these data consists of documents from various sources. As the data from various digital sources increase, it becomes tough to perform the process of identification of relevant information which is most essentially needed for their further usage. The goal of this research is to present a hybrid similarity algorithm that uses text summarization techniques to identify papers that are similar in terms of both semantic and contextual similarity. Some of these methods aim to quantify the corpus’s polysemy quotient using deep learning with numerous layers and prebuilt Natural Language Processing (NPL) models to determine document similarity. In comparison with other conventional algorithms, the experimental results of our model showed an accuracy of 76.25%.


Introduction
Text summarization is one of the applications of Natural language processing (NPL). Text summarization is the method used to create a short accurate and consistent summary of long text documents. As the amount of information is increasing rapidly in the internet era it is very difficult for users to get the relevant information that they are looking for. Suppose the user searches for a particular document through the search engine on the internet, the user will be getting thousands of documents which are of a similar type based on the keywords used by the user. However, the major problem is that the user has to go through all the documents to identify which document is more needful for their reference. This is a very time-consuming task and difficult for the user to identify which document to be considered as neither relevant nor irrelevant for their future reference called Information Overloading. According to the survey, According to International Data Corporation, the total amount of digital data in circulation worldwide are expected to reach 80-100 zeta bytes in 2020. It can be challenging to find the pertinent information B K. M. Shiva Prasad shivakalmutt@gmail.com 1 RYMEC, Ballari, Affiliated to VTU, Belagavi, Karnataka, India the user needs due to the vast volume of data floating about in cyberspace. Therefore, there is a need to develop algorithms that automatically reduce the information by shortening the large data and summaries the data in the relevant information. To overcome this problem text summarization technique came into existence. Summarization is the process of reducing the complexity of the document without modifying the original information by retaining the features of text documents [1].
There are two distinct categories for text summarization. Methods for extracting and abstracting information from text. Without changing the original content, extractive text summarization is a technique that brings together the key sentences or phrases from the source materials to create a text summary. By taking into account the words and phrases that are used in the actual document, it provides a summary. Typically, the sentences follow the same order as in the original text [2]. With the aid of the linguistic way to comprehend and analyse the text, abstractive summarization conducts summation by comprehending the original text. The goal of abstractive summarising is to create a generalised summary that communicates information precisely and typically calls for cutting-edge language creation and compression techniques. It generates summaries that capture silent ideas of text it not only considers the words or phrases from the original document instead it has the capability of generating new words which will provide an effective summarized report in the form of a result.
In this research work, we aim to provide a hybrid algorithm that identifies similar documents in terms of semantic and contextual similarity by utilising various text summarization techniques. Some of these techniques use deep learning with multiple layers and prebuilt NPL models to provide similarity scores by achieving the correlation coefficient value between documents and attempting to provide a quantitative number to the corpus's polysemy quotient. The hybrid model approach has given the best results compared to state-of-theart text summarization methods.
This research project demonstrated adaptive text summarization and relevant text document identification using a hybrid similarity model and user-specified keywords. A hybrid similarity algorithm that combines text summarising methods to find articles that are comparable in terms of semantic and contextual similarity was used to perform the tasks. Several of these strategies use deep learning with many layers and prebuilt NLP models to give similarity between texts and try to calculate the corpus's polysemy quotient. The experiment was evaluated by taking into account various traditional algorithms, and it was discovered that our model provided the best accuracy in comparison with traditional algorithms.
Contributions made in this research work are as follows: 1. A hybrid similarity algorithm has been proposed for the identification of similar documents both in terms of semantic similarity and contextual similarity using a text summarization procedure. 2. The relevancy of the document has been checked based on the semantic nature of the documents that are present in our corpus. 3. The comparison process has been carried out by comparing our hybrid algorithm with other state-of-the-art algorithms and found that the hybrid algorithm provided the best accuracy compared with other traditional algorithms.

Motivation
In today's era, the amount of information on the internet world is increasing rapidly day by day, and storing this huge amount of information is becoming a huge task in the world of the Internet. In these aspects, the users will be finding it difficult to extract the required relevant information manually from the world of the internet. Therefore, to overcome this problem we taught of designing an algorithm that can summarize the text automatically by considering the keywords/features provided by the user. This summarized document would help the user to understand the main context provided in the documents. This will reduce the timeconsuming work of the user by eliminating the process of going through every document. The main objective of this paper is to provide the best suitable algorithm which will summarize the text in the document and provides appropriate and relevant information for the user.

Related work
Yang Zenith and YAO Fei have depicted the process of text dimensionality reduction using Mutual informationpreserving mapping. Even though dimensionality can be reduced in a variety of ways, some work has been done to achieve dimensionality reduction without changing the inner semantic relationship among high-dimension data [5].
To address this issue, they developed Mutual informationpreserving mapping (MIPM), a manifold learning-based method for exploring the low-dimensional, neighbourhood, and mutual information-preserving embeddings of highdimensional inputs. The experimental results show that the proposed method is more effective for the text dimensionality reduction task. This method demonstrates that it is more effective in text dimensionality reduction tasks than traditional text summarization methods.
To create summaries of Turkish texts, Celal Cigar and Mucahid Kutlu have presented a generic text summary method that involves evaluating sentences according to their scores determined by their surface-level properties and extracting the highest ranked ones from the source papers [6]. They have used feature extraction, key phrases, title similarity and position of the text in the original document. They used the best feature weights to summarize the text using various machine-learning techniques. The result obtained in this paper on Turkish text is acceptable when compared with other methods of text analytics.
Cheng-Ying Liu, Chi-Yao Tseng, and Ming-Syan Chen focused their research [7] on short text reports on the comment stream of a specific message from social network services (SNS). Social Network services have become a lot of standards these days and also the comments on social media are increasing sooner in our day-to-day life. Managing knowledge is a more difficult task in today's generation. Thus to beat this downside they have introduced a way of SNS for summarization of short text by grouping the cluster of comments that is of a comparable kind and converting it into a pitchy outline which will be helpful for the user to scan the comments briefly and conjointly in minimizing the information into the info-repository. They have used a novel agglomeration rule beside SNS to pitch the text within the style of short outline and obtained high measurability and accuracy within the style of result and well-tried that this rule can facilitate converting text into a short outline.
Hien Nguyen and Eugene Santos have aimed at studying the impact of user cognitive styles when accessing multi-document summaries [8]. They have chosen twodimensional user cognitive styles to study the impact of how the user will access the summary that has been generated from the set of documents. This approach represented that it identifies whether the documents are closely or loosely related. The result of this paper has represented that different users will be having a different assessment of the information coverage and the way the information is represented either closely or loosely related document set.
Soe-Tsyr Yuan and Jerry Sun presented Structured Cosine Similarity (SCS), a novel-based method used to obtain document clustering with a new way of modelling document summarization by considering the structure of the documents to improve the performance of document clustering in terms of quality, stability, and efficiency [9]. This problem aims to provide knowledge acquisition and information sharing, and it has produced promising results with the help of SCS.
Atsushi Shimada, Fumiya Okubo, Chengjiu Yin, and Hiroaki Ogata have proposed a novel method for summarization of lecture slides to enhance preview efficiency and improve students understanding of the contents [10]. In most cases, students feel difficulty reading the information presented in the lecture slides. This would have not given better results as the student's attention towards the span of attention is very limited. To overcome this drawback, they have proposed a novel method for summarization of text in the slides which will help the students to go through the different concepts easily and effectively. They have used image processing and text processing to extract the text from the lecture material and then optimise the selected pages as per the requirements of the concepts and users. In this study involving more than 300 students, they compared the relative effectiveness of the summarized slides and the original materials in terms of quiz scores, preview achievement ratio, and time spent previewing. We found that students who previewed the summarized slides achieved better scores on pre-lecture quizzes, even though they spent less time reviewing the material.
Kuan-Yu Chen, Shih-Hung Liu, and, Berlin Chen have proposed a method called the recurrent Neural Network technique for extractive broadcast news summarization [11]. Extractive text or speech summarization manages to select a set of salient sentences from an original document and concatenate them to form a summary, enabling users to better browse through and understand the content of the document. In this paper, they have proposed a novel use of the continual neural network language modelling (RNNLM) framework for an extractive broadcast news report.
On prime of such a framework, the deduced sentence models square measure able to render not solely word usage cues but conjointly long-span structural info of word co-occurrence relationships inside broadcast news documents, obtaining around the need for the strict bag-of-words assumption.
Kuan-Yu Chen, Shih-Hung Liu and Berlin Chen have proposed the De-Noising Essence vector Model (D-EV) for the extraction of text from the document and extractive summarization of the text document [12]. In the context of natural language processing, representation learning has become more active in the area of research on many applications. In this paper, they have first proposed an unsupervised embedding method called the Essence Vector (EV) Model which concentrates not only on distilling the most important information from the paragraph but also on excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, they have proposed an extension of the EV Model called the D-EV Model. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. This method has proved that the effectiveness and applicability of the framework are acceptable compared with state-of-the-art summarization methods.
The semantic Link Method for the summary of scientific papers based on the reinforcement rank was proposed by Xiaoping Sun and Hai Zhuge in this research study [13]. The semantic Link Method is a semantic approach that contains semantic nodes, semantic rules, and reasoning rules. This approach is used for effective information processing. Semantic Link Network (SLN) naturally supports relational reasoning, analogical reasoning, and inductive reasoning. SLN would be enriched after reasoning. SLN is statically defined but autonomously evolved; it can be localized or decentralized. This paper proposes a new text summarization approach that extracts linguistics Link Network from scientific paper consisting of language units of various graininess as nodes and linguistics links between the nodes, and so ranks the nodes to pick out Top-k sentences to compose an outline.
In reference [14], Pawan Goyal, and Laxmidhar Behera have represented the Context-Based Word Indexing model for document summarization. Document Summarization is an information extraction task that aims at extracting the minimized version of the original document [15]. The major goal of text summarization is to provide an effective and efficiently oriented text in a minimized manner from the original document. Summaries can be obtained from single documents or many documents [16]. In this paper, they have proposed a new approach as the lexical association between terms to give a context-sensitive weight to the document terms. The resulting indexing weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the baseline graph-based ranking models for sentence extraction [17].
In reference [18], Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao have proposed the optimization of text and summarization of text using C-Miner. Research on Microblog sentimental analysis has been majorly conducted on polarity classification [19][20][21]. However, classifying microblog texts at the sentence level is often insufficient to understand the opinions on a topic. In this paper, they have proposed the C Miner method which majorly concentrates on extracting opinion targets and summarization of microblogs. A novel algorithm has been developed and integrated into complete systems. The C Miner provides different opinion target groups with different opinion summaries for each microblog. They created an unsupervised parallel propagation model for opinion target extraction in this paper. The opinion targets all the messages and extracts the required information collectively based on the assumption of similar messages in the blog. Later they clustered the messages into a group based on the similarity measurements and summarized the text [22].
In reference [23], Tsutomu Hirao, Masaaki Nishino, Yasuhisa Yoshida, Jun Suzuki, Norihito Yasuda, and Masaaki Nagata have proposed a trimming model for the summarization of text documents. Recent research on extractive text summarization formulates it as a combinatorial optimization problem, to extract the optimal subset from a set of textual units that enhances an optimal solution while staying within the length constraint. Although these methods improve automatic evaluation scores, they do not take into account the discourse structure of the source document. They have proposed a method for producing coherent summaries that makes use of a discourse tree structure. We formulated the summarization procedure as a Tree Knapsack Problem whose tree corresponds to the DEP-DT by transforming a traditional discourse tree, namely, a rhetorical structure theory-based discourse tree (RST-DT), into a dependencybased discourse tree (DEP-DT) [24]. This paper expands on the previous work by providing a thorough analysis of the strategy and a brand-new, effective dynamic programming algorithm for resolving the Tree Knapsack Problem [25]. Studies reveal that our approach did not only receive the highest rating in both the automated and human reviews.
In reference [27], Yinglong Ma, Peng Zhang and Jiangang Ma have represented an Ontology Knowledge-driven model for the summarization of Chinese Judgement text document classification. One of the key ideas is document categorization, which has applications in a variety of areas including document search, information extraction, domain-specific content, and more. Nowadays legal document classification has taken more attention in the classification of documents, such as case-based reasoning [28,29], legal citations [30], legal knowledge extraction [31], precedent retrieval [32], etc. In this study, they suggested an ontology-driven knowledge block summarization method for calculating document similarity for the classification of Chinese judgment documents. In the study, they depicted three distinct processes, the first of which was the adoption of semantic knowledge based on top-level ontology and domain-specific ontology, which helped them to know how to integrate various ontologies effectively. Second, they have modelled the fundamental semantic knowledge that is needed for ontology-based information extraction to condense the documents into knowledge blocks. Third, instead of their original Chinese judgment documents, they used Word Mover's Distance (WMD) to calculate the similarity between different knowledge blocks. Finally, KNN-based experiments for Chinese judgment document classification were conducted to demonstrate that our approach outperforms the original WMD approach in terms of classification accuracy and computation speed.

Pythagorean distance similarity
The accepted metric for geometrical issues is Pythagorean distance. It is a regular distance between the two points that is simple to calculate in either two-or three-dimensional space and can be measured using a ruler. The categorization and clustering of text using Distance measures are common practices in text mining. Figure 1 represents the basic model of Euclidean distance which represents the similarity between the terms in the documents.
Let d(x, y) represent the distance between any two objects in a set, x and y.
1. D(x, y) 0, or the distance between any two places, must be non-negative. 2. If and only if two items are identical, the distance between them must be zero; so, d(x, y) = 0 if and only if x = y. 3. The distance must be symmetric, i.e., d(x, y) = d, where d is the distance between points x and y. (y, x). 4. The measure needs to be true to meet the triangle inequality, which is d(x, z) d(x, y).
As a result, this approach meets all four of the aforementioned criteria, making it a valid metric. In addition to the K Means clustering algorithm, it may also be utilised for distance metrics.
Given two text documents, d1 and d2, which are represented by their respective term vectors, t1 and t2, the Euclidean distance of the two documents is computed by where T ={t1,..,tm} is the term set. For identifying the term weight, which is represented in formula 2, we typically employ the tf-idf approach:

Cosine similarity
One of the most often used similarity metrics for text texts is the cosine measure, which is used in information retrieval categorization [33,34], and clustering processes [35]. When documents are represented as term vectors, the correlation between the vectors indicates how similar two documents are to one another. The cosine of the angle between the vectors, or so-called cosine similarity, is used to quantify this. Given two documents − → da and − → db their cosine similarity is measured using formula (3), where a represents document1 and b represents the document 2: The cosine similarity rule's most important property is that it is independent of document size. When two identical copies of document d are combined to form a new document d0, the cosine similarity between d and d0 is 1, indicating that these two documents are represented as identical. Documents with the same composition but different totals will be treated the same way. Strictly speaking, this does not satisfy the metric's second condition, because the combination of two copies is a different object than the original document. In practise, however, when the term vectors are normalised to a unit length, such as 1, and in this case, the representation of d and d 0 is the same.

Jaccard coefficient
The intersection over the union is another name for the Jaccard Coefficient. A statistical technique for evaluating the similarity and variety of sample data sets is the Jaccard Coefficient Similarity. The size of the intersection divided by the size of the union of the sample sets, which is expressed by the formula, is how much the Jaccard coefficient [33] assesses similarity between finite sample sets (4), where A and B represent the documents of different types: A measure of similarity, the Jaccard coefficient runs from 0 to 1. It is 1 when A = B and 0 when A and B are not conjoined, where 1 denotes identity and 0 signifies the total difference between the two things. In subsequent tests, we'll utilise DJ rather than J (A, B), which is the equivalent distance measure [33]. The relationship between intersection and union is shown in Fig. 2.

Inverse document frequency
The process of determining the number of times a specific word appears in a document is known as term frequency. The inverse document frequency is a process that counts the number of times a specific word appears in the Corpus document. Tf-idf is a metric for determining the importance of words. Words that appear frequently in many documents will be given a lower weighting, while words that appear infrequently will be given a higher weighting.
The tf-idf value grows proportionally to the number of times a word appears in the document, but it is offset by the frequency of the word in the corpus, which helps to account for the fact that some words appear more frequently than others. Text mining, search queries, and summarization are all examples of how tf-idf is used in NLP. We must perform a normalisation process during the Summarization of text documents to condense the text from the text document. To calculate the average weight of the document formula (5) has been used:

Semantic labelling
Semantic labelling is a parsing technique that is commonly used in natural language processing operations. It recognises the predicates associated with a verb in a sentence. It is analogous to a function with certain parameters. Each function can be thought of as a verb corresponding to an action. Because each action is associated with an agent and a theme, the parameters of the function can be thought of as the agent and themes. Each verb is associated with a modifier, such as temporal, location, or adverb. These modifiers can also be thought of as parameters of the respective function that represents the verb [36].

Methodology
The process of the hybrid similarity model is represented in the flowchart below identifying whether the documents are similar or not is one of the major challenges in the area of research nowadays. In this paper, we have concentrated on developing the algorithm which helps us to find whether the documents are similar or not which is similar to overcoming the polysemy problems. In our proposed method, we have taken the URL as an input to get the source data from the webpage. Later, we parse the web page to extract the text from the webpage. After extracting the text from the webpage we operated preprocessing to remove the unwanted information from the text using various approaches, such as Sentence Tokenization, Cleaning, Case Conversion, Stop Words removal, and Lemmatization.
Sentence tokenization is the process, where we break the entire text into different sentences by taking the support of various line supporters, such as full stop, end of the line characters, such as \n and so on. Given a Text extracted from Document D, we tokenize it into sentences as < S1, S2, S3 ….Sn > . Figure 3 represents the diagrammatic formation of the Hybrid Model.
Next, we performed a cleaning process, where we eliminated all the special characters which are present in the text document and replaced them with white spaces. Here, we have used the regular expression pattern concept to remove the various characters which come under metadata and all the characters which do not fall under the basic ASCII Range [0-9,A-Z, a-z]. Then, we performed a Case conversion operation to convert all the small characters of text into capital characters which will help us to identify the similarity between the words and represent whether the documents are similar or not similar based on the ASCII values of the characters. Tokenization is the process, where we have broken down the sentences in the form of words to perform the process of our algorithm easily. Later we used a standard NLTK method for removing the stop words from the text documents which helps us to minimise the context of the text document. Then, we operated lemmatization for combining the words which are of a similar kind.
After the pre-processing operation is completed, the text data are ready to operate with various algorithms. We have demonstrated the various algorithms for the text summarization process using both extractive and abstractive-based algorithms.
The practice of counting the number of times a specific word appeared in the text is known as term frequency. The Fig. 3 Flowchart of hybrid similarity model procedure known as "inverse document frequency" refers to calculating the number of times a specific term appears in a corpus document. Frequent words will be given a lower weighting, whereas words that are used sparingly in numerous papers will receive a greater value. The tf-idf value rises proportionally to a word's usage in the text but is countered by the word's usage frequency in the corpus, which helps to account for the fact that some words are used more frequently than others overall. During the Summarization of text documents, we need to perform a normalization process to concise the text from the text document. The average weight of the document can be calculated using the formula (5).
Text Ranking is an extractive text summarization technique that helps us to rank the web pages which is most essentially needed to identify the similarity between the texts in the documents and perform the summarization operation, as shown in Fig. 4. Here, we have taken the two different text documents and combined them in the form of single text documents then we have divided the document into the form of sentences. By making use of these sentences we have formed the vectors using a vector conversion matrix (Glove Word Embedding). The output of these vectors helped us to calculate the similarity scores based on the similarity between the sentences and words. Based on the similarity Score matrix and edges of the text documents we have plotted the graphical representation for sentence rank calculation and finally, we have ranked the sentences based on the similarity index. Based on the top rank index, we have formed the summarization of the text.
The formula used to calculate the score of the number of times the specific word occurred in the document is calculated using formula (6): Total no of times the word occured in document Total no of words .
where T s : stands for term score.T w : stands for term frequency weight.W f : stands for frequency of the total no of words in the document. Latent Semantic analysis is a theory and method to extract the contextual meaning of words by statistical analysis by applying it to a large corpus of text. Latent Semantic Analysis is used in Natural Language Processing to analyze relationships between a set of documents and the terms that they contain. In this paper, we have divided the corpus into clusters using unsupervised learning. We use value decomposition to derive the topics and then fetch important sentences based on the topic. This process helped us to check how often the words appear in the same document and compares how often the same words appear in the documents.
Abstractive text summarization is an advanced machine learning technique that uses deep learning to generate summarized words that are not in the initial document. It is the actual learning process that is carried out by training over a similar corpus of data. This learning process is carried over a huge data set. The generated model is then applied to the document to generate the summarized text.
Recurrent Neural Network is the process, where the model is developed by training the data over a set of layers. This layer contains neurons that perform a specific task based on the data learned by backpropagation. The model contains multiple hidden layers with a set of neurons in each layer on which the data are trained. Figure 5 indicates the basic model of the recurrent neural network is a natural generalization of feed-forward neural networks to sequences [37,38]. When the alignment between the inputs and outputs is known in advance, the RNN can map  A sequence Model is any form that involves sequential information, as shown in Fig. 6. This includes the classification of text documents based on Natural machine learning and text entity recognition. Our goal is to create a text contribution that produces a summary from an input of a lengthy list of terms (in a text body) (which is a sequence as well). Here, we have built the sequence-to-sequence model that can reduce the long text sequence to a short text sequence.
We take the URL or the text for 2 documents.
1. If it is a URL we hit the URL and get the source data from the webpage. 2. We parse the webpage and get the < body > of the page 3. Now, we follow-up with the pre-processing steps.
The encoder and Decoder architecture is used to solve seq-to-seq problems, where the inputs and output sequence are of different lengths (Fig. 7).
In this paper, we have set up the two different phases of this approach.

Training Phase 2. Interface Phase
The encoder and decoder will first be configured during the training phase. The model will then be trained to forecast the target sequence with a one-timestep offset. The complete input sequence is read by an encoder long short-term memory model (LSTM), with one word being sent into the encoder at each timestep. It then processes the information at every timestep and captures the contextual information present in the input sequence, as shown in Fig. 8.
The decoder is initialised using the hidden state (hi) and cell state (ci) from the previous time step. Keep in mind that this is because the LSTM architecture's encoder and decoder are two separate sets.
Decoder As illustrated in the figure, the decoder also uses an LSTM network to read the complete target sequence word-by-word and predict the same sequence (9).  After training, new source sequences with unknown target sequences are used to test the model. Therefore, to decode a test sequence, we must configure the inference architecture (Figs. 9, 10).

Results and discussion
To judge the potency and effectiveness of our approach we tend to use the hybrid-based similarity model to form experiments compared with the prevailing state-of-the-art approaches. Different documents of a similar kind have been used for acting the experimental analysis. We have known that the approach is acting higher to spot the similarity between the documents and represent the document compactly as output scrutiny with numerous state-of-the-art approaches, such as Tf-idf, Spacy, Sumpy, LDA, and LSI. We have found that our approach works better and its efficiency is good compared with other approaches.

Data corpus
We have taken the online data web links to create a huge data corpus for operating the summarization of text and identification of similarities between the document. Each annotator   chose news pieces on their own, with no restrictions on the subject or sources of the news. We gathered articles from numerous online news portals in the domains of politics, sports, economy, entertainment, and so on. Because there are no constraints on the data, it will be more helpful for us to analyse and work on the text summary. Table 1 represents the statistics of the data corpus.

Experimental results
Identification of similarities between the documents is one of the important tasks in the field of text mining. In our approach, we have used the Hybrid model which can find the similarity between the various documents of similar type. In our model, we have considered different metrics such as vector index and cosine similarity index to identify the similarity between the documents and found it is more effective compared to other traditional approaches. Figure 11 represents the Similarity index rate and Rouge score of different algorithms.
The rouge score is used to perform text summarization and further convert the text into machine translation format. Figure 12 indicates the comparative results of different models compared with our hybrid model approach. It found that our model performance is acceptable compared with other models. In this approach, we have conducted the test with AbstractRouge and Non-Abstract Rouge features. Figure 13 represents the word cloud from document 1 to document 2 which indicates the similar words which are present in both documents in the form of a word cloud and   also indicates that based on the number of times the keyword is present in the data set; it represents the keyword with bigger font size and the keyword which occurs least is considered as small font size in the word cloud, as shown in Fig. 13. Figure 14 represents the graphical representation of various rouge scores obtained in Fig. 11. The rouge score specifies the summarization process and similarity index rate between the various documents of similar type.

Conclusion
This research work presented an adaptive text summarization and identification of relevant text documents based on the keywords specified by the user using a Hybrid similarity model. The tasks were completed using a hybrid similarity algorithm that uses text summarization techniques to identify publications that are comparable in terms of semantic and contextual similarity. To give similarity between texts and attempt to provide a quantifiable number to the corpus's polysemy quotient, several of these strategies use deep learning with numerous layers and prebuilt models of NLP. The experiment has been evaluated by considering various traditional algorithms and found that our model provided the best accuracy in comparison with traditional algorithms. Furthermore, the work must be carried out by considering the polysemy and synonyms which can help the user to identify the most relevant documents based on his requirement.

Author contributions
The overall contribution in this research paper is carried out by Dr. SPKM and this research work specifies the process involved with identification of relevant information from huge dataset. This research work presented an adaptive text summarization and identification of relevant text document based on the keywords specified by the user using Hybrid similariy model. The tasks were completed using a hybrid similarity algorithm that uses text summarization techniques to identify publications that are comparable in terms of semantic and contextual similarity. To give similarity between texts and attempt to provide a quantifiable number to the corpus's polysemy quotient, several of these strategies use deep learning with numerous layers and prebuilt models of NLP.The experiment been evaluated by considering various traditional algorithms and found that our model provided the best accuracy in comparision with traditional algorithms. Further the work must be carried out by considering the polysemy and synonyms which can help the user to identify the most relevant documents based on his requiement.