An Enhanced Exploration of Sentimental Analysis in Health Care

The medical dataset replicates the patient's crucial information, such as important details regarding the patient's health. It includes disease diagnoses, interventions, and descriptions of the examined results. Also, detecting the mindset of an acute disease-affected patient is a primary challenging task. Though sentiment analysis plays a role in seeing their perspective, the significant broad medical application does not yet meet the analysis of patient mindset. So here we identified major shortcoming exists while studies the diversified disease-affected people mindset. Hence, we introduce a practical framework to analyse patients' perspectives using a socio-medical dataset that contains various reviews and feedback of critical diseases-affected people—initially, we applied a pre-processing technique, including Lowercase Conversion, removing special characters, removing stop words, Number to word conversion, Stemming, and lemmatization over dataset. Next, N-gram tokenization methodology is used to extract the valuable features followed by assigning polarity score to each sentiment we extract and calculate the overall polarity of the context. Finally, a probabilistic LDA model was employed to combine the review. Furthermore, various machine learning classifiers are explored to evaluate the performance of the proposed framework.


Introduction
The evolving area in medical concepts depends on allotting the classification and emotion analysis. Because of the absence of domain-specific lexicons and uninterested domain researchers in this area, the issues accepted are high. One more issue is the semantic relation of the healthcare domain and the separation of knowledge-dependent features. The main reason is that the all-time medical lexicons do not provide any 1 3 characteristics like category and sentiment. The experts are tried their best to plan the data extraction like GENIAI and PennBioIE 2 to solve the issues throughout the years. The primary need is to recreate either structure or unstructured corpora versions. Other economic and ontology methods are used with linguistic and ML (machine learning) techniques [1]. These are used to separate the healthcare topics with the synaptic and semantic characters [2]. Two systems are evolved to separate the semantic relations from the healthcare domain in recent work. The first system is tokenization (allotting the groups). The second system is for recognizing the emotions in the healthcare domain and their contexts. Healthcare topic is nothing but a phrase or a word with entities and knowledge and data belonging to healthcare attributes. The recognized field consists of two types i) Medical ii) Non-Medical. Separations of negation and stop words or sentences are considered with the aim of identifying the area. Let us take two examples Regular headache and uncontrolled jerking. These are considered medical and nonmedical fields depend on the presence of the healthcare domain. Headache is regarded as a symptom of starting stage of cancer also. So it is a medical context with a company of medical concepts. Additionally, every word or phrase of the corpus is recognized as the context in our work. In case of sentences like" Orange is good or bad" means a nonmedical domain without any medical concepts.
The categorization and emotion recognition systems are used to separate the contexts and domain. The separated field is divided into five models in the categorization. I) Disease II) symptoms III) drugs IV) human anatomy V) miscellaneous medical terms (unidentified topic represented as MMT in the remaining section of the paper). An example of a disease type is "Headache." According to utterance and computation of the first incident of separation concept, the healthcare researchers and scientists presented these five categories in the corpora. Every type has its ideas which allot the whole classification based on their context. Eleven classifications of healthcare domain are recognized in pairwise aspects like disease symptom, disease drug, etc. According to both concepts and contexts, the sentiment recognition model [3][4][5] is augmented for sense-based information. Here, only the positive and negative emotions are taken into account. For example, "There is something wonderful about being pregnant" is represented with positive emotion. The outputs of emotion recognition are not similar for various types of domain. Consider "anatomy of the human body" as a neutral sentiment while positive or negative feelings depend on symptom types. The past model lexicon viz is used to evaluate these models. WordNet of Medical Event is utilized to separate the healthcare domain from contexts. Additionally, the lexicons allotting the linguistic and the emotional characteristics in the healthcare domain [6], the WME has two versions: WME 1.0(WME version 1) [7] and WME 2.0 (WME version 2) [8] augmented to compromise. The Linguistic character such as POS, gloss with polarity score, emotions are coming under the WME1.0 lexicon. It also has 6415 no. of healthcare domain, and it impotent to offer the feeling-based field and related knowledge data. So we go for WME 2.0, which has 10,186 no. of the area, and it has some knowledgerelated character like affinity score, similar sentiment words (SSW), and gravity score. The blended model is injected to combine the previous linguistic symbols of WME 2.0 and the machine learning prototype. This offers aspects such as negation [9], uni-gram and bigram. It is two types of classifiers to accomplish an average of 0.81 and 0.86 F-Measures for allotting classifications to the healthcare domain and contexts in developing a categorization system. i) Naive Bayes ii) Logistic Regression [13], In the presence of WME2.0, the emotion recognition system was evaluated with the utilization of naive Bayes and support vector-oriented Sequential Minimum Optimization (SMO) Classifiers. It attains an average of 0.91 and 0.81 F-Measure for recognizing emotions of the healthcare domain and contexts. When the unigram and bigram are used to identify the divisions of the medical domain, the negation character identifies the first sentiments of medical concepts [10][11][12].
The paper is organized as follows: Sect. 2 discussed the various related work and literature studies associated with sentimental analysis. Section 3 discussed the current problem statement along with the solution. The different methodology used for the proposed approach is discussed in Sect. 4. Experimentation and result discussion was on Sect. 5. Finally, the paper is concluded in Sect. 6.

Related Study and Literature Survey
This section analyzes and describes the numerous existing approaches built for analyzing sentiment of acute disease affected patients.
Many new researchers are examining the way social media convincing the public of medical care. Researches are injecting text mining (Ficek and Kencl [14], Rahnama [15]), and it is doing a significant role in the high performance of unstructured data by the utilization of apache spark along with the binary and ternary process, Baltas and Tsaklidis [16] introduced a Twitter sentiment analysis. The following method is a conventional extreme learning machine based on spark cluster performed by Oneto et al. [17]. Using spark with deep learning shows that they have a high-performance level than any other spark model. In mobile big data analytics, a deep learning framework is introduced by Chen et al. [18] utilizing the apache spark model. The further method is sentimental analysis with spark architecture on large scale data discovered by Nodarkis, et al. [19].
To separate the sentiments from the HPV vaccine-based tweets, the best ML system is performed by Du.et.al [20]. The ranking division, along with the SVM type, is processed, and 6000 tweets are explicated physically. Other than baseline types, this output gives the best of 0.6732F-. Medical sentiment analysis is an evolving technology. Denecke and Nejdl [21] introduced a health care ontology to evaluate the factual level in the healthcare texts. It is different from past emotion examination systems. Emotion examination is processed by rule or by ML methods. Many workers are represented with ML methods than ruledependent methods.

Problem Statement and Solution
From the literature survey, we found the difficulties involved while analyzing the acute disease-affected mindset. During the extraction of sentiment, there have been a lot of parameters involved. Also, analyzing the diversified people mindset becomes a very challenging role. Hence, we proposed a practical framework to analyze the sentiment of perceptive disease-affected people. In our proposed framework, we used five majorly used machine learning classifiers known as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), K-nearest neighbor (KNN). Our proposed framework includes: • Initially, we perform pre-processing over the dataset, including Conversion of Lowercase, eliminate special characters, eliminate stop words, Conversion of Number to word, Stemming, and lemmatization. Hence the dataset gets accurate and crisp. • After pre-processing, N-gram tokenization is performed over the dataset. • Then assigning a polarity score to each extracted review and calculate the overall polarity score. • Through combing the data review, we employed a probabilistic LDA over the resultant dataset.

Methodologies
This section will figure out all the methodologies used for our proposed work and describe its entire functionalities in a detailed manner.

Data Pre-processing
Data pre-processing is primary task of any data classification process. Here we used the techniques includes lemmatization and stop word removal which is very crucial as well as much generalized approach. Here we used four major pre-processing technique is employed over our socio-medical dataset. It is helpful to improve the quality of classification process as well as paves the way to develop the features as much robust.

Conversion of Lowercase
Initially, we employed lowercase conversion. It sequentially scans the entire dataset and finds out the uppercase words. If the uppercase word exists, it converts it into corresponding lowercase letters using the python library function known as Numpy.

Eliminate Special Characters
This is the second of data pre-processing. This step eliminates the special characters such as (*, %, $, @, #, etc.) from the dataset. And it takes the uppercase-free dataset, which means after finishing step 1, it starts its process.

Eliminate Stop Words
Usually, stop words are frequently occurring in the text, and no meaningful information is conveyed. Hence my use of the NLTK library function we achieved to eliminate the stop words. Figure 1 demonstrates the clear view.

Conversion of Number to Word
These steps involved converting the number to words by use of python library known as num-2words. (For instance, 7 are converted into seven).

Stemming and Lemmatization
It is the final step of pre-processing. Stemming is the technique to remove the suffix or prefix from a given the word and reduce the word's complexity, eliminating the stem or root of a term. We employed a porter stemmer to done this process. Lemmatization usually diminishes the word to a well-founded dictionary word.

N-gram Tokenization
N-gram tokenization [39-41] is defined as a collection of occurring tokens present in a frame and used to find further grants. It divides the text into different tokens after the pre-processing with the intention of feature extraction. Based on the pre-processed data, the N-gram dictionary is explicated physically at words like 'a',' and," the," there,' etc.; these words do not have any critical data .so they are rejected from the review sentences. Separation of the opinion words from the N-gram dictionary is processed by sentence-level annotation and summarization. The summary of every sentence, along with frequency, is represented in Fig. 2. Using the summarized characteristics, the mark of every aspect was evaluated to predict the no. of positive, negative, and neutral words is represented in Table 1. Also, it has to be done after the physical explication and summarization. The score was calculated for positive, negative, and their emotions individually. Additionally, it has three steps i) recognizing the opinion terms ii) feature vectorization process iii) vector transformation. Depending on Term Frequency and Inverse Document frequency [TD-IDF] [42], the features of vectors are computed. The TD-IDF allots the weight to every feature vector. The importance of the feature vector is calculated by The results of the weighted character vector are fashioned to LDA (latent Dirichlet allocation) [43] using pipeline. Intentionally, the Bayesian optimizer which is in LDA separates the some characters by changing them into various no.of concepts. 'Dimension is term used to describe every concept in LDA. Below section briefly describes the LDA working type.

Assign Polarity Score
Depends on the polarity level of total domain in context, emotions are bring out by learning section [44] some phrases like "no","not", "never" and" neither" are considered to identify the relevant emotions of healthcare domain [11,45,46]. The below algorithm is sketched to allot the emotions to healthcare domain by utilizing the emotion recognition system. STEP 1: Set polarity score (Polarity level ) and emotion of both healthcare and nonhealthcare of the domain. The different emotion lexicons techniques used for imposing to allot the polarity level are SenticNet and SentiWord Net. STEP 2: In order to allot the relevant emotion of domain, determine the negation words or phrases STEP 3: Calculate the total polarity level of the domain using following equation, wherePolarity level -the total polarity level of the contextPolarity level -single polarity level of every topic in a domain The following algorithm is to examine the single person's emotions. It should be done after the polarity level determination. It can find the classification of healthcare domain along with domain classification. STEP 1: Set the type of healthcare domains of the environment with concept categorization system. We can denote the medical concepts and their types as CM in an environment. where P cc1 & P cc2 -partial context category of the environment.

Latent Dirichlet Allocation (LDA) with Topic Modelling
Following hierarchical Bayesian type, LDA is injected on character vectors to inspect the text in the corpus. The unsupervised likelihood type that designs corpus into the group of content is called LDA [47]. Over the words, each concept is sketched as a distribution. Consider there is possible arrangement 't' along with the corpus 'c', and it holds 'r' number of review associated with it. Each probability distribution function associated with the feature vector is considered the polynomial probability distributed function, and each study will produce the arbitrary constant 'k'. The following equation represents how feature extraction is done in the LDA model.
whereP ( t a = b) -the likelihood concept of b sampled for character fa for every revise in corpus c.P ( f i |t i = k)-the likelihood of fa under topic b and n denotes the total number of domain.
The term had used while computing Eq. 8. There we used a character vector that refers to b, which is also a multinomial distribution of characters. For considering the review r, P (t) refers to the multinomial distribution. For the representation of feature vector and reviews the estimated parameters, and are used, the total features representation of a review is done using RN. Also, R represents the entire set of reviews. The topic-word hyperparameters are represented using, and the total consideration associated with Dirichlet distributions is kept on updating as every unit of the cell.
Determines the variable comprises of review level and it is sampled once per feature level variable i.e. R and f corresponding to each work r with N. Consequently, for entailed features also very difficult to score and process directly, Eq. (8) clearly determines the probability function of overall possibilities available with respect to each feature vector.
wheret i = k -The f i features which is assigned to the topic k.t i-1 -The i th review represents to the allocated domain f i *r i .R -Entire set of reviews.F -Entire set of features.C FN and C RN -Matrix corresponding to topic-review.C FN fj -The overall features corresponding to topic k.f i * C RN rj -The respective topic given to some word with respect to review r without f i .
The following Eqs. (10) and (11) represent the parameter and ̂ (r) i . From the equation and are the hyperparameters.
In this proposed work, LDA is utilized to discover the text from the review corpus and fuse them into the latent domain. We have modeled the field as K = 100, 200, 300, 400, and 500 on the review corpus. The essential features were identified from the domain based on the probabilities correlated with each segment. Later, the implementation of feature selection is presented in the below section to handle the curse of dimensionality problem.

Classification Process
In this subsection, we discussed various classifiers used for our evaluation purpose; by using those classifiers, we analyze and classify the sentiment as positive, negative, and neutral.

Support Vector Machine (SVM):
While text classification and important categorization are based on hypertext, the SVM model is widely used. It is very much helpful to significantly minimize the training sample label of both inductive and transudative settings. It is a classifier that primarily classifies the cost function also enhances the classification performance. And we have used the Lib-SVM library, which is a function of sklearn SVC. To analyse the data in minimizing the structural risk, a learning robust method, SVM, is used. It is a kind of learning method in which classifies the data appropriately.
The training phase was usually optimally separating hyperplane, reducing the cost function so that distance between two classes of margin had been induced and feature space must be minimized. Consider there is m instance of data in the training phase. Each model consists of a pair (ai, bi) where ai ∈ Rn is a vector attribute that belongs to the instance i. Also, bi ∈ {+ 1,-1} and known as the instance of class label.
To find the hyperplane that finely separates the optimal solution, the main objective of the SVM and the corresponding data belongs to two main classes, which is W ⋅ a + c=0. The decision function is used to classify and test the instance y, which is defined as

Naïve Bayes
The NB classifier between the predictors depending on Bayes theorem detachment, since NB refers to different characters, the multinomial naive Bayes are used along with proper fit prior. It is simple, has no difficulties to sketch, no repeated variable computation utilizing big datasets. It processes enlightened division models. Bayes theorem provides a way of calculating the posterior probability, P(g|h), from P(g), P(h), and P(h|g).
Naive Bayes classifier assumes that the effect of the value of a predictor (h) on a given class (g) is independent of the importance of other predictors. This assumption is called class conditional independence.
NB offers a computation of posterior probability P(g|h) from P(g) , P(h) and P(h|g) . It consider the performance of total of a predictor (h) on a given class (g). It is not dependent of other predictors. It is called conditional independence were P(h|g) is the posterior probability of class (destiny) given predictor (attribute). P(g) . is the prior probability of class. P(g|h) . is the likelihood which is the probability of the predictor given class. P(h) is the prior probability of the predictor.

Decision Tree (DT)
It comes from the classification tree algorithm family. Transversely, the subtrees are outlined by dividing the source entity. 12 is the maximum depth of the tree. The tree structure is the basic form for classification and regression models. It can be built by splitting the dataset into tiny and very tiny subsets. Simultaneously, the association tree is evolved. The decision node and leaf node are the final results. As usual, the topmost node is the root node. The root node handles the categorical and numerical data. It is a top-down approach, so the data are divided into different homogeneous values which have instances. The entropy of the decision tree is built by the following equation.

Random Forest (RF)
Based on the dataset sample, the RF has various decision trees. There is a possible 'n' maximum depth of the tree. It can be built with different single decision trees at the learning phase. To make the destiny prediction, it makes use of predictions from all the trees. The mode of the classes for classification or the mean forecast for regression, since there are a group of outputs to reach a destiny, they are called an Ensemble method.
The following equation used for making binary tree: The importance for each feature on a decision tree is then calculated as: where j = the importance of feature mK = the importance of node m

K-nearest Neighbor
K nearest neighbor is the faster algorithm which significantly produces the classification result as accurate and more precise also the performance must be enhanced. It is majorly used to find out similar objects and much applicable to solving the problem. Applications such as recommendation systems, search engines are utilized and essentially work based on KNN.

Experimentation and result Discussion
This section presents a clear description of the implementation details and a proper comparison with the existing approach. We used the Windows operating system with 12 GB RAM with a 2 GHz processor and 1 TB hard disk for experimentation purposes. Python programming language along with the help of proper library functions used for implementation. We used pycharm IDE for implementation purposes, and CPU will no longer be sufficient as a GPU when the dataset is huge. Hence, we used google colab.

Dataset Description
Notably, The 821,483,453 general tweets on Twitter are brought together between 16th march 2019 and 2nd October 2020. Among them, 438,072,932 are based on healthcare issues, especially numerous social environment health domains. Besides, three medical and health datasets are used to assess the coherence of the project. In the below sections, a brief explanation of this data is given. The other standard survey has taken between October 2013 and Jan 2016 using the UCI machine learning repository (common dataset in Twitter). This convention dataset has different medical tweets, which are gathered using many accounts on Twitter as follows: a. reutershealth b. kaiserhealthnews c. latimeshealth d. bbchealth e. msnhealthnews f. NBChealth g. cbchealth h. nytimeshealth i. gdnhealthcare j. everydayhealth k. nprhealth l. foxnewshealth. Table 1 describes the details regarding statistical data and Table 2 describes examples of various emo-tags list.
Here our experimentation consists of 5 main phases. It includes data pre-processing which provides for conversion of lowercase, elimination of special character, elimination stop words, and conversion of number to terms, and stemming and lemmatization. After the resultant pre-processing dataset is passed to N-gram tokenization that emotion has been converted into tokens. Then we assign polarity value to each emotion and calculate cumulatively. The resultant had fed to Latent Dirichlet Allocation (LDA) with Topic modelling. In that, each topic is converted into a set of sentences. Then finally j ∶ node j splits on feature i nij ∑ kC all nodes nik significant classifier like SVM, NB tree, DT tree, Random forest, KNN is used to analyses the sentiment of acute disease-affected people.

Data Pre-processing
The collected dataset is applied over the set of pre-processing techniques includes lemmatization and stop word removal. We also include some manually developed wordlist, which provides for some crucial keywords of sentiments like positive, negative, and neutral. Some of which are shown in the Table 3. We also embedded it in the pre-processing step.

N-gram Tokenization
Then we applied N-gram tokenization over the resultant dataset. There annotation based on sentence-level and summarization is done and extracting the words based on opinion from N-gram dictionary. The words such as "a", "the", "and", "so" etc. kinds of words are removed from the dictionary and provide clarity about what kind of emotions we want to predict. The resultant of N-gram tokenization is simulated and generated the graph and shown in below. Pink line is our approach and orange link base line approach which is except N-gram tokenization [i.e. only pre-processing]. Hence, we found that our approach outperforms well and able to provide result in an accurate manner (Table 4).

Assign Polarity Score and Calculate it's Cumulative
This step happens after the completion of N-gram tokenization. The resultant of n-gram tokenization is fed to the input of this step. To analyses the exact prediction, we will set some polarity scores to each entity so that the negation kinds of words such as "not," never," "none," neither," etc., can be recognized and extract the exact sentiment. After fix polarity score to every entity, we categorize this context into uni-gram bi-gram and trigram which defines the priority to analyses the sentiment also it is used to examine the kinds of statistics result involved in the disease. The categorization over each entity is done exactly with Tri-gram in which the corresponding polarity score = 45. After finding these, cumulative is done over each entity, then entire sequence must recognize the negation, which means the negative sentiment is done. The comparison graph over each categorization is shown in below graph.

Latent Dirichlet Allocation (LDA) with Topic modelling
It was assigning probability value over every emotion and categorizes that based on that probability value. Also, it categorizes that into a high risk to low risk. Furthermore, it keeps on actioned this, so that model fine-tuned most finely. Here k = 100. Figure 3 demonstrates the overall entities of the dataset, which clearly scatters entire emotions involved, got from the LDA model. It fine-tuned and provides a high reliable feature in which classification done with high accuracy. Each dataset which are scattered in

Evaluate result using classifier:
For evaluation of result, we use most significant classifier like SVM, KNN, DT, RF, NB tree that are widely used for analyzing very crucial information. Classifier extracts the exact sentiment from each existing entity. We used evaluation metrics like Accuracy, Recall and Precision for our analysis (Figs. 6 and 7). From above table describes the overall comparison of the classifier to classify and analyses sentiment with our proposed method. Where A denotes Accuracy, R denotes Recall,  . 4 Categorize sentiments based on polarity score P denotes precision. In our experimentation analysis, we determine, SVM Outperforms well and produce better accuracy when compared to the remaining classifiers. It achieves the highest accuracy of 98.2% while classifying negative sentiment, 96.5% for organizing positive feelings, and analysing neural emotions achieves 96.1% accuracy. Next from SVM, Naïve Bayes outperforms well, and it also performs an accuracy of 95.2% accuracy for classifying negative sentiment. Consequently, the corresponding comparison graph is given as follows.
Then final comparison was our proposed methodology with SVM along with baseline methodology with SVM. Hence, we take SVM classifier and compare with base line SVM Classifier. Then we identified that our proposed methodology with SVM classifier analyses the sentiment and outperforms well when compared to traditional SVM classifier and the corresponding table and comparison are given as follows: (Table 7)

Conclusion and Future Work
Usually, analysing some acute disease-affected people mindset is a much challenging role. Also, the dataset must be very accurate. Hence, we collected datasets from three different environments, including a review from social media, a critical review from Twitter, and abstracts of medical study from the wall street journal. Then we proposed a practical framework that differs from the traditional approach, which includes four crucial techniques, i.e., Enhanced pre-processing, N-gram tokenization, assigning polarity score, and topic modelling with Latent Dirichlet Allocation. Finally, for evaluation purposes, we used a significant classifier that is prominently used in medical applications, including SVM, NB, DT, KNN, and RF. Out of this classifier, SVM outperforms well with our proposed method, and it got better accuracy of 98.2%. Also, we compared our proposed practical framework with a baseline approach to prove our work analyses acute disease affected people emotion in an efficient way. We will incorporate this with a deep learning approach to processing some vast features of the dataset in future work.

Funding No Funding.
Data Availability statement Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Code Availability Not Applicable.  (12)   Ambeshwar Kumar is currently pursuing his PhD from SASTRA Deemed University, Thanjavur, India. He has received his Bachelor of Engineering and Master of Technology Degree from Visvesvaraya Technological University Belagavi and Dayananda Sagar University Bangalore respectively. He has knowledge in the field of Machine Learning Deep learning and Big Data Analytics. He is doing research in the field of Deep learning. He has published an article in Scopus and SCI Indexed.