ERIL: An Algorithm for Emotion Recognition from Indian Languages Using Machine Learning

It is critical for a computer to understand the speaker’s mood during a human–machine conversation. Until now, we’ve only used neutral phrases or utterances to train robots. A person’s mood affects their performance. Machines have a hard time deciphering human mood from voice because humans can make fourteen distinct sounds in a second. For a machine to comprehend human behavior, it must first comprehend the human ear’s acoustic skills. Linear Prediction Coefficients (LPC) and Mel Frequency Cepstral Coefficients (MFCC) can simulate the human auditory system. Emotion Recognition from Indian Languages (ERIL) extracts emotions like fear, anger, surprise, sadness, happiness, and neutral. ERIL first pre-processes the voice signal, extracts selective MFCC, LPC, pitch, and voice quality features, then classifies the speech using Catboost. We tested ERIL on different benchmark classifiers to choose Catboost. ERIL is a multilingual emotion classifier, it is independent of any language. We checked it on Hindi, Gujarati, Marathi, Punjabi, Bangla, Tamil, Oriya, Kannada, Assamese, and Telugu. We recorded a speech dataset of various emotions in these languages. The accuracy of distinct emotions is 95.05 percent on average. The languages have a combined average of 95.05082 percent.


Introduction
Knowledge exchange is closely linked to interacting parties' reciprocal knowledge. A normal two-person discussion begins with shared identification and concludes with mutual confidence. To apply human skills to computers, it is necessary to first understand how people perceive emotions. Over the years, a lot of effort has gone into isolating the right features from speech and correctly classifying them [1][2][3][4]. Technology has brought voice bots, applications, and gadgets that understand human voice commands. Google assistant, Siri and Alexa responding to users' voice orders are household names. These applications are multilingual and work brilliantly on voice commands, but cannot understand emotions. If I say, "I am in pain", these applications will google on the keywords and land up with a song or a movie or an article. These applications will be far more useful if they can determine the emotion and search according to the emotion of the speaker [5].
Natural Language Processing (NLP) with emotion recognition helps the organizations to plan their marketing strategies. Instead of text people put voice messages on social platforms these days. When the voice messages replace text, existing NLP techniques do not work well, as we will have to first convert the speech to text and then apply NLP [6]. Emotion recognition from the speech finds its application in medicine, virtual-reality, education, business-development, and entertainment [7]. Capturing the emotions from phone calls made for emergency services or police is equally important. Emotion extraction from threats made over phones can help the cops to make a character sketch of the person.
We need a robust algorithm to decipher the emotions from the speech. Emotions Recognition from Speech (ERS) involves four steps: pre-processing, feature extractions, classification, and recognition [1][2][3][4][5][6][7]. Before the retrieval of the essential characteristics in the voice, we must analyze the speech signal to eliminate noise [8]. We achieve feature extraction by transforming the speech waveform to a parametric representation for subsequent processing and analysis at a lower data rate. Quality feature extraction helps to classify the data easily [9].
To extract the vocal features like human ear, the algorithm should replicate the human acoustics. MFCC, LPC, Linear Prediction Cepstral Coefficients (LPCC), Linear Spectral Frequencies (LSF), and Perceptual Linear prediction (PLP) imitate the human hearing and speaking tract and give relevant features [10]. MFCC filters frequencies linearly at low frequencies and logarithmically at high frequencies to preserve the phonetically vital properties of the speech signal. LPC helps in determining the position of formants and crest in the spectrum [11]. LSF considers the nasal cavity and mouth structure, laying the groundwork for the linear prediction of illustration's physiological significance [12]. LPCC uses vocal tract features to collect emotion-specific information. PLP method incorporates vital bands, intensity-to-loudness compression, and equivalent loudness pre-emphasis to extract features from the speech. It is based on the nonlinear bark scale and helps in speech recognition by removing speaker-dependent functionality [13]. The other feature extraction algorithms are Discrete Wavelets, Hybrid models and Deep learning based.
DWT, an extension of Wavelets Transform (WT), can collect information from latent signals in both the time and frequency domains [14]. Many wavelets are orthogonal, which is an excellent characteristic of signal representation in small spaces. The wavelet transform divides a signal into wavelets, which are basic functions. The WT's main feature is that it uses a variable window to scan the frequency range, which increases temporal precision. Different frequency ranges are among the parameters. This improves the quality of speech information got at the appropriate frequency. It provides enough frequency bands for effective speech recognition, but since the input signals are finite in the range, the wavelet coefficients can have overly large variations at the edges because of discontinuities.
Deep Learning offers i-vector and x-vector features [15,16]. These features are a fusion of MFCC, DWT and MFCC, Gammatone Frequency Cepstral Coefficients (GFCC) respectively. With the advancement of deep learning (DL), there is a trend in speech synthesis to use DL to extract features from speech signals automatically. Convolutional Neural Network (CNN) is good at extracting local features from unprocessed data. CNNs is created specifically for visual recognition activities. Their success prompted researchers to investigate 2-D CNNs in the field of SER [17]. To understand emotions in the voice, CNN models derive high-level salient information from speech signals. Similarly, some researchers have used CNNs to build fully convolutional networks (FCNs) that can accommodate fixed input of differing sizes. In time series classification tasks based on a set input variable scale, the FCNs provided good results. FCN, though, cannot learn temporal features. A CNN and Long Short-Term Memory (LSTM) model can read the spatio-temporal features and thus widely used in SER [7]. These techniques perform well in a noisy environment but suffer from over-fitting when dealing with tiny datasets. Deep learning, Transformers, Knowledge distillation, Bidirectional Encoder Representations from Transformers (BERT) or variants of BERT have shown good results in Automatic Speech recognition, classification, speech separation, speaker-recognition, and speech-enhancement [16,. But there is very little historical evidence of emotional recognition of Indian speech. Table 1 illustrates work done in various Indian languages.
Our study conclude that the techniques used for sentiment analysis from speech thus far, work better on a larger dataset and on a single language. There is no historical evidence of emotion extraction from multilingual speech data of Indian languages [23-25, 27, 28, 40-47]. Our research is based on 10 Indian languages: Hindi, Gujarati, Marathi, Punjabi, Bangla, Tamil, Oriya, Kannada, Assamese, and Telugu. Most of them have a small data corpus.
ERIL's efficiency is compared to that of state-of-the-art systems. Each dataset's comprehensive investigational implications are explained in separate parts, along with discussions and comparisons of baseline state-of-the-art approaches. In this research, we experimented with various forms of AI architectures to find the most effective method for recognizing the emotion in speech signals. We proposed this system for the SER after extensive testing, which ensures high-level output with a higher prediction rate.

Methodology
The objective is to extract sentiments from Multilingual Indian voices.

Experimental Set-Up
The section is split into hardware and dataset.

Dataset
We created a dataset of semi-natural expression. UTU Semi Natural Emotion Speech Corpus (UTU-SNESC) was developed for training and validating emotion recognition models. The corpus contains dialogues from Indian films by well-known actors and actresses ( Table 2).

Model
ERIL first reads every voice file, filters it by removing the silence. It then extracts the MFCC and LPC features. Then it calculates the pitch. A method is designed to retain only the relevant features. Pitch, voice quality and the retained MFCC, and LPC features of every voice file are stored as a vector to create the testing dataset. Figure 1 graphically illustrates the working. Figure 1a shows the graphical representation of the model. Figure 1b is the predictive model. The output from classification is used to predict the emotion from the new voice. The overall working of ERIL is summarized in the following steps: Step

Objective Function
The objective of the paper is to design predictive the sentiments from multilingual Indian voices. The function is represented mathematically: Here, S v = Sentiment from voice; V I = Indian voice; V f = Filtered Indian voices, P s = Predicitve function, ∀ f = All the features.

Filter the Voices
The data set considered is recorded in a studio environment. There are some silent spots in the recording which we removed from using the following equation: Using Eq. (2), the code iterates through the voice sample and removes all the zeros. The filtered voice is stored in V f .

MFCC Feature Extraction
The function (3) calculates the first and second derivatives of the Cepstral coefficient to obtain the signal's spatial features. The first algorithm discusses how it works: . 1 a Proposed Classification Model. b Predictive model 1 3

LPC
The LPC is a linear predictive filter that uses a linear combination of previous samples to calculate the value of the next sample. LPC assumes that the glottis at the end of the tube produces a speech signal that is distinguished by its strength and frequency. LPC analyses the speech signal by estimating the formants, removing their effects from the signal, and estimating the concentration and frequency of the residual left behind. The method of extracting the formants is known as inverse filtering, and the signal that remains is known as residue.
The LPC coefficients at lag La is given by: This is further expressed as Equation (5) is Yule-Walker equation.

Calculate Pitch
After filtering the voice, the next step is to calculate the pitch using autocorrelation. The calculations are shown using Eq. 6.
Here, max l is the maximum lag; index is the index of the maximum peak. .

A(n)
Sampling Frequency max l +index

Prepare Data
We use Algorithms 1, 2 and 2 to measure Pitch and MFCC (VMF) Features for each sound format. A numeric mark is assigned to each record in the dataset, ranging from 1 to the number of files.

Design Deep Learning Model
Here, Cat m : Catboost regressor model, p: parameters. Model creates a structure with the following parameters and values: We increased the number of iterations to a high amount, used the overfitting detector parameters, and turned on the use best model choices to avoid overfitting or underfitting before modifying other parameters. Only the first best iterations are included in the final model. We resolved the issue of overfitting by keeping the minimum number of data in a leaf to 20.

Fit and Predict
The model in fn (9) is trained using Eq. (10). The model trains and classifies on V dataset and label. The output is stored in V c .
Here, S v is the sentiment; N v is the new voice having sentiment. Once the model trains and classifies the training data, the next step is to predicted sentiment from a new voice. The trained model from Eq. (10) is used to predict the sentiment using Eq. (11). We took a new voice (N v ) with random sentiment and ran the trained model over it. The predicted sentiment is stored in S v . We evaluate the accuracy based on the correct or incorrect sentiment predicted. N v is recorded in the same environment, and we preprocessed it using Eq. (2) before passing it to the model.

Results
The results are tabulated with the mathematical formulations used to arrive at the results. Figure 2 is a frequency plot of a Bollywood movie dialogue. The plot is before applying any filter. The result of algorithm 2 is plotted in Fig. 3 on the original voice.
Effect of Eq. (2) if reproduced in Fig. 4. The recorded dialogues do not have much leading of trailing spaces. There is in between silence which is removed from the original signal. Pitch of the filtered voice is shown in Fig. 5. The evident change in the frequency of the filtered voice affects the pitch. There is no loss of voice after filtering. It only plays as if one has pressed the fast forward button.
To calculate MFCC we used Algorithm 1. The voice is windowed first and then converted to frequency domain using hann filter. Steps 6-16 in Algorithm 1 computes the filter points and then mel bands are computed as shown Figs. 6 and 7, respectively. We have considered 10 mel bands.
The cepstral coefficients are commonly referred to as static features as they only contain information from a single frame, the second derivative of MFCC is computed in step 17 of Algorithm 1. A spectrogram plot of the same is shown in Fig. 8. The second order derivate contains details such as speech acceleration, which is a critical function for emotion recognition. Figure 9 is a plot showing LPC estimate against filtered voice. LPC is calculated using algorithm 2.

Calculate Robustness of the Model
We calculated True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP) to determine accuracy. The following equations are used to measure precision, recall, accuracy, and F1 score: TP denote the number of correctly predicted sentiment. FP is the number of sentiments detected by the model, but they are not actually sentiments. FN is the number of sentiments that do not match. TN is the number of incorrect sentiments detected; the output is a sentiment but not the actual sentiment. The tabulation is done using manual approach. The model is fed new sentiments at random, and the evaluated sentiment is registered ( Figure 10). Table 3 uses Catboost to train and predict. 80% of the data is used for training and 20% is used for testing and validation. Tables 4 and 5 highlights the reason for choosing Catboost. The results helped us to select Catboost for classification and prediction. Further tests were conducted using CNN, LSTM, Xboost, and LightGBM.
Results in Tables 3 and 4 indicates that CatBoost is a better choice with MFCC, LPC, and Pitch. CatBoost extracts its own features from the data.
The average accuracy of different emotions from Table 6 is 95.05%. The accuracy of emotions from individual languages is computed in Table 7. The average of the languages come to 95.05082%. Pre-processing, selective feature selection and fine tuning of CatBoost parameters helps the classifier to classify and predict the emotions accurately.

Conclusion
ERIL is a step towards addressing the Indian languages. The supplied dataset is limited due to the small number of speakers. A tiny dataset is difficult to train and classify. In a cocktail party-like environment, ERIL can extract and identify the voices. The accuracy of the results is 95.05 percent, which is very promising for Indian languages having smaller datasets. ERIL can be used to create an Indian language speech bot. Even in a noisy atmosphere, it will work. ERIL would assist native language speakers in effectively communicating with computers. It could assist businesses in comprehending the speakers' emotions  and providing appropriate replies. ERIL can assist the online education providers to understand the emotional connect of the students with the subject. It could help them to design better strategies and syllabus for the students.
We plan to add more languages and noises in the future. In a nosy setting, we'd like to try out different native speakers speaking at the same time. To transform the signal to the frequency domain, we'd like to try out FFT and wavelets. Even if they overlap, the future model should be able to grasp more Indian languages. It could aid corporations in developing devices that recognize patients' emotions. To treat patients, medical science can use emotion recognition.      Table 6 a plot is drawn of contribution of each language in the accuracy Author contributions As per our study, ERIL is the first algorithm to address the problem of extracting emotions from 8 Indian languages with one algorithm. Equation 1 designed by us able to remove the blanks and silence from the audio files. We wrote the code in Matlab to accomplish the task. We selected MFFCC and LPC features based on a threshold value. We ignored other features. Both the methods are based on autocorrelation method. The time complexity of a conventional method is O(N 2 ) ERIL has a complexity of O (N log N). The parameters of Catboost are fine tuned to avoid an over-fitting issue. The dataset has some recordings of dialogues in a studio and rest are cut from the movies. A code is written to extract only the audio from the movies.
Funding There is no funding of the project.
Data availability Our research is in the growing stage once it is complete the data and the code will be available in the public domain as soon as our research is complete and thesis is awarded.