System Implementation of Emotion Recognition From Speech

doi:10.21203/rs.3.rs-4095398/v1

Download PDF

Research Article

System Implementation of Emotion Recognition From Speech

https://doi.org/10.21203/rs.3.rs-4095398/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Emotional states can be expressed in facial expressions, speech, or body language. This study aims to create an effective system that can recognize emotional states from speech. The study proposes a methodology framework that uses acoustic features to recognize emotions. Initially, various methods for extracting features from speech are examined, and extensive statistical values are derived from the feature data. To recognize emotions from different speakers, a method to standardize the statistical features is presented. Normalization is proven to be necessary for building a high-performance system. Using normalized values, a score is calculated for each speech utterance, and the feature patterns for different emotions such as Anger, Boredom, Fear, Sadness, Happiness, and Neutral are examined. The study suggests that pitch, first formant, and speaking rate are the best features to distinguish emotions such as Anger, Fear, Boredom, Happiness, and Sadness. The study also shows that Support Vector Machine (SVM) can yield satisfactory performance in recognizing emotions. The system has achieved a high level of accuracy in detecting emotions such as Fear, Boredom, and Sadness. However, the method used in this study is not effective in recognizing emotions of Anger and Happiness.

Artificial Intelligence and Machine Learning

Support Vector Machine

Emotional state

Hidden Markov Model

MFCC

Speech emotion recognition

Voice communication

Linear prediction

Cognitive Informatics

Artificial neural network

Gaussian Mixture Model

Quadratic Programming

Multi-layer perceptron

Facial expressions

Discrete cosine transform

Human communication involves various modes, such as text, body language, and speech. Verbal communication is crucial, but emotions are also essential. We want to investigate if we can detect emotional components from audio signals. Although computers process information faster than humans, effective language processing is still challenging. We seek a realistic method to process audio signals and recognize patterns. Cognitive Informatics addresses natural intelligence processing challenges. Our research aims to detect human emotions from voice communication and study effective emotion recognition systems [Chung Lee 2022] reported the result of 6 major emotion detection from written text using Hidden Markov Model.

An area called Cognitive Informatics (CI) in computer science tackles the problems in natural intelligence processing, internal information processing mechanism of the brain. During the whole process, we use our general intelligence, which is a complex process not easy to be achieved effectively by any compute system. However, we master the skill by using intelligence acquired overtime, and that helps us to survive in adverse environments.

2.1 Definition of Emotional States

To build a speech emotion detection system, we need to understand the process of converting speech input into emotional states [Harris 2000]. Using the discrete-category approach, we recognize six basic emotions (Happiness, Surprise, Fear, Anger, Disgust, Sadness) [Widen & Russell 2003], which expand into more nuanced emotions over time. These emotions are grouped into Pleasure, Neutral, and Displeasure categories in a hierarchical structure (see Fig. 2.1). Note that the definition of discrete emotion is language-dependent and may overlap.

Another emotional taxonomy approach uses valence and arousal as two main dimensions. Excitement and Happiness are high arousal and high valence emotions, Serenity is positive valence and low arousal, and Depression is low arousal and negative valence. Distress and Anger are high arousal and negative valence emotions. While this model simplifies emotions into two dimensions, it may not fully differentiate all emotions. For more accurate emotion recognition, especially in speech, a regression approach beyond valence and arousal could be necessary.

The human auditory system serves as a sophisticated sound receptor, transmitting signals to the brain through the ear, particularly the cochlea in the inner ear. This process, comparable to a computer converting analog signals, facilitates the detection of frequencies [Young et al. 2008]. The brain then processes this information to discern emotions and speech features, forming the foundation for the development of an accurate audio recognition system. In voice communication, both explicit and implicit information, including emotions, are conveyed. Emotion recognition in speech, however, is subjective, influenced by individual perceptual processes and biases in audio features. Voice quality changes, reflecting the physical response of the nervous system, allowing for the recording of natural emotional states for experimental purposes. Emotion recognition heavily relies on acoustic features, categorized into linguistic and acoustic information, with prosodic parameters such as frequency, energy, speaking rate, and pitch playing pivotal roles. These features are further classified into time-related, intensity-related, and combined measures, each presenting unique characteristics. Psychologists associate specific speech patterns with emotions; for example, anger is linked to a higher mean fundamental frequency and wider range, fear to higher values, sadness to lower values and slower speech, and joy to a higher mean frequency and range. However, consistent acoustic patterns for disgust are lacking, underscoring the necessity for further research to enhance emotion recognition from spoken words.

2.2 Signal Processing

In a computational system, a function maps audio signals to emotion states. By multiplying a continuous time signal with an impulse train, we derive a discrete time signal. We control the function by specifying the sample interval, and the sampling frequency is tailored to the desired analysis. The goal is to reconstruct the original continuous-time signal from the discrete version with minimal information loss.

2.3 Features Extraction for Emotion Recognition

To recognize emotions from voice signals, we convert the signal from the time domain to the frequency domain using the Fourier Transformation method, which estimates audio features like pitch and formant accurately and efficiently for real-time applications.

2.4 Emotion Recognition Researches

Speech and emotion recognition involve pattern matching and machine learning for classifying multivariate, varying length time series data. Popular methods include Modeling, Vector Space classifiers, and Support Vector Machines (SVM). Traditional approaches like Artificial Neural Networks (ANN), Binary Decision Trees, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) have shown success in detecting emotion from speech [Kamaruddin & Wahab 2008] [Cichosz and Slot 2007]. SVM, particularly, outperforms other vector-based methods like K Nearest Neighbors (KNN) [Zhou et al. 2009].

2.5 Research Goal

This thesis aims to design an effective emotion recognition system for human speech, focusing on Anger, Happiness, Fear, Sadness, and Boredom. The process involves collecting diverse audio samples, extracting features like pitch, formants, and energy, and utilizing Support Vector Machines for recognition. A database will store the extracted feature patterns, and we anticipate achieving high performance in emotion recognition from speech.

3.1 Feature Extraction and Normalization:

Initially, we built an audio library for difference emotions. We selected the audio samples base on the Berlin Database of Emotion speech [Bartels 2009]. The Berlin Database, or EMO-DB, was a sound clip collection of over 500 utterances. The utterances were recorded by different actors in 7 emotional states, which are Anger, Anxiety or Fear, Boredom, Disgust, Happiness, Sadness, and Neutral. Also, the utterances composed of up to 10 different contents for each emotion and each actor. We selected this library because it was a well-established collection with clear classification of emotional states of speech. The audio samples in this library were spoken in German. It gave additional advantage for non-German researchers since we could focus on the acoustic features of the audio sample for emotion recognition.

We analyzed 196 audio clips with four different actors with two male's voice and two female's voice. The information of the speakers is shown in Table 3.1. The emotion code was defined as W, L, A, F, T, N, which are Anger, Boredom, Fear, Happiness, Sadness and Neutral respectively. 10 different contents of the audio clips, and the code of text could be found in Table 3.2.

Table 3.1

The information of the speakers in EMO-DB [Bartels 2009]
Code	Speaker Details	Code	Speaker Details
03	Male, 31 years old	13	Female, 32 years old
10	Male, 32 years old	16	Female, 31 years old

Table 3.2

The contents in EMO-DB [Bartels 2009]
code	Content (German)	English translation
a01	Der Lappen liegt auf dem Eisschrank.	The tablecloth is lying on the fridge.
a02	Das will sie am Mittwoch abgeben.	She will hand it in on Wednesday.
a04	Heute abend könnte ich es ihm sagen.	Tonight I could tell him.
a05	Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.	The black sheet of paper is located up there besides the piece of timber.
a07	In sieben Stunden wird es soweit sein.	In seven hours, it will be.
b01	Was sind denn das für Tüten, die da unter dem Tisch stehen?	What about the bags standing there under the table?
b02	Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.	They just carried it upstairs and now they are going down again.
b03	An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.	Currently at the weekends I always went home and saw Agnes.
b09	Ich will das eben wegbringen und dann mit Karl was trinken gehen.	I will just discard this and then go for a drink with Karl.
b10	Die wird auf dem Platz sein, wo wir sie immer hinlegen.	It will be in the place where we always store it.

The audio samples in EMO-DB were recorded with a quiet background. That reduced the need to performance an extra step of eliminating the nose in the analysis. Microsoft WAVE was the file format of the audio samples with 16KHz sampling and mono channel. WAVE format was used directly to digitize the analog signal. After the audio samples were selected, we extracted the features from the audio clips.

3.2 Vector Space Model for Scoring:

After computing all the normalized value of each feature, we applied Vector Space Model for Scoring to classify different emotions for the audio samples. Basically, out of the 18 normalized value of the features, we selected 7 measurements to represent the audio clip. The process of selecting the representation features was to pick the highest variance value of the emotions. For each feature measurement, we added up the mean value of each speaker. We repeated the same calculation for each emotion. Since we tried to classify five different emotions, 5 values of each feature were obtained. The variance was calculated with that set of data. And we selected the feature measurement with the top 7 variance to represent the audio clip. Then, we tried to found out a single value, or score, to represent the audio clip with the equations score = w1{feature1} + w2{feature2} + ... + w18{feature8} + c (3.3.1)

where w is the weight with the constant value, and c is a constant value. We tried to set different value of weight to increase the contrast of the relationship for different emotions. After the score was computed for each sample audio, we constructed a graph to analyze the relationship between the score and the emotional states.

3.3 Support Vector Machine (SVM) Recognition:

In addition to scoring methods, Support Vector Machine was used in recognition the emotion of the audio samples. An application called SVMlight, was mainly used to recognize the emotional states of all utterances. SVMlight was a SVM command prompt tool developed by Thorsten Joachime at Cornell University [Joachime 2009]. There were two step process of using the any SVM. They were training process and recognizing process. In the training process, SVM calculated the hyperplane for emotions separation. First, we needed to construct a training set out of the audio samples. For each audio clips, we constructed a vector with 17 normalized feature values as the following:

{Class} 1: {feature value 1} 2: {feature value 2} ... {n}: {feature value n}

where {Class} is the numeric value to represent the emotional state, feature value is the normalize value of the feature, n is the number of features we used. In our case, n was 18. We constructed a vector list with the audio samples of three difference speakers in the training set. We put the audio clip vector of speaker 03, 10 and 16 in the training set, and applied the SVM training methods to obtain the model of hyperplane. Then, we put the audio clip vectors of speaker 13 in the testing set, and actually tested the ability of the SVM for emotion recognition. We repeated the same process for four different combinations of putting different speakers in the training set, and the other speaker in the testing set until we tested all emotion clips among the speakers. In addition, for using the same set of speaker's audio data, we tested the recognition ability with the 7 features we selected in the previous section. We compared the result with using difference features vector in SVM.

Vector scoring is able to show the classification capabilities by combining different feature variables into one equation. There are many automated methods to find those equations. ANN can be used to train the equations for obtaining the optimal feature weights for emotion classification. Applying KNN can determine the features classification boundary in the vector space. Also, other methods such as decision tree, logistic regression, random forest, etc. can be used as our classification methods after we obtain the scores by vector scoring. In this study, we only study SVM as our main recognition method. Extended studies are required for comparing the effectiveness by using SVM and other methods in emotion recognition.

Support Vector Machine is an effective tool for emotion recognition. From the results, we noticed that the performance of SVM depends on the quality of the training set we provided initially. The accuracy for recognizing Fear, Boredom and Sadness is higher when the vector was modeled by using only 7 major features. Although SVM recognized a small number of the Sadness utterances as Boredom, and few Fear audio clips as Angry, the result is highly acceptance if we consider the nature between Boredom and Sadness is similar. Also, an utterance may mix with more than one emotion.

Sometimes, there is no definite conclusion of what emotion the utterances should be perceived.

Another observation is that increasing the dimensionality of training vector does not necessarily improve the emotion detection capability in SVM. When we blindly train the model of the SVM with all the features we constructed, the accuracy of recognizing

Anger and Happiness slightly increases, but the accuracy of recognizing other emotions decreases at the same time. In some cases, the results are not acceptable such as a fear utterance is being mistaken as a sad utterance, or a bored utterance as a happy utterance.

Judgment based on good quality features of discriminating power will improve the accuracy of the system, it also results in consistent predictions. In our study, SVM has done poorly in recognizing the emotions between Anger and Happiness. It is because we have not discovered a feature with high discriminating power to discriminate Anger and Happiness. Moreover, we can improve the performance by using different approaches for measuring and selecting audio features. We may also examine the temporal features, and use the method of modeling as our emotion classification method. We can combine and compare different recognition methods to achieve an optimal result when we attempt to build a high-performance system to recognize emotions from spoken words.

The values for 18 features are successfully constructed for each audio clip. Absolute measurements for each feature and each audio clip were presented without specifying speakers. Utilizing Eq. 3.3.1 from the previous chapter, box plots for each emotion, featuring different weights and normalized feature values, are graphed. The constant "c" in Eq. 3.3.1 is set to 16000 for positive scores in plotting. Cross symbols in each box plot indicate the feature values of the 5 test cases of audio samples for verification.

The impact of each signal feature is examined. The score of a signal feature is calculated by assigning unit weight to the feature under examination and zero to other features in Eq. 3.3.1. Different weights for various features are mixed to study the results. Normalized values of each feature derived from speech utterances, excluding any references to specific speakers, are plotted without including figures.

Turning attention to the specific features, the examination of Pitch Features illuminated intriguing findings. Mean and median values of pitch proved to be effective discriminators between high arousal emotions (Anger, Happiness, Fear) and low arousal emotions (Boredom, Sadness, Neutral). Meanwhile, the standard deviation of pitch exhibited wider ranges in Happiness and Anger, contributing to the nuanced understanding of the emotional spectrum [Yu & Fang 2007]. Interestingly, pitch change rate did not emerge as a strong feature for emotion separation, adding a layer of complexity to the interpretation of pitch-related data.

Speaking Rate, despite not aligning with previous research findings, emerged as a valuable feature in separating Fear from other emotions. The study observed a faster speaking rate in Fear utterances compared to other emotions, while other emotions exhibited similar rates, challenging established conventions in the field. Energy patterns, derived from the Energy Result and Analysis, did not show clear distinctions among emotions, contradicting previous findings. The representation of loudness failed to exhibit significant differences, attributed to adjustments in loudness levels within the study design.

The investigation delved into the realm of combining various features using Eq. 3.3.1, resulting in vector scoring and allowing for the effective separation of emotions. The exploration of different scoring schemes emphasized the critical role of feature weighting in achieving accurate emotion recognition. Subsequently, the study transitioned to Support Vector Machine (SVM) Recognition, where the results showcased variability based on feature sets.

With the inclusion of all 18 features, SVM achieved an overall accuracy of 67.7%, excelling particularly in Boredom and Sadness recognition. However, a nuanced approach involving only 7 features with high variance across emotions yielded a slightly lower overall accuracy of 67.1%. Intriguingly, Boredom recognition reached 100% accuracy, while recognition of Sadness and Fear achieved high accuracy. The challenges arose in distinguishing between Anger and Happiness, underscoring the complexity in feature analysis and the need for further exploration.

In conclusion, this study provided a rich tapestry of insights into the effectiveness of various features in emotion recognition. The emphasis on the selection of quality features emerged as a critical factor for optimal performance in classification tasks, offering valuable implications for the broader field of emotional signal processing.

The study highlights robust performance in speech-based emotion recognition through the developed framework, addressing challenges like noise filtering and the need for extensive emotion-labeled speech samples in practical implementation. Proposed normalization methods, including those based on feature patterns across speakers and profiling, ensure recognition accuracy for diverse individuals.

In features selection, the study recognizes the subjectivity of emotion perception, emphasizing pitch and speaking rate as robust features in emotion classification. Inconsistencies in results highlight the intricate relationship between audio features and emotional states, with energy features proving less influential.

Regarding emotion recognition methods, the study emphasizes the effectiveness of vector scoring and focuses on Support Vector Machine (SVM) as the primary recognition method. SVM's performance depends on the quality of the initial training set, prompting suggestions for experimenting with different methods and feature selection approaches to enhance accuracy and ensure consistent predictions in emotion recognition.

Collectively, the study underscores the attainability of constructing an effective system for emotion recognition from speech. It identifies global statistical values from pitch and speaking rate as key discriminative features. The normalization process and the utilization of SVM contribute to achieving high performance in emotion recognition.

Albahri, A. (2016). Automatic emotion recognition in noisy, coded, and narrow-band speech (Ph.D. thesis). RMIT University, Melbourne, VIC, Australia.
Cichosz, J. and Slot, K (2007) “Emotional recognition in speech signal using emotion-extracting binary decision trees” Institute of Electronics, Technical University of Lodz.
Chen, L.; Su, W.; Feng, Y.; Wu, M.; She, J.; Hirota, K. (2020) Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Inf. Sci.
Costantini, S., De Gasperis, G., Migliarini, P. (2019). Multi-agent system engineering for emphatic human-robot interaction. In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE) (pp. 36–42). IEEE, Sardinia, Italy.
Dong-Mei Yu, and Jian-An Fang. (2007) “Research on a methodology to model speech emotion.” Wavelet Analysis and Pattern Recognition, ICWAPR '07. International Conference on. 825-830. IEEE Xplore.
Harris, P.L. (1993). Understanding emotion. In M. Lewis & J.M. Haviland (Eds.), Handbook of emotions (pp. 237–246). New York: Guilford Press.
Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. (2021) Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812.
Justino, Edson J. R., Bortolozzi, F. and Sabourin, R. (2005) “A comparison of SVM and HMM classifiers in the off-line signature verification.” Pattern Recogn. Lett. 26.9 (2005): 1377-1385.
Kamaruddin, N., and A. Wahab. (2008) “Speech Emotion Verification System (SEVS) based on MFCC for real-time applications.” Intelligent Environments, 2008 IET 4th International Conference. 1-7.
Kwon, O. W et al. (2003) “Emotion recognition by speech signals.” Eighth European Conference on Speech Communication and Technology.
Lee, Chung. (2022.) Emotion Detection from Written Text using Hidden Markov Model. Mongolian Society of Science and Technology. pp. 211-216.
Manning, C. D., Raghavan, P. and Hinrich, S. (2008) Introduction to Information Retrieval, Cambridge University Press.
Noguerias, A., Moreno, A., Bonafonte, A., no, J.M.: (2001) Speech Emotion Recognition Using Hidden Markov Models. Eurospeech 2679–2682.
Nediyanchath, P. Paramasivam, P. Yenigalla, (2020) in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition (IEEE, Barcelona), pp. 7179–7183.
Shashidhar, G., Koolagudi, K., Sreenivasa Rao. (2012) Emotion recognition from speech: a review, Int’l Journal on Speech Technology, pp. 99-117.
Tao, Qing, Gao-wei Wu, and Jue Wang. 10 (2004):“A generalized S-K algorithm for learning v-SVM classifiers.” Pattern Recogn. Lett. 25.10 (2004): 1165-1171.
Taylor, P. (2009) Text-to-Speech Synthesis. Cambridge, UK: Cambridge University Press,.
Wei Han et al. (2006) “An efficient MFCC extraction method in speech recognition.” Circuits and Systems, 2006. ISCAS 2006. Proceedings. IEEE International Symposium on. 2006. 4 pp. IEEE Xplore. Web.
Widen, S. C., & Russell, J. A. (2003). A closer look at preschoolers’ freely produced labels for facial expressions. Developmental Psychology, 39, 114–128.
Widen, S. C., & Russell, J. A. (2008). Young children’s understanding of other’s emotion. In M. Lewis & J. M. Haviland (Eds.), Handbook of emotions (pp. 348–363). New York: Guilford.
W. Cheng, P. Y. Cheng, and L. Zhao, (2009) “A Study on Emotional Feature Analysis and Recognition in Speech Signal,” in Measuring Technology and Mechatronics Automation, International Conference on. Los Alamitos, CA, USA: IEEE Computer Society, 418-420. IEEE Computer Society.
Zhou, J. et al. (2006) “Speech Emotion Recognition Based on Rough Set and SVM.” IEEE International Conference on Cognitive Informatics. Los Alamitos, CA, USA: IEEE Computer Society,. 53-61. IEEE Computer Society.

Web References

Bartels, Astrid “Berlin Database of Emotional Speech” Technical University, Institute for Speech and Communication, Department of Communication Science, Berlin “http://pascal.kgw.tu-berlin.de/emodb/ “ Web. 10 January 2010
Huckvale, Mark “Speech Filing System” University College London, Division of Psychology & Language Sciences, “http://www.phon.ucl.ac.uk/resource/sfs/ “. 2009. Web. 10 January 2010
Joachims, Thorsten “Support Vector Machine” Cornell University, Department of Computer Science.“http://svmlight.joachims.org/ “ Web. 10 January 2010.

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

System Implementation of Emotion Recognition From Speech

Status:

Version 1

Abstract

Figures

I. INTRODUCTION

II. LITERATURE REVIEW

III. Methodology

IV. Results and Analysis

V. Conclusion

References

Additional Declarations

Status:

Version 1