3.1 Feature Extraction and Normalization:
Initially, we built an audio library for difference emotions. We selected the audio samples base on the Berlin Database of Emotion speech [Bartels 2009]. The Berlin Database, or EMO-DB, was a sound clip collection of over 500 utterances. The utterances were recorded by different actors in 7 emotional states, which are Anger, Anxiety or Fear, Boredom, Disgust, Happiness, Sadness, and Neutral. Also, the utterances composed of up to 10 different contents for each emotion and each actor. We selected this library because it was a well-established collection with clear classification of emotional states of speech. The audio samples in this library were spoken in German. It gave additional advantage for non-German researchers since we could focus on the acoustic features of the audio sample for emotion recognition.
We analyzed 196 audio clips with four different actors with two male's voice and two female's voice. The information of the speakers is shown in Table 3.1. The emotion code was defined as W, L, A, F, T, N, which are Anger, Boredom, Fear, Happiness, Sadness and Neutral respectively. 10 different contents of the audio clips, and the code of text could be found in Table 3.2.
Table 3.1
The information of the speakers in EMO-DB [Bartels 2009]
Code
|
Speaker Details
|
Code
|
Speaker Details
|
03
|
Male, 31 years old
|
13
|
Female, 32 years old
|
10
|
Male, 32 years old
|
16
|
Female, 31 years old
|
Table 3.2
The contents in EMO-DB [Bartels 2009]
code
|
Content (German)
|
English translation
|
a01
|
Der Lappen liegt auf dem Eisschrank.
|
The tablecloth is lying on the fridge.
|
a02
|
Das will sie am Mittwoch abgeben.
|
She will hand it in on Wednesday.
|
a04
|
Heute abend könnte ich es ihm sagen.
|
Tonight I could tell him.
|
a05
|
Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.
|
The black sheet of paper is located up there besides the piece of timber.
|
a07
|
In sieben Stunden wird es soweit sein.
|
In seven hours, it will be.
|
b01
|
Was sind denn das für Tüten, die da unter dem Tisch stehen?
|
What about the bags standing there under the table?
|
b02
|
Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.
|
They just carried it upstairs and now they are going down again.
|
b03
|
An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.
|
Currently at the weekends I always went home and saw Agnes.
|
b09
|
Ich will das eben wegbringen und dann mit Karl was trinken gehen.
|
I will just discard this and then go for a drink with Karl.
|
b10
|
Die wird auf dem Platz sein, wo wir sie immer hinlegen.
|
It will be in the place where we always store it.
|
The audio samples in EMO-DB were recorded with a quiet background. That reduced the need to performance an extra step of eliminating the nose in the analysis. Microsoft WAVE was the file format of the audio samples with 16KHz sampling and mono channel. WAVE format was used directly to digitize the analog signal. After the audio samples were selected, we extracted the features from the audio clips.
3.2 Vector Space Model for Scoring:
After computing all the normalized value of each feature, we applied Vector Space Model for Scoring to classify different emotions for the audio samples. Basically, out of the 18 normalized value of the features, we selected 7 measurements to represent the audio clip. The process of selecting the representation features was to pick the highest variance value of the emotions. For each feature measurement, we added up the mean value of each speaker. We repeated the same calculation for each emotion. Since we tried to classify five different emotions, 5 values of each feature were obtained. The variance was calculated with that set of data. And we selected the feature measurement with the top 7 variance to represent the audio clip. Then, we tried to found out a single value, or score, to represent the audio clip with the equations score = w1{feature1} + w2{feature2} + ... + w18{feature8} + c (3.3.1)
where w is the weight with the constant value, and c is a constant value. We tried to set different value of weight to increase the contrast of the relationship for different emotions. After the score was computed for each sample audio, we constructed a graph to analyze the relationship between the score and the emotional states.
3.3 Support Vector Machine (SVM) Recognition:
In addition to scoring methods, Support Vector Machine was used in recognition the emotion of the audio samples. An application called SVMlight, was mainly used to recognize the emotional states of all utterances. SVMlight was a SVM command prompt tool developed by Thorsten Joachime at Cornell University [Joachime 2009]. There were two step process of using the any SVM. They were training process and recognizing process. In the training process, SVM calculated the hyperplane for emotions separation. First, we needed to construct a training set out of the audio samples. For each audio clips, we constructed a vector with 17 normalized feature values as the following:
{Class} 1: {feature value 1} 2: {feature value 2} ... {n}: {feature value n}
where {Class} is the numeric value to represent the emotional state, feature value is the normalize value of the feature, n is the number of features we used. In our case, n was 18. We constructed a vector list with the audio samples of three difference speakers in the training set. We put the audio clip vector of speaker 03, 10 and 16 in the training set, and applied the SVM training methods to obtain the model of hyperplane. Then, we put the audio clip vectors of speaker 13 in the testing set, and actually tested the ability of the SVM for emotion recognition. We repeated the same process for four different combinations of putting different speakers in the training set, and the other speaker in the testing set until we tested all emotion clips among the speakers. In addition, for using the same set of speaker's audio data, we tested the recognition ability with the 7 features we selected in the previous section. We compared the result with using difference features vector in SVM.
Vector scoring is able to show the classification capabilities by combining different feature variables into one equation. There are many automated methods to find those equations. ANN can be used to train the equations for obtaining the optimal feature weights for emotion classification. Applying KNN can determine the features classification boundary in the vector space. Also, other methods such as decision tree, logistic regression, random forest, etc. can be used as our classification methods after we obtain the scores by vector scoring. In this study, we only study SVM as our main recognition method. Extended studies are required for comparing the effectiveness by using SVM and other methods in emotion recognition.
Support Vector Machine is an effective tool for emotion recognition. From the results, we noticed that the performance of SVM depends on the quality of the training set we provided initially. The accuracy for recognizing Fear, Boredom and Sadness is higher when the vector was modeled by using only 7 major features. Although SVM recognized a small number of the Sadness utterances as Boredom, and few Fear audio clips as Angry, the result is highly acceptance if we consider the nature between Boredom and Sadness is similar. Also, an utterance may mix with more than one emotion.
Sometimes, there is no definite conclusion of what emotion the utterances should be perceived.
Another observation is that increasing the dimensionality of training vector does not necessarily improve the emotion detection capability in SVM. When we blindly train the model of the SVM with all the features we constructed, the accuracy of recognizing
Anger and Happiness slightly increases, but the accuracy of recognizing other emotions decreases at the same time. In some cases, the results are not acceptable such as a fear utterance is being mistaken as a sad utterance, or a bored utterance as a happy utterance.
Judgment based on good quality features of discriminating power will improve the accuracy of the system, it also results in consistent predictions. In our study, SVM has done poorly in recognizing the emotions between Anger and Happiness. It is because we have not discovered a feature with high discriminating power to discriminate Anger and Happiness. Moreover, we can improve the performance by using different approaches for measuring and selecting audio features. We may also examine the temporal features, and use the method of modeling as our emotion classification method. We can combine and compare different recognition methods to achieve an optimal result when we attempt to build a high-performance system to recognize emotions from spoken words.