The performance of the SER system is mainly dependent on the methodology of modeling and the relations between the emotional labels and speech features across various languages. Many different audio datasets are available in public like EMO-DB (535 utterances), IEMOCAP (10039 utterances), RAVDESS (1440 sample), EMOVO (588 utterances), etc. (Burkhardt et al. 2005; Busso et al. 2008; Livingstone and Russo 2018; Costantini et al. 2014), are widely used for the research of SER systems. However, emotional analysis of the Odia language is not often encountered. In this research work, an audio dataset on Odia language is created apart from using two other benchmark datasets, RAVDESS, and IEMOCAP used for cross-validation of our Odia dataset. The Odia language is an Indo-Aryan language spoken in various parts of eastern India. This language has been influenced by the neighboring family of languages, both Dravidian and Aryan. The dialectal variations are Baleswari (Balasore), Sambalpuri (Sambalpur and western districts), Cuttacki (Cuttack), Beharampuri (Beharampur), Ganjami (Ganjam and Koraput area), Chhattisgarhi (Chhattisgarh and Odisha adjoining areas), and Medinipuri (Midnapur district of West Bengal and Odisha border area) (Jain and Cardona 2007). After presenting the datasets, the process of feature extraction, feature selection followed by the baseline proposed model, is described.
3.1 Datasets
The SITB-OSED database contains 7317 utterances recorded from 12 Odia speakers (6 male and 6 female) in Odia Indic language. They express six different states of emotions- happiness, surprise, anger, fear, sadness, and disgust. Every emotion includes almost equal utterances to measure the classification more accurately. The waveform of seven emotions is shown in Fig. 1. All the utterances were recorded with a sampling rate of 22050 Hz and 16-bit quantization. For the extraction of features, utterances are compressed by 16000 Hz. Table 1 shows the number of audio samples in each emotion in our SITB-OSED dataset. The dataset will be publicly available at: https://www.speal.org/sitb-osed/.
Table 1
The number of emotion samples in the SITB-OSED dataset
Emotion
|
Number
|
Participation (%)
|
Anger
|
1197
|
16.36
|
Disgust
|
1231
|
16.82
|
Fear
|
1196
|
16.34
|
Happiness
|
1197
|
16.36
|
Sadness
|
1309
|
17.89
|
Surprise
|
1187
|
16.22
|
RAVDESS is an English language dataset. This dataset presents the recording of 24 actors (12 female and 12 male) with 1440 utterances. The utterances consist of eight (8) different emotions (angry, fear, disgust, sadness, calm, happiness, neutral, and surprise).
Here the IEMOCAP dataset is used as a second benchmark database for the experiment. This dataset is categorized into two parts, scripted dialogs, and improvised dialogs. The whole dataset contains five sessions recorded by ten actors with a total duration of 12 hrs. Each session includes recordings of two (2) actors (1 female and 1 male) with a sampling rate of 16000 Hz. The total audio samples of the dataset are classified by nine different emotions (anger, excitement, happiness, frustration, sadness, disgust, neutral, surprise, and fear). In this experiment, we use only four emotion classes (anger, neutral, happiness, and sadness) of improvised samples. Most of the previous works using IEMOCAP utilizes these four emotion labels.
3.2 Feature extraction
The feature extraction process is vital for building a successful deep neural network; for this study, 70 features are extracted, including voice quality, prosodic, and spectral features, for analysis. Spectral features were extracted using the Python Librosa audio library (McFee et al. 2015), and prosodic and voice quality features were extracted using Praat software, contain 13 MFCCS, 16 LPC, and 41 prosodic and voice quality features. The prosodic and voice quality features include prominent features like HNR, Fundamental frequency (mean and standard deviation), Formant frequency (Minimum and Maximum of first formant frequency to fifth formant frequency (F1 – F5)), Formant bandwidth (Mean, Standard deviation and Median of first formant bandwidth to fifth formant bandwidth (B1 – B5)), Jitter (local, rap, local absolute, ppp5, ddp), Shimmer (local, localdb, apq3, apq5, apq11, dda) and PCA (shimmer and jitter). apq3, apq5, apq11, dda) and PCA (shimmer and jitter).
3.3 Feature selection
The performance of any model or system depends mainly on the number of inputs or input features. Here a feature selection method incorporated before the classification task to get relief from correlated features. Because a set of large number features includes irrelevant features that give the model over-fitting and reduce the accuracy. Selecting a small number of features allows for a better model fitting and helps the performance.
In this work a GBDT feature selection mechanism used to obtain the best feature for emotion classification (Wijaya et al. 2019; Zulfiqar et al. 2021). It is a statistical approach and provides an essential score of each feature, calculating how much linear dependency exists between a pair of features. Gradient boosting has low variance and high bias. This method has a computational complexity, but gives higher accuracy than other feature selection methods such as recursive elimination, forward and backward methods. In this algorithm, 75% data is used for training.
The input dataset for training is defined as \({\left\{u{}_{i}, {v}_{i}\right\}}^{P}\) (where \(p\) is the total number of sample), \(\left({u}_{i}\right)\) is the input vector in the form of\(\left\{{u}_{1}, {u}_{2},\dots ,{u}_{i}\right\}\), and \(\left({v}_{i}\right)\) is the corresponding class of emotions\(\left\{{v}_{1}, {v}_{2},\dots ,{v}_{i}\right\}\). This algorithm is used to ensure the convergence of the GBDT. The basic learner is \(h\left(u\right)\), where \({u}_{i}=({u}_{1i}, {u}_{2i}, \dots .,{u}_{mi})\). \(m\) is the number of predicted variables. The normalized multiclass GBDT for the \({p}^{th}\) sample is given by,
$${{\Psi }}_{n}=\underset{{\Psi }}{\text{argmin}}\sum _{i=1}^{p}L({v}_{i}, {F}_{n-1}\left({u}_{i}\right)+{\Psi }{h}_{n}({u}_{i}\left)\right)$$
1
Where, \(L(v,F(u\left)\right)\) is the differentiable loss function and \(n\) is the times of iteration.
The pseudo code of GBDT is described as follows:
1. Input data: \({\left\{u{}_{i}, {v}_{i}\right\}}^{P}\) with total number of iterations\(\left(k\right)\)
2. Initialize result with a selected feature set. The initial constant value of the model \({\Psi }\) is given:
$${F}_{0}\left(u\right)= \underset{{\Psi }}{\text{argmin}}\sum _{i=1}^{p}L({v}_{i}, {\Psi }) , i=\{\text{1,2},\dots .,p\}$$
2
3. Iteration starts from \(1\) to\(k\)
i) Update the weights for targets based on previous output.
ii) Fit the model on selected sample of dataset.
4. Update the weights for targets based on previous output.
$${F}_{n}\left(u\right)={F}_{n-1}\left(u\right)+{{\Psi }}_{n}{h}_{n}\left(u\right)$$
3
5. Return the final output with final iterations of
\({F}_{k}\left(u\right)\).
From the above feature selection based algorithm the 30 best features which contribute more than all other features for recognition accuracy. Table 2 shows the details of the 30 best-selected features, including the spectral, voice quality, and prosodic features which is used in the models.
Table 2
Selected features after feature selection method
Feature
|
Group
|
MFCCs: 0 to 10 and 12;
LPCs: 1,2,3,5,7,8,11;
|
Spectral
features
|
Mean (f0), Standard deviation (f0), HNR, Local Jitter, Max (F3), Min (f5), Median (B1, B2), apq11, Shimmer, Mean (B5), dda Shimmer;
|
Combined prosodic
and voice quality
features
|
3.4 Proposed baseline model
In the proposed baseline model, a cascading CNN-Bi GRU model with a self-attention layer for emotion classification from the speech feature of a raw audio file is used. The baseline model consists of two parts, the first part is a one-dimensional convolutional layer that cascades with the second part that is the Bi-GRU layer, followed by a self-attention layer and fully connected network. The suggested proposed baseline model is shown below (Fig. 2).
3.4.1 Convolutional neural network
First, the experiments are conducted using pre-trained AlexNet networks (Krizhevsky et al. 2012) instead of CNN. The pre-trained AlexNet model required large data because it trained on millions of image data. In this experiment, the pre-trained model did not get good accuracy because of a limited number of speech data. After that, our own CNN network is created. We ran several simulations of models randomly with various convolutional layers, kernel sizes, kernel filters, and other hyper-parameters, and selected the optimal number of convolutional layers and other parameters those values give more accurate results.
The above-proposed framework consisted of four 1D convolutional layers. We choose a 1D convolutional network over a 2D because the 1D network allows larger feature datasets. Moreover, the learning rate of sequential data is faster than the 2D CNN model (Kiranyaz et al. 2019). Each convolutional layer is followed by batch normalization, pooling layer, and a dropout layer. The dataset size was S×R (where S represents the number of samples and R is denoted the final dimension of the feature).
The first convolutional layer received 30×1 as the input of the model. A filter size of 256 is used in the initial convolutional layer with a kernel size of 5×1. After that, batch normalization and activation layer are used. The batch normalization layer allows one to learn independently of each layer and is activated by linear rectifier units (‘Relu’). Then a max-pooling layer is applied to compress the data and reduce the complexity; to map the input data, the size of the pooling layer was 2. After the pooling operation, one dropout layer is used to minimize the over-fitting problem, which helps better the model’s recognition rate. The dropout value set at 0.4, which means 40% of the neurons are dumped during the training process. This same process is repeated four times with different filter sizes and kernel sizes. The output of the last convolutional layer is connected to the bidirectional gated recurrent unit.
3.4.2 Bi-directional gated recurrent unit
The Bi-directional gated recurrent unit is an enhancement type of recurrent neural network (Zhu et al. 2020). It has two RNNs paths; one is forward, and the other is a backward path and concatenated with the same output layer; for an audio sound which is a frequency-based time-varying speech signal, the Bi-GRU layer to add two independent hidden layers, one forward, and another in the backward direction is used.
For \({i}^{th}\) utterances of the input feature vector \(u\) with encoded by time step \(t\) is mathematically described in two layers as follows:
$$\overrightarrow{{ h}_{i,t}}= \overrightarrow{GRU}\left\{\left({{u}_{i,t}}_{ , }\overrightarrow{{h}_{i,t-1}}\right)+\overrightarrow{{b}_{f}}\right\}$$
4
$$\overleftarrow{{h}_{i,t}}=\overleftarrow{GRU}\{\left({{u}_{i,t}}_{ , }\overleftarrow{{h}_{i,t+1}}\right)+\overleftarrow{{b}_{b}}\}$$
5
$$\widehat{{y}_{i}}=\overrightarrow{{h}_{i,t}}+\overleftarrow{{h}_{i,t}}$$
6
Where\(, \overrightarrow{{h}_{i,t}}\), \(\overleftarrow{{h}_{i,t}}\)and \(\widehat{{y}_{i}}\) is the forward path, backward path, and output state. Similarly \({u}_{i,t}\), \(\overrightarrow{{h}_{(i,t-1)}}\), \(\overleftarrow{{h}_{(i,t+1)}}\) is the input vector of \({i}^{th}\) utterance at time \(t\), forward hidden state at a time step of \((t-1)\), and backward hidden state at a time step of \((t+1)\) corresponding with the forward bias (\(\overrightarrow{{b}_{f}}\)) and backward bias (\(\overleftarrow{{b}_{b}}\)). This baseline framework uses two Bi-GRU layers because they experimentally perform better than one Bi-GRU layer.
The first Bi-GRU layer is connected to the input of the last convolutional layer with 64 neural units. The second Bi-GRU layer is serially connected to the output of the first Bi-GRU layer with 32 neurons, and the output was activated of each layer by the same activation function ‘Relu’ followed by the dropout rate of 0.4. The baseline combines CNN-BiGRU model loses some long-term-dependencies features when features pass through CNN layers. To overcome the problems and improved the performance significantly two separated CNN and Bi-GRU channels are used in the suggested proposed model.
3.4.3 Self-attention layer
After the passing of important information of feature through the output of the Bi-GRU layer, a self-attention mechanism is constructed to capture further important information (Li et al. 2021). The self-attention layer cannot forget the past information, and it can respond more accurately to understand the sequence of information. Then, a flattened layer is used to get a fixed length of weight matrix passed through a dense layer with 64 hidden units. Finally, the final dense layer contains 8 neurons for the RAVDESS dataset, 7 neurons for the Odia dataset, and 4 neurons for the IEMOCAP dataset, equal to the number of emotions in the different datasets. Then the emotion is classified using the ‘Softmax’ classifier. The baseline model uses the ‘Categorical Cross-Entropy’ as a loss function and ‘Adam’ optimizer with a learning rate of 0.0001 and a decay rate of 1e-06.
3.5 Proposed model
In this paper, proposed baseline model describes the basis for the final proposed model. We modify the baseline model by changing, adding, and removing some layers to better recognize the emotion classification task. The characteristics discussed in the final proposed model are given in the next section. Figure 3 shows the final proposed model.
The first part of the proposed model consists of two parallel convolutional layers with a different kernel size of (15×1) and (5×1) to extract vertical (spectral, voice quality, and prosodic features) and the horizontal (cross-time) form. Then a concatenate layer is used to map the selected input feature. After mapping the input feature, two parallel deep learning networks are used; one is a convolution network another is a Bi-GRU network. The convolution network is composed of three convolutional layers with the filter size of 128, 256, and 256 corresponds to the kernel size of (5×1), (7×1), and (7×1). These three convolutional layers pass all the features from the input feature maps. CNN learns the frequency domain spectral features (MFCCs, LPCs) more accurate than the time series feature. In each convolutional layer, a batch normalization layer is used with an activation function ‘Relu’. After that a max-pooling layer is used to reduce layer complexity and maps the feature weight. Finally, the global average poling (GAP) layer is employed to reduce the number of training parameters and generates feature points for fusion. It can also overcome the over-fitting problem.
The second channel of the parallelized CNN-BiGRU model consists of two Bi-GRU layers with an equal number of neurons which is used in our baseline model. This separated Bi-GRU layer is used to easily learn the time-series features in a feature vector sequence (Zhu et al. 2020). So the prosodic and voice quality features pass quickly with the minimum loss of information and learn more high-level time-series features through the Bi-GRU layer from the input feature map. And, with the help of the self-attention layer, some long-term dependencies are also learned and all information is passed through the flatten layer to reshape the feature vector for fusion operation.
3.5.1 Fusion method
Typically, the fusion method uses fusion features as input. To combine various neural networks to obtain an efficient prediction result, a confidence-based decision-level method is used similar to decision score fusion to get the final classification results. A confidence-based decision-level fusion technique takes the outputs of every recognizer as associate input and confirms which combination of features gives a better result (Yao et al. 2020; Wang et al. 2021). The confidence score from the model is obtained from the convolutional module, and the Bi-GRU model is used as the one-dimensional confidence score vector. These confidence score vectors are summed up together. The mathematical expression of the fusion vector is as follows:
$${ w}_{1}=\{{c}_{1}, {c}_{2}, {c}_{3}, \dots ..,{c}_{n}\}$$
7
$${w}_{2}=\{{d}_{1}, {d}_{2}, {d}_{3},\dots \dots ,{d}_{n}\}$$
8
$${w}^{fusion}={f}^{sum}({w}_{1},{w}_{2})$$
9
Here, \({w}_{1}\) and \({w}_{2}\) are the output of the spectral feature vector of the convolutional network, the output of the prosodic feature vector of the Bi-GRU. The \({w}^{fusion}\) represents the fusion vector. After the feature is fused, the fused vector passes through the dense layer for the final prediction. The number of neurons in the dense layer and other hyper-parameters were the same as in the previous model. And, proposed model shows that the recognition rate is better than the baseline model.