6.1 Reflection of the approach
As summarized in Chap. 1.2, this research approach includes a systematic literature search and explorative prototyping, including porting to possible end devices. The systematic literature search is conducted according to vom Brocke et al. (2009) and includes a forward and backward search. During this process, high-quality literature sources are primarily used, according to Rowley and Slack (2004). However, since the literature search is not ongoing, there is always a risk that relevant literature or primary sources are overlooked or inferior sources are used, despite the care taken. Thus, despite carefully selecting literature, this research is not entirely exempt from this risk.
According to Floyd (1984), exploratory prototyping uses machine learning and deep learning. Both approaches are based on the same data corpus, determined and acquired based on the criteria mentioned in Chap. 3 within the available literature sources. Among others, the minimum audio length of one second and the maximum length of 20 seconds are mentioned as conditions. The minimum size of one second is justified because speech or audio files below this duration generally do not contain useful information. On the other hand, the maximum length of 20 seconds is arbitrarily chosen. A total time of 10, 15, or 30 seconds would also have been conceivable. It is questionable, however, whether a different selection of audio databases would have taken place if this criterion had been changed since, as shown in Fig. 4, the majority of the audio files have a duration of fewer than 10 seconds. Another criterion is the restriction to spoken sentences. Since the goal of the prototypes is SER and not audio classification, the audio files must consequently contain speech. Pieces of music do include speech, but this is interfered with by the accompanying instruments.
Furthermore, emotion recognition in music is not part of this work but is a suggestion for further research. Nevertheless, there is the possibility to look at audio files, including background noise, because, in real situations, communication and, thus, speech does not take place in silence but in noisy environments. In contrast to music, background noises are not in the foreground but in the background. Therefore, it would be an enrichment to include additional databases with such audio data in the research.
Furthermore, it is specified as a criterion that the files must be in an auditory or audiovisual format. However, a restriction on the file type, sampling rate, o dubbing of the files is not made here. A limitation to purely auditive file formats would be feasible and influence the selection of the databases but would not affect the further procedure. This is because, before the model training, all audio files are still transformed to the same format and file type. The native language used in the audio files is also not a selection criterion, which is why it can be seen from Table 1 it can be seen that the selected databases contain German, English, and Italian language data. It would also be possible to choose audio files in, for example, Turkish, Danish, or Chinese. Since the six basic emotions, according to Darwin (1873) and Ekman (1971), are expressed in the same way across cultures, the mother tongue spoken has no significant influence on the Mel spectrograms, the model training, or the result. However, when selecting databases, male and female voices must be included. Not meeting this criterion may prevent the generalization of the data and lead to overfitting or underfitting the model; open accessibility and the presence of a label are mandatory for canvassing. Without open access to the databases, the procedures and results of this work cannot be reproduced by third parties. Furthermore, without labeled data, no supervised machine-learning algorithm is feasible. Investigating whether a different quantity or composition of the databases or the included language changes the outcome of this work can be the subject of further research.
Other methodologies that affect both procedures equally include splitting the data corpus into training and validation partitions in an 80 to 20 ratio or transforming the audio files to a 16000 hertz format with the mono channel. The alternative division of the training and validation partitions in a ratio of 50 to 50 (Krizhevsky et al. 2017), or two-thirds to one-third, represents a potential option. Formatting audio files in the manner above are the most commonly cited method for preprocessing audio files in the literature and are consistent with approaches taken in this work (Lin and Wei 2005; Lim et al. 2017; Zhang et al. 2018a) but also provides room for research elsewhere with different parameters.
In implementing the machine learning method, the hyperparameters and value ranges are defined at the beginning of the training, based on which the algorithm determines the optimum independently. The selection of those hyperparameters is based on the research of Mao et al. (2014). However, other parameters or ranges of values are also conceivable, as in Wang et al. (2015) or Cummins et al. (2017). For feature extraction, the framework openSMILE, including the eGeMAPS, is used, as it is also used in Cummins et al. (2017) and Ottl et al. (2020). Although this approach is in line with current research, DEEP SPECTRUM (Ottl et al. 2020) is also conceivable.
In contrast to the Machine Learning method, the Deep Learning method uses explicit hyperparameters described in Chap. 4.3. The division of the training into two phases of 50 epochs each is based on the research of the authors Tan et al. (2018). A different approach was taken by Zhang et al. (2018a), using as hyperparameters a batch size of 30, the SGD, and a learning rate of 10− 3. However, an SGD with a learning rate of 10− 2, a dropout of 0.25, and a Rectified Linear Unit activation function are standard hyperparameters, as used by Lim et al. (2017). Differences also prevail in the generation of Mel spectrograms between the parameters described in Sect. 2.3.3 described the relevant literature. For example, in Zhang et al. (2018a), 64 Mel filters are used for audio classification at a frequency range of 20 to 8000 hertz on a 25-millisecond Hamming window with a ten-millisecond superposition each. According to Zhang et al. (2018a), the difference in the different Mel spectrogram generations can be justified because SER and speech recognition are also differentiated processes. The use of Mel spectrograms, as mentioned in Chap. 2, is in line with the current state of research. Alternatively, the use of MFCC within the Deep Learning procedure is possible.
When applying transfer learning, the MobileNetV2 is used here as the base model, but the CNN ResNet50 or SqueezeNet (Ottl et al. 2020) are also used in the literature for this purpose. Furthermore, in this study, finetuning operates only on the last 54 of the total available 154 layers. Optionally, another number of trainable layers, such as 32 or 16, is possible. To what extent a change of these hyperparameters differs from the results described in the previous chapter can be the subject of a progressive investigation.
When the speech is subsided for real-time processing, a time window of three seconds is selected. In retrospect, however, intervals of, for example, one, two or five seconds are also possible. In particular, a gap of seconds can be used in further analysis. The value of investigation here is whether such a recording duration is also suitable for audio and emotion classification and whether a user subjectively perceives the SER system as running in real-time. In general, the subjective view of an end-user has not been part of this research.
About porting, it must be stated that only three of the maximum four possible ports were possible. The lack of compatibility of openSMILE with a 32-bit operating system, as well as the lack of a qualitative alternative, led to the conclusion that the selection of this framework was not the most suitable. Nevertheless, the use of openSMILE is consistent with the corresponding literature.
6.2 Interpretation of the results
The results of the previous chapter prove that both methods are suitable to answer the research questions with minor restrictions. However, comparing the two approaches and their results does not allow a direct conclusion as to whether one of the methods is superior to the other. Table 5 makes no immediate conclusion about whether one of the procedures is superior to the other. When comparing the training time, it becomes clear that the Machine Learning approach needs about 16 times the time compared to the Deep Learning approach. However, in the Machine Learning model, the accuracy is higher, the classification is faster, and the increase in processor load and memory requirements is lower.
The speed advantage of the machine learning method in real-time classification arises here from the different emotion recognition algorithms. The audio classification is not considered here, as it is identical in both ways and precedes emotion recognition. Speech input is first processed by openSMILE in the Machine Learning method, normalized, and then classified by SVM. In the Deep Learning model, the speech input is first converted to a spectrogram, stored, normalized, and then processed through all neural network layers. When the spectrograms are stored and read, additional write and read transactions occur that the machine learning method does not require, which in turn throttle the Deep Learning model in terms of speed. However, the difference in speed is marginal and negligible since humans cannot perceive the difference in a period of a few milliseconds.
Furthermore, it is noticeably, based on Table 5, that the mere emotion classification acts, on average, ten times faster than the combined audio and emotion classification, regardless of the method used. Primarily concerning porting to the Raspberry Pi, the difference is significantly decisive for the declaration of real-time capability. While the mere emotion classification is declared real-time capable, this is not the case for the combined classification due to the increased runtime. The difference is presumably found in the CNN YAMNet used for audio classification and its structure, which was developed outside this research. Consequently, detailed analysis and statement of the time difference and its exact origin cannot be made. Thus, the optimization of speech classification, which is not considered in detail in this paper, can significantly improve the overall latency of the process.
When looking at the individual process steps in Fig. 6 and Fig. 8, no direct conclusion on the training duration is recognizable. However, more process steps are necessary for the Deep Learning method than for the Machine Learning method. As mentioned above, the parameterizations of the models are not listed in these figures since they are part of the mapped training. In particular, specifying fixed hyperparameters versus selecting ranges of hyperparameter values is critical to the training time. In the Deep Learning method, the training duration also varies with the parameterization of the batch size, the number of epochs, and the number of steps per epoch. The training duration in the Machine Learning method is determined by the number of hyperparameters and their value ranges. As explained in Chap. 4.2, the resulting hyperparameters are determined by processing and optimization of the possible combinations by the algorithm itself. Compared to explicit parameterization, processing all potential combinations is presumably responsible for the training duration difference. Consequently, this also means that the overall accuracy of the machine learning model is higher than that of the deep learning model due to the processing of all combinations.
Furthermore, it is evident from the results in Fig. 13 it can be seen that the curve of the validation loss increases again from epoch 80. An analogous phenomenon can also be observed in Fig. 14 from epoch 50 onwards. However, the validation accuracies in the figures do not increase again. The increasing course of these curves may indicate overfitting, which can be investigated in continuing research.
The higher memory requirement for the Deep Learning model is because, in addition to the primary audio file, the generated spectrogram is also necessary for emotion recognition and stored in memory. Another apparent cause is that the CNN, in contrast to an SVM, has a higher number of parameters, which are also kept in memory.
Contrary to expectations, the processor utilization is higher for the Machine Learning method than for the Deep Learning method. This phenomenon is probably due to the wrong timing of the recording. However, the difference of 21.7% points in the processor utilization of the two methods is negligible here due to marginal differences.
It is also mentioned that both models consider the presence of a microphone as a strict prerequisite. Unlike multimodal emotion recognition, a unimodal SER system does not require other provisions like cameras. Most ambient terminals have a native microphone but not a native camera, such as smart speakers or smart TVs. Thus, the prototypes developed in this work are suitable for porting to these devices, which is discussed in more detail in Chap. 6.4 will be discussed in more detail.
Referring to the theoretical foundations of this research, SER represents an increasingly important component of human-computer interaction (HCI). For its part, HCI occurs in the context of remote participation, which is a component of the increase in computer-supported hybrid communication in everyday life. In conclusion, SER is an increasingly important part of everyday life. Furthermore, SER is the subject of current research. For example, analogous to this work, there are studies of SER in real-time (Vogt et al. 2008) and real-time edge computing (Liu et al. 2018; Satyanarayanan 2017; Abbas et al. 2018). However, research on SER applications on edge devices, defined by Shi et al. (2016), does not exist. Thus, the combination of SER, edge computing, and real-time processing in this present work represents a novelty and an extension of the research.
6.3 Limitations of the work
To maintain the focus of this work, restrictions are deliberately made. However, other external limitations also limit this work, which will be explained in more detail below.
According to Ekman, only the six basic emotions, including a seventh neutral emotion, are considered in this work, which is why emotions such as tiredness or boredom are excluded. Accordingly, the data acquisition is made with the mentioned seven emotions, further limiting the selection of suitable databases. Furthermore, the inclusion of the dimensions of arousal and valence is also omitted. These dimensions can be considered in continuing work, but do not play a role in the mere emotion recognition in this research. Therefore, it is pinpointed that these dimensions exist, but it does not address them in the further course of the study.
A technical limitation, however, is the mapping of processor and memory utilization. Since the system constantly updates these two indicators, it is impossible to identify the exact utilization. Thus, the documentation of the processor and RAM utilization only represents a snapshot, not a calculated average value.
Furthermore, the maximum number of simultaneously recognizable emotions is another technical limitation. As mentioned in Chap. 2, this paper assumes that only one emotion is contained in a sentence or voice recording. As the length of the sentences and audio files increases, the probability that multiple emotions are contained increases. However, the machine learning method using the SVM can only classify one emotion, which is why the length of the voice recording is limited to three seconds. On the other hand, the CNN in the Deep Learning method calculates a probability for each of the seven emotions.
For this reason, this method can potentially identify multiple emotions within one speech recording. Another technical limitation is the applicability of the prototypes to only one person. As described in Chap. 2 described, the model training is based on emotional audio files of the acquired databases. Individuals can be heard in each audio file, which is why the prototypes can effectively apply emotion recognition only to individuals. When multiple individuals speak simultaneously, the prototypes cannot distinguish between individuals and their emotions. The extension to multi-person recognition goes beyond the definition of these prototypes and therefore needs to be investigated in further work.
Furthermore, the porting of the prototypes is also limited. For example, only two device categories were selected since porting to more devices would exceed the scope of this paper. For this reason, porting to smartphones or tablets, for example, is not included.
6.4 Practical implication
The results and interpretations shown in this and the previous chapter prove that the prototypes created are functional and thus also suitable for practical use. The possible uses of such SER applications are manifold and serve as extensions to existing products and services. Some of such areas of application and the resulting benefits are mentioned below.
Although a microphone is a prerequisite for SER applications, these applications can be used universally. Thus, such applications can be used where speech plays a central role. Examples include call centers, radio broadcasts, podcasts, and television shows. As Kiesler et al. (1984) mentioned, physical communication and participation are increasingly complemented by virtual technologies. Related to this and the recommendation of Shirmohammadi et al. (2012) to combine the physical and virtual presence, new business models can be generated at this point through SER applications. For example, using SER on a smart speaker, vocal activity, and emotions in the living room at home can be determined. These determined emotions can be visually processed for the user and transmitted with the user's consent to the producer, who can use the feedback to optimize his product and pay the user a premium. It is also conceivable that the highlights of a broadcast sports game could be automatically edited based on determining emotions. Similar scenarios are also feasible for Internet-based broadcasts such as Twitch or Netflix.
Another use case for SER applications could be to capture the current mood of an audience. Unlike the previous use case, here, the emotion is not determined and summarized over some time, as is the case with a broadcast. Instead, in this use case, the emotion level is determined directly for an exact point in time. Possible scenarios are, for example, a loudspeaker announcement at a train station, an expression of opinion at political talks, or the presentation of new products at events. In such situations, it is essential to determine the emotion that occurs directly because, according to Averill (1980), it is evaluating a position in a sociocultural context. Speakers in these scenarios thus also can receive unbiased feedback on what they say. In addition, the techniques mentioned, except for loudspeaker announcements, occur in both physical and virtual or hybrid forms, which underpins the aspect of increasing remote participation.
The SER use cases mentioned so far are examples where the emotions of an audience or several cumulative individuals are in focus. However, use cases are also conceivable that only involve individual persons. For example, a use case is feasible in which a machine exhibits different behavior depending on the determined emotion of the user. For example, an SER system can be implemented in a smart speaker or automobile and play music or change the lighting corresponding to the determined emotion. A scenario in the field of gaming is also possible, in which the algorithm offers the player relief within the game when anger is detected. But emotion-controlled individual advertising, for example, in social media or e-commerce, is also conceivable. Depending on the emotional state, the price can also vary dynamically and, for example, increase in the case of joyful emotion.