Edge emotion recognition: applying fast Fourier transform on speech Mel spectrograms to classify emotion on a Raspberry Pi for near real-time analytics

doi:10.21203/rs.3.rs-2198948/v1

Download PDF

Research Article

Edge emotion recognition: applying fast Fourier transform on speech Mel spectrograms to classify emotion on a Raspberry Pi for near real-time analytics

https://doi.org/10.21203/rs.3.rs-2198948/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Many people and machines are inherently unable to interpret socio-affective cues such as tone of voice. Thoughtful adoption of intelligent technologies may improve the conversation. Since direct communication often occurs via edge devices, where an additional network connection is often not guaranteed, we now describe a real-time processing method that captures and evaluates emotions in a speech via a terminal device such as the Raspberry Pi computer.

In this article, we also present the current state of research on speech emotional recognition. We examine audio files from five important emotional speech databases and visualize them in situ with dB-scaled Mel spectrograms using TensorFlow and Matplotlib. Audio files are transformed using the fast Fourier transform method to generate spectrograms.

For classification, a support vector machine kernel and a CNN with transfer learning are selected. The accuracy of this classification is 70% and 77%, respectively, a good value related to the execution of the algorithms on an edge device instead of on a server.

On a Raspberry Pi, it took less than one second to evaluate pure emotion in speech using machine learning and the corresponding visualization, suggesting the speaker's emotional state.

speech emotion recognition

edge computing

real-time computing

user equipment

A phenomenon that cuts across professional and private spheres is the increasing interaction with machines, which aims to bring people together. In a conversation, nonverbal communication often carries crucial information, such as the speaker's intent. In addition to the semantic content of the message conveyed by the text, the way the words are spoken also delivers important non-verbal information. The exact spoken text that includes appropriate emotions may tell a different message. Thus, el Ayadi et al. (2011) describe that humanity is still far from entering into a natural interaction with machines, especially when it comes to an understanding of the emotional state of counterparts.

At this point, emotion recognition technologies come into play, under which we sum up all methods by which emotions are recognized outside of human perception using methods or technologies. El Avadi et al. explain that the primary goal of emotion recognition is to adapt a system response when certain emotions, such as frustration or anger, are detected. Cowie et al. (2001) explain that in every human interaction, there are two channels of communication, the explicit and the implicit. While the explicit channel conveys messages, the implicit channel conveys the speaker's feelings and moods. Further, Cowie et al. (2001) explain that extensive research has been done to understand the first explicit channel.

Conversely, the implicit channel is less researched but crucial to understanding the speakers and their emotionality. Fener states that machines generally act neutrally, which in turn is perceived by humans as indifference Cowie et al. (2001). Therefore, a machine must have the ability to recognize emotions from the speaker's speech to interact adequately. That ability is referred to as SER (speech emotional recognition).Schuller (2018) states that even animals can understand the tonality of human speech and that the time has now come for machines to recognize this as well.Kraus (2017) explains that pure voice versus visual or audiovisual communication is the best way to determine a person's empathy. Finally,Schuller (2018) mentions that automated emotion recognition systems are still rare in everyday life. Last but not least,Akçay and Oğuz (2020) comment that emotion recognition by SER systems, although possible in real-time, is not yet part of everyday life, contrary to speech recognition systems.

For a machine to operate near real-time, a high, optimized processor performance, short transmission paths, and low latency times are required. Three conditions are general components of edge computing. In contrast to cloud computing, edge computing does not transfer the data to be processed to a central location of computing power but moves the computing power to the data. Satyanarayanan (2017) explains that the closer the computation is to the data, the better the end-to-end latency and, thus, the establishment of trust and the network's survivability. He admits that the distance corresponds to the optimal physical proximity between the place computations, and the data in this regard cannot be answered. Mao et al. (2017) also describe that cloud computing is unsuitable for latency-critical mobile applications due to the distance between the user and the data center and the resulting delay time. Satyanarayanan (2017) further describes that humans are sensitive to delay, which underpins the relevance of edge computing for low latency. Referring to this, he cites a study from 1997 (Noble et al. 1997), which shows the potential of edge computing using the example of a speech recognition system on a mobile device for the first time.

Abbas et al. (2018) explain that cloud computing is unsuitable for real-time applications such as augmented reality or car-to-car communication and thus supports the edge computing approach. Such applications are realized by the mobile communications standard 5G and mobile edge computing. At the same time, the authors describe that special cloudlets can be physically stationed close to the data-producing end devices and take over the computationally intensive tasks. In their paper, authors Wang et al. (2020) state that humanity is currently in an information-centric era in which computing is ubiquitous and the location of computing power is shifting from the central point of view to the edge. According to the authors, this is because an ever-increasing number of applications have more stringent delay requirements. Endpoints are producing more and more data that they cannot transfer to the cloud because network capacity does not allow it. Beyond this, however, Wang et al. (2020) make it clear that edge computing and cloud computing are not mutually exclusive. Instead, it is a conventional practice that edge computing extends cloud computing. For example, the authors state that artificial intelligence applications are linked to this combined computing approach to meet low latency requirements, data protection, and reliability (Wang et al. 2020).

According to Cao et al. (2020), over 50 billion end devices will be connected to the Internet and thus to each other in the coming years, producing a data volume of 40 zettabytes. This includes mobile and ambient end devices such as smartphones, smart speakers, or Raspberry Pi’s. For all these network participants to act and communicate with each other in near real-time, computing power must be shifted closer to the data.

1.1 Relevance of the topic and motivation

Due to the growing number of mobile devices, people are increasingly interacting and using them more and more for communication. According to two types of research, text and voice messages are increasingly being used instead of a call or meeting in person for real-time communication (Pettigrew 2009; Hayat et al. 2020). This behavior can be observed in both private and professional environments. In addition, companies are increasingly relying on video conferencing instead of business trips or face-to-face meetings (Karis et al. 2016). Sports and leisure events also increasingly rely on digital transmissions to attract viewers. This presents another application for emotion recognition in such remote participation, allowing event organizers and performers to receive feedback and impressions from the audience. The challenge is to ensure that gestures, facial expressions, and paralinguistic information such as volume, frequency, and intonation that make up communication are not lost.

The motivation for this work lies in the analysis and evaluation of this challenge. In addition, it is crucial whether the emotions of a human can be captured during remote participation utilizing SER systems through given ambient end devices. A local implementation via edge computing is favored to keep the latency as low as possible.

1.2 Derivation of the research question, methodology, and objectives

The above examples can be extended to other sectors and industries. For instance, domestic subscriber has so far been able to participate in a TV or radio program and express their emotions without the broadcasters finding out about them. Which scenes in yesterday's soccer match were exciting for the viewer? Which songs did the music fan sing along to in yesterday's radio program? Is the viewer watching the football game, or is the program just playing in the background? These are examples of questions that can be interesting for a broadcaster. Whether the viewers are actively watching, cheering, swearing or distracted remains hidden from the hosts. To optimize their offers, they depend on such feedback. An answer to these questions can be provided by networked intelligent machines, which determine the users' emotions. For this reason, this paper deals with the following research questions:

Can a speech emotion recognition system distinguish between speech, non-speech, and silence?

Can a speech emotion recognition system distinguish between positive, negative, and neutral enthusiasm?

Is a system for speech emotion recognition using edge computing feasible on ambient terminals?

Is real-time machine speech emotion recognition possible on an edge device?

This paper follows the mixed-methods approach, which is a combination of qualitative and quantitative research methods. The quantitative research methods used here are the deductive and explanatory approaches, in which own research questions are derived employing current research and then verified or falsified through investigations. The investigations conducted here, such as the systematic literature review and prototyping, according to Wilde and Hess (2007), represent the components of the qualitative research method.

The methodology used here for the systematic literature review is adapted from vom Brocke et al. (2009) and illustrated in Fig. 1. After the research questions have been defined, a systematic search for relevant literature uses suitable keywords from the particular research questions. The literature found in this way is further expanded and analyzed using the Backward Search and Forward Search. In the backward search, the literature references within the sources found are checked for suitability in the research context. In contrast, Forward Search searches for further articles that cite the source found. Subsequently, the scope of literature is analyzed and purged of inappropriate articles or duplicates. When sifting through the sources, it is determined to what extent the existing collection must be subsequently supplemented with further search terms. Of course, the forward or backward search is also applied to these later-found literature sources. The final result of the literature search is presented as a concept matrix. To generate a high-quality paper, predominantly high-quality journals, according to Rowley and Slack (2004), will be used during the entire literature research process, which will be searched via the common scientific databases.

In addition to the literature review, exploratory prototyping, according to Floyd (1984), is used to answer the research questions. In contrast to experimental or evolutionary prototyping, explorative prototyping focuses on system functionality and the evaluation of problem solutions. In a broader sense, the goal of prototyping also includes communication with the customer or end user. For the reasons mentioned above, the prototype is designed to resemble a deployment in a production environment. A model is trained using algorithms that classify a recorded audio input into a resulting activity or emotion based on pattern recognition. The trained model will then be ported to ambient devices to verify or falsify the feasibility of such an application using edge computing. Finally, a real-time capability is determined by measuring the processing time of the prototype.

Machine learning algorithms are used in creating the prototype, as these have already demonstrated success in the past concerning speech processing. For example, the speech recognition software of Amazon's Alexa or Apple's Siri is also based on machine learning algorithms. However, the basis for creating such algorithms is existing data on which the algorithm can be established and the pattern recognition can learn. Such data is acquired by employing the systematic literature review mentioned above. In the sources contained there, descriptions about the creation of such databases or references to just such databases can be found again. The databases referred to in the sources shall be the basis of this research work. However, only those databases are included in investigation whose audio files are freely accessible and identified by labels. Consequently, no databases will be considered that do not meet these requirements. A detailed description of the data and their processing is given in Chap. 2.

Since this paper and single sub-aspects are interdisciplinary research subjects, this research project explicitly focuses on the described research question and related relevant information. Accordingly, relevant topics such as Natural Language Processing, privacy and security, mathematical derivations and proofs, and aspects of economics are not part of this work. An overview of these topics is given by the authors Kröse et al. (1993), Kubat (2017) as well as Mehlig (2019).

The introduction of cloud computing was a milestone in the early 2000s, enabling new business models and innovations. However, the era of cloud computing seems to end because the edge computing paradigm increasingly replaces the cloud computing paradigm due to new requirements. The latter can support the new requirements for low latency, increased data security, mobility support, and real-time processing. In the literature, edge computing is divided into the sub-areas of fog computing, cloudlet, and mobile edge computing (MEC). While the first two approaches mentioned are hardly found in practice, MEC is ubiquitous. In MEC, computationally intensive cloud servers are stationed in mobile base stations at the edge of the network and thus close to the end devices, ensuring daily use of this technology. As Shi et al. (2016) stated, MEC means data processing immediately to the end device and on it. In addition to MEC, mobile cloud computing (MCC) is based on the principle that end devices perform the processing and only send the result or partial result to the MEC server or the MCC server. However, none of these approaches can be found in pure form in practice. Instead, cloud and edge computing techniques are combined to cover various use cases and exploit their respective advantages.

The topic of SER and its feature extraction and pattern recognition are a constant part of current research. Thus, the recent literature review shows that in SER, especially the continuous and the spectral features of speech are used since these reflect the characteristics of emotions most appropriately. Priority is given to the course of the primary speech frequency or loudness, the temporal ratios, pauses, and spectral features such as the Mel frequency cepstral coefficient (MFCC) and the Mel spectrograms (Schuller 2018). The most common classification techniques used in speech recognition in recent years are the Gaussian Mixture Model (GMM) in combination with the Hidden Markov Model (HMM), the support vector machine (SVM) (Cortes and Vapnik 1995), and more recently, neural networks (Lin and Wei 2005; Nassif et al. 2019). Consequently, the successes achieved in this regard also inspired the use of these techniques in SER, but with a focus on neural networks, SVM, or any combination of these two (Schuller 2018; Lim et al. 2017; Zhang et al. 2018a; Huang et al. 2014). Studies reveal that even pure emotion determination by humans is not accurate in all cases, which is why the focus is on the use and further development of neural networks (Schuller et al. 2011). In the field of neural networks, recurrent neural networks (RNNs) such as Long Short-Term Memory (Hochreiter and Schmidhuber 1997) were initially used because their feedback loops make them more suitable for processing continuous inputs such as speech signals (Khalil et al. 2019; Hinton et al. 2012). RNNs have been superseded by convolutional neural networks (CNNs) such as AlexNet (Krizhevsky et al. 2017), VGG16 (Simonyan and Zisserman 2015), ResNet (He et al. 2016), or MobileNetV2 (Sandler et al. 2018) due to their high resource and memory requirements, as well as their continued success. Furthermore, MFCC or Mel spectrograms were launched using CNN (Zhang et al. 2018c). Moreover, the current everyday use of transfer learning and Multitask Learning methods makes the CNN deployment even more efficient (Torrey and Shavlik 2010).

Every pattern recognition is based on previously extracted features in considerable quantity and quality. Due to this given diversity, selecting suitable parts is relevant in classification. The method generally used in machine learning for feature extraction is the use of the framework open-source Speech and Music Interpretation by Large-space Extraction (openSMILE) (Eyben et al. 2010), which in turn includes the datasets extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) (Eyben et al. 2016) and ComParE 2016 (Schuller et al. 2016) In the field of deep learning, recent literature has increasingly used CNN for this purpose. In this approach, the output layer is either preserved as a classifier or replaced by, for example, an SVM. A recently developed system, DEEP SPECTRUM (Ottl et al. 2020; Cummins et al. 2017; Amiriparian et al. 2018), acts on the latter principle, in which Mel spectrograms are generated, processed through a pre-trained CNN, and thereby a set of features is extracted. That set of features now serves as input to the SVM. The DEEP SPECTRUM system already uses aspects of transfer learning since a pre-trained, fixed CNN is used. The processing principle of the DEEP SPECTRUM system is graphically visualized in the following Fig. 2.

In the subsequent phase of emotion classification, diverse sets of emotions diverge, which in turn harbor a different number of emotions. The settings can vary from five to 20 other emotions. The most common set of emotions in the literature refers to the six basic emotions, according to Ekman (1971), including a seventh neutral emotion.

In the mainstream literature, descriptions of the hardware on which a neural network is trained or executed are scarce. However, Tariq et al. (2019) describe that neural networks - especially deep neural networks - are trained and run in cloud-like data centers. The locally collected data is transferred to these servers, deleted on the local device, processed on the servers, and only the result is sent back to the end device. Thus, applying neural networks in the context of MEC and real-time capability represents a novelty. Despite the intensive research on these topics, no everyday emotion recognition products currently exist.

The foundation of any machine learning algorithm is the underlying database, which is necessary for training. Due to the topic of the SER, labeled emotional speech data is needed for the prototypical implementation. Some criteria are used to ensure high quality for selecting and acquiring these data. For example, only those databases and audio files are considered for acquisition, where most files have a minimum length of one second but a maximum length of 20 seconds. In addition, the files must be in an auditory or audiovisual format, have an emotion label, and be freely accessible. Most databases contain the six basic emotions, including one neutral. Considering arousal and valence dimensions is not part of the work, which is why these criteria are neglected in data acquisition. Since part of this work is the emotion recognition in speech, but human speech is divided into sentences from which the emotions emerge, the audio length of one to 20 seconds is subjectively chosen since most of all sentences are spoken within this period. Thus, the audio files must still contain spoken sentences and no singing, noise, or the like. However, the native language used is not a selection criterion since emotions are expressed in any language. Even though the speaking gender is not an immediate selection criterion, the totality of all databases must contain both male and female spoken sentences to allow for the generalization of the data. Finally, the stored channel number or sampling rate is irrelevant in data acquisition, as these are standardized in the training process.

With the help of these criteria, the found literature can be searched from the systematic search in Chap. 1.2. However, not every database found in this way is ultimately used, but an upper limit of five databases is set. Finally, in this literature collection, the following audio databases are identified that meet the given quality criteria:

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) (Livingstone and Russo 2018)
Berlin Database of Emotional Speech (Emo-DB) (Burkhardt et al. 2005)
Toronto Emotional Speech Set (TESS) (Choudhury et al. 2018)
EMOVO (Costantini et al. 2014)
eNTERFACE’05 (Martin et al. 2006)

Each of these acquired databases is available with its files in its folder structure. The eNTERFACE'05 database, in particular, is in audiovisual format, whereas the remaining databases are in pure auditory waveform format (WAV). To use the required audio data, the audio part of the database eNTERFACE'05 is extracted from the audiovisual files utilizing the freely available software MediaHuman Audio Converter, which will be used from now on. The software allows one to read the audiovisual files, extract the auditory part, and save it with identical names in WAV format folders.

Then the statistics of the audio files of all databases are analyzed. The detailed property overview of the enumerated databases is presented in Table 1. The databases' statistics contain the minimum, maximum, and average audio length in seconds. Furthermore, the cumulative total size of all files in a database is also provided in minutes. The respective times are rounded to two decimal places. In addition, the table shows the language used and the emotions contained, whereby it is recognizable with the latter that the databases do not use a uniform emotion set. While RAVDESS, eNTERFACE'05, TESS and EMOVO reflect the six basic emotions, Emo-DB contains only five of them. However, except for eNTERFACE'05, all other databases have a neutral emotion. Only RAVDESS and Emo-DB include any other emotions beyond these seven listed. However, since most of the listed databases contain the six basic and neutral emotions, the prototype will also classify only those. All other emotions beyond these will not be considered further and will be excluded due to their underrepresentation.

Table 1

Overview of the speech audio databases applied
Database	Number of files	Min. length in sec.	Max. Length in sec.	Average length in sec.	Total length in minutes	Language	Emotions
RAVDESS	1440	2,94	5,27	3,7	88,82	English	Neutral, calm, joy, sadness, anger, fear, disgust, surprise
Emo-DB	535	1,23	8,98	2,78	24,79	German	Neutral, joy, sadness, anger, fear, disgust, boredom
TESS	2800	1,25	2,98	2,06	95,91	English	Neutral, joy, sadness, anger, fear, disgust, surprise
EMOVO	588	1,29	13,99	3,12	30,59	Italian	Neutral, joy, sadness, anger, fear, disgust, surprise
eNTERFACE‘05	1293	1,12	106,92	3,17	68,37	English	Joy, sadness, anger, fear, disgust, surprise

The data corpus “collected from 140 sources” now available contains 6656 audio files with a running time of 5.14 hours and an average duration of 2.97 seconds per file. The emotions held in the audio files are each coded in the file name in full text, as an abbreviation, or as a number. There is a widespread in the number of files as well as in the total length per database. However, the audio size per file is essential since sometimes more than one emotion can be contained in a sentence. The longer the spoken sentence, the higher the probability of having several emotions.

The distribution of emotions to be considered across all acquired databases is shown using Fig. 3. Here, it is clear that the emotion labels are approximately equally distributed. Only the neutral emotion is underrepresented, but this is compensated for during model training by setting the hyperparameters accordingly.

Furthermore, the audio file lengths per database are illustrated in Fig. 4 as boxplots without outliers. The eNTERFACE'05 database contains six individual outliers that lie beyond the upper limit of 20 seconds and, therefore, would make the boxplot representation unreadable. But most of the eNTERFACE'05 files have a maximum length of fewer than 20 seconds. In conclusion, the entire database is still considered the basis for the prototype.

The individual audio files are technically stored with mono or stereo channels and a sampling rate of 44100 or 48000 Hertz. According to the mainstream literature, audio files with a sampling rate of 16000 hertz and mono tracks are primarily used (Lin and Wei 2005; Lim et al. 2017; Mao et al. 2014; Tzirakis et al. 2018). As a result, all audio files of the data corpus are transformed to that uniform format using the freely available software Audacity. Within the software, a macro is created, reads the audio files, converts the channel number and the sampling rate to 16000 Hertz, and saves the file afterward under the same name in the WAV format in the same folder. No further adjustments are made to the databases.

The development of a prototype generally aims to obtain feedback early in the development phase. In exploratory prototyping, mainly, the focus is on evaluating the solution approach and the system functionality. Further, part of the goal is to answer the questions with the help of the prototype. Both machine learning and deep learning algorithms are used to achieve this specific goal. According to Shinde and Shah (2018), machine learning and deep learning are not synonyms; instead, deep learning is a subtype of machine learning. While classical machine learning uses logistic regression, SVM or stochastic gradient descent (SGD), deep learning focuses on the use of neural networks. Another notable difference is that so-called hyperparameters must be specified in the machine learning algorithm, whereas the deep learning algorithm determines and optimizes the hyperparameters.

The prototypes to be created in this research will be realized using Python, the most widely used Artificial Intelligence programming language besides C/C++ (Adetiba et al. 2021). Therefore, the subject of this chapter is the technical description of the implementation of both methods within the SER. In the course of this, the hardware used in the process is also described.

4.1 Speech recognition using a pre-trained neural network

A practical application of SER requires, as mentioned at the beginning, that appropriate language data underlies the system. Accordingly, the implementation of SER requires an upstream filter that first distinguishes between speech and non-speech. To this end, authors Hershey et al. (2017) describe in their paper a neural network, YAMNet, that is specifically trained on audio classification using the database AudioSet[1] (Gemmeke et al. 2017) and distinguishes between 521 audio classes. These 521 audio classes include varieties from human, animal as well as machine, and natural origins. This freely available CNN YAMNet is thus suitable as an upstream filter to the SER and, since it does not affect the main SER algorithm generated in this work, will be used in both methods. Although YAMNet is part of the implementation, the finished model is used unchanged only for pre-filtering the audio input. It is why a further description of the functionality or structure is not given in this work.

4.2 Speech emotion recognition using machine learning techniques

The traditional machine learning algorithm uses the supervised learning approach based on a data corpus of the five databases mentioned. This method aims to generate a model that can then be ported to a terminal device and used for classification. In this method, the openSMILE framework is used for feature extraction, and an SVM is used as a classifier. The SVM is trained on the eGeMAPS, extracted from the audio files using openSMILE. The extraction, as well as the training of the SVM, is done on the whole data corpus, not on the individual databases. After extraction, the entire parameter dataset is normalized, which is done by removing the mean and scaling it to unit variance. Then, the normalized dataset is split into training and test partitions with a ratio of 80 to 20. The training partition is used for model training, whereas the test partition validates the training.

The hyperparameters relevant to the training are optimized and determined by the algorithm itself. Initially, only four hyperparameters with a dedicated value range are specified. Among other things, a set of available SVM kernels is selected, including a polynomial, linear, sigmoidal, and radial basis function kernel. Furthermore, the regulation parameter is given on a logarithmic scale from 10^− 3 to 10² and on a logarithmic scale from 10⁰ to 10^− 3. For the use of the polynomial kernel, a degree parameter with a range of values from zero to nine is also given. The possible combinations of hyperparameters are optimized and applied to the training partition during training. Immediately following the training, validation is performed using the training data set.

After training, the completed machine learning system can classify new, unknown data according to the learned generalization. For this purpose, the system is connected to a microphone that records human speech at 1024 frames per buffer every three seconds. The recordings are stored locally in a 16-bit WAV format with a sampling rate of 16000 Hertz and a mono channel. This stored file is then read in by the machine learning system and first processed using YAMNet. If this first classification results in 'human speech,' the file is then processed using openSMILE to obtain the eGeMAPS. As with training, this dataset is normalized and then passed to SVM for emotion classification. Immediately following the classification, the procedure is repeated continuously with the following three-second cycle file until it is manually terminated.

The following Fig. 6 shows the processing steps mentioned as a schematic overview. This overview shows the individual steps and their sequence. Not all processing steps are executed on the same hardware, so the figure indicates which processing steps are performed on the server and end device. Alles in blau ist bekannt und dass grün auf edge funktioniert, ist neu

4.3 Speech emotion recognition using deep learning techniques

Like the traditional machine learning algorithm, the deep learning algorithm is based on the same data corpus to ensure a subsequent parity comparison of both approaches. The goal is also to generate an executable model for subsequent porting to the end devices. As an alternative to the machine learning method, the CNN acts as a feature extractor and classifier. In this context, the creation and training of the CNN are based on TensorFlow (Abadi et al. 2019). Since a CNN expects image files instead of audio files as input, it is first necessary to generate corresponding representative spectrograms from the audio data. In this procedure, transfer learning is also used for the reasons mentioned above.

Input for the CNN are Mel spectrograms derived from the spectrogram audio representations. Therefore, speech recordings of different lengths also result in spectrograms of various sizes. However, since it is necessary to always use identically sized spectrograms for training the CNN, the audio files must be read in and processed with a fixed window. To ensure a subsequent comparison at this point, the first three seconds of the audio files are read in, of which only two seconds are processed with an offset of half a second. If an audio file is smaller than three seconds, the content of the file is duplicated until the minimum size is reached. Now, to generate the spectrograms, the final two-second audio file is transformed using Fast Fourier Transform with a window size of 512 milliseconds and a jump size of 256 milliseconds between windows. From this spectrogram, the Mel spectrogram is derived with 128 Mel filters, a minimum frequency of 0 Hertz, and a maximum frequency of 8000 Hertz. Finally, this Mel spectrogram is plotted on a dB scale with 80 dB and the magma color scheme and is available for subsequent classification. The generated dB-scaled Mel spectrogram, including its intermediate stages, is visually presented in Fig. 7. In this way, the entire data corpus is preprocessed and then split again into a training and a test data set in a ratio of 80 to 20.

Before the Mel spectrograms act as input to the CNN, they must first be normalized. In this method, normalization consists of importing the image files with fixed dimensions of 224 x 224 x 3 pixels and then dividing each pixel value by a factor of 255. The dimensions of 224 x 224 x 3 pixels have been proven in image recognition by CNN since AlexNet (Krizhevsky et al. 2017), which is why they are also used here. The division by 255 is necessary because neural networks are known to operate from zero to one, and thus the pixel values are normalized.

Training a neural network from scratch is computationally intensive and time-consuming and involves significant amounts of data, which is why transfer learning is used at this point. Transfer learning for neural networks consists of removing the output layer of a pre-trained neural network and replacing it with new output layers of its own, which act as classifiers. MobileNetV2 (Sandler et al. 2018), which is pre-trained on ImageNet[2], serves as the base model for transfer learning in this method. MobileNetV2, the training base, was initially designed for object recognition and execution on mobile devices. Three separate output layers then augment the base model with a GlobalAveragePooling2D, a dropout of 0.2, and a fully connected layer including a softmax activation function, which is used when the number of classes is greater than two. The neural network is then optimized using the Adam optimization algorithm (Kingma and Ba 2015) and an initial learning rate dropout of 10^− 5.

Furthermore, categorical cross-entropy is used as a loss function, which is used to quantify the differences between two probability distributions in prediction. Finally, the model is trained in two phases of equal size to apply transfer learning. First, the training of the new model operates with 50 initial epochs on the 154 untrainable layers and weights, which is used to transfer the experience of the base model to the task. Only the three newly added layers are trainable in this phase. Subsequently, the model training operates another time with 50 epochs, this time with 54 trainable and 100 untrainable layers, which is called fine-tuning in the corresponding literature. Each epoch is run with 100 training steps and ten validation steps. The training and validation data are read into the model training with a batch size of 16.

Figure 8 schematically visualizes the described sequence of the Deep Learning process. In addition to the individual processing steps and their arrangement, it is also apparent here which processing steps are executed on the server or the end device. The similarities and differences between this graphic and Fig. 6, the similarities and differences become apparent. The diagrams outline the appropriate processing steps, sequence, and execution location. Furthermore, the optimization of the hyperparameters and the general parameterization of the models are part of the training and are, therefore, not listed in both diagrams. Furthermore, it can be seen from the comparison that additional work steps are necessary for the Deep Learning method before the neural network training is started. The effects of the extra steps on the result will be discussed in Chap. 6.

4.4 Hardware used

Different terminal configurations are used for the creation and execution of the prototypes. The training of the machine learning or the deep learning model takes place on a Windows server. This step focuses on processing the five databases by the respective method for which corresponding hardware performance is required. The use of servers offers the advantage that they are more powerful and equipped with more RAM than standard notebooks. However, servers do not have a microphone input due to their general structure, broad physical localization, and clustering, which is also irrelevant for training. A list of the hardware used in the training and its components is shown in Table 2.

Table 2

Description of the server hardware used for model training
Hardware component	Designation
Rack Server	HPE ProLiant DL380 Gen10
Operating system name	Microsoft Windows Server 2016 Standard
Processor	Intel® Xeon® Gold 6226 CPU @ 2.70GHz
Installed memory	256 GB
System type	64-bit operating system, x64-based processor

After the runnable models have been created, they are ported to ambient end devices, on which the real-time classification subsequently takes place. In general, such end devices are less powerful and have less memory than servers, which is why they are unsuitable for training a machine learning or deep learning model. However, the built-in processors and internal microphones are suitable for running such models. The resulting performance depends on the individual hardware components. Table 3lists the ambient terminals used in this work and their members. The table shows terminals of two different care used, representing each category by a specific terminal.

Table 3

Description of the terminal devices used and their components
ID	Category	Terminal	Operating system	Processor	Working memory	Battery
1	Notebook	HP Envy x360 Convertible 15-cn0xxx	64-bit Windows 10 Home version 21H1	Intel® Core™ i7-8550U CPU @ 1.80 GHz	16 GB	3-cell, 52-Wh, 4.55-Ah, 11.55V, Li-ion battery
2	Raspberry PI	Raspberry Pi 4 Model B including an additional USB mini microphone	32-bit Raspbian GNU/Linux 11	Broadcom BCM2711 (Cortex-A72, ARM v8), 4-core CPU with 1.5 GHz	4 GB	External power supply

The spectrum of end devices compactly represents a variety of different categories, operating systems, performance, and storage capacities. Thus, the selected range can serve as a representative cross-section of all available ambient end devices.

4.5 Prototype porting to ambient devices

After the model training has been completed, it is ported to the end devices and executed to test the general executability and real-time capability. Due to the different architectures and operating systems of the end devices, various methods and requirements are also needed to use the trained models. As mentioned above, the Python programming language and the openSMILE and TensorFlow frameworks are used, and these are also the basic requirements for using the ported models on the end devices. All files used or generated during prototyping are in a dedicated folder created according to the procedure.

The first porting is done to the notebook, which has a 64-bit Windows operating system. Since Windows operating systems are generally compatible with all Python applications, installing any Python development environment like Jupyter, including the required frameworks openSMILE and TensorFlow, is unobstructed. A separate program is developed for machine and deep learning methods within this development environment. Figure 6, respectively Fig. 8 describe the procedure.

The same procedure is used for the port on the Raspberry Pi, which has the 32-bit Linux distribution Raspbian OS. Due to the lack of audio input on the Raspberry Pi, an external USB mini microphone is used. Since Linux operating systems are also compatible with Python applications, the Python development environment Jupyter is also installed here. Ambivalent to the last port, the openSMILE and TensorFlow framework is consistent with the installed hon runtime environment. Therefore, instead of TensorFlow, the framework TensorFlow Lite is used, which is specifically designed for running models on mobile devices. Before that, however, the model is converted to a format readable by TensorFlow Lite and stored as a separate file in the same folder structure. Subsequently, a Deep Learning program is developed using TensorFlow Lite, which executes the trained, converted, and exported model and performs audio and emotion classification. As a result of openSMILE's incompatibility with a 32-bit Python runtime environment and a lack of a qualitative alternative, the source code is recompiled on the Raspberry Pi. Instead of including a Python library, the compiled openSMILE binary file is called via the command line within the prototype. At this point, the audio file passed is processed, and the eGeMAPS generated from it is stored in the same folder. This file is then read in, the content formatted, normalized and then passed to the SVM for classification.

4.6 Metrics used

Appropriate metrics are needed to measure and evaluate a prototype or a general result. The prototypes' real-time capability and classification success rate must be measured concerning this work and the research questions posed.

Regarding real-time capability measurement, prototype response times are measured in seconds. The time duration measurement is done on the terminals by the prototypes themselves, starting directly after the recording and storing the speech and ending after the classification. The model training, as well as the time during recording, are not relevant here. However, the time frame in which the measured time has to be to fulfill the criteria of real-time capability for machine processes is not clearly defined in the literature. In contrast, the ISO/IEC 2382:2015 standard formulates real-time as the "processing of data by a computer in connection with another process outside the computer according to time requirements imposed by the outside process" (ISO/IEC JTC 2015). Thus, it is clear from the definition that an explicit time specification in seconds or milliseconds is impossible. Instead, the real-time or real-time capability is specified by the external process, which in turn can also be human perception. The latter is sensitive concerning linguistic communication since pauses of several milliseconds are already subjectively perceived as interruptions. Vogt et al. (2008) state that a subjective interruption is already perceived from 1000 milliseconds. Complementing this, Satyanarayanan (2017) shows that cloud-based speech recognition systems without transmission paths need up to 450 milliseconds for processing. According to Zhang et al. (2018b), neural networks for image classification require only 15.2 up to 184 milliseconds, with an input of dimensions 224 x 224 x 3, similar to the Deep Learning method of this work. To this end, Liu et al. (2018) also report that compressed neural networks require only 103 to 189 milliseconds to process on ambient devices such as smartphones. Concluding, in the context of this prototyping without compressed methods, a measured time duration of fewer than 1000 milliseconds is declared as real-time.

Confusion matrices are commonly used to represent and evaluate classification problems in machine learning. In such matrices, the model predictions are juxtaposed with the actual states. In Fig. 9, such an example of a confusion matrix is shown, from which it can be seen that the result can assume one of the four outcomes shown. The general distinction is whether the model's prediction matches or deviates from reality. If the model predicts the outcome correctly, the effect is either True Positive or Negative. In other cases, the results are False Negative or False Positive because the prediction of the system does not conform to reality.

The confusion matrix or the four-field matrix does not represent a key figure in the narrower sense, but it does provide the basis for its creation. Thus, the overall accuracy of the respective machine learning system is calculated from the confusion matrix and represented in decimal numbers, where the value 1.0 represents the maximum, and the value 0.0 is the minimum. The accuracy indicates the total number of correct predictions of the model and is determined according to Caelen (2017) using the following formula:

$$Accuracy=\frac{True Positives + True Negatives}{True Positives+False Positives+True Negatives+False Negatives}$$

Since this resulting accuracy cannot be evaluated using a scale, it is instead compared to the accuracy of other models. For comparison, the CNNs mentioned in Chap. 2 are used, whereby the highest accuracy achieved in each case is shown in Table 4 is deposited. The CNNs mentioned are sorted chronologically by publication date within the table. Besides MobileNetV2, the cited papers do not specify which machine has been used to generate results. Therefore, for the time being, it is assumed that the results of the CNNs were generated on cloud-like servers, similar to what is described by Tariq et al. (2019). Since a direct comparison of server-generated results with terminal device-generated results is not possible, the subsequent interpretation of the results of this study is limited.

Table 4

Comparison of prediction accuracy of known CNN models
CNN	Accuracy	Source
LeNet	82%	LeCun et al. 1998
AlexNet	84,6%	Krizhevsky et al. 2017
VGG	93,2%	Simonyan and Zisserman 2015
ResNet-152	96,43%	He et al. 2016
MobileNetV2	75,32%	Sandler et al. 2018

Especially for neural networks, there are other metrics for measuring and evaluating the training results. Here, the result is generally given in terms of training and validation accuracy and duration of training and validation losses, respectively. The training and validation accuracy is shown in decimals, where the maximum is 1.0 and the minimum is 0.0. Consequently, Table 4 also applies here to assess the accurate measurement of the neural networks. On the other hand, the training and validation losses are also given in decimal numbers, but the maximum here is unlimited upwards. However, the minimum of 0.0 also applies when specifying the loss, representing the optimum. Therefore, with this metric, the closer the loss rate is to the minimum of 0.0, the better the performance will be evaluated. On the other hand, the upper limit of the loss rate tends towards infinity and consequently has no definable maximum. However, the decisive factor for evaluating a CNN is the resulting accuracy, which is why an evaluation or classification of the training losses is dispensed with.

Unlike the measurement of the time required for classification, the measure of accuracy and the creation of the confusion matrix takes place immediately after training, i.e. on the server.

However, measuring the accuracy and time elapsed is insufficient to achieve the goals. It is also necessary to check whether the described methods can be run on ambient terminals and whether the results are comparable. While the training of the models does not take place on the end devices, the classifications operate on them. To evaluate the performance of a general machine learning method on a terminal device, Liu et al. 2018 gave as criteria the accuracy and delay of the prototype and the memory and power requirements. The accuracy is issued by the confusion matrix and the resulting accuracy score, as shown above. The time or temporal delay is determined by the model itself, as also explained above, and is given in seconds. The value indicates how much time the prototype needs for one classification cycle. The memory requirement is given in gigabytes and displays the average memory allocation required for a cycle. This is measured from the difference in the measured working memory before and during the classification. Ultimately, the measurement of the energy requirement of an application is missing, but this cannot be measured directly. This is because a specific application does not run exclusively on one system, thus precluding the strict identification of the energy demand of this one application. However, the energy consumption can be calculated indirectly from the measured processor utilization. Therefore, the difference in the measured processor utilization in percent before and during classification is also used as a metric here.

The metric of overall accuracy is related to the model and, thus, independent of the hardware used. However, the metrics of time, memory consumption, energy consumption, and processor utilization depend on the hardware. In the following chapter, the metrics mentioned are therefore examined on the hardware listed in Table 3.

[1] AudioSet is an audio database consisting of about 1.8 million labeled 10-second audio excerpts divided into 632 categories.

[2] ImageNet is an image database consisting of over 15 million labeled and high resolution natural images with approximately 22000 categories.

5.1 Results of the machine learning procedure

While describing the results of both prototypes, a distinction is made between the generation of the executable model, including its training, and the real-time classification by the same model.

The prototype is a supervised machine learning method in which an SVM is used as a classifier for emotion determination. The algorithm is trained on the five databases mentioned in Chap. 3 to generate an executable and portable model. The training of this algorithm, including the optimization of the hyperparameters, is about 96 hours. The hyperparameters selected and optimized by the algorithm are the radial basis function kernel, the regulation parameter with a value of 10¹, the gamma with 10^− 2, and the degree parameter with a zero value.

The accuracy after completion of the model training is indicated by the confusion matrix, shown in Fig. 10, for the test partition of the trained model and the classification report based on it. In the former, the list of the seven considered emotions can be lined up vertically on the left edge as absolute values on the one hand and horizontally on the bottom edge as values predicted by the model on the other. Furthermore, it can be seen from the marked green fields that the prediction of the model agrees with the actual values in most cases. Those correct predictions represent the true positives. The remaining whitish fields represent the False Positives since the predicted emotion classes do not match the real-world conditions.

From this given confusion matrix, the model's overall accuracy is calculated using the above formula, which is already included in the classification report, and visualized in Fig. 11, together with other metrics. For example, in addition to the overall accuracy rates, accuracy rates for individual emotions are also present. Figure 11 shows that the overall accuracy of this procedure is 0.77. With an achieved value of 0.77 and 77%, respectively, the model is ranked between the CNN MobileNetV2 and LeNet based on Table 4.

The confusion matrix and the classification report are valid for the machine learning model and thus independent of the end device used.

An exemplary metrics measurement is performed in Table 3, mentioned notebook. The estimated time is measured within the model between two cycles. A cycle consists of a speech recording, an audio classification, and an emotion classification, depending on this result and its output. The average estimated time for 15 observed cycles is 0.799 seconds. For comparison purposes, processes without audio classification are also performed, where emotion classification is applied to each incoming audio signal. The average time required here is 0.114 seconds, calculated from 15 observed cycles.

In conclusion, based on the criteria set, the time of 144 milliseconds for emotion classification alone and the time of 799 milliseconds for emotion classification, including previous audio classification, are declared to be real-time capable. The memory requirement increases from 9.8 gigabytes to 10.1 gigabytes after starting the classification, which is derivatively an increase from 62–64% utilization. On the other hand, the processor utilization increases by 17% points after starting the classification, from an average of 11% to around 28%.

Furthermore, a good measurement is also performed on the Raspberry Pi. Here, the arithmetic means of 15 observed cycles is 4.33 seconds for audio and emotion classification. Also, at this point, the result is compared with a pure emotion classification without prior audio classification. This cycle takes an average of 0.337 seconds, calculated from 15 observed cycles. In conclusion, based on the set benchmarks, the emotion-only classification is declared real-time capable, but the combined audio and emotion classification with a time of 4330 milliseconds is not. Further memory measurement indicates an increase in memory usage after starting classification from 415 megabytes to 586 megabytes, a relative increase of 4.4% for a total availability of 3838 megabytes, from 10.8% utilization for the first time to 15.2% utilization now. On the other hand, processor utilization increases by 26.4% points during execution, from 0.7% utilization for the first time to 27.1%.

5.2 Results of the Deep Learning procedure

The second method described in 4.3, a CNN, is used based on the pre-trained MobileNetV2 network. The CNN developed in this method is also trained with the same intention on the five databases mentioned in Chap. 3. The training time of the neural network with a total of 100 epochs is about six hours. As described above, the training proceeds in two identical phases of 50 epochs each, one step for initial learning and one stage for finetuning the model. Following each training epoch, the training and validation accuracy and the training and validation loss are reported. The preliminary result after the first 50 epochs is graphically visualized in Fig. 12. On the left side is the training and validation accuracy course. The training and validation loss, each for 50 epochs, is shown on the right side. Each of these four parameters is shown as a separate curve.

The accuracy curve shows that the training accuracy starts at around 0.16 and increases to approximately 0.42 by the 50th epoch. The validation accuracy also starts at about 0.16 and reaches an accuracy of about 0.5 at the 50th epoch. It is noticeable here that the validation accuracy is always above the training accuracy. This phenomenon is due to the peculiarity of transfer learning. When training a neural network without transfer learning, the training accuracy is always above the validation accuracy.

Similar behavior can be observed in the course of the loss curve. The training loss curve starts at a loss of around 2.3 and drops to about 1.55 by completing the 50th epoch. On the other hand, the validation loss curve begins at 2.2 and drops to about 1.4 by the 50th epoch. Once again, it is characteristic of transfer learning that the validation curve always lies below the training curve.

During the subsequent fine-tuning of the CNN, more trainable layers and, thus, more trainable weights are available. The model also has more possibilities to optimize performance. The result of the fine-tuning is illustrated in Fig. 13 illustrated.

The result in the previous Fig. 12 is supplemented here by the 50 other fine-tuning epochs. A vertical straight line in the 50th epoch shows at which point the fine-tuning phase starts. Thus, after the start of the fine-tuning stage, the training accuracy curve drops to 0.35 but then takes a steeper course than before and reaches the maximum of 1.0 from the 90th epoch, at which point the curve stagnates until the 100th epoch. On the other hand, the validation accuracy curve initially drops to around 0.45 after the start of the finetuning phase but then rises again and reaches an accuracy rate of about 0.7 by the 100th epoch.

A change can also be observed in the loss curves after the start of fine-tuning. For example, the training loss curve rises to 1.75 after the start of fine-tuning but then drops more sharply than before, reaching a value of 0.1 at the 100th epoch. The validation loss curve does not rise after the start of fine-tuning but drops to a value of 0.8 by the 80th epoch, where the local minimum of the curve is located. By completing the 100th epoch, the curve rises to about 1.0. Ultimately, it is not the training accuracy but the validation accuracy that is decisive for the correct classification. With an accuracy of 0.7 and 70%, respectively, this result is based on Table 4 ranks below MobileNetV2.

Analogously, the CNN of this method is trained a second time on the five databases but without applying transfer learning. In this training, the CNN is also qualified with 100 epochs, but in only one phase and with the absolute number of trainable layers. With its 100 epochs, this training has a running time of about six hours, as before. The result of this training of 100 epochs without transfer learning is visualized in Fig. 14. Here, it can be seen that the training accuracy curve starts at a value of 0.22 and rises to the maximum of 1.0 by the 50th epoch. There the curve remains until the 100th epoch. The validation accuracy curve also starts at a value of 0.22 and increases to 0.7 by the 50th epoch, which remains with fluctuations until the 100th epoch. The training loss curve begins at the value of 2.0 and steadily decreases until it reaches the minimum of 0.0 at about the 70th epoch and remains there.

On the other hand, the validation loss curve starts with a value of about 2.2 and falls with a fluctuating downward trend until the 35th epoch. There, the curve has reached its local minimum of 0.9. However, the curve rises again with a fluctuating upward trend until the 100th epoch to the value of about 1.2.

When comparing Fig. 13 with Fig. 14, the advantages of transfer learning become apparent. Starting from the start of fine-tuning, it can be stated that the validation accuracy curve has already reached the value of 0.7 after 20 epochs. In contrast, the validation accuracy curve without transfer learning has only reached this value after about 50 epochs. The advantage can also be seen in that the validation accuracy curve for the method with transfer learning has a higher slope than the validation accuracy curve without transfer learning. Finally, it can be seen that the start of the validation accuracy curve for the process with transfer learning starts higher on the Y axis with a value of 0.45 than the curve without transfer learning with a value of 0.22.

In contrast to a SVM, the result of the classification in a neural network does not output a single value but a value range with seven entries corresponding to the number of classes present. The entries in this value range represent the probabilities with which the model predicts one class each. The individual entries can assume a value between 0.0 and 1.0, with the sum of all entries in the value range again resulting in 1.0. The emotion class with the highest probability value is output as the classified emotion.

An exemplary measurement of the metrics is also performed in Table 3, mentioned notebook. Here, too, the time is measured between two cycles, whereby one cycle consists of the audio and emotion classification, including the output. The arithmetic mean of the measured time is 0.856 seconds for 15 observed cycles. At this point, a comparison is also made to a cycle without prior audio classification. The time counted for this cycle is 0.119 seconds, also calculated from 15 observed cycles. With a time value of 119 milliseconds for a cycle without audio classification and a time value of 856 milliseconds for a cycle with audio classification, respectively, the result is below the set benchmarks and is therefore considered real-time capable. The memory requirement increases from 9.9 gigabytes to 10.6 gigabytes from the start of classification. Relative to the total available memory, this is an increase of 4% points, from 63% utilization for the first time to 67%. The processor utilization also shows an increase of 16% points, from 15–31% for the first time.

Based on the implementation of the prototype on the Raspberry Pi, the average time required for a cycle, including audio and emotion classification, is around 4.43 seconds, calculated from 15 observed cycles. In comparison, emotion recognition without prior audio classification requires arithmetic mean of only 0.393 seconds, again calculated from 15 practical cycles. With a needed time of 393 milliseconds, the latter result is below the set benchmarks, but the previous mark with 4427 milliseconds is not. Thus, implementing the prototype on the Raspberry Pi does not have real-time capability. The memory used increases from 563 megabytes to 675 megabytes during runtime, a rise of 2.9% points for a total availability of 3838 megabytes, from 14.7% utilization for the first time to 17.6% now. On the other hand, processor utilization increases by 22% points during execution, from 2.9% utilization for the first time to 24.9%.

5.3 Examination of the research questions

The research questions are answered within this chapter by the results of the prototypes from the previous chapter. Each research question will be addressed individually.

First, however, a tabular comparison of the results of the two methods used is provided, which supports the subsequent answering of the research questions. The explicit comparison of the resulting metrics is represented in Table 5 embodied in the table. The metrics listed here represent the arithmetic mean across all measured metrics.

Table 5

Comparison of the results of the two methods
Metrics	Machine Learning Method	Deep Learning Method
Training duration	96 hours	6 hours
Accuracy	77%	70%
Working memory requirement increase	10.7% points	3.45% points
Processor load increase	21.7% points	19% points
Time consumption emotion classification	225.5 milliseconds	256 milliseconds
Time consumption audio and emotion classification	2565 milliseconds	2642 milliseconds

Starting with the first research question, it can be deduced from the previous Chap. 4 it can be deduced that a SER system is fundamentally capable of distinguishing between speech, non-speech, and silence. To this end, the YAMNet neural network (Hershey et al. 2017), which is not a primary component of this work and was also not developed within this research, is used within the prototypes. Nevertheless, the YAMNet neural network is part of both prototypes, which are thus able to classify audio inputs into different categories, such as music, meowing, barking, silence, or even speech. The first research question can be answered with these results as wholly fulfilled.

To answer the second question, besides implementing the model training, the audio databases are critical since they form the basis of that training. Concerning the databases used in this research, it can be seen that they contain various emotional audio files. For example, most databases have the six basic emotions mentioned by Ekman, including further emotional stimuli such as tiredness or boredom. Neutral emotion arousal can also be found in the majority of the databases. The prototypes trained on these databases are thus able to distinguish between the seven emotions mentioned in Chap. 2. Related to the research question, it can therefore be stated that a SER system is capable of distinguishing between positive, negative and neutral emotions but is not limited to this. Instead, such a system can perform a more detailed categorization of speech input into individual emotions with an accuracy of 77% for machine learning and 70% for deep learning.

The third research question relates to the feasibility of a SER system on ambient terminals. In answering this question, it is first necessary to distinguish between the phases, and the ambient end devices used in Table 3 must be indicated. Due to the intensive computing power and high runtime, the one-time model training step must be executed on a server. Therefore, therefore not feasible on an end device. The subsequent real-time classification phase is based on the trained model and can be performed multiple times on a terminal device. The prototype porting to a notebook is feasible, as in Chap. 4.5, since notebooks generally support corresponding Python runtime environments. Thus, running the emotion classification is possible on a notebook regardless of the method used. Porting the prototypes to a Raspberry Pi, on the other hand, is more complex since, on the one hand, Python runtime environments are supported in principle. Still, on the other hand, the necessary frameworks, openSMILE and TensorFlow are not available for Raspberry Pi’s. Alternatively, for TensorFlow, the porting of the Deep Learning procedure is done with TensorFlow Lite, which runs the trained model on the end device. In the absence of openSMILE compatibility with 32-bit operating systems and a lack of a qualitative alternative, the porting of the machine learning procedure is omitted at this point. In summary, it can be stated that the realization of an SER system using edge computing is only possible to a limited extent. While they assist in executing deep learning approaches and neural networks on the end devices, this does not always apply to the machine learning method.

In answering the fourth question regarding the real-time capability of the classification system, it is first necessary to differentiate which method is used and whether only SER or SER, including prior audio classification, is considered. Concerning the machine learning method, the SER system requires an average of 114 milliseconds for pure SER without prior audio classification and is thus below the set index of 1000 milliseconds. But also, the SER, including prior audio classification with an average time of 799 milliseconds, is below the set of 1000 milliseconds. Related to Deep Learning, the average time for a cycle without an audio classification is 256 milliseconds. Meanwhile, the average time for a cycle with audio and emotion classification is 2642 milliseconds. According to the results, the fourth research question can thus be answered the choice of the method determines whether the real-time capability is given. However, since there is no porting and, therefore, no results regarding the machine learning method, there is the possibility that the answer to this research question is falsified.

6.1 Reflection of the approach

As summarized in Chap. 1.2, this research approach includes a systematic literature search and explorative prototyping, including porting to possible end devices. The systematic literature search is conducted according to vom Brocke et al. (2009) and includes a forward and backward search. During this process, high-quality literature sources are primarily used, according to Rowley and Slack (2004). However, since the literature search is not ongoing, there is always a risk that relevant literature or primary sources are overlooked or inferior sources are used, despite the care taken. Thus, despite carefully selecting literature, this research is not entirely exempt from this risk.

According to Floyd (1984), exploratory prototyping uses machine learning and deep learning. Both approaches are based on the same data corpus, determined and acquired based on the criteria mentioned in Chap. 3 within the available literature sources. Among others, the minimum audio length of one second and the maximum length of 20 seconds are mentioned as conditions. The minimum size of one second is justified because speech or audio files below this duration generally do not contain useful information. On the other hand, the maximum length of 20 seconds is arbitrarily chosen. A total time of 10, 15, or 30 seconds would also have been conceivable. It is questionable, however, whether a different selection of audio databases would have taken place if this criterion had been changed since, as shown in Fig. 4, the majority of the audio files have a duration of fewer than 10 seconds. Another criterion is the restriction to spoken sentences. Since the goal of the prototypes is SER and not audio classification, the audio files must consequently contain speech. Pieces of music do include speech, but this is interfered with by the accompanying instruments.

Furthermore, emotion recognition in music is not part of this work but is a suggestion for further research. Nevertheless, there is the possibility to look at audio files, including background noise, because, in real situations, communication and, thus, speech does not take place in silence but in noisy environments. In contrast to music, background noises are not in the foreground but in the background. Therefore, it would be an enrichment to include additional databases with such audio data in the research.

Furthermore, it is specified as a criterion that the files must be in an auditory or audiovisual format. However, a restriction on the file type, sampling rate, o dubbing of the files is not made here. A limitation to purely auditive file formats would be feasible and influence the selection of the databases but would not affect the further procedure. This is because, before the model training, all audio files are still transformed to the same format and file type. The native language used in the audio files is also not a selection criterion, which is why it can be seen from Table 1 it can be seen that the selected databases contain German, English, and Italian language data. It would also be possible to choose audio files in, for example, Turkish, Danish, or Chinese. Since the six basic emotions, according to Darwin (1873) and Ekman (1971), are expressed in the same way across cultures, the mother tongue spoken has no significant influence on the Mel spectrograms, the model training, or the result. However, when selecting databases, male and female voices must be included. Not meeting this criterion may prevent the generalization of the data and lead to overfitting or underfitting the model; open accessibility and the presence of a label are mandatory for canvassing. Without open access to the databases, the procedures and results of this work cannot be reproduced by third parties. Furthermore, without labeled data, no supervised machine-learning algorithm is feasible. Investigating whether a different quantity or composition of the databases or the included language changes the outcome of this work can be the subject of further research.

Other methodologies that affect both procedures equally include splitting the data corpus into training and validation partitions in an 80 to 20 ratio or transforming the audio files to a 16000 hertz format with the mono channel. The alternative division of the training and validation partitions in a ratio of 50 to 50 (Krizhevsky et al. 2017), or two-thirds to one-third, represents a potential option. Formatting audio files in the manner above are the most commonly cited method for preprocessing audio files in the literature and are consistent with approaches taken in this work (Lin and Wei 2005; Lim et al. 2017; Zhang et al. 2018a) but also provides room for research elsewhere with different parameters.

In implementing the machine learning method, the hyperparameters and value ranges are defined at the beginning of the training, based on which the algorithm determines the optimum independently. The selection of those hyperparameters is based on the research of Mao et al. (2014). However, other parameters or ranges of values are also conceivable, as in Wang et al. (2015) or Cummins et al. (2017). For feature extraction, the framework openSMILE, including the eGeMAPS, is used, as it is also used in Cummins et al. (2017) and Ottl et al. (2020). Although this approach is in line with current research, DEEP SPECTRUM (Ottl et al. 2020) is also conceivable.

In contrast to the Machine Learning method, the Deep Learning method uses explicit hyperparameters described in Chap. 4.3. The division of the training into two phases of 50 epochs each is based on the research of the authors Tan et al. (2018). A different approach was taken by Zhang et al. (2018a), using as hyperparameters a batch size of 30, the SGD, and a learning rate of 10^− 3. However, an SGD with a learning rate of 10^− 2, a dropout of 0.25, and a Rectified Linear Unit activation function are standard hyperparameters, as used by Lim et al. (2017). Differences also prevail in the generation of Mel spectrograms between the parameters described in Sect. 2.3.3 described the relevant literature. For example, in Zhang et al. (2018a), 64 Mel filters are used for audio classification at a frequency range of 20 to 8000 hertz on a 25-millisecond Hamming window with a ten-millisecond superposition each. According to Zhang et al. (2018a), the difference in the different Mel spectrogram generations can be justified because SER and speech recognition are also differentiated processes. The use of Mel spectrograms, as mentioned in Chap. 2, is in line with the current state of research. Alternatively, the use of MFCC within the Deep Learning procedure is possible.

When applying transfer learning, the MobileNetV2 is used here as the base model, but the CNN ResNet50 or SqueezeNet (Ottl et al. 2020) are also used in the literature for this purpose. Furthermore, in this study, finetuning operates only on the last 54 of the total available 154 layers. Optionally, another number of trainable layers, such as 32 or 16, is possible. To what extent a change of these hyperparameters differs from the results described in the previous chapter can be the subject of a progressive investigation.

When the speech is subsided for real-time processing, a time window of three seconds is selected. In retrospect, however, intervals of, for example, one, two or five seconds are also possible. In particular, a gap of seconds can be used in further analysis. The value of investigation here is whether such a recording duration is also suitable for audio and emotion classification and whether a user subjectively perceives the SER system as running in real-time. In general, the subjective view of an end-user has not been part of this research.

About porting, it must be stated that only three of the maximum four possible ports were possible. The lack of compatibility of openSMILE with a 32-bit operating system, as well as the lack of a qualitative alternative, led to the conclusion that the selection of this framework was not the most suitable. Nevertheless, the use of openSMILE is consistent with the corresponding literature.

6.2 Interpretation of the results

The results of the previous chapter prove that both methods are suitable to answer the research questions with minor restrictions. However, comparing the two approaches and their results does not allow a direct conclusion as to whether one of the methods is superior to the other. Table 5 makes no immediate conclusion about whether one of the procedures is superior to the other. When comparing the training time, it becomes clear that the Machine Learning approach needs about 16 times the time compared to the Deep Learning approach. However, in the Machine Learning model, the accuracy is higher, the classification is faster, and the increase in processor load and memory requirements is lower.

The speed advantage of the machine learning method in real-time classification arises here from the different emotion recognition algorithms. The audio classification is not considered here, as it is identical in both ways and precedes emotion recognition. Speech input is first processed by openSMILE in the Machine Learning method, normalized, and then classified by SVM. In the Deep Learning model, the speech input is first converted to a spectrogram, stored, normalized, and then processed through all neural network layers. When the spectrograms are stored and read, additional write and read transactions occur that the machine learning method does not require, which in turn throttle the Deep Learning model in terms of speed. However, the difference in speed is marginal and negligible since humans cannot perceive the difference in a period of a few milliseconds.

Furthermore, it is noticeably, based on Table 5, that the mere emotion classification acts, on average, ten times faster than the combined audio and emotion classification, regardless of the method used. Primarily concerning porting to the Raspberry Pi, the difference is significantly decisive for the declaration of real-time capability. While the mere emotion classification is declared real-time capable, this is not the case for the combined classification due to the increased runtime. The difference is presumably found in the CNN YAMNet used for audio classification and its structure, which was developed outside this research. Consequently, detailed analysis and statement of the time difference and its exact origin cannot be made. Thus, the optimization of speech classification, which is not considered in detail in this paper, can significantly improve the overall latency of the process.

When looking at the individual process steps in Fig. 6 and Fig. 8, no direct conclusion on the training duration is recognizable. However, more process steps are necessary for the Deep Learning method than for the Machine Learning method. As mentioned above, the parameterizations of the models are not listed in these figures since they are part of the mapped training. In particular, specifying fixed hyperparameters versus selecting ranges of hyperparameter values is critical to the training time. In the Deep Learning method, the training duration also varies with the parameterization of the batch size, the number of epochs, and the number of steps per epoch. The training duration in the Machine Learning method is determined by the number of hyperparameters and their value ranges. As explained in Chap. 4.2, the resulting hyperparameters are determined by processing and optimization of the possible combinations by the algorithm itself. Compared to explicit parameterization, processing all potential combinations is presumably responsible for the training duration difference. Consequently, this also means that the overall accuracy of the machine learning model is higher than that of the deep learning model due to the processing of all combinations.

Furthermore, it is evident from the results in Fig. 13 it can be seen that the curve of the validation loss increases again from epoch 80. An analogous phenomenon can also be observed in Fig. 14 from epoch 50 onwards. However, the validation accuracies in the figures do not increase again. The increasing course of these curves may indicate overfitting, which can be investigated in continuing research.

The higher memory requirement for the Deep Learning model is because, in addition to the primary audio file, the generated spectrogram is also necessary for emotion recognition and stored in memory. Another apparent cause is that the CNN, in contrast to an SVM, has a higher number of parameters, which are also kept in memory.

Contrary to expectations, the processor utilization is higher for the Machine Learning method than for the Deep Learning method. This phenomenon is probably due to the wrong timing of the recording. However, the difference of 21.7% points in the processor utilization of the two methods is negligible here due to marginal differences.

It is also mentioned that both models consider the presence of a microphone as a strict prerequisite. Unlike multimodal emotion recognition, a unimodal SER system does not require other provisions like cameras. Most ambient terminals have a native microphone but not a native camera, such as smart speakers or smart TVs. Thus, the prototypes developed in this work are suitable for porting to these devices, which is discussed in more detail in Chap. 6.4 will be discussed in more detail.

Referring to the theoretical foundations of this research, SER represents an increasingly important component of human-computer interaction (HCI). For its part, HCI occurs in the context of remote participation, which is a component of the increase in computer-supported hybrid communication in everyday life. In conclusion, SER is an increasingly important part of everyday life. Furthermore, SER is the subject of current research. For example, analogous to this work, there are studies of SER in real-time (Vogt et al. 2008) and real-time edge computing (Liu et al. 2018; Satyanarayanan 2017; Abbas et al. 2018). However, research on SER applications on edge devices, defined by Shi et al. (2016), does not exist. Thus, the combination of SER, edge computing, and real-time processing in this present work represents a novelty and an extension of the research.

6.3 Limitations of the work

To maintain the focus of this work, restrictions are deliberately made. However, other external limitations also limit this work, which will be explained in more detail below.

According to Ekman, only the six basic emotions, including a seventh neutral emotion, are considered in this work, which is why emotions such as tiredness or boredom are excluded. Accordingly, the data acquisition is made with the mentioned seven emotions, further limiting the selection of suitable databases. Furthermore, the inclusion of the dimensions of arousal and valence is also omitted. These dimensions can be considered in continuing work, but do not play a role in the mere emotion recognition in this research. Therefore, it is pinpointed that these dimensions exist, but it does not address them in the further course of the study.

A technical limitation, however, is the mapping of processor and memory utilization. Since the system constantly updates these two indicators, it is impossible to identify the exact utilization. Thus, the documentation of the processor and RAM utilization only represents a snapshot, not a calculated average value.

Furthermore, the maximum number of simultaneously recognizable emotions is another technical limitation. As mentioned in Chap. 2, this paper assumes that only one emotion is contained in a sentence or voice recording. As the length of the sentences and audio files increases, the probability that multiple emotions are contained increases. However, the machine learning method using the SVM can only classify one emotion, which is why the length of the voice recording is limited to three seconds. On the other hand, the CNN in the Deep Learning method calculates a probability for each of the seven emotions.

For this reason, this method can potentially identify multiple emotions within one speech recording. Another technical limitation is the applicability of the prototypes to only one person. As described in Chap. 2 described, the model training is based on emotional audio files of the acquired databases. Individuals can be heard in each audio file, which is why the prototypes can effectively apply emotion recognition only to individuals. When multiple individuals speak simultaneously, the prototypes cannot distinguish between individuals and their emotions. The extension to multi-person recognition goes beyond the definition of these prototypes and therefore needs to be investigated in further work.

Furthermore, the porting of the prototypes is also limited. For example, only two device categories were selected since porting to more devices would exceed the scope of this paper. For this reason, porting to smartphones or tablets, for example, is not included.

6.4 Practical implication

The results and interpretations shown in this and the previous chapter prove that the prototypes created are functional and thus also suitable for practical use. The possible uses of such SER applications are manifold and serve as extensions to existing products and services. Some of such areas of application and the resulting benefits are mentioned below.

Although a microphone is a prerequisite for SER applications, these applications can be used universally. Thus, such applications can be used where speech plays a central role. Examples include call centers, radio broadcasts, podcasts, and television shows. As Kiesler et al. (1984) mentioned, physical communication and participation are increasingly complemented by virtual technologies. Related to this and the recommendation of Shirmohammadi et al. (2012) to combine the physical and virtual presence, new business models can be generated at this point through SER applications. For example, using SER on a smart speaker, vocal activity, and emotions in the living room at home can be determined. These determined emotions can be visually processed for the user and transmitted with the user's consent to the producer, who can use the feedback to optimize his product and pay the user a premium. It is also conceivable that the highlights of a broadcast sports game could be automatically edited based on determining emotions. Similar scenarios are also feasible for Internet-based broadcasts such as Twitch or Netflix.

Another use case for SER applications could be to capture the current mood of an audience. Unlike the previous use case, here, the emotion is not determined and summarized over some time, as is the case with a broadcast. Instead, in this use case, the emotion level is determined directly for an exact point in time. Possible scenarios are, for example, a loudspeaker announcement at a train station, an expression of opinion at political talks, or the presentation of new products at events. In such situations, it is essential to determine the emotion that occurs directly because, according to Averill (1980), it is evaluating a position in a sociocultural context. Speakers in these scenarios thus also can receive unbiased feedback on what they say. In addition, the techniques mentioned, except for loudspeaker announcements, occur in both physical and virtual or hybrid forms, which underpins the aspect of increasing remote participation.

The SER use cases mentioned so far are examples where the emotions of an audience or several cumulative individuals are in focus. However, use cases are also conceivable that only involve individual persons. For example, a use case is feasible in which a machine exhibits different behavior depending on the determined emotion of the user. For example, an SER system can be implemented in a smart speaker or automobile and play music or change the lighting corresponding to the determined emotion. A scenario in the field of gaming is also possible, in which the algorithm offers the player relief within the game when anger is detected. But emotion-controlled individual advertising, for example, in social media or e-commerce, is also conceivable. Depending on the emotional state, the price can also vary dynamically and, for example, increase in the case of joyful emotion.

This paper investigated whether an SER system can distinguish between speech, non-speech, and silence and classify different emotions. A further component of this research was to examine whether such an SER system can operate in real-time and on ambient terminals. A systematic literature review was conducted; two prototypes were constructed to test the research questions, one with machine learning and one with a deep learning approach. Both prototypes are implemented using Python and are based on a data corpus consisting of five previously acquired audio databases. All audio files were transformed to a sampling rate of 16000 Hertz and a mono channel before being processed through the model training.

In the machine learning approach, feature extraction was performed using the openSMILE framework. The resulting eGeMAPS was first normalized, which was then used for classification. An SVM acted as the classifier at this point. The duration of the model training on a server was about 96 hours for this approach. The completed model was successfully ported to a notebook, while porting to a Raspberry Pi failed. Execution of this model proved that the prototype could identify different sounds in under 1000 milliseconds and, in the case of speech, classify seven other emotions.

In the Deep Learning Model approach, on the other hand, the audio files were first converted into Mel spectrograms and normalized. These spectrograms then serve as input for the CNN used in this method, which was created using TensorFlow. Feature extraction and classification are performed within this CNN. The model training, including transfer learning, takes about six hours on the server. Subsequently, this completed model was successfully ported to a notebook and a Raspberry Pi. The prototype proved classification below 1000 milliseconds on the notebook, while about 4427 milliseconds were required on the Raspberry Pi. The models calculated the computation time and classification accuracy following the above formula.

An SER system is also part of everyday life as part of HCI. Paradoxically, SER applications are currently part of everyday life. Possible reasons may be that neither related products nor suitable use cases exist. However, the results of this work show that an implication in practice is technically feasible, and some described use cases can find application.

The results of this work suggest that an SER system is possible in real-time client devices. Furthermore, these results have increasing relevance for everyday communication since, according to Shirmohammadi et al. (2012), presence participation is increasingly combined with remote involvement. Especially concerning remote participation, such an SER system can help make communication and interaction with machines more human and intuitive. Based on this work, the four criteria for acceptance and use (Doke et al. 2012) of such remote participation applications can be extended by the fifth criterion of emotion recognition. As an output of this work, it can be predicted that more end devices will be equipped with the SER function in the future. Some use cases for this have already been described in Chap. 6.4 described.

As mentioned in chap, the findings obtained in this research and their discussion could be supplemented and augmented by further study as in Chap. 6 noted, be augmented by further research. Thus, this research can be continued and extended by including other emotions or increasing the quantity of the databases used. In this context, it would also be an addition to investigating the dimensions of arousal and valence. Furthermore, a follow-up action linked to the emotion by the machine within the SER system is conceivable.

For this reason, it would be worthwhile to investigate, depending on the emotion determined, which lighting setting or color combination is the most suitable to support or counteract the respective emotion. Analogously, this also applies to the investigation of the most appropriate music. In this context, the genre, volume and beat rate can be the essential subject of investigation. Emotion recognition within songs may be interesting to continue this train of thought on another level. Of course, both approaches could also be combined. Selected songs with specifically identified emotions can thus be played in response to previously identified human emotions.

Another aspect of further research is the Deep Learning method. In this, selected hyperparameters were defined for the model training as well as for the generation of the Mel spectrograms. Since neural networks are a constant component of current research, new knowledge is continuously being gained in this area. Therefore, combining this research with the deep learning approach here seems lucrative. The object of this combined research could be to increase the model's performance and investigate modified transfer learning. In addition, the topics of multitask learning or semi-supervised learning could open up new perspectives in further research.

Furthermore, the limitations presented also leave room for further questions, which could result in independent research. For example, whether an SER system can distinguish more than one person based on speech could be investigated. Whether an SER system identifies more than one emotion within a sentence could also be analyzed. Last, exploring subjective user perception could be another aspect of these continuing questions. Lastly, it has to be taken into account that the prototypes created in this work were ported to two device categories. Thus, a conclusion and, simultaneously, a research question of progressing research would be whether those prototypes can be ported to further device categories. Here, the categories of smartphones or tablets, which in turn contain various devices and operating systems, would be prominent.

Concerning the real-time capability of the prototype, it would still be interesting to investigate the execution of digital signal processors and the optimization potential of the runtimes. Such digital signal processors are optimized for real-time implementation of functions such as the Fast Fourier Transform. Using such processors in a mobile device such as the Raspberry Pi can help the prototypes achieve real-time capability or optimize them further.

The results of this paper show that real-time emotion recognition based on speech data via edge computing is possible in theory and practice. Furthermore, this thesis demonstrates that practical implications and further research are feasible.

Abadi M, Agarwal A, Barham P, et al (2019) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016. arXiv preprint arXiv:1603.04467
Abbas N, Zhang Y, Taherkordi A, Skeie T (2018) Mobile Edge Computing: A Survey. IEEE Internet Things J 5:450–465. https://doi.org/10.1109/JIOT.2017.2750180
Adetiba E, Adeyemi-Kayode TM, Akinrinmade AA, et al (2021) Evolution of Artificial Intelligence Programming Languages - a Systematic Literature Review. Journal of Computer Science 17:1157–1171. https://doi.org/10.3844/jcssp.2021.1157.1171
Akçay MB, Oğuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76. https://doi.org/10.1016/j.specom.2019.12.001
Amiriparian S, Gerczuk M, Ottl S, et al (2018) Bag-of-Deep-Features: Noise-Robust Deep Feature Representations for Audio Analysis. In: 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–7
Averill JR (1980) A Constructivist View of Emotion. In: Theories of Emotion. Elsevier, pp 305–339
Burkhardt F, Paeschke A, Rolfes M, et al (2005) A database of German emotional speech. In: Interspeech 2005. ISCA, ISCA, pp 1517–1520
Caelen O (2017) A Bayesian interpretation of the confusion matrix. Ann Math Artif Intell 81:429–450. https://doi.org/10.1007/s10472-017-9564-8
Cao K, Liu Y, Meng G, Sun Q (2020) An Overview on Edge Computing Research. IEEE Access 8:85714–85728. https://doi.org/10.1109/ACCESS.2020.2991734
Choudhury AR, Ghosh A, Pandey R, Barman S (2018) Emotion Recognition from Speech Signals using Excitation Source and Spectral Features. In: 2018 IEEE Applied Signal Processing Conference (ASPCON). IEEE, pp 257–261
Cortes C, Vapnik V (1995) Support-Vector Networks. Mach Learn 20:. https://doi.org/10.1023/A:1022627411411
Costantini G, Iadarola I, Paoloni A, Todisco M (2014) EMOVO corpus: An Italian emotional speech database. In: Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
Cowie R, Douglas-Cowie E, Tsapatsoulis N, et al (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80. https://doi.org/10.1109/79.911197
Cummins N, Amiriparian S, Hagerer G, et al (2017) An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech. In: Proceedings of the 25th ACM international conference on Multimedia. ACM, New York, NY, USA, pp 478–484
Darwin C (1873) The Expression of the Emotions in Man and Animals. The Journal of the Anthropological Institute of Great Britain and Ireland 2:444. https://doi.org/10.2307/2841467
Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06. ACM Press, New York, New York, USA, pp 233–240
Doke M, Kaneko H, Hamaguchi N, Inoue S (2012) Engaging Viewers Through the Connected Studio: Virtual Participation in TV Programs. IEEE Consumer Electronics Magazine 1:30–39. https://doi.org/10.1109/MCE.2012.2196062
Ekman P (1971) Universals and cultural differences in facial expressions of emotion. Nebraska Symposium on Motivation 19:207–283
el Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Eyben F, Scherer KR, Schuller BW, et al (2016) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans Affect Comput 7:190–202. https://doi.org/10.1109/TAFFC.2015.2457417
Eyben F, Wöllmer M, Schuller B (2010) Opensmile. In: Proceedings of the international conference on Multimedia - MM ’10. ACM Press, New York, New York, USA, p 1459
Floyd C (1984) A Systematic Look at Prototyping. In: Approaches to Prototyping. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 1–18
Gemmeke JF, Ellis DPW, Freedman D, et al (2017) Audio Set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 776–780
Hayat S, Rextin A, Idris A, Nasim M (2020) Text and phone calls: user behaviour and dual-channel communication prediction. Human-centric Computing and Information Sciences 10:11. https://doi.org/10.1186/s13673-020-00217-x
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 770–778
Hershey S, Chaudhuri S, Ellis DPW, et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 131–135
Hinton G, Deng L, Yu D, et al (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag 29:82–97. https://doi.org/10.1109/MSP.2012.2205597
Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech Emotion Recognition Using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. ACM, New York, NY, USA, pp 801–804
Karis D, Wildman D, Mané A (2016) Improving Remote Collaboration With Video Conferencing and Video Portals. Hum Comput Interact 31:1–58. https://doi.org/10.1080/07370024.2014.921506
Khalil RA, Jones E, Babar MI, et al (2019) Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access 7:117327–117345. https://doi.org/10.1109/ACCESS.2019.2936124
Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings
Kraus MW (2017) Voice-only communication enhances empathic accuracy. American Psychologist 72:644–654. https://doi.org/10.1037/amp0000147
Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60:84–90. https://doi.org/10.1145/3065386
Kunxia Wang, Ning An, Bing Nan Li, et al (2015) Speech Emotion Recognition Using Fourier Parameters. IEEE Trans Affect Comput 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86:2278–2324. https://doi.org/10.1109/5.726791
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and Recurrent Neural Networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, pp 1–4
Liu S, Lin Y, Zhou Z, et al (2018) On-Demand Deep Model Compression for Mobile Devices. In: Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, New York, NY, USA, pp 389–400
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One 13:e0196391. https://doi.org/10.1371/journal.pone.0196391
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning Salient Features for Speech Emotion <newline/>Recognition Using Convolutional <newline/>Neural Networks. IEEE Trans Multimedia 16:2203–2213. https://doi.org/10.1109/TMM.2014.2360798
Mao Y, You C, Zhang J, et al (2017) A Survey on Mobile Edge Computing: The Communication Perspective. IEEE Communications Surveys & Tutorials 19:2322–2358. https://doi.org/10.1109/COMST.2017.2745201
Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE&#146;05 Audio-Visual Emotion Database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, pp 8–8
Nassif AB, Shahin I, Attili I, et al (2019) Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 7:19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
Noble BD, Satyanarayanan M, Narayanan D, et al (1997) Agile application-aware adaptation for mobility. ACM SIGOPS Operating Systems Review 31:276–287. https://doi.org/10.1145/269005.266708
Ottl S, Amiriparian S, Gerczuk M, et al (2020) Group-level Speech Emotion Recognition Utilising Deep Spectrum Features. In: Proceedings of the 2020 International Conference on Multimodal Interaction. ACM, New York, NY, USA, pp 821–826
Pettigrew J (2009) Text messaging and connectedness within close interpersonal relationships. Marriage Fam Rev 45:. https://doi.org/10.1080/01494920903224269
Rowley J, Slack F (2004) Conducting a literature review. Management Research News 27:31–39. https://doi.org/10.1108/01409170410784185
Sandler M, Howard A, Zhu M, et al (2018) MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 4510–4520
Satyanarayanan M (2017) The Emergence of Edge Computing. Computer (Long Beach Calif) 50:30–39. https://doi.org/10.1109/MC.2017.9
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun 53:1062–1087. https://doi.org/10.1016/j.specom.2011.01.011
Schuller B, Steidl S, Batliner A, et al (2016) The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language. In: Interspeech 2016. ISCA, ISCA, pp 2001–2005
Schuller BW (2018) Speech emotion recognition. Commun ACM 61:90–99. https://doi.org/10.1145/3129340
Shi W, Cao J, Zhang Q, et al (2016) Edge Computing: Vision and Challenges. IEEE Internet Things J 3:637–646. https://doi.org/10.1109/JIOT.2016.2579198
Shinde PP, Shah S (2018) A Review of Machine Learning and Deep Learning Applications. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA). IEEE, pp 1–6
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings
Tan C, Sun F, Kong T, et al (2018) A Survey on Deep Transfer Learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 270–279
Tariq Z, Shah SK, Lee Y (2019) Speech Emotion Detection using IoT based Deep Learning for Health Care. In: 2019 IEEE International Conference on Big Data (Big Data). IEEE, pp 4191–4196
Torrey L, Shavlik J (2010) Transfer Learning. In: Handbook of Research on Machine Learning Applications and Trends. IGI Global, pp 242–264
Tzirakis P, Zhang J, Schuller BW (2018) End-to-End Speech Emotion Recognition Using Deep Neural Networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 5089–5093
Vogt T, André E, Wagner J (2008) Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realisation. In: Affect and Emotion in Human-Computer Interaction. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 75–91
vom Brocke J, Simons A, Niehaves B, et al (2009) Reconstructing the giant: On the importance of rigour in documenting the literature search process. In: 17th European Conference on Information Systems, ECIS 2009
Wang X, Han Y, Leung VCM, et al (2020) Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. IEEE Communications Surveys & Tutorials 22:869–904. https://doi.org/10.1109/COMST.2020.2970550
Wilde T, Hess T (2007) Forschungsmethoden der Wirtschaftsinformatik. WIRTSCHAFTSINFORMATIK 49:280–287. https://doi.org/10.1007/s11576-007-0064-z
Yi-Lin Lin, Gang Wei (2005) Speech emotion recognition based on HMM and SVM. In: 2005 International Conference on Machine Learning and Cybernetics. IEEE, pp 4898-4901 Vol. 8
Zhang S, Zhang S, Huang T, Gao W (2018a) Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching. IEEE Trans Multimedia 20:1576–1590. https://doi.org/10.1109/TMM.2017.2766843
Zhang X, Zhou X, Lin M, Sun J (2018b) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 6848–6856
Zhang Y, Du J, Wang Z, et al (2018c) Attention Based Fully Convolutional Network for Speech Emotion Recognition. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, pp 1771–1775

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Edge emotion recognition: applying fast Fourier transform on speech Mel spectrograms to classify emotion on a Raspberry Pi for near real-time analytics

Status:

Version 1

Abstract

Figures

1 Introduction

1.1 Relevance of the topic and motivation

1.2 Derivation of the research question, methodology, and objectives

2 Related Work

3 Data Acquisition, Analysis, And Processing

4 Prototype Implementation

4.1 Speech recognition using a pre-trained neural network

4.2 Speech emotion recognition using machine learning techniques

4.3 Speech emotion recognition using deep learning techniques

4.4 Hardware used

4.5 Prototype porting to ambient devices

4.6 Metrics used

5 Presentation Of Results

5.1 Results of the machine learning procedure

5.2 Results of the Deep Learning procedure

5.3 Examination of the research questions

6 Discussion

6.1 Reflection of the approach

6.2 Interpretation of the results

6.3 Limitations of the work

6.4 Practical implication

7 Conclusions

8 Outlook For Further Research

References

Additional Declarations

Status:

Version 1