Using Voice and Biofeedback to Predict User Engagement during Requirements Interviews

—Capturing users’ engagement is crucial for gathering feedback about the features of a software product. In a market-driven context, current approaches to collect and analyze users’ feedback are based on techniques leveraging information extracted from product reviews and social media. These approaches are hardly applicable in bespoke software development, or in contexts in which one needs to gather information from speciﬁc users. In such cases, companies need to resort to face-to-face interviews to get feedback on their products. In this paper, we propose to utilize biometric data, in terms of physiological and voice features, to complement interviews with information about the engagement of the user on product-relevant topics. We evaluate our approach by interviewing users while gathering their physiological data (i.e., biofeedback ) using an Empatica E4 wristband, and capturing their voice through the default audio-recorder of a common laptop. Our results show that we can predict users’ engagement by training supervised machine learning algorithms on biometric data (F1= 0 . 72 ), and that voice features alone are sufﬁciently effective (F1= 0 . 71 ). This work is one the ﬁrst studies in requirements engineering in which biometrics are used to identify emotions. Furthermore, this is one of the ﬁrst studies in software engineering that considers voice analysis. The usage of voice features can be particularly helpful for emotion-aware requirements elicitation in remote communication, either performed by human analysts or voice-based chatbots, and can also be exploited to support the analysis of meetings in software engineering research.


Introduction
The development of novel software products, as well as the improvement of existing ones, can deeply benefit from the involvement of users in requirements engineering (RE) activities (Bano and Zowghi, 2015). Getting feedback from the user base has been recognised to lead to increased usability, improved satisfaction (Bakalova and Daneva, 2011), better understanding of requirements (Hanssen and Faegri, 2006), and creation of long-term relationships with customers (Heiskari and Lehtola, 2009).
User feedback can take implicit and explicit forms, and different means are available to collect this information. In particular, data analytics applied to user opinions and to usage data has seen an increasing interest in the last years, leading to the birth of RE sub-fields such as crowd RE (Murukannaiah et al., 2016;Groen et al., 2017) and data-driven RE Williams and Mahmoud, 2017). In the case of bespoken development (i.e., when customer-or domain-specific products' requirements need to be engineered), it is still common to resort to traditional RE practices, such as prototyping, observations, usability testing, and focus groups (Zowghi and Coulin, 2005). Among the classical techniques, user interviews are one of the most commonly used to gather requirements and feedback (Fernández et al., 2017;Davis et al., 2006;Hadar et al., 2014). Several aspects have been observed to influence the success and failure of interviews, such as the domain knowledge of the requirements analyst (Hadar et al., 2014;Aranda et al., 2015), ambiguity in communication (Ferrari et al., 2016a), and typical mistakes such as not providing a wrap-up summary at the end of the interview session, or not creating rapport with the interviewee (Bano et al., 2019).
Currently, little attention is dedicated to the emotional aspects of interviews and, in particular, to users' engagement. Capturing users' engagement is crucial for gathering feedback about the features of a certain product, and have a better understanding of their preferences. The field of affective RE recognised the role of users' emotions and studied it extensively. Contributions include applications of sentiment analysis to app reviews (Guzman and Maalej, 2014;Kurtanović and Maalej, 2018), analysis of users' facial expressions (Scherr et al., 2019a;Mennig et al., 2019), the study of physiological reactions to ambiguity (Spoletini et al., 2016), and the augmentation of goal models with user emotions elicited through psychometric surveys (Taveter et al., 2019).
In this paper, we aim to extend the body of knowledge in affective RE by studying users' emotions during interviews. We focus on engagement-i.e., the degree of positive or negative interest on a certain product-related aspect discussed in the interview. We perform a study with 31 participants taking part in a simulated interview during which we capture their biofeedback using an Empatica E4 wristband, we record their voice through a common laptop recorder, and collect their self-assessed engagement. We compare different machine learning algorithms to predict users' engagement based on features extracted from biofeedback and voice signals.
Our experiments show that topics related to privacy, ethics and usage habits tend to create more positive users' engagement. Furthermore, we show that engagement can be predicted in terms of valence and arousal (Russell, 1991) with an F1-measure of 98% and 97%, considering solely biofeedback signals. When using voice signals alone, the performance in terms of F1measure increases to 100% and 97%, showing the voice features alone can still be strongly predictive of users' engagement. The combination of biofeedback and voice features leads to the best performance for both valence and arousal (F1-measure of 100%).
This paper makes two main contributions: -A methodology, based on the use of machine learning and biometric features, including physiology-and voice-related metrics, which can be applied to predict users' engagement during requirements interviews. -A replication package (Ferrari et al., 2021) 1 to enable other researchers to build on our results.
This paper builds upon a previous conference contribution by the same authors (Girardi et al., 2020a), in which only the biofeedback signals were used for prediction. The current paper repeats and expands the experiments from Girardi et al. (2020a). In particular, we introduce additional biometric features, based on voice signals, as well as additional data preparation options-namely standard scaling, oversampling and data imputation-that allow us to radically improve the previous results, even with voice features alone. This is a particularly relevant result, as using voice analysis can have multiple benefits: 1. It dramatically decreases the overall cost of the proposed methodology, which is due to the use of specific biofeedback devices (i.e., Empatica E4 wristbands); 2. It can be effectively applied during remotely performed interviews-a common scenario nowadays, especially due to the COVID-19 pandemic-as voice is remotely transmitted as main part of the conversation, while biofeedback need to be pre-processed locally, and its transmission requires additional data transfer; 3. It scales the approach to a larger number of users, possibly interviewed in a semi-automated way by artificial agents; 4. It mitigates potential issues related to the acceptance of the non-intrusive, yet potentially undesired, biofeedback device.
The remainder of the paper is structured as follows. In Section 2, we present background definitions of engagement and emotions, as well as related work in RE and software engineering. In Section 3, we report our study design, whereas Section 4 reports its results. We discuss the implications of our study in Section 5 and its limitations in Section 6. Finally, Section 7 concludes the paper.

Background and Related Work
In this section, we first clarify the relationship between emotion modelling and engagement (Sect. 2.1). Then, we present the background on affect modelling and emotion classification using biofeedback (Sect. 2.2) and voice analysis (Sect. 2.3). Finally, we discuss relevant related work in the broader area of emotions in RE (Sect. 2.4) and use of biometrics in software engineering (Sect. 2.5), and discuss our contribution to the field.

Engagement and Emotions
Affective states, ranging from personality traits, which are stable features of an individua, to emotions, that are, conversely, dynamic, episodic and rapidly changing events depending on individual and contextual factors (Cowie et al., 2011). Psychologists have investigated the nature and triggers of emotions for decades. As a consequence, a plethora of theories of emotions emerged in the last decades. Cognitive models describe emotions as reactions to cognition. For example, the OCC model (Ortony et al., 1988) defines a taxonomy of emotions and identifies them as valenced reactions (either positive or negative) to the cognitive processes involved in evaluating objects, events, and agents. Analogously, Lazarus describes nine negative (Anger, Anxiety, Guilt, Shame, Sadness, Envy, Jealousy, and Disgust) and six positive (Joy, Pride, Love, Relief, Hope, and Compassion) emotions, as well as their appraisal patterns: when a situation is congruent with the person's goals positive emotions arise; otherwise, negative emotions are triggered when one's goal are threatened (Lazarus, 1991).
In line with these theories and consistently with the operationalization adopted in our previous study Girardi et al. (2020a), we use emotions as a proxy for users' engagement during interviews. Our choice is further supported by previous empirical findings demonstrating how emotions can be leveraged for detecting engagement in speech-based analysis of conversations (Yu et al., 2004) or to detect students' motivation (Barhenke et al., 2011). When evaluating the importance of a feature, the appraisal process of an individual is responsible for triggering an emotional reaction based on the perceived importance and relevance of a given aspect with respect to his/her goal, values, and desires.
Consistently with prior research on emotion awareness in software engineering (Müller and Fritz, 2015;Graziotin et al., 2015;Girardi et al., 2020b), we adopt a dimensional representation of developers' emotions. In particular, we refer to the Circumplex Model of Affect by Russel (Russell, 1991), which models emotions along two continuous dimensions, namely valence, that is the pleasantness of the emotion stimulus, ranging from pleasant to unpleasant, and arousal, that is the level of emotional activation, ranging from activation to deactivation. Pleasant emotional states, such as happiness, are associated with positive valence, while unpleasant ones, such as sadness, are associated with negative valence. The arousal dimension captures, instead, the level of emotional activation. Some emotions are associated with the person being inactive, thus experiencing low arousal, as in sadness or relaxation. Conversely, high level of arousal are associated to high emotional activation, as in anger or excitement.
We expect to observe different forms of engagement in relation to valence and arousal: positive-high engagement (i.e., positive valence and high arousal) may occur when users discuss topics that they consider relevant and towards which they have a positive feeling, e.g., a feature users like and have an opinion they want to discuss about; negative-high engagement (i.e., negative valence and high arousal) may occur when topics are relevant but more controversial, such as a feature that users do not like, or a bug they find annoying. Low engagement may occur when the user does not have a strong opinion on the topic of the discussion, and is either calm (positive valence, low arousal) or bored by the conversation (negative valence, low arousal).

Biofeedback-based Classification of Emotions
The use of physiological signals for emotion recognition has been largely investigated by affective computing research (Canento et al., 2011;Kim and André, 2008;Koelstra et al., 2012;Soleymani et al., 2016;Girardi et al., 2017). Previous work studied the relationship between emotions and biometrics such as the electrical activity of the brain-e.g., using electroencephalogram (EEG) (Kramer, 1990;Reuderink et al., 2013;Soleymani et al., 2016;Li and Lu, 2009), the electrical activity of the skin, or elecrodermal activity (EDA) (Burleson and Picard, 2004;Kapoor et al., 2007), the electrical activity of contracting muscles measured using electromyogram (EMG) (Koelstra et al., 2012;Nogueira et al., 2013;Girardi et al., 2017), and the blood volume pulse (BVP) from which heart rate (HR) and its variability (HRV) are derived (Canento et al., 2011;Scheirer et al., 2002b). In this study, we leverage metrics based on the electrodermal activity (EDA), hearth rate Electroencephalogram (EEG) captures the electrical activity of the brain through electrodes placed on the surface of the scalp. Changes in the EEG spectrum correlate with increased or decreased overall levels of arousal or alertness (Kramer, 1990) as well as with the valence of the emotion experienced (Reuderink et al., 2013;Soleymani et al., 2016).
The electrodermal activity (EDA) measures the electrical conductance of the skin due to the sweat glands activity. EDA correlates with the arousal dimension (Lang and Bradley, 2007) and its variation occur in presence of emotional arousal and cognitive workload. Hence, EDA has been employed to detect excitement, stress, interest, attention as well as anxiety and frustration (Burleson and Picard, 2004;Kapoor et al., 2007).
Heart-related metrics have been successfully employed for emotion detection (Canento et al., 2011;Scheirer et al., 2002b). In particular, blood volume pressure (BVP) measures the changes in the volume of blood in vessels, while Heart Rate (HR) and its Variability (HRV) capture the rate of heart beats. Significant changes in the BVP are observed in presence of increased cognitive and mental load (Kushki et al., 2011). Increases in HR occur when the body needs a higher blood supply, for example in presence of mental or physical stressors (Greene et al., 2016).
Finally, several studies demonstrated the high predictive power of facial EMG for emotion recognition (Koelstra et al., 2012;Nogueira et al., 2013). However, it leads to poor results when the sensors are placed on body parts other than the face (i.e., the arms (Girardi et al., 2017)).
In a recent study, Girardi et al. (Girardi et al., 2020b) identify a minimum set of sensors including EDA, BVP, and HR for valence and arousal classification. To collect such physiological signals, they use the Empatica E4 wristband and detect developers' emotions during software development tasks. They found that the performance obtained using only the wristband are comparable to the one obtained using an EEG helmet together with the wristband.
Accordingly, in this study we use EDA, BVP, and HR collected using Empatica E4, a noninvasive device that participants can comfortably wear during interviews (see Section 3.2), thus increasing the ecological validity of our study. Furthermore, we combine biofeedback and with voice features, which were not considered in previous works.

Voice Analysis and Classification of Emotions
Classification of emotions based on the analysis of voice features is a welldeveloped research field, normally referred as speech emotion recognition (SER). Different surveys have been recently published on the topic (Akçay and Oguz, 2020;Schuller, 2018;Sailunaz et al., 2018), which highlight the maturity of research, but also point out the limits in terms of real-world applications, mainly due to limited gold standard datasets available for SER systems' training and assessment.
Speech is composed by a diverse set of acoustic features, and its information content is usually accompanied by other so-called supporting modalities, including linguistic features (i.e., the textual content equivalent to a verbal utterance), visual features, and physiological signals such as those discussed in the previous section.
Acoustic features used in SER are normally classified into prosodic (e.g., pitch, tone), spectral (i.e., frequency-based representation of the sound produced), voice quality (e.g., measuring the stability of the voice) and Teager energy operator (TEO)-based features, specifically developed to detect stress from the voice signal. Prosodic and spectral features are the most commonly used in the literature (Akçay and Oguz, 2020;Sailunaz et al., 2018). In particular, most commonly used features are Mel-scaled spectrogram and Mel-frequency cepstral coefficients (MFCCs), which are spectral features that mimic the reception pattern of sound frequency intrinsic to a human (Issa et al., 2020). Issa et al. (2020) uses also Chromagrams-typically used for music representation-since the other features were recognised to be poor in distinguishing pitch classes and harmony (Beigi, 2011).
Research in SER initially focused on identifying relevant features and combination thereof to optimise the performance of traditional classification algorithms (Lee and Narayanan, 2005;Ververidis and Kotropoulos, 2006), leading to good recognition rates especially with Support Vector Machines (Chen et al., 2012;Schuller et al., 2004). With the advent of deep neural networks, and the possibility of overcoming the feature engineering problem altogether, the focus shifts to the selection of appropriate network architectures, and promising results are achieved through Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) models (Zhao et al., 2019;Trigeorgis et al., 2016). To address the problem of data scarcity, the recent development of transfer learning methods have been also experimented by Ottl et al. (2020). Another avenue of research in SER, still under development, is the combination of pure audio features with other contextual cues, including videos (Chao et al., 2015;Ren et al., 2019;Ottl et al., 2020).
Applications of the SER techniques are mostly in the field of humancomputer interaction (HCI). Ramakrishnan and El Emary (2013) lists a set of ten different possible applications, including lie detection, treatment of language disorders, driving support systems (Schuller et al., 2004), surveillance (Nanda et al., 2017), and smart assistants. These latter have become commercially available in recent years-e.g., Siri, Cortana-and represent one of the natural area of exploitation of SER research. Another traditional application, closely related to our context, is the recognition of customer's emotions during conversations with call center operators (Han et al., 2020;Li et al., 2019;Morrison et al., 2007;Batliner et al., 2003;Petrushin, 1999). These works are oriented to identify critical phases of the dialogue between a customer and an artificial operator. This can be generally useful to understand when the artificial agent is irritating the customer, thus to deciding to transfer the call to a human operator.
With respect to works in SER, we are among the first ones that bring these techniques to the software engineering domain, and to the RE area in particular, therefore identifying a novel application field.

Sentiment and Emotions in Requirements Engineering
Researchers acknowledge the role and importance of users' emotions in RE activities (Sutcliffe, 2011). As far as written communication is concerned, data such as stakeholders' conversational traces and users' feedback (e.g., tweets and app reviews) can be collected and analyzed once a software product is in use (e.g., sentiment extracted from reviews on the current version of an app is analyzed to prioritize new features). Studies in this area focus on the application of natural language processing (NLP) to textual artefacts to mine the stakeholders' emotions and opinions about a given product or feature. For example, Guzman et al. (2017) use sentiment analysis to analyze a large dataset of 10M tweets about 30 different software applications. They found that tweets are mostly neutral (85%), whereas negative emotions correlates with complaints and positive with praises and satisfaction about existing features. Martens and Maalej (2019) apply sentiment analysis to 7M app reviews over 245 free and paid apps. They showed that specific categories are characterized by either positive or negative sentiment. Furthermore, they observed a correlation between the users' sentiment polarity and the rating (e.g., 1-5 stars). Researchers also leveraged emotion mining from reviews in app stores reviews to evaluate single app features (Guzman and Maalej, 2014;Johann et al., 2017;Shah et al., 2019).
In particular, several studies propose supervised machine learning approaches that leverage sentiment information extracted from a textual source to support RE tasks. For example, Maalej and Nabil (2015) propose a method that uses sentiment scores to classify app reviews into bug reports or feature requests, in order to support the stakeholder involved in a software artifact maintenance and evolution to fruitfully deal with large amount of feedback. Kurtanović and Maalej (2017) use sentiment scores to investigate how users argue and justify their decisions in Amazon App Store reviews. Other uses of sentiment analysis in RE include the prediction of tickets escalation in customer support systems (Werner et al., 2019).
Finally, emotions have been also considered in early-stage RE activities, such as requirements elicitation and modelling. For example, Colomo-Palacios et al. (2011) asked users to rank requirements according to Russel's Valence-Arousal theory, which is the one that we adopt in the present study (see Section 2.1) They use this information to enhance effective resolution of conflicting requirements. Other researchers leverage information regarding users' emotions gathered through psychometrics (e.g., surveys) to augment tradi-tional requirements goal modelling approaches (Taveter et al., 2019;Miller et al., 2015) and artefacts, such as user stories (Kamthan and Shahmir, 2017).
With respect to previous work on emotions and RE, we focus on the users' feedback phase. Differently from existing works that leverage app reviews and NLP techniques, thereby considering purely textual or structured feedback (e.g., star ratings), we investigate emotions in oral interviews.

Biofeedback and Voice Analysis in Software Engineering and RE
Biometric sensors have been leveraged in several software engineering studies for recognizing cognitive and affective states of software developers.
In one of the early studies in this field, Parnin (2011) envisions an approach to infer developers' cognitive states based on the analysis of sub-vocal signals, that is the electrical signal the brain transfers to the mouth and vocal cords while performing complex cognitive activities. While presenting preliminary findings, this study demonstrate that it is possible to use EMG to capture the subvocalization associated to programming and leverage this information to distinguish between easy and hard development tasks. Fucci et al. (2019) use multiple biometrics, including EEG, EDA, HR, and BVP, to distinguish between code and prose comprehension tasks during software development. Fritz et al. (2014) employ EEG, BVP, and eye tracker to measure the difficulty of programming tasks, thus preventing developers from introducing bugs. The authors use the same set of sensors in a follow-up work aimed at classifying the emotional valence of developers during programming tasks (Müller and Fritz, 2015). Girardi et al. (2020b) replicate previous findings by Müller and Fritz (2015) regarding the use of non-invasive sensors for valence classification during software development tasks. Furthermore, Girardi et al. also address the classification of arousal. Other studies also propose approaches for predicting developers' interruptibility (Züger et al., 2018) and for identifying code quality concerns (Müller and Fritz, 2016) by leveraging EDA, HR, and HRV.
Biofeedback has been used also in RE, mainly to capture users' emotions while using an app. For example, Scherr et al. (2019b) and Mennig et al. (2019) uses the mobile phone cameras to recognize facial muscle movements and associate them to the users' emotions when using different features of an app. This methodology was recently applied to enable user validation of new requirements (Scherr et al., 2019a) and to for identifying usability issues (Johanssen et al., 2019a) with minimal privacy concerns (Stade et al., 2019). Part of the authors of the current paper previously proposed using biometrics in requirements elicitation interviews (Spoletini et al., 2016). Our previous work focused on ambiguity, and remained at the research preview stage, as it evolved in the current work after pilot experiments.
With respect to using biofeedback and voice analysis in software engineering and RE, our study is among the first ones to specifically focus on users' interviews rather than product usage or development tasks. Previous studies (e.g., Scherr et al. (2019b);Mennig et al. (2019)) focus on detecting the user's engagement experienced while using the software features. In our case, we aim to detect users' engagement about certain features when users reflect on the features and speak about them. This captures a different moment-a verbalized, more rational one-of the relationship between the user and the product. Furthermore, in interviews we can consider what if scenarios (e.g., financial and privacy-related questions in Table 1), which is not possible when performing observations without interacting with users. Finally, to our knowledge, our work is among the first ones that use voice features to predict the emotion of a speaker involved in a software engineering activity.

Research Design
Our study is exploratory in nature, aiming to investigate a certain area of interest-i.e., engagement in interviews-and identify possible avenues of research. We adopt a quantitative experimental approach involving human subjects, and oriented to compare software-based artifacts (i.e., machine learning algorithms and feature configurations). The study was approved by the Kennesaw State University review board (study 16-068).
The main goal of this study is to understand to what extent we can use biofeedback devices and voice analysis to predict users' engagement during interviews. Accordingly, we formulate the following research questions (RQs).

-RQ1:
To what extent can we predict users' engagement using biofeedback measurements and supervised classifiers? With this question, we aim to understand whether it is possible to automatically recognize engagement with biofeedback. More specifically, we want to assess to what extent we can recognize emotional valence and arousal-i.e., the two dimensions we use for the operationalization of engagement. To collect trainig and testing data, we first interview Facebook users 2 , asking their opinion about the platform. After the interview, we ask them to report their engagement for each of the different questions. During the interviews with users, we acquire their raw biofeedback signals. We use features extracted from the signals, and consider intervals of reported engagement as classes to be predicted. Based on these data, we evaluate and compare different supervised machine learning classifiers. -RQ2: To what extent can we predict users' engagement using voice analysis and supervised classifiers? This question aims to understand whether we can recognize engagement with automatic voice analysis. To this extent, we record the audio of the interviews, and we extract voice features from the audio signals. We then use the voice features to train and compare the previously used supervised classifiers.
-RQ3: To what extent can we predict users' engagement by combining voice and biofeedback features? This questions aims to use voice and biofeedback features in conjunction. By training and comparing the classifiers, we check if and in which way the combination of features allows improving the performance of the approach.

Study Participants
We recruited 31 participants among the students of Kennesaw State University with an opportunistic sampling. The participation was not restricted by major or academic level, but the only main requirement was to be an active Facebook user (access to Facebook at least once per day, self-declared), as the user interview questions dealt with this social network. More than 90% of the participants were undergraduate students divided in 11 majors. To account for differences in biometrics due to physiological aspects (Bent et al., 2020), we try to have a pool of participants as varied as possible by including multiple ethnic groups and both female and male subjects. Specifically, approximately 65% of the participants were male, and their age varied between 18 and 34 with both median and average equal to 22. Participants were either native speakers or proficient in English. The majority (58%) were white/Caucasian, 23% black/African American, 13% Hispanic/Latino, and the remaining 6% was Asian/Pacific islander. During the data analysis, we removed 10 participants because either the collected data were incomplete or the available information were not considered reliable (e.g., they provided the same response to all the questions in the surveys). Of the remaining 21 participants, approximately 67% were male with the following racial/ethnicity distribution, 67% white/Caucasian, 14% black/African American, 14% Hispanic/Latino, and 5% Asian/Pacific islander. We collected information about the ethnicity of participants because the reserach demonstrated that heart-rate optical sensors might give more/less reliable readings based on the skin tones. Having a diverse pool of participants in terms of ethnicity strengthens the validity of our empirical findings. Participants received a monetary incentive of $25 for up of one hour of their time.

Biofeedback Device and Signals
The device we use to acquire the biofeedback is the Empatica E4 3 wristband. We selected it as it is used in several studies in affective computing (Greene et al., 2016) as well as in the field of software engineering (Müller and Fritz, 2015;Fucci et al., 2019)). Using the Empatica E4, we collected the following signals: -Electrodermal Activity: EDA can be evaluated based on measures of skin resistance. Empatica E4 achieves this by passing a small amount of current between two electrodes in contact with the skin, and measuring electrical conductance (inverse of resistance) across the skin. EDA is considered a biomarker of individual characteristics of emotional responsiveness and, in particular, it tends to vary based on attentive, affective, and emotional processes (Critchley and Nagai, 2013). -Blood Volume Pulse: BVP is measured by Empatica E4 through a photoplethysmography (PPG)-an optical sensor that senses changes in light absorption density of the skin and tissue when illuminated with green and red lights (Allen, 2007;Sinex, 1999). -Heart Rate: HR is measured by Empatica E4 based on elaboration of the BVP signal with a proprietary algorithm.
Research identified a minimal set of biometrics for reliable valence and arousal detection, consisting in the EDA, BVP, and HR measured by the E4 wristband (Girardi et al., 2020b).

Audio Device and Signals
The interviews' audios were captured using the default audio recorder of a Mac OS laptop, and the files were stored in the classical Waveform Audio File Format (.wav), which is an uncompressed representation of the raw signal.
Audio is a complex, information-rich signal, and a largely variable set of classical features are used to characterise its salient aspects (Schuller and Batliner, 2013). Among these features, we consider the following ones: -Mel Spectrogram: it represents the acoustic time-frequency representation of sound. -Mel-Frequency Cepstral Coefficients (MFCC): MFCC are the representation of the short-term power spectrum of sound. More in details, Cepstral features contain information about the rate changes in the different spectrum bands and they have the ability to separate the impact of the vocal cords and the vocal tract in a signal. In the MFCC, these features are extracted at the frequency more audible by human ears. -Chromagram: the Chromagram is a 12-element feature vector indicating how much energy of each pitch class is presented in the signal. This is typically used to model harmonic and melodic characteristics of music, and it is recognised as useful also to model the emotional aspect of voice (Schuller and Batliner, 2013).
We choose these features as they are amongst the most common in speech emotion recognition (Issa et al., 2020;Schuller and Batliner, 2013;Issa et al., 2020).

Experimental Protocol and Data Collection
Three main roles are involved in the experiment: interviewer, user, and observer. The interviewer leads the experiment by asking questions to the user, while the observer tracks the interview by annotating timestamps of each question, monitoring the output of the wristband, checking that audio recording is operational, and annotating general observations on the interview and behaviour of the user.
The experimental protocol consists of four phases (i) device calibration and emotion triggering, (ii) user's interview, (iii) self-assessment questionnaire, and (iv) wrap-up.
Device calibration and emotion triggering In line with previous research (Müller and Fritz, 2015;Girardi et al., 2020b) we run a preliminary step for device calibration and emotion elicitation. The purpose of this phase is threefold. First, we want to check the correct acquisition of the biofeedback signal by letting the wristband record the raw signals for all sensors under the experimenter scrutiny. Second, the collected data will be needed to adjust the scores obtained during the self-assessment questionnaire (see Sect. 3.6). Third, we want the participants to get acquainted with the emotion self-report task.
Accordingly, we run a short emotion elicitation task using a set of emotiontriggering pictures. Each participant watches a slideshow of 35 pictures. Each picture is displayed for 10 seconds, with intervals of five seconds between them to allow the user to relax. The whole slideshow lasts for nine minutes. During the first and last three minutes, calming pictures are shown to induce a neutral emotional state, while during the central 3 minutes the user sees pictures aimed at triggering negative and positive emotions. The pictures have been selected from the Geneva database (Dan-Glauser and Scherer, 2011) previously used in software engineering studies by Müller and Fritz (2015). The user is then asked to fill a form to report the degree of arousal and valence they associated to the pictures on a visual scale from 0 to 100. As done in previous work (Müller and Fritz, 2015), for each picture, the user is asked two questions, 1) You are judging this image as 0 = Very Negative; 50 = Neutral; 100 = Very Positive; 2) Confronted with this image you are feeling 0 = Relaxed, 50 = Neutral, 100 = Stimulated.
User's Interview A trained interviewer conducts the interview with each user. The interview script consists of 38 questions concerning the Facebook platform. Questions are grouped into seven topics-i.e., usage habits, privacy, procedures, relationships, information, money, and ethics. The questions are reported in USAGE HABITS 1. Do you use the Facebook chat function? 2. (If yes to 1) Who are the people you talk to most frequently using the Facebook chat? (If no to 1) Do you use any other chat applications? 3. How many hours do you use Facebook per day? 4. When you check Facebook, what is the average length of time you spend per session? 5. Is Facebook your primary source of social media? (If yes, why? If no, what other social media you use more often? Why is it superior?) PRIVACY 6. If someone shared a photo of you in an embarrassing, incriminating, or shameful situation, how would you react? (Do you think Facebook has a responsibility to prevent it from happening? Should they be allowed to remove the photo on your behalf?) 7. If someone tagged you in a post which contained topics you are not comfortable sharing on Facebook (e.g., your political view, sexual preference, . . . ), how would you react? (Do you think Facebook has a responsibility to prevent it from happening?) 8. How would you feel knowing that someone (e.g. your SO) accessed your profile and searched it? 9. Imagine Facebook begins using profile information to generate ad content.  Table 1. For each topic, we include multiple questions, to allow users sufficient time to get immersed in the topic, and have more stable biometric parameters in relation to the topic. Questions related to topics we expect to raise more engagement, (i.e., privacy, relationship, money, and ethics) are separated by questions on topics that are expected to reduce user engagement (i.e., usage habits, procedures, and information). The lower degree of engagement for the latter topics was assessed during preliminary experiments in which the questions were drafted and finalised 4 . During the interview, the wristband records the biofeedback parameters, the audio recorder acquires the voice of the speakers, while the observer annotates the timestamp of each question. We use this information to align the sensor data with the questions. Based on a preliminary run, each interview was estimated to last for about 20 minutes.
Self-assessment Questionnaire For each question in the interview script (i.e., Q i ), the interviewer asks the participant to report their involvement using two 10-point rating scale items: (q A (Q i )) How much did you feel involved with this topic? (1 = Not at all involved; 10 = Extremely involved); (q V (Q i )) How would you rate the quality of your involvement? (1 = Extremely negative; 10 = Extremely positive). These two questions aim at measuring the engagement of the user in terms arousal (q A ) and valence (q V ). The participants' answers to these questions represent our gold standard for the machine-learning study (see Section 3.6).
Wrap-up The observer downloads and stores the wristband data as well as the voice recording and the questionnaires filled by the participant. The wristband memory is then erased to allow further recording sessions.

Pre-processing and Feature Extraction
The data from the interview questionnaire are used to produce the gold standardi.e., the labels for valence and arousal to be predicted.
We define positive, negative, and neutral labels for valence, and high, low, and neutral labels for arousal. We discretize the scores in the rating scale following an approach utilized in previous research (Müller and Fritz, 2015;Girardi et al., 2020b). First, we adjust the valence and arousal scores based on the mean values reported while watching the emotion-triggering pictures (see Section 3.5). This step is necessary to take into account fluctuations due to individual differences in the interpretation of the scales in the interview questionnaire. Then, we perform a discretization of the values into the three categories (i.e., labels) for each dimension using k-means clustering 5 .
To synchronize the measurement of biofeedback and voice signals with the self-assessment, we (1) save the timestamp corresponding to the interviewer asking question Q i (i.e., timestamp(Q i )), (2) calculate the timestamp associated to the next question Q i+1 (timestamp(Q i+1 )), and (3) select each signal samples recorded between timestamp(Q i ) and timestamp(Q i+1 ).
For each interview question Q i , we have: a set of biofeedback signal samples (for EDA, BVP and HR) within the time interval associated to Q i ; a voice signal sample in the form of a .wav file-the segment of the .wav file of the whole interview for the time interval associated to Q i ; two labels, one representing arousal (q A (Q i )) and the other representing valence (q V (Q i )) according to the self-assessment questionnaire.
The labels are used to form the gold standard to be predicted by the algorithms based on features extracted from the signal samples.
We normalize the signals collected during the entire duration of the experiment to each participant's baseline using Z−score (Müller and Fritz, 2015). To maximize the signal information and reduce noise caused by movements, we apply multiple filtering techniques. Regarding BVP, we extract frequency bands using a band-pass filter algorithm at different intervals (Canento et al., 2011). The EDA signal consists of a tonic component (i.e., the level of electrical conductivity of the skin) and a phasic one representing phasic changes in electrical conductivity or skin conductance response (SCR) (Braithwaite et al., 2015). We extract the two components using the cvxEDA algorithm (Greco et al., 2016). For the audio signal, we use the Python package Librosa (McFee et al., 2015) 6 for audio and music analysis to process the files, and we extract the different features (Mel Spectrogram, MFCC, Chromagram). After signals pre-processing, we extracted the features presented in Table 2, which we use to train our classifiers. We select biofeedback features based on previous studies using the same signals (Müller and Fritz, 2015;Fucci et al., 2019;Girardi et al., 2020b) and we choose audio features according to recommendations from the specialised literature (Schuller and Batliner, 2013;Issa et al., 2020).

Analysis Procedure
The analysis procedure aims at answering the three RQs, as detailed in the following. We first collect descriptive data and provide qualitative considerations. To this end, we measure the range of engagement in terms of arousal and valence, based on the results of the self-assessment questionnaire. This allows us to understand which are the most engaging topics according to the users, and to what extent engagement varies during the interview. Then, for each user, we use the biometrics gathered in the user's interview phase as input features for the different classifiers listed in Sect. 3.4. To answer the different questions, we first consider solely biofeedback features (RQ1), then voice features (RQ2) and finally their combination (RQ3).
In line with previous research (Müller and Fritz, 2015;Girardi et al., 2020b), we target a binary classification task using machine learning. In particular, we distinguish between positive and negative valence and high and low arousal. As such, we exclude the neutral label from the gold standard and focus on more polarised values. Although this reduces our dataset, it also facilitates the separation between clearly distinguished emotional states 7 .
We evaluate our classifiers in the Hold-out setting. Therefore, we split the gold standard into train (70%) and test (30%) sets using the stratified sampling strategy, which allows having a balanced set of instances from the different classes in both sets. For each algorithm, we search for the optimal hyperparameters (Tantithamthavorn et al., 2016(Tantithamthavorn et al., , 2019 using leave-one-out cross validation-i.e., the recommended approach for small training sets (Raschka, 2018) such as ours. The resulting model is then evaluated on the test set to assess its performance on unseen data and avoid overfitting. We repeat this process 10 times with different splits of the train and test sets to further increase the validity of the results. The performance is then evaluated by computing the mean for precision, recall, F1-measure, and accuracy over the different runs. This setting is directly comparable to the one implemented by Müller and Fritz (Müller and Fritz, 2015) and by Girardi et al. (Girardi et al., 2020b), which includes data from the same subject in both training and test sets.
The process outlined above is repeated with a maximum of 8 different settings, based on the three following data preparation options oriented to improve the performance of the machine learning algorithms without losing validity of the results: -Standard Scaling: the features in the training set are standardised so that their distribution will have a mean value 0 and standard deviation of 1. The standardisation parameters from the training set are then applied to scale the test set. This way, information from the test set (i.e., its standard deviation) is not passed to the training set, which could bias the learning process. Standard scaling is essential for machine learning algorithms that calculate distances between data, in our case SVM and MLP. If not scaled, the feature with a higher value range starts dominating when calculating distances. Scaling should not affect rule-based algorithms that consider each feature separately, and are not affected by monotonic transformations of the variables, such as standard scaling. Standardisaiton is performed by means of the StandardScaler of Scikitlearn. -Balancing: Synthetic Minority Oversampling Technique (SMOTE) is a traditional data augmentation technique applied to train machine learning models in case of class imbalance (Chawla et al., 2002). Indeed, in case of class imbalance, machine learning algorithms tend to perform poorly on the minority class, as they do not have a sufficient amount of example items to learn from and build a fair classification model. To overcome this issue, SMOTE creates synthetic examples of the minority class, in our case based on the k-nearest neighbour algorithm. To prevent data leakage, SMOTE is applied solely to the training set, therefore the test set does not contain synthetic data. SMOTE is performed through the SMOTE class from the imblearn Python package. -Imputation: data imputation is normally adopted when some features have missing data. In our case, we miss voice feature data for 66 arousal vectors and 60 valence vectors. We can however infer (impute) the data by using the corresponding biofeedback features. Imputation is performed by means of k-nearest neighbors imputation, using the KNNImputer from Scikitlearn.

Execution and Results
The data were initially gathered from 31 participants. Interviews lasted 18 minutes on average. We discarded the data from those subjects for which data were largely incomplete, or that appeared to have a low degree of standard deviation (i.e., lower than 1) in their labels of valence and arousal. Indeed, although these subjects may in principle have had little variations in their actual emotions, they can be considered outliers with respect of the rest of the subjects. As data are treated in aggregate form, and given the limited number of data points, including these outliers could have introduced undesired noise. We also discarded data whenever some inconsistency was observed through the different pre-processing steps, as, e.g., timestamps not plausible. At the end of this process, we produced the feature vectors and associated labels for valence and arousal (776 vectors in total from 21 subjects). The scatter plot for the two dimensions is reported in Fig. 1   resulted neutral for the dimensions, based on the participant's answers. Therefore, our gold standard includes only the vectors labelled as high (positive) or low (negative) and we model our problem as a binary classification task. Table 3 reports the gold standard dataset with valence and arousal distribution, when considering biofeedback features (for RQ1). Voice feature vectors corresponding to each biofeedback vector could not be identified for part of th gold standard items, as the audio recording was not reliable for some subjects. Therefore, the gold standard dataset for audio only (RQ2) and for combined features (RQ3) without imputation is a subset of the original gold standard, and is reported in Table 4.

Descriptive Statistics
In the following, we report some descriptive statistics on the data. Table 5 reports the ranges of valence and arousal, according to the selfassessment questionnaire. We report both original values and normalised ones ("norm", in the table). We see that, overall, users tend to give high scores both for arousal and valence (both averages are above 7), indicating that the interview is generally perceived as positively engaging. Although they used the whole 1 to 10 scale for both dimensions, indicating that the interview appeared to cover the whole range of emotions, we see that the standard deviation is not particularly large, especially for valence. Indeed, considering the more intuitive 1-10 scale, the value of standard deviation (Std. Dev. in Table 5) indicates that around 68% of the subjects gave score in [6-9] for valence, and in [5-9] for arousal. This indicates that subjects tended to report scores around the average, and that apparently most of the interview triggered a similar level of engagement.
To gain more insight, it is useful to look at the reported engagement for each question 8 . Figure 2 reports the box plots for valence and arousal for each   question, divided by question group. We see that questions related to privacy, ethics and usage habits tend to create more (positive) arousal in average, while questions related to procedures are associated to more neutral values of arousal and valence (i.e., closer to 0 in the plot). Interestingly, questions related to relationships show the largest variation in terms of arousal and valence (the box-plot appears larger), indicating that this is a sensitive topic for the users, leading to more polarised scores in terms of emotional dimensions. The maximum average valence, instead, is observed for questions related to ethics.

RQ1:
To what extent can we predict users' reported engagement using biofeedback measurements and supervised classifiers?
In Figure 3 we report the performance of the different classifiers in terms of their precision, recall, F1-measure and accuracy, considering the different configurations. Specifically, for each metric, we report the mean over the ten runs of the Hold-out train-test procedure, i.e., the macro-average. This choice is in line with consolidated recommendations from literature on classification tasks using machine learning (Sebastiani, 2001). Specifically, using macro-averaging is recommended with unbalanced data as ours, as it emphasizes the ability of a classifier to behave well also on categories with fewer training instances on specific classes.

Green cells indicate cases with good performance, while red cells indicate poor performance, while different shades indicate intermediate values.
We see that the best performance (in bold) for valence are achieved by the Multi-layer Perceptron (MLP), while for arousal the best algorithm is Random Forest (RF), when applying both balancing and standard scaling. The worst performing algorithm is Naive Bayes (NB), regardless of the configuration. In general, we see that balancing the dataset (Bal. set to Y in the table) leads to the most relevant improvement for all the algorithms-the top-2 lines are green-except NB, allowing to pass from F1-measures in the range of 0.5−0.6 to values in the range of 0.8−1.0. This indicates that compensating for class imbalance in the training set substantially boosts the performance of the algorithms in this setting, characterised by a limited number of data points, especially for negative valence (see Table 3). Given that high performance are observed by most algorithms, it suggests that the biofeedback features considered are effective in discriminating between positive (high) and negative (low) valence (arousal).
As expected, standard scaling (Scale set to Y) has a particularly positive influence on SVM (F1 passes from 0.648 to 0.923 for valence, and from 0.684 to 0.890 for arousal), since this algorithm is strongly influenced by the scale of the features. Scaling does not have a strong influence on rule-based algorithms that consider feature values individually, such as Decision Trees (DT) and Ranform Forest (RF). Actually, for DT, we even see a decrease in performance for arousal (F1 decreases from 0.934 to 0.823) when applying standard scaling, and similarly for NB.  Table 6: Performance of the best classifiers based on F1, using EDA, BVP, and HR features with respect to majority class baseline classifier. Improvement over the baseline is also shown.
In Table 6, we report the result of the two best algorithms with the best configurations, and we compare them with a baseline. Following previous research on sensor-based emotion recognition in software development (Girardi et al., 2020b), we select as baseline the trivial classifier always predicting the majority class, that is high for arousal and positive for valence. For the sake of completeness, we also report accuracy even if its usage is not recommended in presence of unbalanced data as ours.
For valence, the MLP classifier distinguishes between negative and positive emotions with an F1 of 0.98, thus obtaining an increment of 117% with respect to the baseline. Furthermore, we observe an improvement in precision of 145% (from 0.40 of the baseline to 0.98 of MLP) and 98% in recall (from 0.50 to 0.99). These results indicate that the classifiers' behavior is substantially better than the baseline classifier that always predicts the positive class.
As for arousal, we observe a comparable performance. The RF classifier distinguishes between high and low activation with an F1 of 0.97, representing an improvement of 170% over the baseline (0.36). Again, the classifier substantially outperforms the baseline with an improvement of 246% for precision (from 0.28 to 0.97) and 94% for recall (from 0.50 to 0.97). The improvement with respect to the baseline is particularly high for arousal in terms of precision since arousal data (cf. Table 3) are more balanced. Therefore, the baseline that always predicts the positive class is inherently less effective, with a precision of 0.28.

RQ2:
To what extent can we predict users' engagement using voice analysis and supervised classifiers? Figure 4 reports the comparison of the performance between the different machine learning algorithms, evaluated in terms of precision, recall, F1-measure and accuracy (macro-average), for the different configurations. Differently from the biofeedback case, here we apply also data imputation (Imp. column) as configuration option, by synthesising data for vectors with missing voice data, based on the corresponding biofeedback vectors. Therefore, the gold standard considered when Imp. is set to Y (Yes) is analogous to the one used for biofeedback and reported in Table 3. When Imp. is set to N (No), the gold standard is the one in Table 4. This is an important aspect to notice for at least two reasons: (1) the two gold standards have slightly different distributions, and thus the default baselines for comparison will differ. In particular, when using imputation the baselines are the same as the one used in Table 6. When imputation is not used, the baselines need to be recomputed, and in particular the majority class baseline for arousal will always predict the negative class, as this is the most frequent in Table 4; (2) the usage of imputation in a real-world context assumes that, although solely voice data are used for classification, biofeedback data are collected anyway, so the practical advantage, both economic and logistic, is limited.
As it happened for the biofeedback case, the actual boost in performance for all the algorithms, except Naive Bayes (NB), is provided by the application of SMOTE (Bal. set to Y)-green lines are the top 4 of each algorithm.
The best performance (in bold) for valence and arousal is achieved by the Multi-layer Perceptron (MLP) and by Support Vector Machines (SVM), while NB again remains far behind the other algorithms, regardless of the applied configuration options.
For valence, both MLP and SVM lead to perfect classification (all measures equal 1.000) when balancing and scaling are applied, and imputation is not applied. Therefore, voice features alone appear to be particularly effective in discriminating the quality of the engagement (positive or negative valence), thereby confirming that our set-up is effective in capturing the so called emotional prosody (Buchanan et al., 2000), which reveals the sentiment of the speaker.
Lower performance is achieved with imputation (F1=0.969 for MLP and 0.958 for SVM). These values are lower also with respect to the best performance obtained using biofeedback features (F1=0.983 for MLP, cf. Fig. 3). The decrease in performance when using imputation is common to all the algorithms, when considering valence-F1 for valence is always lower in the top line of each algorithm in Figure 4.
For arousal, the best performance (bold) is obtained by MLP, when using all the configuration options, including imputation. The performance is comparable to the best one obtained with biofeedback features (F1=0.979 vs F1=0.969 for Random Forest (RF), cf. Fig. 3). When imputation is not used, the best performance is obtained by RF, and remains comparable with the best case for biofeedback (F1=0.966 vs F1=0.969). Differently from the valence case, here imputation appears to improve the performance for all the algorithms-cf., e.g., top lines of each algorithm for arousal in Fig. 4-except RF and NB.
Overall, our findings suggest that voice features represents a valid alternative to biofeedback for the emotion recognition during requirements elicitation. In fact, uur classifiers' performance demonstrate that in absence of biofeedback information, both valence and arousal can still be successfully predicted with voice-only features.  : Performance of the best classifiers, according to F1, using voice features, and without imputation, with respect to majority class baseline classifier. Improvement over the baseline is also shown.
To have additional insights, Table 7 compares the result of the best algorithms, with respect to the majority class baselines. As it is more interesting, given the practical advantages given by the usage of voice features only, we consider the case in which imputation is not applied. For valence, the perfect classification achieved leads to an increase of 150% in terms of precision, 100% for recall and 122% for F1. For arousal, the increase in performance is again higher, with 235% for precision, 94% for recall and 162% for F1. These numbers are basically equivalent to those obtained with biofeedback, confirming that using voice-only features is sufficient to predict engagement in interviews.

RQ3:
To what extent can we predict users' engagement by combining voice and biofeedback features? Figure 5 reports the comparison of the performance of the different algorithms with the different configurations when using voice and biofeedback features combined.
General trends are analogous to those observed when features are treated in separation, with increased performance for all the algorithms except Naive Bayes thanks to the usage of SMOTE, and to the application of standard scaling for Support Vector Machines (SVM) and Multi-layer Perceptron (MLP). The best performance for valence (in bold) are achieved by MLP, when applying SMOTE and scaling, and regardless of the application of imputation.

Alg. Feature
Valence Arousal   column). In bold, we report the best performance for each algorithm considering F1.
Instead, the best performance for arousal are obtained by the SVM algorithm, when considering data without imputation. For these cases, perfect classification is achieved, with all measures equal to 1.000. These results suggest that voice and biofeedback features can play complementary roles in engagement recognition, and can lead to the best classification results compared to those obtained when using only one source of data, when specific algorithms are selected. These observations are confirmed by  : Performance of the best classifiers, according to F1, using combined features with respect to majority class baseline classifiers. Improvement over the baseline is also shown.
As for the other cases, in Table 9 we compare the performance for the best algorithms with the majority class baselines. In all cases, the improvement for precision, recall and F1 is always greater or equal to 100%, thus confirming that the combination of voice and biofeedback features allow obtaining finegrained distinction of classes for both valence and arousal, which cannot be achieved without combining the features.

Discussion
The main take-away messages of this study are: users' interviews are activities that can trigger positive engagement in the involved users; different levels of engagement are experienced depending on the topic of the question, with topics such as privacy, ethics and usage habits leading to higher engagement, and relationships leading to larger variations of engagement; by combining biofeedback features into vectors and by training the Multilayer Perceptron (MLP) and Random Forest (RF) algorithms, it is possible to predict the engagement in a way that outperforms a majority-class baseline, with F1-measure of 98% for valence when using MLP, and 97% for arousal with RF; using voice features only when training MLP, Support Vector Machines and RF, performance are increased. Engagement can be predicted through voice features alone with F1-measure 100% (valence, SVM or MLP) and 97% (arousal, RF) the combination of biofeedback and voice features maximises the performance, with F-1 measure 100% (valence) and 100% (arousal). In this case, the best performance is achieved with SVM for arousal, and with MLP for valence. -The major boost in performance is generally obtained by all the algorithms when applying data augmentation by means of SMOTE. Regardless of the types of features, its usage increases the performance by allowing to pass from the range of 0.5-0.6 in terms of F1 to the range of 0.8-1.0.
In the following sections, we discuss our results in relation to existing literature and outline possible applications and timely avenues of research that are enabled by the current study.

Engagement and Topics
Our descriptive statistics indicate that users experienced different levels of engagement with respect to the question topic. Specifically, our participants reported a positive attitude when discussing privacy, ethics, and usage habits. Concerning privacy and ethics, these topics were selected on purpose to trigger higher engagement. Given the raising interest in these two fields, especially in relation to Facebook and online communities in general (e.g., Trice et al. (2019)) the obtained results are not surprising. Concerning usage habits, we expected to see lower values of arousal. As questions regarding usage habits were asked at the beginning, the high arousal observed may be resulting from the excitement of the new experience. However, we observed that question 19, also about usage habits but asked later, was the one with the highest average arousal (3.6 in normalised values, while the average for all questions regarding usage habits is 2.8) and valence (3.2 vs 2.5) 9 . Therefore, we argue that speaking about usage habits triggers positive engagement. This indicates that users generally like the platform and are interested in speaking about their habitual relation with it. Qualitative analysis of the audio of the actual answers, not performed in this study, can further clarify these aspects. Overall, these results show that 1) users' interviews elicit emotions and engagement, with varying degrees of reactions depending on the topic; and 2) some topics are perceived as more engaging than others.

Performance Comparison with Related Studies
According to the theoretical model of affect described in Sect. 2, in this study we use emotions as a proxy for engagement. Specifically, we operationalize emotions along the valence and arousal dimensions of the Circumplex Model of affect (Russell, 1991), which we recognize using biometrics and voice. Using machine learning, we are able to classify emotions of users engaged in requirements elicitation interviews by distinguishing between positive and negative valence and high and low arousal. We experimented with different experimental setting, i.e. with/without data balancing using SMOTE, data scaling, and data imputation (for voice data only).
As for biometrics, we observe that the performance significantly increases when SMOTE is applied for balancing our training data, achieving and F1measures up to 0.98 and 0.97 for valence and arousal, respectively (see Table 6), thus outperforming our previous classifier (Girardi et al., 2020a). A direct comparison is possible also possible with the performance achieved in the empirical study by Girardi et al. (2020b), as we use the same device (i.e., Empatica E4 wristband) and include the same metrics for EDA, BVP, and HR. Our classifier performance for arousal (F1 = 0.97 accuracy = .99) and valence (F1 = 0.98 and accuracy = 0.97) outperforms the one they obtain using Empatica-i.e., 0.55 for arousal and 0.59 for valence. They report a slightly better performance, though still lower than ours, when including also the EEG helmet (F1 = 0.59 for arousal and F1 = 0.60 for valence). Müller and Fritz (2015) report an accuracy of 0.71 for valence, using a combination of features based on EEG, HR, and pupil size captured by an eye-tracker. Overall, tasks are different from ours, as neither voice nor active expression of emotions were triggered in these related works. Our considerably better performance may be also linked to the specific task of interviewing and the actual use of voice, not only as a feature for emotion prediction, but as a mean for emotion expression Laukka (2017);Scherer (2003). Indeed, the simple act of vocalizing can be regarded as an explicit, although not necessarily voluntary, expression of emotion that have an effect on biometric aspects, thus improving the performance of our classifiers also when using biofeedback only as a predictor.
Previous studies in affective computing report comparable performancee.g., accuracy of 0.97 for arousal (Soleymani et al., 2015;Chen et al., 2015;García et al., 2016) and 0.91 for valence (Nogueira et al., 2013). However, it is worth highlighting that such studies rely on high-definition EEG helmets (Soleymani et al., 2015;Chen et al., 2015;García et al., 2016) and facial electrodes for EMG (Nogueira et al., 2013) which are not comfortable to wear and, thus, could not be used outside a laboratory settings-e.g., during real interviews with users or in remote interviews.
Our approach also achieves comparable performance when using voice features only for both arousal (F1 = 0.97, accuracy = 0.97) and valence (F1 = 1.00 accuracy = 1.00) recognition. Furthermore, the model relying on voicebased features paves the way to future replications for in vivo studies that do not require the use of wearable sensors. The voice-based classifier we trained and tested in the scope of this study outperfors most state of the art approaches on speech emotion recognition (Akçay and Oguz, 2020). However, further research is needed with a larger pool of participants, to further assess the classifier performance on new, unseen speakers.
Previous work also tried to recognize discrete emotions instead of valence and arousal. Lin and Wei (2005) used HMM and SVM to classify five emotions, namely anger, happiness, sadness, surprise, and a absence of emotion (i.e, the neutral condition), achieving a performance up to to 99.5% accuracy.
As far as the combination of biofeedback and voice is concerned, our classifier outperforms the approach recently proposed by Aledhari et al. (2020), based on deep learning. As in our study, they use the Empatica E4 wristband for collecting biofeedback and report an accuracy of 85% on test set and 79% in the validation set for recognition of emotional valence.

Implications for Research and Practice
This is an exploratory study, which is not specifically oriented to direct applications, but rather to have a first understanding of engagement in user interviews, and on the potential usage of biofeedback devices and voice analysis in this context. However, we argue that our results, once consolidated, can have multiple applications and can open new avenues of research.

Applications in User Feedback
In user interviews similar to those staged in our experiment, biometric information in the form of biofeedback and voice features can be exploited to better investigate possible discrepancies between user engagement and the reported relevance of features, to facilitate requirements prioritization tasks, similarly to sentiment analysis applied to textual user feedback Sutcliffe (2011). Furthermore, the usage of these technologies can be extended to identify the engagement of the user on-the-fly-i.e., during the interview-to guide analysts steering the flow of the dialogue. These applications, which support human analysts in their activity, become particularly important when artificial agents are used to elicit feedback or provide customer support, as shown by related research on voice analysis for call centers (Han et al., 2020;Li et al., 2019). In these works, detection of negative emotions is used to understand when a human operator needs to replace an artificial one, because the latter is irritating the customer. Therefore, our work also opens to further applications on emotion-aware, voice-based chatbots for user interviews.
The Role of Voice The introduction of voice features is particularly decisive in this sense. Biofeedback needs to be locally acquired with specialised devices such as Empatica E4, which: (i) costs about $1,690.00 at the time of writing; (ii) needs to locally register the different signals; (iii) does not remotely send the signal in an automated manner; (iv) can raise privacy reserves in users who are not accustomed to this type of devices. Therefore, their usage is realistic only during face-to-face interviews, in which a certain level of mutual trust can be achieved and all data can be acquired in loco. Instead, the analysis of voice is particularly appropriate in remote communication scenarios-involving either human or artificial agents-which are increasingly common due to the COVID-19 pandemic. Voice is voluntarily produced and transmitted by users, and can be remotely recorded and processed without the need to resort to specialised devices, with evident savings in terms of costs. The cost reduction extends the applicability of the idea to large-scale scenarios. With voice analysis, automated user feedback campaigns become feasible, and companies can improve automated A/B testing of web apps or pages. Specifically, they can ask multiple users to interact with different versions of an interface, and speakup their reflections on the experience. The recording and the analysis of the engagement can be used to facilitate the identification of preferred versions, appreciated features, or relevant interaction problems.
Applications in RE and Software Engineering In the case of more classical requirements elicitation interviews (Davis et al., 2006;Zowghi and Coulin, 2005), the usage of biometrics can support these activities to improve the analyst's ability to create a trustworthy relationship wit the customer, and improve the quality of the interview and the collected data. In this context, it is relevant to extend the work to identify the customer's frustration, which often corresponds to the first step to create mistrust in the analyst (Distanont et al., 2012). Frustration can be detected using biofeedback by analyzing the changes in the heart-rate, temperature, and other vitals (Haag et al., 2004;Wagner et al., 2005;Mandryk et al., 2006;Scheirer et al., 2002a) and used to warn the analyst. Furthermore, frustration is strictly related with stress, which can be detected in voice signals through Teager energy operator (TEO)-based features (Zhou et al., 2001;Bandela and Kumar, 2017).
Overall, we argue that the analysis of voice, with its relative cost-effectiveness, can be broadly applied not only to RE, but to all software engineering scenarios in which conversations are central (e.g., SCRUM stand-up meetings, information exchange between developers, etc.) to investigate the emotional side of these human-intensive activities that have a relevant impact on the development, but are currently ephemeral in terms of data permanence.
Tacit Knowledge It is worth noting that the improved performance obtained with voice features, and the lower cost of the approach, do not rule-out biofeedback. We have shown that the best performance are actually obtained with a combination of both types of features. In addition, biofeedback captures involuntary body signals that the speaker cannot fully control, while voice tone can, to a certain extent, be manipulated to deceive (Kim and André, 2006). Biofeedback somehow reveals a more faithful representation of emotions, and one can compare discrepancies between emotion prediction with biofeedback and with voice to identify situations in which what the voice appear to tell is different from what the speaker feels. This can happen in requirements elicitation interviews, which can involve controversial political aspects (Milne and Maiden, 2012), or domain experts who need to be interviewed to gather process-related information, but may be reluctant to share their knowledge (Gervasi et al., 2013). Therefore, our research contributes to further scratch the surface of the open problem of tacit knowledge in requirements engineering (Gervasi et al., 2013;Ferrari et al., 2016b;Sutcliffe et al., 2020).
Recently, SE reserachers proposed to identify software usability problem by relying on user emotions derived from facial expression analysis (Johanssen et al., 2019b). In follow-up studies we plan to investigate the combination of audio and visual signals, which have been already proven to be complementary to each other. Specifically, we envisage an approach where facial expression analysis is combined with voice-based emotion detection, thus implementing multimodal arousal and valence classification, in line with previous research on affective computing (Sebe et al., 2006;Pantic and Rothkrantz, 2003;Busso et al., 2004;Tzirakis et al., 2017).

Threats to Validity
In this section, we discuss the main limitations of our study and report how do we address them.
External validity. The generalizability of our results is limited by the amount of subjects (and associated data points) who took part in the study. Although with some imbalance, our sample includes multiple ethnic groups and genders to account for physiological differentiation (Bent et al., 2020). Further replications with a confirmatory design should engage more participants, and consider balance between ethnicity, culture, age, and gender to account for the differences in emotional reactions due to these aspects. As for the topic of the interviews, we selected features from a commonly-used social media app for which no particular expertise is needed.
To support generalizability, we share the materials and procedures described in this paper, and encourage researchers to adapt them to other domains (e.g., gaming apps) and populations (e.g., children). Moreover, we make this study reproducible and extensible to an new set of data by sharing the scripts necessary to run our analysis (Ferrari et al., 2021).
Conclusion validity. The validity of our conclusions relies on the robustness of the machine learning models. To mitigate any threat arising from having a small dataset, we ran several algorithms addressing the same classification task. In all runs, we performed hyperparameters tuning as recommended by state-of-the-art research (Tantithamthavorn et al., 2018). Following consolidated guidelines for machine learning, we split our data into train-test subsets. The training is performed using cross-validation and the final model performance is assessed on a hold-out test set. The entire process is repeated ten times for each algorithm, to account for random variations in the data. Moreover, our classifiers configuration included scaling and data balancing techniques. To increase the validity of our study, we report all the results related to such configurations.
Construct validity. This threat refers to the reliability of the operationalization of the study constructs. Our study may suffer by threats to construct validity in capturing emotions using self-reports. To address this issue, we performed data quality assurance and excluded participants who did not show engagment with the task (e.g., who provided always the same score or scores with overall low standard deviation). We believe that the designed interview script is sufficiently representative of typical users' interviews in terms of triggered engagement.
Internal validity. Threats to internal validity deal with confounding factors that can influence the results of a study. We collected data in a laboratory setting. Factors existing in our settings, such as the presence of the experimenter, can influence the emotional status as the participants (Adair, 1984). Establishing a trust-based rapport with the participants in a relaxed setting is crucial to mitigate these threats. Thus, we invited the participant to wear the wristband when entering in the room, before the actual interview started, in order to get acquainted with the device, settings, and the presence of the experimenter. Furthermore, self-assessment questionnaires were filled immediately after the interview. This choice was driven by the need to preserve a realistic interview context. However, with this design, the engagement is recalled by the subject and not reported in the moment in which it emerged. Therefore, discrepancies may occur between the feeling of engagement and its rationally-processed memory. Similarly, to maintain a realistic settings, we did not perform pre-interviews to assess the participants' mood (i.e., the presence of a long-lasting emotion) nor their personality traits. We acknowledge that an emotionally-charged event in the life of a participant, either sad or happy, before the interview took place can impact the results.

Conclusion and Future Work
This paper presents the first study about engagement prediction in user interviews. In particular, we show that it is possible to predict the positive or negative engagement of a user during an interview about a product. This can be achieved through the usage of biofeedback measurements acquired through a wristband, the analysis of emotional prosody (Buchanan et al., 2000) through speech processing, and the application of supervised machine learning algorithms. Furthermore, in budget-constrained development contexts, the usage of voice analysis alone can lead to sufficiently good results. The approach can be extended to large scale scenarios, for example for A/B testing, when low-cost devices will be available to acquire the considered measurements, or resorting to voice features only. Furthermore, the approach can be particularly promising to equip artificial agents with some form of emotional sensitivity, so to upgrade relational abilities of voice-based chatbots for gathering product feedback, as well as automated interviewers for requirements elicitation.
The study is exploratory in nature, and application of our results requires further investigation, especially concerning the acceptance of the non-intrusive, yet potentially undesired, biofeedback device. Among the future works, we plan to: (a) replicate the experiment with a larger and more representative sample of participants; (b) complement our analysis with the usage of other emotionrevealing signals considered in other studies, such as facial expressions captured through cameras (Soleymani et al., 2016) and electroencephalographic (EEG) activity data (Girardi et al., 2020b;Müller and Fritz, 2015); (c) apply the study protocol to requirements elicitation interviews for novel products to be developed; (d) investigate requirements analyst's emotions in relation with users' emotions during interviews, to explore the emotional dialogue that occurs between the two of them; (e) investigate and compare the emotional footprint of different software-related tasks. This can be done for example by looking at the difference between physiological signals of the multiple actors of the development process across different phases, such as of development, elicitation, testing, etc. Overall, we believe that the current work, with its promising results, establishes the basis for further research on emotions during the many human-intensive activities of system development.