Question-Based Computational Language Approach Outperforms Rating Scales in Quantifying Emotional States

doi:10.21203/rs.3.rs-3219927/v1

Download PDF

Article

Question-Based Computational Language Approach Outperforms Rating Scales in Quantifying Emotional States

https://doi.org/10.21203/rs.3.rs-3219927/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 23 May, 2024

Read the published version in Communications Psychology →

Version 1

posted

You are reading this latest preprint version

Psychological constructs are commonly quantiﬁed with closed-ended rating scales, however, recent advances in natural language processing (NLP) allow for quantiﬁcation of open-ended language responses with unprecedented accuracy. We demonstrate that specific open-ended question analyzed by natural language processing (NLP) shows higher accuracy in categorizing emotional states compared to traditional rating scales. One group of participants (N = 297) was asked to generate narratives related to four emotions; depression, anxiety, satisfaction, or harmony. The second group of participants (N = 434), read the narratives produced by the ﬁrst group. Both groups summarized the narratives in ﬁve descriptive words and rated the narratives on four rating scales commonly used to measure these emotions. The descriptive words were quantiﬁed by NLP methods and machine learning was used to categorize the responses into the corresponding emotional categories. The results showed a substantially higher number of accurate categorizations of the narratives based on descriptive words (64%) than on rating scales (44%), indicating that semantic measures have signiﬁcantly higher predictive accuracy than the corresponding four rating scales. These findings are important, as it contradicts the commonly held view that rating scales have higher accuracy in quantifying mental states than language-based measures.

Social science/Psychology/Human behaviour

Health sciences/Health care/Diagnosis

natural language processing

rating scales

mental health assessment

While the use of artificial intelligence (AI) in mental health is a promising area of research, it is essential to ensure that the algorithms being developed are reliable, accurate, and transparent. The accuracy of AI-based mental-health-related technology can be estimated by comparing the AI-based methods for analyzing language with ground truth, where common methods are standardized rating scales or clinical assessments. In this work, we take on another approach by introducing a paradigm where participants in Phase 1 are instructed to write an autobiographical narrative of an emotional state that is read by participants in Phase 2. This approach is interesting as the validation is directly generated by the participants' self-experienced emotions. We probe participants with a single open-ended language question related to narrative emotions where they respond with five descriptive words that are analyzed by AI methods and compare these with standardized rating scales.

Challenging the Dominance of Rating Scales in Mental Health

Language is a natural way for people to communicate their mental states. Nonetheless, standardized numeric rating scales are the dominating way that behavioral scientists measure mental states, as they are believed to have higher validity. For example, in a typical research article featured in the Journal of Personality and Social Psychology, 87% of the data used to derive conclusions are based on closed-ended rating scales (Flake et al., 2017). However, these scales have limitations as they are one-dimensional, typically ranging from “strongly disagree” to “strongly agree”, they tend to produce an error of central tendency, have a halo effect, or are limited by the capacity for self-observation, to name a few. Scholars are becoming increasingly skeptical regarding the validity of rating scales (Doliński, 2018; Uher, 2018) because they may risk oversimplifying the human experience (Diener et al., 1985; Newmann, 1998).

Alternatively, open-ended language responses may convey a more person-centered and holistic representation of one’s mental state. For instance, there may be multiple ways in which indicators of a specific disorder could be expressed, addressing symptoms other than those prescribed by the DSM-V (American Psychiatric Association, 2013). In the case of major depressive disorder, for example, there may be other indicators and symptoms that are not included in the DSM-V criteria, such as somatic symptoms like headaches or digestive problems, that are not typically associated with depression. DSM-V is commonly critiqued for its undue emphasis on categorizing symptoms based on specific criteria, potentially overlooking the diverse presentations and variations within mental disorders (Allsopp et al., 2019). It also receives criticism for being too rigid, potentially excluding individuals with clinically significant symptoms that do not meet the specified criteria (Clark et al., 2017). This suggests that adopting an approach with an open-ended response format may allow for greater flexibility in the diagnostic process, which can accommodate individual variations and contextual factors. This could offer a more comprehensive assessment of mental health conditions. This approach may complement the diagnostic manual, enhancing the accuracy and personalized understanding of mental health conditions. By considering a wider range of indicators beyond those prescribed by the DSM-V, researchers and clinicians can gain a more nuanced understanding of mental health conditions that may better capture the complexity of each individual and allow for a more personalized treatment plan.

Manual analysis of open-ended responses has drawbacks: it is a time-consuming and effort-intensive process susceptible to personal biases (Levitt, 2021). Consequently, prior to the boom of computer technology, the widespread adoption of rating scales emerged as a favored alternative, alleviating these challenges. Recent findings indicate that the traditional ways in which mental illnesses are captured (e.g. the PHQ-9 for depression or the GAD-7 for anxiety) might be overlooking other, equally important symptoms associated with a specific mental illness (Le Glaz et al., 2021). Fortunately, recent advancements in natural language processing (NLP) offer a potential solution for efficiently interpreting and quantifying language responses while maintaining objectivity in the assessment process.

Analyzing Mental Health with NLP/ML Methods

Advancements in technology, particularly in artificial intelligence and machine learning (ML), have had a significant impact on the analysis and prediction of data, leading to the automation of decision-making in various fields. Data analyzed using ML improves in accuracy as more data become available, and where the system learns a set of tasks without external intervention or supervision. Recent developments in artificial intelligence have simplified the work of marketing professionals, clinicians, statisticians, and various analysts, and promising results have been observed in fields like web searches (Sinnenberg et al., 2017; Skaik & Inkpen, 2020), targeted marketing (Liu et al., 2021), and finance (Fisher et al., 2016). These ML applications have resulted in the development of sentiment analyzers, text classifiers, chatbots, and virtual assistants that may be capable of transforming the field of mental health diagnostics as well (Le Glaz et al., 2021).

Machine learning models have shown great promise in assessing and predicting mental disorders based on widely different datasets, for genetics, magnetic resonance imaging (MRI), electroencephalography (EEG), and clinical data (e.g., Allesøe et al., 2023). In clinical settings, a popular application of ML has been NLP, which has been used especially in electronic health records (e.g., Castro et al., 2015; Navarro et al., 2021), medical diagnosis (e.g., Kolanu et al., 2020) and social media text data mining (e.g., Levanti et al., 2023). Nevertheless, using such input data necessitates participants having access to, and being able to provide, a substantial amount of up-to-date medical records. To overcome these limitations, we focus on data that can readily be assessed by prompting participants to answer a single open-ended question related to mental health. Such responses can be answered in seconds by a few descriptive words and compared with commonly used rating scales, each consisting of a number of items and several response alternatives.

In recent years, AI has indeed been increasingly used in mental health to improve diagnosis and treatment. Private corporations such as Amazon have already made considerable progress in this field. For instance, Amazon has developed a patent that allows its Alexa device to identify depression and suicidal tendencies (Jin & Wang, 2018) and plans to combine this technology with its healthcare and pharma businesses, creating new opportunities for profit. However, the development of algorithms for mental health by private companies may not always be transparent. There are concerns about the potential biases and ethical implications of these algorithms, particularly when they are developed without appropriate oversight and regulation (Gaonkar et al., 2020). Here, research is essential to ensure that the algorithms are reliable and valid. For instance, Sidey-Gibbons and Sidey-Gibbons (2019) developed NLP algorithms that can analyze patient conversations and identify patterns that suggest potential clinical issues. A lot of work within the field has been focused on developing computational models that can predict mental health outcomes from social media data. For example, it has been shown that it is possible to identify individuals who are at risk for depression or anxiety by analyzing their social media posts (Eichstaedt et al., 2018; Guntuku et al., 2017; Seabrook et al., 2016). Although social media studies have shown great promise, assessment of the present emotions requires current and relevant social media data, which is not always accessible.

Recents years have shown great improvements in NLP models. The emergence of the transformer, a powerful ML technique, has been associated with remarkable performance improvements. Among them, BERT (Bidirectional Encoder Representations from Transformers) stands out as the most frequently cited transformer-based language model (Devlin et al., 2019). Transformers are flexible and sizable statistical models renowned for their ability to capture word meanings in context. Assessments consistently demonstrate that BERT achieves substantial error reductions compared to earlier models (Devlin et al., 2019). With their large size and flexibility, these models excel at representing diverse word meanings in different contexts, enhancing researchers' capability to grasp the nuanced intent of speakers and writers. BERT made a groundbreaking impact on the field of NLP by introducing unsupervised pre-trained models, enabling them to learn from vast amounts of unlabeled text data (Rogers et al., 2020).

Question-based Computational Language Assessment

Recent studies have shown that text-based answers analyzed by computational methods can indeed predict corresponding close-ended rating scales. Kjell and colleagues (2019) developed a Question-based Computational Language Assessment (QCLA) where text is generated by direct questions from participants that can be transformed into a quantifiable vector using NLP. This method captures a severity measure of mental health and a detailed description of mental states. A related study also points out the validity of QCLAs where human cooperation is distinguished more precisely using computational language assessment, analyzed using NLP, than by using traditional rating scales (Kjell et al., 2021).

Despite the advantages of an NLP approach, previous studies of language-based approaches have failed to meet the accuracy of rating scales. The reason for this is that the previous validation of language-based approaches has primarily used rating scales as the outcome measure, where validity is measured by a correlation to the rating scales and therefore does not allow testing of whether rating scales or languages measures have the highest validity (Kjell et al., 2019; 2022). To address this issue, the present study compares language-based responses and rating scales using an outcome criterion that is independent of rating scales. In particular, we focus on an outcome criterion where participants are instructed to generate self-experienced narratives with specific emotional content related to either depression, anxiety, satisfaction, or harmony. These narratives are then evaluated by other participants using both rating scales and descriptive words. We investigate how well these evaluations can categorize narratives into the emotions used to generate the narratives.

The Present Study

The primary and original contribution of this study lies in its examination of the accuracy of assessing emotional states using language-based methods compared to rating scales. While computational methods for analyzing language data have been previously introduced in other works (e.g., Kjell et al., 2019), this study focuses specifically on the comparative evaluation of these two assessment approaches. The present work aims to show that language-based measures on emotional states, analyzed by NLP/ML methods, have a higher accuracy in categorizing emotional states compared to commonly used rating scales dedicated to measuring these states.

In particular, we prompt participants with a single open-ended question where they respond with five descriptive words (i.e., “Write five keywords that best capture your/ the author’s emotional state”). We choose keywords as the response format as previous work shows that this data type shows more accurate predictions compared to free text data (e.g., Kjell et al., 2019). Writing a few keywords can be done in a reasonably short time (e.g., seconds) while writing a free text narrative typically takes considerably longer time (e.g., minutes). Finally, we ask for five words, as previous studies have shown that adding more words does not significantly increase the predictive accuracy (e.g., Kjell et al., 2019). We choose to assess common psychological concepts of emotions, namely, depression, anxiety, satisfaction, and harmony in life, because standardized numeric rating scales of these pairwise constructs are typically highly correlated (Kjell et al., 2021; Stochl et al., 2022).

We also introduce a new paradigm consisting of two phases, where the participants generate emotional narratives in the first phase that are assessed by other participants in the second phase (Fig. 1). In Phase 1 of the study, participants were asked to write an autobiographical text about a specific self-experienced event of either depression, anxiety, satisfaction, or harmony. In Phase 2, a different set of participants (composed of controls and healthcare professionals) were asked to read a text written by someone in Phase 1. In both phases, participants described the emotional content of the narratives in five descriptive words, and completed rating scales corresponding to the four emotions (i.e., the Patient Health Questionnaire (PHQ-9), the Generalized Anxiety Disorder Scale (GAD-7), the Satisfaction with Life Scale (SWLS), and the Harmony in Life Scales (HILS).

NLP methods were used to quantify the five-word responses into an embedding (i.e., a semantic vector), where we apply computational methods related to earlier work from our research group (Kjell et al., 2019; Kjell et al., 2021). The classification of embeddings and rating scales was based on multinomial regression into one of the four emotional conditions. As an NLP model, we opted for a state-of-the-art transformer-based language model (i.e., BERT). The evaluation of the classification accuracy is based on a ten-fold cross-validation procedure. We expected that the categorization of narratives describing emotional states to be more accurate by using open-ended language measures analyzed with NLP compared to standardized numeric rating scales. As the number of healthcare professionals was expected to be rather small, no specific hypotheses were made regarding this group.

Participants

The data used here were collected by convenience sampling. Participants were recruited through Prolific, an online recruitment platform for data collection in the behavioral sciences. The inclusion/exclusion criteria were 18 years of age or older and being a native English language speaker. The study had two phases with different sets of participants but with the same inclusion criteria. The participants were compensated £2 in Phase 1 and £1.5 in Phase 2 for participating in the study. A total of 350 participants completed the study in Phase 1, and 53 participants were removed due to failure to correctly answer the control questions, or following screening did not follow the instructions in the study, leaving a final sample size of 297. A total of 465 participants completed the study in Phase 2, 34 of whom were healthcare professionals recruited through additional screening. The same platform was used to recruit healthcare professionals by screening participants based on their indication of having a professional occupation within the healthcare system as either; a doctor, emergency medical employee, nurse, paramedic, pharmacist, psychologist, or social worker. A total of 31 participants were removed as they did not respond correctly to the control questions, leaving a final sample size of 434. Thus, the final sample from both phases consisted of 731 participants (female = 428; male = 281; other = 22) with an age range of 18–79 years (M = 31.97, SD = 12.71) (see Table 1 for more details).

Table 1

Participant demographic data
Phase	N Total^a	N excluded	Age^b	Gender	Nationality	Time^c	Education^d
1	350 (297)	53	19–76, 29.37 (13.62)	female − 165 male − 122 other − 10	US − 115 UK − 108 other − 74	14.25 (17.51)	107 62 46 6 76
2	465 (434)	31	18–79, 34.57 (11.79)	female − 263 male − 159 other − 12	US − 235 UK − 110 other − 89	9.84 (11.78)	127 192 63 8 44
1 & 2	815 (731)	84	18–79, 31.97 (12.71)	female − 428 male − 281 other − 22	US − 350 UK − 218 other − 163	12.05 (14.65)	234 254 109 14 120

^a Total number of participants before after (parenthesized) the exclusion.

^b Age is expressed by a minimum and a maximum age range, mean, and a standard deviation in years;

^c The amount of time spent completing the study is expressed by mean time (SD) in minutes and seconds;

^d Level of education (from top to bottom): high school, undergraduate, postgraduate, Ph.D., other.

Measures

Emotional Autobiographical Memory

The study included an emotional autobiographical text response, one semantic measure, four rating scales, and demographic questions. Participants in Phase 1 were asked to write about a self-experienced event when they felt either depression, anxiety, satisfaction, or harmony using a minimum of 100 words. The instructions were formulated as follows: “Please write a text about a period in life (days to months) when you experienced [depression / anxiety / satisfaction / harmony]. Please answer the question by writing at least a paragraph (approximately five sentences). Write about those aspects that are the most important and meaningful to you. Note. Please do not use the word '[depression / anxiety / satisfaction / harmony]' in your text.”

Semantic Measures

Participants were then asked to capture the emotional aspects of the narrative in five descriptive words. The instruction in Phase 1 was the following: “Write five keywords that best capture the emotional state you wrote about earlier. Note. Please do not use the word [depression/anxiety/satisfaction/harmony] in your text.” Participants in Phase 2 were only asked to describe the mental state that they read about in the text without knowing which emotion, in particular, the text was referring to. The phrasing for Phase 2 was the following: “Read this text about a period in someone's life and write five keywords that best capture the author's emotional state: [the text was inserted here].”

Rating Scales

The following four standardized rating scales were used in this study. The Patient Health Questionnaire-9 (PHQ-9) measures major depression and other depressive disorders and includes nine items (Kroenke et al., 2001). The Generalized Anxiety Disorder Scale-7 (GAD-7) measures the severity of generalized anxiety disorder and includes seven items (Spitzer et al., 2006). Similarly, the Satisfaction with Life Scale (SWLS) (Diener et al., 1985) measures global cognitive judgment of satisfaction with one's life and includes five items answered on the same scale as the Harmony in Life Scale (HILS) (Kjell et al., 2016). The latter measures harmony in life with a focus on psychological balance and flexibility and comprises five items, which are answered on a scale ranging from “strongly agree” to “strongly disagree”.

The instructions for the standardized rating scales were modified in Phase 1 so that they referred to emotions in the self-experienced event and not the present-day state, while in Phase 2 they were modified to correspond to the emotions in the read narrative (for details see the pre-registered report at OSF). In Phase 1, the rating scale instructions were modified so that the rating scale questions were related to emotions during the period that the narratives covered, for example: “Over that period in life, how often have you been bothered by: ….”. In Phase 2, the instructions were modified to be related to the read narrative, for example: “Consider the author's emotional state: Over that period of their life, indicate their agreement with each item by tapping the appropriate box.”.

Procedure

The participants were directed from Prolific to a Qualtrics questionnaire, in which their responses were collected. Participants were first informed about the purpose of the study, their right to withdraw at any time, and their responses being anonymous and voluntary. They were also told that no personal or identifiable information was being collected and that they could contact the researchers with any questions regarding the survey (for details see the pre-registered report at OSF).

In Phase 1 the participants were asked to write one autobiographical text about a period in their lives when they experienced one of the following emotional states: depression, anxiety, harmony, or satisfaction. Thus, each participant wrote one narrative about one emotion and was not informed about the other three emotions. The described emotional states were evenly distributed among the participants. Participants were then asked to write down five words describing the emotional state of the narrative, complete the four rating scales, and answer demographic questions that included age, gender, country of birth, and level of completed education.

The procedure for Phase 2 was identical to Phase 1, except that the participants were asked to read a text written by someone in Phase 1 rather than to write one themselves. Having read the text, they wrote five keywords describing the narrative they read and completed the rating scales related to the emotions in the narratives, i.e., not their self-experienced emotions as in Phase 1 but those of the author. This was done by replacing "you" with "the author" and "they" in the wording of the semantic questions and rating scales (see Supplementary materials, Appendices B - E). Each participant only read and responded to one narrative. However, because there were more participants in Phase 2 than in Phase 1, some of the narratives from Phase 1 were read and evaluated more than once in Phase 2 (but no more than twice).

To ensure that the participants were attentively following the instructions, in both phases four control questions were embedded, one among each of the rating scale questions. Participants were instructed to respond to a question with a pre-given alternative, for example: "Answer 'Agree' to this question". If any one of the four control questions was not answered correctly, the participant’s data were removed from the final sample.

Statistical Analyses

The descriptive words generated by the participants were quantified using BERT. The predicted categorization of the four emotions was generated by multinomial logistic regression. The analyses were conducted in SemanticExcel.com (Sikström et al., 2020), an online tool for statistical analysis of semantic data, where the main author's underlying code is written in MATLAB.

Pre-processing of the Word Data

Text responses, collected as part of this study, were altered per the procedures provided by Kjell et al. (2019). The word responses generated by the participants were first cleaned using a manual procedure. Misspelled words were corrected by the spelling tool in Microsoft Word, and this was done only when the author’s intended meaning was clear; otherwise, their original form was retained. The authors screened the written narratives and descriptive word answers manually and removed those that did not fit the study criteria. Descriptive word answers comprising sentences or strings of words rather than one descriptive word in each response box were removed. Instances of incorrectly answered control questions, successively repeated words, or where participants had written “N/A” were excluded.

Quantifying the Descriptive Words and Rating Scales

The descriptive words generated by the participants in Phase 1 and Phase 2 were analyzed by the BERT model (i.e., 'bert-base-uncased'; Devlin et al., 2019), where we used the embeddings in the last layer (i.e., layer 12). The BERT embeddings had 768 dimensions. The categorization was based on a vector consisting of either the embeddings generated from the word data only (768 dimensions), the four rating scales only (4 dimensions), or a combination of the rating scales and the embeddings combined (4 + 768 dimensions). A data compression algorithm called Singular Value Decomposition (SVD) was used to compress this vector so that the first dimensions contained the most important information about the original vector.

Categorizing the Responses Using ML

Participant responses were categorized into one of the four emotions using multinomial logistic regression, where we ensured that testing and training were always conducted in different subsets of the data. The categorizations were evaluated by a 10-fold leave-out cross-validation procedure (Stone, 1974), where 90% of the data were used for training the multinomial model, and the generated model was applied to 10% left-out data points. This leave-out procedure was repeated ten times so that all data points received a predicted value. The groups in each fold were generated randomly, with the constraint that no narrative in the training dataset could be found in the test dataset. In each fold, the number of dimensions was optimized by trying the first 1, 2, 3, 5, 7, 10, 14, 19, 26, 35, 46, 61, 80, 105, 137, 179, 234, 305, 397, 517, and 768 dimensions in the training dataset, and applying the model with the fewest miscategorizations in the test dataset. For the word only analysis, the mean number of optimized dimensions over the ten folds was 50 with a standard deviation of 14.

Predicted Categorization

As illustrated in Fig. 2, the results showed that the percentage of correct categorization of emotional narratives was substantially higher when based on word responses (64%) compared to rating scales (44%) in Phase 2 for the non-professional group (X²(1, 400) = 32.6, p < .001). The categorization was significantly worse for the rating scales than the word response for depression, satisfaction, and harmony but was significantly better for anxiety. Similar results were found for the total correctness score for Phase 1 data (X²(1, 297) = 53, p < .001) and for all of the data combined (X²(1, 731) = 77, p < .001). Basing the categorization of the rating scales on individual items from all rating scales (N = 36 inputs to the multinomial logistic regression) rather than the sum values (N = 4 inputs) did not improve the accuracy of categorizing the emotions. The accuracy of categorization did not show any difference when comparing the information from the word responses only versus combining the information from rating scales and word responses, for any of the conditions. There was no difference between the percentage of correct categorization of emotional states for word responses (56%) and rating scale responses (50%) in the healthcare professional group (X²(1, 34) = 0.25, p = .62), although the number of participants in this group was rather small (N = 34). Table 2 presents a detailed breakdown and a comprehensive view of the results pertaining to the percentage of accurate categorizations for both rating scales and language measures across all data points.

Table 2

*Correct categorization divided into phase and model*
Phase (-P/P)	Model	All	Harmony	Satisfaction	Depression	Anxiety	N
1	RS	.39	.46	.21	.65	.15	297
1	Words	.70	.71	.58	.76	.75	297
1	Words + RS	.67	.62	.60	.71	.73	297
2(-P)	RS	.44	.38	.31	.88	.02	400
2(-P)	Words	.63	.56	.55	.74	.66	400
2(-P)	Words + RS	.64	.57	.58	.75	.65	400
2(P)	RS	.50	.60	.20	.82	.12	34
2(P)	Words	.56	.70	.20	.45	.75	34
2(P)	Words + RS	.59	.70	.20	.55	.75	34
1 & 2(-P)	RS	.42	.43	.27	.78	.08	731
1 & 2(-P)	Words	.66	.63	.55	.73	.70	731
1 & 2(-P)	Words + RS	.65	.60	.58	.72	.69	731
Note. The table shows the ratio of correct categorizations (N(correct) / N) divided into Phase 1 and 2 for the non-professionals (-P) and professionals (P) and whether the predicted categorizations were based on the rating scales (RS) or word responses (Words).

Accuracy and Precision

The accuracy measure (i.e., the number of correct positive plus the number of correct negative categorizations divided by the total number of categorizations) is higher when the categorization is made on word responses compared to rating scales (Table 3). Similarly, the precision measure (i.e., the number of correct positive categorizations divided by the number of positive predictions) is also higher for categorizations made on word responses compared to rating scales. Table 3 presents a breakdown of precision and accuracy values for each emotional state, providing a detailed depiction of the performance metrics specific to each category

Table 3

Accuracy and precision measures
Model	Measure	Harmony	Satisfaction	Depression	Anxiety
RS	Accuracy	.76	.66	.68	.79
RS	Precision	.43	.35	.49	.29
Words	Accuracy	.82	.79	.80	.86
Words	Precision	.60	.62	.64	.65
Note. The table shows accuracy and precision measures for rating scales (RS) and word responses (Words) for harmony, satisfaction, depression and anxiety.

Confusion and Correlation Matrices

Confusion matrices, or the number of times the model’s prediction occurs with an emotional state in Phase 1, are shown on the left-hand side of Table 4. For the predictions based on rating scales (upper part of the table), most errors occurred when the model predicted depression, but the correct answer was anxiety (N = 64). Based on words, the errors were more evenly spread out, and the highest number of errors occurred when the model predicted satisfaction, but the correct state was harmony (N = 24). The right-hand side of Table 4 shows the Pearson correlation between the multinomial estimated coefficients. The absolute correlation values were considerably larger for the rating scale model (.68 < r < .92) compared to the words-based model (.07 < r < .47), indicating a larger confusion for the rating scale model.

Table 4

*Confusion and correlation matrix*
Model	Prediction	N				r
		Har.	Satis.	Dep.	Anx.	Har.	Satis.	Dep.
RS	Harmony	33	41	2	1
RS	Satisfaction	38	34	8	16	.87
RS	Depression	15	34	107	64	− .92	− .95
RS	Anxiety	0	0	5	2	− .88	− .75	.68
Words	Harmony	48	24	7	1
Words	Satisfaction	25	60	7	4	− .07
Words	Depression	7	20	90	23	− .47	− .42
Words	Anxiety	6	5	18	55	− .41	− .42	− .21
Note. The table shows confusion and correlation matrices for rating scales (RS). Cells in the first four numerical columns show the number of empirical emotional states (in rows) that co-occur with the number of predictions of emotional states for the multinomial models (in columns). The last three columns show Pearson correlation scores for the multinomial coefficient estimates. The following abbreviations were used: harmony (Har), satisfaction (Satis), depression (Dep), and anxiety (Anx).

Inter-rater Agreement

Table 5 shows all pairwise categorizations of the same narratives, divided into; classification based on rating scales or words, for Phase 2 non-professionals’ data and for non-professionals’ data from both phases combined. The agreement in categorization was considerably larger for rating scales than for the words in Phase 2 non-professionals’ data.

Table 5

*Inter-rater agreement*
Phase	Model	Agreement	Correct	r	N
2(-P)	RS	.90	.79	.53	263
2(-P)	Words	.82	.65	.47	263
1 & 2 (-P)	RS	.83	.69	.35	721
1 & 2 (-P)	Words	.82	.66	.51	721
Note. The table shows the agreement between all pairwise categorizations of the same narratives for different rater models and phases. The columns show the phase, model, agreement (i.e., the percentage of identical categorizations), correct (i.e., the percentage agreement of correct categorizations), Pearson correlations between the highest multinomial factors (r), and the number of pairwise comparisons.

Word Clouds of Categorizations

Words clouds were generated showing which words were most indicative of the four mental states (see Fig. 3). Words indicative of depression included sad and depressed, and words indicative of anxiety include anxious, worried, and nervous. Words indicative of harmony and satisfaction included ‘happy’ and ‘content’, whereas ‘calm’ was more central to harmony, and ‘hopeful’ was more critical for satisfaction.

The word clouds were generated by a semantic t-test by summarizing all word responses generated from one emotion in one semantic representation and normalizing this vector to the length of one and then creating one semantic representation for word responses for the other three emotions. The latter representation was then subtracted from the first, and the length of this vector was normalized to one. The semantic similarity between this difference vector and each word was then calculated, and a t-test was conducted to determine if the values were significantly larger than zero following Bonferroni correction for multiple comparisons. A 10-fold leave-out procedure was implemented while calculating the difference vector so that this vector did not include words used for measuring the semantic similarity to this vector. Figure 3 shows the 25 words with the highest t-values, where all words were statistically significant.

This study shows that participant-generated narratives describing emotional states were categorized more accurately by computational methods based on one open-ended question using a response format consisting of five words compared to four dedicated rating scales. This finding has significant clinical implications as it indicates that open-ended language-based responses may have higher validity than the rating scales commonly used for mental health assessments. To our knowledge, this is the first time that language-based stimuli of emotions have been better categorized by word responses than by rating scales. Furthermore, the effect size was substantial, where the percentage of correct categorizations for word responses was 64% compared to 44% for the rating scales, i.e., the difference in the percentage of emotional states correctly categorized was 20%.

The results of this study are consistent with other studies showing that computational language assessment produces very strong correlations to the rating scales of harmony and satisfaction (e.g., r = 0.84) that rival the theoretical limits on test-retest reliability and inter-item correlations (Kjell et al, 2021). Evidence from related fields, such as the assessment of facial expressions and cooperative behavior, has suggested that language responses may possess greater validity compared to rating scales. Kjell and colleagues (2009) used validated facial expressions of "happy", "sad", and "contemptuous", where participants described a facial expression either by three descriptive words or by rating scales corresponding to these emotions. That study found an advantage of word responses for the identification of facial expressions, but the advantage (only 4%) was substantially smaller than what was found in the current study. Kjell and colleagues (2021) studied cooperative behavior in a one-shot give-some dilemma game (GSDG), where participants completed rating scales (HILS and SWLS) or word-response measures of harmony and satisfaction prior to conducting a GSDG (for details see Van Lange & Kuhlman, 1994). The results showed that word responses, but not the rating scales, predicted cooperative behavior. We argue that an important reason why the word responses showed better categorization than rating scales in the present study was that the outcome variable (i.e., the grounded truth of the narratives) was generated independently of the rating scales, whereas the earlier studies on QCLA (Kjell et al., 2019) were typically validated by correlations to rating scales.

Additional information was found by looking at the confusion matrix, where the number of incorrect predictions for the rating scales was higher for the rating scales in Phase 2 compared to word responses.The Pearson correlation between estimated coefficients was considerably higher for the rating scale models compared to the word-based models, suggesting that word responses discriminate better between emotional states. This is consistent with previous studies showing that word plots discriminate better than rating scales between related concepts (Kjell et al., 2021; Stochl et al., 2022).

A smaller subset of the Phase 2 data was based on participants that currently work in healthcare-related professions. This group was included in the study as they were expected to have a more profound knowledge of the definition and assessment of depression, anxiety, harmony, and satisfaction. Nonetheless, this group did not result in higher accuracy of categorizations using the rating scales, and their nominal values of correct categorizations of the rating scales were less than that for the word responses. Nominally, the data looked similar to the larger non-professional control participant data; however, the number of professional subjects was too small (N = 34) to make it feasible to draw any firm conclusions.

In addition to having greater validity compared to rating scales, computational methods for language-based assessment of mental health have several other advantages. First, language is a natural way that people communicate their mental states. People prefer to communicate mental health with language rather than rating scales because they find language to be more precise and elaborate, and they prefer using language during communication with clinicians, although rating scales are seen as easier and faster (Sikström et al., 2023). Second, open-ended language responses allow for an idiosyncratic description of the participant’s mental health, thus providing an opportunity for person-centered health care. This is very different from rating scales that measure a fixed construct defined by the research and where patients cannot add their person-centered view (Kjell et al., 2019). Third, writing about emotional events or traumas improves participants’ mental health and therefore can be viewed as a treatment intervention (Pennebaker, 2011). Engaging in expressive writing, where individuals freely express their thoughts and emotions related to a specific event or trauma, has been found to offer several benefits, as it allows individuals to externalize and confront their emotions, thoughts, and experiences and provides an opportunity for emotional catharsis (Pennebaker & Smyth, 2016). Assessment with rating scales,on the other hand, is not known to influence participants’ mental health. Combined with the data presented here, this suggests that QCLA can be seen as a method for simultaneously assessing and treating mental health (Sikström et al., accepted register report). Another advantage of the proposed language measure is that it is short, easy, and quick to administer. As it stands, it can be conducted in a brief conversation by asking individuals to provide five keywords that describe their emotional state (e.g., fine, great, good, restless, excited). In contrast, completing even a single rating scale would be challenging within the same timeframe, as it necessitates several sentences.

Our study highlights the potential benefits of using semantic measures such as QCLA in clinical and commercial contexts. The use of words instead of numerical scales in diagnosis can provide a more person-centered approach, which can help patients feel more understood and less depersonalized. For example, unstructured clinical notes are rarely made available in structured electronic health records, and a response format where patients are allowed to answer health-related questions in their own words rather than in a one-dimensional closed format presents numerous opportunities. The proposed method could facilitate diagnostic accuracy and treatment planning, ultimately improving treatment outcomes (Han et al., 2022; Kjell et al., 2022). As suggested by Kjell and colleagues (2021), unlike traditional rating scale methods, open-ended questions might be less likely to impose socially desirable, acquiescent responses or suggest likely symptoms e.g. "Are you having trouble relaxing?” (GAD-7) (American Psychiatric Association, 2013) and they are arguably a more natural form of expression. Emotional support tools available digitally are proliferating and the academic community has recently observed a rise in social media text-mining studies (Ford et al., 2021; Karafillakis et al., 2021). The NLP method allows for the efficient evaluation of hundreds of predictors simultaneously and suggests economically sensitive solutions that can anticipate future outcomes, such as suicide actualization, attempts, or ideation (e.g., Allesøe et al., 2023). Social media text mining, where written autobiographical accounts of one’s state of mind are the primary means of communication, offers an alternative for preventative screening and detection of mental illness in the population, particularly in the prodromal phase, and for the assessment of risks for different mental health issues as a whole (e.g., Levanti et al., 2023).

Finally, healthcare-related data are well-positioned to provide insights into the health of our communities. However, one of the principal shortcomings of the use of NLP in text analysis is the perceived privacy and ethics around scanning entire populations for mental health purposes. While it is true that NLP algorithms have the potential to analyze vast volumes of text data and detect patterns of mental health issues, doing so without appropriate ethical considerations could be seen as invasive and could likely lead to negative consequences for individuals and society as a whole. To ensure ethical use of such algorithms, it is important to have strict guidelines and regulations in place. For instance, any program or initiative using NLP to scan text for mental health should have clear protocols around data collection, sharing, storage, and usage to protect the privacy of individuals. The implementation of text analysis in a clinical context, as proposed by this study, would therefore have the means to ensure respect for privacy and ethical considerations. With the introduction of the General Data Protection Regulation (GDPR) and other health data privacy laws, a balance between privacy and intellectual advancement within clinical settings is already in place. For that reason, we believe that QCLA has significant potential as a tool to be incorporated into the clinical setting.

Limitations

The terms "anxiety” and "depression” are used differently in a clinical setting than by a layperson. Clinicians use the DSM-V definitions to assess mental health disorders, whereas participants writing about an event of depression or anxiety may use a broader understanding of the term that may or may not fit a clinician’s assessment. This might have influenced the way the texts were written in this study and might have negatively affected the categorization accuracy.

Another concern is that participants are instructed to report on past emotional episodes so that the emotional experience they had at the time might differ from the way they feel about it when writing the narrative. People may have difficulty accurately recalling and describing their emotional experiences, leading to errors or inaccuracies in the data. In Phase 1 an independent set of participants generated narratives of self-experienced events relating to one of the four emotions; they were later read by participants in Phase 2, described in five keywords, and interpreted using rating scales commonly used for measuring the corresponding emotions. Thus, the success or failure of the categorization of the Phase 2 data depended on how participants interpreted the Phase 1 data, and participants in Phase 1 may have had different views of how these emotions had to be understood, or interpreted, compared to how the rating scales are generally constructed. Although we find this possibility less likely because depression, anxiety, satisfaction, and harmony are commonly used concepts, relying solely on self-report measures of emotions may not provide a complete or accurate picture of emotional experiences.

The payment rate of each individual, especially so in the group of professionals, could have potentially affected the accuracy of descriptions and the outcomes of the study. The perception of inadequate payment might have dissuaded participants from exerting the required effort to provide precise descriptions or attracted individuals with less commitment to the study, potentially leading to distorted results.

While there are limitations to our study, we believe that QCLA has significant potential as a tool for improving the diagnosis and treatment of emotional disorders. Future research in larger and more diverse populations can further explore the effectiveness of QCLA and its potential for wider application. The present study suggests that semantic measures offer an opportunity for a more precise and naturalistic assessment of mental states and our findings show unprecedented accuracy in the categorization of narratives on emotional states compared to one-dimensional numeric rating scales. This has important implications within the field of mental health because standardized rating scales, not open-ended language responses, are commonly used in quantitative assessment in psychiatric and clinical contexts. Semantic measures may therefore constitute a cornerstone approach to a better assessment of emotional states. Our work suggests that directed questions with responses to open-ended questions can be used to assess psychological constructs with higher validity and reliability than standardized numeric rating scales when analyzed by computational methods from NLP.

Competing Interests

S. Sverker is a co-founder of a start-up using computational language assessments to diagnose mental health problems. Other authors do not have any competing interests to declare.

Data Accessibility Statement

The design, hypotheses, and analysis plan for this manuscript were preregistered in the Open Science Framework (OSF) prior to the completion of the study. The original Qualtrics surveys, supplementary materials, code base script, and anonymized participant data have been made publicly available at https://osf.io/6ydfj.

Ethics

The study received ethical approval from the X. All methods were carried out in accordance with relevant guidelines and regulations, adhering to the Declaration of Helsinki. Written informed consent was obtained from all subjects prior to the start of data collection.

Public significance statement

By utilizing a single open-ended question and a concise five-word response format, analyzed through advanced NLP techniques, we have identified a more accurate method for categorizing emotional states compared to traditional rating scales. These results have significant implications for research and practical applications in the field of mental health assessment by enhancing our understanding of how emotions can be communicated and offering promising opportunities for developing more effective mental health assessment tools.

Allesøe RL, Thompson WK, Bybjerg-Grauholm J, Hougaard DM, Nordentoft M, Werge T, Rasmussen S, Benros ME. Deep Learning for Cross-Diagnostic Prediction of Mental Disorder Diagnosis and Prognosis Using Danish Nationwide Register and Genetic Data. JAMA Psychiatry. 2023 Feb 1;80(2):146-155. doi: 10.1001/jamapsychiatry.2022.4076. Erratum in: JAMA Psychiatry. 2023 Apr 5;: PMID: 36477816; PMCID: PMC9857190.
Allsopp, K., Read, J., Corcoran, R., & Kinderman, P. (2019). Heterogeneity in psychiatric diagnostic classification. Psychiatry research, 279, 15-22.
Association, A. P. & others. (n.d.). Diagnostic and Statistical Manual of Mental Disorders (DSM-V). Arlington: American Psychiatric Association; 2013.
Castro, V. M., Minnier, J., Murphy, S. N., Kohane, I., Churchill, S. E., Gainer, V., Cai, T., Hoffnagle, A. G., Dai, Y., Block, S., & others. (2015). Validation of electronic health record phenotyping of bipolar disorder cases and controls. American Journal of Psychiatry, 172(4), 363–372.
Clark, L. A., Cuthbert, B., Lewis-Fernández, R., Narrow, W. E., & Reed, G. M. (2017). Three approaches to understanding and classifying mental disorder: ICD-11, DSM-5, and the National Institute of Mental Health’s Research Domain Criteria (RDoC). Psychological Science in the Public Interest, 18(2), 72-145.
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019). https://doi.org/10.18653/v1/N19-1423.
Diener, E., Emmons, R., Larsen, R., & Griffin, S. (1985). The life satisfaction scale. Journal of Personality Assessment, 49(1), 71–75.
Doliński, D. (2018). Is psychology still a science of behaviour? Social Psychological Bulletin, 13(2), 1–14.
Eichstaedt, J. C., Smith, R. J., Merchant, R. M., Ungar, L. H., Crutchley, P., Preoţiuc-Pietro, D., Asch, D. A., & Schwartz, H. A. (2018). Facebook language predicts depression in medical records. Proceedings of the National Academy of Sciences, 115(44), 11203–11208. https://doi.org/10.1073/pnas.1802331115
Fisher, I. E., Garnsey, M. R., & Hughes, M. E. (2016). Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems in Accounting, Finance and Management, 23(3), 157–214.
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378.
Ford, E., Shepherd, S., Jones, K., & Hassan, L. (2021). Toward an ethical framework for the text mining of social media for health research: A systematic review. Frontiers in Digital Health, 2, 592237.
Gaonkar, B., Cook, K., & Macyszyn, L. (2020). Ethical Issues Arising Due to Bias in Training A.I. Algorithms in Healthcare and Data Sharing as a Potential Solution. AI Ethics Journal, 1, 1–9. https://doi.org/10.47289/AIEJ20200916
Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics. Series B: Numerical Analysis, 2(2), 205–224.
Guntuku, S. C., Yaden, D. B., Kern, M. L., Ungar, L. H., & Eichstaedt, J. C. (2017). Detecting depression and mental illness on social media: An integrative review. Current Opinion in Behavioral Sciences, 18, 43–49 https://doi.org/10.1016/j.cobeha.2017.07.005
Han, S., Zhang, R. F., Shi, L., Richie, R., Liu, H., Tseng, A., Quan, W., Ryan, N., Brent, D., & Tsui, F. R. (2022). Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. Journal of Biomedical Informatics, 127, 103984.
Jin, H., & Wang, S. (2018). Voice-based determination of physical and emotional characteristics of users (United States Patent No. US10096319B1). https://patents.google.com/patent/US10096319B1/en
Karafillakis, E., Martin, S., Simas, C., Olsson, K., Takacs, J., Dada, S., Larson, H. J., & others. (2021). Methods for social media monitoring related to vaccination: Systematic scoping review. JMIR Public Health and Surveillance, 7(2), e17149.
Kjell, O., Daukantaitė, D., Hefferon, K., & Sikström, S. (2016). The harmony in life scale complements the satisfaction with life scale: Expanding the conceptualization of the cognitive component of subjective well-being. Social Indicators Research, 126(2), 893–919.
Kjell, O., Daukantaitė, D., & Sikström, S. (2021). Computational language assessments of harmony in life—Not satisfaction with life or rating scales—Correlate with cooperative behaviors. Frontiers in Psychology, 12, 601679.
Kjell, O. N., Kjell, K., Garcia, D., & Sikström, S. (2019). Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods, 24(1), 92.
Kjell, O. N., Kjell, K., Garcia, D., & Sikström, S. (2020). Prediction and semantic trained scales: Examining the relationship between semantic responses to depression and worry and the corresponding rating scales. In Statistical Semantics (pp. 73–86). Springer.
Kjell, O. N., Sikström, S., Kjell, K., & Schwartz, H. A. (2022). Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Scientific Reports, 12(1), 1–9.
Kolanu, N., Brown, A. S., Beech, A., Center, J., & White, C. P. (2020). OR29-02 Natural Language Processing of Radiology Reports Improves Identification of Patients with Fracture. Journal of the Endocrine Society, 4(Supplement_1), OR29-02.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.
Le Glaz, A., Haralambous, Y., Kim-Dufor, D.-H., Lenca, P., Billot, R., Ryan, T. C., Marsh, J., Devylder, J., Walter, M., Berrouiguet, S., & others. (2021). Machine learning and natural language processing in mental health: Systematic review. Journal of Medical Internet Research, 23(5), e15708.
Levanti, D., Monastero, R. N., Zamani, M., Eichstaedt, J. C., Giorgi, S., Schwartz, H. A., & Meliker, J. R. (2023). Depression and anxiety on Twitter during the COVID-19 stay-at-home period in 7 major US cities. AJPM focus, 2(1), 100062.
Levitt, H. M. (2021). Introduction to the special section: Questioning established qualitative methods and assumptions. Qualitative Psychology, 8(3), 359.
Liu, X., Shin, H., & Burns, A. C. (2021). Examining the impact of luxury brand’s social media marketing on customer engagement: Using big data analytics and natural language processing. Journal of Business Research, 125, 815–826.
Navarro, M. C., Ouellet-Morin, I., Geoffroy, M.-C., Boivin, M., Tremblay, R. E., Côté, S. M., & Orri, M. (2021). Machine learning assessment of early life factors predicting suicide attempt in adolescence or young adulthood. JAMA Network Open, 4(3), e211450–e211450.
Newmann, F. (1998). Research news and comment: An exchange of views on “Semantics, psychometrics, and assessment reform: A close look at ‘authentic’ assessments.” Educational Researcher, 27(6), 19–22.
Pennebaker, J. W. (2011). Using computer analyses to identify language style and aggressive intent: The secret life of function words. Dynamics of Asymmetric Conflict, 4(2), 92–102.
Pennebaker, J. W., & Smyth, J. M. (2016). Opening up by writing it down: How expressive writing improves health and eases emotional pain. Guilford Publications.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2021). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842-866.
Seabrook, E. M., Kern, M. L., & Rickard, N. S. (2016). Social Networking Sites, Depression, and Anxiety: A Systematic Review. JMIR Mental Health, 3(4), e5842. https://doi.org/10.2196/mental.5842
Sidey-Gibbons, J. A. M., & Sidey-Gibbons, C. J. (2019). Machine learning in medicine: A practical introduction. BMC Medical Research Methodology, 19(1), 64. https://doi.org/10.1186/s12874-019-0681-4
Sikström, S., Kjell, O. & Kjell, K (accepted registered report). Combining Assessment and Treatment of Depression and Anxiety using Expressive Writing and Question based Computational Methods on Language. Assessment
Sikström, S., Höök, A. P., & Kjell, O. (2023). Precise Language Responses Challenge Easy Rating Scales-Comparing Clinicians’ and Respondents’ Views. PLOS ONE, https://doi.org/10.1371/journal.pone.0267995.
Sikström, S., Kjell, O. N., & Kjell, K. (2020). SemanticExcel. Com: An online software for statistical analyses of text data based on natural language processing. In Statistical Semantics (pp. 87–103). Springer.
Sinnenberg, L., Buttenheim, A. M., Padrez, K., Mancheno, C., Ungar, L., & Merchant, R. M. (2017). Twitter as a tool for health research: A systematic review. American Journal of Public Health, 107(1), e1–e8.
Skaik, R., & Inkpen, D. (2020). Using social media for mental health surveillance: A review. ACM Computing Surveys (CSUR), 53(6), 1–31.
Spitzer, R., Kroenke, K., Williams, J., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: The GAD-7. Archives of Internal Medicine, 166 (10), 1092-1097.
Stochl, J., Fried, E. I., Fritz, J., Croudace, T. J., Russo, D. A., Knight, C., Jones, P. B., & Perez, J. (2022). On dimensionality, measurement invariance, and suitability of sum scores for the PHQ-9 and the GAD-7. Assessment, 29(3), 355–366.
Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika, 61(3), 509–515.
Uher, J. (2018). Quantitative data from rating scales: An epistemological and methodological enquiry. Frontiers in Psychology, 9, 2599.
Van Lange, P. A., & Kuhlman, D. M. (1994). Social value orientations and impressions of partner’s honesty and intelligence: A test of the might versus morality effect. Journal of Personality and Social Psychology, 67(1), 126.

Yes there is potential Competing Interest. The first author is the founder and shareholder of Worddiagnostic AB

Download PDF

Journal Publication

published 23 May, 2024

Read the published version in Communications Psychology →

Version 1

posted

You are reading this latest preprint version

Question-Based Computational Language Approach Outperforms Rating Scales in Quantifying Emotional States

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Challenging the Dominance of Rating Scales in Mental Health

Analyzing Mental Health with NLP/ML Methods

Question-based Computational Language Assessment

The Present Study

Method

Participants

Measures

Emotional Autobiographical Memory

Semantic Measures

Rating Scales

Procedure

Statistical Analyses

Pre-processing of the Word Data

Quantifying the Descriptive Words and Rating Scales

Categorizing the Responses Using ML

Results

Predicted Categorization

Accuracy and Precision

Confusion and Correlation Matrices

Inter-rater Agreement

Word Clouds of Categorizations

Discussion

Limitations

Conclusion

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1