This study shows that participant-generated narratives describing emotional states were categorized more accurately by computational methods based on one open-ended question using a response format consisting of five words compared to four dedicated rating scales. This finding has significant clinical implications as it indicates that open-ended language-based responses may have higher validity than the rating scales commonly used for mental health assessments. To our knowledge, this is the first time that language-based stimuli of emotions have been better categorized by word responses than by rating scales. Furthermore, the effect size was substantial, where the percentage of correct categorizations for word responses was 64% compared to 44% for the rating scales, i.e., the difference in the percentage of emotional states correctly categorized was 20%.
The results of this study are consistent with other studies showing that computational language assessment produces very strong correlations to the rating scales of harmony and satisfaction (e.g., r = 0.84) that rival the theoretical limits on test-retest reliability and inter-item correlations (Kjell et al, 2021). Evidence from related fields, such as the assessment of facial expressions and cooperative behavior, has suggested that language responses may possess greater validity compared to rating scales. Kjell and colleagues (2009) used validated facial expressions of "happy", "sad", and "contemptuous", where participants described a facial expression either by three descriptive words or by rating scales corresponding to these emotions. That study found an advantage of word responses for the identification of facial expressions, but the advantage (only 4%) was substantially smaller than what was found in the current study. Kjell and colleagues (2021) studied cooperative behavior in a one-shot give-some dilemma game (GSDG), where participants completed rating scales (HILS and SWLS) or word-response measures of harmony and satisfaction prior to conducting a GSDG (for details see Van Lange & Kuhlman, 1994). The results showed that word responses, but not the rating scales, predicted cooperative behavior. We argue that an important reason why the word responses showed better categorization than rating scales in the present study was that the outcome variable (i.e., the grounded truth of the narratives) was generated independently of the rating scales, whereas the earlier studies on QCLA (Kjell et al., 2019) were typically validated by correlations to rating scales.
Additional information was found by looking at the confusion matrix, where the number of incorrect predictions for the rating scales was higher for the rating scales in Phase 2 compared to word responses.The Pearson correlation between estimated coefficients was considerably higher for the rating scale models compared to the word-based models, suggesting that word responses discriminate better between emotional states. This is consistent with previous studies showing that word plots discriminate better than rating scales between related concepts (Kjell et al., 2021; Stochl et al., 2022).
A smaller subset of the Phase 2 data was based on participants that currently work in healthcare-related professions. This group was included in the study as they were expected to have a more profound knowledge of the definition and assessment of depression, anxiety, harmony, and satisfaction. Nonetheless, this group did not result in higher accuracy of categorizations using the rating scales, and their nominal values of correct categorizations of the rating scales were less than that for the word responses. Nominally, the data looked similar to the larger non-professional control participant data; however, the number of professional subjects was too small (N = 34) to make it feasible to draw any firm conclusions.
In addition to having greater validity compared to rating scales, computational methods for language-based assessment of mental health have several other advantages. First, language is a natural way that people communicate their mental states. People prefer to communicate mental health with language rather than rating scales because they find language to be more precise and elaborate, and they prefer using language during communication with clinicians, although rating scales are seen as easier and faster (Sikström et al., 2023). Second, open-ended language responses allow for an idiosyncratic description of the participant’s mental health, thus providing an opportunity for person-centered health care. This is very different from rating scales that measure a fixed construct defined by the research and where patients cannot add their person-centered view (Kjell et al., 2019). Third, writing about emotional events or traumas improves participants’ mental health and therefore can be viewed as a treatment intervention (Pennebaker, 2011). Engaging in expressive writing, where individuals freely express their thoughts and emotions related to a specific event or trauma, has been found to offer several benefits, as it allows individuals to externalize and confront their emotions, thoughts, and experiences and provides an opportunity for emotional catharsis (Pennebaker & Smyth, 2016). Assessment with rating scales,on the other hand, is not known to influence participants’ mental health. Combined with the data presented here, this suggests that QCLA can be seen as a method for simultaneously assessing and treating mental health (Sikström et al., accepted register report). Another advantage of the proposed language measure is that it is short, easy, and quick to administer. As it stands, it can be conducted in a brief conversation by asking individuals to provide five keywords that describe their emotional state (e.g., fine, great, good, restless, excited). In contrast, completing even a single rating scale would be challenging within the same timeframe, as it necessitates several sentences.
Our study highlights the potential benefits of using semantic measures such as QCLA in clinical and commercial contexts. The use of words instead of numerical scales in diagnosis can provide a more person-centered approach, which can help patients feel more understood and less depersonalized. For example, unstructured clinical notes are rarely made available in structured electronic health records, and a response format where patients are allowed to answer health-related questions in their own words rather than in a one-dimensional closed format presents numerous opportunities. The proposed method could facilitate diagnostic accuracy and treatment planning, ultimately improving treatment outcomes (Han et al., 2022; Kjell et al., 2022). As suggested by Kjell and colleagues (2021), unlike traditional rating scale methods, open-ended questions might be less likely to impose socially desirable, acquiescent responses or suggest likely symptoms e.g. "Are you having trouble relaxing?” (GAD-7) (American Psychiatric Association, 2013) and they are arguably a more natural form of expression. Emotional support tools available digitally are proliferating and the academic community has recently observed a rise in social media text-mining studies (Ford et al., 2021; Karafillakis et al., 2021). The NLP method allows for the efficient evaluation of hundreds of predictors simultaneously and suggests economically sensitive solutions that can anticipate future outcomes, such as suicide actualization, attempts, or ideation (e.g., Allesøe et al., 2023). Social media text mining, where written autobiographical accounts of one’s state of mind are the primary means of communication, offers an alternative for preventative screening and detection of mental illness in the population, particularly in the prodromal phase, and for the assessment of risks for different mental health issues as a whole (e.g., Levanti et al., 2023).
Finally, healthcare-related data are well-positioned to provide insights into the health of our communities. However, one of the principal shortcomings of the use of NLP in text analysis is the perceived privacy and ethics around scanning entire populations for mental health purposes. While it is true that NLP algorithms have the potential to analyze vast volumes of text data and detect patterns of mental health issues, doing so without appropriate ethical considerations could be seen as invasive and could likely lead to negative consequences for individuals and society as a whole. To ensure ethical use of such algorithms, it is important to have strict guidelines and regulations in place. For instance, any program or initiative using NLP to scan text for mental health should have clear protocols around data collection, sharing, storage, and usage to protect the privacy of individuals. The implementation of text analysis in a clinical context, as proposed by this study, would therefore have the means to ensure respect for privacy and ethical considerations. With the introduction of the General Data Protection Regulation (GDPR) and other health data privacy laws, a balance between privacy and intellectual advancement within clinical settings is already in place. For that reason, we believe that QCLA has significant potential as a tool to be incorporated into the clinical setting.
Limitations
The terms "anxiety” and "depression” are used differently in a clinical setting than by a layperson. Clinicians use the DSM-V definitions to assess mental health disorders, whereas participants writing about an event of depression or anxiety may use a broader understanding of the term that may or may not fit a clinician’s assessment. This might have influenced the way the texts were written in this study and might have negatively affected the categorization accuracy.
Another concern is that participants are instructed to report on past emotional episodes so that the emotional experience they had at the time might differ from the way they feel about it when writing the narrative. People may have difficulty accurately recalling and describing their emotional experiences, leading to errors or inaccuracies in the data. In Phase 1 an independent set of participants generated narratives of self-experienced events relating to one of the four emotions; they were later read by participants in Phase 2, described in five keywords, and interpreted using rating scales commonly used for measuring the corresponding emotions. Thus, the success or failure of the categorization of the Phase 2 data depended on how participants interpreted the Phase 1 data, and participants in Phase 1 may have had different views of how these emotions had to be understood, or interpreted, compared to how the rating scales are generally constructed. Although we find this possibility less likely because depression, anxiety, satisfaction, and harmony are commonly used concepts, relying solely on self-report measures of emotions may not provide a complete or accurate picture of emotional experiences.
The payment rate of each individual, especially so in the group of professionals, could have potentially affected the accuracy of descriptions and the outcomes of the study. The perception of inadequate payment might have dissuaded participants from exerting the required effort to provide precise descriptions or attracted individuals with less commitment to the study, potentially leading to distorted results.