The goal of this study is to compile and assess numerous types of data into a single longitudinal dataset. Clinically relevant psychotherapy sessions with real patients coupled with anonymized EHR data provide a unique set of multimodal data. There are a multitude of opportunities that are possible to explore with this dataset to gain a better understanding of mental health and build assistive products for clinicians. Here we describe our overall plan for analyzing the dataset and discuss some preliminary analyses that have been conducted. Importantly, it should be acknowledged that as the data analysis progresses, novel insights may lead to new hypothesis generation.
3.1 Plan for Overall Data Analysis
The initial step is to build representative features from the raw data. By leveraging the power of pre-existing models trained in large collections of data (43), we can obtain high-quality multidimensional representations which allow for improved data analysis through clustering (44) or correlating with other specific features. This will provide insights into the meaningfulness of the data, and how it correlates with traditional mental health measurements.
A second analysis aims to extract video features correlated with emotional expression. Different frameworks of features allow for the analysis of different properties of emotions. The valence-arousal framework (45) represents emotion on a continuous two-dimensional spectrum, while the Facial Action Coding System (46) allows for action units to tightly represent the activation of specific facial muscles. Current approaches (47), (48) unify these representations in the same model, allowing more data to be considered in training and potentially providing a robust representation of the emotional state of both the patient and therapist. Additionally, although we do not expect to have the entire body visible in the recorded videos, exploring pose estimation may help to understand features such as body position or hand gestures in relation to the expression of the participant (49).
Not only will it be interesting to visualize the evolution over time of facial expressions, body postures, and their correlation with other data types, but we may also be able to better understand the level of empathy and rapport between patient and therapist by analyzing the emotional response of the patient to the therapist and vice versa.
Another class of features will be extracted from natural language in the text messages between patient and clinic, and in the dialogue during the therapy sessions. As participants also use secured text messages to communicate with providers between sessions, analysis of this data could provide a representation of mental health states between therapy sessions and provide a means to annotate moments of crisis or acute mental health symptoms.
Current approaches to automatic speech recognition (50) also allow for the transcription of speech to text with accuracy. The dialogue could then allow us to acquire a representation of sentiment in each therapy session and allow for topic modeling where patterns in conversation within and across sessions, and their correlation with other features can be explored. For natural language features, we can leverage large, pre-trained models to provide us with a greater level of sentiment representation (51), (52).
Similarly, sound can provide an additional layer of representation of mental health. Although the emotional sentiment can be represented solely by the content of the conversation, vocal features can add an extra layer of representation of an individual’s mental state. Current literature has been able to separate emotional content in speech from speaker identity and lexical content (53). Our work may demonstrate the feasibility of objectively analyzing the vocal features of a patient’s speech such as rate, rhythm, and volume which are key components of a Mental Status Exam (MSE), and typically only assessed through the subjective experience of a clinician.
EHR data can provide an overview of the medical history of each patient. Visit notes taken by physicians after each session can be automatically summarized based on the video session content (54). EHR data could be used as a weak annotation of diagnoses for building models. Prescribed medications provide an additional layer of features that can represent the mental state as physicians will prescribe medication to tackle specific mental conditions.
Finally, more traditional clinical measurements such as the Patient Health Questionnaire-9 (PHQ-9) (55) and the Beck Depression Inventory (BDI) (56) are also collected throughout the participant’s contact with the clinic in this study. These measurements will therefore provide validated labels for a participant’s mental health state and help to bring accuracy in the training of ML models.
3.2 Preliminary Analyses
To demonstrate the feasibility of this research and how therapeutic sessions can be recorded, analyzed, and quantified in an informative and meaningful way for domain experts, we have prepared several visual examples using imagery and natural language ML models in combination with an initial set of collected clinical data in the study.
The interaction between clinicians and patients can be observed via different signals. As examples for this paper, we have chosen to visualize facial expressions and speech transcripts due to the clinical relevance of these signals in psychotherapy. In Fig. 2, various facial expressions are visualized on a valence-arousal grid to demonstrate the relationship between valence-arousal scores and facial expressions.
Figure 3. An example of a psychotherapy session represented by automated topic detection (below the time axis) and evolution of valence-arousal distribution of clinician and participant facial expressions (above the time axis). While the clinician facial expressions remain more in the center of the valence-arousal 2D grid throughout the session (orange area), the participant’s facial expressions (blue area) diverge from the positive valence area to the 3rd quadrant with lower valence and arousal facial scores corresponding with the identified discussion topics.
Figure 4 utilizes imagery and NLP sentiment model predictions to classify facial expressions and speech transcripts into 8 discrete sentiment classes (e.g., sadness, anger, or fear). At the time of this analysis, the models utilized in classifying sentiment are not built on top of our own clinical dataset, but rather large, publicly available datasets (57)- (68). While there may be weaknesses in the validity of these models to accurately classify sentiment, they remain useful for demonstrating the potential of this research as it is developed further. Time-aligned predictions for each frame of the video provide a detailed sentiment analysis of the whole session. By enhancing into the concrete type of emotion and concrete time window as in Fig. 4, we can identify moments of facial and speech sentiment dissonance, such as when the participant is describing a sad story with a happy facial expression. Another example in Fig. 4 showcases moments of emotional correlation between clinician and participant which may be helpful in the analysis of rapport.