Quantifying the Processes and Events of Psychotherapy at Scale

doi:10.21203/rs.3.rs-3232354/v1

Download PDF

Research Article

Quantifying the Processes and Events of Psychotherapy at Scale

https://doi.org/10.21203/rs.3.rs-3232354/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

In the wake of the COVID-19 pandemic telemedicine usage increased in the United States, especially in the field of mental health. The study aims to demonstrate the feasibility of collecting recordings of telemedicine psychotherapy, relevant electronic health records (EHR), and matched real-world data to create an aligned, multimodal dataset. We examine possible ways to use this dataset to train machine learning models, intending to explore the creation of tools that could assist psychotherapists.

Methods

This study was conducted through an outpatient, telemedicine-enabled, clinic in New York City. Participants were recruited from the existing treatment population and were already undergoing psychotherapy. After participants provided informed consent, each subsequent psychotherapy session was recorded, however, a participant could request that any individual session not be recorded without impact on study participation. Only sessions that occurred via telehealth were eligible for recording.

This study also collected participants’ electronic health record (EHR) data from the study clinic as well as participants’ de-identified real-world data from aggregated records providers using a tokenized de-identification process provided by a third-party organization.

Results

We successfully collected 34 psychotherapy session recordings from 19 participants across seven different providers as well as EHR and other real-world health data from all participants. Preliminary machine learning analyses were applied to the data, and a further plan for data analysis is discussed.

Conclusion

Establishing this unique dataset is the first step to developing machine learning tools that can assist psychotherapists in their practice. This study demonstrates the feasibility of collecting more data of this nature, illustrates potential analyses that can be applied to the data, and how they may be used to help improve psychotherapy.

artificial intelligence

machine learning

psychotherapy

mental health

sentiment analysis

telemedicine

telepsychiatry

depression

anxiety

1.1 The Current Mental Health Crisis

Common mental health diagnoses are well-observed but poorly characterized. The COVID-19 pandemic has seen the prevalence of global mental health disorders increase significantly in the past several years. Notably, in a systematic literature review of studies conducted between January 1, 2020, and January 29th, 2021, the estimated global prevalence of major depressive disorder and anxiety disorders increased significantly (1). In the United States, rates of anxiety and depression among U.S. adults were approximately four times higher from 2020 through 2021 than they were in 2019 (2), (3).

In concert with the increase in prevalence has been an increase in patients seeking treatment. A survey conducted in late 2021 by the American Psychological Association (APA) demonstrated that more than 80% of the 26,400 psychologists surveyed who treat anxiety disorders had seen an increase in demand for treatment as compared to the start of the pandemic with the average amount of referrals more than doubling (4).

1.2 The Rise of Telemedicine in Psychiatry

One effect potentially linked to increased demand for mental health treatment has been the adoption of telemedicine as a modality for many clinicians. According to the World Health Organization (WHO), telemedicine is “the delivery of health care services, where distance is a critical factor, by all health care professionals, using information and communication technologies for the exchange of valid information for diagnosis, treatment and prevention of diseases and injuries, research and evaluation and for the continuing education of health care providers, all in the interests of advancing the health of individuals and their communities” (5). When utilized for psychiatric or psychological care, this modality is referred to as telepsychiatry (6).

It is notable that psychological and psychiatric care is somewhat unique in that with only rare exceptions, clinicians are not expected to physically touch their patients to provide the best services. Verbal and visual assessments are all that are needed to engage with patients appropriately, obtain full diagnostic accuracy and provide most types of treatments. This makes using two-way telemedicine platforms for clinical assessments and treatment delivery a reasonable method for psychiatric treatment (7). Notably, adoption of telemedicine modalities in psychiatry was significantly higher than all other healthcare specialties in 2020, at a rate of 50% of psychiatry visits compared to an average of 17% for all specialties combined. (8).

This growth of telepsychiatry has provided a unique opportunity to collect data that could better elucidate effective and efficient ways to evaluate patient progress and assess the impact of intervention. By using telemedicine platforms and recording sessions between patient and clinician, the data collected can be analyzed using emerging techniques in active assessment and machine learning (9).

1.3 Artificial Intelligence and Machine Learning Advances in Psychiatry

Recent advances in areas of artificial intelligence (AI), specifically a sub-specialty termed machine learning (ML), have allowed for the development of sophisticated methods that are being utilized to better study human emotion and disease pathology including the use of Natural Language Processing (NLP) and Computer Vision (CV) (10). The application of these techniques has been demonstrated in scientific literature with increasing prevalence, with a total of 438 publications on the topic in 2022 as compared to 314 in 2021 and 241 articles in 2020 (11).

While there are many specific approaches available, the essential characteristic of ML algorithms is that they create models based on its interaction with data (12), rather than models being manually programmed to do the task. Previous work has demonstrated the utility of using ML models to analyze the brain’s electrical activity via electroencephalography (EEG) data. ML models processing EEGs have demonstrated high-quality results in areas such as sleep stage detection (13) and provides strong feasibility evidence in diagnostics tasks currently lacking established or unbiased clinical tests such as for Alzheimer’s disease (14), schizophrenia (14), or major depressive disorder (MDD) (15). ML analysis of EEG signals can also be used to identify responders to MDD treatment (16) or to determine the presence of depression biomarkers (17).

In concert with EEG, in just the last several years, there has been a significant improvement in the ability of ML algorithms to analyze broad domains such as text (18), sound (19), and video (20, (21). These trends come alongside large datasets that require fewer human annotations (22), (23). Models trained on such datasets can incorporate those learnings from general data into more specialized tasks. In psychiatry, ML models can introduce novel insights into previously overlooked data dependencies and relationships between clinical elements, as well as improve the collection of well-established features and biomarkers through automation.

1.3.1 Computer Vision

For example, facial behavior is becoming one of the established biomarkers of mental health (9). With the development of computer vision technologies, facial expressions can be measured and used as an increasingly relevant biomarker of disease severity in patients with psychiatric illness (24), (25). Initially, this technology is focusing on conditions where changes in facial expressions are easily observable (e.g., autism spectrum conditions (ASC), schizophrenia, and major depressive disorder (MDD). As this technology improves, it can be used to develop predictive tools and become applicable to more conditions via increasing sensitivity and combining facial expression measurements with other clinical features.

These data are being used to develop predictive tools, and as models improve these tools will become applicable to more conditions by detecting subtle changes in facial expressions, or identifying combinations of features, including facial expressions, which will be applicable to new clinical questions (26).

A recent study by Gupta et al (2019) (27), utilized 5-minute recorded clips of clinical interviews in adolescent and young adults (12–21 years of age) who were considered clinically high risk for developing schizophrenia as well as aged-matched controls to undertake automated computerized analysis of facial expressivity. Results indicated that automated facial expression assessments correlated significantly with clinical interview findings from the Prodromal Inventory of Negative Symptoms (PINS) blunted affect item which has been demonstrated to be a predictor of lower social functioning and increased psychosis conversion risk.

Similarly, Abbas et al. (2021) (28) explored the detection of depression severity in reduced facial motor function. Eighteen adults who were experiencing a major depressive episode were started on a new selective serotonin reuptake inhibitor (SSRI) or selective norepinephrine reuptake inhibitor (SNRI) and followed for four weeks. Each participant completed a weekly digital assessment on their personal smartphone consisting of describing and reacting to a set of images while the phone recorded their face and speech. Study clinicians also followed depression severity using the Structured Interview Guide for the Montgomery–Åsberg Depression Rating Scale (SIGMA-MADRS). Results indicated that decreases in SIGMA-MADRS depression severity ratings significantly correlated with increases in facial motor function. Further, this work demonstrated the feasibility of utilizing self-capture video for analysis.

1.3.2 Audio and Sentiment Analysis

Voice and speech have also been established as a means of characterizing how acoustic properties of voice and natural language characteristics of speech can serve as biomarkers of mental health (29). Not only can automatic speech recognition software effectively and with sufficient reliability generate transcripts of psychotherapy sessions to enable research aimed at understanding existing treatments and to augment clinician training (30), but machine learning-based linguistic analysis of recorded speech or transcripts can also estimate the underlying cognitive status and thus be used to predict the severity of dementia (31) or to discriminate between patients with Alzheimer’s disease and controls (32).

Recent studies in patients with mood disorders have demonstrated that a number of acoustic measures characterized by source features from the vocal folds (e.g., jitter, shimmer, harmonics-to-noise ratio (HNR), filter features from the vocal tract (e.g., F1 and F2 formants), spectral features (e.g., Mel Frequency Cepstral Coefficients (MFCCs), and prosodic/melodic features (e.g., fundamental frequency (F0)), speech intensity, speed, and pause duration) have been shown to be altered in individuals with depression (33).

1.4 Technology and Psychotherapy

In contrast to the continued evolution of both the medium by which therapy is administered as well as the development of digital signals that correlate significantly with underlying symptomatology, the practice of psychotherapy has remained somewhat stagnant in its ability to quantify its own effectiveness. While psychotherapy as a treatment modality has been found to be significantly effective at treating a range of mental health disorders across different patient populations (34), there is little consensus as to which specific therapist behaviors or patient factors (e.g., patient diagnosis, therapist experience, and theoretical orientation) contribute most to patients’ symptom improvement or deterioration (35).

Many different forms of psychotherapy exist, and evidence suggests that there may not be clear, significant differences in the efficacy of one form over another, especially for common mental health diagnoses such as depression or anxiety (36)– (38). Instead, treatment outcomes are strongly influenced by the qualities of each therapeutic relationship (35)-(36), (39). As this relationship is central to treatment outcomes, methods for quantifying and measuring important aspects of therapeutic relationship development are paramount. However, the overwhelming majority of psychotherapy sessions are never examined in their entirety which negates the ability to detect important features or patterns (39). Quantified data about each therapy session could give providers and patients deeper insight into their therapeutic process and help determine the impact of specific interventions (40).

Given that the use of telepsychiatry can improve access to patients who otherwise would have difficulty getting treatment (41), quantifiable measurements of the therapeutic process could bolster this increase in access by providing further evidence for effective treatments to both patients and payers as a substantial portion of adults with a mental illness do not seek any type of treatment due to structural barriers such as the cost of treatment, lack of insurance coverage, and beliefs that treatments are ineffective (42).

Here, we present the study design of the Quantifying the Processes and Events of Psychotherapy at Scale study, which was designed to test remote data collection of psychotherapy sessions for multimodal analyses.

Specifically, we review [1] the method deployed to collect the first set of video and audio data from psychotherapy sessions, [2] our plan for data analysis as well as preliminary analyses performed on the data, and [3] an overview of related work to highlight the uniqueness of the dataset captured in this study.

This study was reviewed and approved by an independent review board (Advarra IRB Pro00056787) and conducted in an outpatient clinic in the New York metropolitan area. To date, we have collected 34 psychotherapy session recordings from 19 participants across seven different providers and aim to continue data collection in order to create the largest feasible dataset. Figure 1 illustrates the present study design. Research participants are recruited from the outpatient treatment population and therefore are already undergoing psychotherapy with a trained clinician. Inclusion criteria for study participation include: [1] age 18 years or older at time of enrollment, [2] history of symptoms of anxiety or depression in the last 6 months, [3] receiving outpatient telemedicine therapy, [4] willing to share recordings of image and voice with the research funder [5] willing to provide electronic informed consent, [6] willing to share mental health history, mental health claims data, and demographics with the research funder, [7] able to read and understand the English language well enough to complete electronic informed consent [8] willing to use English as the primary language for therapy and [9] have access to a suitable internet connection and a device with a functioning camera and microphone for participating in recorded sessions. Exclusion criteria include [1] hospitalized at time of enrollment and [2] any requirement of a voice prosthesis.

After participants confirm their informed consent by signature, all recordings are captured using the Zoom Healthcare platform, compliant with the Health Insurance Portability and Accountability Act (HIPAA). Once consent is obtained, at all subsequent therapy sessions the clinician reminds the participant that the session will be recorded. The participant can choose to not have any individual session recorded and still remain as a part of the overall study. Similarly, if the treating clinician feels that a particular session should not be recorded, they can exercise that discretion without impacting the participant's overall participation.

As part of study data collection, participants consent to share their electronic health record (EHR) data including demographics, medical history, international classification of diseases (ICD-10) codes, diagnosis, current and past medications, procedures, and current procedural terminology (CPT) codes, session notes, written communication with treating clinician, and scores on standard clinical assessments.

An optional part of the study involves the collection of real-world data from various healthcare databases. If a consented participant opted-in to this section of the study, a third-party organization would connect distributed personal health information of the participant from various healthcare databases into a de-identified format for use in the study to further enrich the dataset. For these participants, we can analyze not just the information presented in their current treatment course, but information from previous interactions with healthcare systems in an anonymized manner.

The goal of this study is to compile and assess numerous types of data into a single longitudinal dataset. Clinically relevant psychotherapy sessions with real patients coupled with anonymized EHR data provide a unique set of multimodal data. There are a multitude of opportunities that are possible to explore with this dataset to gain a better understanding of mental health and build assistive products for clinicians. Here we describe our overall plan for analyzing the dataset and discuss some preliminary analyses that have been conducted. Importantly, it should be acknowledged that as the data analysis progresses, novel insights may lead to new hypothesis generation.

3.1 Plan for Overall Data Analysis

The initial step is to build representative features from the raw data. By leveraging the power of pre-existing models trained in large collections of data (43), we can obtain high-quality multidimensional representations which allow for improved data analysis through clustering (44) or correlating with other specific features. This will provide insights into the meaningfulness of the data, and how it correlates with traditional mental health measurements.

A second analysis aims to extract video features correlated with emotional expression. Different frameworks of features allow for the analysis of different properties of emotions. The valence-arousal framework (45) represents emotion on a continuous two-dimensional spectrum, while the Facial Action Coding System (46) allows for action units to tightly represent the activation of specific facial muscles. Current approaches (47), (48) unify these representations in the same model, allowing more data to be considered in training and potentially providing a robust representation of the emotional state of both the patient and therapist. Additionally, although we do not expect to have the entire body visible in the recorded videos, exploring pose estimation may help to understand features such as body position or hand gestures in relation to the expression of the participant (49).

Not only will it be interesting to visualize the evolution over time of facial expressions, body postures, and their correlation with other data types, but we may also be able to better understand the level of empathy and rapport between patient and therapist by analyzing the emotional response of the patient to the therapist and vice versa.

Another class of features will be extracted from natural language in the text messages between patient and clinic, and in the dialogue during the therapy sessions. As participants also use secured text messages to communicate with providers between sessions, analysis of this data could provide a representation of mental health states between therapy sessions and provide a means to annotate moments of crisis or acute mental health symptoms.

Current approaches to automatic speech recognition (50) also allow for the transcription of speech to text with accuracy. The dialogue could then allow us to acquire a representation of sentiment in each therapy session and allow for topic modeling where patterns in conversation within and across sessions, and their correlation with other features can be explored. For natural language features, we can leverage large, pre-trained models to provide us with a greater level of sentiment representation (51), (52).

Similarly, sound can provide an additional layer of representation of mental health. Although the emotional sentiment can be represented solely by the content of the conversation, vocal features can add an extra layer of representation of an individual’s mental state. Current literature has been able to separate emotional content in speech from speaker identity and lexical content (53). Our work may demonstrate the feasibility of objectively analyzing the vocal features of a patient’s speech such as rate, rhythm, and volume which are key components of a Mental Status Exam (MSE), and typically only assessed through the subjective experience of a clinician.

EHR data can provide an overview of the medical history of each patient. Visit notes taken by physicians after each session can be automatically summarized based on the video session content (54). EHR data could be used as a weak annotation of diagnoses for building models. Prescribed medications provide an additional layer of features that can represent the mental state as physicians will prescribe medication to tackle specific mental conditions.

Finally, more traditional clinical measurements such as the Patient Health Questionnaire-9 (PHQ-9) (55) and the Beck Depression Inventory (BDI) (56) are also collected throughout the participant’s contact with the clinic in this study. These measurements will therefore provide validated labels for a participant’s mental health state and help to bring accuracy in the training of ML models.

3.2 Preliminary Analyses

To demonstrate the feasibility of this research and how therapeutic sessions can be recorded, analyzed, and quantified in an informative and meaningful way for domain experts, we have prepared several visual examples using imagery and natural language ML models in combination with an initial set of collected clinical data in the study.

The interaction between clinicians and patients can be observed via different signals. As examples for this paper, we have chosen to visualize facial expressions and speech transcripts due to the clinical relevance of these signals in psychotherapy. In Fig. 2, various facial expressions are visualized on a valence-arousal grid to demonstrate the relationship between valence-arousal scores and facial expressions.

Figure 3. An example of a psychotherapy session represented by automated topic detection (below the time axis) and evolution of valence-arousal distribution of clinician and participant facial expressions (above the time axis). While the clinician facial expressions remain more in the center of the valence-arousal 2D grid throughout the session (orange area), the participant’s facial expressions (blue area) diverge from the positive valence area to the 3rd quadrant with lower valence and arousal facial scores corresponding with the identified discussion topics.

Figure 4 utilizes imagery and NLP sentiment model predictions to classify facial expressions and speech transcripts into 8 discrete sentiment classes (e.g., sadness, anger, or fear). At the time of this analysis, the models utilized in classifying sentiment are not built on top of our own clinical dataset, but rather large, publicly available datasets (57)- (68). While there may be weaknesses in the validity of these models to accurately classify sentiment, they remain useful for demonstrating the potential of this research as it is developed further. Time-aligned predictions for each frame of the video provide a detailed sentiment analysis of the whole session. By enhancing into the concrete type of emotion and concrete time window as in Fig. 4, we can identify moments of facial and speech sentiment dissonance, such as when the participant is describing a sad story with a happy facial expression. Another example in Fig. 4 showcases moments of emotional correlation between clinician and participant which may be helpful in the analysis of rapport.

While the ambition to build a novel multimodal dataset in order to offer insight into both mental health disorders as well as the role of psychotherapy is ambitious, there has been ongoing work in this area for several years which can provide an excellent scaffold. By examining these prior datasets, it allows not only for a more nuanced understanding of the data captured, but also for an understanding of what types of data should be captured to improve the current landscape and overall generalizability. This section will briefly review the current landscape of data collection in mental health in order to illustrate how the dataset collected in this study is unique and clinically relevant.

4.1 Datasets in Mental Health

Based on the defined endpoints of the study, we have identified three distinct streams of existing domain-specific datasets related to our goal of better understanding of diseases affecting mental health:

Natural Language Processing (NLP): dialogue content in therapy sessions and between sessions, clinician notes, and other EHR records

Audio: sentiment analysis on voice features

Image and video: combining face sentiment analysis and landmark detection

4.1.1 Natural Language Processing Stream

One of the biggest longitudinal patient behavioral health databases for NLP is the MindLinc Global Database (MGD) (57), 500 000 patients, 330 million records of mental health examinations). Here the authors use NLP for categorizing assessment information that can be used for comparing mental health disorders more efficiently (57). Another example is the Medical Information Mart for Intensive Care (MIMIC-III) dataset with over two million anonymized clinician notes. Using MIMIC III, a general BERT model has been fine-tuned for domain-specific tasks of patient readmission prediction (58) or action items prediction from discharged notes that significantly reduce the reading time for medical staff (59). This supports the assumption about the validity of knowledge transfer of general pre-trained models to specific narrower domains such as EHR.

4.1.2 Audio Stream

The second significant stream are sound datasets focusing on identifying signs of depression and their severity based on various sound properties (jitter, shimmer, base frequencies, MFCCs, fine-tuned embeddings of general sound models). The majority of clinically relevant studies are unfortunately based on smaller and less representative data samples. These are usually directly used for analysis and become unsuitable for training a new generation of models (33).

Sound can also be used for sentiment and valence/arousal predictions. Here the datasets are often built by volunteers or actors and emotions displayed are less authentic and clinically relevant (60), (61), as well as annotated via a crowdsourcing model (62). Even though models based on those datasets cannot directly be used to train clinically relevant predictors, they can still be used as useful feature extraction tools in clinical sessions.

4.1.3 Image and Video Stream

Similarly, as with sound datasets, image and video datasets are prepared for tasks of sentiment classification and valence-arousal prediction, but also face landmark detection. The major differences are in the way the actual image data is collected. The first approach is to scrape images from the internet on a large scale and annotate them individually. While requiring more effort, collecting large amounts of annotated data can lead to a model which then has a higher performance than an average annotator, as in the case of Emonet (48). The low performance of a single annotator is mostly due to the lack of image context when annotators are given single frames to label (63).

The second approach samples video frames, which allows extra context for the annotators. In both SEWA (64) and the Distress Analysis Interview Corpus Wizard-of-Oz (DAIC-WOZ) (65)– (67) annotators were provided with video context for each image to be labeled, increasing the performance of each annotator. These datasets provide over 2000 minutes of video, highly focused on data diversity.

The last approach is based purely on synthetic data – by generating faces with extreme realism as in the Microsoft SyntheticFace dataset (68). This provides an unlimited size, automatically annotated dataset with control over the diversity of participants and emotions. This novel approach can generate models outperforming competition trained on the real data and solve challenges where annotations are tricky to obtain, as in the eye tracking problem.

In contrast to the aforementioned datasets, the dataset that is being collected within our study attempts to connect all these domains together. It is a longitudinal collection of medical history data combined with clinically relevant audio/video sessions of patients with different severity levels of such mental health conditions such as general anxiety disorder (GAD) and MDD. To our knowledge, no dataset of this nature has ever been generated prior and it is the hope that this data will provide opportunities both for training clinically relevant multimodal models as well as advanced analysis.

When assessing the relevancy of existing datasets, none can fulfill all requirements to fundamentally improve our understanding of the mental health landscape, which are as follows:

[1] Clinically relevant data from participants with diagnosed mental health issues, [2] any annotations are done by subject matter experts, [3] the size of the dataset fulfills the requirements for training deep learning models, [4] dataset is multimodal aligning together audio, video, and EHR data of participants, and [5] the dataset is built based on longitudinal data collection.

Given the ambition of the study to build upon existing models by incorporating a novel set of rich multimodal data, it will require substantial effort to recruit and analyze. However, to date the study has recorded upwards of 34 sessions demonstrating initial success and proving the feasibility of the program in terms of both recruitment of participants and implementation of the technical infrastructure needed to capture data. With the continued prevalence of mood disorders the need for innovation in the space remains paramount and we hope this dataset, once complete, will prove beneficial.

EHR- Electronic Health Record

APA- American Psychological Association

WHO- World Health Organization

AI- Artificial Intelligence

ML- Machine Learning

NLP- Natural Language Processing

CV- Computer Vision

EEG- Electroencephalography

MDD- Major Depressive Disorder

ASC- Autism Spectrum Conditions

PINS- Prodromal Inventory of Negative Symptoms

SSRI- Selective Serotonin Reuptake Inhibitor

SNRI- Selective Norepinephrine Reuptake Inhibitor

SIGMA-MADRS- Structured Interview Guide for the Montgomery-Åsberg Depression Rating Scale

HNR- Harmonic-to-Noise Ratio

MFCC- Mel Frequency Cepstral Coefficients

HIPAA- Health Insurance Portability and Accountability Act

ICD- International Classification of Diseases

CPT- Current Procedural Terminology

MSE- Mental Status Exam

PHQ-9- Patient Health Questionnaire-9

BDI- Beck Depression Inventory

MGD- MindLinc Global Database

MIMIC-III- Medical Information Mart for Intensive Care

DAIC-WOZ- Distress Analysis Interview Corpus Wizard-of-Oz

7.1 Ethical Approval

This study was approved by the Advarra IRB (Pro00056787) and all participants gave informed consent prior to participating. All methods were carried out in accordance with relevant guidelines and regulations.

7.2 Consent for publication

Not applicable

7.3 Availability of data

The datasets generated and analyzed during the current study are not publicly available at this time due to the sensitive nature of the dataset and to protect participant privacy. Data may be available from the corresponding author on reasonable request.

7.4 Competing interests

TMS, JJ, AD, MH, AK, MM, MAP, DK are employed by Mind Medicine, Inc., the Sponsor of this research. OW and AT were employed by Brooklyn Minds Psychiatry, the research site, during this study.

7.5 Funding

All funding for this research was provided by Mind Medicine, Inc.

7.6 Author Contributions

TMS, JJ, AD, MH, AK, MAP, MM, and DRK: authors; OW, AT: study design; DJK: data analysis

7.7 Acknowledgments

The authors would like to acknowledge and thank Christianna Mariano and the staff at Brooklyn Minds Psychiatry/Curated Mental Health, as well as all study participants for their invaluable contributions to this study.

Santomauro DF, Mantilla Herrera AM, Shadid J, Zheng P, Ashbaugh C, Pigott DM, et al. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. The Lancet. 2021;398(10312):1700–12.
National Center for Health Statistics Household. Pulse Survey data on anxiety and depression collected between April 23, 2020, and Oct. 11, 2021 [Internet]. National Center for Health Statistics; Available from: https://www.cdc.gov/nchs/covid19/pulse/mental-health.htm.
Terlizzi E, Schiller J. Estimates of Mental Health Symptomatology, by Month of Interview: United States, 2019. Natl Cent Health Stat. 2021;1.
Worsening mental health crisis pressures psychologist workforce. 2021 COVID-19 Practitioner Survey [Internet]. American Psychological Association; 2021. Available from: https://www.apa.org/pubs/reports/practitioner/covid-19-2021.
World Health Organization. A health telematics policy in support of WHO’s Health-for-all strategy for global health development [Internet]. World Health Organization. ; 1998. Available from: https://apps.who.int/iris/bitstream/handle/10665/63857/WHO_DGO_98.1.pdf?sequence=1&isAllowed=y
Di Carlo F, Sociali A, Picutti E, Pettorruso M, Vellante F, Verrastro V, et al. Telepsychiatry and other cutting-edge technologies in COVID‐19 pandemic: Bridging the distance in mental health assistance. Int J Clin Pract. 2021;75(1):ijcp13716.
Hilt RJ. Telemedicine for Child Collaborative or Integrated Care. Child Adolesc Psychiatr Clin N Am. 2017;26(4):637–45.
Bestsennyy O, Gilbert G, Harris A, Rost J, Telehealth. A quarter-trillion-dollar post-COVID-19 reality? [Internet]. McKinsey & Company; 2021. Available from: https://www.mckinsey.com/industries/healthcare-systems-and-services/our-insights/telehealth-a-quarter-trillion-dollar-post-covid-19-reality.
Abbas A, Schultebraucks K, Galatzer-Levy IR. Digital Measurement of Mental Health: Challenges, Promises, and Future Directions. Psychiatr Ann. 2021;51(1):14–20.
Washington P, Park N, Srivastava P, Voss C, Kline A, Varma M, et al. Data-Driven Diagnostics and the Potential of Mobile Artificial Intelligence for Digital Therapeutic Phenotyping in Computational Psychiatry. Biol Psychiatry Cogn Neurosci Neuroimaging. 2020;5(8):759–69.
Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50(D1):D20–6.
Smith EA, Horan WP, Demolle D, Schueler P, Fu DJ, Anderson AE, et al. Using Artificial Intelligence-based Methods to Address the Placebo Response in Clinical Trials. Innov Clin Neurosci. 2022;19(1–3):60–70.
Supratak A, Dong H, Wu C, Guo Y. DeepSleepNet: a Model for Automatic Sleep Stage Scoring based on Raw Single-Channel EEG. IEEE Trans Neural Syst Rehabil Eng. 2017;25(11):1998–2008.
Alves CL, Pineda AM, Roster K, Thielemann C, Rodrigues FA. EEG functional connectivity and deep learning for automatic diagnosis of brain disorders: Alzheimer’s disease and schizophrenia [Internet]. arXiv; 2021 [cited 2023 Jan 13]. Available from: http://arxiv.org/abs/2110.06140.
Cai H, Han J, Chen Y, Sha X, Wang Z, Hu B, et al. A Pervasive Approach to EEG-Based Depression Detection. Complexity. 2018;2018:1–13.
Watts D, Pulice RF, Reilly J, Brunoni AR, Kapczinski F, Passos IC. Predicting treatment response using EEG in major depressive disorder: A machine-learning meta-analysis. Transl Psychiatry. 2022;12(1):332.
de Aguiar Neto FS, Rosa JLG. Depression biomarkers using non-invasive EEG: A review. Neurosci Biobehav Rev. 2019;105:83–93.
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O et al. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey [Internet]. arXiv; 2021 [cited 2022 Oct 7]. Available from: http://arxiv.org/abs/2111.01243.
Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust Speech Recognition via Large-Scale Weak Supervision.:28.
Kollias DABAW. Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges [Internet]. arXiv; 2022 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2202.10659.
Zhang W, Qiu F, Wang S, Zeng H, Zhang Z, An R et al. Transformer-based Multimodal Information Fusion for Facial Expression Analysis [Internet]. arXiv; 2022 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2203.12367.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language Models are Unsupervised Multitask Learners.:24.
Srinivasan K, Raman K, Chen J, Bendersky M, Najork M. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning [Internet]. arXiv; 2021 [cited 2022 Oct 7]. Available from: http://arxiv.org/abs/2103.01913.
Li S, Deng W. Deep Facial Expression Recognition: A Survey. IEEE Trans Affect Comput. 2022;13(3):1195–215.
Davies H, Wolz I, Leppanen J, Fernandez-Aranda F, Schmidt U, Tchanturia K. Facial expression to emotional stimuli in non-psychotic disorders: A systematic review and meta-analysis. Neurosci Biobehav Rev. 2016;64:252–71.
Grabowski K, Rynkiewicz A, Lassalle A, Baron-Cohen S, Schuller B, Cummins N, et al. Emotional expression in psychiatric conditions: New technology for clinicians. Psychiatry Clin Neurosci. 2019;73(2):50–62.
Gupta T, Haase CM, Strauss GP, Cohen AS, Mittal VA. Alterations in facial expressivity in youth at clinical high-risk for psychosis. J Abnorm Psychol. 2019;128(4):341–51.
Abbas A, Sauder C, Yadav V, Koesmahargyo V, Aghjayan A, Marecki S, et al. Remote Digital Measurement of Facial and Vocal Markers of Major Depressive Disorder Severity and Treatment Response: A Pilot Study. Front Digit Health. 2021;3:610006.
Kliper R, Portuguese S, Weinshall D. Prosodic Analysis of Speech and the Underlying Mental State. In: Serino S, Matic A, Giakoumis D, Lopez G, Cipresso P, editors. Pervasive Computing Paradigms for Mental Health [Internet]. Cham: Springer International Publishing; 2016 [cited 2022 Oct 7]. p. 52–62. (Communications in Computer and Information Science; vol. 604). Available from: http://link.springer.com/10.1007/978-3-319-32270-4_6.
Miner AS, Haque A, Fries JA, Fleming SL, Wilfley DE, Terence Wilson G, et al. Assessing the accuracy of automatic speech recognition for psychotherapy. Npj Digit Med. 2020;3(1):82.
Yancheva M, Fraser K, Rudzicz F. Using linguistic features longitudinally to predict clinical scores for Alzheimer’s disease and related dementias. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies [Internet]. Dresden, Germany: Association for Computational Linguistics; 2015 [cited 2022 Oct 7]. p. 134–9. Available from: http://aclweb.org/anthology/W15-5123.
Fraser KC, Meltzer JA, Rudzicz F. Linguistic Features Identify Alzheimer’s Disease in Narrative Speech. J Alzheimers Dis JAD. 2016;49(2):407–22.
DeSouza DD, Robin J, Gumus M, Yeung A. Natural Language Processing as an Emerging Tool to Detect Late-Life Depression. Front Psychiatry. 2021;12:719125.
Recognition of Psychotherapy Effectiveness [Internet]. American Psychological Association. ; 2012. Available from: https://www.apa.org/about/policy/resolution-psychotherapy.
Castonguay LG, Hill CE, editors. How and why are some therapists better than others?: Understanding therapist effects. [Internet]. Washington: American Psychological Association; 2017 [cited 2022 Oct 7]. Available from: http://content.apa.org/books/16004-000.
Palpacuer C, Gallet L, Drapier D, Reymann JM, Falissard B, Naudet F. Specific and non-specific effects of psychotherapeutic interventions for depression: Results from a meta-analysis of 84 studies. J Psychiatr Res. 2017;87:95–104.
Barth J, Munder T, Gerger H, Nüesch E, Trelle S, Znoj H, et al. Comparative Efficacy of Seven Psychotherapeutic Interventions for Patients with Depression: A Network Meta-Analysis. FOCUS. 2016;14(2):229–43.
Cuijpers P, Sijbrandij M, Koole SL, Andersson G, Beekman AT, Reynolds CF. The efficacy of psychotherapy and pharmacotherapy in treating depressive and anxiety disorders: a meta-analysis of direct comparisons. World Psychiatry. 2013;12(2):137–48.
Norcross JC. Psychotherapy Relationships That Work [Internet]. Oxford University Press; 2011 [cited 2022 Oct 7]. Available from: https://academic.oup.com/book/9160.
Kilbourne AM, Beck K, Spaeth-Rublee B, Ramanuj P, O’Brien RW, Tomoyasu N, et al. Measuring and improving the quality of mental health care: a global perspective. World Psychiatry. 2018;17(1):30–8.
Lattie EG, Stiles-Shields C, Graham AK. An overview of and recommendations for more accessible digital mental health services. Nat Rev Psychol. 2022;1(2):87–100.
Walker ER, Cummings JR, Hockenberry JM, Druss BG, Insurance Status. Use of Mental Health Services, and Unmet Need for Mental Health Care in the United States. Psychiatr Serv. 2015;66(6):578–84.
Smith S, Patwary M, Norick B, LeGresley P, Rajbhandari S, Casper J et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [Internet]. arXiv; 2022 [cited 2022 Oct 7]. Available from: http://arxiv.org/abs/2201.11990.
Müllner D. Modern hierarchical, agglomerative clustering algorithms [Internet]. arXiv; 2011 [cited 2022 Oct 7]. Available from: http://arxiv.org/abs/1109.2378.
Feldman LA. Valence focus and arousal focus: Individual differences in the structure of affective experience. J Pers Soc Psychol. 1995;69(1):153–66.
Ekman P, Friesen WV. Facial Action Coding System [Internet]. American Psychological Association; 1978 [cited 2022 Oct 11]. Available from: http://doi.apa.org/getdoi.cfm?doi=10.1037/t27734-000.
Kollias D, Zafeiriou S. Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework [Internet]. arXiv; 2021 [cited 2022 Oct 7]. Available from: http://arxiv.org/abs/2103.15792.
Toisoul A, Kossaifi J, Bulat A, Tzimiropoulos G, Pantic M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat Mach Intell. 2021;3(1):42–50.
Sun K, Xiao B, Liu D, Wang J. Deep High-Resolution Representation Learning for Human Pose Estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [Internet]. Long Beach, CA, USA: IEEE; 2019 [cited 2022 Oct 7]. p. 5686–96. Available from: https://ieeexplore.ieee.org/document/8953615/.
Yu D, Deng L. Automatic Speech Recognition [Internet]. London: Springer London; 2015 [cited 2022 Oct 11]. (Signals and Communication Technology). Available from: http://link.springer.com/10.1007/978-1-4471-5779-3.
Devlin J, Chang MW, Lee K, Toutanova K, BERT. Pre-training of Deep Bidirectional Transformers for Language Understanding [Internet]. arXiv; 2019 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/1810.04805.
52 Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P et al. Language Models are Few-Shot Learners [Internet]. arXiv; 2020 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2005.14165.
Kreuk F, Polyak A, Copet J, Kharitonov E, Nguyen TA, Rivière M et al. Textless Speech Emotion Conversion using Discrete and Decomposed Representations [Internet]. arXiv; 2022 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2111.07402.
Krishna K, Khosla S, Bigham JP, Lipton ZC, Generating arXiv. ; 2021 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2005.01795.
Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: Validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13.
Beck AT, Steer RA, Carbin MG. Psychometric properties of the Beck Depression Inventory: Twenty-five years of evaluation. Clin Psychol Rev. 1988;8(1):77–100.
Mukherjee SS, Yu J, Won Y, McClay MJ, Wang L, Rush AJ, et al. Natural Language Processing-Based Quantication of the Mental State of Psychiatric Patients. Comput Psychiatry. 2020;4(0):76.
Huang K, Altosaar J, Ranganath R, ClinicalBERT. Modeling Clinical Notes and Predicting Hospital Readmission [Internet]. arXiv; 2020 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/1904.05342.
Mullenbach J, Pruksachatkun Y, Adler S, Seale J, Swartz J, McKelvey TG et al. CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes [Internet]. arXiv; 2021 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2106.02524.
Livingstone SR, Russo FA. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. Najbauer J, editor. PLOS ONE. 2018;13(5):e0196391.
Pichora-Fuller MK, Dupuis K. Toronto emotional speech set (TESS) [Internet]. Borealis; 2020 [cited 2022 Oct 11]. Available from: https://borealisdata.ca/citation?persistentId=doi:10.5683/SP2/E8H2MF.
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R. CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE Trans Affect Comput. 2014;5(4):377–90.
Mollahosseini A, Hasani B, Mahoor MH. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans Affect Comput. 2019;10(1):18–31.
Kossaifi J, Walecki R, Panagakis Y, Shen J, Schmitt M, Ringeval F, et al. SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild. IEEE Trans Pattern Anal Mach Intell. 2021;43(3):1022–40.
Ringeval F, Schuller B, Valstar M, Cowie R, Kaya H, Schmitt M et al. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. In: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop [Internet]. Seoul Republic of Korea: ACM; 2018 [cited 2022 Oct 11]. p. 3–13. https://dl.acm.org/doi/10.1145/3266302.3266316.
DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A et al. SimSensei kiosk: a virtual human interviewer for healthcare decision support. In: AAMAS ’14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems [Internet]. International Foundation for Autonomous Agents and Multiagent Systems; 2014. p. 1061–8. https://dl.acm.org/doi/10.5555/2615731.2617415.
Gratch J, Artstein R, Lucas G, Stratou G, Scherer S, Nazarian A et al. The Distress Analysis Interview Corpus of human and computer interviews.:6.
Wood E, Baltrušaitis T, Hewitt C, Dziadzio S, Johnson M, Estellers V et al. Fake It Till You Make It: Face analysis in the wild using synthetic data alone [Internet]. arXiv; 2021 [cited 2022 Oct 11]. Available from: http://arxiv.org/abs/2109.15102.

Competing interest reported. TMS, JJ, AD, MH, AK, MM, MAP, DK are employed by Mind Medicine, Inc., the Sponsor of this research. OW and AT were employed by Brooklyn Minds Psychiatry, the research site, during this study.

Download PDF

Version 1

posted

You are reading this latest preprint version

Quantifying the Processes and Events of Psychotherapy at Scale

Status:

Version 1

Abstract

Figures

1. Background

1.1 The Current Mental Health Crisis

1.2 The Rise of Telemedicine in Psychiatry

1.3 Artificial Intelligence and Machine Learning Advances in Psychiatry

1.3.1 Computer Vision

1.3.2 Audio and Sentiment Analysis

1.4 Technology and Psychotherapy

2. Methods

3. Results

3.1 Plan for Overall Data Analysis

3.2 Preliminary Analyses

4. Discussion

4.1 Datasets in Mental Health

4.1.1 Natural Language Processing Stream

4.1.2 Audio Stream

4.1.3 Image and Video Stream

5. Conclusion

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1