The primary aim of our present study is to utilize large language models (LLMs) to systematically analyze and characterize the changes in conscious awareness induced by a variety of psychedelic substances. By leveraging the advanced capabilities of LLMs, the study seeks to design, annotate, and evaluate a comprehensive set of experiential dimensions from free-form psychedelic experience reports. More specifically, we hypothesized that 1) large language models can effectively identify and quantify distinct experiential dimensions in psychedelic reports, and 2) the model-derived question ratings then allow for the discrimination between different psychedelic substances based on their unique profiles of subjective effects, validating the LLM's ability to capture meaningful aspects of subjective psychedelic experiences. This analytical protocol aims to overcome the limitations of traditional questionnaires and provide a more nuanced and scalable approach for understanding how psychedelics induce changes in conscious awareness.
Our present study aimed to go beyond several limitations of existing questionnaires in psychedelic drug research. In this neuroscience community, questionnaires built off questions that were often conceived decades ago - with the intention of being broadly relevant to multidimensional alterations in conscious experience - are used by researchers to measure subjective effects due, specifically, to psychedelic drugs (Bayne & Carter, 2018). It appears that such long-standing questionnaires were designed by small numbers of scientists, without necessarily trying to reach broader consensus or pooling design decisions across alternative viewpoints from a number of scientists and other actors from the psychedelics community (Barrett et al., 2015). For example, previously established views about mystical experiences, including implicitly religious views, influenced the study of psychedelic drug experiences, such as items comprising the Mystical Experience Questionnaire (MEQ) (Barrett et al., 2015; Maclean et al., 2012).
Still today, there is only a limited set of established questionnaires that are accepted and commonly used in psychedelic research, aside from the MEQ, which mostly includes the Hallucinogen Rating Scale (HRS), developed according to a similar mystical framework like the MEQ, the States of Consciousness Questionnaire, the 5-Dimensional Altered States of Consciousness questionnaire, the Challenging Experience Questionnaire, the States of Consciousness questionnaire, the Emotional Breakthrough Inventory, and the Psychological Insight Scale (Barrett et al., 2016; Barrett et al., 2015; Dougherty et al., 2023; Hovmand et al., 2024; Maclean et al., 2012; Peill et al., 2022; Roseman et al., 2019; Spriggs et al., 2021). But, what if the incumbent psychological questionnaires for psychedelic experiences are missing some key experiential features that we have not planned to probe, and are therefore routinely evading us?
As an unfolding paradigm shift, natural language processing (NLP) sees the emergence of large language models (LLMs), getting ever more powerful since the introduction of transformer architectures (Vaswani et al., 2017). In tasks such as semantic similarity assessment and question answering, new performance peaks have been reached with generously pretrained generative AI solutions. Semantic similarity tests typically use datasets where pairs of sentences are assessed on a scale for semantic equivalence. The popularity of LLMs has also grown with emergent capabilities, that is, modeling goals categorically impossible to achieve with smaller models (Wei et al., 2022). Thus, LLMs are now well positioned to help unlock unprecedented insights in neuroscience, inconceivable with previous generations of NLP tools, now enabling advanced text annotation and elaboration in an automated fashion.
A key differentiating factor of LLM technologies over more classical NLP tools is that the recently emerged modeling architectures are more naturally suited to be trained with a self-supervised learning goal (Lee, 2023). This model fitting regime involves using text, mostly from the internet, as probably the largest text source available, to train a model based on domain-agnostic general objectives, such as inferring randomly masked words and predicting the next upcoming word. This approach thus obviates the need for an expert-determined target label (normally an outcome variable ‘y’ in supervised machine learning algorithms), since the to-be-inferred internal world model emerges organically from the text sources themselves. What sets apart the self-attention mechanism used in transformer architectures from previous deep learning tools (Huang et al., 2015) is that the input sequence (such as a text document) is fed in as a whole (rather than word by word). The self-attention mechanism computes semantic correspondences between each word in the input, while word order in the sentence is also considered in the modeling flow. Not needing an outcome target (‘y’) leads to a scenario where several orders of magnitude of more data can now be leveraged as input for LLM training. This leads to the emergence of LLMs with semantic world knowledge, which can then provide the basis for queries in downstream tasks. Intuitively, LLMs can provide judgment of what a wide variety of persons on the internet, in books, and other sources, would generally think and respond in a certain context (Schrimpf et al., 2021).
Offering untapped possibilities in neuroscience studies, LLMs have been shown to be extraordinarily useful for text annotation tasks (Bansal & Sharma, 2023). In particular, they also work with zero-shot learning (providing no examples of how to answer), such as annotating the type of legal texts provided to the LLM (Savelka, 2023). LLMs have even been shown to outperform human annotators in detecting and successfully classifying the political affiliations of Twitter users from their messages (Törnberg, 2023). Research has shown that LLMs can be used to generate explanations for texts with labeled annotations, which can then be used in aiding prompts for labeling unlabeled texts without meta-information (He et al., 2023). LLMs have also been shown to outperform basic sentiment analysis, with an expansive semantic knowledge, such as, for example, annotating apology components, which may involve determining whether an apology has been given or received, the reason, and intensity (Yu et al., 2023). In our present study, we harnessed several of these text-appraisal capabilities in annotation of texts based on a corpus of questions. Researchers have argued that human validation is required for annotation using generative AI (Pangakis et al., 2023), however, validation can be achieved in various forms, as we describe in our present work. Indeed, LLMs can emulate various alternative viewpoints and human perspectives that can be applied to appraise texts and thus also activities in society as a whole.
Of relevance to the neuroscience community that operates in the small to medium data regime (Bzdok & Yeo, 2017), LLMs also have transfer learning capabilities far beyond that of previous generations of NLP models (Alyafeai et al., 2020; Malte & Ratadiya, 2019). Progressively larger models with more parameters, along with always more training data means this potentially leads to always better internal world models. This in turn leads to better transfer learning capabilities for downstream tasks. Both the amount of natural text input a model can accept and the granularity of context a model can decipher continues to evolve. Thus, the transfer learning capabilities of LLMs augment, and also less examples for fine-tuning are needed, exemplifying data efficiency. With prompt tuning on exceptionally large language models, less example observations are needed to get a desired outcome from a pre-trained LLM, paralleling fine-tuning requirements. Thus, building on such capabilities of LLMs, we can investigate their uses in psychedelic experience reports, as psychedelic drugs have received a renewed interest in their emerging uses for mental benefits (Nutt et al., 2020; Vollenweider & Kometer, 2010; Vollenweider & Preller, 2020).
In particular, LLMs are well positioned to overcome biases in the canonical legacy questionnaires, endorsed by the psychedelics research community today. Again, default use of these assessment tools assumes that evaluators already know what cognitive and experiential dimensions during psychedelic drug experience are most important. Yet, what if these instruments of inquiry are letting some relevant experience dimensions slip through the cracks? Scientists may be building layers of knowledge on what previous researchers have already investigated - in a tautological loop. For example, rather than leaving some topics to ineffability (Barrett et al., 2015), expanding the range of questions may uncover previously hidden experience facets. With the basis for the MEQ coming from few researchers making questions by hand in the 60’s, studying non-psychedelic mystical experiences, a-prioris, preconceived ideas, and possible inadequacies are passed down from one generation of psychedelics investigators to the next – even if question batteries that have been used have undergone certain reliability and validity checks, this still leaves open many potential inquisitions that may have been missed in past and present research (Barrett et al., 2015). Also, the lacking heterogeneity of a narrow set of questions in questionnaires implies that a wide array of askable questions have not yet been asked - automated LLMs can be useful in establishing new standards at this front (Hovmand et al., 2024). This un-materialized opportunity of having a more flexible, data-first approach, to start afresh in nominating and rigorously evaluating the cognitive candidate dimensions that are most relevant or promising to faithfully analyze psychedelic experiences in a disciplined, holistic approach.
Moreover, across-drug studies have been rarely possible in psychedelic research because of a number of reasons, including the high financial and logistic costs, and invasiveness and implied harms of multi-drug use. The regulatory approval process for scheduled drug research is long and challenging. It can take months or years for final approval and procurement is not always trivial (How Studies Get Psychedelics). Moreover, some drug effects can last for hours – creating special logistical challenges. Importantly, a holistic re-assessment of the various different cognitive dimensions at play during psychedelic drug experiences requires data from a large and diverse series of many experimental sessions, a requirement which many traditional study approaches to psychedelics struggle to satisfy. For example, studies using psychedelics may be limited to the range of several dozens of participants (Garcia-Romeu et al., 2019; Griffiths et al., 2016; Johnson et al., 2014; Ross et al., 2016), and with fMRI, the sample size may be less than two dozen participants (Carhart-Harris et al., 2016; Carhart-Harris et al., 2017). Regular randomized clinical trials, as a gold standard method important to show clinical relevance of psychedelics as a treatment option, do not naturally scale to many of psychedelic drugs either, and present their own logistical and validity challenges in this context (Carhart-Harris et al., 2022). Therefore, we can probably not rely on the usual kinds of neuroscience studies alone to fully and faithfully chart the mental effects of a variety of psychoactive or psychedelic substances. Hence, to ignite rapid progress, a radically different approach may be in order.
In the present investigation, we expanded on previous research on thousands of psychedelic reports and word usage patterns in freeform text reports (Ballentine et al., 2022; Martial et al., 2019; Zamberlan et al., 2018) to chart the changes of mental processes caused by 30 different examined psychoactive and psychedelic substances. We aimed to explore and characterize the whole semantic design space relevant to psychedelic experiences, rather than a handful of narrow questionnaire items, as research did previously. We found patterns in reports based on LLM-generated questions, giving finer insight into reported changes in conscious awareness when participants were undergoing these drug experiences. We found that this fine-grained insight could be used to the extent that predictions on the type of substance a test report may be possible based on questionnaire scales alone; even disregarding the actual free-form experience accounts. We derived almost 200 semantic dimensions of psychedelic drug experience in an organic fashion. In short, our study designed LLM analysis pipelines to establish and interrogate the target cognitive dimensions for each of the 30 psychedelic drugs in a full bottom-up fashion, articulated appropriate query questions for automated annotation, and performed impartial grading of psychedelic reports.