Chronic pain is a major health problem [1], with impacts at the individual, social, and economic levels [2]. Language is a key communicator for the task of clinical chronic pain assessment and management [3, 4]: a description of the experience often includes valuable information about the bodily distribution of the feeling of pain, temporal patterns of activity, intensity, emotional and psychological impacts, and others, revealing the multidimensionality of this experience[3]. Additionally, the choice of words may reflect the underlying mechanisms of the causal agent(s) [3], if any, which in turn may be used to redirect therapeutic processes. This linguistic expression has been previously studied, such as in the structuring of the Grammar of Pain [5] and the study of its lexical profile, which resulted in the McGill Pain Questionnaire (MPQ) [6], that is widely used to characterize pain from a verbal standpoint in clinical settings [7, 8]. However, all these studies relied on manual methods, expensive human evaluation, and limited sample sizes (e.g., the MPQ was originally developed with only 297 participants).
Language has been explored with increasingly more complex Natural Language Processing (NLP) techniques, both due to the development of said techniques and the larger availability of relevant data, usually in the thousands of instances, or even more. Specifically, regarding health-related applications, various works started to focus on mental health due to its close relation with language, such as depression diagnosis [9], suicidal ideation detection [10], and the linguistic analysis of multiple and co-occurring mental health conditions [11]. Indeed, some works have focused on computationally exploring language for chronic pain, such as, extracting biomedical entities and relations from disease specific online forums [12], importance analysis of latent topics (as pre-defined by the authors, as opposed to automatically extracted) in online discussions of Inflammatory Bowel Disease [13], qualitative analysis of the concerns of women with Rheumatoid Arthritis, according to textual submissions to Reddit on specific sub-forums related to this disease [14], and topic modelling over the Reddit’s sub-forum ChronicPain to analyze common semantic structures of chronic pain online reports, discovering that back pain is, by far, the most mentioned [15].
Reddit is a social media platform structured in sub-forums (called subreddits), each focused on a given, self-moderated, topic. Each subreddit is moderated according to its specific rules, topic(s) of discussion, and quality of the moderation. Considering only public subreddits, any user can participate in accordance with their rules. Additionally, Reddit is implicitly anonymous, i.e., users can choose not to disclose their identity without limiting platform use. Reddit’s data are made publicly available through the Reddit API, with the Python Reddit API Wrapper (PRAW) and the Python Pushshift.io API Wrapper (PSAW). Because of these characteristics, increasingly more research studies are based on Reddit data, including health-related applications [16]. Given user anonymity, Reddit’s demographics cannot be easily described. According to the platform administrators, in 2021, Reddit is in the top-5 most visited sites in the United States, with over 52 million daily users and more than 100k active subreddits. Regarding age groups, 58% of its users are reported to be between 18–34 years old, 27% between 35–44 years old, and 19% are 45 years old or older. Moreover, 57% identify as male and 43% as female. Regarding user physical location, as of May 2022, some websites report that 47.13% of Reddit.com internet traffic comes from the United States of America (USA). Although internet traffic is not conclusive regarding user geographical distribution, it points towards a distributional hotspot. We could not find official sources reporting this information.
In this work, we present the Reddit Reports of Chronic Pain (RRCP) dataset, which comprises social media textual descriptions and discussion of chronic pain experiences, on Reddit, from multiple base-pathologies (as represented by subreddits explicitly focused on said pathologies) which are known to be commonly accompanied by chronic pain. We used the RRCP to model the language of chronic pain, as used in that corpus. We started by discovering latent topics of the whole corpus, explicitly describing it in that space, which we called the semantic space. Then, observing only the textual entries of any one given subreddit in this semantic space, we approximated their distribution in that space, identifying regions of high density. These regions, enriched by their latent semantics, allowed us to identify the core concerns, or qualities, of what it is like to experience chronic pain, as reported in that subreddit. The set of concerns of any given subreddit, which we call its semantic span, defines the model of that subreddit. Using graph theory, we compared the semantic spans of every subreddit, allowing us to determine the similarities and differences between distinct experiences of chronic pain, as given by their distinct subreddits. We further explored which concerns were shared by all reported experiences of chronic pain, and which were exclusive to specific reported experiences. With this, we show that our findings are useful to gain insights into what it is like to experience chronic pain as reported in each subreddit in the RRCP.
To the best of our knowledge, this is the first research work attempting to model the linguistic expression of various chronic pain-inducing pathologies (as found on Reddit) and comparing these models to identify and quantify the similarities and differences between the reported chronic pain experiences.