3.1 Data collection
Texts of the corpus were recorded at the Department of Psychiatry, Faculty of Medicine, University of Szeged and by the Prevention of Mental Illnesses Interdisciplinary Research Group (University of Szeged, Hungary) led by István Szendi. Data collection was approved by the Ethics Committee of the University of Szeged, and it was conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from all participants involved in the research project. We have official written permission to use the recordings in our research. The medical diagnosis for each person was also provided along with the speech samples. All the speakers were native Hungarian speakers.
The database contains spontaneous speech recordings of people suffering with schizophrenia-bipolar spectrum disorders and those of healthy controls. In the case of spontaneous speech, in contrast with planned speech, speakers do not have time to prepare their speech, which might truly reflect their own language specificities, e.g. their difficulties in word finding (Vincze et al., 2021). Here we work with directed spontaneous speech, which is also a memorial task at the same time. A total of 90 subjects participated in the study, of which 31 were SZ, 16 SAD, 16 BD and 27 healthy controls.
The first exercise (henceforth Narr1) consisted of three parts. The interviewer first asked the subjects to talk about themselves and then asked them to describe their mother and father in a few words. In some cases, the respondents may not have wanted to talk about a parent, in which case they could choose another person close to them as the subject of the monologue. In the first part of the following task (henceforth Narr2), the respondents were asked to recall the last years of their studies or the first years of their employment. The interviewer then asked them to describe the same period in life of someone close to them. Lastly, in the third task (henceforth TegnN), the subjects were asked to talk about their previous day. The final recorded speech corpus consists of 526 monologues from the 90 subjects all together. Demographic data are presented in Table 1.
Table 1
Demographic data (i.e. age and education in years) of the four subject groups (M = mean; SD = standard deviation).
| SZ | SAD | BD | Control | All |
Participants | 31 | 16 | 16 | 27 | 90 |
Texts | 183 | 91 | 94 | 158 | 526 |
Age; M(SD) | 38.00(9.78) | 40.09(9.77) | 49.42(8.49) | 36.28(10.03) | 39.89(10.55) |
Education; M(SD) | 14.26(2.86) | 14.91(2.96) | 16.34(4.05) | 14.67(2.99) | 14.87(3.18) |
Sex ratio; f:m | 10:21 | 10:6 | 9:7 | 13:14 | 42:48 |
3.2 Data processing
After data collection, recordings were transcribed manually. It was not necessary for us to use any specific software in this phase of the project since we did not attempt to transcribe the recordings suitable for phonological analysis. For our planned research work a simple written form of the texts was sufficient without any specific annotation on phonetic or phonological properties. However, we also made sure to annotate stops and restarts, false starts, as well as silent and filled pauses. The recordings were transcribed in a simple plain text file format, using UTF-8 encoding. A total of 526 separate transcripts were produced, of which 183 belonged to the SZ, 91 to the SAD, 94 to the BD and 158 to the control group.
Since the texts were manually transcribed, there was no need for data cleaning prior to the automatic data processing. Thus, as a first step, we performed an automatic linguistic analysis with “magyarlanc”, a linguistic preprocessing toolkit for Hungarian (Zsibrita et al., 2013). With this tool, the texts were first split into sentences, then tokenized, and finally the tokens were lemmatized and assigned the proper part-of-speech and morphological tag. Lemmatization is especially important in the case of morphologically rich languages such as Hungarian.
After conducting the above procedure, other dictionary analyses were carried out on the corpus, during which the corpus was processed with the help of different types of lexicons, like the dictionaries of sentiment words and intensifiers. The decision to focus on these features was justified by the fact that emotion regulation dysfunction is characteristic of psychotic disorders (Kring et al., 2013; Chapman, 2020). For instance, based on previous research on emotions, people with schizophrenia have difficulty in sensing and predicting emotional events, integrating emotional impressions and contexts, as well as the richness and maintenance of emotional experiences (Kring et al., 2013). The underlying brain activities show a deficit in the functioning of the networks responsible for cognitive control, indicating insufficient integration of emotions and cognition (Kring et al., 2013). Moreover, an abnormally elevated mood in BD is associated with specific neurocognitive deficits consistent with neuropathology in neural networks that are critical for emotion regulation (Green et al., 2007). Hence, we argue that a sentiment analysis of this corpus may produce relevant results concerning the mental disorders in question. As for linguistic intensification, recent research findings lead us to the conclusion that the use of intensifiers is closely related to emotion regulation (Athanasiadou, 2007; Strous et al., 2009). What is more, according to Athanasiadou (2007), intensifiers are linguistic markers of speaker subjectivity, and they have the primary function of signifying the speaker’s point of view and attitude. Because of the links between mental illness and emotion regulation, within the group of intensifiers, it is worth focusing on to the so-called negative emotive intensifiers (henceforth NEIs), whose prior semantic content is related to a negative emotion, but which can function as intensifiers. It is worth mentioning here that there is evidence in the literature that the use of intensifiers is different in e.g. schizophrenia (Strous et al., 2009).
For the dictionary-based analysis, we used a sentiment dictionary, for which we combined two previously published Hungarian sentiment lexicons (Szabó (2015) and a Hungarian translation of Liu (2012)). For the identification of linguistic intensifiers, we used a basic (non-emotive) intensifier dictionary (Szabó et al., 2023) consisting of 125 words and a 225-item dictionary of NEIs (Szabó & Guba, 2022). The sentiment analysis was performed on the lemmas of the words, as this provided the opportunity for us to find all the suffixed forms of the dictionary lemmas in the corpus (see e.g. the word forms of the lemma borzasztó ‘awful’ and borzasztót ‘awful-ACC’). At the same time, the intensifier lexicon was applied on the word forms instead of the lemmas, since in one of our research projects we are aiming and analyzing separately intensifiers with different morphological properties (e.g., both fantastic and fantastically occur). The reason for this was that we were also investigating linguistic intensification as a part of our research project and the corpus properties of each word form were important to us (Szabó et al., 2022b).
Basic statistical data of the processed corpus are presented in Table 2 below.
Table 2
Basic data of the processed corpus
| SZ | SAD | BD | AllPat | Control | All |
Texts | 183 | 91 | 94 | 368 | 158 | 526 |
Tokens | 42,335 | 30,350 | 49,820 | 122,505 | 75,542 | 198,047 |
Tokens_noPunct | 34,670 | 23,685 | 39,752 | 98,107 | 60,279 | 158,386 |
Sentiment word | 3,242 | 1,939 | 3,733 | 8,914 | 5,080 | 13,994 |
Intensifier | 767 | 533 | 1,107 | 2,407 | 1,646 | 4,053 |
NEI | 20 | 19 | 49 | 88 | 48 | 136 |