This section outlines the procedures of data collection, representation and processing that were used in the HUJICorpus project. First, the recording of data and metadata and the anonymization process will be presented (§ 4.1). Next, the workflow and work platform will be described, including the ‘transcription chain’ and the technological tools used for data storage and management (§ 4.2). Finally, the transcription conventions will be outlined, including adaptations and expansions of the GAT2 system (§ 4.3).
4.1 Data collection
The primary data for the HUJICorpus come from telephone conversations recorded during the years 2020–2021 by students at the Hebrew University of Jerusalem talking to their relatives and friends. The students, who participated in classes on spoken discourse and conversation, were encouraged to record long conversations of at least 10 minutes to enable the speakers’ adjustment to the recording situation and allow thereby their decreased control of speech. Indeed, references to the recording and displays of awareness were mostly observed at conversation boundaries, in great part at the beginning of the call. When engaged in topical talk, participants did not seem to be particularly mindful to the recording, neither in terms of the subjects of talk nor in terms of their speech style.
Participants who were willing to contribute their recorded conversations signed on a consent form that grants permit to include it in a database that will be open for the public for the purpose of the study of spoken Hebrew. Besides the consent, participants were also asked to fill in a metadata form that included details about the conversation’s situation (its time, place and main topics), and about the participants, in particular, their gender, age, native language, residence, and relationship to their co-conversationalist.
In order to protect the privacy of the participants, all identifying details were anonymized in the transcripts and in the audio files. Names of persons and locations were replaced by pseudonyms in the transcripts and blurred in the audio files by using a low-pass filter.
4.2 Data processing
The building of the HUJICorpus was carried out in several phases. The preparatory phase involved assessment of the audio quality of all the contributed recordings. Recordings that presented a good quality were selected for further work and assigned an ID number (e.g., HCSH002). Each recording was then assigned to an initial transcriber who, upon listening to the entire recording, selected a coherent conversational segment of 5 to 10 minutes for transcription. This segment was given a label according to its most prominent topic (e.g., HCSH002 is labeled ‘the new apartment’).
The transcription process of each conversation included four steps carried out by three different transcribers. First, an initial transcript of the selected conversational segment was produced by transcriber A. This transcript was then handed for review to transcriber B. Next, a commented version was returned to transcriber A who would revise accordingly the initial transcript. Finally, the revised version was handed to a senior transcriber C for a final inspection and emendation. This multistep process was designed to increase the reliability of the analysis proposed by each transcript and secure the consistency of the method applied throughout the corpus.
The technological platform in which the monitoring and coordination of the transcription process was done is Notion (https://www.notion.so/product), a workspace software designed for data organization and project management. To track the progress of the shared transcription tasks, a customized Kanban board was designed, in which the recordings’ identifiers could be moved along the different stages of the transcription pipeline, while each task is attributed to one team member (see Figure 1).
In addition, as work progressed, a growing number of files of different types (.docx, .pdf, .wav, .mp3) had to be stored and arranged while allowing team members easy access to the files’ version history. This is another need to which Notion provided a satisfying solution, by allowing to create versatile databases that can accommodate multiple versions of files of all kinds (see Figure 2).
Finally, since Notion has an API (Application Programming Interface) which allows retrieving information from the platform for use in other applications, it could also serve as the HUJICorpus website’s database. This is advantageous both since it keeps the corpus’s website light and speedy, and since it assures that any additions, corrections, or other developments which will be documented in the project’s workspace will be immediately and automatically updated in the corpus’s public website.
While each task in the transcription chain was accomplished individually, the project team met on a regular basis to discuss dilemmas and questions that emerged during this process. In some cases, small surveys were conducted in order to take an informed decision about a new feature or method of annotation. For instance, the introduction of the label ‘mimicking’ followed several group discussions of a collection of candidate cases and consequently a delimitation of the category to only certain forms of prosodic delivery that conveyed animation and stereotyping, rather than, for instance, quoting or emphasis (see also § 4.3.4).
4.3 Transcription
As mentioned above, the HUJICorpus uses the GAT2 transcription system. This system was designed with the objective to accurately capture the temporal and sequential organization of talk-in-interaction and enable a rich representation of prosodic phenomena (Selting et al. 2009). GAT2 also provides means for the representation of non-verbal sounds and details of the surrounding environment, as well as the inclusion of interpretive comments. In addition, the system is expandable and adaptable to new languages (Couper-Kuhlen and Barth-Weingarten 2011). The following sub-sections outline and illustrate the main phenomena that are captured in the HUJICorpus transcripts and describe the expansions of GAT2 and its adaptation to Hebrew. The full transcription conventions can be found in the HUJICorpus website (https://huji-corpus.com/method.html).
4.3.1 Temporal and Sequential Organization
The transcripts run in numbered lines that represent the prosodic phrasing of the speakers’ contributions as they unfold in time. Each line consists of an Intonation Unit (IU, Chafe 1994),1 i.e., a short stretch of talk produced under a single unified intonation contour (Du Bois et al. 1992) and delimited by phonetic and prosodic cues. Incomplete words are marked with a special symbol at the exact place of truncation. Silent pauses are measured to the level of one hundredth of a second. Inbreaths and outbreaths are measured and classified to duration groups of 0.2–0.5 seconds, 0.5–0.8 seconds, and 0.8-1.0 seconds. Laughter is represented within text lines by the designated symbol @ (one @ for each burst of laughter). Overlapping speech is indicated by aligned square brackets in two consecutive lines, delimiting precisely the parts that were produced simultaneously. Finally, a quicker than expected transition between two consecutive IUs is marked by the latching symbol =.
4.3.2 Representation of Prosodic Phenomena
Prosodic representation in GAT2 can be either local, i.e., depict phenomena which take scope over one syllable, or global, i.e., depict phenomena which take scope over at least one word and up to several consecutive IUs. The prosodic labels apply to parameters of pitch, amplitude, rhythm and voice quality.
The local pitch features that are represented in the transcripts include: final pitch movement of each IU analyzed as one of five form-based categories (rise-to-high, rise-to-mid, level, fall-to-mid, and fall-to-low); sudden pitch jumps, either up or down; and mid-IU noticeable pitch movements (rise, fall, rise-fall or fall-rise). Globally, pitch can be marked as either high or low with respect to the speaker’s usual pitch register.
Speech can be represented globally as either loud or soft with respect to the surrounding context by the forte/piano labels. A gradual increase or decrease in amplitude can be represented as well, by the crescendo/decrescendo labels. In addition, the marcato label was introduced to annotate a sequence of words that are produced with a prosodic accent on each word. Locally, a syllable can be marked as carrying a marked prosodic accent, with a distinction between two levels of strength. Finally, when the standard orthography and the context is not enough to disambiguate several possible readings of a word, the syllable carrying the lexical stress is marked.
Locally, prosodic lengthening of a given speech sound (either vowel or consonant) is marked by duration categories of 0.2–0.5 seconds, 0.5–0.8 seconds, 0.8-1.0 seconds etc. Globally, speech can be marked as fast or slow with respect to the surrounding context with the allegro/lento labels. In addition, gradual increase and decrease in speech rhythm are marked with the accelerando/rallentando labels.
GAT2 provides means to annotate an open-ended list of voice quality modulations. Voice quality annotation is always global, i.e., taking scope over at least one word. The most frequently used voice quality labels in the HUJICorpus are ‘creaky voice’, ‘smiling/laughing voice’, and ‘whisper’. Importantly, GAT2 allows annotation of more than one global label category, thus enabling to mark a given stretch of talk as having, for example, high global pitch and laughing voice quality. In the interest of readability, we decided to limit this multiple global annotation to only two labels.
4.3.3 Non-Verbal Sounds, Background Noise, and Pronunciation Issues
GAT2 enables insertion of transcribers’ comments in the body of the text enclosed in double parenthesis. In the HUJICorpus, comments within text lines were used to specify non-verbal sounds such as coughs, gasps and clicks; comments in separate, un-numbered lines were used to specify background noises, with an indication of their duration when relevant, e.g.: ((siren sound in the next 6.4 seconds)). In addition, pronunciations deviating from the expected form were specified in a comment appended to the relevant part of the text. Transcribers added such a specification in order to facilitate the readability of the transcript in cases of homograph-induced ambiguity or when several pronunciations are possible (for instance, of a borrowed word), or when speakers used markedly high register or an idiosyncratic pronunciation.
To maximize the readability of the transcript, wording in GAT2 is notated according to the standard orthography of the given language, rather than using an ‘eye-dialect’ method (cf. Jefferson 2004) or a narrow phonetic transcription. The HUJICorpus follows this preference, and thus cases of phonetic reduction (e.g., cliticizations), which are not captured by standard orthography, are in some cases followed by double parenthesis comments in which the actual pronounced form is specified. Transcribers were free to judge when to add these comments based on considerations of salience and relevance. However, for highly frequent and nearly grammaticized phonetic reductions, such as the contraction of the accusative marker et with the definite article ha, viz. ta, transcribers were instructed to comment on all cases where reduced pronunciation was observed.
4.3.4 GAT2 Adaptation to Hebrew and Expansion
As indicated above, the HUJICorpus uses the conventions of standard Hebrew orthography to notate the wording of Hebrew talk. While most of GAT2 conventions could be smoothly applied to the Hebrew orthography, a few adaptations had to be made. First, Hebrew is written from right to left instead of from left to right. Second, since Hebrew orthography does not have an uppercase/lowercase distinction, the use of capital letters to mark prosodic accents was replaced by underline. Third, since Hebrew orthographic representation of vowels is limited, and to avoid an excessive use of within-text commentary (see § 4.3.3), a simplified usage of the Hebrew system of diacritical signs (niqqud) was introduced in order to indicate the pronunciation of otherwise vocally ambiguous forms such as hesitation markers, recipiency tokens, and truncated clitic elements. This simplified system of diacritics is intended to differentiate between the three vowels that are used in the above-mentioned cases: /a/, /e/, and /i/.
In addition to the GAT2 conventions, several new annotations were introduced in the HUJICorpus to enrich the representability of prosodic and interactional phenomena. Such is the convention for annotating code-switching, which includes specification of the source language and an indication of whether the speaker’s accent was adapted to it or not. For example, the English word ‘ring’, delivered with an adapted American English accent, will be notated in the running text by the Hebrew form רינג followed by the comment ((ENadap. “ring”)). Another convention concerned cases when speakers perform a global modulation of their voice in order to echo or mimic a specific character or a prototype. In these cases, the detailing of specific formal features, such as global pitch or voice quality, will not suffice to capture the effect achieved by this modulation and therefore the interpretive label ‘mimicking’ was introduced to indicate the referencing quality of the utterance (which as a rule takes scope over at least one word).
4.3.5 An illustration
To make the above description more concrete and to demonstrate the rich information provided by the transcriptions, two excerpts from the HUJICorpus are presented. The excerpts are adapted for non-Hebrew readers and include: (a) a representation of the Hebrew wording in Latin alphabet; (b) glossing according to the Leipzig Glossing Rules (Comrie et al. 2008); and (c) translation to English.
Excerpt (1) comes from a late-night conversation between two friends, Ira and Ido (recorded around 03:00 AM). The two participants are tired after a long day of studies and not entirely focused in the conversation (Ira reported in her metadata form that she was also engaged in texting while talking over the phone). The transcript presents many displays of the participants’ low degree of involvement. First, the abundance of long silent pauses (lines 01–02, 04, 06, 09); second, Ira’s noticeable yawn, which begins with a long inbreath in line 02 and continues through line 05 until the long release of air at the end of line 06; third, the use of soft volume and creaky voice in lines 10–11 (delimited by the marking < < p, creaky > text>); and finally, the extreme lengthening of the hesitation marker mm in line 14 followed by the slow production of line 15 (marked by lento).
Excerpt (2), in contrast, presents a passionate complaint sequence, in which Gaya critically describes the hysterical reaction of her roommate to the Covid-19 pandemic. The transcript reflects Gaya’s high degree of emotive involvement in several ways. First, the dynamicity of her pitch is manifest in frequent sudden pitch jumps (marked by the arrow symbols ↑ and ↓ in lines 01, 06–08), noticeable mid-IU rise-fall pitch movements (marked with the circumflex accent symbol ˆ in lines 03 and 11), and the delivery of lines 05–06 with high global pitch (delimited by the marking < < h > text>). In addition, the relatively large number of words packed in each IU (each line in the transcript), together with the rarity of pauses and the quick transition between IUs (marked by the latching sign = in lines 10–11) suggest that Gaya’s speech was produced with great fluency.
As illustrated in (1) and (2), the annotation of vocal and paralinguistic elements provides information that goes beyond the ‘text’ of the conversation and captures the state and stance of the speakers as made hearable to their co-conversationalists. This information is inherent to spoken interaction, and it has a crucial role in contextualizing its contents and advance particular forms of its interpretation (Auer 1992). Such detailed transcripts afford then a more solid ground for the analysis of spoken language both in the formal and social areas of linguistic research.
[1] These prosodic chunks are referred to by many names in the literature, such as (intermediate) Intonational Phrase (Beckman & Pierrehumbert 1986), Intonation Group (Cruttenden 1997), or Intonation Unit (Chafe 1994). The latter is the most widely used term (see Barth-Weingarten 2016: 3-4).