The HUJI Corpus of Spoken Hebrew: An interaction-oriented design of a corpus

doi:10.21203/rs.3.rs-2111907/v1

Download PDF

Research Article

The HUJI Corpus of Spoken Hebrew: An interaction-oriented design of a corpus

https://doi.org/10.21203/rs.3.rs-2111907/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The paper reports on the creation of the HUJI Corpus of Spoken Hebrew (HUJICorpus) which documents naturally occurring speech and interaction in Modern Hebrew. Data come from telephone conversations recorded during the years 2020–2021. The focus on telephone conversations as a primary resource enables to reduce some of the semiotic complexity and concentrate on the linguistic (verbal and vocal) features of the interaction. Drawing on the principles of Interactional Linguistics, the recordings are transcribed using a highly granular system of formal annotation which accurately captures the temporal, sequential and prosodic aspects of talk-in-interaction. The audio files and transcripts were made freely accessible online, thus filling a widely acknowledged need for a publicly accessible and updated corpus of spoken Modern Hebrew.

Spoken Hebrew

Conversation

Prosody

Interactional Linguistics

This paper reports on the creation of the HUJI Corpus of Spoken Hebrew (HUJICorpus). The corpus project, hosted by the Hebrew University of Jerusalem (HUJI), aims at documenting naturally occurring speech and interaction in Modern Hebrew. Its first installment includes recordings of everyday telephone conversations between students and their relatives and friends. Data annotation followed standard methods of Interactional Linguistics (Couper-Kuhlen and Selting 2018). Audio files and transcripts were made freely accessible online (huji-corpus.com) with the aim of increasing the amount of public resources available for the study of spoken Hebrew. The following sections present the background, goals, procedures, outcomes, and future plans for the HUJICorpus.

Since its beginnings, modern linguistics proclaimed the privileged status of spoken language as its object of inquiry (De Saussure 1916: 23–24). Yet in practice, research of spoken data has materialized only decades later, with the advent of recording technologies that made naturally produced speech accessible to systematic inquiry. Early work on spoken grammar and discourse was often conceptualized with reference to categories and structures known from traditional linguistics, highlighting the different nature of spoken vis-à-vis written language (e.g., Chafe 1982; Tannen 1982, 1989). From quite a different angle, naturally occurring speech has become a subject of interest for conversation analysts. While not concerned with the study of language per se (Sacks 1984: 26), the close scrutiny of talk in social interaction has led them to recognize the co-constitutive relations between language and conversation:

Conversational interaction may be thought of as a form of social organization through which the work of most, if not all, the major institutions of societies - the economy, the polity, the family, socialization, etc. - gets done. And it surely appears to be the basic and primordial environment for the use and development (both ontogenetic and phylogenetic) of natural language (Schegloff 1996: 54)

Language does not pre-exist its use (cf. Linell 1998: 36); rather, its structure is shaped in accordance with its home environment in conversation ‘as adaptation to it or as part of its very warp and weft’ (Schegloff 1996: 54). This observation has become the basic tenet in Interactional Linguistics, a relatively new field in linguistics dedicated to the study of language ‘as systematically deployed in social interaction’ (Couper Kuhlen and Selting 2018: 14). The interactionist approach takes a ’doubly empirical stance’ (p. 26): its object of inquiry is linguistic structure as locally designed in a given sequence of talk, and it relies on emic evidence, i.e. participants’ displayed orientations, for the validation of its analyses. A prerequisite for this type of work is the rigorous annotation of naturally occurring data that captures the fine verbal and vocal details of language use, first and foremost in the context of everyday ordinary conversation.

Hebrew, as used in ordinary conversational interaction, has not been a central focus of Modern Hebrew scholarship. To an extent this is traceable to the dominance of philological and normative approaches in the early days of the discipline, which devalued the emerging vernacular as a defective form of Classical Hebrew and considered it unripe for descriptive analysis (Reshef 2013). However, even after Modern Hebrew has gained recognition as a legitimate and well-deserved object of study, scholarly effort was not in the main directed to authentic conversational material as a basis for linguistic investigation (cf. Grossman and Reshef 2020: 23–24; Schwarzwald 2010: 329). While prescriptive ideology is no longer standing in the way, the lack of sufficient resources, in particular, a publicly accessible and searchable database of everyday conversation, remains a serious obstacle to the ‘evidence-based research of spoken Hebrew’ which ‘holds it back and directs it to areas where research is possible without a large database’ (Gonen 2016: 28, our translation; see also Reshef 2013: 658).

The availability of natural conversation data in Modern Hebrew is very limited to date. Important pioneering work, conducted over 20 years ago at Tel Aviv University, aimed at compiling a large database of recordings that would be representative of demographically and contextually defined varieties of spoken Israeli Hebrew (Izre’el, Hary and Rahav 2001). The Corpus of Spoken Israeli Hebrew (CoSIH) was designed to include speakers of different social and lingual backgrounds engaged in a range of formal and informal speech situations. This ambitious project was never realized in full (see Gonen 2016). In its present state, which is the outcome of the project’s preparatory phase and pilot study, CoSIH consists of 13.5 hours of recorded conversations, involving 140 speakers. The existing transcripts cover approximately 10.5 hours of these recordings. The audio files and transcripts are available online (http://cosih.com/english/).

Two other major resources were curated and developed by Yael Maschler from the University of Haifa over the last three decades. The Haifa Corpus of Spoken Hebrew (Maschler et al. 2021) consists of audio files and transcriptions of 266 face-to-face conversations which were recorded during the years 1993–2018. The corpus comprises over 12 hours of talk and documents dyadic and multi-party interactions involving altogether 770 speakers. The Haifa Multimodal Corpus of Spoken Hebrew (Maschler et al. 2022) comprises nearly five hours of video recordings of face-to-face conversations, involving 21 speakers. These conversations were recorded during the years 2017–2021. Both resources are available to the research community upon request.

The HUJICorpus was designed with the objective to increase the amount of publicly available resources which allow for evidence-based research of spoken Hebrew. The corpus comprises recordings of ordinary conversations that were conducted over the telephone. The conversations were transcribed according to the conventions of the GAT system (Selting et al. 2009). The corpus was designed in light of the following considerations:

Recordings of formal speech or media talk (Hutchby 2006) are ample and easily accessible nowadays via online platforms. The availability of ordinary conversation data, in contrast, is far more limited. The HUJICorpus enables access to everyday conversation as the primary context of speech which provides thus a fundamental ground for the analysis of spoken language in all other contexts (cf. Sacks, Schegloff and Jefferson 1974). Ordinary conversation is also the least structured of all communicative situations and affords therefore a wide and varied range of activities in which the workings of spoken language can be explored.

The creation of a demographically representative corpus of Modern Hebrew is acknowledgedly a national-scale project that depends on substantial long-term funding (see Izre’el, Hary and Rahav 2001; Gonen 2016). The compilation of smaller scale corpora documenting particular segments of the society are thus important steps on the way (cf., e.g., Henshke 2018). The HUJICorpus documents talk that was produced by a relatively defined group: first language Hebrew-speaking students and their relatives and friends. The detailed metadata record for each speaker enables a more accurate assessment of their social background (see § 4.1, § 6). The representativeness of the HUJICorpus is planned to increase in a gradual bottom-up manner, through collection and transcription of conversations contributed by a wider circle of speakers.

Speakers make use of an ensemble of semiotic resources when they interact. In telephone conversations, as opposed to co-present conversations, interactionally meaningful resources are those audible to both parties (and, as such, to the analysts). The decision to focus on telephone conversations in the first installment of the HUJICorpus was done with the aim to record material that is best suited for the analysis of the linguistic features, i.e. the verbal and vocal components of the conversation. In that respect, the HUJICorpus differs from the main existing corpora of Spoken Hebrew (§ 2)

The main existing corpora of spoken Hebrew (§ 2) rely on the principles of the Discourse Transcription method (Du Bois et al. 1992) with adaptations to Hebrew. The HUJICorpus uses the conventions of GAT2, a system originally developed for the analysis of German conversation and later adapted to English and other languages (Selting et al. 2009; Couper-Kuhlen and Barth-Weingarten 2011). GAT2 provides a rich tool box for the annotation of prosodic and paralinguistic features and follows the principle of form-based parametrization. Transcripts, accordingly, involve less interpretation and more representation through specification of the formal features of speech.

In summary, the HUJICorpus was created in order to fill a widely acknowledged need for a publicly accessible corpus of spoken Modern Hebrew. The corpus documents everyday conversation as the most basic environment in which spoken language is used and shaped. The focus on telephone conversations as a primary resource enables to reduce some of the semiotic complexity and concentrate on the linguistic (verbal and vocal) features of the interaction. The representation of the data draws on principles and methods of Interactional Linguistics and employs accordingly a highly granular system of formal annotation.

This section outlines the procedures of data collection, representation and processing that were used in the HUJICorpus project. First, the recording of data and metadata and the anonymization process will be presented (§ 4.1). Next, the workflow and work platform will be described, including the ‘transcription chain’ and the technological tools used for data storage and management (§ 4.2). Finally, the transcription conventions will be outlined, including adaptations and expansions of the GAT2 system (§ 4.3).

4.1 Data collection

The primary data for the HUJICorpus come from telephone conversations recorded during the years 2020–2021 by students at the Hebrew University of Jerusalem talking to their relatives and friends. The students, who participated in classes on spoken discourse and conversation, were encouraged to record long conversations of at least 10 minutes to enable the speakers’ adjustment to the recording situation and allow thereby their decreased control of speech. Indeed, references to the recording and displays of awareness were mostly observed at conversation boundaries, in great part at the beginning of the call. When engaged in topical talk, participants did not seem to be particularly mindful to the recording, neither in terms of the subjects of talk nor in terms of their speech style.

Participants who were willing to contribute their recorded conversations signed on a consent form that grants permit to include it in a database that will be open for the public for the purpose of the study of spoken Hebrew. Besides the consent, participants were also asked to fill in a metadata form that included details about the conversation’s situation (its time, place and main topics), and about the participants, in particular, their gender, age, native language, residence, and relationship to their co-conversationalist.

In order to protect the privacy of the participants, all identifying details were anonymized in the transcripts and in the audio files. Names of persons and locations were replaced by pseudonyms in the transcripts and blurred in the audio files by using a low-pass filter.

4.2 Data processing

The building of the HUJICorpus was carried out in several phases. The preparatory phase involved assessment of the audio quality of all the contributed recordings. Recordings that presented a good quality were selected for further work and assigned an ID number (e.g., HCSH002). Each recording was then assigned to an initial transcriber who, upon listening to the entire recording, selected a coherent conversational segment of 5 to 10 minutes for transcription. This segment was given a label according to its most prominent topic (e.g., HCSH002 is labeled ‘the new apartment’).

The transcription process of each conversation included four steps carried out by three different transcribers. First, an initial transcript of the selected conversational segment was produced by transcriber A. This transcript was then handed for review to transcriber B. Next, a commented version was returned to transcriber A who would revise accordingly the initial transcript. Finally, the revised version was handed to a senior transcriber C for a final inspection and emendation. This multistep process was designed to increase the reliability of the analysis proposed by each transcript and secure the consistency of the method applied throughout the corpus.

The technological platform in which the monitoring and coordination of the transcription process was done is Notion (https://www.notion.so/product), a workspace software designed for data organization and project management. To track the progress of the shared transcription tasks, a customized Kanban board was designed, in which the recordings’ identifiers could be moved along the different stages of the transcription pipeline, while each task is attributed to one team member (see Figure 1).

In addition, as work progressed, a growing number of files of different types (.docx, .pdf, .wav, .mp3) had to be stored and arranged while allowing team members easy access to the files’ version history. This is another need to which Notion provided a satisfying solution, by allowing to create versatile databases that can accommodate multiple versions of files of all kinds (see Figure 2).

Finally, since Notion has an API (Application Programming Interface) which allows retrieving information from the platform for use in other applications, it could also serve as the HUJICorpus website’s database. This is advantageous both since it keeps the corpus’s website light and speedy, and since it assures that any additions, corrections, or other developments which will be documented in the project’s workspace will be immediately and automatically updated in the corpus’s public website.

While each task in the transcription chain was accomplished individually, the project team met on a regular basis to discuss dilemmas and questions that emerged during this process. In some cases, small surveys were conducted in order to take an informed decision about a new feature or method of annotation. For instance, the introduction of the label ‘mimicking’ followed several group discussions of a collection of candidate cases and consequently a delimitation of the category to only certain forms of prosodic delivery that conveyed animation and stereotyping, rather than, for instance, quoting or emphasis (see also § 4.3.4).

4.3 Transcription

As mentioned above, the HUJICorpus uses the GAT2 transcription system. This system was designed with the objective to accurately capture the temporal and sequential organization of talk-in-interaction and enable a rich representation of prosodic phenomena (Selting et al. 2009). GAT2 also provides means for the representation of non-verbal sounds and details of the surrounding environment, as well as the inclusion of interpretive comments. In addition, the system is expandable and adaptable to new languages (Couper-Kuhlen and Barth-Weingarten 2011). The following sub-sections outline and illustrate the main phenomena that are captured in the HUJICorpus transcripts and describe the expansions of GAT2 and its adaptation to Hebrew. The full transcription conventions can be found in the HUJICorpus website (https://huji-corpus.com/method.html).

4.3.1 Temporal and Sequential Organization

The transcripts run in numbered lines that represent the prosodic phrasing of the speakers’ contributions as they unfold in time. Each line consists of an Intonation Unit (IU, Chafe 1994),¹ i.e., a short stretch of talk produced under a single unified intonation contour (Du Bois et al. 1992) and delimited by phonetic and prosodic cues. Incomplete words are marked with a special symbol at the exact place of truncation. Silent pauses are measured to the level of one hundredth of a second. Inbreaths and outbreaths are measured and classified to duration groups of 0.2–0.5 seconds, 0.5–0.8 seconds, and 0.8-1.0 seconds. Laughter is represented within text lines by the designated symbol @ (one @ for each burst of laughter). Overlapping speech is indicated by aligned square brackets in two consecutive lines, delimiting precisely the parts that were produced simultaneously. Finally, a quicker than expected transition between two consecutive IUs is marked by the latching symbol =.

4.3.2 Representation of Prosodic Phenomena

Prosodic representation in GAT2 can be either local, i.e., depict phenomena which take scope over one syllable, or global, i.e., depict phenomena which take scope over at least one word and up to several consecutive IUs. The prosodic labels apply to parameters of pitch, amplitude, rhythm and voice quality.

The local pitch features that are represented in the transcripts include: final pitch movement of each IU analyzed as one of five form-based categories (rise-to-high, rise-to-mid, level, fall-to-mid, and fall-to-low); sudden pitch jumps, either up or down; and mid-IU noticeable pitch movements (rise, fall, rise-fall or fall-rise). Globally, pitch can be marked as either high or low with respect to the speaker’s usual pitch register.

Speech can be represented globally as either loud or soft with respect to the surrounding context by the forte/piano labels. A gradual increase or decrease in amplitude can be represented as well, by the crescendo/decrescendo labels. In addition, the marcato label was introduced to annotate a sequence of words that are produced with a prosodic accent on each word. Locally, a syllable can be marked as carrying a marked prosodic accent, with a distinction between two levels of strength. Finally, when the standard orthography and the context is not enough to disambiguate several possible readings of a word, the syllable carrying the lexical stress is marked.

Locally, prosodic lengthening of a given speech sound (either vowel or consonant) is marked by duration categories of 0.2–0.5 seconds, 0.5–0.8 seconds, 0.8-1.0 seconds etc. Globally, speech can be marked as fast or slow with respect to the surrounding context with the allegro/lento labels. In addition, gradual increase and decrease in speech rhythm are marked with the accelerando/rallentando labels.

GAT2 provides means to annotate an open-ended list of voice quality modulations. Voice quality annotation is always global, i.e., taking scope over at least one word. The most frequently used voice quality labels in the HUJICorpus are ‘creaky voice’, ‘smiling/laughing voice’, and ‘whisper’. Importantly, GAT2 allows annotation of more than one global label category, thus enabling to mark a given stretch of talk as having, for example, high global pitch and laughing voice quality. In the interest of readability, we decided to limit this multiple global annotation to only two labels.

4.3.3 Non-Verbal Sounds, Background Noise, and Pronunciation Issues

GAT2 enables insertion of transcribers’ comments in the body of the text enclosed in double parenthesis. In the HUJICorpus, comments within text lines were used to specify non-verbal sounds such as coughs, gasps and clicks; comments in separate, un-numbered lines were used to specify background noises, with an indication of their duration when relevant, e.g.: ((siren sound in the next 6.4 seconds)). In addition, pronunciations deviating from the expected form were specified in a comment appended to the relevant part of the text. Transcribers added such a specification in order to facilitate the readability of the transcript in cases of homograph-induced ambiguity or when several pronunciations are possible (for instance, of a borrowed word), or when speakers used markedly high register or an idiosyncratic pronunciation.

To maximize the readability of the transcript, wording in GAT2 is notated according to the standard orthography of the given language, rather than using an ‘eye-dialect’ method (cf. Jefferson 2004) or a narrow phonetic transcription. The HUJICorpus follows this preference, and thus cases of phonetic reduction (e.g., cliticizations), which are not captured by standard orthography, are in some cases followed by double parenthesis comments in which the actual pronounced form is specified. Transcribers were free to judge when to add these comments based on considerations of salience and relevance. However, for highly frequent and nearly grammaticized phonetic reductions, such as the contraction of the accusative marker et with the definite article ha, viz. ta, transcribers were instructed to comment on all cases where reduced pronunciation was observed.

4.3.4 GAT2 Adaptation to Hebrew and Expansion

As indicated above, the HUJICorpus uses the conventions of standard Hebrew orthography to notate the wording of Hebrew talk. While most of GAT2 conventions could be smoothly applied to the Hebrew orthography, a few adaptations had to be made. First, Hebrew is written from right to left instead of from left to right. Second, since Hebrew orthography does not have an uppercase/lowercase distinction, the use of capital letters to mark prosodic accents was replaced by underline. Third, since Hebrew orthographic representation of vowels is limited, and to avoid an excessive use of within-text commentary (see § 4.3.3), a simplified usage of the Hebrew system of diacritical signs (niqqud) was introduced in order to indicate the pronunciation of otherwise vocally ambiguous forms such as hesitation markers, recipiency tokens, and truncated clitic elements. This simplified system of diacritics is intended to differentiate between the three vowels that are used in the above-mentioned cases: /a/, /e/, and /i/.

In addition to the GAT2 conventions, several new annotations were introduced in the HUJICorpus to enrich the representability of prosodic and interactional phenomena. Such is the convention for annotating code-switching, which includes specification of the source language and an indication of whether the speaker’s accent was adapted to it or not. For example, the English word ‘ring’, delivered with an adapted American English accent, will be notated in the running text by the Hebrew form רינג followed by the comment ((EN_adap. “ring”)). Another convention concerned cases when speakers perform a global modulation of their voice in order to echo or mimic a specific character or a prototype. In these cases, the detailing of specific formal features, such as global pitch or voice quality, will not suffice to capture the effect achieved by this modulation and therefore the interpretive label ‘mimicking’ was introduced to indicate the referencing quality of the utterance (which as a rule takes scope over at least one word).

4.3.5 An illustration

To make the above description more concrete and to demonstrate the rich information provided by the transcriptions, two excerpts from the HUJICorpus are presented. The excerpts are adapted for non-Hebrew readers and include: (a) a representation of the Hebrew wording in Latin alphabet; (b) glossing according to the Leipzig Glossing Rules (Comrie et al. 2008); and (c) translation to English.

Excerpt (1) comes from a late-night conversation between two friends, Ira and Ido (recorded around 03:00 AM). The two participants are tired after a long day of studies and not entirely focused in the conversation (Ira reported in her metadata form that she was also engaged in texting while talking over the phone). The transcript presents many displays of the participants’ low degree of involvement. First, the abundance of long silent pauses (lines 01–02, 04, 06, 09); second, Ira’s noticeable yawn, which begins with a long inbreath in line 02 and continues through line 05 until the long release of air at the end of line 06; third, the use of soft volume and creaky voice in lines 10–11 (delimited by the marking < < p, creaky > text>); and finally, the extreme lengthening of the hesitation marker mm in line 14 followed by the slow production of line 15 (marked by lento).

Excerpt (2), in contrast, presents a passionate complaint sequence, in which Gaya critically describes the hysterical reaction of her roommate to the Covid-19 pandemic. The transcript reflects Gaya’s high degree of emotive involvement in several ways. First, the dynamicity of her pitch is manifest in frequent sudden pitch jumps (marked by the arrow symbols ↑ and ↓ in lines 01, 06–08), noticeable mid-IU rise-fall pitch movements (marked with the circumflex accent symbol ˆ in lines 03 and 11), and the delivery of lines 05–06 with high global pitch (delimited by the marking < < h > text>). In addition, the relatively large number of words packed in each IU (each line in the transcript), together with the rarity of pauses and the quick transition between IUs (marked by the latching sign = in lines 10–11) suggest that Gaya’s speech was produced with great fluency.

As illustrated in (1) and (2), the annotation of vocal and paralinguistic elements provides information that goes beyond the ‘text’ of the conversation and captures the state and stance of the speakers as made hearable to their co-conversationalists. This information is inherent to spoken interaction, and it has a crucial role in contextualizing its contents and advance particular forms of its interpretation (Auer 1992). Such detailed transcripts afford then a more solid ground for the analysis of spoken language both in the formal and social areas of linguistic research.

[1] These prosodic chunks are referred to by many names in the literature, such as (intermediate) Intonational Phrase (Beckman & Pierrehumbert 1986), Intonation Group (Cruttenden 1997), or Intonation Unit (Chafe 1994). The latter is the most widely used term (see Barth-Weingarten 2016: 3-4).

Following the initial audio quality filtering, 30 recordings were included in the first installment of the HUJICorpus. These amount to a total of ~ 12 hours of talk, out of which 230 minutes were transcribed. The corpus records talk produced by 60 participants (two participants in each conversation), 41 female and 19 male speakers. The following tables detail the distribution of the speakers according to participants’ relationship (Table 1), participant’s gender (Table 2), and participant’s residence (Table 3), as reported in the metadata forms.

Table 1

Distribution of Participant Relationship in HUJICorpus
Participant Relationship	Count
Friends	16
Child-Parent	9
Siblings	3
Grandchild-Grandparent	2
Total	30

Table 2

Distribution of Participant Gender in HUJICorpus
Participant Gender	Count
Female-Female	16
Female-Male	9
Male-Male	5
Total	30

Table 3

Distribution of Participant Residence in HUJICorpus
Participant Residence	Count
Central Israel-Central Israel	20
Central Israel-Peripheral Israel	6
Central Israel-East Coast U.S.A.	2
Peripheral Israel-Peripheral Israel	1
Not reported	1
Total	30

The HUJICorpus was fully uploaded to huji-corpus.com (see Figure 3 for a screenshot of the website’s database page).

The website enables free access to the recordings (in WAV and MP3 formats) and to the transcripts (in PDF format, open for direct view or downloading, see Figure 4). In addition, the website includes metadata for each conversation and a full detailing of the transcription conventions.

The immediate next step in the construction of the HUJICorpus is to introduce a searching tool that will allow to locate all occurrences of a specific lexical item or prosodic annotation across all transcription files. This goal poses several technological challenges which are unmet by the searching abilities of a common word-processing software. First, since prosodic marking is often done at mid-word position, or without spacing at initial or final position, the search function must include an ‘under-the-hood’ use of regular expressions. This means that all non-textual symbols within a given string of characters could be ignored while searching for a lexical item. Second, the discontinuous word formation and the inflectional morphology of Hebrew (Schwarzwald 2013) add to the difficulty of lemmatization – the process of grouping together all inflected forms of a lexeme to a single representation (a lemma). While existing algorithms, such as the Sketch Engine (Kilgarriff et al. 2014; http://www.sketchengine.eu/), may cope with the above-mentioned challenges, they are not applicable directly to Hebrew. At this stage the feasibility and efficacy of two alternatives is assessed: adapting an existing text analysis software to our purposes, or developing a search engine algorithm from scratch.

The further developing of the HUJICorpus is planned to proceed in two main tracks. The first includes enlarging of the telephone component of the corpus by collecting and transcribing more recordings. The purpose of this is to increase the representativeness of the data through documentation of a wider range of speakers. The second path of development includes incorporation of video recordings of face-to-face interactions into the HUJICorpus. The introduction of this component will require adjustments of the annotation system in order to capture the use of embodied resources mobilized for interactional purposes (Mondada 2014). Privacy protection and anonymization procedures will also have to be updated. In addition, to enable access to non-Hebrew speakers, segments of both the audio and the video materials will be selected for translation into English.

Author Contribution:

Both authors contributed to the design and implementation of the research program, to the collection and analysis of the data and to the writing of the manuscript.

Acknowledgements:

The authors wish to thank Enmar Shaked, Amir Efrati, Itamar Folman and Yuval Geva for their immense contribution to the design and preparation of the HUJICorpus, as well as to Prof. Beatrice Szczepek Reed for her support and advice.

The authors have no competing interests to declare. No funding was received to assist with the preparation of this manuscript.

Auer, Peter. 1992. Introduction: John Gumperz’ approach to contextualization. In Auer, P. & Di Luzio, A. (Eds.), The contextualization of language, 1-38. Amsterdam: John Benjamins Publishing. https://doi.org/10.1075/pbns.22
Barth-Weingarten, D. (2016). Intonation Units Revisited: Cesuras in talk-in-interaction (Vol. 29). John Benjamins Publishing Company. https://doi.org/10.1075/slsi.29
Beckman, M. E., & Pierrehumbert, J. B. (1986). Intonational structure in Japanese and English. Phonology, 3, 255–309. https://doi.org/10.1017/S095267570000066X
Chafe, W. L. (1982). Integration and involvement in speaking, writing, and oral literature. In D. Tannen (Ed.), Spoken and Written Language: Exploring Orality and Literacy (pp. 35–54). Ablex Publishing Corporation. https://doi.org/10.1017/S0272263100005076
Chafe, W. L. (1994). Discourse, consciousness, and time: The flow and displacement of conscious experience in speaking and writing. Chicago: University of Chicago Press.
Comrie, B., Haspelmath, M., & Bickel, B. (2008). The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses. Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved August, 30, 2022.
Couper-Kuhlen, E., & Barth-Weingarten, D. (2011). “A System for Transcribing Talk-in-Interaction: GAT 2. English Translation and Adaptation of Selting, Margret et al. (2009): Gesprächsanalytisches Transkriptionssystem 2.” Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion 12: 1–51.
Couper-Kuhlen, E., & Selting, M. (2018). Interactional linguistics: Studying language in social interaction. Cambridge University Press. https://doi.org/10.1017/9781139507318
Cruttenden, A. (1997). Intonation. Cambridge University Press. https://doi.org/10.1017/CBO9781139166973
De Saussure, F. (2011). Course in general linguistics. Columbia University Press. (Original work published 1916)
Du Bois, J. W., Schuetze-Coburn, S., Paolino, D., & Cummings, S. (1992). Discourse Transcription. Santa Barbara Papers in Linguistics 4. Santa Barbara: Department of Linguistics, University of California.
Gonen, E. (2016). Introduction to Corpus Linguistics: On Corpus Studies in the World and in Israel. Teuda, 27, 15-36. Tel Aviv University.
Grossman, E., & Reshef, E. (2020) Setting Modern Hebrew in space, time, and culture. In R. A. Berman (Ed.) Usage-Based Studies in Modern Hebrew: Background, Morpho-lexicon, and Syntax 210. John Benjamins Publishing Company. https://doi.org/10.1075/slcs.210
Henshke, Y. (2018). Other-Hebrew Corpus.
Hutchby, I. (2006). Media Talk: Conversation analysis and the study of broadcasting. Berkshire: Open University Press.
Izre'el, S., Hary, B., & Rahav, G. (2001). Designing CoSIH: the corpus of spoken Israeli Hebrew. International Journal of Corpus Linguistics, 6(2), 171-197. https://doi.org/10.1075/ijcl.6.2.01izr
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In Lerner, G. H. (Ed.), Conversation analysis: studies from the first generation, 13-23. Philadelphia: John Benjamins. https://doi.org/10.1075/pbns.125.02jef
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014) The Sketch Engine: ten years on. Lexicography, 1, 7-36. https://doi.org/10.1007/s40607-014-0009-9
Linell, P. (1998). Approaching Dialogue. John Benjamins Publishing Company. https://doi.org/10.1075/impact.3
Maschler, Y., Polak-Yitzhaki, H., Fishman, S., Miller Shapiro, C., Goretsky, N., Aghion, G., & Fofliger, O. (2021). The Haifa Corpus of Spoken Hebrew.
Maschler, Y., Polak-Yitzhaki, H., Fishman, S., Miller Shapiro, C., Goretsky, N., Aghion, G., Fofliger, O., Wildner, N., Ben Moshe, Y. M., & Lagil, R. (2022). The Haifa Multimodal Corpus of Spoken Hebrew.
Mondada, L., (2014). The local constitution of multimodal resources for social interaction. Journal of Pragmatics 65, 137–156. https://doi.org/10.1016/j.pragma.2014.04.004
Reshef, Y. (2013) Modern Hebrew Grammar: History of Scholarship. In Khan, G. (Ed.) Encyclopedia of Hebrew Language and Linguistics (Vol. 2). Leiden: Brill.
Sacks, H. (1984) Notes on Methodology, in J.M. Atkinson and J. Heritage (Eds.) Structures of Social Action: Studies in Conversation Analysis. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511665868
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn taking for conversation. Language, 50(4), 696–735. https://doi.org/10.2307/412243
Schegloff, E. A. (1996). Turn organization: One intersection of grammar and interaction. In E. Ochs, E. A. Schegloff & S. A. Thompson (Eds.), Interaction and grammar (pp. 52–133). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511620874.002
Schwarzwald, O. (2010). Trends in Modern Hebrew Studies. Lešonenu 72, 321-336. (In Hebrew)
Schwarzwald, O. (2013) Morphology: Modern Hebrew. In Khan, G. (Ed.) Encyclopedia of Hebrew Language and Linguistics (Vol. 2). Leiden: Brill.
Selting, M., Auer, P. Barth-Weingarten, D., Bergmann, J., Bergmann, P., Birkner, K., Couper-Kuhlen, E., Deppermann, A., Gilles, P., Günthner, S., Hartung, M., Kern, F., Mertzlufft, C., Meyer, C., Morek, M., Oberzaucher, F., Peters, J., Quasthoff, U., Schütte, W., Stukenbrock, A., and Uhmann, S. (2009). “Gesprächsanalytisches Transkriptionssystem 2 (GAT 2).” Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion 10: 353–402. http://www.gespraechsforschung-ozs.de/heft2009/px-gat2.pdf
Tannen D. (1982) Spoken and written language: exploring orality and literacy. Norwood, NJ: Ablex.
Tannen, D. (2007). Talking Voices. Cambridge University Press. (Original work published 1989) https://doi.org/10.1017/CBO9780511618987

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

The HUJI Corpus of Spoken Hebrew: An interaction-oriented design of a corpus

Status:

Version 1

Abstract

Figures

Introduction

Background

Goals

Procedures

4.1 Data collection

4.2 Data processing

4.3 Transcription

4.3.1 Temporal and Sequential Organization

4.3.2 Representation of Prosodic Phenomena

4.3.3 Non-Verbal Sounds, Background Noise, and Pronunciation Issues

4.3.4 GAT2 Adaptation to Hebrew and Expansion

4.3.5 An illustration

Outcomes

Future Plans

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1