Participants
The aim of Exp 1 was to examine in general the pattern of event segmentation behavior in the Hungarian versions of the texts. Because of this, there was no clearly defined effect, which Exp 1 aimed to replicate, and so no formal power analysis was conducted. Because the original study detected the differences between no-shift and shift sentences with a group size of N = 20 (Exp 1, control group), we judged that a sample size of N = 30 would be sufficient for investigating patterns of segmentation behavior. Thus, expecting data loss due to dropouts, the final sample size was somewhat higher, N = 36 (26 women; Mage= 22.86 years, SDage= 4.36, range: 18 - 40).
In Exp 2, because boundary-related pupil dilations were not reported previously in the literature, we had no prior expectations about the size of this effect. Thus, we used the statistical software G-Power79 to compute the sample size required to detect a small to medium effect (Cohen’s d=0.35). The computations were performed for a nonparametric one-sample Wilcoxon signed rank test testing whether the baseline corrected pupil dilation values are significantly greater than zero (i.e. significant dilation relative to the baseline). The required sample size was N = 63 with a power of 0.85 and alpha of 0.05. Thus, expecting dropouts, we collected data from N = 71 participants (45 women; Mage= 21.84 years, SDage= 1.84, range: 18 - 26).
Finally, the main aim of Exp 3 was to replicate the boundary-linked pupil dilations observed in Ex 2. As a first step for sample size calculation, we used the data of Exp 2 to estimate the magnitude of the effect for pupil responses after spatial shift sentences and event segmentation hotspots, respectively. To this aim, we computed one-sample Wilcoxon signed rank tests testing whether pupil dilation is significantly greater than zero during the peak interval of the two pupil responses (i.e. the time bin with the largest pupil dilation). The magnitude of the effect was medium for both event segmentation hotspots (Cohen’s d = 0.6 for the time bin 4-4.5 sec), and for spatial shift sentences (Cohen’s d = 0.5 for the time bin 6.5-7 sec), respectively. The smaller of the two effect sizes was used to calculate that the required sample size for detecting the boundary-linked pupil dilations is N = 32 (power: 0.85, alpha = 0.05). Thus, expecting dropout, we collected data from N = 39 participants (26 women; Mage= 21.31 years, SDage= 1.83, range: 19 - 26).
Due to low task performance, missing data or low quality of pupil data, four participants were excluded from Exp 2 and six participant from Exp 3 (see below), thus the final sample size was N = 67 (44 women; Mage= 22.00 years, SDage= 1.81, range: 18-26) in Exp 2 and N = 33 (24 women; Mage= 21.21 years, SDage= 1.78, range: 19 – 26) in Exp 3. The experiments were approved by the United Ethical Review Committee for Research in Psychology (Hungary) and were carried out in accordance with the Code of Ethics of the Word Medical Association (Declaration of Helsinki) for experiments involving humans.
Stimuli
We adapted the stimulus set from Bailey and colleagues73. The authors created eight stories of everyday activities (e.g. a mother and a child visiting an aquarium; a family preparing for a camping trip). Each story was designed to contain narrative shifts: in four sentences, there was a change in the identity of the character (character shift sentences). In further four sentences, the spatial location changed (spatial shift sentences). As a control condition, the authors defined four sentences in each story, in which no shift occurred (no shift sentences).
To adapt these stories to Hungarian, first, the eight stories were translated to Hungarian by a professional translator. After the translation, one of the authors (PP) reviewed all stories, and the translated versions were finalized after discussing all open questions regarding terminology and sentence formulation. Minor changes in content were required due to cultural differences (e.g. in one of the original stories, an American football game was mentioned, whereas in the Hungarian story, this was modified to a soccer game). Furthermore, in the original study, the stories were shown on a computer screen. To avoid artefacts in pupil size measurement due to visual processing, we opted for auditory presentation. The stories were read out by a female narrator, and we recorded her voice. These recordings were then played for the participants, who listened it through earphones (Exp 2-3) or the microphone of their computer (Exp 1).
To test whether the narrative shifts are identified as event boundaries also in the Hungarian translations of the texts, we conducted a pilot experiment (N = 32), in which participants listened to all eight stories created by Bailey and colleagues73. Participants were instructed (1) to memorize spatial locations and characters for later memory test, and (2) to indicate with a single key press if they identified an event boundary (i.e. event segmentation behavior). We found that location and character shifts were more often associated with event segmentation behavior, as compared to those sentences, which Bailey and colleagues73 used as control text locations without narrative shifts (see Section 1 in supplementary material for description and results of the pilot study). We selected those four texts, in which the predefined spatial/character shifts most consistently triggered key-presses, and these stories were used in all the three experiments (with respect to their topic, the selected stories are referred to as “aquarium”, “hospital”, “castle”, and “camping”).
To test memory for the stories in Exp 1-2, we created seven questions with respect to each story. These questions focused on specific characters and spatial locations (e.g. “Where did they stop the car”? “What kind of phobia does Julia have?”).
To test memory in Exp 3 with a forced choice recognition task, we created 41 sentence pairs. One of the sentences was always a boundary sentence taken from one of the stories, whereas the other sentence was a slightly modified version of it: always one specific word or phrase was changed, which slightly changed the sentence without significantly affecting the meaning of the sentence (e.g. ‘That day they visited a medieval castle, which was Stephen's idea’. vs. ‘That day they visited a ruined castle, which was Stephen's idea’).
The supplementary material contains the text of all stories (Section 4) and the memory tests as well (Section 5).
Design
In Experiment 1-3, we used the four stories chosen during the piloting work (see above). The order of presentation was counterbalanced: that is, for different participants, the stories were presented in a different order, and each story was presented in each position approximately the same number of times. Thus the mean serial positions of the stories were similar in Exp 1 (aquarium: M = 2.5; hospital: M = 2.5; castle: M = 2.61; camping: M = 2.38), Exp 2 (aquarium: M= 2.52; hospital: M = 2.48; castle: M = 2.49; camping: M = 2.51), and Exp 3 (aquarium: M = 2.49; hospital: M = 2.51; castle: M = 2.54; camping: M = 2.46).
In Exp 1-2, immediately after listening to each story, participants’ memory for the previously heard story was tested. In Exp 3, participants took part in a forced choice recognition task after listening to all four stories. During this task, we presented participants a series of sentence pairs. One of the sentences was always presented previously in a boundary location, whereas the other sentence was a slightly modified version of this sentence. Participants had to select the previously heard sentence.
Procedure
The pilot study and Exp 1 was conducted online. Presentation of the stories/questions and the administration of the responses were programmed using the computer presentation software Psychopy80, and hosted online using the Pavlovia software (https://pavlovia.org). Each story and the related memory test were programmed as separate tasks. The research assistant contacted the participants using online chat applications and sent the web-links related to the different stories in the predefined order of the pseudorandomization (see above). At the beginning of each task, the instruction was presented on the computer screen, and it was also explained by the experimenter. Participants were told that they would listen to everyday stories. They were instructed to try to memorize the stories, because they would get questions regarding the characters and locations of the stories. Besides, they were also asked to segment the stories to meaningful events by indicating event boundaries with pressing the SPACE key on the keyboard (see Section 6 of the supplementary material for the full instruction). Participants conducted the tasks on their own laptop or PC.
Exp 2 and Exp3 were conducted in the laboratory. Participants were seated in front of a computer screen and an SMI RED500 remote eye-tracker – we asked them to place their head in a chinrest to restrict head-movements. As a first step, the eye-tracker was calibrated. Thereafter, in Exp 2, we instructed participants to memorize the stories – this part of the instruction was similar to Exp 1. Importantly, however, in contrast to Exp 1, in Exp 2, we did not instruct participant to segment the stories, thus their only task was to listen carefully to the narratives. In Exp 3, we omitted from the instruction the direct reference to mnemonic task, and also all hints towards spatial locations or characters. We told participants only to listen to the texts carefully, as later further tasks will be conducted related to the contents of the task. In both experiments, participants listened to the stories using earphones.
Responses to the questions regarding the stories in Exp 1-2 were presented on the screen of the computer, and participants had to type in their responses. In Exp 3, after listening to the last story, participants were given a spreadsheet, which listed all sentence pairs one below the other. Participants task was to mark the correct version of the sentence, which they listened previously. To make the task easier, the sentences were presented in blocks with respect to the specific story, that is participants knew in which story the correct version of the sentence was presented.
Data processing
Rating of memory performance in Exp 2: Although memory task was conducted also in Exp 1 due to maintaining consistency between design in Exp 1 and 2, memory performance was only analyzed in Exp 2. Here, due to technical problems, all responses of one participant, and responses for one of the stories in the case of another two participants were lost. All other responses were coded by a predefined scoring protocol developed by the first author (P.P.). One point was given for correct responses, and zero point for incorrect ones. For some answers, half a point could be given for partially correct responses. Based on this protocol, all responses were scored by a PhD student, who had no knowledge on design, hypotheses or aims of the study. Twenty-five percent of the responses (data from 21 participants) were also coded by the second author (Sz.Á.). She was unaware of the scores given by the PhD student and followed the same scoring protocol. Between-rater correspondence was high: the two raters gave the same points in 94.55% of the cases (kappa = 0.88). As there were four stories with seven questions per story, the maximum score was 28. Thus, points attained by each participant were divided by 28 to get a percentage score – this memory score was used in all analyses. Mean memory score was M = 0.71 (SD = 0.11) in Exp 2. The performance of two participants was more than 3 SD below the mean of the sample, thus their data were also excluded from further analysis. Thus, in sum, three participants were excluded from further analysis due to missing data or low memory performance (i.e. their pupil data were not analyzed either).
Computing memory performance in Exp 3: We calculated the percentage of sentences, were participants selected the correct, previously heard sentence. For one of the participants, the percentage of correct responses was at chance level (48%), thus his/her behavioral and pupillometric data were excluded from further analysis. After the exclusion, the mean accuracy was M = 0.69 (SD = 0.09).
Segmentation behavior: In the pilot study and in Exp 1, segmentation behavior was assessed by analyzing the frequency and distribution of key presses. For all seconds of all stories, we calculated the percentage of participants who indicated an event boundary. These key-press percentage values were then used to define event boundaries (see next section). Processing of raw key-press frequency data was done using self-written scripts in MATLAB81.
Alignment points: Because we were interested on how event boundaries affect pupil size, we aimed to select those story locations, which represent event boundaries, and investigate pupil size changes following these story locations. To this aim, pupil size data were aligned to these locations, and thus we will refer to them as alignment points.
First, we identified those sentences that contain a spatial or a character shift, as identified by Bailey and colleagues73. As the sentences were usually multiple seconds long, we selected that time point of the story, when the first information suggesting a character/spatial shift was said by the narrator (i.e. the name of the new character or the first information regarding the new spatial location). Four character- and four spatial-shift locations were identified in each story by Bailey and colleagues73. We refer to these alignment points as content-driven, as they are based on the narrative content of the texts and reflect the assumptions of the stories’ creators about event boundaries.
We also used data-driven alignment points to determine event boundaries – in these cases, we used segmentation behavior of participants in Exp 1 to identify those locations in each story, where key presses indicating event boundaries were most frequent. After inspecting segmentation behavior over time, we selected those seconds where at least 15% of the participants pressed the segmentation key. These locations were termed event segmentation hotspots. Because we aimed to investigate pupil size change in the time window after such changes, we aimed to avoid overlap of pupil data in cases when multiple such segmentation hotspot followed each other by less than ten seconds. Thus, in such cases, only the hotspot with the larger key-press value was included in the analysis. In the case of equal large key-press frequency values within ten seconds, the first location was chosen. No segmentation hotspots were chosen from the first and last ten seconds of the stories. Following the above criteria, we identified 24 such hotspot (aquarium: 6, hospital: 7, castle: 6; camping: 5). The mean key-press percentage for these segmentation hotspots was M = 19.45%, SD = 3.01. Note that this mean frequency value is computed involving key-press frequency for those specific locations (seconds) in the story, where at least 15% of the participants pressed the segmentation key. When all key-presses were counted, which were initiated during the sentences containing an event segmentation hotspot, then larger agreement was observed: an average of M = 36.76% (SD = 12.91) of the participants indicated an event boundary during the sentences which consisted these event segmentation hotspots. Furthermore, the average of M = 48.01% (SD =14.72) of the participant pressed the segmentation key in the period of three seconds before and after the event segmentation hotspots. These agreement values are comparable to those observed for the original stories in the original study73. In the original study, participants could read the written version of the texts and could use a pencil to mark the event boundaries, whereas in our task, they had to identify the boundaries online while listening the stories. Thus, the rather low agreement values at event segmentation hotspots (15-25%) might be due to the fact that the key-presses in our experiment are spread out in time around the actual event boundary – if this is compensated by aggregating responses over multiple seconds, the same level of agreement can be achieved. Note, that pupil-linked brain arousal might be triggered by a discrete time point when a given participant detects change in the incoming input, thus despite the lower mean agreement value of the measure aggregating key-presses on a second-by-second basis, we favored it in the identifying event segmentation hotspots instead of aggregate measures, which sum key-press frequencies over a longer time period.
We also identified two sets of control alignment points, where no event boundaries were expected. First, as content-driven control alignment points, we used the control sentences from the original paper73, which were defined by the authors as sentences where no narrative shift occurred. Here, we chose randomly one word from the control sentence to identify the specific alignment point. Second, to identify data-driven control alignment points, we used data from Exp 1, reflecting segmentation behavior: we identified locations where no key-presses occurred in the current and in the surrounding two (aquarium, castle) or three (hospital, camping) seconds. This difference in criteria was used because of different patterns in the segmentation data: clusters with subsequent zero key-press frequencies were shorter in the former (aquarium, castle), as compared to the latter two stories (hospital,camping). Thus, using the same criterion would have led to widely differing amount of no-segmentation spots in the four stories. When multiple subsequent seconds met this above criteria, then the last second was chosen. When two no-segmentation spots followed each other by less then ten second, then the first one was chosen. We identified 23 such no-segmentation spots (aquarium: 7, hospital: 4, castle: 5; camping: 7).
Importantly, as key-press percentage values were calculated for each second of each story, the alignment points also refer to one specific second of each story (e.g. an event segmentation hotspot was identified in 34th second of the aquarium story). All alignment points are marked in the stories presented in the supplementary material (Section 5).
Pupil size preprocessing: Pupil size data were recorded in mm by the eye-tracker software and were preprocessed for each story and for each participant separately using self-written scripts in MATLAB81. As a first step, we removed from each data series outliers beyond the physiologically plausible range of 2mm-8mm and also blinks identified by the data processing software of SMI. These data points were interpolated using linear interpolation. As a second step, data were downsampled from the original 500 Hz recording frequency to 10 Hz, then they were smoothed using the Savitzky-Golay filter (parameters: window size=51, polynomial degree=2). Blink frequency (blinks per minute) was M = 17.93 (SD = 14.19) in Exp 2, and M = 19.05 (SD = 15.01) in Exp 3. On average, the percent of interpolated points was M = 7.81 % (SD = 7.11) in Exp 2, and M = 8.52 % (SD = 7.44) in Exp 3 (calculated for the down-sampled data series).
As a second step, we identified data sets with large number of interpolated data (suggesting low data quality). First, we averaged blink frequencies and interpolation ratios of each participant for all four stories. We excluded data from one participant from both experiments, whose average blink frequency was more than 3 SD above the mean of the sample (computed separately for Exp 2 and Exp 3). Furthermore, one participant was excluded from Exp 3, because his/her interpolation ratio was more than 3 SD above the mean of the sample. Then, we checked blink frequencies and interpolation ratios for the individual stories, and data linked to individual stories were also excluded, if these values exceeded 3 SD above the mean of the respective sample. Data from one story was excluded due to excessive blinking in both Exp 2 an Exp 3, whereas due to high interpolation ratio, data were excluded for five stories in Exp 2, and one story in Exp 3. Finally, we also excluded data from those stories, where the mean blink frequency was less than two blinks/minute. Such values are far below the average blinking frequency found in other studies82, and might indicate a failure of the blink detection algorithm. Because of this, we excluded data from nine stories in Exp 2, and data of one story in Exp 3. Furthermore, in Exp 3, we also excluded all data from three participants, because their mean blink frequency was below two blinks/min for either three or four stories. Note that when data from a specific story of a participant was excluded due to low data quality, data of the same participants from other stories with appropriate data quality were included in the analysis.
In summary, in Exp 2, data of 67 participants were processed, as four participants were excluded due to low data quality (1), missing data (1) or low memory score (2). These 67 participants completed 268 stories, and overall data from 15 stories were excluded (5.6%). In Exp 3, data of 33 participants were processed, as six participants were excluded due to low data quality (2), low blink frequency (3) or low memory score (1). These 33 participants completed 132 stories, and data from three stories were excluded (2.2 %).
Pupil dilation analysis: Pupil size is sensitive to several psychological and physiological factors and varies considerably with time (see e.g. 69). Thus, to investigate how event boundaries affect pupil size, we had to investigate relative change of pupil size compared to specific time points of interest. To this aim, we aligned pupil data to the different alignment points (see above) and investigated how the size of the pupil changes in the subsequent seconds relative to its baseline value immediately before the alignment point. This relative pupil dilation might reflect neural processes underlying information processing associated with the alignment points.
To identify pupil size changes of interest, we extracted data segments starting from 0.5 sec before each segmentation points to 12 seconds after each segmentation point. Then, the average pupil size value was computed for each 500 msec bin of each data segment (i.e. mean pupil size was calculated for the period ranging from 0.5 sec before the alignment point until the start of the alignment point, from the start alignment point until 0.5 sec after the start of the alignment point, from 0.5 after the start of the alignment point until 1 sec after the start of the alignment point, etc.). Thereafter, to represent pupil size change relative to the alignment point, the mean value of the 500 msec before the alignment point was subtracted from each subsequent value. Thus, at the end of this processing, for each participant, and for each alignment point, we had a time series consisting of 25 data points spanning from 500 msec before to 12 secs after the alignment point, which represented pupil size changes relative to the 500 msec before the alignment point.
Statistical analysis
All statistical tests (except cluster based permutation testing) were conducted by self-written R-scripts83. During statistical analysis of the data, we first checked the distribution of the to-be-investigated variables using the Shapiro-Wilk test and depending on the normality of the variables parametric (ANOVA, Pearson- correlation, etc.) or non-normal statistical tests were performed (Wilcoxon Signed Rank Test, Spearman Correlation, etc.). For the correlation analysis in Exp 2-3, we checked and removed outliers whose value exceeded the sample mean by more than 3 SD for the correlation were excluded. Only one outlier had to be removed, from the maximum pupil dilation values linked to segmentation hotspots in Exp 3 (note that even including this outlier results in a significant correlation between pupil dilation variable and memory score, rs = .37, p = .033).
For comparing the pupil size time series between 0-12 seconds, we used a cluster-based permutation testing, which controls for the Type I error resulting from multiple comparisons of conditions at each of the 24 time bins. We used an MNE-Python script written for this purpose84,85. During this analysis, first, data are randomly permuted 500 times, and each time a one-sample Wilcoxon test is conducted for each time point, either on the mean of that time bin for a specific condition or on the difference in the means of two to-be-compared conditions (depending on the specific analysis, see results section). Then, significant clusters are identified at each run, that is locations where this statistical test identifies significant differences relative to zero or between conditions (depending on the specific analysis, see results section). Then, z-scores over the different significant clusters are calculated, and a distribution of these summed z-scores over the 500 permutations is created. Only those clusters are then treated as significant, where this weighted score of duration and effect size is above the 95th percentile of this distribution (corresponding to the 5 % significance level).