In Experiment 1, pupil traces differed markedly depending on the task (recall vs repeat), consistent with past studies that investigate the effect of recall on pupillary response. Although baseline dilation increased at the beginning of each list for both tasks, they diverged after the first 2–3 words. The repeat-only task was extremely easy, and as a consequence the pupil relaxed, ending up 0.2 mm lower than where it started. In contrast, recalling as many words as possible demanded sustained cognitive resources, leading to a continued increase in pupil diameter. Initial PPD values were high, but these decreased over the course of the 10-word list, meaning that average PPD could be inordinately affected by the first word. These effects of task and position are interesting and generally warn us about the limitations of a single metric such as PPD 43,46,54. In ecological situations, one is often faced with multi-tasking, and measuring how the averaged diameter varies over long periods of time can yield useful information.
More to the point of this study, we found that pupillometry was sensitive to subtle effects generated by the F0 contour within words. Consistent with our hypothesis for pupillometry outcomes, the PPD was larger in both monotonized and inverted conditions, suggesting that these conditions required extra resources to process. For behavioral outcomes, we replicated the poorer intelligibility of these two conditions as it has been found in sentences with noise 55,56, though the impairment was minimal in our current study because individual words in quiet can hardly be misunderstood (ceiling performance). Note that this was by design: had we presented the words in background noise (to lower intelligibility), the periodicity information would have been less well extracted (or would have served other purposes, e.g. segregation).
Additionally, we observed two interesting interactions between pitch and task conditions: one for the baseline and the other for PPD latency. The latency effect is relatively easy to interpret since the effect of recall (delaying the peak dilation) was seen exclusively for the conditions that were already difficult to process, namely monotonized and inverted pitch contours. The effect of the task would therefore exacerbate any initial difficulty in word decoding. The baseline effect is not as easy to interpret: it is possible that the absence of pitch information (or more accurately the fact that it is uninformative) in monotonized words would make them even more difficult to decode than inverted words (which have some - but potentially misleading - pitch information). In turn, expectation to the extra difficulty in the whole monotonized condition would exacerbate the overall effort induced by recall 57. In other words, both interactions might point to the idea that the recall task is more costly for listening situations that were already taxing resources.
These results demonstrate that, even in normally hearing non-tonal language native speaking individuals, pupillometry is sensitive to unnatural and distorted F0 inflections, and we interpret differences in response to reflect explicit investment of cognitive resources. This would be explained by the ELU model in the following manner: the phonological representation of monotonized or inverted words becomes harder to match with a “template” of a known word in the listener’s long-term memory. To make the match, explicit engagement of extra cognitive resources is necessary, and this extra engagement is shown in the task-evoked pupillary response (even when behavioral response and subjective reflection do not capture it).
Thus, abnormal pupil responses to words in quiet may indicate that performance will worsen in noisy situations, even though in our experiment there was little effect of pitch conditions on intelligibility, memory, or subjective difficulty rating. Most likely, changes in behavioral and subjective responses can in principle occur without adding noise, but they might require more power and adjustments to the experimental design such as a greater sample size or a longer wait time before testing recall.
In Experiment 2, many findings from Experiment 1 were replicated: the effects of task and position were identical for the baseline and very similar for the PPD amplitude (but affecting the entire list rather than changing the slope across word positions).
More to the point of this study, we found again that pupillometry was sensitive to pitch manipulations, this time across words instead of within words. Given the “across-word” nature of the manipulation, this is a different interpretation, no longer about the phonological representation of words but more about their encoding as a sequence forming a melody via their respective F0s. A complete understanding of this phenomenon is not trivial.
Although these manipulations did not have behavioral consequences here, we know that they can in principle affect memory performance. Sares et al. 35 tested free recall performance after presenting word lists with different pitch sequences, and found that there was an improvement in free recall only when the sequences indicated a grouping (arpeggios). Since the melodic condition in this experiment had a similar pattern to the arpeggios in Sares et al. (2023), one might have expected the melodic condition to lead to better recall than the fixed condition. Not only was this not the case, but performance in the melodic condition tended to be worse. One important parameter differentiating the two designs is speed of presentation: while in Sares et al. (2023) the words were presented close in time (less than 1 second apart), in a way which could facilitate grouping and pitch pattern recognition, here the words were presented around 5 seconds apart (1s waitpeak + participants repeating back the words), a timescale which may be perceived quite differently 58. The flattened profile of ten words in a sequence with short intervals makes it difficult to retain them in memory since items that are more distinct from one another could be easier to store and retrieve than items that share common features. For instance, several studies showed that, in NH listeners, the recall of words that rhyme is poorer than the recall of words that do not rhyme (Baddeley, 1966; Conrad & Hull, 1964; Nittrouer et al., 2013; Salamé & Baddeley, 1986). Another possibility could be that on this larger timescale, the role of pitch is not to facilitate memory encoding, but rather to support participant’s engagement with the task. Previous study has shown that pupillary response is sensitive to changes in task engagement even when behavioral performance is the same (similar in our case) 57. However, we did not observe differences in baseline pupil diameter between two manipulations that were typically related with anticipating or mobilizing attention 63–65. We did not have any other markers to estimate task engagement, nor a physiological correlation of arousal and stress (e.g. salivary cortisol, skin conductance, or heart rate recordings). The melodic condition resulted in larger PPDs but it remains unclear whether this is a sign of additional effort in processing and storing words, or whether it reflects a more enthusiastic engagement towards this condition.
In summary, pitch manipulations applied in both Experiment 1 and Experiment 2 (within-word and across-word) had small effects on intelligibility (in ceiling region) and did not significantly affect immediate recall or subjective difficulty ratings. It is surprising that immediate recall did not show any significant difference, especially considering how similar dual-task behavioral paradigms have shown to be sensitive to different acoustic manipulations 28,29. This lack of sensitivity could be due to the use of words rather than sentences. In our case, the recall was likely tapping into a phonological loop of recently stored monosyllabic words, so the impact of acoustic manipulation could be more heavily influenced by the recency effect. When using longer and more complex stimuli, for instance the SWIR, the impact of an experimental manipulation might rely more on the primacy effect, where sentence-final words are transferred to long term memory. Additionally, the effect size of our acoustic manipulations might be smaller compared to other manipulations (i.e., SNR, noise reduction turned on and off), hence harder to observe with current statistical power.
On the other hand, the pupil responses did differ, suggesting that pupillary responses might be sensitive to subtleties in the allocation of cognitive resources. These subtleties are meaningful within the ELU framework: the more matched the acoustic inputs with the stored template of pitch contour, the less need for explicit cognitive resources to resolve the mismatch, hence the smaller the pupillary response. This is presumably why the inverted and monotone pitch contours led to increased pupil dilation and response latency, especially during the recall task, as well as slightly decreased intelligibility– we could perhaps call this “unproductive effort.” Experiment 2 showed increased pupil dilation and response latency for the melodic pitch condition, and intelligibility was slightly increased (though not significantly). This is the opposite relationship to experiment 1; and could be thought of as “productive effort.”
The dichotomy between productive and unproductive effort is seen elsewhere in past studies using pupillometry to quantify effort. For instance, in speech recognition tasks, elevated noise during listening can lead to greater pupil dilation and poorer intelligibility or recall, up to a point 37,42. In memory tasks, greater pupil dilation is seen for words that are recalled compared to those that are not recalled (see details of the replication results in Supplementary material 1) and also in previous work 54. Thus, though pupillometry is a powerful and sensitive technique to register fluctuation in physiological responses to cognitive demands, it is difficult to interpret pupil signals in the absence of the task demands and behavioral outcomes. An increase in task-evoked pupillary response or cognitive resources expenditure is not necessarily a negative marker; it can reflect engagement and be followed by successful completion of a more complex task. Ultimately, it is individual differences in cognitive and motivational status that decide whether an event is perceived as negative or positive by the listeners (Carolan et al., 2022; Herrmann & Johnsrude, 2020; Pichora-Fuller et al., 2016).
The impact of F0-related acoustic manipulation on intelligibility and pupillary response found in this paper have not been reported systematically in previous studies. While the results are consistent with ELU framework in general, there has not been enough research conducted on the contribution of F0 to the internal identity of words. For non-tonal language speakers, F0 contour has not been considered as important as phonemic components (i.e., vowels and consonants), therefore its encoding has not been investigated as thoroughly as it has been in tonal languages. Without knowing the encoded ‘template’ of an F0 contour for a given word, it is difficult to predict the impact of acoustic degradation. Our results seem to suggest that even though F0 contour does not play a phonemic role in non-tonal language phonology, distorting the contour still has an impact on single word and word list recognition and listening effort. In Exp1, inverting and monotonizing the F0 contour introduced mismatches that require extra cognitive resources to resolve (as shown by pupillometry). Exaggerating the contour, however, did not even though it reflected some deviation from the typical contour. This is a promising result, which is in line with the correct performance obtained with caricatures 49,50 and reinforces the notion that pitch exaggerations should be further explored as a solution for CI difficulties in this domain (e.g. He et al. 68for a similar endeavor in tonal languages). The effect seen in Exp2 for the melodic condition, on the other hand, would be difficult to implement ecologically and needs to be understood as a complex interaction among attention, engagement, arousal, anxiety, and effort 26. Without other markers of engagement and attention, it is difficult to pin down whether the benefit of adding a melodic pattern over a series of words is due to increasing attention or providing some pitch variability similar to natural speech. We doubt the latter: overlaying a pattern of F0 variation on isolated words is not the same as fully restoring a natural pitch contour to a connected utterance, because natural sentences utilize pitch contour to transmit extra information (i.e., prosody, emotion and intonation). It could be expected, based on previous studies, that restoring natural contour will alleviate more listening effort, if not improving intelligibility significantly, even in non-tonal language 69. But Exp2 suggests that introducing some pitch variability affects the processing of a series of speech inputs, even if that pitch variability doesn’t resemble natural prosody.
Note that the current experiment was conducted in NH listeners. Future studies should investigate whether these findings extend to hearing impaired populations and specifically CI users. The auditory inputs from a CI contain less salient and sometimes distorted F0 cues (e.g. incomplete array insertion). Although many factors contribute to the challenging and effortful speech recognition in CI users, the importance of F0 saliency and fidelity cannot be ignored in speech recognition and associated listening effort. This is not only due to the importance of F0 in transmitting prosodic information (i.e., intonation, emotion, etc.), but also in decoding the words themselves, since a word’s F0 contour is part of its identity. Whether CI users completely ignore pitch in their phonological representation, or whether they struggle with it because it lacks discriminative power, is an open question which we hope to address in a future study.