Children typically communicate with their caregivers in multimodal environments and interact through a variety of modalities, including eye gaze, facial expressions and hand gestures, in addition to speech (e.g. Çetinçelik, Rowland, & Snijders, 2021; Özçaliskan & Dimitrova, 2013). It is known that children integrate information from different communicative channels, such as hand gestures and speech, from 3 years of age and above (Sekine, Sowden & Kita, 2015). The multimodal nature of interactions may be especially crucial in environments where speech may be degraded, such as in classrooms, sport fields, public transport or when others are wearing masks. Also, since the COVID-19 pandemic started, many people have been communicating while wearing masks. In these environments where natural language is used, speech is not always clear. We do not know, however, the extent children can integrate multimodal information in such challenging situations. In order to have a full understanding of children’s comprehension skills of multimodal language we need to investigate the flexibility of their system compared to adults. This should be investigated not only in perfect but also in challenging contexts.
Previous studies have shown that gestures, defined as meaningful hand movements that accompany speech, are part of an integrated system of language (Kita & Özyürek, 2003; Özyürek, 2014; McNeill, 1992), both in production and comprehension (Kelly, Özyürek, & Maris, 2010), and both in typical environments and in noisy situations. In line with this integrated view of speech and gesture, it has been also shown that adult listeners are flexible and use gestures to disambiguate comprehending speech with externally- or internally-induced noise, as observed in degraded speech (e.g. Drijvers & Özyürek, 2017, 2020; Schubotz, Holler, Drijvers, & Özyürek, 2020; Wilms, Drijvers, & Brouwer, 2022), in noisy environments (Kendon, 2004), or in instances involving hearing difficulties (Obermeier, Dolk, & Gunter, 2012). It is not clear, however, whether this flexible integrated system is in place in children as in adults, especially in noisy situations which are more taxing than in clear speech situations. Thus, little is known regarding the extent to which children could benefit from gestures when speech is degraded. Therefore, in the current study, we examined for the first time the enhancement effect of iconic gestures on the comprehension of degraded speech in children, and compared this to adults.
Throughout this paper, we use the term ‘clear speech’ to mean that the speech is not degraded, although we are aware the term is also used in other context for therapeutic techniques utilised with individuals with motor speech disorders and those with hearing loss.
Gesture-speech integration in clear and noisy speech
Speakers use hand gestures to represent iconic semantic information that is relevant to the information conveyed by the concurrent speech. These gestures are referred to as iconic gestures, as they iconically represent concrete aspects of a referent, such as shape, size, motion, or relative position (McNeill, 1992). Such gestures are known to be integrated with linguistic information, accompanying clear speech at the semantic, syntactic, prosodic, discourse and pragmatic levels during both production and comprehension (Kelly et al., 2010 Kelly et al., 2010; Özyürek, 2014) as well as during the interactive aspects of communication such as in dialogue (Rasenberg, Özyürek, & Dingemanse, 2020).
Recent research has shown that such gesture and speech are integrated also in noisy situations. For example, it has been observed that people effectively use and modulate their gestures in adverse listening conditions (Kendon, 2004; Trujillo et al., 2021). Numerous studies have empirically shown that the use of iconic gestures is beneficial for adult listeners when speech is degraded as well (e.g. Drijvers & Özyürek, 2017; Holle, Obleser, Rueschemeyer, & Gunter, 2010; Roger, 1978; Wilms et al., 2022). For example, Drijvers and Özyürek (2017) examined the enhancement effect of both iconic gestures and lip movements (visible speech) on the comprehension of degraded speech by comparing comprehension among three different noise level conditions: 2-band noise-vocoding (severe noise), 6-band noise-vocoding (moderate noise) and clear speech. Participants were presented with a series of video clips in which an actor recited an action verb with or without a gesture and/or lip movement. After presentation of the verb, participants were asked to respond by typing the verb they believed the actor had conveyed. Results showed that accuracy scores were higher when both visual articulators were present compared to scores with the presence of just one of the modalities (i.e. lip movement). In addition, the enhancement effects of both iconic gestures and lip movements were larger in the 6-band condition compared to the 2-band noise-vocoding condition. From these results, the authors concluded that at a moderate level of noise-vocoding (6-band) (when there are adequate phonological cues in the degraded speech channel), there is an optimal range for maximal multimodal integration where listeners can benefit most from visual information. This finding is consistent with a classic study by Rogers (1978) that examined the effect of visual modalities on the comprehension of speech with signal-to-noise ratios (SNRs) ranging from −8dB to +7dB. This study revealed that the effects of visualisation of the speaker were greater when the SNR was lower compared to when the SNR was higher. More recently, Holle et al. (2010) utilised functional magnetic resonance imaging (fMRI) to investigate the brain regions in which iconic gestures and speech are integrated by manipulating the signal-to-noise ratio of speech. Results showed greater neural enhancement in the left pSTS/STG when the noise was moderate (SNR −6dB) compared to when the noise was severe (−14dB) or good (+2dB). Thus, both behavioural and neuroimaging studies support the enhancement effects of gestures in noisy speech and in moderate levels of speech degradation.
The gestural enhancement effect on the comprehension of degraded speech has also been observed in studies involving elderly adults (Schubotz et al., 2020) as well as in studies involving non-native listeners (Drijvers & Özyürek, 2020). These studies showed, however, that elderly adults and non-native speakers benefited less from gestures compared to young adults and native listeners, as these groups needed more phonological information to benefit from degraded speech. Elderly adults or non-native listeners could benefit from gestures when speech was less degraded, that is, when more phonological cues were present in the speech signal compared to native adult speakers. Thus, studies with adult participants have shown that gestures can help to disambiguate degraded speech to varying degrees; however, it is unclear to what extent children, who have less experience with speech input than adults, can benefit from gestures when speech is degraded, that is the flexibility of their multimodal integration.
Gesture-speech integration in clear and noisy speech in children
Previous studies have shown that children can process information from gestures and are able to integrate it effectively with clear speech. First of all, children between the ages of 5 and 10 (as well as adults) can obtain gestural information when speech and gestures are presented simultaneously. Additionally, studies have shown that 8- and 10-year-olds are able to detect information conveyed solely in iconic gestures when presented with children's explanations of Piagetian conservation tasks (Church, Kelly, & Lynch, 2000; Kelly & Church, 1998). Another study revealed the ability of 5- and 6-year-olds to respond to interview questions using information conveyed solely through an interviewer’s iconic gestures (Broaders & Goldin-Meadow, 2010).
Furthermore, research shows that children and adults integrate gestures and speech so that each respective component contributes unique information to the unified interpretation (adults: Cocks, Sautin, Kita, Morgan, & Zlotowitz, 2009; Kelly et al., 2010; children: Kelly, 2001; Sekine et al., 2015). Thus far, two studies have examined children’s ability to effectively integrate iconic gestures and speech. Sekine et al. (2015) examined the ability of both children and adults to integrate speech and iconic gestures in a manner that mutually constrains each other's meaning. The participants were presented with an iconic gesture, a spoken sentence, or a combination of the two and were instructed to select a photograph that best matched the message communicated. Results showed that 3-year-olds demonstrated increased difficulty integrating information from speech and gesture, but 5-year-olds were able to perform with abilities similar to adults. The researchers concluded that the ability to integrate iconic gestures and speech develops after 3 years of age.
This claim was also supported by a study using electrophysiological measures (Sekine et al., 2020). Sekine et al. examined gesture-speech integration in 6- to 7-year-olds by focusing on the N400 event-related potential (ERP) component, which is modulated by the semantic integration load. The ERPs showed that the amplitude of the N400 was larger in the mismatched gesture-speech condition compared to the matched gesture-speech condition. This result provided neural evidence that children integrate gestures and speech in an online fashion by the age of 6 or 7. Thus, these two lines of study have shown that children are able to collect information from gestures and can integrate it with concurrent speech information.
Although there are no current studies, to our knowledge, that investigate the extent to which iconic gestures assist children with the recognition of speech in adverse conditions, previous research on the recognition of degraded speech alone has revealed that children are able to process degraded speech, albeit not as well as adults (e.g. Eisenberg, Shannon, Martinez, Wygonski, and Boothroyd, 2000; Grieco-Calub, et al., 2017; Newman & Chatterjee, 2013; Roman, Pisoni, Kronenberger & Faulkner, 2017). For example, Newman and Chatterjee (2013) assessed the ability of toddlers (27-month-olds) to recognise noise-vocoded speech by comparing their performance with clear speech to their performance with 24-, 8- and 2-band noise-vocoded speech. By measuring the amount of time spent looking at the target picture, they found that toddlers showed equivalent proportions of looking to the target object with clear speech as they did with 24- or 8-band noise-vocoded speech, but they failed to look appropriately with 2-band noise-vocoded speech and showed variable performance with 4-band noise-vocoded speech. These results suggest that even 2-year-olds have developed the ability to interpret vocoded speech; however, children require a long learning period before they are able to recognise spectrally degraded speech as well as adults. Eisenberg et al. (2000) examined the development of the ability to recognise degraded speech by comparing 5- to 7- year-olds, 10- to 12-year-olds, and adults. They presented words or sentences under 4-, 6-, 8-, 16- or 32-noise-band conditions. Results showed that word and sentence recognition scores between adults and older children (10- to 12-year-olds) did not differ statistically. On the other hand, accuracy scores for 5- to 7-year-olds were significantly lower than scores for the other two age groups. Younger children required more spectral resolution (higher than 8-band noise) to perform at the same level as adults and older children. The authors suggested that deficits in younger children are partially due to their inability to fully utilise sensory information and partially due to their incomplete linguistic/cognitive development, including their undeveloped auditory attention, working memory capacity and receptive vocabularies. It is unknown, however, whether children are able to attain the same level of performance as adults when presented with visual cues, such as gestures.
Finally, a few studies have investigated the contribution of visual cues other than gestures, such as visible lip movement, to the comprehension of degraded speech for children. Maidment, Kang, Steward and Amitay (2015) explored whether lip movement improves speech identification in typically developing children with normal hearing when the auditory signal is spectrally degraded. They presented sentences that were noise-vocoded with 1–25 frequency bands and compared the task performance (an identification of the colour and number presented within spoken sentences) of children aged 4–11 with the performance of adult subjects. They found that the youngest children (aged 4–5) did not benefit from accompanying lip movement in comparison to older children (aged 6–11) and adults who did benefit from visual cues to help recognise degraded speech.
However, it has not been clear to what extent children benefit from hand gestures to comprehend degraded speech and how they compare to adults regarding the different levels of speech degradation. It is possible that while children might be able to integrate iconic gestures with speech in clear speech conditions by ages 6–7 (Sekine et al., 2015, 2020), they might perform less well than adults in doing so when speech is degraded. This could be due to the fact that as children are less proficient at understanding degraded speech, they might also benefit less from gestures than adults, in line with results from non-native and elderly listeners benefit form gestures in noise (e.g. Drijvers & Özyürek, 2017; Schubotz, et al., 2020). If this is the case, it shows that even though the multimodal language integration ability in children is there as in adults, it still needs to go through development, especially regarding challenging situations when quality of the input in one channel is not perfect.
Present Study
The purpose of the current study was to investigate the enhancement effect of gestures in the comprehension of degraded speech in addition to information gathered from visible speech in children and adults by using noise-vocoded speech. We presented both populations with a word recognition task in two contexts: speech-only or gesture-and-speech combination. By following the previous studies on the effect of gestures on degraded speech in adults (e.g. Drijvers & Özyürek, 2017, 2018, 2020; Schubotz, Holler, Drijvers, & Özyürek, 2020), we made lip movements (visible speech) visible in both contexts. The current study was conducted with 6- and 7-year-olds, as previous research (e.g. Sekine et al., 2020) has shown that children are able to understand and integrate iconic gestures with speech by this age.
Speech quality was manipulated utilising noise-vocoded speech, which is a speech signal that has been processed to preserve amplitude and temporal information while systematically varying spectral information in the signal (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005; Shannon, Zeng, Kamath, Wygonski & Ekelid, 1995). This signal was originally created by Shannon et al. (1995) to simulate perception of speech with a cochlear implant and to investigate the perception of degraded speech in listeners with normal hearing. In this current study, children and adults were given three types of vocoded speech, as well as clear speech with or without gestures with lip visibility in all trials. After the presentation of each stimulus, participants were asked to ‘say’ what they heard into the microphone. We compared the performance of the children with the adults to examine the difference between the two populations and regarding their response to different levels of degradation.
We anticipated that the children would perform more poorly than the adults in the degraded speech-only condition, as shown by Eisenberg et al. (2000), but that it would improve with the input of gestures. Furthermore, we predicted that, with seeing gestures, the children’s performance would be similar to the performance of the adults in the degraded speech-only condition. We also expected the gestural enhancement effect to be greater in adults than in children, due to the fact that children have more difficulty distinguishing phonological cues from degraded speech than adults, thus hindering the benefits of gestures, as shown in the case of elderly adults and non-native speakers (Drijvers & Özyürek, 2020; Schubotz, et al., 2020).
Finally, in this study in addition to looking at accuracy we also looked at response times, as measured by the onset of the verbal response. In this aspect, our study differs from Schubotz et al.’s study (2020) that looked at responses to multiple alternatives rather than using a verbal response, which might be a more sensitive measure for response time. Here we expected gestures to facilitate both children and adults’ responses to repeat the speech they heard when gestures were presented compared to instances without gestures. In a gesture, a person’s hand starts moving to prepare the meaningful part of the gesture. This initial phase is called the ‘preparation phase’ and the meaningful phase is called a ‘stroke phase’ (McNeill, 1992). Previous studies (e.g. Holler, Kendrick, & Levinson, 2018; Sekine & Kita, 2017) showed that adult listeners responded faster when a speaker produces gestures with speech than when gestures do not occur, because the preparation phase may give a clue about what the speaker will say to the listener. To this end, we calculated response time in both the gesture-speech condition and the speech-only condition in addition to accuracy scores.