Speech-to-touch sensory substitution: a 10-decibel improvement in speech-in-noise understanding after a short training

Katarzyna Ciesla (  kasia.j.ciesla@gmail.com ) The Baruch Ivcher Institute for Brain, Cognition & Technology, The Baruch Ivcher School of Psychology, Interdisciplinary Center Herzliya (IDC), Herzliya T. Wolak World Hearing Centre, Institute of Physiology and Pathology of Hearing, Warsaw A. Lorens World Hearing Centre, Institute of Physiology and Pathology of Hearing, Warsaw H. Skarżyński World Hearing Centre, Institute of Physiology and Pathology of Hearing, Warsaw A. Amedi The Baruch Ivcher Institute for Brain, Cognition & Technology, The Baruch Ivcher School of Psychology, Interdisciplinary Center Herzliya (IDC), Herzliya


Introduction
The COVID19 pandemics imposed on us the obligation of social distancing and wearing masks covering the mouth. Both these restrictions reduce transmission of sounds and prevent access to visual cues from speech/lip reading. Meanwhile, access to speech reading and e cient integration of this visual information with the auditory input is very important for enhanced understanding of speech to people with hearing loss, who naturally receive degraded auditory cues from the environment [1,2]. Healthy individuals also use lip reading in all face-to-face communication, and especially bene t from it when the acoustic context is ambiguous, such as when exposed to rapid speech, speech in non-native language, background noise and/or more speakers talking simultaneously [3,4,5]. With visual cues present, understanding speech against noise has consistently been found to improve, in both healthy individuals and in patients with hearing loss [1,6,7,8,9].
In today's modern society challenging acoustic conditions occur increasingly more often, including exposure to concurrent multiple auditory streams and almost constant exposure to noise. At the same time, the prevalence of hearing impairments is growing (WHO 2020), which disease if not treated, can accelerate cognitive decline and thereby social exclusion [10,11,12]. In addition, many users of modern hearing aids and cochlear implants complain that their devices fail to effectively compensate for hearing loss when they are exposed to ambiguous acoustic situations, as described above [13,14,15,16].
All this indicates the importance of developing novel methods and devices that can be employed to improve communication, both among healthy individuals as well as those with hearing de cits.
Especially solutions combining multisensory inputs are appealing, as increasingly more experimental studies [17,18,19,20] and conceptual works point to the superiority of multisensory over unisensory training for learning and sensory recovery [21,22,23,24,25]. For improved understanding of auditory signals that are either degraded or presented in suboptimal conditions, multisensory training regimes that complement audition with vision have been found successful [20,25], including for rehabilitation of patients with hearing aids or cochlear implants, by adding speech reading or sign cues [1,6].
Interestingly, several recent works showed bene ts of adding tactile stimulation to improve degraded auditory speech comprehension, including our own ndings [26,27,28]. Applying a combination of auditory and tactile stimulation in assistive communication devices is an interesting and novel approach, that can prove useful especially if access to visual cues during communication is limited. Audition and touch share some key similarities, such as both use mechanoreceptors to encode vibration in a shared range of frequencies (approx. 50Hz to 700Hz). Therefore, vibrotactile and auditory information can be naturally perceived as an interleaved signal, which is then also processed in partially shared brain regions, in both hearing participants as well as in congenitally deaf [29,30,31,32,33,34,35].
Given these similarities, we developed an in-house audio-to-touch (assistive) Sensory Substitution Device (SSD). SSDs convey information typically delivered by one sensory-modality (e.g., vision) through a different sensory modality (audition, touch) using speci c translation algorithms that can be learned by the user [36, 37,38]. A classic example is a chair developed by Prof Bach-Y-Rita, that delivered a visual image to the back of its blind users through patterns of vibration [39].
In the rst work with our audio-to-touch SSD we showed immediate and robust improvement by mean 6dB in speech-in-noise understanding (Speech Reception Threshold), when auditory speech was complemented with low-frequency tactile vibration delivered on ngertips [28]. Importantly, in this previous experiment the improvement occurred without applying any training, as opposed to other numerous works using SSDs which required hours of training and/or prolonged use to yield bene ts, probably due to the complexity of the applied algorithms [40,41,42,43,44].
The key goal of the current study was to investigate whether understanding speech delivered as a multisensory audio-tactile input can improve even further when a short training session is applied. This research is important for both practical reasons and for enhancing basic science understanding of multisensory processing and cross-modal integration. It is especially intriguing, as the audio-tactile multisensory context is utterly novel for speech perception and only learned in adulthood. This contrasts with audio-visual speech, i.e. listening to speech and speech/lip reading at the same time, which is a natural language input acquired in critical/sensitive periods of development and to which we are all exposed throughout lifetime. In terms of practical applications of our research, we believe that it can inform development of potentially rapid and successful rehabilitation protocols using touch to improve speech perception for patients with hearing loss, as well as the design of novel technologies to assist the general public in practical cases. This include speech understanding in noisy environments when wearing a mask, when trying to learn a new language or talking on the phone.

Material And Methods
Seventeen (N = 17) native Hebrew speakers (6 male/11 female; mean 27 years) participated in the study, mostly University students or their friends. Participants were all right-handed and reported no history of neurological/neurodevelopmental impairments. All were also uent speakers of English and used English on every day basis. All research was performed in accordance with all relevant guidelines/regulations and the Declaration of Helsinki. An informed consent was obtained from all participants and they were compensated for participation. The experiment was approved by the Institutional Review Board of IDC Herzliya, the School of Psychology.
Subjects participated in a series of tasks. In all of them we used our in-house audio-to-touch Sensory Substitution Device (SSD) (designed in collaboration with the World Hearing Centre in Warsaw, Poland and http://www.neurodevice.pl/en), noise-cancelling headphones (BOSE QC35 IIA), a 5.1 soundcard (Creative Labs, SB1095) and a PC. The SSD was used to deliver speech signal as tactile vibration on the index and middle nger of the dominant hand. A dedicated MatLab (version R2016a, The MathWorks Inc., Natick, MA, USA) application with a user-friendly GUI was developed to run the study. Only a brief description of the study set-up is provided here, as all the details with accompanying gures can be found in our previous publication [28].
In the current study all participants took part in three tests of speech comprehension, once before and once after a short training session. Introducing one additional test condition and a short, dedicated training session (to see whether subjects will improve further in understanding of novel non-trained sentences) were the main differences in the experimental paradigm, as compared to our previous work [28].

Practice
The rst test session was proceeded with a brief practice of listening to two different vocoded (the sound was modi ed to resemble stimulation through a cochlear implant system) and non-vocoded sentences from the English HINT sentence database [45] with and without accompanying vibration, to familiarize the participants with the study setting. The details of the vocoding procedure can be found in our previous work [28].

Speech Reception Threshold tests
In the actual tests, as well as in the training session, the task of the participants was to repeat sentences presented via noise-cancelling headphones. All the sentences were vocoded using an in-house algorithm and presented against background noise (IFFN; URL: https://www.ehima.com/). There were 3 test conditions, both before and after training, all with sentences presented in the headphones: a) with no concurrent vibration delivered on ngertips (audio only; hereafter A for Audio), b) together with low frequencies including the fundamental frequency (f0) extracted from the heard sentence, i.e. matching f0, delivered as vibration on ngertips (hereafter AM, as in "Audio-Matching"), c) together with low frequencies, including the fundamental frequency not corresponding, i.e. non-matching, to the heard sentence, delivered as vibration on ngertips (hereafter AnM). The A condition always came rst, followed by the AM condition in 9 participants, and by the AnM condition in 8 participants. As described in detail in our previous work [28], f0 had been earlier extracted from each original sentence derived from the English HINT database, using the STRAIGHT algorithm that was further improved. The outcome measure of each test was Signal to Noise Ratio (SNR) for 50% understanding, i.e. the Speech Reception Threshold (SRT) of the target (vocoded) lists of sentences against background noise. For each test 20 different sentences were used (2 lists of 10 HINT sentences) and the di culty level was adapted based on individual performance (2dB up -2dB down). The applied adaptive procedure is a typical procedure used in the clinical ENT setting when assessing speech understanding in hearing aid and/or cochlear implant users [46]. After the training session participants performed the second round of 3 test conditions (A, AM, AnM).

Training
After the initial series of tests each person participated in a short training session. The training consisted of listening to and repeating 148 vocoded sentences, each accompanied with matching low frequencies delivered on ngertips of two ngers via our SSD. The SRT calculated for every person in the test condition involving auditory and matching tactile input (AM) prior to the training was used as the SNR throughout the whole training session. Each sentence was presented up to 3 times as a combined audiotactile input, and if the person was unable to repeat it correctly, the sentence was presented as text on the PC screen in front of the participant (black font, middle of the screen). The sentence that required such visual feedback remained in the database of the training sentences and was presented again at the end of the session. The training continued until all 148 sentences were repeated correctly without visual feedback. In all tested participants the training was approx. 35-50 min in duration. The feedback was decided to be visual as opposed to auditory, as in the future the authors wish to apply the same testing (and training) procedures in participants with hearing de cits.

Results
Most importantly, we found that a short multisensory training session provides robust statistically signi cant improvement of almost 10dB in SRT for understanding speech in noise, when accompanied with matching tactile frequencies delivered on ngertips. The 10 dB difference in SRT indicates that after training the same performance (50 % understanding) was achieved with the noise level twice louder. For the trained AM condition the mean improvement was 9.8 +/-6.8 dB (from 12.1 +/-5dB to 2.4 +/-5.4 dB, p = 0.001, Wilcoxon Signed Rank test, two-tailed asymptomatic signi cance), with 16 out of the 17 tested participants obtaining a better SRT after training for novel untrained lists of sentences. This result is shown in Figs. 1A and 1B.
After training the obtained SRT for the AM test condition was signi cantly lower than the SRT reported for the AnM condition, indicating better performance (p = 0.007). Before the training scores in these two conditions were not statistically different (p = 0.3). The SRT for the AnM condition also improved after training, by 3.8 +/-6.5 dB (from 11.2 +/-6.7 dB to 7.3 +/-4.8 dB), with 13 out of the 17 tested subjects showing improvement (see Fig. 2A and 2B). The improvement was of a lesser extent and at a much lower level of statistical signi cance (p = 0.02, Wilcoxon Signed Rank test), when compared to the trained AM condition.
The third tested condition, audio only (A), was always presented as rst, both before and after training. Therefore, we refrain from comparing the SRT values obtained for this test condition with the AM and AnM conditions, as the differences may re ect solely the effect of order. Nevertheless, it should be noted that the SRT for the A condition also improved signi cantly after the AM training session, from 17.7+/-7.7 dB to 7 +/-5 dB (p = 0.000), with 14 out of 17 participants showing improvement. We will discuss in the Discussion section why this outcome is important with respect to the development of multisensory training regimens for the hearing-impaired population.
An additional analysis that we performed revealed positive statistically signi cant correlations between some of the SRT test outcomes, and namely, between: (a) the AM and the AnM SRT values before training (r = 0.5, p = 0.04), as well as (b) the A and the AM SRT values after training (r = 0.51, p = 0.035).

Discussion
We show here that understanding of speech in noise improves signi cantly after a short and simple training session when auditory sentences are combined with a tactile input that corresponds to their low frequencies, including the fundamental frequency (f0). The effect that we report for the trained condition (though using novel sentences), a mean decrease of 10 dB in the speech recognition threshold, is profound, as it represents maintaining the same performance witch background noise perceived twice louder [47]. The current work expands our previous ndings [28] which showed that understanding degraded speech against background noise improves immediately by mean 6 dB in SRT without any training, when accompanied with corresponding and synchronized low frequencies delivered as vibrotactile input on ngertips. Interestingly, in both our studies we reported a very similar mean score in the audio-tactile condition with matching vibration (AM before training in the current study) with the mean group SRT of approx. 12 dB. The sentence database, the applied vocoding algorithm and the background noise were identical in both experiments. Obtaining similar scores in a total of almost 30 participants strengthens the interpretations we propose. Interestingly, and as we expected, the AM training applied in the current study also led to improved speech in noise understanding in the unisensory, auditory only, test condition.
These main outcomes of the study have several implications, both for basic science as well as in terms of their potential practical applications. Intriguingly, we show here that the improvement of speech comprehension through audio-tactile training was in the order of magnitude (~ 10 dB in SRT) similar or higher than that found for audio-visual speech-in-noise testing settings, when the performance is compared to only auditory [5,48,49,50,51]. The reported improvement in SRT was by 3 to 11dB, although the applied methodology and the language content varied. The implications of this nding for basic science are discussed further in text. In addition, showing improvement in speech understanding, both in the multisensory and the unisensory test setting is especially important with respect to the development of novel technologies for the general public and rehabilitation programs for patients with hearing problems. We discuss that in some more detail below.

Multisensory perceptual learning
Findings of our study are an example of rapid perceptual learning, which has already been shown for various acoustic contexts and distortions of the auditory signal, including natural speech presented in background noise, as well synthetically manipulated vocoded/time-compressed speech [25,52,53,54]. Furthermore, our results are in agreement with the emerging scienti c literature demonstrating that learning and memory of sensory/cognitive experiences is more e cient when the applied inputs are multisensory [22,23]. The experimental procedure was also speci cally designed to bene t from the fundamental rule of multisensory integration, and namely the inverse effectiveness rule [1,55,56,57]. Although mainly shown for more basic sensory inputs, the rule predicts that multisensory enhancement, i.e. the bene t of adding information through an additional sensory channel, is especially profound for low signal-to-noise conditions and when learning novel tasks [56]. In our study the auditory speech signal was new to the study participants, degraded (vocoded), presented against background noise and in their non-native language. All these manipulations lead to a low signal to noise context and deemed the task of understanding the sentences challenging, thereby increasing the chance of improvement in performance via adding a speci cally designed tactile input.
At the same time, the study conditions were ecologically valid, in that we aimed to recreate an every-day challenging acoustic situation encountered by both hearing impaired patients and healthy people, of being exposed to two auditory streams at the same time. We showed here that auditory stream segregation, and speci cally focusing on one speaker can be facilitated by adding vibrotactile stimulation which is congruent with the target speaker (see [5] discussing similar bene ts of audio-visual binding).

Design of the SSD -considerations for the current study
To deliver tactile inputs, we developed our own audio-to-touch SSD. In the ever growing body of literature it has been suggested that SSDs can advance the bene ts of a multisensory training even further than mentioned above, since they convey one modality input via another one in a way that is speci cally tailored to the neuronal computations characteristic for the original sensory modality [39,38,37,36,40,58,59,60,61]. In our study we used the SSD to deliver an input complementary to the auditory speech through touch which, nevertheless, maintained features typical for the auditory modality, as vibrations are also a periodic signal that uctuates in frequency and intensity.
At the same time the applied frequency range of the inputs was detectable by the tactile Pacinian cells that are most densely represented on the ngertips and most sensitive for coding frequencies in the range of 150-300Hz (and up to 700-1000Hz) [62]. Speci cally, the tactile vibration that was provided on ngertips of our participants was part of the signal below (and including) the extracted fundamental frequency of the speech signal. Access to this low-frequency aspect of the temporal ne structure of speech is speci cally reduced in patients with sensorineural hearing loss, including those using a cochlear implant. It has been shown that lack of this input profoundly impairs speech understanding in challenging acoustic situations, especially with several competing speakers [63,64,65].
Apart from the current and the previous work of our lab, also one other research group showed that adding low-frequency input delivered on ngertips can improve comprehension of auditory speech presented in background noise in normal hearing individuals [26]. The main advantage of our approach was, however, that to estimate the SNR for various task conditions we applied an adaptive procedure, as opposed to using xed SNRs, thereby avoiding oor and ceiling effects [46]. In addition, we showed a more signi cant bene t of adding the speci cally designed vibrotactile input than the other group, possibly due the fact that the stimuli we used were in the non-native language of the participants, which further enhanced the effect of the inverse effectiveness rule. Using the native language of the participants, and thus yielding the task easier was maybe also the reason why Fletcher and colleagues failed to show bene t from adding the vibrotactile input before training (which we did in our previous work, [28]).

The audio-tactile interplay
Interestingly, in the current study we also saw improvement in speech understanding in the control test condition, and namely when the auditory sentences were paired with vibro-tactile inputs corresponding to a different sentence than the one presented through audition, i.e., paired with non-matching vibration (AnM). The degree of improvement was, however, far less robust (mean of 4dB in SRT) and far less statistically signi cant than the improvement reported for the trained AM condition (mean of almost 10dB in SRT). The reason why we introduced this control test condition, was to acquire more information about the mechanisms of multisensory enhancement. Interestingly, we found that the group scores before training for both conditions combining an auditory and a tactile input (matching and nonmatching) were almost the same; and they also correlated (r = 0.5). We hypothesize that at this point the participants were still becoming familiarized with the study set-up and the vibrotactile input, and only after a short, dedicated training session they "realized" the true bene t of the matching tactile f0. It is also possible that presenting these two conditions next to one another in the pre-training session might have caused confusion.
In our future studies we would like to investigate the role of the vibrotactile input delivered on two ngertips and the audio-tactile binding effects further. It remains to be studied whether it is of crucial importance that the amplitude and frequency uctuations delivered as vibrotions follow the auditory signal precisely. Alternatively, non-matching vibrations whose spectro-temporal characteristics are nevertheless still very different from the background speech signal will provide the same bene t, if trained. To answer this question, in our future work we will include two additional training groups, one exposed to the degraded speech input with concurrent vibrotactile stimulation that represents the fundamental frequency that does not match the auditory sentence (an alternative multisensory training), and one receiving training that is only auditory (a unisensory training).

Implications for rehabilitation
The ndings of our study have implications for auditory rehabilitation programs for the hearing impaired (including the elderly). Besides providing great improvement in speech recognition in noise, our set-up is also intuitive to the user and thus requires minimal cognitive effort. In addition, it is relatively minimal in terms of the applied technical and time resources. We are working on reducing the size of the device even further. This contrasts with the more cumbersome solutions using tactile inputs that are available on the market [43,66].
Interestingly, the participants of our study after audio-tactile training also improved when speech understanding was tested only through the auditory channel with the tactile stimulation removed. These ndings represent a transfer of a short multisensory training not only to the novel multisensory stimuli but also to the unisensory modality (audio only). The reported scores for the auditory only and the AM conditions after training were also strongly correlated (not the case before training) which suggests common learning mechanisms. We were hoping to see such an effect, as this outcome has crucial implications for the development of rehabilitation programs for the hearing-impaired patients, including actual and future users of hearing aids and cochlear implants. Our in-house SSD and the multisensory training procedure can be potentially applied to a population of HA/CI users to help them progress in their auditory (i.e. unisensory) performance. A similar idea of a multisensory training boosting unisensory auditory and visual speech perception, was shown by Bernstein and colleagues [20] and by Eberhard and colleagues [25], respectively, although the applied language tasks were more basic than repeating whole sentences. At the same time, with future hard-of-hearing candidates for cochlear implantation, we believe that both unisensory tactile and multisensory audio-tactile training can be applied using our set-up, with the aim to "prepare" the auditory cortex for future processing of its natural sensory input [22,67]. We believe that this can be achieved, based on a number of works from our lab which show that specialization of brain sensory regions, such as the (classically termed as) visual or the auditory cortex, can emerge also following a speci cally designed training with inputs from another modality, which however preserve the computational features speci c for a given sensory brain area. We have been referring to this type of brain organization in our works as Task-Selective and Sensory-Independent (TSSI) [21,60].

A revised critical period theory
We show here, with our device, signi cant multisensory enhancement of speech-in-noise comprehension, at a level comparable or higher to that reported when auditory speech in noise is complemented with cues from lip/speech reading [5; 48, 49, 50, 51]. This is de nitely an interesting nding, as synchronous audiovisual speech information is what we as humans are exposed to from the very early years of development and throughout lifetime. The brain networks for speech processing combining higher-order and sensory brain structures, often involving auditory and visual cortices, are also well established. At the same time, exposure to an audio-tactile speech input is an utterly novel experience. We argue, therefore, that one can establish in adulthood a new coupling between a given neuronal computation and an atypical sensory-modality which was never used for encoding that type of information before. We also show that this coupling can be leveraged for improving performance through a tailored SSD-training. Quite interestingly, this can even be achieved for the very complex and dynamic signal, such as speech. The current study thus provides further evidence supporting our new conceptualization of critical/sensitive periods in development, as presented in our recent review paper [22]. Although, in line with the classical assumptions, we agree that brain plasticity spontaneously decreases with age, we also believe that it can be reignited across the lifespan, even with no exposure to certain unisensory or multisensory experiences during childhood. Several studies from our lab and other research groups, mainly involving patients with congenital blindness, as well as the current study, point to that direction [22,68,69,70].

Summary, future applications and research directions
In summary, we show that understanding of speech in noise greatly improves after a short multisensory audio-tactile training with our in-house SSD. The results of the current experiment expand our previous ndings where we showed a clear and immediate bene t of complementing an auditory speech signal with tactile vibrations on ngertips, with no training at all [28]. Our research and the speci cally developed experimental set-up are indeed novel and contributes to the very scarce literature on audio-tactile speech comprehension.
We believe that development of assistive communication devices involving tactile cues is especially needed in the time of the COVID19 pandemics, which imposes numerous restrictions on live communication, including the limited access to visual speech cues from lip reading. Besides the discussed rehabilitation regimes for the hearing impaired and the elderly, our technology and the tactile feedback can also assist normal hearing individuals, in second language acquisition, improving appreciation of music, as well as when talking on the phone. Furthermore, our lab already started developing new minimal tactile devices that can provide vibrotactile stimulation on other body parts, beyond the ngertips. Our aim is to design a set-up that will be wearable and therefore will assist with speech-in-noise comprehension (and sound source localization) in real-life scenarios.
In addition, our current SSD is compatible with a 3T MRI scanner. We have already collected functional magnetic resonance (fMRI) data in a group of participants performing the same tasks of unisensory and multisensory speech comprehension. To our knowledge this study would be the rst to look into the neural correlates of understanding speech presented as combined auditory and vibrotactile stimulation. Our results can help uncover the brain mechanisms of speech-related audio-tactile binding, and may elucidate the neuronal sources of inter-subject variability for what it concerns the bene ts of multisensory learning. This latter aspect, in turn, could further direct rehabilitation and training programs. Future fMRI studies are foreseen in the deaf population in Israel that will investigate the neural correlates of perceiving a closed set of trained speech stimuli solely through vibration. Figure 1 The AM test condition -outcomes before and after AM training, (A) at the group level, (B) in single participants; SRT -Speech Reception Threshold; PRE -test performed before training, POST -test performed after training; the bars are error bars; participant no 15 had an SRT = 0 dB in the POST training AM test.