Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

doi:10.21203/rs.3.rs-3085032/v1

Download PDF

Article

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

https://doi.org/10.21203/rs.3.rs-3085032/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 14 Oct, 2024

Read the published version in Scientific Reports →

You are reading this latest preprint version

Acoustic simulations have played a prominent role in the development of speech processing and sound coding strategies for auditory neural implant devices. Traditionally evaluated using human subjects, acoustic simulations have been used to model the impact of implant signal processing as well as individual anatomy/physiology on speech perception. However, human subject testing is time-consuming, costly, and subject to individual variability. In this study, we propose a novel approach to perform simulations of auditory implants. Rather than using actual human participants, we utilized an advanced deep-learning speech recognition model to simulate the effects of some important signal processing as well as psychophysical/physiological factors on speech perception. Several simulation conditions were produced by varying number of spectral bands, input frequency range, envelope cut-off frequency, envelope dynamic range and envelope quantization. Our results demonstrate that the deep-learning model exhibits human-like robustness to simulation parameters in quiet and noise, closely resembling existing human subject results. This approach is not only significantly quicker and less expensive than traditional human studies, but it also eliminates individual human variables such as attention and learning. Our findings pave the way for efficient and accurate evaluation of auditory implant simulations, aiding the future development of auditory neural prosthesis technologies.

Biological sciences/Neuroscience

Health sciences/Medical research/Translational research

Biological sciences/Computational biology and bioinformatics/Machine learning

Recent automatic speech recognition models have achieved human-level performance and impressive robustness to various forms of noise and distortion, thanks to advances in machine learning and deep learning techniques ¹. While these models were initially designed to simply match human performance, there is now growing interest in validating them as computational models of human speech perception ^2,3. This has the potential to offer valuable insights to the field of speech and hearing science. For instance, high-performing speech recognition models can be used to shed light on the speech cues that contribute to intelligibility across different environments, advancing the field of acoustic phonetics. Moreover, such models may also help in developing treatments for speech and hearing disorders by simulating the experience of individuals with impaired auditory systems.

This study aimed to investigate the potential of using speech recognition models to advance research in auditory implant devices, such as cochlear implants and auditory brainstem implants. Auditory implant technologies are clinically used to restore hearing in profound deafness by directly stimulating auditory neural pathways using electrodes that are surgically placed in proximity or in contact with the neural tissue. A sound coding algorithm converts incoming acoustic signal picked up by a microphone to electric stimulation of the auditory neurons. Deaf patients who use auditory implants generally benefit from their device for hearing and speech communication. However, most patients struggle participating in oral conversations in realistic acoustic environments. It is believed that further enhancement and optimization of signal processing and sound coding strategies will play a major role in improving speech perception outcomes of implant device users ⁴.

Acoustic vocoder simulations conducted with normal hearing human subjects have traditionally been used to model the effects of implant signal processing on speech perception as well as individual anatomy and physiology of implanted patients. Acoustic vocoders are created with a reasonably proximate signal processing to that of implant devices ⁵. Vocoder simulations have been extensively used in the auditory implant field to evaluate the effects of number of electrodes ^5–7, envelope processing ^8,9, frequency-to-place mappings ^10,11, background noise ^12,13, channel interactions ^14,15, etc. Though human vocoder studies have undoubtedly provided invaluable contributions to the auditory implant field, they are sometimes difficult or impossible to perform. Studies involving human subjects are generally time-consuming and costly, both in terms of subject recruitment and testing. Additionally, perception of vocoded speech spans a wide range across normally hearing listeners, and can be significantly affected by small amounts of familiarization or exposure to vocoded speech sounds ^12,16. The results of human vocoder studies can be biased by individual learning abilities and cognitive factors, as well as the order of stimulus presentation and listener’s attentional state or fatigue during a testing session. These limitations can prevent researchers from being able to evaluate many different parameters with the desired precision.

Instead of testing human subjects, we employed a publicly available robust speech recognition model to evaluate the effects of vocoder parameters on speech recognition. Whisper, the open-source deep learning speech recognition model from OpenAI ¹, was used to evaluate speech information provided by vocoder simulations. We assessed the ability of the Whisper model to recognize vocoded speech signals that were generated through the simulation of various psychophysical and signal processing degradations commonly experienced by auditory implant patients. Whisper model was chosen for this study because of its advanced complex architecture and remarkable speech recognition performance in noise and different types of distortions ¹. Whisper uses the transformer language processing model with over 1.5 billion parameters ¹⁷ and has been trained on hundreds of thousands of hours of natural speech.

The results of testing Whisper with vocoded speech signals showed a very close correspondence to the results of analogous human studies reported in the literature. The pattern of Whisper performance as a function of vocoder parameters closely mirrored that of similar vocoded studies with human subjects. These results support the potential of employing automatic speech recognition models for conducting vocoder studies, as a replacement for human participants. In fact, it would be almost impossible to conduct a human subject experiment with the scale of stimuli tested in this study. We evaluated Whisper’s speech recognition on roughly 270,000 sentences and 900,000 words for more than 900 vocoder and testing conditions. Evaluating this number of stimuli with human subjects would require more than 2000 hours of testing per participant, assuming that 500 sentences or words can be tested in an hour. Instead of thousands of hours of testing time spread over several months and years, the Whisper model could provide results within few hours running on the high-performance computing platforms that are currently available. These results highly support the potential of advanced speech recognition models to investigate large scale sets of parameters to evaluate or optimize signal processing for assistive hearing devices.

In this session, we will present the patterns of Whisper results with respect to several different vocoder parameters tested in this study. The Whisper recognition scores will be presented for words and sentences in quiet and noise. The mean and standard deviation of 5 repetitions of each stimulus and vocoder condition are shown by symbols and error bars. The results will be compared to the analogues vocoder studies conducted with human subjects reported in the literature.

1. Number of channels

The Whisper model’s AzBio sentence¹⁸ and CNC (consonant-nucleus-consonant) word scores as a function of number of channels are summarized in Fig. 1. The results are shown for quiet condition as well as in the presence of speech-shaped noise at 5dB SNR. The error bars were small for many of the tested conditions, suggesting little variability in Whisper results. As expected, the performance of the Whisper model increased monotonically as the number of channels was increased from 4 to 22, and was worse in noise than in quiet. The peak performance approached ceiling in quiet, i.e. 94% for vocoded words and 99% for vocoded sentences. Scores in quiet dramatically increased from 4 channels to 6 channels, approaching a plateau at around 8–10 channels. In noise, sentence scores plateaued at around 90% at 12 channels, whereas word scores kept increasing as the number of vocoder channels was increased to 22 without reaching a clear plateau. Better recognition of sentences than single words can be explained by Whisper’s remarkable human-like ability to use contextual and linguistic information for sentence recognition. The patterns in Whisper’s performance, as function of the number of vocoded channels and the presence of noise, align with the patterns observed in vocoder studies with human subjects ^6,7,19.

Comparison to human results. While the overall pattern of results with respect to vocoder parameters can be compared among studies, we found it challenging to compare absolute speech recognition scores between Whisper and the different human subject results reported in the literature. The reason is methodological differences between the different vocoder studies. The published human vocoder studies vary significantly in vocoder processing implementation, speech testing materials, noise type and experimental procedures. Furthermore, the duration of training or exposure to vocoder stimuli also varies between studies, with some studies providing hours of such training. All these factors could significantly affect speech perception scores when listening to vocoded speech.

To make a more valid comparison between Whisper's performance and that of human subjects, we conducted additional model simulations under conditions similar to those of two published vocoder studies that provided details about their vocoder processing implementation and testing procedure ^12,19. Grange et al. and Goupell at al. used IEEE sentence material, but the type of noise was different between the two studies. The former study had speech-shaped noise while the latter used two-talker babble noise. We conducted a separate set of Whisper model simulations with IEEE sentence materials and the specific vocoder processing parameters and type of noise that were used in each of the two studies.

The results comparing Whisper to Grange at al. and Goupell et al. are shown in Fig. 2a (left panel) and Fig. 2b (right panel), respectively. The data of Goupell et al. are their adult subject results obtained in the first run of testing, which is presumably minimally affected by training. The patterns of Whisper results are very similar to the results of Grange et al. and Goupell et al., in that performance improved as the number of channels and the SNR were increased. Although the patterns of whisper results agreed with human subject results, the Whisper model had slightly lower score at low number of channels (Fig. 2a, left panel) and low SNRs (Fig. 2b, right panel). Better human performance at these difficult testing conditions, at least partly, may be related to attentional, cognitive and learning abilities that are not implemented in the Whisper model. For instance, human subjects listening to vocoded speech expect to hear degraded stimuli and are prepared for the difficult task. Also, the human subjects can perceptually adapt to the degraded stimuli during the testing session. Perceptual adaptation may have particularly helped the subjects who were provided with feedback throughout the testing session ¹². Another factor that could have contributed to better human subject results in speech in noise tasks may be their awareness of the onset time of the target sentence within the background noise, enabling them to concentrate on the target speech.

2. Channel frequencies

The second parameter we manipulated was the channel frequency boundaries. Frequency boundaries for different equal-octave frequency channels were manipulated by changing the overall frequency range of vocoder processing. The effects of changing the low and high cutoff of the vocoder processing are shown in Figs. 3 and 4, respectively. The results showed consistent drop in performance as the staring frequency was increased (Fig. 3), whereas the ending frequency didn’t have a systematic effect on the model’s performance (Fig. 4). The results trends were similar for words and sentences. The detrimental effect of increasing starting frequency on speech recognition is intuitive, as higher starting frequencies discard important energy peaks (formants) in the low frequency region of the speech signal. The model’s performance dropped gradually as the starting frequency was increased up to a knee point, above which there was steeper drop in performance. The knee point was around 1000Hz for most of the vocoder conditions in quiet and was a lower frequency in noise. The drop in performance was also steeper with CNC words than with AzBio sentences.

One peculiar result is that the performance of the model showed a peak at 300Hz starting cutoff for the lower number of channels. There was also a peak in performance at ending frequency of 7000Hz. We further inspected these effects by re-testing the model with additional sets of equal-octave band frequency boundaries at starting frequencies 150, 200 and 300 Hz and ending frequencies 7000 and 7900 Hz. In addition to equal-octave bands, we also tested the standard frequency boundaries in the commercial auditory implant device by Cochlear Ltd. The results shown in Fig. 5 suggest that manipulating frequency boundaries could have a large effect on recognition of noise vocoded speech, particularly for small number of channels (4, 6, and 8 channels). There was minimal effect of frequency boundary manipulation for 12 channels and above. The condition with frequency range between 300 and 7000 Hz resulted in the best model performance for all channels. The standard frequency boundaries for Cochlear outperformed most of the other tested frequency boundaries for 6 and 8 channels but showed poorer result than the other frequency conditions for 4 channels.

3. Envelope cutoff frequency

Figure 6 shows the patterns of Whisper results as a function of the envelope low-pass cutoff frequency. Whisper performance dramatically improved as envelope cutoff was increased from 5 to around 30Hz, at which point the performance plateaued regardless of the number of channels. The patterns were similar for sentence and word recognition and in quiet and noise. These results are in agreement with previous findings in human subjects, which also showed that increasing envelope cutoff beyond 30Hz had no effect on speech perception with noise vocoded speech ⁹. One interesting note is that the Whisper results in noise show a decline in performance as the low pass envelope becomes greater than 100Hz. This reduction pattern is particularly obvious for words and below 10 channels. The reason why performance in noise drops at high envelope cutoff frequencies needs further investigation but may be due to speech masking by high-frequency noise envelope that would be preserved at high cutoff frequencies.

4. Envelope dynamic range

The results in Fig. 7 suggest that envelope dynamic range could have a large effect on vocoded speech recognition. Whisper model’s sentence recognition performance considerably dropped as envelope dynamic range (the range of values below maximum) was reduced below 70 dB, particularly for words and at smaller number of channels. The patterns of changes as a function of dynamic range look similar in quiet and noise conditions. These results suggest that very low envelope levels contribute to speech recognition, particularly word recognition. To our knowledge, the effect of reducing envelope dynamic range on perception of vocoded speech by humans has not been directly evaluated. Loizou et al. showed that compressing envelope of tone vocoders after they were multiplying by output tones (at the synthesis stage of the vocoder) had a large effect on speech perception ²⁰. In that study, the extracted envelopes were compressed before modulating the amplitudes of tones, without removing any envelope information. This manipulation is different from our approach that reduced the range of the envelope levels. Zeng et al. evaluated the effect of input dynamic range in cochlear implant patients and found that at least 50 dB input dynamic range was needed for maximum consonant and vowel scores ²¹. These results are in agreement with the results we obtained with the Whisper model in that low speech levels bear important speech cues.

5. Envelope quantization

Figure 8 shows Whisper performance as a function of the number of quantized envelope steps. Whisper performance dramatically improved as the number of envelope steps was increased from 2, and quickly reached plateau. The plateau in performance happened with less than 5 quantized steps at large number of vocoder channels and in quiet. More quantized steps were needed to reach asymptotic performance in noise or with small number of channels, but asymptote was generally achieved with 10 to 20 quantized steps. These results are consistent with the results of analogues human vocoder studies ²². Loizou et al. showed that human subjects listening to 16 channel vocoder required less than 5 envelope steps to reach asymptotic performance in quiet. Similar to the whisper results, the human subject results of Loizou et al. also showed that larger number of quantized steps were needed for smaller number of vocoder channels. Both Whisper and human subject results confirm that the number of discriminable steps may not be the main factor limiting speech performance of cochlear implant patients. Most cochlear implant patients have access to more than 10 discriminable steps ²³, which would be sufficient for near-asymptotic speech perception performance even in noise.

This study employed an open-source speech recognition model released by OpenAI to run vocoder simulations of auditory implants. Traditionally, vocoder simulations have been used to simulate implant signal processing in normal hearing listeners. The vocoder results of the model showed remarkable similarity to the published results obtained with human subjects. With only 4 channels, the model correctly recognized around 25% of AZBio sentences in quiet. The model’s sentence recognition scores dramatically increased to more than 90% at 6 channels. As expected, the model had lower CNC word scores than AzBio sentence scores. CNC word score in quiet was 10% and 60% for 4 and 6 channels, respectively. These results are generally consistent with published human subject results in quiet at different numbers of channels ^6,22. It should be noted that because of methodological differences across studies, numerous factors must be taken into consideration when drawing comparisons between speech recognition findings. The factors that could significantly affect comparison between different vocoder studies are the vocoder processing parameters (e.g. channel frequency boundaries, envelope cutoff frequency, etc.), sentence or word corpus used for testing, and the type of background noise or other environment distortions applied to the stimuli.

To provide a valid comparison between Whisper and the published human subject studies, we performed Whisper simulations using similar implementation and methodological details to two published human subject studies ^12,19. The patterns of Whisper results were very similar to the analogous human subject results, but Whisper performance was lower than human subject results in the presence of large amount of noise and for small number of channels (Fig. 2). One reason that Whisper model underperforms human subjects in difficult conditions may be related to its training and deep learning architecture. Since Whisper is one of the most advanced state of the art speech recognition models to date, this possibility suggests that more technological advancements are required to achieve speech recognition models with similar phoneme and word recognition performance to humans. One other possibility that may explain higher human scores in degraded stimulus conditions is concerned with attentional and learning factors in human subject experiments. Human studies often involve various amount of training to familiarize subjects with the novel vocoded stimuli. It has been shown that even small amount of training or simply passive exposure to vocoded stimuli can significantly improve intelligibility of these sounds ^12,16. In addition, the subjects in human vocoder studies are aware that they are about to hear distorted speech sounds and can benefit from that knowledge to attend to phonetic and linguistic contexts in vocoded stimuli. Human subjects' performance might have been significantly poorer if they were completely naïve to the experimental procedure and the stimuli in each trial, which is the case for the speech recognition model employed in this study.

In this study, we focused on evaluating the effects of vocoder parameters that are relevant to hearing and speech perception with auditory implants. The tested vocoder parameters directly simulate implant signal processing or perceptual aspects of the implant device users (Table I). One set of parameters we used were frequency range and frequency boundaries of vocoder channels. These parameters were used to simulate the effects of frequency-to-electrode assignment in auditory implant devices. The model results were consistent with the expectation that increasing the low cutoff frequency reduces speech perception scores. Interestingly, the results suggest that small changes in channel frequency boundaries could have a large effect on perception of vocoded speech, particularly for smaller number of channels. The next parameter we tested was envelope cutoff frequency. Envelope cutoff was manipulated to evaluate effects of signal processing as well as the individual’s ability to process rapid envelope cues. Consistent with the published human subject results, the model results showed that rapid envelope variations above 30Hz do not significantly contribute to speech understanding. Finally, we evaluated the effects of envelope dynamic range and quantized steps. These parameters were used to examine envelope amplitude cues that contribute to speech perception. The dynamic range results of the model suggested that even very low envelope levels (> 60dB below maximum) can contribute to speech perception. These interesting results were rather unexpected. To the best of our knowledge, there are no published human data to directly compare to the model’s dynamic range results. The Whisper results with quantized envelope were consistent with published human studies, showing that 20 quantized steps can be sufficient for speech recognition even in noise and for low number of channels. Less than 10 envelope steps were sufficient when the number of channels was greater than 8.

In general, the trend of the model results as a function of vocoder parameters agreed with the published human subject results. The vocoder results obtained with the Whisper model support the potential of employing speech recognition models in auditory implant research. These models may be used to evaluate the effects of implant signal processing parameters as well as perceptual limitations on speech perception. The focus of this paper was on using speech recognition model to evaluate vocoder simulations of auditory implants. Though vocoder simulations have played a significant role in development of auditory implant signal processing, they still cannot simulate all aspects of sound perception in an auditory implant user. For instance, the properties of the primary auditory neurons and their responses to electrical stimulation cannot be easily simulated with vocoded signals presented to acoustic hearing system. One way to leverage speech recognition models to directly study auditory implants is to modify the models to implement a biophysical model of neural activity in response to electric stimulation ²⁴. Speech recognition systems typically consist of a module that emulates peripheral auditory processing in a normal ear, followed by a core neural network model of language processing. The peripheral module, originally derived from acoustic hearing, can be substituted with auditory implant physiology to directly modulate peripheral neural factors and their interactions with the implant signal. This modification to the model would allow evaluating the effects of the modeled physiological parameters of individual implant users on the model’s speech recognition performance. Other potentials of such modifications to speech recognition models could be to optimize signal processing parameters for individual implant users.

1. Speech Recognition Model

The OpenAI Whisper ¹ is a robust multi-lingual speech recognition model, which is publicly available and open-source (https://github.com/openai/whisper/). The Whisper model is constructed based on the sophisticated transformer architecture ¹⁷. The transformer’s ability to process context information within and across sentences has become a widely accepted and validated technique for natural language processing. Whisper has been trained on a very large speech corpus of around 680,000 hours of multilingual training data. The training set consisted of naturally produced speech tokens of various qualities and in different realistic and noisy environments. With default settings, the Whisper model assigns a probability to the possible candidates for each word in a sentence and randomly chooses among the candidates that were assigned the highest probability. This probabilistic decoding process provides flexibility in model’s decision that can result in better performance ¹. The probabilistic decoding may also result in variability in model’s performance between different runs of the model for the same speech token, which is somewhat analogues to perceptual variability and uncertainty in human speech perception.

2. Vocoder Processing

Noise vocoded stimuli were generated using the vocoder signal processing shown in Figure 9. The input speech signal was passed through N non-overlapping band-pass filters, and the band-passed signals were half-wave rectified and low-pass filtered to obtain envelope from each of the N channels. In the conventional vocoder processing, the resulting channel envelopes modulated narrow-band random noise signals that were filtered with the same band-pass frequency range of the corresponding channel. In some of the vocoder simulations of this study, the extracted envelopes were subjected to additional degradations before multiplying by random noise, including dynamic range reduction and quantization. Finally, the modulated noise bands were band-pass filtered by the corresponding N non-overlapping filters, and were summed to produce the final vocoder output. Band-pass and low-pass filters were implemented using sixth and fourth order Butterworth filters, respectively.

The vocoder processing parameters that were manipulated in this study and the rage of parameter values are shown in Table 1. These parameters intended to simulate some important signal processing and perceptual aspects of sound processing with auditory implants. The rationale for choosing each parameter is described below.

Number of channels was manipulated to simulate different numbers of active electrodes in an auditory implant device. Number of channels was varied between 4 and 22, which spans the range of active electrodes in clinical cochlear implant and auditory brainstem implant devices.

Frequency boundary for each vocoder channel was chosen in two ways: 1) equal bandwidth across different channels when represented in octave, 2) frequencies mimicking the standard channel boundaries of commercial Cochlear Ltd. devices. The aim of the frequency boundary manipulations was to evaluate the effect of different approaches of determining frequency boundaries on speech recognition at different number of channels. To create equal-octave channels, the overall frequency range was converted to octaves and was divided by the number of channels. The resulted bandwidth was used to determine all the channel frequency boundaries starting from the first channel. The lowest and highest cutoff frequencies of the signal were manipulated to test the effects of the overall frequency range on the model’s speech recognition performance.

Envelope low-pass cut-off frequency was manipulated to evaluate the range of envelope frequencies that provide usable speech cues. The range of tested envelope frequencies was between 5 and 400 Hz. The results have significant implications for determining the range of envelope variations that should be preserved by auditory implant devices and be accessible to device users.

Envelope dynamic range was manipulated by limiting the range of envelope values in each channel. The aim of this manipulation was to evaluate the range of envelope amplitudes that contribute to speech recognition. Envelope dynamic range was manipulated by varying the low end of the envelope values, while the high end was set as the maximum envelope value across all channels for a given stimulus. Envelope values outside of the dynamic range were set to zero. The envelope dynamic range was represented as the ratio between the high and low envelope values in logarithmic scale (dB). Envelope dynamic range values tested in this study ranged between 10 and 150dB.

The number of quantized envelope steps was manipulated to simulate the number of discriminable envelope steps by individual implant users. The number of discriminable steps can range from just a few to several tens among auditory implant patients ²⁵. The aim of this simulation was to evaluate the extent to which impaired sensitivity to changes in stimulus level could disrupt speech perception by implant users. Envelope quantization was implemented by rounding envelope values to the closest quantized level. Quantized levels were obtained by dividing envelope dynamic range (in dB) into equal dB steps. Logarithmic steps were used in this study to be consistent with discriminable acoustic level steps in normally hearing human listeners. The number of quantization steps tested in this study was between 1 and 100 (Table I). It should be noted that due to the inherent nature of band-pass filtering at the final stage of vocoder processing (Figure 9), quantized envelopes would likely be slightly smeared and may not be precisely stepwise after the vocoder output was generated.

3. Speech testing

3.1. Stimuli

We tested the Whisper model on a set of 60 AzBio Sentences ¹⁸ and 200 CNC words. The same set of words and sentences were used for all vocoder conditions tested. Each of the 60 AzBio sentences was a separate audio file. The 200 CNC words were presented in lists of 30 words each. The reason for dividing CNC words into lists, rather than presenting one word at a time, was for models’ efficiency and accuracy. The Whisper model by default processes the audio input as 30 second clips and pads files that are shorter with zeros. Thus, the model would run much quicker when it is fed with a 30-second audio file consisting of a list of words as opposed to multiple single-word audio files. We also found that the model output was more accurate when feeding a list of words rather than playing one at a time. The vocoded sentence and word stimuli were tested in quiet and in the presence of speech-shaped noise at 5dB SNR (signal to noise ratio). Speech shaped noise was generated using the pyAcoustic python package (https://github.com/timmahrt/pyAcoustics/blob/main/pyacoustics). Since the probabilistic selection factor was used at the decoding stage of the Whisper model, the model output could vary each time it was run. We ran the Whisper model 5 times for each set of 60 sentences and 200 words at each vocoder condition, and used these repetitions to estimate a true mean and the standard error of the mean.

3.2. Scoring

Since it was not feasible to grade hundreds of thousands of words and sentences by hand, we implemented automated scoring algorithms to analyze the results of sentences and words. The scoring was done by grading the output of the Whisper model, which we will refer to as model result, and comparing it to the correct transcription of the input audio, which we will refer to as audio transcript. The first step in the automatic scoring was to “clean” the model results and audio transcripts by removing all punctuations and lower-casing all letters. The model results were then compared to the corresponding audio transcripts after the cleaning phase. We implemented separate grading algorithms for sentence and word stimuli.

Sentence Scoring Algorithm: In this algorithm, each single sentence was graded by counting the number of common words between the model result and the audio transcript of that sentence. The total percent correct score for multiple sentences was obtained by summing the individual sentence grades and dividing by the total number of words in the set of sentences. The algorithm for grading sentences did not take the order of words into account. For example, the two sentences “Abbey skipped rocks” and "Skipped Abbey rocks" were graded as 3 correct words. This approach is consistent with how speech perception tests are commonly graded.

Words scoring Algorithm: The words scoring algorithm compared the model result for each 30-word list to the transcribed input of the list on a word-by-word basis. To obtain the final score, the number of words correctly recognized by the model was divided by the total number of words. One major challenge in scoring words was that the model result and the audio transcript lists didn’t always have the same number of words. The reason was that the model could miss words or interpret a single word as multiple words. We addressed this issue by using an iterative approach to finding the optimal alignment that resulted in the highest similarity between the audio transcript and the model result. The optimal word-by-word alignment between model result and audio transcript was found by iteratively sampling single words from the model result or from the audio transcript and calculating the similarity score between the remaining words. Each sampled word was then replaced and the scoring was repeated until all the words were sampled one at a time. The sampled word associated with the highest similarity score was permanently removed and the process of sampling single words and calculating similarity score was repeated. The algorithm stopped when the number of remaining words in the model result and the audio transcript were equal. The highest score that was obtained with different alignments was used as the true correct word score.

Another issue in scoring words was to address homophones, which are words that are pronounced the same but spelled differently. An example is “whine” and “wine”. Homophones are phonetically identical and should be scored as correct when comparing the results of speech recognition model to the audio transcript. Scoring of homophones was addressed by implementing the grapheme-to-phoneme (g2p) python package (https://pypi.org/project/g2p-en/) to analyze all words that had different spellings and compared them by phonetic content rather than just spelling.

Acknowledgments

We thank our colleagues for their valuable feedback on this work, in particular Dr. Mario Svirsky for his insightful comments on the manuscript. The computational requirements for this work were supported by the NYU Langone High Performance Computing (HPC) Core's resources and personnel. This study was funded by NIH/NIDCD grant R21DC020305 (PI Azadpour).

Data availability

The authors will provide python code for vocoder processing, running deep-learning model and scoring the model output upon request. Please contact the corresponding author: [email protected].

Competing interests

The authors declare no competing interests.

Radford, A. K., J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-ScaleWeak Supervision. arXiv, doi:https://doi.org/10.48550/arXiv.2212.04356 (2022).
Weerts, L. R. S., Clopath C.; Goodman D. F. M. . The Psychometrics of Automatic Speech Recognition. bioRxiv, doi:https://doi.org/10.1101/2021.04.19.440438 (2021).
Rossbach, J., Kollmeier, B. & Meyer, B. T. A model of speech recognition for hearing-impaired listeners based on deep learning. J. Acoust. Soc. Am. 151, 1417, doi:10.1121/10.0009411 (2022).
Wouters, J., McDermott, H. J. & Francart, T. Sound Coding in Cochlear Implants. Ieee Signal Proc Mag 32, 67-80, doi:Doi 10.1109/Msp.2014.2371671 (2015).
Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J. & Ekelid, M. Speech recognition with primarily temporal cues. Science 270, 303-304 (1995).
Dorman, M. F., Loizou, P. C. & Rainey, D. Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. J. Acoust. Soc. Am. 102, 2403-2411 (1997).
Shannon, R. V., Fu, Q. J. & Galvin, J., 3rd. The number of spectral channels required for speech recognition depends on the difficulty of the listening situation. Acta Otolaryngol Suppl, 50-54, doi:10.1080/03655230410017562 (2004).
Xu, L., Thompson, C. S. & Pfingst, B. E. Relative contributions of spectral and temporal cues for phoneme recognition. J. Acoust. Soc. Am. 117, 3255-3267, doi:10.1121/1.1886405 (2005).
Souza, P. & Rosen, S. Effects of envelope bandwidth on the intelligibility of sine- and noise-vocoded speech. J. Acoust. Soc. Am. 126, 792-805, doi:10.1121/1.3158835 (2009).
Fitzgerald, M. B., Prosolovich, K., Tan, C. T., Glassman, E. K. & Svirsky, M. A. Self-Selection of Frequency Tables with Bilateral Mismatches in an Acoustic Simulation of a Cochlear Implant. J. Am. Acad. Audiol. 28, 385-394, doi:10.3766/jaaa.15077 (2017).
Jethanamest, D., Azadpour, M., Zeman, A. M., Sagi, E. & Svirsky, M. A. A Smartphone Application for Customized Frequency Table Selection in Cochlear Implants. Otol Neurotol 38, e253-e261, doi:10.1097/MAO.0000000000001409 (2017).
Goupell, M. J., Draves, G. T. & Litovsky, R. Y. Recognition of vocoded words and sentences in quiet and multi-talker babble with children and adults. PLoS ONE 15, e0244632, doi:10.1371/journal.pone.0244632 (2020).
Rosen, S., Souza, P., Ekelund, C. & Majeed, A. A. Listening to speech in a background of other talkers: effects of talker number and noise vocoding. J. Acoust. Soc. Am. 133, 2431-2443, doi:10.1121/1.4794379 (2013).
Bingabr, M., Espinoza-Varas, B. & Loizou, P. C. Simulating the effect of spread of excitation in cochlear implants. Hear Res 241, 73-79, doi:S0378-5955(08)00085-3 [pii] 10.1016/j.heares.2008.04.012 (2008).
Stafford, R. C., Stafford, J. W., Wells, J. D., Loizou, P. C. & Keller, M. D. Vocoder simulations of highly focused cochlear stimulation with limited dynamic range and discriminable steps. Ear Hear. 35, 262-270, doi:10.1097/AUD.0b013e3182a768e8 (2014).
Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K. & McGettigan, C. Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences. J Exp Psychol Gen 134, 222-241, doi:2005-04168-006 [pii]10.1037/0096-3445.134.2.222 (2005).
Vaswani A., S. N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin, I. in Neural Information Processing Systems. 5998–6008.
Spahr, A. J. et al. Development and validation of the AzBio sentence lists. Ear Hear. 33, 112-117, doi:10.1097/AUD.0b013e31822c2549 (2012).
Grange, J. A., Culling, J. F., Harris, N. S. L. & Bergfeld, S. Cochlear implant simulator with independent representation of the full spiral ganglion. J. Acoust. Soc. Am. 142, EL484, doi:10.1121/1.5009602 (2017).
Loizou, P. C., Dorman, M. & Fitzke, J. The effect of reduced dynamic range on speech understanding: implications for patients with cochlear implants. Ear Hear. 21, 25-31, doi:10.1097/00003446-200002000-00006 (2000).
Zeng, F. G. et al. Speech dynamic range and its effect on cochlear implant performance. J. Acoust. Soc. Am. 111, 377-386 (2002).
Loizou, P. C., Dorman, M., Poroy, O. & Spahr, T. Speech recognition by normal-hearing and cochlear implant listeners as a function of intensity resolution. J. Acoust. Soc. Am. 108, 2377-2387 (2000).
Nelson, D. A., Schmitz, J. L., Donaldson, G. S., Viemeister, N. F. & Javel, E. Intensity discrimination as a function of stimulus level with electric stimulation. J. Acoust. Soc. Am. 100, 2393-2414 (1996).
Brochier, T. et al. From Microphone to Phoneme: An End-to-End Computational Neural Model for Predicting Speech Perception with Cochlear Implants. IEEE Trans Biomed Eng PP, doi:10.1109/TBME.2022.3167113 (2022).
Kreft, H. A., Donaldson, G. S. & Nelson, D. A. Effects of pulse rate and electrode array design on intensity discrimination in cochlear implant users. J. Acoust. Soc. Am. 116, 2258-2268 (2004).

Table I. Vocoder parameters used for the simulations in this study

Parameter	Range of Values
Number of channels	4, 6, 8, 10, 12, 16, 22
Channel Frequencies	Equal-octave bands: Low cutoff frequency: 200 - 2000 Hz High cutoff frequency: 2000 - 7900 Hz Cochlear Ltd. Standard: Low cutoff frequency: 188 Hz High cutoff frequency: 7938 Hz Default: Octave bands (200-7900 Hz)
Envelope Cutoff	5 - 400 Hz Default: 100 Hz
Envelope Dynamic Range	10 -100 dB, as well as full range Default: Full range
Envelope Quantized Steps	1 – 100, as well as infinite (i.e. continuous non-quantized) Default: Infinite

No competing interests reported.

Download PDF

Journal Publication

published 14 Oct, 2024

Read the published version in Scientific Reports →

Editorial decision: Revision requested
11 Jan, 2024
Reviews received at journal
10 Jan, 2024
Reviewers agreed at journal
10 Oct, 2023
Reviews received at journal
27 Jul, 2023
Reviewers agreed at journal
28 Jun, 2023
Reviewers invited by journal
27 Jun, 2023
Editor assigned by journal
27 Jun, 2023
Editor invited by journal
27 Jun, 2023
Submission checks completed at journal
27 Jun, 2023
First submitted to journal
19 Jun, 2023

You are reading this latest preprint version

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

1. Number of channels

2. Channel frequencies

3. Envelope cutoff frequency

4. Envelope dynamic range

5. Envelope quantization

Discussion

Methods

Declarations

References

Table

Additional Declarations

Status:

Journal Publication

Version 1