1. Speech Recognition Model
The OpenAI Whisper 1 is a robust multi-lingual speech recognition model, which is publicly available and open-source (https://github.com/openai/whisper/). The Whisper model is constructed based on the sophisticated transformer architecture 17. The transformer’s ability to process context information within and across sentences has become a widely accepted and validated technique for natural language processing. Whisper has been trained on a very large speech corpus of around 680,000 hours of multilingual training data. The training set consisted of naturally produced speech tokens of various qualities and in different realistic and noisy environments. With default settings, the Whisper model assigns a probability to the possible candidates for each word in a sentence and randomly chooses among the candidates that were assigned the highest probability. This probabilistic decoding process provides flexibility in model’s decision that can result in better performance 1. The probabilistic decoding may also result in variability in model’s performance between different runs of the model for the same speech token, which is somewhat analogues to perceptual variability and uncertainty in human speech perception.
2. Vocoder Processing
Noise vocoded stimuli were generated using the vocoder signal processing shown in Figure 9. The input speech signal was passed through N non-overlapping band-pass filters, and the band-passed signals were half-wave rectified and low-pass filtered to obtain envelope from each of the N channels. In the conventional vocoder processing, the resulting channel envelopes modulated narrow-band random noise signals that were filtered with the same band-pass frequency range of the corresponding channel. In some of the vocoder simulations of this study, the extracted envelopes were subjected to additional degradations before multiplying by random noise, including dynamic range reduction and quantization. Finally, the modulated noise bands were band-pass filtered by the corresponding N non-overlapping filters, and were summed to produce the final vocoder output. Band-pass and low-pass filters were implemented using sixth and fourth order Butterworth filters, respectively.
The vocoder processing parameters that were manipulated in this study and the rage of parameter values are shown in Table 1. These parameters intended to simulate some important signal processing and perceptual aspects of sound processing with auditory implants. The rationale for choosing each parameter is described below.
Number of channels was manipulated to simulate different numbers of active electrodes in an auditory implant device. Number of channels was varied between 4 and 22, which spans the range of active electrodes in clinical cochlear implant and auditory brainstem implant devices.
Frequency boundary for each vocoder channel was chosen in two ways: 1) equal bandwidth across different channels when represented in octave, 2) frequencies mimicking the standard channel boundaries of commercial Cochlear Ltd. devices. The aim of the frequency boundary manipulations was to evaluate the effect of different approaches of determining frequency boundaries on speech recognition at different number of channels. To create equal-octave channels, the overall frequency range was converted to octaves and was divided by the number of channels. The resulted bandwidth was used to determine all the channel frequency boundaries starting from the first channel. The lowest and highest cutoff frequencies of the signal were manipulated to test the effects of the overall frequency range on the model’s speech recognition performance.
Envelope low-pass cut-off frequency was manipulated to evaluate the range of envelope frequencies that provide usable speech cues. The range of tested envelope frequencies was between 5 and 400 Hz. The results have significant implications for determining the range of envelope variations that should be preserved by auditory implant devices and be accessible to device users.
Envelope dynamic range was manipulated by limiting the range of envelope values in each channel. The aim of this manipulation was to evaluate the range of envelope amplitudes that contribute to speech recognition. Envelope dynamic range was manipulated by varying the low end of the envelope values, while the high end was set as the maximum envelope value across all channels for a given stimulus. Envelope values outside of the dynamic range were set to zero. The envelope dynamic range was represented as the ratio between the high and low envelope values in logarithmic scale (dB). Envelope dynamic range values tested in this study ranged between 10 and 150dB.
The number of quantized envelope steps was manipulated to simulate the number of discriminable envelope steps by individual implant users. The number of discriminable steps can range from just a few to several tens among auditory implant patients 25. The aim of this simulation was to evaluate the extent to which impaired sensitivity to changes in stimulus level could disrupt speech perception by implant users. Envelope quantization was implemented by rounding envelope values to the closest quantized level. Quantized levels were obtained by dividing envelope dynamic range (in dB) into equal dB steps. Logarithmic steps were used in this study to be consistent with discriminable acoustic level steps in normally hearing human listeners. The number of quantization steps tested in this study was between 1 and 100 (Table I). It should be noted that due to the inherent nature of band-pass filtering at the final stage of vocoder processing (Figure 9), quantized envelopes would likely be slightly smeared and may not be precisely stepwise after the vocoder output was generated.
3. Speech testing
3.1. Stimuli
We tested the Whisper model on a set of 60 AzBio Sentences 18 and 200 CNC words. The same set of words and sentences were used for all vocoder conditions tested. Each of the 60 AzBio sentences was a separate audio file. The 200 CNC words were presented in lists of 30 words each. The reason for dividing CNC words into lists, rather than presenting one word at a time, was for models’ efficiency and accuracy. The Whisper model by default processes the audio input as 30 second clips and pads files that are shorter with zeros. Thus, the model would run much quicker when it is fed with a 30-second audio file consisting of a list of words as opposed to multiple single-word audio files. We also found that the model output was more accurate when feeding a list of words rather than playing one at a time. The vocoded sentence and word stimuli were tested in quiet and in the presence of speech-shaped noise at 5dB SNR (signal to noise ratio). Speech shaped noise was generated using the pyAcoustic python package (https://github.com/timmahrt/pyAcoustics/blob/main/pyacoustics). Since the probabilistic selection factor was used at the decoding stage of the Whisper model, the model output could vary each time it was run. We ran the Whisper model 5 times for each set of 60 sentences and 200 words at each vocoder condition, and used these repetitions to estimate a true mean and the standard error of the mean.
3.2. Scoring
Since it was not feasible to grade hundreds of thousands of words and sentences by hand, we implemented automated scoring algorithms to analyze the results of sentences and words. The scoring was done by grading the output of the Whisper model, which we will refer to as model result, and comparing it to the correct transcription of the input audio, which we will refer to as audio transcript. The first step in the automatic scoring was to “clean” the model results and audio transcripts by removing all punctuations and lower-casing all letters. The model results were then compared to the corresponding audio transcripts after the cleaning phase. We implemented separate grading algorithms for sentence and word stimuli.
Sentence Scoring Algorithm: In this algorithm, each single sentence was graded by counting the number of common words between the model result and the audio transcript of that sentence. The total percent correct score for multiple sentences was obtained by summing the individual sentence grades and dividing by the total number of words in the set of sentences. The algorithm for grading sentences did not take the order of words into account. For example, the two sentences “Abbey skipped rocks” and "Skipped Abbey rocks" were graded as 3 correct words. This approach is consistent with how speech perception tests are commonly graded.
Words scoring Algorithm: The words scoring algorithm compared the model result for each 30-word list to the transcribed input of the list on a word-by-word basis. To obtain the final score, the number of words correctly recognized by the model was divided by the total number of words. One major challenge in scoring words was that the model result and the audio transcript lists didn’t always have the same number of words. The reason was that the model could miss words or interpret a single word as multiple words. We addressed this issue by using an iterative approach to finding the optimal alignment that resulted in the highest similarity between the audio transcript and the model result. The optimal word-by-word alignment between model result and audio transcript was found by iteratively sampling single words from the model result or from the audio transcript and calculating the similarity score between the remaining words. Each sampled word was then replaced and the scoring was repeated until all the words were sampled one at a time. The sampled word associated with the highest similarity score was permanently removed and the process of sampling single words and calculating similarity score was repeated. The algorithm stopped when the number of remaining words in the model result and the audio transcript were equal. The highest score that was obtained with different alignments was used as the true correct word score.
Another issue in scoring words was to address homophones, which are words that are pronounced the same but spelled differently. An example is “whine” and “wine”. Homophones are phonetically identical and should be scored as correct when comparing the results of speech recognition model to the audio transcript. Scoring of homophones was addressed by implementing the grapheme-to-phoneme (g2p) python package (https://pypi.org/project/g2p-en/) to analyze all words that had different spellings and compared them by phonetic content rather than just spelling.