Participants
Table 1
Participant characteristics. Each participant’s vibro-tactile discrimination thresholds (measured during screening), wrist temperature during testing, wrist dimensions, dominant hand, age, and sex are shown.
ID | 31.5 Hz thresh. (m/s− 2) | 125 Hz thresh. (m/s− 2) | Wrist temp. (°C) | Wrist height/ width (mm) | Wrist circum. (mm) | Dom. hand (L/R) | Age (yrs.) | Sex (M/F) |
---|
1 | 0.02 | 0.08 | 30.3 | 39/58 | 166 | R | 36 | M |
2 | 0.02 | 0.09 | 28.8 | 36/48 | 150 | R | 22 | F |
3 | 0.04 | 0.08 | 31.6 | 36/48 | 149 | L | 18 | F |
4 | 0.05 | 0.07 | 31.6 | 41/52 | 160 | R | 29 | M |
5 | 0.06 | 0.41 | 32.2 | 43/55 | 172 | R | 37 | M |
6 | 0.03 | 0.03 | 28.4 | 32/49 | 149 | R | 21 | F |
7 | 0.05 | 0.03 | 25.6 | 44/59 | 171 | R | 19 | M |
8 | 0.05 | 0.10 | 31.8 | 41/50 | 160 | R | 23 | M |
9 | 0.05 | 0.06 | 28.4 | 43/51 | 168 | R | 18 | F |
10 | 0.08 | 0.15 | 30.1 | 41/48 | 148 | L | 20 | F |
11 | 0.08 | 0.20 | 28.6 | 40/49 | 154 | R | 20 | F |
12 | 0.02 | 0.03 | 31 | 35/51 | 148 | R | 28 | F |
13 | 0.01 | 0.05 | 31 | 41/51 | 166 | R | 26 | F |
14 | 0.09 | 0.29 | 27.9 | 39/51 | 162 | R | 20 | M |
15 | 0.05 | 0.05 | 31.2 | 38/55 | 153 | R | 18 | F |
16 | 0.02 | 0.06 | 34 | 33/46 | 138 | R | 21 | F |
Mean | 0.05 | 0.11 | 30.2 | 39/51 | 157 | - | 24 | - |
Table 1 shows the characteristics of the 16 adults who completed the experiment. There were 6 males and 10 females, with an average age of 24 years (ranging from 18 to 37 years). Participants all had normal touch perception, which was assessed through a screening questionnaire and by measuring vibro-tactile detection thresholds at the fingertip (see “Procedure”). All participants had British English as their first language and reported no hearing problems. Participants were paid an inconvenience allowance of £20 for taking part.
Stimuli
The tactile stimulus in the experiment phase (after screening), was generated using the EHS Research Group Sentence Corpus. This contained 83 sentences, each spoken by both a British English male and British English female talker. The sentences were taken from readings (connected discourse) of a public engagement article written in a semi-conservational style40. They contained a range of natural variations in prosodic pattern, speaking rate, pitch, and phoneme pronunciation.
The long-term spectrum for each talker is shown in Fig. 4. The male talker had an average fundamental frequency of 147.9 Hz (ranging from 80.5 to 220.7 Hz; SD: 19.0 Hz) and the female talker had an average fundamental frequency of 205.5 Hz (ranging from 108.2 to 285.7 Hz; SD: 31.4 Hz). The fundamental frequency (estimated using a Normalized Correlation Function) and the harmonic ratio were determined using the MATLAB audioFeatureExtractor object (MATLAB R2022b). A 300-ms Hamming window was used, with a 30-ms overlap length. Samples were included in the analysis if their harmonic ratio was greater than 0.8.
The EHS Research Group Sentence Corpus was recorded in an anechoic chamber at the Institute of Sound and Vibration Research (UK). The audio was recorded using a Rode M5 microphone and RME Fireface UC soundcard (with a 96 kHz sample rate and a bit depth of 24 bits). The microphone was 0.2 m from the talker’s mouth.
Table 2
The subset of sentence pairs used in the experiment. For each pair, the table shows how much the syllable count differs between the sentences, how many syllables there are, and the sentence text. The spoken text is the same for the male and female talker.
Set | N Syl. | Sentence 1 text | Sentence 2 text |
---|
Match | 3 | However | Did it work? |
Match | 4 | Blocked by your head | For example |
Match | 5 | Over and over | And the one they chose |
Match | 5 | For many people | Interestingly |
Match | 6 | It tells you where things are | But to reach your left ear |
Match | 8 | In India, for example | When you block it by closing a door |
Match | 8 | If you're in a noisy hall | This can cause a lot of problems |
Match | 9 | We were very pleased with what we found | They could help people across the world |
Match | 10 | They keep on getting better and better | Together with a team of researchers |
Match | 13 | We calculated how well they located the sound | And the music blaring from the speaker to your left |
1 diff | 2/3 | So far | Who is shy |
1 diff | 2/3 | Quieter | We hoped that |
1 diff | 4/5 | To get a job | Through another sense |
1 diff | 4/5 | Ears, nose, and mouth | To make this happen |
1 diff | 6/7 | And who likes to show off | Between the correct speaker |
1 diff | 6/7 | Touch and temperature | When using only their ears |
1 diff | 11/12 | You listen all the more closely for footsteps | This difficulty isn't a temporary one |
1 diff | 11/12 | And is always hungrily searching for more | Perhaps we can send the missing information |
1 diff | 14/15 | They might be especially useful in poorer countries | They tended to be much closer to the correct location |
1 diff | 14/15 | Which are a type of surgically fitted hearing aid | Less than a third of children with hearing problems go to school |
2 diff | 7/9 | Our volunteers performed best | If you don't know where it's coming from |
2 diff | 7/9 | Could overcome these problems | You can see this from the big blue bar |
2 diff | 7/9 | As they go about their day | As illustrated in figure one |
2 diff | 7/9 | That we took advantage of | The wristbands we are developing |
2 diff | 9/11 | We measured this distance in degrees | Like the voice of the person in front of you |
2 diff | 9/11 | That people can wear outside the lab | This is one of the main ways your brain works out |
2 diff | 10/12 | Or expensive medical equipment | Adults with hearing problems in poorer countries |
2 diff | 10/12 | Have been invented to solve this problem | And so are often forced to live in poverty |
2 diff | 11/13 | But because a sense isn't working properly | If you're trying to follow a conversation |
2 diff | 11/13 | Now we're looking to create a device | Were converted into vibration on the right wrist |
A subset of 60 sentences from the EHS Research Group Sentence Corpus were used in the experiment (see Table 2). The cross-section of sentences contained a variety of prosodic patterns with different pitch contours, total lengths, and phoneme inventories. Each of the 60 sentences were spoken by both the male and female talker so that there were 120 speech samples in total. The sentences were grouped in pairs, with 10 of the pairs having the same number of syllables, 10 pairs differing by 1 syllable, and 10 pairs differing by 2 syllables. This was done to ensure trials spanned a wide range of difficulties.
For the conditions with background noise, a non-stationary multi-talker noise recorded by the National Acoustic Laboratories (NAL)41 was used. The noise sample was recorded at a party and has a long-term spectrum that matches the international long-term average speech spectrum42. This noise was selected to reproduce real-world challenges that haptic hearing aid users would face. For the speech-in-noise conditions, the speech and noise signals were mixed with a 2.5 dB SNR. Importantly, the RMS level of the speech was calculated with silences removed (though note that silences were not removed from the stimulus for presentation). Silent sections were identified by extracting the speech amplitude envelope using a Hilbert transform and a zero-phase 6th -order Butterworth low-pass filter, with a cut-off frequency of 23 Hz. Sections of the speech where the amplitude envelope dropped below 10% of the maximum were removed for the RMS level calculation. This meant that SNR setting in the current study was lower than for comparable studies where the silences were not removed for the SNR calculation. For the EHS Research Group Sentence Corpus, the SNR was 2.6 dB lower on average if silences are removed than if they were not (ranging across sentences from 0.9 to 4.7 dB lower; SD: 0.7 dB). For the more naturalistic speech material used in the current study – which contains a variety of prosodic characteristics, with differing syllabic stress patterns, speaking rate and overall modulation characteristics – silence-stripping was felt to be important to achieving a stable SNR across sentences.
On each trial, the masker duration was set to have a randomly selected delay of between 500 and 1,500 ms before and after the longest speech stimulus of the sentence pair (see Table 2). This was done to exclude the possibility that participants could learn when the sentence onset or offset would be and use total duration cues without detecting the speech signal. If the speech in the trial was the shorter sentence of the pair, it was located within the masker at a random point within the time window for the longer sentence. The speech and noise samples were ramped on and off with a 50-ms raised-cosine ramp. A new randomly selected section of the noise sample was used in each trial.
For conversion from audio to haptic stimulation, the audio was first downsampled to 16 kHz. This matches the typical sample rate available in compact wearable audio devices. For the conditions in quiet and in noise with noise reduction applied, the audio was first processed with the DPRNN algorithm shown in Fig. 5. This consisted of an end-to-end time-domain Audio Separation Network (TasNet) with three stages. In the first stage, a learned encoder block transformed the time-domain audio frames into a 2D feature space (similar to a time-frequency representation). The next stage consisted of a masking network that processed consecutive chunks within the latent feature space to estimate a noise-reduction mask. This mask was then applied to the noisy input speech representation to remove the background noise and to produce the enhanced speech signal. The final stage consisted of a learned decoder block, which transformed the enhanced speech signal back into a time-domain audio output signal.
The DPRNN algorithm used 1D-convolutional layers with 128 filters for the encoder and decoder blocks with a kernel size of 16 (1 ms) and a stride of 4 (0.25 ms) samples20. The chunks processed by the masking network consisted of 20 frames, which limited the use of future information to 5.75 ms to allow real-time processing. The masking network consisted of 6 blocks of the DPRNN method, each with bottleneck and hidden dimensions of 256 units. Channel-wise normalisation was employed to ensure causal processing.
The DPRNN algorithm was implemented using the publicly available Asteroid PyTorch toolkit43. It was trained on a large dataset of speech-in-noise stimuli comprising of speech utterances from the LibriSpeech corpus44 mixed and noise samples from the WHAM! stimulus set45 at various SNRs (total of 360 hours). The SNRs were sampled from a uniform distribution between − 6 to 10 dB, with 5% of the samples retained to ensure that performance for clean speech had not degraded. These speech and noise stimuli differed from those used during behavioural and objective assessment of the performance of the DPRNN. The training made use of the Adam optimizer46 and the scale-invariant signal-to-distortion ratio47 as loss functions for a total of 200 epochs. The training of the DPRNN algorithm was performed on three NVIDIA A100 40GB Tensor Core graphics processing units.
Following downsampling and, for some conditions, the application of noise reduction, the signal was converted to tactile stimulation using a tactile vocoder (following the method used previously by Fletcher, et al. 10). The audio was first passed through a 512th -order FIR filter bank with eight frequency bands, which were equally spaced on the auditory equivalent rectangular bandwidth scale48 between 50 and 7,000 Hz. Next, the amplitude envelope was extracted for each frequency band using a Hilbert transform and a zero-phase 6th order Butterworth low-pass filter, with a cut-off frequency of 23 Hz (targeting the envelope modulation frequencies most important for speech recognition49,50). These amplitude envelopes were then used to modulate the amplitudes of eight fixed-phase vibro-tactile tonal carriers.
The eight tactile tones had frequencies of 94.5, 116.5, 141.5, 170, 202.5, 239, 280.5 and 327.5 Hz. The frequencies remained within the frequency range that can be reproduced by the latest compact haptic actuators and were spaced so as to be discriminable based on frequency discrimination thresholds from the dorsal forearm51. A different gain was applied to each tone to make them equally exciting across frequency, based on tactile detection thresholds10,52. The gains were 13.8, 12.1, 9.9, 6.4, 1.6, 0, 1.7, and 4 dB, respectively. Tactile stimuli were scaled to have an equal RMS amplitude, with a nominal level of 141.5 dB ref 10− 6 m/s2 (1.2 G). This intensity can be produced by a range of compact, low-powered haptic actuators. Pink noise was presented through headphones throughout speech identification testing at a level of 60 dBA to ensure any auditory cues were masked. During familiarisation, there was no masking noise, and the speech audio was played through the headphones at 65 dBA.
Apparatus
Participants sat in a vibration isolated, temperature-controlled room (mean temperature: 23°C; SD: 0.45°C). The room temperature and participant’s skin temperature were measured using a Digitron 2022T type K thermocouple thermometer, which was calibrated following ISO 80601-2-56:201753 using the method described in Fletcher, et al. 10. In screening, vibrotactile detection thresholds were measured with a HVLab Vibro-tactile Perception Meter54. The circular probe was 6 mm in diameter and contacted the skin through a rigid surround with a 10 mm diameter. The probe gave a constant upward force of 1N. The downward force applied by the participant was measured using a force sensor built into the surround. This sensor was calibrated using Adam Equipment OIML calibration weights and the force applied was displayed to the participant. The Vibro-tactile Perception Meter output was calibrated using its built-in accelerometers (Quartz Shear ICP, model number: 353B43) and a Brüel & Kjær (B&K) Type 4294 calibration exciter. The system conformed to ISO-13091-1:200155 and the stimuli had a total harmonic distortion of less than 0.1%.
For the sentence identification task, the EHS Research Group haptic stimulation rig was used (see Fletcher, et al. 10). This consisted of a Ling Dynamic Systems V101 shaker suspended from an aluminium strut frame by an adjustable elastic cradle. The shaker had a downward facing circular probe with a 10-mm diameter, which contacted the participant’s dorsal wrist. A foam block with a thickness of 95 mm was placed below the shaker probe for participants to rest their palmar forearm on. The probe applied a downward force of 1N, which was calibrated using a B&K UA-0247 spring balance. The shaker was driven using a MOTU UltralLite-mk5 sound card, RME QuadMic II preamplifier, and HVLab Tactile Vibrometer power amplifier. The vibration output was measured using a B&K 4533-B-001 accelerometer and calibrated using a B&K type 4294 calibration exciter. All stimuli had a total harmonic distortion of less than 0.1%.
Masking noise in the experiment phase and speech audio in the familiarisation phase were played to the participant through Sennheiser HDA 300 headphones, driven by the MOTU UltralLite-mk5 sound card. A B&K G4 sound level meter with a B&K 4157 occluded ear coupler (Royston, Hertfordshire, UK) was used to calibrate the audio. Sound level meter calibration checks were carried out using a B&K Type 4231 sound calibrator.
Procedure
For each participant, the experiment was completed in a single session lasting approximately two hours. Participants first gave informed consent to take part in the study. They then completed a screening questionnaire, which ensured that they (1) had not had any injury or surgery on their hands or arms, (2) did not suffer from conditions that might affect their sense of touch, and (3) had not been exposed to intense or prolonged hand or arm vibration over the previous 24 hours. Their self-reported hearing health was also recorded.
Following this, the participant’s skin temperature on the index fingertip of the dominant hand was measured. When the participant’s skin temperature was between 27 and 35°C, their vibrotactile detection thresholds were measured at the index fingertip following BS ISO 13091-1:200155. During these measurements, participants applied a downward force of 2N, which was monitored by the participant and experimenter using the HVLab Vibro-tactile Perception Meter display. Participants were required to have touch perception thresholds in the normal range (< 0.4 m/s-2 RMS at 31.5 Hz and < 0.7 m/s-2 RMS at 125 Hz), conforming to BS ISO 13091‑2:202156. The fingertip was used for screening as normative data was not available for the wrist. If all screening stages were passed, the participant’s wrist dimensions were measured at the position they would usually wear a wristwatch, and they proceeded to the experiment phase.
In the experiment phase, participants sat in front of the EHS Research Group haptic stimulation rig10. They placed their arm on the foam surface with the shaker probe contacting the centre of the dorsal wrist, at the position where they would normally wear a wristwatch. During the experimental phase, participants completed a two-alternative focused-choice sentence identification task. This was done using a custom-built MATLAB (2022b) app with three buttons, one that allowed the user to play the stimulus for the current trial, and two that displayed the sentence text alternatives for the current trial. The app also displayed the instruction: “Select the text that matches the sentence played through vibration”. When the play button was pressed, both the buttons displaying the sentence text for the alternative response options turned green for the duration of the stimulus. The sentence text buttons were only selectable after the stimulus had been played at least once.
Before testing began, participants completed familiarisation to ensure they understood the task. In this stage, English male and female speech was used that was different from that used in testing. The male talker was from the ARU speech corpus (ID: 02)57 and the female was from the University of Salford speech corpus58. They each spoke four different sentences from the Harvard sentences set59. Sentences were only played without background noise and the participant was permitted to feel the sentence through tactile stimulation or to hear the sentence through the headphones (without tactile signal processing). There was no limit placed on the number of times these sentences could be repeated and the participant was encouraged to ask the experimenter questions if they were unsure of the task. Once the participants selected one of the sentences, all buttons became inactive for 0.5 seconds and green text reading “Correct” was displayed if the response was correct, or red text reading “Incorrect” was displayed if the response was incorrect. Once the experimenter confirmed that the participant understood the task, they continued to the testing stage.
During testing, the participant did the same task as in familiarisation except that they could not opt to hear the audio stimulus, and the headphones played masking noise. Participants were also limited to a maximum of four repeats of the tactile stimulus per trial, after which the play button became inactive, and they were forced to select one of the two text alternatives. All sentences were tested for all four experimental conditions once (in quiet and in background noise, both with and without the DPRNN). This meant that there was a total of 480 trials for each participant in the testing stage. The trial order was randomised for each participant, with a rule that the same sentence could not appear in consecutive trials. None of the speech or background noise material used for testing was used during the training of the DPRNN.
The experimental protocol was approved by the University of Southampton Faculty of Engineering and Physical Sciences Ethics Committee (ERGO ID: 68477). All research was performed in accordance with the relevant guidelines and regulations.
Statistics
Behavioural performance measures
The percentage of correctly identified sentences was calculated for each condition for the male and female talkers. Primary analysis consisted of a three-way RM-ANOVA (see “Results”) and six two-tailed t-tests, with a Bonferroni-Holm multiple comparisons correction60 applied. The t-tests compared: (1) identification in quiet with and without noise reduction; (2) identification in noise with and without noise reduction; (3) identification in quiet without noise reduction and identification in noise with noise reduction; (4) identification in quiet and in noise, both without noise reduction; (5) the difference in identification in quiet with and without noise reduction for the male and female talkers; and (6) the difference in identification in noise with and without noise reduction for the male and female talkers. Data were determined to be normally distributed based on visual inspection as well as Kolmogorov-Smirnov and Shapiro-Wilk tests.
Following the primary analysis, exploratory secondary analysis was performed. These analyses were not corrected for multiple comparisons, as no effects were anticipated, following results from previous studies (e.g., 10).
Objective performance measures
The effectiveness of DPRNN noise reduction was also assessed objectively for each audio frequency band in the tactile vocoder using a pairwise Pearson correlation coefficient. The amplitude envelopes for the temporally-aligned clean speech signal were compared to either (1) speech in quiet with noise reduction, (2) speech in noise with noise reduction, or (3) speech in noise without noise reduction. All audio samples used in the experiment were assessed.