2.1. Ethical approval
This study was approved by the Human Research Ethics Committee of The University of Sydney (project number: 2020/399). Informed consent was obtained from all participants to participate in this study. All methods used in the present study were performed in accordance with the relevant ethical guidelines and regulations. The measurement procedures conformed to the standards set by the latest revision of the Declaration of Helsinki.
The study required recruitment of two cohorts of participants: speakers and listeners. Speakers were recorded wearing and not wearing two types of masks, while listeners were required to listen to voice recordings and rate specific auditory-perceptual features.
2.2. Voice recording
2.2.1. Speakers
Sixteen speaker HCWs took part in this study (12 females, 4 males) with mean age = 43 years (range = 24 - 61), including two otolaryngologists, 13 practicing speech language pathologists, and one registered nurse working in an Ear Nose and Throat clinic. Inclusion criteria were English speakers, non-smokers, with no self-reported voice or hearing problems at the time of the study. Exclusion criteria included any voice or hearing problem at the time of the study.
2.2.2. Procedure
Voice recordings were performed in a quiet room or soundproof booth at the participants’ respective clinics as social distancing measures during the COVID-19 pandemic prohibited the use of the same room. Participants used their habitual voice to read the Rainbow Passage 45 in three conditions with the speaker (1) not wearing a mask, (2) wearing a surgical mask, and (3) wearing a KN95 mask. The order of conditions was randomised across speakers to minimize biases related to intra-speaker variability in phonation and potential compensation whilst wearing a mask. When wearing these masks, participants were required to use the highest level of fitting to ensure maximal barrier level. They were required to press the nose metal bar so that it fit tightly to the nose contour. The straps of the mask were securely placed behind the auricles and the lower side of the mask was pulled fully downward so that it covered the chin completely. It has been known that in unfavourable/challenging speaking conditions, speakers may adapt a phonation style that helps improve clear phonation 46,47. Therefore, we required participants to maintain similar habitual voice in terms of pitch, loudness, and speaking style throughout recording sessions both with and without a mask to minimise intra-speaker variability in voice production.
All voice signals were captured using an AKG C520 cardioid ear-mounted microphone 48 placed at a constant distance of 6cm, 45o off the mouth axis with analog-to-digital conversion via a professional external sound card (Roland Quadcapture 49) at 44.1kHz and 16-bit resolution. The signals were processed and saved to a laptop computer using the Audacity sound editing software 50 in *.wav format. Calibration of sound level in the voice signals was deemed unnecessary given that the data were used to test within-subject effects of mask and non-mask conditions.
Given that voice recordings took place in different clinic rooms with different levels of background noise, audio files were examined for signal-to-noise ratio (SNR) using a Praat script called Speech-to-noise ratio /Voice-to-noise ratio v.01.01 51. Only samples with a SNR greater than 30dB were used for auditory-perceptual and acoustic analyses 52.
2.3. Auditory-perceptual analyses
2.3.1. Listeners
Listeners were recruited via email advertisement sent to an international professional network of speech language pathologists and ENT specialists. Inclusion criteria included 1) working with voice patients as speech-language pathologists, voice specialists, or laryngologists; and 2) Normal hearing at the time of the study. Exclusion criteria: None of the above-mentioned occupations and self-reported hearing impairment. Twenty listeners initially signed in to complete the ratings. Seven raters demonstrated good intra-rater reliability (see below) and only data from these raters were included in the study. The remaining listeners had poor intra-rater reliability and were excluded from further analysis.
2.3.2. Stimuli
The Rainbow Passage 45 was used for listening tests (‘When the sunlight strikes raindrops in the air... the pot of gold at the end of the rainbow’). The stimuli represented non-mask (n=16), surgical mask (n=12), and KN95 mask (n=12) conditions. Twelve samples were repeated for intra-rater reliability evaluation. In total, there were 52 samples which were coded and randomized throughout for presentation to the listeners. Given that auditory-perceptual rating of clear speech is affected by vocal intensity 53,54, all stimuli were normalized for intensity using the ‘Normalize’ command in Audacity with the checkbox ‘Normalize peak amplitude to -3.0 dB’ being checked. After normalization, the output intensity level of stimuli was between 72.0 to 75.0 decibel (dB) sound pressure level (SPL) as measured in Praat and was presented to listeners via a headphone. This level was used as it has been shown that intelligibility of speech produced at a range of vocal effort levels from 55dB to 78dB was constant whilst speech produced at louder and softer intensity levels resulted in decreased word intelligibility 55.
2.3.3. Perceptual rating scale
In this study we were interested in the ratings of clarity of speech in surgical mask and KN95 mask conditions. The perception of speech clarity has been examined using different rating schemes 56. Magnitude-estimation scaling, for example, is a reliable estimation for speech clarity 57. Magnitude estimation can be implemented using both word identification tests and/or the degree of the attribute of interest 58. Some authors have also used rating scales to quantify speech clarity. For example. Tasko and Greilick 59 used a computer-based slider scale where raters compared clarity of word pairs and made judgment by moving the slider from the mid-point of the scale toward the clearer stimulus 59. Reinhart and Souza 60 used a 7-point Likert scale with 1 representing ‘completely unclear’ and 7 ‘completely clear’. In the present study, we used the VAS with a straight line containing 100 points (1-100) with 1 and 100 representing ‘completely clear’ and ‘completely unclear’, respectively, i.e. the higher the score, the less clear the speech sound.
2.3.4. Procedure
Listening tasks were conducted using an online auditory-perceptual rating tool called Bridge2practice, which is a free online education and research platform developed for perceptual learning and practice of allied health professionals and students 61. Listeners were required to listen to the speech stimuli as many times as they wished using headphones and make a judgment about speech sound clarity by changing the position of the slider on the VAS line described above. All stimuli were randomized and raters were not aware that a number of stimuli were produced with speakers wearing a facemask. Responses were registered in the rating platform and exported to an Excel spreadsheet.
2.3.5. Reliability of perceptual ratings
Intra- and inter-rater reliability were assessed using SPSS 24.0 (SPSS, Inc., Chicago, IL, USA). Intraclass correlation coefficients (ICC) 62 were used to determine the level of agreement between the first and second (repeated) ratings (intra-rater reliability) and across listeners (inter-rater reliability). ICC was calculated using a two-way mixed model, consistency type, and single measure analysis [ICC (3,1)]. To assess the level of correlation, ICC < 0.5 indicates poor correlation, 0.5 - 0.75 moderate, 0.75 - 0.9 good, and > 0.9 excellent correlation 63. Intra-rater reliability ranged from ICC = 0.647 to 0.785 for Single Measures and from ICC = 0.785 to 0.880 for Average Measures. Inter-rater reliability amongst the seven raters was moderate based on average measures (ICC = 0.692, p = 0.003).
2.4. Acoustic analyses
2.4.1. Root-mean-square (RMS) amplitude (ARMS) of fricatives
Two fricatives /s/ (alveolar) and /ʃ/ (palato-alveolar) were used for analysis for specific reasons. Firstly, these fricatives have higher amplitude than other voiceless fricatives e.g. /θ/ and /f/ 64 which would make it easier to reliably identify and extract them from connected speech compared to other fricatives. Secondly, the duration of these two fricatives is longer than that of the other fricatives 64, making it more likely to obtain stable and reliable identification of fricative boundaries and acoustic values. The /s/ and /ʃ/ are characterized by well-defined spectral shapes compared to labio-dental and dental fricatives /f/, /v/, /θ/ and /ð/ which have a relatively flat spectrum without a clear dominant peak 41. Lastly, /s/ and /ʃ/ differ in their spectral mean, representing different locations of spectral peaks in the spectrum 41, and allowing investigation of a wider range of high-frequency fricative signal assumed to be affected by facemasks.
The ARMS was used as this has been frequently examined to characterize English voiceless fricatives 41,42. The amplitude of the fricative signal has also been considered an important cue to perceive the place of articulation in voiceless fricatives and hence the accuracy of fricative consonant production 65. Firstly, the signals were high-pass filtered at 1 kHz in Audacity 50 to remove any potential trace of voicing due to the pre- and postvocalic environment of the fricative 66 and to minimize low-frequency energy which could interfere with detection of zero-crossings due to the turbulent source 42. Audacity software was used to high-pass filter the sound files with 6-dB roll-off per octave. The fricative /ʃ/ was extracted from the word 'shape' in '...These take the shape of a long round arch...' and /s/ was extracted from the word 'say' in the sentence ‘...his friends say he is looking for the pot of gold...’. The boundaries of this consonant were identified visually using acoustic waveform and spectrograms in Praat 6.1.40 67 (Figure 1) and by listening to the sample. Fricative signals were defined as having the following criteria: 1) Characteristic waveform with zero-crossing; and 2) High-frequency noise energy in the narrow-band spectrogram. Koenig et al. 68 used 25- millisecond (ms) fricative segments. In this study, the middle 50ms segment was extracted from the centre of these fricatives for acoustic analysis. Onset and offset segments were excluded as for these fricatives, the onset (immediately before voicing onset) and offset have lower amplitude than the middle 64 hence extracting the middle segment would increase the probability of capturing the amplitude peaks of the fricative noise signals. The edited /s/ underwent a Fast Fourier transform and was analysed in Praat in the frequency range 1 - 10kHz.
The ARMS over the time interval t1 ≤ t ≤ t2 was defined using the formula 67:
in which A is the amplitude of the sound. ARMS was converted from Pascal (Pa) unit in Praat to sound pressure level (SPL) in decibels (dB) using the formula:
dB SPL = 20log10 (P/P0) (2)
where P was ARMS value and P0 = 20 micropascals (μPa) which was the reference value.
2.4.2. Spectral moments of fricatives
Apart from quantifying amplitude of the two fricatives, we were interested in clarifying whether facemasks affected other attributes of these consonants. Centre of Gravity (Hz), Standard Deviation (SD, in Hz), Skewness, and Kurtosis have been used extensively in the literature to characterize voiceless fricatives in both conventional speech 41 and clear speech contexts 42. These measures were obtained from two fricatives, being /s/ and /ʃ/ in Praat.
2.4.3. Amplitude measures of vowels
2.4.3.1. Vowel root-mean-square amplitude
The ARMS was measured as it has been used frequently to investigate spectral amplitude of vowels 69. The following vowels were edited from the Rainbow Passage: /ɐː/ in 'arch', /ɪ/ in 'many', and /ʊ/ in 'two', which represent primary cardinal vowels with the highest and most forward tongue position (/ɪ/), highest and most backward tongue position /ʊ/, and lowest tongue position (/ɐː/) 35. A previous study has shown that the RMS amplitude was different across vowels when put in context (connected speech) with /ɐː/ presenting the highest RMS amplitude whilst /ɪ/ and /ʊ/ have the lowest RMS amplitude 69. Using these vowels would help clarify whether different vowel amplitudes were affected similarly or differently by the masks. In addition, these three vowels are also produced with different levels of lip rounding and protrusion 35 which might also be affected by the KN95 mask because of its tighter levels of fitting than a standard surgical mask. The vowels were extracted in Praat by listening and identifying waveform and spectrogram characteristics associated with the required vowel. For each vowel, the middle 50ms was extracted and analysed for ARMS using the same protocols as mentioned above for fricatives.
2.4.3.2. Amplitude of the first three formants with formant frequency
Although amplitude of vowels in context can be quantified using both RMS and formant amplitude 69, the RMS gives an overall vowel amplitude rather than amplitude of specific formants. Therefore, this study measured amplitude of the first three formants (hence A1, A2, and A3) from the above-mentioned vowels using a MATLAB code called VoiceSauce 70,71 employing the Snack Sound Toolkit 72. Our previous observations indicated that facemasks impacted the high-frequency ranges above 1kHz 28, therefore, the first formant was not deemed to be affected. However, amplitude of this formant (A1) was included so that between-formant cross-reference could be made if formant amplitude is normalized to eliminate between-speaker variabilities. The signals were first down-sampled to 16kHz and all measurements were implemented automatically at every 1 milliseconds (ms) for voiced segments with a window length of 25ms 40. The highest formant amplitude within this window length was obtained. Settings were as follows: min F0 = 75Hz and max F0 = 400Hz; pre-emphasis = 0.96; and Linear Predictive Coding (LPC) order = 12. Data points with zero values were deleted.
Frequencies of the first three formants (F1, F2, and F3) were measured in Praat to present the range of specific formant frequencies of the formants used in amplitude measurements. Formant frequencies were measured using Praat from the middle 30ms of the vowels where the vowel production was the most stable 73. Formant settings followed default in Praat including: Maximum formant = 5500Hz, number of formants = 4, window length = 25ms, dynamic range = 30dB, and dot size = 1.0mm 73.
2.4.4. Reliability of acoustic analyses
A co-author repeated the file editing and measurement process on both /s/ and /ʃ/. Table 1 shows results of intraclass correlation coefficients calculated for the two fricative consonants in the no-mask condition.
Table 1. Intraclass correlation coefficient (ICC) for inter-rater reliability of spectral measures for the two fricatives in no-mask (SM, Single Measures; AM, Average Measures). All p values were < 0.001
Spectral measures
|
Measures
|
/s/
|
/ʃ/
|
ICC
|
ICC
|
Root-mean-square
|
SM
|
.997
|
.999
|
AM
|
.998
|
1.000
|
Centre of gravity
|
SM
|
.998
|
.996
|
AM
|
.999
|
.998
|
Standard Deviation
|
SM
|
.987
|
.966
|
AM
|
.993
|
.983
|
Skewness
|
SM
|
.968
|
.991
|
AM
|
.984
|
.995
|
Kurtosis
|
SM
|
.991
|
.998
|
AM
|
.996
|
.999
|
A co-author also repeated measurements of formants F1 and F2 of three vowels mentioned above for all three conditions in 50% of the participants. In total, 72 repeats were implemented (n = 8 participants x 3 conditions x 3 vowels = 72). ICC was calculated for inter-rater reliability and the results are presented in Table 2, which showed good to excellent agreement between the two raters for these acoustic measures.
Table 2. Intraclass correlation coefficient (ICC) for inter-rater reliability of formant measurement (SM, Single Measures; AM, Average Measures)
Formants
|
Measures
|
ICC
|
p
|
F1
|
SM
|
0.832
|
< 0.001
|
AM
|
0.908
|
< 0.001
|
F2
|
SM
|
0.899
|
< 0.001
|
AM
|
0.947
|
< 0.001
|
2.5. Statistical analyses
Data were managed in Microsoft Excel 365 74 and analysed using IBM SPSS Statistics v.24.0 75 and Prism v8.1.2 76. One-way repeated-measures analysis of variance (ANOVA) was used to examine the effects of masks on acoustic measures. Significant main effects were further analysed using pairwise comparisons with Bonferroni-adjusted p values. Prior to analyses, normal distribution of the data was examined using Kolmogorov-Smirnov tests 77. Mauchly’s test of sphericity was performed before ANOVA and, if sphericity assumptions were not met, a Greenhouse-Geisser adjustment was used. Effect size was calculated using partial Eta squared (η2). Effect sizes of 0.01, 0.1, and 0.25 indicated small, medium, and large effects, respectively 78. If normality assumption was not met, the non-parametric Friedman test was used to compare data across non-mask, KN95, and surgical mask conditions. Pearson’s correlation coefficient (r) was used to calculate the correlation between acoustic data and perceptual rating of speech clarity in which r = 0.1, 0.3, and 0.5 indicated small, medium, and large effects, respectively 79. Multivariate linear regression was used to examine acoustic predictors of speech clarity. In all statistical calculations, a significance level of 0.05 was used (two-tailed). Where there were multiple calculations, Bonferroni adjustment was applied to p values.