Acoustic stimuli
All the acoustic recordings were obtained from a public German voice database, Saarbrueken Voice Database27. From the recordings of over 1600 speakers available in the database, we selected recordings with speakers with dysphonic voice under three selected diagnoses (n: 157): vocal fold paralysis (rekurrensparese, n:145), hypofunctional dysphonia (hypofunktionelle dysphonie, n: 10), and hypotonic dysphonia (hypotone dysphonie, n: 2). These diagnoses are often associated with breathiness.28,29
Each recording session in Saarbrueken Voice Database includes twelve sustained vowel segments (three vowels—/a/, /i/, and /u/—and four pitch profiles—normal, high, low, low-high-low). The vowel segments are pre-curated to exclude the onsets and offsets of the phonations. For this study, three vowel segments in normal pitch were retrieved from the database as WAV files. The file comprises signed 16-bit samples at the sampling rate of 50000 samples-per-second (S/s). The total of 630 WAV files were considered. These files include multiple recordings of some speakers (210 recording sessions by the 157 speakers). Eighty-six (86) speakers have at least two recordings, and the most frequent speaker was recorded six times. The interval between repeated recordings is at least one week apart and as far apart as five years.
Stimuli Screening Process
As the study focuses on the breathiness, vowel segments with irregular vocal fold vibration or roughness perception were excluded with a four-step screening process. The first two steps were objective and based on the fundamental frequency (\({f}_{0}\)) estimations of the Praat software30 (v6.1.38). Severely irregular vibration of vocal folds causes \({f}_{0}\) estimator to fail as the signal is mostly—if not entirely—nonharmonic. If voice is breathy and not rough, the vibration remains periodic but expected to contain increased turbulent noise. Accordingly, a vowel segment was excluded if Praat fails to estimate \({f}_{0}\) at any point within the segment. This step eliminated 24 vowel segments.
Second objective exclusion condition was the presence of subharmonics, which is another indicator of roughness31. Subharmonics were detected by a custom Python program. When subharmonics are present, Praat could erroneously detect the lowest subharmonic frequency as the \({f}_{0}\), reporting by 50% (period-2) or 33% (period-3) of the speaking \({f}_{0}\). Since the subharmonics, however, is seldom present for the entire phonation32, the partial presence of the subharmonics can be identified by detecting a sudden change in the \({f}_{0}\) estimates. The Python analysis checked for a sudden change in Praat’s \({f}_{0}\) estimates over short analysis windows, denoted by \({f}_{0,n}\), for the condition:
$$\frac{\left|{f}_{0,n}-{f}_{0,n+1}\right|}{{f}_{0,n}+{f}_{0,n+1}}>\frac{1}{6}$$
This ratio is near zero if no change in harmonic structure in two successive analysis windows or is 1/3 if switches between no subharmonics to period-2 subharmonics. This constitutes intervals that are stable in harmonic structure, and the intervals with the highest \({f}_{0}\) average are considered to have no subharmonics. Based on this rule, a vowel segment was excluded if more than 5% of its analysis windows contained subharmonics. This step eliminated 70 vowel segments.
While the objective screening steps successfully removed the severe cases, the vowel segments with mild irregularities required additional subjective screening steps. The third step was the spectrogram evaluation for harmonic irregularities. Praat was often able to estimate \({f}_{0}\) correctly despite mild irregularity in the signals, including the presence of mild subharmonics. Also, the second step could not filter out the subharmonics if they persist throughout the vowel. To detect the obvious missed irregularity, the spectrograms of the selected remaining vowel segments were inspected by the author (T. Ikuma) for strong presence of irregularity. This step only tested those segments with Praat HNR below 15 dB (5 dB below the expected HNR of normophonic speakers33). This step eliminated 6 vowel segments.
The final screening step was conducted by the expert raters as a part of breathiness evaluation (see the proceeding section for the details). The raters were instructed to check two boxes along with filling the breathiness rating for each segment. The first box was checked if the segment is too short to evaluate breathiness confidently, and the second box was checked if roughness was perceived at any point in the segment. A segment was dropped if any rater marked it too short or if two or more raters indicated it rough. The final step reduced the number of vowel segments by 162, including 25 considered too short for perceptual assessment.
After the screening process, 368 vowel segments (58%) were included in the study from 123 speakers (78%). Of 123 speakers, all 3 vowels were collected from 70 speakers. The study dataset and the speaker demographics are summarized on Table 1. Gender and age information are presented as reference and were not analyzed.
Table 1
Study Sample (after screening out spectral harmonic irregularity or perceptual roughness)
Vowel
|
No. Segments
|
No. Speakers
|
Gender
|
|
Age
|
Female
|
Male
|
< 35
|
35–55
|
> 55
|
/a/
|
103
|
88
|
67
|
21
|
|
9
|
33
|
46
|
/i/
|
120
|
93
|
65
|
29
|
|
12
|
33
|
49
|
/u/
|
145
|
114
|
74
|
40
|
|
13
|
42
|
59
|
Total
|
368
|
123*
|
82
|
41
|
|
14
|
45
|
64
|
*Out of 123 speakers, all 3 vowels were available from 70 speakers.
|
Perceptual Rating Of Pathological Voice Recordings
To establish reference score for breathiness, all of the vowel segments were perceptually rated by four (4) speech language pathologists. Each rater was given an Excel spreadsheet to perform the rating. Each row of the spreadsheet had a play button to play a voice recording, a checkbox for insufficient duration, another checkbox for roughness, and a pair of slider and editable cell for breathiness rating. The checkboxes were for the screening step described earlier. A spreadsheet was machine-generated for each rater with rows of the vowel segments in randomized order.
The raters were asked to fill in the form in a quiet room, listening to each recording as many times as needed. For each recording, they were asked to check the roughness or insufficient-duration checkboxes for the screening purpose (as described above) and enter a breathiness rating between 0 (normal) and 100 (most severe) either with a scrollbar or directly typing in the score. The entered value was immediately reflected on both the scrollbar and editable cell. They were allowed to revisit and rescore the recordings as felt necessary.
Once all the rating sheets were returned, a combined breathiness score for each vowel segment was obtained as follows. First, the scores from each rater were rescaled to match the score ranges of all the raters. This normalization was necessary because no reference voice stimuli were given to the raters. These rescaling anchor points were selected for each rater based on the vowel segments which were in the bottom three (3) percentile of all four raters for the least-breathy anchor, and top three percentile for the most-breathy anchor. The anchor scores of each rater were their median scores of the common vowel segments. Each rater’s scores were adjusted linearly so that the bottom anchor score maps to zero (0) and the top anchor score to 100. The adjusted breathiness scores were then averaged across the raters to form the combined scores to be compared against the objective parameters. The averaged adjusted rating is hereafter referred to as Adjusted Breathiness Rating (ABR).
Short-time Acoustic Signal Model
Acoustic parameters for a short segments of recordings were measured with a time-varying harmonic model (HM)34:
$$x\left(t\right)={\sum }_{p=0}^{P}{a}_{p}\left(t\right)\text{cos}\left(2\pi p{f}_{0}\left(t\right)\right)+{\sum }_{p=1}^{P}{b}_{p}\left(t\right)\text{sin}\left(2\pi p{f}_{0}\left(t\right)\right)+v\left(t\right),$$
where \(P\) is the harmonic order, \({a}_{p}\left(t\right)\) and \({b}_{p}\left(t\right)\) are the time-varying amplitudes, \({f}_{0}\left(t\right)\) is the time-varying fundamental frequency, and \(v\left(t\right)\) comprises nonharmonic components, most of which were the glottal turbulence noise. Here, \({a}_{p}\left(t\right)\) and \({b}_{p}\left(t\right)\) were modeled with cubic polynomials while \({f}_{0}\left(t\right)\) was modeled with a quadratic polynomial. These time-varying parameters enabled the harmonic model to capture slow variations in frequency and amplitude, resulting in more accurate harmonic power estimation of the voice signals. The number of harmonics \(P\) was dependent on \({f}_{0}\) (populating up to the Nyquist frequency, 8000 Hz), and each signal segment was modeled with (\(8P+11\)) coefficients and the residual as the turbulent noise.
For each fitted model, the average \({f}_{0}\) in Hz is given by
$${\stackrel{-}{f}}_{0}=\frac{1}{T}{\int }_{0}^{T}{f}_{0}\left(t\right)dt,$$
the \(p\)th harmonic power is given by
$${H}_{p}=\frac{1}{2T}{\int }_{0}^{T}\left({a}_{p}^{2}\left(t\right)+{b}_{p}^{2}\left(t\right)\right)dt,$$
and the noise power over the \(p\)th harmonic range—i.e., between \(\left(p-0.5\right){\stackrel{-}{f}}_{0}\) Hz to \(\left(p+0.5\right){\stackrel{-}{f}}_{0}\) Hz—is given by
$${N}_{p}=\frac{1}{{\stackrel{-}{f}}_{0}}{\int }_{\left(p-0.5\right){\stackrel{-}{f}}_{0}}^{\left(p+0.5\right){\stackrel{-}{f}}_{0}}{\left|V\left(f\right)\right|}^{2}df.$$
Here, \(V\left(f\right)\) is the Fourier transform of \(v\left(t\right)\) and was estimated by the discrete Fourier transform (DFT).
To obtain the models, the acoustic recordings were first resampled to 16000 S/s by FFmpeg (v5.0). Each waveform was split into nonoverlapping 50-millisecond frames (\(T=0.05\), 800 samples per frame) and a set of model coefficients were estimated in the least squares sense with an iterative algorithm34. The algorithm received the \({f}_{0}\) estimate by Praat35 (via Python package, Parselmouth36, v0.4.1) as the initial \({f}_{0}\).
Spectral Segmentation
Signal analyses were conducted in four nonoverlapping spectral bands, generically denoted by \({b}_{i}\), \(i=0, 1, 2, 3\). Each frequency band was assigned to cover specific formants to minimize the effect of the vocal tract on the acoustic voice parameters. Four distinctive spectral band designs were investigated in the study. The first was from de Krom’s study6 for the vowel /a/ while the other three, one for each vowel, were designed to isolate each formant. De Krom’s principle of the spectral band design could not be applied to other vowels. The problem lied in the 60–400 Hz (below the first formant) \({b}_{0}\) band. Adjusting this band for /i/ and /u/ lowers the upper bound to around 200 Hz, which is near, possibly below, \({f}_{0}\). If \({f}_{0}>200\) Hz, some parameters could not be evaluated in this band. A modified design, referred to as the “per-formant” bands, was considered in this study to circumvent the issue. De Krom’s \({b}_{0}\) band was excluded, and his \({b}_{1}\) band was split into two separate bands (alternate \({b}_{0}\) and \({b}_{1}\) bands): the alternate \({b}_{0}\) covers the first formant, and the alternate \({b}_{1}\) covers the second formant. This separation was intended to explore potential differences in how breathiness affects acoustic signals in the two formant regions. The upper two bands \({b}_{2}\) and \({b}_{3}\) remained the same as the de Krom’s bands. With the modified spectral band design, the per-formant band edge frequencies were defined as according to the distributions of the formant frequencies of German vowels37. The edge frequencies of the per-formant bands as well as de Krom’s bands are tabulated in Table 2.
Table 2
Boundary frequencies of spectral bands under four different designs.
Type
|
Vowel
|
|
—— \({{b}}_{0}\)——
|
——\({{b}}_{1}\)——
|
——\({{b}}_{2}\)——
|
——\({{b}}_{3}\)——
|
|
de Krom6
|
/a/
|
60 Hz
|
400 Hz
|
2000 Hz
|
5000 Hz
|
8000 Hz
|
per-formant
|
/a/
|
400 Hz
|
1000 Hz
|
2000 Hz
|
5000 Hz
|
8000 Hz
|
/i/
|
200 Hz
|
1100 Hz
|
2400 Hz
|
5000 Hz
|
8000 Hz
|
/u/
|
200 Hz
|
600 Hz
|
1700 Hz
|
5000 Hz
|
8000 Hz
|
Acoustic Parameter Measurements
This study evaluated two broad types of parameters: intra-band and inter-band. The former consists of the harmonics-to-noise ratio (HNR) measurements by four (4) different algorithms. The inter-band parameters assess the differences between two neighboring bands either in harmonic power (harmonics-to-harmonics ratio, HHR), in noise power (noise-to-noise ratio, NNR), and in waveform envelope shapes (GNE ratio). In addition, the fundamental frequency \({f}_{0}\) and the ABI were obtained from Praat and were included as a supplemental parameter and reference parameter, respectively. These parameters are summarized in Table 3.
Table 3
List of Spectral Parameters
Name
|
Units
|
Description
|
Ref’s
|
type: intra-band (harmonics-to-noise ratios)
|
HNR (HM)
|
dB
|
HNR over a frequency band using the harmonic model
|
34
|
HNR (de Krom)
|
dB
|
HNR over a frequency band using cepstrum-based method
|
6,26
|
HNR (Shama)
|
dB
|
HNR over a frequency band by spectral segmentation
|
38
|
HNR (Qi)
|
dB
|
HNR over a frequency band using modified cepstrum-based method
|
39
|
type: inter-band
|
HHR
|
dB
|
Ratio of harmonic power in adjacent bands (harmonic model)
|
–
|
NNR
|
dB
|
Ratio of noise power in adjacent bands (harmonic model)
|
–
|
GNE Ratio
|
–
|
Glottal-to-noise excitation ratio (GNE) between 2 adjacent bands
|
17
|
Slope
|
dB
|
Ratio of total signal power in adjacent bands
|
6
|
type: supplementary
|
\({{f}}_{0}\)
|
Hz
|
Fundamental frequency estimated by Praat software
|
35
|
ABI
|
–
|
Acoustic breathiness index
|
22
|
Note: HNR (HM), HNR (de Krom), HNR (Shama), and HNR (Qi) indicate the HNRs evaluated by the algorithms specified in the parenthesis. HM: Harmonic Model.
|
The primary parameters of interest are those based on the harmonic models: the intra-band HNRs, inter-band HHRs, and inter-band NNRs. These ratios are computed from the harmonic and noise power measurements, \({H}_{p}\) and \({N}_{p}\) in each target frequency bands. For each of target bands \({b}_{i}\), \(i=0, 1, \text{2,3}\), let \({\mathcal{P}}_{i}\) be a set of the indices of all the harmonics residing in \({b}_{i}\). Then, the model-based parameters were measured as follows:
If \({\mathcal{P}}_{0}\) contains all the harmonics of the observed signal, then the corresponding \({\text{HNR}}_{{b}_{0}}\) is the overall HNR parameter. The HHR can be seen as a generalization of existing harmonic ratio parameters such as H1-H2 and H1-A1. Measurements of the above parameters were recorded in decibels (dB, \(10{\text{l}\text{o}\text{g}}_{10}\)), and the inter-band parameters are shorthanded as \({\text{HHR}}_{\text{low}}\triangleq {\text{HHR}}_{{b}_{0}{b}_{1}}\), \({\text{HHR}}_{\text{mid}}\triangleq {\text{HHR}}_{{b}_{1}{b}_{2}}\), \({\text{HHR}}_{\text{high}}\triangleq {\text{HHR}}_{{b}_{2}{b}_{3}}\), \({\text{NNR}}_{\text{low}}\triangleq {\text{NNR}}_{{b}_{0}{b}_{1}}\), \({\text{NNR}}_{\text{mid}}\triangleq {\text{NNR}}_{{b}_{1}{b}_{2}}\), and \({\text{NNR}}_{\text{high}}\triangleq {\text{NNR}}_{{b}_{2}{b}_{3}}\) to be consistent with the de Krom’s spectral slope nomenclature6.
In addition to the HHRs and NNRs, the glottal-to-noise excitation (GNE) ratios17 were computed between frequency bands (\({\text{GNE}}_{\text{low}}\), \({\text{GNE}}_{\text{mid}}\), and \({\text{GNE}}_{\text{high}}\)). The inter-band GNE ratio is the maximum cross-correlation coefficient of the envelopes of bandpass-filtered signals. Bandpass filtering was performed in the frequency domain according to the defined frequency bands. The GNE ratio computed in this study is a partial application of the original GNE ratio by Michaelis, et al. The original GNE ratio is the maximum of the inter-band GNE ratios between many overlapping bands with a fixed bandwidth.
The rest of the parameters were included as reference. First, three additional algorithms to compute the HNR were included. Shama’s HNR38 is an extension of Kasuya’s NNE40. The magnitude samples of the discrete Fourier transform (DFT) frequency bins were labeled as either harmonic or noise based on the main lobe width of the analysis window function, and the HNR is the ratio of the sum of the harmonic bins and that of the noise bins. De Krom’s HNR26 avoids the frequency segmentation by estimating the noise spectrum with the comb-liftering operation in the ceptrum domain. Fourth, Qi HNR39 is a modification of the de Krom’s HNR by low-pass liftering and simplified noise level sampling. Finally, de Krom’s spectral slopes6 were recorded as the reference inter-band measure. The spectral slope is the ratio of the overall signal power in two frequency bands obtained directly from the DFT samples. All these reference parameters were measured with the Praat’s \({f}_{0}\) estimates.
All the short-time measurements were calculated with 50-ms non-overlapping window, and the average over all the windows was recorded for each vowel segment. Finally, per-speaker measurements were formed by averaging the per-recording averages for each speaker for each vowel. The measurements in dB were averaged in the dB scale. The averaging was done to address the speaker effect.
For the ABI and its member-parameters, Barsties et al. provided the Praat script as supplemental material22. The original script was modified to use only sustained vowel recording. The ABI officially requires 3-second acoustic data from each of sustained /a/ and concatenated voiced segments of continuous speech. Technically, the dataset in this study does not conform to the ABI protocol because of the lack of continuous speech data and the shortness of sustained vowels (all shorter than 3 seconds). To better gauge the effectiveness of the ABI, measurements of its nine parameters were also recorded to recalibrate the ABI model coefficients for the study dataset.
Statistical analysis
All statistical analyses were conducted in the R (v4.2.1) software environment.
The interrater reliability was tested with the intraclass correlation (ICC) measure41. The ICC estimates and their 95% confident intervals (CIs) were calculated with irr package based on a mean-rating, agreement, two-way random-effect model. The ICC estimates were computed on the raw breathiness ratings as well as the adjusted breathiness ratings.
Inter-vowel differences of the averaged adjusted breathiness rating (ABR) were tested at Type I error rate 0.05 with the matched dataset (i.e., 70 speakers, each with all 3 vowels) using the non-parametric Friedman test followed by the post-hoc pairwise comparison42.
The relationships between spectral parameters and the ABR were assessed using simple and multiple linear regression models. The adjusted coefficients of determination (adjusted \({R}^{2}\) or \({R}_{\text{adj}}^{2}\)) of the fitted models were used as the performance indicator of the linear models. A simple linear model was used to predict individual spectral parameter by the ABR. This model provides the expected ranges of each spectral measure over the breathiness rating range (0 to 100).
Ability of a set of spectral parameters to predict the ABR was modeled by multiple linear regression: \(\text{ABR}={\beta }_{0}+{\sum }_{i}^{ }{\beta }_{i}{Z}_{i}+ϵ\) where \({Z}_{i}\) denotes the measurement of the \(i\)th parameter, \({\beta }_{i}\) denotes the regression coefficient for the \(i\)th parameter, and \(ϵ\) is the model error. All the possible subset models were evaluated (16363 models with unique combinations of up to 14 variables) However, only the 12910 models were considered, comprising up to 8 parameters because the per-vowel dataset sizes could not adequately support the full parameter space.
Along with identifying the best performing model at each model order, importance of parameters was assessed based on how frequently they were used in all the models within 99% of the best \({R}_{\text{adj}}^{2}\) at each model order. These models are referred to as the top-1% models. The importance was determined by two metrics: the average of the parameter’s contribution to each model’s \({R}_{\text{adj}}^{2}\) (%Contribution) and the percent of used in models (%Usage). The contribution of a parameter was determined by the increase in \({R}_{\text{adj}}^{2}\) when a model was constructed by adding one parameter at a time from the most important to least.
The regression analyses were first applied to the spectral parameters of /a/ computed with de Krom’s frequency bands and all the HNR algorithms. The analyses were next applied to the spectral measurements of all vowels with per-formant frequency bands and the model-based parameters and GNE ratios, both over the full datasets and speaker-matched datasets. The latter set was used to compare \({R}_{\text{adj}}^{2}\) between vowels.