Formant-Aware Spectral Analysis of Sustained Vowels of Pathological Breathy Voice

DOI: https://doi.org/10.21203/rs.3.rs-2588358/v1

Abstract

Objectives. This paper reports the effectiveness of formant-aware spectral parameters to predict the perceptual breathiness rating. Breathy voice has a higher first harmonic, steeper spectral slope, and higher turbulent noise than normal voice. Measuring spectral parameters of acoustic signal over formant regions is a known approach to capture the properties related to breathiness. This study examines this approach by testing the contemporary spectral parameters and algorithms within the framework, alternate frequency band designs, and vowel effects.

Methods.Sustained vowel recordings (/a/, /i/, and /u/) of speakers with voice disorders in the German Saarbrueken Voice Database were considered (n=368). Recordings with spectral irregularity or roughness perception were excluded from the study. Four speech language pathologists perceptually rated the recordings for breathiness on a 100-point scale, and their averages were used as the ratings of the recordings. The acoustic spectra were segmented into four frequency bands according to the vowel formant structures. Five different spectral parameters were considered in each band or between bands (13 total plus the fundamental frequency) to predict the perceptual breathiness rating.

Results. Linear combinations of spectral parameters, led by the formant-focused harmonics-to-noise ratios (HNRs), were shown to explain up to 85% of the variance in perceptual breathiness ratings of disordered voice. This performance exceeded that of the Acoustic Breathiness Index (82%). Also, the best performing parameter (the HNR over the first two formants, 78%) explained more variances in the breathiness than the smoothed cepstrum peak prominence (74%). Some vowel effects were observed in the perceptual rating (higher for /u/), in predictability (5% lower for /u/), and in model parameter selections.

Conclusions.Strong breathiness correlates were found by segmenting the spectrum to isolate the portion most affected by breathiness.

Introduction

Breathiness is one of the primary perceptual voice qualities assessed during clinical examination. There are standards for evaluating breathiness in voice such as Grade, Roughness, Breathiness, Asthenia and Strain (GRBAS) scale1 and Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V)2. Perceptual qualification, however, is subject to a large variation in rating reliability (See review papers3,4 and references therein). The inherent variability of subjective ratings has motivated the investigation of acoustic correlates of breathiness and other perceptual qualities to objectify the assessment as reviewed and meta-analyzed by Barsties v. Latoszek, et. al.5.

The present study revisits and expands the formant-based spectral correlates of breathiness by de Krom6 and investigates the effects of algorithm, vowel, and frequency selections. Pathological breathy voice quality is a manifestation of turbulence caused by air flow through insufficiently adducted vocal folds during phonation. It has the following four spectral characteristics7: (1) intensification of the first harmonic as the open quotient increases, (2) replacement of the higher harmonics by turbulent (aspiration) noise, (3) reduction of lower formant bandwidths, and (4) alteration of the frequency response of the vocal tract (additional poles and zeros) due to tracheal coupling. De Krom’s study6 employed spectral parameters covering the above characteristics, including the harmonics-to-noise ratios (HNRs) over prescribed frequency bands. His frequency band design was borrowed from the long-time-average-spectrum (LTAS) analysis methods of connected speech8,9, and the bands’ frequency coverages were defined as below the first formant (60–400 Hz), over the first two formants (400–2000 Hz), above the second formant with mixed harmonics and noise (2–5 kHz), and the highest band predominantly occupied by noise (5–8 kHz). Notably, the HNRs over 400–2000 Hz and 2–5 kHz were found to best explain the variability in breathiness.

De Krom’s results align well with the study by Shrivastav and Sapienza10, in which the breathiness was predicted by the partial loudness of the harmonic energy (i.e., the perceived loudness of harmonics against the background noise) and the loudness of aspiration noise. The changes in the partial loudness pattern of the signal were most prominent over the range between 400 to 2000 Hz while much of the changes of the specific loudness of the noise varied over the 400 to 5000 Hz. Shrivastav and Sapienza10 reported that these auditory-loudness based parameters were shown to better predict the breathiness than a linear combination of cepstrum peak prominence (CPP)11, HNR, and shimmer12.

The premise that breathiness represents the replacement of the higher harmonics by turbulent (aspiration) noise promoted studies to investigate only higher frequency components but with limited successes. Hillenbrand et al.11,13 reported that the CPP’s ability to predict breathiness slightly lowered when the acoustic signals were prefiltered (passing between 2.5 and 3.5 kHz or passing above 2.5 kHz) compared to the unfiltered signals. The same studies also demonstrated that the prefiltering benefitted the peak-to-average ratio parameters and mixed results for Pearson \(r\) at correlation peak (beneficial for disordered voice13 but not for imitated breathy voice11). Correlation of normalized noise energy (NNE) measurement to breathiness diminished when the spectral content below 1kHz was excluded14.

Likewise, the expected increase of the first harmonic and loss of higher harmonics (i.e., steeper spectral tilt) led to investigations of the relationships between breathiness and various power ratios of selected harmonics: the first harmonic vs. the second (H1-H2, H1/H2)6,10,11,13, the first harmonic vs. peak harmonic in the first formant (H1-A1)10,15, first harmonic vs. peak harmonic in the third formant (H1-A3)10. These were reported to have poor to moderate correlations.

Recently, Barsties v. Latoszek et al. developed a composite index, named Acoustic Breathiness Index (ABI).16 The ABI is a linear combination of nine existing acoustic parameters: smoothed CPP11,13, jitter, glottal-to-noise (GNE) ratio17, relative energy level of high frequency noise18, HNR19, H1-H2, shimmers in linear and decibel scales, and period standard deviation20. It explained 71% of the variance of perceived breathiness in the original study in Dutch with comparable results in follow-up cross-language validations (68% in Spanish21, 80% in Japanese22, 72% in German23, 56% in Brazilian Portuguese24, and 76% in Korean25). While it attains a reasonable performance in most languages, the index was developed by selecting a pool (28) of existing acoustic parameters with potential and reducing the parameter space by the stepwise multiple linear regression. As a result, how different sources of breathiness perception (e.g., which of the four spectral characteristics) are contributing to the index is not clearly presented by the ABI’s constructs.

The present study aimed to draw a clearer picture of how different frequency contents are contributing to the breathiness perception. Specifically, the spectral correlates under study include the harmonics and noise power relationships both within and between frequency bands and the similarities of envelope waveforms of frequency bands using the glottal-to-noise excitation (GNE) ratio concept. The study investigated their relationships to the breathiness Individually as well as collectively. In addition, the following four questions were posed. (1) Would newer HNR estimation algorithms improve the breathiness predictability of De Krom’s cepstrum-assisted HNR algorithm26? (2) Does vowel (/a/, /i/, and /u/) affects the breathiness predictability? (3) How does frequency band design influence the predictability? And (4) How does the performance compare to the overall HNR and the ABI?

Material And Methods

Acoustic stimuli

All the acoustic recordings were obtained from a public German voice database, Saarbrueken Voice Database27. From the recordings of over 1600 speakers available in the database, we selected recordings with speakers with dysphonic voice under three selected diagnoses (n: 157): vocal fold paralysis (rekurrensparese, n:145), hypofunctional dysphonia (hypofunktionelle dysphonie, n: 10), and hypotonic dysphonia (hypotone dysphonie, n: 2). These diagnoses are often associated with breathiness.28,29

Each recording session in Saarbrueken Voice Database includes twelve sustained vowel segments (three vowels—/a/, /i/, and /u/—and four pitch profiles—normal, high, low, low-high-low). The vowel segments are pre-curated to exclude the onsets and offsets of the phonations. For this study, three vowel segments in normal pitch were retrieved from the database as WAV files. The file comprises signed 16-bit samples at the sampling rate of 50000 samples-per-second (S/s). The total of 630 WAV files were considered. These files include multiple recordings of some speakers (210 recording sessions by the 157 speakers). Eighty-six (86) speakers have at least two recordings, and the most frequent speaker was recorded six times. The interval between repeated recordings is at least one week apart and as far apart as five years.

Stimuli Screening Process

As the study focuses on the breathiness, vowel segments with irregular vocal fold vibration or roughness perception were excluded with a four-step screening process. The first two steps were objective and based on the fundamental frequency (\({f}_{0}\)) estimations of the Praat software30 (v6.1.38). Severely irregular vibration of vocal folds causes \({f}_{0}\) estimator to fail as the signal is mostly—if not entirely—nonharmonic. If voice is breathy and not rough, the vibration remains periodic but expected to contain increased turbulent noise. Accordingly, a vowel segment was excluded if Praat fails to estimate \({f}_{0}\) at any point within the segment. This step eliminated 24 vowel segments.

Second objective exclusion condition was the presence of subharmonics, which is another indicator of roughness31. Subharmonics were detected by a custom Python program. When subharmonics are present, Praat could erroneously detect the lowest subharmonic frequency as the \({f}_{0}\), reporting by 50% (period-2) or 33% (period-3) of the speaking \({f}_{0}\). Since the subharmonics, however, is seldom present for the entire phonation32, the partial presence of the subharmonics can be identified by detecting a sudden change in the \({f}_{0}\) estimates. The Python analysis checked for a sudden change in Praat’s \({f}_{0}\) estimates over short analysis windows, denoted by \({f}_{0,n}\), for the condition:

$$\frac{\left|{f}_{0,n}-{f}_{0,n+1}\right|}{{f}_{0,n}+{f}_{0,n+1}}>\frac{1}{6}$$

This ratio is near zero if no change in harmonic structure in two successive analysis windows or is 1/3 if switches between no subharmonics to period-2 subharmonics. This constitutes intervals that are stable in harmonic structure, and the intervals with the highest \({f}_{0}\) average are considered to have no subharmonics. Based on this rule, a vowel segment was excluded if more than 5% of its analysis windows contained subharmonics. This step eliminated 70 vowel segments.

While the objective screening steps successfully removed the severe cases, the vowel segments with mild irregularities required additional subjective screening steps. The third step was the spectrogram evaluation for harmonic irregularities. Praat was often able to estimate \({f}_{0}\) correctly despite mild irregularity in the signals, including the presence of mild subharmonics. Also, the second step could not filter out the subharmonics if they persist throughout the vowel. To detect the obvious missed irregularity, the spectrograms of the selected remaining vowel segments were inspected by the author (T. Ikuma) for strong presence of irregularity. This step only tested those segments with Praat HNR below 15 dB (5 dB below the expected HNR of normophonic speakers33). This step eliminated 6 vowel segments.

The final screening step was conducted by the expert raters as a part of breathiness evaluation (see the proceeding section for the details). The raters were instructed to check two boxes along with filling the breathiness rating for each segment. The first box was checked if the segment is too short to evaluate breathiness confidently, and the second box was checked if roughness was perceived at any point in the segment. A segment was dropped if any rater marked it too short or if two or more raters indicated it rough. The final step reduced the number of vowel segments by 162, including 25 considered too short for perceptual assessment.

After the screening process, 368 vowel segments (58%) were included in the study from 123 speakers (78%). Of 123 speakers, all 3 vowels were collected from 70 speakers. The study dataset and the speaker demographics are summarized on Table 1. Gender and age information are presented as reference and were not analyzed.

 
Table 1

Study Sample (after screening out spectral harmonic irregularity or perceptual roughness)

Vowel

No. Segments

No. Speakers

Gender

 

Age

Female

Male

< 35

35–55

> 55

/a/

103

88

67

21

 

9

33

46

/i/

120

93

65

29

 

12

33

49

/u/

145

114

74

40

 

13

42

59

Total

368

123*

82

41

 

14

45

64

*Out of 123 speakers, all 3 vowels were available from 70 speakers.

Perceptual Rating Of Pathological Voice Recordings

To establish reference score for breathiness, all of the vowel segments were perceptually rated by four (4) speech language pathologists. Each rater was given an Excel spreadsheet to perform the rating. Each row of the spreadsheet had a play button to play a voice recording, a checkbox for insufficient duration, another checkbox for roughness, and a pair of slider and editable cell for breathiness rating. The checkboxes were for the screening step described earlier. A spreadsheet was machine-generated for each rater with rows of the vowel segments in randomized order.

The raters were asked to fill in the form in a quiet room, listening to each recording as many times as needed. For each recording, they were asked to check the roughness or insufficient-duration checkboxes for the screening purpose (as described above) and enter a breathiness rating between 0 (normal) and 100 (most severe) either with a scrollbar or directly typing in the score. The entered value was immediately reflected on both the scrollbar and editable cell. They were allowed to revisit and rescore the recordings as felt necessary.

Once all the rating sheets were returned, a combined breathiness score for each vowel segment was obtained as follows. First, the scores from each rater were rescaled to match the score ranges of all the raters. This normalization was necessary because no reference voice stimuli were given to the raters. These rescaling anchor points were selected for each rater based on the vowel segments which were in the bottom three (3) percentile of all four raters for the least-breathy anchor, and top three percentile for the most-breathy anchor. The anchor scores of each rater were their median scores of the common vowel segments. Each rater’s scores were adjusted linearly so that the bottom anchor score maps to zero (0) and the top anchor score to 100. The adjusted breathiness scores were then averaged across the raters to form the combined scores to be compared against the objective parameters. The averaged adjusted rating is hereafter referred to as Adjusted Breathiness Rating (ABR).

Short-time Acoustic Signal Model

Acoustic parameters for a short segments of recordings were measured with a time-varying harmonic model (HM)34:

$$x\left(t\right)={\sum }_{p=0}^{P}{a}_{p}\left(t\right)\text{cos}\left(2\pi p{f}_{0}\left(t\right)\right)+{\sum }_{p=1}^{P}{b}_{p}\left(t\right)\text{sin}\left(2\pi p{f}_{0}\left(t\right)\right)+v\left(t\right),$$

where \(P\) is the harmonic order, \({a}_{p}\left(t\right)\) and \({b}_{p}\left(t\right)\) are the time-varying amplitudes, \({f}_{0}\left(t\right)\) is the time-varying fundamental frequency, and \(v\left(t\right)\) comprises nonharmonic components, most of which were the glottal turbulence noise. Here, \({a}_{p}\left(t\right)\) and \({b}_{p}\left(t\right)\) were modeled with cubic polynomials while \({f}_{0}\left(t\right)\) was modeled with a quadratic polynomial. These time-varying parameters enabled the harmonic model to capture slow variations in frequency and amplitude, resulting in more accurate harmonic power estimation of the voice signals. The number of harmonics \(P\) was dependent on \({f}_{0}\) (populating up to the Nyquist frequency, 8000 Hz), and each signal segment was modeled with (\(8P+11\)) coefficients and the residual as the turbulent noise.

For each fitted model, the average \({f}_{0}\) in Hz is given by

$${\stackrel{-}{f}}_{0}=\frac{1}{T}{\int }_{0}^{T}{f}_{0}\left(t\right)dt,$$

the \(p\)th harmonic power is given by

$${H}_{p}=\frac{1}{2T}{\int }_{0}^{T}\left({a}_{p}^{2}\left(t\right)+{b}_{p}^{2}\left(t\right)\right)dt,$$

and the noise power over the \(p\)th harmonic range—i.e., between \(\left(p-0.5\right){\stackrel{-}{f}}_{0}\) Hz to \(\left(p+0.5\right){\stackrel{-}{f}}_{0}\) Hz—is given by

$${N}_{p}=\frac{1}{{\stackrel{-}{f}}_{0}}{\int }_{\left(p-0.5\right){\stackrel{-}{f}}_{0}}^{\left(p+0.5\right){\stackrel{-}{f}}_{0}}{\left|V\left(f\right)\right|}^{2}df.$$

Here, \(V\left(f\right)\) is the Fourier transform of \(v\left(t\right)\) and was estimated by the discrete Fourier transform (DFT).

To obtain the models, the acoustic recordings were first resampled to 16000 S/s by FFmpeg (v5.0). Each waveform was split into nonoverlapping 50-millisecond frames (\(T=0.05\), 800 samples per frame) and a set of model coefficients were estimated in the least squares sense with an iterative algorithm34. The algorithm received the \({f}_{0}\) estimate by Praat35 (via Python package, Parselmouth36, v0.4.1) as the initial \({f}_{0}\).

Spectral Segmentation

Signal analyses were conducted in four nonoverlapping spectral bands, generically denoted by \({b}_{i}\), \(i=0, 1, 2, 3\). Each frequency band was assigned to cover specific formants to minimize the effect of the vocal tract on the acoustic voice parameters. Four distinctive spectral band designs were investigated in the study. The first was from de Krom’s study6 for the vowel /a/ while the other three, one for each vowel, were designed to isolate each formant. De Krom’s principle of the spectral band design could not be applied to other vowels. The problem lied in the 60–400 Hz (below the first formant) \({b}_{0}\) band. Adjusting this band for /i/ and /u/ lowers the upper bound to around 200 Hz, which is near, possibly below, \({f}_{0}\). If \({f}_{0}>200\) Hz, some parameters could not be evaluated in this band. A modified design, referred to as the “per-formant” bands, was considered in this study to circumvent the issue. De Krom’s \({b}_{0}\) band was excluded, and his \({b}_{1}\) band was split into two separate bands (alternate \({b}_{0}\) and \({b}_{1}\) bands): the alternate \({b}_{0}\) covers the first formant, and the alternate \({b}_{1}\) covers the second formant. This separation was intended to explore potential differences in how breathiness affects acoustic signals in the two formant regions. The upper two bands \({b}_{2}\) and \({b}_{3}\) remained the same as the de Krom’s bands. With the modified spectral band design, the per-formant band edge frequencies were defined as according to the distributions of the formant frequencies of German vowels37. The edge frequencies of the per-formant bands as well as de Krom’s bands are tabulated in Table 2.

 
Table 2

Boundary frequencies of spectral bands under four different designs.

Type

Vowel

 

—— \({{b}}_{0}\)——

——\({{b}}_{1}\)——

——\({{b}}_{2}\)——

——\({{b}}_{3}\)——

 

de Krom6

/a/

60 Hz

400 Hz

2000 Hz

5000 Hz

8000 Hz

per-formant

/a/

400 Hz

1000 Hz

2000 Hz

5000 Hz

8000 Hz

/i/

200 Hz

1100 Hz

2400 Hz

5000 Hz

8000 Hz

/u/

200 Hz

600 Hz

1700 Hz

5000 Hz

8000 Hz

Acoustic Parameter Measurements

This study evaluated two broad types of parameters: intra-band and inter-band. The former consists of the harmonics-to-noise ratio (HNR) measurements by four (4) different algorithms. The inter-band parameters assess the differences between two neighboring bands either in harmonic power (harmonics-to-harmonics ratio, HHR), in noise power (noise-to-noise ratio, NNR), and in waveform envelope shapes (GNE ratio). In addition, the fundamental frequency \({f}_{0}\) and the ABI were obtained from Praat and were included as a supplemental parameter and reference parameter, respectively. These parameters are summarized in Table 3.

 
Table 3

List of Spectral Parameters

Name

Units

Description

Ref’s

type: intra-band (harmonics-to-noise ratios)

HNR (HM)

dB

HNR over a frequency band using the harmonic model

34

HNR (de Krom)

dB

HNR over a frequency band using cepstrum-based method

6,26

HNR (Shama)

dB

HNR over a frequency band by spectral segmentation

38

HNR (Qi)

dB

HNR over a frequency band using modified cepstrum-based method

39

type: inter-band

HHR

dB

Ratio of harmonic power in adjacent bands (harmonic model)

NNR

dB

Ratio of noise power in adjacent bands (harmonic model)

GNE Ratio

Glottal-to-noise excitation ratio (GNE) between 2 adjacent bands

17

Slope

dB

Ratio of total signal power in adjacent bands

6

type: supplementary

\({{f}}_{0}\)

Hz

Fundamental frequency estimated by Praat software

35

ABI

Acoustic breathiness index

22

Note: HNR (HM), HNR (de Krom), HNR (Shama), and HNR (Qi) indicate the HNRs evaluated by the algorithms specified in the parenthesis. HM: Harmonic Model.

The primary parameters of interest are those based on the harmonic models: the intra-band HNRs, inter-band HHRs, and inter-band NNRs. These ratios are computed from the harmonic and noise power measurements, \({H}_{p}\) and \({N}_{p}\) in each target frequency bands. For each of target bands \({b}_{i}\), \(i=0, 1, \text{2,3}\), let \({\mathcal{P}}_{i}\) be a set of the indices of all the harmonics residing in \({b}_{i}\). Then, the model-based parameters were measured as follows:

If \({\mathcal{P}}_{0}\) contains all the harmonics of the observed signal, then the corresponding \({\text{HNR}}_{{b}_{0}}\) is the overall HNR parameter. The HHR can be seen as a generalization of existing harmonic ratio parameters such as H1-H2 and H1-A1. Measurements of the above parameters were recorded in decibels (dB, \(10{\text{l}\text{o}\text{g}}_{10}\)), and the inter-band parameters are shorthanded as \({\text{HHR}}_{\text{low}}\triangleq {\text{HHR}}_{{b}_{0}{b}_{1}}\), \({\text{HHR}}_{\text{mid}}\triangleq {\text{HHR}}_{{b}_{1}{b}_{2}}\), \({\text{HHR}}_{\text{high}}\triangleq {\text{HHR}}_{{b}_{2}{b}_{3}}\), \({\text{NNR}}_{\text{low}}\triangleq {\text{NNR}}_{{b}_{0}{b}_{1}}\), \({\text{NNR}}_{\text{mid}}\triangleq {\text{NNR}}_{{b}_{1}{b}_{2}}\), and \({\text{NNR}}_{\text{high}}\triangleq {\text{NNR}}_{{b}_{2}{b}_{3}}\) to be consistent with the de Krom’s spectral slope nomenclature6.

In addition to the HHRs and NNRs, the glottal-to-noise excitation (GNE) ratios17 were computed between frequency bands (\({\text{GNE}}_{\text{low}}\), \({\text{GNE}}_{\text{mid}}\), and \({\text{GNE}}_{\text{high}}\)). The inter-band GNE ratio is the maximum cross-correlation coefficient of the envelopes of bandpass-filtered signals. Bandpass filtering was performed in the frequency domain according to the defined frequency bands. The GNE ratio computed in this study is a partial application of the original GNE ratio by Michaelis, et al. The original GNE ratio is the maximum of the inter-band GNE ratios between many overlapping bands with a fixed bandwidth.

The rest of the parameters were included as reference. First, three additional algorithms to compute the HNR were included. Shama’s HNR38 is an extension of Kasuya’s NNE40. The magnitude samples of the discrete Fourier transform (DFT) frequency bins were labeled as either harmonic or noise based on the main lobe width of the analysis window function, and the HNR is the ratio of the sum of the harmonic bins and that of the noise bins. De Krom’s HNR26 avoids the frequency segmentation by estimating the noise spectrum with the comb-liftering operation in the ceptrum domain. Fourth, Qi HNR39 is a modification of the de Krom’s HNR by low-pass liftering and simplified noise level sampling. Finally, de Krom’s spectral slopes6 were recorded as the reference inter-band measure. The spectral slope is the ratio of the overall signal power in two frequency bands obtained directly from the DFT samples. All these reference parameters were measured with the Praat’s \({f}_{0}\) estimates.

All the short-time measurements were calculated with 50-ms non-overlapping window, and the average over all the windows was recorded for each vowel segment. Finally, per-speaker measurements were formed by averaging the per-recording averages for each speaker for each vowel. The measurements in dB were averaged in the dB scale. The averaging was done to address the speaker effect.

For the ABI and its member-parameters, Barsties et al. provided the Praat script as supplemental material22. The original script was modified to use only sustained vowel recording. The ABI officially requires 3-second acoustic data from each of sustained /a/ and concatenated voiced segments of continuous speech. Technically, the dataset in this study does not conform to the ABI protocol because of the lack of continuous speech data and the shortness of sustained vowels (all shorter than 3 seconds). To better gauge the effectiveness of the ABI, measurements of its nine parameters were also recorded to recalibrate the ABI model coefficients for the study dataset.

Statistical analysis

All statistical analyses were conducted in the R (v4.2.1) software environment.

The interrater reliability was tested with the intraclass correlation (ICC) measure41. The ICC estimates and their 95% confident intervals (CIs) were calculated with irr package based on a mean-rating, agreement, two-way random-effect model. The ICC estimates were computed on the raw breathiness ratings as well as the adjusted breathiness ratings.

Inter-vowel differences of the averaged adjusted breathiness rating (ABR) were tested at Type I error rate 0.05 with the matched dataset (i.e., 70 speakers, each with all 3 vowels) using the non-parametric Friedman test followed by the post-hoc pairwise comparison42.

The relationships between spectral parameters and the ABR were assessed using simple and multiple linear regression models. The adjusted coefficients of determination (adjusted \({R}^{2}\) or \({R}_{\text{adj}}^{2}\)) of the fitted models were used as the performance indicator of the linear models. A simple linear model was used to predict individual spectral parameter by the ABR. This model provides the expected ranges of each spectral measure over the breathiness rating range (0 to 100).

Ability of a set of spectral parameters to predict the ABR was modeled by multiple linear regression: \(\text{ABR}={\beta }_{0}+{\sum }_{i}^{ }{\beta }_{i}{Z}_{i}+ϵ\) where \({Z}_{i}\) denotes the measurement of the \(i\)th parameter, \({\beta }_{i}\) denotes the regression coefficient for the \(i\)th parameter, and \(ϵ\) is the model error. All the possible subset models were evaluated (16363 models with unique combinations of up to 14 variables) However, only the 12910 models were considered, comprising up to 8 parameters because the per-vowel dataset sizes could not adequately support the full parameter space.

Along with identifying the best performing model at each model order, importance of parameters was assessed based on how frequently they were used in all the models within 99% of the best \({R}_{\text{adj}}^{2}\) at each model order. These models are referred to as the top-1% models. The importance was determined by two metrics: the average of the parameter’s contribution to each model’s \({R}_{\text{adj}}^{2}\) (%Contribution) and the percent of used in models (%Usage). The contribution of a parameter was determined by the increase in \({R}_{\text{adj}}^{2}\) when a model was constructed by adding one parameter at a time from the most important to least.

The regression analyses were first applied to the spectral parameters of /a/ computed with de Krom’s frequency bands and all the HNR algorithms. The analyses were next applied to the spectral measurements of all vowels with per-formant frequency bands and the model-based parameters and GNE ratios, both over the full datasets and speaker-matched datasets. The latter set was used to compare \({R}_{\text{adj}}^{2}\) between vowels.

Results

Perceptual Ratings

The averaged ICC estimate of the raw breathiness ratings of 368 vowel segments among four raters was 0.84 with the 95% CI of (0.71, 0.90). This level of reliability is considered “moderate” to “good”41. After the rating rescaling, the 95% confidence interval of the ICC improved to (0.80, 0.91), solidly regarded as “good”. This ICC range was considered sufficient to use the averaged adjusted breathiness ratings (ABRs) as the reference for the objective parameters.

The mean (and standard deviation) of the adjusted breathiness rating (ABR) per vowel among the 70 speakers with all three vowels in the dataset were 27.6 (21.3) for /a/, 26.3 (19.8) for /i/, and 33.7 (19.0) for /u/. Significant vowel effect was found (Friedman test, \(p<{10}^{-3}\)), and the post-hoc pairwise comparison42 revealed the ABR of /u/ is significantly higher than both /a/ (\(p<0.01\)) and /i/ (\(p<{10}^{-3}\)).

Baselines: Overall Hnr And Abi

To establish the reference points of the \({R}_{\text{adj}}^{2}\) performance, the model-based overall HNR and the ABI parameters and were assessed with the /a/ dataset to predict the variations in ABR. The linear model with the overall HNR predicted 49% of the variance in ABR (\(p<{10}^{-6}\)). Likewise, the linear model with the ABI (calculated with the default coefficients16) yielded \({R}_{\text{adj}}^{2}=0.81\) (\(p<{10}^{-6}\)).

To optimize the ABI model for the dataset, the multiple linear regression of the nine ABI parameters was evaluated. The refined model explained 82% of the variances in ABR (\(p<{10}^{-6}\)). The primary parameter of the ABI, the smoothed CPP, alone explained 74% of the ABR variance (\(p<{10}^{-6}\)).

De Krom’s Frequency Band Design And Comparison Of Algorithms

With the de Krom’s frequency band design, the /a/ dataset was analyzed with the regression analyses. The target parameters were the four HNR measures (harmonic model (HM) based, de Krom, Qi, and Shama), HM-based inter-band measurements (HHRs and NNRs), and the GNE ratios. Table 4 summarizes the findings of the linear regression analysis to explain the variation in the spectral parameters by the adjusted breathiness rating (ABR). The highest \({R}_{\text{adj}}^{2}=0.79\) was found with Shama’s HNR in the \({b}_{1}\) band (400 Hz to 2 kHz) closely followed by the harmonic model (HM) based HNR with \({R}_{\text{adj}}^{2}=0.77\), also in \({b}_{1}\). Both Shama’s and harmonic model HNRs are also strong in \({b}_{2}\) (2 to 5 khz), reporting \({R}_{\text{adj}}^{2}\ge 0.6\). In contrary, de Krom’s and Qi’s cepstrum based HNRs reported more than 0.10 lower \({R}_{\text{adj}}^{2}\) than Shama and HM’s HNRs in the \({b}_{1}\), \({b}_{2}\), and \({b}_{3}\) bands. In the \({b}_{0}\) band (60 to 400 Hz), the HM and Qi’s HNRs could not be predicted with \(\alpha =0.05\) confidence (\({R}_{\text{adj}}^{2}<0.04\)) while de Krom’s HNR scored \({R}_{\text{adj}}^{2}=0.40\) in \({b}_{0}\) which is close to its \({R}_{\text{adj}}^{2}\) in the \({b}_{2}\) band.

  
Table 4

Linear Regression Results of /a/ Dataset Using de Krom Bands

Parameter

Band

Algorithm

\({{R}}_{\text{adj}}^{2}\)

\({\rho }\)

95% Confidence Intervals

Delta§

Sig.

 
         

at ABR = 0

at ABR = 100

     

HNR

\({{b}}_{0}\)

HM

(33, 35)

(30, 35)

   
   

de Krom

0.40

-0.50

(18, 20)

(8, 11)

-7

***

 
   

Qi

(25, 29)

(26, 34)

   
   

Shama

0.24

-0.35

(31, 32)

(27, 29)

-2

***

 
 

\({{b}}_{1}\)

HM

0.77

-0.81

(28, 29)

(5, 9)

-19

***

 
   

de Krom

0.58

-0.63

(17, 19)

(3, 7)

-10

***

 
   

Qi

0.22

-0.30

(22, 25)

(9, 16)

-6

**

 
   

Shama

0.79

-0.83

(26, 27)

(2, 6)

-20

***

 
 

\({{b}}_{2}\)

HM

0.60

-0.82

(13, 15)

(-5, -1)

-13

***

 
   

de Krom

0.47

-0.71

(8, 9)

(-2, 1)

-6

***

 
   

Qi

0.34

-0.5

(12, 14)

(1, 5)

-7

***

 
   

Shama

0.64

-0.82

(10, 12)

(-7, -3)

-13

***

 
 

\({{b}}_{3}\)

HM

0.32

-0.70

(6, 8)

(-3, 1)

-5

***

 
   

de Krom

0.20

-0.44

(4, 4)

(0, 2)

-1

**

 
   

Qi

0.12

-0.30

(6, 7)

(2, 5)

-1

**

 
   

Shama

0.33

-0.64

(2, 4)

(-6, -3)

-5

***

 

HHR

low

HM

0.45

0.67

(-5, -3)

(9, 15)

12

***

 
 

mid

HM

0.27

-0.37

(28, 31)

(14, 21)

-8

***

 
 

high

HM

(11, 15)

(5, 13)

   

NNR

low

HM

0.05

-0.26

(-11, -8)

(-17, -11)

0

*

 
 

mid

HM

0.23

-0.46

(14, 16)

(6, 10)

-4

**

 
 

high

HM

(4, 7)

(7, 14)

   

GNE

low

Michaelis

0.04

0.22

(0.85, 0.88)

(0.88, 0.93)

0

*

 
 

mid

Michaelis

0.39

-0.62

(0.87, 0.89)

(0.77, 0.81)

-0.06

***

 
 

high

Michaelis

(0.80, 0.83)

(0.76, 0.81)

   

Slope

low

de Krom

0.20

0.47

(-5, -4)

(-1, 2)

2

**

 
 

mid

de Krom

0.48

-0.58

(12, 13)

(4, 6)

-6

***

 
 

high

de Krom

(4, 6)

(1, 6)

   

Acronyms: ABR: adjusted breathiness rating, HNR: harmonics-to-noise ratio, HHR: harmonics-to-harmonics ratio, NNR: noise-to-noise ratio, GNE: Glottal-to-noise, HM: harmonic model.

Uses de Krom’s frequency band configuration (see Table 2).

Confidence intervals are in dB except for GNE ratios, which are unitless.

§Distance between the confidence intervals at ABR = 0 and ABR = 100; omitted if overlapping.

Significance levels are based on \({F}\left(\text{1,86}\right)\) statistic: *\({p}<0.05\), **\({p}<{10}^{-3}\), ***\({p}<{10}^{-6}\).

The confidence intervals (CIs) at the least breathy (ABR = 0) and most breathy (ABR = 100) and their difference (Delta) indicated the ranges in the HNRs as the breathiness varies. The most performant HNRs (Shama and HM in \({b}_{1}\)) predicted a drop of 19 to 20 dB over the full range of ABR.

In contrast to the HNRs, the inter-band parameters were not as performant. No parameter exceeded \({R}_{\text{adj}}^{2}>0.5\). However, all the “mid” parameters (\({b}_{1}\) vs. \({b}_{2}\)) consistently separated the least (ABR=0) and most (100) breathy cases.

Next, multiple linear regression analyses were conducted on each of the parameter. For each type of parameter (per-algorithm HNR, HHR, NNR, GNE ratio, and Slope), the measurements from all the bands were combined to predict the ABR. Figure 1 shows the resulting \({R}_{\text{adj}}^{2}\) values and the 95% confidence intervals of the estimated ABRs. Among the HNR algorithms in Fig. 1(a), the HM-based, Qi, and Shama HNRs achieved \({R}_{\text{adj}}^{2}\ge 0.83\) while the de Krom HNRs yielded the worst \({R}_{\text{adj}}^{2}=0.73\). The gain of both HM and Shama HNRs were marginal (0.04) relative to their best individual \({R}_{\text{adj}}^{2}\) in the \({b}_{1}\) band. Meanwhile, the contrasting performance of the Qi HNRs individually (\({R}_{\text{adj}}^{2}<0.35\)) vs. collectively (\({R}_{\text{adj}}^{2}=0.84\), the highest) was a surprising result. The regression coefficients of the Qi HNR model (\({b}_{0}:4.44\), \({b}_{1}:-4.97\), \({b}_{2}:-2.49\), and \({b}_{3}:1.41\)) revealed a substantial contribution from the \({b}_{0}\) band. The variance in the Qi HNR in this band was not explained by the ABR with any significance (see Table 4).

Of the inter-band parameters, Fig. 1(b) shows that the HHRs and Slopes—both influenced by the harmonic power—gained over 0.1 to their best single-band \({R}_{\text{adj}}^{2}\) values, and their estimated ABR generally follow the ABR. The others, NNRs and GNE ratios, did not gain any performance by combining the measurements from three bands. The estimated ABRs by the NNRs or GNE ratios were severely underestimated at high ABRs.

The last part of the regression analysis, the all-subset regression analysis, included \({f}_{0}\) as a covariate. To anticipate the role of the \({f}_{0}\), its correlation to ABR was tested first to assess the relationship. As expected with the mixed-gender dataset, \({f}_{0}\) was not normally distributed: Shapiro-Wilk test showed that the distribution of \({f}_{0}\) significantly deviated from normality (\(p<0.005\)). Accordingly, Spearman’s rank correlation was computed, and the correlation between the two variables was not found to be statistically significant (\(\rho =0.087\), \(p=0.13\)). Meanwhile, \({f}_{0}\) yielded statistically significant Spearman’s rank correlations with a number of objective parameters: e.g., \({\text{NNR}}_{\text{low}}\) of /a/ (\(\rho =-0.56\), \(p<{10}^{-6}\)), \({\text{HHR}}_{\text{mid}}\) of /i/ (\(\rho =-0.57\), \(p<{10}^{-6}\)), and \({\text{HNR}}_{{b}_{3}}\) of /u/ (\(\rho =-0.58\), \(p<{10}^{-6}\)). Hence, \({f}_{0}\) was included as a covariate in the all-subsets regression model to account for the dependencies of those variables on \({f}_{0}\).

The best \({R}_{\text{adj}}^{2}\) as a function of the model order (Fig. 2, circle markers) indicates that the \({R}_{\text{adj}}^{2}\) exceeds 0.85 (i.e., the models explained 85% of the variance in ABR) and leveled off above the fifth order. Accordingly, the sixth order was chosen to compare among different datasets and vowels. The best sixth-order model employed \({\text{HNR}}_{{b}_{1}}\), \({\text{NNR}}_{\text{mid}}\), \({f}_{0}\), \({\text{GNE}}_{\text{mid}}\), \({\text{GNE}}_{\text{low}}\), and \({\text{HHR}}_{\text{low}}\) with the estimated ABR as shown in Fig. 3 (far left panel). Expanding the model selection, de Krom’s frequency bands yielded 230 linear models (out of 3003) as the top-1% models (i.e., models which are within 1% of the best \({R}_{\text{adj}}^{2}\) for each order). As shown in Fig. 4 (left most bars), the most important parameter was \({\text{HNR}}_{{b}_{1}}\) (over the first and second formants), which contributed on the average 61% to the models’ \({R}_{\text{adj}}^{2}\) and used in 79% of the models. There was a divergence in the two metrics to determine the secondary parameter. The %Contribution suggested \({\text{HNR}}_{{b}_{2}}\) (over the 2–5 kHz range) and \({\text{HHR}}_{\text{mid}}\) (harmonic power ratio between \({b}_{1}\) and \({b}_{2}\) bands) as the next highest contributors while %Usage indicates \({\text{NNR}}_{\text{mid}}\) (noise power ratio between \({b}_{1}\) and \({b}_{2}\) bands) as the second most frequently used parameter. This was stemmed from the general preference for \({\text{HNR}}_{{b}_{1}}\) to be paired with \({\text{NNR}}_{\text{mid}}\) over either \({\text{HNR}}_{{b}_{2}}\) or \({\text{HHR}}_{\text{mid}}\). The fourth-order model using these parameters was not among the top-1% models.

Per-formant Frequency Band Design And Vowel Comparison

The linear regression analyses were repeated for the per-formant frequency bands (defined in Table 2). The measurements on each of the vowels (/a/, /i/, /u/) were analyzed separately. Table 5 shows the simple regression outcomes, predicting the spectral parameters by the ABR. No parameter demonstrated to peak in a frequency band simultaneously across all vowels. The highest \({R}_{\text{adj}}^{2}\) were observed with the HNRs but in different bands: /a/ in the first formant (\({b}_{0}\), 0.71), /i/ in the third formant (\({b}_{2}\), 0.73), and /u/ in the second formant (\({b}_{1}\), 0.61). The HNRs in the \({b}_{1}\) bands scored \({R}_{\text{adj}}^{2}>0.58\) on all three vowels. The HHR and NNR were both ineffective with the highest \({R}_{\text{adj}}^{2}=0.33\) by the NNR with /u/ while the GNE ratio reached \({R}_{\text{adj}}^{2}>0.4\) for /a/ and /i/.

  
Table 5

Linear Regression Results of Spectral Parameters Using Per-Formant Frequency Bands

Parameter

Band

Vowel

\({{R}}_{\text{adj}}^{2}\)

\({\rho }\)

95% Confidence Intervals

Delta§

Sig.

         

at ABR = 0

at ABR = 100

   

HNR

\({{b}}_{0}\)

a

0.71

-0.74

(29, 31)

(7, 11)

-18

***

   

i

0.36

-0.38

(38, 41)

(17, 25)

-13

***

   

u

0.45

-0.61

(36, 38)

(18, 23)

-13

***

 

\({{b}}_{1}\)

a

0.65

-0.77

(23, 25)

(1, 6)

-17

***

   

i

0.61

-0.78

(17, 19)

(-5, 0)

-17

***

   

u

0.58

-0.78

(25, 28)

(0, 5)

-20

***

 

\({{b}}_{2}\)

a

0.60

-0.82

(13, 15)

(-5, -1)

-13

***

   

i

0.73

-0.89

(17, 19)

(-8, -3)

-20

***

   

u

0.42

-0.70

(10, 12)

(-2, 1)

-9

***

 

\({{b}}_{3}\)

a

0.32

-0.70

(6, 8)

(-3, 1)

-5

***

   

i

0.55

-0.84

(9, 11)

(-7, -3)

-11

***

   

u

0.15

-0.46

(5, 7)

(0, 3)

-3

**

HHR

low

a

0.11

-0.28

(8, 11)

(-3, 5)

-3

**

   

i

(31, 37)

(24, 38)

 
   

u

(15, 18)

(10, 17)

 
 

mid

a

0.06

-0.17

(18, 21)

(10, 17)

-1

*

   

i

(-13, -9)

(-13, -4)

 
   

u

0.05

-0.18

(24, 28)

(17, 24)

-1

*

 

high

a

(11, 15)

(5, 13)

 
   

i

0.16

-0.44

(13, 16)

(0, 8)

-5

**

   

u

0.09

-0.36

(6, 10)

(-3, 3)

-3

**

NNR

low

a

0.22

-0.49

(2, 5)

(-7, -2)

-5

**

   

i

u

(11, 15)

(4, 12)

 
   

0.33

-0.61

(5, 7)

(-7, -3)

-8

***

 

mid

a

(9, 11)

(5, 10)

 
   

i

(-13, -10)

(-15, -8)

 
   

u

0.15

0.47

(10, 12)

(16, 20)

3

**

 

high

a

(4, 7)

(7, 14)

 
   

i

(5, 8)

(2, 9)

 
   

u

(1, 5)

(-1, 4)

 

GNE Ratio

low

a

0.17

-0.32

(0.93, 0.95)

(0.84, 0.89)

-0.03

**

   

i

0.42

-0.57

(0.93, 0.96)

(0.80, 0.84)

-0.09

***

   

u

(0.86, 0.89)

(0.83, 0.87)

 
 

mid

a

0.30

-0.59

(0.86, 0.88)

(0.77, 0.81)

-0.05

***

   

i

0.41

-0.69

(0.87, 0.89)

(0.75, 0.79)

-0.08

***

   

u

0.18

-0.48

(0.82, 0.84)

(0.77, 0.79)

-0.03

**

 

high

a

(0.80, 0.83)

(0.76, 0.81)

 
   

i

0.24

-0.52

(0.83, 0.85)

(0.75, 0.79)

-0.04

***

   

u

0.05

-0.18

(0.81, 0.82)

(0.78, 0.80)

0

*

Acronyms: ABR: adjusted breathiness rating, HNR: harmonics-to-noise ratio, HHR: harmonics-to-harmonics ratio, NNR: noise-to-noise ratio, GNE: Glottal-to-noise, HM: harmonic model.

Uses the per-formant frequency band configuration (see Table 2).

Confidence intervals are in dB except for GNE ratios, which are unitless.

§Distance between the confidence intervals at ABR = 0 and ABR = 100; “—” if overlapping.

Significance levels are based on \({F}\left(1, 86\right)\) for /a/, \({F}\left(1, 91\right)\) for /i/, and \({F}\left(1, 112\right)\) for /u/: *\({p}<0.05\), **\({p}<{10}^{-3}\), ***\({p}<{10}^{-6}\).

As for the ranges of the spectral parameters over the ABR range, the HNR of /a/ observed similar declines over the first and second formants (-18 dB in \({b}_{0}\) and − 17 dB in \({b}_{1}\)). These declines were slightly lower than that of the HNR observed over the de Krom’s \({b}_{1}\) band (-20 dB, see Table 4). Meanwhile, the HNR of /i/ and /u/ dropped − 20 dB in the most effective bands (\({b}_{2}\) for /i/ and \({b}_{1}\) for /u/).

Next, all-subsets linear regression analysis was conducted on the full dataset of each vowel. The best \({R}_{\text{adj}}^{2}\) as functions of the model order (Fig. 2, solid lines with non-circle markers) showed all three vowels plateaued by subset model order of 6. Figure 3 (right three plots) shows the best sixth-order subset models. All three vowels achieved \({R}_{\text{adj}}^{2}\ge 0.80\) although the parameters selected were quite different between vowels. No parameter was mutually chosen for all three vowels, except for \({f}_{0}\).

Figure 2 also includes the plots of \({R}_{\text{adj}}^{2}\) vs. the model order for speaker-matched results (dotted lines). The 70-speaker dataset produced similar results to the full dataset for /a/ and /u/: the peak value of 0.84 in the matched vs. 0.85 in the full for /a/; 0.79 vs. 0.80 for /u/. Both were lower in the matched dataset. The /i/ datasets, however, trended in the opposite direction: the matched dataset yielded higher maximum \({R}_{\text{adj}}^{2}=0.87\) than the full dataset, 0.85.

The cause of the oddity with the /i/ datasets was hypothesized that there were discrepancies in the inclusion of potentially rough voice among vowels. Here, potentially rough vowel segment means that one rater classified it as rough while the other three did not. However, no drastic change in the percentages of speakers with potential roughness was observed between the two datasets. The full /a/ dataset had 43% speakers with potential roughness to 41% in the matched dataset (–2% change), /i/ datasets from 47–46% (–1%), and /u/ datasets from 23–21% (–2%).

The hatched bar plots in Fig. 4 indicate the importance metrics (%Contribution and %Usage) of the parameters among the top-1% models under the per-formant frequency band designs. The parameters with the highest regression results (Table 5; \({\text{HNR}}_{{b}_{0}}\) for /a/, \({\text{HNR}}_{{b}_{2}}\) for /i/, and \({\text{HNR}}_{{b}_{1}}\) for /u/) were the primary contributors to the models. The fundamental frequency, \({f}_{0}\), despite its small contribution towards \({R}_{\text{adj}}^{2}\) (1–3%), was the second or third most frequently used parameter for all three vowels (%Usage > 66%). Out of the 33 top-1% models of the per-formant /a/ case, only 3 models included both \({\text{HNR}}_{{b}_{0}}\) and \({\text{HNR}}_{{b}_{1}}\). This indicates that separating the first and second formants in the per-formant frequency band design was ineffective towards exposing the potential differences in how breathiness affects these two formant regions.

Finally, the effect of using a mismatched frequency band design was tested using the best subset regression. Limiting to the sixth-order models, the highest \({R}_{\text{adj}}^{2}\) were recorded from 3003 models for every combination of 3 vowel datasets and 3 formant-based frequency-band designs (9 total, 6 mismatched). The results are shown in Table 6. If frequency bands do not align to the vowel formants, \({R}_{\text{adj}}^{2}\) was up to 0.06 lower than properly paired results. The worst cases are /a/ dataset evaluated with /i/ frequency bands and /u/ dataset with /a/ bands. Meanwhile, the /u/ frequency bands were the least negative effect on the other vowels: 0.02 reduction in \({R}_{\text{adj}}^{2}\) of /a/ and 0.01 in /i/.

 
Table 6

Best \({R}_{\text{adj}}^{2}\) of the 6th order models of vowels vs. per-formant frequency band designs

   

Frequency Band Design

   

/a/

/i/

/u/

Dataset

/a/

0.85

0.79

0.83

/i/

0.80

0.85

0.84

/u/

0.74

0.79

0.80

Discussion

Frequency segmentation is a powerful concept to separate weak-but-important phenomenon from overarching signal behavior when they are spectrally separable. The spectral characteristics of acoustic signals of breathy voice fits this description. Voice source signals are lowpass: − 6 dB/octave for normal43, − 12 dB/octave for breathy44. Assuming \({f}_{0}=100\) Hz, the 16th harmonic at 1.6 kHz is down 24 dB (roughly 1/250 in linear scale) relative to the first harmonic for normal voice and down 48 dB (below 1/60000 in linear scale) for breathy voice. HNRs are generally evaluated by averaging in the linear scale. Without eliminating the influence of the lower harmonics, the expected 24 dB reduction of the 16th harmonic (or reductions of the higher harmonics in general) could not be fully captured by spectral parameters. This explains the poor \({R}_{\text{adj}}^{2}\) (49%) of the overall HNR to explain the variance of perceptual breathiness. The smoothed CPP11 (\({R}_{\text{adj}}^{2}=74\%\)) alleviates this problem by converting the power spectrum to decibels before combining the spectral bins. Isolating the breathiness-sensitive frequency range could further the ability of breathiness correlates as demonstrated by the HNR over de Krom’s \({b}_{1}\) band6 (0.4 to 2 kHz, 20 dB drop if severely breathy, \({R}_{\text{adj}}^{2}=79\%\), Shama’s algorithm). The frequency range of this band also roughly corresponds to the auditory loudness patterns, which produced comparable \({R}_{\text{adj}}^{2}\) gain over CPP (4% reported by Shrivastav and Sapienza10).

De Krom’s \({b}_{1}\) band specifically targets the first and second formants of /a/. This approach differs from other parameters such as H1-A1 or H1-A3, which evaluate only one harmonic within the target formant. Specifying a formant range eliminates the need to estimate the formant frequencies. Automatic formant analysis of breathy voice could lead to incorrect identification of the formants due to the tracheal coupling effect, which may change the formant bandwidths and add poles and zeros to the vocal tract7. Evaluating the parameters over formant frequency range avoids such potential misidentifications as long as the speakers’ formants are in the expected ranges37,45. The vocal tract response naturally weighs the formant frequency bins more than those not in the formants.

Separating the first and second formants was hypothesized to benefit breathiness analysis because the stronger first formant maybe overshadowing inherently weaker second formant. The latter may carry useful new information in its HNR of the second formant or in the differences between the first and second formants (\({\text{HHR}}_{\text{low}}\), \({\text{NNR}}_{\text{low}}\), and \({\text{GNE}}_{\text{low}}\)). However, the results (Fig. 2 and Fig. 3) do not support the hypothesis: the performance of de Krom’s and per-formant /a/ designs are not distinguishable at best. Per-formant /a/ \({R}_{\text{adj}}^{2}\) is slightly lower than de Krom’s \({R}_{\text{adj}}^{2}\) across model orders. The finding that only 3 out of the 33 top-1% per-formant model utilizing both formants indicates that the HNR behaves similarly to breathiness in both formants thus there is no strong incentive to include both to a model. The performance loss in the per-formant model is likely attributed to the reduced number of harmonics per band, especially for those with high \({f}_{0}\). Meanwhile, the loss was minimal (< 1%), and the added redundancy of the per-formant model could be beneficial to account for the other perceptual dimensions (e.g., roughness).

Comparing the four HNR algorithms, Shama’s DFT based method38 and harmonic-model (HM) based method34 stood out over the de Krom and Qi’s cepstrum-based HNRs26,39. For evaluating voice only with breathiness, Shama’s method is the simplest and best choice. Considering expanding the analysis to other perceptual dimensions, the HM based model may have an advantage as a more versatile tool. For example, the HM produces the estimated noise samples, which can be further analyzed for the presence of nonharmonic tones46. The behavior of Qi’s cepstrum-based method39 is intriguing. Its simple regression performance was the worst among the four (Table 4) while its collective performance surpasses (albeit only by 1%) the Shama and HM HNRs (Fig. 1). A possible explanation is in its unique method to represent the noise level in each harmonic region by a single sample at the harmonic frequency. Others average over all the frequency bins.

The vowel effect on breathiness—or on voice qualities in general—is a scarcely investigated topic. The majority of existing acoustic breathiness correlate studies targeted only sustained /a/5, and those which considered multiple vowels did not pursue the vowel effects (e.g., included as a repeated measures factor20) or found no vowel effects11. In a perceptual study, /i/ was found to have significantly less severe overall quality than /a/ or /u/ with small effect size.47 A recent study by Anand et al.48 investigated the vowel effect and found a moderate effect on breathiness ratings via a matching task with /u/ less breathier than /i/ and /a/. Interestingly, the present study found the exact opposite relationship: /u/ was found to be significantly breathier than /a/ and /i/ (more than 6 points higher in the 100-point scale). The difference likely attributes to the difference in the rating procedures: matching task vs. unanchored rating protocols. Another interesting result in the present study was that /u/ was also least often labeled rough or having spectral irregularities (thus the study included 20% more recordings than /i/ and 40% more than /a/). Meanwhile, the breathiness of /u/ was also the hardest to be modeled by the spectral parameters (80% compared to 85% of /a/ and /i/).

The higher breathiness rating and lower predictability of /u/ are likely related. De Krom49 showed that the variabilities of perceptual qualities are the highest in the middle of the rating scale. Specifically, the 95% confidence interval of the breathiness rating was the largest (reaching 30% of the overall range) in the middle and near-zero at both extrema of the scale. The 6-point increase in the rating shifted the mean of the breathiness ratings in the current study from 27 to 33, towards the middle. This may have increased the overall uncertainty in the perceptual ratings, and the increase in the perceptual uncertainty could not be explained by the objective parameters.

Limitations And Future Directions

The models in this study accounted for up to 85% of the variance in the perceptual roughness rating. While much of the remaining 15% could be due to the general uncertainty surrounding the perceptual ratings49,50 as well as the uncertainty of the formant locations37,45, some may attribute to the experimental design of the study.

First, the perceptual ratings were intentionally collected with minimal restrictions. Raters were allowed to perform the rating in virtually any manner they preferred (e.g., listening environment, number of playbacks per stimuli, order of rating, and granularity of the ratings). The idea was to allow the experts to complete the rating as quickly as they would assess voice qualities in clinic. Despite this lax protocol, interrater consistency reached a solid “good” ICC value. However, this protocol likely increased the variability in the averaged breathiness rating. To formalize or standardize a spectral breathiness index based on the parameters from this study, the reference perceptual ratings will need to be collected in a more controlled manner, e.g., a matching-task protocol48.

Second, the stimuli were drawn from Saarbrueken Voice Database, which is the only publicly available database with multiple sustained vowels to our knowledge. The data collection protocol did not control the duration of sustained vowels. Subsequently, there is a wide variation in the recording durations (mean = 1.26 s, SD = 0.21s, min = 0.27s, max = 2.12s). This results in a variable number of 50-ms windows per recording. Those recordings with fewer windows to average may presented higher variance in their parameter measurements.

Third, the frequency bands were chosen based on previous research on the formant frequencies37. Frequency band designs could be optimized to improve the performance further. However, the optimization may not be a trivial numerical problem due to the frequency quantization employed in the parameter calculation.

Also, the estimated models do not represent the full span of the breathiness scale evenly. The distribution of breathiness ratings among dysphonic population is naturally not uniform. There are far more stimuli in the lower half of the scale than the upper half. This is visually apparent in Fig. 1 and Fig. 3 and agrees with the ABI studies16,21,22,24. The lack of support in the upper end of the scale makes the estimated models less accurate in the upper range (i.e., wider confidence intervals in Table 4 and Table 5). Additional stimuli are desirable to further fill the upper range to improve the models.

Finally, this study explicitly targeted breathy-but-not-rough voice to isolate the breathiness correlates. As such, the models are not expected to perform well in quantifying the breathiness of rough or breathy-and-rough voices. The next phase of this research will be to add roughness correlates to the parameter set (and possibly a nonlinear model structure) to construct models which are only sensitive to breathiness and not roughness.

Conclusions

Loss of higher harmonics has been associated with breathy voice, and capable acoustic correlates of breathiness have been known. This study focused on spectral parameters to better understand how their ability to predict breathiness is influenced by the choice of algorithm, frequency, and vowel. The breathiness was found to be better explained by the changes in the harmonics-to-noise ratios (HNRs) over the first and second formants than the changes in the differences of harmonic or noise power between formants. The HNR in this critical band was shown to explain the variance in perceptual breathiness rating better than the wholistic smoothed cepstrum peak prominence. Moreover, linear combinations of the frequency-specific spectral parameters exceed that of the parameters used in the Acoustic Breathiness Index in breathiness prediction. Some vowel effects were observed: /u/ is less performant than /a/ or /i/ while /u/ is perceived breathier and less rough than the others.

Declarations

Competing interests: The authors declare no competing interests.

References

  1. Hirano M. Clinical Examination of Voice. Springer Verlag; 1981.
  2. Kempster G. CAPE-V: Development and future direction. Perspect Voice Voice Disord. 2007;17(2):11-13. doi:10.1044/vvd17.2.11
  3. Barsties B, De Bodt M. Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx. 2015;42(3):183-188. doi:10.1016/j.anl.2014.11.001
  4. Kreiman J, Gerratt BR, Kempster GB, Erman A, Berke GS. Perceptual evaluation of voice quality: Review, tutorial, and a framework for future research. J Speech Lang Hear Res. 1993;36(1):21-40. doi:10.1044/jshr.3601.21
  5. Barsties v. Latoszek B, Maryn Y, Gerrits E, Marc DB. A meta-analysis: acoustic measurement of roughness and breathiness. J Speech Lang Hear Res. 2018;61(2):298-323. doi:10.1044/2017_JSLHR-S-16-0188
  6. de Krom G. Some spectral correlates of pathological breathy and rough voice quality for different types of vowel fragments. J Speech Hear Res. 1995;38(4):794-811.
  7. Klatt DH, Klatt LC. Analysis, synthesis, and perception of voice quality variations among female and male talkers. J Acoust Soc Am. 1990;87(2):820-857. doi:10.1121/1.398894
  8. Hammarberg B, Fritzell B, Schiratzki H. Teflon injection in 16 patients with paralytic dysphonia: Perceptual and acoustic evaluations. J Speech Hear Disord. 1984;49(1):72-82. doi:10.1044/jshd.4901.72
  9. Kitzing P. LTAS criteria pertinent to the measurement of voice quality. J Phon. 1986;14(3-4):477-482. doi:10.1016/S0095-4470(19)30693-X
  10. Shrivastav R, Sapienza CM. Objective measures of breathy voice quality obtained using an auditory model. J Acoust Soc Am. 2003;114(4):2217-2224. doi:10.1121/1.1605414
  11. Hillenbrand J, Cleveland RA, Erickson RL. Acoustic correlates of breathy vocal quality. J Speech Hear Res. 1994;37(4):769-778.
  12. Milenkovic P. Least mean square measures of voice perturbation. J Speech Hear Res. 1987;30(4):529-538.
  13. Hillenbrand J, Houde RA. Acoustic correlates of breathy vocal quality: Dysphonic voices and continuous speech. J Speech Hear Res. 1996;39(2):311-321.
  14. Anita Mcallister, Johan Sundberg, S. Acoustic measurements and perceptual evaluation of hoarseness in children’s voices. Logoped Phoniatr Vocol. 1998;23(1):27-38. doi:10.1080/140154398434310-1
  15. Gorham-Rowan MM, Laures-Gore J. Acoustic-perceptual correlates of voice quality in elderly men and women. J Commun Disord. 2006;39(3):171-184. doi:10.1016/j.jcomdis.2005.11.005
  16. Barsties v. Latoszek B, Maryn Y, Gerrits E, De Bodt M. The Acoustic Breathiness Index (ABI): A multivariate acoustic model for breathiness. J Voice. 2017;31(4):511.e11-511.e27. doi:10.1016/j.jvoice.2016.11.017
  17. Michaelis D, Gramss T, Strube HW. Glottal-to-noise excitation ratio – a new measure for describing pathological voices. Acta Acust United Acust. 1997;83(4):7.
  18. Dejonckere PH. Recognition of hoarseness by means of L. T. A. S. Int J Rehabil Res. 1983;6(3):343-344.
  19. Dejonckere PH, Lebacq J. An analysis of the diplophonia phenomenon. Speech Commun. 1983;2(1):47-56. doi:10.1016/0167-6393(83)90063-8
  20. Wolfe VI, Steinfatt TM. Prediction of vocal severity within and across voice types. J Speech Hear Res. 1987;30(2):230-240.
  21. Delgado Hernández J, León Gómez NM, Jiménez A, Izquierdo LM, Barsties v. Latoszek B. Validation of the Acoustic Voice Quality Index Version 03.01 and the Acoustic Breathiness Index in the Spanish language. Ann Otol Rhinol Laryngol. 2018;127(5):317-326. doi:10.1177/0003489418761096
  22. Hosokawa K, von LBB, Ferrer -Riesgo Carlos Ariel, et al. Acoustic breathiness index for the Japanese-speaking population: Validation study and exploration of affecting factors. J Speech Lang Hear Res. 2019;62(8):2617-2631. doi:10.1044/2019_JSLHR-S-19-0077
  23. Barsties v. Latoszek B, Lehnert B, Janotte B. Validation of the Acoustic Voice Quality Index Version 03.01 and Acoustic Breathiness Index in German. J Voice. 2020;34(1):157.e17-157.e25. doi:10.1016/j.jvoice.2018.07.026
  24. Englert M, Barsties v. Latoszek B, Maryn Y, Behlau M. Validation of the acoustic breathiness index to the Brazilian Portuguese language. Logoped Phoniatr Vocol. 2022;47(1):56-62. doi:10.1080/14015439.2020.1864467
  25. Kim GH, von Latoszek BB, Lee YW. Validation of Acoustic Voice Quality Index Version 3.01 and Acoustic Breathiness Index in Korean population. J Voice. 2021;35(4):660.e9-660.e18. doi:10.1016/j.jvoice.2019.10.005
  26. de Krom G. A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. J Speech Hear Res. 1993;36(2):254-266.
  27. Pützer M, Barry WJ. Saarbruecken Voice Database. http://stimmdatenbank.coli.uni-saarland.de. Published October 13, 2008. Accessed October 25, 2021. http://www.stimmdatenbank.coli.uni-saarland.de/
  28. Watts CR, Awan SN. Use of spectral/cepstral analyses for differentiating normal from hypofunctional voices in sustained vowel and continuous speech contexts. J Speech Lang Hear Res. 2011;54(6):1525-1537. doi:10.1044/1092-4388(2011/10-0209)
  29. Hartl DM, Hans S, Vaissière J, Brasnu DF. Objective acoustic and aerodynamic measures of breathiness in paralytic dysphonia. Eur Arch Otorhinolaryngol. 2003;260(4):175-182. doi:10.1007/s00405-002-0542-2
  30. Boersma P, Weenink D. Praat: doing phonetics by computer. Published online 2021. https://www.fon.hum.uva.nl/praat/
  31. Gerratt BR, Kreiman J. Toward a taxonomy of nonmodal phonation. J Phon. 2001;29(4):365-381. doi:10.1006/jpho.2001.0149
  32. Ikuma T, McWhorter AJ, Adkins L, Kunduk M. Investigation of vocal bifurcations and voice patterns induced by asymmetry of pathological vocal folds. J Speech Lang Hear Res. Published online in press.
  33. Richardson K, Matheron D, Martel -Sauvageau Vincent, Vincent I. A comparative normative study between Multidimensional Voice Program, Praat, and TF32. Perspect ASHA Spec Interest Groups. 2019;4(3):563-573. doi:10.1044/2019_PERS-SIG19-2018-0006
  34. Ikuma T, Story B, McWhorter AJ, Adkins L, Kunduk M. Harmonics-to-noise ratio estimation with deterministically time-varying harmonic model for pathological voice signals. J Acoust Soc Am. 2022;152(3):1783-1794. doi:10.1121/10.0014177
  35. Boersma P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. Proc Inst Phon Sci. 1993;17:97-110.
  36. Jadoul Y, Thompson B, de Boer B. Introducing Parselmouth: A Python interface to Praat. J Phon. 2018;71:1-15. doi:10.1016/j.wocn.2018.07.001
  37. Pätzold M, Simpson AP. Acoustic analysis of German vowels in the Kiel Corpus of Read Speech. In: The Kiel Corpus of Read/Spontaneous Speech Acoustic Data Base, Processing Tools and Analysis Results. IPDS Kiel; 1997:215-247. https://www.ipds.uni-kiel.de/kjk/pub_exx/aipuk32/mpas.pdf
  38. Shama K, krishna A, Cholayya NU. Study of harmonics-to-noise ratio and critical-band energy spectrum of speech as acoustic indicators of laryngeal and voice pathology. EURASIP J Adv Signal Process. 2006;2007(1):085286. doi:10.1155/2007/85286
  39. Qi Y, Hillman RE. Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. J Acoust Soc Am. 1997;102(1):537-543. doi:10.1121/1.419726
  40. Kasuya H, Ogawa S, Mashima K, Ebihara S. Normalized noise energy as an acoustic measure to evaluate pathologic voice. J Acoust Soc Am. 1986;80(5):1329-1334. doi:10.1121/1.394384
  41. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-163. doi:10.1016/j.jcm.2016.02.012
  42. Eisinga R, Heskes T, Pelzer B, Te Grotenhuis M. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinformatics. 2017;18(1):68. doi:10.1186/s12859-017-1486-2
  43. Flanagan JL. Some properties of the glottal sound source. J Speech Hear Res. 1958;1(2):99-116.
  44. Childers DG, Lee CK. Vocal quality factors: Analysis, synthesis, and perception. J Acoust Soc Am. 1991;90(5):2394-2410.
  45. Hillenbrand J, Getty LA, Clark MJ, Wheeler K. Acoustic characteristics of American English vowels. J Acoust Soc Am. 1995;97(5):3099-3111. doi:10.1121/1.411872
  46. Ikuma T, Kunduk M, McWhorter AJ. Advanced waveform decomposition for high-speed videoendoscopy analysis. J Voice. 2013;27(3):369-375. doi:10.1016/j.jvoice.2013.01.004
  47. Gerratt BR, Kreiman J, Garellek M. Comparing measures of voice quality from sustained phonation and continuous speech. J Speech Lang Hear Res. 2016;59(5):994-1001. doi:10.1044/2016_JSLHR-S-15-0307
  48. Anand S, Skowronski MD, Shrivastav R, Eddins DA. Perceptual and quantitative assessment of dysphonia across vowel categories. J Voice. 2019;33(4):473-481. doi:10.1016/j.jvoice.2017.12.018
  49. de Krom G. Consistency and reliability of voice quality ratings for different types of speech fragments. J Speech Hear Res. 1994;37(5):985-1000.
  50. Kreiman J, Gerratt BR. Validity of rating scale measures of voice quality. J Acoust Soc Am. 1998;104(3):1598-1608. doi:10.1121/1.424372