The classication of neurodegenerative disease from acoustic speech data

Neurodegenerative diseases often affect speech. Speech acoustics can be used as objective clinical markers of pathology. Previous investigations of pathological speech have primarily compared controls with one specic condition and excluded comorbidities. We broaden the utility of speech markers by examining how multiple acoustic features can delineate diseases. We used supervised machine learning with gradient boosting (CatBoost) to differentiate healthy speech and speech from people with multiple sclerosis or Friedreich ataxia. Participants performed a diadochokinetic task where they repeated alternating syllables. We extracted 74 spectral and temporal prosodic features from the speech recordings, which were subjected to machine learning. Results showed that Friedreich ataxia, multiple sclerosis and healthy controls were all identied with high accuracy (over 82%). Twenty-one acoustic features were strong markers of neurodegenerative diseases, falling under the categories of spectral qualia, spectral power, and speech rate. We demonstrated that speech markers can delineate neurodegenerative diseases and distinguish healthy speech from pathological speech with high accuracy. Findings emphasize the importance of examining speech outcomes when assessing indicators of neurodegenerative disease. We propose large-scale initiatives to broaden the scope for differentiating other neurological diseases and affective disorders.


Introduction
Neurodegenerative diseases can impair speech due to the breakdown of motor control. Therefore, acoustic features of speech can be used as objective clinical markers for neurodegenerative disease. Previous studies examining acoustic changes in neurodegenerative disease have primarily focused on differences between healthy controls and various patient populations, such as, multiple sclerosis (MS) (1)(2)(3)(4), Huntington's disease (HD) (5-7), Parkinson's disease (PD) (1,3,8,9), and Friedreich ataxia (FA) (8,10,11). These studies report that various acoustic features indeed change as the disease progresses, and patients tend to exhibit slower and more variable speech rates, lower and more variable pitch, and reduced spectral clarity compared to patients. Although machine learning has been used to differentiate healthy controls from single, well-de ned patient populations (e.g., Parkinson's disease, spasmodic dysphonia), real-world machine learning implementations will encounter multiple different diseases that may have overlapping acoustic outcomes. The present study aims to broaden the utility of these speech markers by determining how different acoustic pro les of speech may accurately identify speci c neurodegenerative diseases. In other words, we advance beyond discriminating between healthy and pathological speech by examining differences between different neurodegenerative diseases simultaneously across multiple acoustic dimensions.
Clinicians use a combination of tools to diagnose neurodegenerative disease including genetic sequencing, neurological scans (e.g., magnetic resonance imaging), and motor tests. Although these methods have improved considerably, misdiagnosis is still possible, particularly during earlier disease stages where motor problems, cognitive impairment, and/or mood disturbances could be misattributed to other neurological conditions (e.g., (12,13)). Acoustic clinical markers have several advantages over traditional tools. First, speech can be recorded remotely in a home environment without the need to visit a specialist or hospital. Given that some populations with neurodegenerative disease are considered at-risk, remote identi cation reduces the risk of contracting potentially life-threatening pathogens. Remote testing is also more accessible for populations with limited mobility and those living in rural areas.
Second, acoustic markers are obtained using non-invasive techniques. Invasive procedures (e.g., blood tests, surgery) can cause discomfort and have a risk of infection. These procedures can also impose large nancial burdens, particularly when multiple tests are required due to misdiagnosis. Acoustic markers have the potential to alleviate these burdens by providing accessible, low-cost, and low-risk tests that can guide clinician decisions in the early stages of diagnosis.

Acoustic Markers Of Pathological Speech
Acoustic features of speech can be used to construct pro les of different patient populations. The most common acoustic features used to identify pathologies include speech rate, the number of syllables per second, pauses, the duration between utterances or syllables, and frequency information related to the pitch of the voice (fundamental frequency; f0) and its formants (cf. 12). These features re ect underlying dysfunctions of clinical populations that have decreased motor control over articulators during speech.
As such, neurodegenerative disease can result in a slower and more variable speech rate, longer pauses, and decreased pitch quality compared to controls. Other spectral features have also been shown to be useful markers of neurodegenerative disease (e.g., jitter and shimmer) as they capture instabilities in pitch and amplitude that arise as a result of tremor or decreased muscle strength within the articulatory system (7).
There are multiple speech elicitation tasks for speech pathology assessment. The diadochokinetic (DDK) task is a common method of speech elicitation that requires the speaker to produce as many repetitions of /PATAKA/ as possible within one breath or for a speci ed duration (15). The DDK task is a controlled method of speech elicitation that allows high consistency between different speakers in terms of the speech content while remaining sensitive to oromotor performance (e.g., speech rate) (16). Other speech elicitation tasks include reading a paragraph aloud or semi-structured interviews (17,18). Although these tasks increase ecological validity, they may also increase cognitive load, which may induce speech changes based on third variables like education level, language or reading impairments, or other confounds based on cognitive ability. Moreover, the linguistic content may encourage changes in prosodic features based on emphasis, stress patterns, and emotion which may differ based on personality, accent, or emotional state. To avoid these concerns, the present study examined speech from a DDK task that was performed by healthy controls (HC) and two patient populations (FA, MS) using uniform practices (see Methods). We calculated acoustic features that have been examined in previous studies comparing HCs and various patient populations (1)(2)(3)(4)(5)(6)(7)(8)(9).
Previous implementations of machine learning on speech have compared healthy controls with only a single patient group (cf. (14)). Although these approaches are useful as initial triage for identifying pathological voice disturbances that should be investigated (19), they do not provide nuanced classi cation of the underlying pathology or disease phenotypes potentially due to small sample sizes and, consequently, low accuracy (20). This is especially the case for models with hidden layers that re ect latent variables that are not de ned and, therefore, do not aid in developing speci c acoustic pro les that characterize a disease (21). We used a transparent machine learning approach using gradient boosting that quanti es the contribution of each acoustic feature in distinguishing between healthy and pathological voices, and between multiple diseases.

Results
Two-sample t-tests revealed that overall model performance as assessed by Matthews's correlation coe cient (M = 0.82, SD = 0.04) was signi cantly above chance (33% accuracy), t (99) = 114.56, p < 0.001, Cohen's d = 11.46. Classi cation accuracy between groups was assessed using 1 scores that equally weight model speci city and sensitivity (see supplementary materials for full statistical analysis); these were also signi cantly better than chance for all groups (ps < 0.001) with large effect sizes for HC

Model optimization
To measure the contribution of each acoustic feature for categorising each group, the Shapley additive explanation (SHAP) values were examined. These show the probability of each outcome based on the information provided by each feature (23,24). To achieve a more parsimonious model, we performed the same machine learning procedures twice including features that produced SHAP values above criteria of 2% (n=87) and 5% (n=21) for at least one group (see supplementary materials for rankings). Overall model accuracy (Matthew's Correlation Coe cient) signi cantly increased relative to the full model (M = 82.3%, SEM = 0.4%) for the 2% cut-off (p = 0.001; M = 83.7%, SEM = 0.4%) and 5% cur-off (p < 0.001; M = 83.6%, SEM = 0.4%). Pairwise comparisons of accuracy between models for each group revealed signi cant increases in F1 accuracy between the full model and the 2% cut-off for all groups (ps < .002), and between the full model and 5% cut-off for the HC and MS group (ps < 0.03) but not the FA group (p = 0.11) (see Figure 1). These results suggest that high discrimination accuracy can be achieved with a reduced subset of 21 acoustic features. It should be noted, however, that larger subsets of variables may be required to achieve high discrimination accuracy if a broader scope of clinical groups are included.

Optimal acoustic clinical markers
Here we describe the top 21 features overall, and top 10 for each group and overall (see supplementary materials for all features). As shown in Table 1, the dominant acoustic features for accurate classi cation were spectral decrease, peak f0 energy, peak energy in the low, high, and broadband frequency ranges, low-frequency summed energy, utterance duration based on summed broadband energy (including low-, mid-and high-frequency sub-bands), spectral spread, and acoustic intensity. Figures 3a and 3b show that healthy controls are characterized by a less steep and less variable spectral decrease, a smaller spectral spread and range of energy produced in low frequencies, greater energy in low and f0 frequency bands, and shorter utterance durations. The FA group was characterized by low intensity and energy in low, high, and broadband frequency bands, a higher and more variable spectral spread, and longer utterance durations. The MS group was characterized by a steeper and more variable spectral decrease, as well as utterance durations and spectral spread values that fell between the control and FA groups (see link in Figure 3a note for gures of all acoustic features). Other acoustic features that were useful in delineating groups include metrics of speech timing (pause duration, speech rate, and stress rate (25)), spectral features (crest, slope, centroid, atness, and entropy (26)), formants 1-5, and the alpha ratio (27).

Discussion
Our machine learning approach was able to distinguish between healthy controls and two cohorts with different neurodegenerative diseases with high accuracy using acoustic properties of speech alone.
These results indicate that multiclass supervised machine learning has the potential to discriminate between diseases, a step beyond the mere healthy-pathological dichotomy. Therefore, through the accumulation of big data that merges speech data from various patient populations, we may be able to use machine learning to assist in the detection of speci c diseases using acoustic markers.
There are numerous advantages for using acoustic markers to detect neurodegenerative disease including the decreased risk and burden of travelling to a hospital to undergo a range of tests, some of which are invasive. Speech, on the other hand, can be recorded within a familiar and comfortable setting, using common household devices (e.g., smartphones). Smartphones have demonstrated relative robustness for obtaining acoustic clinical markers and, therefore, increase accessibility to these automated detection methods (28). Although the present study recorded speech within laboratory settings, it is also possible to record speech data remotely (29). Practitioners could use this information to re ne which tests should be performed to con rm a diagnosis. This would be particularly useful for people living in rural communities with increased travel burdens or during situations where the risk of infection is heightened (e.g., pandemics). Speech markers can be used as a remote tool to initially detect signs of neurodegenerative disease, expand our understanding of the clinical characteristics of these diseases to improve our ability to develop targeted interventions, and to monitor disease progression.
We identi ed several acoustic features that strongly contributed to distinguishing between groups.
Spectral decrease, the average of all slopes between the peak amplitude at the fundamental frequency and the peak amplitude of the formants (i.e., harmonics), was the most useful variable in distinguishing our three groups. This nding is in line with previous results that suggest vocal fold dysfunctions are associated with greater energy in the lower frequency range relative to higher frequencies (e.g., the soft phonation index (30)). Other spectral features associated with the distribution of vocal energy also contributed to classi cation accuracy, including summed and peak energy within low-frequency bands (1-75 Hz), peak energy within f0 (75-500Hz), high (4000-8000 Hz), and broadband (1-8000Hz) frequency ranges, and the spectral spread of peak frequencies. Therefore, the distribution of acoustic energy across the spectrum that re ects voice qualia culminates as a strong set of acoustic clinical markers for distinguishing neurodegenerative diseases.
Speech timing measures were also strong contributors to classi cation accuracy, speci cally, the duration of syllables based on summed energy in low and broadband frequency ranges, and the rate of stressed syllable onsets based on peak energy across broadband frequencies (25). These results corroborate previous ndings that demonstrate slowed speech rate and decreased phonation time for a range of neurodegenerative diseases including Parkinson's disease (31), Huntington's disease (6,32), multiple sclerosis (1, 3), and others (32,33). Speech rate and phonation time re ect both pneumoarticulatory capacity and oromotor function, and could serve as clinical markers for neurodegenerative diseases and their progression. Therefore, speech timing measures are useful markers for distinguishing between diseases and may also aid in determining disease severity.

Limitations & Considerations
We used the most common acoustic features (or similar proxies) based on an a priori analysis of the neurodegenerative disease literature and timbral features used in music information retrieval. There are, however, other acoustic features that may increase the accuracy and sensitivity of the machine learning algorithm that were not considered here. For example, voice onset time, the time between the burst of a stop consonant and the onset of the vowel, is an acoustic feature that differs signi cantly between controls and people with Parkinson's disease (34). We opted not to use this measure because our data contain a high degree of coarticulation, and there is little agreement for the best way to extract the burst and vowel onset times and which acoustic features should be considered (see (35)). Similarly, we did not include measures from other voice assessment tasks (e.g., sustained vowel) (36) that can more reliably measure certain features (e.g., jitter and shimmer) but preclude the measurement of speech timing. We chose to constrain the number of variables and tasks to avoid over tting. Future studies could use feature selection and pruning methods (e.g., (37)) to nd the best feature set and remove unreliable variables prior to analysis.
The inclusion of non-speech performance measurements could also increase discrimination accuracy, for instance, cognitive (38) and motor performance (39) measures. The primary aim of this experiment was to examine accuracy using speech features alone because speech data can easily be obtained in the absence of a clinician through websites and smartphone applications (40). Other cognitive and motor tests often require the scoring of a clinician or dedicated tools to measure gait and tremor, although some smartphone tests are available (41). We show that neurodegenerative diseases can be delineated with high accuracy from speech data alone, but future applications could also consider other non-verbal features, for example, irregular gait patterns using smartphone accelerometers or irregular typing patterns. Whether these movement features or others would increase the accuracy of machine learning algorithms for neurodegenerative disease remains unknown.

Conclusion
We provide strong evidence that neurodegenerative diseases can be differentiated through acoustic clinical markers and machine learning. This model can be expanded and improved through the inclusion of additional diseases and phenotypes. Big data initiatives that bring together researchers and speech data from multiple laboratories are necessary to increase the scope of diseases that can be identi ed by acoustic clinical markers and machine learning. Moreover, a combination of remote testing tools for physical and cognitive assessment could be included in addition to speech to improve identi cation accuracy. These technologies promise to provide tools that can aid practitioners in reaching a diagnosis and relieve the physical and nancial burden of patients.

Declarations
Acknowledgements This work was undertaken in collaboration with the Melbourne Data Analytics Platform (MDAP) at The University of Melbourne. Data collection for the multiple sclerosis group was supported by NHMRC Project grant (#108546). APV was supported by a NHMRC Fellowship (#1135683).

Contributions
BGS managed the machine learning, performed the acoustic feature extraction and statistical analysis, interpreted the data, and wrote the manuscript. ZJ, UN, and MMQ performed the data wrangling and machine learning. Healthy control data were collected by GN and SR under the supervision of APV and AvdW. Data for people with multiple sclerosis were collected by GN. Data for people with Friedreich ataxia were collected by HR and APV. Speech production protocols were designed by APV. BGS and APV contributed to the conception of the study. All authors were involved in manuscript revisions and have approved the nal manuscript.

Competing interests
APV is the CEO of Redenlab, a speech clinical marker company.

Materials & correspondence
Raw speech data are not publicly available due to Institutional Review Board restrictions. Participants did not consent to their data being publicly available. Processed data and code used for data processing and analysis is available on request from the corresponding author (BGS). Descriptions of protocols are available on request from the corresponding author (APV).

Procedure
Participants performed a DDK task where the syllables /PA/, /TA/, and /KA/ were repeated in an alternating fashion as many times as possible within one breath for a maximum of 10 seconds. Speech recordings were screened prior to feature analysis to manually remove speech artefacts and background noise.

Acoustic Feature Extraction
Acoustic features were extracted using custom-made MATLAB scripts that used standard signal processing functions from MATLAB (44), onset and offset detection algorithms (45), beat detection algorithms (25), music information retrieval (46), and speech analysis toolboxes (47,48). Acoustic features consisted of summary statistics (mean, standard deviation, coe cient of variation, minimum, maximum, range) of 74 variables that measure different aspects of speech qualia. These features include speech rate, utterance duration, pause duration, fundamental frequency, the rst ve formants, intensity, summed and peak energy across frequency bands, spectral decrease and spread, and a range of other spectral features used in the acoustic clinical marker literature (see supplementary materials for a full list and additional references).

Machine Learning
We used CatBoost as our machine leaning classi cation algorithm.
CatBoost is an open-source decision tree-based algorithm with gradient boosting and hardware optimisation (49). The main advantage of CatBoost over other algorithms is that it builds symmetric trees, employs weighted sampling, and performs ordered boosting. It also lowers the weights of variables that are less useful in identifying groups. These features decrease the need for hyperparameter tuning and reduces the chance of over tting (49). Cross-validation was performed using 67%-33% Train-Test splits with 100 resamples using strati cation to achieve the same balance for each class. Figure 1 Mean, dispersion, and range for classi cation accuracy (A = 1 Accuracy, B = Precision, C = Recall) for healthy controls, and groups with Friedreich ataxia and multiple sclerosis for the full model, and models using a subset of acoustic features using the maximum group-wise average SHAP value cut-offs of 2% and 5%.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.