Cognitive dysfunction such as dementia due to Alzheimer’s disease (AD) is the most common chronic neurodegenerative disease worldwide. Mild cognitive impairment (MCI) is the prodromal phase of cognitive decline, a condition that can be reverted with proper interventions detected with neuropsychological tests (1). For all of these cognitive issues, early detection is essential to ensure effective and timely treatment and slow the progression of cognitive deterioration. For example, a growing consensus is that pharmaceutical interventions may be most effective at the earliest stages of dementia (DM) before serious and irreversible neuropathological changes begin (2).
Various screening techniques have been developed for detecting cognitive decline. Cognitive function tests such as the mini-mental state examination (MMSE) (3) and Montreal Cognitive Assessment (4) are conventional methods widely used to screen for DM and MCI. In addition, fluid biomarkers collected from cerebrospinal fluid, blood, saliva, and tears (5), and brain imaging with magnetic resonance imaging (MRI) (6) and positron emission tomography (PET) (7) are utilized as reliable clinical examinations to detect pathological findings such as the accumulation of amyloid β, which is a causative agent of AD. However, these methods have several disadvantages such as their time-consuming nature, high inspection cost, invasiveness, and the need for dedicated equipment.
As a relatively new approach, diagnostic assistance with the analysis of patient’s voice to detect cognitive deterioration (i.e., vocal biomarkers) has been extensively studied over the last decade (8). This approach is non-invasive, does not require specific or expensive equipment, and can be efficiently conducted remotely. In addition, voice data collection and analysis is reasonable price-wise, compared to brain imaging or fluid tests. Many studies have successfully detected cognitive impairments using voice data as vocal biomarkers. However, most of these studies extract the prosodic and/or temporal features from the voice recorded during cognitive tasks such as picture descriptions (using the “cookie theft” picture in most instances) (9–12), sentence-reading tasks (13–16), and telling stories or having a conversation with a clinician (17–20), all of which are slightly time-consuming and require a skilled examinator to impose a task. In addition, when a patient is examined by using the same task repetitively to monitor the patient’s cognitive function, the task-based recordings could be highly affected by the “learning effect”. Thus, repeated exposure to the same task could mask cognitive decline (e.g., an individual remembers the answers in a task) (21).
Another common method is a machine-learning model with linguistic features that primarily uses natural language processing (NLP) (10, 11, 22, 23). Although these methods offer high performance in dementia detection, their linguistic features are highly language-dependent. Thus, text-based models can be applied to limited regions where patients use the same language as that used in the regions in which the model is trained.
In this study, we aimed to test the performance of prediction models for detecting cognitive dysfunction using purely acoustic features (i.e., without linguistic features). Our model uses prosodic and temporal features from two simple language-independent phrases, that can be applied to patients in different regions with various languages.