An Approach for Assisting Diagnosis of Alzheimer’s Disease Based on Multi-Model Features of Narrative Speech

Background: Alzheimer’s Disease (AD) is a common dementia which affects linguistic function, memory, cognitive and visual spatial ability of the patients. More and more studies have been done to access non-invasive, accessible, cost-effective methods for the detection of AD, Speech is proved to have relationship with AD, so a time that AD can be diagnosed in a doctor’s office is coming. Methods: In our study, the ADRess dataset in 2020 was used to detect AD which was balanced in gender and age. First we extract three categories of feature parameters: acoustic feature extracted by opensmile software, bert embeddings automatically and complicated linguistic feature extraction manually. Linguistic features are based on the POS tag, lexical Richness, fluency, semantic feature. Then seven different classifiers are used for identifying AD from normal controls, including SVM, Logistic Regress, Random forest, Extra Trees, Adaboost, LightGBM and a novel ensemble approach with majority voting strategy which is applied to overcome the error caused by a base classifier. Finally ten-fold cross validation is adopted for the evaluation of our approach. In addition, individual features and their combine features are fed to six base classifiers and ensemble of classifier. Results: We get top-performing classify result on the test set with ensemble of classifiers, the best accuracy of which is 85.4%. The best performance of feature sets are linguistic features, the accuracy of which is 85.6% with LightGBM classifier, and SFS approach is used to manifest seven discriminative linguistic features. Conclusions: The statistical and experimental results illustrates the feasibility by using speech to predict AD effectively based on acoustic and linguistic feature parameters. Stronger classifier and discriminate features are vital for the final results. We emphasise the best linguistic features for predicting AD disease are based on the POS tag, lexical Richness, fluency, semantic feature. Ensemble of classifiers usually has a better performance than single classifier.

years old sufferring from dementia in China, including 9.83 million AD patients, 3.92 million vascular dementia and 130,000 other dementia. In the meanwhile, there are over 500 million AD sufferings in America now (Https://www.alz.org/alzheimers-dementia/facts-figures). It is estimated that by 2030, 7,600 million people will be diagnosed with AD or other dementias. AD has gradually become a world wide problem. AD is a chronic, progressive disease characterized by losing the ability independently in daily life gradually. Although clinicians can differentiate people with AD from healthy controls by a combination of cognitive test scales [2], it is time-consuming and little uniform in selected measures. So it is essential to find a more reliable but simple test method to aid differentiating different cognitive people, especially for the early diagnosis of AD.
As a part of higher brain function, language function has an effective relationship with cognitive function [3]. Discourse represents one's psychological activities, which can manifest clearly the complicated relationship among cognitive, language and communication [4]. As the discourse reflects speaker's intention and attention, researchers have found that AD disease has great relationship with linguistic function [5]. The universal performance of AD sufferings is language barrier [6], the symptom of discourse disorder appears even earlier than memory and orientation damage, so Caplan [7] pointed that the most common method to study the relationship between language and brain is to analyze speech disorder caused by brain damage. Snowdon D [8] has demonstrated that low language proficiency is an important index for people with cognitive impairment in daily life, so language maybe a better identify indicator compared with other methods such as memory, study and cognitive function, automatic detection and screening method based on speech has great potential for AD patients.
Previous studies on the relationship between linguistic complexity and AD had demonstrated that low language ability had great relationship with cognitive impairment [12]. Larrieu S [13] and Howieson D [14] had found that some people will steady from mild cognitive impairment (MCI) to AD while others will remain stable for many years, and even a minority can return to normal cognitive status. Some studies attempt to quality the linguistic impairments by using computational techniques, as there are still many differences in the spoken language to supply discriminative markers. Guinn [15] took filled pauses, repetitions, and incomplete words as linguistic features, which was proved discriminative than POS tags and measures of lexical diversity, and finally got an accuracy of 79.5%. By using praat software, Meil´an [16] extracted acoustic features such as number of voice breaks, shimmer, number of periods of voice, the percentage of voice breaks, noise-to-harmonice ratio and so on, and got accuracy of 84.8% finally in distinguish 30 AD patients from 36 healthy controls. Jarrold [17] found that AD sufferings like to use more verbs, adjectives and pronouns and less nouns than healthy controls, and got a best accuracy of 88% by using POS features, acoustic features and psychologically motivated word lists. Orimaye [18] found that AD patients used less syntactic components and higher significantly lexical components. Yancheva [19] and Fraser [20] extracted 477 acoustic, semantic and lexico-syntactic features in a cookie theft picture task, forty most informative features in demantiabank database, finally the accuracy reached over 92% in distinguishing AD from healthy controls.
To summarize, precious work in this area mainly used two methods. The first one is to build feature engineering and then use machine learning classifier to recognize AD properly, which needs many expertise knowledge in order to get distinguishing features, so the integrity of features can not be guaranteed, and features extraction mostly based on grammar, semantic and pragmatic and so on [9][10][11]. The second method is deep learning model, which uses powerful neural networks with multiple hidden layers to solve general machine learning tasks without feature engineering. Deep neural networks can learn representations from data by using cascades of multilevel nonlinear processing units for feature extraction. The manifestion of deep learning is better but interpretability is not better than the first method, which is of great importance for clinical diagnosis, however. As an unclear relationship between brain neurocognitive mechanism and language itself, the development of AD linguistics has affected. While the linguistic analysis of AD patients, especially, the relationship between linguistic features and patients' brain impair area, may explore the internal AD pathogenesis. So linguistic feature extraction has great significance for the diagnosis and treatment of AD patients, and we believe that oral linguistic markers to diagnose AD is a compelling method for future research.
Many work had done by using ADReSS dataset (including classification and MMSE prediction), while we are only intrested in the classification task between AD and normal controls [21][22] [23].
The champion is Yuan [21] who used bert embeddings combined with encoded pauses. A Pompili [24] used both acoustic and textual feature embeddings and attained 81.25% accuracy in ADReSS challenge.
In all, most researches manifest that, by extracting clinical markers of acoustic and transcripts, the classification among AD, MCI and healthy controls is practicable. These study includes a small amount of specialized classifications, such as PPA, vascular dementia and so on. The task includes most common picture description task (e.g. Cookie Theft picture) or recall or repeat some stories( e.g. cinderella's story) and so on. The database used mainly includes public database such as Dementiabank or self-built corpus. transcript includes manual or automatic ways, and the accuracy of these studies is between 80%-92% or so.
In this work, we present the multi-modal system to the topic. We exploited syntatic, lexical and semantic features with measures of global and theme coherence, we also tried widely acclaimed bert method and acoustic features which is easily to get relatively by software and then combine different features in order to get the state-of-the-art statement. In the meanwhile, we used a novel ensemble approach with majority voting mechanism and got a best accuracy which is lower 4.2% than the champion (the accuracy is 89.6% [21]) in the ADReSS challenge. We try different model feature extracted method so as to provide a more comprehensive characterization in AD for speech and linguistic abilities and a more reliable identification.
The main purpose of our study is to explore different analyse measures to identify acoustic and linguistic feature unique to AD, and then classify them correctly by a stronger classifier. The main contributions of our work include the following four aspects.
(1) We combine the methods of the state of the art deep learning (bert embeddings) automatically and phonological / linguistic features manually in order to find suitable discriminative features.
(2) We extracted linguistic complicated measures manually from cookie theft picture task and used SFS method to find best discriminative features for different subjects in order to improve the interpretability of model.
(3) The linguistic features we extracted are based on the POS tag, lexical Richness, fluency, semantic feature, which got the best results for single classifier.
(4) It is the first time to use ensemble of Classifiers with majority voting strategy on features of narrative speech and language to discriminate AD from normal controls, and get best accuracy by this means.

Methods
One of the main contribution of the study is the idea of ensemble framework for AD prediction by using majority voting strategy. The approach includes three phases: preprocessing, feature extraction, ensemble of classifiers to get the final result. The detailed working of our method is shown in Figure 1.

ADReSS Dataset
The data used in this study came from ADReSS Challenge in 2020. The data is composed of speech recordings and transcripts of the Boston Diagnostic Aphasia Exam [25], which is picture description task through the cookie theft picture. By using CHAT coding system [26], the text is transcribed and then annotated. The speech was segmented using a voice activity detection method based on signal energy value. The dataset includes 2,122 acoustic segments from 78 AD sufferings and 1,955 acoustic segments from 78 normal controls. All datasets have already been pre-processed by removing noise and normalling speech volume, and the corpus includes many user-defined tags. With those caveats in mind, the composition of the full dataset is shown in Table 1. the average and standard deviation(SD) of age and MMSE is shown in Table 2. There is not many differences in age between two groups, and both of the number of two groups is 78.
As the proprecessing process has already been dealt with by initiator, including transcripts and annotation, we do not need to do the first step again.

Feature extraction methods
The feature extraction mainly includes three sections (acoustic features, bert embeddings and linguistic features manually) in this study. Linguistic features manually will be described in detail because it is most complicated relatively.

Acoustic features
We used 384 acoustic features which can got from opensmile software [27]. The method was proposed by Bjorn [28] in 2009 InterSpeech challenge. We simply describe the extract process of acoustic features. First it calculate 16 LLD, including the zero rate, Square root of energy, F0, HNR, MFCC1-2 and so on. Then the first order differential of 16 LLD is calculated and get 32 LLD. At last, 12 statistical function is applied on 32 LLD and got 32 * 12 = 384 features.

Linguistic Features on BERT Embeddings
A pre-trained Bidirectional Encoder Representations from Transformers (BERT) [29] model is used as a feature extractor. BERT models is a new pre-training language representations which obtains the state-of-the-art representation in many Natural Language Processing (NLP) tasks, the performance outperforms other methods as it is the first deep bidirectional system for NLP. There are six layers in the whole architecture and every layer has two layers, one is self-attention layer, which is multi-head-Attention, the other is a fully connected layer. We just use self-attention layer to extract bert embeddings and get high level word embedding representations which can capture universal across different tasks information. We just use the encoder of transformer, composed of many transformer blocks, as feature extractor. The position embedding and word embedding are the input embedding, then we use BERT model and a pytorch deep learning framework to extract embeddings, the dimension of which is (156, 768) for every dialogue of every speaker in the transcripts, where 156 is the length of the dataset, 768 is the size of hidden size. The encoder structure of transformer is shown in Figure 2, x1 and x2 are input words. The configuration of BERT we used is shown in Table 3.   [37] and Bucks [38] found an increase in the proportion of verbs, adjectives and pronouns and a decrease in the proportion of nouns for AD patients. Ahmed [39] found a change in the number of verbs and pronouns. AD sufferings often exhibit perseverative behavior in their daily life, including obvious linguistic differences [40] with healthy controls. For example, Tomoeda [41] and Nicholas [42] found that AD sufferings often repeated some words or phrases than normal, while the frequency was not relationship with the severity of disease. Pause is also a common phenomenon for AD patients, the frequency and location of pause can index some important information, for example, the expression of things in some scenes, coherence and organization of language, fluency and coherence are all important linguistic reference characteristics.
There is a consistent conclusion that the impairment of word extraction and naming ability in AD patients is much related to the impairment of semantic memory and other factors. There is no uniform strandard for linguistic measure, the pros and cons of these linguistic measures have already beyond the scope of our study. After careful consideration, We will expound our extraction method from the following four aspects: Part-of-Speech(POS), lexical richness, fluency and semantic feature. We used Natural Language Toolkit (NLTK), exployed by University of Pennsylvania, to extract POS information of noun, gerund, pronoun, verb gerund phrase automatically. We then compute the frequency of occurrence of those different POS tagging, normalized by the total number of words in every utterance. In the meanwhile we calculate ratios, for example, pronoun to noun, word-to-sentence, pauses and unintelligible count and so on. In addition, lexical Richness ( TTR, ARI, CLI and so on), linguistic fluency, semantic understanding are also as an index in our study. More detailed description of 20 linguistic features is shown in Table 4.  We calculate the number of keywords (the list of the ten groups above) in the dialogue.

SFS algorithm
In order to improve the interpretation of the model, we need to identify most discriminant feature sets that influence the final result, we choose sequential forward selection (SFS) [49] method which is an iterative search strategy, it is also a simple greedy algorithm. The feature set starts from no features, the accuracy of the model is calculated by adding a feature at a time at each iteration, and the feature which yields the best result is then chose to add the final feature sets. When adding a new feature can not improve the accuracy of the model, the final iteration will end. In our work, we explored the variation of SFS, that is, the terminating condition was removed as soon as the first maximum is found. Finally seven feature sets which affect 80% performance were found to meet the performance convergence criterion. Figure 3 is the accuracy when adding T0, T1, T2, T3, T4,   T5, T6 (T0: prp_noun_ratio, T1:num_concepts_mentioned, T2 : SIM_score, T3 : TTR, T4: noun_count, T5: word_sentence_ratio, T6: ARI) linguistic feature respectively. T0 (prp_noun_ratio) accounts for about 63.63% performance for all 20 linguistic features, the accuracy improve 14.3% by adding T1. The rate of accuracy increases is slower from T2 to T6, which has only 2% improvement. Figure 4 is the boxplot of the seven features (T0-T6) , from which we can see the different score between AD and Control group.

Classifier
In the following experiments, we choose SVM, Logistic Regression, Random Forest, Extra Trees, Adaboost and LightGBM six classifier to train datasets. The simple description of these classifiers is as follows: (1) Support Vector Machines (SVM).
SVM are one of supervised classifiers [50] which has better performance in finite data. It work by finding a maximum margin hyperplane which can best seperate two datasets. We used SVM equipped with RBF kernel for our work, the performance with RBF kernel is better than linear kernal for a small dataset.
(2) Logistic Regression Logisitc Regress establishes the cost function and solves the best model parameters iteratively through the optimal method, then verifies the performance of the model we solved. It is often used in binary classification and popular in industry as its simple, parallelized and strong explanation.
The formula of Logistic Regression is: Where x is the training datasets. Random forest chooses a best eigenvalue partitioning points based on gini coefficient or mean square error, while extra trees choose an eigenvalue randomly to divide the decision tree. The variance of Extra Trees is decreased further and bias is further increased than Random Forest, so the generalization ability of Extra Trees is better relatively.

(5) Adaboost
Adaboost (adaptive boosting) is an iterative algorithm, the main idea is to train different weak classifiers for the same training sets, and then assemble these weak classifiers to construct a final stronger classifier. By adjusting the weight of the sample and weak classifier, the classifier with the smallest weight coefficient is selected from the trained classifier to form a final strong classifier.
The run process is as follows: Train every samples and give it a weight which construct a vector D. First the weight is equal, the weight after training will be changed. Adaboost allocates every base classifier with a weight alpha which is calculated based on the error rate of every base classifier, the definition of error rate m and weight alpha is: Where m is the number of misclassified samples divided by the number of all samples. After calculating the value of alpha, the weight vector D will be updated by reducing the weight of samples classified correctly and increasing the weight of misclassified samples.
If the sample is classified correctly, the weight of the sample is: While if the sample is classified wrongly, the weight of the sample is:    Where "0" stands for AD class, "1" stands for normal controls, "n" is the number of base classifiers.
Six base classifiers were trained on seven feature sets and got 6*7= 42 results in all. If the voting result is 21:21, The result of the classifier that perform best in current feature is used as the final result.

Model evaluation and Cross-Validation
The computer configuration we used is Intel (R) Core (TM) i7-6700 CPU @3.40GHZ CPU and 8.00GB RAM. The experiment environment in the paper is windows 10 operating system with python 3.7.0 and scikit-learn library. we use a 10-fold cross-validation method in which a 10% test set is used in every iteration for evaluation, the remaining 90% training set is used to construct the models. The result is the average value across 10 folds. That is to mean, data from any speaker in a given fold maybe training set or test set, but not both. As ADReSS is a balanced dataset, which has 78 AD sufferings and 78 normal, we use accuracy, AUC and F1 value as the final evaluate strandard.
The relationship between the actual class and predicted class is shown in Table 6, the evaluate metrics in this study are defined as Eqs. (2) to (6). The classify result is provided in Table 7, 8, 9.

The result of ensemble of classifiers
In order to compare the results with the champion(the best accuracy is 89.6%), we use accuracy as a general evaluate index. As the datasets is balance, the number of AD and normal is all 78, the threshold of the binary classifier is 0.5. so it is rational to evaluate the performance of classifier by using accuracy. The accuracy of maximum voting is shown in Figure 5, from which we got the best accuracy of 85.4%. From the result we can get two conclusions, on the one hand, we use majority voting strategy ( the minority is subordinate to the majority) for ensemble approach in this study, which maybe not impartial. The base classifier that has better performance should have more weight, as far as we think. So majority voting maybe not a better strategy. While how to allocate the weight of every base classifier is also a worth-study subject. On the other hand, ensemble approach maybe get better result by getting rid of the feature with poor performance. From Table 6 we known that the performance of acoustic features worst among three single features. Five combine features: AL, AB, LB, ALB, LB, the best performance is LB feature sets without acoustic features, and ALB second.

Conclusions
This study is an essential step in developing a simple but practical, low-cost reliable tool for the early detection of AD or other dementias based on multi-model Features of Narrative Speech and language. Also, we hope the tool can detect the change of AD gradually with the development of the disease in real time. In our work, we try to explore some speech and linguistic complex measures in order to explore discriminative markers of AD, and then validate the effect of those markers in discriminating clinically different cognitive function groups. We tried three different feature extraction methods, especially, we spend many enery on linguistic feature extraction method manually, and give a detailed explanation for different statistical measures and best seven linguistic features by using SFS algorithm. Finally, through ensemble of classifiers with majority voting strategy in different feature sets, we got a best accuracy of 85.4%.
From the study we can find that acoustic feature extracted by opensmile is easier relatively than other feature extract menthods, as it does not need transcript, while the performance is worst from the final result. Bert embedding features is paid more attention in this area since transformers emerged in 2017, especially the best result of ADReSS with bert structure in 2020. The result with bert embeddings is not bad but not the best, but it is a simple approach. Linguistic features manually is most complicated because first it need transcript and annotation by CLAN, it needs professional knowledge and is hard to comprehensive, the accuracy usually 80%-88% or so in the studies nowadays.
Differences in oral language do supply markers of better discriminative utility for elderly subjects. The topic-focused and narrow use of oral language makes automatic feature extraction to be much accurate for many elders. Based on the above considerations, we believe the use of spoken language markers to diagnose AD is a exploring and compelling area for the further research.
Future, acoustic feature is an explorable area, for example, the importance of pause such as the location and number of pauses in acoustic has already been proved. Deep learning model is also a better study orientation, as the method is easy relatively without complicated feature extraction process, while the performance is usually better, but the explainable is worse of course as deep learning is a "black box" for us after all. We think the following aspects may be the future research idea for the detection of cognitive impairment based on speech: