Alzheimer's Disease Early Detection Using Machine Learning Techniques

Alzheimer's is the main reason for dementia, that affects frequently older adults. This disease is costly especially, in terms of treatment. In addition, Alzheimer's is one of the deaths causes in the old-age citizens. Early Alzheimer's detection helps medical staffs in this disease diagnosis, which will certainly decrease the risk of death. This made the early Alzheimer's disease detection a crucial problem in the healthcare industry. The objective of this research study is to introduce a computer-aided diagnosis system for Alzheimer's disease detection using machine learning techniques. We employed data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the Open Access Series of Imaging Studies (OASIS) brain datasets. Common supervised machine learning techniques have been applied for automatic Alzheimer’s disease detection such as: logistic regression, support vector machine, random forest, linear discriminant analysis, etc. The best accuracy values provided by the machine learning classi�ers are 99.43% and 99.10% given by respectively, logistic regression and support vector machine using ADNI dataset, whereas for the OASIS dataset, we obtained 84.33% and 83.92% given by respectively logistic regression and random forest.


Introduction
Dementia is "the loss of memory, language, problem solving and other thinking abilities that are severe enough to interfere with daily life tasks" (Alzheimer's Association, 2019). It is not considered as a speci c disease, but as a set of symptoms with an increasing decline in memory or other thinking and reasoning skills (Creative Caregivers LLC, 2020). In an aging society, dementia is a priority in health and social care. In fact, dementia usually affects the elderly, whereas 2% of people with dementia do not exceed 65 years (Alickovic & Subasi, 2020). Worldwide, around 50M people are with dementia and approximately 10M new cases every year (World Health Organization, 2020b). It is predicted that by 2030, the number of people having dementia will be around 75M, which will cost the society nearly US$ 2 trillion (Prince et al., 2015). In today's world, dementia is more signi cant in terms of healthcare compared to other diseases (World Health Organization, 2020a). In addition, there is an important di culty in diagnosing dementia due to the absence of a standardized test for its detection (Stamate et al., 2020). Therefore, there is no treatment until now to cure dementia, although, some treatments are available to support and improve the life of those patients as well as their caregivers.
Alzheimer's is a frequent type of dementia, which is considered as a main threat for the healthcare industry in today's world (Jo et al., 2019). It accounts for 60-80% of the population with dementia a (Creative Caregivers LLC, 2020). In 2018, around 50M people have been affected by Alzheimer's (Patterson, 2018). Moreover, Alzheimer's is the sixth leading cause of death in the United States (World Health Organization, 2020a). Alzheimer's disease can be clinically diagnosed by physical and neurological examination, which can be costly and time consuming. Alzheimer's symptoms generally develop slowly and get bad over time, which can become more severe and affect the daily activities (Alzheimer's Association, 2019). However, the early detection of this disease, before most of its symptoms are observable, is di cult. The prediction of Alzheimer's at pre-symptomatic stages is recommended to slow down the disease progression. Currently, Alzheimer's disease is diagnosed by calculating the Multi Slice Multi Echo (MSME) score and by the manual study of the Magnetic Resonance Imaging (MRI) scan (Janghel & Rathore, 2020). This may require analyzing thousands of slides of brain tissue, which is lengthy and costly. Learning-based techniques can be used to speed up the diagnosis process and reduce its cost. In the recent years, intelligence scientists investigated the use of advanced technologies to improve Alzheimer's detection quality and precision. Thus, several machine learning models have been applied successfully for early disease detection (Kumar, 2019).
In this paper, we aim to propose a computer-aided detection system for Alzheimer's early detection using machine learning. The remainder of this research paper is as follows. In Sect. 2, we provide background information about Alzheimer's disease and review previous works on the use of learning-based methods in the Alzheimer's early detection. Section 3 describes the use of machine learning in the early detection of Alzheimer's disease. Section 4 presents and discusses the experimental results. We conclude this work in Sect. 5.

Alzheimer'S Disease And Related Work
In this section, we provide background information about Alzheimer's disease and survey previous research studies that proposed learning-based approaches for the early detection of this disease.

Alzheimer's Disease
Alzheimer's disease, mentioned directly as Alzheimer's, is a "progressive neurological brain disease, which is caused due to the damage of nerve cells in parts of the brain" (Alzheimer's Association, 2019). Alzheimer's has mostly severe physical and psychological effects on the person with Alzheimer's and his family. Alzheimer's normally starts with a slow progression and get worse increasingly as time progresses. At the beginning of this disease, the rst symptom that appears is the memory loss. In the advanced stages, Alzheimer's symptoms become more serious. Hence, a person with Alzheimer's may suffer from emotional changes (e.g., depression, apathy, etc.), changes in behavior and even the decrease in physical abilities (e.g., coordination, managing self-care, etc.). Until-to-days, there is no cure for Alzheimer's despite the worldwide effort to nd better ways for treating this disease. Nevertheless, treatments for Alzheimer's symptoms are available. Those treatments are not able to prevent Alzheimer's progression, but they are used to temporarily reduce the worseness of its symptoms. Hence, the earlier a person is diagnosed with Alzheimer's, the sooner help he can receive. Recently, Alzheimer's disease received a remarkable focus in recent scienti c research studies since it eventually leads to the people death. However, Alzheimer's diagnosis needs a good clinical assessment based on patient's medical history, several neuropsychological tests, and other pathological evaluations (Kundaram & Pathak, 2021). Those examinations can be costly and time-consuming.

Related Work
In the recent years, learning-based techniques such as supervised machine learning have been progressively being applied in healthcare. In particular, computer-aided detection systems that use learning-based techniques have been successfully applied in multiple diseases' detection (e.g., heart disease (Barik et al., 2020), breast cancer (Asri et al., 2016), etc.). This information, if earlier correctly detected, can be bene cial for clinicians and patients. Regarding the use of machine learning for the Alzheimer's detection, different models have been commonly used. For instance, Alickovic & Subasi conducted a comparative study to evaluate how well supervised machine learning models can be used in the Alzheimer's disease prediction (Alickovic & Subasi, 2020). This study focused mainly on: support vector machine, naïve bayes, k-nearest neighbours, random forest, arti cial neural network and logistic regression. They conducted their experiments using the ADNI data repository (ADNI, 2017). The highest performances have been given by the random forest classi er with an accuracy of 85.77%, and the k-nearest neighbours classi er with an accuracy of

The Alzheimer's Disease Neuroimaging Initiative Dataset
The data used in this research study have been obtained from the widely used data repository, ADNI (Alzheimer's disease Neuroimaging Initiative) (ADNI, 2017). As it is illustrated in Table 1, this dataset has been widely used mainly to detect Alzheimer's disease at the earliest stages (ADNI, 2017). A detailed description of the ADNI dataset is given in Table 2. Total 8320 ADNI dataset includes data recorded from the North American male and female individuals that are "Cognitively Normal", with "Early Mild Cognitive Impairment", with "Late Mild Cognitive Impairment", or with "Alzheimer's Disease". The dataset used in this paper contains 502 attributes for 1737 participants. This dataset is longitudinal since it contains data from multiple visits per patient. In fact, ADNI contains records of individuals' examination, at different monthly intervals (i.e., from 0 to 120 months), from July 2005 to May 2017. Consequently, this dataset contains a total of 8320 examinations (see Table 2

ADNI Dataset Preprocessing
Data pre-processing is an important task in this research work. It refers to all the necessary transformations done on the data to be used.

ADNI dataset cleaning
As we mentioned in Sect. 3.1, the ADNI dataset contains a total of 502 parameters (i.e., attributes). However, the 22 most relevant attributes that are mainly used in the Alzheimer's disease detection are represented in Table 3. They include String (e.g., PTGENDER and PTMARRY), Number (e.g., AGE, PTEDUCAT, etc.) and Boolean (e.g., APOE4) data type. The class of people diagnosed with Alzheimer's disease is assigned as DXCHANGE = 3.0. The class of people with Mild cognitive impairment is assigned as DXCHANGE = 2.0. Finally, the class of people cognitively normal is assigned as DXCHANGE = 1.0. However, the ADNI dataset contains missing values. Hence, before being used by the different learningbased models, this dataset needs to be cleaned.  Table 3. We removed rows that include missing values. However, we kept those that include data in the main features mentioned in Table 3 in order to avoid data lost.
Finally, the ADNI dataset includes String data type (e.g., PTGENDER, PTMARRY, etc). However, the learning-based models require that the inputs must be numeric; we used the one hot encoder to convert the non-numeric values into numeric values. After applying all the data cleaning tasks mentioned above, the data that will be used by our learning-based models includes 1000 instances with 22 features, where 521 patients are cognitively normal and 479 patients are with Alzheimer's.

Correlation Matrix
To understand the relationships between the different features in our dataset, we used the correlation matrix. This matrix highlights the most correlated attributes in the ADNI dataset. Figure 2 shows the correlation matrix between the 22 selected features from the ADNI dataset. As shown in this Figure, there are few features correlated to each other (> 0.85). For instance, FAQ is correlated to CDRSB with 0.9; ADAS13 is correlated to ADAS11 with 0.98, etc. However, we decided to keep those features since the total number of features is restricted to 22.

Experiments And Discussion
This section presents and discusses the experimental results performed using the ADNI and OASIS datasets. We compare the results provided by the common machine learning models.

Experimental Results
We have performed several experiments with different parameters. In fact, after the pre-processing phase, the conversion of all the variables into numerical features and after keeping the pertinent features to be used by the machine learning models, we can now split the data into training and test sets. For this purpose, we used 5-fold cross-validation model. Finally, we evaluate the ability of using machine learning models in the Alzheimer's disease detection using the accuracy, precision, recall, and F-measure metrics given by, respectively Eq. 1, Eq. 2, Eq. 3, and Eq. 4. The references and introduced labels in the evaluation are given in Table 4. In Table 5, we compare the results given by the machine learning models using the ADNI dataset. As it is provided in this Table, the best values given by the machine learning models have been provided by the logistic regression and the support vector machine models with 99.43% and 99.10%, respectively. The lowest accuracy value has been given by the naïve bayes classi er with 87.07%.
The same learning-based models have been also used with the OASIS dataset. In Table 6, we compare the results provided by the selected machine learning models using OASIS dataset. As it is illustrated in this table, the best accuracy values have been provided by the logistic regression classi er and random forest with respectively, 84.33% and 83.92%. Whereas, the lowest accuracy value has been given by the naïve bayes classi er with 71.91%.

Comparative Evaluation
In this section, we compare the results given in this paper to those provided in the related work section. We selected the research studies that used the same ADNI database (cf., ( Table 7 provides the results of the comparison between our obtained results with the state-of-the-art models. As it is illustrated in this table, we obtained better results than the state-of-the-art models. For instance, compared to the results obtained by (Alickovic & Subasi, 2020), that used random forest, knearest neighbours, and support vector machine, we still achieve better performances for the k-nearest neighbours with 97.55% compared to 84.27%, for the support vector machine with 99.10% compared to 83.15%, and for the random forest classi er with 99.43% compared to 85.77%. Finally, we compared our obtained results with those given by (Shahbaz et al., 2019), that used random forest, k-nearest neighbours, and decision tree. We achieved better performance for the k-nearest neighbours with 97.55% compared to 43.26%, for the random forest with 98.89% compared to 69.69%, and for the decision tree with 97.53% compared to 74.22%.

Conclusion
Dementia, in particular Alzheimer's disease has an important impact on the society healthcare. However, the early detection of this disease is recommended to slow down the symptoms progression and avoid brain damage. Hence, such information, if earlier detected, can help people with Alzheimer's having a healthy life as well as their families' members. In the herein presented work, we proposed to use machine learning models for Alzheimer's disease detection. The evaluation of the classi cation models is performed using the ADNI and OASIS datasets. The experimental results shown that the best accuracy values provided by the machine learning models are 99.43% and 99.10% given by respectively, logistic regression and support vector machine using ADNI dataset, whereas for the OASIS dataset, we obtained 84.33% and 83.92% given by respectively logistic regression and random forest.
Further research work could be conducted to focus on other diseases. It will be also interesting to be restricted on MRI without any handcrafted features.

Declarations
Funding-Not applicable Visualization of the ADNI dataset according to DXCHANGE, PTGENDER, PTEDUCAT, and AGE attributes