An AD diagnosed person develops amyloid plaque, tau, neurofibrillary in brain. There is loss of connection between the neurons of the brain. Hippocampus is the likely pace where the problem seems to start. But as the neurons die, subsequently other brain parts also get affected. This leads to problems like short-term memory loss in the initial stages. Then happens progressive problems with loss of short-term memory initially. Next, a decline happens in other cognitive faculties followed by behavioral issues. Broadly 3 stages are defined for the progression of this disease. They are CN, MCI and AD. “(MCI) is attractive because it represents a transitional state between normal aging and dementia” [1]. 50–75% of people of age 65 years or above are generally prone to AD. As expectancy of life of people is increasing, AD patient count is also increasing all over the world.
No major cure of AD is established till date [2]. There is tremendous progress in analysis of brain, brain functions, its changes in MCI and AD stages with the clinical applications of MRI (magnetic resonance imaging ) and PET (positron emission tomography ) have led to tremendous development in brain analysis, understanding brain function and its changes in MCI and AD stages [3, 4].
Other approach like genomic data analysis for AD diagnosis can be beneficial. Genetic analysis can predict AD risk in an individual much earlier than clinical symptoms of AD appear. Genetic factors play a major role in 80% AD cases [5]. ‘Genome Wide Association Studies’ (GWAS) is able to discover some AD candidate genes. However, GWAS have majorly failed to produce AD candidate genes with reliability. Thousands of genes are considered as potential AD risk factors [6, 7]. But GWAS only discovers genes that are associated with some phenotypes and fail to address the genes functionality causing AD [8]. Gene expression provides the opportunity to biochemical pathway analysis, regulatory mechanisms and cellular functions to find the key AD and MCI genes. Some research utilised gene expression values from brain tissues from biopsy or autopsy-based samples [9,10]. However, various difficulties are involved with such autopsy samples for analysis. But brain dynamics, changes are also expressed in blood and large portion of gene expression in body is also found in PBMCs (Peripheral Blood Mononuclear Cells) [11]. Amyloid precursor protein expression, oxidative damage in RNA and DNA of AD brain tissue are reflected in peripheral blood and as well [12]. So, blood gene expression is getting attention as an appropriate method and diagnose AD and MCI [13, 14].
With availability of biomarker datasets in public space, Machine learning (ML) is becoming a major support in diagnosis of a disease and the stage. ML is now widely used in disease diagnosis with increasing availability of AD datasets. The clinical biomarker AD dataset is categorized into brain structural integrity measurement with MRI ROI, primary cognitive tests, measurement of cell metabolism with FDG PET ROI average, amyloid-beta load measurement in brain with AV45 PET ROI average, biomarkers for measuring tau load in the brain, axon related microstructural parameter measurement with DTI ROI, CSF biomarkers for measuring tau levels and amyloid in cerebrospinal fluid and others like demographic information, APOE status with count of APOE4 allele etc. APOE4 allele increases the risk of late onset Alzheimer [5].
1.1 Related Work
A lot of earlier research on AD diagnosis was done with clinical data using ML techniques. Primary issues in the earlier research were lack of sufficient and authentic data samples and less accuracy achieved [20]. For the other data modality, some important research on AD with genome expression data has been already done with ML techniques [9, 13–18]. We have reviewed several papers on use of gene expression data for Alzheimer disease diagnosis and listed some recent studies [2017–2022] in table-1 for identification of AD from gene expression data.
Table 1
– Review of blood gene expression research in 2017–2022
Study | Data source | Feature selection | Classifier | Results |
Li et al. (2017) [15] | ANM1 and ANM2 | student’s t-test | Ref-REO | AUC: 0.733 (ANM2: test set) AUC: 0.775 (ANM1: test set) |
Li et al., (2018) [16] | ANM1 and ANM2 | LASSO regression | Majority voting of RF,SVM, RR | AUC: 0.866 (ANM2: test set) AUC: 0.864 (ANM1: test set) |
Lee H. et al., (2020) [9] | ANM1, ANM2, ADNI | VAE, TF genes. | Binary classification logistic regression (LR), (L1-LR), SVM, RF, and DNN. | AUC: 0.657, 0.874, and 0.804 for ADNI, ANMI and ANM2, respectively. |
C. Park. Et al., (2020) [17] | Gene expression: GSE33000 and GSE44770 methylation data: GSE80970 | Integrate DEGs and DMPs by inter-section | DNN (Deep Neural Network) | 0.823 is the average accuracy |
Kalkan H. et al., (2022) [18] | GSE63060, GSE63061, GSE140829 | LASSO regression | CNN on transformed image representation. | AUC of 0.875 for the AD vs. CTL. AUC of 0.664 for the MCI vs. AD. AUC of 0.619 for the MCI vs. CTL. |
Datasets from NCBI and ADNI have been used in most of the research on AD diagnosis using statistical, machine learning and deep neural networks. To highlight performance, researchers used only binary classification for AD identification and mostly focussed on identifying AD, omitted diagnosis of ‘MCI’ stage. But MCI is an important stage [1]. Performance output of some of the earlier research is promising with AUC score around 80% when balanced dataset GEO datasets are used for model training / test. However, results are not promising when train and test dataset is ADNI. In research of Lee H. et al.[9] with ADNI data, for CN vs AD binary classification, the AUC achieved was 0.657 for internal validation (ADNI data for train and test/validation). The score is far less when compared with other results. Primary reason for this low score is that there are multiple challenges associated with ADNI gene expression data. ADNI gene expression data is of ‘HDLSS’ in nature [19]. It has 49,386 gene probes / gene transcripts (or features), has only 744 samples and is imbalanced as well as reflected in Table 2. So, when the model is constructed and validated with same ADNI data, performance is low.
1.2 Key contribution
In this work, we have two key contributions. First, We are able to achieve multi classification based ROC performance score on genomic data that is much better than AUC score 0.657 from binary classification with internal validation of earlier research as mentioned in section 1.1. Second, we have done stage diagnosis from two different data modalities of ADNI participants – blood gene expression profile and clinical data. We constructed two separate models and compared the performances.
Gene expression profile of ADNI subjects is used for model training and test, multiple challenges associated with the dataset as mentioned earlier like HDLSS characteristics, data imbalance appear. We have applied appropriate techniques to handle such challenges and improved performance score in diagnosis of stages – CN, MCI and AD. Additionally, we are able to identify new genes responsible (or repellent) to AD. From the second modality of ADNI clinical dataset, we have found most effective biomarkers for diagnosis and achieved best known ‘F1 score’, ‘ROC AUC’ of multiclassification in AD stage diagnosis. We have analysed the results of both the models built on the two modalities of ADNI subjects.