What Do Machines Tell us About Dementia? Machine Learning Applied to Aging, Dementia and Traumatic Brain Injury Study

Dementia, a syndrome characterized by the progressive deterioration of memory and cognition, arises from different pathologies, with Alzheimer's Disease (AD) its most common cause. Patterns of gene expression during dementia of different etiologies may function as generalist biomarkers of the condition. We used RNA-Seq data from the Allen Dementia and Traumatic Brain Injury Study (ADTBI) to identify differentially expressed genes in brains with dementia. Machine Learning algorithms Decision Trees (DT) and Random Forest (RF) were used to create models to identify dementia samples based on their gene expression prole. Importance analyses were conducted to identify the most relevant genes in each classication model. A total of 1629 differentially expressed (DE) genes were found in brains with the condition. Gene PAN3-AS1 was the only DE gene across more than three brain regions. The articial intelligence models were capable of identifying correctly up to 92.85% of dementia samples. Our analyses provide interesting insights regarding using brain-specic gene expression proles as biomarkers of dementia, identifying genes possibly involved with dementia, and guiding future studies in prediction and early identication of the syndrome.


Introduction
Dementia is one of the most challenging issues in aging populations. Its prevalence is between 5-7% of the elderly, affecting around 50 million people worldwide (Prince et al, 2013). Dementia is a clinical syndrome caused by different etiologies, including complex, multigenic, and even epigenetic causes.
Alzheimer's Disease (AD), vascular dementia, Lewy body disease, frontotemporal dementia, and Creutzfeldt-Jakob disease are among its most common causes. Traumatic brain injury and alcohol consumption are also frequent causes of the condition, especially in younger people ( Machine Learning (ML) computes probabilities and classi es data based on statistics. This method has been proved invaluable for disease classi cation based on images, with growing precision and predictive power (Rizk-Jackson et al, 2011 ;Liu et al, 2015). ML is currently used to aid the diagnostics of heart and liver diseases, diabetes, dengue, and many others (Fatima and Pasha, 2017). RNA-Seq is a highthroughput technology capable of assessing transcriptomic pro les (Courtney et al, 2010). The combination of RNA-Seq with ML models can help us identify potential biomarkers of diseases, and assess the expression of these genes to predict the disease development (Singireddy et al, 2015).
Changes in gene expression caused by neurodegenerative diseases can highlight the pathogenic mechanisms (Arneson et al, 2018). Identifying these changes may help early predict developing diseases by the identi cation of gene expression patterns related to the pathologies. And to further understand the brain molecular networks involved in the process of memory and dementia, we investigated the transcriptome patterns of healthy and dementia RNA-Seq data from the Aging, Dementia, and Traumatic Brain Injury Study (ADTBI, Miller, et al, 2017). We used the set of differentially expressed (DE) genes from dementia samples to build ML models capable of predicting the condition. We were capable of identifying DE genes during dementia and that can be used as biomarkers of the condition.

Data Acquisition
RNA-Seq data of human brains with and without dementia were extracted from the ADTBI (aging.brainmap.org). This database contains data of 50282 transcripts of 107 brain transcriptomes. Of the dementia samples, thirty were from AD, and 21 from different etiologies (vascular, multiple etiologies, others, and unknown causes), and 56 samples are from non-dementia brains. The parietal cortex (PC), temporal cortex (TC), forebrain white matter (FWM), and hippocampus are the four sampled areas in the ADTBI and included in this analysis.
Data Cleansing and Treatment R language was used to clean and treat the datasets. Data from three les were merged to create the nal dataset (Donor information, Gene Expression Matrix, and Column Samples). We used the column DSM IV (clinical diagnosis) from within the donor information table to classify samples. We considered all sources of dementia as just "Dementia", and the remaining samples as "No Dementia". Non-normalized Fragments per Kilobase of sequence per million mapped reads (FPKM) data were loaded and DESeq2 library (Love, 2017) to extract differential expression data of the genes. Python module Pandas was used to further lter the DE genes by false discovery rate (FDR) adjusted p-values, considering signi cant DE genes only those with p-adjusted below 0.05.

Disease modeling
To check whether the dementia samples would cluster together, we have drawn a heatmap with the 1000 lowest false discovery rate-adjusted p-values for each data strati cation.
Transcriptome patterns of DE genes were used as input for the classi cation models Decision Tree (DT) using R library rpart (Therneau and Atkinson, 2019), and Random Forest (RF), using R library randomForest (Liaw and Wiener, 2002).
The rst model considered all DE genes as classi cation features. Feature selection was made using all genes whose attribute importance was superior to 3 de ned by the random.forest.importance method from randomForest library. This step was done for each brain region, and new RF models were created using only these selected genes.
For the RF models, the number of trees (ntree) and the number of variables at each split (mtry) were adjusted based on the Out-of-bag error, and the proximity was calculated and visualized using multidimensional scaling.

Model metrics analysis
For each created model, a confusion matrix was calculated, containing the total number of true positives (correctly identi ed dementia samples), true negatives (correctly identi ed non-dementia samples), false positives (non-dementia samples classi ed as dementia), and false negatives (dementia samples classi ed as non-dementia). From these confusion matrices, the accuracy, precision, recall, and F1 score metrics were calculated for each model. Accuracy is calculated as the ratio of correctly predicted observations (True positives + True Negatives) over the total number of observations. Precision was calculated as the ratio of true positives over the total number of positive observations and relates to a low false-positive rate. Recall, also known as sensitivity, is the ratio of true positives overall real positives (true positives and false negatives). And nally, the F1 Score measures the test's accuracy based on the average between precision and recall. All metrics were added to a single comparison data frame. strati cation as input. First, each gene was mapped with a non-ambiguous identi er, using "get_string_ids" method, returning only the best matches for each gene name. The network images were generated using the "network" method, with nodes connected based on con dence, and with a score of at least 750, without secondary nodes. Finally, the method "function_annotation" was used for the functional enrichment of the networks, considering all annotated functions shared by at least 5 nodes

Data Clustering
Our data clustering analysis revealed that taking the 1000 genes with the lowest adjusted p-values, the dementia samples would group closer than with any other non-dementia sample (Fig. 1). This clustering is an important indicator that the models are able to nd patterns and correctly classify the samples.
Differential Expression of all samples We found differential expression patterns in brains with dementia, with a total of 1629 DE genes between non-dementia samples. There were a total of 1550 overexpressed genes across the brain samples, while only 79 were under expressed. Of these, 382 DE genes were found in the TC, 377 in the PC, 376 in the FWM, and 502 in the hippocampus. Figure 2 presents the amount of DE genes through a volcano plot in the FWM as an example. Only genes with false discovery rate-adjusted p-values below 0.05 were considered signi cantly DE (green and gray in the gure).
A Venn diagram was drawn to identify DE genes in all brain regions (Fig. 3). This revealed a total of 24 DE genes shared by at least 2 brain regions. Except for under expressed genes TNFRSF21 in the PC (-3.37-fold) and OR4C5 in the FWM (-3.44 fold), all genes were overexpressed by at least 2.74-fold in the dementia samples. Gene PAN3-AS1 was the only DE gene shared by three brain areas (hippocampus, FWM, and PC), and overexpressed by over 3-fold.

Machine Learning Models
All DT based models had precision over 80%, with Hippocampus model with 100% precision, and the lowest false-negative rate (Fig. 4A). The Whole-Brain model had a precision of 93.10% (Fig. 4B). The TC had the highest F1 Score between all models (80%) (Fig. 4C).
The RF models overall did not score as well as the DT models. The Hippocampus model created with feature selection had the highest Recall (sensitivity) between all models (77.27%), with the lowest number of mislabeling of true dementia samples. These comparisons are presented in Table 1.

Feature selection Genes
The attribute importance for feature selection revealed several genes whose patterns could be particularly useful in the classi cation of dementia samples. Models for each brain region had different numbers of selected features. The TC had 9 chosen genes (HDAC11, CNGB3, TRIM56, TBC1D13, PDLIM7, PPP1CB, ADGRB3, GET4, and TYK2, in ascending order of importance), and the FWM had 7 chosen genes (ARL10, OBSCN, STX7, VAPA, CRHR1, SAR1AP3, and MARCH2). The hippocampus and the PC both had 3 genes selected based on the attribute importance: TMEM106B, PCCB, and ZNF358 for PC, and FPGT.TNNI3K, CRB2, and ST6GALNAC2 for the hippocampus.

Protein network analyzes
All DE gene lists were capable of recreating protein functional interaction networks from the STRING database. The most connected network was created from DE genes from the hippocampus, with 45 direct nodes connected to the central UBA52 gene, with 7 adjacent protein clusters (Fig. 5).

Functional enrichment
The most-reported process in Gene Ontology in all sample strati cations was the Cellular Process, shared by around 78% of DE genes. Biological Regulation was the second most reported process, shared by 60% of the genes. In Reactome pathways, Signal Transduction was the most present process, followed by Immune System, Metabolism and Protein Metabolism (present in 19%, 12%, 12%, and 10% of the genes, respectively). Speci c pathways change in each brain region, with relatively fewer participating genes in each. 'Asparagine-bound N-glycosylation', 'Signal Transduction Diseases', and 'MAPK Family Signaling Cascades' belong in the pathways of the Hippocampus DE genes. The complete list of functionally enriched DE genes is available in the supplementary material (Supplementary 1-8).

Discussion
The continuous growth of human elderly populations makes understanding, preventing, and treating agerelated conditions a priority in research and policies for public health. This study attempted to assess dementia's neuroanatomical expression pro le shared throughout multiple etiologies. Our most e cient models were able to correctly identify up to 92.85% of dementia samples. These models prove that there are differential expression patterns in brains with some type of dementia. While these models might not be used as dementia predicting tools, they contain important information about genes in which a differential expression pro le correlates to brains with dementia. This is consistent with a recent comparative transcriptome study between AD, Frontotemporal Dementia, and Hungtington's Disease (Annese et al, 2018;Stopa et al, 2018). These models contain important information regarding DE genes during dementia, pointing to potential biomarkers for the condition.
ML models can predict dementia in RNA-Seq samples. This indicates that we can utilize brain-speci c gene expression pro les as disease biomarkers. Changes in TC, PC, and FWM are su cient to be modeled and used to compare and classify dementia. Resultant genes from the importance analysis can become potential candidate genes for the development of a gene expression panel that could work as a type of dementia indicator.
Our results show that differences in transcriptome patterns between healthy and dementia brains were most signi cant in the Hippocampus. This was expected since there is strong evidence of hippocampus involvement in the origin and development of dementia. Hippocampus volume loss and shape change is a common feature to distinguish healthy and Alzheimer's Disease dementia brains (Wang et al, 2003).
This phenotype can also be a potential predictor of future dementia (Apostolova et al, 2010). In addition to the atrophy, molecular changes in the dementia hippocampus are also reported. The presence of αsynuclein, of ubiquitin-immunoreactive inclusions, and reduced choline acetyltransferase are molecular aspects of the hippocampus in dementia of different etiologies (Galvin et al, 1999).
Most DE genes are involved in trivial cellular processes, such as metabolism and protein regulation. Proteins involved in the immune response are also overexpressed in this condition, pointing to the role of the innate immune system in neurodegeneration. This hypothesis is reported by Richards and colleagues (2016), which states that neurodegenerative disease is the result of chronic activation of an innate surveillance pathway. DE genes shared by the brain regions during dementia are linked to collagen metabolism, tyrosin kinases, adaptive immunity, protein digestion and absorption, and osteoclast differentiation (Sup. Table 1). Genes SYK and TYK2 participate in most of these molecular functions and are differentially expressed in Temporal Cortex, Parietal Cortex, and Temporal Cortex and Hippocampus, respectively. Indeed, SYK appears to be directly involved in the susceptibility of vascular dementia (Kim, Kong, and Lee, 2013). In addition, TYK2 participates in the TYK2/STAT3 signaling pathway that mediates β-amyloid-induced neuronal cell death (Wan et al, 2010). In the Decision Tree classi cation model, overexpression of TYK2 (over 7-fold) was the most important feature distinguishing dementia and no dementia brains in our study.
Other DE genes shared by more than one region are reported in the context of dementia. TNFRSF21 is known as death receptor 6. This gene is differentially expressed in the Parietal Cortex and Temporal Cortex in dementia brains and is involved in neuronal apoptosis by binding to β-amyloid precursor protein and recruiting Caspase 6 (Nikolaev et al, 2009). Gene TIMM23B codi es an inner mitochondrial membrane protein, which is also associated with App plaques in AD (Heinemeyer et al, 2019).
RNA genes and pseudogenes were also important features in distinguishing dementia brains in our models. These have little to no physiological information available yet. Between these genes, some have an association with neurodegenerative conditions. RPS27AP4 is a pseudogene differentially expressed in both Temporal Cortex and Hippocampus is possibly involved with schizophrenia (Xu et al, 2013). Lastly, MIR181C gene is DE in both Parietal Cortex and Hippocampus, and is involved with microglia-mediated apoptosis, increased amyloid-beta plasma levels with aging, and cognitive impairment (Fang et al, 2017; Crespo, Atienza e Cantero, 2019).
Although the deep brain tissues used in this study are not accessible in exploratory examinations, this effort was important to reveal several transcriptomic aspects of the syndrome. The predictive power of these models reveals key molecules that may be potentially involved in the development and progress of the condition. Furthermore, we demonstrated that ML algorithms can be used to classify disease samples based on transcriptome data. Future studies in the eld can allow the development of dementia early detection tools, promoting healthier aging for the populations.

Declarations Ethical Statement
Ethics approval and consent to participate Not Applicable Consent for publication Not Applicable Availability of data and materials All data used in the production of this paper is publicly available at the Allen Insitute's Aging Dementia and Traumatic Brain Injury Study website.

Code Availability
All data and scripts underlying this article are available in GitHub repository rnaSeqDementia at https://github.com/mouradap/rnaSeqDementia.

Competing interests
Con ict of Interest The authors declare that they have no con ict of interest.

Authors' contributions
Denis Moura was responsible for all the data operations and analytics in this paper, and João Oliveira contributed to the ideation and support throughout the entire project.  Tables   Due to technical limitations, table 1 is only available as a download in the Supplemental Files section.    Decision tree based dementia classi cation models

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.