Data Acquisition
RNA-Seq data of human brains with and without dementia were extracted from the ADTBI (aging.brain-map.org). This database contains data of 50282 transcripts of 107 brain transcriptomes. Of the dementia samples, thirty were from AD, and 21 from different etiologies (vascular, multiple etiologies, others, and unknown causes), and 56 samples are from non-dementia brains. The parietal cortex (PC), temporal cortex (TC), forebrain white matter (FWM), and hippocampus are the four sampled areas in the ADTBI and included in this analysis.
Data Cleansing and Treatment
R language was used to clean and treat the datasets. Data from three files were merged to create the final dataset (Donor information, Gene Expression Matrix, and Column Samples). We used the column DSM IV (clinical diagnosis) from within the donor information table to classify samples. We considered all sources of dementia as just "Dementia", and the remaining samples as "No Dementia". Non-normalized Fragments per Kilobase of sequence per million mapped reads (FPKM) data were loaded and DESeq2 library (Love, 2017) to extract differential expression data of the genes. Python module Pandas was used to further filter the DE genes by false discovery rate (FDR) adjusted p-values, considering significant DE genes only those with p-adjusted below 0.05.
Disease modeling
To check whether the dementia samples would cluster together, we have drawn a heatmap with the 1000 lowest false discovery rate-adjusted p-values for each data stratification.
Transcriptome patterns of DE genes were used as input for the classification models Decision Tree (DT) using R library rpart (Therneau and Atkinson, 2019), and Random Forest (RF), using R library randomForest (Liaw and Wiener, 2002).
The first model considered all DE genes as classification features. Feature selection was made using all genes whose attribute importance was superior to 3 defined by the random.forest.importance method from randomForest library. This step was done for each brain region, and new RF models were created using only these selected genes.
For the RF models, the number of trees (ntree) and the number of variables at each split (mtry) were adjusted based on the Out-of-bag error, and the proximity was calculated and visualized using multidimensional scaling.
Model metrics analysis
For each created model, a confusion matrix was calculated, containing the total number of true positives (correctly identified dementia samples), true negatives (correctly identified non-dementia samples), false positives (non-dementia samples classified as dementia), and false negatives (dementia samples classified as non-dementia). From these confusion matrices, the accuracy, precision, recall, and F1 score metrics were calculated for each model. Accuracy is calculated as the ratio of correctly predicted observations (True positives + True Negatives) over the total number of observations. Precision was calculated as the ratio of true positives over the total number of positive observations and relates to a low false-positive rate. Recall, also known as sensitivity, is the ratio of true positives overall real positives (true positives and false negatives). And finally, the F1 Score measures the test's accuracy based on the average between precision and recall. All metrics were added to a single comparison data frame.
Data Visualization
R library EnhancedVolcano (Blighe, Rana, and Lewis, 2020) was used to draw the volcano plots showing DE genes in each data stratification. R library VennDiagram (Chen, 2018) was used to plot a venn diagram of the four datasets showing DE genes shared between the brain regions. R library ggplot2 was used to create and visualize DT trees, the Out-of-bag errors, and multidimensional scaling plots. R library volcanoplot was used to show the significant DE genes.
Functional Enrichment and Protein Network Analysis
STRING (Szklarczyk et al, 2019) database API was accessed using the lists of DE genes for each stratification as input. First, each gene was mapped with a non-ambiguous identifier, using "get_string_ids" method, returning only the best matches for each gene name. The network images were generated using the "network" method, with nodes connected based on confidence, and with a score of at least 750, without secondary nodes. Finally, the method "function_annotation" was used for the functional enrichment of the networks, considering all annotated functions shared by at least 5 nodes based on Gene Ontology ( The Gene Ontology Consortium, 2019), Reactome (Jassal et al, 2020) or SMART (Letunic e Bork, 2017). The script to access the API can be found in the author's repository (github.com/mouradap/StringAPI).