Identifying and Ranking Potential Driver Genes of Alzheimer's Disease Using Weighted Co-Expression Network Analysis

Background: Alzheimer’s disease (AD) is one of the most common neurodegenerative diseases. Identiﬁcation of AD-related genes from transcriptomics provided new direction to the mechanism for ﬁnding potential targets for drug therapy. Methods: We mined gene co-expression network modules from diﬀerentially expressed genes (DEGs) of AD and normal samples in multiple datasets by weighted gene co-expression network analysis (WGCNA). A convergent functional genomic (CFG) method was used to prioritize potential driver genes. Results: The 7567 DEGs were enriched signiﬁcantly with 61 KEGG pathway and 242 GO terms. Then, the genes in 5 AD-speciﬁc modules obtained signiﬁcantly from DEGs were interconnected with well-known AD risk genes in common PPI network. Remarkably, compared to the number of Tau production-related genes, Aβ play a more critical role. Lastly, the 23 potential driver genes was prioritized by CFG method from 5 AD-speciﬁc modules. Conclusions: Identiﬁcation of AD-related genes could be useful for understanding pathophysiology of AD and looking for candidates drug targets.


Background
Alzheimer's disease (AD) is one of the most common neurodegenerative diseases, accounting for the majority of dementia patients [1]. AD is estimated to affect in 13.8 million individuals in the United States (US), with 7.0 million being aged 85 years or older by 2050 [2,3]. Currently, genetic factor are believed to be partially responsible for AD [4]. Genome-wide association studies (GWAS) have also revealed that some single nucleotide polymorphisms (SNPs) contribute to AD disease onset [5,6]. These include common variants such as amyloid protein precursor (APP), presenilin-1 (PSEN1), presenilin-2 (PSEN2) and apolipoprotein E (APOE). PSEN1, PSEN2 and APP genes are clear pathogenic genes of early-onset AD [7]. APOE, as the only identified risk gene for late-onset AD, can increase the rate of cognitive decline [8]. Different microRNAs (miRNAs) are also involved in the pathophysiology of AD [9]. For example, miRNA-377 promotes cell proliferation and inhibits cell apoptosis by regulating the expression level of cadherin 13 (CDH13), thus participating in the occurrence and development of AD [10]. Long non-coding RNAs (lncRNAs) have been widely reported to be associated with a variety of physiological and pathological processes, such as AD. Brain cytoplasmic RNA is a kind of lncRNA, and the overexpression of brain cytoplasmic may lead to synaptic/dendritic degeneration in AD [11]. Despite the fact that remarkable advances have been made in the understanding of the genetic basis of AD, there is no pharmaceutical or other intervention that can treat AD. Identification of AD-related genes from transcriptomics becomes an attractive strategy for finding potential targets for drug therapy. Gene expression profiling of transcriptomic datasets of AD and normal brain samples has identified potential genes and contributed to the search for potential targets [12]. Recently, most studies on AD transcriptome analysis pay attention to the identification of DEGs from different brain regionsand there are only few work studied highly correlated pairs of genes to obtain gain or loss of co-expression in AD [13]. Generally, describing the degree of association between two genes can be obtained by calculating correlation coefficients such as Pearson and Spearman between expression values. But the disadvantage of a fixed threshold approach is that the threshold is artificially defined and it ignores a lot of potential correlations. At the same time, this approach can also lose information about the trend of genetic changes, and it will be difficult to describe the strength of correlation in the network. In order to solve these problems, the idea of weighting is proposed. WGCNA is widely used to describe the association pattern between gene expression in microarrays or RNA-seq, and can be divided the gene co-expression network of complex biological processes into several highly related feature modules. The feature modules represent several groups of highly coordinated gene sets, associate the modules with specific clinical features to find genes that perform key functions, and help identify potential mechanisms involved in specific biological processes and explore candidate biomarkers [14,15]. In this paper, we aimed to identify potential driver genes for AD from DEGs using WGCNA and CFG methods based on multiple transcriptomics dataset. Firstly, the datasets was download from AlzData and ADNI database. Then genes with |logF C| > 0.1 and P < 0.05 were regarded as DEGs. We also identified AD-specific modules using WGCNA method. Lastly, we prioritized potential driver genes by the CFG method in Figure 1.

Materials and Methods
AD expression data collection and preprocessing All the original microarray data regarding AD were downloaded from Alz-Data(http://www.alzdata.org/)and Alzheimer's Disease Neuroimaging Initiative(http://adni.loni.usc.edu/). We have 467 controls and 309 AD from five dataset for subsequent analyses, including EC (39 vs 39), HP(67 vs 74), FC(128 vs 104), TC(39 vs 52) and ADNI(194 vs 40). Only ADNI had Aβ and Tau related data. Detailed information of each dataset is shown in Table-1. Of note, to solve data imbalanced in ADNI dataset, we randomly chose 40 samples from the control in 10 times.
AD data detection of DEGs and enrichment of biological process In order to identify potential genes of AD, differential expression analysis was conducted by R package edgeR and the Benjamini-Hochberg's method was used to correct for multiple comparisons [16]. For each dataset, t-test was used to identify DEGs with a cut-off of statistical significance P<0.05 and foldchange > 0.1. Functional enrichment of the DEGs was produced from Database for DAVID, which now provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes. For obtained list of DEGs, DAVID is able to identify enriched biological themes, particularly KEGG pathway and GO terms.
Weighted gene co-expression network analysis We use WGCNA approach, which is good for distinguishing the strong and weak relationships, to divide the gene co-expression network of complex biological processes into several highly related feature modules. For genes i and j, the correlation coefficient is r ij , we define the correlation intensity :a ij = r β ij , which depends on the choice of power β. The β value is determined by the lowest value close to the scale-free network. According to this idea, the correlation strength of expression levels among all genes was calculated, and the adjacency matrix was obtained. Further, the topological overlap matrix (TOM) was calculated. In simple terms, if genes i and j have a lot of the same adjacent genes, then T OM i,j will be high, which means that there are similar expression patterns between the two genes. Therefore, several AD-specific modules from DEGs of AD and normal samples could be obtained by WGCNA.

Protein-protein interaction(PPI) network
Mentioned AD-specific modules before, to identify the genes in these AD-specific modules and the functional relationship between well-known AD risk genes(APP, PSEN1, PSEN2, APOE or MAPT), a protein-protein interaction(PPI) network was constructed by Gephi(https://gephi.org/).

Convergent functional genomics
The potential driver genes was prioritized from AD-specific modules by CFG method, which integrated various levels of AD-related evidence [4,17]. There were five AD-related evidence: (1) regulation of target gene expression by AD genetic variants;(2) association at least one locus significantly ;(3) physical interaction between target genes and APP, PSEN1, PSEN2, APOE or MAPT;(4) differential expression of target gene before AD pathology emergence;(5) correlation of target gene expression with AD pathology in Aβ and Tau. The range of CFG score was from 0 to 5, with 5 indicating highest priority.

DEG detection
A total of 776 samples and 108,302 genes from multiple transcriptomic datasets were compiled for DEGs detection(P<0.05 and foldchange > 0.1). Besides, for ADNI dataset, we randomly chose 40 samples from the control in 10 times and selected gene with frequency greater than or equal to 3. Each red node represented DEG for five datasets (Figure 2). We identified 7567 DEG(2166 EC, 1952 HP, 949 FC, 3075 TC and 3204 ADNI) for subsequent analyses. About 6%˜19% of the total genes could be identified as DEGs. Among the DEG list in all five datasets, the expression patterns of well-known AD risk genes, such as APP, PSEN1, PSEN2, APOE and MAPT were only slightly altered or unchanged in AD patients. In addition, 19 genes had a consistently differential expression from EC, HP, FC, TC and ADNI( Figure   3).

Functional annotation of AD-related DEGs
We investigated functional enrichment of the AD-related DEGs. The 7567 target genes in the network were enriched in 324 KEGG pathway and 1381 GO terms in Figure 4. We identified 61 KEGG pathway and 324 GO terms (P<0.005 ), respectively. As shown in Table-2, we also found several pathways have been reported to be associated with AD, including Alzheimer disease pathway, MAPK signaling pathway and AMPK signaling pathway. Top 20 significantly KEGG pathway selected was exhibited for each dataset in Figure 5. Besides, these GO terms are divided into ontologies based on a hierarchical relations. Specifically, DEGs related to the biological processes for synaptic-related functions were significant enriched in Table-3, such as chemical synaptic transmission, regulation of postsynaptic membrane potential, synaptic vesicle exocytosis, synaptic transmission, GABAergic, regulation of synaptic transmission, glutamatergic, synaptic vesicle endocytosis and long-term synaptic potentiation. In addition, they were associated with neuron-related processes, including neurotransmitter secretion, neuron projection morphogenesis, negative regulation of neuron apoptotic process and negative regulation of neuron projection development. These results of KEGG pathway and GO terms could contribute to understand the neuropathological process of AD.

DEG-enriched modules
We used WGCNA to divide the DEGs into several highly related gene modules. As shown in Figure 6, a very significant positive correlation was observed between five modules and AD for five dataset. A modular size was ranged from 96 to 142 genes that might reflect the different layers and complexity of gene regulation in the AD brain. These five DEG-enriched modules were used for identifying potential driver genes for AD etiology and pathology. We obtained a PPI network from five ADspecific modules, which contained 5 seed genes and 209 neighbors. Due to the lack of connectivity, several genes were not included in the PPI network, such as S100A6, PSAT1 and MXI1. Notably, only a few genes were linked to all five well-known AD risk genes (APP, PSEN1, PSEN2, APOE and MAPT) in Figure 7.

Pathology correlation Aβ and Tau
We also investigated the relationship between the genes in these five AD-specific modules and AD pathology in Aβ and Tau. Table-4 shows that there were significant differences in AD pathology among different datasets. Compared with the number of Tau production-related genes, AD pathology in Aβ play a vital role. These results could contributed to understanding the pathology process of AD.
Potential driver genes of AD To search potential driver genes of AD, we identified 23 related-AD genes based on CFG score(>=4) and P-value(<0.05) from the genes in five AD-specific module. We also list previously published gene-based studies for the identification of patients with AD in Table-5. For example, GJA1, also known as connexin 43, shows upregulated mRNA and protein levels in AD[18].

Discussion
In this study, we aimed to identify potential driver genes for AD from multiple transcriptomics dataset DEGs by WGCNA and CFG methods. Pathway enrichment analysis was performed to interpret the function of these DEGs. KEGG pathway analysis for the 7567 DEGs were significantly enriched in one KEGG pathway "MAPK signaling pathway", which is composed of ERK, P38, and JNK. In the adult nervous system, ERK activation is necessary for synaptic plasticity and memory formation [19]. In the brains of AD patients, P38 is highly expressed. Aβ-induced P38 activation increases tau phosphorylation and promotes the amyloidogenic processing of APP [20,21]. In a mouse model of AD, the JNK signaling pathway is overactivated in the spine before cognitive decline [22]. These studies indicate that the overactivation of MAPK signaling pathway could cause the occurrence of AD. Therefore, preventing MAPK overactivation is effective strategy in order to reduce Aβ deposition, Tau hyperphosphorylation, neuronal apoptosis, and memory impairment. MAPKs could be potential targets for novel and effective therapeutics of AD[23, 24]. GO term analysis indicated that the 7567 DEGs were mainly involved in chemical synaptic transmission, regulation of postsynaptic membrane potential, synaptic vesicle exocytosis, synaptic transmission, GABAergic synapses, regulation of synaptic transmission, glutamatergic, synaptic vesicle endocytosis, long-term synaptic potentiation, neurotransmitter secretion, neuron projection morphogenesis, negative regulation of neuron apoptotic process and negative regulation of neuron projection development. Damage to neuronal and synaptic function has always been considered an important pathological feature of neurodegenerative diseases, and decreased synaptic activity is also considered to be the most relevant pathological feature of AD cognitive impairment [25]. For example, the downregulation of GABAergic synapses is closely related to the loss of GABAergic inhibition [26]. Studies have found that GABAergic neurotransmission is closely related to various aspects of AD pathology, including Aβ toxicity and Tau hyperphosphorylation [27]. The level of GABA inhibitory neurotransmitter in AD patients was significantly reduced, suggesting that AD has insufficient synaptic function and neuronal transmission [28]. In addition, In a mouse model of AD indicate that the impairment of hippocampal neurogenesis may be mediated by GABAergic signal dysfunction or the imbalance between excitatory and inhibitory synapses [29]. Therefore, GABAergic synapses not only plays an important role in the function of the hippocampus, but also in the pathogenesis of AD. To explore the protein interactions, a PPI network was consisted of 5 well-known AD risk genes(APP, PSEN1, PSEN2, APOE and MAPT) and 209 genes. The 14 genes with greater than four degrees (including CXCL12,BAX, IRF2, IRF5, SMAR-CC1, SP1, CASP6, DCN, FZD2, GNA13, HNF4A, ITGAV, PELP1 and SDC2) were selected. Especilally, CXCL12 plays a major role in neuroinflammation because it mediates the local immune response and attracts leukocytes, which are thought to migrate across the blood-brain barrier along the concentration gradient of chemokines to their targets. This occurs for example in AD in the vicinity of the amyloid plaques that attract and/or activate local glial cells [30].

Limitations
There are some limitations in this study. First, although we identified 23 potential driver genes of AD by the WGCNA and CFG method, these approachs could be used to prioritize genes rather than to identify true causal genes. Therefore, further biological validation of the identified genes are necessary in future studies. Second, 4 of 5 datasets were downloaded from AlzData, which only retained the common genes from different studies during the cross-platform normalization. Third, the sample size of EC, HP and TC available for analyze was still limited, and the larger sample size of FC and ADNI might have a greater influence on the results. Fourth, the rapid development of various omics provide new opportunities for understanding of AD. However, we only used transcriptomics dataset to identify potential driver genes of AD. Finally, more potential genes of AD were not considered. Deep learning has capacity to dig out more hidden gene in data and is a machine learning algorithm based on artificial neural network, which is a computational model inspired by the structure of human brain. The main difference between deep learning and traditional artificial neural network lies in the scale and complexity of network structure. The networks of deep learning have a larger number of hidden layers, while traditional artificial neural networks usually have only one hidden layer. This is due to the lack of big data and GPU hardware technical support in the last century. Due to the emergence of more powerful CPU and GPU hardware, deep learning with more hidden layers is proposed on the basis of artificial neural network, and more nodes can be used in each hidden layer[31, 32].

Conclusions
In this study, we identified potential driver genes from AD-specific modules using multiple transcriptomics datasets and observed that DEGs were enriched with several pathways significantly, which are consistent with observations from previous studies. In summary, Our results contribute to understanding pathophysiology of AD and looking for candidates drug targets.

Consent for publication Not applicable
Availability of data and materials AlzData and ADNI data are publicly available.

Competing interests
The authors declare that they have no competing interests.       In total 123 47 74 244 Table 5 The brief description of 23 potential driver genes of AD. The potential 23 driver genes are prioritized by the CFG method, which is integrated various levels of AD-related data including expression of target gene is regulated by eQLT, GWAS, PPI, Early DEG, Pathology correlation Aβ and Tau. The last column is list previously published gene-based studies for the identification of patients with AD. (*: P-value<0.05; **: P-value <0.01; ***: P-value<0.001)     PPI network. The PPI network contained 5 well-known AD risk genes (green) and 209 (yellow) neighbors, which are from ve AD-speci c modules.