The Human Brainome: changes in expression of VGF, SPECC1L, HLA-DRA and RANBP3L act with APOE E4 to alter risk for late onset Alzheimer’s disease

While there are currently over 40 replicated genes with mapped risk alleles for Late Onset Alzheimer’s disease (LOAD), the Apolipoprotein E locus E4 haplotype is still the biggest driver of risk, with odds ratios for neuropathologically confirmed E44 carriers exceeding 30 (95% confidence interval 16.59–58.75). We sought to address whether the APOE E4 haplotype modifies expression globally through networks of expression to increase LOAD risk. We have used the Human Brainome data to build expression networks comparing APOE E4 carriers to non-carriers using scalable mixed-datatypes Bayesian network (BN) modeling. We have found that VGF had the greatest explanatory weight. High expression of VGF is a protective signal, even on the background of APOE E4 alleles. LOAD risk signals, considering an APOE background, include high levels of SPECC1L, HLA-DRA and RANBP3L. Our findings nominate several new transcripts, taking a combined approach to network building including known LOAD risk loci.


INTRODUCTION
Apolipoprotein E (APOE) is a major risk factor in both early onset AD (EOAD, onset before 65) and late onset AD (LOAD, onset after 65) 1 .The APOE locus has three common haplotypes (APOE E2, APOE E3 and APOE E4) de ned by two single nucleotide polymorphisms (rs7412 and rs429358).While the E4 haplotype is a major driver of risk 1 , the E2 haplotype appears to be protective 2 .We have found that considering APOE background can help to map additional factors involved in genetic risk 3 .These factors could be either acting in conjunction with APOE to increase or decrease risk or, conversely, could be acting independently of the APOE locus.It is through this context that we decided to map how the APOE E4 risk allele can affect 'omics pro les.
Our prior work [4][5][6][7] has involved analyzing expression pro les to determine the interplay of DNA, RNA, and protein in LOAD risk.We have used Bayesian network (BN) probabilistic / causal modeling to infer potential causality and we uncovered a potentially predictive transcript (in other words, a "hit") that replicated in two separate sets of data as well as cell culture models 5 .The choice of BN modeling (over alternative network methodologies, e.g., Markov networks and correlation-based co-expression networks) is motivated by a number of factors.First, the BNs ability (due to the BN directed acyclic graph structure) to lter out numerous spurious (indirect, or transitive) dependencies inherent to the multicollinear data rich in mutually dependent variables 8 , such as expression datasets.Second, the BNs ability to model non-additive higher-order relationships, both linear and non-linear.Third, the BNs ability to incorporate mixed data types (such as, in our case, expression data plus APOE haplotypes) into a uni ed analysis framework.Finally, and importantly, BN modeling is an explainable and segment-able statistical/machine learning activity, wherein each fragment of the network can be mechanistically and biologically interpreted as a probabilistic dependency and potential directional causality at high levels of granularity, and can be further scrutinized and in silico validated using "conventional" statistical tools.By concentrating on a variable of interest (e.g., LOAD phenotype) and its immediate neighborhood in the network, we can perform a classi cation task with clearly de ned local conditional probability tables, something that the dimensionality reduction / clustering / visualization methods, such as UMAP 9 or t-SNE 10 , cannot directly address.
Here, we sought to extend our prior BN work using a few modi cations.First, we included the APOE E4 risk haplotype in our analysis.Second, we prioritized data inputs using mutual information (MI) as a variable selection lter, as opposed to differential expression (DE).MI measures the informational gain for understanding the state of one variable given knowledge about the state of another variable.Rather than rely on statistically signi cant differences in means (DE), MI estimates the overall dependence between two variables.Mixed MI (MMI) can deal with both quantitative and qualitative variables in the same space.Additionally, relationships can be given ranks based on the MMI cumulative distribution function (CDF), with only inputs showing high levels of MMI with the variable of interest (here diagnosis of LOAD) included.The nal difference from our prior work is that we have used a different platform for BN network modeling, BNOmics [11][12][13] .BNOmics uses a novel hybrid (constraint-based + search-andscore) heuristic algorithm to increase scalability by signi cantly shrinking the search space thus making a series of full BN modeling experiments with varying hyper-parameters (e.g., discretization schedules) computationally feasible (days instead of weeks) in our setting, increasing convergence and robustness.In addition, di culties with handling both qualitative and quantitative measures in the same analysis (i.e., the mixed-variable problem) are solved by the novel MMI metric and by our use of adaptive maximum entropy-based discretization.This allows for robust handling of mixed type variables and their linear and non-linear interactions within a single analysis framework.
In summary, we used BNOmics to analyze our Human Brainome datasets 5 including the APOE E4 risk haplotype in the model, as well as performed separate analyses in the group of individuals who were APOE E4 positive (Group 1) and APOE E4 negative (Group 2).Input data was selected by the transcripts which best explained LOAD status (scored as presence (1) / absence (0) of pathologically con rmed LOAD), and expression pro les were discretized to maximize both biological interpretability and rigor and reproducibility (by insuring convergence in the search space).We were focused on searching for de novo hits that were protective against APOE E4 risk as well as those that contributed to APOE E4 risk.

SUBJECTS AND METHODS
1.2.1 Human brain tissue samples.Sample sets are from The Human Brainome series 5 .Our KRONOSII series was obtained from 21 National Alzheimer's Coordinating Center (NACC) brain banks and from the Miami Brain Bank as previously described 4,7,14 .Additional cohorts were obtained in the same manner as the original US series.Our criteria for inclusion were as follows: self-de ned ethnicity of European descent (in an attempt to control for the known allele frequency differences between ethnic groups), neuropathologically con rmed AD or no neuropathology present, and age of death greater than 65.
Neuropathological diagnosis was de ned by board-certi ed neuropathologists according to the standard NACC protocols 15 .Samples derived from subjects with a clinical history of stroke, cerebrovascular disease, Lewy bodies, or comorbidity with any other known neurological disease were excluded.
Alzheimer's disease or control neuropathology was con rmed by plaque and tangle assessment with 45% of the entire series undergoing Braak staging 16 .
The RUSH series includes deceased subjects from two large, prospectively followed cohorts maintained by investigators at Rush University Medical Center in Chicago, IL: The Religious Orders Study (ROS) and the Memory and Aging Project (MAP).The ROS cohort, established in 1994, consists of Catholic priests, nuns, and brothers from 40 groups in 12 states who were at least 55 years of age and free of known dementia at the time of enrollment.The MAP cohort, established in 1997, consists of men and women primarily from retirement facilities in the Chicago area who were at least 53 years of age and free of known dementia at the time of enrollment.All participants in ROS and MAP sign an informed consent agreeing to annual detailed clinical evaluations and cognitive tests, and the rate of follow-up exceeds 90%.Similarly, participants in both cohorts signed an Anatomical Gift Act donating their brains at the time of death.The overall autopsy rate exceeds 85%.The ROS and MAP cohorts were analyzed jointly since they were designed to be combined, are maintained by a single investigative team, and a large set of phenotypes collected are identical in both studies 17 .More detailed information regarding the two cohorts can be found in previously published literature 18 .
Samples were chosen by the diagnostic criteria as stated above as well as the availability of high quality RNA.1.2.2Ethics Statement.The research for this project fails under 45 CFR 46.101(b)(4).Human Subjects exemption #4 states research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identi ed, directly or through identi ers linked to the subjects.In both series, samples were de-identi ed before receipt, and the study met human studies institutional review board and HIPPA regulations.Consents for biobanking are obtained from the source brain banks.This work is declared not human-subjects research and is IRB exempt under regulation 45 CFR 46.See the Acknowledgements section for a list of individual sites that contributed samples to this effort.1.2.3Data Collection.Sample data was adjusted for several biological covariates (gender, age at death and cortical region) and several methodological covariates (institute source of sample, post-mortem interval, detection and hybridization date).Adjustment methodologies were as in the prior report 5 .1.2.4Data Analysis-Networks.Our focus was on the construction of high-dimensional multimodal Bayesian networks (BNs).As before 5 , we analyzed the KRONOSII and the RUSH datasets separately.Our analysis pipeline is shown in Fig. 1.APOE status was de ned as either having no APOE E4 alleles (APOE negative) or at least one APOE E4 allele (APOE positive).
All analyses were performed using the BNOmics platform.BNOmics is a highly scalable open-source universal-purpose BN modeling and visualization software developed by us previously 11,12,19 .Each analysis involved the following: MMI.The goal of the MMI-based variable selection/ranking was to select a more compact set of transcripts for subsequent BN analysis.We selected two series of 2000 top expression transcripts from the KRONOSII and RUSH data, respectively, as 2000 constituted a reasonable compromise between the computational e ciency of the full BN construction (hours to 1-2 days for each BN) and low probability of false negatives (given the smooth and monotonic nature of the MMI curves and the fact that the resulting Markov blankets of LOAD phenotype contained no more than 10-30 variables).
DISCRETIZATION.In order for our BN analyses to proceed, expression data rst must be discretized.We used the multinomial model with the maximum entropy-based discretization, as the linear Gaussian model is not always a good t with the transcript expression data 20 .We discretized the continuous transcript expression data into 3 bins ("high-medium-low"), which showed stable BN convergence and was also the most convenient discretization schedule from the biological interpretation standpoint.To insure that our discretization was robust and did not create artifacts such as false dependencies or independencies, we additionally plotted the full distribution using the ggstatsplot package in R 21 for each of our main targets (continuous values) against disease risk.
BNOmics.We used a hybrid (constraint-based + search-and-score) algorithm with MDL scoring function previously described in 11,12,19 , with 20 restarts.BN edges are directed; however, a directed edge in the BN does not necessarily imply either directional causality or the biological hierarchy.Rather, edge directionalities describe the dependency structure between the variables (the factorization of the joint probability of the variables in the BN) that is most likely given the data, largely resolving multicollinearity issues.To identify the most salient transcripts, Markov neighborhoods (MNs) of the primary variable of interest (LOAD status, denoted as "DX" node in the networks) were used.MN is a simpli cation of the more rigorous concept of the Markov blanket (MB) that includes only the variables that are immediately adjacent to the variable/node of interest (such as DX) in the network.MB, consisting of the node's "parents", "children", and other "parents" of the node's "children", is a superset of MN; given the MB of the variable of interest, it is independent of the remaining variables in the network, allowing us to concentrate on a small subset of variables to identify and highlight potential hits.

Data Analysis-dataset similarity.
To examine global similarity in transcript pro les, for each dataset (D), KRONOSII (K) and RUSH (R), we computed the Mixed Mutual Information (MMI) 13,22 between the level of expression of each transcript i ( ) and the corresponding binary variable associated with the LOAD status (DX) as .
To simplify notation, we de ne .Subsequently, we de ne two transcript subsets r and k as , and , with corresponding to the percentile value in the distribution.We used the Jaccard Index (JI) to measure the similarity between , .JI is de ned as the size of the intersection divided by the size of the union of the sample sets.In our analysis we have computed JI using different cut-offs of MMI as , with being the set of all transcript labels in KRONOSII and RUSH.Similar to a common practice in transcript-set enrichment analysis, a hypergeometric distribution was used to compute the probability ( ) of observing two subsets sampled randomly without replacement from of size equal to and and with JI greater than or equal to that observed for the corresponding and .APOE genetic similarity between the two datasets was determined by looking at haplotype frequencies.
1.2.6 Data Analysis-Visualizations. MBs of the variable of interest (LOAD status / DX) were visualized as parts of the full BNs.Notably, these are the fragments of the full BNs, and not the small BNs built from the further reduced variable sets.The MBs included dependencies between the peripheral variables.The MMI CDFs were plotted for KRONOSII and RUSH, and speci c hits were mapped to determine how much explanatory weight they had in each dataset according to a univariate MMI criterion.To dissect the effects (and their directions) of top BN-identi ed hits, two-level Conditional Probability Trees (CPTs) were created using the local conditional probability tables.CPT visualizations are intuitively similar to the conventional decision trees.

RESULTS
Figure 2 depicts the MB of LOAD diagnosis (DX) for the 3-bin discretized expression data plus APOE genotype for KRONOSII and RUSH.We obtained converging structures in both sets.The top hits in KRONOSII were APOE E4 presence/absence, VGF Nerve Growth Factor Inducible (VGF), Sperm Antigen with Calponin Homology and Coiled-Coil Domains 1 Like (SPECC1L), and Major Histocompatibility Complex, Class II, DR Alpha (HLA-DRA).The top hits in RUSH were APOE, Kinesin Family Member 5A (KIF5A) and Ornithine Decarboxylase (ODC1).Violin plots of normalized, but not discretized pro les for all top targets are given in Supplementary Fig. 1 demonstrating our MB results (dependencies) were not due to possible discretization artifacts.Of note, for the small set of top explanatory hits for LOAD, only APOE E4 presence/absence replicated between the two datasets.
To investigate why there was little replication, we compared both the expression input from both KRONOSII and RUSH as well as examined the haplotype frequencies for APOE E4.We used the JI to compare the MMI expression datasets and found that at high levels of mutual information (> 60%), as we had in our top 2000 MMI input sets, the KRONOSII and RUSH expression datasets overlap, as measured by the JI (red line, Supplementary Fig. 2A).This similarity is statistically more signi cant than a random distribution of the JI (Supplementary Fig. 2B).From this we conclude that as prior 5 , the KRONOSII and RUSH expression datasets on their own are globally similar.
In this study, we are interested in including genetics in our modeling, speci cally the APOE haplotype.To examine whether this was a factor impeding reproducibility, we plotted the frequencies of the main APOE To attempt to validate the results in KRONOSII, we split the dataset into APOE E4 positive individuals and APOE E4 negative individuals and ran these two groups separately.While not as ideal as a completely independent dataset, this approach does serve as a validation of effects given that MMI predictions and BN structures are produced independently from the prior runs using the entire KRONOSII series.Figure 3 shows the MB for the two separate groups of APOE E4 positive and APOE E4 negative samples in KRONOSII.Again, VGF is a top hit.In the KRONOSII group without the APOE risk haplotype, VGF and Homo sapiens unc-13 homolog B (UNC13B) had the greatest explanatory weight for pathologically con rmed LOAD.In the KRONOSII group with the APOE risk haplotype, RAN Binding Protein 3 like (RANBP3L) had the greatest explanatory weight for pathologically con rmed LOAD.Violin plots of normalized, but not discretized pro les for all top targets are given in Supplementary Fig. 4.
While MBs indicate patterns of dependencies and strengths, they do not indicate how the dependency is manifested, e.g., whether the transcript is highly expressed with APOE E4 haplotypes, or vice versa.We were interested to see how our case and control groups split by the level of expression and copies of APOE E4 haplotype.Figures 4-7 and Supplementary Figs.5-6 show conditional probability trees (CPT), which are decision tree-like visualizations of the conditional probability tables for each of the main hits.We found that VGF, SPECC1L, HLA-DR and RANBL3L gave strong consistent signals and acted in concert with APOE E4 haplotypes.While UNC13 and NEU4 gave strong signals in the network, there was less of a consistent (in its direction) signal.
High levels of VGF and no APOE alleles were most commonly found in the group of controls, with only 8% of the samples in this set having a pathologically con rmed diagnosis of LOAD (Fig. 4 top box, APOE = 0, VGF = 2).The greatest proportion of cases was found in the group of individuals with APOE 4 haplotypes and low levels of VGF (Fig. 4 bottom box, APOE = 1, VGF = 0).Of interest, there is a risk reduction in individuals with high levels of VGF, even if they possess APOE E4 alleles (Fig. 4 yellow highlight, APOE = 1, VGF = 2), indicating that high levels of VGF can be protective on an APOE background.This is a signi cant effect (comparing high VGF expression and APOE E4 presence with low VGF expression and APOE E4 presence OR = 2.2810; 95% CI 1.2217-4.2588,p-value = 0.0096) with an effective reduction in probability of ~ 40%.
For risk effects, we have mapped SPECC1L, HLA-DRA and RANBP3L.All of these elevate risk with increasing expression in a linear fashion with the APOE E4 haplotype.SPECC1L (Fig. 5) and HLA-DRA (Fig. 6) were seen without splitting the KRONOSII cohort by APOE E4 haplotypes (see Fig. 2).RANBP3L (Fig. 7A) was only ranked in the network when we split the cohort by APOE E4 haplotype (see Fig. 3).To investigate this effect, we traced the path of RANBP3L in the full cohort (Fig. 7B).RANBP3L was "missed" in the full dataset because it is steps removed from the diagnosis (DX) signal and in terms of network informative content it acts through VGF.VGF is our strongest signal, and thus, through splitting the sample into APOE E4 negatives, where VGF has the most effect given it is protective, and APOE E4 positives, where VGF has less of an effect, we are able to uncover RANBP3L.
As stated previously, UNC13 (Supplementary Fig. 5) and NEU4 (Supplementary Fig. 6) did not give a consistent signal in the CPT analysis.For UNC13 consistency in expression is only seen in the APOE E4 haplotype absent set (APOE = 0 branch, top of gure).Note that in multinomial (and therefore multi-bin) BN analysis with more than two bins, relationships do not have to be linear; therefore, "consistency" in this sense indicates relationships with sustained "low-medium-high" or "high-medium-low" directions.
There is a modest increase in the probability of LOAD with high UNC13 expression if there are no APOE E4 haplotypes present.This is logical given these hits were only uncovered in the split analysis under the condition where the input KRONOSII dataset had no APOE E4 alleles (Fig. 3A).NEU4 was also found in the MB of the network built from individuals without APOE E4 haplotypes; however, unlike UNC13 there was no real relationship between high levels of expression or low levels of expression and risk.This suggests that, rst, the minimum description length (MDL) scoring function in BNs tends towards higher sensitivity in general and, second, multinomial models capture differences at the distribution level that might not directly translate into the sustained (linear) directions.Therefore, "borderline" (lower) edge strengths in the MBs should be further scrutinized using the local probability tables (visualized as CPTs) and validated using traditional statistical univariate tests.
Finally, we were interested to search for the main effects we found in KRONOSII in RUSH and compare the level of MMI (univariate metric that does not account for multicollinearity and does not distinguish between direct and transitive dependencies) with diagnosis for each transcript.We considered the RUSH cohort as well, because while the reduced number of APOE E4 samples inhibits the power to map APOE E4 contingent structures, we still may be able to map effects considering MMI, which is a direct

DISCUSSION
In this study we have demonstrated the high utility of the BN-driven computational systems biology approach for studying LOAD.The principal advantages of the BN modeling outlined in Introduction above (removal of spurious relationships; mixed data types handling; no linearity or normality assumptions; combination of both network visualization/interpretation and classi cation, at high levels of granularity) proved to be instrumental in our analysis.In this, we concur with the recent LOAD literature [23][24][25] .We believe the BNs to be a robust, exible and tting tool, around which further multiscale network modeling in the LOAD context should be centered.The main impediment to a wider BN modeling application is the limited scalability due to the high computational complexity; however, this is now less of an issue due to the increasingly better hardware, and continuing progress in both algorithmic development and software implementation 11,23,26 .
Our most robust hit was VGF.High levels of VGF are protective against the development of Alzheimer's disease generally, and on an APOE E4 risk background there is approximately a 50% reduction in the proportion of individuals with LOAD when there are high levels of VGF (see Fig. 4, yellow box).This hit was rst implicated in LOAD using SELDI-TOF-MS to nd novel biomarkers in patient CSF 27 .This initial CSF result has now been validated in the Alzheimer's Disease Neuroimaging Initiative (ADNI) 28 .
Differences in VGF pro les have also been mapped in brain tissues 23,29 further implicating this hit.VGF (nonacroymic) is a granin-like neuropeptide precursor protein which is processed by prohormone convertases 30 .Precursor cleavage results in several downstream peptides with various functions including in ammation 31 , pain 31 , reproduction 32 , energy metabolism/feeding 33 , and circadian rhythms 34 .Central to all these functions is the crucial role that VGF plays in the secretory pathway along with other members of the granin family 35 .The secretory pathway has long been implicated in Alzheimer's disease pathogenesis with the rst ndings coming from studies of EOAD, where the Swedish mutation results in differences in Amyloid precursor protein (APP) sorting 36 .APOE pathogenesis in LOAD also involves pathway regulation; however, in this case APP pathogenesis is likely caused by faster endocytosis through APOE E4 binding of the LRP1 receptor 37 .It remains to be seen whether VGF may block this process directly, or whether the protection from overexpressed VGF is merely due to its growth factor properties relationship between the hits and LOAD diagnosis.Figure 8 graphs the MMI CDFs for KRONOSII and RUSH, considering our main hits.As can be seen, VGF and APOE E4 haplotype ("APOE_genetic") have extremely high explanatory signals in the KRONOSII data (solid lines) with the VGF signal actually exceeding APOE in KRONOSII.VGF is also high in the RUSH dataset (dotted lines), but not strong enough to overtake the main APOE signal.Interestingly, RANBP3L is also a MMI hit in both datasets, whereas our other hits only appear to have strong effects in KRONOSII.Finally, we tested APOE expression levels for MMI with DX.APOE expression is not a robust signal in either set.This matches with our nding that there is no relationship between APOE haplotypes and APOE expression when just plotting expression without considering MMI (Supplementary Fig. 7).
For all the other hits we mapped, increases in expression caused increases in risk for LOAD development.SPECC1L colocalizes with tubulin and actin.De ciencies in SPECC1L protein expression can lead to de cits in vertebrate facial morphogenesis 38 .SPECC1L is thought to be a regulator of adherens junction stability and remodeling of the actin cytoskeleton 39 .Loss of SPECC1L increases staining of adherens junctions 40 , and therefore, increased expression of SPECC1L, as seen in our study, may be involved in risk via breakdowns in blood brain barrier (BBB) integrity.APOE also acts on BBB integrity.Speci cally, the E4 containing forms of apoE increase age-dependent breakdowns in the blood brain barrier in mouse models 41 .
HLA-DRA is a member of HLA class II set of proteins, which sit on the surface of antigen presenting cells and act in pathogen recognition.Upregulation of HLA class II proteins is a marker of activated microglia in LOAD 42 and AD patients have a higher load of CD4 + HLA-DR + and CD8 + HLA-DR + lymphocytes 43 .In large, multisite genome-wide association screens variants near the HLA-DRB5-DRB1 cluster just downstream of HLA-DRA are consistently found to be associated with LOAD risk 44,45 .HLA is known to be involved in immune in ammatory response 46 , as well as CNS plasticity and signal transmission for the HLA MHC class I proteins 47 .In the context of APOE genetic risk, prior work has focused on class I proteins, showing protective effects of HLA-DRB haplotypes 48 ; however, this is counter to the GWAS results, where HLA-DRB and HLA-DQA/B are risk factors 44,45 .Further work has suggested that HLAantigen incongruence resulting in persistent immune activation can lead to risk for AD 49 , which is in concordance with our nding of HLA-DRB activation.It remains to be seen how APOE haplotypes, HLA-DRB haplotypes and HLA-DRA expression activation interact.
RANBP3L is a nuclear pore protein which enables nuclear protein translocation through the nuclear pore complex (NPC).RANBP3L speci cally acts on BMP-speci c SMAD 1/5/8 proteins and terminates BMP signaling by blocking nuclear import 50 .Reductions in SMAD signaling have been linked to AD pathogenesis in both neurons and glia 51,52 .Prior work on RANs in AD showed reduced expression of RAN, RANBP1, RANBP2, RANBP5, RANBP9 and RANBP10 in AD tissues, proposing that the toxic forms of APP knocked down RANs and thus reduced import into the nucleus 53 .RANBP3L was not measured in that screen.In our series there are also seen statistically signi cant reductions in RAN, RANBP1, RANBP6, and RANBP9, which is consistent with the prior report 5 ; however, none of these downregulated hits were crucial to the APOE effect.From our studies the crucial effect is an upregulation in RANBP3L blocking SMAD signaling by blocking nuclear import.It remains to be seen how APOE E4 acts in concert with RANBP3L.
In conclusion, our study seeks to de ne the role of APOE risk haplotypes on the downstream effects of expression being agnostic to genomic location or biological hypothesis.We have used MI and BN modeling to de ne novel hits which can either increase risk or offer protection against developing LOAD in the context of risk conferred by the APOE E4 haplotype.This study extends prior work in that we are sampling all expression pro les, rather than just looking at genomic locations closest to APOE, which is a common technique in GWAS follow-up or cis expression pro ling.We are additionally performing network analysis in the context of risk genotypes, which is also not widely done as most network pro ling focusses on continuous variable data only (i.e.expression).Network analysis, and speci cally BN modeling, offer distinct advantages for the reasons stated above.Finally, we have demonstrated that power is a crucial component to these screens, with reproducibility only seen where there are enough alleles/haplotypes to capture effects.Further work will involve an understanding of additional genetic risk hits beyond APOE as well as compound modeling, where power permits.

CONCLUSION
Taking a combined approach to network building including the APOE E4 locus, our ndings nominate VGF, SPECC1L, HLA-DRA and RANBP3L as novel expression loci involved in the pathogenesis of LOAD.Of particular interest, high levels of VGF are protective on an APOE E4 background.
Pipeline. is the overview of all computational procedures performed in the KRONOSII and RUSH datasets.As mentioned in the text, the matrix of gene expression was reduced to n=2000 using MI for each dataset independently prior to BN construction.Markov Blankets (MBs) were captured from the full BN structures to determine top hits.Reproducibility between outputs was determined by rank comparison of MB hits and reproducibility of inputs was determined by the JI.Conditional Probability Trees (CPTs) were only built for KRONOSII given the lack of reproducibility between sets with respect to APOE genotype and the fact that KRONOSII gave the most robust results.The CPTs along with the BNs provide biological interpretability since we have the order to the network structure from the BNs and risk/resilience/expression pattern x genotype relationships from the CPTs.SPECC1L CPT.Same as in Figure 4 with the exception that in the second tree layer (3 rd column from the left) the probability is computed for SPECC1L.
Figure 6 HLA-DRA CPT.Same as in Figure 4 with the exception that in the second tree layer (3 rd column from the left) the probability is computed for HLA-DRA.
Page 21/22 A. RANBP3L CPT.Same as in Figure 4 with the exception that in the second tree layer (3 rd column from the left) the probability is computed for RANBP3L.
Cumulative Distribution Function (CDF) of the MMI.For each variable in the dataset with respect to DX the CDF was calculated from the MMI data.KRONOSII in solid blue; RUSH in dashed black.Rankings of selected genes are reported for KRONOSII (solid arrows) and RUSH (dashed arrows).APOE and APOE_Genetic represent the level of expression and haplotype state, respectively.The higher a hit is on the graph, the more explanatory weight there is for the state of diagnosis (i.e.whether or not samples have LOAD).As can be seen, the largest MMI signal was VGFin the KRONOSII dataset, followed by APOE_Genetic in the KRONOSII dataset.In RUSH, the largest signal was with APOE_Genetic, even though there were not as many APOE E4 samples as in KRONOSII.The next highest hit in RUSH was RANBP3L,

B
. RANBP3L dependency path in the full Bayesian Network (BN) for SET1.BN comprising top 2000 most informative (as measured by MMI) genes (represented by dots) is shown.The undirected dependency path between RANBP3L and DX (LOAD status) is highlighted.(Note that in this representation the edge lengths do not correspond to any meaningful measures and are merely for visual clarity).
haplotypes in each dataset.Of note, in Supplementary Fig.3there is a very low number of APOE E4s in the RUSH cohort.This effectively reduces the cohort size contributing to mapping effects by ~ 40% (KRONOSII APOE E4 positive controls n = 56; RUSH APOE E4 positive controls n = 22; KRONOSII APOE E4 positive cases n = 150; RUSH APOE E4 positive cases n = 57).This difference in APOE haplotype frequencies was not just due to sampling, given the KRONOSII set is larger (n = 475) than the RUSH dataset (n = 306).Examining APOE frequencies in additional RUSH samples (n = 458) where expression was not pro led due to tissue availability and/or quality, the APOE haplotype frequencies were very This reduction in power can clearly be seen in Fig.2, where the explanatory weights (edge strengths) for diagnosis and APOE in RUSH are ~ 40% of those in KRONOSII (RUSH APOE E4 explanatory weight = 21.24;KRONOSII APOE E4 explanatory weight = 58.19).Mathematically, this makes sense because all other things being equal, edge strengths in the BNs are approximately linearly proportional to the sample size when using the MDL scoring function.It is likely this reduction in power is contributing to the lack of reproducibility; generally, we are seeing much larger effects in KRONOSII.We therefore decided to focus on the KRONOSII dataset for uncovering effects.