Bayesian Networks and Causal Inference for the Interpretation of Patients' Symptom Experience

Given that oncology patients experience an average of 14 co-occurring symptoms, the identiﬁcation of sentinel or core symptoms is a critical need for effective symptom management. However, this task is an extremely challenging one. To address this need, we used Bayesian Network Analysis (BNA) approaches on a comprehensive dataset of 38 distinct and co-occurring cancer symptoms from a sample of 1328 cancer patients receiving chemotherapy. We evaluate three classes of algorithms (constrained-based, score-based, hybrid algorithms) on their structural stability and performance with different sample sizes and demographic characteristics (gender and age). We demonstrate their potential clinical use for casual discovery and inference. The hybrid algorithm identiﬁed more dense network structures that were clinically relevant, not only for the entire sample but for our case studies of gender and age. Our work demonstrates how to process a comprehensive set of symptom experience data for interactions and provides valuable information on the use of BNA in clinical practice when different algorithms, sample sizes and comparative sub-groups are evaluated.


Introduction
Cancer and its treatments create a significant symptom burden for oncology patients regardless of tumor types, stage of disease, and/or phases of care. 1,2 Oncology patients undergoing cancer treatment experience an average of 10 to 15 unrelieved symptoms that can be highly variable in their occurrence, severity and distress. [3][4][5] Research on these symptoms and other patient reported outcomes (PROs) is needed to improve patient' quality of life and in some cases survival. 1,2,[6][7][8][9][10][11] Until now, studies that evaluated multiple co-occurring symptoms have used analytical techniques such as exploratory factor analysis or cluster analysis. [12][13][14] One of the main research hypotheses in this field is that symptoms that co-occur or cluster together may share common underlying mechanisms that are potential targets for therapeutic interventions. 15 Nevertheless, the existing research on multiple co-occurring symptoms and symptom clusters has several limitations. 12 The current analytical approaches are limited to an evaluation of the nature of the relationships between and among individual symptoms and multiple-occurring symptoms. However, these approaches do not permit an examination of how the presence of one or more symptoms may exert an influence on other co-occurring symptoms. The accurate identification of these sentinel or core symptoms may provide target(s) for early therapeutic interventions to decrease patients' symptom burden. In this study, we explore the application of Bayesian Network Analysis (BNA) to evaluate the direction of the relationships between and among multiple co-occurring symptoms in oncology patients receiving chemotherapy (CTX). Being one of the first BNA studies in the domain of cancer symptoms, we explore how BNA can be used to map the pathways that illustrate how symptoms interact and influence each other. In addition, we elaborate on key analytical aspects that are needed to increase the reproducibility of this work as well as for the extension/comparison of clinical findings to different samples of cancer patients.
BNA is an analytical approach that can be used to explore multivariate relationships with intuitive graphs, also known as causal graphs, causal diagrams, or directed acyclic graphs (DAGs). Bayesian networks (BNs) are probabilistic graphical models that represent a set of variables (e.g., cancer symptoms) and their conditional dependencies using a DAG. 16 Using BNA, the multivariate relationships (i.e. joint probability distributions) between these variables are quantified and visualised graphically, with nodes and arcs (see Materials and Methods, Figure 2), making the discovery and analysis of their inter-relatedness more comprehensive. In principle, such causal graphs can be used to answer questions related to: (i) causal discovery (e.g., which symptom(s), among many co-occurring symptoms, can influence another distinct symptom) and (ii) causal inference (e.g., of what magnitude and causal connection -direct or confounding -these symptoms affect the aforementioned sentinel or core symptom). 17 Because of BNA's ability to handle both observable as well as uncertain information (i.e., latent factors from observational data), [18][19][20] BNA has been used in biomedical informatics for more than two decades.
BNs are suitable tools for the probabilistic inference that can aid in clinical decision making for several reasons. 18,[21][22][23] For example, they provide a visual representation of causality that can be easily understood by a clinician. 22,24,25 They can integrate knowledge from clinical experts into their modelling procedures while learning the structure and parameters of a network through the available data. 22,23,26 In addition to their capacity to provide information on observational and causal inferences, 22,27 BNs can extract parameter estimates through the representation of the joint probability space among the nodes or clinical variables (e.g., symptoms) in a model. 22 Likewise, BNs can be used with and perform well with incomplete data. 22,[28][29][30] With respect to their healthcare applications, BNs are commonly used in genomics and system biology. [31][32][33][34] The usefulness of their causal diagrams has been explored and applied in epidemiology, public health and health services research. [35][36][37] Causal diagrams have been used to control for and identify confounding variables; 38, 39 estimate effect sizes from observational data; 17 select features that are most predictive of a target variable; 40 and test for external validity, 41 fault diagnosis and health monitoring, 42 missing data, 43 and selection bias. 39 Notably, BN applications supported decision making in a wide spectrum of the clinical practice activities (e.g., diagnosis, treatment, prognosis, risk modeling). 21 In cancer research, BN applications have been used in survival prediction and treatment selection for patients with lung and colon cancer; 22, 28-30 mammography interpretation; 44 detection, diagnosis and prognosis of breast cancer; 22, 28-30 modelling and predicting the occurrence of brain metastasis from lung cancer 45 and revealing the genetic and environmental basis of bladder cancer. 16 They have been even applied to model and asses the inter-relationships between life-style, health-related factors and quality of life of cancer patients. 46,47 Nevertheless, the use of BNs in cancer symptom management research is extremely limited. To our knowledge, only one longitudinal study used BNA to investigate the effect of sleep disturbance, fatigue, and mood disturbance on cognition and quality of life in breast cancer patients before, during, and after CTX. 47 While this study illustrates the use of BNs to evaluate the relationships among cancer symptoms, it does not provide an extensive summary of the types of BN algorithms that can be used. In addition, it does not specify how the proposed model can drive data-driven domain knowledge discovery when various symptom structures are unknown. Finally, it does not elaborate on the analytical challenges that such clinical applications may have. While this study provides important information on the relationships among three common co-occurring symptoms and quality of life, 47 the sample size was small (n=74) and included only patients with breast cancer. In addition, only three symptoms were evaluated in the BNA. Given that oncology patients receiving CTX report an average of 14 co-occurring symptoms, 3 , the goals of our study are multifold and use a broader approach than the previous study. We describe how BNA methods can be used to explore the interrelationships among 38 co-occurring symptoms in a large, clinically representative sample of 1328 cancer patients undergoing CTX. In addition, we show how these methods can be used to build new clinical knowledge by highlighting notable differences in the BNs created using gender and age (i.e. <65 versus >65 years of age based on the definition of an older person from the World Health Organisation). 48 Finally, we explain how our findings can be used to guide the development and testing of future interventions to improve symptom management in oncology patients.
Our contributions in this paper are two-fold. First, we implemented three different families of BNA algorithms using crosssectional data on the occurrence of 38 symptoms, as well as on symptom occurrence data dichotomised by gender and age, to assess and compare their structure, performance and stability. Second, we provide the analytical implications, the clinical interpretations and the significance of our findings in relationship to these 38 common co-occurring symptoms in cancer patients receiving CTX. Our work provides the first case study on how to process a comprehensive set of 38 cancer symptoms to assess their interactions and explore potential target symptoms for early interventions. An additional contribution to the field of symptom research lies in the demonstration of how to utilise findings from BNs in clinical practice when using different algorithms, sample sizes and comparative sub-groups.

Results
Sample Characteristics: Of the 1328 (n Complete =1328) patients in this study, 77.8% were female (n Female =1033,n Male =295) and 72.1% were under the age of 65 (n Age <65 =1033,n Age >65 =295), with a mean age of 57.3 (±12.3) years. The majority of the patients had breast (40.2%) or gastrointestinal (30.6%) cancer. The remaining patients had gynaecological cancer (17.5%) and lung cancer (11.7%). These 1328 patients reported an average of 13.9 (±7.2) symptoms prior to their second or third cycle of CTX. Sample characteristics and symptom occurrence rates are summarised in Appendix A, Table S1 and Online Figures 1 through 5, 49 respectively.

Grow Shrink (GS) Algorithm
To learn the structure of the 38 different symptoms using the GS algorithm, we set the nominal type I error rate to 0.05. We averaged 10,000 BNs and parsed their edges with the two different statistical criteria described in the Methods section.
Both of these networks were relatively sparse, with the majority of the symptoms not connected by any edges. The network that was parsed with a 0.85 threshold identified 10 directed arcs. The network parsed with the optimal threshold (0,4963), described by Scurati and Nagarajan, 50 identified 22 directed arcs (Appendix A,

Hill-Climbing (HC) Algorithm
To learn the structure of the 38 different symptoms using the HC algorithm we set its hyperparameters to have 50 random restarts and 30 attempts to randomly insert/remove/reverse an arc on every random restart. We constructed and averaged 10,000 BNs and parsed their edges with the two different statistical criteria described in the Methods section. The results are illustrated in Online Figures 8 and 9. 49 Compared to the GS algorithm, the BNs learned using the HC algorithm were denser (Online Figures 12 and 13 49 ). As illustrated in Online Figure 8, 49 the network that was parsed with a 0.85 threshold identified 29 directed arcs and seven symptoms (i.e, numbness/tingling in hands/feet, problems with urination, itching, dizziness, difficulty swallowing, constipation, swelling of arms and legs) were not included in the BN. The network that was parsed using the optimal threshold (0.4989), described by Scurati and Nagarajan, 50 identified 59 directed arcs (Appendix A, Table S2). The only symptom not included within the network was swelling of the arms and legs (Online Figure 9 49 ).

Max Min Hill-Climbing (MMHC) Algorithm
To learn the structure of the 38 different symptoms with the MMHC algorithm, we set its hyperparameters to have a nominal type I error rate to 0.05, during the first phase of the algorithm. In addition, we included 50 random restarts and attempted 30 times to randomly insert/remove/reverse an arc on every random restart, during the second phase. We constructed and averaged 10,000 BNs and parsed their edges with different statistical criteria described in the Methods section. The results are illustrated in Appendix A, SF1, SF2, Online Figures 10 and 11. 49 Similar to the findings with the HC algorithm, the networks learning with the MMHC algorithm were denser in comparison to the networks learned with the GS algorithm and the HC algorithm (Online Figures 14 through 17 49 ). As illustrated in Online Figure 10, 49 the network that was parsed with a 0.85 threshold identified 29 directed arcs and seven symptoms (i.e, numbness/tingling in hands/feet, problems with urination, itching, dizziness, difficulty swallowing, constipation, swelling of arms and legs) were not included in the BN. The network that was parsed using an optimal threshold (0.5010), described by Scurati and Nagarajan, 50 identified 62 directed arcs (Appendix A, Table S2). The only symptom not included within the network was swelling of the arms and legs (Online Figure 11 49 ).

Bayesian Network Algorithms Comparison
To assess the goodness of fit and findings of the networks constructed, we compared GS, HC, and MMHC based on different scoring functions such as k2, BDE, AIC, BIC and logik (Appendix A, Table S2).
Based on these functions, both HC and MMHC performed better than GS. In addition, their performance was similar for all five scoring criteria with MMHC performing better overall when applied to our smaller and larger subgroup analyses (Appendix A, Table S2). Using the complete dataset of all cancer patients, the findings for HC and MMHC were similar (see Table 1). Depending on the threshold used (i.e., 0.85 or optimal threshold), the Hamming Distance (HD) of their networks were 0 and 2 respectively, and their Structural Hamming Distance (SHD) were 0 and 5, respectively. Most of their edges were common as illustrated in Figure 1 (see also Online Figures 12 to 17 49 ).

Stability of the Identified BNs
To evaluate the stability of the constructed BNs, we compared the total number of arcs identified in four bootstrap sub-samples (n 1 =332, n 2 =664, n 3 =996, n 4 =1328). In addition, we compared their similarity based on their common arcs. The averaged BNs were created in a similar way, as explained above. The results are shown in Appendix A, Table S2.
Based on the scoring function, MMHC provided a better fit for the data and it contributed the highest number of common arcs across the procedure (Appendix A, Table S2). Of note, all of the algorithms seemed to be affected by the sample size, with the score and the number of arcs increasing as the sample size increased (Appendix A, Table S2).

Bayesian Networks for Male and Female Cancer Patients
To demonstrate the potential applications and challenges of BNA in cancer symptom research, we applied the MMHC algorithm to four different subgroups of cancer patients based on their gender and age.
The sample sizes for the two subgroups of male and female cancer patients were not balanced (n male =295, n female =1033). Checking their BNs' similarities provides a view that female cancer patients experience more symptoms than male patients 5/15 (Table 2, 3). More specifically, women seem to experience up to 59 different symptom interactions, while men experience up to 37 symptom interactions. Male and female patients experience up to 14 different symptoms in common (Table 3).
However, when we created the BNs using bootstrapped, random samples of female cancer patients of equal size with the male cancer patients (n=295), this view reversed (Table 2). In this case, women seemed to experience up to 25 different symptom interactions compared to 37 in men.  49 From these diagrams we can see that when we keep the sub-groups disproportional women seem to experience a more interconnected network of symptoms than men. In contrast, when we balance the gender sub-groups with a similar sample size the findings reverse with men experiencing a more interconnected network of symptoms than women.

Bayesian Network for younger (< 65 years) and older (> 65 years) cancer patients
In a similar fashion, we compared the BNs identified using the MMHC algorithm, between older and younger cancer patients (n Age<65 =958, n Age>65 =370). The results are illustrated in Appendix A, Figure SF5

6/15
In our initial comparison, keeping the age sub-groups disproportional, younger cancer patients seem to experience a more interconnected network of symptoms than older ones. More specifically, younger people seem to experience up to 59 different symptom interactions, while older ones experience up to 42 symptom interactions. Younger and older cancer patients experience up to 18 different symptoms in common ( Table 3).
As in our previous example, when we created the BNs using bootstrapped, random samples of younger cancer patients of equal size with the older cancer patients (n=370), this view reversed ( Table 2). The comparison with the balanced sample sizes showed that older patients experience a more interconnected network of symptoms. In this case, younger cancer patients seem to experience up to 27 different symptom interactions compared to 42 interactions in older ones. Younger and older cancer patients experience up to 13 different symptoms in common (Table 3).

Discussion
In this study, we compared the network structures that were learned using constraint-based, score-based, and hybrid algorithms. 51 The hybrid algorithm identified more dense network structures that were clinically relevant, not only for the entire sample but for our case studies of gender and age differences. While our sample size was relatively small (N=1328), and we evaluated the three methods on a constrained set of confounding parameters, our results show that the score-based and hybrid algorithms identified relatively similar structures. The constraint-based algorithm identified less sparse networks which may be the results of the relatively small sample size. 52 Future analyses, with a broader set of confounding factors (e.g. specific chemotherapy regimens, supportive care interventions) for structure-learning as well as different subgroups of patients (e.g., different cancer types, different cancer treatments) may provide additional insights into the use of BNA in cancer symptom research. In addition, if our findings are confirmed regarding the existence of common core or sentinel symptoms identified in the BNs of different subgroups of cancer patients, this information can be used to develop and test interventions to decrease symptom burden in oncology patients. One can hypothesize that if an intervention is directed at the core/sentinel symptom, co-occurring symptoms associated with this "parent" symptom will decrease as well.
Given the cross-sectional design and the need for replication, one needs to proceed with caution regarding the clinical implications of this study. However, an evaluation of the various edges in Figure SF2, Appendix A, illustrates relationships that are commonly observed in oncology patients. For example cancer patients will often report the co-occurrence of difficulty breathing and shortness of breath. 53 In a similar fashion, oncology patients can experience sweats and hot flashes from a variety of therapeutic interventions. 54,55 Another set of interesting findings from our study is the age and gender differences in the density of BNs when the sample sizes were controlled for in the BNA. While most of the studies of age and gender differences in symptom burden found higher rates and severity of symptoms in younger 56, 57 and female 56,58 patients, the majority of these studies were cross-sectional and did not account for the inter-relationships among multiple co-occurring symptoms. These findings warrant replication in future studies with cancer types that occur equally in men and women (e.g., gastrointestinal, lung) and in younger and older patients with the same cancer diagnosis who are undergoing similar types of treatments.
We have listed methods and ways in which BNA could inform symptom management research. Among the strengths of this study is the use of an extensive set of cancer symptoms from a comprehensive sample of oncology patients; the complete evaluation of BNA with different algorithms, performance, stability criteria, and contrasting sub-groups of patients; and the use of bootstrap methods and different scoring functions which can reduce overfit and improve replicability. Nevertheless, several limitations warrant consideration. We do not know what the true structure of patients' cancer symptom experiences. Also, our analyses lack the inclusion of potentially important confounding factors (e.g., specific chemotherapy regimens) that could increase or decrease the occurrence of specific symptoms. Nevertheless, BNs are exploratory tools that are well suited for hypothesis generation. Markov blankets allow us to identify factors that directly influence each symptom and can be regarded a part of a causal (mechanistic) model. Despite the fact that it is very difficult to prove causality from cross-sectional observational data, BNs can provide insights for specific causal models that could be validated in experimental data and/or in longitudinal studies. Therefore, our findings need to be verified in independent cohorts and/or in longitudinal studies.
BNs and network analysis approaches 59 are promising in detecting patterns among symptoms. The various BNA approaches have the potential to identify possible causal pathways, with predictive value inside the Markov blankets of each network. To take full advantage of this network approach in symptom management research, we need to explore its utility using additional sets of data modelling parameters (e.g., new algorithms, hyper-parameters, statistical criteria). In addition, a need exists to study multiple and diverse sub-groups of patients, based on various demographic, clinical, and molecular characteristics. The application of BNA to a broader set of parameters, similar to those that we present in these analyses, may provide new evidence regarding the mechanisms that underlie multiple co-occurring symptoms in oncology patients undergoing CTX.

7/15
Breinan et al. 60 distinguished two cultures of statistical modelling, namely algorithmic and data modelling. The first one aims to optimise the predictions of the modelling produced from a given set of inputs while treating the process itself as a black box. The second approach focuses on the identification of patterns between all the given inputs and outputs and an explanation of how the specific modelling works. In the context of data modelling for cancer symptoms, algorithmic modelling is crucial for implementing state-of-the-art predictions that can be used in clinical practice. 61 In addition, data modelling can be a useful tool to expand conceptual knowledge about the domain itself. 3,59 The use of hybrid approaches such as BNA, that combines the strengths of both analytic fields, provides additional tools for both cancer research and care. Its use on temporal data will offer more in-depth insights on the ways in which cancer symptoms interact with each other both in causal and transitional ways.

Dataset Description
This paper reports on a secondary analysis of data from a longitudinal study of the symptom experience of oncology outpatients receiving CTX. The methods for this study are described in detail in our previous publications. [62][63][64] For the BNA experiments, enrollment data from the parent, longitudinal study were analysed (n Complete =1328). Patients were eligible to participate if they: were 18 years of age or older; had a diagnosis of breast, gastrointestinal, gynecological, or lung cancer; had received CTX within the preceding four weeks; were scheduled to receive at least two additional cycles of CTX; were able to read, write, and understand English; and gave written informed consent. Patients were recruited from two Comprehensive Cancer Centers, one Veteran's Affairs hospital, and four community-based oncology programs. This study was approved by the Committee on Human Research at the University of California, San Francisco, and was implemented in accordance with the Declaration of Helsinki. Written informed consent was obtained from all patients.

Symptom dimensions
To determine the occurrence of 38 symptoms commonly associated with cancer and its treatment, a modified version of the Memorial Symptom Assessment Scale (MSAS) 65 was used. In addition to the original 32 MSAS symptoms, the following six symptoms were assessed: hot flashes, chest tightness, difficulty breathing, abdominal cramps, increased appetite, and weight gain. Using the MSAS, patients were asked to indicate whether or not they had experienced each symptom in the past week (i.e., symptom occurrence; yes= presence of a symptom, no = absence of symptom). The validity and reliability of the MSAS are well established in studies of oncology inpatients and outpatients. 65

Bayesian Networks
Bayesian Networks (BNs), also known as a Bayesian Belief Networks (BBNs) or Belief Networks, 66 are graphical models that represent the relationships among multiple variables and describe the conditional dependencies between them. They comprise a network structure in the form of a DAG and a set of conditional probabilities, one for each variable, that characterize the dependencies represented by the DAG's edges (see Figure 2).
In BNA terminology, when we have a directed edge (i.e., arc or arrow) going from node X to node Y, node X is considered to be a parent of Y. Accordingly, node Y is considered a child of X. Every variable (i.e., node) inside a DAG, given the values of its parents, is considered to be conditionally independent of the set of all its predecessors in this graph. 16,67 Based on the Markov Condition when two nodes are not connected directly through an arrow, we consider these two variables to be independent. Taking advantage of the aforementioned graphical structure and the chain rule of probability calculus the total joint probability distribution of the group of variables represented inside a BN can be expressed as the product of their conditional probabilities: 16 where x = {x 1 , ..., x n } are the variables (i.e., the nodes inside a BN) and θ = {θ 1 , ..., θ n } are the BN's parameters. Each i is the set of parameters necessary to specify the distribution of the variable x i given its parents pa(x i ). 16 The structure mentioned above can support researchers and clinicians to understand the main cause(s) of a certain problem (e.g., symptom). It can also be used to describe the probabilities of different effects (e.g., increased occurrence rates of other symptoms) given an action (i.e., presence of a core or sentinel symptom). To predict the value or behaviour of a specific node inside a DAG, we need its Markov Blanket. The Markov Blanket of a node (i.e., variable or symptom) is a set of nodes, consisting of its parent nodes, child nodes, and all of the other nodes that share a child with it. For example, in Figure 2

Bayesian Network Learning Algorithms
The process of identifying and understanding the structure of a BN from the available data is called Learning, and it requires two steps. 69 The first step, called Network Structure or Structure Learning (i.e., identification of the topology of the BN), 69 consists of finding the graph structure that encodes the conditional independencies present in the data. The second step, called Parameter Learning, deals with the estimation of the parameters of the global distribution given the identified network topology. 69 Regarding Network Structure Learning, three different types of algorithms exist for constructing BNs from observational data, namely, constraint-based, score-based, and hybrid algorithms. 51 In Parameter Learning, two main categories of algorithmic approaches exist, one that is focused on complete data (e.g., Maximum Likelihood Estimate) and the other that is focused on incomplete data (e.g., Expectation-maximization, Monte-Carlo, Gaussian approximation). 70 In this study, we focus on network structure learning.

Structural Learning Algorithms
In this study, we use three well known structural learning algorithmic implementations, one for each of the aforementioned approaches. These algorithms are the Grow-shrink (GS, constraint-based), the Hill-Climbing (HC, score-based) and the Max-Min Hill-Climbing (MMHC, hybrid). 51 Constraint-based algorithms focus on recognising the conditional independence (i.e., Markov Condition) of all of the observed variables between them. 69 They learn the network structure with conditional independence tests, such as x 2 , to determine the absence of edges between variables, and then construct a graph, adding directions to edges that satisfy the d-separation criterion. Given a causal graph, d-separation is a criterion for deciding whether a set X of variables is independent of another set Y, given a third set Z. More specifically, if X, Y, Z are three disjoint sets of variables in a BN, "Z is said to d-separate X from Y if and only if Z blocks every path from a node in X to a node in Y". 71 Thus, GS is based on the recovery of the optimal Markov Blanket of every variable X, based on pairwise independence tests. 72 Grow-shrink's name refers to its two different phases, the growing and the shrinking one. Initially, the GS algorithm starts with a variable X and an empty set S. 16,72,73 The growing phase adds variables to S as long as they are dependent on X, conditional on the rest of the variables in S. In the 9/15 subsequent shrinking phase, variables that violate the Markov blanket property of X are removed from S. Then, the algorithm identifies the local neighbourhood (direct parents and children) of each variable in the network within the Markov blanket to recover the exact structure around each variable. Edge directions are determined by examining the triples of variables using the d-separation criterion.
Score-based methods calculate a number of possible BN structures and allocate a score to each that measures how well it explains the observed set of data. 16 These algorithms select the BN structure that maximizes this score. Because the number of potential structures is exponential to the number of nodes, even datasets with few variables have too many possible BN structures to allow for an exhaustive search. That is the reason why most score-based algorithms have employed heuristic search techniques, such as HC or simulated annealing. 16,51 More specifically, a HC search starts from either an empty, full or possibly random network and existing prior knowledge can be used to seed the initial candidate network. The main loop consists of attempting every possible single-edge addition, removal, or reversal in a candidate network. The structure that increases the score, the most becomes the current candidate. The process iterates until a change in a single-edge does not increase the score.
Finally, hybrid algorithms were developed that combine the constraint-and score-based methods to maximise their advantages. They consist of two steps, called restrict and maximise. 51 During the restrict step, a constraint-based algorithm is implemented to identify the skeleton of the network using conditional independence tests. During the maximise step, a score-based method is employed to maximise the score function of this network and identify the best set of edge orientations. In this manner, MMHC first learns the skeleton of a BN using a local discovery algorithm called Max-Min Parents and Children (MMPC). MMPC consists of a forward phase and a backward phase. In the forward phase, all variables with an edge to and from a variable are selected using a heuristic function. In the backward phase, all false positives chosen in the first phase are removed. After MMPC identifies the skeleton of the BN, a greedy Bayesian-scoring HC search is used to orient the edges. The algorithm has the advantages of reliably scaling up to thousands of variables in a reasonable amount of time.

Computation, Performance, Stability, Reproducibility and Structure Assessment
We implemented each of the analyses in this study using the bnlearn R package. 74 We implemented the visualisations using the visNetwork R package 75 . The algorithms (i.e., GS, HC, MMHC) used were assessed for their performance (i.e., the goodness of fit with respect to particular score functions), stability (i.e., performance and topology across different sample sizes) and structure (i.e., discrepancies between the different algorithms).

Computation
To ensure the reliability of the identified DAGs, we bootstrapped 10.000 BNs, for each analysis, and averaged them to obtain the final, resultant network. To identify the possible true, high-confidence arcs (i.e., directed edges), we assessed the frequency of the presence of an arc (i.e., directed edge) across the aforementioned bootstrapped networks. The final visualised BNs contained the arcs that had a predefined frequency/inclusion threshold. These thresholds were based on both an ad-hoc approach and an 85% frequency, according to Sachs et al. 76 or a statistically motivated approach and statistically optimised frequency, according to Scutari et al. 50 . In the first case, for example, for the final averaged BNs we kept only the edges that appeared in at least 85% 76 of the 10.000 bootstrapped networks (i.e., at least 8.500).
In order to to keep only the edges that were significant in a specific direction, we kept the ones that pointed from symptom X to symptom Y in at least 51% of the 10.000 bootstrapped networks(i.e., at least 5.100). Finally, the selected arcs made a different contribution to the the goodness of fit of the final BNs. This contribution was represented by their width. Arcs which characterized the structure of the data with more confidence were visualised as thicker than others. The thicker an arc, the more damaging it would be for the model fit to remove it from the network. 77 You can see the full list of these, visualised networks in Online Figures 6 through 41. 49

Performance
To evaluate the averaged networks learned, on all of the combinations of the different algorithms, thresholds, and samples sizes used, we compared their post-hoc scores on a set of different scoring functions (Appendix A, Table S2). The selected functions were suitable for assessing the discrete BNs made from our categorical, symptom occurrence variables (i.e., Yes/No). 74,78 These metrics were the logarithm of the K2 score (k2), the logarithm of the Bayesian Dirichlet Equivalent score (BDE), the Bayesian Information Criterion score (BIC), the Akaike Information Criterion score (AIC) and the multinomial log-likelihood (loglik) score. The values for these scoring metrics reflect their goodness of fit, which the algorithm attempts to maximize. In Table S2, Appendix A, higher values reflect better goodness of fit.

10/15
Many of these scoring metrics are in the form of a penalised log-likehood (LL) function. 79 The LL is the log probability of the dataset D, used, given a learnt network structure B. With the assumption that the data are independent and identically distributed, the likelihood of the data D given a structure B can be calculated as: 79 where D ij is the instantiation of X i in data point D j , and PA ij is the instantiation of X i 's parents in D j .
Adding an arc to a network increases the likelihood of the network. If this arc does not provide more information it is ignored. Nevertheless, extra arcs may lead to overfitting of the training data. To address the overfitting problem these penalized LL functions add a penalty term which penalizes the complex networks. In this manner, despite the fact that the complex networks may have a very good LL score, the high penalty term will reduce their score to be below that of a less complex network.

Stability and Reproducibility
Our study examines the data-driven, exploratory usefulness of BNA for cancer symptom data. Until now, very few network analyses studies were done on this domain. Furthermore, no consensus or evidence exists on the true network structure of how co-occuring symptoms' interact with each other. 59 Finally, an inherent problem in BNA is obtaining this "true" network structure. 80 The identified structures could be sensitive to the sample size, the dataset used, and/or the specific algorithms used.
To assess the stability and reproducibility of our findings we tested their consistency among four equally sized, bootstrapped and randomly assigned subsets (n 1 =332,n 2 =664,n 3 =996,n 4 =1328). We ran 10.000 BNs for each bootstrapped and randomly assigned subset.
In Appendix A, Table S2 provides the full list of the results of these analyses. These analyses showed the stability of the identified networks as well as the reproducibility of each BNA algorithm based on our cancer symptoms' dataset.

Comparing Networks
To facilitate comparison between the BNs in this study, we compared them on their structural similarities and distances with the Hamming Distance (HD) and the Structural Hamming Distance (SHD). 74, 80-82 HD and SHD describe the changes that have to be made to a network for it to turn into the one that it is being used for comparison. HD simply counts the number of edges that are different between the skeleton of the two networks, ignoring the direction of their arcs. SHD counts how these networks differ, taking into account the direction of their arcs. It sums the counts of the edges that are present in the original structure but are missing in the comparator structure, the counts of the edges that are found in the comparator structure but are not present in the original structure, and the counts of directed edges in the comparator structure that are oriented differently.
We provide a visual side-by-side comparison of the networks, identified with all three algorithms (i.e., GS, HC, MMHC) on the complete dataset (n Complete =1328), and the two sub-groups based on gender (n Male =295,n Female =1033) and age(n Age< 65 =958, n Age>65 =370). Lacking a "gold standard" network structure for cancer symptoms' data, in order to compare the performance and stability of the GS, HC, MMHC algorithms across the four different sub-samples (n 1 =332,n 2 =664,n 3 =996,n 4 =1328) we used the networks obtained from the complete, unrandomised data set (n Complete =1328)) as the reference network. Online Figures 12 through 17, 24 through 29, and 36 through 41, 49 provide a visual overview of the structural differences between the networks identified in this study. More precisely, by taking one network as a reference network, we computed the true positive, false positive and false negative arcs in the other network. In addition, we created an interactive visualisation of these differences.

Data availability
The data used in this study will be available upon request and subject to ethics approval. All data requests should be sent to Christine Miaskowski (chris.miaskowski@ucsf.edu).