Directed acyclic graphs approach in epidemiological research: an example with NHANES II data on the relationship between skin color and heart attack

doi:10.21203/rs.3.rs-3338576/v1

Background: Statistical methods are essential in epidemiology research, but they can generate erroneous estimates when selecting variables based only on statistical criteria. The use of directed acyclic graphs (DAG) helps to understand the causal relationships between variables and to avoid biases.

Objective: Compare the estimate of the effect of skin color on heart attack obtained from two data analysis techniques: a stepwise approach based on statistical criteria and a graphical approach based on causal criteria.

Methods: Population-based cross-sectional study using data from the second National Health and Nutrition Examination Survey (NHANES II). The exposure variable was skin color (black or non-black) and the outcome was heart attack (yes or no). To identify the association between the variables, multivariate logistic regressions were carried out using the stepwise technique and the DAG-based approach. In the stepwise technique, all variables potentially related to the outcome were included in the model and a forward or backward algorithm was used. In the DAG-based approach, different possible causal models were developed between the variables, identifying confounding, mediation, and collision factors. The models were created considering self-rated health as a confounding or collider variable, and the modeling results were verified.

Results: A total of 10.351 adults were evaluated, the majority female (52.1%), aged 20 to 39 years (48.5%), and with non-black skin color (90.4%). The prevalence of heart attack was 3.0%, and 45% rated their health as good, fair or poor. Using different modeling techniques, no association was found between skin color and heart attack (p>0.05), except when self-rated health, a collider variable, was included in the models. In this case, there was an inverse and biased association between the two variables, indicating a collision bias (stepwise-backward-OR: 0.48; 95%CI: 0.33-0.70; stepwise-forward-OR: 0.64; 95%CI: 0.44-0.94).

Conclusion: Skin color was not associated with heart attack when controlling for appropriate confounding factors. However, when adjusting for self-rated health, a colliding variable, there was an inverse and distorted association between the two variables, indicating a collider bias. The DAG-based approach can avoid this bias by correctly identifying confounding factors and colliders.

Directed acyclic graphs

Causal inference

Confounding variables

Collider bias

Bias

NHANES

Epidemiology is a science that seeks to understand the determinants and consequences of phenomena related to health and disease in human populations (Rouquayrol; Goldbaum; Santana, 2013). This science uses quantitative methods to measure and compare frequencies, distributions, and associations between exposure factors and outcomes (Brachman, 1996). A relevant challenge in epidemiology is to settle causal relationships between exposure factors and outcomes to identify whether an explanatory variable can produce a change in the response variable, regardless of other variables that may interfere with this relationship (Ananth; Schisterman, 2017). To overcome this challenge, appropriate data analysis techniques should include possible confounding factors, effect modification, and mediation.

One of the main techniques for epidemiological data analysis is variable selection through stepwise algorithms, which consist of adding or excluding the variable from the statistical model based on predefined criteria, such as the level of significance or the determination coefficient (Hernán et al., 2002). This technique is widely used in statistical software and has the advantage of simplifying the model-fitting process (Prieto-Merino; Pocock, 2012). However, has some limitations, such as sample data dependence, instability of results, statistical tests violation of assumptions, and introduction of bias in estimating the causal effect (Prieto-Merino; Pocock, 2012). For these reasons, some authors have criticized the indiscriminate use of the stepwise technique in epidemiological research, as it could lead to erroneous or misleading attempts about the causal relationships between variables (Digitale; Martin; Glymour, 2022).

The directed acyclic graphs (DAGs) are an alternative to the stepwise technique. DAGs consist of visual representations of causal relationships between variables based on causal theory and the researcher's prior knowledge (Greenland; Pearl; Robins, 1999). Thus, DAGs can be used to identify which variables should be added or excluded from the statistical model, considering possible confounding factors, effect modification, mediation, and collision (Ananth; Schisterman, 2017). This approach makes the researcher's causal hypotheses explicit, facilitating the communication of results and avoiding bias in estimating the causal effect (Greenland; Pearl; Robins, 1999; Hernán et al., 2002). Some authors have indicated the use of DAGs in epidemiological research because they can help in the preparation and evaluation of studies, in addition to the interpretation and synthesis of findings (Suttorp et al., 2015; Digitale; Martin; Glymour, 2022).

Despite the advantages of DAGs over the stepwise technique and other statistical models, few studies compare the results obtained with both approaches, showing the differences and implications for estimating the causal effect (Heinze; Wallisch; Dunkler, 2018). An example of a research question that can be addressed with both techniques is the relationship between skin color and heart attack. Skin color reflects racial inequalities in society and can be associated with several health outcomes (Dressler; Oths; Gravlee, 2005), such as heart attacks, one of the leading causes of morbidity and mortality worldwide (Roth et al., 2020; Vos et al., 2020). However, the relationship between skin color and heart attack can be influenced by other variables, such as self-rated health, which can be a confounding factor, effect modification, mediation, or collision, depending on the causal model adopted. Then, this question can be answered, for example, using the second National Health and Nutrition Examination Survey (NHANES II). The NHANES II is a national health and nutrition survey conducted in the United States (US) between 1976 and 1980, which collected information on several sociodemographic, clinical, and behavioral variables from a representative sample of the adult population (NCHS, 2023).

The general objective of this study is to compare the results based on statistical criteria (stepwise) with results based on causality criteria (DAG) in estimating the causal effect of skin color on heart attack, using data from NHANES II. The specific objectives are: 1) Estimate the association between skin color and heart attack using the stepwise technique, considering several variables potentially related to the outcome; 2) Estimate the association between skin color and heart attack using the DAGs-based approach, considering different possible causal models; 3) Compare the results obtained with the two approaches, analyzing the differences between the causal effect estimates and the variables selected for the adjustment of the model; 4) Discuss the advantages and limitations of both approaches and their implications for epidemiological research.

We hypothesized that the stepwise technique introduces a bias in the causal effect estimation of skin color on heart attack when adjusting for a colliding variable, such as self-rated health. It is expected that the DAGs approach is more adequate to identify the confounding factors, modified effect, mediation, and collision, avoiding the bias in the belief of the causal effect.

Study design

This is an observational, cross-sectional, and analytical study conducted with publicly available data from the 1976-1980 NHANES II, a unique source of national data on the health and nutritional status of the US population, aiming to determine the prevalence of major diseases and risk factors for diseases. The methods and data collection procedure behind NHANES are described in detail elsewhere (NCHS, 2023).

Population or sample studied

The NHANES II sample was composed of 27,801 individuals from 6 months to 74 years of age, being that 25,286 were interviewed and 20,322 were examined, resulting in an overall response rate of 73.0% (NCHS, 2023).

Participants who had complete information on the variables of interest were included in this study: skin color, heart attack, self-rated health, sex, body mass index (BMI), total cholesterol, systolic and diastolic blood pressure, and geographic location. Participants who were younger than 18 years old or older than 74 years old, who had a physical or mental disability that prevented data collection, or who had a medical condition that could interfere with the results were excluded. The final sample of this study consisted of 10,351 participants.

Outcome variable

The outcome variable was a heart attack, measured using the following question: “Has a doctor ever told you that you had a heart attack?”. The answer options were: yes (1) or no (0).

Exposure variable

The exposure variable was skin color, and the answer options were: white (1), black (2), or other (3). For analysis purposes, the variable was dichotomized into black (1) or non-black (0), the latter being the reference category.

Covariates

Variables potentially related to the outcome were: sex (male or female), age (20-29, 30-39, 40-49, 50-59, 60-69, and ≥ 70 years), location (urban or rural), diabetes (no or yes), BMI, self-rated health, systolic and diastolic blood pressure, and high blood pressure. Self-rated health was assessed using the question: “Would you say your health in general is?” Responses included five categories: excellent, very good, good, fair, and poor. For this study, the responses were re-categorized as excellent/very good (reference category), good (moderate), and fair/poor based on other studies other studies that analyzed self-rated health (Brett O’Hara & Kyle Caswell, 2010; White, Philogene, Fine, & Sinha, 2009). BMI was evaluated as a continuous variable, calculated as the ratio between weight in kilograms and the square of height in meters. Systolic and diastolic blood pressure were measured using a sphygmomanometer and expressed in millimeters of mercury, and high blood pressure was classified when systolic blood pressure ≥ 140 or diastolic blood pressure ≥ 90.

Statistical analysis

To analyze the data using Stata software version 16.0. The NHANES II database was downloaded from Stata using the 'webuse nhanes2' command. The sampling plan was then checked using the 'svydescribe' command, which confirmed that the 'finalwgt' variable was the sampling weight, the 'strata' variable was the stratum and the 'psu' variable was the primary sampling unit. All statistical analyses were performed considering a 5% significance level and adopting procedures for studies with complex methodologies (cluster sampling and multiple stages). We incorporated the prefix 'svy' (survey commands) into the syntax, which considers the effects of stratification and clustering derived from the complex sampling design, along with the sample weighting, to expand the results to the population evaluated.

The descriptive analyses included absolute and relative frequencies for categorical variables and measures of central tendency and dispersion for continuous variables. For the categorical, the data was tabulated using the following command: 'svy: tabulate variable_category variable_outcome, ci col', which produced a table with the absolute and relative frequencies, 95% confidence intervals (95%CI), and Pearson's chi-squared tests for each category. For the continuous variables, the data was tabulated using the following command: 'svy: mean continuous_variable, over (outcome_variable)' and 'regress continuous_variable, outcome_variable', which produced a table with the means, 95%CI, and the test for statistical significance.

For the multivariate analyses, two approaches were used: the stepwise technique and the DAG-based approach. The stepwise technique is a method that uses predefined criteria to select the variables that remain in the statistical model. In this study, two algorithms were used: forward and backward. In the forward algorithm, the empty model does not contain any variables, and the inclusion of each variable potentially related to the outcome is tested, using the p-value as the model fit criterion. The variable (if any) with the lowest p-value of less than 0.05 is included in the model and the process continues until no other variable reaches this level of significance. In the backward algorithm, the initial model contains all the variables potentially related to the outcome that had a p-value of less than 0.20 in the univariate analysis. These variables are tested for removal from the model, using the p-value as the model fit criterion. The variable (if any) with the highest p-value and which is greater than 0.05 is removed from the model and the process is repeated until all the remaining variables have a p-value of less than 0.05.

The DAG-based approach is a method that uses causal criteria to select the variables that remain in the statistical model. In this study, different possible causal models between the variables were constructed using DAGitty software version 3.0. To do this, causal connections represented by arrows were established between the variables. Each variable in the DAG was represented with a rectangle, and each color had a different meaning: green, exposure variable; blue, outcome variable; light red, potential confounding variables; white, colliding variable (Figure 1). To avoid spurious associations and unnecessary adjustments, the backdoor criterion was adopted to select the minimum set of confounding variables. A set of variables is considered sufficient for confounding control if the variables contained block all non-causal paths linking exposure to the outcome. In addition, the set must be minimal, as the inclusion of unnecessary variables not only has a risk of causing collision bias but also contributes to reducing the accuracy of the estimates.

To design the DAG and select the path between the explanatory variable (skin color), the outcome (heart attack), and the probable collider (self-rated health), a portfolio with the Hill criteria was made to support the path of analysis, available in the Supplementary Material (Appendices 1, 2, and 3). The Hill criteria are a set of nine aspects that can be used to assess the strength of evidence of a causal relationship between two variables. These aspects are the strength of association, consistency, specificity, temporality, biological gradient, plausibility, coherence, experimental evidence, and analogy. These criteria were applied to verify the path of 1) Skin color to Heart attack; 2) Heart attack to Self-rated health and; 3) Skin color to self-rated health. Then, we can check which scenario is more realistic. It was found that self-rated health can be a mediator or a collider between skin color and heart attack.

After that, in the multivariate analyses, seven logistic regression models were performed using two approaches: the stepwise technique and the DAG-based approach. The seven models were: Model 1: univariate, without adjusting for any variable: 'svy: logistic heartattack i.black'. Model 2: stepwise backward, without considering the colliding variable (self-rated health), adjusted for location, diabetes, and blood pressure: 'svy: logistic heartattack i.black i.rural i.diabetes i.highbp, Model 3: stepwise forward, without considering the colliding variable (self-rated health), adjusted for sex, age, diabetes, and BMI: 'svy: logistic heartattack i.black i.sex i.age i.diabetes c.bmi'. Model 4: based on DAG, for the total effect of skin color on heart attack, without adjusting for any variable, as there was no spurious path opened by the backdoor criterion: 'svy: logistic heartattack i.black'. Model 5: based on DAG, for the direct effect of skin color on heart attack, adjusted for mediators (blood pressure, BMI, diabetes, and location): 'svy: logistic heartattack i.black i.highbp c.bmi i.diabetes i.rural.' Model 6: stepwise backward, considering the collider variable (self-rated health), adjusted for self-rated health and blood pressure: 'svy: logistic heartattack i.black i.hlthstat i.highbp’. Model 7: stepwise forward, considering the collider variable (self-rated health), adjusted for self-rated health, sex, and age: 'svy: logistic heartattack i.black i.hlthstat i.sex i.age'.

Characteristics of the participants

This study included 10,351 participants, the majority female (52.1%), aged between 20 and 39 years (48.5%), with non-black skin color (90.4%), and living in rural areas (68.3%). Regarding health conditions, 36.9% had high blood pressure, 3.0% had a heart attack, the prevalence of diabetes was 3.4%, and the average BMI was 25.3 kg/m². Concerning self-rated health, 27.5% reported excellent or very good health, 28% good health, 12.3% fair health, and 4.7% poor health (Table 1).

Table 1 also shows the association between the covariates and explanatory variable (skin color) and the outcome (heart attack). It was found that black individuals lived more in rural areas and had a higher frequency of diabetes, high blood pressure, and higher BMI than non-black individuals (p < 0.05). About heart attacks, it was found that older men who lived in urban areas and who had diabetes, high BMI, and high blood pressure had a higher prevalence of heart attacks.

In addition, black individuals were less likely to rate their health as excellent or very good. Similarly, individuals who rated their health as good, fair, or poor had a higher prevalence of heart attacks than individuals who rated their health as very good or excellent (Table 1 and Figure 2).

Association between skin color and heart attack

Table 2 shows the association between skin color and heart attack. Different modeling techniques were used to select the confounding factors: univariate, stepwise backward or forward, and DAG-based. Table 2 summarizes the results of the seven models tested.

In models 1 to 5, there was no significant association between skin color and heart attack (p>0.05). These models were adjusted by different sets of variables, but none of them included self-rated health, which was considered a colliding variable. In models 6 and 7, from a stepwise backward (Odds Ratio [OR]: 0.48; 95%CI: 0.33-0.70) and forward (OR: 0.64; 95%CI: 0.44-0.94) selection approach, which included self-rated health as a confounding factor, there was a significant inverse association between skin color and heart attack, indicating that black individuals were less likely to have a heart attack than non-black individuals. However, this association was distorted by the presence of the collider variable, which introduced a collision bias.

The results of the present study showed a significant difference between the two approaches (stepwise e DAG) in estimating the causal effect of skin color on heart attack. In the stepwise technique, considering the collider variable (self-rated health), skin color was associated with a heart attack, indicating that black people had fewer chances of having a heart attack than non-black people. In the DAG-based approach, with no collider bias, skin color was not associated with a heart attack, evidencing no difference in the chance of having a heart attack between black and non-black people. The difference between the two approaches is that the stepwise technique, adjusted for a colliding variable (self-rated health), introduced a bias in the estimation of the causal effect. The DAG-based approach avoided this bias by correctly identifying the confounding factors and not adjusting for the collider.

The evaluation of the methods used in this study reveals some advantages and disadvantages. The advantage of the stepwise technique is that it is simple and easy to apply, as it uses predefined criteria to select the variables that remain in the statistical model. The disadvantage of the stepwise technique is that it can introduce a bias in the causal effect estimation by adjusting for variables that are not confounding factors but rather colliders or mediators (Hernán; Hernández-Díaz; Robins, 2004). Regarding the DAG approach, in addition to being a technique used to identify and select confounding variables, the DAG makes explicit the a priori hypotheses about the causal relationships between the variables being studied since it is based on causal theory and the researcher's prior knowledge (Schipf et al., 2011; Werneck, 2016). The disadvantage of the DAG-based approach is that it depends on the quality and availability of the data, as well as the researcher's ability to build and test possible causal models (Law; Green; Ellison, 2012).

Understanding a particular health-disease process, such as cardiovascular disease (CVD), is a challenge, as it is usually the result of a complex interaction between genetic, individual, sociocultural, economic, environmental, and behavioral factors (Patwardhan; Mutalik; Tillu, 2015). Cardiovascular diseases are the leading cause of disease burden in the world, contributing to deaths and disability and, consequently, loss of healthy life years (Nascimento et al., 2022; Roth et al., 2020). Some hypotheses suggest that race is associated with CVD. One is that race reflects genetic and epigenetic differences that can influence metabolism, inflammation, coagulation, and vascular response (Ho et al., 2022). For example, some studies have shown that genetic variants in genes related to blood pressure, cholesterol, and glucose are more frequent in certain racial groups (Harrell; Burford; Davis, 2022). Another hypothesis is that race is a marker of environmental, social, and behavioral factors that can affect cardiovascular health (HO et al., 2022). For example, some studies have shown that psychosocial stress, racial discrimination, poverty, pollution, smoking, diet, and a sedentary lifestyle are more prevalent in certain racial groups (Paradies et al., 2015). One way of measuring the impact of these factors is through the social determinants of health, which are the conditions in which people are born, live, work, and age. Some studies have shown that social determinants of health are independent predictors of cardiovascular events and mortality in different racial groups (Wang; Li; Zheng, 2023). However, most studies only consider the indirect effects between race and cardiovascular risk, and not its direct effects. Thus, race may be an indicator of the socioeconomic and environmental conditions that influence the development of cardiovascular disease, rather than a direct causal factor.

One hypothesis regarding self-rated health is that it acts as a mediator between skin color and heart attack. Individuals of black skin color might self-assess their health more negatively than non-black individuals because they face socioeconomic conditions, racism, and other factors that affect social equity (Borrell et al., 2006). Thus, this perception of poor health could reflect their health condition and, consequently, be associated with the increased risk of cardiovascular outcomes in the long term. However, to test this hypothesis, a significant association between skin color and heart attack in the crude analysis unadjusted for other factors would be expected. Furthermore, this association should be attenuated by including self-rated health in the model, as we would be controlling for the mediating effect. Interestingly, the results of our study do not show this pattern. In the univariate analysis, there is no found association between skin color and heart attack. Nonetheless, when we adjust for self-rated health, an effect that did not exist before appears, indicating that this variable is probably a confounder or collider. To define whether it is a confounder or a collider, we can use the DAG approach. To be a confounder, self-rated health should cause both the explanatory variable (skin color) and the outcome (heart attack), described as a fork in the DAG approach. However, this scenario is not plausible since skin color is only determined by hereditary genetic characteristics. Therefore, the possibility of it being a collider remains. The collider is a variable resulting from the effect of both variables; that is, it is caused by the explanatory variable (skin color) and the outcome variable (heart attack), an inverted fork in the DAG approach. In this case, adjusting the model for this variable opens up a spurious non-causal path between skin color and heart attack, which was previously closed off by the collider. This is the situation observed in our study. To assess the causal plausibility of these paths, we used the Hill criteria, which are a set of principles proposed by Austin Bradford Hill to examine the evidence of a causal relationship between two variables (FEDAK et al., 2015). We found that the path that best fits Hill's nine criteria is the one that considers self-rated health as a collider between skin color and heart attack. This reinforces the importance of using graphical tools, such as the DAG, to represent causal relationships between variables and discuss the implications of these relationships clearly and transparently for readers.

The interpretation of the results of this study considers the hypotheses, objectives, and theory. The hypothesis of this study was confirmed, as it was found that the stepwise technique introduced a bias in the estimate of the causal effect of skin color on heart attack, by adjusting for a collision variable. In addition to that, the objective of this study was achieved, as the results obtained with the two approaches were compared, analyzing the differences in the estimates of the causal effect and the variables selected for adjusting the model. The theory underpinning this study was causal theory, which provides the concepts and methods for identifying and estimating causal effects between variables, besides providing the tools to build and test possible causal models, such as DAGs.

This study has some limitations to be considered when interpreting the results. One of the limitations of this study is that it was based on secondary data, which may contain errors of measurement or recording of variables. Another limitation is the cross-sectional design, which does not allow the temporality between variables to be established.

Despite the limitations mentioned above, we should consider some of the strengths of our study. A notable strength of our study is the application of Hill's causality criteria, which increases the credibility of our conclusions regarding assessing the chain of causality between skin color, self-rated health, and heart attacks. Furthermore, the use of a DAG as a theoretical framework provides a structured approach to modeling the complex relationships between variables, contributing to the robustness and analytical clarity of our study. Moreover, using a database with a large and representative sample of the US adult population increases our results’ statistical power and validity, making them more applicable and impactful.

Finally, as possibilities to advance the topic, we have some suggestions for future research based on the gaps in knowledge or open questions observed in this study. One of the suggestions for future research is to carry out studies that will allow the variables to be measured with greater precision and reliability. Another suggestion for future research is to carry out studies with longitudinal data, which will allow us to establish the temporality between the variables. A third suggestion for future research is to carry out studies with data from different populations to assess the generalizability of the results. A fourth suggestion for future research is to carry out studies with more variables to explore other factors potentially related to the causal relationship between skin color and heart attack.

According to the proposed causal model, the results showed that skin color was not associated with a heart attack when controlling for the appropriate confounding factors. However, when self-rated health, a collider variable, was included in the stepwise models, there was an inverse and distorted association between the two variables, indicating a collider bias. This bias can lead to erroneous or misleading conclusions about the causal relationships between variables when not identified and avoided. The DAG-based approach can avoid this bias by correctly identifying confounding factors and colliders based on causality criteria.

This study contributed to epidemiological research by demonstrating the importance of the DAG-based approach in analyzing causal relationships and addressing a relevant public health issue related to racial inequalities in society. The findings highlight the importance of the use of DAGs by researchers in the planning and evaluation of their studies, as well as in the interpretation and synthesis of their results, as DAGs can generate relevant insights, even in situations where it is not possible to identify a set of confounding factors.

Funding

This study was supported by the Federal University of Ouro Preto (UFOP), Brazilian Council for Scientific and Technological Development (CNPq), Coordination for the Improvement of Higher Education Personnel-Brazil (CAPES), and Foundation for Research Support of the State of Minas Gerais (FAPEMIG) for Ms.c. and Ph.D. student scholarship.

Acknowledgments

The authors gratefully acknowledge the support of the Federal University of Ouro Preto (UFOP) and the Research and Teaching Group in Nutrition and Collective Health (GPENSC). We extend our sincere thanks to the NHANES participants and researchers for generously providing the necessary data for this study. In addition, we would like to thank researcher Rafael Vieira Duarte for his contributions to our research discussions. His support was fundamental to the success of this study.

Conflict of interest

The authors declare no conflicts of interest related to this article.

Authors' contributions

The authors confirm their contribution to the paper as follows: study conception and design: LAAMJ; data collection: LAAMJ; analysis of data; : LAAMJ; BCRB; WCO; MCP; MCAV; DFB and TO interpretation of results: LAAMJ; BCRB; WCO; MCP; MCAV; DFB and TO; draft manuscript: LAAMJ; preparation: LAAMJ; BCRB; WCO; MCP; MCAV; DFB and TO. All authors reviewed the results and approved the final version of the manuscript.

Data availability

The data used in this study were taken from the National Health and Nutrition Examination Survey (NHANES II) database. The public data of NHANES II are available at https://wwwn.cdc.gov/nchs/nhanes/ and can be downloaded by STATA using the command: "webusenhanes2"

Ananth, C. V.; Schisterman, E. F. Confounding, causality, and confusion: the role of intermediate variables in interpreting observational studies in obstetrics. American Journal of Obstetrics and Gynecology, v. 217, n. 2, p. 167-175, 2017.
Borrell, Luisa N. et al. Self-reported health, perceived racial discrimination, and skin color in African Americans in the CARDIA study. Social science & medicine, v. 63, n. 6, p. 1415-1427, 2006.
Brachman, P. S. Epidemiology. In: Baron S, editor. Medical Microbiology. 4th edition. Galveston (TX): University of Texas Medical Branch at Galveston; 1996. Chapter 9. Available from: https://www.ncbi.nlm.nih.gov/books/NBK7993/
Cortes, T. R.; Faerstein, E.; Struchiner, C. J.. Utilização de diagramas causais em epidemiologia: um exemplo de aplicação em situação de confusão. Cadernos de Saúde Pública, v. 32, n. 8, p. e00103115, 2016.
Digitale, Jean C.; Martin, Jeffrey N.; Glymour, Medellena Maria. Tutorial on directed acyclic graphs. Journal of Clinical Epidemiology, v. 142, p. 264-267, 2022.
Dressler, W. W.; Oths, K. S.; Gravlee, C. C. Race and ethnicity in public health research: models to explain health disparities. Annual Review of Anthropology, v. 34, p. 231-252, 2005.
Fedak, Kristen M. et al. Applying the Bradford Hill criteria in the 21st century: how data integration has changed causal inference in molecular epidemiology. Emerging themes in epidemiology, v. 12, p. 1-9, 2015.
Greenland, S.; Pearl, J.; Robins, J. M. Causal diagrams for epidemiologic research. Epidemiology, p. 37-48, 1999.
Harrell, Camara Jules P.; BURFORD, Tanisha I.; DAVIS, Renee. From Race to Racism in the Study of Cardiovascular Diseases: Concepts and Measures. In: Handbook of Cardiovascular Behavioral Medicine. New York, NY: Springer New York, 2022. p. 207-230.
Heinze, G.; Wallisch, C.; Dunkler, D. Variable selection–a review and recommendations for the practicing statistician. Biometrical Journal, v. 60, n. 3, p. 431-449, 2018.
Hernán, M. A. et al. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. American Journal of Epidemiology, v. 155, n. 2, p. 176-184, 2002.
Hernán, Miguel A.; HERNÁNDEZ-DÍAZ, Sonia; ROBINS, James M. A structural approach to selection bias. Epidemiology, p. 615-625, 2004.
Ho, Frederick K. et al. Ethnic differences in cardiovascular risk: examining differential exposure and susceptibility to risk factors. BMC medicine, v. 20, n. 1, p. 1-10, 2022.
Kabad, Juliana Fernandes; Bastos, João Luiz; Santos, Ricardo Ventura. Raça, cor e etnia em estudos epidemiológicos sobre populações brasileiras: revisão sistemática na base PubMed. Physis: Revista de Saúde Coletiva, v. 22, p. 895-918, 2012.
Law, Graham R.; Green, Rosie; Ellison, George Th. Diagramas de confusão e de caminho causal. In: Métodos modernos de epidemiologia . Dordrecht: Springer Holanda, 2012. p. 1-13.
Lunkes, Luciana Crepaldi, et al. Fatores socioeconômicos relacionados às doenças cardiovasculares: uma revisão. Hygeia: Revista Brasileira de Geografia Médica e da Saúde, v. 14, n. 28, p. 50, 2018.
Nascimento, Bruno Ramos et al. Carga de Doenças Cardiovasculares Atribuível aos Fatores de Risco nos Países de Língua Portuguesa: Dados do Estudo “Global Burden of Disease 2019”. Arquivos Brasileiros de Cardiologia, v. 118, p. 1028-1048, 2022.
National Center for Health Statistics (NCHS). NHANES II. Centers for Disease Control and Prevention, Atlanta, GA: 2023. Available in: https://wwwn.cdc.gov/Nchs/Nhanes/nhanes2/default.aspx. Acessed in: 03 Sep. 2023.
O’Hara Brett, & Caswell Kyle. (2010). Health Status, Health Insurance, and Medical Services Utilization: 2010. Current Population Reports, U.S. Census Bureau, Washington, DC, 2012.
Paradies, Yin et al. Racism as a determinant of health: a systematic review and meta-analysis. PloS one, v. 10, n. 9, p. e0138511, 2015.
Patwardhan, B., Mutalik, G., & Tillu, G. (2015). Concepts of Health and Disease. Integrative Approaches for Health, 53–78. doi:10.1016/b978-0-12-801282-6.00003-6
Prieto-Merino, D.; Pocock, S. J. The science of risk models. European Journal of Preventive Cardiology, v. 19, n. 2_suppl, p. 7-13, 2012.
Roth, Gregory A. et al. Global burden of cardiovascular diseases and risk factors, 1990–2019: update from the GBD 2019 study. Journal of the American College of Cardiology, v. 76, n. 25, p. 2982-3021, 2020.
Rouquayrol, M.Z.; Goldbaum, M.; Santana, E.W. de P. Epidemiologia, História Natural e Prevenção de Doenças. In: Rouquayol, M.Z.; Silva, M.G. da. (Org.). Epidemiologia & Saúde. 7. Ed. Rio de Janeiro: MedBook, 2013. p. 11-24.
Schipf, S. et al. Directed acyclic graphs (DAGs)-the application of causal diagrams in epidemiology. Gesundheitswesen (Bundesverband der Arzte des Offentlichen Gesundheitsdienstes (Germany)), v. 73, n. 12, p. 888-892, 2011.
Suttorp, M. M. et al. Graphical presentation of confounding in directed acyclic graphs. Nephrology Dialysis Transplantation, v. 30, n. 9, p. 1418-1423, 2015.
Vos, T. et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. The Lancet, v. 396, n. 10258, p. 1204-1222, 2020.
Werneck, Guilherme L. Diagramas causais: A epidemiologia brasileira de volta para o futuro. Cadernos de Saúde Pública, v. 32, p. e00120416, 2016.
White AM, Philogene GS, Fine L, & Sinha S (2009). Social Support and Self-Reported Health Status of Older Adults in the United States. American journal of public health, 99(10), 1872–1878. doi: 10.2105/AJPH.2008.146894

Table 1. Characteristics of NHANES II participants and association with skin color and heart attack
		Skin color		Heart attack*
	Total, % (95%CI)	Black, % (95%CI)	p-value*	Yes, % (95%CI)	p-value*
Sociodemographic
Sex
Male	47.9 (46.8-49.1)	45.5 (41.4-49.8)	0.245	66.2 (60.4-71.5)	<0.001
Female	52.1 (50.9-53.2)	54.5 (50.2-56.8)	0.245	33.8 (28.5-39.6)	<0.001
Age
20 to 29 years	28.1 (26.6-29.5)	27.5 (26.1-29.0)	0.391	0.4 (0.1-0.3)	<0.001
30 to 39 years	20.4 (19.3-21.6)	20.3 (19.0-21.7)		1.4 (0.5-4.1)
40 to 49 years	16.8 (16.0-17.7)	17.0 (16.0-17.9)		8.5 (4.7-15.0)
50 to 59 years	16.7 (15.7-17.8)	16.8 (15.7-18.0)		30.8 (24.8-37.7)
60 to 69 years	13.4 (12.6-14.2)	13.7 (12.9-14.5)		40.6 (35.8-45.6)
≥ 70 years	4.6 (4.1-5.3)	4.7 (4.1-5.4)		18.3 (15.1-22.0)
Location
Rural	68.3 (63.8-72.4)	88.8 (81.8-93.3)	< 0.001	61.7 (54.4-68.5)	0.013
Urban	31.7 (27.6-36.2)	11.2 (6.7-18.2)	< 0.001	38.3 (31.5-46.6)	0.013
Health conditions
Self-reported health
Excellent	27.5 (26.1-28.9)	16.0 (12.8-19.8)	<0.001	4.6 (2.7-7.9)	<0.001
Very good	27.5 (26.4-28.6)	19.4 (16.6-22.6)		7.9 (5.2-12.0)
Good	28.0 (26.9-29.0)	34.9 (32.1-37.9)		25.3 (20.0-31.4)
Fair	12.3 (11.4-13.2)	19.7 (16.7-23.0)		32.8 (28.6-37.3)
Poor	4.7 (4.0-5.6)	10.0 (8.2-12.3)		29.4 (23.1-36.7)
Diabetes
No	96.6 (96.2-96.9)	94.1 (92.7-95.2)	<0.001	88.7 (85.5-91.4)	<0.001
Yes	3.4 (3.1-3.8)	5.9 (4.8-7.3)	<0.001	11.3 (8.6-14.5)	<0.001
Body mass index (BMI)
BMI, kg/m²	25.3 (25.2-25.4)	26.5 (26.0-27.1)	<0.001	26.6 (25.9-27.3)	<0.001
Blood pressure
Systolic blood pressure	126.9 (125.7-128.2)	128.6 (125.4-128.1)	0.061	139.1 (125.4-127.8)	<0.001
Diastolic blood pressure	81.0 (80.0-82.1)	83.2 (81.9-84.5)	<0.001	84.2 (82.7-85.8)	<0.001
High blood pressure
No	63.1 (60.2-66.0)	56.5 (52.1-60.8)	0.002	43.1 (36.9-49.6)	<0.001
Yes	36.9 (34.0-39.8)	43.5 (39.3-47.9)	0.002	56.9 (50.4-63.1)	<0.001
* Pearson's chi-square test was performed. High blood pressure: systolic blood pressure >= 140 or diastolic blood pressure >= 90.

Table 2. Association between black skin color and heart attack adjusted for confounding factors selected by the stepwise backward method and directed acyclic graphs. NHANES II.
	Skin color
Models	Non-black	Black
Models	OR	OR	95%CI	p-value
Model 1: Univariate	1.00	0.81	0.57-1.16	0.245
Model 2: Stepwise backward without collider variable	1.00	0.80	0.57-1.13	0.194
Model 3: Stepwise forward without collider variable	1.00	0.90	0.63-1.29	0.555
Model 4: DAG – Total effect	1.00	0.81	0.57-1.16	0.245
Model 5: DAG – Direct effect	1.00	0.74	0.52-1.06	0.095
Model 6: Stepwise backward with collider variable	1.00	0.48	0.33-0.70	<0.001
Model 7: Stepwise forward with collider variable	1.00	0.64	0.44-0.94	0.024
DAG: Directed acyclic graph. Odds ratios (OR) and 95% confidence intervals (95% CI) were estimated for each model using logistic regression. The outcome variable was heart attack, categorized as 0 (no) or 1 (yes). The explanatory variable was skin color, with the reference category being non-black and the test category being black. Seven models were tested: model 1 - univariate; model 2 from a stepwise backward modeling without considering the collision variable, adjusted for location, diabetes and blood pressure; model 3 from a stepwise forward modeling without considering the collision variable, adjusted for sex, age, diabetes and body mass index; model 4 from the selection of the minimum adjustment by DAG, for the total effect of skin color on heart attack, which suggested that no adjustment by confounders would be necessary because there was no spurious path open through the backdoor; model 5 from the selection of the minimum adjustment by DAG, for the direct effect of skin color on heart attack, which suggested the adjustment by mediators, adjusted by blood pressure, body mass index, diabetes and location; model 6 from a stepwise backward modeling considering the collision variable, adjusted for self-reported health and blood pressure; and model 7 from a stepwise forward modeling considering the collision variable, adjusted for self-reported health, sex and age. The backward stepwise criterion was p-value < 0.20 in the univariate analysis and p-value > 0.05 in the multivariate analysis to exclude the variables.

No competing interests reported.

SupplementaryMaterialManuscriptDAG.docx
Graphicalabstract.png
Graphical abstract

Directed acyclic graphs approach in epidemiological research: an example with NHANES II data on the relationship between skin color and heart attack

Status:

Version 1

Abstract

Figures

INTRODUCTION

METHODS

RESULTS

DISCUSSION

CONCLUSION

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1