Universal Mechanism to Link Multiple Datasets with Synthetic Populations for Clinical and Epidemiological Research

doi:10.21203/rs.3.rs-456335/v1

Download PDF

Research Article

Universal Mechanism to Link Multiple Datasets with Synthetic Populations for Clinical and Epidemiological Research

https://doi.org/10.21203/rs.3.rs-456335/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Many modern predictive models require knowledge acquired from multiple datasets, yet data linkage can be a substantial challenge. Datasets exist in a wide variety of formats and linking them directly is often infeasible or restricted by privacy requirements. One solution is to map variables from different datasets onto a synthetic population, producing a dataset that contains information from multiple sources sufficient for reliable statistical inference with quantifiable uncertainty. This approach is universal because it is applicable to a broad range of datasets and has good potential for privacy protection.

We consider datasets with information about individuals and describe three methods for building linked synthetic data: resampling, modeling predictors independently, and modeling predictors sequentially. We apply these methods to the prediction of the prevalence of Florida youth vaping by state, county, and census tract using the 2018 Florida Youth Substance Abuse Survey (FYSAS) and the 5-Year American Community Survey (ACS). We find that resampling and sequential modeling most closely approximate the 2018 survey results, and that the sequential model captures more variability. We find that the order of predicting the variables in sequential modeling does not substantially impact the outcome.

Health Economics & Outcomes Research

Health Policy

Medical Genetics

Synthetic Populations

Clinical

FYSAS

Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Universal Mechanism to Link Multiple Datasets with Synthetic Populations for Clinical and Epidemiological Research

Status:

Version 1

Abstract

Figures

Full Text

Additional Declarations

Status:

Version 1