Many modern predictive models require knowledge acquired from multiple datasets, yet data linkage can be a substantial challenge. Datasets exist in a wide variety of formats and linking them directly is often infeasible or restricted by privacy requirements. One solution is to map variables from different datasets onto a synthetic population, producing a dataset that contains information from multiple sources sufficient for reliable statistical inference with quantifiable uncertainty. This approach is universal because it is applicable to a broad range of datasets and has good potential for privacy protection.
We consider datasets with information about individuals and describe three methods for building linked synthetic data: resampling, modeling predictors independently, and modeling predictors sequentially. We apply these methods to the prediction of the prevalence of Florida youth vaping by state, county, and census tract using the 2018 Florida Youth Substance Abuse Survey (FYSAS) and the 5-Year American Community Survey (ACS). We find that resampling and sequential modeling most closely approximate the 2018 survey results, and that the sequential model captures more variability. We find that the order of predicting the variables in sequential modeling does not substantially impact the outcome.