This work was conducted under a study protocol that was approved by the Institutional Review Board at the University of North Carolina at Chapel Hill.
Motivation
FHIR PIT is a complex, custom, open-source software application that uses geocodes and time stamps of varying resolution (e.g., hour, day, year) to automatically integrate multiple sources of spatiotemporal data, irrespective of the degree to which the data depend on space and time. FHIR PIT was motivated by our research and development of the Integrated Clinical and Environmental Exposures Service [ICEES; 16]. ICEES was developed as part of the Biomedical Data Translator program in response to a need to openly expose clinical data that have been integrated at the patient and visit level with environmental exposures data [17,18]. FHIR PIT provides the integrated clinical and environmental exposures data to support ICEES.
Implementation overview and spatiotemporal data sources
For initial research and development of FHIR PIT, clinical data on patients from UNC Health Care System were integrated with a variety of public data on environmental exposures, including: airborne pollutant exposures from the US Environmental Protection Agency; roadway exposures from the Federal Highway Administration’s Highway Patrol Monitoring System, within the US Department of Transportation; roadway exposures from the US Census Bureau’s Topologically Integrated Geographic Encoding and Referencing system; and socio-environmental exposures from the US Census Bureau’s American Community Survey. (A graphical overview of the FHIR PIT integration pipeline can be found in Figure 1. A list of currently available feature variables can be found in Supplementary Table 1. This table and additional documentation are maintained and regularly updated on the ICEES OpenAPI.) Importantly, the integration step is conducted within a secure environment and under a protocol that was approved by our institution’s Institutional Review Board because data integration necessitates the use of patient geocodes (i.e., primary home residence), date/time stamps, and patient identifiers—data elements that are considered Protected Health Information under the Health Insurance Portability and Accountability Act (HIPAA).
Multiple integration steps are required to harmonize across these data sources, which vary in spatiotemporal resolution and format of geocodes and time stamps. For example, patient primary home residence is coded as latitude and longitude in the patient data, whereas the American Community Survey data are provided at the Census block level. Airborne pollutant exposures are available at hourly estimates, daily estimates, or annual averages, depending on the exposure entity and source year. Roadway data are provided as GIS shape files, with latitudes and longitudes in WGS84 decimal format, which is the World Geodetic System for expressing latitude and longitude. Separate software code is required to convert the spatiotemporal representation of the data used by each data source into a common format that allows integration across data sources. In addition, separate mappings are required to link patient identifiers and geocodes with each non-clinical data source, thereby supporting the final integration step that merges the different data sources.
The final product of the FHIR PIT software pipeline is a set of “integrated feature tables”, with feature variables binned or recoded and data de-identified according to §164.514(b) of HIPAA for subsequent open access via the ICEES OpenAPI.
Implementation details
FHIR PIT consists of several transformation steps or building blocks that can be chained together to form a transformation and integration workflow. Several of these transformation steps are generic, such that they can take in any data that conform to a certain format. Thus, the incorporation of new types of data amounts to adding new transformation steps or reusing generic steps. FHIR PIT is implemented using Apache Spark. Spark is used to easily parallelize and distribute the data transformation steps. A Python script is used to simplify the application interface to the transformation steps. FHIR PIT supports building containers in both Singularity and Docker. This allows the application to run on different machines and platforms with portability.
Each block in FHIR PIT is implemented as a plugin consisting of a set of Scala classes that can be plugged into the pipeline. FHIR PIT is configured using a YAML file, and steps can be switched on or off for rapid re-execution of the pipeline. The plugins consist of both generic building blocks such as joining of tables and data set–specific building blocks such as preprocessing of environmental data (Table 1). The input and output of each plugin can be configured so that the output of the previous step in a pipeline configuration can be fed as input for the next step.
One of our goals for implementation of the pipeline is to enable automatic and rapid re-execution. Given the extensible number of input files and parameters, we use the Dhall configuration language to author configuration files and avoid code duplication. Dhall code is converted to a YAML file that is then read by the pipeline. An example YAML configuration of a step in the FHIR PIT pipeline is provided below, with fields defined in Table 2.
See Function 1 in the supplementary files.
Writing the entire FHIR PIT pipeline configuration in YAML would necessitate rewriting the pipeline for every new calendar year and every new data set. With Dhall, we are able to create a function in the configuration that can be instantiated for each new calendar year or data set. A simplified version of this function to address additional years is shown below.
See Function 2 in the supplementary files.
To instantiate this for calendar year 2012, we simply need to specify the following parameter:
envDataSourceStep False "2012"
To extend this function for multiple calendar years, we specify an additional parameter:
List/map ["2012", "2013", "2014"] (envDataSourceStep False)
Here, the List/map function takes a list of terms and a function, applies the function to each element in the list, and returns a list of values.
Execution of the FHIR PIT pipeline generates a report of skipped tasks, succeeded tasks, failed tasks, and errors from failed tasks.