1. Integrated data from existing environmental health monitoring programs
Many different approaches can be used for quantifying environmental exposures: direct methods (measuring, monitoring or biomonitoring) or indirect methods, involving exposure estimations from measurements and existing data, like environmental monitoring, questionnaires and exposure models. The availability of data on the geographic area of interest for the pollutants evaluated is an essential prerequisite. The quality and usability of all environmental data should be assessed before employing them in the health or risk assessment processes, as many factors can bias environmental sampling results [9]. Ideally, direct measures of exposure (e.g., biomarkers or personal monitoring data) for all key stressors related health effects, throughout the critical time-period of exposure, and in the population of interest would be necessary10. However, exclusive use of biomarker data in exposure assessment to characterize EHI is currently not practicable when considering a large number of diverse chemicals due to analytical and resource limitations [11] specifically when the assessment should cover a large territory and fine resolution. Environmental quality data are often available at a fine administrative or resolution level and enable the building of environmental indicators on a regional or national scale. The processing of variables for the identification and characterization of environmental inequalities depends on the reuse of this type of data, which is very diverse by nature regarding its initial intended objectives. Determining how representative those measured levels of contamination are of other locations or time frames is not always a simple task [12].
Databases in health and environment have been developed for several years. They evolve and are in full expansion. Actions to identify and monitor the quality of the environment for soils, water and air are conducted by different agencies, institutes or observatories. The production of this type of data and advances in computer technology allow their reuse in conceptual frameworks and with objectives different from those that prevailed in their implementation. The emergence of quality data and their integration into GIS make it possible to conduct territorial analysis work. These environmental data reflect the actual contamination of the environment and therefore of the global exposure of the populations. The indicators based on these data allow to characterize the population's exposure and its evolution regarding the implementation of public prevention policies. In the context of reuse of this type of data for the purpose of expology, a database must be set up in which the variables are associated with the modes of exposure (concentrations in the environmental and exposure media are present, eating behavior, space-time budget, ...).
These variables must know several stages of process to allow the construction of indicators:
- the identification of data sources allowing the construction of the different variables,
- the acquisition of these data in view of the access modalities, the financial, legal or human aspects,
- the analysis of the quality and representativeness of the databases regarding the objective of the study (choice of a database, validity and representativeness of the data) sometimes involving the approximation or the application of simplifying assumptions,
- the preprocessing of databases: cleaning the databases, rebuilding missing data,
- the construction of ad-hoc data where the appropriate data sources are not available or exhaustive in relation to the objectives of the study,
- data transformation (homogenization, aggregation or disaggregation of data).
The estimation of exposure requires knowledge of the concentrations of environmental compartments to which an individual or a population is exposed. These concentrations can be measured or modeled. A wide range of data might potentially be mobilized for integrated assessment. The database selection or study design definition should be guided to reach the best compromise between data representativeness and method robustness, consistent with the objectives of the study.
Characteristics of air pollution (e.g., chemical components, particle properties) vary spatially [13] and may differ between areas near and far from monitors [14]. Automated monitoring networks operate in Europe providing detailed air quality information on a regular basis. The soil routes of exposure to humans are inhalation of dust and vapor coming from soil contaminants, ingestion of contaminated soil particles (mainly for children) or contaminated food, and dermal absorption through the skin. Once a site is considered as contaminated, it is necessary to provide enough accurate data to minimize lack of statistical representativeness and increase the spatial quantification. The time spent for evaluating the presence and extent of contamination can be reduced by an adequate sampling plan [15] which can at the same time, reduce the project costs [16]. A soil monitoring system could be the source of the comparable and objective data on the current state and evolution of soils. The database of the soil monitoring system allows the creation and maintenance of data for each of the monitoring sites of agricultural land as well as the preparation of data for further processing through specialized programs [17]. Position information provides a link to the GIS, and thus opens the possibilities for further spatial analysis, the identification of risk areas and their assessment. By example, in France, soil pollutant stocks and properties and most explanatory variables were derived from the French National Soil Monitoring Network (Réseau de Mesures de la Qualité des Sols or RMQS). The RMQS surveys soils and their properties on a regular 16 km grid across the French mainland territory (around 2,200 sites covering 550,000 km²) [18].
The Drinking Water Directive (80/778/EEC), and its successor (98/83/EC which comes in force in 2003), aims to ensure that water intended for human consumption is safe. In addition to microbiological and physicochemical parameters, a number of toxic substances such as pesticides, polyaromatic hydrocarbons, cyanide compounds, and heavy metals are to be monitored. This is because the raw supply may be contaminated, for example, with pesticides from agricultural land which have leached into groundwater or from contamination within the distribution system, such as lead from piping. In France, 300,000 samples are tested each year. Indeed, tap water is one of the most strictly controlled foodstuffs. Each year, the health agencies carry out close on 12.3 million tests covering all of the country’s public water and wastewater services (both publicly and privately managed). In 2013, more than 8.1 million tests were carried out on services managed by private water company.
Work has been carried out by INERIS to identify environmental and spatialized databases for the purpose of characterizing exposures by associating the main producers and data managers identified [19, 20]. It allows to propose elements for the specification of environmental health platforms and to improve the integration of data in the framework of building an environmental health tracking information system. However, spatial data used to characterize environmental exposures have not always been initially collected and collated to meet these objectives, resulting in use bias. Measuring frequencies or spatial densities of sampling are not always sufficient. To partially overcome these problems, different techniques are adopted to specifically address the different environmental, behavioral or population databases. The selection of a treatment method depends on the problem to be solved and the quality of the data available.
2. Statistic approaches to link and optimize data representativeness
The data available in a region of interest characterize levels of contamination at very specific locations, over a given spatial support (i.e. the support on which the data is measured such as point, surface or volume), and for very specific time frames. In order to construct the exposure maps from spatialized databases in the context of evaluating environmental inequalities, the development of methods is required to process and harmonize the available data, with respect to their specificities (missing values, limited number of observations, etc.) in the same resolution and support.
In the mathematical field of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points. During the last years the increasing availability of spatial and spatiotemporal data pushed the developing of many spatial interpolation methods, including geostatistics [21]. Spatial interpolation includes any of the formal techniques which study entities using their topological, geometric, or geographic properties. Spatial dependence is the co-variation of properties in a geographical space: features at nearby locations seem to be correlated. The fundamental principle is Tobler's first law on geography: if the interrelation between entities increases with proximity in the real world, representation in geographical space and evaluation using spatial analysis techniques are appropriate [22]. These interactions are all stronger as the locations concerned are closer. In statistics, spatial autocorrelation measures the correlation of a georeferenced variable with itself. It makes it possible to measure the degree of similarity between neighboring observations. This spatial dependence implies the infringement of the assumptions made in the classical statistical techniques which suppose the independence between the observations. Spatial dependence should also be considered as a source of information. To characterize the different scales of local, regional and global variability of the phenomena studied, the analysis of spatial data structures through geostatistical tools (variogram, autocorrelation analysis) is often employed [23].
Several more sophisticated methods of spatial analysis can be applied to include additional information and take benefit from spatial and inter-variable correlation to improve data representativeness and characterize associated uncertainty [24].
For air, several methods for estimating exposure to air pollutants exist, including monitor-based approaches such as proximity-based assessments and statistical interpolation, as well as land-use regression and air quality modeling [25]. Using data from existing monitoring networks remains popular, due to cost considerations, data availability, and population coverage. Such statistical methods are aimed at using multiple types of information to inform exposure estimates and allow to estimate exposure in areas far from monitors. In addition to fused data, several other approaches have been developed to estimate individual- and population-level exposures, including various interpolation methods, land use regression (LUR) models, aerosol measurements obtained from satellites, and source- and traffic-proximity analysis [26]. Stochastic methods such as kriging are preferred [27]. An issue commonly reported is the availability of data. Some databases include some limitation (as a limited number of observations by example) and therefore it is not possible to assess the population’s exposure adequately. External drift kriging is then widely used in air and soil quality modeling, in order to combine different kind of information to include secondary information in the model.
Machine learning use algorithms and statistical methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases. Machine learning allow for example to build a metamodel from a dataset of deterministic model outputs. The fundamental concepts of machine learning and its usages in spatially distributed data are given in Kanevskij et al [28].
In order to construct the exposure maps from spatialized databases in the context of evaluating health risks, methods have been developed to process and harmonize the available data, with respect to their specificities (missing values, limited number of observations, etc) in the same resolution and support. A GIS-based modeling platform developed by INERIS for quantifying human exposure to chemical substances (PLAINE: environmental inequalities analysis platform [29]) aims to spatialize an environmental indicator related to human health using risk assessment methods and mapping environmental disparities at a fine resolution. The main aim of the PLAINE Project, developed in France, is to develop a platform of environmental and health data. This platform is developed for systematic collection, integration, and analysis of data on emission sources, environmental contamination, exposure to environmental hazards, and population and health. Ad-hoc methodologies are used to align the available data to the same pixels. Spatial analysis and statistical methods are employed to process (georeferencing, data controlling, pre-processing, re-formating) and assemble the databases for the purpose of the study, using R and QGIS. By example, atmospheric concentration data were collected in France in the context of regulatory surveillance for two years (2010 and 2011). Estimation of concentrations over France by classical interpolation method could lead to a misrepresentation of the spatial distribution due to the limited number of observations. To address this issue, auxiliary variables in the context of external drift kriging [30] were employed. The best auxiliary variable to define linear drifts was found to be the one that includes the atmospheric emissions as well as the population and the altitude. Measurements of PAH topsoil concentrations are available through the French Soil Monitoring Network. Qualitative data on the polluted sites localization are integrated by processing distance-to-polluted soil proxy. These, along with 14 variables about physicochemical soil properties were combined in a hybrid regression-kriging and fitted using Random Forest [31] models, were shown to outperform the traditionally used linear regression. Due to its hydrophobic nature, B[a]P is found in water in small concentrations; therefore, the exact measurement cannot always be reported. The observations under the detection limit rate is quite high, which requires careful handling. A complex multiple imputation method was developed in order to extract the maximum information from the available measurements without introducing too much bias in the results. This one permits to take advantage of the temporal aspect and correlations between substance of interest and other PAH substances. Spatial estimation of water concentrations was carried out by taking into account the multi-annual data and the network water distribution complexity using a bootstrap based expectation-maximization algorithm. The above methods permitted the construction of a representative spatial database in a 9 km2 grid of reference on the whole France (550,000 km²) used to perform the integrated exposure assessment [3].