Geocoding and spatio-temporal modeling of long-term PM2.5 and NO2 exposure: the Mexican Teachers' Cohort

Epidemiological studies on the effects of air pollution in Mexico often use the environmental concentrations of monitors closest to the home as exposure proxies, yet this approach disregards the space gradients of pollutants and assumes that individuals have no intra-city mobility. Our aim was to develop high-resolution spatial and temporal models for predicting long-term exposure to PM 2.5 and NO 2 in a population of ~ 16 500 participants from the Mexican Teachers’ Cohort study. We geocoded the home and work addresses of participants. Using information from secondary sources on geographic and meteorological variables as well as other pollutants, we tted two generalized additive models to predict monthly PM 2.5 and NO 2 concentrations in the 2004–2019 period. The models were evaluated through 10-fold cross validation. Both showed high predictive accuracy with out-of-sample data and no overtting (CV RMSE = 0.102 for PM 2.5 and CV RMSE = 4.497 for NO 2 ). Participants were exposed to a monthly average of 24.38 (6.78) µg/m 3 of PM 2.5 and 28.21 (8.00) ppb of NO 2 during the study period. These models offer a solid alternative for estimating PM 2.5 and NO 2 exposure with high spatio-temporal resolution for epidemiological studies in the Valle de México region.


Introduction
In 2015, WHO recognized exposure to ambient air pollutants and its effects on human health as a priority public-health problem requiring further study (WHO, 2015). The rapid spread of urban spaces around the world has increased the number of individuals exposed to pollutant concentrations above the values recommended by the WHO (Riojas-Rodríguez et al., 2016). For instance, the MCMA is the third largest metropolitan area among the member states of the OECD and one of the three megacities in Latin America (OECD, 2016). Historically, this area has registered high rates of air pollution (INE-SEMARNAT, 2011) which have placed the health of its almost 21 million inhabitants permanently at risk (Instituto Nacional de Estadística y Geografía (México)., 2015).
Epidemiological studies have shown that chronic exposure to air PM 2.5 and NO 2 is associated with increased mortality (Beelen et al., 2014), cardiovascular disease (Dockery, 2001;Hoek et al., 2013), lung cancer (Chen et al., 2008;Hamra et al., 2014;Pope et al., 2002), cardiopulmonary conditions (Krewski et al., 2009) and diabetes (Li et al., 2014), among others. All these studies, which have used a variety of methods to estimate exposure to air pollutants, have developed deterministic and/or probabilistic models as the cornerstone of their analyses (Jerrett et al., 2005). In Mexico, the majority of epidemiological studies on the health effects of air pollutants have estimated exposure to pollutants using proximity methods such as those based on data from the nearest monitor Escamilla-Nuñez et al., 2008;Hernández-Cadena et al., 2009;Rojas-Martinez et al., 2007), city or municipal averages (Carbajal-Arroyo et al., 2011;Téllez-Rojo et al., 2000), IDW interpolation (Riojas-Rodríguez et al., 2014;Trejo-González et al., 2019). One disadvantage of these methodological approaches lies in their failure to capture the spatial variability of exposure caused by local sources, urban topography and local meteorological factors (Jerrett et al., 2005). The strength of these methods is therefore undermined by considerable uncertainty as regards the assignment of exposure. To the best of our knowledge, the literature offers only a handful of epidemiological studies that have employed more sophisticated, highresolution methods (e.g., based on satellite data) for estimating exposure to ambient air pollutants in the Mexican population (Gutiérrez-Avila et al., 2018;Rosa et al., 2017;Téllez-Rojo et al., 2020). In addition, previous studies have documented that ignoring the daily mobility patterns of the study population can lead to erroneous exposure measurements, thus introducing bias in epidemiological analyses based on exposure estimates at the individual level (Nyhan et al., 2019;Setton et al., 2011). To offset these limitations, we geocoded the home and work addresses of the study population and tted GAMs to predict PM 2.5 and NO 2 concentrations at both locations. The principal advantage of the GAMs resides in their use of semiparametric methods to model non-linear functions through penalized splines (Yanosky et al., 2008).
This work represents an improvement over the exposure models developed thus far for the MCMA (Just et al., 2015;Rivera-González et al., 2015;Son et al., 2018). In order to craft a feasible model that could be adapted to conditions of limited information in developing countries, we used the highest-quality and largest amount of information available from secondary sources on air pollutant predictors in Mexico.
This study provides a novel methodological tool for the MTC study as well as for subsequent epidemiological studies seeking to assess the effects of PM 2.5 and NO 2 exposures on health. Our objective was to assign the outdoor PM 2.5 and NO 2 exposure of ~16 500 participants from an MTC living and working in the MCMA, on a monthly basis over the 2004-2019 period.

Material And Methods
We developed spatio-temporal prediction models for PM 2.5 and NO 2 , using the pollutant concentrations at monitoring locations as our dependent variable, and meteorological and geographic variables as well as other pollutants as predictors. For statistical and geostatistical modeling, geoprocessing and spatial analysis, we used R version 3.5.3 software (R Core Team, 2019).

Study population and geocoding
The MTC study, an ongoing prospective initiative established in 2006 -2008, includes 115 314 female public-school teachers from 12 states in Mexico. A detailed description of the cohort has been published elsewhere (Lajous et al., 2015). Brie y, recruitment was performed in two phases: the rst, in 2006, took place in two states (n=27 979), and the second, in 2008, incorporated 10 additional states (n=87 335) including CDMX and EdoMex. The participating teachers answered a reference questionnaire concerning sociodemographic characteristics, reproductive history, diet, lifestyle and health status. We continue to follow-up on the women every 3-4 years to obtain information on disease diagnoses and to update their risk factor pro les. In the rst follow-up cycle (2011-2013) the response rate was 83%, and in the second follow-up cycle (2014-2020, ongoing) the response rate was 59%.
Considering the availability and quality of data required to geocode the homes of the teachers, we selected the MCMA as our study area. It was composed of 16 municipalities in CDMX, 59 in EdoMex and one municipality in Hidalgo ( Figure 2). Based on the meteorological and air pollutant monitoring stations in place during the study (Figure 2), we delimited the boundaries of the area of analysis using the extreme coordinate position of a 5 km buffer around each monitoring station. Within these boundaries, we de ned a grid composed of cells measuring 1 km x 1 km.
We geocoded the HA and WA of each teacher in the MTC. The HAs included ZC, municipalities and states reported by the teachers at baseline. Since information was unavailable on changes of HAs before and after follow-up, we assumed that they were permanent. Based on the criteria of SEPOMEX for valid ZCs, we selected only those containing ve digits and veri ed that the rst two corresponded to the codes of the home municipalities reported by the teachers. Finally, we used a geographic layer of SEPOMEX ZCs ("gob.mx," n.d.) to assign the coordinates of the ZC's centroid for each HA. To geocode the WAs, we used the WCs provided by the PNCM at baseline and follow-ups. The geographic coordinates were drawn from the SNIE ("Sistema Nacional de Información de Escuelas," n.d.).

Meteorological and air pollutant data
We obtained the databases of the hourly measurement of pollutants (PM 10 , PM 2.5 , SO 2 , NO 2 and O 3 ) and meteorological variables (relative humidity, temperature and wind speed) at the monitoring sites, from 1:00 a.m., January 1st, 2004, to 12:00 a.m., July 31st, 2019. Data were drawn from the monitoring network of the SEDEMA in CDMX. These datasets are permanently available for open consultation on the o cial websites of both networks ("Dirección de Monitoreo Atmosférico," n.d.).
We calculated the daily estimators (daily averages) for each pollutant and meteorological variable at each monitoring site only when a minimum of 75% of the required observations (18 hours) were available. Otherwise, we entered "missing data." On this basis, we calculated the monthly estimators by averaging at least 23 daily estimators (75%). The seasons of the year, used as a categorical variable, were de ned as follows: the cold-dry season, from November to February; the hot-dry season, from March to May; and the rainy season, from June to October.

Geographic data and other covariables
The geographic pro les of the monitoring sites were characterized using a GIS. We created a geographic layer of the air quality monitoring stations in place from 2004 to 2019 in the study area, based on coordinate data provided by the SINAICA ("INECC | Sistema Nacional de Información de la Calidad del Aire (SINAICA)," n.d.). The altitude variable was provided by the SINAICA in masl (meters above sea level) units and categorized into tertiles as follows: low ≤ 2 300 masl, medium ≥ 2 301 masl and ≤ 2 500 masl, and high ≥ 2 501 masl. The vehicle motorization index (the number of motorized vehicles per 1 000 inhabitants) was calculated annually, at the municipal level, based on population projections from the CONAPO ("Consejo Nacional de Población | Gobierno | gob.mx," n.d.) and the historical databases for registered motorized vehicles (automobiles, passenger buses, cargo trucks and motorcycles) in circulation, according to the INEGI (INEGI, n.d.). This variable was categorized into tertiles as follows: low ≤ 222 vehicles per 1 000 inhabitants, medium ≥ 223 vehicles per 1 000 inhabitants and ≤ 487 vehicles per 1 000 inhabitants, and high ≥ 488 vehicles per 1 000 inhabitants.

Generalized Additive Models
To ascertain the contribution of each predictor to PM 2.5 and NO 2 variability at the monitoring station level during the 2004-2019 period, we tted two independent GAMs (Equation 1).

Equation 1. Generalized Additive Model formula
where f j (.) are smooth functions, Y is the vector of outcomes and the X j are the vectors of covariates. For each model, we used monthly PM 2.5 or NO 2 concentrations per monitoring site as the outcome. The vector PM 2.5 was log-transformed because of its skewed distribution. We only included in the model the covariables which we expected a priori to exert a physical in uence on the PM 2.5 and NO 2 levels. The predictors introduced in the adjusted model were the group of other pollutants, the meteorological variables and the geographic variables. To take in to account the correlation between pollutants, we included them using interaction terms. With the view of ensuring a parsimonious model speci cation, we eliminated each term to assess its contribution and retained only those that improved predictive accuracy and offered statistical signi cance (p <0.05). Model comparison and selection were performed according to DE, AIC, and RMSE. To test the goodness of t, we performed a residual diagnosis, allowing us to verify their normality and the absence of any overdispersion patterns. Finally, we performed stepwise selection to test and select the right number of base functions for achieving a smoothed curve that would maximize the t of our data and reduce the RMSE.

Model validation
To test for possible over tting in the GAMs and evaluate their predictive accuracy, we carried out a 10-fold cross validation and determined the predictive accuracy of each model based on its CV RMSE.

Model predictions and geostatistical conditional simulation
We used the GAM coe cients to predict the monthly concentrations of PM 2.5 and NO 2 in all the 1x1-km grid cells during the January 2004-July 2019 period. For each grid cell and for each month of the study period, we carried out 3 000 simulation replicates under the GCS method (ESRI, n.d.). The object was to obtain the mean value of the predictor variables: PM 10 , O 3 , SO 2 , NO 2 , wind speed, relative humidity and temperature in each cell. As part of GCS, we tted gaussian and exponential variograms with and without nugget effect to the empirical semivariograms in order to characterize the spatial structures at the unmeasured locations (Equation 2). To this end, we used the Cholesky decomposition of covariance matrix (Chilès and Del ner, 2012): where the degree of spatial dependence among the values of Attribute Z in two different locations or points in space was semivariance γ(h), and the distance between the points was lag h.
To take into account error ε ij in the GAMs caused by the spatial correlation of the observations, we carried out the same simulation procedure with the residuals and then added them to the predicted values in each grid cell.

Assigning of cohort participant exposures
To maximize exposure's estimate precision, we considered the intra-city mobility of the teachers. For this purpose, we assigned the predicted monthly values of PM 2.5 and NO 2 in the grid cell where the HA and WA of each teacher were located each month during the study period. Under the assumption that estimates based on mobility (HWA weig ) were closer to actual exposure (individual monitoring), HA and WA exposures were weighted according to the number of hours spent in each address, assuming eighthour workdays, which is the average working day in Mexico.

Equation 3. Weighting formula
To assess the relevance of considering the mobility factor in our study, we compared the HA and WA exposures through the difference between HA -HWA weig . The null hypothesis of a difference equaling zero was proved using a mixed-effects model where the teacher was the grouping variable.

Study population and geocoding
Of the 115 314 MTC participants at the national level, 22 743 had their HAs in CDMX or the EdoMex. We were able to geocode the ZC centroids of 95% of them (n=21 594), with 19 547 residing in the MCMA. In this area, the median of teachers per ZC was 23, ranging from 1 to 175 (Figure 2). The areas of the HA ZCs of participants oscillated between 0.005 km 2 and 17.003 km 2 .
Of the 19 547 participants with geocoded HAs, we were able to identify and geocode the schools for 17 135 and 16 672 in the MCMA, at the beginning of follow-up (2008)  participants. Finally, the study population included 16 407 MTC participants who lived and worked in the area of analysis (Figure 1).

Descriptive statistics
The monthly averages of PM 2.5 , NO 2 and their predictor variables at the monitoring site level are summarized in Table 1. During the 2004-2019 period, a maximum of 11 PM 2.5 and 17 NO 2 monitoring stations provided data on all the variables required for adjusting the GAMs. The geometric mean and SD of the monthly PM 2.5 concentrations were 3.18 and 0.27 µg/m 3 , respectively. Meanwhile, the mean and SD of the monthly NO 2 concentrations were 28.34 and 8.59 ppb, respectively. Over the study period, the mean ambient temperature oscillated 17 ºC, wind speed averaged 1.85 m/s and humidity reached 53%. The lowest concentrations of both pollutants occurred in the rainy season and the highest in the cold-dry season. The PM 2.5 concentrations were greatest at the lowest altitudes.

Generalized Additive Model for PM 5 and predicted values
The PM 2.5 model predictors were altitude, relative humidity, wind speed, season and year, in addition to the PM 10 and NO 2 pollutants. Only PM 10 , relative humidity and wind speed were included as continuous variables with smoothing functions, using penalized splines of up to seven degrees of freedom (Figure 3). The maps in Figure 4 illustrate the average spatial distribution of the monthly PM 2.5 levels predicted by the model per year on the surface of the area of analysis (the grid). In general, the predicted levels were higher to the north of the MCMA and showed a downward trend until 2014. The map for 2019 considers only measurements until July; therefore, according to Table 1, the rainy season, which yielded the lowest PM 2.5 levels, was underrepresented. The percentage of deviance explained by the model was high (ED=87.3 %) for the 2004-2019 period ( Table 1). The root mean square error in the sample indicated high predictive accuracy with a low level of error (RMSE = 0.100) and con rmed that no over tting of data existed, as the value was very close to the cross-validation error (CV RMSE = 0.102).

The Generalized Additive Model for NO 2 and predicted values
The NO 2 model predictors were the vehicle motorization index, wind speed, temperature, season and year, in addition to the PM 10 , O 3 and SO 2 pollutants. Only PM 10 , O 3 and wind speed were included as continuous variables with smoothing functions using penalized splines of up to seven degrees of freedom, since the other continuous variables indicated a linear relationship with the dependent variable ( Figure 3). The maps in Figure 4 show the average spatial distribution of the monthly levels of NO 2 predicted by the model on the surface of the study area per year. In general, the predicted levels of NO 2 were highest in the center of the MCMA and showed a downward trend until 2016, with a mean (SD) of 16.87 (7.77) ppb in the study area. As was the case with PM 2.5 , only measurements until July 2019 were considered. The percentage of deviance explained by the model was intermediate (ED =74.4 %) for the 2004-2019 period ( Table 1). The root mean square error in the sample indicated high predictive accuracy with a low level of error (RMSE = 4.406) and con rmed that no over tting of data existed, as the value came very close to the cross-validation error (CV RMSE = 4.497).

Exposure assignment based on a mobility approach
The HAs of all the teachers were different from their WAs. This strengthens the case for considering mobility in estimates of exposure to air pollutants. For the entire period and population analyzed, the ranges of differences in estimated exposure at the homes and workplaces of participants were 0 -6.01 µg/m 3 for PM 2.5 and 0 -14.76 ppb for NO 2 . Only 0.11% (n=1,799) of the teachers changed WAs between the rst and second follow-up cycles. On average (SD), they traveled 6.285 (6.216) kilometers between their homes and their workplaces (min: 0.0098 km, Q1: 1.846 km, Q2: 4.282 km, Q3: 8.746 km, max: 56.155 km).
We obtained 187 weighted monthly averages of PM 2.5 and 187 of NO 2 assigned to each teacher as a proxy for long-term exposure. Figure 5 illustrates the annual distribution and principal statistics for the weighted monthly averages (HWA weig ) assigned to the 16 407 MTC participants during the 2004-2019 period. For both pollutants, we captured the expected exposure variability attributable to the spatial distribution of the pollutants and the location of the HAs and WAs in the study area. On average (SD), for the entire population and period analyzed, weighted exposures were 24.38 (6.78) µg/m 3 to PM 2.5 and 28.21 (8.00) ppb to NO 2 .

Discussion And Conclusion
The models developed in our study provide a useful method for predicting monthly exposure to outdoor PM 2.5 and NO 2 at any location within a big area though the MCMA, with high spatial (1x1 km) and temporal (16 years: 2004-2019) resolution. These models delivered a solid performance and high predictive accuracy, with cross-validation errors slightly higher than those of the models (CV RMSE = 0.102 for PM 2.5 and CV RMSE = 4.497 for NO 2 ). A large difference between these values would have suggested that GAMs over t data, as performance would be lower in the data not included in the adjustment during cross-validation. Because of their high resolution, these models satisfy the latent need of epidemiological studies to predict and assign exposures to ambient air pollutants in numerous MCMA locations with a high degree of precision. Furthermore, they cover a long-time span (16 years) on a small temporal scale (monthly), making it possible to assess the effects of medium-and long-term exposures.
To the best of our knowledge, only three published studies have proposed models for assigning exposure to ambient air pollutants in Mexico (Just et al., 2015;Rivera-González et al., 2015;Son et al., 2018); all of them are focused on Mexico City and/or the MCMA. Rivera et al. provide a comparison of four basic methods based on proximity and interpolation for all criteria pollutants (Rivera-González et al., 2015). The principal limitation of several of these approaches is that they do not capture the spatial variability of exposure (Jerrett et al., 2005). We overcome this limitation by using high spatial resolution, which captures variability to a great extent. Son et al. propose a mixed-effects regression model based on land use for all criteria pollutants. Although they follow the traditional assumption that the relationship between the predictors and the dependent variable is linear, these authors are innovative in that they explore the LASSO method as a statistical technique for tting models. In addition, they analyze various spatio-temporal scales (hourly, daily, monthly, semi-annual and annual) (Son et al., 2018). As opposed to their model, we propose GAMs, which maximize prediction quality for the dependent variable considering its non-linear relationship with the predictors and incorporate smoothed -not parametric-terms using penalized splines. However, we only explored a monthly exposure scale for two criteria pollutants. Finally, Just et al. developed a more sophisticated mixed-effects regression model, including AOD satellite measurements as their primary PM 2.5 predictor, over a period of 10 years (Just et al., 2015). While innovative, this method is limited mainly in that it is inapplicable in areas where little or no PM 2.5 monitoring data exist for its validation, and no satellite measurements can be taken on cloudy days (Bravo et al., 2012;Paciorek and Liu, 2009). We overcame the limitation related to unavailable entry data, an obstacle faced by many developing countries, by using information on predictors from secondary sources that are easy to access and interpret and are permanently monitored. Furthermore, we covered a longer time span (16 years) and modeled NO 2 exposure. Notwithstanding the differences with all the previous studies, in general, we obtained similar results. The predicted spatial patterns for PM 2.5 were comparable with theirs and, on average, the predicted concentrations for both pollutants (PM 2.5 and NO 2 ) were only slightly lower.
The results of our analyses demonstrate the relevance of using meteorological variables and other pollutants as the principal PM 2.5 y NO 2 predictors. This could indicate that, even with secondary monitoring data from local networks, it is possible to generate highly predictive models. We therefore con rm that, in spite of limited information conditions, it is possible to obtain feasible methods for predicting outdoor exposures to air pollutants. However, it is important to include geographic predictors given the nature of the contaminants of interest and the geographic characteristics of the study area. In the two models we developed, the geographic predictors indicated smaller coe cients than the other predictors. In addition to altitude and the vehicle motorization index, we generated and tested other geographic variables. The linear meters of the different types of avenues in buffers of 100, 300 and 500 meters around the monitoring stations did not prove signi cant and therefore were not included in the nal models. We tested the UV-B radiation variable and found a signi cant relationship with NO 2 ; however, we excluded it from the nal model because we were unable to perform a semivariographic analysis owing to insu cient monitoring sites in the study area (only 0-6 stations during the 2004-2019 period). This caused an increase of 0.583 units of CV RMSE in the NO 2 model (CV RMSE including UV-B = 3.914).
A recent study which analyzed and compared the individual home-work mobility patterns of ~ 400 000 individuals corroborated that ignoring mobility causes an erroneous classi cation when estimating health effects (Nyhan et al., 2019). The principal strength of our study was lay in considering home-work mobility to estimate PM 2.5 and NO 2 exposures. It is expected that, based on the ndings of Nyhan et al. (Nyhan et al., 2019), this will help reduce measurement errors that could lead the association measurements of subsequent epidemiological studies to a null value. Another strength of our study concerned the method adopted to ascertain the spatial distribution of the predictor variables at unmeasured sites. Although no existing method can offer an exact estimate of an unknown reality, GCS considers the dispersion of the phenomenon analyzed and allows for obtaining one of the possible executions of a random function (surface). The simulated variables thus present the same variability and correlation characteristics as the observed data (measured according to their mean, their variance and the semivariogram) (Deutsch and Journel, 1998;Goovaerts, 1997). Among other available methods, we could have chosen IDW interpolation or Kriging; however, despite their extensive use in earth and environmental sciences, they provide only a smooth image of reality. GCS assumptions include seasonality and a normal distribution of data (Deutsch and Journel, 1998;Goovaerts, 1997).
We are conscious that the precision and measurement error inherent in predictive models such as the one we developed depends not only on mobility, but also on the residential history of the participants and a precise geocoding of the addresses. Prior studies have suggested that the geocoding technique can in uence health effect estimates when using ne-scale exposure models (Jacquemin et al., 2013). We therefore recognize two sources of error. First, we assumed that the HAs of cohort participants were permanent throughout follow-ups. Nonetheless, Pimienta et al. (Pimienta Lastra and Toscana Aparicio, 2019) registered a low level of inter-municipal migration within the MCMA (4.8%) during the 2010-2015 period. This could suggest that the study population was unlikely to change addresses, a hypothesis similar to our assumption. Second, we geocoded the centroids of the ZCs instead of the exact location of each HA. As the ZCs were spread across a wide-ranging area (0.005-17.003 km 2 ), it is expected that the estimated exposure was more precise for participants residing in the smaller ZC areas. Finally, we assume that the long-term outdoor exposure assigned constitutes a reasonable representation of exposure in interior spaces and microenvironments.
Our models were subject to various limitations. First, the delimitation of the study area and population was restricted by the quality of data (the geographic layer of ZCs) obtained from secondary sources of information to geocode the HAs of the MTC participants. Although the cohort was extended across 12 Mexican states, we were able to partially include only two of them in this study. Second, given the suboptimal quality of the data, we did not include information to characterize the sources of pollution attributable to land use and vehicular tra c emissions. Nonetheless, as a proxy for the latter, we used a vehicle motorization index created with o cial data from secondary sources regarding registered vehicles at the municipal level. This variable proved signi cant in the NO 2 but not the PM 2.5 model. This is consistent with our theoretical framework, given that NO 2 is generated primarily by the combustion of vehicles.
It is crucial for environmental epidemiology to rely on methods created and re ned to estimate exposure to ambient air pollutants, which transcend the limitation of unavailable entry data, particularly in developing countries. In conclusion, our results suggest that the models developed in this study offer a solid alternative that can be used in epidemiological studies of the MCMA to predict PM 2.5 and NO 2 exposure with high spatio-temporal resolution for the 2004-2019 period. Beyond the MTC context, these models can be replicated in other cities with similar sources of information. Our results support the relevance of developing exposure models with high predictive accuracy based on available secondary data from o cial sources of information. Future research efforts should therefore focus on expanding the models developed thus far to other Mexican cities.   Violin plot and descriptive statistics for assignment of monthly PM2.5 (B and D) and NO2 (A and C) weighted exposures (HWAweig) by year. Note: First, second and third quartiles estimated for n=16,407 teachers.