Wildfire worsens population exposure to PM2.5 pollution in the Continental United States

Abstract As wildfires become more frequent and intense, fire smoke has significantly worsened ambient air quality, posing greater health risks. To better understand the impact of wildfire smoke on air quality, we developed a modeling system to estimate daily PM 2.5 concentrations attributed to both fire smoke and non-smoke sources across the Continental U.S. We found that wildfire smoke has the most significant impact on air quality in the West Coast, followed by the Southeastern U.S. Between 2007 and 2018, fire smoke affected daily PM 2.5 concentrations at 40% of all regulatory air monitors in EPA's Air Quality System (AQS) for more than one month each year. People residing outside the vicinity of an EPA AQS monitor were subject to 36% more smoke impact days compared to those residing nearby. Lowering the national ambient air quality standard (NAAQS) for annual mean PM 2.5 concentrations to between 9 and 10 µg/m 3 would result in approximately 29% to 40% of the AQS monitors falling in nonattainment areas without taking into account the contribution from fire smoke. When fire smoke impact is considered, this percentage would rise to 35% to 49%, demonstrating the significant negative impact of wildfires on air quality.


Introduction
With a changing climate, large-scale wild re events have increased in frequency and intensity, and re seasons have been prolonged in the Contiguous U.S. (CONUS) in recent decades (1,2).Wild re smoke contains large quantities of ne particulate matter (PM 2.5 , airborne particles with diameters smaller than 2.5 µm), and can adversely affect regional air quality in downwind communities that are tens to hundreds of kilometers away.For instance, Jaffe et al. (2008) reported that PM 2.5 levels have increased in summer due to wild res in the western U.S. (3) and Geng et.al observed an signi cant enhancement in PM 2.5 concentrations in intensive wild re years in Colorado (4).This impact has become so expansive that a previous analysis of PM 2.5 measurements from U.S. EPA's ground monitoring network between 1988 and 2016 attributed the increasing trend of 98th quantile of 24-hour PM 2.5 concentration in the Northwestern U.S., in contrast to the decreasing trend in the rest of the Continental U.S., to the in uence of wild res (5).In January 2023, the U.S.EPA proposed to revise the National Ambient Air Quality Standards (NAAQS) of PM 2.5 by lowering the primary annual PM 2.5 standard to the range of 9.0 to 10.0 µg/m 3 (6).As both re season length and wild re frequency are projected to increase worldwide and particularly in North America under both 1.5°C and 2.0°C global warming targets (7), attainment under the new annual PM 2.5 standard will be more challenging in re prone regions.
Studies worldwide have reported signi cant associations of both acute and chronic exposure to ambient PM 2.5 with various adverse health outcomes including respiratory and cardiovascular diseases, nervous system diseases, and premature mortality (8-10).Smoke PM 2.5 contains 5-20% elemental carbon (EC) and at least 50% organic carbon (OC) including many polar organic compounds (11).The greater oxidative potential of smoke PM 2.5 suggests a different if not greater toxicity than ambient PM 2.5 .In addition, with the expanding wildland-urban interface and an aging U.S. population, the overall burden of wild re-related diseases is expected to increase (12).A few previous studies have linked acute exposure to wild re smoke PM 2.5 with a series of adverse health outcomes (13).For example, Alman et al. (2016) positively associated short-term exposure of PM 2.5 from wild re with respiratory illnesses (14).Stowell et al. (2019) reported signi cant association between smoke PM 2.5 exposure and a greater risk of emergency department visits due to asthma attacks after controlling for PM 2.5 exposure from non-smoke sources in Colorado (15).
While chronic exposure to ambient PM 2.5 has been shown to present a much greater risk to human health than acute exposure (9), few studies have assessed the health effects of chronic wild re smoke PM 2.5 exposure primarily due to the challenge of estimating long-term wild re smoke PM 2.5 exposure at high spatial and temporal resolutions.Since most wild res started in remote areas, regulatory monitoring networks such as US EPA's Air Quality System (AQS) are often insu cient to characterize the spatial patterns of smoke PM 2.5 .In addition, ground observations alone cannot separate re smoke PM 2.5 from other sources.Chemical transport models (CTMs) such as the Community Multiscale Air Quality (CMAQ) Model can simulate re-speci c PM 2.5 with full coverage in space and time, greatly expanding the study population of air pollution epidemiological studies to cover both urban and rural populations (4).However, uncalibrated CTM smoke simulations frequently suffer from substantial prediction errors caused by imperfect characterization of complex re chemistry, inaccurate emission inventory and rapidly changing local meteorology surrounding res (16).Most recently, machine learning or statistical models that integrated ground observations, satellite remote sensing data, land cover land use information, as well as CTM simulations have shown great promise to generate long-term, accurate and high-resolution ambient PM 2.5 concentrations worldwide with full spatial and temporal coverage.To date, a handful of non-CTM-based fusion models to estimate smoke PM 2.5 levels have been reported.For example, O'Dell et al. (2019) estimated the contribution of wildland-re smoke to seasonal mean PM 2.5 levels in the CONUS at a spatial resolution of ~ 15 km (17).Childs et al.
(2020) estimated daily smoke PM 2.5 concentrations at 10 km spatial resolution using satellite-based re smoke contours to de ne re days.The coarse spatial resolutions of these studies cannot capture the detailed spatial gradients of smoke PM 2.5 levels.The lack of ground observations near the res to be included in model training is also attributable to the underestimation of peak smoke PM 2.5 concentrations in these studies.
Here, we designed a multi-stage, CTM-based modeling framework to estimate full-coverage, daily smoke PM 2.5 concentrations in the CONUS at 1 km spatial resolution.This framework integrated CMAQ PM 2.5 simulations, multiple satellite remote sensing products, meteorology reanalysis, land cover land use information, and ground observations from both regulatory and low-cost sensor networks.Taking advantage of the high spatial and temporal resolution of our model predictions, we investigated the long-term impact of wild res on national air quality as well as the representativeness of AQS monitoring network in estimating population exposure to re smoke.In addition, we investigated the impact of lowering the PM 2.5 standard on the attainment areas and the number of individuals affected by it, both with and without the in uence of smoke emissions from res.

Ground PM 2.5 Measurements and Calibrations
We obtained Environmental Protection Agency (EPA) federal reference and acceptable ground PM 2.5 measurements which were publicly available at the AQS (18).We calculated daily PM 2.5 concentrations by averaging the hourly measurements at stations and days with at least 16 out of 24 possible measurements.The rapidly developing low-cost sensor networks are a signi cant supplement of traditional monitoring due to their high spatial density and temporal frequency (19,20).We included measurements from the PurpleAir low-cost PM 2.5 sensors to extend the spatiotemporal coverage of ground monitoring and increase the probability of capturing the PM 2.5 pollutions from wild re smoke (21,22).The PurpleAir is a citizen-based real-time PM 2.5 monitoring network with nearly 10,000 sensors currently online globally (23).By utilizing measurement calibration methods (24,25), previous studies suggested that the low-cost sensor can be a signi cant supplement to the reference ground monitors in PM 2.5 exposure assessments (26, 27).Vu et al, incorporated the PurpleAir network with AQS monitors in estimating regional PM 2.5 levels during a re event in California (28).Since the PurpleAir PM 2.5 measurements have bias with the reference-grade measurements, we performed a series of quality control and calibration (29).We rst removed all station-days with less than 16 hourly measurements and those with 30 percent relative difference among two channels.We also removed extreme measurements with daily value more than 1,000 µg/m 3 , temperature less than − 20 F° or higher 140 F°, and humidity less than 0% or higher than 100%.We conduct Geographically Weighted Regression (GWR) to calibrate PurpleAir measurements which is similar to many previous studies (26).In order to perform calibration widely among the entire study domain, we matched PurpleAir monitors and AQS stations within 5 km buffers and a total of 230 AQS stations were paired but around half of paired stations are located in west U.S. Since relative humidity and temperature have great impact on PurpleAir accuracy (26), we divided CONUS into 4 parts to balance the climate regions and paired stations locations, as shown in Figure S1.We developed four regional GWR models, with relative humidity and temperature to calibrate the PurpleAir measurements.The 20-km buffer was created for each region and calibrated PurpleAir observations located in buffers were calculated as the mean of two GWR models' outputs in order to make a smooth transition between regions.Calibrated daily PurpleAir observations over annual standard of 12 µg/m 3 were added in our nal model.

Data Integration
A large array of predictor variables was used to develop the PM 2.5 models, including satellite-retrieved aerosol, cloud, and smoke plumes information, gridded meteorology, population, land cover and topographic data (detailed descriptions provided in the Supplementary Materials).
All data sets at various spatial resolutions were integrated at the 1-km grid of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) Aerosol Optical Depth (AOD).Due to the missing data issue in MAIAC AOD, we applied a two-step gap-lling approach to obtain a full coverage MAIAC AOD (detailed descriptions provided in the Supplementary Materials).Daily average PM 2.5 measurements from the AQS monitors and PurpleAir sensors were assigned to their collocated grid cells and averaged PM 2.5 measurements were calculated at grid cells with multiple monitors.Note that the PurpleAir data were calibrated based on a previously reported method before merging with AQS measurements (26).We interpolated the coarse resolution variables into 1-km resolution using inverse distance weighting (30).They include CMAQ, Copernicus Atmosphere Monitoring Service (CAMS) AOD and meteorological factors.We obtained the land cover data at 30-meter resolution is obtained from the National Land Cover Database.We collected road network and elevation data from the Global Roads Inventory Project and the Global Digital Elevation Model Version 3, respectively.For each grid cell, we calculated the percentages of land cover types, average elevation, and total road length.
We matched our grid with the 1-km resolution population density data, which is from the Landscan Program at Oak Ridge National Laboratory (ORNL) (31).We calculated daily total smoke plumes duration, daily weighted average plume density for each grid cell using the re smoke polygons produced by the National Oceanic and Atmospheric Administration (NOAA) Hazard Mapping System (32,33).Terra and Aqua Moderate Resolution Imaging Spectroradiometer (MODIS) cloud fractions at 5 km resolution were assigned to the overlapped grid cells, and then averaged if available.Based on climate types, CONUS is divided into nine climate regions (34) and indicators of climate region were assigned to the overlapped grid cells.

Smoke PM 2.5 Model Development
Random Forest (RF) is an ensemble algorithm based on multiple decision trees and the outputs from all decision trees are averaged to be the prediction of the dependent variable (35,36).Each decision tree is built on a bootstrap training data and a subset of independent variables are randomly selected in each tree node (35).The bootstrap strategy allows RF to be a robust model against over tting (36).RF also provides an estimated importance rank which informs the weights of predictors and allows an easier interpretation, comparing with neural network models (35,37).The R 2 and Root Mean Squared Error (RMSE) were calculated from overall, spatial and temporal 20-fold cross-validation (CV) and we used them to assess the model performance and furthermore adjust model parameters.
Two random forest algorithms were trained independently in order to separate smoke PM 2.5 from the background PM 2.5 (Fig. 1).
First, the modeling grid cells and days were divided into smoke-impacted regions and no-smoke regions according to daily HMS smoke plume polygons and the CMAQ smoke ratio (i.e., simulated smoke PM 2.5 over total PM 2.5 concentration).A smoke grid cell was de ned as either being inside an HMS smoke plume polygon or having a CMAQ smoke ratio greater than 0.03 on a given day.Next, in the smoke-impacted region, a random forest algorithm was trained to estimate daily total PM 2.5 concentrations, which was assumed to be the sum of smoke contribution and background (i.e., contribution from all the other sources).In the no-smoke region, smoke contribution was assumed to be negligible and a separate random forest algorithm was trained to estimate daily background PM 2.5 concentrations in the no-smoke region.Then, this no-smoke algorithm was also used to predict daily background PM 2.5 concentrations in the smoke-impacted region.Finally, the daily smoke PM 2.5 concentration in each grid cell of the smoke-impacted region was then calculated as the difference between predicted total PM 2.5 concentration and the predicted background PM 2.5 concentration.Since only a small proportion of extreme-high ground PM 2.5 concentrations were captured by AQS data, we applied a Synthetic Minority Over-sampling Technique (SMOTE) to oversample the underrepresented measurements with high levels to improve the model performance at high PM 2.5 concentrations (28, 38).SMOTE generated synthetic samples along with their predictions from the ve nearest neighbors in the training dataset (38).PM 2.5 concentrations over 35 (U.S. NAAQS for 24-hour PM 2.5 ) and below 100 µg/m 3 were oversampled once while the PM 2.5 measurements over 100 µg/m 3 were oversampled twice through SMOTE.The oversampled data accounted for 0.85% of the total input data, and the SMOTE process did not skew the distribution of PM 2.5 observations.Our nal training dataset for smoke-impacted and no-smoke models had 1,657,449 and 2,003,085 station-day observations, respectively.
The formulas of models in smoke-impacted and no-smoke regions are: where and denote the ground measured PM 2.5 concentration, re component PM 2.5 and non-re background PM 2.5 at location on day , respectively.For the model in no-smoke region, is the CMAQ simulated background PM 2.5 at location on day , and is a vector of additional predictors, including gap-lled MAIAC AOD, meteorological factors, cloud fractions, land cover and climate region types, as listed in Table S1.For the model in smoke-impacted region, is the CMAQ simulated total PM 2.5 at location on day , while includes the HMS data and all predictors in .

Model Performance
The R 2   S3.Same process was used for spatial and temporal CV and as a result, the R 2 of both smoke-impacted and no-smoke models were improved, as shown in Table S2.After aggregating the overall CV to annual level, the R 2 between all predictions and AQS measurements is 0.9, implying a high accuracy of model predictions.As for variable importance, CMAQ is the most important predictors in both smoke-impacted and no-smoke and AOD and wind are the common parameters ranked in top ve in two models (Figure S4).
Spatiotemporal Patterns of Smoke PM 2.5 across the CONUS µg/m 3 were common in Alabama, Georgia, and the Carolinas.Fire smoke also contributed signi cantly to elevated PM 2.5 levels in Georgia and Florida in 2010 and 2017.In addition, air quality in the Midwestern states was periodically affected by re smoke.For example, approximately half of Texas, Oklahoma and Kansas showed detectable re smoke impact in 2010, 2011, and 2017, with high smoke PM 2.5 levels observed over large cities such as Dallas, Austin and San Antonio.PM 2.5 levels in the states around the Great Lakes and in the Northeastern U.S. have rarely been affected by re smoke during our study period.
Conducting large-scale epidemiological studies to investigate the impact of re smoke on human health has been challenging largely due to the di culty in estimating spatially resolved exposure to re smoke PM 2.5 .Recently, a few modeling studies of smoke PM 2.5 concentrations in the CONUS have been conducted with spatial resolutions ranging from 10-15 km (17,39).Using machinelearning models such as those presented in this study allows the integration of CTM re simulations, high-resolution satellite remote sensing of re smoke, and the broader spatial representation of the PurpleAir sensor network to achieve high spatial resolution (1 km), high temporal resolution (daily), and full coverage of the CONUS for a 12-yr period.The temporal trend and spatial characteristics of our model-predicted smoke PM 2.5 concentrations align with major re events across the country.For example, data from the National Interagency Fire Center (40) showed that re activities in Southern California, eastern Texas, and southern North Carolina and Tennessee in 2007 were 125% and 121% of previous 10-year average, respectively.The acres burned in the Rocky Mountains were 367% and 351% of previous 10-year average in 2012 and 2017, respectively, and our model successfully capture these features.Compared with uncalibrated CMAQ simulations of smoke PM 2.5 (Figure S5 Panel A), our predictions better represent the spatial and temporal distribution of re smoke.For instance, our model captured the high smoke PM 2.5 values in the West and Southeast during the extreme re years, such as 2007 and 2018 (Fig. 1), and low smoke PM 2.5 values in 2015, which have same temporal trend as reported by National Interagency Fire Center (40).In addition, our model was able to capture ner spatial features more accurately due to its high spatial resolution at 1 km.Compared with previous smoke PM 2.5 estimations with coarse resolution, our predictions provided a clearer boundary of the smoke impacted areas and captured detailed variability of population exposure levels.As illustrated in Figure S6, population within an area of 100 km 2 in Sacramento, California were able to be assigned to 100 unique smoke PM 2.5 values based on their locations rather than one average value, which offers the feasibility for high-resolution health impact studies.
To our best knowledge, our study is the rst large-scale attempt to use calibrated PM 2.5 concentration measurements from low-cost sensors such as PurpleAir monitors in conjunction with AQS monitors to better characterize the spatial variability of smoke PM 2.5 .
Previous research has shown that low-cost sensor measurements can increase the likelihood of detecting wild re smoke (21,22), and integrating low-cost sensor data with regulatory measurements has allowed for better training of satellite-based machine learning models for identifying air pollution hotspots (26, 41).In our study, PurpleAir sensors reported extreme PM 2.5 concentrations over 200 µg/m 3 during the Camp re in California, while the highest AQS measurement was approximately 100 µg/m 3 as there were no AQS monitors located near the smoke plumes.Including the high PM 2.5 measurements from PurpleAir in our training dataset reduced the model underestimation on high PM 2.5 values.For instance, the smoke PM 2.5 prediction from models without PurpleAir (Figure S5, Panel B) was biased low in California where high smoke PM 2.5 values always occurred and the difference of annual smoke PM 2.5 predictions between models with and without PurpleAir measurements reached up to 16 µg/m 3 in 2018.Unlike earlier studies which attributed the deviation from background levels of PM 2.5 to smoke using ground total PM 2.5 measurements, satellitebased smoke plume identi cation, and air trajectories (17,39), we employed two different CMAQ simulations, with and without re emissions, along with satellite-based HMS smoke contours to more accurately label smoke impacted areas and days.Our approach facilitates independent modeling of both background PM 2.5 and total PM 2.5 accounting for smoke impact nationwide.
Effect Fire Smoke on National PM 2.5 Concentration Levels Using our daily model predictions, we assessed the impact of re smoke on the regulatory air quality monitoring network.We de ned a smoke impact day as when re smoke contributed more than 25% of model-estimated daily total PM 2.5 mass concentration at the location of an air quality monitoring station included in the EPA AQS.Daily PM 2.5 concentration at ~ 40% of the 1836 AQS monitoring sites have been signi cantly affected by smoke for more than a month each year during our study period (Fig. 3).In 2009 and 2010 when our model predicted the lowest smoke impact on national PM 2.5 levels, over 25% of the national ambient PM 2.5 monitoring network was under signi cant smoke impact for more than a month.In intensive re years such as 2017, 50% of all monitoring locations were affected for at least a month, indicating a widespread impact at the national scale.During the worst re year of 2007, 25% of all monitoring locations were affected for more than 90 days.Smoke impact on air quality was highest in summer and fall in most years.However, in low re years such as 2009 and 2010, re smoke had the greatest impact in spring and fall.
AQS's Representativeness of Population Exposure to Fire Smoke Using our model predictions and annual population at 1 km resolution, we estimated the U.S. population affected by re smoke.As shown in Table 1, nearly the entire population in the CONUS, ranging from 95% in 2018 to 100% in 2007, has been exposed to re smoke.On average, a slightly higher percentage of people living outside the vicinity of an EPA AQS monitoring station (de ned by a 5 km radius) has been exposed to re smoke.The average duration of population exposure to re smoke showed a more substantial difference.On average, people living outside the vicinity of an AQS monitoring station experienced 25.2 smoke impact days, 36.5% (ranging from − 8% in 2018 to 70% in 2012) greater than people living near an AQS station.While the mean model estimated total PM 2.5 concentration in regions near an AQS station (10.79 µg/m 3 ) is signi cantly higher than that in regions without AQS coverage (8.87 µg/m 3 ), estimated smoke PM 2.5 concentration shows the opposite (0.50 µg/m 3 vs.0.65 µg/m 3 ).Since the majority of AQS stations are located in urban areas, these ndings suggest that using EPA observations alone may substantially underestimate both the duration and the concentration of the re smoke exposure of the rural and suburban population.In January 2023, the U.S. EPA proposed to lower the NAAQS for annual mean PM 2.5 concentrations, calculated as the average of past three years, to a value between 9 µg/m 3 and 10 µg/m 3 .We estimated the total population as well as the number of AQS monitoring sites which would reside in nonattainment areas under the new standard (Table S3

Figure 2
Figure 2 presents spatial distributions of annual mean smoke PM 2.5 in the CONUS from 2007 to 2018.While the Western U.S. has seen a signi cant and more persistent impact of re smoke on PM 2.5 levels, other regions including the mid-West and the Southeast have also suffered high smoke PM 2.5 in certain years.For example, annual average smoke PM 2.5 concentrations over 8 µg/m 3 occurred in California, Oregon, and Washington in 2007-2009, 2011, 2013, 2017 and 2018, and over 50% of the areas in these states were impacted by re smoke during these years.Along California coasts and in the Central Valley, annual average smoke PM 2.5 concentrations exceeded 12 µg/m 3 in 2007, 2017 and 2018.We observed the highest annual average wild re smoke PM 2.5level north of Ventura County in Southern California at 25 µg/m 3 in 2017.Other Western states such as Idaho, Montana, Utah, Colorado, Arizona, and New Mexico have been affected to a lesser degree, with annual mean smoke PM 2.5 levels ranging between 0 and 5 µg/m 3 .The second most affected region by re smoke is the Southeast.For example, annual smoke PM 2.5 levels up to 9

Figures Figure 1 Flow
Figures of overall, spatial and temporal CV of smoke-impacted model is 0.75 (RMSE = 4.59 µg/m 3 ), 0.59 (RMSE = 5.88 µg/m 3 ) and 0.67 (RMSE = 5.18 µg/m 3 ), respectively, indicating a good model performance in re grids.For no-smoke model, the R 2 of random, spatial and temporal cross validation is 0.68, 0.47 and 0.63, with RMSE of 3.35 µg/m 3 , 4.30 µg/m 3 and 3.59 µg/m 3 , respectively, which indicates the satisfactory performance from the random forest model for background PM 2.5 .As shown in FigureS2, random forest models slightly overestimated at low PM 2.5 concentrations and underestimate at high PM 2.5 values, especially when daily PM 2.5 concentration exceeds 100 µg/m 3 .After aggregating the daily PM 2.5 predictions to monthly level, the R 2 of smoke-impacted and no-smoke models in the overall 20-fold CV increased to 0.84 and 0.78, respectively, indicating the bias of estimation is random.Scatter plots for aggregated monthly CV are shown in Figure

Table 1
Fire smoke impact on the U.S. population.Impact of Fire Smoke on Attainment Status with the Proposed New PM 2.5 Standard (42)S4).Without considering in the impact of re smoke, an average of 116.83 million people (from 68.73 million in 2016 to 148.74 million in 2013) and 30% of all AQS monitoring sites (from 15% in 2017 to 40% in 2011) in the CONUS would be in areas with annual mean PM 2.5 concentrations equal to or above 10 µg/m 3 .When we considered the re smoke contribution to PM 2.5 levels, an additional 21.4 million people and 6% of AQS monitors would reside in nonattainment areas.Under the stricter standard of 9 µg/m 3 , the average affected population would increase to 167.23 million without considering the effect of re smoke, and 197.68 million (ranging from 153.73 million in 2016 to 225.27 million in 2013) with the contribution of re smoke.Regarding air quality monitoring, an average of 41% of all AQS monitoring sites would fall into nonattainment areas.When the contribution of re smoke was considered, this percentage rose to 50% (ranging from 37% in 2016 to 58% in 2011 and 2012).As the increasing regulation of emissions of PM 2.5 and its precursors from anthropogenic sources have effectively improved air quality in most parts of the US, re emissions are becoming a major contributor of PM 2.5 .The proximity of large populations to wildland res poses a nontrivial threat to public health and compliance with ambient air quality standards.According to EPA(42), approximately 20.9 million Americans (2010 population) reside in PM 2.5 nonattainment areas based on the current NAAQS as of 2023.Our model estimated that 95.9 to 146.3 million more people would live in nonattainment areas if the annual mean PM 2.5