Atmospheric visibility refers to the maximum horizontal distance at which the human eye can adequately perceive and recognize a prominent dark object against its background through sufficient visual contrast (Watson 2002). It is a routinely measured meteorological parameter at standard surface weather stations. Visibility is an important meteorological parameter because its degradation can affect various modes of transportation (e.g., traffic congestion, delayed flights, and on-road accidents) and decrease tourism and recreational activities due to disruption of scenic beauties (Watson 2002). In a polluted environment, visibility degradation is mainly caused by the scattering of absorption of light of particulate matter and gaseous pollutants in the atmosphere. Due to this direct linkage with air pollution, it is used as a surrogate for particulate matter (PM) for long-term air quality studies when long-term air quality data is not available (Aman et al. 2019; Singh et al. 2017). In addition to air pollutants, it is also dependent on various meteorological variables either directly or indirectly (Aman et al. 2019, 2023; Singh et al. 2017; Majewski et al 2022). As most of the aerosols are hygroscopic, an increase in humidity can lead to an increase in aerosol size, and this increases light scattering. Other meteorological variables also affect visibility indirectly by affecting aerosol concentrations (Watson 2002). Greater Bangkok (GBK) is one of the largest urban and industrial agglomerations in Southeast Asia. It consists of Bangkok, the capital of Thailand, and five surrounding provinces. Elevated levels of PM2.5 during the dry season (November to April) are experienced in GBK every year as reported by PCD (2023). This is linked to emissions from both industrial and vehicular sources (ChooChuay et al. 2020; Narita et al. 2019), as well as biomass burning in the area (ChooChuay et al. 2020; Narita et al. 2019; Phairuang et al. 2019). Additionally, stagnant meteorological conditions induced by cold surges and sea breezes in GBK contribute to the persistence of this pollution (Aman et al. 2020, 2023).
The increase in particulate pollution has also led to a decrease in visibility in GBK (Aman et al. 2022). Visibility and its relationship with particulate pollution and meteorological variables in Bangkok have been investigated by many researchers using observational analysis as well as numerical modeling (Vajanapoom et al. 2001; Ruangjun and Exell 2008; Aman et al. 2022; Lee et al. 2017). Vajanapoom et al. (2001) reported a linear inverse relationship between visibility and PM10 in Bangkok. Ruangjun and Exell (2008) developed a regression model using meteorological variables to predict fog and visibility at Don Muang Airport in Bangkok. Lee et al. (2017) used the WRF-Chem model to investigate the effect of biomass-burning aerosol on low-visibility events in Bangkok and other cities in Southeast Asia. Aman et al. (2022) reported two visibility events in the winter of 2014 and 2015 affected by the synergetic effect of particulate pollution and meteorology. Recently, visibility-related studies across the world have started using machine learning (ML) models for visibility prediction using particulate matter and meteorological variables as input data (Kim et al. 2022a; Kim et al. 2022b; Penov and Guerova 2023) and the importance of different features is identified by feature importance plots (Kim et al. 2022a; Penov and Guerova 2023). However, feature importance has limited insights, as it does not give information about their directional relationship with visibility and interactive effects of different variables. Also, it is important to quantify and separate the effect of meteorology on visibility to support air quality-related studies with visibility data. Traditionally meteorological adjustment for visibility or particulate pollution has been done using statistical models (Aman et al. 2019; Barmpadimos et al. 2011). More recently, Grange et al. (2018) developed a meteorological normalization technique to quantify and separate the effect of meteorology on air pollutants (discussed in Section on Meteorological normalization). This method has been used in multiple studies to quantify the effect of meteorology on different air pollutants (Grange et al. 2018; Qu et al. 2020; Wang et al. 2022). To understand the directional relational and interactional effect of different variables, it is important to understand the effect of different variables on visibility for each instance. This is done by first developing the ML model for visibility prediction and explaining each prediction by coupling it with another mathematical concept which is combinedly referred to as explainable machine learning (XML). Shapley additive explanation (SHAP) is one such mathematical concept from cooperative game theory which is coupled with any ML model to explain and quantify the contribution of each predictor variable for each instance of prediction (Lundberg and Lee 2017). This helps in developing a better understanding of factors affecting air pollution overall and for episodical investigation (Wu et al. 2022a; Hou et al. 2022; Wang et al. 2023a; Wang et al. 2023b) or visibility (Yao and Li 2023).
Various studies have used ML models for PM2.5 prediction in GBK and Thailand (Gupta et al. 2021; Thongthammachart et al. 2023; Wongnakae et al. 2023; Aman et al. 2024) as well as other parts of the world (Hu et al. 2017; Fathollahi et al. 2023; Wei et al. 2021; Tian et al. 2023; Kumar et al. 2023). However, no study has explored ML models for visibility prediction or used methods like meteorological normalization and XML to quantify and understand the effect of meteorology on visibility degradation in GBK. Based on above stated facts and motivation, the objectives of this study are as follows: a) evaluation and comparison of different ML models for visibility prediction, (b) quantification of the effect of meteorology on visibility using ML model-based meteorological normalization, c) understanding of drivers of the visibility degradation using game theory-based XML method i.e, Shapley method. The ML models use different air pollutants, meteorological data, and time-related variables as input variables. The best-identified ML model was used for both the meteorological normalization technique and for integrating with the XML method to identify important variables affecting the visibility and the effect of hygroscopic growth of aerosols on visibility.
Study area and general climate
Greater Bangkok consists of the capital city of Thailand i.e., Bangkok (13.7°N, 100.5°E), and its five neighboring provinces (Samut Prakan, Nonthaburi, Pathumthani, Nakhon Pathom, and Samut Sakhon) (Fig. 1). It is located in the lower part of Central Thailand with a geographical area of 7762 km2 and a registered population of 11 million (DOPA 2023). It serves as Thailand's economic hub, contributing approximately 47% to the nation's overall gross domestic product (NESDB 2022). The cityscape is characterized by an intricate blend of commercial, residential, agricultural, and industrial zones (LDD 2016), situated across a predominantly flat terrain with an average elevation of less than 10 meters above mean sea level. The prevailing climate in GBK is tropical and humid, influenced by the northeast monsoon (November-February) and the southwest monsoon (May-October) (TMD 2023). The northeast monsoon introduces cool, dry air from continental mid-latitudes, marking the winter season, while the southwest monsoon brings moist air from the Gulf of Thailand and the Indian Ocean, leading to widespread rain, categorizing it as the wet or rainy season. The transitional period (March-April) between these monsoons typically experiences higher average temperatures, constituting the summer season. The winter and summer seasons are collectively known as the dry season.
Insert Fig. 1
Data collection and processing
Hourly human-observed visibility (VIS, km), temperature (TEMP, °C), relative humidity (RH, %), wind speed (WS, ms− 1), and direction (WD, deg.) data for the surface weather station at Don Mueang airport (T01) was obtained from the Thai Meteorological Department (TMD) (Fig. 1 and Table 1). Hourly air pollutants datasets were requested and obtained from PCD for seven air quality stations (P05, P08, P19, P27, P52, P53, P61; Fig. 1) for the years 2017–2022. These air quality monitoring stations are located in three provinces of GBK and have been used as a representative of air pollution level in GBK affecting visibility. The air pollutant datasets included PM2.5, PM10, Sulphur dioxide (SO2), nitrogen oxides (as the sum of nitrogen oxide and nitrogen dioxide, i.e., NOx = NO + NO2), Ozone (O3), and carbon monoxide (CO). It should be noted that all the air pollutants datasets are not available over all seven stations. All surface measurement data are from near-surface monitoring, except for winds at about 10 m above ground level (AGL). Surface pressure (SP), global radiation (GR), and rainfall (RN) data were downloaded for GBK from the 5th generation of the European Centre for Medium-Range Weather Forecasts (ECMWF), namely ERA5_LAND at a spatial resolution of 0.1°. Cloud cover (CC) and planetary boundary layer height (PBLH) data were downloaded from ERA5 at a spatial resolution of 0.25°. Both reanalysis datasets were obtained from the Climate Data Store (CDS; https://cds.climate.copernicus.eu) which is an operational service of Copernicus Climate Change Services implemented by ECMWF. The values for different meteorological variables from ERA5_Land and ERA5 were extracted for the location of Don Mueang Airport using the bilinear interpolation method. Only daytime data (here, 08 LT-17 LT) for visibility and other variables were considered here. In previous studies on visibility (Aman et al., 2019; Aman et al., 2022), visibility observations under certain meteorological conditions, e.g., high RH (RH > 90%), rain, fog, and mist, etc., were excluded to understand the effect of anthropogenic activities on visibility. However, in this study, we excluded visibility observation under rainfall but used data under all other weather conditions for visibility estimation. In our quality-checking, the following probable ranges or detection limits were applied the initial data: PM2.5 and PM10 (0 to 1,000 µg m–3), SO2 (0 to 1,000 ppb), NOx (0 to 1,000 ppb), O3 (0 to 1,000 ppb), CO (0 to 1,000 ppm), TEMP (–5 to 50°C), RH (0 to 100%), WS (0 to 50 m s–1), WD (0° to 360°), RN (0 to 1,000 mm h–1) and GR (0 to 1,000 W m–2) (Aman et al. 2023). No improbable values were found in the data. In addition to the air pollutants and meteorological variables, three time-related variables, namely the day of the season year (referred to here as Julian day or JD), day of the week (DOW), and hour of the day (HOD) were also considered. For the day of the season year, 1st November is assigned as day 1, and so on. The hour of day and day of the week were used as surrogates for the local traffic emissions while the day of the season year was used as proxies to account for the seasonal and long-term variability in visibility due to changes in emissions and not accounted for in the meteorological variables. HOD also represents hourly sky conditions due to changing position of the sun. To understand the variability of visibility and other variables on a monthly scale, its monthly average values by month were assessed. This helped us determine that visibility degradation is mainly a problem in the dry season (discussed in Result and Discussion) and hence was chosen for this study. The strength and direction of the association between visibility and other variables were examined with correlation analysis through a a correlation heatmap.
Table 1
Data and variables used in the study
Datasets | Product/Station Namea | Variableb | Spatial Resolution | Sourcec |
Meteorological data | T01 | VIS, TEMP, RH, WS, WD | In situ | TMD |
Air Quality data | P05, P08, P19, P27, P52, P53, P61 | PM10, PM2.5 SO2, NOx, O3, CO | In situ | PCD |
Reanalysis data | ERA5–LAND | SP, GR, RN | 0.1° × 0.1° | ECMWF |
ERA5 | CC, PBLH | 0.25° × 0.25° | ECMWF |
aT01: Don Muang Airport (WMO484560), P05 (Thai Meteorological Department Bang Na), P08 (Vocational Rehabilitation Center for Persons with Disabilities Phra Pradaeng), P19 (Bang Phil National Housing Authority), P27 (Samut Sakhon Witthayalai School), P52 (Metropolitan Electricity Authority Substation Thonburi), P53 (Chokchai Police Station Ladprao Road), P61 (Bodindecha (Sing Singhaseni) School. ERA5: European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis V5. bVIS: Visibility; TEMP: Air temperature; RH: Relative humidity; WS and WD: Wind speed and direction; PM10 and PM2.5: Particulate matter with size smaller than or equal to 10 µm and 2.5µm, respectively; SO2: Sulphur dioxide; NOx: Nitrogen oxides; O3: Ozone; CO: Carbon monoxide; SP: Surface Pressure; GR: Global Radiation; RN: Rain;; CC: Cloud cover; and PBLH: Planetary boundary layer height. cTMD: Thai Meteorological Department, PCD: Pollution Control Department, ECMWF: European Centre for Medium-Range Weather Forecasts. All datasets are are hourly and available for season year 2017–2022 (November 2016 to October 2022). |
Insert Table 1
Visibility prediction with machine learning (ML) models
Visibility was estimated using the following six individual ML models: random forest (RF), adaptive boosting or AdaBoost (ADB), gradient boosting (GB) and extreme gradient boosting or XGboost (XGB), cat boosting or Catboost (CB), and light gradient boosting machine or LightGBM (LGB). The technical details of these individual ML models are given elsewhere (Breiman 2001; Freund and Schapire 1999; Friedman 2001; Chen and Guestrin 2016; Prokhorenkova et al. 2018; Ke et al. 2017). A set of hyperparameter values for each ML model was selected based on author’s learned experience and judgment (here, Aman et al. 2024) and the literature review. Various combinations of hyperparameters were then explored to optimize visibility estimation (Table 2). The model validation process employed a nested cross-validation technique with a double loop. In the inner loop, hyperparameters underwent optimization using the Random Search Method and a 5-fold cross-validation. In the outer loop, a 10-fold cross-validation was performed, with 90% of the data used for training and 10% for testing the developed model. This procedure was repeated 10 times to ensure all data points should be utilized for both training and testing. Four statistical metrics namely correlation coefficients (ρ), mean bias (MB), mean absolute error (ME), and root mean square error (RMSE) were used for the evaluation of different ML models. These are mathematically defined as below:
\(\rho = \frac{1}{n-1}{\sum }_{i=1}^{n}\frac{(\stackrel{-}{O}-{O}_{i})}{{\sigma }_{o}}\times \frac{(\stackrel{-}{P}-{P}_{i})}{{\sigma }_{p}}\) | (1) |
\(MB= \frac{\sum _{i=1}^{n}({P}_{i}-{O}_{i})}{n}\) | (2) |
\(ME= \frac{\sum _{i=1}^{n}|{P}_{i}-{O}_{i}|}{n}\) | (3) |
\(RMSE= \sqrt{\frac{\sum _{i=1}^{n}{({P}_{i}-{O}_{i})}^{2}}{n}}\) | (4) |
Table 2
Hyperparameters used by ML model
Model | Hyperparameter | Space |
Random forest | n_estimator | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000] |
max_depth | [3, 5, 7, 9, 10, 12, 14, 16, 18, 20] |
max_features | [“sqrt”, “log2”] |
min_samples_split | [2, 4, 6, 8, 10] |
min_samples_leaf | [1, 2, 4, 6, 8] |
criterion | ['squared_error'] |
Adaptive boosting | n_estimator | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000] |
learning_rate | [0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1.0] |
loss | ["linear", "square"] |
Gradient boosting | n_estimator | [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000] |
max_depth | [3, 5, 7, 9, 10, 12, 14, 16, 18, 20] |
max_feaatures | ['sqrt', 'log2'] |
min_samples_split | [2, 4, 6, 8, 10] |
min_samples_leaf | [1, 2, 4, 6, 8] |
learning_rate | [0.1, 0.2, 0.4, 0.6, 0.8, 1.0] |
criterion | ['squared_error'] |
subsample | [0.5, 0.7, 0.9, 1] |
Extreme gradient boosting | n_estimator | [100, 200, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400] |
col_sample_bytree | [0.4, 0.5, 0.6, 0.7, 0.8] |
max_depth | [3, 5, 7, 9, 10, 12, 14, 16, 18, 20] |
learning_rate | [0.01, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0] |
subsample | [0.5, 0.7, 0.9, 1] |
min_child_weight | [1, 3, 5, 7, 9] |
gamma | [0.1, 0.5, 1, 3, 5] |
Cat boosting | n_estimator | [100, 200, 400, 600, 800, 1000, 1100, 1400, 1500] |
col_sample_bylevel | [0.6, 0.8, 1] |
max_depth | [4, 6, 8, 10] |
learning_rate | [0.01, 0.03, 0.1, 0.2, 0.3] |
subsample | [0.6, 0.8, 1] |
min_child_samples | [1, 3, 5] |
max_leaves | [31, 62] |
l2_leaf_reg | [0.1, 1, 3, 5, 7, 10] |
Light gradient boosting | n_estimator | [100, 200, 300, 500, 700, 900, 1000] |
col_sample_bytree | [0.5, 0.8, 0.9] |
max_depth | [5, 10, 15, 20, 25, 30] |
learning_rate | [0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 1.0] |
subsample | [0.5, 0.7, 0.9, 1] |
min_child_samples | [10, 20, 30, 40, 50] |
num_leaves | [31, 60, 90, 100, 150, 200] |
Here, O̅ and P̅ are the means of observed and predicted values, Oi and Pi denote the individual observed and predicted values, and n is the number of data points. The model that showed the best performance in visibility estimation was selected and the final tuning of the model again using full datasets. The model development and hyperparameter optimization were done using the scikit-learn library (Pedregosa et al. 2011).
Insert Table 2
Meteorological normalization of visibility
Meteorological normalization is a technique that was first introduced by Grange et al. (2018) and later adopted by other studies with some adjustments to decouple the effect of meteorology on PM2.5. The decoupling was achieved by predicting PM2.5 for a specific time multiple times with randomly selected meteorological variables and then averaged to give normalized PM2.5. Here, we have adopted this method to decouple the effect of meteorology on visibility. Grange et al. (2018) and some other studies (Hou et al. 2022; Mallet 2021; Qu et al. 2020) predicted PM2.5 1000 times for each hour while Liu et al. (2022) repeated the prediction process 500 times. Vu et al (2019) proposed to normalize only weather conditions without normalizing seasonal and diurnal variation by randomly resampling meteorological variables only from the particular hour for which prediction is being made within 4 weeks (2 weeks before and 2 weeks after) for the entire study period. Wang et al. (2022) resampled both meteorological variables and time variables for each hour prediction which could not be used for comparing seasonal or diurnal variation with observed PM2.5. Wu et al. (2022b) compared three meteorological normalization methods: a) resample meteorological variables, b) resample meteorological and time variables, and (c) no resampling and found that resampling meteorological variables only gives the best results. Here, we have adopted this method to decouple the effect of meteorology on visibility. Here, Vu et al (2019) were followed and only meteorological variables were resampled keeping air pollutants and time-related variables constant. It can be represented as follows:
$${VIS}_{norm}= \frac{1}{1000} \times {\sum }_{i=1}^{1000}{VIS}_{, i \left(prd\right)}$$
5
Here, VISnorm is the meteorologically normalized visibility, and VISi, (prd) is the ML model predicted visibility for the ith set of predictor variables. The mean level of observed visibility and meteorologically normalized visibility were calculated. Next, the seasonal average of observed and normalized visibility were compared.
SHapley Additive exPlanation (SHAP): A game theory-based approach
The average importance of different predictor variables on visibility and their overall effect on the output of an ML model can be determined using feature importance plots and partial dependence plots, respectively. However, these methods can not investigate the complex interactive effects of different variables on visibility. Also, the contribution of different variables for each instance of prediction is not possible by either of the two plots. To address these two important aspects, this study used SHapley Additive exPlanation (SHAP) method which is a cooperative game theory-based approach given by Nobel laureate Lloyd Shapley. In this method, the Shapley value is calculated to quantify the contribution of each player to the outcome of any cooperative game (Lundberg and Lee 2017; Lundberg et al. 2020). In the context of ML model interpretability, Shapley values are used to attribute the contribution of each predictor variable to the prediction of a specific instance. Shapley values help to understand the impact of each predictor variable on a model's output by considering all possible combinations of predictor variables and calculating the average marginal contribution of the variable across all possible combinations. A positive Shapley value means a positive while a negative value means a negative effect on visibility. A detailed explanation of SHAP method can be found in Lundberg and Lee (2017). In this study, Shapley values of each predictor variable for each prediction were computed by the best-identified ML model i.e., the light gradient boosting model (see sub-section on ML model evaluation in Results and Discussions. Next, the mean Shapley values for each feature were calculated and represented by the SHAP feature importance plots for dry, winter, and summer seasons. A list of important predictor variables was identified based on mean SHAP values being higher than a threshold value (here set as 0.1). To understand the directional relationship of visibility with the selected important predictor variables and to understand the distribution of Shapley value for these predictor variables, SHAP summary plots (also called beeswarm plots) were used. In summary plots, the Shapley values (both positive and negative) for each feature and for each instance are represented but it still cannot help to understand the exact form of relationship between visibility and its predictor variables. To understand this, the dependence plots for Shapley values of the important predictor variables were also analyzed.
Effect of hygroscopic growth on visibility with SHAP approach
RH is considered an important predictor variable that can affect the visibility relationship with PM2.5 and PM10 due to the hygroscopic growth of aerosols leading to a change in light scattering (Watson, 2002; Singh et al. 2017; Aman et al. 2019). To understand this interaction, Shapley values for PM2.5 and PM10 were plotted against their respective concentrations under different RH intervals. A total of seven RH intervals: 0–40%, 40–50%, 50–60%, 60–70%, 70–80%, 80–90%, and 90–100% were used. In addition to the SHAP approach, an equal step-size method as adopted by Aman et al. (2019) was also used to understand the effect of the hygroscopic growth of aerosols. In this method, RH data was divided into 11 intervals: 0–40%, 40–45%, 45–50%, 50–55%, 55–60%, 60–65%, 65–70%, 70–75%, 75–80%, 80–85%, 85–90%, 90–95%, and 95–100%. The average values for original and meteorologically normalized visibilities, PM2.5 and PM10 were calculated for each interval. To understand the effectiveness of meteorological normalization on nullifying the effect of meteorology on visibility.