Spatio-Temporal Pollution Forecasting using Hybrid Networks

Rising real estate prices to set up infrastructure coupled with expensive maintenance costs and lack of spares during times of instrument failure have become major issues for statutory bodies when dealing with real-time pollution monitoring stations. As a possible solution to these problems, a novel class of hybrid spatio-temporal pollution forecasting networks which are a combination of various widely used temporal forecasting methods and spatial interpolation methods have been proposed in this paper. In addition, a novel multi-site Multi Layer Perceptron based Ensemble method, capable of improving accuracy by taking exogenous variables into account, has also been proposed as a spatial interpolation method in addition to popular techniques such as Ordinary Kriging (OK) and Inverse Distance Weighted (IDW). Experimental results based on the multi-site air pollution data of Beijing suggest that the proposed class of hybrid networks have been effective in predicting the pollution of unknown locations with the best root mean squared error (RMSE) being 18.475 for Holt-Winters + IDW combination. Moreover, the proposed novel MLP Ensemble method for spatial inter-Pritthijit polation has also been empirically shown to perform in the ± 15% range in terms of RMSE when used in combination with any temporal method discussed.


Introduction
Air pollution has become a major concern in recent times. Rapid industrialization and towering technological growth have increased the levels of particulate matter in the atmosphere like never before. High socio-economic activities have made cities the hotbed of pollution in recent times. To monitor the levels of pollution, governments set up numerous air quality monitoring stations in major industrial districts all throughout the city. The hourly data collected is then analyzed and used by the concerned government agencies to frame new policies accordingly. Due to increased consciousness in the implications of air pollution and its long term effects on society, a new field of research termed Urban computing 1 has now become important in the recent decades, involving the application of wireless networks, sensors, computational resources and finally data to solve the issues arising in densely populated areas.
Data retrieval and processing is one of the major engineering challenges when it comes to air quality analysis and poses changes in both the spatial and temporal domains. In the majority of the cities worldwide, the number of air quality monitoring stations are comparatively lower than required to create an accurate geo-spatial profile of the pollution levels in the city. High real estate prices and expensive maintenance costs can be attributed to this problem as the focal points of pollution of a city are most often in the industrial and busy districts, thus producing a major challenge in the effective spatial distribution of sensor stations. Regarding the temporal aspect of the problem, instrument failure turns out to be one of the inevitable challenges agencies face during air quality analysis. Due to instrument failure, historical data of certain time periods of various stations go missing and has to be imputed using several statistical techniques. In order to counter instrument failure, stations have to maintain backup monitoring devices, which further adds to the cost of maintenance of such air quality stations.
An extensive amount of research has been conducted to analyze the spatio-temporal variations of pollution in the recent decades. Sun et al. 2 investigated the pollution patterns of 5 year PM 2.5 data belonging to Jiangsu province in eastern China using Kriging and Hybrid Single Particle Lagrangian Integrated Trajectory (HYPLIT) models. Wang et al. 3 used the Box-Jenkins approach to predict pollution values of Long Beach in Los Angeles. Li et al. 4 proposed a hybrid CNN-LSTM model to forecast particulate matter concentration for the next 24 hours in Beijing. Although being crucial works in the field of pollution forecasting, the majority of the works in the field of air quality research are concentrated in the temporal aspect. The number of studies conducted to understand the spatio-temporal aspects have been relatively fewer.
Taking motivation from the recent advancements in spatial as well as the temporal aspects of the problem, this paper proposes a novel class of Hybrid Spatio-Temporal Pollution Forecasting Networks to counter the challenges mentioned before. The networks leverage the power of temporal forecasting by statistical and deep learning methods coupled with the effectiveness of geo-spatial and deep learning based interpolation techniques. Multi-site air-quality pollution data of Beijing 56 is used to demonstrate the performance of the proposed networks in this study.
The dataset is first segregated on the basis of different stations. The spatial coordinates of the stations are geocoded using Google Maps Geocoding API 7 . Stations are divided randomly into two sets -spatial train and spatial test. The proposed networks are trained on the stations mentioned in the spatial train set and the spatial interpolation evaluation is conducted on the stations present in the spatial test set. Within the spatial train set, the last one year is taken as the prediction horizon to evaluate the prediction accuracy of the temporal forecasting models. Various statistical approaches such as AutoRegressive (AR) 8 , AutoRegressive Integrated Moving Average (ARIMA) 9 and Holt-Winters 10 are used, whereas in deep learning models such as Stacked 11 and Bi-Directional LSTMs 12 are used to model temporal data. Geo-statistical methods such as Ordinary Kriging, Inverse Distance Weighted (IDW) along with a novel Multi-Layer Perceptron 13 Ensemble based approach, are used to perform spatial interpolation on the data as predicted by the temporal models to give us the required final predictions.
To summarize, the main contributions presented in this paper are as follows: -A novel class of hybrid networks are proposed to perform spatio-temporal forecasts based on historical real world data to predict the pollution values of unknown locations. -Display of a novel stacked ensemble spatial interpolation model, empirically shown to be capable of performing equivalently when compared to widely used methods such as Ordinary Kriging and IDW. -Demonstration through extensive evaluation on a real world dataset that the proposed networks perform efficiently on the data.
The rest of the paper is organised as follows: In Section 2, relevant studies conducted in this field are discussed. Section 3 consists of the problem formulation along with a detailed description of the various methods used in the proposed class of networks. The results of the study are laid out in great detail in Section 4, along with a comprehensive discussion on the performance of the models. The conclusions drawn from the findings are ultimately presented in Section 5.

Related Works
This section deals with the recent research studies conducted in the field of air quality analysis to model pollution data using different novel spatio-temporal models .
Lindström et al. 24 developed a model using spatial and spatio-temporal covariates. They combined data from several different monitoring stations along with Geographic Information System (GIS) covariates to produce NO x predictions in the area around Los Angeles. Zheng et al. 25 used a semisupervised learning approach to model the pollution data. A spatial classifier based on an artificial neural network (ANN) was used to model the spatial correlation between air qualities of different locations. To predict the temporal data, a linear chain conditional random field (CRF) was used to model the temporal dependency of air quality in a location. A previous study by Nath et al. 26 resampled the daily pollution data into monthly to better forecast the long term trends of temporal data in the city of Kolkata. Das et al. 27 performed a comparitive review of high granular and short term time-series forcasting of PM 2.5 pollutant.
Self Organizing Maps (SOM) were used by Chang et al. 28 to extract and distinguish the spatio-temporal features of PM 2.5 concentrations of Taiwan. Liu et al. 29 proposed Spatio-Temporal Extreme Learning Machine (STELM) method to construct a prediction model based on air quality and meteorological data of Beijing. Abirami et al. 30 presented a hybrid spatio-temporal deep learning framework based on CNN-LSTM to model PM 2.5 concentrations based on data collected from different IoT based air quality sensors. Used STNN with an encoder-decoder architecture and region-based spatial dependencies and external factors to improve accuracy. Das et al. 17 2017 Fuzzy Bayesian Network with incorporated Residual Correction mechanism (FBNRC) Proposed a hybrid probabilistic model which reduces uncertainty during parameter learning process and compensates for the unknown factors during prediction of variables. Zhou et.al 18 2019 Knowledge Graph and deep Spatio-Temporal Convolutional Neural Network Used features from a Graph convolutional network as the input to a spatio-temporal convolutional neural network to accurately forecast traffic congestion. Soh et al. 19 2018 Adaptive Deep Learning based Prediction Model Used a combination of artificial neural network (ANN), a convolutional neural network (CNN), and a long-short-term memory (LSTM) to extract spatial-temporal relations for better performance. Bui et al. 20 2020 Spatio-Temporal Prediction using Multimodal Approach Multimodal fusion of critical factors were used to predict future air quality levels thereby reducing the mean absolute errors of PM 2.5 prediction. Fan et al. 21 2017

Deep RNN based Prediction Framework
Used missing value processing algorithms and (DRNN) for prediction thereby outperforming baseline models. Szpiro et al. 22 2008 Bayesian hierarchical spatiotemporal model Used spatial regression structure in a spatiotemporal model implemented using an optimized Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate concentration of air pollutants. Sahu et al. 23 2006 Spatio-Temporal Modelling of Fine Particulate Matter A fully Bayesian model was implemented, using MCMC techniques, enabling inference with regard to process unknowns as well as predictions in time and space. Table 1: Summary of spatio-temporal models proposed by researchers in recent decades A novel deep convolution neural network model named Multi-Scale Spatial Temporal Network (MSSTN) was proposed by Wu et al. 31 to better discover multi-scale spatial temporal patterns. MSSTN was used to make hourly PM 2.5 concentration predictions jointly for a number of cities based on the training done on Urban Air Pollution Datasets in North China.
Kloog et al. 32 proposed the use of recent advances in Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data processing algorithms to use high resolution Aerosol Optical Depth (AOD) data which is then used to train generalized additive mixed models with spatial smoothing to generate grid predictions.
An Extended Spatio-Temporal Granger Causality model was proposed by Zhu et al. 33 to analyze the causality relations present in air quality data using an efficient solution to overcome the computational complexity to process massive volumes of data along with an adaptation of the grid-based algorithm used to other non-grid based applications. Other related works are summarized in Table 1.
The merits of the class of networks proposed in this paper can be enumerated in the following ways. First, widely used temporal methods such as AR, ARIMA, Holt-Winters are used to produce temporal predictions. Since, pollution data follows repeated patterns, statistical and LSTM based methods perform efficiently compared to other temporal forecasting methods. Second, in the multi-site air-quality pollution data of Beijing 56 , inter-station distances being closer in the order of magnitude of 10 1 kilometers cause geo-statistical methods such as Ordinary Kriging and IDW methods to produce projections very close to actual levels if measured accordingly. Third, a novel Multi-Layer Perceptron (MLP) Ensemble based approach for spatial interpolation is also proposed which could handle data containing exogenous variables to further increase the accuracy of the model. The MLP Ensemble of various station data is shown to perform equivalent to existing geo-statistical methods mentioned before .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 3

Problem Formulation and Proposed Approach
This section deals with the formulation of the problem statement followed by a detailed discussion of the approach along with the steps undertaken to build the model for this study.

Problem Statement
Given a set of n known air quality monitoring stations S = for the i th station over a duration of t timesteps in the past, our intention is to find the pollution data P j = {w j t+1 , w j t+2 ,..., w j t+t 0 } over the next t 0 timesteps for each j th location belonging to the set of m unknown locations Here, v and w denote actual and predicted pollution values respectively

Proposed Approach
The approach undertaken in this study can be divided into two folds. First, the pollution data for the i th station, ..,S n } over the next t 0 timesteps are found out using one-step ahead prediction. These predictions are then taken as input for the spatial interpolation methods to calculate the pollution values P j = {w j t+1 , w j t+2 ,...,w j t+t 0 } for each unknown location S j

Temporal Modelling
For temporal modelling purposes, different widely used statistical methods such as AR 8 , ARIMA 9 , Holt-Winters 10 and deep learning methods such as Stacked 11 and Bi-Directional LSTMs 12 are used. The approach for carrying out one-step ahead predictions using a temporal model Ω is laid out in Algorithm 1.

Algorithm 2: INVERSEDISTANCEWEIGHTED
Input: P out = {P out 1 , P out 2 ,...,P outn }, S, S 0 j Output: P j = {w j t+1 , w j t+2 ,...,w j t+t 0 } 1 distances generateDistanceMatrix(S, S 0 j ) 2 weights 1/(distances) 2 3 normalWeights weights/sum(weights) 4 sptPredictions P out · normalWeights 5 return sptPredictions Algorithm 3: ORDINARYKRIGING The model Ω is re-trained in each run on new data. The new data is created by appending the predicted data in the i 1 th run to the existing data P i in provided as initial input. For each run, the resulting prediction w i k 8 k 2 {t + 1,t + 2,...,t + t 0 } is appended to the result set P i out produced as the required output. This output so produced serves the basis for the spatial interpolation process in the following subsections.

Geo-Statistical Spatial Interpolation
Spatial interpolation is carried out in two ways in the study. One way is using geo-statistical methods such as Inverse   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 append(X, cell) 10 y append(y, pollValue) 11 return X, y Distance Weighted (IDW) and Ordinary Kriging, while the other way is by using the proposed multi-site MLP Ensemble. Methods such as Ordinary Kriging (OK) and Inverse Distance Weighted (IDW) have been used due to the wide variety of applications in spatial interpolation purposes. Recent studies such as by Kostopoulou 34 and Apriyono et al. 35 use methods such as OK and IDW to aid in interpolation of spatial features when the datapoints are sparsely distributed. The approaches for geostatistical spatial interpolation using IDW and Ordinary Kriging as discussed are laid out in Algorithms 2 and 3.
In case of the IDW approach, a distance matrix is calculated first, based on the spatial co-ordinates of S and S 0 j .A power parameter, in our case, a value of 2 is chosen, thereby implying that the square of the inverse of the distances will be used as the weights. The weights are further normalized by dividing with the sum. The normalized weights so calculated are then multiplied using the scalar product operation, with the temporal predictions P out from different stations to ultimately produce the final results P j .
Ordinary Kriging uses a different approach to solve the spatial interpolation problem. A variogram is used to model the spatial dependence between different pairs of points situated at different distances. Rows of values based on the the pollution value P out k i and the spatial coordinates of stations S k 8 k 2 {1, 2, 3,...,n} are created which are used to fit the variogram. The output for each i th timestamp is calculated by executing the grid based on the spatial values of the unknown station S 0 j .

Spatial Interpolation using MLP Ensemble
The multi-site ensemble proposed in this study is based on a deep learning approach to model spatial relations in the data and use the spatial relations so found out to help in producing required predictions. The novel ensemble also aids in the investigation of the closeness in terms of efficacy when compared to tried and tested approaches such as OK and IDW. In this study, a stacked ensemble of various station based MLP deep learning models are used for modeling the data to produce the final predictions. The stacked ensemble learning approach is illustrated in Algorithm 4. In the ensembling stage, the processed datasets containing X and y output from the data modeling stage is used to train a MLP based deep learning model Φ i for each fixed station S f . As illustrated in Fig.1, all intermediate models Φ i 8 i 2 {1, 2, 3,...,n} are then used to produce predictions which are used to train a linear regressor based meta-learner Ψ , to produce the final predictions.

Model Building
The steps followed to build the model are illustrated in Fig. 2. In step 1, the multi-site air quality data is split into two components namely, Spatial Train and Spatial Test. Spatial Train consists of known stations denoted by S = {S 1 , S 2 ,...,S n } while Spatial Test consists of unknown stations denoted by S 0 = {S 0 1 , S 0 2 ,...,S 0 m }. In the following step, the time series pollution values belonging to Spatial Train are subdivided further into Temporal Train and Temporal Test (also known as the Prediction Horizon) for the training and evaluation of temporal methods such as AR 8 , ARIMA 9 , Holt-Winters 10 , Stacked 11 and Bi-directional LSTMs 12 .
Step 3 involves the carrying forward of the predictions P i out 8 i 2 {1, 2, 3,...,n}, produced by the temporal methods mentioned earlier. The predictions P i out are then fed into various spatial-interpolation methods such as IDW, Ordinary Kriging and the proposed stacked MLP ensemble to produce the final predictions P j 8 j 2 {1, 2, 3,...,m} needed. The resulting final predictions P j from the spatial-interpolation models are then used to carry out the final evaluation w.r.t the ground truth values present in Spatial Test as pointed by the arrow for step 4. The extensive analysis of different combinations of temporal and spatial interpolation methods are reported later in the following section of this study.

Evaluation and Results
In this section, the performance of the proposed class of networks is presented along with the discussion on the evaluation metrics and the parameters.

Experimental Setup
All calculations have been performed on a test bench consisting of a 6C/12T Ryzen 5 3600 CPU clocked at 3.6 GHz coupled with a 16 GB 3000 MHz DDR4 RAM and a 1 TB NVMe SSD. In case of deep learning based models, an Nvidia RTX 2070 Super GPU is used as a hardware accelerator in order to increase parallelism in matrix related calculations.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 The code for the development of this study has been written in Python 3.8 36 , due to the rich ecosystem consisting of several highly efficient libraries such as NumPy 37 , Tensor-Flow 38 , StatsModels 39 and Sci-kit Learn 40 to reduce overall complexity of the code without any trade off with regards to performance. The operating system by choice has been Ubuntu 20.04 LTS due to increased support for the developed code in platforms like Linux.
For carrying out the performance analysis of the models, the dataset containing 12 stations is first split randomly in half to create sub-datasets for training and evaluation purposes respectively. Due to the presence of temporal methods, the last year of both datasets is taken as the prediction horizon. The results presented are the final predictions performed by the spatial interpolation methods, on the one step ahead predictions produced by the temporal models. The metrics are evaluated by comparing the spatially interpolated predictions w.r.t. the actual last year values for all unknown stations in the test set.

Data Description
The dataset used for this study is the Beijing multi-site airquality pollution data 5 donated to the UCI Machine Learning Repository 41 by Zhang et al. 6 . The dataset contains multisite pollution data from 12 different government controlled air quality monitoring stations scattered all throughout the capital city of Beijing. 6 pollutants such as PM 2.5 , PM 10 , SO 2 , NO 2 , CO and O 3 are reported with hourly frequency in the dataset, out of which this study concerns only with the PM 2.5 data. In addition to pollution data, other meteorological factors such as temperature, pressure, dew point, precipitation, wind speed and wind direction are also present in the dataset, of which all are currently out of scope for the current version of the study. As the geographical locations of the air quality monitoring stations are absent, the required coordinates for each location have been externally found out using the Google Maps Geocoding API 7 . Missing values (if present) in the data for each station are imputed using Mean Before After and Window based Last Observation Carried Forward (LOCF) techniques depending on the time period of the values missing. The data ranges from 1st March, 2013 to 28th February, 2017 thus, accumulating to a total of 1461 rows for each station when resampled into daily frequency by taking the mean.

Evaluation Metrics
The performance of different models of the proposed class of networks is based on Root Mean Squared Error (RMSE) and Mean Average Error (MAE) in contrast to the Test Set Mean. RMSE causes the effect of each error to be proportional to the size of the squared error, thereby causing larger errors to disproportionately have a larger effect on the RMSE. The Test Set Mean is the mean of all the values present in the testing set, thus helping in providing a perspective of the errors made during prediction.
In addition, relative improvements in the performance of all temporal models having the same spatial interpolation method is also studied to analyze how much the models outperform when compared with the worst performing ones.
whereŷ t is the prediction made by the model and y t is the actual value at instant t. Here T denotes the count of the number of time-series samples.
Due to the high training time for deep learning models, one epoch was used to retrain the model during each one step ahead forecast.
The ability for hybrid pairs containing Holt-Winters 10 to outperform all others stems from the advantage of using triple exponential smoothing for the level, trend and seasonal components of the data. Since exponential smoothening mod-els are based on the description of trend and seasonality, it performs better than hybrid pairs using AR 8 and ARIMA 9 as those are built to describe the correlations in the data.
From Table 2, it can be seen that the worst performing combinations out of all, are those hybrid pairs involving AR 8 as the temporal method. This can be attributed to the presence of only the auto-regressive component whereas methods like   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 ARIMA 9 take into account both the auto-regressive as well as the moving-average component of the time series. Hybrid pairs involving deep learning methods such as Stacked LSTM 11 and Bi-Directional LSTM 12 performed relatively worse compared to methods involving ARIMA 9 and Holt-Winters 10 as can be inferred from Table 2. The relatively poor performance of deep learning methods can be attributed to many different causes. One of them might be the lack of adequate data for the methods to properly model data while another cause can be attributed to noisy components present. One interesting observation to note from Fig.6 is that deep learning based temporal methods were unable to model high values such as peaks and hence plateaued around the value of 200, while models such as AR 8 , ARIMA 9 and Holt-Winters 10 were successful in producing predictions close to the actual ones both in the higher as well as in the lower ends of the spectrum.
Another interesting observation that can be concluded from Table 2 is that all spatial interpolation methods including the proposed novel MLP Ensemble, showed equivalent performances with each other. This also can be attributed to the fact that the geo-spatial surface in Beijing is flat and not uneven, thereby enabling the spatial interpolation methods to model the surface accurately.

Discussion
Out of all the different combinations tested on the Beijing multi-site hourly pollution data 5 , hybrid pairs containing Holt-Winters 10 gave the overall best performance. It was also shown that the novel MLP based multi-site ensemble performed equivalently to other widely used spatial interpolation methods such as Ordinary Kriging and Inverse Distance Weighted (IDW). The biggest contributing performance of the multi-site ensemble method lies in its flexibility to account for more than one feature to accurately model the geospatial surface. If the amount of data increases to the order of magnitude 10 6 , deep learning based temporal methods will be expected to outshine the statistical ones.
In the case of Beijing, the range of pollution values was quite high compared to other major cities in the world, thus leading to higher error values in terms of RMSE for temporal methods. Cities, where the range of pollution values are comparatively lower, will allow better performance for temporal methods, thereby leading the proposed class of networks to produce much better overall predictions for unknown locations. In addition, the performance can be also attributed to the type of surface being operated on. If there are undulations present in the surface, spatial interpolation methods may not produce good and accurate results.
As pollution monitoring sensors are prone to instrument failures, in the likely scenario of a real-world deployment, the additional data due to failure in the sensor will be absent from the feed. As prediction on absent input data is not possible for the proposed class of networks, the missing data will have to be imputed through LOCF technique for the model to produce the required predictions for the future timesteps.

Conclusion
In this study, a novel class of hybrid spatio-temporal networks have been proposed to aid in predicting pollution values of unknown locations when historical time-series data of other locations are known. At this stage, exogenous variables in temporal data forecasting were not taken into account. As a possible future work, statistical methods such as ARIMAX can be used which takes into usage the contributions of exogenous variables in producing predictions. The data input for LSTMs can be extended to take into account a higher channel depth to include exogenous variables such as meteorological factors and any other location specific features.