Air Quality Forecasting with Hybrid LSTM and Extended Stationary Wavelet Transform

Artificial intelligence (AI) technology-enhanced air quality forecasting is one of the most promising 10 directions in the field of smart environment development. Despite recent advances in this area, two difficulties remain 11 unsolved. First, multiple factors influence forecasting results, such as weather conditions, fuel usage and traffic 12 conditions. These factors are usually unavailable in air quality sensor data. Second, traditional predicting models


25
Air quality measurements and forecasting remain as popular research topics in sustainable smart environmental design, urban 26 area development and pollution control, especially for those fast-developing countries, such as China and India (Zhang et al. 2012).

27
Recently, the problem of industrial pollution gas emissions becomes worse in those developing countries, threatening the health of 28 a tremendously large amount of people (Bellinger et al. 2017). The accurate prediction of air quality indices (AQIs) in the short-29 term future helps decision makers take necessary actions to alleviate the air pollution situations. For example, the hour-ahead 30 prediction of PM2.5 can be useful for the city pollution central control system to send pre-caution messages and issue further 31 preventive actions if necessary.

32
Traditional air quality indicators usually include PM2.5 and PM10, which are defined according to the pollution particle sizes. The    layers are used to store historical information to perform the forecasting tasks (Shi et al. 2017). The main shortcoming of the RNN 41 is that it becomes problematic for RNN to memorize a long history of data. LSTM is a special form of RNN that has better the neural network for better forecasting accuracy. The AI-enhanced forecasting model is demanded to have more complex internal structure memorizing longer time historical states.
shortcoming of Elangasinghe et al.'s work is that only part of the historical data is used for training. Pardo and Malpica (Pardo and

135
In this section, first, we introduce the dataset and the necessary pre-processing step for the hybrid deep learning framework for 136 air quality forecasting, including zero-mean normalization and wavelet transform. By wavelet transform, the data is decomposed 137 into more stationary sub-signals. Second, each sub-signal is attached with a nested LSTM (NLSTM) neural network for forecasting.

138
Last, the overall forecasting framework is described combining wavelet transform, NLSTM neural networks and inverse wavelet 139 transform (IWT).

142
The flowchart of the proposed method is shown in Figure 1. The entire air-quality forecasting process is composed of two phases, 143 the data pre-processing phase and the prediction phase.

144
In the data pre-processing phase, the original univariate PM2.5 dataset is first normalized using zero-mean normalization method.

145
Then, ESWT is applied to decompose the dataset into multiple sub-signals by two steps. On the basis of the decomposed sub-146 signals, the time series for each data sample is created and the whole dataset is finally divided into training set, validation set and 147 test set. After this phase, the original univariate PM2.5 data is transformed into multiple well-processed and properly-divided sub-148 signals.

149
In the prediction phase, each sub-signal is considered an independent dataset and assigned to one NLSTM to preform hour-ahead 150 prediction. Afterwards, with inverse wavelet transform (IWT), all of the sub-signal prediction results are combined to reconstruct 151 the complete result. At last, the final prediction result is produced after conducting the inverse process of zero-mean normalization.

152
The forecasting performance is evaluated by calculating the prediction error.  The model proposed in this paper uses the zero-mean normalization (z-score normalization) method to normalize the recorded 159 PM2.5 concentration data. Zero-mean normalization is also called standard deviation standardization. The processed data has a mean 160 of 0 and a standard deviation of 1. And the normalized data is calculated using the normalization formula: where ̅ is the mean of the original data, and σ is the standard deviation of the original data. The normalization helps the machine 162 learning algorithms better measuring the distance between the standard deviation and mean of the processed data samples.

163
Similarly, the inverse normalization formula can be derived as: capture by the deep learning models. The explanation of such feature is caused by its mixture of factors of different temporal resolution. The raw data includes components of both low and high frequency, which can also be explained as seasonal longer-term trend and shorter-term fluctuations. Therefore, ESWT is introduced to decompose the original data to separate the temporal features wavelet transform uses finite-length, attenuating wavelets as the basis function. Through the window adjustment method (Huang et al. 2002), the input signal is decomposed into low-frequency signals that reflect the overall trend of data changes and high-174 frequency signals that fluctuate sharply.

175
The original PM2.5 data is decomposed into 'sub-signals' of different dimensions by discrete stationary wavelet transform (SWT) 176 decomposition, using Daubechies Wavelet as basis function. Compared with the original data, these sequence groups have the same 177 sizes by up-sampling after filtering, resulting in more stationary signals and fewer singular value points.
178 By wavelet transform decomposition of level m, a data sequence is decomposed into: 179 In equation (3), Amt is an approximate information set, indicating the overall trend characteristics of the original data and Dit is a 180 high-frequency information set, and represents a small high-frequency fluctuation, that is, a noise portion of the original data.

181
The regular SWT only composes further decomposition upon the lower frequency component Amt, because the method is mostly 182 used for filtering the noises. Therefore, the major focus of the SWT is to make the low frequency component, which is the major

189
The proposed ESWT method decomposes the high frequency components together with the low frequency components.

190
Therefore, the time series information within the high frequency signal can be more effectively and accurately expressed with the 191 decomposition. The decomposing process is demonstrated in Figure 2.

192
In the proposed model, we used ESWT with decompose level of 3, decomposing the original raw data y(t) into six sub-signals 193 in two steps. First, the original air quality data is decomposed into A1 and D1. Then, both A1 and D1 is further decomposed into 194 A1-A2, A1-D2, A1-D1 and D1-A2, D1-D2, D1-D1, respectively. The original SWT method decomposing the original data into A3,

195
D3, D2, D1 is employed as comparative models in Section IV to demonstrate the superiority of the ESWT method.

197
3) Dividing of the dataset

198
With wavelet transform, the original data is decomposed into six sub-signal datasets. In time series forecasting, the data in the 199 history is employed to predict the data in the future.

200
Each of the datasets is divided into data series X and Y. Y is the sequence of data samples to be predicted. X is the sequence of

207
After dividing and fitting the decomposed datasets into a proper shape, they are ready to be learned and predicted by the deep learning models in the prediction phase.

216
In LSTM, the output gate follows a principle that information that is not relevant to the current time step is still worth 217 remembering.

218
As shown in Figure 3, the structure of a common LSTM memory cell is as follows: (9), the update of the memory cell state ct is made by adding two parts, that is is, an NLSTM cell. The outer LSTM is called the outer memory, and the inner LSTM is called the inner memory. In an ordinary 222 LSTM cell, the memory cell status ct is updated as follows: In the NLSTM cell, this process is replaced by an inner LSTM cell, where x t and h t-1 are used as short-term and long-term 224 memory inputs, respectively. The structure of the inner LSTM cell is as follows:

258
The specific formulas of the above mentioned four metrics are listed below: where f refers to the actual data value; and ̂ is the predicted value.

260
Evaluation metrics including MAE, RMSE and MAPE evaluate the forecasting performance by calculating the level of error.

261
Amongst the metrics, MAE is the average of absolute errors between the real value and the predicted value, RMSE measures the 262 deviation between the actual value and the predicted value but it is more sensitive by outliers. MAPE measures the relative level of 263 absolute error in a proportional approach. R 2 is the coefficient of determination that evaluates the fitting effectiveness.

302
The absolute error of the proposed method and the comparative models are compared in Figure 9.

303
Based on the aforementioned evaluation metrics, the evaluation result on forecasting performance of the proposed forecasting 304 models and comparative models is made and listed in Table 1.

305
According to Figure 8 and table 1, the forecasting performance of proposed ESWT-NLSTM neural networks is remarkably effect is greatly improved. The lines in Figure 8 show that the proposed model can successfully and correctly predict most of the show that the absolute error of the proposed method is lower in average and is distributed in a smaller range, which means the 310 predicted values can reflect the actual value more accurately but also more steadily.

328
In Table 1 Table 1, the prediction performance of the proposed 332 WNLSTM model outperforms all hybrid models using EMD for data pre-processing for various LSTM extensions. In particular, 333 the performance comparison between WNLSTM and EMD-NLSTM shows that wavelet transform has the obvious advantage over 334 EMD and is more suitable for time series prediction of datasets like PM2.5 concentration.

335
Compared with the LSTM extensions with regular SWT method shown in Figure 2(a), the proposed method also shows 336 significant superiority by additionally decomposing the high frequency component. The decomposition and separately conducted 337 learning of the higher frequency sub-signals improves the forecasting performance in predicting the more volatile part in the original 338 data and thus improves the overall prediction accuracy. Therefore, according to Figures 8-9 and Table 1, the level of error is reduced 339 and the time lag between the actual data and predicted data is also shortened.

340
Based on the comprehensive comparative results list in Table 1, the combination of ESWT and NLSTM proves to be the best fit

344
The superior performance shows better memorizing and handling longer-term history information for NLSTM networks compared 345 with traditional LSTM extensions.

346
Figures 10-13 illustrates the forecasting results of the methods used in Table 1 respectively. For better visualization purposes, 347 only 72 data samples are shown. In particular, the performance comparison between the proposed method and machine learning 348 methods listed in Table 1 is showed in Figure 10. The performance comparison between the proposed method and LSTM extensions 349 is showed in Figure 11. Figure 12 shows the comparison between the proposed method and LSTM extensions combining with EMD.

363
From Figures 10-13, the existing methods not only show higher prediction errors, but also demonstrate obvious lagging effects obviously lagging, compared with the proposed method. EMD, as one decomposition method, is originally proposed to deal with 366 the lagging problem. However, the EMD cannot completely separate signals of high and low frequencies for extremely sensitive 367 and drastic changes, which can be notably found in Figure 12. According to Figures 10-13, the lagging effect in air quality prediction 368 using the proposed mothed is seriously reduced, resulting in less forecasting error. Learning more stationary sub-signals separately 369 reduced the sensitivity of transient fluctuations for more stable and robust performance results. Table 2 lists the prediction performance of the sub-signals decomposed by ESWT and SWT using NLSTM. The performance in 371 predicting the high frequency component D1 is compared. As described, the difference between the proposed modified SWT and 372 the regular SWT is that the ESWT also decompose the high frequency D1 into D1-A2, D1-D2 and D1-D1 and the decomposed 373 sub-signals are learned by NLSTM to improve the forecasting performance on D1. According the results listed in Table 2

395
such as the combinations of multiple sine waves or cosine waves. When decomposing unstable and irregularly curves, such as the 396 PM2.5 data, as shown in Figure 12, EMD cannot effectively segregate the seasonal trend and local fluctuations, making the prediction 397 very unstable for harmonic waves. Therefore, EMD is less competitive against SWT in this case, making the models employing 398 EMD to decompose the air quality data less effective compared to the proposed model.

399
3) Insensitivity to peaks. The regular SWT decomposing technique lacks further analysis towards the higher frequency sub-400 signals. Therefore, models combining with regular SWT are insensitive to high frequency fluctuations due to the lack of ability to 401 accurately predicting them. According to Figure 8 and Figure 13, compared with the proposed ESWT, the results of models 402 combining regular SWT are too smooth and less accurate when predicting minor fluctuations. The SWT embedded models also 403 tend to fail to predict the peaks and troughs in the PM2.5 data, making the forecasting less timely and less meaningful.

404
From the experimental results shown in Tables

416
In the experimental result phase, a real-world dataset collected by weather stations located around Beijing, China is utilized. A 417 comprehensive comparative study with various existing methods in the literature has been conducted. By outperforming most of 418 the existing technique in the related field, we demonstrate that the proposed ESWT-NLSTM framework is effective and suitable 419 for real-world applications for both PM2.5 value forecasting and PM2.5 varying trend forecasting.

420
The main limitation of this work is that we only perform the prediction for PM2.5 values in the given dataset. Although the 421 forecasting process and results for other air-quality indices are similar to the PM2.5, the deep learning technique actually can perform 422 transit learning from one index to another. In this study, we did not take that advantage in the PM2.5 prediction, which is one of the      models. arXiv preprint arXiv:1811.04745.