Time Series Modeling of Air Pollution and Its Association with Season and Climate Variables in Istanbul, Turkey

Air pollution depends on seasons, wind speed, temperature, wind direction and air pressure. The effect of different seasons on air pollution is not fully addressed in the reported works. The current study investigated the impact of season on air pollutants including SO 2 , PM 10 , NO, NO X , and O 3 using NARX method. In the applied methodology, a feature selection was used with each pollutant to find the most important season(s). Afterward, six models are designed based on the feature selection to show the impact of seasons in finding the concentration of pollutants. A case study is conducted on Esenyurt which is one of the most populated and industrialized places in Istanbul to validate the proposed framework. The performance of using all of the designed models with different pollutants showed that using season effect led to improving the performance of predictor and generating high R 2 and low error functions.


Introduction
Air pollution increases with the increasing of industrialization and urbanization. Public health and economic development in metropolitan cities are affected by air pollution. Air pollutants' levels and types vary from place to another based on the sources of air pollutants such as cars, power plants, oil refineries, industrial facilities, and factories (Koop et al., 2010;Wyrwa, 2015;Sacks et al., 2018). Air quality is monitored using the most common air pollutants (indicators) including SO2, NO2, CO2, O3, NO, NOx, PM2.5, and PM10 (Araujo et al., 2020). These indicators can be found in different levels in the ambient air, unfortunately, exceeding the concentration levels of these pollutants will threaten the human health and may cause many serious problems namely long term and short-term problems. One of the most hazardous events produced by air pollutants is the great smog of 1952 in London, which continued for five days and killed 4000. Monitoring and detecting of the concentration of air pollutants can help decision markers to take the right decisions for the current and future plans. An enormous number of researches have been conducted to forecast the pollutants' concentrations and to understand the most suitable way to evaluate the air quality. Cogliani, (2001) studied the relationship between metrological variables and daily pollution index in three Italian cities using linear multiple partial correlation analysis. The results of forecasting the concentration of pollutants showed high evaluation error and the methods can be used in the surrounded areas of the observation stations while forecasting the pollutants in places that far away from the station is inapplicable. Zhu et al., (2012) investigated the relationship between low respiratory diseases and monthly average concentration of SO2, NO2 and PM10 by considering the effect of seasons especially winter season. Besides, the study tried to estimate the dataset covered period from January 2001 to December 2005 and was collected from Xigu District's hospitals.
The results found a relationship between short-term pollution and low respiratory diseases and found strong relation with winter season on low respiratory diseases. Feng et al., (2015) proposed a novel daily PM2.5 forecasting to improve the performance of artificial neural network by using air mass trajectory analysis and wavelet transformation. The dataset is collected from 13 stations from different locations in China including Beijing, Tianjin, and Hebei provinces.
The results showed that the new hybrid method can reduce the root mean square error up to 40%. Furthermore, the results indicated that the proposed model is efficient to be applied in different countries. Fortelli et al., (2016) investigated the relationship between local metrological variables and PM10 in Naple, IItaly.
Afterward, metrological variables are used to forecast PM10 for couple of days. The results found a relationship between air pollution crises and geopotential heights. The prediction model showed a high correlation between PM10 observations and the predicted values with 0.8 as a correlation coefficient. Alimissis et al., (2018) evaluated two interpolation prediction models including Artificial Neural Networks and Multiple Linear Regression, to predict the quality of air in Athens, Greece. The quality of air is majored using five pollutants including Nitrogen dioxide, Nitrogen monoxide, Ozone, Carbon monoxide and Sulphur dioxide. The results showed that artificial neural networks are found in most cases to be significantly superior, especially where the air quality network density is limited. Yu et al., (2019) proposed a fast forecasting method to estimate PM2.5 concentrations in six cities including Baoding, Beijing, Dezhou, Shijiazhuang, Tianjin, and Tangshan, which located in the north of China. The forecasting method is based on source-receptor relationship modeling with backward Lagrangian stochastic particle dispersion model and emission inventory inversion. The forecasting method is built using a dataset collected in 2015, where another dataset collected in 2016 is used for forecasting purposes. The results showed that applying the new techniques can achieve better results and high correlation coefficients compared with non-optimized model. Wang et al., (2019) developed forecasting model to predict an interval PM2.5 concentration using meteorological factors based on multilayer perceptron. To select the most important input variables from the list of variables an interval grey incidence analysis is adopted. In addition, the dataset is collected from three different locations in Beijing, China. The results indicated that the developed model is accurate and stabile than other models in the literatures. Liu et al., (2019) developed threestage hybrid algorithm based on neural network to forecast PM2.5. The dataset is collected from four cities in China, including Beijing, Tianjin, Shijiazhuang, and Tangshan. The results indicated that the accuracy of the proposed model is efficient compared than conventional methods in the field. Lu et al., (2019) investigated the relationship between PM2.5 and PM10 in different locations within Hong Kong province. Based on a relationship between two pollutants, a predictive model is built to estimate the concentration of PM2.5 using PM10 by using three prediction strategies including local, remote and mixed strategies. The results showed that the three used strategies are able to estimate the missing or unmonitored values using the surrounded stations. Zhu et al., (2019) proposed two-step-hybrid prediction model to estimate concentrations of NO2 and SO2 pollutants in four cities in Central China region. The model is divided into three steps starting with finding high-frequency and lowfrequency sequences followed by applying Support Vector Regression based on combining the Cuckoo Search algorithm and Grey Wolf Optimizer algorithm and finally, forecasting data of low frequency and high frequency. The results indicated that different hybrid models should be used to efficiently predict high and low frequencies. Catalano et al. (2016) studied the relationship between the hourly mean concentration of NO2 and the factors that reflecting the NO2 level (i.e. traffic and weather conditions). Both neural network and Autoregressive Integrated Moving Average with Explanatory (ARIMAX) forecasting methods were used to predict the pollution peaks along with using a combination of these models to forecast the air quality. The results revealed that ARIMAX outperformed neural network in pollution peak forecasting, while neural network could better represent the realistic pollution's concentration association with wind attribute. Integrating both forecasting models could efficiently predict extreme pollution concentrations than using both models separately. Durao et al.,(2016) forecasted the concentration level of O3 using a combination of metrological and air quality and industrial emissions data for Sines Portuguese region, Portugal. Two forecasting models including Multi-Layer Perceptron (MLP) and Classification and Regression Trees (CART) were used to predict O3 concentration. The results revealed that MLP successfully predicted O3 concentration within 24 hours ahead. Corani et al., (2016) designed multi-label classifier to predict multiple air pollution variables. Bayesian networks were used as a learning technique to predict the level of PM2.5 and ozone in different three studies, and to compare the results with other classifiers. It is found that using multi-label classifier performed better than other independent classifiers. Shi et al., (2020) investigated the most metrological variables that have direct or indirect responsibility in serving PM2.5 in Central East China. The results showed increasing in PM2.5 concentration comes as a normal changing in wind direction from south to north, besides increasing in important meteorological factors (i.e., large-scale subsidence, and radiative cooling). Mo et al., (2020) investigated the growth of surface ozone concentration and its effect on health.
The study tried to develop a new model based on combining different machine learning techniques together as one model. The dataset is collected from four stations in China, then the dataset is divided into training dataset (i.e., 1 May 2014 to 31 May 2017), and testing dataset (i.e., 1 June 2017 to 30 May 2018). The results showed that the proposed model is accurate and stable, besides the model can be used in different locations. Researchers have investigated the capability of predicting the air quality using different prediction models including Artificial intelligent ; Bai et al., (2016), Biancofiore et al., 2017;Dotse et al., (2018)), Autoregression models ( Wang et al., 2017), and other hybrid models as discussed in (Carbajal-Hernández et al., 2012;Franceschi et al., 2018;Yang et al., 2019).
However, pollutants can make acid rain that has a harmful effect on plants, buildings, monuments, groundwater, soil composition and living organisms inside seas, ponds and rivers. Thus, this study is important, hence, it gives an opportunity for decision-makers to take into account the level of air pollution when developing future plans, especially Turkey is working to increase exports, attract investments and expand construction of factories. Besides, few researchers have focused on finding a relationship between increasing or decreasing pollutants including SO2, NO, NO2, NOX, O3 and PM10 in air and seasons. Besides, researchers have tried to test different combinations between input variables without considering any scientific or mathematical reasons. Therefore, this article comes as important research for both international and national researchers. Our contributions in this research are explained as follow: • lack of studies considers the impact of season in understanding the trend of pollutants in air.
• lack of studies uses a feature selection with seasons and pollutants to determine the most effective season(s) on each pollutant.
• The study presents a new methodology that can be followed to improve pollutants forecasting including SO2, NO, NO2, NOX, O3 and PM10. Besides, it explains the main behavior between each independent variable and each pollutant to understand the trend of each independent variables on each pollutant.
• The study presents the most effective season(s) on the most important pollutants in Turkey. This can help different researchers to follow the same steps to explain the movement of pollutants in different countries.
• The study proves the capability of NARX model in forecasting different pollutants in Turkey, which gives a hint to many researchers in the field to consider the model without considering other models. This can save researchers' time and effort.
In order to improve the capability of NARX model in forecasting pollutants, it is therefore beneficial to develop a new methodology based on previous studies and feature selection of the most effective season(s) on each pollutant.
Afterward, new models will be designed to cover different scenarios. Hence, dataset from one of the most populated sites in Istanbul (Esenyurt) is considered between 2015 and 2019, to validate the used methodology. Furthermore, to compare between different models a determination coefficient with different error functions are used to find the best and most suitable model on each pollutant.

Research Methodology
To build a forecasting model for gases in air, a feature selection method between seasons' variables and each pollutant is used as shown in the following subsections.

Data Collection and Analysis
The case study was conducted in Esenyurt, one of the most polluted sites in Istanbul, Turkey. The dataset covers 5 years and the total collected hours for all the collected data are 37645. To build a forecaster model, the dataset is divided into two datasets including training and testing. The training dataset is used to train the forecast model about the historical information of all the gases in ambient air, where the testing dataset is used to check the capability of the trained forecaster to forecast the future data. The dataset is divided based on the year of the collected information.
In which, from 2015 to 2018 is used for training and 2019 is used for testing. To implement the effect of seasons in the collected dataset, four dummy variables for each season are considered. The dummy variable assigns one for starting to ending date and zero otherwise. The statistical description of the training and testing datasets are shown in Tables 1 and 2,

Design Seasons' Models Based on Feature Selection
After collecting the data from the source, data processing should be conducted. Since, the collected data has no missing value, no outlier values, therefore, sample modelling and design should be the next step to validate the relationship between independent and dependent variables, besides, to find the capability of forecasting air pollution. The study uses six type of air pollutants including PM10, SO2, NO, NO2, NOX and O3, besides Turkey has four seasons, so the total number of models that can be generated is equal to 144 models (24 "seasons' combinations" X 6 "models").
Trying all the combinations is time consuming and a huge number of results will be generated. To minimize the total number of models, a subset attributed evaluator with greedy stepwise search method is used as a feature selection method. The target of using a feature selection is to find the most important seasons(s) that connected with each studied pollutant. The process starts by selecting one of the pollutant gases as a target and all the seasons as input where the rest of metrological variables are used without using feature selection based on the previous studies. The most important season(s) for each pollutant is /are considered. The results of all the models are used to create models. The created models are used to design forecasting models for all the pollutants as shown in Figure 1.
In our dataset, a long time hourly (i.e., 30840 hourly readings) data is used to find the most related seasons that connected with each pollutant, then there will be 6806 time series samples for forecasting purposes. The most important seasons and the designed models will be discussed in the results section.

Figure 1:
The created models' methodology.

Air Pollution Gases Forecasting Based on NARX
After building models using a feature selection and metrological variables. The NARX forecasting model will be used to forecast different pollutants. The first step in building a forecasting model is to train the NARX model using a training dataset between 2015 and 2018. The NARX will be ran many times until the best model is achieved. The best model that has highest determination coefficient and minimum error are considered for each pollutant. The testing dataset that is not used in training phase between Jan 2019 to Dec 2019 is used to forecast the performance of the trained NARX. The previous two steps are repeated to find the most appropriate weights for NARX models. The best model for each pollutant is used to calculate the performance metrics. For each pollutant, all the generated models are considered and the best model that achieved the best performance metrics are denoted as the best model based on testing dataset (not training dataset). The generated results from NARX and feature selection method are compared together to draw conclusion about the pollutant type and season(s). Figure 2 shows the NARX network with one output (pollutant) denoted as y, u inputs and b as bias, the process of NARX starts by creating serial parallel architecture (opened loop network), then parallel architecture (closed loop network). The target of creating opened and closed network is to improve the forecasting process and to increase the efficiency of the network by using the previous direct data. To forecast the trained NARX model, the first two inputs with training data are used by forecaster to adjust the predicted values. To sum up all the used methodology in this study a flow chart in Figure 3 is considered. Figure 3: The used methodology to build models.

Performance Analysis
After building pollutant time series model for each pollutant using a training dataset, a testing dataset is used to determine the most effective and suitable model for each pollutant. The results of training and testing dataset using different generated models are evaluated using four metrics including determination coefficient (R2), mean square error (MSE), mean absolute error (MAE) and root mean square error (RMSE) .The calculations of all the performance metrics are presented as follows: The higher R 2 and the low error function are considered as a best model, robust and accurate, besides the seasons of the best results are pollutants considered the most effected results that can explain the concentrations of the pollutant.

Results, Analysis and Discussion 3.2. Study Area
This study was conducted in Esenyurt, a district of Istanbul Province that belongs to a metropolitan municipality of the city. Esenyurt is situated on the European part of the city and km far from Marmara Sea. it's surrounded by Avgelar

Relationship Between Input Variables and Air Pollutants
To show the impact of each independent variable (input) on the concentration of air pollutants including PM10, SO2, NO, NOX, NO2 and O3, correlation coefficients are shown in Table 3. The most effective date variable on the pollutants is year variable, which shows a decrease in a concentration of all pollutants per year (except SO2). Decreasing the concentration of air pollutants (except SO2) with time may be attributed to the high percentage of modern cars in Istanbul which emit less pollutants comparing with the old ones. In addition, such decreasing may be attributed to the rules issued by Turkish government to control the emitted pollutants from cars and factories. The growing industry in Turkey, in particular Istanbul, may interpret the increasing concentration of SO2 with time. In general, sulfur dioxide releases from fossil fuel burning power stations, industrial processes such as extracting metal from ore and the burning of fuels with a high sulfur content by locomotives, large ships and non-road equipment (Patricio, 2001). In Esenyut, the temperature ranged between -5 °C in winter and 38 °C in summer during the study time period. Temperature has a negative correlation with all pollutants except ozone which showed a positive correlation. These results are almost consistent with correlation between air pollutants and winter and summer. In this study, the negative correlation could be attributed to the decrease in usage of the domestic heating in summer. In case of ozone, it's mainly formed by a photochemical reaction consequently the more intense the solar radiation (temperature), the more O3 concentrations is (Grosjean & Grosjean, 1998 In general, it has been found that ozone concentration is higher in the places surrounding the city than inside the city center (Salem et al., 2009). In general, as the humidity increases, the concentration of air pollutants decreases because of the washing effect (Kwak et. al. 2017). The interference between the relative humidity and other parameters such as temperature explained the variety of the correlation between relative humidity and the concentration of air pollutants indicating that it's difficult to analyze meteorological variables in isolation . Air pressure showed a positive correlation with all gases except with O3. The positive results agree with the reported works Zeng et al., 2017). The negative effect of pressure on ozone may attributed to the depletion of ozone under pressure.

Designing Forecasting Models Based on Season's Variables
Before building a forecasting model and to avoid a lot of combinations between seasons, a feature selection between the season variables and each pollutant gas is considered. Table 4 represents the best effected season on each gas.
Based on Table 4 and after combining the most effective seasons, 6 models could be generated as represented in Table   5. Models 1 and 6 represent no season effect and all the season effect, respectively. Model 3 represents spring and Autumn, respectively, where Models 4 and 5 represent summer-winter and Autumn-spring seasons respectively. Therefore, instead of testing 144 models only 6 models are considered in this research. Each gas from the pollution list will use all the models and the best model for each gas are recorded separately as shown in the next section.  Spring, Autumn, Spring, Winter

Air Pollution Gases Forecasting Results
The correlation results in Table 3 showed that autumn and spring seasons have a weak correlation with NO, which means that these two seasons may have a strong nonlinear relationship. The results of training and testing NO forecasting are shown in show the fluctuation of errors in the designed models Figure 4 showed that forecasting errors of NO gas in 2019. All models showed higher errors on February and March, where the results of months showed lower errors. The results indicated that using all the seasons variables with metrological variables and time to forecast NO gas is efficient. This indicate that NO gas has a relationship with the seasons, and it change every season.  The results of forecasting NO2 gas are shown in Figure 7 and respectively. The forecasting error of NO2 for 2019 is shown in Figure 7, the results of errors showed that February and March have the highest errors using all the models, which indicated that using the metrological variables with Autumn season could achieve the best results for training and testing datasets. The results revealed that using the season effect can improve the NO2 forecasting.   Table 8. Figure 8 showed the forecasting errors of NOX gas using Models 1 to 6. The forecasting results showed that February and March have the highest errors. The results indicated that no season effect is the best model to describe the behavior of NOX.  The forecasting of O3 is shown in Table 9 and Figure 9, the results showed that Model 5 is the best model for training and testing datasets. The results came in line with a correlation analysis in Table 3. The analysis shows that spring and Autumn are the most effective variables on O3 gas. Model 5 shows the lowest errors compared with other models as presented in Figure 9. The results revealed that using a season effect has a great benefit in training all the models, as well as testing dataset.   Table 10 and Figure 10, Models 4 and 2 showed best performance for training and testing, respectively. The PM10 results showed that spring season has the best results compared with other models.
The results indicate that using metrological variables with spring could retrieve the best performance metrics.  The results indicate that using all the seasons could improve the performance of SO2 forecasting.  In addition, summer-winter model, spring and autumn models showed 30% improvement in both datasets.
Combining the results of training and testing datasets and eliminating the best results from training dataset. The best performance of PM10 is shown with metrological variable and spring season. SO2 and NO behaved very good with metrological variables with all seasons. NO2 and O3 showed a good concentration estimation with spring season and Autumn-spring seasons, respectively, where metrological variables are used as input with the seasons. NOX has no improvement with any season and only using metrological variables are affected in NOX forecasting. The results of PM10 and NO2 come in line with the feature selection method, where SO2, NO, NOX and O3 are not in line with feature selection. This indicated that feature selection is appropriate for PM10 and NO2 only.

Conclusion
This work proposes a methodology of using feature selection to find the most effective season(s) on each air pollutant including PM10, SO2, NO, NO2, NOX and O3. Based on the results of feature selection, 6 models are proposed, NARX method is used to build a forecasting model using hourly training dataset between 2015 and 2018, where hourly testing dataset is used for validating the developed models. The dataset is adopted from Esenyurt, Istanbul. The main finding of this study can be summarized as follow: • This paper is one of the few studies that considered the effect of season through feature selection method.
• It found that Autumn and spring season are the most effective seasons on the concentration of NO2 and O3 gases, where spring season and autumn season are effective on PM10 and NOX, respectively. SO2 and NO gases, on the other hand, have impact with all seasons.
• NARX model is the most effective and accurate model to build prediction models for different gases' concentration.
• All the prediction models that used to construct different pollutants showed very good results except for SO2, the results were bad compared with other models.
This paper considered only one of the most polluted sites in Istanbul, Esenyurt and further sites needed to be conducted to validate different sites and different gases. As a next step in this study, different locations with extra metrological variables will be used to build more accurate and robust models for different gases in air.