Comparative Study on Predicting Particulate Matter (PM2.5) Levels Using LSTM Models

In recent times, air pollution has attracted the attention of policymakers and researchers as an important issue. The pollution that contaminates the air that people breathe is from pollutants such as oxides of carbon, nitrogen and sulphur as well minuscule dust particle which are smaller than 0.0025mm in diameter. The emissions contain many substances that are harmful to human health when exposed to them for a prolonged period or more than certain levels of concentration. The recent advent of technology in sensors and compact instruments to measure the concentration of pollutant levels with considerable ease. Further, this paper also predicts the air pollution for using multiple Deep Learning models that are variations of the Long Short-Term Memory (LSTM) model. In this research, only PM2.5 alone taken into consideration for prediction. Real-time air quality data were collected at selected places in the study area. It is found that the model prediction data is well matched with the other researchers' results and real-time data.


Introduction
Pollution caused by pollutants such as CO, CO 2 , NO x , SO 2 and dust particles with diameters less than 0.0025 mg is one of the leading causes of deaths in India. A study by 1 estimated that about 1.24 million deaths in India could be attributed to air pollution. The key purpose of this analysis was to identify and classify deep learning models that would be the best method to predict the PM2.5 concentrations using the dataset. Data necessary for this experiment were collected from the Central Pollution Control Board of India.
Data was collected over 15 min intervals and the type of arti cial Recurrent Neural Network called Long Short-Term Memory was utilized to analyze the data and compare predictions. Data were collected from 3 monitoring stations in the city of Chennai. The locations of these stations are the neighbourhoods of Alandur, Manali and Velachery. These pollution monitoring stations collect a variety of data and exploratory variables were chosen from the gathered at these stations. Then data was cleaned to have a more ordered and complete dataset to avoid any inaccuracies caused by missing data.
Air pollution has become a major concern in India in recent years, as large parts of the urban population of India are exposed to some of the highest levels of pollution in the world 2 . World Health Organization estimates that the health effects of air pollution have increased the hazard risks in major cities of India 3 .
Many cities in India have a population of over 1 million, and some of them rank among the top 10 in the world's most polluted cities. Of the 3 million premature deaths in the world that occur annually due to outdoor and indoor air pollution, the highest number is estimated to have occurred in India. India has many pollution problems, the most severe of which is air pollution. 4 developed three machine learning algorithms that predicted the levels of PM2.5 using a dataset that included downscaled and uncertainty. 5 have applied Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) to predict the concentration of PM2.5 and compared results with other machine learning methods. 6 have applied three machine learning models that forecasted PM2.5 concentrations and their results showed that the variability was 80% (R 2 = 0.8) in the concentrations of PM2.5 and 75% of the pollution levels were predicted. 7 made an attention mechanism to capture the degree of signi cance of the effects on future concentrations of PM2.5 of the featured states at different times in the past. 8 have studied the PM2.5 using Interagency Monitoring of Protected Visual Environments (IMPROVE) and Chemical Speciation Network (CSN). They obtained data from these two networks with different operating structures, sampling practices, analytical methods, analytical facilities, and data handling and validation practices and they collected data for 33 months. Further they concluded that the combined method of CSN and IMPROVE dataset will explain better understanding of PM2.5 in urban and rural areas.
Magazzino et al established and experimented the relationship among COVID-19-related deaths, economic growth, PM10, PM2.5, and NO2 concentrations for New York, USA 9 . The arrived that the PM2.5 and NO2 are the important pollutants and economic growth rate increases pollution level for COVID -19 death rates. A study on Impact of Outdoor and Indoor Meteorological Conditions on the COVID-19 transmission in the Western Region of Saudi Arabia by 10 . They have considered 10 impact outdoor and indoor meteorological parameters for COVID-19 cases. They concluded that highest daily COVID-19 cases when the temperature ranges between 40.71 °C to 41.20 °C. A research on Source analysis of heavy metal elements of PM2.5 in canteen in a university in winter was carried by 11 . They analysed the indoor and outdoor PM2.5 in a canteen of university and found that the PM25. At inddor is 99.43 μg/m 3 and outdoor is 103.09 μg/m 3 . Further they found that more than half of the PM2.5 penetrates from the adjacent outdoor area in the study location. 12 developed a novel long short-term memory neural network extended (LSTME) model with spatiotemporal correlations. The authors used hourly PM2.5 data from Beijing City and the results showed a mean absolute percentage error (MAPE) of 11.93%. In another study on AQI prediction for Delhi done by 13  Imaging Spectroradiometer (MODIS) images, with a 1km spatial resolution and concluded that the LSTM model is best for prediction of PM2.5 / PM10. 16 predicted PM by using two sets of 3-D chemistrytransport model (CTM) simulations and the results index of agreements ranging from 0.62 to 0.79.
A research carried out on PM2.5 prediction for Wuhan and Chengdu by 17 used PM2.5 concentration data from 2015-2017. Metrological data were also used in developing the model and better results were achieved because of this. A machine learning method is adopted to predict PM2.5 using six-year metrological data 18 . The model has shown that the use of machine learning-based statistical models are important for forecasting concentrations of PM2.5 from meteorological data. A study was carried out to predict air quality with time to predict going up to 48 hrs by combining multiple neural networks 19 . This experiment resulted in excellent performance and outperformed current state-of-the-art methods. 20 proposed a deep learning model to predict air quality in South Korea that used Stacked Autoencoders to train and test data. Research work by 21 used meteorological data to forecast AQI. This is the only study carried for Chennai city and used one of the monitoring stations from which the data used here was collected from.

About the study area
Chennai is located along the coast of the Bay of Bengal. It is the state capital of Tamil Nadu and the fourth largest metropolis in India. Chennai lies between the latitudes 12°50'49" and 13°17'24" and longitudes 79°59'53" and 80°20'12". It can be counted as a part of the Coromandel Coast along the eastern part of India 22 . The terrain around Chennai is a at coastal plain and since it is close to the equator, it is usually humid and hot. The highest temperatures are reached in May-June and are generally around 40°C for a few days and the least temperatures are felt in early January with the recorded temperature of about 20°C throughout the month. Chennai is a major transport hub for road, rail, air and sea transport linking major inland and overseas cities. Chennai is one of India's most prominent educational centres with a range of institutions and research centres. The metropolitan area of Chennai stretches to some 1,189 sq.km.

About the dataset
The data was collected from the 3 Central Pollution Control Board (CPCB) monitoring stations in the city of Chennai [19]. The stations are located at Alandur, Manali and Velachery and illustrated in Fig. 1. The exploratory variables collected from these locations were the atmospheric pressure (BP), relative humidity (RH), PM2.5 values, wind degree (WD) and wind speed (WS). The data collected was present in 15 min intervals for the period of 00:00, 01 May 2019 to 23:59, 30 April 2020 and each station yielded a dataset containing 35,039 data rows totalling a 105,117 data rows. The missing values of PM2.5 were approximately 78.28%. The data was processed to remove any rows that had empty columns and the data was restricted to rows that had PM2.5 levels of less than 2.5x10 -4 mg/L. This left the data to be reduced 22,827 data rows as certain elements were missing in all the other rows.
The statistical summary of the dataset has been shown in Table 1. The table provides an insight towards how the dataset is structured. The exploratory variables shown in Table 1 were then used to plot a heatmap. Figure 2 shows the correlation between the different exploratory variables collected in the dataset. Figure S1 shows the rst 5 rows of information within the dataset. Figure S2 provides statistics about the dataset such as central tendency, dispersion and shape of the dataset distribution. Figure S3 provides information about the type of data stored within the dataset.

About DL and RNN
Deep learning is a subset of machine learning methods based on representation learning and arti cial neural networks. Learning is of 3 types, namely, unsupervised, supervised or semi-supervised. Deep learning architectures such as deep belief networks, deep neural networks, recurrent neural networks and convolutional neural networks are being applied to speech recognition, natural language processing, computer vision, audio recognition, machine translation, drug design, social network ltering, medical image analysis, bioinformatics, material inspection and board game programs. RNNs are the basis for the LSTMs used in the models. They are a class of Arti cial Neural Networks (ANN) that use their internal state to process variable sequence length of inputs. RNNs can be de ned as a generalized form of feedforward neural networks. This means that RNNs can use previous outputs as inputs within the model with hidden states as well. RNNs also have the added advantage of being able to compute inputs of varying lengths and the size of the model doesn't change with the size of the input. But the disadvantage of using an RNN is that it is very di cult to train an RNN and it takes a lot of time to train RNNs.

Stacked LSTMs with Memory Between Batches
A generic LSTM unit has three gates that regulate the ow of information within the unit. These gates are called input, output and forget. All the models had the dataset split into training and testing datasets. Two-thirds of the data was assigned to train the models and the remaining one-third was used to test the models. All the models were trained for both 100 and 1000 epochs.

LSTM Network for Regression
The network has three layers with the visible layer having one input. The hidden block was made up of 4 LSTM units and the output layer produced a single value prediction. The data from the dataset is then t into the model and from this the performance of the train and test datasets can be estimated. After this, the model is used to make predictions on both the train and test datasets and from that, the visual skill of the model can be identi ed. error of 0.1552x10 -4 mg/L for the training dataset and 0.1289x10 -4 mg/L for the testing dataset. The R 2 values obtained were 0.77 and 0.67 for the training and testing datasets, respectively. Fig. 3(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. Fig. 4(a) indicates the PM2.5 values against time for 1000 epochs. Green indicates the training dataset and red indicates the testing plot. The RMSE values obtained indicated that the model has an average error of 0.1553x10 -4 mg/L for the training dataset and 01276x10 -4 mg/L for the testing dataset. The R 2 values obtained were 0.77 and 0.68 for the training and testing datasets, respectively. Fig. 4(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn't create any major differences in results and the model has done a good job in tting the model for both the training and testing datasets.

LSTM for Regression with Time Steps
Time steps can be used as inputs to predict the output at the next step. They provide another method in tackling the time series problem. Any point of failure or surge and the conditions that lead up to them are the features that de ne a time step.   Fig. S5(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn't create any major differences in results and the model has done a good job in tting the model for both the training and testing datasets.

LSTM with Memory Between Batches
LSTM in Python is executed through the Keras deep learning library and the library supports both stateless and stateful LSTMs. The stateful LSTMs provide ner control over the internal state of the LSTM and when the internal state of the LSTM is reset. This can be used to make predictions to by building state over the entire training sequence. mg/L for the training dataset and 0.1653x10 -4 mg/L for the testing dataset. The R 2 values obtained were 0.76 and 0.46 for the training and testing datasets, respectively. Fig. S6(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. values obtained were 0.76 and 0.46 for the training and testing datasets, respectively. Fig. S7(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn't create any major differences in results and the model has done a good job in tting the model for both the training and testing datasets.

Stacked LSTMs with Memory Between Batches
Stacked LSTMs are an extension of normal LSTMs which have a single hidden layer. Thereby, stacked LSTMs have multiple hidden layers with multiple memory cells. Stacking LSTM layers make the model deeper and thus justify the usage of the term deep learning.   Fig. S9(b) shows the LSTM trained on regression for the dataset and the comparison of predicted values (blue) vs the training and testing datasets. It can be inferred that running for 100 or 1,000 epochs doesn't create any major differences in results and the model has done a good job in tting the model for both the training and testing datasets.

Results And Discussions
The collected dataset is divided into two parts: two-third of the data was used to train the model, and the remaining one-third of the data was used to test the performance of the developed models when benchmarking with others. The root-mean-square error (RMSE), and coefficient of determination (R 2 ) were used to evaluate the performance of the different models presented in this paper.
Four different variations of the LSTM models were compared and tested based on performance. All four models were trained and tested using the same datasets, the PM2.5 mass concentrations from the three air quality monitoring stations in the city of Chennai were present in the dataset and were predicted to evaluate performance. The dataset was cleaned to remove any incomplete data or data that exceeded 2.5x10 -4 mg/L in providing more uniform predictions and results.
The comparison of the prediction results from the four models are made in terms of RMSE and R 2 in Tables 2 and 3. The time to predict is taken to be 15 min and most prior researchers have not used such a short time to predict for their research. All four DL models used have similar RMSE values for the given dataset. The R 2 seems the same as well for the training dataset, but when it comes to the testing dataset, there appears to be a decrease in R 2 . This difference in R 2 values for the training and testing dataset can be indicative of over tting by the models.
All the models were trained for both 100 and 1,000 epochs and show very similar results in both cases.
The results both in terms of RMSE and R 2 values are very similar and this can be attributed to the fact that any model should be trained till it reaches the minimum error rate and not after that as it may cause over tting in the model. Also, from the results, it can be said that these are very suitable for predicting urban PM2.5 concentration in the future.
Nonetheless, the research has some drawbacks as emissions have a huge effect on air quality. As emission data are hard to acquire, the data obtained in this paper does not contain emissions from factories and vehicles in the region. This affects the accuracy of the model 's prediction. Also, when a sudden increase in pollution because certain accidents occur, the concentration of PM2.5 changes suddenly. Whether the proposed model will forecast this well still needs to be shown.

Conclusions
All the models were developed to predict PM2.5 concentrations with the LSTM model that used regression with time steps showing the best results for 100 epochs of training. All four models produce very similar RMSE values for both training and testing datasets. The least difference in RMSE values was in the LSTM with Memory Between Batches variation. While the least values training and testing RMSE values were observed in the LSTM for Regression with Time Steps and LSTM Network for Regression respectively. The R 2 values for these were consistent for the training dataset but varied wildly for the testing dataset. Per this, it can be concluded that the LSTM Network for Regression produced the best results as there was little to no over tting seen in the model. While these provide some insight onto which models might be appropriate for prediction of PM2.5 values, all of these were trained for 100 and 1,000 epochs with little to no variation in results in terms of accuracy of predictions. Also, there is a necessity to introduce more statistical analysis techniques as well as the introduction of more exploratory variables will improve the model's performance and open new avenues to study new exploratory variables and methods to analyze them. The development of such models is very useful to the city of Chennai, which plays a role as a vital industrial centre and region of economic importance. The models can be used to identify factors that affect air pollution within the city of Chennai and thereby reduce the levels of pollution as well as the impact of air pollution on the inhabitants of the city. The models can also be expanded to different cities around India and the world and thereby improve the quality of life of people around the world. Tables   Table 1 Statistical Figure 1 Location of air quality monitoring stations Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.   SupplementaryMaterial.docx