An adaptive spatio-temporal neural network for PM2.5 concentration forecasting

Accurate PM2.5 concentration prediction is essential for environmental control management, therefore numerous air quality monitoring stations have been established, which generate multiple time series with spatio-temporal correlation. However, the statistical distribution of data from different monitoring stations varies widely, which needs to provide higher flexibility in the feature extraction stage. Moreover, the spatio-temporal correlation and mutation characteristics of the time series are difficult to capture. To this end, an adaptive spatio-temporal prediction network (ASTP-NET) is proposed, in which the encoder adaptively extracts the input data features, then captures the spatio-temporal dependencies and dynamic changes of the time series, the decoder part maps the encoded features into a predicted future time series representation, while an objective function is proposed to measure the overall fluctuations of the model’s multi-step prediction. In this paper, ASTP-NET is evaluated based on the Xi'an air quality dataset, and the results show that the model outperforms other baseline methods.


Introduction
The serious air pollution problem has attracted public attention as urban industrialization has progressed (Lizhong et al. 2017;Wei-Feng et al. 2018).PM 2.5 (Particulate matter with diameter less than 2.5 μm), as one of the most common air pollutants, not only endangers people's health and can cause a variety of respiratory diseases (Jos et al. 2015), but it also has a significant impact on the environment and the city's image.Therefore, accurate air quality forecasting is required to assist government in implementing countermeasures and reducing the impact of PM 2.5 on cities and people.
However, the long-term prediction of PM 2.5 is an extremely challenging task.Firstly, due to the influence of many external factors, PM 2.5 time series data fluctuate greatly, and the time pattern is extremely complex and difficult to capture.Secondly, how to make use of these heterogeneous multi-source data, including meteorological data, air quality data and future time data, is also a challenge for long time series prediction.The complex impact information can be divided into two main categories: spatial dependence, which indicates correlation of time-series data in space, and temporal dependence, which means correlation of time-series data in time sequence, or a combination of both.As shown in Fig. 1, arrows and lines are used to identify the complex impact information of different spatio-temporal attribute sequences.For example, at the same monitoring station, the pollutants PM 2.5 and PM 10 will show some correlation.For different monitoring stations, wind speed and direction have an impact on the diffusion and spread of pollutants, making the trend of PM 2.5 rise or fall in different areas.Meanwhile, due to human cycle activities and seasonal changes, the same PM 2.5 trend may occur multiple times, and being able to capture these temporal entanglements is of great significance for time-series prediction.
Recently, hybrid deep neural network methods have gained a lot of attention in the field of time-series prediction.The core idea is to capture the spatio-temporal characteristics of the time-series data based on the realistic data set attribute composition structure, so as to efficiently utilize the data set and complete the prediction.Among them, many graph-based deep learning models (Guo et al. 2021;Jin et al. 2021;Zhang et al. 2021;Huang et al. 2021;Wang et al. 2020Wang et al. , 2021) ) have been developed, where the framework captures spatial structural correlations by graph neural networks without considering the topology, and temporal correlations are captured by some RNN-based, TCN-based or attention-based models.For instance, AST-GNN (Guo et al. 2021) develops a dynamic graph convolution module to capture spatial correlations in a dynamic manner, and temporal features of local context in the temporal dimension through a self-attentive mechanism.SpAttRNN (Huang et al. 2021) embeds graph-based Fig. 1 The complex spatio-temporal entanglement between the properties of the different air monitoring stations.The different abbreviations are described as follows: (1) AQ stands for air quality data; (2) MD is meteorological data, (3) WD indicates the wind direction in the meteorological data; (4) WS indicates the wind speed in the meteorological data; (5) ID represents the internal dependencies of the attributes of a monitoring station; (6) SD represents the spatial dependence of different air monitoring stations; (7) TD represents the dependence entanglement in the spatio-temporal dimension attention units into recurrent neural networks to learn dynamic spatio-temporal correlations from multiple monitoring stations simultaneously.Similarly, ETGCN (Zhang et al. 2021) and ATGCN (Wang et al. 2021) fuse multiple graph structures, and utilize graph convolutional networks (GCN) to model spatial correlations, then combine GRU with GCN to capture both spatio-temporal correlations and their changing states, these graph-based deep learning models have achieved impressive performance in spatio-temporal data feature extraction.However, current forecasting methods based on spatio-temporal data still face the following challenges: 1. Capturing the spatial dependence between time series data can assist in predictive modeling.The correlation of time series data from different monitoring stations cannot be simply determined based on distance.Traditional attention mechanisms are difficult to capture such complex information.The models based on multi-headed self-attentive mechanisms can obtain the serial correlation between different input series by vector operations on them.However, this requires tuning a large number of attention matrix vectors and hyperparameters, resulting in a huge computational overhead.How to reduce the computational overhead of time series prediction has become a serious challenge for time series prediction.In addition, the statistical distribution of attributes varies widely from station to station, and it is also a challenge to extract features from monitoring stations before performing spatio-temporal processing of the data.2. Time-series forecasting often uses the MSE loss function, which is the absolute distance between two points in Euclidean space, but the fluctuations of the data in the overall situation are ignored in the MSE loss.Meanwhile, existing methods emphasize mining the pattern of target time series from past time series data and ignore the effect of future known time series on the target series.The wide variety of data sources, coupled with little knowledge about their mutual interactions, makes time series prediction particularly challenging.
To the best of our knowledge, the adaptive spatio-temporal neural network (ASTP-NET) method is a new approach to solve the long time series PM 2.5 concentration prediction.The framework of the method is shown in Fig. 2 The main processing steps of the method are Fig. 2 Framework of proposed ASTP-NET briefly described below.First, the input multivariate time series data are processed using the feature extraction mechanism of the prediction model, which is adaptive according to the characteristics of the data itself; second, the information on the influence of the entangled dependence information between the time series of multiple monitoring stations on the target series is captured by the spatio-temporal attention extraction mechanism; third, the overall trend and error of the target series are captured using the objective function.
The main contributions of this paper are as follows: 1.An adaptive feature extraction network is built, and the module can determine the degree of nonlinear processing required based on the characteristics of the different stations' own data, and it can process feature input of different dimensions.2. We design a spatial attention mechanism for the model, where we use the adaptively extracted features of the target station and his neighboring stations as the basis, which focuses on the spatial connections between the target station and the neighboring stations and ignores the spatial connections of the neighboring stations, which we consider redundant.Compared with the existing spatial feature processing module, it improves the computational efficiency of spatial attention and can accurately capture the complex dynamic relationship between the target station and the neighboring stations.3. Using deep learning embedding operations, known future time variables are mapped into numerical vectors in a continuous space, enabling the model to extract semantic information about future time variables along with a larger perceptual horizon.4. A new objective function reasonably describes the long time series prediction task, including MSE loss function and an error control term.The control terms mainly include regularization and variance.With the above operations, the objective function can guide the model to focus on the overall change magnitude information of the target series.
The rest of this paper is organized as follows.In Sect.2, we briefly review the existing solutions for PM 2.5 concentration prediction.Section 3 presents the preliminary knowledge used in this paper.In Sect.4, the ASTP-NET method is described in an integrated manner.Section 5 synthesizes and discusses the operational efficiency of the method on real data sets and the effectiveness of the prediction model.Finally, Sect.6 summarizes the conclusions and future research directions of the paper.

Related work
Air pollutant concentration prediction has been extensively studied and can be broadly divided into two categories:physical-principle-based method and data-driven statistical method, physical-principle-based method like CMAQ model, MEGAN and WRF model (Binkowski et al. 2003;Carlton et al. 2010;Alex et al. 2006;Skamarock et al. 2005) are widely used for predicting Emissions of gases and aerosols from nature, the main advantage of physical-principle-based method is that they can simulate the physical and chemical processes of pollutant diffusion and transformation based on atmospheric particle diffusion models (Xian-Xiang et al. 2006;Bray Casey et al. 2017;Holmes Nicholas and Lidia 2006).However,the model based on physical principles requires experts to carry out a complex investigation of a city, meanwhile different cities are affected by terrain, climate, industrial facility density and other factors, and experts also need to adjust parameters.Therefore, we explore a data-driven method to predict urban PM 2.5 concentration.The advantage of this method is that the parameters can be dynamically adjusted adaptively, while the spatiotemporal simulation of ground monitoring stations will have greater flexibility and reliability in urban areas.The prediction results are more accurate than those based on physical model.
With the develoPMent of sensor collection technology and big data, data-driven model has become the mainstream (Mclean et al. 2019;Araujo Lilian et al. 2020) combined with increasingly powerful computer computing power.In general, data-driven methods can be divided into two categories, linear prediction model and nonlinear prediction model.As for the linear prediction model, geographically and temporally weighted regression (GTWR) (Mirzaei et al. 2019), land-use regression (LUR) (Qian et al. 2016), and autoregressive integrated moving average models (ARIMA) (Lanyi et al. 2018), However, PM 2.5 varies strongly in reality, and it is difficult for linear prediction to fit the complex curve of PM 2.5 , which may lead to poor performance of the model.Moreover, Non-linear prediction models such as machine learning and deep learning are also widely used in air pollutant prediction, such as artificial neuralnetworks (ANN) (Yao et al. 2021), support vector regression (SVR) (Yanlai et al. 2019;Liu et al. 2019), random forest (Keyong et al. 2018),Compared to linear prediction models, these methods take non-linear factors into account, because PM 2.5 concentration changes are not only related to other pollutants, but also to meteorological conditions, weather, and topography, and these complex conditions require a stronger model fitting capability.
However, the above methods are not specifically designed for time series models, but due to the temporal nature of atmospheric pollutants, it is of critical importance to explore neural networks specifically designed for temporal models to perform time series prediction.Recurrent neural networks (RNN) (Biancofiore et al. 2017) can effectively deal with this problem by memorizing the previous information and applying it to the computation of the current output, however, RNNs can suffer from gradient explosion and gradient disappearance with long time dependence, various variants of RNNs, such as long short term memory (LSTM) (Tsai et al. 2018) networks alleviate this problem to some extent, but they also fail to exploit the spatial correlation of the monitoring stations.
In recent years, more sophisticated data-driven hybrid approaches have been proposed.Compared with single prediction model, hybrid model can integrate the advantages of multiple models and efficiently capture the relationship between meteorological data.LSTM-FC (Jiachen et al. 2019) uses neural network-based spatial combination model to capture the spatial and temporal correlation of PM 2.5 pollution between the target air quality monitoring station and its neighboring stations.wavelet-BF-LSTM (Chen and Li 2021) takes into account the variations of different frequencies in various types of weather data, which can extract key features and improve the efficiency of prediction.Time-series prediction methods based on self-attention mechanism can capture the dependencies between nonpredicted sequences (Dairi et al. 2021;Yin et al. 2020;Fu et al. 2022).Due to the good performance of convolutional neural network(CNN) on 2d and 3D matrix data processing, some methods based on CNN have been applied to feature extraction of atmospheric spatio-temporal data (Pak et al. 2020;Ding et al. 2021).These methods take into account the spatial connection between air montoring stations.However, 2D CNNs have difficulty capturing the overall spatial dependence of the data when extracting features, and PM 2.5 concentrations at monitoring stations are influenced by other monitoring stations to different degrees, requiring a more flexible strategy to extract these features.To this end, graph neural networks have been applied extensively, and Yanlin et al. ( 2019) proposed a GC-LSTM based on graph convolutional networks and long and short-term memory networks to model and predict the spatial and temporal variation of PM 2.5 concentrations.Gman (Zheng et al. 2020) is an encoder-decoder model combining GCN and attention networks, capable of simulating dynamic space and nonlinear time correlation, However, the GCN global weight sharing mechanism may limit the ability to learn spatially local features, so ADVW-Net (Jin et al. 2022) integrates CNN and GCN to enable the model to capture global and regional dependencies.Recently, many transformer-based models (Zhou et al. 2020;Wu et al. 2021;Liang et al. 2022;Cirstea et al. 2022) have been successful in the field of time-series prediction.AirFormer (Liang et al. 2022) proposes two new Multihead Self-Attention mechanisms that effectively capture spatial and temporal dependencies, respectively, and then exploit latent variables to capture the intrinsic uncertainty in air quality data.Triformer (Cirstea et al. 2022) employs a novel triangular structure of patch attention, which ensures linear complexity.At the same time, it focuses on capturing the temporal dynamics of each variable of a multivariate time series.

Preliminary
In this section, we will elaborate on the multi-step prediction task of PM 2.5 concentration series and review some basic knowledge to facilitate the introduction of our method.

Problem formulation
In this part, we propose relevant symbols and formalize the problem of PM 2.5 concentration prediction based on spatio-temporal data.
Suppose there are N monitoring stations in the air pollutant concentration monitoring network of the city.The set of monitoring stations is denoted as S = S 1 , S 2 , … , S N ∈ ℝ N , where each weather station continuously collects observations of multiple air pollutants (e.g., PM 2.5 , PM 10 , O 3 ) and meteorological data (e.g., wind direction, wind speed, temperature) at a constant time interval.Let X t ∈ ℝ N×m be the observations of all stations at time step t, where N is the number of monitoring stations, where m is the variable dimension of all air pollutants and meteorological data.
Some future known information (e.g.upcoming holiday date, season) is represented as f t ∈ ℝ p , where p is the total number of dimensions of all time information.
Given a time window of length T, we use the observations of the past T time steps as the historical observations at the moment T. It is represented as a sequence of feature matrices X = X 1 , X 2 , … , X T .Setting the number of prediction steps to , the sequence of future information is known to be f 1 , f 2 , … , f .We aim to learn a function F( * ) which predicts the PM 2.5 concentration over the next steps: where X 1∶T ∈ ℝ N×T×m , f 1∶ ∈ ℝ T×m are the input data and y 1∶ ∈ ℝ is the target sequence.

Spatio-temporal modeling and mutation trends
The spatio-temporal pattern of PM 2.5 concentration is influenced by many factors.For example, pollutants in a city originate from local emissions and foreign transport, wind is the main driver of transport, and other meteorological factors such as temperature, rainfall, and humidity affect the accumulation and dissipation of pollutants locally. (1)

Diebold-Mariano test
MAE, RMSE can be used to compare which model has better accuracy.However, a higher accuracy of a model is not sufficient to guarantee that the model is better.In particular, to minimally guarantee that a model is better, the difference in accuracy should be statistically significant.To evaluate this, the dibold-mariano (DM) test (Diebold Francis and Mariano Robert 2002) is a commonly used statistical test.
The null hypothesis of the DM test is that the target model A and the baseline model B have the same predictive accuracy, and the alternative hypothesis is that they have different predictive abilities.Using the mean square error as the loss function, the DM statistics are as follows: xA (t) and xB (t) denote the prediction values of test model A and benchmark model B at moment t, respectively.p-value and S-statistic of DM test reflect the difference in prediction accuracy between model A and model b.p-value is less than the significance level indicating that there is a difference in prediction accuracy between the two models, and a negative S-statistic indicates that the target model has less prediction error than the benchmark model.

Feature extraction module
Because of its small mass, PM 2.5 has a longer residence time in the atmosphere compared to larger particles, and it is difficult to settle to the ground naturally.Meanwhile, PM 2.5 is easily diffused between regions under the effect of wind speed and direction, so the concentration of PM 2.5 will be influenced by the surrounding environment.PM 2.5 concentration has spatio-temporal correlation in different districts.traditional approach uses CNN and multi-layer perceptron(MLP) for feature extraction when processing the initial inputs from different monitoring stations.
However, due to the lack of a priori knowledge, it is difficult to accurately determine the degree of influence of the input from multiple monitoring stations on the PM 2.5 concentration to be predicted, which makes the feature extraction difficult.Meanwhile, the complex feature (2) processing sometimes has a bad effect on the prediction results.For example, when the monitoring stations to be processed does not fluctuate much relative to other stations, a simpler network would be more beneficial.However, we do not know the situation of the monitoring stations that will be processed.Therefore, in order to be able to process the data flexibly, we designed an adaptive feature extraction network(FEN) as shown in Fig. 3.
The set of monitoring stations is represented by S = S 1 , S 2 , … , S N ∈ ℝ N , S i ∈ ℝ T×M denotes the i-th monitoring station, and i = 1, 2, … , N , N is the number of monitoring sta- tions, T is the total time step of the past input data, and M is the variable dimension of a single station.
represent the weighs and biases, respectively.tanh activation function is used to obtain the nonlinear features of the monitoring station as follows: Feature extraction network based on Gated Linear Units (GLUs) (Gulati et al. 2020) can provide adaptive selection of nonlinear features for monitoring stations, where the GLU can be explained by formula as follows: where is the sigmoid activation function, ⊙ is the element-wise Hadamard product, ∈ ℝ F , F is the hidden feature size, W i 3 , W i 4 , b i 3 and b i 4 (i = 1, 2, … , N) correspond to the weights and biases, respectively.GLU is able to control the complexity of the network by limiting the propagation of nonlinear features in the network and even ignoring them when is necessary.
Finally, the two feature sequences are added together to obtain the final monitoring station feature matrix: (5 Fig. 3 The structure of feature extraction module

Spatial attention module
When people observe the outside world, they do not give equal attention to every region, and more attention weight will be given to the region of more interest.Inspired by the above human visual mechanism, attention mechanism has been widely applied in various fields of artificial intelligence as soon as it is proposed.
With the increasing complexity of neural network structure and data explosion in the era of big data.The ability to distinguish the importance of different parts can undoubtedly improve the efficiency and quality of neural network fitting.It is obviously that the atmospheric pollutants and meteorological parameters of different auxiliary stations have different degrees of influence on the PM 2.5 concentration of the target station, and therefore the attention weights assigned should also be different.Compared with the self-attention mechanism based model to calculate the correlation between all feature sequences, the model focuses more on the correlation between target station and auxiliary stations, and ignores the spatial correlation between auxiliary stations.Before processing temporal features, a well-designed spatial attention module is used to assign different spatial attention weights for each time step.The designed module is shown in Fig. 4.
The sequence of monitoring stations after feature extraction is represented as and h tar is the feature sequence of the target station.The base spatial-attention weight of h tar is obtained through a linear network as follows: Fig. 4 The structure of Spatial attention module where W tar and b tar correspond to the weights and biases, respectively.
Meanwhile, we concatenate the base spatial-attention weight h 0 and H = h 1 , h 2 , … , h tar , … , h N to enhance the spatial attention perception of the auxiliary stations to the target station, thus the feature sequences of different monitoring stations can be expressed as E = e 1 , e 2 , … , e N ∈ ℝ N , and then we choose hyperbolic tangent activation function to calculate un-normalized spatial attention weight of each monitor station as The importance weights of each station are obtained by the normalization operation softmax: i is spatial attention weights of the i-th station, the N denotes the number of stations.
Subsequently, the feature sequences of stations are multiplied with their corresponding attention weights based on the spatial attention distribution = 1 , 2 , … , N , the final spatial representation L N : In this process we focus on the spatial correlation between the target station and the auxiliary stations, reducing the complexity of the spatial correlation calculation to O(N).

Multi-layer Bi-TLSTM
The air quality data from the monitoring stations are affected by the complex periodic activities of human beings and seasonal factors, resulting in complex entangled time dependence.In order to capture the time characteristics between the air quality data, we use Bi-LSTM, which has been proved to achieve good results in many temporal prediction.
The standard LSTM can selectively store and discard the information in the long time series data, but the ability to capture the mutation trend in the input time series data is still weak, so we use the LSTM network with conversion mechanism (also known as TLSTM (Zheng and Hu 2022)) to solve this problem, through the conversion process of the input gating and the historical input Ct, we can compress the time series to the most.The following formula is derived from the TLSTM model: and o t stands for forget gate, input gate, and output gate respectively.Moreover, x t represents the temporal features of t-th time step, Tf t and Ti t represent the output handling of the forget gate and the input gate by the transformation mechanism at time T, respectively.x t ∈ ℝ T×P , P is the hidden feature dimensions of the present moment, However, standard LSTM can only obtain one-way temporal correlation while ignore the auxiliary impacts of the subsequent data to the historical data at a certain moment.Therefore, Bi-LSTM is employed to establish time-dependent modeling, which can capture the forward and backward trends of periodic data simultaneously.As shown in Fig. 6, the forward and backward trending outputs A i and A ′ i are captured at each moment together, ( 16) 5 TLSTM neural unit structure and then they are combined into the final output y i for that moment.The Bi-TLSTM with a richer perceptual field of view and can capture more information.

Network decoder architecture
Under the influence of human natural activities and meteorological conditions, PM 2.5 has obvious multi-scale periodicity in many regions of the world.These include seasons, upcoming holidays, weekends, etc. Different from continuous digital variables such as PM 2.5 , these future time variables can be obtained in advance, and they have a direct time correspondence with the predicted results, which plays an auxiliary role in the final prediction results.
For these future time variables, we use the corresponding violin diagram to show the frequency distribution from 1 June, 2018 to 8 November, 2020 (shown in the Fig. 7a  and b).For the hourly variables, PM 2.5 concentrations are slightly higher in the late evening and early morning than at other times, probably due to the high humidity, low temperature and low wind speed, which make the meteorological conditions unfavorable for pollutant dispersion.For the weekly variables, the possibility of anomalies in PM 2.5 concentrations was greater on Tuesday than on previous days.In addition, PM 2.5 concentrations are significantly greater in spring and winter than in summer in Xi'an.
For future time category features, we use embedding operations to map the category features to a low-dimensional space, which is convenient for computation and automatic feature extraction.The sequence of monitoring stations after feature extraction is represented as f = f 1 , f 2 , … , f ∈ ℝ p , p is the dimension of future time categorical features, Then, the future time features f ′ pass through the feature extraction module to get the final output f time : As discussed above, a vector of prior knowledge of the target sequence is obtained to assist the decoding process by processing known future variables.

Objective function
To our knowledge, the objective function is divided into a loss function term and an overall distribution term.In the case of single-step prediction, the loss function term combines the MSE loss and L 2 weight regularization to update the network parameters, where L 2 weight regularization is employed to punish those neural units with higher weights and avoided the over-fitting phenomenon.It can be shown as follows: where is the regularization parameter and N is the number of parameters, w i is the weight.The formula of MSE is shown: where M is the batch size, y * (j) and y (j) are the predicted and observed values, respectively.
In the case of multi-step prediction, we consider the overall distribution term, which is the difference in statistical distribution between the predicted series and the true series, as shown in the following equation: where Std * and Std represent the standard deviation of the predicted sequences and the observed sequences, respectively.When the difference of the standard deviation tends to 0 leading to loss function terms that are invalid, and too large leading to potential gradient explosion, so to avoid the above situation, The specific formulas are given below: Finally, the training process of proposed ASTP-NET is shown in Algorithm 1.

Experiments
In this section, we first introduce the dataset and evaluation metrics, and then we conduct many experiments to demonstrate the validity of our model.

Public datasets
The area we studied is Xi'an, which is located in the south of the Qinling Mountains and belongs to a typical basin terrain, which is not conducive to the diffusion of air pollutants.Like most fast-growing cities, Xi'an has undergone industrial transformation, urbanization and massive energy consumption in the course of its develoPMent, and as a result the city's air quality has been affected.However, due to the develoPMent of data sensor technology, we are capable of obtaining more historical data for academic research and to assist relevant authorities in environmental management.
The dataset consists of air quality data, meteorological data and temporal characteristics data, and the details of the dataset are shown in the Table 1.The air quality data and meteorological data in this dataset were obtained from the data platform of the State Key Experiment Project of Xi'an Meteorological Bureau, Portions of the meteorological data provided by the ERA5 website (cds.climate.copernicus.eu/).We selected 11 monitoring stations with a total of 21384 hourly records from June 1 2018 to November 8 2020.There are three pollutants in air quality data: PM 2.5 , PM 10 and O 3 .Meteorological data includes temperature, humidity, wind speed.The geographical distribution of air quality monitoring stations is shown in the Fig. 8.In order to solve the time stamp inconsistency caused by missing data, linear interpolation is performed with the data before and after.

Evaluation metrics
The mean absolute error the root mean square error (RMSE) and symmetric mean absolute percentage error (SMAPE) are employed to verify the reliability of the designed method in subsequent experiments, which give the measurement for the error between the predicted and observed values, their expressions are given as follows:

Comparison algorithm
To verify the effectiveness of the model, we compare ASTP-NET with five recent deep learning method: (1) ANN (Cakir and Sita 2020).ANN is a classical case of neural networks and is widely used in the field of prediction..A basic BPNN consists of an input layer, several hidden layers, and an output layer.( 2) LSTM (Tsai et al. 2018).Unlike ordinary recurrent neural networks, the LSTM consists of input gates, forgetting gates, output gates and cell Ct.The gating mechanism part of the LSTM allows the model to capture longer time patterns.(3) GRU (Zhou et al. 2019).GRU is a variant of the LSTM, which combines the input and forgetting gates in the LSTM into one, called the update gate, which controls how much information from the previous time step and the current time step will be passed on to the future; the other gate of GRU is called the reset gate, which controls how much information from the past will be forgotten.(4) Deep-TCN (Jiang et al. 2021).The model is based on fully integrated empirical modal decomposition with adaptive noise (CEEMDAN) and deep temporal convolutional neural network (DeepTCN), which capture the data patterns of external factors affecting PM 2.5 concentration changes and model the historical pollutant concentration data patterns.(5) CNN-LSTM (Ding et al. 2021).The model amplifies the spatial features of different monitoring stations by multi-layer convolution and pooling operations, and then the LSTM decoder obtains the time-dependent patterns of them.( 6) Triformer (Cirstea et al. 2022).The model employs a novel triangular structure of patch attention, which ensures linear complexity.At the same time, it focuses on capturing the temporal dynamics of each variable of a multivariate time series.

Implementation details
The specific details of the experimental settings are shown in Table 2, In order to ensure the reliability of the method in this paper, we divided the data blocks into 60% training set, 20% verification set and 20% test set according to the time sequence.The time span of the past input sequence is 36, the time span of the future input sequence is the prediction .
length, and the feature dimension of the single station output in the adaptive feature extraction module is 5.The number of neurons in the Bi-TLSTM layer is 72 and the number of layers is 2. The batch size of the training data is set to 128, and the model repeats the training process 100 times on the training dataset to obtain the best state of our model.The CPU of the experimental device is Intel Xeon Silver 4214, and the GPU is NVIDIA TITAN RTX.Our proposed method and baseline method are implemented in pytorch 1.2.

Performance comparison
Table 3 summarizes the comparison results between our method and other baseline methods, with independent experiments conducted at different prediction steps, and the best results for each metric are shown in bold in the table.
From the Table 3, we find that the classical deep learning model ANN, LSTM, performs consistently in short-term prediction, but still has a gap with recent approaches in long time series prediction, which may be due to their model structures that limit them to capture complex periodic patterns from spatio-temporal scales.In addition, CNN-LSTM shows a decline in prediction performance with increasing prediction step size, because feature extraction from different stations using CNN destroys the inherent spatio-temporal structure of the data, while the local weight sharing mechanism of CNN is difficult to capture global spatial features, but the adaptive feature extraction network can give the model more flexibility in the feature extraction stage.Triformer based on the self-attention mechanism also achieves excellent performance, demonstrating the importance of capturing dynamic spatio-temporal associations between sequences in multivariate spatio-temporal data, which gives us important inspiration for future work.Obviously, ASTP-NET achieves better performance than other methods in all metrics, with a decrease of 2.11, 2.36, and 8.98% in MAE at 8, 24, and 72 scales, respectively, which further proves that ASTP-NET is an effective PM 2.5 concentration prediction scheme.on the correlation between the target stations and the auxiliary stations, ignoring the interference information between the auxiliary stations.Secondly, from the temporal perspective, our capture of abrupt outliers makes the model more stable and efficient.
To further explore the influence of different spatial distribution of monitoring stations on the model performance, we selected 11 monitoring stations for comparison experiments with a prediction step of 72 and an evaluation metric of RMSE, the specific results are shown in the Fig. 9, we can see that the prediction accuracy of the model changes with the change of concentration at different monitoring stations.Among them, ANN, LSTM in the long time series has a serious degradation in the fitting ability and a large error, Deep-TCN has a significant fluctuation in the predicted concentration with the change of the average concentration at the station, Triformer has an impressive performance in the long time series prediction and slightly outperforms ASTP-NET at station 4, which shows the potential of the model based on the self-attention mechanism in the long time series prediction.However, overall the ASTP-NET model has no significant fluctuations in prediction error and outperforms the other models.In addition, monitoring stations 8 and 9 are located in the low-lying basin area near the Qinling Mountains, where air pollutants are not easily diffused.As depicted in Fig. 9, the average PM 2.5 concentrations at the 2 stations were 59.19 and 55.85 μg/m 3 , which were higher than the average concentration of all stations (50.7 μg/ m 3 ), but the ASTP-NET still maintained a stable prediction capability.As shown in Fig. 10, the red dashed line represents the true value and the orange solid line represents the predicted value generated by ASTP-NET, and it can be found that ASTP-NET can match the true value in most cases.At the same time, some baseline models show significant lags when there are continuous up and down fluctuations, and ASTP-NET can capture these fluctuations in advance.Moreover, at step 17, there is a significant abrupt change in the target curve, and ASTP-NET captures this trend faster than other models.This reflects that the model is effective in learning the spatial and temporal dependence information of PM 2.5 when faced with a complex air quality time series prediction task.In Fig. 11, the blue curve of ASTP-NET method converges after 12 epochs with lower final validation loss values than other baseline models, while ASTP-NET method has smaller initial validation loss values, which indicates that the model can learn more spatiotemporal features in the initial stage, and the objective function loss enhances the ability of the model to capture the features.For air pollutant prediction, accurate prediction of extreme PM 2.5 concentrations is required for early warning.Precision and recall are common metrics used by the meteorological industry to assess the performance of prediction models.According to China's ambient air quality standards, exposure to PM 2.5 at a concentration greater than 115 μg/ m 3 may seriously affect the health of citizens.We used this as a threshold to evaluate the effectiveness of different models to predict high concentrations(Table 4).When the prediction step size increases, the precision and recall of extreme value prediction both show a significant decrease, but our method still maintains good efficiency.GRU fails to take into account the multisite spatio-temporal correlation, and the prediction precision is low.CNN-LSTM performs stable in precision and recall, but still lacks in capturing extreme values.Triformer's ability to capture multivariate sequence correlation also shows good results in extreme value prediction.

Ablation study
Ablation analysis is executed to measure the contributions of each architecture to ASTP-NET by removing the different components from the model.As shown in Fig. 2, there are four main parts for ASTP-NET: Feature Extraction Module, Spatial Attention Module, Multi-Layer Bi-TLSTM and Future Inputs, whose contributions to ASTP-NET can be verified through ablation experiments.Theoretically, Feature Extraction Module can be adaptive to extract the initial inputs' features of different stations according to the demands.To identified this fact, we leverage a simple linear layer instead of the feature extraction module to characterize its contribution.The Fig. 13b shows the model performance comparison with and without the Feature Extraction Module.Obviously, the model with Feature Extraction Module with better prediction performance than those without that.Especially, the model can accurately capture the mutation points in the intervals with large fluctuations in a short period of time.Meanwhile, Fig. 12  the prediction appear larger oscillations and cannot fit the ground truth precisely.Furthermore, Fig. 12 further illustrates this phenomenon.In fact, Spatial Attention Module is conducted based on the spatial correlation between the target and auxiliary stations, and it can assign different weights to those stations, through which the module can extract valid information while ignore redundant information to improve the model's prediction performance.
In addition, we use ablation experiments to verify the effectiveness of the Bi-TLSTM module on capturing the time-periodic features of entanglement.As shown in the Fig. 12, The MAE error decreases by 31.35, 17.65, 9.76, and 13.64% on the scales of 1, 4, 48, and 72, respectively.In fact, Bi-TLSTM is able to efficiently capture important temporal features from the entangled long-term temporal patterns, in both anterior and posterior directions.This allows us to focus on those moments that have a large impact on the Fig. 13 Visualization of ablation analysis subsequent PM 2.5 concentrations, such as those moments when the PM 2.5 concentration surges abruptly.
The enhancement of the model by future timing data can also be seen.As shown in the Fig. 12, the error is not large when the future time is missing in the short-term forecast, while the MAE shows large fluctuations when the future time is missing in the long-term forecast.In fact, the statistical distribution of PM 2.5 concentrations on different time scales varies significantly, We put these known future temporal statistical characteristics into the model as a priori knowledge to guide the prediction, which enables the model to capture a larger temporal perception range and know the temporal features of the PM 2.5 concentration sequence to be predicted in advance.

Hyperparameter sensitivity analysis
To investigate the effect of the hyperparameters of the model, including the input historical data length t and the hidden feature dimension h of the feature extraction network.
We first explored the effect of the input historical data length on the model by modifying the hyperparameters t between {16, 24, 36, 72, 128}.the length of the prediction is {8, 72} hours respectively, capturing the relationship between the input length and the prediction step size.In the Table 5, we observe an increase in the error when the number of prediction steps is 8 and t is too small, e.g., t = 16, which indicates that too small step lengths do not capture the temporal patterns well.In addition, the error fluctuates significantly when t is 128, which may be a result of overfitting due to the short prediction step length and too much historical input information.When the prediction step is 72, the prediction accuracy does not get better with increasing t.This may be due to the fact that the dataset's own pattern characteristics do not require too much historical input for fitting the periodic patterns.
Then we analyze the hidden feature dimension of the feature extraction network, and we modify the hyperparameter h between {2, 4, 6, 8, 16}, and in the Table 6, there is a significant increase in the error when h takes smaller values.This may be because when the length of the input history information is determined, we need the appropriate number

Diebold-Mariano test
Finally, we took ASTP-NET and six baseline methods for DM-test, and as shown in the Table 7, ASTP-NET passed the null hypothesis and outperformed the existing model in terms of prediction accuracy, which proves that the model is more advantageous in statistical tests.

Conclusions
In this paper, we propose a PM 2.5 concentration prediction network based on spatio-temporal data, which we call ASTP-NET.first the model performs flexible feature extraction through an adaptive feature extraction network of encoders, then spatio-temporal correlation is captured through a spatio-temporal attention module, while LSTM with a transformation mechanism is able to capture mutation features, in addition to the decoder part we use known future data aligned prediction sequences, and finally we use a new objective function to measure the overall fluctuation error of the sequences.The evaluation shows that our proposed model significantly outperforms the state-of-the-art methods on real data sets.ASTP-NET still has some limitations.Firstly, during the training process, the number of parameters that can be learned by the model is related to the pre-setting, for example, during the training process we set the hyperparameter input window to 36.If we change the input size to 48 during the testing period, the pre-trained model cannot be used, which brings some obstacles to the generalized application of the model, for which we can introduce a generator to the model, so that only the parameters of the generator need to be learned This makes the model more flexible.Secondly, the model cannot predict multiple pollution sources at the same time, so we can consider introducing multi-objective optimization theory.A future research direction is to establish a generalized time-series prediction architecture, and more data sources such as industrial emissions, traffic flow, etc. will be used in the model to improve the prediction accuracy.
represent the weight matrix, and b = {b f , b i , b g , b o } represents bias vector of the three gates, memory cell c t responsible for storing historical information and h t represents the hidden state at the current moment.The exact structure of the TLSTM is shown in the Fig. 5, where it can be seen that the input data pass through both the forget gate and the input gate, and their principle is based on the output of the Sigmoid function, where values close to 0 cannot pass, while values close to 1 will fully pass.The output value range of the gating mechanism is [0, 1], 1 − tanh function maps the output of the gate to the interval [0.25, 1.0].Meanwhile values near the middle will be compressed to 0.5.Compressing the value range of the data to the most obvious interval is more conducive to capturing dependencies between data and detecting short-term abrupt trends.

Fig. 7
Fig. 7 PM 2.5 concentration distribution at different time scales

Fig. 8
Fig. 8 Geographical distribution map of Xi'an air quality monitoring stations.(The red flags in the right map represents air quality monitoring stations)

Fig. 9
Fig. 9 Distribution of the average values of the observed PM 2.5 concentrations in hourly periods at each station (histogram) and distribution of RMSE values under different methods(points connected by solid lines), which are denoted as the different color points

Fig. 11
Fig. 11 Comparision of ASTP-NET method with baseline model validation loss curves Fig. 12 Histogram comparison results of ablation experiments with different modules

Table 1
Statistics

Table 2
Details

Table 3
The results of different models for PM 2.5 concentration prediction in different time periods based on the first station (station ID: 1)

Table 4
Average precision and recall of 11 stations of different models

Table 5
Sensitivity analysis of the time step of past input dataHorizon Metric t hidden feature dimensions for feature extraction.And when h = 16, the accuracy of the model is almost close to h = 6, but considering the model operation efficiency and the number of training parameters, we choose h = 6 as the hidden feature dimension of the feature extraction network. of

Table 7
Diebold and Mariano tests between the proposed ASTP-NET and baseline methods