A novel PM2.5 concentrations probability density prediction model combines the least absolute shrinkage and selection operator with quantile regression

PM2.5 has a significant negative impact on human health and atmospheric quality, and accurate prediction of its concentration is necessary. When using common point prediction models for PM2.5 concentration prediction, the influence of various uncertainties on PM2.5 concentrations makes the prediction results suffer from poor accuracy. To address this issue, this paper proposes the quantile regression neural network (QRNN) model based on the least absolute shrinkage and selection operator (LASSO), combined with kernel density estimation (KDE) for probabilistic density prediction of PM2.5 concentrations. The model uses LASSO regression to select the influential factors, and then the quartiles of daily PM2.5 concentrations calculated by the QRNN model are imported into the KDE model to obtain the probability density predictions of PM2.5 concentrations. In the paper, empirical analyses are carried out with the cities of Beijing and Jinan in China as well as six other datasets, and the prediction performance of the model is assessed by using evaluation criteria in both point prediction and interval prediction. The simulation reveals that the predictive performance of the LASSO-QRNN-KDE model is well, and the model is not only effective in filtering high-dimensional data, but also has a higher accuracy compared to common research models. In addition, the model is able to describe the uncertainty of PM2.5 concentration fluctuations and carry more information on the variation of PM2.5 concentrations, which can provide a novel and excellent PM2.5 concentration prediction tool for relevant policy makers.


Introduction
PM2.5 refers to the particulate matter in the ambient air with an aerodynamic equivalent diameter less than or equal to 2.5 microns, which has a small particle size and can be used as a carrier of toxic substances. These fine particles can be inhaled into the human body through the respiratory system and adhere to the upper and lower respiratory tracts and lung lobes, causing damage to the nervous system, cardiovascular and cerebrovascular systems, and respiratory systems.
People who have been exposed to fine particulate pollution for a long time are more likely to face cardiovascular and respiratory diseases (Geng et al. 2015;Lee et al. 2017;Mukhopadhyay and Sahu 2018). At the same time, PM2.5 pollution has a negative impact on the normal development of agriculture, industry, and cultural tourism. It will also affect traffic order and cause major economic losses. By predicting the concentration of PM2.5 and accurately forecasting it, it is possible to issue early warnings for heavily polluted air in order to take targeted protective measures. Therefore, in order to maintain the order of production activities, reduce public health risks, and provide meaningful reference for relevant policy makers, the prediction of PM2.5 concentration is of great significance.
Currently, prediction models for air pollutant concentrations such as PM2.5 can be divided into three main categories: deterministic models, classical statistical models, and machine learning models (Jiang and Qiao 1 3 2021). Deterministic methods mainly include weather research forecasting model (WRF) (Cao et al. 2018), community multiscale air quality model (CMAQ) , and coupled WRF-CMAQ meteorologicalchemical model (Syrakov et al. 2016). The above methods use relevant meteorological data and pollutant source data to simulate the process of pollutants from emission to accumulation to dispersion and transfer, so as to predict and analyze the concentration of pollutants. According to Lightstone et al. (2017), such methods are more complex and do not have a significant advantage in prediction accuracy over other models. At the level of statistical models, autoregressive integrated moving average model (ARIMA) and multiple linear regression model (MLR) are more used among scholars in the problem of PM2.5 concentration prediction (Zhang et al. 2018). However, the concentration of PM2.5 is affected by a variety of variable factors and has strong nonlinearity and complexity. Statistical models are not satisfactory in processing nonlinear time series data. Machine learning has the advantage of covering knowledge of probability theory, statistics, approximate theory, and knowledge of complex algorithms, and can simulate the way of human learning. Machine learning has superior predictive performance and is widely used in air pollutant concentration prediction. At the same time, the hybrid model is also widely used in the prediction of the concentration of air pollutants such as PM2.5. Huang and Kuo (2018) combined convolutional neural network (CNN) and long short-term memory (LSTM) and applied it to the PM2.5 prediction system. This method effectively improved the prediction accuracy of the LSTM model. Huang et al. (2021) used empirical mode decomposition (EMD) to decompose the PM2.5 concentration series. They input the multiple stationary sub-sequences and meteorological features obtained after decomposition into the constructed gated recurrent unit neural network (GRU) in order to train and predict them. The results showed that the accuracy of the EMD-GRU model is much higher than that of the single GRU model. Li et al. (2017) introduced the co-integration theory in order to solve the problem of the lack of evaluation of the possible correlation between different variables in the current research. The author used the flower pollination algorithm (FPA) to optimize the parameters of the support vector machine (SVM) model, and made highprecision predictions of the PM2.5 and PM10 concentrations in two important cities in Yunnan Province, China.
PM2.5 concentrations are influenced by a variety of factors, related not only to pollutant emissions, but also to meteorological factors. Common pollutant factors include PM10, NO 2 , CO, and O 3 . (Masiol et al. 2013;Xiao et al. 2018), while meteorological factors include wind speed, precipitation, temperature, and humidity (Liao et al. 2017;Wang et al. 2019;Xu et al. 2019). In the big data environment, related data sources are extensive and the characteristics are different, making the prediction of PM2.5 concentration more complicated. Reviewing the existing literature related to the prediction of PM2.5 and other air pollutant concentrations, it is found that most of the prediction methods obtain deterministic point prediction results, which can hardly reflect the uncertainty of their fluctuations. To solve the above problems, a probability density prediction method for PM2.5 concentration can be used to overcome the shortcomings of the traditional point prediction and interval prediction methods. In this way, the future PM2.5 concentration values and the concentration fluctuation intervals under the corresponding confidence intervals can be obtained, and the probability of occurrence of each point within the interval can also be visualized. With the probability density function of the predicted PM2.5 concentration, more detailed and effective information can be obtained than the traditional prediction methods. It is extremely helpful for the accurate prediction of PM2.5 concentration.
However, due to the complexity and variability of factors affecting PM2.5 concentration, the probability density function that accurately portrays the uncertainty of PM2.5 concentration is difficult to obtain. Therefore, in order to predict PM2.5 concentration more accurately, valuable information needs to be extracted from the wide range of influencing factors first, so as to filter out the influencing factors that have significant explanatory significance on PM2.5 concentration. The traditional classical method of variable selection is mainly represented by the subset selection method. This method compared 2 p − 1 non-empty subsets of all p dimensional variables with a certain criterion to find the optimal subset. Here, the most representative selection criteria are mainly the Akaike information criterion (AIC) based on information theory proposed by Hirotsugu Akaike in 1974 (Hoerl and Kennard 2000), the BIC criterion based on Bayes' method proposed by Schwarz (1978), and the cross-validation criterion. But for the information age where high-dimensional data are proliferating in various fields, the complexity and high cost of computation make the traditional variable selection methods no longer applicable to these high-dimensional data structures. Therefore, statisticians have started to focus on variable selection methods for high-dimensional data. In the field of variable selection for high-dimensional data, Tibshirani's 1996 proposal of the LASSO algorithm (least absolute shrinkage and selection operator) under the ordinary linear model was of epochmaking significance (Tibshirani 1996). It was inspired by two algorithms, ridge Regressionl (Hoerl and Kennard 2000) and nonnegative garrote (Breiman 1995).
LASSO was essentially a coefficient compression method that compressed the coefficients of non-significantly influential variables to zero, achieving both variable selection and parameter estimation at the same time. This method can be seen as an improvement of ridge regression, but it is milder in compressing the coefficients of significant variables with significant effects, thus ensuring the accuracy of parameter estimation. In a previous study, Koen et al. (2020) used a LASSO regression model to select variables closely related to four common chronic diseases in the Netherlands to predict disease prevalence. The results showed that the use of the LASSO algorithm significantly improved the predictive performance of the model. Jasleen and Mamta (2021) used a correlation-based adaptive LASSO algorithm to identify potentially important influences while studying pollutant concentrations and meteorological factors in Delhi and its surrounding cities. Evaluation of the model revealed better performance of this approach in extracting subset of features. In an empirical study, Li et al. (2017) found that real surplus management (REM) was strongly selected as a key crisis predictor by the LASSO algorithm, developing an updated distress forecasting model with surplus management considerations for the Chinese market. In addition, the method has been applied to the variable selection process of quantile regression models.
Quantile regression was first proposed by Koenker and Bassett (1978). Due to its unique explanatory power, it has become one of the key tools for data analysis and is widely used in various data analysis fields (Koenker 2015). Quantile regression can give a more comprehensive picture of the conditional distribution of the response variables. Instead of analyzing only the conditional expectation (mean) of the explanatory variable, the method can also analyze how the explanatory variables affect the median, quantile, etc. of the response variable. Kang et al. (2021) used quantile regression to study the impact of macroprudential policies on the level of bank financing of Chinese firms, and pointed out that the intensity of the impact of macroprudential policies was different at different quantile points. Salari et al. (2021) combined the idea of quantile regression in the study of the impact of multiple factors on the ecological footprint of emerging countries, and discussed the significant impact of different quantile factors. Sun et al. (2021) used quantile autoregressive models to analyze return and volatility series on multiple time scales using the crude oil market as an example. It was found that the autoregressive coefficients vary with quantile. The above study further demonstrates the advantages of quantile regression. Namely, the method can explain the effect of each input variable on the response variable at different quartiles and obtain a more detailed relationship between the input and response variables.
The approach of combining the LASSO algorithm with quantile regression has been applied to the power, financial, and medical sectors. He et al. (2019) predicted electricity consumption based on LASSO-quantile regression for Guangdong Province, China, and California, USA.
Simulation results showed that the method provided better performance for electricity consumption forecasting. Cao et al. (2021) studied the optimal portfolio problem based on the LASSO-quantile regression method and showed that the results obtained by this method were close to the given value at risk under certain canonical conditions, demonstrating the robustness and effectiveness of this method. Taha et al. (2018) reduced the data dimensionality by LASSO and used quantile regression method to study and analyze the prostate cancer dataset. The results validated the superiority of this combined method.
There are no studies that have applied LASSO-quantile regression to air pollutant concentration prediction, to the best of our knowledge. In this paper, the LASSO-QRNN-KDE model is proposed to combine LASSO quantile regression with kernel density estimation. Firstly, this paper will apply the LASSO variable selection method to extract the factors that have significant effects on PM2.5 concentrations from the high-dimensional external influences data to obtain a lower dimensional data set. Subsequently, combine with the quantile regression method, a quantile regression model will established to obtain the conditional quantiles of PM2.5 concentrations at different quantiles. Finally, the kernel density estimation method was employed to obtain the probability density prediction curves of PM2.5 concentrations on any consecutive day in the future month.
The main innovations and contributions of this paper can be divided into the following two aspects: (1) The LASSO-QRNN-KDE model proposed in the paper reduces the uncertainty of the prediction by downscaling the influencing factors of PM2.5 concentration and selecting the factors that have a significant influence on it. Moreover, it is able to obtain different probability distributions of PM2.5 concentrations at different quantile points and further obtain probability density curves. Compared with the common point prediction, the probability density prediction expands the prediction range and provides a more detailed and informative explanation for the prediction of PM2.5 concentrations. (2) This paper demonstrates the superiority of the method through experiments on two real datasets, Beijing and Jinan, China, and six other datasets. Three different scenarios are investigated and compared in this paper: without considering external factors, considering external factors without variable selection, and using the LASSO algorithm to filter external factors. The results show that the proposed LASSO-QR-KDE model has superior performance and performs well in point prediction as well as interval prediction. Moreover, when comparing five models commonly used to predict PM2.5 concentrations, namely ARIMA, BP, RBF, 1 3 CNN-LSTM, and EMD-GRU, the LASSO-QRNN-KDE model was found to achieve higher accuracy.
The rest of the paper is organized as follows: the second part describes the theoretical basis and methodology involved in the model proposed in the paper. The third part presents the constructed neural network model and its evaluation criteria. The fourth part gives an empirical analysis and discussion of the two cases and shows how the data are handled. Finally, the conclusions of the paper are given in the fifth part.

LASSO method
The LASSO method adds a penalty term to optimize the traditional objective function to produce a sparse solution. The method avoids the problem of overfitting caused by excessive explanatory variables (He et al. 2021). It proceeds by extracting the key variables from the high-dimensional variables. The method is effectively solving the problem of multicollinearity among variables and improving the explanatory accuracy of the model, which is highly evaluated and widely used. In addition, compared with ridge regression, the LASSO method can directly compress the regression coefficients of redundant predictor variables to zero, obtaining a more streamlined set of predictor variables. Over-compression of important regression coefficients can also be avoided.
G i ve n a m a t r i x o f i n d e p e n d e n t v a r i ab l e s X = x 1 , x 2 , ⋯ , x n , w h e re X j = x 1j , x 2j , ⋯ , x mj T , j = 1, 2, ⋯ , n,the dependent variableY = y 1 , y 2 , ⋯ , y m T . It is assumed that the data have been standardized, , then the linear model between Y and X is established as Eq. (1).
where 0 is the constant term, i is the coefficient of each variable, and is the random disturbance term. Suppose = 1 , 2 , ⋯ , n T , then the LASSO estimation of is as in Eq. (2).
where the penalty parameter is and ≥ 0 . If ̂ 0 is the least squares estimation of j , then 0 = ∑ n j=1 �̂ 0 � . When < 0 ,the absolute value of the LASSO estimation of the regression (1) Y = 0 + 1 X 1 + 2 X 2 + ⋯ + n X n + ε, coefficient j in the model is smaller than the absolute value of̂ 0 . As decreases gradually, the LASSO estimation of certain j decreases or even goes to zero. The variable can be eliminated if its estimation is zero, which indicates that it has little association with the dependent variableY , and thus, the variable selection function can be achieved. The commonly used estimation and testing methods regarding the penalty parameter are the cross-validation criterion, the generalized validation method, and the AIC (Hoerl and Kennard 2000). For the solution of the LASSO problem, the least angle regression (LARS) significantly enhances the computational efficiency of LASSO. Suppose A = j 1 , j 2 , ⋯ , j k ⊆ {1, 2, ⋯ , n} is the indicator set, and define the matrix of m × k as X A = j1 x j1 , j2 x j2 , ⋯ , jk x jk , where the value of can be determined by Eq. (3).

S u b s e q u e n t l y , s u p p o s e t h a t G A
where I A is a K-dimensional column vector, and all elements are equal to 1. It can be shown that the angular component vector u A is a unit vector andX Establish the LARS estimation as ̂ = X̂ . Assuming that the current estimate is ̂ A , ĉ = X � (Y − u A ) indicates that the respective variable is currently correlated with Y . From the previous assumptions, the variables in A are maximally correlated with Y , and the relationship can be expressed by Eq. (4).
(2) Calculate X A , A A , u A by taking j = sign ĉ j (j ∈ A).
(3) Calculate a = X � u A , with X being the full matrix of independent variables. (4) C a l c u l a t e . min + means that only the smallest positive value in the set is selected for calculation. (5) Set ̂ A = u A+ , and then returns to step (1) to continue the loop until all the independent variables are used for the operation.

Quantile regression
Quantile regression is a linear model that describes the relationship between the explanatory variable X and the conditional quantile of the response variable Y. The formula is given in Eq. (5): where Q( |X) is the estimate of the response variable Y corresponding to the explanatory variable X = X 1 , X 2 , ⋯ , X m under the th quantile condition. ( ) = [ 0 ( ), 1 ( ), ⋯ , m ( )] T is the model parameter associated with the quantile , which is estimated and solved by Eq. (6).
where N is the number of samples, and (•) 为is the loss function of the model, which is expressed as in Eq. (7) After the obtained estimate of ( ) , the estimate of the response variable at the τth quantile can be solved according to Eq. (5).

Kernel density estimation
The solution of the density function of a random variable can be solved by parametric estimation methods and nonparametric estimation methods. However, the parametric estimation method requires prior estimation of the density function form of the random variable, which often leads to a large discrepancy between the actual density function and the estimated density function.
The kernel density estimation (KDE) method can overcome the drawbacks of the parameter estimation methods. KDE is used in probability theory to estimate unknown density functions and was proposed by Rosenblatt (1956) and Parzen (1962). The method enables to employ a set of random variables from the same unknown distribution function to estimate its density function. KDE has no requirement to know the prior distribution of the data in question or to make any assumptions about the data in question. The method requires only input variables, kernel density estimation function, and optimal bandwidth to perform the predictions. where k is a non-negative series of constants, and F(x) is the sample distribution function of the random variable x .
x 1 , x 2 , ⋯ , x n are n sample points of an independent identically distributed one-dimensional continuous overall sample. Its probability density function is f (x) , and the kernel density estimate is as in Eq. (9).
where K(•) is the kernel function, which is essentially the weight function. h is called the bandwidth, and its choice determines the smoothness and accuracy of the estimated density function.

Quantile regression model based on LASSO
The precision of PM2.5 concentration prediction is influenced by various external factors, and the impact of different factors on the prediction results varies in degrees.
According to the description of quantile regression and the LASSO method in the previous section, the quantile regression model is constructed as Eq. (10).
where Y t is the prediction series and X t is the sequence of influencing factors. m indicates the number of influencing factors and c indicates the maximum number of lags of the prediction series, c = 1, 2, ⋯ , p . is the quantile, ∈ (0, 1) . The regression coefficient vector for the prediction target of the th quantile is c ( ) , and the regression coefficient vector for the τth quantile is d j ( ) , d = 1, 2, ⋯ , m . q indicates the maximum number of lags of the influence factor. r( ) is the intercept term and t is the error term. The predicted value of PM2.5 concentration y t at the th quantile can be further obtained, which is calculated as Eq. (11) According to Eqs. (10) and (11), it can be observed that there is heterogeneity in r( ) , c ( ) , and d j ( ) , which indicates the discrepancy in the importance of the studied samples. This reflects the impact of the selected influence factors on the PM2.5 concentration prediction results under different quantile points. The parameters c ( ) and d j ( ) in Eq. (11) can be solved by Eq. (12).
where (u) is the check function with the following Eq. (13).
I(u < 0) is called the indicator function and u < 0 is the conditional relation. When u < 0 is true, I(u < 0) = 1 , and vice versa I(u < 0) = 1 . To improve the speed of operation, the interior point algorithm is applied to solve the equation (Portnoy and Koenker 1997).
Since the influence of multiple factors on PM2.5 concentration is considered, this paper will apply the LASSO method to filter a large number of influence factors. The penalty parameter L1 is added to the original objective function, and Eq. (14) is obtained as follows.
where is the penalty parameter and its penalty enhances as the value taken increases. ‖ ( )‖ 1 and ‖ ( )‖ 1 are the 1-norm. ‖ ( )‖ 1 is the loss function that measures the effectiveness of the quantile regression model to fit the data. ‖ ( )‖ 1 is the penalty function, which can be used to achieve variable selection by constraining the insignificant coefficients to zero. To facilitate the solution of the LASSO-QRNN model, Eq. (14) can be transformed into Eq. (15).
where is the constraint parameter corresponding to . The degree of constraint on the coefficients increases as the value of is taken to decrease. The LASSO-QRNN model is used to demonstrate the distribution of PM2.5 concentrations at different quartiles, and to filter out the influencing factors that are significant to its interpretation. In this way, the effect of reducing the data dimensionality and optimizing the prediction model can be achieved.

Probability density prediction of PM2.5 concentration
From the above method, the conditional quantile of PM2.5 concentration at different quantile points can be obtained, and each quantile is the prediction result of PM2.5 concentration at different quantile points. The mean square error performance of the Epanechnikov kernel function is optimal, and its efficiency loss is small. In this paper, Epanechnikov kernel function is used to predict the PM2.5 concentration at each quantile as its input value, and the prediction result of the probability density of PM2.5 concentration is then obtained. The expression of the Epanechnikov kernel function is as follows.
where M(•) is an indicator function. When |x| ≤ 1 is true M(•) takes the value of 1, and vice versa M(•) takes the value of 0.

LASSO-QRNN-KDE neural network model
Neural networks employ nonlinear kernel functions that can provide a better way to deal with complex nonlinear objects. In the paper, a quantile regression neural network is used to predict the quantile corresponding to random variables at different quantile levels based on the single hidden layer neural network model proposed by Taylor (Madsen and Nielsen 1993). The hyperbolic tangent Sigmoid function is used as the hidden layer basis function of the neural network in solving the problem utilizing quantile regression neural network. First, a stable network structure is established between the input and response variables to achieve the prediction of different quartiles of the predicted object, and then the probability density can be estimated by using the kernel density estimation of nonparametric estimation. The expression of the quantile regression neural network model used in the paper is as follows.
where θ is the quantile, u(θ) = u ij (θ) is the weight matrix to be estimated between the input layer and the implied layer, and V(θ) = v j (θ) is the connection weight vector between the implied layer and the output layer. To achieve the parameter estimation of Eq. (17), which can be achieved by solving the optimization objective function Eq. (18).
To prevent the neural network model from falling into the process-fitting state, the corresponding penalty term is added to the objective function. The new objective function is as follows where 1 and 2 are the penalty parameters of the model. By determining the optimal penalty parameters, the model can be effectively prevented from over-fitting to the empirical data, which can reduce the prediction error and improve the prediction accuracy. The optimal estimate of U( ) and V( ) can be obtained by optimizing Eq.

Evaluation criteria
For testing the prediction performance of the method in the paper, a comprehensive and reasonable analysis and evaluation of the prediction results is required. In this paper, mean absolute percentage (MAPE) error and relative mean squared error (RMSE) are used to assess the accuracy of the model for point prediction; predict interval coverage probability (PICP), predict interval normalized average width , and predict interval normalized average deviation (PINAD) are used to assess the reliability of the model for interval prediction (Shen et al. 2018). MAPE measures the relative magnitude of the prediction error, while RMSE is more sensitive to outlier data and can highlight error values with large effects, and the expressions are as follows.
where n indicates the number of days of PM2.5 concentration being used for prediction, A i is the actual value, and F i is the predicted value. The closer the values of MAPE and RMSE to zero indicate the better model performance.
PICP is able to define the probability of the actual value falling in the prediction interval, and its expression is as follows.
where c i reflects the ability of the actual value to fall into the prediction interval at the confidence level. where L i and U i are the lower and upper bounds of the i th prediction interval. In general, the value of PICP should be greater than or equal to the predetermined confidence level, and a larger PICP value indicates a higher confidence level of the model. PINAW can reflect the sharpness of the probability density prediction, and a high sharpness probability density prediction is able to carry the higher amount of information.
The expression for PINAW is as follows.
where R is the range of the actual value of the predicted series. The smaller the value of PINAW, the narrower the prediction interval, which means that the prediction interval can carry more information and has higher accuracy.
PINAD reflects the deviation between the actual value and the forecast interval, and its expression is as follows.
where D i is the deviation from the prediction interval for the i th prediction point, and is defined by the following equation.
PINAD calculates the cumulative deviation of the points where the actual values fall outside the prediction interval from the bounds of the prediction interval. When the confidence level of the prediction interval is satisfied, the smaller the PINAD, the better the performance of the interval prediction model.

Case 1: Beijing, China
Beijing is located in the north of China and has 12 national automatic air quality monitoring sites. In the twenty-first century, Beijing has developed rapidly and made great achievements in industrial construction as well as in the construction of transport networks. However, along with the economic boom, the air quality in Beijing has also been affected, and it is far from the national standard and the expectation of the people, especially the foggy weather makes the whole society pay extra attention to the fine particulate matter PM2.5 pollution which is more difficult to prevent and control. In recent years, Beijing has vigorously promoted clean energy and strengthened environmental supervision, enabling pollutant concentrations to drop significantly. However, the first quarter still has a high PM2.5 concentration, as shown in Fig. 2. In order to continuously promote the environmental construction, the prediction of PM2.5 is essential. In this case, the data were selected from January 1, 2018, to January 31, 2019, for Beijing, China. The data were obtained from the National Meteorological Data Center of China (http:// data. cma. cn/), and all data were normalized in advance. From the experience of previous studies, the concentration of PM2.5 is influenced not only by air quality but also by meteorological factors. In this paper, five air quality indicators and five meteorological factor indicators were selected as influencing factors to investigate the implications on PM2.5 concentration prediction.
According to the previous Eq. (11), the sequence of PM2.5 concentrations to be predicted is y 1 , y 2 , ⋯ , y n , where n is the number of samples to be predicted. The s e qu e n c e o f i n f l u e n c i n g fa c t o r s i s X t a n d X t = X 1 t , X 2 t , ⋯ , X m t , m is taken as 10 in this paper. X d t is the d th influencing factor, and In this paper, X 1 t indicates daily PM10 concentration (μg/ m3), X 2 t indicates daily SO 2 concentration (μg/m3), X 3 t indicates daily CO concentration (mg/m3), X 4 t indicates daily NO 2 concentration (μg/m3), and X 5 t indicates daily O 3 concentration (μg/m3); X 6 t indicates daily average temperature (0.1 °C), X 7 t indicates the daily relative humidity (%),X 8 t indicates the daily average air pressure (0.1 hPa), X 9 t indicates the daily average wind speed (0.1 m/s), and X 10 t indicates the wind direction of daily maximum wind speed (16 directions). The details of each influencing factor are shown in Table 1. Y t = y t−1 , y t−2 , ⋯ , y t−c is used to indicate the historical PM2.5 concentration with lag c period. Both c and q are taken as 4, and the confidence level is 95%.The statistical descriptions and correlation analysis of the data sets are shown in Appendix 1 (Tables 8 and 9).
Data from January 1, 2018, to December 31, 2018, in Beijing were used as training data, and data from January 1, 2019, to January 31, 2019, were used as test data. Through the rolling prediction method, the data of the 4 days before the prediction day were used as input variables to predict the PM2.5 concentration values on that day. As shown in Table 1, a total of 44 factors are included in the obtained training data.
The LASSO method was applied to the training data for variable filtering. From Appendix 2 (Table 11), it can be seen that Lasso proceeds to the 38th step with the smallest cp value, and then the optimal number of steps for Lasso regression is 38. Eight factors with non-zero coefficients were retained, as shown in the part marked with an asterisk in Table 1. The selected factors are those that have significant explanations for PM2.5 concentrations in Beijing. The above output shows that PM2.5 concentration in Beijing is influenced by its historical concentration, SO 2 concentration, CO concentration, average temperature, wind speed, and wind direction.
Three situations are considered in the paper to make the empirical analysis more objective and to verify the validity of the LASSO-QRNN-KDE method for probability density prediction. They are probability density prediction without considering external factors, probability density prediction with considering external factors but without variable selection, and probability density prediction with applying LASSO for variable selection. After training, the neural network structures for the three cases are shown in Table 2. The number of times the neural network was trained was set to 1000, and the penalty parameters λ 1 and λ 2 in Eq. (19) were set to 0.5. The quartiles range from 0.05 to 0.95 with an interval of 0.05. With the trained neural network model, the continuous quartiles for each day in January 2019 can be obtained. The obtained PM2.5 concentration quantile is then used as an input variable and imported into the Epanechnikov kernel function to obtain the probability density curve of PM2.5 concentration for each day in January 2019.  Table 1 The statement of influencing factors of PM2.5 concentration in Beijing Name (codename) Lags Daily average temperature ( X 6 t ) Daily average air pressure ( X 8 t ) Daily average wind speed ( X 9 t ) The wind direction of daily maximum wind speed ( X 10 t ) 1, 17, and 25 in Fig. 3, the actual values appear at the tail of the probability density curve, and the predicted values deviate significantly from the actual values. With external factors taken into account but not selected, the prediction situation improved. According to Fig. 4, the prediction value that deviates significantly from the actual value is significantly improved, and only on day 13 the actual value is at the tail of the probability density curve. In the case of using LASSO to select the variables, according to  (Table 13) gives the predicted values and relative errors of PM2.5 concentrations at the highest point of the probability density curve (mode) for January 2019 in Beijing under three situations. Combining with Fig. 6, it can be seen that the maximum relative errors in the three situations are 15.64%, 13.29%, and 11.42%, and the minimum relative errors are 0.88%, 0.99%, and 0.08%. The performance of the LASSO-QRNN-KDE model is significantly better than the other two situations.
According to the results in Table 3, in terms of point predictions, the best prediction results are obtained using the LASSO-QRNN-KDE model, where the MAPE is 2.40% and the RMSE is 1.63. In addition, the probability density  prediction considering external factors but without variable selection also performs better than the probability density prediction without considering external factors. In terms of interval prediction, it is evident that the prediction with external factors has a qualified PICP, and the PICP increases by 2.26% after selecting the influencing factors by LASSO, which indicates that the proposed model has more reliable prediction intervals. In addition, the proposed model has the smallest PINAW and PINAD, with a decrease of 11.45% and 1.49% respectively compared to the model without the influence filter, which means that the model has a higher sharpness and the prediction interval has a smaller deviation from the actual value.
In the prediction of PM2.5 concentrations with quantile regression neural networks, the prediction results using the LASSO-QRNN-KDE model outperformed the prediction results without considering external factors and without variable selection. This indicates that the probability density prediction of PM2.5 concentration should adequately consider the influence of external environment, and the optimal prediction model can be obtained by filtering multiple external factors.

Case 2: Jinan, China
Jinan is located in East China and has built 8 national automatic air environment quality monitoring points. Jinan used to be a heavy industrial production base, and heavy industrial enterprises have made outstanding contributions to the economic development of Jinan. However, the development of heavy industries has brought pollution in various aspects such as air and water. In addition, the high topography around Jinan makes its air less mobile and dust tends to accumulate over the city, which has led to Jinan being the city with the poor air quality in China. According to Fig. 7, it can be seen that its annual PM2.5 concentration has a U-shaped distribution. Accurate prediction of PM2.5 concentrations is essential to promote the local environment and to reduce public health risks.
In this case, data from January 1, 2020, to January 31, 2021, were selected from the National Meteorological Data Center of China (http:// data. cma. cn/) for Jinan, China, and all data were pre-normalized. Consistent with case 1, in this case m is taken as 10, c and q as 4, and the confidence level is 95%. The statistical descriptions and correlation analysis of the data sets are shown in Appendix 1 (Tables 8 and  10). The data from January 1, 2020, to December 31, 2020, in Jinan were treated as training data, and the data from January 1, 2021, to January 31, 2021, were treated as test data using the rolling prediction method. From Appendix 2 (Table 12), it can be seen that the cp value at step 32 is the smallest, and the optimal number of steps for Lasso regression is obtained as 32. Ten factors were selected, as shown in the variables marked with an asterisk in Table 4. The results show that the PM2.5 concentration in Jinan is influenced by its historical concentration, PM10 concentration, CO concentration, NO 2 concentration, O 3 concentration, as well as wind speed and wind direction.
The same as case 1, case 2 also considered three cases. After training, the neural network structures for the three cases are shown in Table 5. Figures 8, 9, and 10 show the probability density curves of PM2.5 concentrations in Jinan for eight days in January under three scenarios in turn. The highest probability point of the probability density curve deviates from the actual value more often when the prediction is made without considering external factors, and the true value often appears at the end of the probability density curve. When external factors are taken into account, the prediction results are much better, especially when the LASSO method is used to select the variables and then make the prediction, the actual value basically appears at the highest point of the probability density curve. In this case, the LASSO-QRNN-KDE model also has a good performance in predicting the probability density of PM2.5 concentration.
Appendix 3 (Table 14) presents the predicted values of PM2.5 concentrations at the highest point of the probability density curve (mode) and the relative errors for Jinan in January 2021 under three situations. Figure 11 shows that the maximum relative errors in the three cases are 14.77%, 11.46%, and 6.52%, and the minimum relative errors are 1.02%, 1.53%, and 0.05%. The overall performance using the LASSO-QRNN-KDE model is significantly improved over the other two situations.
The experimental results in Table 6 demonstrate that the LASSO-QRNN-KDE model outperforms the other three cases, from both the point prediction perspective and the interval prediction perspective. When applying LASSO to variable screening, point prediction results were optimal, where MAPE is 2.36% and RMSE is 2.07. In addition, the PICP is greater than 95% for all three situations, with the model presented in the paper having a high PICP of 99.03%, which means that the LASSO-QRNN-KDE model performs well for interval coverage. Similarly, after screening for impact factors with LASSO, PINAW, and PINAD are 17.14% and 1.18% respectively, which are the best performers in the three situations. Similar to case 1, probability density prediction of Table 4 The statement of influencing factors of PM2.5 concentration in Jinan Name (codename) Lags Daily average temperature ( X 6 t ) Daily relative humidity ( Daily average air pressure ( Daily average wind speed ( X 9 t ) The wind direction of daily maximum wind speed ( X 10 t )

Comparative experiments
In order to validate the accuracy and robustness of the LASSO-QRNN-KDE model, several comparison models and datasets from other regions are introduced below. Experiments are conducted to compare the performance of 5 other commonly used forecasting models with the model proposed in this paper, and to verify the capability of the proposed model to generalize using seven additional datasets.

Comparisons with other models
To demonstrate the superiority of the LASSO-QRNN-KDE method for PM2.5 concentration prediction, a variety of traditional prediction models and novel research models are  (2) BP (back propagation neural network), which continuously adjusts the weights and thresholds of the network through training, and thus learns the mapping relationships in the input and output data (Zhou and Qu 2017); (3) RBF (Radical Basis Function), similar to BP, but with good ability to approximate arbitrary nonlinear functions and to express the intrinsic unresolved regularities of the system (He et al. 2013); (4) CNN-LSTM (convolutional neural networks-long short-term memory), which uses CNN for feature extraction, and then inputs the extracted feature values into the LSTM structure (Huang and Kuo 2018). (5) EMD-GRU (empirical mode decomposition-gate recurrent unit), which uses EMD to decompose PM2.5 concentration sequences, and then the decomposed multiple smooth subsequences and meteorological features are sequentially input into the GRU neural network for training and prediction (Huang et al. 2021).
The prediction results of each model for PM2.5 concentrations in Beijing are shown in Fig. 12 and for Jinan City in Fig. 13. The prediction errors of the two cases are compared in Fig. 14. In addition, the daily forecast data for each model are shown in Appendix 4 along with the daily forecast relative errors.
Based on the experimental results of each model, the following observations are derived: (1) Compared with the traditional prediction model ARIMA and the classical machine learning methods BP and RBF, the evaluation indexes of the other three models are better, which reflects that the networks designed specifically for dealing with time series problems outperform the typical shallow machine learning methods in terms of time series prediction. Furthermore, for the task of PM2.5 concentration prediction, the prediction performance of the combined model is better than that of the single model. (2) In both cases, the RMSE and MAPE of the LASSO-QRNN-KDE model are taken to be the minimum values. Taking the data of the Beijing case as an example, the RMSE and MAPE of the proposed model decreased by 2.14 and 2.78%, respectively, compared to the better-performing EMD-GRU model.

Validation of the model with other data sets
To validate the generalization capability of the LASSO-QRNN-KDE model, six additional datasets are added to this section, and the statistical descriptions of these datasets are presented in Appendix 1 (Table 8). The prediction results 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9  for each dataset are evaluated with the evaluation criteria proposed in the previous section, and the results are presented in Table 7.
(1) The prediction results from the six additional datasets show that the LASSO-QRNN-KDE model has strong robustness. The point predictions of the model  (2) Based on the performance of interval prediction, the PICP of all six data sets is greater than the confidence level, which indicates that the model proposed in the paper performs well in interval coverage. In general, the PINAW of the six datasets are able to maintain a low level of around 23%, which denotes that the LASSO-QRNN-KDE model is able to obtain a narrower bandwidth for interval prediction and carry a higher density of predictive information. In addition, the low level of PINAD also validates that the model fits the true value in the prediction process.

Discussion
In this paper, a new hybrid model is proposed to accurately predict the short-term variability of PM2.5 concentrations. Based on previous studies, LASSO is used for the selection of factors influencing PM2.5 concentrations, which greatly improves the prediction accuracy of the model. This is similar to the results of the study by Koen et al. (2020). The LASSO approach does play a larger role in improving the accuracy of the model interpretation. In addition, the QRNN model is able to obtain the overall distribution of the response variables corresponding to the explanatory variables, which gives more information than the widely used mean regression (Sun et al. 2021). Based on the experimental results above, it can be seen that the model proposed in the paper can effectively select sequences that are significant to the prediction target and can obtain the probability distribution of the predicted values compared to the comparison model. The two case studies and the six other datasets analyzed in this paper are from more polluted cities in China, for which PM2.5 prediction is more urgent. As the causes of high levels of PM2.5 vary from region to region, the LASSO method is used to select the influencing factors to obtain  more accurate predictions, as demonstrated in the comparison between case 1 and case 2. After the selection of influencing factors, the performance of the model improved significantly in all aspects. The model presented in this paper performs well in predicting data sets from highly polluted areas, with better prediction performance for less fluctuating data sets. In future research, extension of the model to predict PM2.5 concentrations at lower levels and improvement of the model's prediction performance on more volatile data sets could be considered. In addition, better selection methods for impact factors are also a future research direction.

Conclusion
In this paper, the LASSO-QR-KDE model is proposed to predict the probability density of PM2.5 concentrations. Comparative tests were constructed for analysis based on data sets from Beijing, Jinan, and six other regions in China. The model prediction performance is evaluated by using point predictors MAPE and RMSE and interval predictors PICP, PINAW, and PINAD, respectively, and compared with the control model to draw the following conclusions.
(1) The model proposed in the paper is applicable to the prediction of PM2.5 concentrations and shows high accuracy and robustness in prediction. In the comparison tests with other models and several data sets, the LASSO-QRNN-KDE model obtained the smallest MAPE and RMSE, which could be maintained at around 2% and 2. On the one hand, the LASSO method is able to reduce the dimensionality of the influencing factor data set without affecting the prediction performance and is suitable for prediction of PM2.5 concentrations with multiple influencing factors. On the other hand, the KDE model can overcome the shortcomings of parameter estimation and can fit the distribution according to the characteristics of PM2.5 concentration data, further reducing the prediction error. In addition, the LASSO-QRNN-KDE model can provide decision makers with more accurate prediction information by obtaining narrower prediction intervals and high interval coverage when making interval predictions. In tests on several datasets, the prediction intervals obtained deviated from the true values by a small margin, almost below 3%. (2) When analyzing the datasets from Beijing and Jinan, China, different influencing factors were screened out, due to the variability of the regions. It is also because the model in this paper is able to identify the variability that it performs well in predicting PM2.5 concentrations. The model outperformed Beijing in predicting PM2.5 concentrations in Jinan for all indicators, due to the fact that the Jinan data is more stable than the Beijing data. This is also reflected in additional tests on six other datasets, with less volatile datasets having better predictions, such as the AL dataset.
(3) When the LASSO-QRNN-KDE model is used for prediction, the true value of the prediction point is likely to occur near the highest point of the probability density curve, which is the mode position. The model is able to obtain a probability density curve of PM2.5 concentrations for each day of the month, producing richer predictions than the single point prediction approach and the interval prediction approach. By making differentiated predictions for individual areas with high PM2.5 concentrations, the model can provide meaningful reference for relevant policy makers to effectively reduce and control air pollution and reduce public health risks. Appendix 1 Table 9 The results of the correlation analysis (Beijing)

Declarations
Ethics approval Not applicable.
Consent to participate Not applicable.

Consent for publication Not applicable.
Competing interest The authors declare no competing interests.