Dynamic pollution emission prediction method of a combined heat and power system based on the hybrid CNN-LSTM model and attention mechanism

Combined thermal power (CHP) production mode plays a more important role in energy production, but the impact of its pollutant emission on the natural environment is still difficult to eradicate. Traditional pollutant control adopts post-treatment process to degrade the generated pollutants, but there is little research on controlling the generation of pollutants from the source. Therefore, starting from the source, this paper predicts the pollutants through the prediction model, so as to provide countermeasures for production regulation and avoiding excessive emission. In this paper, a pollution emission prediction method of CHP systems based on feature engineering and a hybrid deep learning model is proposed. Feature engineering performs multi-step preprocessing on the original data, refines the correlation factors, and removes redundant variables. The hybrid deep learning model has a multi-variable input and is established by combining the convolutional neural network, long short-term memory network with the attention mechanism. The case study is conducted on the collected actual dataset. The influence of the prediction target periodicity on the prediction results is analyzed seasonally to verify the effectiveness of the hybrid model. The results show that the root mean square error of the proposed method is less than one, and the error is reduced compared to the other basic methods, which proves the superiority of the proposed pollution emission prediction method over the existing methods.


Introduction
Among all environmental pollution sources in China, power industry pollution is one of the most dominant pollution sources Xiao and Liao 2021). Human production activities, such as automobile exhaust emission, power generation in thermal power plants, and fuel combustion in other industrial production, highly increase the content of NO X and SO 2 in the air. According to the preliminary estimation of the United Nations Environmental Planning (UNEP) deployment project "Comprehensive Consideration of Environmental Factors in Energy Planning," 70% of atmospheric emissions of NO X in China originates from the direct combustion of coal (Zhang 2016), which has an immeasurable impact on nature.
In recent years, numerous effective measures have been taken in many fields to control anthropogenic pollutant emissions (Meng 2021). For instance, in the field of thermal power generation, the application of combined heat and power (CHP) production mode (Hyeunguk et al. 2021) can improve resource utilization and reduce environmental impact under the same energy consumption. The Emission Standard of Air Pollutants for Thermal Power Plants (2004), which was implemented in China in 2012, has been strictly stipulated that the NO X and SO 2 emissions of the new circulating fluidized bed coal-fired boilers are 100 mg/m 3 , and there have been even more stringent requirements for certain regions in China. The traditional power plant adopts the post-treatment process to degrade the generated pollutants. But, the power generation capacity of power plants far exceeds the power consumption of users and produces too many pollutants, which is difficult to guarantee only after the posttreatment. From the source of pollutants, if the future production law and pollutant emission can be predicted in advance according to the historical operation of the power plant and the law of regional power supply, the current equipment operation conditions can be adjusted to avoid excess coal consumption. Combined with posttreatment, the impact of pollutants on the environment can be reduced more effectively.
In pollutant emission prediction, a method of physical mechanism analysis has been commonly used to establish the mathematical model between the parameters of various power plant components and pollutants. However, it is a great challenge to establish a physical model due to the complex structure of CHP equipment and cumbersome production process (Nakaishi et al. 2021). Considering the nonlinearity and heterogeneity of pollutant emission data in industrial production activities, the data-driven statistical method can capture the complexity of mechanism modeling and shows good performance (Lusis et al. 2017). Pollution emission prediction is a multi-dimensional time series prediction problem. For dealing with this problem, the neural network model in the field of machine learning has excellent results. The long short-term memory network (LSTM) model can well describe the data feature relationship contained in time series (Zou et al. 2020;Chang et al. 2020), and it has been widely used in time series prediction (Zhang and Guo 2020). Yang et al. (2020) used mutual information variable selection and an LSTM to predict the NO X emission from power plant boilers dynamically, which improved the accuracy and robustness of the prediction model. Weng et al. (2020) proposed a NO X emission prediction model based on deep bidirectional LSTM, which reduces the prediction error about 5%.
In the field of big industrial data, it has become one of the mainstream trends to combine different methods with LSTM to establish hybrid models to extract the time features of data (Lin and Zhang 2021). For instance, convolutional neural network (CNN) can extract the deep features of multi-dimensional time series and significantly improve the performance of feature extraction. Attention mechanism (AM) (Zhang et al. 2021a, b, c;Guo and Zhu, 2021;Ghaffarian et al. 2021) can optimize the model weight distribution and improve the model training efficiency by increasing the weight that has a stronger impact on the results. Peng et al. (2021) combined CNN with LSTM and used their advantages to predict the coal storage of power plants. Zhang et al. (2021a, b, c) combined the AM with LSTM network and proposed a short-term multi-energy load forecasting method for power plants, which was verified by the actual collected data.
However, the applications of hybrid CNN, LSTM, and AM models in pollutant emission prediction of power plants have been very few. Therefore, combining the respective advantages of the above three models, this paper proposes a hybrid model for CHP emission prediction. The main contributions of this study can be summarized as follows: 1) CNN is used to extract the features of input data through its powerful feature extraction function. The temporal features of multi-dimensional time series are extracted by LSTM. 2) The attention mechanism is used to distribute the weight of time series features, which improves the accuracy of emission prediction of CHP. 3) The working and environmental variables of equipment are filtering and refactoring by feature engineering, and the pollutant emission law caused by seasonality is analyzed, which improve prediction accuracy and efficiency. 4) The real data of combined cycle power plant are used to verify the proposed model, which provides a case for the application of artificial intelligence technology in the field of environmental protection and energy utilization.

Data overview
Zhejiang Jiaxing Tongxiang Tayes Environmental Protection Energy Limited Company is a CHP power plant (Fig. 1). It provides the power and heat demand of the surrounding industries and residents. The pulverized coal in the boiler burns to heat water to produce the superheated steam. The steam is gathered in the main pipe and distributed to each steam turbine to drive the steam turbine to generate electricity and supply electricity to users. The waste heat transmitted by the gas condenser can also produce heat to the users so as to make full use of resources. The NO X and SO 2 produced by the power plant are controlled by the post-treatment process, such as desulfurization, and then discharged into the atmosphere. The hourly operation data of the CHP units collected from July 2, 2020, to July 9, 2021, are used to extract the NO X and SO 2 data from the flue gas emission data of the power plant, which are then used as a prediction target.
Historical operation data have typically been used as input data of a pollution emission prediction model because of the high dependency on the prediction objectives. The pollutant emission curve of a thermal power plant reflects daily and weekly production affected by the production equipment of the power plant and daily activities in the power supply area. As the power generation of a power plant shows a certain periodicity, the number of discharged pollutants changes regularly. Moreover, owing to the influence of seasonal factors, there are large differences in regional power supply between summer and winter periods, and their distribution ranges of emission values are completely different, which requires using a high-accuracy prediction model. The emission curves of NO X and SO 2 from June 4 to June 9 in summer and from November 4 to November 9 in winter periods are presented in Fig. 2. As shown in Fig. 2, in the summer, the NO X emission is in the range of 10-30 mg/m 3 , but in the winter, it is less than 3 mg/m 3 ; meanwhile, the SO 2 emission shows smaller differences between the summer and winter periods, namely, it is higher than 20 mg/m 3 in the summer and lower than 20 mg/m 3 in the winter. Analyzing the daily emission results, it can be seen that the NO X emission results show high periodicity during daytime but low periodicity at night in both summer and winter periods, while the SO 2 emission results are chaotic and irregular. Therefore, it is more challenging to predict SO 2 compared to NO X .

Proposed method operation flow
The proposed pollutant emission prediction method mainly includes feature engineering, model establishment, and prediction, as shown in Fig. 3.
Feature engineering uses two methods to preprocess the original data, determine the factors highly correlated with the prediction target, and reduce data dimension. The data preprocessing includes the Pearson correlation analysis (Zhang et al. 2021a, b, c) and mutual information analysis Ghosh et al. 2021;Miyashita and Yonezawa 2021), which are respectively used to extract the factors that are linearly and nonlinearly related to NO X and SO 2 emission. The hybrid prediction model consists of CNN and LSTM, which are combined with the AM. Among them, the CNN is used to extract the potential spatial features of data and simplify network parameters, whereas the LSTM is used to extract the time distribution relationship between data; the AM is used to optimize the network weights to achieve better prediction results. The summer data (collected from April to September) are divided into the training and test sets, which are used to train and After the model has converged, the model parameters are saved; the winter data (collected from October to March) are used as a verification set, which is used for model verification and to obtain the final prediction results.

Feature engineering
In order to achieve better follow-up neural network feature mining, it is necessary to select and reduce the dimension of the original data to reduce the error caused by irrelevant factors. In this study, two methods are used to reduce the dimension of the collected data; namely, the Pearson correlation coefficient and the mutual information.

Pearson correlation analysis
First, the physical mechanism analysis and variance test are performed to remove irrelevant variables and less volatile variables. Then, the Pearson correlation coefficient (ρ X,Y ) shown in Eq. (1) is used to measure the correlation between various parameters and pollutant emission. Pearson correlation coefficient is a measure of the linear correlation degree between two variables. Its value is distributed in the [− 1,1] interval. The greater its absolute value is, the greater the correlation degree of the two variables involved in the calculation is.
where Cov(X,Y) denotes the covariance of X and Y, and σ X is the variance of X.
The calculation results of Pearson coefficient between each variable and the prediction target are shown in Table 1, which are less than 0.3, indicating that the linear correlation between single performance parameters of CHP Equipment and pollution emission is very weak (Dancey and Reidy 2017); this shows that in power production, the complex equipment structure makes the coupling relationship between various parameters not a simple linear relationship, but a more complex nonlinear relationship. Therefore, it is necessary to further measure the nonlinear correlation between parameters and targets through other indicators.

Data dimensionality reduction based on mutual information
Mutual information is used to evaluate the amount of information contributed by the occurrence of one event to another event. It has been often used to test the nonlinear correlation between multiple variables in feature engineering, which is calculated by (Peng et al. 2005): where P(x) and P(y) are the marginal probability distributions of random variables x and y, respectively, and p(x, y) denotes their joint probability distribution.
As shown in Table 1, the mutual information between each variable and the prediction target is basically higher than 0.9, and the maximum value reaches 0.970, indicating that the variable contains a large amount of nonlinear information of the objective function, and its change has a strong impact on the prediction results.
Through the above two steps, the ten most relevant indicators screened are shown in Table 1, where the maximum five values of each column are denoted in bold.
(2) Because the linear correlation of variables to the target is very low, it is not feasible to adopt the traditional linear regression model to predict pollutant emission, but the nonlinear correlation is very high, which poses a higher challenge to the selection and design of the model. The deep learning neural network has strong nonlinear feature extraction ability, so this study builds the pollutant emission prediction model based on the neural network.
The input dimension of a deep learning-based model should not be too large; otherwise, the dimension problem will arise, resulting in a too-long model training time and difficulty in model fitting. This study selects five variables with the highest correlation with the target pollutants as the input data, which are C E , C C , O 2 , F W , T OS . The O 2 here represents the sum of the oxygen content at the left and right inlets of the high-temperature economizer of three boilers.
In this study, the pollutant emission of the power plant is predicted on an hourly basis; namely, the prediction horizon h takes 1. Therefore, the ith continuous lag prediction target is given by where l i m denotes the ith constituent element at time m, lag represents the number of continuous lag prediction targets, and t is the prediction time.
The vector of the jth sliding window is given by where n is the sliding window width, and T is the total number of sliding windows, which is calculated by Multiple input variables and continuous lag targets are reconstructed into two-dimensional (2D) features as follows: (3) where T × n denotes the size of a 2D feature map, and the two dimensions correspond to the time step number and dimension of input features in the LSTM, respectively. In this study, n is set to six, including five pollutant emission influencing factors and prediction objectives.

Data normalization
The statistical characteristics of each factor are calculated. As shown in Table 2, the data distribution intervals of the features are different.
To improve the neural network training efficiency and mine the potential features of data, it is necessary to normalize the original data to the interval of [0,1] and remove the dominant features of the macro dimension. In this study, the Max-min normalization (Ge et al. 2021) is used, which is shown in Eq. (7): where x denotes the original data, x max is the maximum data, x min is the minimum data.

1D-CNN
The 1D-CNN is a deep feedforward neural network with the characteristics of local connection and weight sharing . Its operating principle is to extract local spatial information through the sliding of the convolution kernel on the data matrix, reduce the number of features and parameters by the pooling layer, and improve the training efficiency so as to replace the traditional single fully connected layer. A typical CNN is composed of a convolution layer, a activation layer, a pooling layer, and a fully connected layer (Ghorbani and Behzadan 2021).

LSTM
The LSTM is used in this study because it can capture long-term dependencies in time series. The LSTM represents an extension of the recurrent neural network (RNN) architecture (Abbasimehr et al. 2020), and its structure is shown in Fig. 4. Each LSTM unit has a storage unit and three gates (Han et al. 2019), namely, forgetting, input, and output gates. These internal structures control the flow of information related to the state of memory cells and enable the LSTM to learn the time relationships of long-term sequences. The operation of the three gates at the time step t is given by the following formula (Gers et al. 2000;Hochreiter et al. 1997): where V f , V i , and V o refer to the recurrent weights of the LSTM; W f , W i , and W o indicate the weight matrixes of the forget gate, input gate, and output gate, respectively; b f , b i , and b o represent the biases of the three gates, respectively; "sig" is the sigmoid activation function.

AM
The AM is highly suitable for pollutant emission prediction models involving an LSTM network. The internal correlation between hidden features and the effects of the hidden features in different time steps on the prediction results are difficult to identify using an LSTM network (Wang et al. 2019). The method proposed in this study uses adaptive weighted hidden features to enhance key information and weaken trivial information. The introduction of the AM helps to identify the importance of the LSTM input so as to extract high-level features better.
The structure block diagram of the AM inspired by the work of reference (Zang et al. 2021) is shown in Fig. 5, and the data conversion process of the AM is presented in Fig. 6.
First, the correlation between hidden features of different time steps is discussed from various dimensions. The score of the jth time step of the dth dimension of the hidden feature is calculated by (11) where h j,d denotes the dth dimension of the hidden feature at the jth time step; W j,d is the weight vector, which shall be adjusted during training; function f sco (·) is implemented using the dense layer (Choi et al. 2018) and neurons number is equal to the time step; n and T are the dimension and time step of hidden features, respectively.
The softmax function, which is defined by Eq. (14), is used to normalize the dimension score, and the sum is equal to one.
Finally, the normalized scores of each dimension are normalized by Eq. (15) to obtain the weight of the corresponding time step. Using Eq. (16), the hidden feature vector is multiplied by the weight and used as the input of the next layer.
.  where M j denotes the weight of the jth time step, and h Att j is the jth weight vector of the hidden feature.

Proposed model structure
The structure of the proposed hybrid CNN-LSTM-AM model is shown in Fig. 7. It has one input layer, one CNN layer, one LSTM layer, one attention layer, one flatten layer, and one output layer; the AM is located in the attention layer. The multi-variable time series data of pollution emissions are fed to the input layer, the potential local features are extracted by convolution of the CNN layer, and the time correlation of target components is extracted by the LSTM layer. The extracted hidden features are transmitted to the attention layer, where they are weighted by the AM before being transmitted to the flatten layer; the flatten layer transforms the output results into a one-dimensional vector. Finally, the output layer is used to obtain the prediction value corresponding to each superposition component so as to obtain the final prediction result of each combination. Fig. 9 The NO X and SO 2 emission prediction results in winter

Model structural parameters
To evaluate the performance of the proposed pollution emission prediction model, a case study was conducted using the real power plant data. The NO X and SO 2 emission data collected in summer were used as the input data of the two prediction models, and the data were divided into training and test sets according to the ratio of 4:1, and the data collected in winter were used as a verification set intended for model testing. The pollution emission prediction model was constructed based on the structure displayed in Fig. 7, and through many experiments, the model super parameters were finally determined according to the optimal value of the model prediction effect and are shown in Table 3.

Performance indicators
Three evaluation indexes were used to evaluate the performance of the proposed pollution emission prediction model, including the mean absolute percentage error (MAPE), root mean square error (RMSE), and mean square error (MSE) (Chen et al. 2019), which are respectively given by Eqs. (15)-(17): where n is the total number of training samples, L i is the predicted value, and L i is the true value.
For the same batch of datasets, the closer the above three indexes are to zero, the better the prediction effect is.

Experimental parameters
The proposed prediction model was trained using the error BP algorithm based on time expansion; namely, the neural network was expanded into a deep network in the time order, and then the expanded network was trained by the error BP algorithm (Li and Zhu 2021). To control the learning rate of the network and to prevent the problems of gradient disappearance and slow convergence speed, the Adam optimization algorithm (Tang et al. 2021) was used to update the network parameters, and the initial learning rate lr was set to 0.001. The dropout regularization method was introduced before the LSTM layer to avoid overfitting, and the retention rate was set to 0.8. Through experimental observation, when the number of epochs exceeds 50, the loss function has almost converged and does not decline. Therefore, the training parameters of the neural network were set as follows: the maximum number of epochs was 50, and the small batch size was 512.

Performance comparison of different prediction models
The proposed pollution emission prediction model was compared with the traditional prediction models including the K-nearest neighbor (KNN), least squares support vector machine (LS-SVM), LSTM, CNN-LSTM, and LSTM-Attention models. Their network parameter settings are consistent with the method proposed in this paper. The number of LSTM cyclic cores was 40, and the number of CNN convolution cores was 64. The optimal model parameters were selected in the other models according to the prediction effect. The prediction timestep was 1 h, and the continuous data of 40 previous hours were used to predict the NO X and SO 2 emission levels in the next hour. The results of the above-mentioned models on June 2, 2021 are shown in Fig. 8.
As shown in Fig. 8, the traditional model was not effective in predicting the emission of the thermal power plant. In the second figure in Fig. 8, it can be seen that the NO X emission showed a certain periodicity and regularity during the day. It should be noted that the machine learning model takes periodicity as a surface characteristic when capturing data characteristics to predict the final result. However, since there was no such regularity in SO 2 emission, which had a weak regularity, the effect of the traditional model was unsatisfactory. In contrast, the proposed CNN-LSTM-AM model could not only learn the shallow regularity characteristics of data but also mine the depth characteristics by the CNN. Therefore, compared to the traditional model, the prediction effect of the proposed model was better, and its prediction results fitted closely with the real curve, and also, its changing trend was consistent with the real one. The generalization performance of the proposed model ( Fig. 9) was further verified using the winter data as a verification set.
Owing to the obvious difference in the distribution characteristics between winter and summer data, a model with a poor training effect will perform unsatisfactorily on the verification set. As shown in Fig. 9, the performances of the KNN and LS-SVM models on the winter data were poor, and they differed significantly from the performance on the summer data, which are shown in Fig. 7. Thus, the robustness of these models was low, and they could difficultly adapt to changes in data. The prediction effects of the LSTM and CNN-LSTM models declined slowly, and these models had certain generalization performances. After the introduction of the AM, the characteristics learned in the same period were more accurate, and the prediction effect was gradually improved. This was because the weight distribution paid more attention to the factors that have a strong impact on the results and ignored irrelevant factors, so the model training efficiency was higher.
The errors of each model are shown in Table 4, where the minimum error values in each column are boldfaced. The e RMSE of the proposed CNN-LSTM-AM in predicting the pollution emissions of the thermal power plant was lower than one, and it was lower than other models. The e MAPE of the proposed model was lower in predicting NO X than in predicting SO 2 , but both values were lower than the traditional models.

Loss function change experiment
To prove the superiority of the proposed prediction model over the traditional models, the proposed model was compared under the changes in the loss function (e MSE ) during the training with the LSTM, CNN-LSTM, and LSTM-Attention models, and the comparison results are shown in Fig. 10. The proposed method converged to the minimum value of the loss function in the early stage of model training, after only several iteration rounds. Compared with the proposed model, although the convergence speed of LSTM with AM is slower, the curve is still smooth, while the single LSTM and CNN-LSTM models showed an upward trend of the loss function in the early stage of training, which proved that there was an overfitting phenomenon; the convergence speed of the follow-up model was also slow. The LSTM did not converge even after 40 rounds, and its training effect was poor. This is enough to prove that adding the AM to the time series prediction model can effectively mitigate the overfitting phenomenon and improve model training efficiency and final prediction effect.

Attention layer weight visualization
The weight of the attention layer of the power plant pollution emission prediction model was used to draw a thermodynamic diagram (Fig. 11). In the thermodynamic diagram, the brighter and darker parts indicate higher and lower attention of the AM during weight allocation, respectively. As shown in Fig. 11, the difference between brighter and darker parts increases in the areas close to the prediction time point for both NO X and SO 2 data; thus, the data characteristics closer to the prediction point have a more obvious impact on it, which is consistent with common sense.

Conclusions and future work
To solve the problem of accurate prediction of CHP pollutant emission, this study proposes a CHP emission prediction model named the hybrid CNN-LSTM and AM. The hybrid CNN-LSTM and AM is trained with the NO X and SO 2 data collected in summer and tested on the data collected in winter; the results are compared with the traditional prediction models. Based on the comparison results, the following conclusion can be drawn: 1) The hybrid CNN-LSTM and AM can accurately predict the future NO X and SO 2 emission levels, and its e RMSE is less than 1 mg/m 3 . 2) Compared with the traditional prediction models based on machine learning and deep learning, the proposed model has higher prediction accuracy and better robustness.
3) The introduction of attention mechanism and the selection of key variables by feature engineering improve the effect of the prediction model.
This study gives the measures of artificial intelligence application in clean energy usage in the power production industry. And in the follow-up work, the working condition regulation of power plant equipment will be studied according to the prediction results, so as to reduce the emission of pollutants.