Application of PCA-Kmeans method-based BP neural network to the prediction and optimization studies in S ZORB Sulfur Removal Technology

In this paper, the modeling of predicting the gasoline octane number and sulfur content in S ZORB Sulfur Removal Technology (SRT) is established. In the modelling, the principal component analysis (PCA) and unsupervised K-means clustering algorithm were initially integrated together to determine the key variables that affect the octane number and sulfur content of the product. With the selected key variables, the backpropagation neural network prediction models of the product octane number and sulfur content were established, trained and tested. Moreover, the mean accuracy of the prediction error within 0.15 and 0.3 were 94% and 99%, respectively. Besides the prediction of output of the S ZORB SRT Reactor, a multi-variable random walk optimization method was also proposed and investigated to reduce the octane loss, which was expected to be reduced by more than 30%, during desulfurization of uid catalytic cracking gasoline in the S ZORB SRT Reactor, meanwhile the sulfur content stayed relatively stable which was less than 5 ppm. The results of the proposed models are reliable and could be applied into the real industrialization, which are benecial with both the eciency of economy and environmental protection. (b) utilization operating


Introduction
At this stage, one-third of the world's commercial gasoline comes from catalytic cracking units, and the produced catalytic cracking gasoline contains more than 90% of sulfur [1,2]. The exhaust emissions from the combustion of high-sulfur gasoline will seriously damage the atmospheric environment and human health. S ZORB Sulfur Removal Technology, a sorbent-based sulfur-removal method is widely used in gasoline desulfurization [3,4]. Under suitable pressure, temperature, and hydrogen conditions, adsorbents such as zinc oxide and nickel oxide are used in uidized bed reactors. The sulfur contained in raw gasoline is adsorbed in the form of metal sul de to produce gasoline components with low sulfur content (10 ppm) [5][6][7][8]. S ZORB SRT Reactor can produce gasoline with low octane loss and low sulfur value, which is also an important standard for evaluating desulfurization process [9]. As of 2019, 32 domestic installations have been reported, and S ZORB SRT Reactors has been placed into production in China, which effectively guarantees the domestic supply of clean gasoline energy.
However, using this technology for desulfurization, the octane number of the product will inevitably decrease. At present, the octane loss of most S ZORB SRT Reactors is 0.5-1.5 units, which results in economic losses and reduces the production of high-grade gasoline. Taking the Chinese gasoline market in October 2020 as an example, the research octane number (RON) value of 92# car gasoline and 95# differs by 3 units, and the price difference is 300 yuan/ton, which is equivalent to 1 unit of RON value at 100 yuan/ton. Therefore, reducing the octane loss during S ZORB SRT process is important for enterprises to increase pro ts and improve energy utilization [10].
Many models have been proposed to explain the principle of the process. In the classic chemical process modeling, the loss of the octane number could be reduced [9,[11][12][13][14]. Given the late proposal of S ZORB SRT technology, many operating procedures and complex related mechanisms remain to be studied, and establishing a model from the reaction mechanism is di cult. However, some scholars have made corresponding contributions. Bezverkhyy et al. used thermogravimetric analysis to explore the reaction kinetics of thiophene on the adsorbent under laboratory conditions and divided the reaction of thiophene on Ni/ZnO into three different stages: rapid adsorption stage, surface reaction control stage, and solid-phase diffusion stage. This method gives a principal analysis of the working process of the adsorbent and establishes a model [15]. Schmidt et al. studied the effect of NH 3 and HCL on the surface activity of the catalyst [16]. Qiu et al. used X-ray photoelectron spectroscopy and X-ray diffraction to study the performance conversion between the catalyst and regenerated catalyst [17]. Jia et al. proposed a reactor modeling method on the basis of the process mechanism, which divided FCC gasoline into ve lumps and built a reaction kinetic model and an octane number correlation model on the corresponding basis [18]. However, the use of chemical process modeling focuses on the mechanism analysis of a certain stage of production or the content of a certain component and on some parts and tends to overlook the overall desulfurization.
In the recent years, data science has been paid more and more attention, not only because data association methods can comprehensively consider the impact of multiple variables, but also classi cation and regression algorithms such as neural networks, support vector machines, random forests and etc. can establish correspondences between high-dimensional spaces and nonlinear multi-variables. It was stated that the above data method has played an important role in the chemical industry and can be used to reduce chemical mixture risk, predict potential drug-drug interactions, automatic chemical design and so on. [19][20][21][22][23][24] With regard to gasoline components prediction, the research is based on the theoretical basis of spectral analysis. After obtaining the spectral data of gasoline, different data tting algorithms are used to predict the octane content of gasoline [25][26][27][28]. However, no paper has proposed a method to obtain the gasoline components by establishing a model of S ZORB SRT process.
Given the lack of mathematical model for the desulfurization of FCC gasoline, previous optimization of gasoline desulfurization used the controlled variable method to study the variables that may affect the product index. For example, Xiong et al. studied the effects of space velocity, adsorbent bed temperature, adsorbent roasting temperature, and roasting time on the desulfurization rate of gasoline [29]. However, this method cannot achieve optimization at the same stage of multiple variables, ignoring the interconnection among variables. Therefore, prediction models will be established for the sulfur content and octane number of the S ZORB SRT reaction and achieve optimization at the same time of multiple variables through the random walk iterative optimization algorithm. Using the data method to model the S ZORB SRT reaction, the data contains more than 300 variable information such as raw material properties, catalyst properties, and device process control variables. Directly use raw data for modeling will lead to complications of multicollinearity and other false correlations in model variables, which will affect generalization ability. Therefore, data dimensionality reduction is necessary to extract the key variables. The common methods include principal component analysis (PCA), clustering, and manifold learning [30][31][32][33]. In these papers, PCA is used to reduce the dimension of algebra, and K-means algorithm is used to cluster variables. Due to the sensitivity of the K-means cluster method to the initial positions, which determines the nal cluster result. After performing PCA analysis, the result is then passed for K-means algorithm as initial positions of cluster centers.
In this paper, a PCA-Kmeans method-based BP neural network model of the S ZORB SRT reactor will be established. Based on the model, an optimization method for operating variables is proposed to reduce the product octane loss value by more than 30%, meanwhile keep the product sulfur content below 5 ppm. First, PCA and K-means clustering algorithm are used to select the representative key variables from many physical and operating variables during S ZORB SRT process. Then, the key variables are used as input to construct a neural network prediction model and test the performance of the model. Finally, based on the S ZORB product sulfur content and octane number prediction model, a multivariate optimization to reduce sulfur content and octane number loss is proposed. The algorithm can improve the octane content of desulfurized gasoline.

Modeling Analysis Of The S Zorb Srt
This article takes the S ZORB SRT reactor of a petrochemical company in China as an example and establishes a model through data samples collected by the reactor. The collected data are the continuous operation of the reactor from April 2017 to May 2020. Each data sample includes 354 operating variables and 7 raw material properties (sulfur content, octane number, saturated hydrocarbons, ole ns, aromatics, bromine number, and density), product property variables (sulfur content and octane number), spent adsorbent variables, and regenerated adsorbent variables. Except for the product property variables, which are the model prediction results, other variables all affect the product of the reactor.
PCA was rstly to process the original independent variable matrix to output the principal components. Then, K-means clustering was performed on the principal components obtained in the previous step and the variables closest to the cluster centers in each cluster of the clustering results were used as the initial cluster centers and key variables, respectively. Moreover, the BP neural network models could be established to predict the sulfur content and octane content of the desulfurized gasoline in the S ZORB SRT reactor.

Extraction of principal components by PCA
PCA was rst introduced by K. Pearson on non-random variables, and the amount of information re ected by variables with correlations has a certain overlap [34]. Using the linear algebra technique of continuous attributes, new attributes (principal components) in the data can be obtained [35,36]. These attributes are linear combinations of the original attributes and are mutually orthogonal without overlapping information. Therefore, PCA is often used to reduce dimensionality in mathematics [37]. The key variables of the S ZORB SRT reactor can be divided into raw material property variables, adsorbent property variables, and operating variables. This classi cation bases that changing raw material properties and adsorbent properties during optimization is more complicated than operating variables. Therefore, this article has classi ed all variables into three categories: raw material properties, adsorbent properties, and operating variables. Data preprocessing is performed on the original data, including the following preprocessing processes: (a) eliminating the abnormal values based on the range of each operating variable; (b) the utilization of the Pauta criterion proposes outliers in the data; (c) replacing the missing values with the average of the 2 h data before and after the operation; (d) deleting the variables with many missing data. Preprocessing has removed eight operating variables with many data missing in the operating variables. Then, the PCA method is used to process the three types of variables.
The number of principal components is calculated when the cumulative contribution rate of the raw material properties and adsorbent properties is greater than 0.9, and the number of principal components is calculated when the cumulative contribution rate of operating variables is greater than 0.85. The variance contribution rate of each component is shown in Fig. 1, and the calculation results are shown in Table 1.
As shown in the table, the number of main components of raw material properties and adsorbent properties accounts for more than 75% of the variables, whereas the number of main components of the operating variables only accounts for 10% of the variables. This result is consistent with the actual situation. The nature of the raw materials has a small correlation, whereas the operating variables have a high correlation, such as the input pressure and output pressure of the container.

K-means clustering
K-means clustering algorithm is a clustering algorithm based on partition, which divides the data into valuable groups (clusters) according to the relationship and information of the objects described by the data [38][39][40]. The goal of division is as follows: objects in the groups are similar (related), and objects in different groups are dissimilar (irrelevant). Therefore, K-means clustering algorithm is often used in data processing, such as feature extraction and sample classi cation [41]. The key of the K clustering algorithm is to determine the K value. The principal component of the PCA result is used as the initial clustering center of the K clustering algorithm [42]. The theoretical basis for this approach is that the correlation between the clusters with the principal component as the initial clustering center is weak, and the correlation between the variables within the cluster is strong. The original variable matrix information can be represented by the clustering results.
In the clustering process, the initial cluster centers are 30 principal component variables obtained by PCA, and the cluster proximity measure is set to Euclidean distance, and the sum of squares of errors (SSE) is used as the objective function to determine clustering. The function formula is shown in Eq. (1).
The K-means clustering algorithm is used to cluster variables, and the partial results of cluster are shown in Fig. 2, Table 2. At present, PCA integrated with K-means was used to obtain the key variables that are representative and contain most of the information in the original variable matrix. The key variables in the properties of raw materials are hydrocarbon content, octane content and so on. The spent catalysts are the key variables in the catalyst properties. This nding may indicate that the spent catalysts are directly related to the process of adsorbing sulfur compounds at this stage. The key variables in the operating variables are space velocity, temperature, pressure, and hydrogen concentration, which conform to the S ZORB SRT. In addition, we will analyze the correlation of the 30 key variables. The results are shown in Fig. 3. If the colors of the selected key variables are lighter, then the correlation is weak, and the differences between clusters are evident, indicating that the clustering effect is remarkable.

BP neural network model
The above-mentioned algorithm is used to obtain the key variables and establish the prediction model. The Pearson correlation analysis result of the sulfur content and octane number of the product is 0.208 [43], therefore, these two properties are weakly correlated. These two properties can be used as two types of variables [44][45][46].
The network initialization training parameters are de ned as follows: the number of neurons in the input layer is 30, and the number of hidden layers of the prediction model is determined to be 18 through empirical formulas and multiple tests, and a neural network with a topology of 30-18-1 is obtained (as showed in Fig. 4). The transfer function of the hidden layer is logistic, and the transfer function of the output layer is purelin. The initial learning rate of the neural network is 0.001, and the learning rate change mode is adaptive. The minimum convergence error of the training target is 0.001, and the maximum number of training times is 1000. The data was divided into 80% training set and 20% test set.
It is de ned as the correct prediction criterion that the differences between the predicted result and actual value is less than 0.15. The ratio of the predicted correct variables in the test set to the total number of test sets is the prediction accuracy rate. After many parameter adjustments, the nal prediction accuracy rate of RON and Sulfur content is shown in Figs. 5 and 6, and the performance of the neural network is shown in Table 3.
As shown in Figs. 5 and 6, Table 3, the model prediction results almost coincide with the actual results. The average accuracy rate reaches 94% when the error value does not exceed 0.15 and reaches 99% when the error value does not exceed 0.3. It is accurate and can be used in the next optimization. In addition, the octane number uctuates by about 4 units, whereas the sulfur content uctuates slightly, mostly stable at approximately 3.2 ppm. This result indicates that the octane number is easily affected by the input variables, and the sulfur content is robust, which provides the basis for the following optimization method.

Model-based Optimization Method Of Key Variables
In actual production, we hope to reduce the loss of octane number and sulfur content to increase economic bene ts. Hence, we propose the optimization goal of using the above-mentioned model to optimize the operating variables to reduce the RON loss by more than 30% , meanwhile the sulfur content stayed relatively stable which was less than 5 ppm [47,48].
Considering that the optimization of multiple variables often expands considerable computational effort, and the robustness of the sulfur content prediction model is remarkable, a simple random walk variable iteration method is proposed to iterate the variables [49]. In the iterative process of all manipulated variables, the distance between the variable and boundary condition is determined. If the variable is too close to the boundary, then the iterative direction must be opposite to the boundary. Otherwise, the variable obtains any positive or negative random iteration direction. The iteration results are substituted into the trained neural network model to obtain the sulfur content and octane number. If the optimization goal is met, or the number of iterations reaches the de ned limit, then the optimization ends, and the iteration stops. If the optimization goal is not met, then the iteration is returned. Consequently, a solution to this problem is formed. The calculation and iteration processes are shown in Figure 7 Table 4. The sulfur content increases slightly after optimization but still less than 5 ppm. The RON loss is reduced from 1.3 to 0.8 and 1.1 to 0.7, which is reduced by 38% and 36%, and the optimization goal is achieved. Figure 9 and 10 shows that the sulfur content and octane number of the two samples change with some variables. Each variable randomly increases or decreases the iteration step within the range of the variable, and the model result will change accordingly.
Based on the above-mentioned ideas, the operating variables are optimized, and most of the samples can reach the optimization goal after iterations. Eighteen groups in 325 samples cannot reduce octane loss by more than 30%, which may be due to the nature of the raw materials and adsorbent. In general, the model established in this paper can complete the optimal design.

Result And Discussion
The algorithm rst selects 30 principal components from 362 original data variables through PCA and uses the principal components as the initial clustering center of the K-means clustering algorithm to cluster the original variables. The variables closest to the cluster center are selected as the 30 key variables. Among them, the number of raw material property variables and catalyst property variables is only reduced by 25%, and the number of operating variables is reduced by 90%, indicating that raw material properties and catalyst properties contain more information than the original data variables in the desulfurization operating variables, which have a great impact on the results.
Subsequently, the neural network prediction models of the product sulfur content and octane number prediction model are determined. The average accuracy rate of the model prediction is 94% when the error value is within 0.15, and the average accuracy rate is 99% when the error value is within 0.3. This result indicates that the average accuracy rate of the model can remarkably predict the sulfur content and octane content of desulfurized gasoline.
Finally, using the above-mentioned model, an optimization algorithm called random walk iteration of operating variables is proposed, which can reduce the octane loss by more than 30% under the premise that the product sulfur content is less than 5 ppm. Using this algorithm on the No. 133 data sample, the optimized operating variables result in the increase of the sulfur content from 3.2 to 4.02 and reduction of the octane loss from 1.3 to 0.8, attributing to a 38% reduction. Using this algorithm on the No. 285 data sample, the optimized operating variables result in the increase of the sulfur content from 4.03 to 4.37 and reduction of the octane loss from 1.1 to 0.7, attributing to a 36% reduction.

Conclusion
A PCA-KMeans backpropagation neural network prediction model that integrated PCA and K-means clustering algorithm was established to predict octane loss and sulfur content in the S ZORB SRT. The model has high prediction accuracy and can be used to reduce the optimization problem of octane loss. Then, the random walk iteration algorithm is proposed to reduce the octane loss by more than 30%, during desulfurization in the S ZORB SRT Reactor, meanwhile the sulfur content stayed relatively stable which was less than 5 ppm. The results of the proposed model are reliable and could be applied into the real industrialization, which are bene cial with both the e ciency of economy and environmental protection. Besides, the model is also suitable for other complex application scenarios with multiple variables, nonlinearities, and strong coupling in the chemical process, providing new ideas for model establishment and target optimization in the chemical process.

Declarations
Availability of data and materials The original data is submitted as an attachment.

Competing interests
No competing interests.

Funding
No funding.

Authors' contributions
The rst author Xiaoyi Geng completed most of the experiments and paper. Two correspondents Dr. Xin Wang and Dr. Guangcheng Zhang provided support for the theoretical basis and design of the PCA-Kmeans algorithm. The second author Bingyan Song and the third author Yu Chen wrote part of the paper and summarize the experiment.   Step 0.1 100 1 1 500