3.1 Environment and dataset
The experimental environment used in this paper includes a Dell PC Core i7 processor, 32GB memory, and Windows 10 operating system, with Python as the programming language. A simulation comparison test was conducted on 500 sample data from a 250W gas turbine in 2013. The dataset contained 500 pieces of data and 39 variables. To test the calculation errors of the four aforementioned filling methods under different missing rates, the middle five rows of the sample data were randomly deleted at missing rates of 25%, 50%, 75%, and 90% respectively. The invalid data density of the dataset is displayed as shown in the Fig. 1.
3.2 Results and Comparative Analysis
To study the accuracy of interpolation methods, this section introduces four different evaluation metrics to measure the difference between various interpolated data and the original data, namely mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination R2, with their definitions as follows:
$$\:MAE=\:\frac{1}{n}\sum\:_{i=1}^{n}\left|\widehat{{y}_{i}}-{y}_{i}\right|$$
1
$$\:MSE=\:\frac{1}{n}\sum\:_{i=1}^{n}{\left(\widehat{{y}_{i}}-{y}_{i}\right)}^{2}$$
2
$$\:RMSE=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}{\left(\widehat{{y}_{i}}-{y}_{i}\right)}^{2}}\:$$
3
$$\:{R}^{2}=1-\frac{\sum\:_{i=1}^{n}{\left(\widehat{{y}_{i}}-{y}_{i}\right)}^{2}}{\sum\:_{i=1}^{n}{\left(\stackrel{-}{{y}_{i}}-{y}_{i}\right)}^{2}}$$
4
Where n is the sample size, y is the original data value of the sample, and \(\:\widehat{{y}_{i}}\) is the imputed value of the sample. Among the three indicators, MAE, MSE, and RMSE are used to measure the deviation between the predicted and actual values, with smaller values generally indicating better performance. Unlike residuals. these metrics do not cancel each other out based on their sign. \(\:{R}^{2}\) measures the proportion of the dependent variable's variation that can be explained by the independent variable, with a value range of 0–1. When it is close to 1, it indicates that the dependent variable is completely determined by the independent variable, thus a higher value indicates better model fitting performance.
Given the operating conditions of gas turbines and the reasons for missing values, the four traditional missing value imputation methods such as Mean value filling, Median filling, Regression imputation and KNN method filling mentioned above are compared with the MI-RandomForest method proposed in this paper. Table 2 shows the six evaluation metrics, MAE, MSE, RMSE, \(\:{R}^{2}\), Time(s) and Maximum deviation results of the five imputation methods. From Table 2, it can be found that the trends of the three indicators, MAE, MSE, and RMSE, are similar. The MI-RandomForest method produces the smallest values for the three indicators and the largest corresponding values, close to 1, while the regression imputation method produces the largest values for the three indicators and the smallest corresponding values. Among the several imputation methods the median imputation has the shortest running time, while the multiple imputation method has a longer running time due to multiple imputations. If the data volume is large, the running time should also be used as a condition for measuring the algorithm. The maximum deviation of multiple imputation is the smallest, the maximum deviation of KNN method is relatively small, and the maximum error of regression method is the largest.
This study visualizes the experimental results using MSE and \(\:{R}^{2}\) results as examples, as shown in Figs. 2 and Figs. 3, respectively.
Table 2
Current Research on Data Cleaning
Method | MAE | MSE | RMSE | \(\:{R}^{2}\) | Time(s) | Maximum deviation |
Mean value filling | 0.019 | 0.047 | 0.216 | 0.993 | 0.045 | 0.299 |
Median filling | 0.016 | 0.036 | 0.189 | 0.993 | 0.031 | 0.288 |
Regression imputation [4] | 0.030 | 0.198 | 0.445 | 0.923 | 2.749 | 0.401 |
KNN method filling [14] | 0.0040 | 0.00568 | 0.0754 | 0.9988 | 0.047 | 0.127 |
Multiple imputation [15] | 0.0022 | 0.00141 | 0.0375 | 0.9997 | 22.849 | 0.019 |
As can be seen from Fig. 2 and Fig. 3, by using multiple imputation methods, the missing values are filled using the random forest regression algorithm, and then the random forest model is used for fitting. resulting in the smallest mean squared error and the best fitting effect of the model. Therefore, it is recommended to use multiple imputation methods under complex operating conditions of gas turbines for better results.