Outlier Analysis Results
Measurements with absolute values of standard scores greater than 3 are usually considered outliers. In this study, after calculating the standard scores of physical and chemical indicators, the standard scores were sorted in descending order. It can be seen from Table 1, the hardness value was in a normal state, the mass value had one outlier, the vitamin C content had two outliers, and the soluble solids content had three outliers. In addition, the 300 samples of the vitamin C and soluble solid content were in a normal distribution, the average value of vitamin C content is 35.0209, the standard deviation is 5.7870, and the average value of soluble solid content is 13.3479, and the standard deviation is 1.8589.
Table 1
Physical and chemical indicators Results of standard score calculations
Standard score of Soluble Solids Content | Standard score of vitamin C content | Mass Standard Score | Hardness Standard Score |
4.0132 | 4.5866 | 3.2923 | 2.7107 |
3.8516 | 3.0540 | 2.1905 | 2.7107 |
3.1336 | 2.9633 | 2.1497 | 2.2051 |
2.8464 | 2.7521 | 1.9457 | 2.1964 |
2.7746 | 2.3906 | 1.7824 | 2.0063 |
As both the mean and standard deviation is sensitive to outliers, which may cause a small amount of data to be anomalous, and the number of outliers is negligible relative to the sample size, while the sample data conforms to a normal distribution, no outliers are excluded.
One - Dimensional Feature Prediction Results
Results of One-Dimensional Feature Correlation Analysis
Pearson correlation analysis was performed on the measured hardness and mass values of blueberries with the vitamin C and soluble solids content, respectively. Among them, the correlation coefficients of hardness value with the vitamin C and soluble solids content were 0.704 and − 0.639, respectively, which were strong correlations; the correlation coefficients of mass value with the vitamin C and soluble solids content were 0.334 and − 0.308, respectively, which were weak correlations. In this study, the prediction models of the vitamin C and soluble solids content were established by hardness and mass, respectively.
Model Parameter Settings
The parameters of each base learner are optimised by the combination of grid search and manual parameter adjustment, so as to improve the training speed and accuracy. Among them, ensemble models with more parameters, such as ADA, LGBM, XGB, GBRT, RF, etc., adopt the grid search method to conduct mesh division according to the range of parameters; while fewer parameters of MLP, SVR and KNN algorithm adopt manual parameter tuning according to past experience. After debugging, the parameter selection of each model under the one-dimensional feature condition is shown in Table 2 and Table 3 respectively.
Table 2
Parameters of each learner based on hardness prediction model
Learners | Main Parameter Settings |
Adaptive Boosting | learning_rate = 0.3, n_estimators = 30 |
Light Gradient Boosting Machine | learning_rate = 0.5, num_iterations = 300 max_depth = 5, min_child_samples = 10 |
Extreme Gradient Boosting | learning_rate = 0.5, n_estimators = 50, max_depth = 10 |
Gradient Boosting Regression Tree | learning_rate = 0.3, n_estimators = 200 max_depth = 5 |
Support Vector Machine Regression | C = 100, epsilon = 0.1, gamma = 30 |
Multilayer Perceptron | hidden_layer_sizes = 3, activation=’relu’ learning_rate_init = 0.001, solver=‘adam' |
Random Forest | n_estimators = 150, max_depth = 15 |
K-Nearest Neighbor | n_neighbors = 5 |
Table 3
Parameters of each learner based on mass prediction model
Learners | Main Parameter Settings |
Adaptive Boosting | learning_rate = 0.3, n_estimators = 25 |
Light Gradient Boosting Machine | learning_rate = 0.5, num_iterations = 300 max_depth = 3, min_child_samples = 10 |
Extreme Gradient Boosting | learning_rate = 0.5, n_estimators = 50, max_depth = 10 |
Gradient Boosting Regression Tree | learning_rate = 0.3, n_estimators = 200 max_depth = 5 |
Support Vector Machine Regression | C = 100, epsilon = 0.1, gamma = 30 |
Multilayer Perceptron | hidden_layer_sizes = 13, activation=‘relu' learning_rate_init = 0.001, solver=‘adam' |
Random Forest | n_estimators = 150, max_depth = 15 |
K-Nearest Neighbor | n_neighbors = 4 |
The Choice of Learner
Based on the consideration of model diversity and difference, the base learner is selected by averaging 10 prediction results of each model, and then combining Pearson correlation analysis with Euclidean distance analysis. Taking the model for vitamin C in one-dimensional hardness predicting as an example, the correlation analysis plot and the Euclidean distance analysis table are shown in Fig. 1 and Table 4, respectively.
Table 4
Hardness prediction vitamin C model Euclidean distance analysis
Correlation between value vectors |
| MLP | RF | GBRT | LGBM | XGB | ADA | KNN | LR | SVR |
MLP | 0.000 | | | | | | | | |
RF | 2.236 | 0.000 | | | | | | | |
GBRT | 2.214 | 2.599 | 0.000 | | | | | | |
LGBM | 2.351 | 2.508 | 2.689 | 0.000 | | | | | |
XGB | 2.560 | 2.556 | 2.350 | 2.382 | 0.000 | | | | |
ADA | 2.053 | 2.364 | 2.220 | 2.262 | 2.355 | 0.000 | | | |
KNN | 2.262 | 2.587 | 2.638 | 2.591 | 2.800 | 2.491 | 0.000 | | |
LR | 2.169 | 2.605 | 2.652 | 2.793 | 2.686 | 2.414 | 2.798 | 0.000 | |
SVR | 2.344 | 2.520 | 2.512 | 2.576 | 2.385 | 2.260 | 2.632 | 2.622 | 0.000 |
It can be seen from Fig. 1, that there are four negative correlations for LR, RF, and GBRT, and five and six negative correlations for XGB and KNN, respectively. Among them, LR has negative correlations with GBRT, LGBM, XGB, and KNN, RF has negative correlations with GBRT, ADA, XGB, and KNN, XGB has negative correlations with MLP, ADA, RF, KNN, and LR, GBRT has negative correlations with RF, LR, KNN, and LGBM, and KNN has negative correlations with all the models except MLP. It can be seen from Table 4, that the Euclidean distance greater than 2.5 is used as the cut-off for analysis, and when the correlations are negatively correlated, there is often a larger trend between the distances, the difference is that the distance between LR and RF is larger, but it is positively correlated. Considered together, LR, KNN, RF, GBRT, and XGB were selected as the base learners for the hardness prediction vitamin C model.
After the selection of base learners is completed, the integrated first layer model is combined with each model for training, and after 15 times of average determination coefficient and root mean square error are synthesised, the selection of meta-learners and the construction of the second layer model and the whole framework of heterogeneous stacking integrated model are completed. The results of the learner selection under different conditions are shown in Table 5.
Table 5
Results of learner selection under different conditions
Conditions | Base learner selection results | Meta Learner selection results |
Hardness prediction vitamin C model | GBRT + KNN + LR + RF + XGB | SVR |
Hardness prediction soluble solids model | ADA + GBRT + MLP + SVR + XGB | SVR |
Mass prediction vitamin C model | GBRT + KNN + RF + SVR + XGB | ADA |
Mass prediction soluble solids model | GBRT + KNN + LGBM + RF + XGB | SVR |
Stacking Ensemble Learning Prediction Results
Under the Stacking integrated model design of 3.1.2 and 3.1.3, this study used the experimentally obtained data set as input to predict the vitamin C and soluble solids contents based on hardness and mass, respectively. Among them, the data set was divided using a training set: test set ratio of 4:1. The results are shown in Table 6.
Table 6
Stacking ensemble prediction results
| Predicted Vitamin C | Predicted soluble solids |
| R2 | RMSE | MAE | R2 | RMSE | MAE |
Based on hardness | 0.875 | 0.263 | 0.199 | 0.873 | 0.250 | 0.194 |
Based on mass | 0.234 | 0.661 | 0.557 | 0.207 | 0.623 | 0.505 |
It can be seen from Table 6 that when hardness predicts vitamin C content, the R2 obtained by Stacking ensemble learning is 0.875, which is 0.641 higher than that of mass prediction; RMSE is 0.263, which is 0.398 lower than that of mass prediction; MAE is 0.199, which is 0.358 lower than that for mass prediction. When hardness predicts soluble solid content, the R2 obtained by Stacking ensemble learning is 0.873, which is 0.666 higher than that of mass prediction; RMSE is 0.250, which is 0.373 lower than that of mass prediction; MAE is 0.194, which is 0.311 lower than that of mass prediction.
Multi-Dimensional Feature Prediction Results
From the above prediction results, it can be seen that hardness predicts the vitamin C and soluble solids content significantly better than mass, so the prediction study was conducted by adding multi-dimensional features for hardness only.
Results of Multi-Dimensional Feature Correlation Analysis
After adding multi-dimensional features of hardness such as X2, X3,... correlation analysis was carried out with the vitamin C and soluble solids, respectively, and the results are shown in Tables 7 and 8.
Table 7
Correlation change between hardness and vitamin C content after adding multi-dimensional features
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 |
.704 | .703 | .702 | .699 | .696 | .691 | .686 | .683 | .672 | .661 | .609 | .624 | .344 | .093 | .000 |
Table 8
Correlation change between hardness and soluble solid content after adding multi-dimensional features
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 |
− .639 | − .640 | − .641 | − .640 | − .639 | − .637 | − .632 | − .631 | − .621 | − .608 | − .581 | − .570 | − .353 | − .075 | .000 |
It can be seen from Tables 7 and 8 that with the increase of the dimension, the correlation between hardness and vitamin C content first shows a downward trend, and the correlation increases slightly when X12, and it is still in a strong correlation state at this time, starting from X13. The correlation decreased again, from weak correlation to no correlation; the correlation between hardness and soluble solid content showed a short-term downward trend, and the correlation gradually increased during the period from X4 to X10, and it was still in a state of strong correlation at this time, starting from X11, Although the correlation continues to rise, it has changed from strong correlation to weak correlation, and finally to no correlation.
Model Parameter Settings
The model parameters were set in the same way as one-dimensional conditions, and after debugging, the selection of each model parameter for the condition of adding multi-dimensional features are shown in Table 9 and Table 10 respectively, where only the parameter items different from the previous dimension are listed starting from X2.
Table 9
Predicted vitamin C learner parameters after adding multi-dimensional feature
Dimension | Main parameter settings | Dimension | Main parameter settings |
X1 | Same as Table 3 | | KNN: n_neighbors = 9 |
X2 | ADA: learning_rate = 0.1, n_estimators = 200 | X8 | ADA: learning_rate = 0.5, n_estimators = 30 |
LGBM: max_depth = 3 | SVR: C = 2000 |
RF: n_estimators = 200 | MLP: hidden_layer_sizes = 8 |
MLP: hidden_layer_sizes = 7 | X9 | SVR: C = 1000 |
KNN: n_neighbors = 6 | X10 | SVR: C = 4000 |
X3 | ADA: learning_rate = 0.3, n_estimators = 25 | MLP: hidden_layer_sizes = 7 |
SVR: epsilon = 0.05 | X11 | ADA: learning_rate = 0.3 |
MLP: hidden_layer_sizes = 8 | SVR: C = 100 |
X4 | ADA: learning_rate = 0.1, n_estimators = 200 | MLP: hidden_layer_sizes = 10 |
SVR: epsilon = 0.1 | KNN: n_neighbors = 6 |
MLP: hidden_layer_sizes = 7 | X12 | MLP: hidden_layer_sizes = 11 |
X5 | MLP: hidden_layer_sizes = 8 | X13 | ADA: learning_rate = 0.5 |
X6 | ADA: learning_rate = 0.05 | SVR: C = 2000 |
KNN: n_neighbors = 12 | MLP: hidden_layer_sizes = 8 |
X7 | SVR: C = 5000 | X14 | SVR: C = 5000 |
MLP: hidden_layer_sizes = 7 | MLP: hidden_layer_sizes = 7 |
Table 10
Prediction of soluble solids learner parameters after adding multi-dimensional features
Dimension | Main parameter settings | Dimension | Main parameter settings |
X1 | Same as Table 3 | X8 | ADA: learning_rate = 0.5, n_estimators = 30 |
X2 | ADA: learning_rate = 0.1, n_estimators = 200 | MLP: hidden_layer_sizes = 8 |
LGBM: max_depth = 3 | X9 | SVR: C = 200 |
RF: n_estimators = 200 | MLP: hidden_layer_sizes = 7 |
MLP: hidden_layer_sizes = 7 | X10 | SVR: C = 4000 |
KNN: n_neighbors = 6 | MLP: hidden_layer_sizes = 8 |
X3 | ADAt: learning_rate = 0.3, n_estimators = 30 | X11 | SVR: C = 2000 |
RF: n_estimators = 150 | MLP: hidden_layer_sizes = 7 |
MLP: hidden_layer_sizes = 6 | KNN: n_neighbors = 6 |
X4 | Same as X3 | X12 | SVR: C = 1000 |
X5 | RF: n_estimators = 200 | MLP: hidden_layer_sizes = 8 |
MLP: hidden_layer_sizes = 8 | X13 | SVR: C = 2000 |
X6 | ADA:learning_rate = 0.05, n_estimators = 200 | MLP: hidden_layer_sizes = 7 |
MLP: hidden_layer_sizes = 7 | X14 | ADA: learning_rate = 0.05, n_estimators = 150 |
KNN: n_neighbors = 11 | SVR: C = 1500 |
X7 | KNN: n_neighbors = 9 | KNN: n_neighbors = 5 |
The Choice of Learner
The model combination process was the same as one-dimensional condition, after analysis, the model combination results for predicting the vitamin C and soluble solid content under the condition of adding multi-dimensional features are shown in Table 11.
Table 11
Learner selection results when adding to different dimensions
| Multi-dimensional feature prediction vitamin C model | Multi-dimensional feature prediction soluble solids content model |
Dimension | Base learner selection results | Meta-Learner Selection Results | Base learner selection results | Meta-Learner Selection Results |
X2 | ADA + LR + MLP + SVR + XGB | SVR | GBRT + KNN + LR + RF + XGB | LR |
X3 | MLP + GBRT + XGB + RF + SVR | SVR | ADA + GBRT + KNN + RF + XGB | SVR |
X4 | GBRT + KNN + LGBM + LR + SVR | RF | GBRT + MLP + LGBM + LR + SVR | SVR |
X5 | KNN + LGBM + MLP + RF + XGB | LR | GBRT + KNN + MLP + SVR + XGB | SVR |
X6 | GBRT + KNN + LR + MLP + XGB | KNN | ADA + GBRT + MLP + RF + SVR | SVR |
X7 | KNN + LGBM + MLP + RF + XGB | LR | ADA + MLP + LGBM + LR + SVR | LR |
X8 | ADA + GBRT + LGBM + LR + SVR | RF | KNN + LGBM + MLP + RF + XGB | LR |
X9 | KNN + MLP + RF + SVR + XGB | SVR | GBRT + KNN + LGBM + RF + SVR | KNN |
X10 | GBRT + KNN + RF + SVR + XGB | SVR | ADA + KNN + RF + SVR + XGB | LR |
X11 | GBRT + KNN + MLP + SVR + XGB | LR | ADA + LGBM + RF + SVR + XGB | RF |
X12 | GBRT + LR + RF + SVR + XGB | SVR | ADA + KNN + LR + MLP + XGB | LR |
X13 | ADA + GBRT + KNN + LGBM + LR | ADA | ADA + GBRT + LR + MLP + XGB | SVR |
X14 | ADA + KNN + LR + SVR + XGB | LR | ADA + GBRT + LGBM + LR + MLP | LR |
Stacking Ensemble Learning Prediction Results
The prediction results of hardness on the vitamin C and soluble solid content when adding multi-dimensional features are shown in Table 12.
Table 12
Prediction results when adding different dimensions
| Predicted vitamin C content results | Predicted soluble solids content results |
Dimension | R2 | RMSE | MAE | R2 | RMSE | MAE |
X2 | 0.881 | 0.261 | 0.204 | 0.875 | 0.244 | 0.191 |
X3 | 0.878 | 0.267 | 0.206 | 0.889 | 0.232 | 0.181 |
X4 | 0.879 | 0.264 | 0.200 | 0.872 | 0.249 | 0.194 |
X5 | 0.872 | 0.272 | 0.209 | 0.883 | 0.241 | 0.192 |
X6 | 0.871 | 0.262 | 0.210 | 0.871 | 0.245 | 0.194 |
X7 | 0.876 | 0.266 | 0.209 | 0.877 | 0.244 | 0.187 |
X8 | 0.880 | 0.261 | 0.193 | 0.874 | 0.248 | 0.196 |
X9 | 0.881 | 0.262 | 0.201 | 0.871 | 0.243 | 0.189 |
X10 | 0.877 | 0.265 | 0.202 | 0.883 | 0.258 | 0.198 |
X11 | 0.885 | 0.262 | 0.203 | 0.866 | 0.255 | 0.197 |
X12 | 0.890 | 0.250 | 0.191 | 0.877 | 0.249 | 0.197 |
X13 | 0.881 | 0.260 | 0.201 | 0.878 | 0.245 | 0.196 |
X14 | 0.881 | 0.257 | 0.201 | 0.860 | 0.254 | 0.198 |
It can be seen from Table 12, that when the hardness was added to the 12th power, the prediction effect of vitamin C content was the best, and the R2 was 0.89, which was 0.012 higher on average than when adding to other dimensions; RMSE is 0.25, which was 0.013 lower on an average than when adding to the other dimensions, MAE is 0.191, which was 0.017 lower on an average than when adding to the other dimensions.
When the hardness was added to the 3rd power, the prediction effect of the soluble solids content was the best, and the R2 was 0.889, which was 0.015 higher on average than when adding to other dimensions; the RMSE was 0.232, which was 0.016 lower on an average than when adding to the other dimensions; the MAE was 0.181, which was 0.013 lower on an average than when adding to the other dimensions.
Comparative Analysis
Comparative Analysis of Stacking Ensemble and Single Model
Under the condition of one-dimensional features, the hardness-based prediction was analyzed as a comparison between the results of the single model and the Stacking ensemble learning model, and the comparison results were shown in Table 13.
Table 13
Comparison results of single model and Stacking ensemble model
Hardness predicts vitamin C results | Hardness predicts soluble solids results |
Models | R2 | RMSE | MAE | Models | R2 | RMSE | MAE |
LR | 0.856 | 0.286 | 0.228 | ADA | 0.871 | 0.253 | 0.193 |
KNN | 0.850 | 0.283 | 0.221 | SVR | 0.816 | 0.290 | 0.228 |
RF | 0.865 | 0.278 | 0.202 | GBRT | 0.851 | 0.268 | 0.207 |
GBRT | 0.864 | 0.276 | 0.201 | MLP | 0.795 | 0.319 | 0.254 |
XGB | 0.861 | 0.273 | 0.200 | XGB | 0.849 | 0.266 | 0.208 |
Stacking ensemble | 0.875 | 0.263 | 0.199 | Stacking ensemble | 0.873 | 0.250 | 0.194 |
It can be seen from Table 13 that when vitamin C content was predicted by hardness, the R2 obtained by Stacking ensemble learning is 0.875, which is 0.019,0.025,0.01,0.011,0.014 higher than that of LR, KNN, RF, GBRT, and XGB, respectively, with an average increase of 0.016. The RMSE of the ensemble learning is 0.263, which is 0.023, 0.02,0.015,0.013,0.01 lower than LR, KNN, RF, GBRT, and XGB, respectively, with an average reduction of 0.016. The MAE of the ensemble learning is 0.199, which is 0.029, 0.022,0.003,0.002,0.001 lower than that of LR, KNN, RF, GBRT, and XGB, respectively, with an average reduction of 0.012; when soluble solids content was predicted by hardness, the R2 obtained by Stacking ensemble learning is 0.873, which is 0.002, 0.057,0.022,0.078,0.024 higher than ADA, SVR, GBRT, MLP, and XGB, respectively, with an average increase of 0.034. The RMSE of the ensemble learning is 0.250, which is 0.003, 0.040,0.018,0.069,0.016 lower than that of ADA, SVR, GBRT, MLP, and XGB, respectively, with an average reduction of 0.029. The MAE of the ensemble learning is 0.194, which is 0.001,0.034,0.13,0.060,0.014 lower than ADA, SVR, GBRT, MLP, and XGB, respectively, with an average reduction of 0.024. The above results show that the Stacking ensemble has better predictive ability than the single model.
Comparative Analysis of One-Dimensional Features and Multi-Dimensional Features
It can be seen from the prediction results of adding multi-dimensional features, that for the prediction of vitamin C content, the model has the best effect when it was added to the twelfth power; for the prediction of soluble solids, the model was optimal when added to the third power, and the accuracy of the model was reduced by adding too many features. Combined with the multi-dimensional feature correlation analysis, it can be seen that when the model achieves the optimal effect, the correlation both fluctuates.
Under the premise of considering the influence of parameters on the accuracy of the model, the model parameters of one-dimensional and adding multi-dimensional features were optimised respectively, and then the comparison of the predicted results was carried out. As can be seen from synthesising the prediction results of the one-dimensional features and multi-dimensional features, when vitamin C content was predicted by hardness, the R2 obtained by the Stacking ensemble model based on the one-dimensional features were 0.875, RMSE is 0.263, MAE is 0.199, the R2 obtained by Stacking ensemble based on adding multi-dimensional features were 0.890, which is 0.015 higher than that of one-dimensional features, RMSE is 0.250, which is 0.013 lower than that of one-dimensional feature, MAE is 0.191, which is 0.008 lower than that of one-dimensional features;when soluble solids content was predicted by hardness, the R2 obtained by the Stacking ensemble model based on one-dimensional features were 0.873, the RMSE was 0.250, MAE is 0.194, the R2 obtained by Stacking ensemble based on adding multi-dimensional features were 0.889, which is 0.016 higher than the one-dimensional feature, RMSE is 0.232, which is 0.018 lower than that of the one-dimensional features, MAE is 0.181, which is 0.013 lower than that of one-dimensional features.