Selecting Essential Factors for Predicting Reference Crop Evapotranspiration Through Tree-based Machine Learning and Bayesian Optimisation

: Reference crop evapotranspiration (ET O ) is a basic component of the hydrological 10 cycle ,and its estimation is critical for agricultural water resource management and scheduling. In 11 this study, three tree-based machine learning algorithms (random forest [RF], gradient boosting 12 decision tree [GBDT], and extreme gradient boosting [XGBoost]) were adopted to determine the 13 essential factors for ET O prediction . The tree-based models were optimised using the Bayesian 14 optimisation (BO) algorithm, and they were compared with three standalone models in terms of 15 daily ET O and monthly mean ET O estimation in North China, with different input combinations of 16 essential variables. The results indicated that solar radiation (R s ) and air temperature (T s ), including 17 the maximum, minimum, and average temperature, in daily ET O were the key parameters affecting 18 model prediction accuracy. R s was the most influential factor in the monthly average ET O model, 19 followed by T s . Both relative humidity (RH) and wind speed at 2 m (U 2 ) had little impact on ET O 20 prediction at different scales, although their importance differed. Compared with the GBDT and 21 RF models, the XGBoost model exhibited the highest performance for daily ET O and monthly 22 mean ET O estimation. The hybrid tree-based models with the BO algorithm outperformed the 23 standalone tree-based models. Overall, compared with other inputs, the model with three inputs 24 (R s , T s , and RH/U 2 ) had the highest accuracy. The BO-XGBoost model exhibited superior

mean ETO estimation. The hybrid tree-based models with the BO algorithm outperformed the 23 standalone tree-based models. Overall, compared with other inputs, the model with three inputs 24 (Rs, Ts, and RH/U2) had the highest accuracy. The BO-XGBoost model exhibited superior 25 performance in terms of the global performance index (GPI) for daily ETO and monthly mean ETO 26 prediction and it is recommended as a more accurate model predicting daily ETO and monthly mean 27 ETO in North China or areas with a similar climate. the implementation of precision agriculture. 35 Many methods have been applied for measuring ETO, with the experimental measurement method 36 being the most scientific. However, this method is challenging due to its complex experimental 37 operations and expensive materials. Therefore, various low-cost mathematical methods for 38 predicting ETO have been proposed, such as the United Nations Food and Agriculture Organization Haise, and Imark-Allen methods. Among them, the FAO-56 PM method has been widely used 41 given its wide applicability and high accuracy. However, this method requires many meteorological at 2 m (U2), RH, and net radiation (Rn); they concluded that CatBoost was more effective than the 73 GRNN model.  proposed various machine learning models to predict ETO using 74 only local or cross-site temperature data and observed that tree models had higher estimation 75 accuracy than other machine learning models. Overall, tree-based machine learning algorithms 76 have exhibited strong performance in ETO model construction research. 77 However, machine learning models have a high computation time and require numerous input of ETO prediction reached 83% without lacking key parameters, which is more accurate than ETO 87 prediction using other models. Although the use of the principal component analysis method to 88 extract the main input factors is both scientific and feasible, the aforementioned study had certain 89 regional limitations. Moreover, this method is mainly used for linear problems and is not ideal for 90 solving nonlinear problems. Machine learning can solve nonlinear problems more effectively, and 91 the use of the tree model to select the main factors is thus more suitable. Shanxi, and the Inner Mongolia Autonomous Region (Fig. 1). The climate of the study area is 108 dominated by a temperate monsoon and subtropical monsoon climate. The environmental changes 109 are distinct across the four seasons: winter is cold and dry, summer is hot and rainy, and spring is 110 dry with less precipitation and strong evaporation. The amount of precipitation is insufficient, and 111 precipitation is unevenly distributed throughout the year, with most rainfall occurring in summer.

112
The annual precipitation is 500-1000 mm, with 600-700 mm in the lower Yellow River plain and 113 500-600 mm in the area around Beijing and Tianjin.

114
The meteorological data from 1960 to 2019 were obtained from China Meteorological Data 115 Network (http://data.cma.cn). Table 1 is a detailed description of weather stations. The data were 116 divided into the training set (80%) and test set (20%); to reduce mistakes and improve the 117 generalisability of the model, a 10-fold cross-validation method was adopted (Fig. 2

136
The angstrom formula was applied to calculate Rs. Prior to the input processing of the importance ranking of the input data features, the data must be 146 normalised. In this study, min-max normalisation was used. selection. In this study, the impurity reduction method was applied to calculate feature importance.

159
The importance of a feature can be measured based on the extent of the reduction in impurity in all 160 trees in the forest that this feature induces. The score for each feature is calculated after training, 161 and the results are standardised, resulting in a summed importance of all features that is equal to 1.

162
The following Gini index calculation method was also applied: (4) 164 where k represents the k categories, and represents the sample weight of category k.

165
The importance on node m is the change in the Gini index before and after the branch of node m, 166 which can be expressed as where and represent the Gini index of the two new nodes after branching, respectively.

169
The feature is , and the node that appears in decision tree i is in the set M. The importance 170 of in the i-th tree is as follows: For RF, assuming that n represents the total number of trees, the feature importance is given For regression problems, the final result of RF calculation is the average of all tree results. The 176 regression process is described as follows:

177
First, k training samples ( 1, 2, ⋯ ) are randomly generated from the total training sample 178 using the bootstrap sampling method, corresponding to K decision trees that can be constructed.    The weak learner is initialised, and the mean value of C can be set to the mean value of sample y. For the leaf area = 1,2,3, … , , the best-fit value is calculated as follows: The strong learner f(x) is updated, 219 and the f(x) expression is obtained as follows:  XGBoost integrates multiple regression tree models to form a strong classifier, thereby increasing 228 the training speed, parallel processing, and generalisability. Furthermore, with more data, the 229 parallel efficiency is higher.

230
XGBoost can be regarded as an additive model composed of K CARTs.
where K is the number of trees; F is all possible CARTs; and is a specific CART.

233
In the return process, with the parameters = { 1 , 2 , ⋯ , }, the objective function of XGBoost The first part of the formula presents the loss function, and the second part is the regular term, 237 which is obtained through the addition of the regularisation terms of the K trees.

238
Through vector mapping for each tree, the decision tree is improved, and the regularisation 239 term of XGBoost is obtained as follows: where and are the penalty coefficients of the model; is the number of leaf nodes; and is the 242 fraction of leaves.

243
The optimisation objective function is gradually approached step by step. At step t, an optimised 244 CART is added on the basis of t − 1, and the objective function becomes Second-order Taylor expansion of this formula is then calculated to obtain   . smaller the simulation error. The closer NSE is to 1, the higher the model quality and credibility.

293
The higher the GPI, the more effective the overall simulation effect of the model. in the daily ETO and monthly mean ETO factor importance analysis, but the importance tended to 303 be consistent on the same time scale.

304
For the daily ETO factor importance analysis ( Fig. 3 and used Rs and Ts as the basic input combination, with both input at the same time. Finally, RH and 320 U2 were introduced. (Table 3).

Comparison of daily ETO and monthly mean ETO models with different input combinations 333
The performance of the tree-based models (RF, GBDT, and XGBoost) and the hybrid tree-  Table 4.

336
Boxplots of the performance of the tree-based models and hybrid tree-based models on the daily 337 and monthly mean scale under different input combinations in the testing phase are illustrated in 338 Fig. 5 and Fig. 6. The same letter denotes no significant difference between the models, as 339 determined using Fisher's least significant difference at a 0.05 significance level.

340
As presented in Fig. 5, for daily ETO estimation, no significant difference was observed 341 between the Rs and Ts input (p > 0.05), indicating the similar overall accuracy of these two factors.

342
When the two factors were input, the difference was obvious compared with single-factor input (p 343 < 0.05). Three-factor input considerably improved the accuracy of most models; however, the RF 344 model did not exhibit a marked improvement compared with the two-factor input condition.

345
Compared with the accuracy of the single-factor model, the three-input RF model showed

359
As detailed in the performance of the BO-GBDT and BO-RF models.

374
The BO-XGBoost model had the highest accuracy along with the other tree-based models, 375 and the XGBoost models outperformed the GBDT models, followed by the RF models. The

GPI ranking of the daily ETO and monthly mean ETO models 387
Although RMSE, R 2 , MAE, and NSE were used, no single measure alone can be used to judge 388 the performance of the tree-based models. Therefore, the GPI value was further calculated for all 389 tree-based model at all stations under different input combinations (Table 5, Fig. 7, and Fig. 8). As 390 listed in Table 5 other tree-based models on the monthly mean scale. Moreover, the hybrid tree-based models 395 exhibited higher GPI compared with that of the standalone tree-based models at each station with 396 different input combinations for the daily ETO and monthly mean ETO; thus, the hybrid tree-based 397 models outperformed the standalone tree-based models.

398
The GPI ranking comparison at each site in the study area is presented in Fig. 7 and Fig. 8.

399
The smaller the circle radius, the larger the GPI value and the more satisfactory the model. In the 400 comparison of daily ETO GPI value rankings, the radius of the model with Rs input is larger than 401 that with Ts input at the Inner Mongolia site, although the opposite is true for the other sites. This 402 indicated that the Ts input of the Inner Mongolia site contributed more to the ETO calculation than 403 the Rs input, which is consistent with the results of the main factors. In the two-factor input 404 combination, the radii of 50548 and 54311 were small, indicating that the accuracy of the model at 405 these two sites was high; the importance of these two sites was closer to 1 than that of the other 406 sites. When a third factor (RH or U2) was introduced to the same site, the radius of the factor with 407 greater importance was smaller than the factor with less importance, indicating that the model 408 accuracy was less than the input combination with greater importance. Therefore, the importance 409 feature ranking was verified.  Fig. 7. GPI value of the tree-based daily ETO prediction models during the testing phase.     The BO algorithm was employed to identify the optimal parameter set for the relative model  Compared with the standalone tree-based models, the hybrid tree-based models (BO-RF, BO-