The results presented in this study are essential for a more adequate water management, since the accurate estimation of ET0 is fundamental for the water demand quantification. Furthermore, the use of different estimation techniques and the combination of input data in the models allowed us to obtain important results applied at different spatial scales.
According the results, it was possible to observe linear correlation between the input data and the ET0, with the variables Tmean, Tmax, and Tmin showing the best correlation (Fig. 4). The other variables have low (lat, alt and RH) or no (lon and month) correlation with ET0. The inversely proportional behavior with ET0 was observed in the lat, alt and RH variable. High latitudes tend to be cooler regions, with less energy available for the ET0 process. Also, an increase in the altitude results in a decrease of the temperature according to the vertical thermal gradient in the troposphere. The increase in RH increases the potential gradient, increasing the water transfer rate from the soil-plant system to the atmosphere. However, a proportional behavior was observed between the Tmean, Tmax and Tmin variables with ET0. The increase in Tmean, Tmax and Tmin results in more energy available for ET0. Sattari et al. (2021) observed the same behavior of the variables Tmean, Tmax, Tmin and RH when estimating ET0. The variables Tmean, Tmax, and Tmin were all highly correlated with ET0 and the RH mean was the least correlated variable.
In this way, the ability of machine learning approaches with the variables mentioned above in different conditions and scenarios was investigated. The ANN, RF, SVM and MLR statistical performance indicators for estimating ET0 in any location within the Minas Gerais state (SI: data from the 56 climatological stations - 100% of the input data available) is presented in Table 3.
Table 3 Statistical performance indicators of the ANN, RF, SVM and MLR models in SI
|
SI
|
|
I8
|
I6
|
I3
|
I2
|
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
ANN
|
0.966
|
0.167
|
0.215
|
0.963
|
0.178
|
0.224
|
0.943
|
0.210
|
0.278
|
0.860
|
0.332
|
0.429
|
RF
|
0.955
|
0.191
|
0.250
|
0.966
|
0.166
|
0.220
|
0.934
|
0.220
|
0.296
|
0.859
|
0.335
|
0.426
|
SVM
|
0.933
|
0.23
|
0.290
|
0.927
|
0.242
|
0.310
|
0.878
|
0.311
|
0.399
|
0.877
|
0.312
|
0.399
|
MLR
|
0.933
|
0.231
|
0.298
|
0.928
|
0.241
|
0.308
|
0.877
|
0.313
|
0.399
|
0.877
|
0.312
|
0.398
|
Value in bold indicates the best result within each model; value in italics indicates the best result within input data combination. Combination data: (I8) latitude, longitude, altitude, month, Tmean, Tmax, Tmin, and UR; (I6) latitude, longitude, altitude, month, Tmean and RH; (I3) month, Tmean and UR; and (I2) Tmean and RH.
All the models developed with the I8 and I6 input combination exhibited better performances than their versions developed with the I3 and I2. The lowest predictive capacity was observed when the RF model was used with the I8 input combination. The greatest predictive capacity, in SI, was observed when the RF and ANN models was used with the I6 and I8 input combination, respectively. The models SVM and MLR exhibited better performances than ANN and RF when only Tmean and RHmean (I2) were used as input data.
When comparing the combination I8 with I6, the average r, MAE and RMSE of all models does not show high variations. The removal of the geographic coordinate (I6 to I3) resulted in a more expressive performance reduction of the SVM and MLR models. The highest impact on performance was observed in the ANN and RF when the month variable was removed (I3 to I2). The average r decreased by 8%; MAE and RMSE increased by 52.2% and 43.9%, respectively. The removal of month did not impact the SVM and MLR models performance.
In the case of the SII, a scenario in which the state of Minas Gerais was divided into two areas (Tho1 and Tho2), the statistical performance indicators of the models used in the ET0 estimation are shown in Table 4.
Table 4 Statistical performance indicators of the ANN, RF, SVM and MLR models in SII
|
SII
|
|
Tho.1 (A, B4, B3, B2 and B1)
|
|
I8
|
I6
|
I3
|
I2
|
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
ANN
|
0.976
|
0.135
|
0.168
|
0.965
|
0.156
|
0.196
|
0.959
|
0.169
|
0.212
|
0.904
|
0.266
|
0.322
|
RF
|
0.964
|
0.164
|
0.198
|
0.973
|
0.143
|
0.174
|
0.963
|
0.16
|
0.198
|
0.920
|
0.234
|
0.291
|
SVM
|
0.955
|
0.181
|
0.219
|
0.948
|
0.191
|
0.235
|
0.925
|
0.232
|
0.281
|
0.924
|
0.235
|
0.284
|
MLR
|
0.957
|
0.178
|
0.216
|
0.949
|
0.190
|
0.233
|
0.927
|
0.23
|
0.278
|
0.925
|
0.233
|
0.282
|
|
Tho.2 (C2, C1 and D)
|
|
I8
|
I6
|
I3
|
I2
|
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
ANN
|
0.940
|
0.211
|
0.269
|
0.944
|
0.210
|
0.260
|
0.895
|
0.269
|
0.353
|
0.818
|
0.377
|
0.462
|
RF
|
0.925
|
0.227
|
0.302
|
0.943
|
0.200
|
0.267
|
0.879
|
0.29
|
0.374
|
0.817
|
0.350
|
0.453
|
SVM
|
0.893
|
0.276
|
0.353
|
0.898
|
0.271
|
0.346
|
0.840
|
0.342
|
0.427
|
0.840
|
0.340
|
0.427
|
MLR
|
0.898
|
0.269
|
0.345
|
0.899
|
0.267
|
0.342
|
0.839
|
0.339
|
0.427
|
0.839
|
0.339
|
0.427
|
Value in bold indicates the best result within each model; value in italics indicates the best result within input data combination. Combination data: (I8) latitude, longitude, altitude, month, Tmean, Tmax, Tmin, and UR; (I6) latitude, longitude, altitude, month, Tmean and RH; (I3) month, Tmean and UR; and (I2) Tmean and RH.
Tho1 and Tho2 had 48.2% and 51.8%, respectively, of the data available as input data. The highest predictive capacity in Tho1 and Tho2 area was observed when the ANN model was used with the I8 input combination and RF model was used with the I6 input combination, respectively. The removal of the Tmax and Tmin input data (I6) did not increase the model’s predictive capacity in Tho1 area, except for the RF model. This behavior is similar to that observed in SI. However, all models performed better when the I6 input combination in Tho2 area was used (better results).
The removal of the month variable (I3 to I2) resulted in the highest impact on the ANN and RF models quality. When comparing the combination I8 with I3, the average r of the ANN and RF models decreased by 7.2% and 5.7%, respectively. The MAE of the ANN and RF models increased by 36.4% and 31.6%, respectively. However, no expressive variation was observed in the performance of SVM and MLR models.
In the case of SIII, the statistical performance indicators of models for the scenario presented by Table 5, were the Minas Gerais state was divided in K1 and K2 area, which was characterized as 62.5% and 37.5% of the climatological stations, respectively.
Table 5 Statistical performance indicators of the ANN, RF, SVM and MLR models in SIII
|
SIII
|
|
K1 (Cwa, Cwb and Cfb)
|
|
I8
|
I6
|
I3
|
I2
|
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
ANN
|
0.966
|
0.170
|
0.209
|
0.968
|
0.163
|
0.204
|
0.961
|
0.180
|
0.225
|
0.912
|
0.270
|
0.334
|
RF
|
0.963
|
0.175
|
0.221
|
0.973
|
0.15
|
0.191
|
0.962
|
0.174
|
0.222
|
0.920
|
0.261
|
0.318
|
SVM
|
0.949
|
0.199
|
0.256
|
0.944
|
0.209
|
0.267
|
0.927
|
0.247
|
0.305
|
0.926
|
0.248
|
0.306
|
MLR
|
0.950
|
0.204
|
0.253
|
0.945
|
0.212
|
0.266
|
0.928
|
0.245
|
0.303
|
0.917
|
0.247
|
0.305
|
|
K2 (Am and Aw)
|
|
I8
|
I6
|
I3
|
I2
|
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
r
|
MAE
|
RMSE
|
ANN
|
0.964
|
0.16
|
0.201
|
0.889
|
0.269
|
0.350
|
0.895
|
0.263
|
0.340
|
0.817
|
0.360
|
0.447
|
RF
|
0.924
|
0.23
|
0.294
|
0.943
|
0.203
|
0.258
|
0.885
|
0.285
|
0.352
|
0.826
|
0.347
|
0.429
|
SVM
|
0.889
|
0.270
|
0.347
|
0.890
|
0.274
|
0.347
|
0.846
|
0.329
|
0.405
|
0.847
|
0.326
|
0.403
|
MLR
|
0.892
|
0.269
|
0.343
|
0.894
|
0.269
|
0.340
|
0.846
|
0.325
|
0.404
|
0.848
|
0.323
|
0.403
|
Value in bold indicates the best result within each model; value in italics indicates the best result within input data combination. Combination data: (I8) latitude, longitude, altitude, month, Tmean, Tmax, Tmin, and UR; (I6) latitude, longitude, altitude, month, Tmean and RH; (I3) month, Tmean and UR; and (I2) Tmean and RH.
In general, the ANN and RF models were higher than the SVM and RLM models with the input combinations I8, I6 and I3. When the I2 combination was used, the SVM and RLM models were superior. The model with highest predictive capacity in K1 area was the ANN with the I8 input combination. The RF model with the I6 input combination showed highest predictive capacity in K2 area.
In the K1 area, the removal of the month variable resulted in the highest impact on the ANN and RF models performance. The removal of the alt, lat and lon variable resulted in the highest impact on the SVM and MLR performance. In the K2 area, the behavior of the RF, SVM and MLR were similar to that observed in the K1 area. However, the withdrawal of the alt, lat and lon variable resulted in the highest impact ANN in the area K2.
The ANN and RF models showed greater predictive capacity in all scenarios when compared to the SVM and MLR models. This high capacity is achieved with the data input combination I8 or I6. Both models had similar performances, but on average the RF showed slight superiority. Ferreira et al. (2019) and Ferreira and Da Cunha (2020) conducted studies aimed at evaluating the performance of different machine learning in the ET0 estimation in Brazil. In these studies, it was observed that, in general, ANN performed slightly better than the other traditional machine learning models (i.e., RF and Extreme gradient boosting - XGBoost). However, in some studies the RF model performed slightly better than other models (i.e., Generalized Regression Neural Networks - GRNN) to estimating ET0 (Feng et al. 2017; Wang et al. 2019b). There are papers suggesting better performance than other machine learning models in different situations and regions (Mehdizadeh et al. 2017; Shiri et al. 2014). Therefore, there is a need for studies that address more than one models.
The SVM and MLR models showed similar statistical indices and behavior in all scenarios. These results can be explained by the use of the linear Kernel function by SVM that probably presented behavior similar to an MLR. Tests with the nonlinear Kernel function did not result in improvements in prediction. According to Pisner and Schnyer (2020), this SVM is used to recognize patterns in complex databases. Possibly the data used does not present a complexity that justifies the use of SVM.
The SVM and MLR models showed a greater predictive capacity in all scenarios when the input data limited to only Tmean and RH (I2). This result may indicate a low predictive capacity of the ANN and RF models in situations of low variability in the input data. This low variability may hinder the search for patterns that justify variations in ET0.
In some scenarios the remotion of Tmax and Tmin improved the ET0 estimation results. Sattari et al. (2021) observed an increase in the accuracy of the support vector regression (SVR) and Gaussian process regression (GPR) models with the removal of some input data, including Tmax and Tmin.
Although the Tmax and Tmin showed a good correlation with ET0 (Fig. 4), the weight of Tmax and Tmin is diluted in the calculation of the Tmean used in the calculation of ET0. Thus, adding Tmax and Tmin can make the ET0 estimate more complex or confusing. This fact can decrease the accuracy of the models, and the removal of this input data can improve the prediction. Determining the input data is critical to the success of the models. This selection can facilitate the training and testing processes, improving the understanding of the system (Bowden et al. 2005; Maier and Dandy 2000). However, this result shows that only linear regression is not enough to decide which input data should be removed in order to increase the predictive performance.
When the independent variables lat, lon and alt were removed (I3), a reduction in the statistical indexes of all models was observed. These variables are related to the spatial location of the observed data. Although the correlation observed between these variables and ET0 is low (Fig. 4), the joint removal of these data negatively impacted the model’s performance. According to Mehdizadeh et al. (2017) and Souza et al. (2022), air temperature and solar radiation are one of the main impact data on ET0. Several studies have indicated the influence of lat, lon and alt variables on air temperature and solar radiation (Alvares et al. 2013b; Ozgoren et al. 2012) Thus, variations in lat, lon and alt may indirectly impact ET0. This can explain these observed results.
The division of the input data into two areas with climatic similarity aimed to increase the performance of the models. The division presented in SII and SIII managed to slightly increase the capacities of the models in relation to SI. However, this increase was only observed in the Tho1 and K1 areas. Thus, we can infer that, although the division into areas with climatic similarity can reduce the amount of input data for training, in some situations this division is valid, and the models can respond more accurately. Machine learning models developed for broader scenarios (e.g., SI) typically have reduced predictive capacity due to the high nonlinearity and low similarity between input data; however, these models have a greater ability to generalize (Shiri et al. 2014). According to Ferreira et al. (2019), although the models developed locally perform better, these models may have low predictive capacity when used in other regions, since they can be highly specific to the location.
Regarding the importance of each input variable in the response variable of the evaluated algorithms, WEKA was used to select the attributes (Figures 5, 6 and 7). Attributes were selected using the "ClassifierAttributeEval" tool associated with "Ranker" method. These tools rank attributes by their individual evaluations. The correlation coefficient was the measure used to evaluate the performance of attribute combinations in the Ranker configuration. The same ranking method of WEKA was used by Yadav et al. (2014), in order to verify the importance of each input variable in the solar radiation prediction.
Different ANN settings were used for each input data (Table 2). These ANN settings resulted in different weights for each input attribute (Fig. 5). However, a similar behavior was observed between the different configurations. In all scenarios, Tmean, Tmax and Tmin had a greater weight in the estimate. In SIII K2, the relative importance of Tmax surpassed Tmed (Fig. 5). This result may explain the decrease in ANN's performance in this scenario when Tmax and Tmin were removed (Table 5). The variables lat, and month had a similar weight in all scenarios. Although similar, the removal of the month variable resulted in a greater reduction in the ANN performance when compared to the removal of the variables lal, lon and alt.
Ranked value of each input variable of RF is show in Figure 6. The Tmean and month variable had a higher weight in the ET0 estimate. In SII Tho1 and SIII K2, the month variable was more important than Tmean variable. This result may explain the drop in the RF model performance when it removed the month variable (I3 to I2). The Tmax and Tmin variables also had a high weight in the ET0 estimate. However, the removal of these variables increased the capacity of the RF model as observed (Table 3, 4 and 5) and discussed previously.
The relative importance of each input variable of SVM is show in Figure 7. It was possible to observe that Tmean, Tmax and Tmin variables had a higher weight in the ET0 estimate. Followed by HR and lat. The month variable was of low importance in the ET0 estimate. In SI, the month showed a negative weight. Therefore, this input data can negatively impact the ET0 estimate. In the performance results of the SVM model (Table 3, 4 and 5), there was no significant variation in performance when the month variable was removed. Both results make it possible to highlight that, for this region, the month variable does not contribute to the performance of the SVM model.
Although each model has a different pattern in the ranking of the input variables (Fig. 5, 6 and 7), air temperature was the most important attribute. The observed correlation between air temperature and ET0 (Fig. 4) may explain the importance of air temperature in the estimate. This behavior was not observed in SIII K2 and SII Tho2. However, in these scenarios, no significant difference was observed between the month and Tmean variables. Wang et al. (2019b) observed the rank of importance of meteorological variables based on the RF method. The three most important variables were: insolation (n), Tmax and RH. The high relative importance observed corroborates the results of the present study.
The other variables presented different weights according to each model applied. These results indicate a peculiarity of the models experienced. In this way, new research or applications can base on these results and choose the best method that suits the conditions of the input data. However, it is recommended that the models be previously experimented with different input data, as noted, some variables may have a relatively high weight in the ET0 estimate, but their use can decrease the predictive performance of the model. This behavior was observed when using the RF model. In this model, the removal of the variables Tmax and Tmin increased the predictive capacity, although these variables have shown high relative importance.
It is important to note that the month variable was highly important in estimating through the RF. However, a low importance was observed when the SVM model was used, since this variable was not correlated with ET0 (Fig. 4). These results highlight the need for more techniques to select the meteorological variables used in the modeling. Linear regression alone is not sufficient to identify the relevance of the input data. Furthermore, different models may present different behaviors regarding the classification of the importance of the input variable and still present satisfactory results.
Differently from the evaluation of the importance of the ANN, RF and SVM attributes, for the MLR method, the attribute selection method was applied (M5 method), which indicates the importance of each input attribute in the generated model. The adjusted coefficients are shown in Table 6. It is observed that in some models the method used (M5 method) excluded the month variable. This behavior indicates a low importance of this variable in the MLR estimate. This result was similar to that observed in the analysis of the importance of the input variables in the SVM. The exclusion of lat and Tmax was also observed in some cases.
Table 6 Coefficients of the multiple linear regression models in SI, SII and SIII
|
|
MLR method coefficients
|
|
|
lat
|
lon
|
alt
|
month
|
Tmax
|
Tmean
|
Tmin
|
RH
|
|
|
|
β1
|
β2
|
β3
|
β4
|
β5
|
β6
|
β7
|
β8
|
β0
|
SI
|
I8
|
-0.0208
|
0.0579
|
0.0016
|
0.0091
|
0.0758
|
0.2966
|
-0.0396
|
-0.02
|
-1.8209
|
I6
|
-0.0222
|
0.0402
|
0.0013
|
0.0065
|
-
|
0.2972
|
-
|
-0.0264
|
-0.4453
|
I3
|
-
|
-
|
-
|
ø
|
-
|
0.2262
|
-
|
-0.0234
|
0.4921
|
I2
|
-
|
-
|
-
|
-
|
-
|
0.2262
|
-
|
-0.0234
|
0.4921
|
|
|
Tho.1 (A, B4, B3, B2 and B1)
|
SII
|
I8
|
-0.0343
|
0.06
|
0.0012
|
0.0172
|
0.0633
|
0.3096
|
-0.0532
|
-0.0229
|
-1.2174
|
I6
|
-0.0523
|
0.0294
|
0.0008
|
0.0159
|
-
|
0.2807
|
-
|
-0.0278
|
-0.7404
|
I3
|
-
|
-
|
-
|
0.0159
|
-
|
0.2521
|
-
|
-0.0234
|
-0.0037
|
I2
|
-
|
-
|
-
|
-
|
-
|
0.2498
|
-
|
-0.0251
|
0.2709
|
|
Tho.2 (C2, C1 and D)
|
I8
|
ø
|
0.0517
|
0.0017
|
ø
|
ø
|
0.3857
|
-0.0447
|
-0.0215
|
-1.344
|
I6
|
ø
|
0.0511
|
0.0016
|
ø
|
-
|
0.331
|
-
|
-0.0247
|
-0.6378
|
I3
|
-
|
-
|
-
|
ø
|
-
|
0.2858
|
-
|
-0.0297
|
-0.6149
|
I2
|
-
|
-
|
-
|
-
|
-
|
0.2858
|
-
|
-0.0297
|
-0.6149
|
|
|
K1 (Cwa and Cwb)
|
SIII
|
I8
|
ø
|
0.0713
|
0.0013
|
0.0149
|
0.091
|
0.2515
|
-0.0203
|
-0.0231
|
-0.1477
|
I6
|
-0.0166
|
0.0461
|
0.0009
|
0.0122
|
-
|
0.2861
|
-
|
-0.029
|
0.6759
|
I3
|
-
|
-
|
-
|
0.0128
|
-
|
0.256
|
-
|
-0.0243
|
-0.0213
|
I2
|
-
|
-
|
-
|
-
|
-
|
0.254
|
-
|
-0.0257
|
0.2013
|
|
K2 (Am and Aw)
|
I8
|
-0.0428
|
0.0595
|
0.002
|
ø
|
0.066
|
0.3543
|
-0.0588
|
-0.0139
|
-3.4505
|
I6
|
ø
|
0.0316
|
0.0014
|
ø
|
-
|
0.3329
|
-
|
-0.0208
|
-1.7699
|
I3
|
-
|
-
|
-
|
ø
|
-
|
0.3172
|
-
|
-0.029
|
-1.5306
|
I2
|
-
|
-
|
-
|
-
|
-
|
0.3172
|
-
|
-0.029
|
-1.5306
|
Ø: input data excluded by the M5 method
The results presented revealed that, for locations in the Minas Gerais state, the models can be used safely. The ANN and RF models are recommended to estimate ET0 when considering a wider range of input data, as they have a better predictive capacity in this situation. The SVM and MLR models are recommended in situations where only temperature and relative humidity data are available. However, between these two models, MLR is recommended because it presents less computational effort. The models, although they have a high predictive capacity, cannot be perfect. Other meteorological variables not considered as input data (e.g., solar radiation, wind speed and vapour-pressure deficit) and other factors (e.g., data recorded in error) contributed to the decrease in the predictive capacity of these models.
No statistical method or machine learning can produce results that are the same as the observed and/or recorded data. There will always be some error, no matter how small. Therefore, it is important that the meteorological stations function continuously. These models developed in this study are expected to help decision-making by different professionals, mainly farmers. Agricultural companies are responsible for a considerable part of the Brazilian gross domestic product (Brugnaro and Bacha 2006) and the Minas Gerais state has the third largest Gross Domestic Product in Brazil of 2018 (IBGE 2020). The results of these models assist in the irrigation management, in climatic zoning, in the construction of productivity models among other applications. In addition, the approaches used in the present study have the potential to benefit the development of other types of models and studies from other regions.