Evaluating evapotranspiration using data mining instead of physical-based model in remote sensing

Precise calculations for determining the water requirements of plants and the extent of evapotranspiration are crucial in determining the volume of water consumed for plant production. In order to estimate evapotranspiration over an extended area, different remote sensing algorithms require numerous climatological variables; however, climatological variable measurements cover only limited areas thus resulting into erroneous calculations over extended areas. The exploiting of both data mining and remote sensing technologies allows for the modeling of the evapotranspiration process. In this research, the physical-based SEBAL evapotranspiration algorithm was remodeled using M5 decision tree equations in GIS. The input variables of the M5 decision tree consisted of Albedo, emissivity, and Normalized Difference Water Index (NDWI) which were defined as absorbed light, transformed light, and plant moisture, respectively. After extracting the best equations in the M5 decision tree model for 8 April 2019, these equations were modeled in GIS using python scripts for 8 April 2019 and 3 April 2020, respectively. The calculated correlation coefficient (R2), mean absolute error (MAE), and root mean squared error (RMSE) for 8 April 2019 were 0.92, 0.54, and 0.42, respectively, and for 3 April 2020 were 0.95, 0.31, and 0.23 in order. Moreover, for the further evaluation of the model, a sensitivity analysis and an uncertainty analysis were carried out. The analysis revealed that evapotranspiration is more sensitive to Albedo than the two other model inputs, and when applying data mining techniques instead of SEBAL, the estimation of evapotranspiration has a lower accuracy.


Introduction
Nowadays precision agriculture has become a highly important issue and is applied in most agricultural sectors, especially in the irrigation sector. Agricultural activities are now carried out with more precision through the utilization of various methods (in terms of hardware and software) and state-of-the-art technologies which can be best exemplified by the application of various apparatus such as pressurized irrigation systems, drones, and the internet of things for crop monitoring, in addition to software consisting of machine learning, deep learning, and data mining particularly in areas related to decision trees.
Using meteorological data to calculate evapotranspiration allows for the precise and timely scheduling for the irrigation of crops. The literature related to such studies indicate that by applying satellite images and different algorithms, evapotranspiration can be estimated over an extended area, thus allowing for the developing of an accurate irrigation schedule (Jaferian et al. 2019;Song et al. 2018;Goodarzi and Eslamian 2018;Diarraa et al. 2017;Colaizzi et al. 2017;Anderson et al. 2012;Abdolhosseini et al. 2012).
The estimation of the rate of evapotranspiration is a complicated process due to many independent parameters. Evapotranspiration is a dependent parameter which can be affected by many climatological parameters and crop conditions. In order to estimate evapotranspiration, a variety of equations have been developed which are used for different conditions such as the FAO Penman-Monteith equation, and the Blaney-Criddle formula, among others. Ground observations represent the results of one specific point, while great precision is required to generalize the observations over an extended region; hence, evaporation data varies from one station to another. By applying remote sensing technologies, one may attain satisfactory and superior precision for a specific extended region; moreover, the use of satellite images within the framework of remote sensing allows for ground observations (hard data) to be transformed into soft data. Gibert et al. (2018) point out that among the various methods related to data mining, the M5 decision tree has been successfully applied to estimate the evapotranspiration over an extended area.
The current research is focused on establishing an innovative and highly applicable linear relationship using data mining techniques between independent remote sensing parameters (Albedo, emissivity, and normalized difference water index) and a dependent parameter (evapotranspiration) via a M5 decision tree.
Landsat8 satellite images along with the SEBAL algorithm were used for estimating evapotranspiration in the study area, and the obtained data was further used for many accurate evapotranspiration estimations. (Mhawej et al. 2020;Elnmer et al. 2019;Kong et al. 2019;Ochege et al. 2019;Gobbo et al. 2019;Kamali and Nazari 2018).
One of the input parameters for estimating the rate of evapotranspiration is land surface temperature (LST); however, the spatial resolution of this band is 100 m. The application of the SEBAL algorithm along with this band length further provides an estimated evapotranspiration image with a 100 m spatial resolution. The aim of the current research is to enhance spatial resolution using the M5 decision tree since input parameters have a 30 m spatial resolution; thus, by applying the obtained equations through the M5 decision tree, an evapotranspiration map with a higher spatial resolution can be obtained.
In the Southwest of Iran, which is extremely dry and arid, an extremely high volume of water is consumed by the sugarcane plantations over an extended area (more than 94,000 ha); hence, by spatially enhancing the image resolution for evaluating evapotranspiration, optimal irrigation programming can be calculated more precisely.

Study area
This study was conducted at the Amir-Kabir Agro-Industry Sugarcane plantation located in the Southwest of Iran (Fig. 1). The soil texture is clay-loam, and the annual average evapotranspiration for a 20 year period was 3331.812 mm. The total area of the Amir-Kabir agro-industry is over 17,000 ha of which 14,000 ha is cultivated. Each farm is 25 ha in extent, having a low-pressure hydro flume irrigation system along with a subsurface drainage system at 40 m spacing and 1.8 m depth for each drain tile. The total irrigation water consumption is 3,000 mm, and peak irrigation consumption occurs in July.

Landsat 8 satellite
Landsat 8 OLI satellite images provided the main input data for the remote sensing procedure (http:// glovis. usgs. gov). The thermal bands however had a lower resolution as compared to other optic bands. In the case of Landsat 8, a band10 image represents a thermal band that provides minimum spatial resolution (100 m). Thermal bands are critical for the estimation of the evapotranspiration rate, and Land-sat8 provides the most appropriate thermal band for the estimating of agricultural evapotranspiration over an extended region. The 14,000 ha Amir-Kabir agro-industry plantation is extended enough for the calculation of evapotranspiration using remote sensing images.

Ground measurements
Any estimation of ET requires meteorological data; thus, the meteorological data applied in this study was obtained from the Amir-Kabir agro-industry plantation's local weather station. The meteorological data used for the calculation of the evapotranspiration rate included max and min temperature, the relative percentage of humidity, wind speed, and hours of sunshine. Ref-et software was used for determining the reference points for the evapotranspiration calculations.

SEBAL algorithm
The SEBAL algorithm which is known as a one source algorithm for the calculation of evapotranspiration was used for calculating the rate of evapotranspiration in the sugarcane fields of the Amir-Kabir agro-industry plantation in this study. The energy balance equation is the primary equation used in this algorithm. Equation 1 represents the main energy balance.
where λET is the latent heat flux (W/m 2 ), Rn is the net radiation at the surface (W/m 2 ), G is the soil heat flux (W/m 2 ), and H is the sensible heat flux to the air (W/m 2 ) (Bastiaanssen et al. 1998).
The net sun radiation (Rn) is a balance equation between the incoming and outgoing short and long wave components as shown in Eq. 2 (Bastiaanssen et al. 1998): where Rs↓ is the incoming shortwave radiation (W/m2), RL↓ is the incoming long wave radiations emitted by the atmosphere (W/m2), RL↑ is the outcoming long wave radiations (W/m2), α is the surface Albedo also known as the reflection coefficient, and ε o is the broadband surface emissivity.
The incoming shortwave radiation Rs↓ is estimated from the radiation received at the top of the atmosphere and has been calculated as per the equation below: (Allen et al. 2002) (Eq. 3): (3) Rs ↓= Gsc × cos × dr × τ sw where Gsc is the solar constant values 1367 (W/m2), cosϴ is the cosine of the solar Zenith angle, dr is the inverse relative squared distance of the Earth to the Sun, and τ sw is the atmospheric transparency factor.
The Stephan-Boltzmann equation (Eq. 4) has been used to calculate the incoming long wave radiation emitted by the atmosphere (Allen et al. 2002): where εα is the atmospheric emissivity (dimensionless), σ is the Stefan-Boltzmann constant (5.67 × 10 −8 W/m2/K4), and Ta is the air temperature in K.
The Stephan-Boltzmann equation has also been applied to estimate the long wave radiation emitted by the surface (Allen et al. 2002) where ε o is the broadband surface emissivity (dimensionless), σ is the Stefan-Boltzmann constant (5.67 × 10 −8 W/ m2/K4), and T s is the surface temperature in K. The amount of the Earth's surface temperature was calculated through Eq. 6: where K 1 and K 2 are the constant coefficients of Landsat 8 image equal to 774.88 and 1321.0789, eNB = 1 (Allen et al. 2002), and ρ10 is the thermal correction band (band 10) of Landsat 8 image.
The amount of the Albedo coefficient was calculated using Eq. 7: where α path-radiance is the average portion of the incoming solar radiation taking into account all bands that are backscattered to the satellite before it reaches the earth's surface and τ sw is the atmospheric transmissivity (Allen et al. 2002).
Soil heat flux (G) is the other main parameter for the energy balance equation which represents the heat storage in the equilibrium (Bastiaanssen et al. 1998). G and Rn ratio have a constant value, and by dividing G on Rn, the best obtained ratio is between 0.3 and 0.6. Eq. 8 is used for calculating G (Bastiaanssen et al. 1998): where T s is the surface temperature, while the vegetation index NDVI (normalized difference vegetation index) is used for G calculations using Eq. 9 (Allen et al. 2002).
where near infrared band and the infrared band are used for NDVI calculations and shown as ρ NIR and ρ R in Eq. 9, respectively. It was observed that the NDVI of the study area is more than zero due to plant cultivation.
The amount of NDVI changes between − 1 and + 1. Area with vegetation has NDVI between zero and one; water and clouds usually have NDVI less than zero. If the NDVI value is less than zero, then the area is assumed as water and the G/Rn ratio is considered to be 0.5. Areas with T S less than 4° C and α greater than 0.45 are considered as snow covered areas and the G/Rn ratio for these areas is considered to be 0.5 (Allen et al. 2002).
In the SEBAL method, for the estimation of the apparent heat flux (H), two pixels are selected. First, one of these pixels called cold pixels is related to a completely covered area of the irrigated area and plants. The surface temperature of this pixel is close to the air temperature and evapotranspiration equals to reference evapotranspiration. The second pixel, is called the hot pixel, is land without vegetation and dry, so the amount of flowing heat of evaporation in this pixel is assumed to be zero (Sanaei Nezhad et al. 2011).
In SEBAL method, according to Eq. 1 and based on the evapotranspiration values of the two pixels, the apparent heat flux of other pixels is estimated. The value of the heat flux is calculated from Eq. 10 (Allen et al. 2002).
where ρ air represents the air density (kg/m 3 ), C air represents air specific heat (1004 J/kg/K), dT (K) represents the temperature difference (T 1 -T 2 ) between two heights (Z 1 and Z 2 ), and r ah represents the aerodynamic resistance to heat transport (m/s).
Aerodynamic resistance to heat transport is calculated by applying Eq. 11: Z 1 and Z 2 are the two heights; k is the Von Karman constant (0.41) and u * represents friction velocity which is calculated through Eq. 12: u 200 (blended wind speed) is the wind speed measured at a height of 200 m at the weather station. Z om is empirically estimated from the vegetation height, and Z om is calculated through the leaf area index Eq. 13: The LAI using soil-adjusted vegetation index and with Eq. 14 is determined.

0.91
The LAI (Eq. 14) is calculated using the SAVI (soiladjusted vegetation index) (Eq. 15): where L is a correction factor for the background soil.
The surface temperature difference which is used in Eq. 10 is a linear relation between dT and Ts which is presented in Eq. 16: where dT is the near-surface air temperature difference, Ts is the surface temperature, and a and b are empirical coefficients.
In the main equation (Eq. 1), the term latent heat flux for each pixel is calculated by Eq. 17: where λ is the latent heat of vaporization (J/kg) and can be computed as Eq. 18: The main aim of using the SEBAL algorithm is to obtain ET24 for each pixel using Eq. 19 (Allen et al. 2002): where the ET r-24 is the total ET r during a 24-h period of the same day.
For this study, two cloud-free satellite images were obtained for 8 April 2019 and 3 April 2020. Actual evapotranspiration (ET a ) maps in mm/day are generated by the SEBAL algorithm for each day.

Data mining
Data mining (DM) algorithms are the most fundamental components in data science analysis. Gibert et al. (2018) states that certain DM techniques such as artificial neural networks, clustering, and case-based reasoning or Bayesian networks have been successfully applied in environmental modeling.
The decision tree method applies explanatory variables with higher discriminatory power while considering the response variable and then iteratively subdivides the training sample by building a tree where the internal nodes are associated with the input variables and its corresponding branches as the possible values of the variable (Gibert et al. 2018). The M5 Model Tree (introduced by Quinlan in 1992) has linear regression functions at the leaf nodes, which in themselves create a relationship between input and output variables. The data is then split into subsets, and a decision tree is created. The data in offspring nodes from splitting criterion depends on the treating of the standard deviation of the class values and the calculating of the expected reduction in this error as a result of testing each attribute within that node. The standard deviation reduction (SDR) is calculated as per Eq. 20 (Quinlan 1992): where T is a set of data that reaches the node, T i is the subset of data that has the ith outcome of the potential set, and sd is the standard deviation (Rahimikhoob et al. 2013;Wang and Witten 1997). The data in the offspring nodes are purer due to a less standard deviation in comparison to parent nodes. The M5 tree selects the node that maximizes the expected error reduction after scanning all the possible splits.
Every inner node of a tree has the capability of producing multiple linear regression models using the data associated with that node and all the attributes that are utilized in tests in the sub-tree rooted at that node. The linear regression models are simplified providing that the results have a lower expected error (Etemad-Shahidi and Bonakdar 2009). Figure 2 shows an M5 decision tree structure for two input parameter domains of X1 and X2 with 4 linear models from Y1 to Y4.

Model inputs and output
For estimating the amount of evapotranspiration using the SEBAL algorithm, meteorological data including temperature, humidity, and wind speed, among others are needed. Some of the inputs used in the SEBAL algorithm such as Albedo and emissivity are affected by land surface temperature; thus, in order to minimize the number of variables in data mining, Albedo and emissivity are considered as inputs of the M5 decision tree since these two parameters are easily accessible and display the temperature variances better than most; furthermore, transpiration depends on the moisture of the plant. In data mining calculations, a vegetation index such as the normalized difference water index (NDWI) is used to represent plant moisture; consequently, in the calculations, Albedo and emissivity are defined as absorbed and transformed light to the atmosphere, and NDWI is represented as plant moisture. The basic SEBAL equation was then applied as ET = Rn-G-H in which the following values replaced the formulas: ET = a(Albedo) -b(emissivity) -c(NDWI) and calculated using constant values by applying the M5 decision tree model. The three inputs of the M5 decision tree are explained hereinafter in greater detail.

Evapotranspiration
One of the crucial parameters for the estimating of evapotranspiration is land surface temperature (LST), a factor upon which radiation and the exchange of energy flux between the earth's surface and atmosphere depend on.  Rechid et al. 2009); in addition, vegetation also strongly affects atmospheric properties through evapotranspiration (Gordon et al. 2005). Atmosphere emissivity is in itself determined by atmospheric water vapor pressure (Staley and Jurica 1972;Brutsaert 1975).

Albedo
Zhang et al. (2017) defines Albedo as a dimensionless diffuse reflectivity or reflecting power of a surface and considers it as an important effective parameter for digital climate models and surface energy balance equations. Surface Albedo is computed by correcting the α toa for atmospheric transmissivity from Eq. 21. This parameter is considered as one of the decision tree inputs for calculating evapotranspiration.
where α path_radiance is the average portion of the incoming solar radiation by considering all bands that is back-scattered to the satellite before it reaches the earth's surface and τ sw is the atmospheric transmissivity (Allen et al. 2002).

Emissivity
The surface emissivity is the ratio of the actual radiation emitted by a surface via a black body with the same surface temperature (Allen et al. 2002). Surface emissivity is an important variable for estimating land surface temperature and determining long wave surface energy balance (Mira et al. 2010). Sobrino et al. (2004) proposed emissivity-based NDVI in three different cases as per Eq. 22: where ε v is the vegetation canopy emissivity and ε s is the bare soil emissivity in this paper ε v = 0.986 and ε s = 0.973. The effects of the geometrical distribution of the natural surfaces are measured as dε in Eq. 6. P υ is the vegetation proportion obtained according to Carlson and Ripley (1997) as per Eq. 23: The emissivity of land surfaces can differ significantly due to vegetation, surface moisture, and roughness (Nerry et al. 1988, Salisbury andD'ArÌ 1992).The minimum value of the NDVI for bare soil over the study region is presented as NDVI S , while NDVI V is the highest NDVI for a fully vegetated pixel.

Vegetation index
The NDWI spectral index which represents crop moisture is the normalized difference water index. NDWI was used to estimate the equivalent water thickness of the vegetation canopy (Yilmaz et al. 2008). The NDWI considers two infrared bands with a central wavelength near about 0.86 μm (NIR) and a central wavelength of about 1.24 μm (SWIR). The calculation is represented as Eq. 24: The M5 decision tree model uses Albedo, emissivity, and the vegetation index as input, and after performing the data mining process on the said data, it extracts the linear equations. After inserting the linear equations, a highly enhanced special resolution map was obtained as output. Figure 3 shows the flowchart of the M5 decision tree and the SEBAL algorithm.

Statistical analysis
The application of the three inputs (Albedo, emissivity, and NDWI) allows for the evaluation of the M5 decision tree. The accuracy of the M5 decision tree model and the final evapotranspiration map which was integrated with the M5 decision tree was evaluated through the use of statistical where N is the number of data, ET o is the observed evaporation values calculated via the SEBAL algorithm, and ET p is the M5 decision tree model for the amount of evapotranspiration.

Uncertainty analysis
Uncertainty analysis was calculated from the percentage of observed data through the application of the 95 PPU (95 percent prediction uncertainty) and the average distance d ̅ between the upper and the lower 95 PPU (or the degree of uncertainty) using Eq. 28 (Abbaspour et al. 2007): where k is the number of observed data, X L is the 2.5th, and X U is the 97.5th percentiles of the cumulative distribution of every estimated data. If 100% of the observed data are bracketed by the 95PPU and d is close to zero, the results will be in an acceptable range of uncertainty analysis; hence, utilizing Eq. 29, the d − factor has been calculated.
where is the standard deviation of the measured variable and the obtained is d − factor is less than 1 (Abbaspour et al. 2007).

Sensitivity analysis
The sensitivity coefficient is a dimensionless index (S) which is calculated by taking the ratio of the change in output to input into account conditional to the other variables remaining constant. The sensitivity of a dependent variable (evapotranspiration) to a particular independent variable (Albedo, emissivity, and NDWI) can be calculated from the derivative of evapotranspiration with an independent variable that is δETA/δX (Beven 1979;McCuen 1974). To evaluate the sensitivity of a variable, S was divided into four classes from least to greatest sensitivity (Table 1). If the range attributed to S was at a higher level, the independent variable would have a greater impact on the dependent variable.

Results 2.8 M5 decision tree
Instead of using a physical-based model for the estimating of the amount of evapotranspiration, the M5 decision tree was applied to a data mining model which could increase the spatial resolution of the evapotranspiration map. Using the data mining model, several equations were obtained for estimating the amount of evapotranspiration over a single day (8 April 2019). The extracted equations were then applied for estimating the amount of evapotranspiration during the following year (3 April 2020). The values obtained were then compared with the amount of evapotranspiration calculated by the SEBAL algorithm on that particular day. The results are further discussed in this study.

Inputs
The input variables of the M5 decision tree consisted of four satellite images which applied the SEBAL algorithm to calculate Albedo, emissivity, normalized difference water index (NDWI), and estimated evapotranspiration. Figure 4 shows the Albedo maps for 8 April 2019 and 3 April 2020.
According to Fig. 4, the calculated Albedo for 8 April 2019 is higher than the Albedo for 3 April 2020 because of the existing drought conditions in 2019 which affect Albedo since it is influenced itself by atmospheric vapor and temperature (Feng and Zou 2019). Figure 5 shows the obtained emissivity maps for 8 April 2019 and 3 April 2020. Emissivity is the most important input variable for the M5 decision tree because this variable is required for land surface temperature calculations and is affected by the percentage of vegetation.
According to Fig. 5, both maps have approximately the same maximum and minimum range, yet the amount of emissivity varies on the same farm when comparing both years. Emissivity variations depend on the extent and type of cultivation of farms. In April 2019, based on the type of agricultural activities in the area and managerial decisions taken, the western part of the plantation had more cultivated sugarcane (Fig. 5a), and in the next year (Fig. 5b), there was less cultivated sugarcane which most probably affected the amount of emissivity.
Vegetation moisture also plays an important role in the calculation of evapotranspiration, and in order to determine the vegetation moisture in sugarcane farms, the normalized difference water index (NDWI) was applied. Figure 6 shows the NDWI for 8 April 2019 and 3 April 2020. On 8 April 2019, the vegetation moisture is more prone to dry tension due to drought conditions, whereas on 3 April 2020, the vegetation moisture shows an improved condition as compared to 2019.
The climatological variables of wind velocity, relative humidity, and temperature were taken into account for the calculating of the amount of evapotranspiration; furthermore, these parameters were applied as fundamental inputs in the SEBAL algorithm. One of the input variables of the M5 decision tree, which is considered as its target variable and presented in this study, is the evapotranspiration raster metrological map for 2019 and 2020 derived as an output of the SEBAL algorithm.

Output
The values in parentheses under each label in the leaves of the decision tree indicate the number of segments resulting from the corresponding threshold. The second value indicates the number of times a misclassification occurred (Vieira et al. 2012). Figure 7 shows the decision tree for the evapotranspiration of 8 April 2019. Twenty different equations were extracted using a correlation coefficient of 0.9429, mean absolute error, and root mean squared error of 0.4749 and 0.6479, respectively.
Through the application of minimum input parameters, specifically Albedo, emissivity, and NDWI, a number of evapotranspiration equations were derived. For the evapotranspiration of 8 April 2019, 20 conditional equations were calculated. According to Fig. 7, the Albedo input variable was located at the top of the decision tree, and the divisions were based on Albedo values, therefore, indicating that Albedo is highly important in the calculating of the amount of evapotranspiration. In the first leaf (first equation), divisions are based on Albedo; as a consequence, as per Fig. 7, the majority of equations become derivatives of Albedo. Taking the geographical location of the study area into account, it can be observed that there is a high amount of received light in the area, and the Albedo input variable is thus defined as absorbed light. The M5 decision tree divisions based on Albedo show that the amount of absorbed light in this area has a major role in the calculation of the evapotranspiration rate in the area.
The NDWI variable is defined as plant moisture which after the Albedo variable plays a major role in evapotranspiration calculations. This particular variable shows that plant moisture in addition to absorbed light is essential in producing the decision tree equations.
The emissivity variable being defined as diffused light had lesser significance in the decision tree derivatives with due regards to the geographical location of the study area.
The obtained equations as a result of applying the M5 decision tree and python scripts in an ArcMap environment are presented as Appendix 1 and 2 respectively at the end of this article.

Combining M5 and GIS
After deriving the most appropriate equations from the M5 decision tree model, the equations were applied using python scripts for faster and more accurate calculations (available through email by the corresponding author). The equations were obtained using evapotranspiration data from 8 April 2019 and were applied on input variables from the said year to identify whether the extracted equations have an acceptable performance. Figure 8 shows the evapotranspiration map calculated by the SEBAL algorithm and the M5 decision tree for 8 April 2019 and 3 April 2020, respectively. According to Fig. 8a and c, the evapotranspiration maps for 2019 did not have many differences in the calculated evapotranspiration amounts. Figure 8 b shows the calculated evapotranspiration rate using the SEBAL algorithm for 2020, and Fig. 8d shows the calculated evapotranspiration using the M5 decision tree equation extracted from the 2019 SEBAL evapotranspiration map. A comparison of Fig. 8b and d shows that the mined equations from the M5 decision tree of 2019 when applied on the input variables of 3 April 2020 provide acceptable results due to a minimum disparity between the evapotranspiration map calculated by the SEBAL algorithm and that calculated using the M5 decision tree for 2020. Figure 9 shows the results of the comparison between the SEBAL algorithm and the M5 decision tree for 8 April 2019 and 3 April 2020. Table 2 shows the statistical coefficients for 8 April 2019 and 3 April 2020. According to Fig. 9 and Table 2 when comparing the obtained results for the two different years, it is possible to calculate the amount of evapotranspiration using lesser input variables while focusing on the most important variables due to high correlation coefficients and low errors. As per Table 2, the calculated rate of evapotranspiration for 3 April 2020 shows more improved results than 8 April 2019. This is due to the differences between input variables over the 2 years which cause the derived equations to show better accuracy for 3 April 2020.

Model evaluation
Using an uncertainty analysis in tandem with a sensitivity analysis, the rate of evapotranspiration obtained using the data mining model was evaluated. The uncertainty analysis of the obtained model was calculated using two criteria that are 95PPU and the d-factor. By increasing the observed data in the 95PPU level while decreasing the average values of the upper and lower bands (less than the standard deviation), while taking into consideration that the bracketed value 95PPU must be in the maximum range, it was observed that   Evaluating evapotranspiration using data mining instead of physical-based model in remote… distribution of the output variables. When 80% of the calculated data (calculated through the data mining model) are within the 95PPU level, it clearly indicates that they are of a higher quality. Table 3 shows the uncertainty coefficient for 8 April 2019 and 3 April 2020. As observed in the table, the obtained d-factor is less than 1 for both images, and the 95PPU is more than 80%. The calculated uncertainty coefficients show  that the data mining model has a superior quality for estimating the amount of evapotranspiration using less variables and thus can be applied for calculating the amount of evapotranspiration with acceptable certainty. Table 4 shows that the sensitivity coefficient of the variables vary. The Albedo variable has the greatest sensitivity for calculating evapotranspiration in both images from 8 April 2019 to 3 April 2020. The study area is located in a region which is approximately at sea level and receives a great deal of solar radiation. The net radiation (Rn) is among the major variables for calculating the amount of evapotranspiration using the SEBAL algorithm; moreover, Albedo is capable of representing this variable as one of the data mining model inputs. Moreover, the decision tree obtained for this research (Fig. 7) shows that most of the classifications that were developed derived from the Albedo variable which was incorporated within the nodes of the decision tree which clarifies the major role that Albedo has in deriving the equations due to its high sensitivity. The other two parameters have less sensitivity in the calculation of the evapotranspiration rate for 2020. The NDWI has less sensitivity due to the late start of irrigating the cultivated sugarcane fields as compared with the previous year, i.e., 2019.

Discussion
Climatological variables which are considered as a physical process can affect the calculations for determining the amount of evapotranspiration. Measuring evapotranspiration over an extended area using remote sensing is a practical procedure and can be done utilizing physical-based algorithms like SEBAL. By transforming a physical-based model to a data mining model, the input variables can be decreased in as such that only the major variables are applied. The utilizing of a M5 decision tree allows for the transforming of a physical-based evapotranspiration algorithm into a data mining model. The three inputs of the decision tree consisted of Albedo, emissivity, and NDWI. The target function was the evapotranspiration value obtained by the SEBAL algorithm. Moreover, the possibility of calculating the amount of evapotranspiration using linear regression equations obtained from the M5 decision tree was considered. In this case, a complicated evapotranspiration process which was dependent on so many variables showed that by applying a simple linear regression, the deriving of equations can be carried out. Another objective of the current research is the applying of the general equation: ET = a(Albedo) − b(emissivity) − (NDWI) from which the constant values were obtained using data mining procedures. Based on the basic SEBAL algorithm equation which is represented as the energy balance: λET = Rn − G − H, evapotranspiration is calculated as absorbed energy (net radiation at the surface) subtracted from transmitted energy (soil heat flux and sensible heat flux to the air), so instead of using energy labels (heat fluxes), light labels were used in the main equation, and evapotranspiration was calculated as the absorbed light (Albedo) minus transmitted light (emissivity) on the basis of the theory that light is another form of energy. Therefore, through the applying of the M5 decision tree, the most acceptable value was obtained for calculating the rate of evapotranspiration using data mining methods.

Conclusion
The main objective of the current research was to modify a physical-based model via a data mining model since physical-based models need more input variables in comparison to data mining models which use fewer input variables yet obtain acceptable results. Using three input variables that is Albedo, emissivity, and NDWI, mathematical equations were extracted from the M5 decision tree. These three variables were selected from the values obtained by the basic SEBAL algorithm equation: ET = Rn -G − H. The main concept is to calculate evapotranspiration as ET = a (Albedo) − b (emissivity) − c (NDWI) in which the constant values were calculated from the M5 decision tree model. The extracted equations were then applied on 8 April 2019, and the evapotranspiration calculations are nearly parallel to the SEBAL algorithm evapotranspiration values. Furthermore, the extracted equations from 2019 were applied for 3 April 2020 which when compared to the SEBAL evapotranspiration algorithm paralleled it accurately, making it an acceptable result; however, when applying data mining methods instead of SEBAL, the results obtained show a lower accuracy. In conclusion, evapotranspiration can be calculated over an extended area of the SEBAL algorithm, in tandem with a data mining model, and M5 decision tree utilizes fewer inputs; furthermore, the research has also shown that light labels can be applied instead of energy labels.