Multi-step-ahead soil temperature forecasting at multiple-depth based on meteorological data: Integrating resampling algorithms and machine learning models

Direct soil temperature (ST) measurement is time-consuming and costly; thus, the use of a simple and cost-effective machine learning (ML) tool is helpful. In this study, ML approaches, including KStar, instance-based K-nearest learner (IBK) and locally weighted learner (LWL) coupled with resampling algorithms of bagging (BA) and dagging (DA) were developed and tested for multi-step ahead (3, 6 and 9 days ahead) ST forecasting. In addition, a linear regression model (LR) was used as a benchmark to compare the results. A dataset with daily ST time-series (as models’ output) along with meteorological data (mean (T Mean ), minimum (T Min ) and maximum (T Max ) air temperature, evaporation (Eva), sunshine hours (SSH) and solar radiation (SR); as models’ input) were collected at Isfahan synoptic station (Iran), in a farmland, during 13 years (1992–2005) at 5 and 50 cm soil depths. Six different input combination scenarios were proposed to the models based on Pearson’s correlation coe�cients between inputs and outputs. For the model building, we used 70% of the data and the remaining 30% was considered for model evaluation through different visual and quantitative metrics. Our �ndings showed that variable T Mean is the most effective input variable for ST forecasting in most of the developed algorithms, while in some cases the combination of several variables including T Mean , T Max and as well as the integration of T Mean , T Max , T Min , Eva and SSH proved to be the best input combinations. Among the evaluated models, KStar showed more compatibility with the BA algorithm, while, in most cases and depending on soil depth, IBK and LWL obtained more accurate results when they were hybridized with DA. For soil depth of 5 cm, BA-KStar has superior performance (i.e. Nash-Sutcliffe E�ciency (NSE) = 0.90, 0.87 and 0.85 for 3, 6 and 9 months ahead forecasting, respectively) while for soil depth of 50 cm, DA-KStar outperforms other algorithms (i.e. NSE = 0.88, 0.89 and 0.89


Introduction
Soil temperature (ST) plays a key role in different ecosystems by affecting processes such as the hydrological response of the soil, accumulation and degradation of organic matter, plant growth, nutrient mineralization, carbon emissions, proper time of sowing and even micro-organisms activity (Brar et al., 1992, Peng et al., 2009Hu et al., 2016;Citakoglu, 2017). ST variation can alter soil characteristics and accordingly has considerable environmental outcomes with the change in the carbon balance (Qian et al., 2011). It can be an important parameter in ecological, climatological and hydrological modeling (Tabari et al., 2015). Information about ST is therefore important in decision-making processes. ST also varies with depth, but this variation is much smaller in the deeper layers than near the soil surface, and thus, accurate soil temperature assessments have to be done at different depths. For instance, ST in the topsoil (less than 5 cm) affects seed sowing, while information about ST at a depth of 10-15 cm is required for tree grafting. Previous studies have tried to nd relationships between ST at various depths and the most important factors that affect this parameter (Citakoglu, 2017).
Unfortunately, spatially distributed data of ST does not exist in many regions of the world (Hu et al., 2016), as accurate measurements are expensive and time-consuming (Napagoda and Tilakaratne, 2012).
Moreover, many environmental factors in uence ST, including meteorological variables (e.g. solar radiation, air humidity, pressure and temperature, precipitation, sunny hours and wind speed), topographic conditions and soil factors such as soil water content, texture and surface cover (Paul et al., 2004;Samadianfard, 2018;Zeynoddin, 2019). Despite the technological advances in sensors and devices, direct measurements of ST are still expensive and require continuous measurements at different soil depths (Plauborg, 2002). To overcome the challenge of ST quanti cation for large areas, researchers have been concentrated on ST modelling and prediction using various techniques (Sándor and Fodor, 2012). Accurate ST modelling reduces time, costs and instrument maintenance (Maryanaji et al., 2017). Because ST has a strong correlation with meteorological variables, several models have been developed based on these relationships such as linear models (Kang et  Linear equations do not have reasonable prediction power due to their simple and linear structure, while analytical and numerical models are di cult to use because of their complexity and high data demand (Xing et al., 2018).
Over the past recent decades, machine learning (ML) models, as computational arti cial intelligencebased (AI) models, have captured researchers' attention in distinct disciplines, especially in geosciences.
ML tools are able to process large datasets e ciently. Moreover, non-linear models (e.g. AI models) have a high capability to simulate complex processes due to their non-linear and complex structures (Khosravi et al., 2018  In this study, we aim to propose six novel hybrid resampling algorithms of BA and dagging (DA) with IBK, KStar and LWL, namely: i) BA-IBK, ii) BA-KStar, iii) BA-LWL, iv) DA-IBK, v) DA-KStar, and vi) DA-LWL; for 3, 6 and 9 days ahead daily soil temperature forecasting based on nearby meteorological data at two different soil depth levels (5 and 50 cm). To the best author's knowledge and literature review, these hybrid algorithms are novel not only in soil science but also generally in geoscience, and this is the rst attempt to forecast ST using the proposed models. We tested these approaches in a largely arid and semi-arid area devoted to agriculture where soil and water resources are scarce and accurate ST estimations are necessary to promote long-term sustainable agriculture. Thus, through these algorithms, ST may easily and accurately be predicted worldwide. Moreover, ST prediction through these models could reduce the time and resources for measuring ST. . According to Pearson's correlation coe cient (r) between the input variables and the ST as an output (Table 2), six different scenarios as six different input combinations were considered for modeling process. Finally, and according to the correlation coe cient between forecasted and measured values, the optimum/best input combination was considered for model development and further analysis. This procedure was successfully applied in different prediction modeling.

Materials And
We assumed that the rst variables with the highest correlation coe cients were considered as the rst input to the model (i.e. T Mean ). The assumption is to examine whether variables with the highest r can predict ST accurately alone. Then, other variables with the next highest r were added one by one until the variable with the lowest r (i.e. SR) was added, and the input combination number 6 was constructed: . From a computational point of view, the IBK prediction and classi cation is achieved according to the relative node distance of instances from each category, without prior knowledge of the appropriate K value; and the appropriate K value is determined using a cross validation procedure, and they save and use only the selected instance to generate classi cation predictions, and the approach is sensitive to the number of irrelevant attributes (Aha et al., 1991). The IBK algorithm is described according to three functions: (i) the similarity function, (ii) the classi cation function and (iii) the concept description updater (Aha et al., 1991).

Instance-based Learning with Entropy-Based Distance Function (KStar)
Instance-based learner with entropy-based distance function (KStar), also called K* (Cleary and Trigg where w k is the weight of each individual linear model, and Y k is the output of the individual linear model for a speci c region of the data space. Bootstrap aggregating (Bagging) (Breiman, 1996) belongs in the category of ensemble modeling approaches that are also called resampling techniques, which are mainly adopted as a robust solution for solving the over tting problem that may occur during the training of machines learning models

Multiple linear regression (MLR)
In this study, the MLR was used as a benchmark model. The MLR measures the relationship between dependent and independent variables in terms of linear equations. In simple terms, it is a responsedependent variable (y) that is in uenced by more than one independent variable x i . An MLR can be expressed as follows: The terms β 0 , β 1 , …, β i are regression coe cients, and ϵ is an error term in the model, which represents effects other than those under experiment control, which were not considered in this study. To evaluate the performance of the models, the best input combination was served to the models and modeling procedure was carried out for each algorithm in the Waikato Environment for Knowledge Analysis (WEKA 3.9; The University of Waikato, New Zealand) software. It is well known that the most important and signi cant step in the application of machine learning models is to nd out and establish the best model structure that helps to adequately and e ciently achieve the objectives that are set out for us. In this study, the best model structure (i.e. model's parameter value) was selected by running a commonly used 'trial and error' procedure , where for the best input-combination, different values for each parameter of each model were tried. Models were initially developed using default parameter values and, according to the objective function (i.e. RMSE between observed and forecasted ST) score in the training step, lower or higher values were arbitrarily explored for the model's parameters until the objective function was minimized and the optimum values were obtained.

Model Evaluation Criteria
The evaluation of the proposed models in the present study was done by calculating the following statistical measures: RMSE, mean absolute error (MAE), percent bias (PBIAS), the Nash-Sutcliffe e ciency (NSE) which depends on both the correlation and the bias, and the Pearson's correlation coe cient (R) ( Table 3). NSE is a normalized/dimensionless metric that computes the relative magnitude of the residual variance compared to the variance of the measured dataset. PBIAS calculates the mean tendency of the forecasted values with respect to the observed data, re ecting whether predictions are, on average, higher or lower than the true value. Except for the aforementioned quantitative metric, different visual criteria were applied for further analysis including scatter plot, box plot and Taylor diagram. These criteria allow making a fast and simple comparison. Some of these visual metrics have some advantages over quantitative metrics, such as investigating and comparing between extremes, median and rst and third quartile of forecasted and measured values.

Most effective input variables
It is observed that the correlation between the air temperature and ST (Table 2) is more important than other variables e.g. Eva, SSH and SR. However, the correlation between soil and air temperature decreased with increasing soil depth which is in agreement with previous studies (e.g. Salamene et al., 2010). The in uence of Eva, SSH and SR also decreased in the deeper layers. The result of the correlation coe cient between the input variables and output revealed that the T Mean is the most effective input variable to forecast ST, followed by T Max , T Min , Eva, SSH, and SR (Tables 4 and 5).
After running the models with different inputs, the r-metric was calculated for the training and testing steps between the measured and forecasted STs (Tables 4 and 5 It can be stated that the considered assumption would be true in most of the cases and that the variables with the highest r values would be the most powerful variable in improving model performance and can predict ST accurately, especially for 5 cm soil depth. It is obvious from Tables 4 and 5 that by using T Mean variable, ST can be predicted easily for a soil depth of 5 cm, while for deeper soil depth levels, due to more complexity, a combination of input variables is required.
< Tabel 4. Determination of the best input variable combination for 3, 6 and 9 days ahead forecasting of the soil temperature at 5 cm soil depth using R-value > <Tabel 5. Determination of the best input variable combination for 3, 6 and 9 days ahead forecasting of the soil temperature at 50 cm soil depth using R-value >

Model Evaluation and Comparison
As models developed based on the training dataset, the results presented in this section only show how the developed algorithms t with the dataset. Hence, the results of testing data that represent how the developed algorithms are suitable for forecasting ST, used for model evaluation and comparison purposes. Table 6 lists the performance of the developed models for forecasting ST for 3, 6 and 9 days at a depth of 5 cm based on statistical criteria in the test step. It is observed that the BA-KStar model with an RMSE The visually models' performance evaluation using the Taylor diagram (Fig. 3), based on the correlation and standard deviation values, con rms that the BA-KStar and BA-LWL models result in the closest and farthest values to the measured soil temperature in 3 days ahead forecasting respectively. The Taylor diagram con rms that the BA-KStar model with a correlation of 0.94 and a standard deviation of 0.92 between forecasted and measured data shows better results compared to other developed models in ST (t + 6). Moreover, the forecasted BA-IBK values are close to the measured ST (t + 9) at 5 cm while the DA-LWL is not able to forecast the minimum and maximum ST (t + 9) levels properly.
The distributions of the measured and forecasted ST data at 5 cm depth are presented as box plots in Box plots of measured and forecasted ST 3, 6 and 9 days ahead using all the different models are shown in Fig. 7 (a, b, c).

Discussion
Timing for planting eld crops is partially based on when soil temperatures are optimal for seed germination and seedling emergence (Lindstrom et al., 1976). Therefore, the timing of spring crop planting is dependent on a number of variables. Apart from agronomic variables like seed availability, equipment readiness, eld dryness, tillage possibility, another important factor is soil temperature. As direct soil measurement is time-consuming and empirical models are not accurate enough, the results of this study provide a basis for ST forecasting at different depths (e.g. 5 and 50 cm).
The different model characteristics may need diverse input variables, so a suitable selection of variables is a key factor for accurate ST forecasting. Inappropriate and irrelevant variables complicate the development of models and decrease the performance of the model by creating a complicated relationship between the inputs and output. Although previous studies have used data with high correlation for developing machine learning models (e.g. Barzegar et al., 2017), this study con rms the ndings of the survey by Khosravi et al., (2020b) in which they concluded that both high and low correlation variables should be considered for developing a high-performance predictive model. Besides, each developed data mining algorithm has a different input variable as the best input-combination as well as behaves differently with the same input-combination due to the different structure of each algorithm and different learning processes. This different structure causes different computing capabilities and then different prediction power.
Statistical performance of the developed models for ST forecasting at soil depths of 5 and 50 cm shows that they are capable of using as promising tools for not only ST forecasting but also for other catchment phenomena. The results showed that all developed models performed very well to forecast ST at multiple depths. In addition, the results of the study were compared with those of the MLR model as a benchmark ( Chen et al., (2019) recently indicated that the complexity of a hybrid model is higher than that of a single model. However, the learning performance can be remarkably enhanced to a certain extent. Therefore, the hybrid resampling-ML model cannot only decrease the tting error, but could also ameliorate the generalization ability.
The main limitation of the applied models is their site-speci c calibration, and thus, they do not have generalization power to predict/forecast the target phenomena in another study area, especially when the physiographic conditions differ signi cantly. As a result, ML models are sensitive to the learning section, and therefore, in order to build a model that is more robust, it is important to include a large and long dataset from multiple study domains. However, the main advantage of these models is their ability to forecast each target phenomenon at each site.
In this study, the capability of the developed models was tested using the available meteorological variables. However, it is suggested to develop the used models in this study or other suitable ML models e.g., deep learning models using other different meteorological variables (atmospheric pressure, precipitation, relative humidity, wind speed, etc.), which may provide additional information. In a study by Hu and Feng (2005), the effect of precipitation on soil temperatures was insigni cant. Precipitation and other climate variables, however, may improve the accuracy of the ML models. The models also need to be further employed in different areas with different climatic characteristics and land uses (rangeland, arable land, forest, sub-urban areas) to validate the e ciency of the models in ST forecasting. Also, It is recommended to develop these types of algorithms in one study area and then use them to evaluate forecasted ST in another. As long as a model's predictive power is high, it can serve as an effective and simple tool for ST forecasting worldwide. 5

. Conclusions
This study successfully addressed the challenge of accurately forecasting ST through six novel hybrid algorithms of resampling methods (e.g. bagging (BA) and dagging (DA)) with machine learnings (MLs) (e.g. KStar, instance-based K-nearest learning (IBK) and locally weighted learner (LWL)) including BA-KStar, BA-IBK, BA-LWL, DA-KStar, DA-IBK and DA-LWL at two different soil depths, 5 and 50 cm, for the rst time. The main goal was to develop algorithms with reasonable prediction power and propose them as simple and promising tools for multistep (up to nine days ahead) ST forecasting in the Isfahan area, Iran, as a case study.
The main ndings of this research are as follows: 1. The modeling process based on correlation coe cient shows that mean air temperature (T Mean ) is the most effective input variable for ST forecast followed by maximum air temperature (T Max ),  Scatter plot of 3 (i.e. golden color), 6 (i.e. blue color) and 9 (i.e. red color) days ahead forecasted VS. measured soil temperature at a depth of 5 cm  Model performance using box-plot for 3 (a), 6 (b) and 9 (c) days ahead ST forecasting at a depth of 5 cm Scatter plot of 3 (i.e. golden color), 6 (i.e. blue color) and 9 (i.e. red color) days ahead forecasted VS. measured soil temperature at a depth of 50 cm Model performance via Taylor diagram for 3 (a), 6 (b) and 9 (c) days ahead ST forecasting at a depth of 50 cm Figure 7