The Soil Water Characteristic Curve (SWCC) is closely associated with soil physical properties and plays a crucial role in soil and water management (Shwetha and Varija, 2015). The SWCC provides valuable direct and indirect information about the behavior of water in unsaturated soils (Zhai and Rahardjo, 2012; van Genuchten et al., 2015). There is a need to reliable determination of the SWCC of any given soil using a combination of both measurement and predicting techniques. However, all the field, laboratory, and computer vision-based measurements of SWCC are expensive, tedious, time-consuming, and sometimes impossible due to issues related to scaling, spatial variability, and stud-site inaccessibility (Achieng, 2019) thus use of modeling procedures is a very common approach to predict SWCC (Dobarco et al., 2019).
While multiple linear regression (MLR), ANN, and SVR, have been commonly used in the development of pedo-transfer functions (PTFs) (Rani et al., 2022). There has been a significant increase in the application of machine learning (ML) algorithms such as LR, ANNs, SVMs, classification and regression tree (CART), and RF, in soil moisture researches. These ML algorithms are preferred for their non-parametric nature and ability to capture complex and non-linear relationships (Padarian et al., 2020).
Machine learning techniques for estimating SWCC fall under the category of supervised learning, where a labeled training dataset is provided with known output values. The model is trained using algorithms applied to the input dataset to predict the desired output. Training continues until the model achieves the desired accuracy on the training dataset. Supervised learning is commonly used for classification and regression tasks (Rani et al., 2022).
Achieng (2019) conducted a comparative study of several ML algorithms for modeling SWCC in loamy sand soil. They found that the RBF-based support vector regression (SVR) outperformed SVR with linear and polynomial kernels, single-layer ANN, and deep neural network (DNN) models. In another study, Araya and Ghezzehei (2019) demonstrated the superior performance of the Boosted Regression Tree (BRT) model compared to other algorithms, such as KNN, SVR, and RF, for predicting saturated hydraulic conductivity. However, the RF model closely followed the BRT model in terms of performance. These findings highlight the satisfactory performance of various ML algorithms in predicting environmental events. For instance, Hong and Pai (2007) and Hu et al., (2013) observed the effective use of techniques such as ANN, SVM, and KNN for forecasting soil water evaporation. Furthermore, Baydaroglu and Kocak (2014) observed the valuable performance of these algorithms in predicting free water surfaces, while Valipour et al., (2012; 2013) utilized these algorithms to predict water reservoir inflows. As a result of their high flexibility, accurate predictive performance, and consistent results, data mining techniques have become a preferred choice for many researchers seeking to enhance their understanding of unsaturated soil hydrological properties (Botula et al., 2013).
The capability of machine learning methods to accurately fit the SWCC is directly influenced by the availability of measured soil water content data at various soil matric potentials (Hastie et al., 2009; K. Lamorski et al., 2013). Toth et al. (2014) analyzed the SWCC using the RF model at four matric suctions (0.1, 33, 1500 kPa, and 150 MPa). The results demonstrated that the significance of soil properties in predicting soil water content varies across different soil types and matric suctions. In another study, Gunarathna et al. (2019) evaluated ML algorithms, including ANN and KNN, to estimate the volumetric water content at matric suctions of 10, 33, and 1500 kPa. Pekel (2020) applied decision tree regression, specifically the CART algorithm, to estimate soil moisture. The input variables were air temperature, time, relative humidity, and soil temperature. In other study, Cai et al. (2019) proposed the use of a Deep Learning Regression Network (DLRN) with big data fitting capability for constructing soil moisture prediction models. Numerical models like HYDRUS-2D often require a large amount of input data for simulating the time-series of soil moisture. However, if limited input data is available, ML algorithms such as SVM and Adaptive Neuro-Fuzzy Inference Systems (ANFIS) can efficiently handle the task (Karandish & Simunek, 2016). While the accuracy of ML algorithms may be comparatively lower than numerical models, they can serve as a better alternative under limited and missing data conditions.
Although machine learning techniques have been explored in various soil moisture-related studies, the use of boosting techniques for this purpose is relatively rare (Araya and Ghezzehei, 2019). Boosting methods aim to iteratively combine weak learners to create a strong learner that can provide more accurate predictions. One popular technique in boosting is gradient boosting, which involves sequentially adding predictors to an ensemble, with each predictor correcting the errors made by its predecessor. Unlike AB, which adjusts the weights of data points, gradient boosting trains on the residual errors of the previous predictor. In this study, gradient boosting and AB were selected as the most popular boosting-based algorithms for the estimation of SWCC using ML.
Vereecken et al. (2010) concluded that incorporating soil structure information as one of the predictors in PTFs is likely to enhance their performance. Nguyen et al. (2014) found that including categorical soil structure information in point PTFs developed using the MLR technique improved the accuracy of SWCC estimation for tropical paddy soils. They also suggested further investigation to explore whether these improvements would hold true when using different data mining techniques and for other types of PTFs. Passoni et al. (2014) utilized ImageJ software to characterize the porosity of Oxisols in the southeastern region of Brazil by relying on shape factors or pore form. ImageJ offers a convenient built-in option for analyzing soil porosity. This feature provides valuable output parameters including the number of porosity surface area (Total Area of Porous Regions, cm2), volume (Total Number of Porous Voxels × Voxel Volume, cm3), elongation (Major Axis Length/Minor Axis Length, dimensionless), flatness (Average Length of Major Plane/Average Length of Minor Plane), sphericity (4π × area/perimeter2, dimensionless), and compactness (volume of the porous region/surface area of the porous region, dimensionless). Building upon these findings, we utilized detailed soil structural properties derived from image analysis as inputs in the machine learning technique employed in this study. The aim was to assess whether the incorporation of such soil structure information would contribute to improved SWCC estimation.
Data mining techniques have shown superiority in modeling the interactions of the soil-water complex system compared to traditional MLR techniques. However, these techniques also have some drawbacks, including susceptibility to over fitting, high data demand, and expert knowledge requirements. In this study, machine learning methods were employed to analyze soil structure using selected soil properties. Therefore, our objective is to predict SWCC in soil samples with different properties using data mining algorithms. The prediction process was conducted under two conditions: 1) using matric suction as the only predefined input, and 2) using a range of input parameters obtained from laboratory and image analysis methods.