Groundwater is the largest and one of the most important sources of freshwater, and it supports a major part of the domestic and irrigation needs of many nations1. However, the increase in groundwater extraction and continued climate change have played significant roles in the groundwater level decline2. Extreme climatic conditions such as droughts and anthropogenic activities, such as rapid population growth, industrial development, the expansion of agricultural activities, and other domestic uses, have escalated the demand for groundwater, highly influencing groundwater levels3. A decrease in the groundwater level can trigger serious environmental consequences, such as groundwater quality deterioration, ecosystem degradation, reduced agricultural production, land subsidence, and seawater intrusion4,5. Moreover, agricultural irrigation is critical for crop production and food security worldwide. For regions with intensive agricultural development, the increasing demand for water resources, coupled with the effects of climate change, has led to water scarcity issues and competition among different sectors. Therefore, the sustainable use of groundwater is essential for water resource conservation and food security.
Accurately predicting groundwater levels is a significant challenge in managing aquifer systems, especially in regions where surface water is scarce6. There are two main categories of models for groundwater prediction: physically descriptive models and data-driven empirical models7. Physical models typically require extensive datasets covering parameters such as the water content, hydraulic conductivity, precipitation, volume of groundwater extraction, and soil properties, as well as information about human activities, such as dam construction8,9. These models also require precise information on the properties of aquifers to account for subsurface variability10. However, implementing physical models is challenging due to the immense need for accurate data, which are often limited due to cost and time constraints11,12. Although physical models are historically important, they have limitations. They require large amounts of data and can take a long time to construct, particularly for complex hydrogeological systems. Furthermore, the nonlinear behavior of subsurface systems and their responses to climatic variables can complicate modeling, given the large datasets required to achieve high accuracy.
As alternatives, data-driven empirical models have gained traction, as an in-depth representation of some system properties is unnecessary13. Machine learning algorithms use advanced mathematics to identify optimal functions based on the available data and support tasks such as prediction, classification, and anomaly detection. Machine learning algorithms excel at identifying intricate patterns in data, making them an excellent choice for groundwater level prediction13. In the last two decades, artificial neural networks (ANNs) have become popular algorithms, followed by support vector machines (SVMs) and adaptive network-based fuzzy inference systems (ANFISs)14,15. However, the main drawback of these algorithms is that they are computationally intensive, and training is time consuming, especially for large datasets. Despite their good performance in modeling groundwater levels, ANNs can get stuck at local minima and be affected by overfitting during model training. In addition, their inability to compute missing values and address data correlations can lead to complex simulations with long calibration times16. SVMs are sensitive to the choice of kernel function and parameters and struggle with noisy data. Moreover, it is challenging to determine suitable numbers of fuzzy sets and rules when applying ANFISs17.
Decision tree-based models, notably extreme gradient boosting (XGB), are appealing alternatives to ANNs18. These models are particularly advantageous for small datasets and offer various benefits. Unlike ANNs, decision trees provide more interpretable results. They allow for flexible model development, accommodating data with varying complexities and levels of accuracy. Decision trees can effectively model nonlinear relationships without requiring prior statistical assumptions, data transformations, or outlier elimination. These algorithms are valuable for both classification and regression tasks in supervised learning. Decision trees recursively partition the training data across input variables and fit simple functions to each partition. However, decision tree models are sensitive to data size and quality, potentially leading to overfitting. To address this issue, ensemble decision trees combine multiple weak learners to form a robust model. Boosting and bagging are two key techniques in ensemble modeling. Bagging involves training homogeneous weak learners in parallel using bootstrap subsets of the original dataset and combining their results to reduce model variance. Boosting can be used to train weak learners sequentially in an adaptive manner, with a focus on challenging samples that can reduce model bias. These techniques have inspired various machine learning algorithms. For instance, random forests (RFs), which are based on bagging, have become prominent for classification and regression across various fields. Additionally, algorithms such as AdaBoost and the gradient boosting machine (GBM) implement boosting. GBM algorithms, such as the light gradient boosting machine (LightGBM) and extreme gradient boosting (XGB), excel at capturing the complex nonlinear relationships among variables. XGB, in particular, is renowned for its exceptional performance and is highly recommended for predicting natural phenomena. In summary, decision tree-based models, especially XGB, are attractive alternatives to ANNs. XGB models provide interpretable results, are well suited for small datasets, and have demonstrated remarkable performance and versatility in various practical applications19–21.
The application of XGB for groundwater level prediction remains relatively unexplored, despite recent reports of remarkable performance and accuracy. Table 1 shows that previous studies employed XGB for various purposes, ranging from small- to large-scale applications. These applications encompass irrigation pumping and planning, investigating the demands of growing populations, assessing hydrogeological interactions, and exploring drought mitigation strategies. The spatial scope, including the study area and the number of monitored groundwater wells, largely depends on the specific research objectives. The temporal scales of datasets are primarily determined based on data availability and the objectives of groundwater level prediction. One critical concern in groundwater level prediction using machine learning is the collection of sufficient input features. Table 1 highlights that the most employed model inputs include observed groundwater level, precipitation, evapotranspiration, and temperature records, as previously noted by14. Groundwater levels in aquifer systems are predominantly monitored via observation wells, providing critical insights into how hydrological influences affect recharge, storage, and discharge. Previous studies have consistently underscored the significance of the relationships among groundwater levels, rainfall, and groundwater utilization by growing populations22–24. Groundwater discharge is frequently characterized by the pumping rate25–27. However, assessments of dynamic groundwater pumping in the field are often challenging due to substantial infrastructure costs, particularly in areas with intensive irrigation. As a practical alternative, many machine learning models can be used to estimate total groundwater pumping over a season or year, as a proxy for groundwater discharge. Notably, in regions where electric motors power most irrigation wells, electricity consumption has become a reliable indicator of groundwater pumping rates28,29.
Table 1
Previous studies of the application of XGB model on groundwater level prediction
Purpose of previous study | Temporal scale | Spatial scale of study area | Input parameters | Performance of model testing |
To assess geological-geomechanical properties on groundwater level changes (Kenda et al., 2018) | Daily | 518 stations | Groundwater information Weather data Pump sensor data | R2 = 0.644 RMSE = 0.0211 cm |
To improve Real-time irrigational planning (Brédy et al., 2020) | Hourly | ~ 0.18 km2 with 3 monitoring wells | Groundwater level Precipitation Evapotranspiration | R2 = 0.73–0.83 RMSE = 2.66–3.59 cm (for P1 well) |
To keep water sustainability and drought mitigation (Hussein et al., 2020) | Monthly | Global | 7 features sets from satellite images | R2 = 0.6165 RMSE = 5.544–6.145 m |
To predict groundwater level in highly populated towns (Osman et al., 2021) | Daily | 1429.1 km2 with 5 monitoring wells | Groundwater level Precipitation Temperature Evaporation | R2 = 0.11–0.92 RMSE = 0.137–0.448m |
To predict groundwater level dynamics across a tropical peatland (Hikouei et al. (2023) | Monthly | 453 km2 with 292 dipwells | Groundwater level Precipitation Evapotranspiration Surface heights Distance to the canal | R2 = 0.995 RMSE = 0.101 m |
To assess the geo-environmental for sustainable groundwater restoration (Mahammad et al., 2023) | Seasonal | 3200 km2 with 30 monitoring wells | Precipitation Population growth Use of groundwater | R= -0.72–0.9 RMSE = 4.73–0.52m |
From a macroscopic perspective, the primary motivation behind groundwater level prediction is the anticipation of drought events increasing in frequency in the context of continued climate change. The escalating stress of surface water scarcity has led to the overexploitation of groundwater, resulting in heightened pumping costs, diminished base flow, saltwater intrusion, and land subsidence30. However, curbing intensive irrigation pumping could lead to notable losses, spanning agricultural economics and food security. Consequently, addressing the mitigation of irrigation practices and assessing the socioeconomic implications of groundwater management are imperative tasks31–33. Hence, the objective of this study is to establish a groundwater level prediction model using data related to groundwater dynamics, electricity consumption by pumping wells, and precipitation as inputs for the XGBoost (XGB) algorithm. The developed model can be applied to assess the impact of reducing groundwater usage for irrigation on crop yields and fallowing subsidies.