Assessing the Impact of Land Cover on Groundwater Quality in a Smart City Using GIS and Machine Learning Algorithms

The present study aimed to assess the impacts of land cover on groundwater quality by integrating physico-chemical data and satellite imageries. Initially, fourteen groundwater parameters of both pre-monsoon and post-monsoon were collected from nineteen sampled stations and water quality index (WQI) was calculated. Consequently, Google earth, Landsat-8, and Sentinel-2A imageries were considered for land cover mapping including the concentration of settlement and urban built-up, greenery coverage, and micro-level land use. Two machine learning models such as artificial neural network (ANN) and random forest (RF) were used for pixel-based classification and establish spatial relation between water quality and land cover. This study trained and tested the models for the whole study area as well as 500m buffers from each sampled station. The result of model’s validation including mean absolute error (MAE), root mean squared error (RMSE), Kappa statistic (K), overall accuracy of model (OAC), and receiver operating characteristic (ROC), indicated that random forest classifier has better performance than the artificial neural network. The results show that the testing dataset of pre-monsoon season has higher accuracy with MAE 0.343, RMSE 0.397, K value 0.55, and ROC value 0.838 to envisage the impact of land cover on groundwater quality in comparison to post-monsoon season. The results also reveal that the classification accuracy is greater within 500m buffer areas in comparison to the whole study area with a close to 0 values of MAE and RMSE, and absolutely 1 value of K and ROC. Based on the above findings, the present study suggested considering a large scale for determining the controlling factors of groundwater degradation.


Introduction
Groundwater is important for sustaining communities in many parts of the world. However, groundwater quality in many regions is being degraded by pollution from various non-point and point sources, and this is causing difficult challenges for civilians and planners (Bui et al., 2020). Additionally, groundwater availability in many parts of the world is reducing because of falling water tables caused by over-abstraction and the effects of climate change. These problems are evident in many parts of India (Singh et al., 2022;Ourarhi et al., 2022;Missaoui et al., 2022). As per the 2010 World Bank report, India secured its position to be a largest user of groundwater with an average estimated use of 230 km³. Groundwater in the country is widely used for agriculture, industry and in urban areas. Developing nations frequently shifts from a fast pace of economic boom and each development strategy has a likelihood of producing a cynical impact on the environment. Pervasive ecological changes took place as an outcome of human pursuit (Shaker et al., 2010). Low-quality drinking water is a crucial factor for the developing world because it risks the entire ecosystem along with human health (Li et al., 2016;Escobar et al., 2022). Most of the Indian cities, especially those situated in the central part of the nation surrounded by land from all sides, depend mainly on groundwater as a reliable source of drinking water. There is also an increasing demand for groundwater use in the country due to the rapidly increasing population and the increased demand for food supply of the increasing intensity of agricultural production to sustain the increasing population and increased industrial activity are leading to the degradation of groundwater quality in many areas.
As a result of these issues, it is becoming increasingly important to identify the sources of both geogenic and anthropomorphic groundwater contamination and to monitor changes in groundwater quality over time to protect human health and the environment (Zhang and Li, 2019). Evaluation of groundwater quality is a challenging task. Over a decade, many techniques have been developed and applied for the assessment of water quality, of which the water quality index (WQI) is an extensively used method (Chaurasia et al., 2018;Kanga et al., 2020;Kawo and Karuppannan, 2018;Tyagi et al., 2013;Yisa et al., 2012). The WQI is a mathematical tool, which condenses the concentrations or other values of water quality parameters into a single value that indicates the suitability of a water sample for potable or other uses (Akhtar et al., 2021). For example, Lumb et al. (2006) used WQI for the purpose of evaluating the water quality of freshwater against Canadian council of ministers of the environment (CCME) guidelines. Water quality parameters were used in many studies (Yan et al., 2016;Rabeiy, 2018;Kim and Cardone, 2005;Wagh et al., 2019). The water quality index calculated from the selected water quality parameters is a non-dimensional test (Rezaie-Balf et al., 2020;Rufino et al., 2019).
Landscape metrics show sensitivity to the dynamic spatial pattern and the water quality in different seasons and on both micro-and macroperspective (Risal et al., 2020;Zhou et al., 2012;Shi et al., 2017;Calijuri et al., 2015;Escobar et al., 2022). Pratt and Chang (2012) observed that various topographic features like built-up area, slope, and elevation affects water quality on both watershed region and on buffer of the stream. The temporal impact of land cover on water quality also showed that various land cover alters certain water quality parameters (He et al., 2020;Ahmad et al., 2021). Plenty of works in this direction were already carried out, but no recent study was found where the impact assessment of local land cover on ground water quality in urban area that is mainly dependent on ground water. In the present study, an attempt was made to cover the existing research gap between ground water quality monitoring and their associations with local land cover for whole city as well as buffer regions with the help of GIS and AI.
In this study, water quality indices were determined to observe how groundwater quality varied between pre-monsoon and post-monsoon conditions. There are various methods to calculate WQI, but the rank weight method (Ali and Ahmad, 2019;Maurya et al., 2022) was used in the present study. The relationship between groundwater quality changes and land cover was then determined using remote sensing data. This is a complex task but can be simplified by using artificial intelligence techniques (AI) to obtain a much more precise assessment of how groundwater quality interacts with various land use factors (Yaseen et al., 2018;Bedi et al., 2020;Wang et al., 2017;El Bilali et al., 2021). AI-based modeling excludes subindex calculations and gives a more reliable output. The use of machine learning algorithms has increased tenfold over a decade due to their capacity for quick and precise results, and their ability to handle large amounts of data. The output result entirely depends on the input data and methodology.
Given these factors, the aim of this study was to determine the influence of land cover changes on water quality in the study area. In particular, the rapid increase in urbanization in the study area and in many other regions has caused substantial changes in hydrology and is likely to have adversely affected water quality (Dutta et al., 2018;Szewrański et al., 2018;Siqueira Castro et al., 2015). For example, the loss of vegetation in a watershed often increases the 1 3 Vol.: (0123456789) process of erosion that can increase turbidity and the suspended solids content of surface runoff (Hitouri et al., 2022;Taborda et al., 2022). Hence, it is important to monitor the land cover change and water quality on an ongoing basis (Liu et al., 2006;Li., 1996). The land cover pattern in the process of urbanization, where anthropogenic activity is intense, may also impact groundwater quality (Maurya et al., 2021;Rao and Chaudhary, 2019;He and Wu, 2019;Robertson et al., 2017).
In order to investigate the connection between groundwater quality and land cover, a three-phased approach was followed using the method of He and Wu (2019). This included the quantification of groundwater quality via prescribed approach, determining the land cover pattern of the selected study area, and determining the relationship between groundwater quality and land cover patterns. For the first step, water quality index can productively quantify groundwater characteristics. For the second step, the land cover data is generally acquired from two sources, i.e., from satellite imageries  or the local survey department (Cain et al., 1989). In this study, land cover patterns of the study area were collected from Landsat-8 and Sentinel-2A satellite which were further verified in the Google earth engine. The third step determines the range of influence . He and Wu (2019) proposed a curved streamline searching model (CS-SLM). In this study, the artificial neural network and random forest classifier were used to look into the connection between water quality and land cover patterns.

Study Area
Aligarh city lies at 27°88'N latitude and 78°08'E longitude, 132 km south-east of the capital of India (Fig. 1a). Aligarh city is the administrative headquarter of the Aligarh district. The city is divided into 70 wards and four zones. The total area of the city is 36 km 2 . The city is famous for its lock industry. As per 2011 census, Aligarh city has residents of 874,408; its urban population is 911,223 people, with 461,772 men and 412,636 women. Aligarh comes under subtropical humid climate region. Summer begins in April and is extremely hot, with temperatures ranging from 26° to 46° C. The rainy season begins at the later part of June and lasts until September, with an average yearly rainfall of 800mm. In December, the winter season begins with temperatures range from 12 to 16 ° C, with a low of 1° C in January. The city's water supply system is managed by Aligarh Municipal Corporation. Based on municipal record, the city has supply 90 LPCD water daily.
Geomorphologically, the city is located in the older alluvial plain region, and the soil type is western upland soil series. The area lies within the Ganga-Yamuna alluvial plain. Kali Nadi is the only river that passes nearby the city. There is a sharp demarcation between the old city and the new city. The older part of the city is mainly congested and haphazard in terms of settlement and population density. Aligarh is located on a flat plain from few low-lying areas. The city has depression in its center, which gives the city saucer pan topography. As a result, this part of the city has drainage problems and is susceptible to waterlogging. The groundwater in the city is alkaline in nature and based on TDS measure nearly 30% of the total groundwater available is not suitable for drinking purposes, but permissible only after purifying (Wasim et al., 2014). The residents of slum area, financially weaker section of the city as well as most of the residents who have their own submersible below 200ft prefer drinking water without any purification.
Land cover types in the city area include urban development, vegetation, barren land, swampland, open space, and water bodies. The largest land coverage category is urban development, which covers about 71% out of the city area, whereas water bodies have the least coverage (less than 1% out of the total area of the city). Aligarh is situated in the Upper Ganga-Yamuna doab. Physiographically, it consists of clay and sandy soil. Because there is no stream that passes through the city, groundwater is relied on for domestic and industrial consumption. There are five minor linear streams surrounding the city, but they do not pass through it. Rocks in the area consist of the Vindhyan group of rocks overlaid by Alluvium sediment. The depth of groundwater aquifer is in three layers strata (Fig. 1b). The very first layer is from 00.00 to 122.00 mbgl, the second layer lies in 100 to 150 mbgl, and the third one is from 130 to 300 mbgl. Out of these three layers, only first layer is suitable for drinking and domestic usage. The other two falls under the category of brackish water; hence, they are of no use (Anwar and Aggarwal, 2014). Except for the groundwater, there is no existing alternative source of surface water supply available in the city.

Methods and Materials
In order to fulfill the objective, various steps were followed to assess the impacts of land cover on the groundwater quality of the study area. Initially, a land cover map of the study area was prepared. Groundwater sample locations were then selected based on the land cover map. Based on the general preview from the Google earth engine, it was observed that the city has experienced haphazard growth in all directions without proper management and measures.
In order to prepare the land cover of the study area, three parameters including the concentration of settlement and urban built-up (CSUB), greenery coverage (GC), and micro-level land use (MLU) were selected. The first two parameters were prepared from Sentinel-2A satellite imagery for the year 2020, and the last parameter was derived using pixel supervised classification. For the supervised classification, a total of nine determining parameters were considered, namely agricultural land, barren land, settlement, water bodies, swamp, vegetation, agricultural fallow land, vegetation fallow land, and recreational areas. Groundwater sampling locations were selected to cover different land cover categories, but sampling sites were restricted by the location of the public groundwater supply system from which samples were collected. Most of the samples were collected from public water supply tube wells, so that the entire city can be covered, but few areas are dominated by personal water supply system equipped with submersible pumps. Consequently, some samples were also collected from private water sources. The sampling points were marked in the "x" "y" GPS coordinate system so that results can be plotted for the data visualization (Table 1).
Then WQI was calculated to identify the water quality of the area. To evaluate the correlation between land cover and groundwater quality, a 500m buffer was created around each sampled station to delineate water quality. For assessing the accuracy, machine learning techniques such as artificial neural network (ANN) and random forest (RF) methods were used in the case of both the entire city and within the buffer area. The flow chart presents the details of the methodology applied in the present study (Fig. 2).

Groundwater Sample Collection
For groundwater sample collection, clean 1-L plastic bottles have been used, and before collecting the sample, those bottles were rinsed off with distilled water. The water samples were collected from a running water source so that no stagnant water from the well gets collected in the bottle, because in stagnant or storage water in the tank for example can change its bacteriological and physio-chemical properties, that is why running water from the hand pump and submersible were considered. The groundwater samples were collected from 19 different stations throughout the study area. Both pre-monsoon and post-monsoon data were collected from these sample points in May 2019 and October 2019. Few parameters like DO, EC, pH, and temperature were tested on the spot and the rest of the tests were done at the laboratory. The parameters such as TDS, TSS, TH, chloride, calcium, magnesium, alkalinity, turbidity, and COD were tested at the Environmental Engineering laboratory, Aligarh Muslim University, whereas the other parameters were tested at the analysis laboratory, Agra. All tests were performed as per the BIS norms (Table 2).

Relevant Data Used and Their Sources
This study was based on both primary and secondary data. Primary data include groundwater from 19 sampled locations in two seasons, followed by laboratory tests and experiments. Apart from that, various  secondary datasets were also collected for different aspects. For the purpose of applying machine learning techniques, the land cover map was prepared from Google earth and verified using Landsat 8 data. The concentration of settlement and urban built-up, and greenery coverage of the city was prepared using Sentinel-2A imageries. The details about the data collection and use have been shown in the following table (Table 3).

Selection of Chemical Parameters
Based on various literatures, it has been observed that pre-monsoon and post-monsoon seasons have sufficient influence on the balance of chemical parameters on the groundwater quality (Rao, 2017). Both dry season and wet season have a direct influence on water stress and contamination along with various chemicals present on rock or infiltration (Haines et al., 2006). Needless to say that an ample amount of geochemical compounds are present in the groundwater, out of which few Physico-chemical parameters which may create many health problems were considered for this study, such as pH, alkalinity, dissolved oxygen (DO), total dissolved solids (TDS), total suspended solids (TSS), total hardness (TH), chloride (Cl − ), electrical conductivity (EC), calcium (Ca), magnesium (Mg), turbidity, sulfate ( SO 4 2 − ), bicarbonate ( HCO 3 − ), and sodium (Na). pH is one of the important parameters which determines whether the water is acidic or alkaline. Extreme exposure of pH can give rises to eye irritation and mucous skin (Ali and Ahmad, 2020). TDS refers to the presence of organic and inorganic substances in water. This brings certain necessary matters with it. Negative TDS is unhealthy, but beyond the permissible limit can intrigue some health problems (Burton and Cornhill, 1977;Schroeder., 1966). TSS may contain siltation and various decomposed matter in water. As per BIS and WHO guidelines, the presence of TSS is not permissible in drinking water. The presence of TSS might bring out various unwanted microorganisms which can trigger nausea, diarrhea, headaches, etc. (Kjelland et al., 2015). Turbidity has not any specific health implications, but it can act as a breeding ground for microbial development. High turbidity makes it difficult to wash away the impurities from water by chlorine (Baghvand et al., 2010). Chloride does not have an active effect on human health, but those who already have some previous health issues related to sodium chloride metabolism may get affected. Prolonged ingestion of chloride for ages might bring concerns about human health. Alkalinity in drinking water is also an important parameter where crossing the permissible limit of Al might bring out skin irritation and gastrointestinal diseases along with vomiting and nausea besides their health impact (Wynn et al., 2010). The presence of DO in water is a sign of good water quality, anyhow beyond the permissible limit turns the water into a breeding ground for the bacteriological phenomenon (Bhatia et al., 2015). Total hardness is an important element in water. It brings benefits with it, but beyond the permissible limit, it makes the water soapy, pulses take more time to get boiled than usual, and water pipes become narrow by its siltation. The health impact of TH is controversial among researchers; nonetheless, it has some share in kidney stones, cardiovascular diseases, digestion, etc. (Sengupta, P., 2013). Calcium is one of the important elements needed for our body. A good amount of calcium intake is needed on our consumption and lack of calcium can cause hypocalcemia, muscle cramp, dry skin, etc. (Pravina et al., 2013). Magnesium is another important geo-chemical present in water. The deficiency of magnesium can cause hypomagnesemia, hypertension, osteoporosis, headaches, etc. (Watson et al., 2012;Al Alawi et al., 2018). EC determines the presence of minerals in the water. High EC means a high presence of minerals that might ruin the taste of water (Meride and Ayenew, 2016). Sulfate has a huge laxative impact on pregnant women, and infants are more prone to diarrhea when the water has an undesirable amount of sulfate. It degrades the taste of water as well (Heizer et al., 1997). Bicarbonate is a dependent ion in groundwater; bicarbonate joins with sodium most of the time.
Exceeding the permissible limit of bicarbonate might result in hypernatremia, rebound alkalosis. Sodium presence in water is a healthy sign. It can mitigate kidney damage, headache, hypertension, etc., but an overdose of it can inference into heart diseases, blood pressure etc. The spatial distribution of selected groundwater parameters in both seasons has been shown in the following figure (Figs. 3 and 4). Land cover plays an important role in the sustainable growth and development of the city. The entire Aligarh district has only 1.83% area under forest cover in 2019 assessment as per the forest survey of India, whereas the Aligarh city has about 11.76% area under vegetation cover. Various studies showed that land use land cover has an impact on hydrometeorology, surface runoff, aquifer level, and water quality (Baidya and Avissar, 2002;Sajikumar and Remya, 2015;Schilling et al., 2008). Hence, it can be said that it is an intense need to observe the relationship between groundwater quality and land cover pattern.
In order to study groundwater quality, several possible approaches along with their classification and assessment are significant (Su et al., 2019;Tian and Wu, 2019). In case of the present study, the impact of land cover on groundwater quality consists of both supervised and unsupervised classification. The supervised classification includes information related to groundwater quality and quantifies its standard, whereas the unsupervised classification is used to present the natural pattern of water quality. The supervised classification includes WQI (Chaurasia et al., 2018), ANN (Gupta et al., 2019), RF classifier (Grbčić et al., 2020), and unsupervised classification includes a self-organized map (Liu et al., 2018).

Calculating Water Quality Index
Groundwater is considered the most potent source of freshwater which is present in the least polluted form because it is least interfered with by humans . It is estimated that around one-third world's population depends on groundwater, among them Aligarh city is a peculiar example where no other alternate source is available for drinking water.
Here, public's only source of domestic and drinking water is groundwater. Hence, checking its status is essential. WQI is a method which is widely used for over a decade for the assessment of water quality (Gupta et al., 2017;Sadat-Noori et al., 2014;Şener et al., 2017;Mishra and Patel, 2003;Avvannavar and Shrihari, 2008;Vasanthavigar et al., 2010). There are several advantages of WQI including it is relatively easy to apply, it can easily add new variables, it has a precise range of results, etc. (Walsh and Wheeler, 2013). So, there is no doubt about its reliability and simplified method. Here in this study, the rank weight method of WQI was used. For the assessment of the parameters, the Bureau of Indian standard was considered as the standard parameters, and those standards which are not described in BIS, World Health Organization standard has been selected for them. Among them, total suspended solids (TSS) are not mentioned on both BIS and WHO guidelines; hence, NEMA (Kenya) standard has been selected for that. A total of 14 geochemical parameters were taken into consideration. A relative weight from 1 to 4 out of five has been assigned based on their importance and previous literature reviews (Batabyal and Chakraborty, 2015;Vasanthavigar et al., 2010). Turbidity has given a minimum value of 1, since it is a pretty less important parameter and a maximum value of 4 has been given to pH, TDS, and sulfate. After that relative weight has been given to each parameter by the following equation (Eq. 1): (1) Physico-chemical parameters of groundwater during pre-monsoon a pH, b TDS, c TSS, d turbidity, e chloride, f alkalinity, g DO, h total hardness, i calcium, j magnesium, k EC, l sulfate, m bicarbonate, n sodium ◂ where W i is the relative weight, R is the rank of each parameter, n is the number of parameter = n After placing calculated relative weight (W i ), a quality ranking scale (Q i ) was assigned for every single parameter by dividing its concentration in each water sample by its standard as per the Bureau of Indian Standard, and then the result was multiplied by 100 (Eq. 2): where Q i is the quality rating, C i is the concentration of each chemical parameters in water sample, S i is the Indian drinking water standard for the mentioned parameters The next step is to determine the sub-index (Sl i ) of each physico-chemical parameter of water samples by using the following equation (Eq. 3): where Sl i is the sub-index of ith parameter, Wi is the relative weight of ith parameter Finally, the WQI was obtained by the following equation (Eq. 4): After calculating the water quality index, the value was divided into five categories ranging from excellent to unsuitable for drinking.

Mapping of Land Cover
The land cover considered in the present study consists of three different raster layers. These were the concentration of settlement and urban built-up, greenery coverage, and micro-level land use. The data of micro-level land use were obtained from the realtime Google Earth engine for the year 2020, and the supervised image classification was used for mapping. Here, a total of nine pre-identified signatures were considered for classification purposes. The signature categories includes settlement, water body, vegetation, barren land, recreational area, fallow land, agricultural land, swamp, and open space. On the other hand, the concentration of settlement and urban built-up (CSUB) and greenery coverage (GC) were extracted from Sentinel-2A imageries (https:// apps. senti nel-hub. com/ eo-brows er). The CSUB was derived from the band combination of 12 (Shortwave infrared-1), 11 (Shortwave infrared-2), and 4 (Red). This composite is generally used to visualize settlement and urban built-up areas more clearly. On the other hand, the GC was derived using the following equation (Eq. 5): where B8 is the near infrared band, and B4 is the RED band. The value range of GC is −1 to 1. A negative value corresponds to water; values close to 0 correspond to barren areas of rock, sand, or snow. Positive values represent vegetation (nearly 0.2 to 0.4), whereas a high value indicates tropical forest.

Models Used
The selected groundwater parameters were mapped first, followed by mapping of water quality index, and then land cover maps were generated as mentioned above. The WQI and land cover of the whole city and 500m buffer regions around the sampled stations for both pre-monsoon and post-monsoon were mapped. To know the relationship between WQI and land cover, machine learning algorithms were applied. Based on the model's classification accuracy, the spatial correlation between WQI and land cover was determined in this study.

Random Forest
Random forest is a tree rooted classifier, an improvised version of the bagging-based method coined by Breiman (2001). It is the combination of bagging tree and multivariate data which turns it into a way better version tool for pattern recognition of multivariate and large scale data tree creation (Liaw and Wiener, 2002). Although RF has its cross-validation (Breiman, 2001;Efron and Tibshirani, 1997), it is recommended to check the sensitivity of the model so that Physico-chemical parameters of groundwater during post-monsoon a pH, b TDS, c TSS, d turbidity, e chloride, f alkalinity, g DO, h total hardness, i calcium, j magnesium, k EC, l sulfate, m bicarbonate, n sodium ◂ maximum accuracy can be obtained (Matthew, 2011). Random forest works in four basic steps  1. Selection of sample feature "k" randomly from the total sample "m", where "k" < "m", 2. Calculation of the node tree "d" by applying the befitting split point amidst the selected features "k", 3. Again applying the best split point "d" into daughter nodes tree "dn" then 4. Reiteration of the above-mentioned steps till the Ith number of nodes tree is brought off.
The RF classifier determines a paramount number of nodes to its output class (Bonissone et al., 2010). Therefore, for an input data "x" has computed its output "y" from the highest ensemble which is exhibited in the Eq. (6): where I(t) is an indicator function marked off as: In the case of the present study, the good water quality and poor water quality locations were indicated by 'YES' and 'NO', respectively.

Artificial Neural Network
The artificial neural network is a technique that follows the role of a neuron as in the human brain (Diamantopoulou et al., 2005;Wu et al., 2014), which is faster and reliable. Here, the whole process works in three layers, that is input layer, the hidden layer, and the output layer (Diamantopoulou et al., 2005). For this process, both training and testing data are required, where training data includes weights of the variables and adjust them with the help of the iterative method (Isiyaka et al., 2019). In this study, a multi-layer perceptron feed-forward method of ANN was used with a back propagation algorithm to identify the most polluting contributor in groundwater quality, and it model every single individual's percentage of contribution to the pollution.
In this regard, two input combination models were produced so that the most statistically significant API with high accuracy can be achieved (Alagha et al., 2014;Sarkar and Pandey, 2015). Here, ANN has been applied to both the whole city area and within the buffer region. Total data has been categorized into training (80%) and testing (20%) datasets. To test the multi-layer perceptron ANN model, route means square error, and coefficient of determinants was used by following the equations (Eqs. 7 and 8): where x i is the observed data, y i is the predicted data, and N is the total number of observations.

Inventory Data Preparation and Data Resampling
Inventory data preparation is cardinal in order to execute machine learning techniques and also for the validation of the models. To pertain random forest and artificial neural network, a dataset of various water quality determining factors has been taken into consideration like micro-level land use of the study area, concentration of settlement and urban built-up, greenery coverage, and WQI of the area. Using the GIS, 100 points in the category of good water quality points together with 100 poor water quality points were digitized within the study area. The total good and poor water quality sample points were subsequently categorized into training (80%) and testing (20%) data. Thus, for the purpose of the whole city, a total number of 80 good and 80 poor water quality point data were included in training data; meanwhile, the rest of the 20 good and 20 poor water quality data were allocated for the validation of the models. Analogously, for the demarcation of water quality within buffer region, total number of 140 point data was used out of a 200. From the 140 point data, 80% training and 20% testing data were utilized where training data sets counted as a total number of 114 points and testing data counts as 28 points (Table 4). (7) Vol.: (0123456789)

Feature Selection
With reference to the machine learning algorithm, selection of features is a supreme step of model building (Booker and Snelder, 2012). Feature selection is required because of evaluating the relevance of selected factors for model building. In the present study to determine the factor's importance, the ReliefF ranking evaluator was used. ReliefF ranking is the probabilistic approach used for data classification which monitors conditional reliance and discriminative power of identified factors (Belgiu and Drăguţ, 2016). Increased rank expresses more significance and "0" expresses no or irrelevant factor for modeling.

Receiver Operating Characteristics Curve
The ROC curve is a widely accepted method that has been widely used in various fields including geospatial analysis for the validation of the model in which ROC curve exhibits the trade-off between specificity and sensitivity (Chen et al., 2020). The far off the curve from the ROC space, the more accurate the test. The ROC does not hang on the class distribution. One common approach to calculate the ROC uses the area under the curve (AUC) for numeric evaluation where specificity and sensitivity are being placed on "x" and "y" axis consequently. A greater value of AUC indicates higher accuracy of the result. where TP is the true positive, TN is the true negative, and both TP and TN represent the number of pixels correctly classified; FP is the false positive, FN is the false negative and both FP and FN represent numbers of pixels that are incorrectly classified, W c is the number of pixels that are correctly classified as good water quality and poor water quality, W exp is the expected agreement's value, X ei is the predicted value, X oi is the observed value, and n is the number of datasets.  Training  Testing  Training  Testing  Training  Testing  Training  Testing  160  40  114  28  160  40 114 28

Water Quality Assessment Parameters and Their Proportion
The prevailing condition of Aligarh city reflects that most of the samples score poor water quality rank as per the standard of BIS or WHO. In pre-monsoon season, pH value exceeds the permissible limit at 5 stations, every sample exceeds the permissible limit of DO, 10 stations exceeds TDS permissible limit, TH relies on between the permissible limit, where 6 stations surpassed the permissible limit of magnesium, EC, and alkalinity exceeds its permissible limit at 18 sampling stations, 1 station crossed sulfate limit, 16 stations crossed sodium and magnesium permissible limit where calcium, chloride, turbidity, and bicarbonate are within the limit. In the post-monsoon season, some values increased while some have deceased.
Only one station crossed the permissible limit of pH, again DO cross its permissible limit at every station, 9 stations surpassed TDS permissible limit, 14 stations outstripped TSS limit. Magnesium range is exceeded by 6 stations, EC and alkalinity permissible limit has been overshadowed by 18 stations, 7 stations transcended sodium permissible limit and TH, calcium, chloride, turbidity, sulfate, and bicarbonate is within the permissible range. It has been observed that the areas nearby to the infiltration points from where a regular seepage water penetration is taking place, the areas where very little surface water penetration is happening or deep dumping of industrial hazardous wastes via borehole into the underground where it finds its way to the aquifer is happening, has comparatively poor water quality throughout the year and the areas where big empty spaces are tracked down or waterlogging level is higher in monsoon season, has pretty better rainwater harvesting capacity hence the water quality is better in post-monsoon season.

Land Cover Classification
In the present study, the land cover was defined by three parameters, i.e., concentration of settlement and urban built-up, greenery coverage, and micro-level land use (Fig. 5). As per the image classification, it is shown that about 80% of the total city area is covered by the urban settlement and built-up area, whereas 10% areas are covered by vegetation, 7% barren land, and the rest 3% covered by other land-use types including agricultural land, open space, recreational area, water bodies, and swamp. In terms of greenery coverage, only the campus area (i.e., Aligarh Muslim University) has covered by vegetation, and other areas have the least vegetation cover. In order to train and test the machine learning model, the land cover types within 500m buffer areas were mapped (Fig. 6). The details of pixel-based correlation between land cover of the study area and water quality index have been discussed in section 4.6, as follows.

Assessed Water Quality Index
WQI was calculated and evaluated based on the weightage given looking towards their importance in water quality assessment. It is generally recommended to examine at least four parameters to assess WQI, and in this study, a total of fourteen parameters were taken into consideration out of which, SO 4 , pH and TDS were given a maximum weightage of 0.1052, whereas alkalinity, total suspended solids , chloride, bicarbonate, and sodium were given the weightage of 0.0789. The minimum weight of 0.0526 was allotted for DO, total hardness, calcium, magnesium, and EC. All 19 sampling stations were categorized into five categories based on WQI in both premonsoon and post-monsoon seasons. The enumerated values of WQI range from 78.54 to 209.88 and 78.39 to 179.98 during pre-monsoon and post-monsoon, respectively (Table 5).
When we focus on particular land cover of the study area and WQI of that place, it was found that during the pre-monsoon season, poor water quality was observed at Jama masjid, Ambedkar park, and Mulla Para that might be because of Jama masjid in the city has maximum elevation, and at that level where percolation of freshwater does not occur and that area comes under the industrial hub of Lock industry where some of the previous cases reported that hazardous wastes are secretly or sometimes openly being dumped into the boreholes that no longer in use. Ambedkar Park is that region where the elevation is lowest in the entire city where sanitation and hygiene is a daily life issue, hence overexploitation of groundwater along with the release of direct discharge into groundwater can be a cause and the area Mulla Para which is situated just beside the city solid waste department (A2Z solid waste department), where stagnant drain water along with a heap of garbage succumb the entire regions water quality. The very poor water quality was monitored at Pratibha colony which has a dense population along with barren land which is of no use in the pre-monsoon season. Kali deh is the only region where the physical property of the water was found damaged. The area is near a lot of water bodies which has stagnant filthy water for the entire year along with a pumping station that occupies a big area where sewerage line water is being pumped out for the entire year. In case of the post-monsoon season, the overall water quality of the city was noticed improved, where maximum WQI was observed as 179.98 to 209.88 during the pre-monsoon season. But based on their arithmetic mean category, post-monsoon season shows more poor water quality than pre-monsoon season, where 73.68% sampled water quality was observed good in pre-monsoon season rather than 63.15% during postmonsoon season. Similarly, poor water quality level was also increased in post-monsoon season (36.84%) compared to pre-monsoon season (26.31%). This phenomenon may be due to the percolation of impure water from the surface to the underground (Fig. 7).  Despite having overall better water quality compared to the pre-monsoon season, poor water quality was noticed at Gandhi eye hospital, near student union hall, B.S.J. Hall, and Avas Vikas colony in the post-monsoon season. Except for Avas Vikas colony, all three paces come under the civil line sector which is a shallow laying area; hence, rainwater percolation rate is higher in these areas which trigger the water along with impure wastewater being fused up the groundwater quality.

Feature Selection and Binary Classification for ML Models
In this present study, the training and testing dataset were prepared by applying the locations of good and bad water quality, and a multi-values to point extraction tool was executed in which individual points occupy a specific pixel value of every determining factor. In the training dataset, the binary value such as 1 and 0, stipulating good and bad water quality, respectively. The ReliefF evaluator was utilized in order to calculate the rank of different factors responsible for deteriorating water quality. In the pre-monsoon season, all three determining factors were applied both in the whole city area and within the buffer region, and the same process goes for the post-monsoon season as well.
In the pre-monsoon season, the result of ReliefF ranker shows that concentration of settlement and urban built-up ranks 0.0139, greenery coverage ranks 0.0123, whereas micro-level land use has 0 ranks during the model training within the whole study area and output rank 0 for all these three factors during the model testing. Using the training dataset within the 500m buffer region shows a rank of 0.0340 for the concentration of settlement and urban built-up, 0.0179 for greenery coverage, and again close to 0 (i.e., 0.0046) for microlevel land use, whereas the testing dataset offers a rank 0.0456 for the concentration of settlement and urban built-up, 0.0119 for greenery coverage and absolute 0 for micro-level land use. Simultaneously, in post-monsoon using the training data, the result of ReliefF evaluator shows a rank of 0.0263 for the concentration of settlement and urban builtup, 0.0522 for greenery coverage, and 0 for microlevel land use. On the other hand, using the testing dataset, the result represents a rank of 0, 0.0295, and 0.0281 for the concentration of settlement and urban built-up, greenery coverage, and micro-level land use, respectively. Similarly, for buffer region, the shows that the concentration of settlement and urban built-up, greenery coverage, and micro-level land use holds a rank of 0.0453, 0.0371, and 0 during model training and 0, 0.0089, and 0.0218 during model testing, respectively.

Models Applied and Their Comparison
In the present study, two machine learning algorithms naming artificial neural network (ANN) and random forest (RF) classifier were applied to assess the impact of land cover on groundwater quality. The statistical presumption was used to superimpose good and bad water quality contributing factors with the help of existing WQI data which records the water quality of the city for accessing and validating the results. Initially, three land cover factors were evaluated, and all three selected factors were integrated to prepare the final result of each of the parameter's contribution to deteriorating water quality. The dispatch of these two models for the prediction of groundwater quality of Aligarh city was evaluated and juxtaposed by using ROC and various arithmetic measures including MAE, RMSE, and Kappa statistics were also used for determining the suitability of the model for water quality assessment. Table 7 and Table 8 show the results of applied models in the case of whole study area and within the buffer region for both seasons. Within the whole study area during the pre-monsoon season, the result of training data shows that the ROC is higher for RF (0.800) than the ANN (0.594). Among arithmetic measures, the MAE of RF (0.3721) is less than ANN (0.4694). In terms of RMSE, the RF has a lower value (0.4236) than ANN (0.4865), Kappa statistics (K) and OAC were Fig. 7 Groundwater quality of the Aligarh city a pre-monsoon, b post-monsoon, c pre-monsoon within 500m buffer from sampled stations, d post-monsoon within 500m buffer from sampled stations estimated, where K is less in RF (0.418) than ANN (0.1422), and the OAC is higher in RF (0.6937) than the ANN (0.59375). The result of testing data reveals that the ROC curve is greater in the RF classifier (0.838) than the ANN (0.665), the RMSE has less value in the RF classifier (0.3973) compared to ANN (0.4595), whereas MAE is less in RF (0.3432) than ANN (0.4301). The K is greater in RF (0.55) than the ANN (0.25) and the OAC is also higher in RF (0.8) as compared to ANN (0.6).
Within the whole study area during the postmonsoon season, the result of training data shows that ROC is slightly higher in RF (0.863) than ANN (0.741). In arithmetic measures, the RMSE is less in RF (0.3848) compared to the ANN (0.443), and MAE is less in RF (0.3073) than ANN (0.3953). The K is higher in RF (0.5113) than ANN (0.3451), and OAC is higher as well in RF (0.75625) than ANN (0.675). Using the testing data, the result shows that the ROC is greater in RF (0.862) than ANN (0.755), the RMSE is less in RF (0.3871) than ANN (0.427) and MAE is a little less in RF (0.3341) than ANN (0.3728). The K is higher in RF (0.5745) than ANN (0.4565) and OAC is also a little higher in RF (0.7875) than ANN (0.775) model. Alongside building the model in the whole study area, models were applied within the 500m buffer for both pre-monsoon and post-monsoon seasons. Using the training data in pre-monsoon, the model accuracy reveals that the ROC is 1.00 for both RF and ANN models. The RMSE is very less in RF (0.0028) and ANN (0.0075) as well. The MAE is negligible higher in RF (0.008) than ANN (0.0075), the K is showing the absolute result (1.000) for both RF and ANN and OAC is very close to 1 for both the models, wherein OAC of RF is 0.99 and ANN is 0.99. Similar results were found using the testing dataset in the case of ROC and K for both models, i.e., absolute 1.00. The RMSE is a little higher in RF 0.0163 than ANN 0.0124, and MAE was found almost the same differences where RF (0.0163) is slightly higher than ANN (0.123). The OAC is slightly higher in RF (0.9821) than ANN (0.9642).
Using the training data in post-monsoon, it was found that the ROC is higher in RF (0.7743) than ANN (0.6964). The RMSE is less in RF (0.3796) than ANN (0.4366), and MAE is lesser in RF (0.3029) than ANN (0.382). The K is greater in RF (0.4378) than ANN (0.2815) and OAC is also greater in RF (0.7743) than ANN (0.6964). The result found From the above-mentioned discussion, it is found the random forest (RF) has a more accurate result in comparison to the artificial neural network (ANN) in all cases including within the whole study area, within the 500m buffer region, during pre-monsoon and post-monsoon season. The result also reveals that within 500m buffer of sampled stations has more accurate result in terms of the ROC values and K values very close to 1.00, very less RMSE and very high OAC value.

Correlating Land Cover Classification and Groundwater Quality
Scrutiny of nature and the pattern of land cover is directly associated with groundwater quality, henceforth, it is needed to understand the current ecological problems properly. In this present scenario, the water quality of Aligarh city exhibits pretty inferior conditions as per WQI analysis. For assessing its impact, the water quality results were tied up with the remotely sensed data and machine learning models. From the analysis of the previous section, the result shows that both models are accurate and, hence, can be accepted. The pixel-based classification of land cover and water quality index using RF and ANN offer spatial assimilation and relation which may help in understanding the impact of land cover on water quality.
In the context of whole city, the result shows that there is a strong correlation between the land cover and water quality. Here, it was found that during the post-monsoon season with training data, the highest correlation exists between the concentration of settlement and urban built-up (0.0340) with WQI, followed by greenery coverage (0.0179) and micro-level land use (0.00469), whereas the testing dataset shows that highest relation taken place between the concentration of settlement and urban built-up (0.0456) and WQI, followed by greenery coverage (0.0119) and micro-level land was no impact on water quality. Consequently, the training data of pre-monsoon shows a positive relation between greenery coverage (0.0123) and concentration of settlement and urban built-up (0.0139) with WQI, but micro-level land use has failed to establish any relation with it. The testing dataset also reveals that a negative correlation of all three parameters with WQI during pre-monsoon; hence, it is evident that this parameter does not bother the water quality of the area. Except for the pre-monsoon testing dataset, the other three datasets of both seasons showed a positive correlation with the two existing factors. The concentration of settlement and urban built-up is the main factor that gets highly correlation and hence affects the water quality, followed by greenery coverage (Fig. 8).
The area within the 500m buffer shows a different result compared to the whole study area assessment. The post-monsoon training dataset within the buffer area shows a high value of the concentration of settlement and urban built-up, subsequently greenery coverage. Unlike these two parameters, micro-level land use offers a negative value hence shows no effect on water quality. The results of the post-monsoon testing dataset show the high out-turn of greenery coverage. Here the concentration of settlement and urban built-up shows the negative output. The output of the pre-monsoon training dataset indicates a greater relation of concentration of settlement and urban built-up (0.05223) than greenery coverage (0.02634), where micro-level land use represents no relation. In the case of the pre-monsoon testing dataset, the concentration of settlement and urban built-up, greenery coverage, and micro-level land use shows almost exact positive relation (Fig. 9).

Discussions
Water is the essence of life, lower areal extension and higher population with maximum built-up coverage is an indication of poor urban quality, which may largely affect the environmental components like groundwater (Escobar et al., 2022;Singh et al., 2022). Hence the applications of precise techniques are getting called for to reduce the casualty caused by humans. Initiating suitable models is the prime motive for the damage assessment and their management. With the help of machine learning algorithms, good and bad water quality areas can be easily recognized and precautions can be taken over occurring activities (Ourarhi et al., 2022). Presently, diverse technique and model has been presented to set out the observation of water quality of Aligarh city both in macro-and microlevel, where pixel-based calculations have been performed. At present, profuse work is being done in the field of groundwater along with model building (Chen et al., 2020;Belgiu and Drăguţ, 2016). Viewed in this way, the present study associated and weight up different techniques namely WQI along with machine learning algorithms such as random forest (RF) classifier and artificial neural network (ANN) to assess the land cover induced water quality in Aligarh city. On this regard, a mathematically valid model with a good prediction tool is needed. Recently, ensemble techniques are globally appreciated for the assessment of the hydro-geochemistry and their determining factors (Chen et al., 2020;Belgiu and Drăguţ, 2016;Tung and Yaseen, 2020;Najahet al., 2021;Kumar et al., 2020;Missaoui et al., 2022). Although a lot of works have been done in previous years, a maximum of them focused on only physico-chemical analysis (Escobar et al., 2022;Singh et al., 2022). Thus, this study is the very first attempt to evaluate satellite data along with physico-chemical parameters of water quality to find a correlation between land cover and groundwater quality. Also, the present study focused on applying machine learning algorithms for spatial classification and establishing their relation using different statistical measures.
At the present scenario, machine learning has become the basic need for the researchers hence a lot of researchers are attracted to model building or prediction (Hameed et al., 2017;Wang et al., 2017;Najafzadeh Fig. 9 Correlation between land cover and water quality index within the buffer areas in pre-monsoon and post-monsoon season  Ali et al., 2021;Alagha et al., 2014). The artificial neural network got its recognition for its well predictive performance where it works like human neural and turns big data into precise results (Isiyaka et al., 2019;Kadam et al., 2019;Diamantopoulou et al., 2005;Isiyaka et al., 2019). The random forest known as a classification tree is well recognized for its best prediction tendency (Tyralis et al., 2019;Baudron et al., 2013;Belgiu and Drăguţ, 2016;Bui et al., 2020;Chen et al., 2020). So, these two models were applied in identifying impacts of land cover on WQI in Aligarh city. Based on the output results of these two models, It was found that that random forest classifier obtained maximum accuracy (pre-monsoon training = 69.38%, pre-monsoon testing = 80%, post-monsoon training =75.62%, post-monsoon testing = 78.75%) in comparison to artificial neural network model (pre-monsoon training = 59.38%, pre-monsoon testing = 60%, post-monsoon training = 67.5%, post-monsoon testing = 77.5%). The similar results were also assessed from buffer region, where the accuracy of random forest (pre-monsoon training = 99.55%, pre-monsoon testing = 98.21%, post-monsoon training = 77.43%, post-monsoon testing = 80.35%) showed higher compared to artificial neural network (pre-monsoon training = 99.11%, pre-monsoon testing = 96.43%, post-monsoon training = 69.64%, post-monsoon testing = 78.57%). Hence, based on the comparative analysis, it is clear that the random forest classifier provides maximum accurate results than ANN.
The zeal of this study was to detect changes in water quality with varying land cover characteristics over the study area with the help of these two applied models. The output of the models remarked that both of them can be considered for the detection of the impact of land cover factors on water quality, and developing precaution strategies to reduce the impacts. From the perspective of comparative analysis, the random forest classifier turns into an ideal model since it has outsmarted the ANN model in terms of ROC-AUC, MAE, RMSE, OAC, and Kappa index.

Conclusion
Aligarh comes under the category of class-I cities in India, and more specifically, it comes under Delhi NCR (National Capital Region); hence, the population stress is proportionately increasing day by day. Apart from that, the industrial history of Aligarh with Lock and brass smithy is well recognized along with many more new sectors of industries which make the management and supply of water even more stressful. In this paper, the physico-chemical properties of groundwater were analyzed and water quality index (WQI) was computed for the study area to assess the role of land cover on water quality, and the results were validated using artificial neural network and random forest classifier. The present assessment reveals that: • As per the BIS norms, most of the samples were identified as marginalized under the poor water quality category. However, the WQI shows that most of the city comes under the good water quality category, but the areas near to the sewer station or solid waste management like Mulla Para or Kali deh have comparatively poor water quality than other areas. On overall comparison, pre-monsoon season outsmart the post-monsoon season in terms of good water quality. • Based on the ReliefF rank evaluator, the concentration of settlement and urban built-up is the most important factor responsible for the deteriorating water quality in the whole city, Followed by greenery coverage, i.e., the water quality of green coverage areas were found to be good in comparison to non-vegetated areas. The results within the 500m buffer areas also showed that greenery coverage is the most determining factor followed by concentration of settlement and urban built-up. • The machine learning-based analysis presented the role of greenery coverage and concentration of settlement and urban built-up in groundwater quality with higher classification accuracy (higher K value and OAC) and lower errors (lower value MAE and RMSE). • After the comparative analysis of classification results of the applied models, it was evidenced that the areas where stagnant water stays for the maximum time alongside the places where very low chances of percolation exist have comparatively low WQI than the other areas in the city. • As per the analysis of these models, it is very vigilant that the concentration of settlement is a significant controlling factor in determining groundwater quality. Hence, it can be determined that these physico-chemical data along with suit-able models in this study may be advantageous for the planning of the city and other areas with similar geo-environmental conditions.

Recommendation
After the complete assessment, few recommendations can be suggested in this regard: • Urban encroachment should be monitored • Industrial sector should be moved from residential area • Tube well depth needs to be increased in areas with poor water quality • Tree plantation need to be prioritized in administrative level, which is a nature-based solution of the existing problems.