In order to fulfill the objective, various steps were followed to assess the impacts of land cover on the groundwater quality of the study area. So, initially, a land cover map of the study area was prepared. Then the sample of groundwater collection locations was selected looking towards the land cover map. Based on the general preview from the Google earth engine, it was observed that the city has haphazard growth in all directions without proper management and measures which resulted in the stress on groundwater is becoming prominent each passing day. Hence, the land cover was taken for assessing the water quality in the study area. In order to prepare the land cover of the study area, three parameters including the concentration of settlement and urban built-up (CSUB), greenery coverage (GC), and micro-level land use (MLU) were selected. Out of which, the first two parameters were prepared from Sentinel-2A satellite imagery for the year 2020 and the last parameter was derived using pixel supervised classification. For the supervised classification, a total of nine determining parameters namely agricultural land, barren land, settlement, water bodies, swamp, vegetation, agricultural fallow land, vegetation fallow land, and recreational area were considered. Keeping in mind the micro-level land use, groundwater samples were collected depending on the location of the public water supply system and their surrounding land use pattern. Most of the samples were selected so that almost the entire city can be covered, but few areas are dominated by personal submersible; hence, trials from those private water sources have been collected. The sampling points were marked in the ‘x’ ‘y’ GPS coordinate system so that results can be put for the data visualization. Then WQI was calculated to identify the water quality of the area. To evaluate the correlation between land cover and groundwater quality, a 500m buffer was created around each sampled station to delineate water quality. For assessing the accuracy, machine learning techniques such as artificial neural network (ANN) and random forest (RF) were used in the case of both the entire city and within the buffer area. The flow chart presents the details of the methodology applied in the present study (Fig. 2).
3.1 Groundwater sample collection
For groundwater sample collection, clean 1L plastic bottles have been used, and before collecting the sample, those bottles were rinsed off with distilled water. The water samples were collected from a running water source so that no stagnant water gets compiled into the bottle, because in stagnant or storage water in the tank for example can change its bacteriological and physio-chemical properties, that’s why running water from the hand pump and submersible were considered. Total 1 shows that a total 19 groundwater sample stations were selected for collecting sample water which is mainly from bore wells that supply water in the entire city, and very few samples were collected from a submersible, where people mainly prefer their water supply. Both pre-monsoon and post-monsoon data were collected from these sample points in May 2019 and October 2019. Few parameters like DO, EC, pH, and temperature were tested on the spot and the rest of the tests were done at the laboratory. The parameters such as TDS, TSS, TH, Chloride, Calcium, Magnesium, Alkalinity, Turbidity, and COD were tested at the Environmental Engineering laboratory, Aligarh Muslim University, whereas the other parameters were tested at the Analysis laboratory, Agra. All tests were performed as per the BIS norms (Table 2).
Table 1
Groundwater sampled points and their codes
Sl. No.
|
Place
|
Coordinates
|
Sl. No.
|
Place
|
Coordinates
|
1
|
Double tankey coloney, Shahjamal
|
27°52'44.29"N
78° 3'15.02"E
|
11
|
Dodhpur chauraha
|
27°54'21.88"N
78° 5'7.65"E
|
2
|
Avas Vikas colony
|
27°53'50.23"N
78° 3'58.24"E
|
12
|
Near Student Union hall
|
27°54'25.85"N
78° 4'31.76"E
|
3
|
Pratibha colony
|
27°53'37.48"N
78° 3'21.93"E
|
13
|
Press colony
|
27°53'47.74"N
78° 4'47.09"E
|
4
|
Malkhan Singh Hospital
|
27°53'21.13"N
78° 4'15.21"E
|
14
|
Duda colony, Sootmill
|
27°54'21.88"N
78° 3'12.68"E
|
5
|
Jumma Masjid, Upparkot
|
27°52'46.42"N
78° 3'59.99"E
|
15
|
B.S.J hall
|
27°55'8.96"N
78° 4'5.87"E
|
6
|
Dhobi Ghat, Jamalpur
|
27°55'39.64"N
78° 4'44.57"E
|
16
|
J.N.M.C
|
27°55'2.30"N
78° 5'20.47"E
|
7
|
Shiwalik Ganga Phase IV
|
27°53'32.93"N
78° 6'1.06"E
|
17
|
Mulla para
|
27°51'46.80"N
78° 3'37.29"E
|
8
|
Vaishno Royal apartment, Ramghat road/ Surendranagar
|
27°52'54.12"N
78° 5'11.61"E
|
18
|
Kali Deh
|
27°52'6.69"N
78° 5'2.22"E
|
9
|
Gandhi Eye Hospital
|
27°53'21.64"N
78° 5'0.27"E
|
19
|
Patel Nagar
|
27°52'19.27"N
78° 4'18.01"E
|
10
|
Ambedkar Park, Jiwangarh
|
27°54'38.99"N
78° 5'40.53"E
|
|
|
|
Table 2
Details about the test method of selected parameters
Sl. No
|
Parameters
|
Test methods
|
1
|
pH
|
pH meter
|
2
|
Turbidity
|
Neflometer
|
3
|
EC
|
Conductivity meter
|
4
|
TDS
|
Filtration method
|
5
|
TSS
|
Evaporation method
|
6
|
Alkalinity
|
Indicator method
|
7
|
DO
|
Wrinkle’s method
|
8
|
Sodium
|
IS:3025(Part 45)-1993
|
9
|
Potassium
|
IS:3025(Part 45)-1994
|
10
|
Sulphate
|
IS:3025(Part 24)-1986
|
11
|
Carbonate
|
IS:3025(Part 51)-2001
|
12
|
Bicarbonate
|
IS:3025(Part 51)-2001
|
13
|
Magnesium
|
Flame AAS
|
14
|
Chloride
|
Spectrophotometric
|
3.2 Relevant data used and their sources
This study was based on both primary and secondary data. Primary data include groundwater from 19 sampled locations in two seasons, followed by laboratory tests and experiments. Apart from that, various secondary datasets were also collected for different aspects. For the purpose of applying machine learning techniques, the land cover map was prepared from Google earth and verified using Landsat 8 data. The concentration of settlement and urban built-up, and greenery coverage of the city was prepared using Sentinel-2A imageries. The details about the data collection and use have been shown in the following table (Table 3).
Table 3
Relevant data and their sources
Layers
|
Sources
|
Format
|
Groundwater parameters
|
Collection from sampled stations by authors
|
Numeric data
|
Micro-level land use
|
Google earth, Landsat 8, 2021 (USGS earth explorer),
|
Vector data from Google earth,
Raster data from Landsat 8
|
The concentration of settlement and urban built-up
|
Sentinel-2A (https://apps.sentinel-hub.com/)
|
Raster
|
Greenery coverage
|
Sentinel-2A (https://apps.sentinel-hub.com/)
|
Raster
|
3.3 Selection of chemical parameters
Based on various literature, it has been observed that pre-monsoon and post-monsoon seasons have sufficient influence on the balance of chemical parameters on the groundwater quality (Rao., 2017). Both dry season and wet season have a direct influence on water stress and contamination along with various chemicals present on rock or infiltration (Haines et al., 2006). Needless to say that an ample amount of geochemical compounds are present in the groundwater, out of which few Physico-chemical parameters which may create many health problems were considered for this study, such as pH, Alkalinity, Dissolved Oxygen (DO), Total Dissolved Solids (TDS), Total Suspended Solids (TSS), Total Hardness (TH), Chloride (\({Cl}^{-}\)), Electrical Conductivity (EC), Calcium (Ca), Magnesium (Mg), Turbidity, Sulphate (\({{SO}_{{4}^{2}}}^{-}\)), Bicarbonate (\({HCO}_{{3}^{-}}\)) and Sodium (Na).
pH is one of the important parameters which determines whether the water is acidic or alkaline. Extreme exposure of pH can give rises to eye irritation and mucous skin (Ali and Ahmad., 2020). TDS refers to the presence of organic and inorganic substances in water. This brings certain necessary matters with it. Negative TDS is unhealthy but beyond the permissible limit can intrigue some health problems (Burton and Cornhill., 1977; Schroeder., 1960; Schroeder., 1966). TSS may contain siltation and various decomposed matter in water. As per BIS and WHO guidelines, the presence of TSS is not permissible in drinking water. The presence of TSS might bring out various unwanted microorganisms which can trigger Nausea, Diarrhea, headaches, etc. (Kjelland et al., 2015). Turbidity has not any specific health implications but it can act as a breeding ground for microbial development. High turbidity makes it difficult to wash away the impurities from water by chlorine (Baghvand et al., 2010). Chloride does not have an active effect on human health but those who already have some previous health issues related to sodium chloride metabolism may get affected. Prolonged ingestion of chloride for ages might bring concerns about human health. Alkalinity in drinking water is also an important parameter where crossing the permissible limit of Al might bring out skin irritation and gastrointestinal diseases along with vomiting and nausea besides their health impact (Wynn et al., 2010). The presence of DO in water is a sign of good water quality, anyhow beyond the permissible limit turns the water into a breeding ground for the bacteriological phenomenon (Bhatia et al., 2015). Total hardness is an important element in water. It brings benefits with it but beyond the permissible limit, makes the water soapy, pulses take more time to get boiled than usual and water pipes become narrow by its siltation. The health impact of TH is controversial among researchers, none the less it has some share in Kidney stones, cardiovascular diseases, digestion, etc. (Sengupta, P., 2013). Calcium is one of the important elements needed for our body. A good amount of Calcium intake is needed on our consumption and lack of Calcium can cause Hypocalcaemia, muscle cramp, dry skin, etc. (Pravina et al., 2013). Magnesium is another important geo-chemical present in water. The deficiency of magnesium can cause Hypomagnesemia, hypertension, Osteoporosis, headaches, etc. (Watson et al., 2012; Al Alawi et al., 2018). EC determines the presence of minerals in the water. High EC means a high presence of minerals that might ruin the taste of water (Meride and Ayenew., 2016). Sulphate has a huge laxative impact on pregnant women and infants are more prone to diarrhea when the water has an undesirable amount of sulphate. It degrades the taste of water as well (Heizer et al., 1997). Bicarbonate is a dependent ion in groundwater; bicarbonate joins with sodium most of the time. Exceeding the permissible limit of bicarbonate might result in hypernatraemia, rebound alkalosis. Sodium presence in water is a healthy sign. It can mitigate kidney damage, headache, hypertension, etc. but an overdose of it can inference into heart diseases, blood pressure etc. The spatial distribution of selected groundwater parameters in both seasons has been shown in the following figure (Fig. 3 and 4).
Land cover plays an important role in the sustainable growth and development of the city. the entire Aligarh district has only 1.83% area under forest cover in 2019 assessment as per the forest survey of India, whereas the Aligarh city has about 11.76% area under vegetation cover. Various studies showed that land use land cover has an impact on hydrometeorology, surface runoff, aquifer level, and water quality (Baidya et al., 2002; Sajikumar et al., 2015; Schilling et al., 2008). Hence, it can be said that it is an intense need to observe the relationship between groundwater quality and land cover pattern.
In order to study groundwater quality, several possible approaches along with their classification and assessment are significant (Su et al., 2019; Tian and Wu., 2019). In case of the present study, the impact of land cover on groundwater quality consists of both supervised and unsupervised classification. The supervised classification includes information related to groundwater quality and quantifies its standard, whereas the unsupervised classification is used to present the natural pattern of water quality. The supervised classification includes WQI (Chaurasia et al., 2018), ANN (Gupta et al., 2019), RF classifier (Grbčić et al., 2020), and unsupervised classification includes a self-organized map (Liu et al., 2018).
3.4 Calculation of WQI
Groundwater is considered the most potent source of freshwater which is present in the least polluted form because it is least interfered with by humans (Wagh et.al, 2019). It is estimated that around one-third world's population depends on groundwater, among them Aligarh city is a peculiar example where no other alternate source is available for drinking water. Here public’s only source of domestic and drinking water is groundwater. Hence, checking its status is essential. WQI is a method which is widely used for over a decade for the assessment of water quality (Gupta et al., 2017; Sadat-Noori et al., 2014; Şener et al., 2017; Mishra and Patel., 2003; Avvannavar and Shrihari 2008; Vasanthavigar et al., 2010). There are several advantages of WQI including it is relatively easy to apply, it can easily add new variables, it has a precise range of results, etc. (Walsh and Wheeler., 2013). So, there is no doubt about its reliability and simplified method. Here in this study, the rank weight method of WQI was used. For the assessment of the parameters, the Bureau of Indian standard was considered as the standard parameters, and those standards which are not described in BIS, World Health Organization standard has been selected for them. Among them, Total dissolved solids (TSS) are not mentioned on both BIS and WHO guidelines; hence NEMA (Kenya) standard has been selected for that. A total of 14 geochemical parameters were taken into consideration. A relative weight from 1 to 4 out of five has been assigned based on their importance and previous literature reviews (Batabyal and Chakraborty., 2015; Vasanthavigar et al., 2010). Turbidity has given a minimum value of 1 since it is a pretty less important parameter and a maximum value of 4 has been given to pH, TDS, and Sulphate. After that relative weight has been given to each parameter by the following equation (Eq. 1):
$${W}_{i}=\frac{Rn}{\sum _{i=1}^{n}Rn}$$
1
Where relative weight = \({W}_{i}\), rank of each parameter = \(R\), number of parameter =\(n\)
After placing calculated relative weight (\({W}_{i}\)), a quality ranking scale (\({Q}_{i}\)) was assigned for every single parameter by dividing its concentration in each water sample by its standard as per the Bureau of Indian Standard (BIS, 1991), and then the result was multiplied by 100 (Eq. 2):
\({Q}_{i}=\frac{ Ci}{Si}\times\) 100 (2)
Where quality rating = \({Q}_{i}\), Concentration of each chemical parameters in water sample = \(Ci\), Indian drinking water standard for the mentioned parameters =\(Si\)
The next step is to determine the sub-index (\({Sl}_{i}\)) of each Physico-chemical parameter of water samples by using the following equation (Eq. 3):
$${Sl}_{i}=Wi*{Q}_{i}$$
3
where sub-index of \({i}^{th}\) parameter = \({Sl}_{i}\), the relative weight of \({i}^{th}\) parameter = Wi
Finally, the WQI was obtained by the following equation (Eq. 4):
After calculating the water quality index, the value was divided into five categories ranging from excellent to unsuitable for drinking.
3.5 Mapping of land cover
The land cover considered in the present study consists of three different raster layers. These are the concentration of settlement and urban built-up, greenery coverage, and micro-level land use. The data of micro-level land use were obtained from the real-time Google Earth engine for the year 2020 and the supervised image classification was used for mapping. Here, a total of nine pre-identified signatures were considered for classification purposes. The signature categories includes settlement, water body, vegetation, barren land, recreational area, fallow land, agricultural land, swamp, and open space. On the other hand, the concentration of settlement and urban built-up (CSUB) and greenery coverage (GC) were extracted from Sentinel-2A imageries (https://apps.sentinel-hub.com/eo-browser). The CSUB was derived from the band combination of 12 (Shortwave infrared-1), 11 (Shortwave infrared-2), and 4 (Red). This composite is generally used to visualize settlement and urban built-up areas more clearly. On the other hand, the GC was derived using the following equation (Eq. 5):
$$GS=\frac{\left(B8-B4\right)}{\left(B8+B4\right)}$$
5
Where \(B8\) is the Near Infrared band, and \(B4\) is the Red band. The value range of GC is -1 to 1. A negative value corresponds to water, values close to 0 correspond to barren areas of rock, sand, or snow. Positive values represent vegetation (nearly 0.2 to 0.4), whereas a high value indicates tropical forest.
3.6 Models used
The selected groundwater parameters were mapped first, followed by mapping of water quality index, and then land cover maps were generated as mentioned above. The WQI and land cover of the whole city and 500m buffer regions around the sampled stations for both pre-monsoon and post-monsoon were mapped. To know the relationship between WQI and land cover, machine learning algorithms were applied. Based on the model’s classification accuracy, the spatial correlation between WQI and land cover was determined in this study.
3.6.1 Random forest (RF)
Random forest is a tree rooted classifier, an improvised version of the bagging-based method coined by Breiman (2001). It is the combination of bagging tree and multivariate data which turns it into a way better version tool for pattern recognition of multivariate and large scale data tree creation (Liaw and Wiener, 2002). It produces a huge number of trees by using bootstrapping, it provides numerous metrics which in turn helps in interpretation (Prasad et al., 2006; Baudron et al., 2013). Although RF has its cross-validation (Breiman, 2001; Efron and Tibshirani, 1997), it is recommended to check the sensitivity of the model so that maximum accuracy can be obtained (Matthew, 2011). Random forest works in four basic steps (Ali et al., 2021)
1. Selection of sample feature ‘k’ randomly from the total sample ‘m’, where ‘k’ < ‘m’,
2. Calculation of the node tree ‘d’ by applying the befitting split point amidst the selected features ‘k’,
3. Again applying the best split point ‘d’ into daughter nodes tree ‘dn’ then
4. Reiteration of the above-mentioned steps till the \({I}^{th}\) number of nodes tree is brought off.
The RF classifier determines a paramount number of nodes to its output class (Bonissone et al., 2010). Therefore, for an input data ‘x’ has computed its output ‘y’ from the highest ensemble which is exhibited in the equation (Eq. 6):
$$Y = \left(x\right) = max\left[\sum _{k}I\left(t\right)\right]$$
6
Where I(t) is an indicator function marked off as:
$$I\left(\text{t}\right) \left\{\begin{array}{c}1,t=\text{'}YES\text{'}\\ 0,t=\text{'}NO\text{'}\end{array}\right\}$$
In the case of the present study, the good water quality and poor water quality locations were indicated by ‘YES’ and ‘NO’, respectively.
3.6.2 Artificial neural network (ANN)
The artificial neural network is a technique that follows the role of a neuron as in the human brain (Diamantopoulou et al., 2005; Wu et al., 2014), which is faster and reliable. Like neurons, the artificial neural network works in different layers where non-linear data is processed and transmitted from one layer to another (Isiyaka et al., 2019). Here the whole process works in three layers, that is input layer, the hidden layer, and the output layer (Diamantopoulou et al., 2005). For this process both training and testing data are required, where training data includes weights of the variables and adjust them with the help of the iterative method (Isiyaka et al., 2019). In this study, a multi-layer perceptron feed-forward method of ANN was used with a back propagation algorithm to identify the most polluting contributor in groundwater quality, and it model every single individual’s percentage of contribution to the pollution.
In this regard, two input combination models were produced so that the most statistically significant API with high accuracy can be achieved (Alagha et al., 2014; Sarkar and Pandey, 2015). Here, ANN has been applied to both the whole city area and within the buffer region. Total data has been categorized into training (80%) and testing (20%) datasets. To test the Multi-layer perceptron ANN model, route means square error and coefficient of determinants was used by following the equations (Eqs. 7 and 8):
$${R}^{2} = 1-\frac{\sum ({x}_{i- {y}_{i }}{)}^{2}}{\sum y{}_{i}{}^{2}- \frac{\sum {y}_{i}^{2}}{n}}$$
7
$$RMSC=\sqrt{\frac{1}{n} \sum _{i=1}^{i=n}\left({x}_{i}- {y}_{i}\right)²}$$
8
Where \({x}_{i}\) = the observed data, \({y}_{i}\) = the predicted data, and N = the total number of observations.
3.6.3 Inventory data preparation and data resampling
Inventory data preparation is cardinal in order to execute machine learning techniques and also for the validation of the models. To pertain random forest and artificial neural network, a dataset of various water quality determining factors has been taken into consideration like micro-level land use of the study area, concentration of settlement and urban built-up, greenery coverage, and WQI of the area. Using the GIS, 100 points in the category of good water quality points together with 100 poor water quality points were digitized within the study area. The total good and poor water quality sample points were subsequently categorized into training (80%) and testing (20%) data. Thus, for the purpose of the whole city, a total number of 80 good and 80 poor water quality point data were included in training data, meanwhile, the rest of the 20 good and 20 poor water quality data were allocated for the validation of the models. Analogously, for the demarcation of water quality within buffer region total number of 140 point data was used out of a total 200. From the 140 point data, 80% training and 20% testing data were utilized where training data sets counted as a total number of 114 points and testing data counts as 28 points (Table 4).
Table 4
Details of training and testing dataset
Pre-monsoon
|
Post-monsoon
|
Within whole city limit
|
Within buffer limit
|
Within whole city limit
|
Within buffer limit
|
Training
|
Testing
|
Training
|
Testing
|
Training
|
Testing
|
Training
|
Testing
|
160
|
40
|
114
|
28
|
160
|
40
|
114
|
28
|
3.6.4 Feature selection
With reference to the machine learning algorithm selection of features is a supreme step of model building (Booker and Snelder, 2012). Feature selection is required because of evaluating the relevance of selected factors for model building. In the present study to determine the factor’s importance, the ReliefF ranking evaluator was used. ReliefF ranking is the probabilistic approach used for data classification which monitors conditional reliance and discriminative power of identified factors (Belgiu and Drăguţ, 2016). Increased rank expresses more significance and ‘0’ expresses no or irrelevant factor for modeling.
3.6.5 Performance evaluation of the models
3.6.5.1 Receiver operating characteristics (ROC) curve
The ROC curve is a widely accepted method that has been widely used in various fields including geospatial analysis for the validation of the model in which ROC curve exhibits the trade-off between specificity and sensitivity (Chen et al., 2020). The far off the curve from the ROC space, the more accurate the test. The ROC doesn’t hang on the class distribution. One common approach to calculate the ROC uses the area under the curve (AUC) for numeric evaluation where specificity and sensitivity are being placed on ‘x’ and ‘y’ axis consequently. A greater value of AUC indicates higher accuracy of the result.
3.6.5.2 Arithmetic measures
Besides the ROC curve, various arithmetic measures were also carried out for the prediction of the accuracy of the model. For this purpose, root mean square error (RMSE), Kappa index (K), sensitivity, specificity, and overall accuracy (OAC) were reckoned to measure the accuracy of random forest and ANN models used in the present analysis. The below mentioned Eqs. (9) - (14) were useful for the statistical measures (Khosravi et al., 2019; Ali et al., 2020). Mean Absolute error (MAE) is the average between actual observation and prediction over the test sample of absolute difference. Route Mean Square Error (RMSE) is the unadulterated measure of fit where minimum values of RMSE designate better fit. Kappa index helps out to measure inter reliability among the variables.
$$MAE =\frac{1}{n}\sum _{i=1}^{i=n}⃓{ X}_{ei} - {X}_{oi}⃓$$
9
$$RMSC=\sqrt{\frac{1}{n} \sum _{i=1}^{i=n}\left({x}_{i}- {y}_{i}\right)²}$$
10
$$\left.K =\left( \frac{{W}_{c}- {W}_{exp}}{1 - {W}_{exp}}\right.\right)$$
11
$$Sensitivity = \left.\left( \frac{TP}{TP+FN}\right.\right)$$
12
$$Specificity = \left.\left( \frac{TN}{TN+FP}\right.\right)$$
13
$$OAC =\left.\left( \frac{TP+TN}{TP+TN+FP+FN}\right.\right)$$
14
Where \(TP\) is the true positive, \(TN\) is the true negative and both \(TP\) and \(TN\) represent the number of pixels correctly classified; \(FP\) is the false positive, \(FN\) is the false negative and both \(FP\) and \(FN\) represent numbers of pixels that are incorrectly classified, \({W}_{c}\) is the number of pixels that are correctly classified as good water quality and poor water quality, \({W}_{exp}\) is the expected agreement's value,\({ X}_{ei}\) is the predicted value, \({X}_{oi}\) is the observed value, and \(n\) is the number of datasets.