GAM while retaining the interpretability of generalized linear models have the flexibility of machine learning algorithms, because it does not assume a linear relationship between dependent and independent variables (Hastie and Tibshirani 1990; Larsen 2015). GAM is, as the name suggests, a generalization of the linear model, in which the linear function of the covariate is replaced with a smooth function (Hastie and Tibshirani 1990). Because of their semiparametric nature, GAMs are much more sensitive to unique data distribution than GLM, allowing for modelling of nonlinear relationships by deriving predictor functions during model building (Härdle and Turlach 1992; Larsen 2015). At the same time to avoid overfitting, one can control the smoothness or ‘wiggliness’ of the predictor function (Larsen 2015). Despite their versatility, GAM has not been explored as much as linear or machine learning models to understand the relationship between livestock depredation by carnivores and the environmental factors (Kaartinen et al. 2009; Miller et al. 2015; Rostro-Garcia et al. 2016; Broekhuis et al. 2017; Amirkhiz et al. 2018). Given that the data for this study does not fit the assumptions of linear models, and the primary aim is to not merely get accurate prediction but identify the drivers of HTC, GAM was selected, to model livestock kill locations as a function of various tiger relevant habitat factors (after determining appropriate scale) (Wood 2017).
In India, domestic livestock depredation by wild carnivores is compensated by the forest department of the state. These livestock compensation records were collected from tiger reserve management to understand the intensity of HTC. But the compensation data was not geo tagged, hence the livestock kill data (that has GPS locations) which is collected by tiger monitoring teams in the reserve was also obtained, for the period 2009-2016. Since in case of Panna, there are several feral livestock, and no distinction is made between feral and domestic livestock in the kill records; kill and compensation data were matched, to obtain reliable locations for domestic livestock kill. Thus, 156 locations were matched for the period 2011-2016. For the purpose of modelling, livestock kill data was treated as presence, while random absence data was generated in GIS. Livestock kill data for the period 2011-2015 was used for training the model and 2016 data was used for testing the model.
A total of 27 variables were considered for ecological driver modelling, which can be grouped under the following heads
Anthropogenic disturbance indices were quantified in the circular plots (as discussed earlier under section II). In each plot, all the lopped (only branches were cut) and cut (cut to stump) trees were counted. Camera traps were deployed in 109 locations in 2x2 km grids within national park, accounting for 7459 trap nights. The number of tigers, livestock, humans, and vehicles captured in each camera trap was then manually counted and encounter rates (total no. of captures/total trap nights) calculated.
Village and road location data were also obtained from the forest department and Euclidean distance raster, calculated. Additionally, human footprint data was downloaded from Socio Economic Data and Application Centre (SEDAC) website for the year 2009 (Sanderson et al. 2002; Venter et al. 2018). Population census data for 2011 was also downloaded from SEDAC (Balk et al., 2020).
Geostatistical modelling to create rasters: Canopy cover, prey, human, livestock, and vehicle encounter rate, cutting and lopping intensity rates, were interpolated to create rasters, using the Geostatistical wizard in ArcMap 10.4 (Cressie 2015; ESRI 2016a). Four interpolation tools were considered for the purpose: Inverse Distance Weighing, Simple Kriging (SK), Ordinary Kriging (OK), and Empirical Bayesian Kriging (EBK). Statistical measures of correctness (mean prediction error, root-mean-square error, standardized root-mean-square error, average standard error) from the cross-validation were used to compare the kriging algorithms. The best model was considered the one that has the smallest root-mean-squared prediction error (RMSE), standardized mean nearest to zero, the average standard error nearest the root-mean-squared prediction error, and the standardized root-mean-squared prediction error nearest to 1 (ESRI 2016b) (Online Resource 1).
Scale and variable selection: To construct a multiscale model, each of these variables (except human footprint, LULC and FRAGSTATS variable) were resampled at five scales: 30m (the highest resolution available), 50m (average kill drag distance for tiger, as indicated by literature (Karanth and Sunquist, 2000; Miller et al., 2015)), 100m (midpoint between fine and coarse resolution), 350m (maximum kill drag distance from literature), 1200m (coarsest resolution used by us and at which most global environmental data is available). Thus, in total there were 118 variables. All the variables were then extracted for each of the presence/absence points using the Spatial Analyst toolbox of ArcMap 10.4.
For scale and feature selection, firstly, univariate logistic regressions were run, after performing Box-Tidwell procedure to test for logistic regression linearity assumption (Shin and Ying 1994). Secondly, univariate GAMs were run to understand how much r square/deviance was explained by each of the predictors at each of the scales. The scale at which the variable was best explaining the response was selected. Additionally, Information Value and Weight of evidence were employed for feature selection (Good and Osteyee 1974). Results of univariate logistic regression, univariate GAM, and Information value were studied to select the explanatory variables at appropriate scales for constructing the predictive model. The data was then checked for multicollinearity and spatial autocorrelation. To check for multicollinearity among the selected variables, we ran appropriate tests of association (for continuous vs continuous and continuous vs ordinal variables, Kendall’s tau b; for categorical vs continuous variables, logistic regression; and for categorical vs categorical variables, Cramer’s V). Among the correlated variables, selection for inclusion in the model was made based on which variable was better explaining the response. All prey (wild prey plus livestock) was highly correlated to wild prey encounter rate (r=0.812). Since all prey encounter rate was explaining higher deviance of dependent variable, it was selected. Similarly, NDVI and NDWI were moderately correlated (r=0.636), since NDVI was explaining higher deviance, it was selected.
So, after variable selection, the global model consisted of the following 12 variables at these scales: All prey encounter rate (50m), slope (50m), elevation (1200m), slope deviation (100m), NDVI (100m), shrub abundance (50m), canopy cover (30m), distance to water (1200m), livestock encounter rate (ct) (50m), human encounter rate (1200m), distance to road (100m), distance to village (100m) (Figure 2).
Spatial autocorrelation was checked among these variables using Moran’s I statistic and it was found that many variables were spatially autocorrelated (Getis 2007; Moran 1950).
To account for the spatial autocorrelation in the data, spatial GAM model was constructed using geoGAM package in R ver. 3.6.3. (Nussbaum and Papritz 2017). It’s a procedure to build parsimonious model based on gradient boosting, smoothing splines and a smooth spatial surface to account for the spatial structure. The GAM for spatial data or geoaddtive model in its full generality is represented by
Where, fs(s) is a smooth function of spatial coordinates, which accounts for residual autocorrelation (Nussbaum and Papritz 2017).
Since the response variables in this case is binary, Bernoulli distribution is assumed and logit link used
For building parsimonious model geoGAM automatically selects factors, covariates and spatial effects using componentwise gradient boosting, following which model is further reduced using cross validation (Nussbaum and Papritz 2017).
The final model so selected was run on test data, to get model performance measures, area under the curve (AUC) and true skill statistic.
A raster with 42m cell size was created and masked to the reserve boundary. It was then converted into points and for each of these points, values for all the explanatory variables were extracted. The final selected model was then run on this data to predict predation risk probability for each of the points. The points were then converted back to raster. Finally, risk predictions were assigned as the value of the raster, to create HTC risk map.