Geography Matters for Sanitation! Spatial Heterogeneity of the District-level Correlates of Open Defecation in India

Background: Sanitation interventions often fail in meeting the desired expectations. A prominent reason is the ecological nature of sanitation phenomena that inherently associates with its high dependence on contextual specics. This also constrains the generalization of research ndings in this domain and substantiates the need to supplement most prevalent micro-level evidence with wider population-level (ecological) studies. This paper thus provides the rst quantitative analysis of the spatial heterogeneity of the district-level correlates of open defecation in India. It focuses on a contiguous region of 266 districts (their rural parts) comprising of seven northern and eastern Indian states, all of which had comparatively high open defecation rates in 2011. Methods: We employ standard non-spatial regression, spatially explicit regressions and multi-scale geographically weighted regression to compare the stability of the measurable correlates of open defecation across these different methods as well as across the analyzed spatial units. Results: Attributes like the ownership of household assets, drinking water inaccessibility and prevalent literacy rates were identied as the most stable district-level correlates of open defecation in this region. A higher share of Muslims in the population positively correlated with lower open defecation rates, while this relationship was amplied by the population density. This nding support hypotheses about the positive sanitation externalities stemming from the spatial co-concentrations of Muslim and non-Muslim populations. Conclusions: Our analyses demonstrate notable spatial clustering and signicant spatial non-stationarity of the examined variables. Therefore, research ndings that ignore the spatial heterogeneity of sanitation drivers possibly provide incomplete information for practice.


Background
The elimination of Open Defecation (OD) is a critical initial step towards ensuring safe sanitation, with this being a basic prerequisite for improving human health [1]. The eradication of this practice has been central to achieving the target of universal access to sanitation by 2030, as was declared under the sixth SDG (Sustainable Development Goal). However, while such SDGs and global efforts to ensure safe and hygienic sanitation has no doubt bettered conditions, it is unlikely that the above ambitious goal shall be achieved within the stipulated timeframe [2,3]. To stress the daunting task further, in 2017, OD was practiced by more than 673 million people worldwide, mostly in Sub-Saharan Africa and South Asia, with almost 50% of all such cases being from India alone [4].
Possibly, the repeated failures or lowered effectiveness of past intervention policies have stemmed from sanitation practices being highly context sensitive and deeply embedded in human-environment interactions. Local sanitation conditions result from complex interactions between historically and culturally conditioned social behaviour, structural societal inequalities, multi-scalar sanitation politics and various natural environment parameters [16,17,18]. Thus the same factors have distinct and even opposite effects on sanitation outcomes in different geographical settings and furthermore, the interrelationships between distinct sanitation drivers are also context sensitive [19,20,10].
The implications of the contextual nature of sanitation have still to be fully understood. It makes the transferability of both, the policy interventions and any ndings accrued from sanitation research, quite complex and di cult to generalize. Various measures have thus been recommended for adjusting the planned interventions to the targeted local environments [21,22,7]. This contextual sensitivity also ampli es an inverse relationship between the internal and external validity of garnered research ndings.
The experimental impact-evaluations represent an exemplar case in this respect. Although they attain high internal validity due to the effective elimination of contextual in uences from the examined relationship between an intervention and its outcome, the methodological disregard of contextual factors means there is a weaker external validity (generalizability) of their results [23,19,11]. Contrarily, observational studies usually focus solely on statistical associations. As such, their internal validity is weaker. However, they can provide richer information on the roles of various observable contextual parameters and may also cover larger populations, which then adds to their external validity. On the other hand, inferences from larger-scale (i.e. ecological or population-level) studies are more susceptible to bias caused by various heterogeneities within the analyzed population. This challenges their ecological validity in terms of the extent to which such research results can be generalized to real-life settings [24].
The present article addresses the latter issue by examining how the inferences usually drawn about the correlates of OD practices in India are affected by heterogeneities demonstrated in space. More speci cally, we focus on the problem of spatially varying relationships between OD and its covariates, which can be referred to as spatial non-stationarity or process spatial heterogeneity [25,26]. With this in mind, our paper has two goals. Firstly, to examine the district-level correlates of OD in India, and secondly, to assess the role of spatial heterogeneity in these correlates by comparing the results obtained using a standard non-spatial regression model with those obtained from spatially explicit regressions and a Multi-scale Geographically Weighted Regression (MGWR). Our analysis uses data from the Census of India (2011) covering a contiguous area of 266 districts (their rural parts) across seven northern and eastern states of India, wherein 65% of all households practicing OD across the country reside. Although at present, the 2011 Census data may be already somewhat outdated, the more recent data required for such district-level analysis of OD is not yet available. To the best of our knowledge, this paper presents the rst spatially explicit analysis of district-level sanitation correlates in India as well as the rst quantitative analysis of the spatial non-stationarity of sanitation determinants globally. This article thus seeks to make not only an empirical but also a methodological and epistemological contribution. It demonstrates an approach that can be adopted for a meso-scale analysis of sanitation drivers in India, as and when more recent district-level data is available. Moreover, despite the recent decline in OD in India, the prevalence of this practice is still likely to be substantial [13,27] and possibly has a pronounced spatial variation. Therefore, the ascertained empirical evidence about the OD determinants in this study can still be valuable for addressing sanitation issues in India and other countries.
The variables used in the present study (which act as the correlates of OD) contain measures of the general socioeconomic situation of the studied districts, their occupational structure and their socioecological conditions. Additionally, we also examine the role of the proportion of Muslims in districts' population. Previous research has demonstrated that Muslim communities in India and elsewhere tend to have generally lower OD rates [28,29,30,31]. In addition, Geruso and Spears (2018) [32] uncovered that, due to the comparatively better sanitation conditions of Muslim communities, the spatial coconcentrations of Muslim and non-Muslim neighbourhoods generate positive health externalities and decrease infant mortality rates in both Muslim and non-Muslim families. Inspired by this interesting nding, we have hypothesized that a greater proportion of Muslims in a district's population may not only lead to better health, but that it would also affect the sanitation conditions of non-Muslims. Since such externalities have a higher chance of arising in more densely populated areas [33], we expect that an area's population density shall moderate (amplify) the negative relationship between the share of Muslims and its prevalent OD rate.
Our results demonstrate that the parameters relating to the household ownership of assets, drinking water inaccessibility and literacy rate represent the most stable district-level correlates of the OD rate in India. Furthermore, our surmise about the negative association between the share of Muslim population and the district-level OD rates being contingent upon the level of population density was also con rmed.
The latter nding supports our hypothesis about the positive sanitation effect of Muslim neighbourhoods and ultimately, the improved health externalities stemming from the spatial co-concentrations of Muslim and non-Muslim neighbourhoods. Overall, our analysis demonstrated a notable spatial clustering as well as a signi cant spatial non-stationarity of most of the examined variables. Therefore, studies that ignore the spatial heterogeneity of sanitation drivers possibly provide incomplete information, which may not be adequate for framing properly informed policies. While this claim of ours does have a general relevance with respect to most social and natural phenomena, we believe that sanitation represents a eld in which geography matters particularly strongly.

Complexity Of Sanitation Determinants
The contextual nature of sanitation predetermines the complexity of its predictors. This also explains why the theoretical knowledge in this area only encompasses a few conceptual typologies of potentially important factors [34,35,36,37,38], rather than being more causal theories. Conceptual typologies have to accommodate a large diversity of potential sanitation drivers operating on different scales. Some frequently represented types of factors that especially in uence OD are demographic characteristics [39,40], socio-economic and cultural attributes [41,42,30,43,44] occupation [37,45,17], dissatisfaction with sanitation infrastructure, in particular with poorly constructed government toilets [46,40,47,48] and environmental and socio-spatial factors (45,48,18,16,49].
A systematic review of the primary literature on rural community sanitation by   [19] identi ed 77 typological groups of sanitation drivers and 12 types of different sanitation outcomes.
Drawing on the dataset that supplemented the systematic review referred to above, Fig. 1 illustrates the broader types of in uencers of just two sanitation outcomes, in terms of OD, and the use of toilets, for select Asian and African countries. While there are some overarching types of factors similar to all these nations, it is notable that some of the drivers within these broader types may operate through distinct and contextually speci c mechanisms. For example, ndings from Benin and Bangladesh reveal that femaleheaded households are more likely to own toilets, while the evidence from Nigeria, Tanzania, Cambodia, and Nepal indicates an opposite relationship. In this case, the different socio-cultural and socioeconomic environments of female-headed households in these countries is the likely explanation [19].
The available sanitation research overwhelmingly comprises of case studies that most commonly draw on household-level micro-data. A few studies have investigated international and sub-national geographical inequalities in sanitation conditions [50,51,52,53], while some have also analyzed the various sanitation determinants [54,20,49]. The research focusing on the meso-and macro-level patterns of OD practices is generally scanty and rather descriptive, being often limited to simply monitoring basic sanitation indicators [55]. India is not an exception in this respect. We are aware of only two previous articles that have examined the district-level sanitation patterns in India [56,42], even though these did not employ any spatially explicit analytical techniques. However, this does not imply that sanitation is ignored in ecological studies. Such research nevertheless more often employs sanitation indicators as covariates that explain the prevalent conditions of human health [57,58,59], rather than directly examining them as outcomes.

Analyzed area
The chosen study area is a contiguous region of 266 districts spread across seven northern and eastern states of India (Rajasthan, Uttar Pradesh, Madhya Pradesh, Chhattisgarh, Bihar, Jharkhand, Odisha and a small part of West Bengal) (Fig. 2). This region was demarcated by considering the spatial concentration of OD in India and within these districts lay 65% of all rural Indian households that practiced OD, as per the 2011 Census. In six of the above seven states, the respective state OD rates exceeded the national gure, with the sole exception being West Bengal (76%). The eight districts from West Bengal were included in the studied region as they had disproportionately higher OD rates than the rest of the state and are socio-culturally and environmentally more similar to the neighbouring tracts of Jharkhand and Bihar than to the rest of West Bengal. The studied region has varied topography and land cover. As per the 2011 Census, it housed 494 million ruralites. It is socio-culturally diverse but socioeconomically less developed that the rest of India (e.g. literacy rate of 54%), with Scheduled Castes and Scheduled Tribes comprising 18% and 16% of the total population, respectively.

Note
Only the chosen part of West Bengal in the study area is shown Source: Census of India Atlas, 2011

3.2.Methods
The descriptive characteristics of the analyzed variables were enumerated rst and their spatial patterns/clustering was examined using both, global and local indices of spatial association [60]. The Theil decomposition of spatial inequality [61] was used to ascertain what share of the overall betweendistrict variations in particular variables was attributable to the differences between the respective state averages (between-state component) and to the differences between districts within respective states (within-state component). Next, the correlates of the district-level OD rate were examined using the following consecutive analytical steps to differentiate whether and how they addressed spatial heterogeneity and spatial non-stationarity (Table 1).
At rst, we used a simple linear regression as a baseline model, which was characterized as: where, Y was the district-level OD rate, X stood for the set of predictors described below, β denoted the respective regression coe cients for these predictors and was the error term. The intercept is not shown as we standardized all the variables analysed.
Linear regression assumes that the observations and error terms are mutually independent. However, in this case such an assumption may easily be violated by the spatial dependence in the data, e.g. by differences in sanitation interventions implemented in particular states or by any other state-level variations. Therefore, in the second step we additionally included dummy variables for particular Indian states among the independent variables, to control for any observable and unobservable differences between states (i.e. their xed effects).
In the third and fourth steps, we attempted to better address the presence of spatial autocorrelation using two common alternatives of spatial regressions-the Spatial Lag model and the Spatial Error model [60]. The spatial lag regression presumes that the dependent variable in a district is not independent of its values in the neighbouring districts and was speci ed as: where, ρ represented the spatial autoregressive lag coe cient and WY was a spatially lagged dependent variable. In the spatial error regression model, the assumption of spatially correlated errors was dealt with by including the spatial lag for errors. This was written as: where, λ was the spatial autoregressive error coe cient, Wε denoted the spatial lag of errors and µ symbolised a random error term.
Although the above approaches deal with the various spatial heterogeneities in the data, they ignore spatial non-stationarity by assuming the stability of relationships between the independent variables and the dependent variable over space. In the fth step, we explored this issue using MGWR [62,63], which is a recent extension of the Geographically Weighted Regression (GWR) models [64,65]. Unlike methods that simply elicit global conditional relationships, the GWR models estimate local relationships by weighting observations based on the distance from a given district (i.e. by "borrowing" data from nearby units). The results of the local estimates are thus typically represented through maps. The MGWR additionally incorporates an assumption that the processes which determine the spatially varying regression parameters may operate at different spatial scales. It was expressed as: where, y i was the dependent variable in a district i, β ij denoted the j-th local regression coe cient for the district i and bwj in β bwij refers to the bandwidth used for the calculation of the regression estimates for the j-th predictor x j (i.e. the spatial range for 'borrowing' data from nearby units). This implies that the bandwidths may differ for particular conditional relationships and the optimal bandwidths were therefore determined by a consecutive tting of models with different bandwidths and comparing their goodnessof-t statistics. A back-tting algorithm was also used for model calibration, as described in  [62]. The adaptive bi-square spatial kernel that was applied as the weighting function was denoted as: if and zero otherwise with d ij representing the distance between districts i and j and G i being the distance from district i to its Nth nearest neighbour, as was determined again by an iterative comparison of the goodness-of-t measures.
Although our district-level data was based on the 2011 Census and therefore may be considered as population data (in a statistical sense), we also report ndings on the statistical signi cance of the results derived to provide an indication of the level of a given measure's effect that may simply occur by chance (e.g. due to random spatial distribution).
In addition to the regression analyses, we also applied cluster analysis in order to identify groups of districts with similar local effects of the analyzed predictors on the OD rate. A standard K-means clustering method was used for this purpose.

Variables
The dependent variable was the district-wise percentage share of rural households that practiced OD, as was calculated from the Census of India, 2011 (Tables on Houses, Household Amenities and Assets). It was ascertained by considering the item 'no latrine within premises' but excluded those households that reported the use of public toilets.
A set of 11 potentially relevant predictors was initially considered based on the same source of data-rural literacy rate, workforce participation rate in primary and non-primary activities, population density, share of Muslims in rural population, accessibility of drinking water source and ve variables measuring the ownership of assets. For quantifying the primary sector workforce, cultivators and agricultural labourers were summed together, while the remaining categories of workforce were classi ed as the non-primary workforce. The population density (rural) was obtained by dividing the rural population of each district by that district's rural area (the total town area was subtracted from the total district area, as speci ed in the Town Amenities Table in the District Census Handbook).Water accessibility was measured as the share of households whose drinking water source was located away from the home premises (hereafter referred to as drinking water inaccessibility). The ve variables measuring the ownership of assets included the shares of households owning a television, computer (with and without internet connections), mobile phone, scooters/motorcycles, and cars/jeeps/vans.
Due to multi-collinearity issues and a high internal consistency (the Cronbach's alpha of their standardized gures corresponded to 0.84), we decided to use a single composite index of the household assets owned in the ensuing analyses. This index was created via Principal Component Analysis. Here, the rst component, which explained 65% of variance and revealed the high loadings for the ve variables of assets in the range 0.59 to 0.87, was taken to represent the composite assets scores for particular districts.
Based on the inspection of multicollinearity, the variable denoting the primary sector workforce participation rate was excluded from the analyses. Although some of the other covariates were correlated (the highest correlation coe cient corresponded to 0.66), the appropriate diagnostics suggested an acceptable level of multi-collinearity among the six predictors that were eventually examined in the ensuing regression analyses. In addition to this, we also considered a two-way interaction between the share of Muslims and population density to test our hypothesis about the positive sanitation externalities arising due to the spatial co-concentration of Muslim and non-Muslim communities, as described in Sect. 2 above.
Before the regression analyses was undertaken, both the dependent and independent variables were zstandardised to enable comparisons of regression estimates across both variables and regression models. Such standardisation of variables is also recommended for the MGWR analysis [62]. Table 2 reports the descriptive characteristics for the analyzed variables, plus the global Moran's I measure of spatial autocorrelation and results of the Theil index of spatial inequality and its decomposition. To explore the more nuanced patterns of spatial clustering, Fig. 3 provides the maps of the local indices of spatial autocorrelation for all of the variables. Overall, these exploratory results clearly demonstrate that the examined variables are considerably spatially heterogeneous with non-trivial spatial patterns. The district-level OD rates ranged between 17% and 96%, with a mean of 82% and the standard deviation being 12%. In about 70% of the analyzed 266 districts, the OD rate was above 80%, while there were only nine districts (two in Rajasthan and seven in Uttar Pradesh) where the share of OD practicing rural households was below 50%. All of the measures showed signi cant (either high or very high) spatial autocorrelation. Although not negligible, the OD rate revealed a relatively lower spatial variation compared to the other variables, seemingly because we purposefully chose an area where the practice of OD was generally prevalent in 2011. Furthermore, the Theil index decomposition uncovered that the between-state component of spatial inequality in OD rates accounted for only a small share (2%) of the overall spatial inequality. This share of between-state inequality is thus lower than that for other variables. In fact, it corresponds to what can be obtained based on a random spatial distribution (i.e. to the gure obtained by averaging results calculated from a repetitive random reallocation of the OD observations across the examined map of districts -see Novotný and Nosek, 2012 [66]. Rather surprisingly, it indicates that state boundaries are not crucial for delimiting/constraining the processes that in uence the observed spatial variations in OD rates. This scenario is however quite different for some of the other examined variables, such as for the environmentally conditioned population density and access to drinking water source variables or for the share of Muslims in rural population variable, which are known to be spatially uneven due to socio-culturally conditioned settlement patterns.  Table 3 shows the results derived from the examination of the global relationships between the prevalent OD rate and its ascertained predictors. Although all four regression techniques proved to have relatively good explanative power, the spatial regressions performed notably better. Both the spatial lag (ρ) and spatial error (λ) coe cients were strong and statistically signi cant, con rming the importance of inculcating such spatial patterns of data in the modelling and gauging of the prevalence of OD practices.

Exploratory results
The best model t was attained by the spatial error regression model and the appropriate diagnostics (robust Lagrange Multiplier statistics) indicated the superiority of this modelling approach with respect to our data. Comparisons of the two linear regression speci cations revealed that the inclusion of the statelevel dummy variables had led to only mild improvements in the goodness of t. These state dummy variables were also not statistically signi cant. This nding corroborates the results obtained above by the Theil decomposition, implying the relative unimportance of state boundaries in explaining the spatial variation in the district-level OD rates prevalent in the examined area in 2011.
The relationships reported in Table 3 had the expected signs: the asset ownership, literacy rate, share of Muslims and non-primary work participation rate parameters were negatively associated with the OD rate, while the inaccessibility of drinking water revealed positive association. The composite index derived for assets owned, literacy rate and drinking water inaccessibility were identi ed as the signi cant predictors of the OD rate in all the four regression models. The population density was positively associated with OD in the linear and spatial lag regressions but this relationship disappeared in the spatial error regression. The non-primary work participation revealed a statistically signi cant effect in the spatial error regression. The proportion of Muslims in the rural population of a district was better associated with the OD rate in the linear regressions but weakly so in the spatial regression models, indicating that the effects of this variable may be spatially heterogeneous due to the above documented high spatial clustering of Muslims. The instability of regression parameters across individual regression models may be caused by the spatial non-stationarity of their relationships, which is further examined in Sect. 4.3.
The last model in Table 3 examined the interaction term between the share of Muslims and population density. Congruent to our hypothesis, a statistically signi cant negative joint effect of these two predictors on the OD rate was ascertained. This is an interesting nding, indicating that a concentration of Muslims in more densely inhabited areas possibly has some spill over effects, which positively affect sanitation conditions in these areas, not only in the Muslims neighbourhoods but also in the other communities situated adjacent/together with them.

Examining local relationships
With an R 2 of 0.848, adjusted R 2 of 0.799 and AIC of 386, the main-effects the MGWR model (i.e. without the interaction term) clearly outperformed the global regressions reported in Table 3. The summary results obtained for the individual covariates are presented in Table 4. It shows that the mean local beta coe cients for individual variables generally correspond to the regression parameters reported for the global regressions above. The strongest average local relationships with the OD rate was again revealed for the assets ownership score followed by the drinking water inaccessibility and literacy rate variables.
The average local beta coe cients were closer to zero for the other variables and their standard deviations were higher. This suggests that their effects are either generally weaker or that they are spatially concentrated. The mean values of the local regression coe cients mask their considerable variability across the set of examined districts. Both the indices of variation of the local coe cients and the Monte Carlo spatial variability test documented that the relationships with the OD were signi cantly spatially unstable for all the predictors, except with the ownership of assets score. The spatial nonstationarity was particularly pronounced for the share of Muslims and population density, with 16% and 36%, respectively, of the statistically signi cant local regression coe cients of these variables (and 45% and 33%, respectively, of their all local coe cients) revealing opposite signs to those of their global beta coe cients. This seemingly implies that these variables may affect the OD rates through distinctly different mechanisms in different geographical regions. The interaction effect between this variable and the district-level population density may be one such mechanism.  The local goodness of t statistics (local R 2 ) ranged between 0.36 to 0.92 with the distribution of its values tending towards the higher margin (local R 2 values were above 0.80 for 78% of districts). As shown in Fig. 5, districts with a low local goodness of t were mostly concentrated in the eastern part of Uttar Pradesh. In addition to this, we inspected the spatial distribution of residuals obtained from the MGWR model and the Moran's I statistic of -0.035 indicated their quite random distribution.
The appropriate diagnostics did not suggest problems with collinearity in the global regressions and moreover, the mean values of the local collinearity measures were also below the recommended thresholds. More speci cally, the mean local condition number was 20 and the mean local variance decomposition proportion (VDP) values for these particular predictors ranged between 0.14 and 0.38. However, the inspection of these local collinearity measures indicated that collinearity may be an issue for some districts. Acknowledging that some of the predictors and their respective sets of local beta coe cients were signi cantly correlated, we investigated the sensitiveness of the results derived via the MGWR through alternative model speci cations. For this, we ran six restricted MGWR models in which individual covariates were consecutively omitted (i.e. each of these restricted models had ve independent variables instead of six, as in the full model above). The summary results obtained from this exercise appear in Table 5. We found that the sets of local beta coe cients obtained in the full maineffect model above and from the restricted models are highly correlated (the fth column in Table 5).
Furthermore, the maps of the local regression coe cients averaged across the six restricted MGWR models (see Additional le g. S-1) show very similar spatial patterns as that represented by the maps of the local coe cients obtained from the MGWR model (Fig. 4). Moreover, the other measures reported in Table 5 that characterize the regression results obtained from the restricted MGWR models, document their stability and do not challenge the inferences drawn from the MGWR model as reported above. In the next step of our analysis we ran the MGWR model that additionally included the interaction term between the share of Muslims and population density. The summary results appear in Table 6 and show that the mean local regression coe cients for the interaction term corresponded to -0.146, with the range between − 0.279 and − 0.081 and this was the lowest spatial variation from all analyzed predictors. The joint effect of the share of Muslims and population density on the prevalent OD rates thus seems to be quite stable spatially. However, the map of the local beta coe cients for the interaction term in Fig. 6.a indicates a clear west-east gradient, with the stronger and statistically signi cant effect localized in the western part of study area. Additionally, results revealed that the share of Muslim population (as a maineffect variable) of a district was also a statistically signi cant predictor of the OD rate in 110 districts. At the same time, in 94% of these districts, the local coe cients for the interaction term were also statistically signi cant. This implies that population density ampli es the effect of the share of Muslim population variable on OD rates. Finally, we tried to identify clusters of districts based on similarity in the local beta coe cients obtained in the previous step. For this purpose, we used a K-means method (applying the Euclidean distances and k-means + + algorithm) with the nal number of six clusters determined based on the consideration of changes in the sum of within-cluster squares. The nal classi cation of districts appears in Fig. 6.b which shows that the resulting clusters are mostly spatially compact, and as such they may be thought of as distinct spatial regimes. Table 6 depicts the cluster centres for individual variables indicating how their relationships to the OD rate differ between these spatial entities. All of the analyzed predictors revealed comparatively strong effects within Cluster 5 (comprising 37 districts in western Uttar Pradesh, three districts in Northern Rajasthan and one district from northern Madhya Pradesh) and especially within Cluster 6 (consisting of eight districts in northern Rajasthan). The above is evidenced by drinking water inaccessibility and population density, with considerably stronger positive relationships to the OD rate in these clusters (and particularly in Cluster 6) than in the remaining zones. The interaction term between the share of Muslims and population density was also strong in these two clusters, in addition to Cluster 2 (most of Madhya Pradesh, remaining districts of Rajasthan and four districts from the south-west part of Uttar Pradesh). The role of the non-primary work participation parameter was found to be important for Pradesh), the literacy rate parameter appeared to be a relatively strong predictor. However, this Cluster 3, together with Cluster 1, represents groups with generally weaker local beta coe cients for all of the independent variables. The district-level variation in OD rates was relatively well-explained by the variations in the few predictors considered in this study. The R 2 of the regression models ranged between 0.43 for the simple linear individualistic fallacy (disregard of population-level determinants when explaining individual-level outcomes), thereby invoking a need to avoid them, for example, by using a multi-level analysis [67,68]. A major practical issue, nevertheless, is that compatible individual-and population-level data are usually not available, as was the case in this study. Although weaker in terms of causal inference and limited in the range of available explanatory variables, population-level studies based on administrative data, remain a viable option for exploring larger-scale 'global' relationships, which are potentially more generalizable. However, using ndings from ecological studies often means applying them at the local level within the population being examined. As our article has demonstrated, prior explorations of the stability of the identi ed global relationships becomes essential for this, particularly if the dependence of the analyzed phenomena on the variations in the external environment is high.
We have argued that sanitation represents an exemplary case in this respect. Each of the predictors examined in this study revealed statistically signi cant global relationship with the prevalent OD rate (p < 0.05) in at least two of the ve global regression models, but only the assets ownership score was statistically signi cant in all of these regressions ( Table 3). The variation in regression parameters for the same predictors between different regression models can largely be attributed to the presence of spatial autocorrelation in our data. Simple linear models did not account for such spatial dependence at all, implying that that the regression parameters may be in ated for most spatially autocorrelated covariates relative to those obtained from spatial regressions. This is exactly what we saw in Table 3 for the population density, inaccessibility of drinking water and the share of Muslims variables. These predictors revealed their strong spatial clustering (with the respective univariate Moran's I being 0.80, 0.74, and 0.72) and their beta coe cients thus became considerably lower in the spatial regression models when compared to those obtained from the simple linear model. As the data analyzed in this paper can be considered as population data (in a statistical sense), the regression parameters obtained by both the non-spatial and spatial models are valid. However, they have distinct interpretations and their relevance depends on judgements about the role of spatial clustering vis-à-vis to analyzed global relationships. The results from the simple linear regression shows that the global relationships are relevant, if we believe that spatial autocorrelation represents a part of random disturbance. The spatial regressions provide additional information by indicating what the global relationships are if the effect of spatial clustering is controlled for.
Differences between the global regression coe cients obtained for the individual variables in the nonspatial and spatial regressions indicated the presence of spatially varying relationships that were explored in further detail using MGWR. While each of the predictors revealed statistically signi cant local relationships with the OD rate for at least 41 of the 266 districts, none of them were statistically signi cant in all the districts (Tables 4-6). The strengths of the local associations between each of the examined predictors and the OD rate were generally higher in the western part of the study area (region comprising of Rajasthan, western Uttar Pradesh and western Madhya Pradesh). The only exception was the literacy rate parameter, which displayed signi cant local effects in eastern Madhya Pradesh and in the north-eastern portion of the study area.
A more detailed look at the spatial variations in the local beta coe cients allowed us to identify six clusters of districts that had internally similar roles of the individual predictors of the OD rate ( Fig. 6.b).
Interestingly, these clusters often stretch across the state boundaries in our study area. Similarly, the decomposition of the Theil coe cient into the between-state and within-state components of district-level inequality ( Table 2) as well as the inclusion of the state-level dummies into one of the global regression models (Table 3) showed that the state-level differences surprisingly explain only a small part of the district-level variations in the OD rate. This is a notable nding because state-level analysis is often the rst option/method used when examining (or controlling for) spatial differences in India and it also represents a key policy framing level for the administration of sanitation interventions [13]. Our results, however, suggest that as far as sanitation issues go, the state as a unit does not represent the level at which the fundamental underlying processes related to this phenomenon operate. The differences in the spatial regimes of sanitation determinants discerned in this study (i.e. Figure 6.b) can be explored further and, possibly, considered for designing and administering suitable interventions.
Water availability is known to be a generally important barrier for toilet adoption in India and the same was con rmed in this study. However, the positive relationship between drinking water inaccessibility and the OD rate was considerably more pronounced in the north-western part of the study area, which is also comparatively more water-stressed [13]. The climatic conditions and associated spatial variations in the organisation of settlements may also be related to this positive relationship between the population density and the OD rate, as was identi ed for the same north-western part of Rajasthan. By contrast, the opposite. i.e. a negative association between population density and the OD rate was uncovered for the generally densely (and more evenly) populated north-eastern part of the study area (i.e. eastern Bihar, north-eastern portion of Jharkhand and the considered districts of West Bengal). These ndings suggest that population density may operate through distinct mechanisms in location-speci c different contexts.
Moreover, population density was also shown to enhance (amplify) the negative association between the share of Muslims in the district population and the OD rate. This observation indicates that in addition to its direct effect (i.e. lower prevalence of OD among Muslims), a higher proportion of Muslims in the district population may additionally affect sanitation conditions indirectly through positive externalities stemming from the co-concentration of Muslims with non-Muslim (mostly Hindu) communities. Our study thus provides further support for a similar nding documented previously by Geruso and Spears (2018) [32]. However, our results differ from theirs in two respects. Firstly, Geruso and Spears (2018) [32] used village-level data (more speci cally, the NFHS data at the level of its primary sampling units), while we have analyzed the information at the aggregated level of the districts. Secondly, they showed that the presence of Muslims in a locale generated positive health externalities for the neighbouring communities as well due to the better sanitation practices adopted by the Muslims neighbourhoods. Our results indicate that the sanitation conditions of the other communities are actually positively affected by their co-concentration with Muslims. As with the other statistical associations uncovered in this study, this result is only indicative. However, we regard this nding as quite an interesting topic for future research that can examine how the mechanisms underlying the nexus between the spatial concentrations of Muslims, sanitation conditions and health operate.
The assets ownership score and the literacy rate were identi ed as the two spatially most stable local correlates of OD in this study. These results signify the importance of wider socioeconomic development towards bettering sanitation conditions in India. This is a very basic and not a novel observation, but nevertheless is worth repeating, particularly in relation to the implementation of the Swachh Bharat Mission, which had aimed to make India OD free within a period of a few years. This nding implies that it is unlikely that this ambitious goal has been achieved or that the successes attained can be sustained only through speci c sanitation interventions, unless the broader socioeconomic situation of rural communities is enhanced signi cantly [29,69].
An obvious limitation of the present study is that it draws on the 2011 Census data, which may already be outdated. The National Family Health Survey (NFHS-4) from 2015-16 indicated a mild decline in the OD rate in the study area. However, the actual signi cant change in toilet coverage in this region probably occurred slightly after the NFHS-4 report was published, as a result of the implementation of the Swachh Bharat Mission. Recent data that records the effects of this intervention is not yet available and thus we could not undertake the related analyses in this paper. While the NFHS-4 data might be considered as an alternative information source, the district-level OD gures available from it are not reliable. Furthermore, as our analysis has demonstrated, studies conducted at the state level in India are not adequate for such spatial examinations of sanitation behaviour and its correlates. Even more importantly, we believe that our results are vital as they demonstrate the contextual nature of sanitation by documenting the considerable spatial heterogeneity and non-stationarity in the OD correlates. Therefore, our study has a general relevance and provides a model example for future research into similar topics, for example, when the 2021 Census data is published.

Conclusion
This article has analyzed the district-level correlates of open defecation in seven northern and eastern states of India that had reported comparatively high OD rates in 2011. Parameters like the ownership of assets, drinking water inaccessibility and the literacy rate were identi ed as the most stable correlates of the OD phenomenon. The representation of Muslims in the district population was negatively associated with its OD rate. Moreover, we showed that this effect is ampli ed by the level of population density, which indicates positive sanitation externalities due to that the spatial co-concentration of Muslim and non-Muslim populations. To our knowledge, this article provides the rst quantitative analysis of spatial nonstationarity of sanitation determinants and clearly demonstrates that the spatial clustering and spatial nonstationarity of sanitation phenomena is substantial. As such, the ndings derived from the various micro-level case studies that are usually undertaken should be generalized with much caution.
There is a pertinent need to complement this evidence by population-level studies. Ecological studies should, however, focus on patterns by considering both the global relationships and their spatial heterogeneity. Research ndings that ignore the spatial heterogeneity of sanitation drivers may possibly provide incomplete information for framing policies.   Study area with its administrative divisions