This section first details the model selection process and count data methods used to estimate retail demand thresholds for non-employer establishments, employer establishments, and employment for 11 finely disaggregated retail industries across the contiguous United States. The second part of this section details the data used to develop unique industry specific location determinants, including time invariant place-based factors and restricted-access establishment and employment data.
The count data nature of establishments and employers suggests using a count data estimator over continuous linear alternatives. Count data estimators are superior when dealing with count data because they ensure that fitted values are nonnegative integers and do not require the conditional mean, E(y/x), to be linear in x. While continuous data models have been successfully used, as with Mushinski and Weiler's (2002) simultaneous Tobit model, the choice to deviate from count data estimators seems to be driven by other econometric needs, such as the inherent endogeneity in Mushinski and Weiler.
In selecting a count data estimator, it is helpful to consider how the industry landscape reflects a potential business owner’s location decision. For lower-ordered goods, we would expect to observe establishments located in most counties, with large central places containing more lower-ordered goods vendors to serve a larger population. The distribution for these lower-ordered goods may consequently resemble a standard Poisson distribution. Higher-ordered goods, however, may be more likely to benefit from economies of agglomeration (Henderson et al., 2000), and thus may have large clusters of establishments in large central places. The existence of economies of agglomeration in an industry could cause the distribution to appear skewed right, adding overdispersion into the distribution of establishments.
Overdispersion occurs when the dependent variable’s variance is larger than the mean and violates the assumption of equidispersion in Poisson data models, leading to a higher probability of committing a type one error in significance testing (Perumean-Chaney et al., 2013). In such cases the negative binomial distribution is more efficient as it allows for overdispersion by introducing unobserved individual heterogeneity into the Poisson’s conditional mean (Greene, 2012). This distinction between ordered goods and their distributions is not just important for efficient estimation, but also in accurately modeling the business location decision. As a result of the inherent overdispersion in industry establishment counts, lower-ordered establishments are more likely to be Poisson distributed while higher ordered industries are more likely to be negative binomial distributed due to their propensity to agglomerate.
While some retail types such as gas stations are present in nearly all United States counties, other types of retailers such as art dealers may only be present in a fraction of counties, leading to an excess of zeros. There are likely two different regimes within the zero generation processes for a particular industry: 1) structural zeros – places that lack some essential characteristic to support the industry, and 2) sampling zeros – places that meet minimum essential requirements but do not meet some other set of economic factors to support the industry. For example, a boat dealer would presumably locate somewhere near bodies of water, so we would expect to observe structural zeros in establishment count data in counties with no bodies of water. However, we may still observe sampling zeros in counties where other economic factors, such as population or income combined with random chance, are the preventative factors.
In the extant count data literature, this zero-generation process is captured by two mechanisms: hurdles and zero-inflation. Hurdle models break the location decision into two stages, where the first choice is whether to locate in a county, and the second choice is how many establishments to locate there. HP, conversely, does not have two zero generating regimes because the second stage is a truncated at zero count distribution. Thus, Zero-inflated models, such as the ZIP and the Zero Inflated Negative Binomial (ZINB), are more flexible than hurdle models as they account for both structural zeros as well as sampling zeros. Within the ZIP model, the data generation process is defined first by a binary distribution that identifies if the outcome is a structural zero, followed by a Poisson (or NB) process where zero is still a possible outcome. Following (Henderson et al., 2000), the log-likelihood function may be written as:
where S is a set of observations where yi = 0, F is the logit link function, Z is a vector containing covariates in the participation decision, and X is a vector containing covariates in the amount decision. As a note, the last part of the second term is simply the standard Poisson model. By employing the ZIP and ZINB to model retail establishment county frequencies and comparing them to their non-zero inflated conventional counterparts, we are able to test and account for overdispersion from the two zero generating data regimes as well as from an industry’s tendency to agglomerate.
Our method for choosing amongst the four identified estimators (Poisson, negative binomial, ZIP, and ZINB) follows Perumean-Chaney et al. (2013) by first testing for overdispersion (Poisson versus negative binomial), followed by testing for zero inflation in the resulting count model. While the test for overdispersion consists of a simple likelihood ratio test on the alpha overdispersion parameter, testing for zero inflation is more involved. Previous studies (e.g. Chakraborty, 2012; Perumean-Chaney et al., 2013; etc.) have used Vuong’s statistic to test for zero inflation, but Wilson (2015) demonstrates that this method is incorrect. When Vuong (1989) presented the test for “non-nested” models, he presented six assumptions, one of which was that “nesting must not occur at a boundary of the parameter space of the larger model” (Wilson, 2015). While zero inflation models easily collapse down to their simpler count data counterparts when the zero inflation parameter equals zero, , this outcome is on the perimeter of the parameter space, leading to an unknown (non-normal) distribution for the test statistic (Wilson, 2015).[6] As a result, we identify zero inflation through visual inspections of dependent variable histograms as well as the Akaike and Bayesian information criteria (AIC and BIC respectively) in post-estimation (Greene, 1994). The AIC and BIC are relatively attractive measures for testing for zero inflation because they are not restricted to nested models. If an industry follows a zero inflated data generation process, we retest the overdispersion parameter again to ensure the overdispersion was not solely a product of the zero inflation process.
The data generation processes described above are likely to be dependent not only on the specific retail sector, but also on the industry size measure in the model. For example, non-employer and employer establishments may experience different benefits from agglomerating, leading one establishment count to resemble a Poisson distribution while the other might follow a negative binomial distribution. Regarding zero inflation, one could either view a non-employer establishment as a predecessor to an employer establishment, or as a more efficient means of delivering higher-ordered goods within smaller rural economies. Therefore, we may expect to find more excess zeros within employer establishment counts compared to non-employer establishment counts due to dissimilar economic opportunities across space and the establishment type’s role in Christaller’s (1966) functional hierarchy for a particular industry.
While the data generation process for employment likely also differs among retailers, we include this third measure primarily as a robustness check for employer establishment counts. Previous retail demand threshold studies focus on establishment counts, arguing that they represent a degree of consumer choice and availability in an area (Shonkwiler & Harris, 1996), but there is value in providing a measure of economic intensity (i.e. employment) that can be compared to establishment counts. Establishments of differing sizes are likely to provide differing consumer choices (e.g., seasonal ice cream stand versus full-service restaurant). Alternatively, the three measures of industry size in this analysis may also be thought of as portraying and improving the understanding of the industrial organization of different stages in a specific retail industry’s development, by 1) modeling the decision process for smaller (non-employer) establishments to locate in a place, 2) modeling what factors cause a larger (employer) establishment to locate in a place, and 3) modeling what factors cause employer establishments to grow (add employees) within a place.
A primary objective of this paper is to explore how the data generation processes of retail establishments may inform their hierarchical order. The zero-inflated models integral to this objective limit the use of spatial autocorrelations and spatially lagged covariates, thus we avoid adding spatial autocorrelation or spatially lagged covariates. As zero-inflated count-data spatial regression models become commonplace, this is an opportunity for future analysis. Instead, we address the spatial element through covariates that may identify these spatial relationships – namely, urban influence code (UIC) indicators and share of residents who work out of county, and, to some extent, location quotients. The UIC indicators recognize when a micropolitan county is neighboring a metropolitan county (both defined by population) while the share of commuting residents accounts for how economic dependence may be influenced by geographical barriers. These elements would not be captured through a simple spatial lag on neighboring population. Furthermore, these time invariant factors would not be captured by a panel model with fixed effects.
Data
Most retail demand threshold studies use the publicly available County Business Patterns (CBP) and Non-employer Statistics (NS) datasets for their analyses. However, despite noise infusion in the CBP and NS, data for numerous counties are suppressed or binned (e.g. 1-9 employees), both of which can lead to relatively large distortions in measuring smaller rural economies. Needless to say, these issues become more prevalent as an industry is disaggregated into its smaller component industries. While some industry count estimates are available from private vendors, the vendors do not disclose their estimation methods, and are of unknown accuracy. Anecdotal testimony from local economic development practitioners indicates some vendor-provided local employment estimates diverge widely from actual numbers.
We use the restricted-access establishment-level LBD and ILBD to circumvent these issues and provide unbiased demand threshold estimates for 11 retail industries, including eight at the most refined six-digit industry level (NAICS 44-45). While the LBD focuses on employer establishments and the ILBD focuses on non-employer establishments, the Census Bureau bases both annual data series on the Business Register and Internal Revenue Service tax records (Jarmin & Miranda, 2002). In addition to the data being more complete in scope than in prior works, the data also allow us to estimate the demand threshold for the number of employees within an industry. This alternative metric provides an intensity measure to compare with the simple existence of an establishment within a particular industry.
Census data privacy policy requires us to aggregate the data to the county-level. However, a county-level analysis also allows us to merge other important county-level data sources as well as make comparisons across the demand threshold literature, which tends to also be at the county-level. It should be reiterated that the county aggregated LBD is still superior to the CBP due to the LBD’s completeness, retiming of establishments, and nonsupression of employment counts.
Our choice of variables was informed not only by the literature, but also via virtual facilitated discsussions with rural retail service providers (see Loveridge, Nawyn, & Szmecko (2013) for a description of the method). Table I provides descriptive statistics for variables from secondary data sources as well as for the publicly available versions (CBP and NS) of these data. To avoid losing variation in the data from splitting the sample, rurality is addressed in the models through the inclusion of urban influence codes as dummy variables, the inclusion of population and population density, and the zero-inflation stage of the models when appropriate.
[Approximate position of Table I]
For ease of discussion, we organize the county-level covariates into three general categories: demographics and labor force, infrastructure and institutions, and the restricted-access establishment and employment data. Most of the demographic data are common in retail demand threshold models, however the two other data categories are relatively novel additions to the literature and warrant more discussion. As the set of relevant variables varies from industry to industry, the discussion here will be limited to general descriptions of the variable categories and how they relate to demand threshold theory.
Demographics and Labor Force: Population, race/ethnicity, age, unemployment and income measures are common in demand threshold models, however, our inclusion of social capital (Rupasingha et al., 2006), health insurance, and opiate overdoses is innovative. Support from community social networks enhances the likelihood of retail success and rural community sustainability (Frazier & Niehm, 2004; Korsching & Allen, 2004), as it allows for network development beyond physical boundaries of the community market setting. For instance, to become competitive, rural businesses exploit social networks to access important information pertaining to their local consumer market (Frazier & Niehm, 2004). Additionally, community and business development activities can only succeed if supported by a community with strong social networks that involve participation from local professionals, business owners, and community members (Sharp et al., 2002).
Observing Cleary et al.'s (2019) finding of lower demand thresholds for food hubs in areas with higher social capital, we expect social capital to lower demand thresholds via lower average costs, while labor inhibitors such as opiate overdoses increase demand thresholds due to higher labor costs. The opiate epidemic was a growing issue in 2014 and was mentioned several times throughout focus groups (2018) with retail stakeholders in the context of labor supply issues. Opiate prescription rate and health insurance act as controls for opiate deaths and other variables of interest.
Finally, we hypothesize the percent of workers who work outside their county of residence to be negative as it addresses the retail leakages and spatial interdependencies found in Mushinski & Weiler (2002). Referring to figure 1, retail leakages effectively shrink the market size, leading to lower demand for local retail. If a significant portion of workers commute to another county for work, this will likely lead to retail leakage for their county of residence.
Infrastructure and institutions: Median home value represents multiple place-based amenities and often increases in higher ordered places, reflecting the higher retail demand in amenity rich places or central places. Similarly, we expect to observe more internet service providers (ISP) in central places, however this measure may increase or decrease retail thresholds due to the countervailing effects of greater efficiency (lower AC) and market access (higher S) with competition from non-local ecommerce businesses (lower global price).[7] Still, evidence suggests that greater internet access may introduce businesses to other determinants of growth, such as greater social capital (Kharisma, 2022).
The combined state and average local sales tax rate is likely to increase demand thresholds due to higher costs of production for retailers, while we hypothesize average effective property tax rates to lead to lower demand thresholds. This hypothesis comes from evidence that manufacturers’ decisions to locate in a place are either not affected or are positively related to higher property tax rates because low property taxes frequently imply low-quality public services (Gabe & Bell, 2004; Reum & Harris, 2006). Glaeser, Kolko and Saiz (2001) argue the importance of multiple amenities for attracting and retaining workers in central places and similar arguments can likely be made for the hinterlands. Higher quality provision of public goods and services such as Main Street beautification projects, parking infrastructure, parks, and general city maintenance are likely to be drivers for retail sector establishments and employment.
The effects of other variables in this category are likely industry specific depending on how related that industry is to outdoor recreation, its reliance on mobile clientele (e.g. gas stations), and their relationships with large institutions, such as universities.
Restricted-access establishment and employment data: While we are unable to present the summary statistics for much of the data due to Census Bureau disclosure limitations, we used the LBD and ILBD to create employment location quotients for 11 two-digit NAICS sectors to account for Jacobs' (1969) between-sector economies of agglomeration. Although we do not directly test for the mechanisms through which Jacob’s between sector economies of agglomeration occur, a positive, significant coefficient will indicate evidence of between industry economies of agglomeration. We also include an establishment location quotient for retail industries outside of the industry being modeled to account for Marshall-Arrow-Romer’s within-sector agglomeration economies (E. L. Glaeser et al., 1992), both of which may lead to lower costs.[8] While future studies might show how different measures of agglomeration (e.g. the Ellison-Glaeser Index [Ellison et al., 2010]) and its sources influence retail demand thresholds, in providing the first attempt to account for the phenomenon, we opt for the simpler and well-known location quotient.