Precise Event-level Prediction of Urban Crime 1 Reveals Signature of Enforcement Bias

Policing efforts to thwart urban crime often rely on detailed reports of criminal infractions. However, 12 crime rates do not document the distribution of crime in isolation, but rather its complex relationship 13 with policing and society. Several results attempting to predict future crime now exist, with varying 14 degrees of predictive efﬁcacy. However, the very idea of predictive policing has stirred controversy, 15 with the algorithms being largely black boxes producing little to no insight into the social system 16 of crime, and its rules of organization. The issue of how enforcement interacts with, modulates, and 17 reinforces crime has been rarely addressed in the context of precise event predictions. In this study, 18 we demonstrate that while predictive tools have often been designed to enhance state power through 19 surveillance, they also enable the tracing of systemic biases in urban enforcement—surveillance of 20 the state. We introduce a novel stochastic inference algorithm as a new forecasting approach that 21 learns spatio-temporal dependencies from individual event reports with demonstrated performance far 22 surpassing past results (e.g., average AUC of (cid:25) 90% in the City of Chicago for property and violent 23 crimes predicted a week in advance within spatial tiles (cid:25) 1000 ft across). These precise predictions enable 24 equally precise evaluation of inequities in law enforcement, discovering that response to increased crime a 30% decrease in arrests when averaged over the city. The spatial distribution of locations that experience a positive vs. negative change in arrest rate reveals a strong preference favoring wealthy locations. If neighborhoods are doing better socio-economically, increased crime predicts increased arrests. A strong converse trend is observed in predictions for poor and disadvantaged neighborhoods, suggesting that under stress, wealthier neighborhoods drain resources from their disadvantaged counterparts. b illustrates this more directly via a multi-variable regression, where hardship index is seen to make a strong negative contribution.

as they predict events at the target better than the target can predict itself. More details on the on problem 74 characteristics and performance are provided in Tab. I and II respectively. 75 For Chicago, we make predictions separately for violent and property crimes, individually within spatial tiles 76 roughly IHHHft across and time windows of I day approximately a week in advance with AUCs ranging from 77 VH WW7 across the city. We summarize our prediction results in Fig. 2, where panels a and b illustrate the 78 geospatial scatter of AUC obtained for different spatial tiles and types of crime, and c shows the distribution of 79 AUCs achieved. Out-of-sample predictive performance remains stable over time; our predictions on successive 80 years (each time using three preceding years for training, and one year for out-of-sample test, see Fig. 8 shows 81 little variation in average AUC. Inspecting excerpts of the average daily crime rate for successive years also 82 shows close match between actual and predicted behavior (See Fig. 9, panels a, c and e.) The remaining 83 panels (b, d and f) in the same figure illustrate how the Fourier coefficients match up, showing that we are able 84 to capture periodicities at the weekly and bi-weekly scales, and beyond. 85 Unlike previous efforts 1-5 , we do not impose pre-defined spatial constraints. In contrast to contiguous diffusion 86 phenomena encountered in physical systems, crime may spread across the complex landscape of a modern 87 city unevenly, with regions hyperlinked by transportation networks, socio-demographic similarity, or historical 88 collocation. Rather than assuming that events far off across the city will have a weaker influence compared with 89 those physically near in space or time, we probe the topological structure emergent in the inferred dependencies 90 to estimate the shape, size and organization of neighborhoods that best predict events at each location. The 91 results illustrated in Fig. 2d and e show that the situation is complex with the locally predictive neighborhoods 92 varying widely in geometry and size, implying that restricting analysis to relatively small local communities 93 within the city is sub-optimal for crime prediction and enforcement analysis. In order to analyze if the effect of 94 reported criminal infractions diffuse outward in space and time, we simply calculate temporal-spatial distances 95 of influences, then average across all neighborhoods in the city, revealing the rapid decay with time delay in 96 diffusion rates shown in Fig. 2f. Interestingly we find the property and violent crimes differ in their rates of 97 influence diffusion (Fig. 2f); while the effect of property crimes decays rapidly in days, violent reported events 98 shape the dynamics for weeks to come. 99 Forecasting crime via analyzing historical patterns has been attempted before 18,19 . These approaches use state 100 of the art machine deep learning tools based on recurrent and convolutional neural networks (NN). In the first 101 article 18 , the authors train a NN model to predict next-day events for TH; QRV sample points in Chicago. The   Step 1.

Infer Local Activation
Step 2. Plate c illustrates our modeling approach: We break city into small spatial tiles approximately 1.5 times the size of an average city block, and compute models that capture multi-scale dependencies between the sequential event streams recorded at distinct tiles. In this paper, we treat violent and property crimes separately, and show that these categories have intriguing cross-dependencies. Plate d illustrates our modeling approach. For example, to predict property crimes at some spatial tile r, we proceed as follows: Step 1) we infer the probabilistic transducers that estimate event sequence at r by using as input the sequences of recorded infractions (of different categories) at potentially all remote locations (s; s 0 ; s 00 shown), where this predictive influence might transpire over different time delays (a few shown on the edges between s and  to track graffiti, achieving an out-of-sample AUC of VQ:Q7. Our AUC is demonstrably higher (see Table II), and   104 we predict with significantly less data (only past events), and U days into future (instead of next-day). Additionally, 105 the use of demographic and graffiti is problematic with the possibility of introducing racial and socio-economic 106 bias, with dubious causal value. In the second article 19 , the authors combine convolutional and recurrent neural 107 networks with weather, socio-economic, transportation, and crime data, to predict the next-day count of crime in 108 Chicago. As spatial tiles, the authors use standard police beats, which break up Chicago into PUR regions. Police Panels a-f illustrate the AUCs achieved in six major US cities. These cities were chosen on the basis of the availability of detailed event logs in the public domain. All of these cities show comparably high predictive performance. Panel g illustrates the results obtained by regressing crime rate and perturbation response against SES variables (shown here for poverty, as estimated by the 2018 US census). We note that while crime rate typically goes up with increasing poverty, the number of events observed one week after a positive perturbation of 5-10% increase in crime rate is predicted to fall with increasing poverty. We suggest that this decrease is explainable by reallocation of enforcement resources disproportionately, away from disadvantaged neighborhoods in response to increased event rates, which leads to smaller number of reported crimes. model achieves a classification accuracy of US:T7 for Chicago, which compares against our accuracy of > WH7 112 (See Table II). While this competing model tracks more crime categories, it is limited to next-day predictions with 113 significantly coarser spatial resolution. We also compare the predictive ability of naive autoregressive baseline 114 models (See Material and Methods and Table III), which perform poorly, but provide a yardstick to meaningfully 115 compare our claimed performance estimates.
With our precise predictive apparatus in place, we run a series of computational experiments that perturb the 117 rates of violent and property crimes, and log the resulting alterations in future event rates across the city. By 118 inspecting the effect of socio-economic status (SES) on the perturbation response, we investigate whether 119 enforcement and policy biases modulate outcomes. The inferred stress response of the city suggests the 120 presence of socio-economic bias (See Fig. 3). Wealthier neighborhoods away from the inner city respond to 121 elevated crime rates with increased arrests, while arrest rates in disadvantaged neighborhoods drop, but the 122 converse does not occur (See Fig. 3, panels e and f). Resource constraints on law enforcement, combined 123 with biased prioritization to wealthier neighborhoods, result in reduced enforcement across the remainder of the 124 city. This provides evidence for enforcement bias within U.S. cities that parallels widely discussed notions of While crime rate increases with degrading SES status of local neighborhoods, the number of predicted events 138 a week after a positive 5-10% increase in crime rate is predicted to go down. Thus increasing the crime rate 139 leads to a smaller number of reported crimes, a pattern holding more often in poorer neighborhoods. 140 Our analysis also sheds light on the continuing debate over the choice for neighborhood boundaries in urban 141 crime modeling 28-31 . In Fig. 2d-f, we demonstrate that despite apparent natural boundaries, influence is often 142 communicated over large distances and decays slowly, especially for violent crimes. More importantly, this study 143 reveals how the "correct" choice of spatial scale should not be a major issue in sophisticated learning algorithms 144 where optimal scales can be inferred automatically. We find that there exists a skeleton set of spatial tiles, which 145 have strong influence on the overall event patterns (See Fig. 10). These induce a cellular decomposition of the 146 city that identifies functional neighborhoods, where the cell-size adapts automatically to the local event dynamics.

148
To our knowledge, this is the first analysis exploring perturbations of predictive data-driven models to probe the 149 social dynamics of crime and its enforcement. Our ability to probe for the extent of enforcement bias is limited by 150 our dataset; since inference of crime patterns are easily skewed by arrest rates. Disproportionate police response 151 in Black communities can contribute to biases in event logs, which might propagate into inferred models. This 152 has resulted in significant pushback from diverse communities against predictive policing 32 . Our approach is

168
In this study we use historical geolocated incidence data of criminal infractions to model and predict future events 169 in Chicago, Philadelphia, San Francisco, Austin, Los Angeles, Detroit and Atlanta. Each of the cities considered 170 have a specific temporal and spatial resolution, which are optimized to maximize predictive performance (See 171   Table I). The predictive performance obtained in these cities are enumerated in Table II The sources of the crime incidence data used in this study for the different US cities are enumerated in Table I. 175 Theses logs include spatio-temporal event localization along with the nature, category, and a brief description 176 of the recorded incident. For the City of Chicago, we also have access to the number of arrests made during or 177 as a result of each event. For Chicago, the log is updated daily, keeping current with a lag of U days, and we 178 make predictions for each of the years 2014-2017 (using Q years before the target year for model inference, and 179 I year for out-of-sample validation) for the prediction results shown in Figure 1. The evolving nature of the urban 180 scenescape 37 necessitates that we restrict the modeling window to a few years at a time. The length of this 181 window is decided by trading off loss of performance from shorter data streams to that the evolution of underlying 182 generative processes for longer streams. The training and testing periods of the other cities is tabulated in Table I. 183 In this study, we consider two broad categories of criminal infractions: violent crimes consisting of homicides, 184 assault, battery etc., and property crimes consisting of burglary, theft, motor vehicle theft etc. Drug crimes are 185 excluded from our consideration due to the possibility of ambiguity in the use of violence in such events. For 186 the City of Chicago, the number of individuals arrested during each recorded event is considered a separate 187 variable to be modeled and predicted, which allows us to investigate the possibility of enforcement biases in 188 subsequent perturbation analyses. 189 We also use data on socio-economic variables available at the portal corresponding to Chicago community 190 areas and census tracts, including 7 of population living in crowded housing, those residing below the poverty 191 line, those unemployed at various age groups, per capita income, and the urban hardship index 38 . Such data is 192 also obtained from the City of Chicago data portal. Additionally, we use data on poverty estimates for the other 193 cities, which are obtained https://www.census.gov.

195
Event logs are processed to obtain time-series of relevant events, stratified by occurrence locations. This is 196 accomplished by choosing a spatial discretization, and focusing on one individual spatial tile at a time, which 197 allows us to represent the event log as a collection of sequential event streams (See Fig. 1c). Additionally, we 198 discretize time, and consider the sum total of events recorded within each time window. 199 Coarseness of these discretizations reflects a trade-off between computational complexity and event localization 200 in space and time. Spatial and temporal discretizations are not independently chosen; a finer spatial discretization 201 dictates a coarser temporal quantization, and vice versa to prevent long no-event stretches and long periods of 202 contiguous event records, both of which reduce our ability to obtain reliable predictors. For the City of Chicago, 203 we fix the temporal quantization to I day, and choose a spatial quantization such that we have high empirical 204 entropy rates for the time series obtained. This results in spatial tiles measuring H:HHPUT°¢ H:HHQS°in latitude 205 and longitude respectively, which is approximately IHHH H across, roughly corresponding to an area of under P¢P 206 city blocks. Thus, any two points within our spatial tile are at worst in neighboring city blocks. We dropped from 207 our analysis the tiles that have too low a crime rate (< S7 of days within the modeling window had any event 208 recorded) to reduce computational complexity, resulting in an N a PPHS of spatial tiles in the city of Chicago. 209 The temporal and spatial resolution is adjusted in a similar manner for the other cities (See Table I). 210 Thus, we end up with three different integer-valued time series at each spatial tile: 1) violent crime (v), 2) 211 property crime (u) and 3) number of arrests (w) in the City of Chicago. For other cities, we have only the first 212 two categories, since information on arrests was not available. We ignore the magnitude of the observations, and 213 treat them as Boolean variables. Thus, our models simply predict the presence or absence of a particular event 214 type in a discrete spatial tile within a neighboring city block and observation window, i:e:, within the temporal 215 resolution chosen, which is I day except for Atlanta, where is it is chosen to be P days (See Table I).
Inferring Generators of Spatio-temporal Cross-dependence 217 Let v a f`1; ¡ ¡ ¡ ;`N g be the set of spatial tiles, and i a fu; v; wg be the set of event categories as described 218 in the last section. At location`P v for variable e P i, at time t, we have @`; eA t P fH; Ig, with I indicating the 219 presence of at least one event. The set of all such combined variables (space + event type) is denoted as , 220 i.e., a v ¢ i. Let T a fH; ¡ ¡ ¡ ; M Ig denote the training period consisting of M time steps. Because for any 221 time t, @`; eA t is a random variable, our goal here is to learn its dependency relationships with its own past, and 222 with other variables in S to accurately estimate its future distribution for t > T . 223 To infer the structure of our predictive model, we learn a finite state probabilistic transducer 16 (referred to as 224 a Crossed Probabilistic Finite State Automata or a XPFSA 15 ) for each possible source-target pair s; r P . 225 Given a sequence of events at the source, these inferred transducers estimate the distribution of events at 226 target r for some future point in time. Ability to estimate such a non-trivial distribution indicates the presence of 227 causal influence. Here we assume that causal influence from the source to the target manifests as the source 228 being able to predict events occurring at the target, better than the target can do by itself. This interpretation 229 follows from Granger's eponymous approach to statistical causality 39 . Importantly, we do not assume that the 230 underlying processes are iid, or that the model has any particular linear structure. Additionally, such influence is 231 not restricted to be instantaneous. The source events might impact the target with a time delay, i:e:, a specific 232 model between the source and target might predict events delayed by an a priori determined number of steps 233 ¡ max ¡ H specific to the model. Here we model the influence structure for each integer-valued delay 234 separately. Thus, for source s and target t, we can have ¡ max C I transducers each modeling the influence for 235 a specific delay in fH; ¡ max g. The maximum number of steps in time delay ¡ max is chosen a priori, based on 236 the problem at hand. 237 While these influences or dependencies may differ for different delays, they need not be symmetric between 238 source and target pairs. The complete set, comprising at most jj 2 @¡ max C IA models, represents a predictive 239 framework for asymmetric multi-scale spatio-temporal phenomena. Note that the number of possible models 240 increase quickly. For example, for the City of Chicago, for ¡ max a TH with PPHS spatial tiles and three event 241 categories, the number of inferred models is bounded above by % P:T billion. is good, we will consistently achieve high TPR with small FPR resulting in a large area under the ROC curve 280 denoted as the AUC. Importantly, AUC measures intrinsic performance, independent of the threshold choice. 281 Thus, the AUC is immune to class imbalance (the fact that crimes are by and large rare events). An AUC of 282 SH7 indicates that the predictor does no better than random, and an AUC of IHH7 implies that we can achieve 283 perfect prediction of future events, with zero false positives. 284 We use a flexible approach in evaluating AUC; a positive prediction is treated as correct if there is at least one 285 event recorded in ¦I time steps in the target spatial tile.  2 Tiles with less than threshold event-rate were excluded. We also carried out similar perturbation analyses for the other cities, and observed that with increasing poverty 323 we have expected increase of observed crime rates, but an unexpected decrease in violent and property crimes 324 after a 5-10% simulated uptick in either category of crimes (See Fig. 4).  t whose inclusion models the influence of past values on the current value (autoregression), and " t k s are the white noise terms whose inclusion models the dependence of current value against current and previous (observed) white noise error terms or random shocks (moving average). Specifically, we use the following four models for the earthquake and the crime datasets Similarly, the decrease of property crimes from increase of violent crimes is also localized to disadvantaged neighborhoods (panel a), as well as the decreased violent crimes from increased arrests (panel k). We see a weaker localization for the corresponding increases in crime rates under similar perturbations. Looking at other pairs of variables under perturbation (rest of the panels), we generally do not see a very prominent correspondence with the distribution of socio-economic indicators. It seems crimes (and particulalrly violent crimes) are easier to dampen in lcales with high existing crime rates, which is desirable result. But such conclusions are currently confounded by SES variables, and futher work is needed to investigate these effects more thoroughly.  Note that the criterion for "stitching" two subtrees with roots x and x H is that their edge labels are identical for all depths, which translates to p@yjxA a p@yjx H A for sequence y of all lengths. The criterion is not verifiable with finite data, and hence GenESeSS identifies two subtrees if they agree on depth one. Defining symbolic derivative x to be the vector with the entry indexed by given by p@jxA, GenESeSS identifies x and x H if x a x 0 . This approach works well under the assumption that the target PFSA is in general position, meaning that different causal states have distinct symbolic derivatives. In practice, GenESeSS uses empirical symbolic derivative defined below to approximate x . Let x be an input sequence of finite length, the empirical symbolic derivative x y of a sub-sequence y of x is a probability vector with the entry indexed by given by For simplicity, we first illustrate how GenESeSS solves the transition structure of the target PFSA from a sample path x generated from a process of Markov order k. Assuming the x H produced by Step 1 (line 4) is , the empty sequence, GenESeSS starts by calculating x , i:e:, the empirical distribution on ¦, and records as the identifier of the first state. Then, GenESeSS appends with each P ¦, and calculates x . By the general position assumption and assuming x is long enough, with high probability, no x is within an "-neighborhood of x 0 for H , and hence each is recorded as the identifier for a new state. In fact, GenESeSS will keep on appending symbols to identifiers of stored states and adding new states until it reaches a sequence of length kCI. Assuming y a I ¡ ¡ ¡ k kCI , since the process is of order k, we have y a z for z a P ¡ ¡ ¡ kCI , and hence, with high probability, x y and x z can be within an "-neighborhood of each other given long enough input x. In this case, GenESeSS identifies the state represented by y with that of z. In fact, GenESeSS will identify all states represented by sequences of length k C I to some previously-stored states. And since no new states can be found, GenESeSS exits the loop on line 8 after iteration k CI. Taking the strongly connected component on line 19, GenESeSS gets the correct transition structure.
However, not all processes generated by PFSA have finite Markov order. For such cases, Step 2 of GenESeSS will never exit in theory, since there exists no n P N such that every causal state is visited for sequences with length n.
And if we implement an artificial exit criterion, the model inferred might be unnecessarily large, and have hard-to-model approximations. We address this issue via the notion of synchronization -the ability to identify that we are localized or synchronized to a particular state despite being uncertain of the initial state.
In Step 1 of Algorithm 2 (line 1-4), GenESeSS finds an almost synchronizing sequence, which allows GenESeSS to distill a structure that is similar to that of the finite Markov order cases, and thus carry out the subtree "stitching" procedure described before. A sequence x is synchronizing if all sequences that end with the suffix x terminates on the same causal state. A process is synchronizable if it has a synchronizing sequence, and a PFSA is synchronizable if the process it generates is synchronizable. The structure of the "graph" of a perfectly synchronizable PFSA is that of a co-final automata 1 .
A sequence x is "-synchronizing 2 to the state q if the distribution } x on the state set Q induced by x satisfies k} x e q k I < ", where e q is the base vector with I on the entry indexed by q and H elsewhere. The importance of "synchronizing sequence is twofold: 1) since T x a } T x e ¥, where e ¥ is the jQj¢j¦j matrix with the row indexed by q given by e @qA, a } x close to e q give rise to a x close to e @qA. And 2) although sequences prefixed by an "-synchronizing sequence to a state q may not remain "-synchronizing to state q, they are close to q on average.
To find an almost synchronizing sequence algorithmically 2 , GenESeSS first calculates the convex hull of symbolic derivatives of subsequences of x up to length L (line 1-3), and then selects a sequence x H whose symbolic derivative is a vertex of the convex hull (line 4). Since the convex hull of¨ x X x P ¦ L © is a linear projection of the convex hull } G @xA X x P ¦ L © via e ¥, we can expect sequence x with x being a vertex of the convex hull of¨ x X x P ¦ L © to be a good candidate for an almost synchronizing sequence.
The corresponding inference algorithm for XPFSA is called xGenESeSS, which takes as input two sequences x in , x out , and a hyperparameter ", and outputs an XPFSA in a manner very similar to the inference algorithm of PFSA.
While a PFSA models how the past of a time series influences its own future, a XPFSA models how the past of an input time series influences the future of an output time series. Hence, while in the SSC algorithm of PFSA, we identify sequences if they lead to futures that are statistically indistinguishable, in the SSC algorithm of XPFSA, we identify sequences if they lead to the same future distribution of the output.
Definition 2 (Crossed Probabilistic Finite-State Automaton (XPFSA)). A crossed probabilistic finite-state automaton is specified by a quintuple @¦ in ; R; Y ¦ out ; A, where ¦ in is a finite input alphabet, R is a finite state set, is a partial function from R ¢ ¦ in to R called transition map, ¦ out is a finite output alphabet, and is a function from R to P ¦out called output probability map, where P ¦out is the space of probability distributions over ¦ out . In particular, @r; A is the probability of generating P ¦ out from a state r P R.
Note that a XPFSA has no transition probabilities defined between states as a PFSA does. The XPFSA in the example 6 has a binary input alphabet and an output alphabet of size Q. The bar charts next to the R states of the XPFSA indicate the output probability distributions. To generate a sample path, an XPFSA requires an input sequence over its input alphabet.
Similar to the PFSA construction approach, here we compute the cross symbolic derivative, which is the ordered tuple P r@ jxA, with P ¦ out and a sequence x over ¦ in . We compute the empirical approximation of the cross symbolic derivative from sequences x in and x out as: xin;xout y @A a number of in x out after y transpires in x in number of sub-sequence y in x in (5) Thus, xGenESeSS is almost identical to GenESeSS except that, in Step 1, xGenESeSS finds an almost synchronizing sequence based on cross symbolic derivatives, and in Step 2, identifies the transition structure based on the similarity between cross symbolic derivatives. Arguments for establishing the effectiveness of GenESeSS carry over to xGenESeSS with empirical symbolic derivative replaced by empirical cross symbolic derivative.