A conditional machine learning classification approach for spatio-temporal risk assessment of crime data

Crime data analysis is an essential source of information to aid social and political decisions makers regarding the allocation of public security resources. Computer-aided dispatch systems and technological advances in geographic information systems have made analysing and visualising historical spatial and temporal records of crimes a vital part of police operations and strategy. We look at our motivating crime problem as a spatio-temporal point pattern. Using a conditional approach based on properties of Poisson point processes, we transform the spatio-temporal point process prediction problem into a classification problem. We create spatio-temporal handcrafted features to link future and past events and use machine learning algorithms to learn behavioural patterns from the data. The fitted model is then used to carry out the reverse transformation, i.e. to perform spatio-temporal risk predictions based on the outcomes of the classification problem. Our procedure has theoretical formalism from point process theory and gains flexibility and computational efficiency inherited from the machine learning field. We show its performance under some simulated scenarios and a real application to spatio-temporal prediction and risk assessment of homicides in Bogota, Colombia.


Introduction
The occurrence of crimes in urban conurbations is historically one of society's biggest problems. The larger the size of the city, the higher its crime level tends to be in terms of frequency of crime occurrence. Curry et al. (2008) show that crime can have a destructive effect on the health of citizens; for example, it can increase depression due to constant contact with assaults, blows or shots. Therefore, the reduction of crime is usually a fundamental axis of the decisions of local authorities such as politicians and police.
Statistical analysis of crime and non-civic behaviour can provide public servants with valuable tools that help or impact allocating all resources to public safety (Hart 2020;Leng and Li 2018;Mukhopadhyay et al. 2016). Sciencebased recommendations are more substantial in developing countries as resources are limited, and crime is rampant. Therefore, any assistance that has to do with criminal information processing is welcome by governments.
Advances in computing and real-time data collection have allowed the recording of spatio-temporal data in large quantities. Indeed, including extra geographic information has accompanied this registry; for example, the calls for reporting crimes or the reports of the agents circulating on the streets often include the spatial coordinates of the infraction (Ratcliffe 2014;Rummens et al. 2017;Hu et al. 2018). This allows historical records to be spatial as well. This practice is increasingly common and is carried out with different parameters, such as spatial and temporal resolutions and the type of crime or offence (Mohler et al. 2015). According to these initial data collection parameters, several methodological proposals arise (Catlett et al. 2019).
In this particular context of crime or offence reporting modelling, it is common practice to aggregate the data somehow. This aggregation, which can be simultaneously in space, time or both dimensions, is not necessarily related to statistical or numerical methodologies. It is generally done for data protection laws or simply for security reasons. A common type of aggregation arises when the spatio-temporal region is divided into regular regions, i.e. when the data domain is gridded and the number of events in each subregion is counted. For example, Clark and Dixon (2018) and Clark and Dixon (2020) adopt a hierarchical Bayesian modelling approach in which the spatiotemporal correlation structure arises from a continuous underlying process.
There are also some cases where exact locations in space-time are recorded; this is the case in this work. In this case, point process methodologies are adequate tools to describe the process of discrete events that occur in a continuous space, time, or spatio-temporal domain. The literature in this context is also relatively abundant. Some authors propose Hawkes-type point process modelling of crime in a series of papers (Mohler et al., 2011(Mohler et al., , 2015Mohler, 2014 andRosser andCheng, 2019). Adopting the Hawkes process formulation, their models incorporate time-varying hotspots, assuming that each crime induces a higher local crime risk; this risk decays over space and time. There is abundant literature on Hawkes processes and self-exciting point processes for crime data analysis. We refer to the recent Reinhart (2018) review article, or Zhuang and Mateu (2019) (and references therein). On the other hand, Shirota and Gelfand (2017) used a Cox log-Gaussian process with circular time to model the daily and weekly periodicities of crimes. Rodrigues and Diggle (2012) used a spatio-temporal log-Gaussian Cox process to model crime data, focusing on surveillance analysis to detect crime clusters.
In a parallel research line, perhaps with more focus on computation than statistics, several machine learning techniques have been used to predict crime mainly in space and less often in space-time (see, e.g., McClendon andMeghanathan, 2015 or Yuki et al. (2019)). Some neural network approaches have been alternatively used in Wu et al. (2020) and Ramirez-Alcocer et al. (2019).
In this article, we consider crime reports as spatiotemporal point patterns and perform the corresponding statistical analyses from this point of view. However, current modelling approaches based on self-exciting and log-Gaussian Cox processes in spatio-temporal contexts bring various estimation and computational problems. Non-compliance with assumptions and large data volumes make parametric inference difficult, and prediction in space and time is also challenging.
Our proposal addresses these methods' major limitations, focusing jointly on spatial and spatio-temporal point processes and machine learning techniques to predict the intensity and risk of homicides in a locality of Bogota, Colombia. We use a conditional approach based on properties of Poisson point processes to address a spatio-temporal point process prediction problem as a classification problem. With this transformation, we employ machine learning algorithms to learn from data from the past. The fitted model is then used to carry out a reverse transformation, i.e. to perform spatio-temporal predictions based on the outcomes of the classification problem. We apply the proposed methodology to simulated and real data (homicides in Kennedy, a district of Bogota). The results achieved on many considered scenarios were satisfactory. We indeed used an adaptation of a multivariate goodnessof-fit test (Sect. 3.3) to confirm the quality of our predictions.
Note that the self-exciting mechanism provides a mathematical formality regarding the triggering functions (in space and time) to develop historical dependencies. In this case, the researcher needs to select and fit a particular triggering function that usually depends on parameters, and the problem of separability is crucial. Note, in addition, that the dependence on history comes in the triggering functions and not in the background component. In our case, we build alternatively covariate functions that play the role of the triggering functions without the need to use parametrised functions. As such, these constructions build both components, the background and the dependence on past events, making the mathematical building block much easier.
This paper is organised as follows. Section 2 presents and describes the motivating data problem of homicides in Bogota. Section 3 develops the methodological approach to predict the intensity function of a spatio-temporal point pattern. Section 4 analyses the results of applying the methodology to simulated scenarios and real data. The paper ends with Sect. 5, outlining some conclusions and ideas for future work.

Data
Colombia is administratively divided into 32 departments, and one capital district, Bogota, which is as well the capital of the department of Cundinamarca; the department and the city are shown in Fig. 1 where the coordinate system is long-lat.
Bogota has roughly 7,200,000 inhabitants, according to the last official census conducted by the Colombian government (www.dane.gov.co). The city is divided into 19 urban localities or mayorships, and they represent a political segmentation of the territory with their faculties in administrating and handling resources. Each locality has a police station that is responsible for its security control. Kennedy locality has a population of approximately 1,230,000 inhabitants, most affected by poverty, inequality and social exclusion. It is located west of the city (see Fig. 1).
Data provided by Colombian National Police are used in this study to statistically analyse some features of homicides reported in Kennedy in a period of 6 years, from 2012 to 2017. The information was recorded in an information system belonging to the Colombian National Police known as the Sistema de Informacio´n Estadı´stico, Delincuencial, Contravencional y Operativo (SIEDCO, https://www.poli cia.gov.co/grupo-informacion-criminalidad/estadisticadelictiva).
Homicide is an extraordinarily severe crime and one of the least common crimes in Kennedy (approximately one in 50 crimes is a homicide). This phenomenon deserves the authorities' attention due to its terrible negative impact on society and the difficulties of its investigation by the national police. Therefore, we have decided to focus on the statistical analysis of homicides in Kennedy since our findings could eventually help law enforcement officers in their surveillance systems. In addition, the potential lower number of cases is a motivation for testing our computational and statistical modelling approach. The dataset contains the spatial locations (in planar coordinates) of 901 homicides as well as the corresponding date in days. We note that 901 homicides in 6 years make a rough average of 150 per year, so we have many daily zero data and just an average of 3 per week. This shortage of cases is typical in this type of crime and makes it more complicated in its daily or weekly analysis due to an excess of zeros. Furthermore, the police in Bogota organise teams in charge of homicide crimes during a three/four-week period developing their corresponding operations and need ahead predictions for the next period. By considering bins of 21 days, we solve excess zeros while helping to organise the police actions from one period to the next.
We also have information on locations and times of other types of crimes, including car and motorcycle thefts, shoplifting and robberies during the same period. Figure 3 shows a random sample of 9000 crimes of these other types recorded in Kennedy and the counts of all crimes across time grouped in bins of 21 days.
As complementary information attached to each crime event, we have Euclidean distances to the nearest transport service, the closest construction zone, the closest water canal, the nearest pharmacy, the nearest school, the nearest medical centre, the nearest park, the nearest market, the nearest library and the nearest criminal injury (defined as the closest report of people physically injured as victims of violence). The attached information to each point can be interpreted as associated marks. It is helpful to analyse the likely dependence between the crime occurrence and the distribution of additional risk factors, which are related, in this case, to the environmental conditions of the locality.  We use a Nadaraya-Watson smoother (Nadaraya 1989;Baddeley et al. 2015) to apply spatial smoothing over the values of the associated marks to obtain continuous covariate maps (note that the distances are associated with crime locations and are indeed discrete in space). We employ least-squares cross-validation (Silverman 1986) to get a suitable bandwidth required for smoothing each covariate, obtaining values around 12.53 m for all cases.
The corresponding spatial covariates are displayed in Fig. 4. To link future events with historical data, we build several spatio-temporal covariates (or features) related to the point patterns. These covariates are built from any location in the region (that later will be defined as tentative control events). They also refer to distances from homicides to the two previous temporal instants, as we intend to predict the intensity function. We conjecture that the relation between past and future homicides is stronger than the dependence between homicides and other types of crimes. However, for completeness, distances from homicides to different types of crimes are also used to create covariates. For every location in the region (and in particular, for every event considered as case or control, see Sect. 3.1), we compute the following covariates related to the point pattern of homicides: • fd 1 tÀ1 ; d 2 tÀ1 ; d 3 tÀ1 g, minimum, median and accumulated distances from every location in the region in a given temporal instant t to homicides from the previous temporal instant t À 1. • fd 1 tÀ2 ; d 2 tÀ2 ; d 3 tÀ2 g, minimum, median and accumulated distances from every location in the region in a given temporal instant t to homicides from the temporal instant t À 2. • fd 4 tÀ1 ; d 5 tÀ1 g, average and accumulated distances from every location in the region in a given temporal instant t to its neighbours in the temporal instant t À 1 in two neighbourhoods of radii 1.5 and 0.5km, respectively. • fd 4 tÀ2 ; d 5 tÀ2 g, average and accumulated distances from every location in the region in a given temporal instant t to its neighbours in the temporal instant t À 2 in two neighbourhoods of radii 1.5 and 0.5km, respectively. Note that here, by accumulated distance, we mean the sum of distances. Additionally, we consider the point pattern given by the superposition of every other type of crime reported in Kennedy (see Fig. 3) and proceed similarly. Still, instead of separately computing features related to previous times, the statistics are now based on distances related to crimes from the last two temporal instants, considering together t À 1 and t À 2. With this in mind, we have the following additional covariates: • fd 6 tÀð1;2Þ ; d 7 tÀð1;2Þ g, minimum and average distances from every location in the region in a given temporal instant t to other previous types of crimes. • fd 8 tÀð1;2Þ g, accumulated distance from every location in the region in a given time t to neighbours from other previous types of crimes in a neighbourhood of radius 1.0km. Finally, and in addition to the spatio-temporal covariates described above and the spatial covariates displayed in Fig. 4, we also consider the coordinates of the spatial locations as covariates to capture any spatial variation not explained by the spatial features. We allocate the covariates into two vectors to simplify the notation in the remaining part of the paper. We introduce vector d t ðxÞ to denote the spatio-temporal covariates (d t ðxÞ ¼ d 1 tÀ1 ðxÞ; d 2 tÀ1 ðxÞ; . . .; d 8 tÀð1;2Þ ðxÞ) and vector d 0 ðxÞ to represent the pure spatial covariates of Fig. 4 plus the spatial location.

Spatio-temporal prediction for point processes
This section describes the method used to predict Kennedy's spatio-temporal distribution of homicides. Section 3.1 discusses the conditional approach used to address the spatio-temporal point process problem as a binary classification problem. Section 3.2 presents the machine learning algorithms followed to solve the classification problem and describes the experimental framework used to evaluate the performance of the algorithms. Finally, Sect. 3.3 offers the goodness-of-fit procedure to assess the quality of the predictions.

Point-source modelling
Back in the nineties, several contributions in the context of point-source modelling appeared in the statistical literature. Diggle (1990) develops a point process modelling approach to the raised incidence of a rare phenomenon in the vicinity of a prespecified point. Diggle and Rowlingson (1994) use a conditional approach to estimate the effect of an incinerator on the spatial distribution of cases of larynx cancer in an area of south Lancashire. Diggle et al. (2000) propose an extension of the parametric modelling framework developed by Diggle (1990) and Diggle and Rowlingson (1994) to match case-control studies and to investigate raised risk around putative sources of environmental pollution. A semi-parametric method for point process modelling with point source interventions is developed in Rodrigues et al. (2010). See also Diggle (2008), and Diggle (2013) for a general reference. We are mainly motivated by the approach in Diggle and Rowlingson (1994). The authors consider two spatial Poisson point processes, case and control processes, with intensities given by k 0 ðxÞ and k 1 ðxÞ ¼ qk 0 ðxÞf ðjjx À x 0 jjÞ, respectively. Here in this paper, x 2 W & R 2 . Location x 0 is the source, and f is a parametric function that models the effect of the source on the case process. Diggle and Rowlingson (1994) transformed the point process problem into a binary regression problem by noting that an event at location x has a probability of being a case, i.e. an event from the point process with intensity k 1 ðxÞ, given by Diggle and Rowlingson (1994) consider data on different types of cancer to create a set of binary variables and used an MLE approach to estimate the parameters of f ðÁÞ. Rodrigues et al. (2010) use a similar approach to evaluate the effect of CCTV cameras on the spatial distribution of crimes in Belo Horizonte, Brazil.
Our proposal aims to expand the previous conditional approach to the spatio-temporal setting so that we can predict the intensity function of a point pattern based on observed past events and related covariates. We thus propose incorporating information related to past events in function f ðÁÞ so that it depends on fixed and moving sources. With an abuse of notation, now the f ðÁÞ function relies on the covariates vector rather than only on distances. We thus consider two spatio-temporal Poisson processes with intensities q 0 This problem can be addressed as a classification problem. Machine learning algorithms are flexible and computationally feasible methods that use the dataset D to train a model M. Once the model is trained, the prediction of the probabilities is not limited to locations where events (cases or controls) have occurred. By using the trained model M, the conditional probability p t ðx; d t ðxÞ; d 0 ðxÞÞ can be predicted of any given location x 2 W where the covariates d t ðxÞ and d 0 ðxÞ are defined. Although these probabilities are not of scientific interest, they may be used to estimate f ðd t ðxÞ; d 0 ðxÞÞ by applying the reverse transformation which, in turn, may be used to estimate the spatio-temporal distribution of the case k 1 t for every location x. As in Rodrigues et al. (2010), our focus relies on estimating the spatio-temporal distribution of cases k 1 t ; therefore, estimation of the expected number of events at time t (q 0 t and q 1 t ) is not of interest. Indeed, the trick of conditioning on n 0 t and n 1 t , the number of events of each type of point at time t, allows estimating k 1 t ðxÞ without specifying q 0 t or q 1 t .

The control process
Whereas the choice of the case process is straightforward, the definition of the controls is a key point of our methodology. In this section, we propose and discuss three possibilities to define the control process, as follows: 1. Consider the intensity of the control process constant over the study area (k 0 t ðxÞ ¼ k 0 ); 2. Consider another type of event as the control process.
For example, if homicides are the cases, another type of crime can be seen as control; 3. Consider the case point process's superposition (over time) as a control process.
In the first approach, the spatio-temporal variation of the case process is entirely given by the function f ðÁÞ, which means that the covariates explain the spatio-temporal variation of the case process. We can simulate a homogeneous Poisson process to generate a sample of controls.
With the locations of cases and controls at hand, dataset D may be created to train model M.
In the second approach, f ðÁÞ models the change of cases' spatio-temporal distribution compared to the control distribution. Considering this approach, a sample of controls may be the locations where another type of crime occurred. Therefore the predictions and interpretation of the results are relative to that of other types of crime.
In the third scenario, f ðÁÞ models the spatio-temporal interaction of cases over the pure spatial distribution of cases. Considering this approach, a sample of controls may be the locations where past homicides occurred. To avoid overfitting, the sample of controls must be independent of the sample of cases. For example, suppose the observed homicides go from time t 1 to time t 2 . Then, the sample of controls may be the locations of homicides that occurred within the time window ½t 1 ; t 1 þ v, whereas the sample of cases is comprised of homicides from time ðt 1 þ v þ 1Þ to time t 2 . This third approach may be useful in the context of surveillance, where the main interest is to model the spatiotemporal interaction of events.
We carry on the methodological procedure considering the intensity function of the control process being constant over the study area. The main reason for this choice is the aim of our application, which is predicting the future intensity function of the case process based on the observed past events. Under this choice, the spatio-temporal intensity of cases is not a function of unknown intensity and, in this case, the intensity function simplifies to k 1 t / f ðd t ðxÞ; d 0 ðxÞÞ. Note that any other known intensity could have been used to model the control process; we have chosen the simplest one, as we already had good performance (see Sect. 4.2). In addition, if we consider that the number of controls is the same as the number of cases, i.e. n 0 t ¼ n 1 t , the classification problem addressed is balanced. Therefore, one can simulate controls comprising D to evaluate the classifiers and predict f(.) for every location x. The experimental methodology proposed to compare several classifiers is described in Sect. 3.2.1.

Machine learning classifiers
We test and compare four classifiers, commonly used in various scientific areas, to address the problem of predicting the intensity function: Naïve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), and Random Forest (RF). We conduct experiments to tune the hyperparameters involved in each classifier to choose the one that best suits our classification problem.
The naïve Bayes classifier is a primary probabilistic classifier based on the premise that the features are conditionally independent given the dependent variables. The method uses Bayes' theorem and the assumption of independence to factor the joint distribution into conditional probabilities. No hyperparameter must be tuned in its canonical form, the one used here. Naïve Bayes classifiers have been widely adopted in many applications. More details on this classifier can be found in McCallum et al. (1998), Rish et al. (2001) and Zhang (2004). K-Nearest Neighbour (KNN) is a non-parametric classification/regression method based on distances of the observations on the covariates space. Besides its simplicity, this method can estimate the real separability of the dataset with a certain quality. The Cover-Hart inequality (Cover and Hart 1967) states that the error rate of the KNN rule is at most twice the Bayes error rate when the sample size is sufficiently large. The distance metric considered in most applications is the Euclidean distance, and the hyperparameter to be tuned is the number K of neighbours.
The support vector machine (SVM) is presently one of the head powerful machine learning methods for solving binary classification tasks (Steinwart and Christmann 2008). Over the last decade, it has been extensively used in several distinct scientific areas. Support vector machine performs classification by finding the hyperplane on the covariates space that maximises the distance between the classes. The hyperparameters of the SVM are a constraint parameter C and the associated kernel parameters used to learn a nonlinear decision boundary. In this paper, we assume a Gaussian kernel; therefore, the hyperparameters investigated are C and the standard deviation of the kernel.
A Random Forest (RF) combines bootstrap techniques to compose an ensemble of weak classifiers to build a powerful method to solve classification problems (Breiman 2001). A random forest combines decision trees constructed based on a subset of input covariates and randomly chosen samples. Two parameters are tuned in the experiments, the number of covariates used in each tree and the total number of trees used to create the ensemble.

Performance evaluation
A typical experimental framework is developed and used to evaluate the performance of each classifier to provide a fair comparison. The following steps outline such a framework: 1. Simulate, for each time t, n 1 t events considering a Poisson process with flat intensity to define the controls; 2. Create a dataset with the binary variables and associated covariates; 3. Split the dataset in training (F tr ), validation (F va ) and test sets (F te ), corresponding to data of the first 75% temporal instants, the following 10%, and the remaining 15% temporal instants, respectively; 4. For each model, train the model on F tr (training phase), and find the best combination of hyperparameters considering the performance of the model in F va (tuning phase); 5. Train the model on F tr [ F va set employing the best hyperparameters found; 6. Evaluate the performance of the method on F te (test phase); 7. Repeat steps (1) to (6) several times (a hundred times is often used) to output a performance matrix with elements corresponding to the accuracy for each method across the replications.
The experimental framework performs a paired comparison since the training, validation, and test sets are the same for each method. Also, the optimised hyperparameters might vary across replications. For instance, the best number of neighbours in the KNN classifier may differ across replicas since new controls are simulated in each repetition. For the classifiers with more than one hyperparameter to be investigated, we adopt the primary search strategy of exploring the Cartesian product of all of them in the tuning phase. Table 1 gives an overview of the hyperparameters for each classifier and their ranges used in the grid search.
Since the classification problem addressed in this work is balanced, i.e. the number of events from the control process is equal to the number of cases, the performance criteria used in this work was accuracy, the most popular choice for balanced classification problems. We define accuracy as the rate of correct classification, i.e. the proportion of points classified adequately as cases or controls. In our context, an overall accuracy of 0.5 means that the model cannot differentiate a case from a control; therefore, the covariates are not useful in predicting crimes. On the other hand, an accuracy near 1 means that the trained model is a powerful tool for predicting crimes.

Goodness-of-fit test
In order to assess whether the observations come from a multivariate distribution (the intensity function in our case), we use a goodness-of-fit test proposed by McAssey (2013). Let X be a spatial point pattern with a bivariate continuous distribution kðxÞ (normalised by its integral to make it a probability density function) with mean and variance l and R. Let D be the Mahalanobis distance matrix between the points of X and consider r [ 0. D maps all the points in an ellipse onto a circle of radius r centred at the origin. Define GðrÞ ¼ P D r f g. Let N be the set of distributions U holding the property that if Y $ U then there exists some linear transformation T such that T À1 ðYÞ 2 kðxÞ. If G is the distribution of Mahalanobis distances, it can be proved that a goodness-offit for G is equivalent to a goodness-of-fit for N. Consider a sample fx i g i n from kðxÞ, and compute the associated D ¼ fD i g i n . It can be shown that the empirical distribution function G n ðrÞ converges to G(r), i.e., lim n!1 G n ðrÞ ¼ lim This equality holds even in the case in which l and R have to be estimated from a sample. Then to apply the test in our context, we employ the following algorithm.
Given a point pattern with intensitykðxÞ, we computel andR. Then we proceed to simulate a point pattern with N points and that given intensity and computeD i . We choose a partition fr s g of the interval ð0; 2 maxfD i gÞ to calculatê G N ðr s Þ. Next, we choose a partition fp j g of the interval [0, 1] of size R to estimate the corresponding p-quantiles of G(r) by setting q 0 ¼ 0; q R ¼ 1 and q j ¼ minfr s 2 RjĜ N ðr s Þ ! p j g for 0\j\R. We set E j ¼ Nðp j À p jÀ1 Þ and count the number of observations amongD i falling into the interval ðq jÀ1 ; q j ; this count is denoted by O j for 0\j\R À 1 and ðq RÀ1 ; 1. Then the test statistic is given by Finally, we proceed to find the p-value using a simulated sample fA k R g M k¼1 , and reject the null hypothesis if the pvalue is lower than some given level of significance. Here, rejection suggests that the observed point pattern is not an observation of a Poisson process with intensity proportional to the density functionkðxÞ= Rk ðxÞdx, i.e., the predicted intensity function is not compatible with the observed point pattern.

Simulations and data analysis
This section describes the performance achieved by the proposed methodology when applied to simulated and real data on homicides in Bogota. We carry out four simulation experiments by considering different spatio-temporal dependence degrees. The application of crime data considers homicides as cases, and the other types of crimes are used to compute the covariates.

Simulation experiments
Four scenarios stress the proposed methodology and evaluate its prediction power. In all situations, we employ the Random Forest classifier. The computed covariates are comparable to the ones defined in Sect. 2 but exclude the pure spatial covariates from Figure 4 and the spatio-temporal covariates related to the other types of crimes.
The choice of the Random Forest tool is due to the results of evaluating the machine learning methods in Sect. 4.2, which showed that RF was the most suitable method for our application. We keep similar covariates to analyse the feature selection ability of our proposal.
Additionally, we consider the unit square as the observation window W and generate a sequence of spatial point patterns with particular spatial and or temporal dependence structures. In all cases, we simulate 50 point patterns, one for each temporal instant, with the same expected number of events per time, as observed in the real crime data motivating this paper. The simulated point patterns play the role of cases, while controls are always generated as homogeneous Poisson point patterns in W. The first 45 spatial patterns were used to train and tune the model, whereas the last 5 were used as a test set.
Following the steps shown in Section 3.2.1, we have used 1000 replications of controls (step 7). Therefore, for each time t, we simulated n 0 t ¼ n 1 t controls such that for each time t, the number of simulated controls is the same number of cases. From that, we have an estimated accuracy (and an estimated intensity). In order to get rid of any random aspect, we have replicated this process 1000 times so that we have 1000 estimates.
The outputs of these simulation experiments are the predictions for the simulated intensity functions and the accuracy achieved for each scenario. The accuracy of the simulated scenarios is shown in Fig. 5, and the predictions and current intensities are in Fig. 6. The following subsections will again refer to these figures to better understand the results achieved. The first scenario considers cases within the subwindow W 0 ¼ ½0:4; 0:6 2 according to a homogeneous Poisson process with intensity 100=jW 0 j. This straightforward scenario is simulated to compare the theoretical and the observed accuracies. Indeed, considering this stochastic mechanism used to generate the cases, an event (case or control) that occurred inside the window W 0 has a probability of 1 À jW 0 j ¼ 0:96 of being a case; therefore, we expect to achieve accuracy around this value. Figures 5 and  6 present the attained accuracy and the real and predicted normalised intensity function for the last time. As expected, the average accuracy was 0.979. We used the goodness-of-fit test described in Sect. 3.3 to test whether the observed point pattern (the case pattern) can be seen as a random sample of the predicted intensity function. We indeed obtain a p-value of 0.842, meaning that we do not reject the hypothesis that the cases are an observation of a Poisson process with an intensity function proportional to the predicted intensity. The second scenario considers a sequence of realisations of an inhomogeneous Poisson process with intensity being the exponential function of an observed continuous spatial Gaussian random field SðxÞ with mean l and an exponential covariance model Cov Sðx 1 Þ; Sðx 2 Þ ð Þ¼r 2 exp Àjjx 1 À x 2 jj=s f g : We set l ¼ 4; r 2 ¼ 1 and s ¼ 0:2. Note that s is a scale parameter controlling the distance at which the spatial dependence decays. Since spatio-temporal interactions are absent in this scenario, the most critical covariate to predict the intensity function is the spatial location x.
The parameters used in this scenario were chosen pragmatically to obtain simulated patterns that were not very dense to lighten the computational process. Indeed, the average number of events will be exp ð4Þ % 55 per time. On the other hand, by setting the variance to 1, we make sure that the variation of the points can be controlled only with the scale parameter; when this parameter tends to zero, the generated point pattern will be indistinguishable from a Poisson process. Therefore, we have chosen a scale of 0.2 where the covariance effect is considerable but not extreme. The overall aim is to choose non-extreme scenarios that allow us to illustrate the effectiveness of our methods. Figure 6 compares the real and predicted normalised intensity functions. This visual inspection suggests that our proposed methodology adapts to achieve good results. When applying the goodness-of-fit test, the p-value is 0.371, indicating a proper fit. The average accuracy for this case is 0.696; see the boxplot for 1000 replications in Fig. 5.

Scenario 3
The third scenario considers a random sample of a Matérn cluster point process (Matérn 1986) within W. Indeed, we generate a realisation of a Poisson point process with an intensity j, defining the parent point pattern. Each parent point is replaced by a cluster of mean l offspring uniformly distributed in a disc with radius R, centred on the parent point and intersected with the observation window. Only the offspring are present in the outcome. In this case, we set j ¼ 10; R ¼ 0:06 and l ¼ 6. Again, as in the previous scenario, these parameters were chosen to obtain representative patterns while keeping the computational cost under certain control. In this context, we used j Ã l ¼ 60 average number of events per time. Also, the radius of the clusters allows us to have clusters that are distinguishable from each other.
Considering the stochastic mechanism to generate cases in this scenario, which randomly allocates clusters of points within W, as the parents are a Poisson point pattern, the intensity in a fixed time is unrelated to all the previous point patterns. We thus expect to achieve an accuracy of around 0.5. As shown in Fig. 5, the accuracy is approximately 0.5 (with an average of 0.476), meaning the model cannot distinguish between cases and controls given the past information, as expected.

Scenario 4
The fourth scenario emulates the presence of spatio-temporal explanatory variables. Consider again a realisation of a Matérn cluster point process, denoted by X 0 . Let b 0 ; b 1 and b 2 be some coefficients fixed a priori. At times t i ; ði ¼ 1; :::; 50Þ; we proceed as follows. First, we estimate the intensity function of the point pattern X iÀ1 nonparametrically by using a Gaussian kernel with a bandwidth selected by a rule of thumb depending only on W;k iÀ1 ðxÞ denotes this estimate. Then we compute the reciprocal distance from each point to its nearest neighbour, g j ; j ¼ 1; :::; m, where m is the number of points of X iÀ1 , and assign the values as marks of X iÀ1 . We perform a spatial smoothing of the g values in W (see e.g. Baddeley et al. 2015, p.195) and denote it byk g iÀ1 ðxÞ. The ith-intensity k i is computed through the linear model We rescale this function to ensure that it preserves mass, i.e. its integral gives roughly the number of points. In our particular case, we use b 0 ¼ 0:5; b 1 ¼ 0:1 and b 2 ¼ 1. In this case, we have an average accuracy of 0.648, and, again, we have a non-significant p-value (0.650) for the goodness-of-fit test confirming the similarity between the predicted and actual intensities seen in Fig. 6.

Data analysis
In this section, we apply the proposed methodology to data on reported crimes in Kennedy from 2012 to 2017 (counted in bins of 21 days), focusing on predicting homicides in space and time. See further comments and some exploratory analysis in Section 2. We first compare the four machine learning methods to find the most suitable one to predict the spatio-temporal distribution of homicides. For this comparison, we consider all the covariates, the pure spatial and the spatio-temporal ones. Then we evaluate the relevance of the spatio-temporal covariates for predicting the distribution of homicides in Kennedy. We apply the machine learning methods and the framework of experiments described in Sect. 3.2 to find the best suited for our classification task. As also seen in Sect. 3.2.1, a tuning stage is performed to find the best hyperparameter combination for each method. Figure 7 displays the results for the accuracies of the machine learning algorithms considered in this study. The results suggest that Random Forest (RF) outperforms the alternative methods, which claim to be the best approach to predict the spatio-temporal distribution of homicides in Kennedy. Figure 8 shows the predicted intensity of the last two temporal instants. To do that, we place a regular grid over the study region and estimate the intensity function at the centre of each pixel (the resolution can be as high as required).
The point patterns are also included in the figures to visually assess the quality of predictions. To statistically evaluate the quality of the predictions, we apply the goodness-of-fit test and obtain non-significant p-values (0.801 and 0.263). Therefore, the observed point pattern can be seen as a random sample of the normalised predicted intensity functions.
We conduct two further experiments to investigate the contribution of the covariates by their nature. In the first experiment, RF is trained, tuned and tested considering only the spatial covariates, whereas the second experiment only includes the spatio-temporal features described in Sect. 2. Figure 9 shows the accuracy, where we compare the results of the model with all the covariates (with both spatial and spatio-temporal structures).
The results suggest that pure spatial covariates are the components that contribute most to improving the accuracy. However, including the spatio-temporal covariates slightly increases the method's performance. Figure 9 also shows the mean decrease in the Gini index to measure the importance of each variable in the RF model (Hastie et al. 2009). The Gini index measures the node impurity; therefore, the higher the decrease, the sharper the covariate. These results corroborate the findings from the previous paragraph. Indeed, the most important covariates are the nearest criminal injury and d 6 tÀð1;2Þ , pure spatial and spatio-temporal covariates, respectively.

Conclusions
This paper proposes a machine learning-based methodology to predict the spatio-temporal distribution of a spatiotemporal point process. The spatial point process prediction problem is transformed into a classification problem for which several distinct classifiers can be employed. We applied the methodology to crime data from Kennedy (Bogota), focusing on predicting homicides in space and  Fig. 8 Estimates, by Random Forest, of the intensity of homicides in Kennedy for two periods of 21 days. Black highlighted points represent the locations of the recent homicides in these periods time, and conducted a study to investigate the contribution of pure spatial and spatio-temporal features. We underline that our approach is one of the few in the literature that combines a machine-learning computational procedure and a more probabilistic point pattern approach. The combination of both fields shows promising for further progress in studying the spatio-temporal point process.
A more detailed analysis of the features might be of scientific interest in other applications. For that aim, one can employ importance feature techniques to measure the contribution of each feature. An example of these methods is the shapely additive explanation (SHAP) introduced in Lundberg and Lee (2017). Our approach applies to any spatial region and temporal range. Thus any change in administrative boundaries (for example, working with countries/states/districts/suburbs) does not affect our method.
The methodology is designed to predict a step since the point patterns from the last temporal instants are required to compute the spatio-temporal covariates. However, further predictions can be made by simulating cases from the predicted intensities up to the target temporal moment. For instance, to predict two steps, one can predict the intensity function a step ahead and simulate from this intensity to compute the covariates needed to predict the spatio-temporal distribution for the temporal target instant.
Since the predictions are based on historical data, our methodology does not consider the correlation among events within the same temporal instant, [since we assume Poisson point processes. However, we model the intensities of the point processes as functions of previous events such that correlation among events from different temporal instants is considered.] To minimise this shortcoming, one can define thinner temporal windows or, if data is available, adjust the methodology to accommodate continuous-time applications. A substantial limitation of our approach is that we assume that the point pattern relating the events and the historical data is temporally invariant. Once again, one can minimise this deficiency by training the model on a sliding window so that only the most recent data comprise the training set.
Investigating the limitations and several suggestions discussed in the two previous paragraphs will be addressed in future works. Additionally, the definition of spatiotemporal covariates is a crucial point of our methodology. To avoid this definition, one can consider the application of deep learning models as alternative tools to handcrafted features to achieve more consistent and accurate predictions. Investigation of models of this kind will also be addressed in future works.