Learning heterogeneous delays in a layer of spiking neurons for fast motion detection

The precise timing of spikes emitted by neurons plays a crucial role in shaping the response of efferent biological neurons. This temporal dimension of neural activity holds significant importance in understanding information processing in neurobiology, especially for the performance of neuromorphic hardware, such as event-based cameras. Nonetheless, many artificial neural models disregard this critical temporal dimension of neural activity. In this study, we present a model designed to efficiently detect temporal spiking motifs using a layer of spiking neurons equipped with heterogeneous synaptic delays. Our model capitalizes on the diverse synaptic delays present on the dendritic tree, enabling specific arrangements of temporally precise synaptic inputs to synchronize upon reaching the basal dendritic tree. We formalize this process as a time-invariant logistic regression, which can be trained using labeled data. To demonstrate its practical efficacy, we apply the model to naturalistic videos transformed into event streams, simulating the output of the biological retina or event-based cameras. To evaluate the robustness of the model in detecting visual motion, we conduct experiments by selectively pruning weights and demonstrate that the model remains efficient even under significantly reduced workloads. In conclusion, by providing a comprehensive, event-driven computational building block, the incorporation of heterogeneous delays has the potential to greatly improve the performance of future spiking neural network algorithms, particularly in the context of neuromorphic chips.


Introduction
The human brain has the remarkable capacity to react efficiently at any given time while consuming a reasonable amount of energy, in the order of 20 watts.This system is the result of millions of years of natural selection, and a striking difference with artificial neural networks is the representation that both use.Indeed, our computers use digital values and for instance, the convolutional neural networks (CNNs) which are used for processing images represent the flow of information from one layer to another using tensors of floats.These networks store visual information densely across the visual topography, with different translation-invariant properties represented in different channels.CNNs have achieved state-of-the-art performance for some computer vision tasks, such as image recognition.These networks are also known to mimic several properties of the biological visual system, such that each can be assigned a "brain score" [68].However, this score does not take into account key aspects of the efficiency of biological systems, such as inference speed (usually several times the biological time) or the energy consumption of this mesocopic model of the brain, which is about 360 watts on a standard GPU (NVIDIA RTX 3090).In the vast majority of biological neural networks, on the other hand, information is represented as spikes, prototypical all-or-nothing (binary) events whose only parameters are their timing and the address of the neuron that fired the spike [53].Spiking neural networks (SNNs), known as the third generation of artificial neural networks, incorporate this temporal dimension into the way they perform their computations.One example of a SNN is the SpikeNet algorithm, which takes a purely temporal approach by encoding information using at most one spike per neuron [18].Alternatively, the SNN implemented in the SpikeProp algorithm [7] that uses the exact timing of spikes and learns the structure of the network by minimizing a specifically defined cost function.This was recently extended using the surrogate gradient method which is now widely used in attempts to transfer the performance of CNNs to SNNs [77].However, the performance of SNNs still lags behind that of networks using an analog, firing rate-based representation.The question of the advantage of using spikes in machine learning and computer vision remains open.
In a recent review, we reported previous theoretical and experimental evidence for the use of precise spiking motifs in biological neural networks [25].In particular, Abeles [1] asked whether the role of cortical neurons is the integration of synaptic inputs or rather the detection of coincidences in temporal spiking motifs.While the first hypothesis favors the firing rate coding theory, the second emphasizes on the importance of temporal precision in neural activity.Since then, numerous studies have demonstrated the emergence of synchronous activity within a neuronal population [16,65], efficient coding using precise spike timings [23,54,55], or precise timing in the auditory system [11,19].All these findings, and more [6], highlight the importance of the temporal aspect of the neural code and suggest the existence of spatiotemporal spiking motifs in biological spike trains.In neural models, the definition of heterogeneous delays [27,48,78] allows the efficient detection of these spatiotemporal motifs embedded in the spike train.Such spatiotemporal motifs present in neural activity may form useful representations to perform computations for various cognitive tasks using the synchrony of spikes reaching the soma of a neuron.In particular, Izhikevich [33] introduced the notion of the polychronous group as a spiking motif defined by a subset of neurons with different, but precise, relative spiking delays.This delay between a pair of connected neurons is defined as the time between the emission of a spike at the soma of the afferent neuron and its arrival at the soma of the efferent neuron.Importantly, due to the variety of weights and delays within a population, representations using polychronous groups have, in theory, a much higher information capacity than a firing rate-based approach.
The present paper proposes a real-world application that extends a recently proposed model of spiking neurons with heterogeneous synaptic delays [26].This model was trained to solve a simplified motion detection task on a synthetic event-based dataset generated by moving parameterized textures, and provides a first demonstration that formal neurons can exploit the precise timing of spikes to detect motion thanks to heterogeneous delays.Here, we extend these results to a much more complex and natural setting.First, we define the ecological cognitive task that the model must solve with the different datasets on which it will be tested.Instead of the synthetic textures used previously, we use natural scenes synthesized from natural images translated by biologically inspired saccadic movements.We then develop the Heterogeneous Delays Spiking Neural Network (HD-SNN) model from efficiency principles and derive a learning rule to adapt the weights of each heterogeneous synaptic delay using gradient descent.Applied to this detection task, we study the emergence of spatiotemporal spiking motifs when this single layer of spiking neurons is trained in a supervised manner.We investigate the efficiency of the motion detection mechanism and, in particular, its resilience to synaptic weight pruning.Indeed, once trained, the amount of event-driven computation could be drastically reduced by removing weak synapses while maintaining peak performance for the classification task.In this way, we will be able to show how such a model can provide an efficient solution to the energy/accuracy tradeoff.

Methods
This paper aims to investigate the capability of the HD-SNN model to effectively learn and solve a motion detection task using realistic event-based data streams, as typically captured by event-based cameras, also known as Dynamic Vision Sensors (DVS).DVS are designed to mimic the signal transmitted from retinal ganglion cells to the visual cortex through the optic nerve.The events in these data streams are binary in nature and should provide sufficient information for performing fast and efficient motion detection.In this study, we first outline the task definition, specifying the requirements for motion detection.Subsequently, we describe the HD-SNN model employed for inferring motion, highlighting its key characteristics and architecture.Lastly, we elaborate on the training procedure employed to train the HD-SNN model for the specific motion detection task.

Task definition: motion detection in a synthetic naturalistic event stream
To train and validate our model which uses supervised learning, we need to define a visual dataset for which we explicitly know the ground truth motion, that is, direction and speed.To achieve this, we define a procedure for animating a natural visual scene with virtual eye movements, similar to those used in studies from neurobiology [3,74] or computational neuroscience [38].First, we draw a trajectory inspired by biological eye movements.Indeed, these movements allow us to dynamically actuate the center of vision in the visual field.
In animals with a fovea, this is particularly useful as it allows the area with the highest density of photoreceptors to be moved, for example, to a point of interest in the environment.A specific mechanism for this function are saccades, which are rapid eye movements that reposition the center of vision.In humans, saccades are very frequent (on average 2 per second [14]), very fast (about 80 ms), and have a wide range of speeds.On a more microscopic scale, the human gaze moves with minute micro-saccades which trajectories are similar to a Brownian trajectory [62].To preserve the full generality of the task, we define eye movements using such a form of random walk [20]: We first define a finite set of possible motions in polar coordinates as the Cartesian product of the 12 regularly spaced directions and 3 geometrically spaced speeds.Next, we define a saccadic path as a sequence of time segments whose durations are drawn from a Poisson distribution with a mean block length of 24 ms, similar to a Lévy flight [44, p. 289].Finally, assuming motion is stationary during saccades, and that motions for each flight are drawn uniformly and independently, the global trajectory is generated by integrating this motion sequence.This generative model yields trajectories that are qualitatively very similar to those observed for human eye movements (see Fig. 1-(Left)).Once these eye trajectories are generated, we can apply them to a visual scene.For this purpose, we selected a database of 100 large-scale natural images that were previously used to study the statistics of natural images [72].Note that these were pre-processed to be in grayscale and to equalize (i.e., whiten) the energy in each frequency band, similarly to a process known to occur in the retino-thalamic pathway [13].The full-scale images are 1024 × 1024 in size, and we crop images of size 128 × 128 positioned around the center of gaze at each time step.We discretize time in 1 ms bins and produce movies of duration N T = 200 ms.To avoid boundary effects, we randomly position the full trajectory in image space so that the sub-image is translated using the position given by the trajectory at each time step and without touching the boundaries.At each time step, the translation is computed using a coordinate roll (Right) The dynamics of this image, translated according to the saccadic trajectory, produces a naturalistic movie, which is then transformed into an event-based representation.We show snapshots of the resulting synthetic event stream at different time steps (from t = 15 ms to t = 19 ms, these frames are marked on the trajectory by a white and black dot, respectively, in the left inset).Mimicking the response of ganglion cells in the retina, this representation encodes at each pixel all-or-none increases or decreases in luminance, i.e., ON (red) and OFF (blue) spikes.In the lower left corner of the snapshots, we show the corresponding instantaneous motion vector (red arrow).Note the change in the direction of motion between the third and fourth frames, and also that contours parallel to the motion produce fewer luminance changes, the so-called aperture problem, and thus relatively fewer spikes.
in the horizontal and vertical dimensions, followed by a sub-pixel translation defined in Fourier space [57].Note that the magnitude of the displacement is relative to the time bin, and we have defined the unit speed to correspond to a movement of one pixel per frame (i.e., per time bin of 1 ms).
To transform each movie into events, as recorded by a DVS, we compute a residual gradient image which we initialize at zero.We then compute the temporal gradient of the pixels' intensity over two successive frames.For a given pixel and time stamp, an event is generated when the absolute value of this gradient exceeds a threshold.The event has either an OFF or ON polarity, depending on whether the gradient is negative or positive.The signed threshold is then subtracted from the residual gradient image.When applied to the whole movie, the event stream is similar to the output of a neuromorphic camera [21], i.e. a list of events defined by x r and y r (their position on the pixel grid), their polarity p r (ON or OFF), and their time t r (see Fig. 1-(Right)).Ultimately, the goal of the model is to infer the correct motion solely by observing these events.

The HD-SNN model
In this task, the input consists of a stream of events or spikes, a representation common to the signal obtained from an event-based camera or from a neurobiological recording of the activity of single units.Formally, this can be defined as a list of tuples, each tuple representing the neural address and timestamp.We denote this list as = {(a r , t r )} r∈ [1,Nev] where N ev ∈ N is the total number of events in the data stream and the rank r is the index of each event in the list of events (see Fig. 2-Left-Top for an illustration).Each event has a time of occurrence t r and an associated address a r .Events are usually ordered by their time of occurrence.We define the address space A, which consists of the set of possible addresses.In neurobiological spiking activity, this may be the identified set of recorded neurons.For neuromorphic hardware like the output of a DVS or our task, this can be defined as where N p is the number of polarities (N p = 2 for the ON and OFF polarities encoded in event-based cameras) and (N X , N Y ) is the height and width of the image in pixels.Thus, each address a r is typically in the form (p r , x r , y r ) for event-based cameras.
In the HD-SNN model, neurons b ∈ B are connected to presynaptic afferent neurons from A using realistic synapses.In biology, a single cortical neuron typically has several thousand synapses.Each synapse can be defined by its synaptic weight and its delay, that is, the time it takes for a spike to travel from the soma of the presynaptic neuron to the soma of the postsynaptic neuron.Note that a neuron can contact another afferent neuron with different delays through different synaptic connections.By scanning all postsynaptic neurons b, we may thus define the full set of N s synapses, as S = {(a s , b s , w s , δ s )} s∈ [1,Ns] , where each synapse is associated with a presynaptic address a s , a postsynaptic address b s , a weight w s , and a delay δ s .This defines the full connectivity of the HD-SNN model (see Fig. 2-Right for an illustration of the connectivity of one neuron with synaptic weights and delays).
Of interest is to define the emitting field of a presynaptic neuron S a = {(a s , b s , w s , δ s ) a s = a} s∈[1,Ns] ⊂ S, or also the receptive field of a postsynaptic neuron In particular, when driven by a stream of spikes = {(a r , t r )} r∈ [1,Nev] , each incoming spike is multiplexed by the synapses of the receptive field S b of postsynaptic neuron b.This results in a weighted event stream (see Fig. 2-Left-Middle) for each postsynaptic neuron b: In biology, this new stream of events is naturally ordered in time as events reach the soma of postsynaptic neurons.In simulations, however, it should be properly reordered.Once transformed by the synaptic connectivity, this weighted event stream may be integrated, for instance as the membrane potential of a Leaky-Integrate-and-Fire neuron (see Fig. 2-Left-Bottom), yet the activation function of the HD-SNN neurons can be selected from the full range of spiking neuron response functions.Importantly, this activation function has to be such that when postsynaptic neurons are activated at their soma by a specific spatiotemporal motif imprinted in the synaptic set, and such that these spikes converge at the soma in a synchronous manner, the discharge probability should increase.In this subsection, we have briefly defined the HD-SNN model in all generality (see [35] for a more specific description and treatment), and in the next subsection we describe an implementation of our model adapted to the motion detection task.

Application of HD-SNN to motion detection
In fact, it is possible to adapt the HD-SNN model specifically for common computer vision tasks.First, neural addresses are defined to represent the range of possible positions and polarities.Second, to simulate such eventbased computations on standard CPU-or GPU-based computers and to benefit from parallel computing acceleration, we transform the temporal event-based representation into a dense dicretized representation.Indeed, by using this discretization, we transform any event-based input from an event-based camera into a Boolean matrix A ∈ {0, 1} Np×N T ×N X ×N Y defined for all polarities p, times t, and space coordinates x and y.The values are, by definition, equal to zero, except when events occur: ∀r ∈ [1, N ev ], A(p r , t r , x r , y r ) = 1.
Similarly, one may discretize the connectivity of the HD-SNN model defined above.The longest synaptic delay defines the depth K T of the kernel, so that all possible delays associated with the different presynaptic addresses are represented.In particular, for each class c of the supervision task, the entire synaptic set can be represented as a kernel, which is represented by the dense matrix K of size (N c , N p , K T , K X , K Y ), where N c is the number of classes, K T the number of delays and K X and K Y are the number of pixels in each spatial dimension.To keep the analogy with the HD-SNN model, K gives the synaptic set that defines the weight of all synapses s defined as a function of their class c, polarity, synaptic delay and relative position: In our simulations, we define as many classes as the number of motions (directions and velocities): N c = 12 × 3 and set the size of the model's kernel to (36,2,21,17,17).Such a kernel defines a dense representation of the full synaptic set.Then, it can be noted that by using a discretization, the computational block used in the equation ( 1) corresponds to a weighted reordering of the input A with each kernels and positions assigned to the postsynaptic neurons [26].Let's define evidence as the logit of a probability, that is, the inverse sigmoid of that probability.By this definition of evidence, logistic regression takes advantage of the fact that if different independent observations (here, the estimated motion at different spatial locations and timings) share a common cause (here, the rigid local motion of the image on the receptive field), then an optimal estimate of the evidence of this motion is the sum of the evidences from the independent sources.Interpreting the weights of the kernel as evidences (also called factors in logistic regression), we may therefore define the activity B of post-synaptic neurons as the integration of this activity in each voxel and for each channel c in order to infer the evidence of each motion: ∀x, y, t, B(c, t, x, y) = p,δt,δx,δy where δ x and δ y are the relative addresses of the synapses within a kernel and δ t is the synaptic delay.In this formulation, we recognize that it takes advantage of the position invariance observed in images and exploited in CNNs.Here, we further assume that synaptic motifs should be similar across different times as defined in the temporal convolution.As a consequence, this defines a 3D, spatiotemporal convolutional operator, in which the layers of neurons assigned to specific kernels form channels.Using this dense representation, the model's processing of the input A can be written as layer-wise convolution: B = K * A (see Fig. 3 for an illustration).
The well-known convolution defines a differentiable measure, which is very efficiently implemented for GPUs, and which we will use to detect the motion direction in the event stream.A similar type of spatiotemporal filtering was used as a preprocessing stage for an existing pattern recognition algorithm [22].In addition, Sekikawa et al. [69] developed an efficient 3D convolutional algorithm that implements a motion estimation task.By assuming locally a constant motion, the authors assume that the 3D kernel can be decomposed into a 2D kernel representing the shapes and a 3D kernel representing the motion.For convenience, the connectivity of the neuron b is defined locally around its position (x b , y b ).Furthermore, it is important to consider that in order to adhere to the limitations of causal computation using biologically realistic neurons, synaptic delays are assigned positive values.This ensures that only past information contributes to the inference made at the present moment.In practice, the kernels are temporally shifted so that the inference at the present time is solely influenced by past information.This temporal shift occurs after a duration equivalent to the depth of the kernel, denoted as K T .
Such a method contrasts with classical methods for delay learning, which explicitly manipulate the delay as a variable and which are not directly differentiable [48].Keeping the analogy with spiking neurons, the analog activity B represents the integration of synaptic activity, and we will now try to define the detection of motion using the spatiotemporal kernels.Since we know that at each instant, there may be different motions, we will define the activation function of our model as a sigmoid function that implements a form of Multinomial Logistic Regression (MLR).In our MLR model, a probability value for each class (i.e., each direction of motion) is predicted for each position x, y and time t as a sigmoid function σ(β) = 1 1+exp(−β) of the result of the convolution.Formally, using the kernels, the input raster plot is transformed into a probability with the following formula: where β c is a scalar representing the bias associated with the class c.In particular, we anticipate that certain specific patterns could result in closely synchronized outputs when they are integrated within the basal dendritic tree, consequently leading to heightened postsynaptic activity.By utilizing this analog representation of the evidence for each potential motion at every moment, we can progressively increase the likelihood of generating an output spike.To determine the spiking output, we establish a firing threshold.Here, we computed this threshold to ensure that neurons, on average, generate one spike per second.Therefore, the spiking output of the model corresponds to the motions in space and time that represent the highest probability.Now that this general framework has been described formally, we may include some heuristics based on neuroscientific observations to constrain our model and its strategies for solving the ecological task described in section 2.1.Note that the general framework is an extension to that presented in [26], in particular by: including a more complex task, the deeper analysis of the results, and these novel neuroscience-inspired heuristics.First, to avoid introducing biases in the directions which may be learned, we apply a circular mask to the spatial dimensions of the kernels.We also included a prior in the selectable motions, as there is a prior for slow speeds in natural scenes [71].Since we want to capture the possible convergence of the trajectories of the events converging on each voxel, we apply a mask to the spatiotemporal kernels such that the smaller the delay, the smaller the radius of the circular mask that is applied (see Fig. 4 for an illustration).In our simulations, we observed that including this prior accelerated the learning but was not necessary to reach convergence.Second, we observed that moving images produced trajectories of ON and OFF spikes, and that these were present in both polarities.This is due to the fact that our whitened images have a relative symmetry in the luminance profiles, that is, that an image with inverted contrast is indistinguishable from a standard one.Since this arrangement of polarities is independent of motion, we added a mechanism that collects the linear values for the movie and that with the ON and OFF cells flipped, keeping only the maximum value for each voxel.This is similar to the computation done for complex cells in primary visual cortex.

Supervised learning of the motion detection task
Since the model is fully differentiable, we can now implement a supervised learning rule to learn the weights of the model's kernel.This rule was implemented using the binary input events as inputs and the corresponding motion's labels as the desired output.The loss function of the MLR model is the binary cross entropy of the output of the classification layer knowing the ground truth.The labels were defined at each time point as a one-hot encoding of the current motion in the channel corresponding to the current motion, and applied for all positions.Note that in this context, the label is known, but the position of visual features is not, mainly due to the sparse spatial content of natural images.However, the supervised optimization of this MLR model adjusts the weights of the kernels.As a result, we observed that the error is only propagated back to the spatial locations of these most active cells.This is reminiscent of previous methods that solve this problem using a winner-takes-all mechanism [47], but is implicit in our formulation.Simulations are performed with the PyTorch library using gradient descent with Adam (for 2 10 movies, each of size 200 × 128 × 128, a learning rate of 10 −5 and 100 epochs).Finally, the output of the MLR model is a representation that predicts the probability of each motion at each position and time.Such an output provides a form of optical flow that can be exploited for non-rigid motion, but we have defined here, for simplicity, an evaluation method that applies to our full-field motion task.Using the properties of logistic regression, by taking the mean evidence represented in the output given by the model at all positions for any given time, and using the sigmoid function, we can derive each motion's probability at that time.Taking the most probable class as the output, this allows one to calculate the accuracy as the percentage of times the motion is accurately predicted at any given time step.For validation, these calculations are performed on a different input dataset than the one used in the training or validation steps.The complete code to reproduce the results of this paper is available at https://github.com/SpikeAI/2023_GrimaldiPerrinet_HeterogeneousDelaySNN (see Data Availability).

Kernels learned for motion detection
Once our model has been trained, we can begin by examining the learned weights for the various motions (see Fig. 4).Notably, when we track each spatial motif from the shortest delay (on the right) to the longest delay (on the left), we observe that the cells exhibit highly localized selectivity and their preferences are conveyed along linear trajectories in the space-delay domain.When focusing on the positive weights, we notice a pronounced selectivity along specific motion axes for each kernel, and these directions correspond closely to the associated motion's physical direction in visual space.For instance, the first kernel demonstrates a robust preference for downward motion.The negative weights are symmetrically arranged around these positive weights, forming a center-surround profile that is known to enhance the response.We also observed a strong dependence between the weights reaching the ON polarities and those reaching the OFF polarities.In particular, whenever a weight for a given position and delay is positive for one polarity, it will be negative for the other.This property is due to the way events are generated and the fact that the luminance cannot increase and decrease at the same time.Interestingly, the relative organization of the receptive fields that we observe is in quadrature of phase and follows a push-pull organization predicted by Kremkow et al. [38] to explain neurophysiological results obtained after showing similar natural scenes with synthetic eye movements [3].Finally, we observe that these delay to the left (up to a delay of 12 steps, the remaining synapses being not represented).Each image corresponds to the weights at a given delay, with excitatory and inhibitory weights in warm and cold colors, respectively.Due to the symmetry between the ON and OFF event streams, we observed that the kernels for the OFF polarities are very similar and are not shown here.Different kernels are selective for the different motion directions, and we observe a slight orientation preference perpendicular to the respective direction for all kernels.
receptive fields show also a relative selectivity to the orientation perpendicular to motion, similarly to what is found for neurons in cortical area MT which is known to be selective to visual motions [17].This reflects the way events are generated and in particular the so-called aperture problem which implies that a line moving along its axis would generate no change in luminance and therefore generate no event [61].
If we now widen our focus on the interpretation of these kernels in terms of spatiotemporal motifs embedded in the event stream, these show a prototypical anisotropic profile adapted to motion detection [34].In [24], a link was drawn between event-based MLR training and Hebbian learning, allowing to say that the present model learns its weights according to a presynaptic activity associated to the different motion directions.Each neuron becomes selective to a specific motion direction through the learning of an associated prototypical spatiotemporal spiking motif.Each voxel in the 3D kernels defines a specific property by associating a weight to a position and a delay.Consequently, our model is able to detect precise spatiotemporal motifs embedded Fig. 5 In response to a specific event-based input instance (Left), we present the neural activity of the HD-SNN model (Right).To aid visualization, we display a single spatiotemporal slice for a given vertical position (y = 32) and 10 horizontal positions.The input spiking activity comprises ON and OFF spikes, as explained in Fig. 1, showcasing the switching within the naturalistic event-based stream from one motion to another due to saccades.The dots above the graph indicate the corresponding motion class at each instant (the motion being represented by the matching color).The output activity consists of two components.Firstly, there is an analog component that corresponds to the evidence accumulated by the model on the spatio-temporal kernels.Secondly, there is a spiking component represented by vertical bars superimposed on the analog activity.These spikes signify moments when the evidence surpasses the spiking threshold.Importantly, this activity aligns with the motion depicted in the input stream.Finally, it is possible to compute the accuracy by comparing the ground truth motion in the input video with the motion predicted by the model (as represented by the colored dots on top of the graph).
in the spike train and associated to the different motion directions.Note that as the delays become larger, two effects can be remarked.First, coefficients become lower which is consistent with the fact that trajectories are defined in a piece wise fashion, such that this decrease provides with an optimal integration considering the gradual diminishing of evidence as time progresses [52].Second, coefficients become less localized compared to the kernel's spatial profile at short delays, consistent with the average diffusion of information included in the generative model and with the diffusion introduced in motion-based prediction models [36,61].

Accuracy versus efficiency trade-off
After training our MLR model, we obtain spatiotemporal kernels corresponding to the weights associated with the heterogeneous delays of our layer of spiking neurons, which can be used for detection.For this, we quantify its ability to categorize different motions, i.e. on event streams for which the ground truth motion is known at each instant.When applied to new instances of the input movies, the model develops a neural activity which may be used to infer the correct motion (see Fig. 5) and from which we may deduce an accuracy value.This accuracy was computed on a novel dataset of 200 novel movies.The accuracy computed on the test set was approximately 91% (with a chance level of 1/12/3 ≈ 2%).with error bars indicating the 5% -95% quantiles and a sigmoid fit (blue line).The relative computational load (on a logarithmic axis) is controlled by changing the percentage of nonzero weights relative to the dense convolution kernel.If we shorten the length of the kernel by using only the weights at the shortest delays, the accuracy quickly drops.However, if we prune the lowest coefficients from the whole kernel, we observe a stable accuracy value, with a drop to half-saturation observed at about 375 times fewer computations.
We also observed that the distribution of the kernel's weights is sparse, with most values close to zero (see Fig. 4).As shown in the formalization of our event-based model, the computational cost of our model, if implemented on a neuromorphic chip, would be dominated by the computations used for the convolution operation.In a dense setting, this corresponds for all voxels in the output to a sum over all voxels in the inputs for all weights in the kernel.But if the information support is sparse, then computations can be now performed only on those events.Specifically, if we set some weights of the kernels to zero, then the additive operation in the convolution for those addresses can be dropped.As a consequence, computations will be performed only on those events which were multiplexed by the pruned connectivity matrix.Thus, knowing the sparseness of the input, the total number of computations scales with the number of spikes multiplied by the number of non-zero synaptic weights.This hypothesis is consistent with biological observations which have shown that communication consumes 35 times more energy than computation in the human cortex [42].
In order to evaluate the resilience of the classification performance with respect to computational load, we adopt first a pruning approach, where we remove weights in K that fall below a specified threshold.The accuracy of classification is then plotted as a function of the relative number of computations or active weights per decision for each neuron in the layer (refer to Fig. 6).
To provide a basis for comparison and to account for the benefits of utilizing variable delays, we also present the accuracy achieved by an MLR model employing a shortening strategy.This strategy involves adjusting the temporal width by selecting only the weights associated with the shortest delays.In comparison to the inference performed using the complete 3D kernels without any pruning (36×2×21×17×17), both approaches demonstrate a reduction in computational requirements as indicated by the number of non-zero weights.
By selectively setting certain weights to zero, we observe that the accuracy's evolution, as a function of the logarithmic percentage of active weights, aligns well with a sigmoid curve for both pruning and shortening strategies.The shortening strategy (depicted in orange) demonstrates a rapid decline in accuracy, reaching half-saturation when approximately one-third of the weights remain.On the other hand, the pruning strategy (shown in blue) exhibits a different behavior.It reaches half-saturation when the ratio of active weights is approximately 2.6 × 10 −3 , corresponding to around 375 less computations compared to the dense scenario.In comparison to using the complete kernels, our method maintains accuracy close to its peak performance even when the number of computations is divided by a factor of up to around 31.This substantial reduction in computations showcases the robustness of the presented method.

Testing with natural-like textures
In order to assess the influence of spatiotemporal parameters of the stimuli on the performance of the model, we now test the model on simpler, parameterized stimuli.For this purpose, we use a set of synthetic visual stimuli, Motion Clouds [40], which are natural-like random textures for which we can control relevant parameters for motion detection, including motion direction, spatial orientation, and spatial frequency along with their respective precisions (see Fig. 7) [41,71].By matching the spatial and temporal characteristics of the generated movies with those of the motion task mentioned earlier, we created a range of textures featuring different spatial properties and motions.This procedure defines a set of textures with different spatial properties and different motions chosen from the same set of 12 directions and 3 speeds.For each motion, we also varied the texture parameters, such as mean and variance of orientation or spatial frequency content, to provide some naturalistic variability.This method provides a rich dataset of textured movies for which we know the ground truth for the motion.
We observe some interesting facts.First, as we change the mean spatial frequency of the texture, we observe a broadly tuned response in accuracy.This comes as a similar trend as shown in the primary visual areas [60,63] and reveals the most informative scales learned by our model.Then, by modifying the bandwidth in spatial frequency, we show that the accuracy is worse for a grating-like stimulus than for a large one (which qualitatively resembles a more textured stimulus), reminiscent of the behavioral response of humans to such stimuli [64,70].Interestingly, we also see a modulation of accuracy as a Note that these accuracies are computed both in the case where the orientation of the synthetic texture is necessarily perpendicular to the motion ('no aperture' condition) and in the generic case where the orientation is independent of direction ('aperture').
function of orientation bandwidth.When the stimulus is grating-like and the orientation is arbitrary with respect to the direction of motion, the system faces the aperture problem and experiences a sharp decrease in accuracy.This is not the case for isotropic stimuli or when the orientation is perpendicular to the direction of motion.Finally, we manipulated the amount of change between two successive frames, similar to a temperature parameter.This shows a progressive decrease in accuracy, similar to that observed in the amplitude of human eye movements [45].

Discussion
This paper presents a novel and versatile heterogeneous delay spiking neural network (HD-SNN) that was trained using supervised learning for visual motion detection.We demonstrate the effectiveness of our model by comparing its performance to other event-based classification algorithms for this specific task.Notably, the learned model exhibits similarities with neurobiological and behavioral observations.One key advantage of our approach is the ability to significantly reduce the computational requirements through synapse pruning, while still maintaining robust classification performance.This highlights the potential to leverage the precise timing of spikes to enhance the efficiency and effectiveness of neural computations.Overall, our findings underscore the potential of incorporating precise spike timing in neural models and demonstrate the promising capabilities of our heterogeneous delay SNN for event-based computations, specifically in the context of visual motion detection.

Synthesis and main contributions
The HD-SNN model was trained and evaluated on a naturalistic motion detection task with realistic eye movements.It is defined such as to provide an optimal detection of spatiotemporal motifs and learns kernels similar to those found in the visual cortex [17,38].We have evaluated the computational cost of this model when implemented in a setting similar to event-based hardware.We show that the use of heterogeneous delays may be an efficient computational solution for future neuromorphic hardware, but also a key to understanding why spikes are a universal component of neural information processing.We would like to highlight a few innovations in the contributions presented in this paper.First, while [22,76] use a correlation-based heuristic, the generic heterogeneous model is formalized from first principles for optimal detection of spatiotemporal spiking motifs using a time-invariant logistic regression.Moreover, compared to classical CNN solutions, the parameters of this one-layered model (weights and delays) are explainable, as they directly inform about the evidence of detection for each spatiotemporal spike motif, where we define evidence as the logit of the probability, that is, the inverse sigmoid of the probability.Another novelty is that the model learns simultaneously weights and delays.In contrast, the polychronization model [33] learns only the weights using STDP, while the delays are randomly drawn at initialization and their values are frozen during learning.In addition, the model is evaluated on a realistic task, while models such as the tempotron are tested on simplified toy problems [28].Another major contribution is to provide a model that is suitable for learning any kind of spatiotemporal spiking motif and that can be trained in a supervised manner by providing a dataset of supervised pairs.Instead of relying on a careful description of the physical rules governing a task, e.g. the luminance conservation principle for motion detection [4,15], this allows a more flexible definition of the model using this properly labeled dataset.

Main limits
We have identified a number of limitations of our model, which we will now discuss in detail.First, this implementation of the HD-SNN model is based on a discrete binning of time, which is not compatible with the continuous nature of biological time.We used this binning to efficiently implement the framework on conventional hardware, especially GPUs, in particular to be able to use fast, differentiable three-dimensional convolutions.This is consistent with the relative robustness of other event-based frameworks [24,39], where accuracy was unaffected when the input spikes were subjected to noisy perturbations up to 4 ms on the N-MNIST dataset [24].It suggests the potential advantage of analytically including an additional precision term to the temporal value of input spikes.Such a mechanism may be implemented by the filtering implemented by the synaptic time constant of about 5 ms.Furthermore, it is possible to circumvent the need for time discretization by the use of a purely event-based scheme.In fact, it is possible to derive event-triggered computations of the continuous activity of the SNN [30] and thus to define a purely event-based framework.Such an architecture could provide promising computational speedups.
Another limitation is that the model is purely feed-forward.Thus, the spikes generated by the postsynaptic neurons are based solely on the information contained in the classical receptive field.However, it is known that neurons in the same layer can interact with each other through lateral interactions, for example in V1, and that this can be the basis for more complex computational principles [12].For example, the combination of neighboring orientations may contribute to image categorization [59].Furthermore, neural information may be modulated by feedback information, e.g. to distinguish a figure from its background [66].Feedback has been shown to be essential for building realistic models of primary visual areas [9,10], especially to explain non-linear mechanisms [8].Currently, mainly due to our use of convolutions, it is not possible to implement these recurrent connections in our implementation (lateral or feedback).However, by inserting new spikes into the list of spikes reaching presynaptic addresses, the generic HD-SNN model is able to incorporate them.While this is theoretically possible, it must be properly tuned in practice so that these recurrent connections do not bring neuronal activity outside a homeostatic state (by extinction or explosion).
Such recurrent activity would be essential for the implementation of predictive or anticipatory processes [5].This is essential in a neural system because it contains several delays that require temporal alignment [31].This has been modeled before to explain, for example, the flash-lag illusion [36].As mentioned previously, this could be implemented using generalized coordinates (i.e., variables such as position complemented by motion, acceleration, jerk, . . .), and knowing that "neurobiologically, using delay operators just means changing synaptic connection strengths to take different mixtures of generalized sensations and their prediction errors" [58].Our proposed model using heterogeneous delays provides an alternative and elegant implementation solution to this problem.

Perspectives
In defining our task, we emphasized that the generation of events depends on the spatial gradient in each image.This gradient has both horizontal and vertical dimensions, and its maxima are generally orientation dependent.Taken together, these oriented edges form the contours of visual objects in the scene [37].Thus, there is an interdependence between motion information and orientation information within the event stream, which we put in evidence by the shape of the kernels.It would be crucial to investigate this dependency further.This could be initiated by training the model on a dataset with labels that provide local orientation.Exploring this dependence will allow us to dissociate and integrate these two forms of visual information.In particular, it will allow us to consider that the definition of motion is more accurate perpendicular to an oriented contour (that is, the aperture problem).Thus, it will allow us to implement recurrent prediction rules, such as those identified to dissociate this problem [61].
The model is trained on a low-level local motion detection task, and one might wonder if it could be trained on higher-level tasks.An example of such a task would be depth estimation in the visual scene.There are several sources of information for depth estimation, such as binocular disparity or changes in texture or shading, but in our case motion parallax would be the most important cue [67].This is because objects that are close to an observer move on the retina relatively faster than an object that is far away, and also because visual occlusions depend on depth order.Using this information, it is possible to segment objects and estimate their depth [75].However, this would first require the computation of the optic flow, i.e., the extension of the framework described here for a rigid full-field motion to a more general one where the motion may vary in the visual field.One possible implementation is to add a new layer to our model, analogous to the hierarchical organization which is prevalent in the visual cortex.This is theoretically possible by using the output of our model (which estimates motion in retinotopic space) as input to a new layer of neurons that would estimate motion in the visual field, including the depth dimension in the output supervision labels.This could have direct and important applications, e.g. in autonomous driving to detect obstacles in a fast and robust way.Another extension would be to actively generate sensor motion (physical or virtual) to obtain better depth estimates, especially to disambiguate uncertain estimates [49].
In conclusion, the HD-SNN model that we have presented provides a way to efficiently process event-based signals.We have shown that we can train the model using a supervised rule that infers what is the output label, but not where it occurs.Another perspective would be to extend the model to a fully self-supervised learning paradigm, i.e. without any labelled data [2].This type of learning is thought to be prevalent in the central nervous system and, assuming the signal is sparse [50], one could extend these Hebbian sparse learning schemes to spikes [46,56].We expect that this would be particularly useful for exploring neurobiological data [35].Indeed, there is a large literature showing that brain dynamics often organize into stereotyped sequences, such as synfire chains [32], packets [43], or hippocampal sequences [51,73].These motifs are stereotyped and robust, as they can be activated following the same motif from day to day [29].In contrast to conventional methods of processing neurobiological data, such an event-based model would be able to answer key questions about the representation of information in neurobiological data, and it would open possibilities in the field of computational neuroscience.Furthermore, it would open possibilities in the field of machine learning, especially in computer vision, to address current key concerns such as robustness to attacks, scalability, interpretability, or energy consumption.

Fig. 1
Fig. 1 Motion Detection Task.To generate realistic event-based dynamic scenes, we mimic the effect of minute saccadic eye movements on a large natural scene (1024 × 1024) by extracting an image (128 × 128) which center is moving dynamically according to a jagged random walk.(Left) We show an instance of this trajectory (with a length of 200 ms, green line) superimposed on the luminance contrasts observed at time step t = 15 ms.(Right)The dynamics of this image, translated according to the saccadic trajectory, produces a naturalistic movie, which is then transformed into an event-based representation.We show snapshots of the resulting synthetic event stream at different time steps (from t = 15 ms to t = 19 ms, these frames are marked on the trajectory by a white and black dot, respectively, in the left inset).Mimicking the response of ganglion cells in the retina, this representation encodes at each pixel all-or-none increases or decreases in luminance, i.e., ON (red) and OFF (blue) spikes.In the lower left corner of the snapshots, we show the corresponding instantaneous motion vector (red arrow).Note the change in the direction of motion between the third and fourth frames, and also that contours parallel to the motion produce fewer luminance changes, the so-called aperture problem, and thus relatively fewer spikes.

Fig. 2
Fig. 2 Core mechanism of the HD-SNN model.(Left-Top) Four presynaptic neurons show some spiking activity in which a spiking motif is embedded (starting at time t = 50 ms).(Right)An illustration of a spiking neuron with different synaptic weights (represented by the thickness of the synapses) and different synaptic delays (represented by the length of the synapses).(Left-Middle) Each spike is weighted by the synaptic weights (height of the blue bars) and shifted in time according to the synaptic delays on each respective synapse (input spikes are shown in light gray for comparison).As a result, the spikes from the spiking motif are synchronized as they reach the soma of the postsynaptic neuron.(Left-Bottom) These spikes are then integrated, and contribute to a modification of the membrane potential of the output neuron according to the neural activation function.In this example, we use the activation function of a Leaky Integrate-and-Fire neuron.The first spiking motif is synchronized by the synaptic delays and causes a sudden rise in the membrane potential of the postsynaptic neuron.An output spike is emitted at time t = 75 ms when the membrane potential reaches the threshold, and it is then reset.

Fig. 3
Fig.3Applying HD-SNN to the task of motion detection.(Left) We plot a 2D representation of the input event stream as a raster plot (showing ON spikes in red and OFF spikes in blue for each presynaptic address and time).A spatiotemporal convolution is applied to the dense representation of the input with 2 different convolution kernels (the green and orange kernels), which define the output channels.The convolution is summed over the two polarities.Since we have two axes X and Y to represent the presynaptic addresses, like the pixel grid of a DVS, this results in a 3D convolution.Here we simplify the illustration to a 2D representation and to 2 possible classes (green and orange) associated to two different directions of motion.(Middle) For each position (address, time) one can compute the activation resulting from the convolution.The output of the convolution is processed by the nonlinearity of the MLR model (i.e., the sigmoid function).The output of the MLR gives a probability for each class associated with a particular kernel (colored bars in the highlighted pixel).(Right) By adding a spiking mechanism, here a winner-takes-all associated with thresholding, we obtain as output of the HD-SNN model a new spike train with the different spikes associated with a particular motion class.Note that the position of the output spikes does not systematically correspond to the position of the input spikes, but only when enough evidence is reached.

Fig. 4
Fig. 4 Representation of the weights for 8 directions for a single speed (among the 12 × 3 different kernels of the model) as learned on the dataset of naturalistic scenes.The directions are shown as red arrows in the left insets, where the disks correspond to the set of different possible motions.The spatiotemporal kernels are shown as slices of spatial weights at different delays.Delays vary along the horizontal axis from the far right (delay of one step)to the left (up to a delay of 12 steps, the remaining synapses being not represented).Each image corresponds to the weights at a given delay, with excitatory and inhibitory weights in warm and cold colors, respectively.Due to the symmetry between the ON and OFF event streams, we observed that the kernels for the OFF polarities are very similar and are not shown here.Different kernels are selective for the different motion directions, and we observe a slight orientation preference perpendicular to the respective direction for all kernels.

Fig. 6
Fig.6Accuracy as a function of computational load for the HD-SNN model (blue dots) with error bars indicating the 5% -95% quantiles and a sigmoid fit (blue line).The relative computational load (on a logarithmic axis) is controlled by changing the percentage of nonzero weights relative to the dense convolution kernel.If we shorten the length of the kernel by using only the weights at the shortest delays, the accuracy quickly drops.However, if we prune the lowest coefficients from the whole kernel, we observe a stable accuracy value, with a drop to half-saturation observed at about 375 times fewer computations.

Fig. 7
Fig.7Role of stimulus parameters in motion detection accuracy.Accuracy as a function of (from left to right) mean spatial frequency, bandwidth in spatial frequency (from gratings (left) to isotropic textures (right)), bandwidth in orientation (from isotropic textures (left) to gratings (right)), bandwidth in speed (from a rigid motion (left) to independent frames (right)).Examples snapshots are shown as an illustration in the top insets.Note that these accuracies are computed both in the case where the orientation of the synthetic texture is necessarily perpendicular to the motion ('no aperture' condition) and in the generic case where the orientation is independent of direction ('aperture').