The Backpropagation Algorithm Implemented on Spiking Neuromorphic Hardware

The capabilities of natural neural systems have inspired new generations of machine learning algorithms as well as neuromorphic very large-scale integrated (VLSI) circuits capable of fast, low-power information processing. However, it has been argued that most modern machine learning algorithms are not neurophysiologically plausible. In particular, the workhorse of modern deep learning, the backpropagation algorithm, has proven difficult to translate to neuromorphic hardware. In this study, we present a neuromorphic, spiking backpropagation algorithm based on synfire-gated dynamical information coordination and processing, implemented on Intel's Loihi neuromorphic research processor. We demonstrate a proof-of-principle three-layer circuit that learns to classify digits from the MNIST dataset. To our knowledge, this is the first work to show a Spiking Neural Network (SNN) implementation of the backpropagation algorithm that is fully on-chip, without a computer in the loop. It is competitive in accuracy with off-chip trained SNNs and achieves an energy-delay product suitable for edge computing. This implementation shows a path for using in-memory, massively parallel neuromorphic processors for low-power, low-latency implementation of modern deep learning applications.

The capabilities of natural neural systems have inspired new generations of machine learning algorithms as well as neuromorphic very large-scale integrated (VLSI) circuits capable of fast, low-power information processing. However, it has been argued that most modern machine learning algorithms are not neurophysiologically plausible. In particular, the workhorse of modern deep learning, the backpropagation algorithm, has proven difficult to translate to neuromorphic hardware. In this study, we present a neuromorphic, spiking backpropagation algorithm based on synfire-gated dynamical information coordination and processing, implemented on Intel's Loihi neuromorphic research processor. We demonstrate a proof-of-principle three-layer circuit that learns to classify digits from the MNIST dataset. To our knowledge, this is the first work to show a Spiking Neural Network (SNN) implementation of the backpropagation algorithm that is fully on-chip, without a computer in the loop. It is competitive in accuracy with off-chip trained SNNs and achieves an energy-delay product suitable for edge computing. This implementation shows a path for using in-memory, massively parallel neuromorphic processors for low-power, low-latency implementation of modern deep learning applications.
I. Introduction

Dynamical Coordination in Cognitive Processes.
Spike-based learning in plastic neuronal networks is playing increasingly key roles in both theoretical neuroscience and neuromorphic computing. The brain learns in part by modifying the synaptic strengths between neurons and neuronal populations. While specific synaptic plasticity or neuromodulatory mechanisms may vary in different brain regions, it is becoming clear that a significant level of dynamical coordination between disparate neuronal populations must exist, even within an individual neural circuit [1]. Classically, backpropagation (BP, and other learning algorithms) [2][3][4] has been essential for supervised learning in artificial neural networks (ANNs). Although the question of whether or not BP operates in the brain is still an outstanding issue [5], BP does solve the problem of how a global objective function can be related to local synaptic modification in a network. It seems clear, however, that if BP is implemented in the brain, or if one wishes to implement BP in a neuromorphic circuit, some amount of dynamical information coordination is necessary to propagate the correct information to the correct location such that appropriate local synaptic modification may take place to enable learning.
There has been a rapid growth of interest in the reformulation of classical algorithms for learning, optimization, and control using event-based informationprocessing mechanisms. Such spiking neural networks (SNNs) are inspired by the function of biophysiological neural systems [6], i.e., neuromorphic computing [7]. The trend is driven by the advent of flexible computing architectures such as Intel's neuromorphic research processor, codenamed Loihi, that enable experimentation with such algorithms in hardware [8]. There is particular interest in deep learning, which is a central tool in modern machine learning. Deep learning relies on a layered, feedforward network similar to the early layers of the visual cortex, with threshold nonlinearities at each layer that resemble mean-field approximations of neuronal integrate-and-fire models. While feedforward networks are readily translated to neuromorphic hardware [9][10][11], the far more computationally intensive training of these networks 'on chip' has proven elusive as the structure of backpropagation makes the algorithm notoriously difficult to implement in a neural circuit [12,13]. A feasible neural implementation of the backpropagation algorithm has gained renewed scrutiny with the rise of new neuromorphic computational architectures that feature local synaptic plasticity [8,[14][15][16]. Because of the well-known difficulties, neuromorphic systems have relied to date almost entirely on conventional off-chip learning, and used on-chip computing only for inference [9][10][11]. It has been a long-standing challenge to develop learning systems whose function is realized exclusively using neuromorphic mechanisms.
Backpropagation has been claimed to be biologically arXiv:2106.07030v2 [cs.NE] 26 Aug 2021 implausible or difficult to implement on spiking chips because of several issues: (a) Weight transport -Usually, synapses in biology and on neuromorphic hardware cannot be used bidirectionally, therefore, separate synapses for the forward and backward pass are employed. However, correct credit assignment, i.e. knowing how a weight change affects the error, requires feedback weights to be the same as feedforward weights [17,18]; (b) Backwards computation -Forward and backward passes implement different computations [18]. The forward pass requires only weighted summation of the inputs, while the backward pass operates in the opposite direction and additionally takes into account the derivative of the activation function; (c) Gradient storage -Error gradients must be computed and stored separately from activations; (d) Differentiability -For spiking networks, the issue of non-differentiability of spikes has been discussed and solutions have been proposed [19][20][21][22]; and (e) Hardware constraints -For the case of neuromorphic hardware, there are often constraints on plasticity mechanisms, which allow for adaptation of synaptic weights. On some hardware, no plasticity is offered at all, while in some cases only specific spike-timing dependent plasticity (STDP) rules are allowed. Additionally, in almost all available neuromorphic architectures, information must be local, i.e., information is only shared between neurons that are synaptically connected, in particular, to facilitate parallelization.

Previous Attempts at sBP
The most commonly used approach to avoiding the above issues is to use neuromorphic hardware only for inference using fixed weights, obtained by training of an identical network offline and off-chip [9,10,[23][24][25][26][27]. This has recently achieved state-of-the-art performance [11], but it does not make use of neuromorphic hardware's full potential, because offline training consumes significant power. Moreover, to function in most field applications, an inference algorithm should be able to learn adaptively after deployment, e.g., to adjust to a particular speaker in speech recognition, which would enable autonomy and privacy of edge computing devices. So far, only last layer training, without backpropagation, and using variants of the delta rule, has been achieved on spiking hardware [28][29][30][31][32][33][34]. Other on-chip learning approaches use alternatives to backpropagation [35,36], bio-inspired non-gradient based methods [37], or hybrid systems with a conventional computer in the loop [38,39]. Several promising alternative approaches for actual on-chip spiking backpropagation have been proposed recently [40][41][42][43], but have not yet been implemented in hardware.
To avoid the backwards computation issue (b) and because neuromorphic synapses are not bidirectional, a separate feedback network for the backpropagation of errors has been proposed [44,45] (see Fig. 1). This leads to the weight transport problem (a) which has been solved by using symmetric learning rules to maintain weight symmetry [45][46][47] or with the Kolen-Pollack algorithm [47], which leads to symmetric weights automatically. It has also been found that weights do not have to be perfectly symmetric, because backpropagation can still be effective with random feedback weights (random feedback alignment) [48], although symmetry in the sign between forward and backward weights matters [49].
The backwards computation issue (b) and the gradient storage issue (c) have been addressed by approaches that separate the function of the neuron into different compartments and use structures that resemble neuronal dendrites for the computation of backward propagating errors [5,41,43,50]. The differentiability issue (d) has been circumvented by spiking rate-based approaches [10,24,51,52] that use the ReLU activation function as done in ANNs. The differentiability issue has also been addressed more generally using surrogate gradient methods [9, 19-22, 25, 28, 53] and methods that use biologically-inspired STDP and reward modulated STDP mechanisms [54][55][56][57]. For a review of SNN-based deep learning, see [58]. For a review of backpropagation in the brain, see [5].

Our Contribution
In this study, we describe a hardware implementation of the backpropagation algorithm that addresses each of the issues (a)-(e) introduced above using a set of mechanisms that have been developed and tested in simulation by the authors during the past decade, synthesized in our recent study [59], and simplified and adapted here to the features and constraints of the Loihi chip. These neuronal and network features include propagation of graded information in a circuit composed of neural populations using synfire-gated synfire chains (SGSCs) [60][61][62][63], control flow based on the interaction of synfire-chains [61], and regulation of Hebbian learning using synfire-gating [64,65]. We simplify our previously proposed network architecture [59], and streamline its function. We demonstrate our approach using a proof-of-principle implementation on Loihi [8], and examine the performance of the algorithm for learning and inference of the MNIST test data set [66]. The sBP implementation is shown to be competitive in clock time, sparsity, and power consumption with state-of-the-art algorithms for the same tasks.

The Binarized sBP model
We extend our previous architecture [59] using several new algorithmic and hardware-specific mechanisms. Each unit of the neural network is implemented as a single spiking neuron, using the current-based leaky integrate-and-fire (CUBA) model (see Eq. (19) in Section IV) that is built into Loihi. The time constants of the CUBA model are set to 1, so that the neurons are memoryless. Rather than using rate coding, whereby spikes are counted over time, we consider neuron spikes at every algorithmic time step, so we can regard our implementation as a binary neural network. The feedforward component of our network is a classic multilayer perceptron (MLP) with 3-layers, a binary activation function, and discrete (8 bit) weights. Our approach may, however, be extended to deeper networks and different encodings. In the following equations, each lowercase letter corresponds to a Boolean vector that represents a layer of spiking neurons on the chip (a spike corresponds to a 1). The inference (forward) pass through the network is computed as: where W i is the weight matrix of the respective layer, f is a binary activation function with a threshold of 0.5, and H denotes the Heaviside function. The forward pass thus occurs in 3 time steps as spikes are propagated through layers. The degree to which the feedforward network's output (o) deviates from a target value (t) is quantified by the squared error, E = 1 2 o − t 2 , which we would like to minimize. Performing backpropagation to achieve this requires the calculation of weight updates, which depend on the forward activations, and backward propagated lo-cal gradients d l , which represent the amount by which the loss changes when the activity of that neuron changes, as: Here, • denotes a Hadamard product, i.e. the elementwise product of two vectors, T denotes the matrix transpose, sgn(x) is the sign function, and a l denotes the activation of the lth layer, f (W l a l−1 ), with a 0 = x, a 1 = h, a 2 = o. Here, η denotes the learning rate, and is the only hyperparameter of the model apart from the weight initialization. Here denotes the derivative, but because f is a binary thresholding function (Heaviside), the derivative would be the Dirac delta function, which is zero everywhere apart from at the threshold. Therefore, we use a common method [67,68], and represent the thresholding function using a truncated (between 0 and 1) ReLU (Eq. (8)) as a surrogate or straight-through estimator when back-propagating the error. The derivative of the surrogate is a box function (Eq. (9)): The three functions (Equations (2), (8) and (9)) are plotted in the inset in Fig. 2. When performed for each target (t) in the training set, the model may be thought of as a stochastic gradient descent algorithm with a fixed step size update for each weight in the direction of the sign of the gradient.

sBP on Neuromorphic Hardware
On the computational level, Equations (1)-(9) fully describe our model exactly as it is implemented on Loihi, excluding the handling of bit precision constraints that affect integer discreteness and value limits and ranges. In the following, we describe how these equations are translated from the computational to the algorithmic neural circuit level, thereby enabling implementation on neuromorphic hardware. Further details on the implementation can be found in Section IV.
a. Hebbian weight update Equation (7) effectively results in the following weight update per single synapse from presynaptic index i in layer l − 1 to postsynaptic index j in layer l: where η is the constant learning rate. To accomplish this update, we use a Hebbian learning rule [69] implementable on the on-chip microcode learning engine (for the exact implementation on Loihi, see Section IV B). Hebbian learning means that neurons that fire together, wire together, i.e., the weight update ∆w is proportional to the product of the simultaneous activity of the presynaptic (source) neuron and the postsynaptic (target) neuron. In our case, this means that the values of the 2 factors of Equation (10) have to be propagated simultaneously, in the same time step, to the pre-(a l−1,i ) and postsynaptic (d l,j ) neurons while the pre-and postsynaptic neurons are not allowed to fire simultaneously at any other time. For this purpose, a mechanism to control the information flow through the network is needed. b. Gating controls the information flow As our information control mechanism, we use synfire gating [60][61][62]70]. A closed chain of 12 neurons containing a single spike perpetually sent around the circle is the backbone of this flow control mechanism, which we call the gating chain. The gating chain controls information flow through the controlled network by selectively boosting layers to bring their neurons closer to the threshold and thereby making them receptive to input. By connecting particular layers to the gating neuron that fires in the respective time steps, we lay out a path that the activity through the network is allowed to take. For example, to create the feedforward pass, the input layer x is connected to the first gating neuron and therefore gated 'on' in time step 1, the hidden layer h is connected to the second gating neuron and gated 'on' in time step 2, and the output layer o is connected to the third gating neuron and gated 'on' in time step 3. A schematic of this path of the activity can be found in Fig. 2. To speak in neuroscience terms, we are using synfire gating to design functional connectivity through the network anatomy shown in Supplementary Fig. 4 . Using synfire gating, the local gradient d l,j is brought to the postsynaptic neuron at the same time as the activity a l−1,i is brought back to the presynaptic neuron effecting a weight update. However, in addition to bringing activity at the right time to the right place for Hebbian learning, the gating chain also makes it possible to calculate and back-propagate the local gradient.
c. Local gradient calculation For the local gradient calculation, according to Equation (5), the error o − t and the box function derivative of the surrogate activation function (Equation (9)) are needed. Because there are no negative (signed) spikes, the local gradient is calculated and propagated back twice for a Hebbian weight update in two phases with different signs. The error o − t is calculated in time step 4 in a layer that receives excitatory (positive) input from the output layer o and inhibitory (negative) input from the target layer t, and vice versa for t − o.
The box function has the role of initiating learning when the presynaptic neuron receives a non-negative input and of terminating learning when the input exceeds 1, which is why we call the two conditions 'start' and 'stop' learning (inspired by the nomenclature of [71]). This inherent feature of backpropagation avoids weight updates that have no effect on the current output as the neuron is saturated by the nonlinearity with the current input. This regulates learning by protecting weights that are trained and used for inference when given different inputs.
To implement these two terms of the box function (9), we use two copies of the output layer that receive the same input (W 2 h) as the output layer. Using the abovedescribed gating mechanism, one of the copies (start learning, o < ) is brought exactly to its firing threshold when it receives the input, which means that it fires for any activity greater than 0 and the input is not in the lower saturation region of the ReLU. The other copy (stop learning, o > ) is brought further away from the threshold (to 0), which means that if it fires, the upper saturation region of the ReLU has been reached and learning should cease.
d. Error backpropagation Once the local gradient d 2 is calculated as described in the previous paragraph, it is sent to the output layer as well as to its copies to bring about the weight update of W 2 and its 4 copies in time steps 5 and 9. From there, the local gradient is propagated back through the transposed weight matrices W T 2 and −W T 2 , which are copies of W 2 connected in the opposite direction and, in the case of −W T 2 , with opposite sign. Once propagated backwards, the back-propagated error is also combined with the 'start' and 'stop' learning conditions, and then it is sent to the hidden layer and its copies in time steps 7 and 11.

Algorithm Performance
Our implementation of the sBP algorithm on Loihi achieves an inference accuracy of 95.7% after 60 epochs (best: 96.2%) on the MNIST test data set, which is comparable with other shallow, stochastic gradient descent (SGD) trained MLP models without additional allowances. In these computational experiments, the sBP model is distributed over 81 neuromorphic cores. Processing of a single sample takes 1.48 ms (0.169 ms for inference only) on the neuromorphic cores, including the time required to send the input spikes from the embedded processor, and consumes 0.653 mJ of energy on the neuromorphic cores (0.592 mJ of which is dynamic energy, i.e. energy used by our neural circuit in addition to the fixed background energy), resulting in an energy-delay product of 0.878 µJs. All measurements were obtained using NxSDK version 0.9.9 on the Nahuku32 board nclghrd-01. Table I shows a comparison of our results with published performance metrics for other neuromorphic learning architectures that were also tested on MNIST. Table III in the Supplementary Material shows a breakdown of energy consumption and a comparison of different conditions and against a GPU. Switching off the learning engine after training reduces the dynamic energy per inference to 0.0204 mJ, which reveals that the on-chip learning engine is responsible for most of the power consumption. Because the learning circuit is not necessary for performing inference, we also tested a reduced architecture that is able to do inference within four time steps using the previously trained weights. This architecture uses 0.00249 mJ of dynamic energy and 0.169 ms per inference.
Functional connectivity of the 2 layer backpropagation circuit. Layers are only shown when they are gated 'on' and synapses are only shown when their target is gated on. Plastic connections are all-to-all (fully connected), i.e. all neurons are connected to all neurons of the next layer. The gating connections from the gating chain are one-to-all, and all other connections are one-to-one, which means that a firing pattern is copied directly to the following layer. The names of the neuron layers are given on the left margin so that the row corresponds to layer identity. The columns correspond to time steps of the algorithm, which are the same as the time steps on Loihi. Table II shows the information contained in each layer in each respective time step. The red background in time steps 5 and 7 indicates that in these steps, the sign of the weight update is inverted (positive), as r = 1 in Eq. (21). A detailed step-by-step explanation of the algorithm is given in Section IV C 2 and in Table II  The sBP algorithm trains the network without explicit sparsity constraints, and yet it exhibits sparsity because of its binary (spiking) nature. After applying the binary threshold of 0.5 to the MNIST images, one image is encoded using on average 100 spikes in the input layer, which corresponds to a sparsity of 0.25 spikes per neuron per inference. This leads to a typical number of 110 spikes in the hidden layer (0.28 spikes per neuron per inference) and 1 spike in the output layer (0.1 spikes per neuron per inference). The spikes of the input and hidden layer are repeated in the two learning phases (see Fig. 2) independent of the training progress. However, the error-induced activity from the local gradient layer d 1 starts with 0.7 spikes per neuron per sample (during the first 1000 samples) and goes down to approximately 0 spikes in the trained network as the error goes towards 0.   III. Discussion

Summary
As we have demonstrated here, by using a well-defined set of neuronal and neural circuit mechanisms, it is possible to implement a spiking backpropagation algorithm on contemporary neuromorphic hardware. Previously proposed methods to address the issues outlined in Section I were not on their own able to offer a straightforward path to implement a variant of the sBP algorithm on current hardware. In this study, we avoided or solved these previously encountered issues with spiking backpropagation by combining known solutions with synfire-gated synfire chains (SGSC) as a dynamical information coordination scheme that was evaluated on the MNIST test data set on the Loihi VLSI hardware.

Solutions of Implementation Issues
The five issues (a)-(e) listed in Section I were addressed using the following solutions: (a) The weight transport issue was avoided via the use of a deterministic, symmetric learning rule for the parts of the network that im-plement inference (feed-forward) and error propagation (feedback) as described by [86]. This approach is not biologically plausible because of a lack of developmental mechanisms to assure the equality of corresponding weights [87]. It would, however, without modifications to the architecture be feasible to employ weight decay as described by Kolen and Pollack [87] to achieve selfalignment of the backward weights to the forward weights or to use feedback alignment to approximately align the feedforward weights to random feedback weights [48]; (b) The backwards computation issue was solved by using a separate error propagation network through which activation is routed using an SGSC; (c) The gradient storage issue was solved by routing activity through the inference and error propagation circuits within the network in separate stages, thereby preventing the mixing of inference and error information. There are alternatives that would not require synfire gated routing, but are more challenging to implement on hardware [41,43] as also described in a more comprehensive review [5]; (d) The differentiability issue was solved by representing the step activation function by a surrogate in the form of a (truncated) ReLU activation function with an easily implementable box function derivative; and (e) The hardware constraint issue was solved by the proposed mechanism's straightforward implementation on Loihi because it only requires basic integrate-and-fire neurons and Hebbian learning that is modulated by a single factor which is the same for all synapses.

Encoding and Sparsity
While neural network algorithms on GPUs usually use operations on dense activation vectors and weight matrices, and therefore do not profit from sparsity, spiking neuromorphic hardware only performs an addition operation when a spike event occurs, i.e., adding the weight to the input current as in Equation (19). This means that the power consumption directly depends on the number of spikes. Therefore sparsity, which refers to the property of a vector to have mostly zero elements, is important for neuromorphic algorithms [75,88], and it is also observed in biology [89]. Consequently, significant effort has been made to make SNNs sparse to overcome the classical rate-based approach based on counting spikes [74,88,90,91]. The binary encoding used here could be seen as a limit case of the rate-based approach allowing only 0 or 1 spike. Even without regularization to promote sparse activity, it yields very sparse activation vectors that are even sparser than most timing codes. The achievable encoded information per spike is unquestionably lower, however. In a sense, we already employ spike timing to route spikes through the network because the location of a spike in time within the 12 time steps determines if and where it is sent and if the weights are potentiated or depressed. However, usage of a timing code for activations could be enabled by having more than one Loihi time step per algorithm time step. Therefore the use of SGSCs is not limited to this particular binary encoding, and in fact, SGSCs were originally designed for a population rate code.
Similarly, the routing method we use in this work is not limited to backpropagation, but it could serve as a general method to route information in SNNs where autonomous activity (without interference from outside the chip) is needed. That is, our proposed architecture can act in a similar way as or even in combination with neural state machines [92,93].

Algorithmically-Informed Neurophysiology
Although our implementation of sBP here was focused primarily on a particular hardware environment, we point out that the synfire-gated synfire chains and other network and neuronal structures that we employ could all potentially have relevance to the understanding of computation in neurophysiological systems. Many of these concepts that we use, such as spike coincidence, were originally inspired by neurophysiological experiments [60,61,94]. Experimental studies have shown recurring sequences of stereotypical neuronal activation in several species and brain regions [95][96][97] and particularly replay in hippocampus [98]. Recent studies also hypothesize [99,100] and show [101] that a mechanism like gating by synfire chains may play a role in memory formation. Additional evidence [102] shows that large-scale cortical activity has a stereotypical, packet-like character that can convey information about the nature of a stimulus, or be ongoing or spontaneous. This type of apparently algorithmically-related activity has a very similar form to the SGSC controlled information flow found previously [60][61][62][63]. Additionally, this type of sequential activation of populations is evoked by the sBP learning architecture, as seen in the raster plot in Fig. 5 in the Supplementary Material.
Other algorithmic spiking features, such as the backpropagated local gradient layer activity decreasing from 0.7 spikes per neuron to 0 by the end of training, could be identified and used to generate qualitative and quantitative hypotheses concerning network activity in biological neural systems.

Future Directions
Although the accuracy we achieve is similar to early implementations of binary neural networks in software [68], subsequent approaches now reach 98.5% [103], and generally include binarized weights. However, networks that achieve such accuracy typically employ either a convolutional structure or multiple larger hidden layers. Additional features such as dropout, softmax final layers, gain terms, and others could in principle be included in spiking hardware and may also account for this 3% gap. So, while we show that it is possible to efficiently implement backpropagation on neuromorphic hardware, several non-trivial steps are still required to make it usable in practical applications: 1) The algorithm needs to be scaled to deeper networks. While the present structure is in principle scalable to more layers without major adjustments, investigation is needed to determine whether gradient informa-tion remains intact over many layers, and to what extent additional features such as alternatives to batch normalization may need to be developed.
2) Generalization to convolutional networks is compelling, in particular for application to image processing. The Loihi hardware presents an advantage in this setting because of its weight-sharing mechanisms.
3) Although our current implementation demonstrates on-chip learning, we train on the MNIST images in an offline fashion by iterating over the training set in epochs. Further research is required to develop truly continual learning mechanisms such that additional samples and classes can actually be learned without losing previously trained synaptic weights and without retraining on the whole dataset.
Additionally, the proposed algorithmic methodology can be used to inform hardware adjustments to promote efficiency for learning applications. Although our algorithm is highly efficient in terms of power usage, in particular for binary encoding, the Loihi hardware is not specifically designed for implementing standard deep learning models, but rather as general-purpose hardware for exploring different SNN applications [75].
This leads to a significant computational overhead for functions that are not needed in our model (e.g. neuronal dynamics), or that could have been realized more efficiently if integrated directly on the chip instead of using network mechanisms. Our results provide a potential framework to guide future hardware modifications to facilitate more efficient learning algorithm implementations. For example, in an upgraded version of the algorithm, it would be preferable to replace relay neurons with presynaptic (eligibility) traces to keep a memory of the presynaptic activity for the weight update.

Significance
To our knowledge, this work is the first to show an SNN implementation of the backpropagation algorithm that is fully on-chip, without a computer in the loop. Other onchip learning approaches so far either use feedback alignment [36], forward propagation of errors [35] or single layer training [28,30,32,72,73]. Compared to an equivalent implementation on a GPU, there is no loss in accuracy, but there are about two orders of magnitude power savings in the case of small batch sizes which are more realistic for edge computing settings. So, this implementation shows a path for using in-memory, massively parallel neuromorphic processors for low-power, low-latency implementation of modern deep learning applications. The network model we propose offers opportunities as a building block that can, e.g. be integrated into larger SNN architectures that could profit from a trainable on-chip processing stage.

IV. Methods
In this section, we describe our system on three different levels [104]. First, we describe the computational level by fully stating the equations that result in the intended computation. Second, we describe the spiking neural network (SNN) algorithm. Third, we describe the details of our hardware implementation that are necessary for exact reproducibility.

A. The Binarized Backpropagation Model
Network Model. Backpropagation is a means of optimizing synaptic weights in a multi-layer neural network. It solves the problem of credit assignment, i.e., attributing changes in the error to changes in upstream weights, by recursively using the chain rule. The inference (forward) pass through the network is computed as where f is an element-wise nonlinearity and W i is the weight matrix of the respective layer. The degree to which the network's output (o) deviates from the target values (t) is quantified by the squared error, E = 1 2 o − t 2 , which we aim to minimize. The weight updates for each layer are computed recursively by where n l is the network activity at layer l (i.e., n l = W l f (W l−1 n l−1 )). Here, denotes the derivative, • denotes a Hadamard product, i.e. the element-wise product of two vectors, T denotes the matrix transpose, and a l denotes f (W l a l−1 ), with a 0 = x. The parameter η denotes the learning rate. These general equations (12)- (14) are realized for two layers in our implementation as given by (5)-(7) in Section II. Below, we relate these equations to the neural Hebbian learning mechanism used in the neuromorphic implementation of sBP. Although in theory the derivative f of the activation function is applied in (12), in the case that f is a binary thresholding (Heaviside) function, the derivative is the Dirac delta function, which is zero everywhere apart from at the threshold. We use a common approach [68] and represent the activation function by a surrogate (or straight-through estimator [67]), in the form of a rectified linear map (ReLU) truncated between 0 and 1 (Eq. (16)) in the part of the circuit that affects error backpropagation. The derivative of this surrogate (f ) is of box function form, i.e.
f surrogate (x) = min(max(x, 0), 1) , where H denotes the Heaviside function: In the following section, we describe how we implement Equations (11)-(16) in a spiking neural network. Weight Initialization. Plastic weights are initialized by sampling from a Gaussian distribution with mean of 0 and a standard deviation of 1/ 2/(N fanin + N fanout ) (He initialization [105]). N fanin denotes the number of neurons of the presynaptic layer and N fanout the number of neurons of the postsynaptic layer. Input Data. The images of the MNIST dataset [66] were cropped by a margin of 4 pixels on each side to remove pixels that are never active and avoid unused neurons and synapses on the chip. The pixel values were thresholded with 0.5 to get a black and white picture for use as input to the network. In the case of the 100 − 300 − 10 architecture, the input images were downsampled by a factor of 2. The dataset was presented in a different random order in each epoch. Accuracy Calculation. The reported accuracies are calculated on the full MNIST test data set. A sample was counted as correct when the index of the spiking neuron of the output layer in the output phase (time step 3 in Fig. 2) is equal to the correct label index of the presented image. In fewer than 1% of cases, there was more than one spike in the output layer, and in that case, the lowest spiking neuron index was compared.

B. The Spiking Backpropagation Algorithm
Spiking Neuron Model. For a generic spiking neuron element, we use the current-based linear leaky integrateand-fire model (CUBA). This model is implemented on Intel's neuromorphic research processor, codenamed Loihi [8]. Time evolution in the CUBA model as implemented on Loihi is described by the discrete-time dynamics with t ∈ N, and time increment ∆t ≡ 1: where i identifies the neuron. The membrane potential V (t) is reset to 0 upon exceeding the threshold V thr and remains at its reset value 0 for a refractory period, T ref . Upon reset, a spike is sent to all connecting synapses. Here, U (t) represents a neuronal current and δ represents time-dependent spiking input. I const is a constant input current. Parameters and Mapping. In our implementation of the backpropagation algorithm, we take τ V = τ U = 1, T ref = 0, and I const = −8192 (except in gating neurons, where I const = 0). This leads to a memoryless point neuron that spikes whenever its input in the respective time step exceeds V thr = 1024. This happens, when the neuron receives synaptic input larger than 0.5·V thr and in the same timestep, a gating input overcomes the strong inhibition of the I const , i.e. it is gated 'on'. This is how the Heaviside function in Equation (15) is implemented. For the other activation functions, a different gating input is applied.
There is a straightforward mapping between the weights and activations in the spiking neural network (SNN) described in this section and the corresponding artificial neural network (ANN) described in Section IV A: So, a value of V thr = 1024 allows for a maximal ANN weight of 0.25, because the allowed weights on Loihi are the even integers from -256 to 254.
Feed-forward Pass. The feedforward pass can be seen as an independent circuit module that consists of 3 layers. An input layer x with 400 (20x20) neurons that spikes according to the binarized MNIST dataset, a hidden layer h of 400 neurons, and an output layer o of 10 neurons. The 3 layers are sequentially gated 'on' by the gating chain so that activity travels from the input layer to the hidden layer through the plastic weight matrix W 1 and then from the hidden to the output layer through the plastic weight matrix W 2 . Learning Rule. The plastic weights follow the standard Hebbian learning rule with a global third factor to control the sign. Note, however, that here, unlike other work with Hebbian learning rules, due to the particular activity routed to the layers, the learning rule implements a supervised mechanism (backpropagation). Here we give the discrete update equation as implemented on Loihi: Above, x, y, and r represent time series that are available at the synapse on the chip. The signals x and y are the pre-and postsynaptic neuron's spike trains, i.e., they are equal to 1 in time steps when the respective neuron fires, and 0 otherwise. The signal r is a third factor that is provided to all synapses globally and determines in which phase (potentiation or depression) the algorithm is in. T denotes the number of phases per trial, which is 12 in this case. So, r is 0 in all time steps apart from the 5 th and 7 th of each iteration, where the potentiation of the weights happens. This regularly occurring r signal could thus be generated autonomously using the gating chain. On Loihi, r is provided using a so-called "reinforcement channel". Note that the reinforcement channel can only provide a global modulatory signal that is the same for all synapses. The above learning rule produces a positive weight update in time steps in which all three factors are 1, i.e., when both pre-and post-synaptic neurons fire and the reinforcement channel is active. It produces a negative update when only the pre-and post-synaptic neurons fire, and the weight stays the same in all other cases.
To achieve the correct weight update according to the backpropagation algorithm (see (10)), the spiking network has to be designed in a way that the presynaptic activity a l−1,i and the local gradient d l,j are present in neighboring neurons at the same time step. Furthermore, the sign of the local gradient has to determine if the simultaneous activity happens in a time step with active third factor r or not.
This requires control over the flow of spikes in the network, which is achieved by a mechanism called synfire gating [64,65], which we adapt and simplify here. Gating Chain. Gating neurons are a separate structure within the backpropagation circuit and are connected in a ring. That is, the gating chain is a circular chain of neurons that, once activated, controls the timing of information processing in the rest of the network. This allows information routing throughout the network to be autonomous to realize the benefits of neuromorphic hardware without the need for control by a classical sequential processor. Specifically, the neurons in the gating chain are connected to the relevant layers of the network, which allows them to control when and where information is propagated. All layers are inhibited far away from their firing threshold by default, as described in Section IV B, and can only transmit information, i.e., generate spikes, if their inhibition is neutralized via activation by the gating chain. Because a neuron only fires if it is gated 'on' AND gets sufficient input, such gating corresponds to a logical AND or coincidence detection with the input.
In our implementation, the gating chain consists of 12 neurons, which induce 12 algorithmic (Loihi) time steps that are needed to process one sample. Each neuron is connected to all layers that must be active in each respective time step. The network layers are connected in the manner shown in Supplementary Fig. 4, but the circuit structure, which is controlled by the gating chain, results in a functionally connected network as presented in Fig. 2, where the layers are shown according to the timing of when they become active during one iteration.
The weight of the standard gating synapses (from one of the gating neurons to each neuron in a target layer) is w gate = −I const + 0.5V thr , i.e. each neuron that is gated 'on' is brought to half of its firing threshold, which effectively implements Eq. (15). In four cases, i.e., for the synapses to the start learning layers in time step 2 (h < ) and 3 (o < ) and to the backpropagated local gradient layer d 1 in time steps 6 and 10, the gating weight is w gate = −I const + V thr . In two cases, i.e., for the synapses to the stop learning layer in time step 2 (h > ) and 3 (o > ), the gating weight is w gate = −I const . These different gating inputs lead to step activation functions with different thresholds, as required for the computations explained below, in Section IV B 3. Backpropagation Network Modules. In the previous sections, we have explained how the weight update happens and how to bring the relevant values (a l−1 and d l according to (10)) to the correct place at the correct time. In this section, we discuss how these values are actually calculated. The signal a l−1 , which is the layer activity from the feedforward pass, does not need to be calculated but only remembered. This is done using a relay layer with synaptic delays, as explained in Section IV B 1. The signal d 2 , the last layer local gradient, consists of 2 parts according to (5). The difference between the output and the target o − t (see Section IV B 2) and the box function f must be calculated. We factorize the latter into two terms, a start and stop learning signal (see Section IV B 3). The signal d 1 , the backpropagated local gradient, also consists of 2 parts, according to Eq. (6). In addition to another 'start' and 'stop' learning signal, we need sgn(W T 2 d 2 ), whose computation is explained in Section IV B 4.
In the following equation, the weight update is annotated with the number of the paragraph in which the calculating module is described: 4.
x T

Relay Neurons
The memory used to store the activity of the input and the hidden layer is a single relay layer that is connected both from and to the respective layer in a one-to-one manner with the proper delays. The input layer x sends its activity to the relay layer m x so that the activity can be restored in the W 1 update phases in time steps 7 and 11. The hidden layer x sends its activity to the relay layer m h so that the activity can be restored in the W 2 update phases in time steps 5 and 9.

Error Calculation
The error calculation requires a representation of signed quantities, which is not directly possible in a spiking network because there are no negative spikes. This is achieved here by splitting the error evaluation into two parts, t − o and o − t, to yield the positive and negative components separately. Similarly, the calculation of back-propagated local gradients, d 1 , is performed using a negative copy of the transpose weight matrix, and it is done in 2 phases for the positive and negative local gradient, respectively. In the spiking neural network, t − o is implemented by an excitatory synapse from t and an inhibitory synapse of the same strength from o, and vice versa for o − t. Like almost all other nonplastic synapses in the network, the absolute weight of the synapses is just above the value that makes the target neuron fire, when gated on. The difference between the output and the target is, however, just one part of the local gradient d 2 . The other part is the derivative of the activation function (box function).

Start and Stop Learning Conditions
The box function (17) can be split in two conditions: a 'start' learning and a 'stop' learning condition. These two conditions are calculated in parallel with the feedforward pass. The feedforward activation f (x) = H(x − 0.5V thr ) corresponding to Eq. (1) is an application of the spiking threshold to the layer's input with an offset of 0.5V thr , which is given by the additional input from the gating neurons. The first term of the box function (9), H(x), is also an application of the spiking threshold, but this time with an offset equal to the firing threshold so that any input larger than 0 elicits a spike. We call this first term the 'start' learning condition, and it is represented in h < for the hidden and in o < for the output layer. The second term of the box function in Eq. (9), −H(x − 1V thr ), is also an application of the spiking threshold, but this time without an offset so that only an input larger than the firing threshold elicits a spike. We call this second term the 'stop' learning condition, and it is represented in h > and o > for the hidden and output layers, respectively. For the W 1 update, the two conditions are combined in a box function layer b h = h < −h > that then gates the d 1 local gradient layer. For the W 2 update, the two conditions are directly applied to the d 2 layers because an intermediate b o layer would waste one time step. The function is however the same: the stop learning o > inhibits the d 2 layers and the 'start' learning signal o < gates them. In our implementation, the two conditions are obtained in parallel with the feedforward pass, which requires two additional copies of each of the two weight matrices. An alternative method to avoid these copies but takes more time steps, would do this computation sequentially and use the synapses and layers of the feedforward pass three times per layer with different offsets, and then route the outputs to their respective destinations.

Error Backpropagation
Error calculation and gating by the start learning signal and inhibition by the stop learning signal are combined in time step 4 to calculate the last layer local gradients d + 2 and d − 2 . From there, the local gradients are routed into the output layer and its copies for the last layer weight update. This happens in 2 phases: The positive local gradient d + 2 is connected without delay so that it leads to potentiation of the forward and backward last layer weight matrices in time step 5. The negative local gradient is connected with a delay of 4 time steps so that it arrives in the depression phase in time step 9. For the connections to the o T − layer which is connected to the negative copy −W T 2 , the opposite delays are used to get a weight update with the opposite sign. See Fig. 2 for a visualization of this mechanism. Effectively, this leads to the last layer weight update where the first term on the right hand side is non-zero when the local gradient is positive, corresponding to an update happening in the potentiation phase, and the second term is nonzero when the local gradient is negative, corresponding to an update happening in the depression phase. The functions f and f are as described in Equations (15) and (9). The local gradient activation in the output layer does not only serve the purpose of updating the last layer weights, but it is also directly used to backpropagate the local gradients. Propagating the signed local gradients backwards through the network layers requires a positive and negative copy of the transposed weights, W T 2 and −W T 2 , which are the weight matrices of the synapses between the output layer o and the back-propagated local gradient layer d 1 , and between o T − and d 1 , respectively.
Here o T − is an intermediate layer that is created exclusively for this purpose. The local gradient is propagated backwards in two phases. The part of the local gradient that leads to potentiation is propagated back in time steps 5 to 7, and the part of the local gradient that leads to depression of the W 1 weights is propagated back in time steps 9 to 11. In time step 6, the potentiating local gradient is calculated in layer d 1 as and in timestep 10 the depressing local gradient is calculated in layer d 1 as Critically, this procedure does not simply separate the weights by sign, but rather maintains a complete copy of the weights that is used to associate appropriate sign information to the back-propagated local gradient values. Note that here the Heaviside activation function H(x) is used rather than the binary activation function f = H(x − 0.5V thr ), so that any positive gradient will induce an update of the weights. Any positive threshold in this activation will lead to poor performance by making the learning insensitive to small gradient values. The transposed weight copies must be kept in sync with their forward counterparts, so the updates in the potentiation and depression phases are also applied to the forward and backward weights concurrently. So in total, after each trial, the actual first layer weight update is the sum of four different parts: These four terms are necessary because, e.g., a positive error can also lead to depression if backpropagated through a negative weight matrix and the other way round.

C. The sBP Implementation on Loihi
Partitioning on the Chip. To distribute the spike load over cores, neurons of each layer are approximately equally distributed over 96 cores of a single chip. This distribution is advantageous because only a few layers are active at each time step, and Loihi processes spikes within a core in a sequential manner. In total, the network as presented here needs 2N in +6N hid +7N out +12N gat neurons. With N in = 400, N hid = 400, N out = 10, and N gat = 1, these are 3282 neurons and about 200k synapses, most of which are synapses of the 3 plastic all-to-all connections between the input and the hidden layer.
Learning Implementation. The learning rule on Loihi is implemented as given in Eq. (21). Because the precision of the plastic weights on Loihi is maximally 8 bits with a weight range from −256 to 254, we can only change the weight in steps of 2 without making the update nondeterministic. This is necessary for keeping the various copies of the weights in sync (hence the factor of 2 in Eq. (21)). With V thr = 1024, this corresponds to a learning rate of η = 2 1024 ≈ 0.002. The learning rate can be changed by increasing the factor in the learning rule, which leads to a reduction in usable weight precision, or by changing V thr , which changes the range of possible weights according to Eq. (20). Several learning rates (settings of V thr ) were tested with the result that the final accuracy is not very sensitive to small changes. The learning rate that yielded the best accuracy is reported here. In the NxSDK API, the neuron traces that are used for the learning rule are x0, y0, and r1 for x(t), y(t) and r(t) in 21 respectively. r1 was used with a decay time constant of 1, so that it is only active in the respective time step, effectively corresponding to r0. To provide the r signal, a single reinforcement channel was used and was activated by a regularly firing spike generator in time steps 5 and 7.
Weight Initialization. After the He initialization as described in Section IV A, the weights are mapped to Loihi weights according to (20). Then, the weights are truncated to [−240, 240] and discretized to 8 bits resolution, i.e, steps of 2, by rounding them to the next valid number towards 0.
Power Measurements. All Loihi power measurements are obtained using NxSDK version 0.9.9 on the Nahuku32 board ncl-ghrd-01. Both software API and hardware were provided by Intel Labs. All other probes, including the output probes, are deactivated. For the inference measurements, we use a network that only consists of the three feedforward layers with non-plastic weights and the gating chain of four neurons. The power was measured for the first 10000 samples of the training set for the training measurements and all 10000 samples of the test set for the inference measurements. (1) error (target but no output spike) leads to potentiation of the W2 synaptic weight and the positive transpose; (2) the same error leads to depression of the negative transpose via activity of d1; (3) no error because o and t fire at the same location, i.e. there is no update in this iteration; (4) there is an error (t fires at index 4, but o at index 7), but the local gradient is 0 because it is gated 'off' at index 7 because the derivative of the activation function is 0, i.e. both o < and o > fire. Also, it is not gated 'on' at index 4, because o < does not fire; (5) local gradient (output but not target), leads to potentiation of the weight of the synapses from o T − to d1 (red), and (6) depression of h − o and d1 − o synaptic weights; (7) The orange spikes show the back-propagated local gradient from (1) which leads to potentiation of the x − h weights. Note that for visualization purposes, the gating from b h is applied one time step later directly to h, h < and h > . That is, the orange spikes in time step 7 are the full backpropagated error, but only the neurons that are also gated 'on' by the combination of h < and h > are actually active in the potentiation phase in time step 8. (8) Same as (7), but the error from (2) leads to depression of the x − h weights.  TABLE III. Breakdown of power consumption of the chip for different cases. 'ds100-300-10', means that the network is run using 100 input and 300 hidden layer neurons and with the input downsampled by 2. Here, 'start' means that the measurement is taken in the first 10000 iterations of the first training epoch, and 'end' means that it has been taken with the fully trained model. We use 'learning engine off' to say that the weights are set to nonplastic weights after training so that the on-chip circuitry that handles learning is not active. The inference net, which uses the same weights as the fully trained network, consists only of the 3 feedforward layers and the gating chain.
• Excitatory one-to-all gating (green) synapses from a single gating chain neuron to all neurons of a layer with a fixed weight of usually w g = −I const + 0.5V thr = 8704. The usual gating synapses are used to gate 'on' a layer that is supposed to have the feedforward activation function f . Different weights w g< = −I const = 8192 and w g> = −I const + V thr = 9216 are used to gate 'on' the start and stop learning conditions to get the two step functions with which the box function is built.
• Excitatory one-to-one gating (yellow) synapses from the start learning layer o < and the box function layer b h to the respective local gradient layers. The w g weight is used for gating of the d 2 layers in time step 4, and the w g> weight is used for gating of the d 1 layer in time steps 6 and 10.
To a certain extent within the precision boundaries of the chip, the scaling of the weights is arbitrary. All plastic synapses and most other synapses have a synaptic delay of 0 time steps. Some synapses, specifically the ones originating from the relay layers, the d 2 layers, and the b h layer, have delays up to 6 time steps, i.e. their output spikes affect the target neuron several time steps later. The delays, corresponding to the horizontal arrow length, can be read from Fig. 2. Table III above refers to network structures with input sizes of 400 and 100. The 400-400-10 network used a set of trimmed images, where whitespace was removed to reduce the required size of the input layer. MNIST images for this had a border of width 4 removed. For the smaller 100-300-10 network, these images were then downsampled by averaging over 2 × 2 pixel grids and binarized by thresholding at 0.5. This gave a set of 10 × 10 binary images.

Tensorflow Implementation
For the feedforward network, our GPU tensorflow implementation is similar to the binary network presented in [68] but with continuous weights, and simplified to correspond with the neuromorphic design. The loss function was modified to mean squared error and minimized with stochastic gradient descent with a fixed learning rate. Dropout and batch normalization were removed and the batch size was set to 1.
The network was initialized with Glorot parameters for each layer l, i.e., γ l = 1.5 N l−1 +N l where N l is the number of neurons in layer l. For each layer, weights were generated uniformly on the interval [−γ l , γ l ] and clipped to these values during learning. While these initializations do not take into account the binary activation functions of our network, tests with alternative initializations showed little or no improvement, likely due to the shallow depth of the networks considered.
Gradients on the binary network in [68] were calculated with a 'straight-through' estimator in which the derivative of the binary activation function σ(x) = 1 x ≥ 0 0 otherwise (29) was approximated as This same gradient calculation is maintained, but to mimic the binary gradient signal in the neuromorphic implementation, the final gradient with respect to each weight is thresholded. If the gradient calculated with respect to weight i is g i , the applied gradient in the learning step is sgn(g i )σ(|g i | − t) for a threshold t representing the threshold of signal propagation in the neuromorphic implementation and σ as defined above.