Shape-Dependent Multi-Weight Magnetic Artificial Synapses for Neuromorphic Computing

In neuromorphic computing, artificial synapses provide a multi-weight conductance state that is set based on inputs from neurons, analogous to the brain. Additional properties of the synapse beyond multiple weights can be needed, and can depend on the application, requiring the need for generating different synapse behaviors from the same materials. Here, we measure artificial synapses based on magnetic materials that use a magnetic tunnel junction and a magnetic domain wall. By fabricating lithographic notches in a domain wall track underneath a single magnetic tunnel junction, we achieve 4-5 stable resistance states that can be repeatably controlled electrically using spin orbit torque. We analyze the effect of geometry on the synapse behavior, showing that a trapezoidal device has asymmetric weight updates with high controllability, while a straight device has higher stochasticity, but with stable resistance levels. The device data is input into neuromorphic computing simulators to show the usefulness of application-specific synaptic functions. Implementing an artificial neural network applied on streamed Fashion-MNIST data, we show that the trapezoidal magnetic synapse can be used as a metaplastic function for efficient online learning. Implementing a convolutional neural network for CIFAR-100 image recognition, we show that the straight magnetic synapse achieves near-ideal inference accuracy, due to the stability of its resistance levels. This work shows multi-weight magnetic synapses are a feasible technology for neuromorphic computing and provides design guidelines for emerging artificial synapse technologies.

In neuromorphic computing, artificial synapses provide a multi-weight conductance state that is set based on inputs from neurons, analogous to the brain. Additional properties of the synapse beyond multiple weights can be needed, and can depend on the application, requiring the need for generating different synapse behaviors from the same materials. Here, we measure artificial synapses based on magnetic materials that use a magnetic tunnel junction and a magnetic domain wall. By fabricating lithographic notches in a domain wall track underneath a single magnetic tunnel junction, we achieve 4-5 stable resistance states that can be repeatably controlled electrically using spin orbit torque. We analyze the effect of geometry on the synapse behavior, showing that a trapezoidal device has asymmetric weight updates with high controllability, while a straight device has higher stochasticity, but with stable resistance levels. The device data is input into neuromorphic computing simulators to show the usefulness of applicationspecific synaptic functions. Implementing an artificial neural network applied on streamed Fashion-MNIST data, we show that the trapezoidal magnetic synapse can be used as a metaplastic function for efficient online learning. Implementing a convolutional neural network for CIFAR-100 image recognition, we show that the straight magnetic synapse achieves near-ideal inference accuracy, due to the stability of its resistance levels. This work shows multi-weight magnetic synapses are a feasible technology for neuromorphic computing and provides design guidelines for emerging artificial synapse technologies.

INTRODUCTION
The human brain is incredibly efficient at processing unstructured information in real time. In the brain, synapses control the degree of connectivity between neurons, and this has been analogously implemented by synapses in artificial neural networks (NNs). An artificial synapse implemented in an analog electrical device has multi-weight (MW) conductance states that can be set based on inputs from neurons. Emerging understanding of the brain shows that the synaptic connection has many useful properties beyond MW to aid in learning 1,2 . Analogous properties in artificial synapses have been shown to be beneficial in NNs, but the desired properties of the synapse depend on the application. One desired property is controllability, i.e. a given input sets the synapse to a given state (weight in context of a NN), while still having a certain degree of stochasticity [3][4][5] to avoid local minima. Neural networks that behave probabilistically are also increasingly used to emulate the stochastic spiking behavior of neurons, as well as to quantify the network's confidence in its predictions 6 . Some applications, such as online training using backpropagation, have additional requirements such as symmetry, i.e. the MW has equal but opposite responses to positive and negative electrical stimuli 7 . By contrast, single-shot programming 8 , where a given voltage can bring the synapse directly to a given weight, benefits from asymmetry, i.e. the various weight states can be set in a positive voltage direction and fully reset in the opposite.
While a number of materials classes are being used as synapses, including resistive random-access memory (RRAM) 9 and ferroelectric memory (Fe-RAM) 10 , spintronic solutions benefit from being naturally stochastic 11 , and magnetic random-access memory (MRAM) has been shown to have high endurance 12 , scalability down to 25 nm feature sizes 13 , and CMOS compatibility 12 . To achieve a number of conductance states per device, this paper focuses on the three-terminal domain wallmagnetic tunnel junction (DW-MTJ) device, where a magnetic domain wall (DW) is electrically pushed back and forth underneath a magnetic tunnel junction (MTJ), changing the MTJ's conductance. For neuromorphic computing, the DW-MTJ may offer additional benefits, including the realization of various bio-mimetic functions in the same film stack through modifications to the device geometry. For example, the same film stack can behave both as a synapse or as a leaky, integrate, and fire neuron with intrinsic leaking [14][15][16]17,18 . The DW-MTJ can also have device-todevice magnetic field interactions to mimic the connectivity of the brain 19 , and it can be used in both artificial neural networks (ANNs) 14 and spiking neural networks (SNNs) 15 Fig. 1a-b, the DW track was fabricated in a trapezoidal shape with one edge 4× wider than the other, while in Fig. 1c-d the track is a straight rectangle.   In the reset direction, the DW does not exhibit MW switching because depinning from the rightmost notch requires the maximum voltage , = 3.7 with v = 1.35%, slightly above the maximum switching voltage in the MW direction of 3.5 . Since MW switching only occurs in one direction, we achieve a dependable reset at the highest voltage amplitude since the DW will not get stuck at one of the intermediate weights during reset.
In Fig. 1b   The voltage amplitude is increased between 0 and 5.5 V ten times; each color shows the response for one cycle. c, State distribution map showing the likelihood for the DW to be in a particular notch at a particular voltage. At a given voltage, the height of a colored region is proportional to the probability that the DW occupies the corresponding notch. μm 2 , and a 150 Ω difference between the maximum and minimum MTJ resistances. The reduced TMR compared to the trapezoidal synapse is due to additional resistance at the junction and/or contacts. Nevertheless, perpendicular magnetic anisotropy and a similar offset minor field loop is observed.
The straight synapse's stochastic MW switching behavior is shown in Fig. 3b. Following the same method and parameters as the trapezoidal synapse, the device is first saturated in a DC bias field of 120 mT (−ẑ) to set the magnetization of the free and reference layers fully antiparallel. After saturation, the DC 4 mT (−ẑ) bias field is set and 50 ns voltage pulses V are applied between the IN and CLK terminals with increasing amplitude. The DW is re-initialized and the process is repeated over ten cycles. Five resistance levels are observed, showing the DW is pinned by all five notches present. While the stochasticity in V C is evident over the 10 cycles, the resistance of each MW level is extremely stable, with R MTJ = 1061.0 ± 1.1 Ω, 1043.3 ± 2.6 Ω, 989.0 ± 2.1 Ω, 929.5 ± 0.8 Ω, and 913.1 ± 0.4 Ω at each respective notch. Here, the depinning energy for each notch is nominally identical, so the final DW position is not a function of the voltage alone and the DW is more liable to move over multiple notches with a single pulse. This process is random; the switching behavior can be statistically analyzed but not accurately predicted, making it potentially useful for stochastic neural networks. This varies significantly from the trapezoidal synapse, which showed repeatable, voltage-controlled MW switching with less variation.
The stochastic behavior is further exemplified in Fig. 3c, where the probability of being in a particular resistance state is plotted against the voltage pulse amplitude, showing five regions that correspond to the five notches. Previous work has investigated constant voltage pulse measurements in binary MTJs 24 , where a constant current pulse is sent through the MTJ stack for a fixed number of times, and the probability that the magnetization of the free layer switches is a function of the temperature, with the switching probability emulating a sigmoid. Here, only one current pulse is sent for each voltage, and the voltage is continuously ramped upwards. This allows for a reduction in the number of pulses required, since the bias, and therefore probability, is based on a single voltage pulse. In addition, while variation in weight updates is usually detrimental to neural network accuracy, stochastic updates similar to what is shown in Fig. 3c can offset the negative impact of weight quantization. Small updates in deterministically quantized systems are lost when the update is not large enough to advance the synaptic weight to the next level. Similar to stochastic rounding techniques in software, a notch position switch probabilistically occurs in the DW-MTJ system even with a small update, creating an effectively higher resolution than the number of levels provided by the notches when averaged over thousands of updates, which we have previously shown leads to greater neural network accuracy 7 .
The prototypes are 250-350 nm wide with 500 nm-1 μm long MTJs: this size made the nanofabrication more feasible, but also increased the likelihood of MTJ breakdown through pinholes and defects in the relatively large-area MTJ. We have shown, in simulation, scaling to 50 nm widths and comparison of the predicted scaled behavior to other emerging synaptic memory devices 7 . This feature size has been achieved in fabricated MTJ devices 13 . The necessary scaled device length depends on the number of states needed. For example, to obtain 8 states with a minimum feature size of 15 nm (requiring 30 nm separation between notches) would require of the notches and MTJ to span a total length of 240 nm.
In the following sections, we use the device data to inform model parameters and then simulate neuromorphic computing tasks to show the usefulness of the two different geometries: the trapezoidal synapse for stream learning and the linear synapse for inference.

STREAM LEARNING TASK USING TRAPEZOIDAL MAGNETIC SYNAPSE GEOMETRY
In this section, we show how a trapezoidal DW-MTJ can be optimized for stream learning by taking advantage of tunable nonlinearity. A nonlinear update response, where the amount of conductance change produced by a given pulse depends on the initial conductance, has been shown to have beneficial properties. For example, nonlinear synapses in spiking network arrays using spike-timing dependent plasticity train faster and more accurately than linear synapses 28 . In another example that we will focus on here, binarized NN synapses with nonlinear metaplastic hidden weights are shown to be able to combat catastrophic forgetting and reduce overfitting in low-information situations 29 . In Reference 29 , the update is applied to the full analog weight, but the analog weight is binarized according to the sign of the weight to either -1 or +1. As a result, each synapse can be thought of as having an apparent state (binarized) and a metaplastic state (analog) where the metaplastic weight informs the binarized weight. The effect of this binarized metaplasticity is that only accumulated changes that change the sign of the weight will affect network performance. In addition, in Reference 29 , a nonlinear and asymmetric update is applied to the weight that introduces a variable learning rate as a function of the analog weight, where updates are larger for weights closer to 0 and smaller for weights with larger magnitudes (nonlinear), but only for updates that prescribe to decrease the weight (asymmetric). The authors then show that the combination of these factors ensures that learning with information-limited datasets has reduced overfitting and therefore better performance.
Here, we use the trapezoidal DW-MTJ to implement such synaptic metaplasticity in hardware. The DW propagating in the track acts as the hidden, metaplastic weight and binarization can be introduced either by patterning a small sensing MTJ that covers only a fraction of the length of the track, or by using a binary comparator circuit if using two devices differentially, shown in the inset circuit in Fig. 4a. In this application, following Reference 29 , the applied voltage V will set the DW to multiple states depending on what notch it is set to, but the MTJ itself does not have to produce MW values and could be binary. This could be attractive for the DW-MTJ since the MTJ can be kept small and centered along the DW track. By using a binarized readout while holding a nonlinear multi-state hidden weight, we show that the DW-MTJ can reduce system complexity requirements while providing the capability to prevent overfitting and forgetting of sequentially learned information.
We translate the trapezoidal DW-MTJ MW switching characteristics (Fig. 2b) into a metaplastic function f meta that represents how easy (low voltage required, i.e. DW at the narrow side of the trapezoid with width w 1 ) or hard (high voltage required, i.e. DW at the wide side of the trapezoid with width w 2 ) it is to update the DW position. Thus, f meta is defined such that f meta = 1 when the DW is at the narrow end (synaptic weight of 0) and f meta linearly decreases as the DW gets to there is a 1:4 ratio between w 1 and w 2 , and a corresponding 1:4 ratio in depinning voltage V C , which is included in the model.
To implement the metaplastic update defined in Eq. 1, the algorithm is adapted from Reference 29 and detailed in Supplementary Section 3. The optimizer update is obtained using a form of stochastic gradient descent, and then the metaplastic update is obtained by multiplying the metaplastic function by the optimizer update. For updates that prescribe to increase the magnitude of the weight, an additional asymmetry parameter δ is multiplied to increase the learning rate toward weights with larger values, which can be incorporated into a physical circuit by using larger current sources for positive weight updates.
To show the network performance of this type of nonlinear DW-MTJ synapse, a learning task is performed in which only a subset of the training set is available to the network at a time, as is always the case in online learning. We developed a modified streamed Fashion-MNIST 31 database by splitting the 60,000 images in the training set into 60 subsets of 1000 images each, shown to the network for 30 epochs. The NN is shown in Fig. 4a and detailed in Methods.
To approximate the effect of notches along the track, quantization to 4 levels is applied to the metaplastic weights after each update. Quantization is applied by stochastically rounding the weights to a defined level, mirroring the non-deterministic spread in depinning voltages for each notch seen in Fig. 2b. Supplementary Information Fig. 3 provides results on the difference between stochastic and deterministic rounding.  asymmetric updates (orange curves with diamonds), and DW-MTJ synapses with 1:4 width ratio and δ = 3 to reflect the trapezoidal synapse data (green curves with triangles). For all three device types, their validation accuracy is also calculated for when they are trained on the full data set for 30 epochs (horizontal ticks in Fig. 5a) to match the number of epochs trained per subset for the streamed data set.
The trapezoidal DW-MTJ synapse (green triangles) provides a clear advantage in the stream Fashion-MNIST task, reaching close to 86% accuracy once all the subsets have been presented, the same as when training the network by presenting all the data at once. The linear synapse reached a final accuracy of around 83% regardless of update asymmetry δ (blue circles and orange diamonds). There was an advantage for the linear synapse with asymmetric updates (δ = 3, orange diamonds) until around 35 subsets, but this can be attributed to the fact that updates that increase weights have a three times higher learning rate. Supporting Information Fig. 3 provides further results confirming more effective learning for larger trapezoidal width ratios.
In addition, the binarization scheme with stochastic rounding of weights is very resilient to the effects of heavy quantization, shown in Supplementary Fig. 3, where training performance for binarized trapezoidal synapses remain the same for 4, 8 and 16 levels. This is because the effects of quantization are masked due to binarization, and the stochastic weight updates prevent small updates from being lost. In Fig. 5b, the training performance of the trapezoidal binarized synapse with the metaplastic weight quantized to 4 levels is compared to the performance of a nonbinarized linear synapse with weight quantization of 32 levels. From the results shown, the final accuracy is roughly the same, approaching 86%. The linear synapse was selected because it performed best when evaluating non-binarized training performance, shown in Supplementary Information Section 5. However, there are clear benefits to using the binarized synapse over a typical non-binarized linear synapse. These benefits are clear when comparing the two devices in Fig. 5c and 5d, where the MTJ size requirement for the binarized synapse is much smaller, reducing the chance for pinholes to occur. The fabrication requirements in terms of patterning the track are lower as well, only requiring patterning of 4 relatively uniform levels as opposed to 32.
Due to the level requirement, the linear synapse is also much less compact than that of the trapezoidal binarized synapse.
We see here the advantage of the trapezoidal synapse over a straight synapse for stream learning.
For a network consisting of linear synapses, overfitting on the current subset can occur, resulting in the forgetting of details learned in previous subsets. Specifically, training a network for many epochs using a limited dataset will result in an overfitting of the network based on the earliest presented information without effectively consolidating information across streamed batches.
Thus, the network loses details learned in previous subsets that may be significant to the dataset overall.
Instead, the trapezoidal metaplastic synapse learns more effectively in the stream learning scenario since there is a weight-dependent learning rate attributed to each update. The trapezoidal metaplastic synapse can retain this information, since large magnitude weights have an effectively lower learning rate compared to weights closer to 0. This allows the network to remember the values of larger weights for a longer period of time while maintaining learning flexibility in synapses with smaller weights. The result is a well-suited application for trapezoidal DW-MTJ synapses with low power, low information needed, and reduced lithographic complexity requirements due to only requiring a binary MTJ and few notches. Moreover, the ability to freely modify the metaplastic function through lithographic definition of the ferromagnetic track to modulate the metaplastic function is another indication that the DW-MTJ synapse is well suited for online learning applications.

INFERENCE TASK USING STRAIGHT MAGNETIC SYNAPSE GEOMETRY
Here, we simulate an inference task that takes advantage of the high stability of the straight DW-MTJ synapse resistance levels and is not drastically affected by the randomness observed in setting the levels. When using the DW-MTJ for an NN inference application, the DW-MTJs are programmed to their desired weight levels, using a write-verify process with multiple pulses per device as needed to overcome the effects of stochasticity. This is a one-time programming operation and is not needed after the system is deployed for the inference application. Figure 6a shows the measured error over 10 cycles in the conductance for each conductance weight, σ G ; σ G = 0.68 − 2.73 μS shows that once the straight DW-MTJ synapse is updated, the conductance levels are highly precise. Cycle-to-cycle write noise leads to a conductance error of only about 0.25% for the notches below the MTJ. This very high precision is a consequence of the fact that the DW is only being pinned at notches whose position cannot move, because they are lithographically defined.
Low write noise allows more precise representation of NN weights. To evaluate the effect on accuracy, we use CrossSim 32 to simulate analog inference in a DW-MTJ system (see Methods for details). For the evaluation, we use the CIFAR-100 classification task 33 : 32×32 pixel images with 3 color channels are classified into one of 100 object categories. This is a significantly more challenging task than MNIST or CIFAR-10, and it requires higher precision in the weights. After mapping weights to conductance, a conductance error is applied that is sampled from a normal distribution with zero mean and a standard deviation of 2.53 μS, corresponding to the measured write noise in Fig. 6a. of device-to-device variations. Thus, we also performed the inference task with a write noise that is 10× larger than the measured value, to understand the effect of device-to-device variability. Figure 6b shows that with this higher write noise the accuracy has degraded, but it is still well above zero. This again suggests that the measured level of DW-MTJ write noise is well beyond what is sufficient to obtain high accuracy on CIFAR-100. In this section, the DW-MTJ is compared against state-of-the-art alternatives for artificial synapses. The stand-out feature of DW-MTJ synapses is their relatively high speed and low update energy. Though an update costs around 0.124 pJ in this work, it has been demonstrated in our previous simulation work that this design can be scaled down, leading to lower current densities and smaller update energies. Additionally, though the number of states for the DW-MTJ in this work is five, this can be increased by adding more notches to define more levels, if needed. The DW-MTJ has tunable nonlinearity, which enables unique neuromorphic functionality like the metaplastic function. Electrochemical devices, in particular, show extreme promise with very low write noise to hold many states, as well as extremely linear and symmetric behavior. However, these devices operate in much slower timescales than the operation speed of the DW-MTJ device.

COMPARISON TO OTHER SYNAPSE TECHNOLOGIES
These advantages set up DW-MTJ artificial synapses as a strong candidate for online learning applications.

CONCLUSIONS
In conclusion, we have shown DW-MTJ devices can achieve multiple weights with low-noise resistance levels, and that a single materials stack can yield devices with complex multifunctionality that is tunable via patterning geometry. Using neuromorphic models based on experimental data, we showed, as an example, that the trapezoidal geometry can create a metaplastic function useful for online learning in information-limited environments, where the NN has to learn as the data arrives. We also showed that the extremely low read noise in the straight geometry produces excellent inference performance when compared to other synaptic devices, approaching software accuracy for quantized systems. This work shows MW magnetic synapses are a feasible technology for neuromorphic computing and provides design guidelines for emerging synapse technologies.

Fabrication of DW-MTJ synapse devices
The device fabrication consisted of eight main steps starting with the thin film stack. Patterning of all features was performed using an E-line Raith electron beam lithography (EBL) tool using negative-tone Ma-N 2405 and positive-tone PMMA-A4 resists. The first step was to pattern and then etch the DW track using an AJA International ion miller. EBL was used to pattern the MTJ above the DW track and then etched with the ion miller to the MgO layer. Vias were patterned and silicon nitride was deposited using a Plasma-Therm plasma-enhanced chemical vapor deposition (PECVD) tool. After encapsulation, Cr/Au (5/95 nm) contacts were deposited with a Kurt J. Lesker PVD75 e-beam evaporator.

Electrical testing of DW-MTJ synapse devices
A West Bond wire bonder was used to connect the devices to a home-built testing setup with a tunable electromagnet. Electrical testing was performed by saturating the devices in an out-ofplane magnetic field to achieve the desired configuration, parallel or antiparallel and then set to a DC bias field as described in the main text. 50 ns (with 10 ns rise and fall times) voltage pulses were used to move the DW along the track and the 4-point resistance of the MTJ was measured after each pulse.

Setup of stream Fashion-MNIST training task
The Fashion-MNIST 31 database of 60,000 images in the training set is split into 60 subsets of 1000 images each, shown to the network for 30 epochs. In between every epoch, the network's performance is quantified as it performs inference on a separate validation set of 10,000 images.
The network is trained sequentially on subsets for 30 epochs each, and all 60 subsets are trained for 5 different seeds (i.e. 5 different randomly initialized weights). The network architecture is a fully connected binarized NN with 784 input units, two sign activation hidden layers with 512 units each, and 10 output units, with batch normalization applied after each layer. The Adam optimizer 43 is used with a learning rate of 0.005 for all runs. The network is implemented in PyTorch 44 .

Setup of CIFAR-10 inference task
A NN is first trained using Keras 45 over 200 epochs without accounting for any physical nonidealities. We use the ResNet56 topology, which has the same structure as the identically named network in Ref. 10 for CIFAR-10, but has 4× more channels in every convolutional layer to accommodate the greater complexity of the CIFAR-100 dataset. The final network obtains a software accuracy of 73.9% on the test set and contains 13.7M floating-point weights, which must be quantized to bits before mapping to hardware. In CrossSim, an -bit signed weight is mapped to the difference in conductance of two DW-MTJ devices each: if the weight is positive, one device encodes the weight's magnitude, while the other device is set to the minimum conductance (0.94 mS), and vice versa if the weight is negative. Each device has 2 −1 distinct conductance levels between 0.94 mS and 1.1 mS, implemented using 2 −1 DW notches. A larger increases the available precision, but requires more notches and a longer DW-MTJ device.

ASSOCIATED CONTENT
Supplementary Information is available and includes resistance state stability data, data on an additional trapezoidal device, and additional data from the stream learning task.