A Neural Network Accelerator to Avoid Inference Inaccuracy


 Artificial neural networks (ANNs) have demonstrated their superiority over traditional computing architectures in tasks such as pattern classification and learning. While ANNs demonstrate high prediction accuracy, they do not measure uncertainty in predictions, and hence they can make wrong predictions with high confidence, which can be detrimental for many mission-critical applications. In contrast, Bayesian neural networks (BNNs) naturally include such uncertainty in their model. Unlike ANNs, where the synaptic weights are point estimates (single-valued), in BNNs, the weights are represented by probability distributions (e.g. Gaussian distribution). This makes the hardware implementation of BNNs challenging since on-chip Gaussian random number generators (GRNGs) based on silicon complementary metal oxide semiconductor (CMOS) technology are area and energy inefficient. Stochastic switching in memristors can be used to build probabilistic synapses but with very limited tunability. Additionally, memristor technology rely heavily on CMOS-based peripherals to emulate neurons. Here we introduce three-terminal memtransistors based on two-dimensional (2D) materials, which can emulate both probabilistic synapses as well as reconfigurable neurons. The cycle-to-cycle variation in the programming of the 2D memtransistor is exploited to achieve GRNG-based synapses, whereas 2D memtransistor based integrated circuits are used to obtain neurons with hyperbolic tangent and sigmoid activation functions. Finally, memtransistor-based synapses and neurons are combined in a crossbar array architecture to realize a BNN accelerator and the performance is evaluated using the IRIS data classification task.

whereas 2D memtransistor based integrated circuits are used to obtain neurons with hyperbolic tangent and sigmoid activation functions. Finally, memtransistor-based synapses and neurons are combined in a crossbar array architecture to realize a BNN accelerator and the performance is evaluated using the IRIS data classification task.

Introduction
Machine learning has seen unprecedented growth and success in the recent years owing to the development of artificial neural networks (ANNs). By mimicking the biological neural architecture and employing deep learning algorithms, ANNs have demonstrated notable advantages over standard computing methods for tasks such as image classification, facial recognition, data mining, weather forecasting, and stock market prediction [1][2][3][4][5]. While ANNs offer high performance, especially in terms of high prediction accuracy, they often suffer from overfitting due to lack of generalization as they do not model uncertainty. Large datasets and various regularization techniques are often required to reduce overfitting in ANNs [6]. However, this can limit the use of ANN in applications where the data is scarce. Additionally, uncertainty estimation is important in applications like autonomous driving, and medical diagnosis, where machine learning must be complemented with uncertainty-aware models or human intervention [7,8]. The integration of probabilistic computing paradigms with ANNs allows regularization and enables us to model uncertainty in predictions [9][10][11][12]. This is achieved in Bayesian neural networks (BNNs) by incorporating Bayes theorem to the traditional neural network scheme [12,13]. BNNs are capable of modelling uncertainty and avoiding overfitting, while working well with small datasets [14]. In fact, BNNs are extremely powerful as they represent an ensemble model, which is equivalent to the combinations of numerous ANNs, but with a small number of parameters.
Unlike ANNs, where the synaptic weights are point estimates (single-valued), in BNNs, the weights ( ) are represented by probability distributions, as shown in Fig. 1. According to the Bayes theorem, these weights are given by posterior probability distributions given by Eq 1.
Here, is the training data, ( | ) is the posterior distribution, ( | ) is the likelihood, ( ) is the prior, and ( ) is the evidence. The true posterior distribution is untraceable in BNNs and hence, methods such as variational inference [12] and Markov chain Monte Carlo (MCMC) sampling [15] are used to approximate the posterior distribution. Variational inference is typically preferred due to better convergence and scalability compared to MCMC [16]. In the variational inference method, ( | ) is estimated using a family of variational posterior distributions (typically Gaussian distributions), ( ; ), where represents the variation parameters. For a Gaussian distribution, the variation parameters are its mean and standard deviation i.e., = , .
The estimation is performed by minimizing the Kullback-Leibler divergence between ( | ) and ( ; ). In the training phase, for each synapse, and are learned using the traditional backpropagation method [12]. Here, represents the uncertainty introduced by each synapse. To perform inference using a BNN, multiple forward passes of the trained network is evaluated.
During each forward pass, each of the Gaussian weight distributions are sampled once. The output of the network ( ) is obtained by averaging the outputs of these forward passes obtained by sampling the weight distributions (Eq. 2). It can be approximated by drawing Monte Carlo samples and finding its mean given by Eq. 2.
Here, is the input, is the transfer function of the neural network and represents the th Monte Carlo weight sample.
Over the years we have witnessed the development of neural network accelerators aimed at improving the size, energy consumption, and speed of neural networks, especially for edge computing applications [17][18][19]. Since the training process in neural networks is energy and resource intensive, these works typically rely on off-chip training and on-chip inference.  [16,20,21]. Moreover, these demonstrations are based on the von-Neumann architecture with separate memory and logic units, requiring frequent shuttling of data between the two. BNN accelerators based on emerging and non-von Neumann memristive and spintronic synapses utilize cycle-to-cycle variability in switching to generate Gaussian random numbers (GRNs) [22][23][24]. However, these GRNG-based synapses are limited to =0 and =1 and require extensive CMOS-based peripherals circuitry to obtain unrestricted and values. For example, multiplication and addition operations are used to transform (0,1) to ( , ) = * (0,1) + .
Finally, two-terminal memristors also lack the capability to emulate neurons for the activation functions. Therefore, energy and area efficient acceleration of BNNs will benefit from a standalone hardware platform, which can offer both neurosynaptic functionalities as well as programmable stochasticity.
In this work, we introduce three-terminal memtransistor technology based on two-dimensional (2D) monolayer MoS2 offering all computational primitives needed for a BNN accelerator. First, we realize an ultra-low power GRNG-based synapse by exploiting the cycle-to-cycle variability in programming/erasing operation in the 2D memtransistor. Next, using a circuit comprising of two memtransistors we achieve reconfigurable and . Activation functions such as hyperbolic tangent (tanh) and sigmoid are also realized using the 2D memtransistor-based circuits. Finally, we demonstrate a crossbar array architecture in order to implement on-chip BNN inference.
Furthermore, the entire network is simulated using LTSpice.

2D memtransistor
The schematic of a 2D memtransistor is shown in Fig. 2a (Supplementary Fig. 1a shows the optical image). This 2D memtransistor has a local back-gated geometry, where, atomic layer deposition grown 50 nm Al2O3 is used as the gate dielectric and TiN/Pt is used as the local gate electrode (see Methods section for details on fabrication). This geometry (similar to the top-gated geometry) enables independent modulation of each memtransistor and the development of circuits necessary for a BNN. Note that we have used monolayer MoS2 grown using metal organic chemical vapor deposition (MOCVD) described in our previous reports [25,26]. The choice of MoS2 as the element of memtransistor is motivated by recent demonstrations highlighting the technological viability of 2D materials [27][28][29] and their wide scale adoption in brain-inspired computing [30][31][32][33][34].
The transfer characteristics, i.e., drain current ( DS ) versus gate-to-source voltage ( GS ) for different drain-to-source voltage ( DS ) of 1 V for 250 MoS2 memtransistor are shown in Fig. 2b.  The dependence of programming and erasing, on P is shown in Supplementary Fig. 1c and Fig.   1d, respectively. The non-volatile nature of the MoS2 memtransistor is shown in Fig. 2f, where the retention characteristics for 5 different conductance states are demonstrated for 2000 s. The working principle of this analog and non-volatile memory has been described in detail in our earlier report [32].

Gaussian random number generator-based synapse
BNN accelerators rely on techniques such as cumulative density function inversion, central limit theorem (CLT)-based approximation, and the Wallace method to generate standard GRNs [16,20,21]. These methods typically require linear feedback shift registers, multipliers, and adders, involving numerous transistors to implement the GRNGs, rendering them area and energy inefficient. In contrast, here we use cycle-to-cycle variation in the programmability of our MoS2 memtransistor to generate GRNs. While cycle-to-cycle variation is undesirable for traditional computing, it can be exploited to reduce the design complexity of a BNN accelerator [22,23,35].
To demonstrate the effect of programming variation, we use dynamic programming on 40 MoS2 memtransistors, where we measure the transfer characteristics with different GS sweep ranges. To evaluate the effect of P ( E ), the maximum positive (negative) GS is fixed at +2 V (-2 V), while the maximum negative (positive) GS is stepped from -3 V to -13 V (3 V to 13 V). As shown in Fig. 2g, high P and E (± 13 V) increases the device-to-device variation (post programming/erasing). High P and E (beyond ±7 V) results in significant TH shift (see Supplementary Fig. 2), while also increasing TH , as shown in Fig. 2h. This increase in deviceto-device variation for high P and E is also accompanied by an increase in the cycle-to-cycle variation.
To utilize the cycle-to-cycle variation, the gate of a MoS2 memtransistor is subjected to successive erase-program-read pulse cycles with E = 13 , P = -13 V, and read voltage ( R ) of 0 V as shown in Fig. 3a. The corresponding DS values were 0, 0 and 0.1 V, respectively. The conductance ( ) of the memtransistor, measured at each read step, is shown in Fig. 3b for 200 cycles. As evident from the histogram shown in Fig. 3c, follows a Gaussian distribution, with mean, = 3.5 nS and, standard deviation, = 0.9 nS. The quantile-quantile (Q-Q) plot of further confirms the Gaussian distribution. The quantiles of (represented using circles) are plotted against the theoretical quantiles from a Gaussian distribution as shown in Fig. 3d. As expected, it closely follows a straight line. Note that the slope of the Q-Q plot represents and corresponding to quantile 0 represents . Further characterization of 2D memtransitor-based GRNG has been done in our previous report [36]. Nevertheless, MoS2 memtransistors are used to generate a physical random variable that samples analog conductance values from a Gaussian distribution i.e., ~( , ). Moreover, the MoS2 memtransistor can be used as a synapse, which scales the input by its synaptic weight. If input is applied as voltage to the drain terminal of the memtransistor, the output current is scaled by , i.e., DS = . DS , as shown in Fig. 3a. Therefore, by combining the cycle-to-cycle variation in with the synaptic functionality of the memtransistor, we are able to realize a GRNG-based synapse. Note that to implement a BNN accelerator, it's important to tune both and of the GRNGbased synapse independently. and can be tuned by modulating P in the erase-program-read pulse cycle, as shown in Fig. 3e. However, and are found to be coupled, and the coefficient of variation ( = ⁄ ) depends on P , as shown in Fig. 3f. A similar trend is seen in , and as a function of E , as shown in Supplementary Fig. 3. This dependence of , and on P and E is demonstrated across multiple memtransistors in Supplementary Fig. 4. Fig. 4a shows the design of our GRNG-based synapse with independent control over its and , using two MoS2 memtransistors, T + and T − . While prior demonstrations rely on additional mathematical manipulations of the generated GRNs to establish control over their and , we are able to achieve it without any additional manipulations or circuitry [22][23][24]35]. It is common practice in neural network accelerators to use two devices per synapse in order to map both positive and negative weights [37]. Here, the input to the synapse, in is applied as + in andin to T + and T − , respectively, as shown in Fig. 4a. The current at the output node ( out ) is then given by sum of currents through T + and T − i.e., T + and T − , according to the Kirchhoff's current law (KCL) given by Eq. 3.
Here + and − are the conductance of T + and T − respectively and eff is the effective conductance of the synapse. While the conductance of a device is always positive, by modulating + and − , using + and − (by applying different ), we can obtain both positive and negative eff . Here, we use P to modulate + and − as P shows better linearity and lower device-to-device variation compared to E in GRN generation (see Supplementary Fig. 4). To control eff and eff , T + is subjected to successive erase-program-read pulse cycles, while T − is programmed to a given state and subsequently only read, using the waveforms shown in Fig. 4b. This results in + being drawn from a Gaussian distribution, with + = 5 nS and + = 0.49 nS i.e., +~( 5 ,0.49 ) nS and − having a constant value of ≈ 8.89 nS, as shown in Fig. 4b. eff is expected to be drawn from a distribution with eff = + and eff = + -− . This is confirmed by our measurements as shown in Fig. 4c, eff ~(−3.9 ,0.49 ) nS. Note that, − is not perfectly constant due to the presence of random telegraph fluctuations. However, the fluctuations were found to have a standard deviation of 0.06 nS, making its contribution negligible. The histograms and Q-Q plots of + , − , and eff are shown in Supplementary Fig. 5. Fig. 4d shows the independent control of eff for constant eff using the GRNG-based synapse. Here, T + is subjected to the same erase-program-read cycle, to obtain constant eff , whereas T − is programmed to different states (using P ) to tune eff .  4e shows the independent control of eff for constant eff . In order to modulate eff , + is changed by applying different erase-program-read cycles (different P ) to T + . Since this leads to an unfavorable change in + , T − is reprogrammed to account for the change in + , to maintain a constant eff .
In a synapse, the distribution of out is expected to scale linearly with in , as given by Eq. 4.
This is demonstrated in Fig. 4f, where out and out show linear dependence with respect to in .
The output characteristics of a MoS2 memtransistor is shown in Supplementary Fig. 6 for positive and negative DS . While the current is highly non-linear and asymmetric for large ± DS values, it is seen to be sufficiently linear and symmetric between ± 0.1 V. Hence, we limit the maximum in to 0.1 V. The low in allows us to operate the synapse with extremely low currents, as shown in Fig. 4f, offering significant energy efficiency. Overall, we demonstrate independent control over eff and eff to implement a GRNG-based synapse with just two MoS2 memtransistors resulting in significant area and energy efficiency.

Neurons with modified hyperbolic tangent activation function
The hardware for activation function in neural accelerators is generally realized using standard CMOS-based analog and digital components, and hence these implementations do not utilize the advantages offered by emerging materials [37]. Moreover, hyperbolic tangent (tanh) and sigmoid functions are highly non-linear, significantly complicating their hardware demonstration [38]. We demonstrate a circuit for a modified tanh (m-tanh) activation function using two MoS2 memtransistors (T1 and T2) as shown in Fig. 4g. The transfer function of the circuit i.e., output voltage ( O ) versus input voltage ( S ) closely follows the tanh activation function as shown in Fig.   4h. The maximum of the m-tanh activation function is determined by the drain voltage ( DD ), and DD of 1 V results in the ideal tanh activation function. Here, S is applied to the gate of T2, while T1 is used as a resistive load. We use the charge-trap memory to program T1 to have the required resistance at zero gate voltage, eliminating the need for a constant gate bias. T2 is programmed to ensure that the m-tanh function passes through the origin. Note that when S = -3 V, T2 operates in the off-state, i.e., T1 is more conductive than T2, resulting in O = − DD , whereas for S = 3 V, T2 operates in the on-state and becomes more conductive than T1, which results in O = DD . Note that the m-tanh activation function can also be implemented using complementary n-type and ptype transistors as demonstrated in our previous report [39]. Additionally, modified sigmoid activation function can be realized by applying 0 V to the drain terminal of T1, as shown in Supplementary Fig. 7.

Crossbar array architecture
The crossbar array architecture is routinely used in neural network accelerators to perform the MAC operation of a neuron. Fig. 5a shows the circuit used to implement a portion of a BNN shown in Fig. 5b, where =4 input neurons are connected to =1 output neuron. Each input neuron is multiplied with their corresponding synaptic weight distributions and the resultants are summed at the output neuron (MAC operation). The resultant of MAC operation is passed through the m-tanh activation function, to obtain the output. To implement this on circuit, as shown in Fig. 5a, the conductance distribution of a synapse in the th row, th column and th layer ( ( ) ), given by the combination of + ( ) and − ( ) is modulated using + ( ) and − ( ) lines. Inputs to th row and th layer are applied as voltages (± ( ) ). The current through the th column, due to these synapses is then given by the dot product of ( ) and ( ) , according to KCL. To obtain a voltage proportional to this dot product, we use a sense transistor, as shown in Fig. 5a. The voltage-drop ( S ( ) ) across this sense transistor is given by Eq. 5. Here, S ( ) is the conductance of the sense transistor, modulated using S ( ) . Using S ( ) allows us to seamlessly integrate the circuit for m-tanh activation function into the crossbar array as shown in Fig. 5a, to obtain the corresponding output ( O ( ) ). There are some non-idealities which are also accounted for. First, the synaptic weight distribution ( ( ) ) is mapped to the crossbar array by using a conductance scaling factor ( ) to obtained ( ) . Second, the denominator of S ( ) (Eq. 5) presents a non-ideality, which can be expressed as the product of and a non-ideality factor ( ( ) ).
By mapping the input ( ( ) ) to ( ) , using ( ) as the scaling factor, ideal S ( ) and O ( ) can be obtained as shown in Eq. 6.
S ( ) is used to make sure that each column of the crossbar array has the same ( ) . With this proposed scheme, we can evaluate the dot product between ( ) and ( ) in the voltage domain and use the m-tanh activation function to obtain the ideal output, O ( ) . Note that this scheme is not limited to the implementation of a BNN and can be adopted to implement standard ANN crossbar arrays with tanh and sigmoid activation functions.

Neural network evaluation
We evaluate the performance of our BNN accelerator using the iris data classification task [40].
The iris dataset consists of the lengths and widths of both sepals and petals (shown in Fig. 5c  Here, E of 13 V and R of 0 V is used, while P is used to tune + ( ) and − ( ) . ( ) is determined for each layer and multiplied with ( ) to obtain ( ) . DD of 1 V is used for the m-tanh activation circuit. Note that, at the output layer we do not use the tanh activation function. Instead, following Eq. 2, the S ( ) is sampled =100 times to obtain a distribution, and its mean is used to make the classification. Note that the distribution of S ( ) at the output layer can be used to calculate the uncertainty in classification [42,43]. The BNN accelerator in Fig. 5f is evaluated using LTSpice simulations, where we are able to obtain a test accuracy of 93.78 %. Here, we use resistors to implement the synapses. The other components are modeled using NMOS transistors. The dip in accuracy is observed due to the non-symmetric output of the m-tanh circuit (Fig. 4g). By implementing the tanh activation function using complementary n-type and p-type transistors [39], the test accuracy of 97.78 % can be replicated. It is important to evaluate the effect of device-to-device variation on the performance of the BNN. Fig. 5g shows the effect of device-to-device variation on the testing accuracy. Here, the BNN is simulated with a variation of up to 10 % in the synaptic weights and the testing accuracy is averaged over 10 runs.
While, we observe a decrease in the test accuracy, it is not seen to significantly impact the operation of the BNN and an accuracy of ≈ 80 % is maintained for 10 % variation.

Conclusion
This work demonstrates the development of computational primitives needed for a BNN accelerator, using 2D memtransistor. The cycle-to-cycle variation in the programming of the memtransistor is exploited as a source of randomness and a circuit comprising of two such memtransistors is used to obtain an ultra-low-power and stochastic synapse, which allows sampling of both positive and negative weights from a Gaussian distribution with reconfigurable mean and standard deviation. We also developed circuits to implement the modified hyperbolic tangent and sigmoid activation functions based on the 2D memtransistors. Additionally, we integrate these components into a crossbar array architecture to perform efficient MAC operations.
Finally, we develop a BNN accelerator to perform on-chip inference to classify the iris dataset and benchmark using circuit simulations.

Methods
Device fabrication: Local back-gated MoS2 memtransistors are fabricated using photolithography and e-beam lithography. Photo-lithography is used to define the back-gate islands. A p ++ Si substrate is first spin coated with LOR 5A and baked at 180 ˚C for 120 s, and subsequently spin coated with SPR 3012 and baked at 95 ˚C for 60 s. Using Heidelberg MLA 150, the desired regions are exposed to 405 nm light. The exposed regions are developed using 1:1 CD 26 and DI water.
To form the back gate islands, 20 nm TiN followed by 50 nm Pt is deposited through sputtering.
50 nm Al2O3 gate dielectric is deposited using atomic layer deposition. Al2O3 is etched from back gate contact regions using BCl3 etch, where the etch region was defined by photolithography.
Following this MOCVD MoS2 is transferred onto this substrate and the MoS2 transistors are fabricated as discussed in our previous reports [25,39].
Electrical characterization: Lake Shore CRX-VF probe station and Keysight B1500A parameter analyzer were used to perform the electrical characterization at room temperature. The device-todevice variation measurements were performed using the FormFactor Cascade Summit 12000 semi-automated probe station.

Data availability:
The datasets generated during and/or analyzed during the current study are available from the corresponding authors on reasonable request.

Code availability:
The codes used for plotting the data are available from the corresponding authors on reasonable request.