Ising Model Optimization Problems on a FPGA Accelerated Restricted Boltzmann Machine

Optimization problems, particularly NP-Hard Combinatorial Optimization problems, are some of the hardest computing problems with no known polynomial time algorithm existing. Recently there has been interest in using dedicated hardware to accelerate the solution to these problems, with physical annealers and quantum adiabatic computers being some of the state of the art. In this work we demonstrate usage of the Restricted Boltzmann Machine (RBM) as a stochastic neural network capable of solving these problems efficiently. We show that by mapping the RBM onto a reconfigurable Field Programmable Gate Array (FPGA), we can effectively hardware accelerate the RBM's stochastic sampling algorithm. We benchmark the RBM against the DWave 2000Q Quantum Adiabatic Computer and the Optical Coherent Ising Machine on two such optimization problems: the MAX-CUT problem and finding the ground state of a Sherrington-Kirkpatrick (SK) spin glass. On these problems, the hardware accelerated RBM shows best in class performance compared to these other accelerators, with an empirical scaling performance of $\mathcal{O}(e^{-N})$ for probability of reaching the ground state compared to a similar empirical $\mathcal{O}(e^{-N})$ for the CIM (with the RBM showing a constant factor of improvement over the CIM) and empirical $\mathcal{O}(e^{-N^2})$ for the DWave Annealer. The results show up to $10^7$x and $10^5$x time to solution improvement compared to the DWave 2000Q on the MAX-CUT and SK problems respectively, along with a $150$x and $1000$x performance increase compared to the Coherent Ising Machine annealer on those problems. By using commodity hardware running at room temperature for acceleration, the RBM also has greater potential for immediate and scalable use.

Optimization problems, particularly NP-Hard Combinatorial Optimization problems, are some of the hardest computing problems with no known polynomial time algorithm existing. Recently there has been interest in using dedicated hardware to accelerate the solution to these problems, with physical annealers and quantum adiabatic computers being some of the state of the art. In this work we demonstrate usage of the Restricted Boltzmann Machine (RBM) as a stochastic neural network capable of solving these problems efficiently. We show that by mapping the RBM onto a reconfigurable Field Programmable Gate Array (FPGA), we can effectively hardware accelerate the RBM's stochastic sampling algorithm. We benchmark the RBM against the DWave 2000Q Quantum Adiabatic Computer and the Optical Coherent Ising Machine on two such optimization problems: the MAX-CUT problem and finding the ground state of a Sherrington-Kirkpatrick (SK) spin glass. On these problems, the hardware accelerated RBM shows best in class performance compared to these other accelerators, with an empirical scaling performance of O(e −N ) for probability of reaching the ground state compared to a similar empirical O(e −N ) for the CIM (with the RBM showing a constant factor of improvement over the CIM) and empirical O(e −N 2 ) for the DWave Annealer. The results show up to 10 7 x and 10 5 x time to solution improvement compared to the DWave 2000Q on the MAX-CUT and SK problems respectively, along with a 150x and 1000x performance increase compared to the Coherent Ising Machine annealer on those problems. By using commodity hardware running at room temperature for acceleration, the RBM also has greater potential for immediate and scalable use.
Combinatorial Optimization problems are a particularly important class of computing problem and are prevalent across disciplines, from scheduling and logistics to analysis of physical systems to efficient routing. These problems belong to the NP-Hard and NP-Complete class, where no polynomial time solution exists. This leads to an interest in novel algorithms, architectures, and systems to solve these problems. The Ising Model problem is an example of this type of problem, with foundations in statistical physics [1][2][3] . For this reason, the Ising Model has emerged as an efficient way of mapping these problems onto various physical accelerators [4][5][6][7][8] . Among these, the DWave Adiabatic Quantum Computer 8-10 and the Coherent Ising Machine (CIM) 5,6 are two promising implementations that are capable of solving large scale combinatorial optimization tasks by mapping them onto the Ising model.
The Boltzmann Machine is a stochastic neural network over binary variables which maps directly onto the Ising model. Because of this, the Boltzmann Machine has received attention for its usage to solve Combinatorial Optimization problems 11,12 . However, the stan-dard Gibbs Sampling algorithm 13,14 used with the Boltzmann Machine is computationally expensive due to its sequential nature, and many samples needed to reach convergence. Restricted Boltzmann Machine (RBM) 15 can address some of these problems by introducing a parallel sampling scheme via removing intra-layer connections. This is why Restricted Boltzmann Machines have found substantial interest as next generation accelerators for computationally difficult problems 7,[16][17][18] .
In this work we show how the Restricted Boltmann Machine's parallel stochastic sampling scheme is a strong candidate for Ising Model accelerators, especially for combinatorial optimization tasks. We exploit the intrinsic parallelism in this architecture by mapping it onto an FPGA based design with the flexibility in memory and compute to efficiently take advantage of this parallelism. This accelerator allows us achieve better performance than other accelerators based on quantum computation (DWave 2000Q Computer) or novel physical phenomena (Coherent Ising Machine). We demonstrate the performance boost by benchmarking on two types of problems, the Dense Max-Cut Problem up to 200 nodes, and finding ground states for the Sherrington-Kirkpatrick (SK) spin system up to 150 nodes. The CIM and RBM both show a O(e −N ) performance for probability of reaching the ground state with the DWave annealer demonstrating O(e −N 2 ) performance. This leads to an asymptotic time to solution performance of the RBM (O(e √ N )) which is better than that of the DWave Annealer (O(e N )) and similar to that of the CIM (also O(e √ N )) but with a large constant factor scaling improvement over all problem sizes evaluated here.

Results
Stochastic Sampling Algorithm The Restricted Boltzmann Machine is a stochastic neural network that encodes an exponential family probability distribution over binary variables, taking the form given in Equation 1. The energy function E(v, h) is typically set to reflect the problem being solved, and can take a variety of possible forms 19 . Here we choose the energy function to reflect linear synaptic weights with two body interactions to allow for simplicity of hardware and algorithmic implementation.
As many optimization problems have been formulated for the Ising Model 3 , they must be transformed from the fully connected Ising model problem, to the bipartite graph structure in the RBM as demonstrated in Figures 1 A) and B). To do this, each logical node in the original graph is copied to create two physical nodes in the RBM, with one being in the visible layer and one being in the hidden layer. The connection between the two physical copies is referred to here as the "coupling coefficient" (C) and forces both nodes to remain at the same value. Once the bipartite graph is formed, the RBM energy function is set to E(v, h) = v T W h with W being the bipartite graph's weight matrix. This encodes the lowest energy states in the original Ising Model problem as the highest probability state in the RBM probability distribution. Further details on the RBM embedding method are outlined in the Methods section.
To sample from the RBM probability distribution, we perform block Gibbs sampling 15 , a type of Markov Chain Monte Carlo, on the RBM nodes. Each neuron has a stochastic activation function where p(vi = 1|h) = σ(w T i h + bi) and σ(x) = (1 + e −x ) −1 , with wi being the ith row vector of the weight matrix, and bi being the bias associated with that neuron. The lack of intra-layer connections allows for each neuron in a layer to be sampled in parallel, creating a massively parallel sampling scheme. Each layer is sampled and the result is passed to the next layer to get the next sample. This process of passing neuron activations back and forth is demonstrated in Figure 1 C).
An example of the sampling algorithm on a 150 Node MAX-CUT instance is shown in Figure 1 D). The raw samples produced stochastically fluctuate through the high probability states, with a higher probability of transitioning into and staying in the highest probability/lowest energy state. The samples can be analyzed to find the solution in one of two ways; either the samples can be collected and the mode of the sampled distribution is taken as an estimate for the highest probability state, or the unnormalized probability can be calculated for each sample and the highest probability state seen thus far can be taken as an estimate of the highest probability state for the underlying distribution. The first method is related to the "mixing time" of the Markov Chain and the second is related to the "hitting time" of the Markov Chain 16,19 . The mixing time method requires less computation but more post-processing, while also allowing for analysis of the full distribution to find other potential high probability states as solutions to the problem. The hitting time method produces only one output solution and converges to the solution faster than the mixing time method, but it requires an extra computation for every sample. In Figure 1E) we show that the hitting time method performs better for the same number of samples, with a cut distribution closer to the maximum. Further, Figure 1F) shows that the hitting time method has a constant factor of improvement in probability of reaching the ground state over the mixing time method. This result is expected from the theory of Markov chain samplers 20,21 . FPGA Acceleration and Hitting Time Engine To show the inherent parallelism of the RBM architecture, we map the problem onto an FPGA accelerator. FPGA based Accelerators for the RBM have been made before, but most focus on RBMs in the Machine Learning context [22][23][24] , or for resource constrained environments 25,26 . We base our FPGA accelerator on previous work 16 which implements a high performance accelerator specifically for fast Gibbs sampling operations on an RBM.
The FPGA based accelerator runs at 70Mhz and produces 1 sample each cycle. Although this runs at a clock frequency considerably slower than a typical CPU or GPU (usually 1-4Ghz), the FPGA accelerator is able to produce samples at a considerably faster rate than a CPU or GPU implementation 16 . This speedup can be attributed to the parallelism, binary activations, efficient sigmoid approximations, and reduced precision weights. These cause matrix multiplications to become fixed precision accumulation minimizing the complexity of computation. The accelerator supports 9 bit precision fixed-point weights and biases, which allows for solving Ising Hamiltonians for various problems other than the one included in here. We note that while the problems benchmarked in this work only use weights in {−1, 0, 1}, we wanted to maintain the generality of our accelerator to solve a variety of other problems within the RBM framework, such as machine learning inference 15 , and other instances of NP-Hard Combinatorial Optimization problems 3,16 . We note that if we were to restrict the weights our RBM implementation was able to support further we would expect the accelerator as designed to support larger problem instances.
As outlined above, the two methods for finding the best solution to the optimization problem involves either using the mixing time method or the hitting time method. As the hitting time method has a theoretical advantage for solution quality, we prefer using this. Calculating the probability of each sample on the CPU is computationally expensive, so we implement a "hitting time engine" on the FPGA which calculates an approximate, unnormalized log-probability for each sample and keeps track of the state with the highest probability. By using partial computations already available when computing node activations, the hitting time engine has negligible hardware costs and is able to decrease the amount of FPGA to CPU communication to a minimum, freeing the CPU for other computational tasks. The details of the hitting time engine are outlined in the Methods section. Effect of Sampler Parameters on Algorithm Performance The choice of parameters has a strong effect on algorithm performance and should be characterized to select optimal parameters for new problem instances. These parameters are found for each problem and set empirically based on trends seen in the data. The parameters for this sampler are the coupling (C), the temperature (β), and the number of samples taken (Ns). We first optimize over the temperature parameter, as the other parameters have a strong dependence on the temperature chosen. The temperature parameter (β) refers to a scaling constant in the probability model, where p(v, h) = 1 Z e −βE(v,h) , and is equal to the inverse temperature seen in the physical Boltzmann distribution. As β → ∞ the distribution becomes more sharply peaked with a larger probability difference between the ground state and first excited state. Conversely, as β → 0, the distribution closer to uniform making it easier to sample from and converge to a solution. The results of the temperature analysis are shown in Figure 2 A), where β ≈ 0.25 yields the best results. We note that we only tested in increments of 0.125, as the fixed precision available on our FPGA accelerator only allowed for increments of 2 −2 . More precision is possible, but deemed unnecessary in comparison to the hardware costs.
The coupling coefficient, C is a soft constraint that forces the two copies of the logical node to be the same on the physical RBM implementation. For small values of this constraint, the two copies of the logical node are free from constraint, and an incorrect state will generally be chosen as the ground state. In the case of MAX-CUT, this generally leads to a state of 0 cut, where all nodes are the same value. In the case of the SK Problem, the ground state becomes a random state depending on the problem instance values. The coupling coefficient has a much stronger effect on the performance in the MAX-CUT problem than the SK problem, demonstrated by examining the performance for various values of C. In Figure 2 B) we see the probability of finding the ground state solution strongly peaks around the optimal C ≈ 12 for Ns = 70000, β = 0.25 for the MAX-CUT problem. For the SK problem, we see less of a sharp peak, with a more gentle decline in performance (shown in Supplementary Figure 8 A)). In addition, the large values of C in the MAX-CUT problem cause a slower mixing time leading to worse performance on the MAX-CUT problem compared to the SK problem 27,28 . If we instead use the median cut outputted by the sampler as our metric, we see a slightly different picture where the performance is very poor below a certain threshold, peaks at the optimal C, and then slowly degrades with higher value, which is shown in Figure 2 C).
Additionally, the optimal C tends to change with the number of samples taken, demonstrated in Figure 2 D). This is because small values of the C allow for the system to approach the model distribution faster via a smaller mixing time and hitting time 15,19 but this comes at the cost of having the highest probability state be a state of zero cut (See Supplementary Figure 6 and accompanying discussion for further details). The result is that for small problem sizes, where enough samples are taken such that the sampled distribution is very close to the model distribution, we see a linear relation between the problem size and the optimal C. This also corresponds to the region where the probability of reaching the ground state, p gnd ≈ 1 as shown by comparing Figure  2 D) and E). When problem sizes go past this point and the problem's sampled distribution is sufficiently far from the model distribution so it does not correctly identify the mode in all cases, the optimal coupling parameter is mostly flat.
As the number of samples taken increases, the sampled distribution approaches the model distribution, and the ground state can be correctly identified. This generally causes a smooth and monotonic increase in probability of reaching the ground state as shown in Figure 2 E). The stochastic hill climbing of the Gibbs sampling algorithm causes the mode to be correctly identified and found well before the full distribution is mapped. Additionally, as each trial is independent, many trials can be performed and a higher success probability can be reached by probability amplification 29 . We combine these properties with the Time to Solution framework used in other works 6,[30][31][32] , and adapted for the RBM in equation 2, as the standard for evaluating probabilistic accelerators. This corresponds to the 99% quantile for reaching the ground state of a given problem.
In this equation Ns is the number of samples taken, f clk the clock frequency (70Mhz for our FPGA implementation), and p gnd the probability of reaching the ground state for that particular problem. We use this equation along with the data from Figure 2 E) to create Figure 2 F) which shows the time to solution for various Ns. We note that if multiple FPGAs are available we can parallelize this scheme and further reduce the T soln . Here we use a sequential Time to Solution framework for fair algorithmic comparison to other accelerators. From the graph in Figure 2 E) we see that the optimal number of samples taken (in this framework) is generally lower than the number of samples needed to reach very high accuracy on an individual problem. We can take the lower bound on the data in Figure 2 F) and create a Pareto-optimal boundary for performance on these problems. The same parameter optimizations presented in Figure 2 for MAX-CUT are performed for the SK problem and shown in Supplementary Figure 8. Benchmarking Performance Many accelerators have been developed for solving these Ising Model problems, such as specialized ASICs 4, 33-35 , FPGA designs 36,37 , Memristor based accelerators 17, 31, 38 , Quantum Mechanical Accelerators based on quantum adiabatic processes 8-10 , Optical Parametric Oscillators 6, 39 , Magnetic Tunnel Junction 7, 40, 41 and many others. In this work we benchmark against two notable candidates: the DWave 2000Q Quantum Adiabatic Computer and the Optical Coherent Ising Machine as they represent the state of the art in two different domains of accelerators. The DWave Annealer has 2000 spins, but due to limited connectivity is only able to solve instances of the dense max-cut and SK problem up-to 62 nodes 9, 30 . The Stanford CIM is able to solve fully connected problems up to 100 nodes 6 , and the NTT CIM is able to solve fully connected problems to 2000 nodes 5 . The single FPGA instance presented in this work is able to solve up to 200 node optimization problems.
We benchmark this sampling algorithm on the Dense MAX-CUT problem and the Sherrington-Kirkpatrick (SK) Problem. The MAX-CUT problem involves separating the nodes into two groups and finding the maximal graph cut which separates the two groups of nodes. The MAX-CUT problem is mapped onto the Ising Model by setting weight matrix values wij = +1 with probability p = 0.5 and 0 otherwise. The SK problem sets the Ising Model weights to wij = +1 with p = 0.5 and wij = −1 otherwise. The problem is then transformed from the Ising Model and fully connected Boltzmann Machine into the RBM using the above methodology to form a probability distribution on the RBM, where the maximum cut for the MAX-CUT problem, or minimal energy for the SK problem, is encoded as the highest probability state. We benchmark on instances from 10 to 200 nodes in increments of 10 on the MAX-CUT problem and 10 to 150 nodes in increments of 10 on the SK problem. Each node value in both problem instances has 10 randomly generated problems, each of which is run 10000 times to generate probabilities of reaching the ground state. Problem instances for node values ≤ 150 were provided by Ref 30 , with instances >150 generated by us. Further details of problem instances are detailed in the Method section.

Discussion
Performance Comparison In Figures 3 and 4 we show results of benchmarking performance on the MAX-CUT and SK problems. In Figure 3 A) we show how the probability of reaching the ground state scales for a fixed annealing schedule. We fix Ns = 70000 for the RBM, corresponding to 1000µs of time at 70Mhz and see that the performance outperforms the other annealers at all problem sizes when given less time to solution. Using the Time to Solution framework shown in Equation 2 we convert the optimal sampling solution from where the performance on problem instances drops quickly to 0 after 50 Nodes in the MAX-CUT problem, while the RBM is still able to solve larger instances. When looking at time to solution, this accounts for a 10 6 x difference in performance at 50 nodes. When comparing to the the Coherent Ising Machine, we see very similar scaling performance for time to solution. While the Stanford CIM and the NTT CIM perform very similarly, the RBM performs at a constant ≈ 150x advantage over all problem sizes. When comparing to the simulated curve, this constant advantage becomes even more apparent.
This difference in performance for a given problem size is more pronounced when examining the SK problem. First, we note that with much less computation time (10us compared to ≈ 1000us) the RBM is able to outperform both the DWave and CIM for the given problem instances shown in Figure 4 A). This difference becomes more apparent when looking at the Time to Solution metric, where we see a 10 5 x improvement at 60 nodes compared to DWave in Figure 4 B) and a constant 10 3 x improvement against the Coherent Ising Machines in Figure 4 C). As with the MAX-CUT problem, there appears to be a scaling difference between the RBM and the DWave 2000Q, while the difference between the RBM and Coherent Ising Machine seems to be a constant factor improvement. Scaling and Connectivity One of the biggest challenges in implementing Ising Machines is creating all-to-all connectivity between nodes. When mapping arbitrary graphs onto the Ising Machine, this is a necessary requirement to build a usable machine. The CIM supports this kind of all-to-all connectivity, while the DWave 2000Q, with its limited connectivity, uses ≈ N 2 /κ where κ is the level of connectivity in the physical graph 30 , leading to a large overhead in computation. A consequence of this is that the DWave annealer has scaling performance of O(e −N 2 )) in probability of reaching the ground state, while both the CIM and the RBM based methods exhibit O(e −N ) 30,42 . The RBM based methodology, although mapping N logical nodes to 2N physical nodes, does not suffer from the same scaling problems as the DWave Machine. We believe this is because the RBM increases the number of physical nodes by a constant factor, rather than a factor that depends on the size of the input graph.
We expect similar asymptotic performance for both the SK problem and MAX-CUT as they both are NP-Hard problems, based on the Ising Model glassy spin system configuration, using the same embed-ding and underlying sampling algorithm. This is confirmed based on our experiments with both problems having an underlying O(e −N ) for probability of reaching the ground state for a fixed number of samples (see Figures  ). We note that as the SK problem requires such few samples for computation, the optimal sample point does not transition very much between different sample values. This causes the behavior to appear linear in the exponential at first glance, but we expect it to follow the O(e √ N ) behavior for larger problem sizes. Although we empirically see a probability scaling of O(e −N ) and a time to solution scaling of O(e √ N ) in the RBM annealing algorithm, we acknowledge that without a theoretical result, this scaling is not proven. More work is necessary to understand the sampling algorithm in greater detail. It should be noted, however, that the scaling principles seen in the CIM and DWave are also empirical 30,43 , thus only experimental data can be reasonably compared and care should be taken when extrapolations are taken of the data. Effect of C and graph embedding on algorithm performance The purpose of the coupling constraint (C) is to enforce the two physical copies of the logical node to be the same value. When the constraint is violated, the two physical copies have different values and the state energy is no longer proportional to the Ising Energy of the problem Hamiltonian being solved. This would imply large values of C would improve performance, but that is not what is observed. When the value of C is too large, the Markov Chain does not mix quickly and settles in local minima 15,19,28 .
As shown by comparing Figures 3 and 4, we can see that the there is a significant difference in performance between the MAX-CUT and SK problems, even when comparing the performance difference to the other annealers. This is partially caused by the difference in optimal C required for the problems. While the SK problem uses C ≈ 1 for all problems, the MAX-CUT problem has an optimal value of C ≈ 12 for the same Ns = 70000, β = 0.25. In addition, the graph embedding we use leads the MAX-CUT problem to have a high probability for a state with a cut of 0, a state which is suppressed for high values of the coupling parameter (See Supplementary Figure 6 and 7 for further discussion). We would expect that remapping the MAX-CUT problem to the RBM via a different method that requires smaller C could result in a increase in performance due to lower mixing times. Conclusion By exploiting the intrinsic parallelism present in the RBM architecture on a flexible FPGA based accelerator, we show that our sampling framework is competitive with state of the art computing machines based on novel physics. Importantly, we empirically show that there appears to be no scaling advantage for the accelerators based on novel physics, indicating that classical hardware is sufficient for solving the computationally difficult problems chosen here. We additionally show that the RBM has a large constant factor performance advantage on both of these problems. Although we chose the MAX-CUT and SK problems to benchmark and solve, all of Karp's 21 NP-Complete problems can be mapped onto the Ising Model, and the RBM using the proposed framework, in polynomial resources 3,33,44 . In addition, our usage of a FPGA framework with up to 9 bit precision arithmetic allow for many varieties of these problems to be solved, including arbitrary real world instances that the lower precision DWave and CIM would not support. The use of commodity hardware working at room temperature in a standard server setup allows for widespread adoption and usage. Further accelerator-level parallelization and scaling also become possible through the use of multi-FPGA designs and communication 22,24,45 , time division multiplexing 46 and more efficient pipeline stages 47 .
The work presented represents a proof of concept for the possibility of using parallel, stochastic computing to solve NP-Hard and NP-Complete problems. As these problems represent some of the hardest for traditional computers to solve, this approach has far reaching consequences in fields like logistics, scheduling, resource allocation, and many others. Further improvement on this methodology can be accomplished through the use of novel devices 7,17,38,40,41 for either the matrix multiplication/accumulation or the noise generation. Recently, there has been similar work which modifies the Hopfield Network with noise and variation from memristors to create stochastic behavior 31 creating a network with very similar properties to a Stochastic Boltzmann Machine, showing how the combination of stochastic sampling and a parallel accelerator can yield impressive performance gains. A dedicated accelerator for the RBM using these novel technologies would enable large scale optimization at high speed and throughput, with potential to solve some of the most difficult computational problems.
. . . . The Ising Model is mapped to an RBM by making two copies of each graph node and edge and arranging them into a bipartite graph. One copy is in the "visible" layer neurons and one in the "hidden" layer neurons, with no intra-layer edges. Each physical copy of the neuron is connected by a "coupling" parameter (C) which constrains the two copies to be the same value. (C) Due to the lack of intra-layer connections, the layers can be sampled in parallel. Each of the neurons in a layer is sampled in parallel and used to calculate the values of the opposite layer, creating a two-step sampling procedure. This sampling procedure proceeds until the output of the algorithm has reached the ground state, or until the algorithm output is of sufficient quality. (D) A demonstrative sampling run showing two different methods for interpreting the output samples from the RBM. The user can either aggregate the samples produced from the sampler and check the mode of the distribution, or can take the best individual sample outputted so far. (E) A histogram showing the output cuts after 1000 independent sampler iterations with C = 12, Ns = 70000, β = 0.25 on a 150 Node Max-CUT problem. This histogram shows that both the sampled mode procedure, and the best sample procedure output good states with high probability, but the best sample procedure generally outputs better states.
(F) Analysis of the scaling of both of these sampler types. We see that both the sampled mode and best sample procedure perform well on the MAX-CUT problem, with the best sample method performing a constant factor above the sampled mode method. The parameter β scales the weights and biases by a constant factor to change the speed of convergence of the sampler. We settle on β = 0.25 as the optimal parameter, which is used for all experiments in this paper. (B) We show the performance on varying problem sizes for varying coupling parameters at a fixed β = 0.25 and Ns = 70000. This shows that for the MAX-CUT problem the C is generally optimal at C ≈ 12 for most problem sizes. (C) Although the probability of reaching the ground state is sensitive to the coupling parameter, the median cut outputted from the algorithm tends to be very close to the optimal value for a large range of coupling values. Below a certain value, the median value is very low, but it undergoes a sharp transition to its peak cut value, before slowly degrading. (D) The number of samples taken also increased the optimal coupling value for a given value. This is a consequence of the mixing time of the underlying distribution, where smaller coupling values mix faster but output statistics that are further from the ideal distribution for the problem. Compared to the MAX-CUT problem, the RBM performs considerably better on this problem instance, only requiring 10 µs to get to the ground state with very high probability. This is compared to DWave and the CIM requiring 100x the anneal time to get close to this performance. (B) Compared to the DWave 2000Q, we see a performance increase of 10 5 on large problem instances with better asymptotic performance on the RBM in these problem instances. The lack of connectivity for the DWave annealer contributes to the drop in performance on these instances as many logical copies need to be made to accommodate the fully connected SK graph. (C) The RBM also compares very favorable to the two instance of the Coherent Ising Machine 5, 6 , with a 1000x time to solution difference on the largest problem instances. The scaling performance of these two problems also suggests that the RBM will continue its constant factor performance increase for much larger instances of the SK problem.

METHODS
Mapping Ising Problems onto the RBM Ising model problems take the form of a minimization on of an energy function (E(s)) over bipolar states (s ∈ {−1, 1} n ), that usually takes the form of Equation 3. We will denote quantities relating to the original Ising model with an I subscript, and quantities related to the RBM with an R subscript (i.e. n R is the number of visible nodes in the RBM, while n I is the number of nodes in the original Ising model).
The RBM however maps a probability distribution over binary variables (v ∈ {0, 1} n , h ∈ {0, 1} m ) without intra-layer connections. This takes the form shown below.
The RBM has twice the number of nodes as the original Ising model graph, with n R = m R = n I = n, and each visible node having a corresponding hidden node that should hold the same value in the ground state of the RBM model (i.e. v i = h i for the RBM ground state). With this, we can set the weights by equating the energy of an Ising Model state to the the energy in an RBM state (causing minimal energy Ising states to be maximal probability RBM states), and collecting terms related to the same logical node, shown below. We first show results for an RBM with bipolar states, and then show the transformation between bipolar states and binary states.
Based on this equality, we set each w ij = w ji = −J ij and b i = c i = − 1 2 a i . This sets the correct probabilities for an RBM for states where v i = h i , but the probabilities for states where v i = h i should be correctly penalized. To do this we set ∀i, w ii = −C, which penalizes states so that E R (s, x) E R (s, s), ∀x = s. The size of C is problem specific and is empirically analyzed to find the optimal parameter for a particular problem instance.
The above transformation works for RBMs with bipolar states ({−1, 1}), but RBMs traditionally use an energy function over binary variables ({0, 1}). The transformation between these two is straight forward, and shown below for an arbitrary Ising model. In the below equation W * , b * correspond to Ising models where s ∈ {0, 1} n , W, b are the original weights and biases for the Ising model with s ∈ {−1, 1} n , and 1 is a vector of 1s.
Depending on the type of accelerator architecture available, mapping between bipolar and binary states is trivial to perform, and represents minimal computational overhead. In this work, the RBM accelerator was designed to use binary states, and the Ising model problems were adapted for that. Hitting Time Engine The hitting time engine calculates the approximate log probability for each sample that the stochastic sampling algorithm outputs and keeps track of the highest probability state seen so far. This method works to offload the computation of calculating probabilities or aggregating results from the CPU to the FPGA. The FPGA can operate in raw sample output or hitting time engine output, where it either pushes the raw samples to the CPU or the highest probability state seen. The hitting time engine uses partial computations from each cycle to reduce computational overhead for the FPGA. To do this, we first look at the log probability for a given visible node state.
The log(Z) term is a normalizing constant and can be ignored if we are only comparing probabilities between samples. Additionally, the log(1 + e x ) term is simplified as follows.
This simplification is valid for x 0 and x 0, but introduces errors when x ≈ 0. These errors are not significant in the probability calculation, as the largest contributions to the probability mass are for x 0. The n i=1 a j + w ij v i is calculated each cycle to update the hidden units and is thus recycled for calculation of the overall log probability of the given state. This means the only calculation the hitting time engine has left to do is accumulation of the visible biases and accumulation of the thresholded sums that have been pre-calculated by the hidden neurons. The hardware overhead for the hitting time engine is very small due to the efficient usage of these pre-calculated sums. The hardware usage translates to < 1% of additional FPGA LUT utilization and < 1% of additional FPGA flip flop usage (see Supplementary Table 1). FPGA Programming All programming was done using the Xilinx Vivado suite on the on the Xilinx Virtex UltraScale+ XCVU9P-L2FLGA2104 following Ref 16 . All weights and biases were stored in on-chip SRAM and communication done over PCIe using through a Xillybus IP Core 48 . The design from that work was slightly modified to widen the bit count from 8 bits to 9 bits. This was used to test effects of various parameters and have sufficient dynamic range to perform experiments. Along with this, partial sums were stored for use in the hitting time engine.
The hitting time engine is split into two modules, each calculates the log probability for every other cycle. Each module is composed of an accumulator which takes the partial sums from the hidden nodes and accumulates half of them each cycle along with the visible biases. By splitting the calculation over two clock cycles we are able to meet the 70Mhz timing requirements set by the rest of the design. The hardware cost of these accumulators is approximately the same as an additional visible node (see Supplementary  Table 1). We additionally add an extra function in the hitting engine, where we ignore any output that corresponds to a zero cut state. The results of this additional feature are shown in Supplementary Figure 7, and function to increase the average output cut from the sampler. Problem Instances and Validation Problem instances for MAX-CUT and SK and performance data for the CIM and DWave accelerators for sizes ≤ 150 were provided by Ref 30 and validated by them. Their ground state results are in agreement with ours, as in many runs we were unable to find states with Ising Energies lower than the ground state solutions provided. For the problem instances we generated on MAX-CUT instances, we generated random graphs with edge density 0.5 and tested the MAX-CUT instances on our solver. As exact solvers, such as BiqMac 49,50 , time out for instances of size > 100 nodes (corresponding to 3 hours of computation time) we were unable to confirm a provable ground state solution to these instances. Instead, we compared the best solution generated by the RBM based solver with best output from the solver implemented in Ref 51 and found agreement between those two solutions. As the best solution instances from these two solvers are in agreement and follow the curve that from smaller problem instances, we are fairly certain the RBM is finding the ground state solution.
To characterize performance of the RBM on the MAX-CUT and SK, each problem instance was run 10000 times and the algorithm output was checked against the ground state solution for that problem. Each problem size had 10 randomly generated instances to test on and results were aggregated based on their performance. Error bars on each of the graphs were generated by calculating the bootstrapped, two tailed, 95% confidence interval for the parameter being estimated. Solution time calculations for the RBM do not include pre-processing or post-processing of sample data, only the raw computation time required to solve the underlying problem. This is consistent with the methods used for the other annealers. Anneal times for the CIM use the base anneal time without a parallel sampling protocol to fairly compare to the RBM instances and allow for analysis of scaling. above, we can see how many iterations are needed to reach the ground state for a given number of samples. We can take the lower bound of the graph at each problem size to find the optimal time to solution for the RBM sampling algorithm as a whole. (C) By finding the sample number for the minimum at each problem size in (B) we can find the optimal sample number for each problem size. This graph shows the e N increase in optimal sample number as problem size increases. The discrete jumps shown here are due to the discrete number of samples taken, and would become smooth as this procedure is interpolated for more sample values. For N = 50, we can tell that the Ns is large enough such that the sampler has fully mixed, as the sampler performance peaks when p soln > pzero and the probability of correct ≈ 1. (B) For N = 100, we see that the sampler is further from convergence as the peak performance no longer approaches 1, and the optimal coupling parameter is for a state where p soln < pzero. (C) As the problem size increases to N = 150 we see that the sampler is even further from convergence as p soln pzero at the optimal coupling value.
For the MAX-CUT problem there is an intuitive explanation for the role of the coupling parameter. When C = 0, the two physical copies of a node in the original graph are completely disconnected resulting in the two states not having any direct effect on each other. The maximum cut in this degenerate graph is the one which separates the hidden from the visible nodes and passes through all of the edges and where v = We refer to this state as the "zero cut state" as it corresponds to a state that has zero cut in the original Ising Model graph. As C increases, the relative probability of this state decreases compared to the actual MAX-CUT state for the original Ising Model graph. However, large C causes slower mixing rates, which means that the performance of the sampler tends to peak significantly before the solution state has a higher probability than the zero cut state. This also serves as a good proxy for how close the sampler is to the model distribution, as the probability of reaching the ground state should peak when the MAX-CUT state has higher probability than the zero cut state if the sampled distribution is close to the model distribution. In Figure 6 we show this phenomenon, where for Ns = 70000 we can see the regions of operation. For N = 50 we can see the sampled distribution is very close to the model distribution, as the probability of reaching the ground state peaks when the ground state probability is larger than the zero cut probability and we are able to reach the ground state in almost all instances. For the N = 100 and N = 150 instances we can see the sampler is further away from the the model distribution as the probability of reaching the ground state decreases and the sampler performance peaks for smaller coupling values where p soln pzero. We add an extra constraint in the hitting time engine to ignore all states that have an output cut of 0 (i.e. states with all 0 or all 1). The output is now concentrated very strongly around the optimal cut value, and all zero cut states are suppressed. (C) Although the addition of the exclusion of zero states improves the average cut from the sampler, the probability of reaching the ground state remains mainly unchanged across all problem sizes.
Although the RBM has a relatively high probability of reaching the ground state (≈ 30% here), many of the sampling runs end by reaching a state with 0 cut value, where all nodes are on the same side of the cut and have the same value as shown above in Figure 7 A). This occurs when the coupling constraint is violated and the two physical nodes in the RBM are different for each logical node. Shown here for the optimal coupling value, we can see that the zero cut states still retain a very high probability causing a decrease in overall performance of the sampler.
To combat this, we add an extra constraint into the hitting time engine to ignore all states of zero cut. This moves the average output cut up, improving the output of the sampler, shown in Figure Taken (SK)   samps  7  22  70  221  700  2213  7000  22135  70000   20  40  60  80  100  120  140 Problem Size The SK problem has an inherent symmetry, as it has both +1 and -1 connections, causing the optimal coupling parameter to be lower for this type of problem. (B) The optimal coupling value for maximizing the median cut follows the optimal coupling to find the ground state. Similar to the MaxCut problem, the median cut remains high and is not as sensitive to the coupling parameter as the probability of reaching the ground state. (C) As above, in the MAX-CUT problem, the probability of reaching the ground state improves smoothly as more samples are taken. However, more samples takes more time, and we can optimize for the number of samples to take. The performance is significantly better on the SK problem, compared to the MAX-CUT problem when performed on the RBM. (D) Using the time to solution framework outlined in the Results section above we can use the sampling performance in part (C) and find the time to solution for a given sample number. Here, we see fewer than 2000 samples should be taken across all problem sizes, much lower than the MAX-CUT problem.
The minimum across all sample numbers is taken to find the global time to solution for the RBM.  Table 1: (Supplementary) FPGA Utilization Utilization numbers for FPGA and various RBM sizes All usage numbers reported are for 9 bit weights and biases, including the hitting time engine (unless otherwise noted). The usage shows that the FPGA is not memory limited for the problem sizes we are interested in, but compute limited, as the LUT usage goes up much faster than the FF usage as the problem size grows. All weights and biases fit in on chip SRAM, allowing for fast access and data reuse. This also shows that the hardware overhead of the hitting time engine is minimal and should be included in designs to increase algorithmic performance. Simulated Annealing is performed on a Xeon E5-2620 processor using the code from 51 , while the Parallel Tempering results are copied from 30 using the NASA/TAMU Unified Framework for Optimization, running on a Xeon E5-1650 v2 processor. The RBM performs better than both algorithms for small problem instances on the MAX-CUT problem (closer to 5x improvement), but converges for the larger instances presented in the dataset. Although parallel tempering narrowly outperforms the RBM on large MAX-CUT instances, the empirically seen O(e √ N ) scaling of the RBM is asymptotically favorable to the O(e N ) scaling of the parallel tempering algorithm. The RBM is able to outperform both these algorithms on the SK problem across all problem instances, demonstrating a performance advantage for problems with full connectivity. The RBM performs competitively with these state of the art algorithms, but more work needs to be done to increase the performance relative to these baselines. Many of the optimizations used in the simulated annealing (pre-computing energies, using weight sparsity, efficient random number generation) can be implemented with the RBM as well to increase its performance. Parallel tempering can also be added to the RBM sampling algorithm to yield improved sampling, which we expect to improve the overall algorithm scaling and performance. 54,55