## In-memory search system using 1S1R memristor array

The similarity search is implemented with a memristive system as shown in Fig. 2a (the circuit diagram and photograph of the hardware system are shown in fig. s1). This system consists of an encoder used to map real-valued vectors into binary codes and a TCAM circuit used to calculate the HD between the query and stored data. In this work, we built the IMS system using V/HfOx/HfO2/Pt 1S1R devices organized in a 32×32 crossbar array (Fig. 2b). As illustrated in the cross-sectional transmission electron microscopy image of the device stacked structure (Fig. 2c), the interfacial VOx layer between V and HfOx severs as a volatile selector with its highly reproducible insulator- metal-transition characteristics23. Besides, the VOx-based selector contributes to higher resistance and enhances the low resistance state (LRS) of the whole device to a sub-high resistance region (90 kΩ ~ 243 kΩ) while in the off state (V<|Vth, s|, range C and D), as depicted in the typical 1S1R I-V curves of 100 cycles (Fig. 2d). The higher overall resistance can reduce the search energy compared to the pure-binary memristors. Meanwhile, the 1S1R device maintains a stable large resistance ratio of 17.3 between the high resistance state (HRS) and sub-high resistance even after 1010 times of switching and 104 s retention (Fig. 2e-f). That provides a wide and reliable sense margin for the TCAM (more details about the device can be found in fig.s2 in the Supplementary Information). In addition, the switching energy of SET/RESET can achieve 17.5 fJ/538 fJ, and the read energy is 0.23 fJ/0.013 fJ in the LRS/HRS, enabling an ultralow-power feature for the IMS system.

## Memristor based hardware local sensitive hash encoder

Local sensitive hash (LSH) is widely used in dimensionality reduction for the approximate nearest neighbour search, and is usually described by a random projection operation expressed by Eq. (1), where x is the real-valued feature vector with a dimensionality of *d*×1, W is an *n*×*d* Gaussian random projection matrix whose elements follow the Gaussian distribution with a mean of 0, b is a random offset vector with a size of *n*×1, and y is the binary code obtained from the sign function that outputs 1 when Wx + b is positive otherwise outputs 0. For simplicity, in this study, W also represents [W b] and x also represents [x 1]T (further discussion can be found in fig.s3 in the Supplementary Information).

According to the experimental results, the conductance values in HRS of the 1S1R devices follow a lognormal distribution, as shown in **Fig. 3a**. That is consistent with the results of prior studies24, 25. In addition, we found the conductance difference (*G*+ - *G*-) of any pair of devices follows a Gaussian distribution with a mean of near-zero (Fig. 3b, more proofs about the distribution can be seen in fig. s2). Therefore, the conductance difference matrix *G* of a 1S1R array in HRS could be regarded as a natural Gaussian random matrix *W*. Of course, the matrix *W* can be also represented with the LRS array because the LRS follows the Gaussian distribution25, and thus the conductance difference (*G*+ - *G*-) of LRS follows the Gaussian distribution too. For lower computing energy, HRS are adopted in this work. According to the results, a memristive random projection circuit is designed, as shown in Fig. 3c. The conductance difference of each two rows of the device represents a row vector (). The *x* is mapped to the voltage pulses *V*x** **applied to the array in parallel and the output differential currents () are the result of ()* *according to Ohm’s law and Kirchhoff’s current law. Subsequently, by connecting the two differential rows on the same comparator, the comparators are used to implement the difference and sign function at the same time. After the above process, an *n*-bit binary code *V*y is finally obtained according to the circuit Equation (2). The advantage of the hardware encoder is that the functions of the random number generator and vector-matrix multiplier are realized simultaneously in one step in the same circuit.

Figure 3d illustrates the encoding process of the iris dataset, which contains 150 4×1 feature vectors for three types of irises (setosa, versicolor and virginica). For simplicity, only the features of calyx width, petal length, and petal width are selected and an additional 150×1 vector whose elements are ones is grouped with the 150×3 feature matrix for offset. The grouped feature vectors are mapped into the analogue voltages, and are then applied to the four rows of the array. The 150 16-bit codes Vy are obtained with the 4×32 memristor array and the comparators. However, it can be observed that there are many common bits in these raw codes (more detailed information can be found in fig. s3 in the Supplementary Information), which are redundant for distinguishing the different types of irises and will lead to the waste of computing resources.

To obtain more compact codes, a common bits compression (CBC) approach is proposed in this work. The non-common bits were extracted by bitwise accumulation. Since the distributions of 0 and 1 are significantly unbalanced, the sum of the common bits will be outside the range [bottom threshold (THB), top threshold (THT)]. Only the bits whose sum locates in the range [THB, THT] remain. After the CBC, the final codes consist of only 4 bits (reduced from 16 bits). With the 4-bit codes, the setosa was successfully distinguished. However, the versicolor and virginica are too similar to be distinguished with the short codes, implying a longer code length is required.

To verify the similarity-preserving capability, we simulated the encoder with all four features of the iris datasets. Figure 3e and Fig. 3f illustrate the similarity matrix of original real-valued features based on Euclidean distance and the similarity matrix of 32-bit codes based on normalized HD, respectively. It is obvious that the similarity matrices based on the HD also are able to accurately describe the similarity relationship among the data. That indicates the similarity information of the data is not distorted after encoding. This can be also proved by the preservation of the inter-class distance and intra-class distance as shown in Fig. 3g. It clearly shows that similar items in Euclidean space will still have a shorter distance in Hamming space. According to our simulation results, for the iris dataset, the effective code length *m* after compression is approximately a quarter of the length of raw code (*m* = *n* / 4), as shown in Fig. 3h. In addition, Cosine distance can also be replaced by HD with the same operations with a function without the offset b (fig. s5). In short, we have proved that by using the hardware encoder and CBC-LSH scheme, the Euclidean distance and Cosine distance could be replaced by the HD and calculated on the binary CAM approximately.

## 2S2R TCAM for Hamming distance computing acceleration

Instead of the Euclidean distance and Cosine distance, for the binary data, HD is the more efficient distance metric, which can be computed fast with bitwise operation. The calculation of HD is given by an XOR-accumulation operation expressed by Eq. (3), where *Q**i*/*P**i* denotes the *i*th bit of the *m*-bit code *Q/P*. TCAMs based on non-volatile devices have shown a powerful performance in memory-centric applications such as pattern matching26, 27, data query28, tree-based machine learning29, 30, and the memory augmented neural network31, 32 owing to its ultra-high parallelism to perform the HD-based search. Most of the prior memristor-based TCAM structures are still limited by the small sense margin and large area overhand33, 34. The former is dependent on the devices HRS/LRS ratio, and the latter results from the use of transistors. Additionally, the search energy is also a key performance metric. Accordingly, our 1S1R device has shown a stable large HRS/LRS ratio and compact integrated structure to provide a sufficient sense margin and a small cell footprint. In addition, the high overall resistance of the 1S1R device enables the low search energy of TCAM.

$$HD(Q,P)=\sum\limits_{{i=1}}^{m} {{Q_i} \oplus {P_i}}$$

3

Figure 4a schematically illustrates the structure of the proposed 2S2R TCAM and the corresponding definition of the states. For the data *P* stored in TCAM, 1 and 0 are defined as HRS-LRS and LRS-HRS, respectively, while for the query data *Q*, 1 and 0 are converted into two search voltage (Vs, 0.1 V) pulses (Vs-0 and 0-Vs), applied on the left search line (SL) and right search line (\(\overline {{{\text{SL}}}}\)), respectively. For the *don’t care* state X, *P* is defined as HRS-HRS. When *Q* = X, the two SLs of the corresponding column will be floated. The voltage sense amplifier (SA) is used to sense the voltage of the ML (VML) and output digital voltage by comparing the VML to a reference voltage (VREF). In this cell structure, the ML is hardly charged while the query data *Q* matches with the stored data *P*. For instance, a 0 was written into the TCAM cell where R1 is at the LRS and R2 is at the HRS. To search 0, the SL (LRS device) and \(\overline {{{\text{SL}}}}\) (HRS device) are applied with 0 V and Vs respectively. The voltage of ML is pulled down to a level of approximately 0, which denotes the match case, as shown in Fig. 4b. On the contrary, while searching 1, the Vs is applied on the SL and 0 V is applied to the \(\overline {{{\text{SL}}}}\). Therefore, the VML is pulled up to a level close to the Vs, which denotes the mismatch, as shown in Fig. 4c.

For a single cell of TCAM, VML represents the result of XOR (*Q*, *P*). Further, for a TCAM array with *m* cells in each row, the VML corresponds to the HD between the *m*-bit stored data *P* and the query data *Q*, while the *m* search voltages are applied to the SLs simultaneously (S3). Figure 4d illustrates the structure diagram of the TCAM array. The detailed circuit structure and the relationship between the VML and HD are illustrated in fig.s6. For such a TCAM, before applying the search voltage to the SLs, the MLs must be initialized by discharging (VDCH). The search voltages are applied to the SLs and the MLs are charged to the corresponding voltages depending on the number of matching bits between the stored data *P* and query data *Q*. Subsequently, the SA starts work by applying the enable signal (VSAEN) and generates a digital output representing the result of match or mismatch according to the VREF. If the VML is lower than VREF the node *out* will output a high voltage near VDD (match case); otherwise, remain the low voltage close to 0 (mismatch case). To verify the circuit, simulations were performed based on a 32×32 TCAM array with 32 cells on each ML according to the experimental results. The simulation results of the 32-bit TCAM are shown in fig.s6b and fig.s6c in the Supplementary Information. For clarity, here, we demonstrate the search operation of the 8-bit code 11111111. The TCAM stores nine codes, including 00000000, 00000001, ..., and 11111111. The voltage signals of query data (11111111) were then applied to the array. Figure 4e shows the simulation results of the search operation in the 1-bit mismatch (01111111, HD = 1) and full match (1111111, HD = 0) cases. After initialization, the search operation was completed within 6 ns. Figure 4f shows the voltage signals of the nine MLs, which denote the match degree of the query and stored data. These can be clearly distinguished and easily sensed by SA or analogue-digital converters.

As an essential performance merit, the sense margin (ΔVML, defined as the ML voltage difference between the cases of a full match and a 1-bit mismatch) for the 2S2R TCAM is mainly dependent on the HRS/LRS ratio, cells number per ML, and Vs (given by equation S9). Figure 4g depicts the relationship between the sense margin and the HRS/LRS ratio in different array sizes. It can be observed that to obtain a large sense margin, a small array and large HRS/LRS ratio are required. In addition, due to the resistance variations, the actual sense margin is usually narrower than in the ideal case as shown in fig. s6d. To improve the robustness of the TCAM, we also carefully calculated the best reference voltage for the SA (more detail analysis can be seen in fig. s6d).

Another performance merit is search delay, associated with the parasitic capacitance of the ML (CML) and the resistance of the devices. According to the equivalent RC model of the 2S2R TCAM (fig.s7), the search delay is dominated by the LRS and CML. The CML is dependent on the materials and the wire feature size, thus cannot be tuned once fabricated. But the device resistance can be reprogrammed to meet various demands in practice. The results in Fig. 4h suggest that a lower LRS can provide a shorter search delay whereas results in higher search energy and severe IR-drop (fig.s8). Therefore, there is a trade-off between the search delay and power consumption. In addition, the case of the X-match was also considered and successfully demonstrated (fig. s9).

Figure 4i shows the comparison of the 2S2R TCAM with previous counterparts (more details can be found in table.s1in the Supplementary Information). Our passive TCAM shows significant improvement in terms of cell area, and search energy. First, the cell area of our TCAM is 8F2 (4F2×2) in the ideal case. The previous study has achieve a 16.3F2 cell area in 2-diode-2-resistor (2D2R) structure35.Seconde, owing to the existence of the selector, the 1S1R device shows a sub-high resistance after being SET-switched to LRS when reading it with a voltage (Vs, 0.1 V) lower than Vth, s (~ 1 V). That enables ultra-low search energy of 0.25 fJ / bit per search, although the higher resistance also results in a longer search latency of about 6 ns. Third, if the 2S2R TCAM are used in speed-first scenarios, it can work in the R mode by applying a search voltage higher than the Vth, s to turn-on the selector. In this case, the LRS resistance is the actual value of the memristor as shown in Fig. 2a. With the lower LRS resistance, the search latency can be further reduced according to the results in Fig. 4h.

## In-memory similarity search for clustering and classification

*K*-means clustering and *k*-nearest neighbour (*k*-NN) classification are two of the most important and typical similarity-measurement-based algorithms in data mining. In this work, the nearest similarity search based on HD is described by Eq. (4), where *P**i* and *P*[*i*] denote the *i*th data in the database. Based on Eq. (4), the classification can be described with Eq. (5), which is the key operation of *k*-means and *k*-NN, where *Class**pred* is the classification result.

$${Q_{pred}}=P\left[ {\mathop {\arg \hbox{min} }\limits_{{i \in \{ 1, \ldots ,v\} }} HD\left( {Q,{P_i}} \right)} \right]$$

4

$$Clas{s_{pred}}=\mathop {\arg \hbox{min} }\limits_{{i \in \{ 1, \ldots ,c\} }} HD\left( {Q,{P_i}} \right)$$

5

To find the item *Q**pred*, whose distance to the query data *Q* is the smallest from the stored data *P*, an adjustable threshold scheme40 was adopted in this study. As shown in Fig. 5a, based on the TCAM circuit, the VREF applied to the SA was changed to a scanning voltage from 0 to Vs, instead of a constant voltage, where the row whose output state flips first is the nearest object (fig. s10). Based on such a scheme, we demonstrated *k*-means clustering with the iris dataset, which is encoded with different code lengths. With longer code lengths, it can achieve a higher average search accuracy and robustness (in Fig. 5b). An accuracy of 87.9% was achieved when the code length is 16 bits. It is very close to the software accuracy of 88.7% based on Euclidean distance, even within three iterations, as shown in Fig. 5c. The inset graph shows the average number of iterations required for different code lengths, indicating that *k*-means requires more iterations for longer codes. To further investigate the performance of the IMS system in the high-dimensional space, we simulated the *k*-means with the ISOLET dataset which contains 617 phonetic features of 26 alphabets from 300 subjects, and it achieves an accuracy of 60.2% with 1536-bit codes, which is even slightly higher than the accuracy of software method based on Euclidean distance. As shown in Fig. 5d, compared with hyperdimensional computing41, our IMS system exhibited great superiority, especially in terms of code length. For low-dimensional data (iris), hyperdimensional computing still requires 10,000 bits to achieve an accuracy close to the Euclidean distance scheme, whereas our scheme requires only 16 bits. For high dimensional data (ISOLET), hyperdimensional computing clustering is nearly invalid (33.1% accuracy). In contrast, our scheme shows excellent performance (60.2% accuracy) even with a shorter code length.

Furthermore, we demonstrated the *k*-NN classification with our IMS system. The classification accuracy on the iris dataset for different code lengths is shown in Fig. 5e. The 32-bit codes can provide an accuracy of 95.9% with the hardware implementation, which is very close to the software accuracy of 96.1% based on Euclidean distance. To investigate the robustness of our scheme, Monte Carlo simulations concerning the D2D resistance variation and device HRS / LRS ratio were performed, as shown in Fig. 5f. Note that, while the device HRS / LRS ratio is larger than 10, the *k*-NN is almost insensitive to variations within 200% (fig.s11). Moreover, we simulated the *k*-NN with a random subset of the Mixed National Institute of Standards and Technology (MNIST) handwritten digits datasets where each 28×28 picture is a 784-dimension expansion vector. The accuracy with different *k* is shown in fig. s12, which exhibits high performance even at a 500-bit code length, with an accuracy of 89% which is higher than the maximal accuracy obtained by *k*-NN based on Euclidean distance (86%). In addition, for the high-dimensional sparse text vectors (7599 dimensions) using Cosine distance, high accuracy of 97% in BBC sports news classification was obtained only with 2000-bit codes (fig. s13). These results also prove that the CAM can approximately calculate the Euclidean and Cosine distances after transforming the real-valued vectors into binary codes using the proposed CBC-LSH.

To reduce the computing complexity while querying an unknown object in the database, classification is performed first by identifying the nearest class centroid which was calculated a priori with *k*-means. A *k*-NN search is then performed over the data belonging to the same class as the query object, as shown in Fig. 5g and 5h. We simulated this two-stage search process with the iris dataset, where the three centroids were calculated and stored in the first-stage TCAM array, all samples in the same class are stored in the same array at the second stage, and the NN search and the *k*-NN search were implemented by adjusting the VREF. This method can avoid the aimless search in the entire huge database and significantly reduce the search energy and delay35.