A hash table is a useful data structure for applications that need to discover and retrieve data rapidly. [1–6] Hash tables are widely used in graph analytics and AI/ML applications. For example, the Graph Convolution Neural Network's (GCN) graph sampling mechanism uses hash tables to check if a currently sampled vertex (or edge) is already in the sampled set [5].
A hash table is a basic data structure for finding and retrieving information quickly. A hash table utilizes a particular function called a hash function to retrieve elements faster by combining a value with a key. A hash table is a type of data structure that contains key and value pairs. The key is hashed during lookup, and the hash result shows where the associated value is kept. Each key should be assigned to a different bucket by the hash function. However, most hash table designs utilize an imperfect hash function, which might result in a hash collision if the hash function outputs the same index for multiple keys. Collisions of this nature are frequently handled in some fashion. The number of entries kept in the database in a hash table with enough dimensions has no bearing on the average time complexity of each lookup.
The implementation of hash tables based on bloom filters is to reduce unnecessary hash table visits. The authors designed a forwarding unit for dynamic hash table modifications to mitigate data risks. The development of parallel hash tables for CPU and GPU platforms has been the subject of several studies [5]. Due to the availability of fine-grained parallelism [7], FPGAs succeeded in speeding up data and compute-intensive applications [6]. They have a large on-chip memory (up to 500 MB) and a large number of programmable logic elements (up to 5.5 million), allowing users to create highly efficient designs that function well[8]-[9]. Many low-power edge applications [10] and high-performance cloud and data centre platforms use FPGAs.
On the CPU, the objective has been to use shared memory and message forwarding to establish a concurrent and lock-free hash table. A recent study built a hash table implementation using SIMD extensions [8]–[9]. On the other hand, the hash table is limited to lookups. Partitioning a hash table into coarse or fine-grained divisions and extracting parallelism through concurrent processing has been the subject of GPU research. Designs allow free key-value pair insertions and removals with a constant average cost per operation. Several types of research have focused on "parallelizing" hash table implementations or using certain features of the hash table, such as the fact that it has many partitions [11].
Current initiatives either increase pipelining to enhance clock frequency or execute parallel processing utilizing atomic hash table partition accesses to improve the performance of a single pipeline. The former technique has a modest throughput limit, but the latter relies heavily on data. In the worst case, multiple queries may need to access the same part of the hash table simultaneously. Serialized partition processing problem will occur. To overcome these limitations[12], we developed a hash table that can handle p queries in each clock cycle using p parallel processing engines, even in the worst-case scenario (p > 1).
Bando et al. [13] suggested a hash table-based IP lookup approach with a throughput of 250 Mops/s. Choi et al. [14] developed a hash table implementation that employs a bloom filter to avoid unnecessary hash table accesses. Tong et al. [15] developed an off-chip DRAM-based hash table. The authors designed a forwarding unit for dynamic hash table modifications to mitigate data risks. [16] and [11] used the cuckoo hashing algorithm to create efficient hash tables. The authors of [16] provide a decoupled key-value storage to simplify the parallel calculating of hash values. Pontarelli et al. [11] built a Cuckoo hash table with several concurrent pipelines, each with its unique entry.
Reversible logic gates
The reversible logic gates employ the fewest possible trash outputs, the fewest possible reversible gates, and constant inputs. It takes up less space than permanent gates. The number of reversible gates employed and trash outputs produced is two key restrictions in reversible logic. The following characteristics should be present in a reversible gate logic synthesis technique:
• Use the fewest possible garbage outputs
• Use the simplest input constants possible.
• Use the smallest amount of gates possible
• Keep the length of cascading gates to a minimum.
Various reversible logic gates include the Feynman gate, Fredkin gate, Peres gate, and Toffoli gate. The Feynman gate is used because it gives better output than other reversible logic gates. Fig. 1 describes that the Feynman gate is a controlled not circuit. It has two inputs and two outputs. The Feynman gate is employed in many applications to replicate signals.
Multiported memory: The mWnR Xor-based design requires m*(m-1+n) number of BRAMs, where each BRAMs work as one write and one read port, and m, n denotes the number of write and read ports required for the design. For example 2W1R design, i.e m=2, n=1 and for the 2W2R design i.e m=2 n=2 totally 4(M0 to M3) and 6(M0 to M5) BRAMs are required respectively. 2W1R designs M0, M2, and M1, M3 will maintain the same data for that address. In the case of a 2W1R, a write operation can be performed as, Input data A is XOR-ed with the old data (OLD1), which is read from M1 and will be stored to M0 and M2 at offset 1. Similarly, input data B is XOR-ed with the old data (OLD2), which is read from M0 and will be stored to M1 and M3 at offset 2, as shown in Fig a). For the read operation, Read Data A can be retrieved from the (A⊕ OLD 1) read from M2 is XOR-ed with (OLD1) read from M3 at offset1. In the case of a 2W2R, a write operation can be performed as, Input data A is XOR-ed with the old data (OLD1), which is read from M1 and will be stored to M0, M2 and M4 at offset 1.
Similarly, input data B is XOR-ed with the old data (OLD2), which is read from M0 and will be stored to M1, M3, and M5 at offset 2, as shown in Fig b). For the read operation, Read Data1 A can be retrieved from the (A⊕ OLD 1) read from M4 is XOR-ed with (OLD1) read from M5 at offset1. Similarly, Read Data2 B can be retrieved from the (A⊕ OLD 2) read from M3 is XOR-ed with (OLD3) read from M2 at offset2. Limitations of this approach are that as the number of reads and write ports increases, more BRAMs are required for the design.
On an FPGA, it has been proved that XOR-based memory can scale to many read and write ports while avoiding significant logic complexity. A 2R2W XOR-based memory sample is shown in Figure 2. To do a read, each SRAM block (third coloured column) is read out and XOR'ed. To write to each SRAM block, fresh data must first be XOR'ed with data in the SRAM block (a read followed by XOR), and then the XOR'ed data may be written to each SRAM block.
The downside of the XOR-based approach is that it consumes more SRAM resources: an mRnW XOR-based memory requires n*(n-1+m) SRAM blocks [15]. To preserve memory resources, the read and write operations of the XOR-based memory in our design utilize the same read ports. As a result, the memory need is reduced to m * n. To reduce power consumption, reversible gates are used than irreversible gates.
The utilization of XOR-based placement functions for cache memories is presented in this previously published article. It demonstrates that these XOR-mapping schemes have the potential to eradicate a significant number of conflict misses for direct-mapped and two-way associative organizations. This work covers the limitations of completely redesigning the hash table to support all operations performed on a hash table, including search, insert, update, and delete. We accomplish this by storing the hash table in XOR-based multi-ported block memories, even though this approach results in a greater demand for memory. The performance of several reversible XOR-mapping schemes for various cache organization types is analyzed and compared in this paper. We replace the previously used XOR mapping schemes with Reversible XOR –mapping schemes. The ultimate verdict of this research project is that XOR-based placement functions unquestionably offer highly significant performance benefits to most cache organizations.
The following are the primary contributions of the paper:
• To design an XOR array and reversible XOR array logic-based memory for a parallel hash table.
• We demonstrate that our hash table can perform 128 concurrent searches every cycle, with a throughput of 8000 million operations per second at 325 MHz, using state-of-the-art FPGAs.
• Even in the worst-case scenario, our RXOR-based block memory fully exploits the large on-chip SRAM to support p queries - search, insert, update, and delete in each clock cycle.
• Performance analysis and comparison with the current results.
3. HASH TABLE DEFINITION AND SUPPORTED OPERATION
Hash tables are closed addresses in a standard design. It is possible to store one key-value pair in each collision resolution slot of a hash table bucket. An index h(x) for key-value pairs created by the hash table is used to store key-value pairs with key x. where h(.) is the hash function used by the hash table. Our hash table may do the following operations:
Search (key): If the key appears in the hash table, the search operation obtains the value associated with the key. Otherwise, none is returned.
• Insert/Update (key, value): The Insert/Update operation first scans the hash table for the key. If the key already exists, it replaces the existing value with value. Otherwise, a new key-value pair is put into an open hash table space.
• Delete (key): A delete operation removes the key-value pair from the hash table while leaving the relevant slot available.
We refer to hash table mutation queries - Insert, Update, and Delete - as Non-Search Queries in this study (NSQ). A parallel hash table can handle many queries at the same time. Contentions between concurrent searches frequently reduce parallel hash table performance. The amount to which these issues manifest is determined by individual design. In previous studies, the hash tables were frequently partitioned, and parallelism was obtained by simultaneously accessing each of the partitions. When two or more queries compete to access the same partition on a fine-grained [23] or coarse-grained [18] scale, the queries are automatically serialized. In contrast to earlier methods, the parallel hash table developed ensures p parallel accesses during each cycle, even in the most adverse circumstances. Static hash tables and Dynamic hash tables are the two types of hash tables.
Static hash table: Throughout the execution, S will always have the same priority within a static hash table. As a direct consequence of this, the only operation that is authorized is searching. Hash tables that are static and free of collisions can be created using perfect hashing [32], which is one method. One option for generating a hash table using perfect hashing is to use two tiers of hash functions to create the table. This method involves constructing the first-level hash table using a hash function that belongs to the family of universal hash functions known as H [33]. It generates a second-level hash table with O(ni2) entries for each bucket that contains more than one item in the first-level hash table, which results in some collisions, where ni is the number of items in the bucket I (ni is greater than 1) and O(ni2) is the number of entries in the second-level hash table. A hash generated by using this method does not contain any collisions.
Dynamic hash table: Search operations can be performed while data is incrementally added to a hash table using the dynamic hashing technique. As a direct consequence, dynamic hash tables are susceptible to change. When a new key is added to a system with the entry corresponding to that key, this is known as a collision. On FPGAs, it is common practice to prevent collisions from occurring by employing hash functions or linear chaining.
Parallel Hash table: In a parallel hash table, each query may involve p (p is greater than 1) separate actions instead of searching for or adding a single key at a time. In the p procedures, you can use any combination of search and insert. A parallel hash table concept is broken down into its most basic components and shown in Figure 1. In this scenario, the hash table can only complete p operations each time the clock ticks. Designing a parallel architecture that uses FPGA to access a hash table effectively is a challenging area for research. Congestion of resources is the primary issue, which manifests primarily as on-chip memory conflicts brought about by concurrent hash table operations. The proposed layout ensures that p operations are carried out during each clock tick.
Hash table competitions run parallel every cycle; our parallel hash table employs p PEs to ensure p search, insert, update, or delete queries. For the throughput requirements to be met, a parallel memory that contains the same number of read ports as it does write ports is required. Because each FPGA SRAM block only has one read port and one write port, it cannot be easy to support all of the operations without causing the pipelines to become slower. For their architecture to support delete and update operations, queries must be redirected to the PE, retrieving the initial insert for the input key that matches. The performance of the hash table is no longer guaranteed, and depending on how it is used, it could potentially suffer a decline. This is in addition to the increased logical complexity caused by the all-to-all PE connections. Our design for the hash table handles search, insert, update, and delete queries with a guaranteed performance of p queries per cycle by utilizing reversible XOR-based block memory. This allows us to overcome the challenges that have been presented.
4. HASH TABLE ARCHITECTURE
As can be seen in Figure 2, the proposed design includes p different processing engines (PEs). Each PE can independently receive input queries using a variety of different types of operations. It is possible to carry out search operations within each PE independently. To maintain the integrity of the data during insert operations, each PE must propagate changes to the hash table of the other PEs. The Inter PE Dataflow is responsible for accomplishing this goal. Our architecture model ensures that inserts that one PE initiates never interfere with inserts initiated by other PEs. As a result, we can guarantee p parallel accesses in our design for every clock cycle.
The architecture we have proposed comprises several different processing engines (PEs). A hash table is replicated p times, and each PE stores a single copy of the replicated table. Each PE is responsible for storing key-value pairs as XOR-encoded data across k Partial XOR Stores, where k is the maximum number of non-search queries the architecture can handle during each cycle. Each slot in a Partial XOR Store is responsible for storing one key-value pair, and the store contains multiple slots to manage collisions in hash tables.
Query Dataflow: To ensure that all replicas have the most recent information, NSQ must be updated whenever one replica launches. We devised a new component called the Partial XOR Store to provide conflict-free synchronization across replicas (M). PEs that contain this component (insert, update, and delete) can only do Hash table modification operations. In addition to being the starting point for mutation operations, it acts as a gateway to the Partial XOR Store located in the other PEs. Because of this component, our architectural design ensures that mutations made by one PE will never collide with other PEs. This is a safety measure. The query flow for our hash table is shown below. This flow translates search, insert, update, and delete operations to our parallel architecture.
A PE computes the bucket by employing a predefined hash function whenever it receives a search query from the user. Reads all of the contents of the Partial XOR Store in parallel using the specified index. An XOR operation is performed to recreate the original key-value pair by XORing the data from each Partial XOR Store. This operation is only carried out if the data contained within the partial stores is accurate.
The query dataflow for a four-PEs configuration is depicted in Figure 1.4. (a) A configuration that allows for 100% non-search enquiries in a single cycle. (b): A parameter that enables up to 50% non-search queries every cycle.
Insert/Update: Insertion and update queries use the same dataflow as search queries to determine whether a particular key-value pair already exists in the database. PE then produces the data that is XOR-encoded after that. Initially, the information is written to the Partial XOR Store (M). Note that this does not include the encoded data in Partial XOR Store (M) when referring to PE's generation of encoded data for new key-value pairs [17]. This is because PE generates encoded data by XORing the incoming key-value pair with existing data from each Partial XOR Store. Note that this does not include the encoded data stored in Partial XOR Store (M); when PE generates encoded data for updates, it does so by XORing the incoming key-value pair with already existing data present in each Partial XOR Store. A new key-value pair is required to proceed, and there must also be an empty slot (collision handling). After that, p additional writes containing the same encoded data are sent via inter-PE communication to PEs that are further away.
Delete: To begin, the data flow for Delete queries is the same as for Search queries. If the input key is correct, the invalid bit in the corresponding position in the Partial XOR Store will be set by PE (M). Other PEs, such as insert/update queries, are invalidated due to this invalidation.
Workload Customization: As p and k get larger, our architecture's memory needs grow. Our technique allows flexibility when a workload's non-search query (NSQ) per cycle has an upper bound. Search-only PEs with no hash table mutation capabilities will be used to conserve memory resources. Search-only PEs differ from full PEs because they lack a Partial XOR Store (M) component.
Design of the Processing Engine:
PE performs read operations parallel to each Partial XOR Store after generating the bucket index. The data that has been XOR encoded is read out and transmitted through two XOR gates. The first XOR reduction tree is utilized to retrieve the original key-value pair. The second XOR reduction tree is only used for producing newly encoded data and is only active when the query being processed is not a search. After retrieving the decoded key-value pair from the XOR reduction tree, the result resolution unit sends the query to the subsequent hop based on the operation performed and the query data flow. It does this by going through each slot one at a time and looking for the first one that is vacant before moving on, which helps it avoid collisions.
Inter-PE Communication: Inter-PE data communication is required for the consistency of hash table replicas. It's a pipeline that connects different PEs, with NSQ propagating in p cycles from one PE to all others. As you can see, all hash table changes start with Partial XOR Store. Because of this communication system, mutations generated by one PE never clash with mutations initiated by other PEs.
Analysis of Memory Requirements
Our architecture's on-chip SRAM block consumes the greatest resources due to the hash table replications and XOR-encoded data store[20]. A variety of factors can impact SRAM consumption:
1) The goal throughput or the number of PEs.
2) Per implementation cycle, the maximum NSQ.
3) The key-value pair's length.
4) Support for collisions or slots per hash table item.
As a result, a perfect design should include all of these factors, workload characteristics and device resource availability. The SRAM requirements for a 50K-entry hash table with two slots per entry [21]. Both the key and the value are 4 bytes long.
Hash Table Consistency
As mentioned in Section 3, a new key is not visible to all PEs for up to p + t0 cycles. t0 includes hashing computation[23], partial XOR read, and result resolution. It's a constant since our design reads all blocks in parallel. Furthermore, p cycles are required to complete hash table mutation on all PEs while utilizing Inter-PE.