Locality Sensitive Hardware Signature Variants for Hardware Transactional Memory

Lock based techniques have its own limitations like priority inversion, con-voying, and deadlock. Lock free techniques overcome those mentioned limitations. Transactional memory (TM) is one leading lock free technique used in recent multi core processors like Intel Haswell and IBM BlueGene/Q. TM has to do data versioning and conﬂict detection. For conﬂict detection probabilistic data structure called Bloom Filters are used. Bloom ﬁlter based hardware signatures are used in TM. In TM shared memory conﬂicts like RAW, WAR, and WAW hazards are handled by Bloom Filter (BF). Hardware signatures store memory addresses in hashed form on Bloom ﬁlters. Bloom ﬁlters are easy to use, performance efﬁcient data structures lead to false positive but never support false negative. Locality sensitive hardware signatures reduce ﬁlter occupancy by sharing bits for the contiguous memory addresses, in turn reduces the false positive rate. This paper implements existing H3 – HS and LS – HS proposed by Ricardo Quislant et al. [13]. Also this paper proposes RS – HS, CS – HS, and RO – HS. RO – HS equally spreads addresses among bloom ﬁl-ters thereby reduces ﬁlter occupancy. In turn reduced ﬁlter occupancy leads to better False Positive Rate.


Introduction
Accesses among shared resources are controlled by locks or lock based techniques. Critical section handling is done by lock based techniques. But it limits the performance by disadvantages like priority inversion, convoying, and deadlock [1], [15]. Lock free techniques like Transactional memory overcome the limitations of lock based techniques. Transactional memory is a parallel programming paradigm supports concurrent ex-ecution of threads. It needs to maintain the following properties like transaction: atomic-ity, consistency, and isolation. But it need not maintain durability property like transac-tion since it is transient. It replaces all sizes of locks from coarse -grained locks to fine -grained locks. Transactional memory mainly needs to do two operations. Former one is data ver-sioning and the later one is conflict detection. Data versioning can be done in two ways. First is eager/undo data versioning, works based on logs. Directly changes the memory and does undo operation when needed. Second is lazy/write data versioning, works based on buffers. Maintain the changes in buffer, till the commit operation is performed. Reflects the changes on memory location during commit operation. Later operation is Conflict detection. Handling shared memory conflicts like RAW, WAR, and WAW hazards. There are two kinds of conflict detection in transactional memory. First one is eager/encounter/pessimistic conflict detection. Conflict detection is done during loads/stores. Second one is lazy/commit/optimistic conflict detection. Con-flict detection is done during commit operation [6]. For conflict detection transactions must maintain its readset addresses and writeset addresses. In order to carry on readset and writeset addresses BF is used as a data structure mostly. Probabilistic data structure called bloom filter that stores memory addresses after hashing at the cost of false positives [12]. There are three types of transactional memory available. They are a) software transactional memory, b) hardware transactional memory, and c) hybrid transactional memory. Software TM is implemented through software constructs like atomic and so on. Recent implementations of Software TM are Intel C++ STM, Intel Java STM, HASTM, and Microsoft OSTM and so on. Full implementation of TM is done through in HTM. Hardware transactional memory performs better than software transactional memory. STM is more flexible than HTM by allowing wider variety of algorithms. Latest processor implementations like Intel Haswell and IBMBlue-Gene/Q processors use Hardware transactional memory. Since it is done on hardware transactional memory, it is called as hardware signatures [2]. Hybrid TM implementations are done to eliminate limitations of STM and HTM and to provide a better TM implementation [10], [22], [21]. In case if read and write bits are added to the cache tag to carry on readset and writeset addresses. It is called as cache tag augmentation and it leads to several limita-tions like block replacement, thread switching, and modification of cache structures. In order to maintain read-set and write-set addresses both cache tag augmentation and hardware signatures or anyone of them could be used. Using hardware signatures doesn't need more changes in the existing cache structures. Locality sensitive hardware signatures share signature bits for the contiguous memory locations thereby reduces filter occupancy that leads to low false positive rate. H3 hashing behaves well for memory address sequences [13]. Contributions of this paper are as follows, -Compared the newly proposed hardware signatures with H3-HS (Hardware signature using H3 hashing) and LS-HS (Hardware signature using H3 hashing with nullification or classical locality sensitive hardware signature). Section VI gives conclusion about the work done in this paper.

Related Works
Bloom filters (BFs) based Hardware signatures are being used widely in different fields like image processing, transactional memory, networking, databases, intrusion de-tection and so on for its compressive nature and speed. Size of the bloom filter doesn't depend on the number of items stored in it. Many systems based on multi core processors use bloom filter based hardware sig-natures like network processors, parallel debugging, deterministic replay [10], Au-to-Memoization Processors, thread level speculation, Transactional memory, and so on. This section discusses about different bloom filter architectures and hardware sig-natures proposed by people. Bloom filter architectures and hardware signatures differ based on the type of hash function used, no. of hash functions used, and how the hashed values are handled. True/standard bloom filters support only insertion of an element and membership checking, it shares the existing bits amid hash functions. Parallel BFs are same as true bloom filters but chops and shares the existing bits amid hash functions applied [19]. Parallel BFs perform well than true BFs by reducing False Positive Rate (FPR). Counting bloom filters perform better than standard bloom filters by reducing filter occupancy and FPR. But counting bloom filters increase the memory space required based on the size of the counters. If the counter size is 4 bits then the counting BF size is fourfolds the size of standard BF. D left counting BF is similar to counting BF and based on D left hashing technique [5]. Parallel multi-set bloom filters maintain one single parallel bloom filter for readset plus writeset addresses. Parallel multi-set shared bloom filters maintain one single parallel bloom filter for readset plus writeset addresses and treats both the readset plus writeset addresses as read-write address [14]. Parallel BF with BF indexing (PBF BF) uses BF to store inherent dependency values of an item, parallel BF by hash table indexing (PBF HT) utilizes hash table to accu-mulate inherent dependency values of an entity. PBF BF and PBF HT reduce FPR. PBF HT performs better than PBF BF by reducing FPR [20], [24]. Hierarchical bloom filters are used for substring matching, when first substring matches with the first level bloom filter second substring matching is proceeded with the second level bloom filter and it goes on. This type of bloom filter is suitable for diction-ary based applications [5]. In weighted bloom filters, more weighted elements uses more number of hash functions to reduce FPR.Pipelined bloom filter applies several stages of hash functions. Con-firmation with the first stage proceeds with the second and so on [20]. Hamming metric locality -sensitive Bloom filters (HLBF) are proposed to handle updates / changes in hamming distances. Along with HLBF, M -HLBF algorithm to support AMT is also proposed by JiangboQian et al. [12]. Generalized bloom filters are sent as a message in networks. In order to provide se-curity in addition to set hash functions reset hash functions are also introduced at the cost of false negatives [7]. Deletable bloom filter support false negative free deletions at the cost of additional memory [17].Dynamic bloom filters support dynamic data sets and the following operations: insertion, deletion, item check and filter union operations. This type of bloom filters is suitable for applications with dynamic and static data sets [4].
Fast bloom filters are called as Bloom-1. They work faster than other bloom filters by single memory look-up. But they increase FPR slightly [16] .Variable increment counting BF works on the basis of variable increments and improves the performance of counting BFs by reducing FPR [18]. Ternary bloom filters occupy same amount of space like counting bloom filters. These filters minimize the number of bits used for the counters and increase the number of counters to provide low FPR [9]. Two -stage adaptive bloom filters (TSABFs) proposed by Yan Du and Sheng Wang to perform per -flow monitoring in Software Defined Networks. This type of bloom filters keeps the FPR under a threshold value and improves resource utilization, rejection probability [3].Jungwon Lee et al. proposed Circled BF to identify the association of many sets. The following values are being used in circled bloom filter 1, 2, or 3. Value 1 ensures the membership of a set1. Value 2 ensures the membership of a set2. Value 3 represents intersection of sets 1 and 2. Circled bloom filters improve the accuracy of membership querying [8]. C.Y. Tseung et al. proposed a novel Self -learning bloom filter to defend DDOS at-tack. Self-learning bloom filters are used to defend DDOS attack based on features se-lected by ANOVA algorithm [23]. Several signatures discussed in this section are LS-Sig, FlexSig, and Unified signa-tures. Locality sensitive signatures (LS-Sig) works based on spatial locality and shares the signature bits of the nearby locations to reduce the filter occupancy. Reducing filter occupancy automatically improves performance and reduces FPR [13]. Flexible signatures (FlexSig) do allocate, deallocate, insert, and check operations. Allocate -allocates the resources needed for the signature, deallocate -deallocates the resources needed for the signature, insert -inserts the address, check -checks the avail-ability of the address. It makes flexible signatures on demand [11]. Unified signatures merge read signatures and write signatures. It generates read-read dependency. Helper signatures are used to resolve read-read dependency [2]. This paper proposes three new hardware signatures called RS-HS, CS-HS, and RO-HS. Implementation of H3-HS, LS-HS along with the three new hardware signatures proposed are done in this paper. RO-HS performs better than the other four signatures. Because of its rotational na-ture towards spreading the load among bloom filters. RS-HS works similar as LS-HS in the reverse order. CS-HS combines both LS-HS and RS-HS by spreading addresses that fall in firsthalf of address range using LS-HS and addresses that fall in second -half of address range using RS-HS. CS-HS avoids overloading a specific bloom filter and thereby reduces filter occupancy in turn improves False Positive Rate. H3-HS performs better than LS-HS, RS-HS, and CS-HS because of its working towards address sequences.

False Positive:
Bloom Filter (BF) says positive for the false address. That is the address checked for matching in the BF, is not previously inserted in the BF but it says positive for the matching.

False Positive Rate (FPR):
FPR = No. of False positives ÷ Total no. of addresses False Negative: Saying, negative for the address, that is inserted priory. BF never allows false negatives.

Filter Occupancy:
Total no. of one's present in the BF.

Filter Occupancy Rate:
Filter occupancy rate = Total no. of one's existing in the BF ÷ Size of a BF

H3-HS:
H3-HS uses H3 hashing for its implementation. H3 hashing does binary multiplication followed by XOR operation. In order to hash n bit address, it is multiplied with nxm random binary matrix.
This paper implements 4 parallel bloom filters for manipulating a 64 bit address. To implement k parallel bloom filters k number of random h3 matrices are used. Bit indices of bloom filter are set by the value of h(a) [13].
Preprocessed matrices hs0', hs1', hs2', hs3' are multiplied with address A in order to generate the bit indices. As per LS-HS concurrent addresses will share most of the bit indices that leads to less filter occupancy and reduced FPR. It is depicted in the

Proposed Hardware Signature Variants
The following hardware signatures are implemented in this paper: H3-HS, LS-HS, RS-HS, CS-HS, and RO-HS. Among those RS-HS, CS-HS, and RO-HS are the newly proposed hardware signatures. RS-HS uses H3 hashing with reverse nullification procedure. As per reverse nullifi-cation ha3 matrix is kept as such, last one row of ha2 matrix is nullified, last 2 rows of ha1 matrix are nullified, last 3 rows of ha0 matrix are nullified. Table 1 shows the ma-nipulations for sampleaddresses & bit sharing of H3-HS, LS-HS, and RS-HS. In LS-HS, bit indices of Bfr 4 will be reused / shared many times and Bfr 1 will be overloaded or filled more. In RS-HS, bit indices of Bfr 1 will be reused / shared many times and Bfr 4 will be overloaded or filled more. RS-HS is depicted in the Figure 1. For the first 50% of the address range LS-HS is used and for the next 50% of the address range RS-HS is used. For example if the address range of RS-HS is 2n, then 0 -2n/2-1 addresses will make use of LS-HS and 2n/2 to 2n-1addresses will make use of RS-HS. CS-HS is depicted in the Figure 2. RO-HS, uses hsx' matrices in a rotational basis. For the first 25% of the address range matrix order used is hs0', hs1', hs2', hs3'. For the second 25% of the address range matrix order used is hs1, hs2, hs3,hs0. For the third 25% of the address range matrix order used ishs2, hs3,hs0, hs1.For the fourth 25% of the address range matrix order used is hs3,hs0,hs1, hs2. RO-HS distributes the load evenly on all the four bloom filters when compared with the other hardware signatures implemented. RO-HS is depicted in the Figure 3.  0  0  0  0  0  0  0  0  0  0  0  0  0  1  2  3  3  1  2  0  0  0  0  0  0  1  10  1  2  2  2  1  2  0  0  0  0  0  2  11  3  1  1  3  3  2  0  0  0  0  0  3  100  3  1  3  0  3  1  3  0  0  1  3  0  101  1  2  0  1  1  1  3  0  0  1  3  1  110  0  3  1  2  2  3  3  0  0  1  1  2