A searchable encryption scheme with hidden search pattern and access pattern on distributed cloud system

Dynamic searchable encryption methods allow a client to perform searches and updates over encrypted data stored in the cloud. Recent schemes enable secure searching over an encrypted database stored in a server, but these schemes only focus on hiding the access pattern using ORAM. Although the traditional Oblivious Random Access Machine (ORAM) can hide the access pattern, which refers to the set of documents that match the client’s queries, it also incurs significant communication overhead and cannot hide the search pattern. Existing researches show that the general dynamic searchable symmetric encryption (DSSE) scheme is vulnerable to statistical attacks due to the leakage of both search patterns and access patterns. Therefore hiding the access pattern is not enough, it is essential to hide both access patterns and search patterns with high efficiency. To overcome this limitation, a DSSE scheme called obliviously shuffled incidence matrix DSSE (OSM-DSSE) is proposed in this paper to access the encrypted data obliviously without using ORAM. The OSM-DSSE scheme realizes efficient search and update operations based on an incidence matrix. In particular, a shuffling algorithm using Paillier encryption combines the 1-out-of-n obliviously transfer (OT) protocol to hide access pattern and an optimized padding scheme to obfuscate the search pattern with low storage overhead. Besides, Simulation results and security analysis confirm that OSM-DSSE scheme achieves high security and efficient searches. Also, this scheme provides adaptive security against malicious attacks by adversaries. Furthermore, OSM-DSSE is capable of searching for a keyword out of 9×1010\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$9 \times 10^{10}$$\end{document} keyword-file pairs within 2.19 s which is 3–4 × better execution efficiency than the state-of-art solutions.


Introduction
The rise of cloud service provides vast benefits to society and the IT industry. Storage-as-a-Service is one of the most common cloud services available, which allows the client to store data online remotely and access data everywhere, reducing the cost of data management and maintenance. Despite the merits, Storage-as-a-Service also brings significant security and privacy issues. Once data is outsourced, a client loses the ability to control the data. Also, a malicious user may be tampered with or stolen sensitive information. Although the client can encrypt data with standard encryption schemes (e.g., AES) to ensure confidentiality, basic operations (e.g., search/update) on the encrypted data could not be performed. Furthermore, the substantial computational overhead is incurred, which greatly reduces the benefits of the cloud service.
To solve the above problems, the corresponding searchable symmetric encryption (SSE) scheme is proposed, in 2000, Song et al. [1] first proposed the concept of SSE. As new encryption primitive, searchable encryption enables the user to search for a keyword over the ciphertext. However, the application was limited to searching on static encrypted data and could not resist the simple adversary attack. In 2003, Goh et al. [2] formally This  defined the secure incidence and developed a security model called the "semantic security" for adaptive selective keyword attacks. However, the accuracy of the query result was limited due to the use of the Bloom filter. In 2006, Curtmola et al. [3] proposed two new security models called "adaptive security" and "non-adaptive security", introducing a single-keywordsearch SSE with a formal security definition. Due to the limitations of the SSE proposed earlier and the dilemma between ensuring user privacy and efficient data usage on the cloud, Kamara et al. [4] introduced the dynamic searchable symmetric encryption (DSSE) method, which enabled the user to perform search and update operations on encrypted data. The general searchable encryption algorithm improves the search efficiency at the cost of leaking information about files or queries to the server, such as the search pattern and the access pattern. It is generally acknowledged that the searchable encryption scheme is secure unless it does not reveal user data and query information other than the information disclosed by the leakage profile. However, in the real world, an adversary can exploit these leakages to launch statistical attacks to recover user data and query information.
For instance, some attack schemes which can recover the user's query information by using access pattern or search pattern are proposed. Islam et al. [5] and Cash et al. [7] firstly exploited access pattern leakage and prior knowledge about the dataset to recover the user's query information. Liu et al. [6] exploited the search pattern to launch attacks and obtained users' query information. Zhang et al. [8] completely exposed the client's query and recovered user data and query information through the file injection attack. Oya et al. [9] propose an attack that leverages both access and search pattern leakage, as well as some background and query distribution information, to recover the keywords of the queries performed by the client. Hoang et al. [10] leveraged both access and search pattern leakages to recover the keywords of queries. Therefore, an important direction for future research is to focus on the suppression of information disclosure rather than setting it as default.
There are some solutions have been proposed for the leakages and attacks described above. Firstly, most of the research focuses on forward-secure and backward-secure methods [11][12][13]. Secondly, the ORAM can address the problem of access pattern leakage [14,15,40], but it was impractical for widespread adoption. Garg et al. [16] exploited ORAM and garbled RAM (Random Access Memory) to hide the search pattern. Kamara et al. [17] proposed a general scheme for suppressing search pattern leakage. The united structured encryption (SE) based on ORAM made the scheme more efficient than ORAM, but the scheme was static. To overcome the shortcomings of ORAM, Wang et al. [19] proposed a scheme that can hide the search pattern without utilizing ORAM and has high efficiency. However, search operation needs to be done using the auxiliary server, leading to extra costs. Dauterman et al. [18] leverage linear scans to build a system that can achieve better performance for excepted workloads, but they do not implement features expected from a search engine such as finding non-exact matches, ranking results, or providing summaries. In principle, solutions that do not leak any information to the server can be built on powerful techniques such as secure two-party computing, full homomorphic encryption (FHE), etc., but they are often impractical.
To achieve more secure searchable encryption, both the search pattern and the access pattern must be hidden. The above-proposed solutions solve either the search pattern leakage or the access pattern leakage but not both. Although the method proposed by Hoang et al. [20,21] exploited distributed data structures to hide the search pattern or the access pattern, it is usually necessary to consider whether there is collusion between servers. Akavia et al. [22] proposed a secure search using FHE on an encrypted data structure. This scheme can seal the leakages of the search pattern and the access pattern, but it is difficult to deploy in real environments. In addition, these solutions only address the information leakage on the incidence (explicit) without considering the information leakage in accessing the file (implicit).
The works [14][15][16][17] entailed above addressed the problem of access pattern leakage, but using ORAM can cause high storage and communication overhead. The works [18,19] hide the search pattern without using ORAM, but they do not hide both search pattern and access pattern. Oya et al. [9] pointed out that hiding access pattern or search pattern is not enough. Although works [20][21][22] hide both search pattern and access pattern, they don't have acceptable computational and communication burden.
It is essential for a DSSE to hide both search pattern and access pattern and has high efficiency. This paper proposes a new dynamic searchable encryption scheme called oblivious shuffle incidence matrix DSSE (OSM-DSSE) to access encrypted data obliviously under a single server. OSM-DSSE can hide both explicit and implicit search patterns in addition to the access pattern, contributing to a higher level of security and has low storage and communication overhead. The contributions are as follows: 1. This paper proposes a shuffling algorithm with Paillier encryption [23,24] to address the problem of access pattern leakage, which can shuffle the data in the incidence matrix to change the access path. 2. This paper performs the search based on the group query and efficient 1-out-of-n OT protocol, ensuring the privacy of the server and the client. At the same time, random tokens can be generated to combine with the shuffling algorithm to hide the explicit search pattern. 3. Since searching the same keyword always returns the same file set, exposing the response length is disclosed. This paper proposes an optimized database padding algorithm to hide the response length, which uses the combination of computationally optimal padding length determination scheme and clustering algorithm to reduce the redundant storage of data and hide the implicit search pattern completely with the minimum storage overhead.
The rest of this paper is organized as follows. The related work is presented in Section 2. The preliminaries are presented in Section 3. The overview of the proposed scheme is described in Section 4. A detailed description of the algorithms is provided in Section 5. The experiment and analysis are provided in Section 6.

Related work
There are many researchers focus on searchable encryption. Song et al. [1] first proposed the concept of static searchable encryption, making search encrypted data possible. Curtmola et al. [3] proposed the concept of adaptive security and the first scheme with optimal search complexity, where the number of documents containing the keyword w. Many improvements have been made in their subsequent work [25][26][27]. Chase and Kamara proposed structured encryption to support queries on arbitrary-data-structure. Kamara et al. [4] proposed the concept of dynamic searchable encryption, making searchable encryption no longer limited to static operations. Although subsequent research efforts focused on effectiveness [17,28,29], dynamics [11,30,31], localization [32,33], security [34][35][36], and complex functions [37][38][39], they still suffer from leaking some important information. Attackers can use these leakages to attack and recover data and cause more serious information leakages [5][6][7][8].
To solve the above problems in the traditional DSSE, some solutions have been proposed to deal with these leakages and attacks. Most of these researches focused on forward-secure and backward-secure properties. Forwardsecure refers to the ability to break the link ability of newly added data and query keywords; backward-secure means that the server can no longer match and retrieve the deleted data. Stefanov et al. [11] proposed the first solution to support forward-secure property, but it exhibits a linear time complexity for the search. Bost et al. [12] proposed a scheme relying on primitives such as constrained pseudorandom functions and puncturable encryption to achieve fine control of the opponent's power, preventing the adversary from evaluating functions on selected inputs, or decrypting specific ciphertexts for forward and backward security. Sun et al. [13] proposed the first practical, non-interactive backward-secure SSE scheme using symmetric punctured encryption. However, the forward-secure and backward-secure methods mainly aim at the information leakage in the update phase without considering the information leakage of the access pattern and the search pattern. As a result, the problem of information leakage was not completely solved.
The oblivious random access machine (ORAM) is a traditional method to solve this problems. It can hide the access pattern by confusing each access process to make it indistinguishable from random access. The access pattern refers to the sequence of operations and memory addresses. The ORAM was first proposed by Goldreich and Ostrovsky [14] to ensure that any data block in memory did not permanently reside at a physical address and that two accesses are unrelated. Goldreich and Ostrovsky also proposed an ORAM model, giving a square root (Square-Root) and a layered solution. Zhang et al. [15] proposed a method based on ORAM access pattern protection in the cloud storage environment. Garg et al. [16] proposed a TWORAM scheme that reduces the client storage overhead while hiding the file access pattern with ORAM. Demertzis et al. [40] present SEAL, a framework for encrypted databases with improved security via a light use of ORAM and padding. It is open whether our attack applies in such modified settings. However, researches have shown that using ORAM to eliminate information leaks leads to high overhead and low execution efficiency [41][42][43][44][45].
There are also many schemes that can hide the search pattern without using ORAM. Most of them can hide the search pattern with high efficiency. Dauterman et al. [18] hide the search pattern in order to guarantee trapdoor and keyword privacy. They used a particular type of additive homomorphic encryption to accomplish the conjunctive keyword searchable encryption. To meet the privacy aims of the strategy, they proposed two servers: a cloud server and an auxiliary server. They also used random polynomials to improve user security. Their approach ensures a higher level of protection for cloud users. However, search operation needs to be done using the auxiliary server, making this method inefficient. Recently, Dauterman et al. [46] presented Waldo, a time-series database with several functionality and security guarantees. Waldo allows multi-predicate querying and protects data contents, query filter values and search access patterns. Waldo leverages function secret sharing, a recent cryptographic tool for secure sharing. However, it did not support cryptographic access control. Cheng et al. [47] proposed a design preserving-privacy protocol of vehicle feedback. In order to improve the strong security and privacy of vehicle feedback, they use Paillier encryption to encrypt individual feedback and puts forward it to RSU. Then, RSU aggregates multiple encrypted feedback into one aggregated ciphertext and sends it to the cloud service providers (CSP). Therefore, CSP can not get any information about single vehicle's feedback and obtain the sum of the vehicles in each segment of the feedback list.
Oya et al. [9] point out that, beyond access pattern, SSE also leaks search pattern leakage which can further be leveraged to realize query recovery. However, the effectiveness of this attack relies on a strong assumption of fully knowing the query frequency in the real world. Besides, the accuracy of their attack based on maximum likelihood estimation is also probabilistic, and it significantly depends on the query distribution. Recent attacks do not require strong knowledge assumptions on plaintext data, but the search pattern is essential for their accuracy [48,49].
In Table 1, we have described the main pros and cons of the existing related schemes.
Unfortunately, no schemes can hide the both access and search pattern successfully. Moreover, the amount of data stored on the cloud server has recently increased dramatically [50]. Hence traditional method with high storage and communication overhead is unavailable, it requires researchers to probe for new solutions to solve the problems mentioned above.

System goal
We aim to effectively perform a privacy-protected keyword search and file update on an encrypted cloud database. The main objectives of this system are as follows: • Hide the access pattern • We utilize Paillier encryption to shuffle the incidence matrix. This algorithm can randomize the position of keywords in the incidence matrix, confuse access paths, and hide access patterns. • Hide the search pattern 1. Based on the group query, the server utilizes the two-level map to obtain the target data block con-taining multiple pieces of data. Furthermore, it executes an efficient 1-out-of-n OT protocol with the client to obtain the target. In this process, the 1-out-of-n OT protocol makes the server unable to distinguish which keyword the client is searching for. The client also does not know the server's other messages except for the searched keyword. This protocol protects the privacy of the client and server simultaneously. Besides, the shuffling is performed after each search, which makes the row position of the keyword in the incidence matrix change. It can also convert the deterministic token into a random token (explicit search pattern). The adversary cannot launch an attack by analyzing the search frequency. 2. If the client searches for the same keyword, it always returns the same size file. The adversary can launch an attack by analyzing the response length. Therefore, this paper utilizes an optimised padding strategy based on cluster algorithm to hide the response length (implicit search pattern).

System model
Our system utilizes the client-server model (refer to Fig. 1). The client extracts the keywords of the file and constructs an incidence matrix between the keywords and the file, encrypts the inci dence matrix and the file, and sends them to the server. The client issues search and update requests to the server. The server stores the encrypted incidence matrix and responses to the client's search and update requests. Note that we consider a semi-honest (honest but curious) server. Even though data files are encrypted during the access, the cloud server may try to derive other sensitive information from users' search requests. Thus, although the server can faithfully follow the protocol, it can learn information. Leaked the access pattern and the search pattern information [14][15][16] ORAM protocol Hidden the file access pattern Leaked the search pattern information and led the leads to high storage overhead [17] Combined ORAM protocol Hidden the file search pattern Not a dynamic protocol [18] Search pattern hiding protocol without using ORAM The performance and security of searchable encryption are improved Used an auxiliary server made this protocol not so efficient [47] Privacy protection protocol for feedback provider in the cloud-assisted VANET Identity privacy preservation, feedback privacy protection, improve the VANET performance -Our paper-

OSM-DSSE
Searchable encryption scheme for hiding search pattern and access pattern with high efficiency Hide the access pattern with low storage overhead, hide the search pattern with high efficiency, and considerably reduces the storage overhead of padding schemes -1 3

System overview
The design goal of this system is to hide the search pattern and the access pattern of the searchable encryption scheme. Search and update are two core operations in the system.

Hiding the access pattern
The access pattern refers to the user's access path, where the same access path easily identifies a repeated query. When the client sends a search request to the server, repeated searches also lead to disclosing the access pattern. The attacker can track the access path to obtain the query keyword and the information of the incidence matrix. It is challenging to hide the access pattern while significantly reducing the computational and communicational cost of the searchable encryption scheme. The existing schemes [20,43] usually use the "fetch-decrypt-reencrypt-upload" strategy to hide the access pattern, since data volume of searchable encryption schemes is usually huge, it causes high communication and computation overhead. Although we can adaptively utilize "fetch-decrypt-reencrypt-upload" strategy, e.g., just reencrypt the access path for one search and several dummy access paths instead of the whole data structure, it leaks important information by statistical attack. Fortunately, the matrix data structure ensures us to only uploads the confusion matrix to the server, and the server performs homomorphic calculation between the confusion matrix and the incidence matrix. The shuffling process is divided into two stages: shuffling and homomorphic decryption, shown in Fig. 2.
The specific procedure of shuffling is as follows. On the left in Fig. 2, the client calculates the confusion matrix based on the permutation matrix and the diagonal matrix, and the confusion matrix is encrypted with the Paillier pk . Here, some formulas are given to facilitate the calculation of the confusion matrix.
1. Matrix-based data shuffling. Given a data sequence B = (B 1 , ..., B n ) and a n × n permutation matrix , the position of the data block is changed by B ⋅ . For exam- 3. Based on the formulas (1) and (2), the confusion matrix can be obtained: 4. The client encrypts the confusion matrix with the Paillier public-key pk: On the right in Fig. 2, the server performs the homomorphic calculation between the encrypted confusion matrix and the incidence matrix to obtain the shuffled incidence matrix.
The row position in the incidence matrix is changed when the shuffling phase is over. Homomorphic decryption is performed subsequently. The incidence matrix on the server is encrypted with homomorphic encryption after shuffling. It can be seen from the property of Paillier encryption that two parties involved in the calculation are homomorphic encryption and non-homomorphic encryption. Therefore, to facilitate the next shuffle operation, the server needs to decrypt the incidence matrix with sk of the Paillier before the next search is performed. It should be noted that the client generates different public/private key pairs (pk, sk) of Paillier to resist malicious attacks by the server.
The server performs homomorphic decryption with sk to get the symmetric encrypted incidence matrix.

Hiding the search pattern
Oya and Kerschbaum [9] pointed out that the search pattern can be divided into the explicit and the implicit. The explicit means searching for the same keyword always generates the same deterministic token, while the implicit means searching for the same keyword always returns the result set of the same size. Attackers can perform query recovery attacks using the query volume and the frequency leakages. It is challenging to hide the search pattern while significantly reducing the communicational cost of the searchable encryption scheme. Originally padding scheme can effective hides implicit search pattern, but it pads result set of every keyword to the maximum length which means it causes huge storage overhead. This paper proposed an optimized padding scheme by using a cluster algorithm and calculating the optimal padding length without leaking private information.
Therefore, both the explicit and the implicit search patterns are hidden in this paper.

Implicit search pattern hiding
In this paper, we design an optimized padding scheme which significantly reduces server storage overhead to hide the implicit search pattern. We will introduce details of the optimized padding scheme as follow.
Padding the response length of each keyword to their maximum response length is a native approach to designing a volume-hiding encryption scheme. It is easy to see that this hides the response lengths. Unfortunately, it also induces a non-trivial storage overhead. So we propose an optimized padding method to overcome this shortcoming. Assuming that the number of keywords in the database is N, and to prevent attacks based on response length, the database owner can pad the real responding volume distribution by inserting dummy files. More formally, when responding to a query for the keyword w, the CS will see that it matches N o w = N r w + N Since e p w must be between 0 and 1, the lower bound of can be obtained: In order to achieve the optimal padding to minimize , we utilized the classical k-means clustering algorithm to cluster keywords according to their response length. For a given optimal cluster Γ = (G 1 , ..., G m ) , when w ∈ G i , e r max(i) represents the maximum value of e r w , and e r min(i) represents the minimum value of e r w . According to Eq. (7), the minimum value of can be calculated: Therefore, is the minimum when An optimized padding scheme OptPad = (ResNum, Cluster, PadNum, Pad) is a tuple that consists of six polynomial-time algorithms: • L ← ResNum(I) is a probabilistic algorithm to calculate the length of result set of each keyword. It takes a random incidence matrix I as input and output the L which L i is the length of result set of a keyword w i . • Γ ← Cluster(L) is a probabilistic algorithm to cluster keywords according to their response length, it takes L as input and output optimal cluster Γ = (G 1 , ..., G m ). Explicit search pattern hiding The group query is performed in this paper. It should be noted that if similar semantic keywords are grouped, the adversary can use similar semantics to infer the relationship between the keywords, causing a partial privacy leakage. So, the secure incidence is constructed where the pseudo-random function determines the location of keywords and files to ensure the randomness of the storage location. In this case, the attacker cannot infer the relationship between the keywords in the subsequent group queries. For search, the two-level map and the efficient 1-out-of-n OT protocol are utilized to hide the explicit search pattern. After the search, the server performs the shuffling operation (shown in Fig. 2) to change the row position of the incidence matrix. Since the search token is related to the keyword's position, the shuffling also converts the deterministic search token into a random one.
The client utilizes the pseudo-random function and hash to randomly locate the keywords before performing the group query. The search process is as follows. The client obtains the line number of the search keyword according to the dictionary D and calculates the block number l to which the keyword belongs. Then, the client's selection is combined to generate a search token and send it to the server. Once the server receives the search token, it retrieves the two-level map Ω to obtain a row number group of size according to the search token, and then retrieves the incidence matrix according to the row number group to obtain a data block of size . After that, the server and the client execute an efficient 1-out-of-n OT protocol. The server returns the search results to the client for decryption. The client obtains the file identifier set containing the searched keyword.
After the search, the incidence matrix needs to be shuffled with the shuffling algorithm to generate the random search token. The diagram of the hidden explicit search pattern is shown in Fig. 3.

Data structure
• Incidence matrix. The incidence matrix is used to build an encrypted incidence, and its elements represent the correspondence between keywords and files. Specifically, the row indicates the keyword w , and the column indicates the file id . The two-level map is constructed through the following three steps: (a) The dictionary D is partitioned into -blocks I 1 , ..., I t and I t is padded up to elements if necessary. (b) The block number is taken as the key of the M w , and the value corresponding to the key is the starting incidence of each data block in the array A. (c) Store the row number of the incidence matrix in the array A.
• Address map table Mf s id , j . For the update, the server determines the column j where the update file is located according to the update token.

Construction of OSM-DSSE
This section defines the OSM-DSSE scheme, consisting of the four algorithms (Setup, Search, pathShuffing, Update) presented in the subsections below.
• (K, I, C) ← Init( , , F) : various parameters are input to obtain the public parameters.
Firstly, the client generates public parameters by DSSE.KeyGen . These parameters include the symmetric key K F to encrypt files and the key K I to encrypt incidence.
Secondly, the client constructs a secure random incidence matrix by DSSE.BuildIndex . As shown in Algorithm 1, the client generates a random key K I . The client extracts the keywords W = w 1 , ..., w m from the file set F = f 1 , ..., f n (each file has a unique identifier id 1 , ..., id n ).

Fig. 3 Hidden search pattern diagram
Thirdly, the client divides the keyword set W into data blocks of size . The last data block is padded up to elements if necessary, and it is numbered as � 1, 2, ..., ⌈ m ∕ ⌉ � . The client constructs a two-level map Ω(Mw, A) and encrypts the key of Mw . Simultaneously, the client constructs the dictionary D according to the T w s w i , x i and the M f according to the T f s id j , ⟨ y j , st ⟩ .
Then, the client encrypts the file by .Enc , sending the secure random incidence matrix I , encrypted file C , two-level map Ω , and address map table M f to the server. Meanwhile, the client saves locally. Lastly, according to the optimized padding method, we execute to hide search pattern. As shown in line 15 of Algorithm 1, we determine the length of the result set that needs to be padded for each keyword L = (L 1 , … , L m ) according to Eq. (3.2), select the maximum padding length as the number of dummy files, and the columns corresponding to the dummy files are also randomly inserted into the incidence matrix. Consequently, if the keyword of i − th line whose response length need to be padded, we will randomly select l dummy columns  The client generates the search token ← l||k 3 , y by DSSE.SrchToken , where l||k 3 is the encrypted block number and y is the client's choice to implement the 1-outof-n OT protocol. The client obtains the row number x i by x i ← D w i and calculates y = g r h mod p , where r(r < p) is a random value.
The server performs group queries according to the search token as follows: Firstly, the server parses ← l||k 3 , y and queries the M w l||k 3 , i of Ω M w , A to obtain the starting position of the keyword in the array A , and then sequentially searches A to obtain the row = r i , ..., r i+v . Secondly, the server searches the encrypted incidence matrix according to row = r i , ..., r i+v and obtains a data block B = b 1 , .., b of size , which is symmetrically encrypted.
Then, for this result set, the server and the client execute the efficient 1-out-of-n OT protocol, where the value of n is to . The server calculates 1 , c 1 , ..., , c , where j = g k j mod p, c j = m j y h j k j mod p , k j ∈ R Z * p , j = 1, ..., ) ; is the auxiliary parameter, and c is a ciphertext sent by the server to the client.
Finally, the client decrypts the ciphertext. The client performs the first decryption m = c to obtain the result of symmetric encryption. Then, the client performs the second decryption with the initial row number r i and the key K I to obtain the file identifiers containing the keyword to be queried, i.e., Υ w ← m ⊕ H K I ||r i , where Υ w = (0, 1) n indicates the result vector.
The server returns the file sets containing dummy items to the client based on the result vector after padding. After getting the encrypted data, the client decrypts the data using the key K F to obtain the data that satisfies the query. • I ′′ ← PathShuffling I ′ : Input the incidence matrix to be shuffled, and output the new incidence matrix after shuffling.
After the search, the server executes the path shuffling, as shown in Fig. 2. The client constructs and encrypts the confusion matrix with Paillier, then uploads it to the server. The server performs the homomorphic calculation. This process can change the access path to hide the search pattern and access pattern.
First, the client constructs the permutation matrix P and the diagonal matrix Q . The dot product operation is performed on P and Q to form the confusion matrix M .
The client generates a public and private key pair (pk, sk) by PE.Gen , and encrypts the confusion matrix M with pk , i.e., M � = PE.Enc pk (M) . Then, the client sends M ′ to the server. The server performs homomorphic calculation between the M ′ and I ′ to obtain the shuffled incidence matrix Ĩ , namely � I ← M ′ ⊙ I ′ . Since the Paillier encryption satisfies semantic security, the same plaintext can generate different ciphertexts. The result is homomorphically encrypted after the homomorphic calculation is performed. However, this calculation is not conducive to the next data shuffling. So, the client sends the sk to the server for homomorphic decryption before the next search, i.e., I ′′ ← PE.Dec sk Ĩ , where I ′′ represents the result of symmetric encryption. After shuffling, the keyword's position in the incidence matrix will be changed.
To correctly obtain the row number of the next keyword for search, the client needs to update the dictionary D . Moreover, the client should generate the different public and private key pairs to ensure the server cannot decrypt the confusion matrix.
• C ′ , I ′ upd ← Upda U : Input the key and a file, and output the updated encrypted file and incidence matrix.
The update operation needs an interaction between the client and the server, and it contains add and delete operations. An update token is generated by DSSE.UpdToken to perform updates by the server. It should be noted that the proposed solution will not reveal the update type, since both the add and delete operations are written back to the server.
For add operation, the client confirms that column j is to be added by T f and sets the status value to 1. The client extracts the keywords of the file and constructs a column matrix I according to the T w before it encrypts the file by .Enc . Then, the client sends u , c ′ to the server. The server utilizes the token to update the incidence matrix I ′ , the address map table Mf , and the ciphertext C.
For the delete operation, the client confirms that column j is deleted by T f and sets the status value to 0. The client constructs a column matrix I with all 0. Then, the client sends u to the server. The server utilizes the token to update the incidence matrix I ′ , the address map table Mf , and ciphertext C .

Security analysis
Some definitions and theorems are given first to prove the security of the above scheme. Definition 1 The leakage function L = L stp , L srch , L upd is defined as follows.
1. N, ID, < | | c 1 | | , ..., | | c n | | > ← L stp I ′ , C : Input the encryption incidence matrix I ′ and the encrypted file set C ; Output the maximum value N of keywords and files, the given in Theorem 1 is negligible. A series of five games are defined to prove the security (refer to the Appendix for details). The first game is a real experiment, and the last game is an ideal experiment. Also, the success event of each game is defined, where Game i stands for the event that the opponent correctly guesses the challenge bit b, and the Pr Game i = 1 represents the probability of the succeeding adversary attacks. The security is proven by the progressive relationship of the related games, and the full proof is provided in the Appendix.

Storage overhead
Client storage: The client maintains two hash tables and a dictionary. Storage costs of the two hash tables are proportional to the number of keywords and files, i.e., O(m) and O(n) , where m , n represents the number of keywords and files, respectively. The storage cost of the dictionary is proportional to the number of keywords, i.e., O(m) . Therefore, the total storage cost of the client is O(m + n).
Server storage: The server maintains the incidence matrix I ′ , a two-level map Ω , and an address map table Mf . The incidence matrix is a m × n-dimensional matrix with a storage cost of O(m ⋅ n) . The two-level map Ω consists of two parts: an address map table Mw and an array A . Storage of Mw is proportional to the number of blocks. Assuming that the data are divided into t blocks, the storage cost of Mw is O(t) . The size of array A is related to the number of rows of the incidence matrix, and the storage cost is O(m) . The storage cost of Mf is related to the number of files, and the storage cost is O(n) . Therefore, the total storage cost of the server is O(m ⋅ n + m + n + t).

Communication overhead
In the setup phase, the client sends the encrypted incidence matrix and the encrypted file to the server. The communication overhead is O m ⋅ n + nc i , where m × n is the size of the encrypted incidence matrix and c i is the size of each encrypted file.
In the search phase, the client sends the m × m confusion matrix to the server, and the communication overhead is O m 2 . The server returns an encrypted data block of size with a communication overhead of O( ).
In the update phase, the client sends the m × 1 column matrix to the server, and the communication overhead is O(m).

Computational overhead
Client computational overhead. The client mainly generates a permutation matrix, a diagonal matrix, and an encrypted confusion matrix. Both the permutation matrix and the diagonal matrix are m × m dimensions, so the confusion matrix is of dimension m × m . In addition to the permutation matrix and the diagonal matrix, there are m pieces of data that are not 0. The remaining numbers are all 0, and the computational cost of generating 0 is negligible. So, the computational cost of generating the permutation matrix and the diagonal matrix is O(m) .
Server computational overhead. The server needs to reencrypt data of size and performs 2 times of modular exponentiation operation when executes the 1-out-of-n OT protocol. The server mainly performs the homomorphic calculation between the confusion matrix and the incidence matrix. The size of the confusion matrix is m × m , and the target matrix has m rows. So, the computational cost is O m 3 .

Comparison
As shown in Table 2, the proposed scheme OSM-DSSE is compared with some existing schemes in terms of storage overhead, communication overhead, and the ability to hide search and access patterns. The overhead of all schemes is measured on average. Only the size of the encrypted incidence is considered for server-side storage. m and n respectively denote the number of keywords and files with a maximum value of M and N, i.e., m ≤ M , n ≤ N . k represents the number of servers, represents the size of the data block, p represents the number of processors, z is the size of the bucket in PathORAM.
The IM-DSSE is a traditional DSSE scheme that leakages the search pattern and access pattern. The ODSE employs multi-server PIR and Write-Only ORAM to hide the access pattern. The DOD-DSSE leverages two non-colluding servers to realize the "fetch-reencrypt-swap" strategy, so that the data structure-access pattern can be hided. The DORY splits trust between multiple servers and utilize Bloom filter to hide the access pattern. Compared with the above schemes, the proposed scheme not only achieves a low storage and communication overhead, but also hides the search pattern and access pattern.

Experiment preparation
The proposed OSM-DSSE scheme is evaluated in a real network environment and system setting. For search operation, a round of interaction is defined as client-> server-> client, which means that a search request is sent from the client to the server, and the data block is then downloaded from the server to the client. For update operation, a round of interaction is defined as client-> server, indicating that the client sends an update request to the server.
The hardware of the client and the server are configured as follows. The hardware configurations of client are Intel Core i5-8400 CPU @ 2.80 Hz, 16 GB RAM, 256 GB hard disk, and 1 TB SSD. Besides, the client runs an operating system of Windows 10 64 bit. The hardware configurations of the server are 32 CPUs @ 2.70 GHz and 512 GB RAM. And the operating system of the server is CentOS 7.2 64-bit.
The Google sparse hash is used to realize the hash table T f and T w , and the hash tables are saved on the client. The file and the incidence matrix are pre-encrypted with the IND-CPA and sent to the server.
The online public dataset Enron [51] (mail dataset) is taken as the experiment dataset. The dataset contains data from approximately 150 users, and the corpus contains about 500,000 messages. In the experiment, the emails of the 150 users are used. Since most of the emails are personal, they capture informal conversations between two individuals. Therefore, a stemming algorithm, namely the Porter Stemming Algorithm [52], is used to find each word's root in the document set and delete the most common words such as 'the', 'a', and 'from' to extract keyword sets from the corpus. For comparison, 300,000 files and 300,000 keywords are selected to construct an encrypted incidence matrix of √ √ different sizes (the largest incidence matrix has 9 × 10 10 keyword-file pairs).

Experimental results
In the experiment, the performance of search and update operations of the proposed scheme is evaluated and compared with existing schemes. The time for creating an incidence matrix of different sizes is evaluated to illustrate how the size of the dataset influences the incidence of construction time. As shown in Fig. 4, the construction time is 10.114 s for an incidence matrix of 10 3 × 10 3 . When the size of the incidence matrix exceeds 10 3 × 10 3 , the time to construct the encrypted incidence matrix increases rapidly. For example, it takes approximately 20 min to construct a 10 4 × 10 4 incidence matrix with 10 8 data. Since the incidence matrix is only constructed once during the setup phase, the relationship between the search, update time, and the size of the incidence matrix is mainly investigated.
Then the cluster distance threshold is evaluated. For setup operation, keywords are clustered by it's response length into various sets, the cluster distance threshold not only affects the cluster number but also influences the maximum padding length. Therefore, it is essential to choose a suitable cluster distance threshold.
For the cluster distance threshold, a multiple of 50 distance in [50,500] is selected as the experimental data to evaluate the cluster number. The number of cluster represents kinds of responds length, excess cluster will increase time cost of clustering algorithm and matrix construction in the setup operation. It can be seen from Fig. 5 that the cluster number decreases along with the cluster distance threshold increases. When the cluster distance threshold is 50 distance with 156 clusters and when cluster distance threshold is 100 distance with 107 clusters. However, the cluster number curve tends to be flat when the cluster distance threshold exceeds 200 distance with 74 clusters, when the cluster distance threshold is 250 distance with 63 clusters.
The maximum padding length is evaluated. As shown in Fig. 6, the maximum padding length is 99, 149, and 199 for cluster numbers with the size of 100, 150, and 200, respectively.
To obtain the cluster distance threshold that contributes to both good cluster number and maximum padding length. Figure 7 shows the relationship between cluster number and maximum padding length under different cluster distance threshold, where the red bar and the blue bar represent the cluster number and maximum padding length, respectively. It can be seen that the difference between the cluster number and the maximum padding length exhibits less variation for cluster distance of 200 distance and 250 distance. Since this paper's main purpose is to reduce server storage overhead, cluster distance of 200 is selected. Then, the search and update time of proposed scheme is compared with that of other schemes under the different sizes of the incidence matrix. Finally, the performance of search and update operations of the proposed scheme is compared with that of DOD-DSSE [20], ODSE [21], DORY [18] and Wang et al.'s scheme [19] under different incidence matrix sizes, and the results are shown in Figs. 8 and 9. The DOD-DSSE leverages two non-colluding servers and exploits the properties of an incidence matrix to avoid information leakages. The Wang et al.'s scheme uses multiplicative homomorphism and random polynomials with an appropriate degree to guarantee that the user cannot learn anything other than the desired search result. It should be noted that although Paillier is used to achieve shuffling in the proposed scheme, the shuffling operation is performed after the search, which does not affect the search performance. It can be seen from For the update operation, it can be seen from Fig. 9 3.63 s, and 4.83 s, respectively, for the keyword-document pair with the size of 10 8 .When the size of the keyword-document pair increases to 9 × 10 10 , the update time of the ODSE wo it is approximately 3 × compared to that of the OSM-DSSE. This is because the size of update token is minimal (e.g., a binary string) while ODSE wo it needs to transmit a lot of data to several servers.

Conclusion
This paper proposes a searchable encryption scheme named OSM-DSSE to hide the search pattern and access pattern. An effective shuffling algorithm based on Paillier is proposed to shuffle the incidence matrix, so that the position of the row in the incidence matrix is changed. This scheme combines the 1-out-of-n OT protocol and the optimized padding strategy to realize random data access. Besides, the security of the proposed scheme is formally analyzed, showing that the proposed scheme provides adaptive semantic security that can be against selective adversaries. Furthermore, the OSM-DSSE achieves Fig. 9 The relation between update time and number of keyword-file pairs Fig. 7 The comparison between clusters number and maximum padding length Fig. 8 The relation between search time and number of keyword-file pairs approximately 3-4 × execution speed than existing schemes. The optimal block size will be investigated in the future, and the scenarios with different security levels will be investigated. Game 1 : it is the same as Game 0 , except that the value of the K ← OSM − DSSE.Gen(1 ) is replaced with a randomly uniformly selected value in the setup phase (Game 1 : line 3). Because the random value is safe and indistinguishable from the key generated by the key generation algorithm, the search and update operations can still proceed normally. So, Pr Game 1 = 1 − Pr Game 0 = 1 ≤ negl( ). Game 2 : it is the same as Game 1 , except that the determination function for generating the search or update token is replaced by a random function (Game 2 : lines 8,12). The random function is truly random and secure, and the output value is indistinguishable from the output value of the hash function (pseudo-random function). So, Pr Game 2 = 1 − Pr Game 1 = 1 ≤ negl( ).
Moreover, since the 1-out-of-n OT protocol performed in search is based on the difficult problem of DDH (Game2: lines 9), the client's choice is unconditionally secure. For any ′ , there is r ′ that satisfies y = g r � h � . The client hides its choice in the token sent to the server by introducing a random number. So, the server cannot obtain any information about the client's choice from the token. the server, so the server cannot recognize the correct P and Q . The randomness of the matrix selection ensures that the server cannot correctly infer the true values of P and Q , so the uploaded confusion matrix is safe.
For the server-side incidence matrix I ′ , symmetric encryption is performed to meet the IND-CPA security standards.
According to theorem 2 and the above analysis on security, both parties involved in a homomorphic calculation are secure. At the same time, according to the homomorphic properties of the Paillier encryption system (i.e., any calculation performed by the homomorphic operation can Game 3 : it is the same as Game 2 , except that the values used for homomorphic calculation in the shuffle phase are replaced with other randomly selected values for calculation. In this scheme, from the perspective of the server, the path shuffling algorithm invokes two parts: the confusion matrix M of homomorphic encryption and the incidence matrix I ′ of symmetric encryption. The confusion matrix is composed of a permutation matrix P and a diagonal matrix Q that are randomly selected by the client. So, P and Q are not visible to the server.
To declare the security of this part, the following theorem is given. Theorem 2: Even if the encrypted confusion matrix M is given, the server cannot infer the permutation matrix P and the diagonal matrix Q.
Proof: The security of the confusion matrix M is based on the semantic security of Paillier encryption. The confusion matrix M does not reveal any information about P and Q . There are multiple choices of P and Q to generate the same confusion matrix M . These choices are not visible to protect the privacy of the original data and the calculation result), the calculation result is also secure, and the server cannot correctly distinguish the real confusion matrix from the randomly generated confusion matrix. So, the equation Pr Game 3 = 1 − Pr Game 2 = 1 ≤ negl( ) is obtained.
Game 4 : it is the same as Game 3 , except that the searched block with maximum number of a keyword.
Theorem 3: Leakage function L srch is volume-hiding. Proof: We present a optimized padding method to prevent volume-hiding leakage, in our scheme we utilize cluster algorithm to sort out keywords into different sets according to their number of searched blocks, therefor the number of the searched blocks of each keyword in the same sets are equal. Since the adversary A don't has information of the number of keywords in cluster set C i , when A receives a result set with k i searched blocks, A can not distinguish if it's a keyword searched before or other keyword in C i . Note that the only other leakage is the number of cluster which is independent of response lengths of the query operations. Consequently adversary A can't get any information from the number of searched blocks, completing the proof.