Figure 1 shows the overall functionality of the proposed model of processing the data streams. As discussed above, the phases of the EAPPA model are two-fold. The first contribution is called Data Approximation and Pre-processing (DAP) and the second contribution is called Adaptive Clustering-based Privacy Preservation (ACPP). Figure 1 shows the processing of both phases step-by-step using the delay constraint approach. The functionality is mainly derived from the concept of periodical data processing of incoming streaming data. Before acquiring the data streams, we first initialized the timer \(t\). The acquired data streams are then processed using the DAP. We applied the FM algorithm to reduce the redundant data streams followed by the pre-processing algorithm to filter out the noisy contains. The sequentially received data streams are first checked for duplication, pre-processed, and then stored into output matrix \(D\). The functionality of DAP continues until the timer reaches a pre-defined threshold value \(\lambda\). Once the delay constraint is satisfied, EAPPA launches the ACPP phase which takes the \(D\) as input. In the ACPP phase, we first compute the k-number of centroids using a basic k-means algorithm. Then estimate the similarities of each data stream tuple with its corresponding centroid tuple. All the estimated similarities are recorded and sorted in descending order. Finally, the k-anonymized clusters are formed as per the sorted order to ensure the k-anonymity. As the k-anonymization failed to prevent the attribute disclosure, we applied the l-diversity privacy notion on each cluster. The clusters ensuring the k-anonymity and l-diversity are then published before taking the next periodic data stream. We explore the design of both phases is explored in the below section.
A. DAP
The functionality of DAP is consists of FM-based data streams approximation and pre-processing of each received tuple without losing the sensitive information. Figure 2 shows the methodology of the DAP phase in detail. The input data stream \(S\) holds the 1 or more tuples. If the number of tuples in \(S\) is more than 1, then we initiate the FM algorithm and pre-processing algorithm. The main aim of the FM algorithm is to estimate the total number of distinct data streams, but we explored the FM algorithm to extract the distinct data streams and discard the redundant data streams. This process continues for each incoming tuple. Algorithm 1 shows the modified FM algorithm integrated pre-processing algorithm 2.
As shown in algorithm 1, we have effectively utilized the FM algorithm to approximate the periodic data stream. The advantage of estimating the number of unique tuples using the FM algorithm is explored in this paper to identify the redundant or duplicate incoming tuple. The core functionality of the FM algorithm is belonging to the steps such as defining the hash function (step 9), computing that the hash function of each attribute belongs to each stream (step 11), binary conversion of each hash value (step 12), counting trailing zeros of a binary number (step 13), and computing total distinct streams in \(S\) using step 15 and 16. After discovering the number of distinct elements, we utilized that parameter to discover whether the current data stream \(s\) is unique or redundant and accordingly we take the actions as shown in steps 17–24 in algorithm 1. From algorithm 1, we called algorithm 2 for the pre-processing of the input data stream and stored the final pre-processed streams in \(D\). The core part of this algorithm is the manual discovery of the unique attribute of each streaming tuple and defining hash function. For this work, we defined the hash function shown in Eq. (1).
$$h\left(x\right)\leftarrow \left(a.x+b\right)mod c$$
1
Where, we set the x represents the attribute value of current stream. We set \(a=1, b=6 \& c=32\) to compute the hash values.
Algorithm 1: DAP |
Input \(S: input data stream\) \(\lambda :pre-defined time constraint\) \(j:unique senstive attribute\) Output \(D:Approximted and pre-processed data stream\) |
1. Initial timer\(t=0\) 2. \(s\leftarrow acquire \left(stream\right)\) 3. \(S\leftarrow add\left(s\right)\) 4. If\(\left(size \left(S\right)>1\right)\) 5. For\(i=1:size \left(S\right)\) 6. Estimate the sensitive unique attribute from all streams 7. \(w\left(i\right)\leftarrow S\left(i, j\right)\), record the \({j}^{th}\)position unique value 8. End For 9. Define hash function for stream \(w\) using Eq. (1) and apply 10. For\(i=1:size \left(w\right)\) 11. \(h\left(i\right)\leftarrow \left(a.w\left(i\right)+b\right)mode c\) 12. \(h\left(i\right)\leftarrow binary \left(h\left(i\right)\right)\) 13. \(r\left(i\right)\leftarrow trailingzeros\left(h\left(i\right)\right)\) 14. End For 15. Compute maximum value:\(R\leftarrow \text{max}\left(r\right)\) 16. Compute distinct tuples:\(N\leftarrow {2}^{R}\) 17. If\((N==size (w)\) 18. Current stream \(s\) is unique and apply pre-processing 19. \(p\leftarrow algorithm 2\left(s\right)\) 20. \(t++\) 21. Else 22. Discard stream s from stream S as:\(S\leftarrow substract\left(s\right)\) 23. \(t++\) 24. End If 25. Else 26. \(S\leftarrow algorithm 2\left(S\right)\) 27. \(t++\) 28. End If 29. \(D\leftarrow add\left(p\right)\) 30. Check time constraint 31. If\(\left(t\ge \lambda \right)\) 32. Return \(\left(D\right),\)Launch ACPP phase 33. Reset timer \(t=0\), goto step 1 34. Else 35. goto step 2 36. End If |
Algorithm 2
shows the pre-processing of each input stream. First, we have checked whether the attribute is a string. If it strings then, we performed the lemmatization using NLP to convert the incorrect strings into the meaningful form and remove the noise in the string. Apart from this, we have addressed the challenges of missing or incomplete data in this work for numeric attributes. We have discovered the numeric attributes and replaced them with relevant values using the function. The discover the most relevant value using statistical analysis of same attributed of other steams.
Algorithm 2: Data Pre-processing |
Input \(s:input data stream\) Output \(p:pre-processed data stream\) |
1. Acquisition of test stream s 2. For each attribute each attribute\(i=1:size \left(s\right)\) 3. If\(\left(s\left(i\right)==string\right)\) 4. \(p\left(i\right) \leftarrow Lemmatization \left(s\left(i\right)\right)\) 5. End If 6. If\(\left(s\left(i\right)==NULL\right)\) 7. \(p\left(i\right)\leftarrow newVal\left(\right)\) 8. End If 9. End For 10. Return\(\left(p\right)\) |
B. ACPP Phase
This phase belongs to achieving the complete privacy preservation of periodically collected data stream D without losing information. We estimate the centroids using the existing k-means clustering before doing the k-anonymization. The arguments for using k-means clustering are that (1) it is straightforward to group tuples based on their similarities, (2) it is quick and creates efficient clusters, and (3) outliers in the dataset cannot be avoided using k-means and all outliers have privacy. Therefore, we form the initial centroids of input data stream \(D\) as:
$$[C, \ddot{C] } = kmeans (D, n)$$
2
Where, \(n\) defines the number of clusters (in this work, we have set n as 30, 60, 90, 120, and 150). \(C\) represents the set of \(n\) clusters where the tuples of \(D\) are distributed. \(\ddot{C }\)represents the centroid tuple for each cluster. The k-means algorithm failed to achieve the complete k-anonymity across all the clusters. The clusters are k-anonymized if they satisfied the constraint of having exactly k-number of tuples in each cluster. The value of \(k\) is discovered by:
$$k=\left|\frac{size \left(D\right)}{n}\right|$$
3
Therefore, we have proposed the adaptive clustering mechanism in this paper to achieve the complete k-anonymization privacy notion for periodically received data stream \(D\). This is done by enhancing the output of k-means clustering as showing in algorithm 3. As showing in algorithm, it takes inputs such as \(C\), \(\ddot{C }\), \(n, x,\) and \(y\) and return the set clusters \(O\) that ensures the k-anonymity. Before enhancing the present clusters, we have first estimate the distance between \({i}^{th}\) tuple of \({j}^{th}\) cluster and \({j}^{th}\)centroid. This distance is measured by Manhattan distance technique in \(getDist (.)\) function. It is calculated as the sum of the absolute differences among two numeric vectors of two tuples. All the distances are measured into the vector \(P\) which contains the entire tuple and its distance value. We then sorted the tuples in \(P\) in descending order of distance values. Finally, the clusters are reformed that ensures maximum k-tuples per cluster criteria to ensure the k-anonymity. The proposed clustering takes simple approach to achieve the k-anonymity for the current data stream D. The number of tuples in each cluster should be less than or equal to k. Therefore, algorithm 3 returns the clusters of similar size to achieve the k-anonymization with less IL.
Algorithm 3: K-anonymized adaptive clustering |
Inputs \(C: Set of clusters\) \(\ddot{C }:set of centroid tuples for each cluster\) \(n: number of clusters\) \(x:number of tuples in D\) \(y: number of attributes in each tuple\) Output \(O: set of k-anonimized clusters\) |
1. Initialize, \(P\leftarrow zeros(x, y+1)\),\(q=1\) 2. For \(i = 1:n\) 3. For\(j = 1: size \left(\text{C}\left(\text{i}\right)\right)\) 4. \(\text{d}\leftarrow getDist\left(\text{C}\left(\text{i}, \text{j}\right), \ddot{C\left(i\right) }\right)\) 5. \(P\left(q, 1:x\right)\leftarrow \text{C}\left(\text{i}, \text{j}\right)\) 6. \(P\left(q, y+1\right)\leftarrow \text{d}\) 7. \(q++\) 8. End For 9. End For 10. \(temp\leftarrow sort\left( \text{descening}, P\left(:, m+1\right)\right)\) 11. for\(i = 1: size \left(temp\right)\) 12. for \(j = 1:n\) 13. if\(\left(size\left(\text{O}\left(\text{j}\right)\right) \le \text{k}\right)\) 14. \(\text{O}\left(\text{i}, \text{j}\right)\leftarrow join \left(temp \left(i, :\right)\right)\) 15. end if 16. end for 17. end for 18. Return\(\left(O\right)\) |
Due to its limitations of attribute disclosure, background knowledge, and homogeneity, k-anonymity does not guarantee total privacy protection. The l-diversity resolved the k-anonymity issues. Therefore, we further extend the k-anonymized clusters with l-diversity notion in this paper. We used the entropy l-diversity idea to expand the clusters in \(O\) to meet the l-diversity requirement [38]. We calculated diversity using entropy for each k-anonymized cluster and stored the result in matrix \(L\). The greedy approach had used to guarantee that each cluster met the l-diversity requirement. The procedure continues until all of the clusters are l-diverse. Because we are not removing any tuples from the cluster during the whole algorithm 4, the privacy concept of k-anonymity remains the same. As showing in algorithm 4, our aim is achieve the clusters with diversity below 1. We therefore rearranged the each cluster until we achieve the diversity level below 1. This is achieved by computing the current cluster with maximum diversity value and cluster with minimum diversity value. And according to that value \((Max+Min=l)\), we arranged the clusters in output vector \(F\). Finally, we publish the privacy preserved data stream towards the intended destinations. The process of clustering repeated for each incoming data stream with similarity functionality, therefore it is named as adaptive clustering for privacy preservation.
Algorithm 4: l-diversity and publish |
Input \(O: set of k-anonimized clusters\) Output \(F: set of k-anonimized and l-diverity ensured clusters\) |
1. For\(j = 1: size \left(\text{O}\left(\text{i}\right)\right)\) 2. \(L\left(i\right)\leftarrow ComputeDiversity \left(O\left(i\right)\right)\) 3. End For 4. Ensures the l-diversity 5. While (\(L<1)\) do 6. \(Max\leftarrow \text{max}\left(L\right)\) 7. \(Min\leftarrow \text{min}\left(L\right)\) 8. \(l\leftarrow Max+Min\) 9. \(F\leftarrow O-\left\{Max, Min\right\}+ l\) 10. End While 11. Publish\(\left(F\right)\) 12. Return \(\left(F\right)\) |